R Statistical Application Development By Example Beginners Guide
User Manual:
Open the PDF directly: View PDF .
Page Count: 345 [warning: Documents this large are best viewed by clicking the View PDF Link!]
- Cover
- Copyright
- Credits
- About the Author
- About the Reviewers
- www.PacktPub.com
- Table of Contents
- Preface
- Chapter 1: Data Characteristics
- Chapter 2: Import/Export Data
- data.frame and other formats
- Time for action – understanding constants, vectors, and basic arithmetic
- Time for action – matrix computations
- Time for action – creating a list object
- Time for action – creating a data.frame object
- Time for action – creating the Titanic dataset as a table object
- read.csv, read.xls, and the foreign package
- Time for action – importing data from external files
- Exporting data/graphs
- Time for action – exporting a graph
- Managing an R session
- Time for action – session management
- Summary
- Chapter 3: Data Visualization
- Visualization techniques for categorical data
- Time for action – bar charts in R
- Time for action – dot charts in R
- Time for action – the spine plot for the shift and operator data
- Time for action – the mosaic plot for the Titanic dataset
- Visualization techniques for continuous variable data
- Time for action – using the boxplot
- Time for action – understanding the effectiveness of histograms
- Time for action – plot and pairs R functions
- A brief peek at ggplot2
- Time for action – qplot
- Time for action – ggplot
- Summary
- Chapter 4: Exploratory Analysis
- Essential summary statistics
- Time for action – the essential summary statistics for "The Wall" dataset
- The stem-and-leaf plot
- Time for action – the stem function in play
- Letter values
- Data re-expression
- Bagplot – a bivariate boxplot
- Time for action – the bagplot display for a multivariate dataset
- The resistant line
- Time for action – the resistant line as a first regression model
- Smoothing data
- Time for action – smoothening the cow temperature data
- Median polish
- Time for action – the median polish algorithm
- Summary
- Chapter 5: Statistical Inference
- Maximum likelihood estimator
- Time for action – visualizing the likelihood function
- Time for action – finding the MLE using mle and fitdistr functions
- Confidence intervals
- Time for action – confidence intervals
- Hypotheses testing
- Time for action – testing the probability of success
- Time for action – testing proportions
- Time for action – testing one-sample hypotheses
- Time for action – testing two-sample hypotheses
- Summary
- Chapter 6: Linear Regression Analysis
- The simple linear regression model
- Time for action – the arbitrary choice of parameters
- Time for action – building a simple linear regression model
- Time for action – ANOVA and the confidence intervals
- Time for action – residual plots for model validation
- Multiple linear regression model
- Time for action – averaging k simple linear regression models
- Time for action – building a multiple linear regression model
- Time for action – the ANOVA and confidence intervals for the multiple linear regression model
- Time for action – residual plots for the multiple linear regression model
- Regression diagnostics
- The multicollinearity problem
- Time for action – addressing the multicollinearity problem for the Gasoline data
- Model selection
- Time for action – model selection using the backward, forward, and AIC criteria
- Summary
- Chapter 7: The Logistic Regression Model
- The binary regression problem
- Time for action – limitations of linear regression models
- Probit regression model
- Time for action – understanding the constants
- Logistic regression model
- Time for action – fitting the logistic regression model
- Time for action – The Hosmer-Lemeshow goodness-of-fit statistic
- Model validation and diagnostics
- Time for action – residual plots for the logistic regression model
- Time for action – diagnostics for the logistic regression
- Receiving operator curves
- Time for action – ROC construction
- Logistic regression for the German credit screening dataset
- Time for action – logistic regression for the German credit dataset
- Summary
- Chapter 8: Regression Models with Regularization
- The overfitting problem
- Time for action – understanding overfitting
- Regression spline
- Time for action – fitting piecewise linear regression models
- Time for action – fitting the spline regression models
- Ridge regression for linear models
- Time for action – ridge regression for the linear regression model
- Ridge regression for logistic regression models
- Time for action – ridge regression for the logistic regression model
- Another look at model assessment
- Time for action – selecting lambda iteratively and other topics
- Summary
- Chapter 9: Classification
and Regression Trees
- Recursive partitions
- Time for action – partitioning the display plot
- Time for action – building our first tree
- The construction of a regression tree
- Time for action – the construction of a regression tree
- The construction of a classification tree
- Time for action – the construction of a classification tree
- Classification tree for the German credit data
- Time for action – the construction of a classification tree
- Pruning and other finer aspects of a tree
- Time for action – pruning a classification tree
- Summary
- Chapter 10: CART and Beyond
- Improving CART
- Time for action – cross-validation predictions
- Bagging
- Time for action – understanding the bootstrap technique
- Time for action – the bagging algorithm
- Random forests
- Time for action – random forests for the German credit data
- The consolidation
- Time for action – random forests for the low birth weight data
- Summary
- Appendix: References
- Index
R Statistical Application
Development by Example
Beginner's Guide
Learn R Stascal Applicaon Development from scratch
in a clear and pedagogical manner
Prabhanjan Narayanachar Taar
BIRMINGHAM - MUMBAI
R Statistical Application Development by Example
Beginner's Guide
Copyright © 2013 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmied in any form or by any means, without the prior wrien permission of the
publisher, except in the case of brief quotaons embedded in crical arcles or reviews.
Every eort has been made in the preparaon of this book to ensure the accuracy of the
informaon presented. However, the informaon contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be
caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark informaon about all of the
companies and products menoned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this informaon.
First published: July 2013
Producon Reference: 1170713
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-84951-944-1
www.packtpub.com
Cover Image by Asher Wishkerman (wishkerman@hotmail.com)
Credits
Author
Prabhanjan Narayanachar Taar
Reviewers
Mark ver der Loo
Mzabalazo Z. Ngwenya
A Ohri
Tengfei Yin
Acquision Editor
Usha Iyer
Lead Technical Editor
Arun Nadar
Technical Editors
Madhuri Das
Mausam Kothari
Amit Ramadas
Varun Pius Rodrigues
Lubna Shaikh
Project Coordinator
Anurag Banerjee
Proofreaders
Maria Gould
Paul Hindle
Indexer
Hemangini Bari
Graphics
Ronak Dhruv
Producon Coordinators
Melwyn D'sa
Zahid Shaikh
Cover Work
Melwyn D'sa
Zahid Shaikh
About the Author
Prabhanjan Narayanachar Taar has seven years of experience with R soware and
has also co-authored the book A Course in Stascs with R published by Narosa Publishing
House. The author has built two packages in R tled gpk and ACSWR. He has obtained a PhD
(Stascs) from Bangalore University under the broad area of Survival Analysis and published
several arcles in peer-reviewed journals. During the PhD program, the author received
the young Stascian honors in IBS(IR)-GK Shukla Young Biometrician Award (2005) and
Dr. U.S. Nair Award for Young Stascian (2007) and also held a Junior and Senior Research
Fellowship of CSIR-UGC.
Prabhanjan is working as a Business Analysis Advisor at Dell Inc, Bangalore. He is working
for the Customer Service Analycs unit of the larger Dell Global Analycs arm of Dell.
I would like to thank Prof. Athar Khan, Aligarh Muslim University, whose
teaching during a shared R workshop inspired me to a very large extent.
My friend Veeresh Naidu has gone out of his way in helping and inspiring
me complete this book and I thank him for everything that denes our
friendship.
Many of my colleagues at the Customer Service Analycs unit of Dell Global
Analycs, Dell Inc. have been very tolerant of my stat talk with them and it
is their need for the subject which has partly inuenced the wring of the
book. I would like to record my thanks to them and also my manager Debal
Chakraborty.
My wife Chandrika has been very cooperave and without her permission to
work on the book during weekends and housework mings this book would
have never been completed. Pranathi at 2 years and 9 months has started
school, the pre-kindergarten, and it is genuinely believed that one day she
will read this enre book.
I am also grateful to the reviewers whose construcve suggesons and
cricisms have helped the book reach a higher level than where it would
have ended up being without their help. Last and not least, I would like to
take the opportunity to thank Usha Iyer and Anurag Banerjee for their inputs
with the earlier dras, and also their paence with my delays.
About the Reviewers
Mark van der Loo obtained his PhD at the Instute for Theorecal Chemistry at the
University of Nijmegen (The Netherlands). Since 2007 he has worked at the stascal
methodology department of the Dutch ocial stascs oce (Stascs Netherlands).
His research interests include automated data cleaning methods and stascal compung.
At Stascs Netherlands he is responsible for the local R center of experse, which supports
and educates users on stascal compung with R. Mark has coauthored a number of R
packages that are available via CRAN, namely editrules, deducorrect, rspa, extremevalues,
and stringdist. Together with Edwin de Jonge he authored the book Learning RStudio for R
Stascal Compung. A list of his publicaons can be found at www.markvanderloo.eu.
Mzabalazo Z. Ngwenya has worked extensively in the eld of stascal consulng and
currently works as a biometrician. He holds an MSC in Mathemacal Stascs from the
University of Cape Town and is at present studying towards a PhD (School of Informaon
Technology, University of Pretoria) in the eld of Computaonal Intelligence. His research
interests include stascal compung, machine learning, spaal stascs, and simulaon and
stochasc processes. Previously he was involved in reviewing Learning RStudio for R Stascal
Compung by Mark P.J. van der Loo and Edwin de Jonge, Packt Publishing
A Ohri is the founder of analycs startup Decisionstats.com. He has pursued graduaon
studies from the University of Tennessee, Knoxville and the Indian Instute of Management,
Lucknow. In addion, he has a Mechanical Engineering degree from the Delhi College of
Engineering. He has interviewed more than 100 praconers in analycs, including leading
members from all the analycs soware vendors. He has wrien almost 1300 arcles on
his blog besides guest wring for inuenal analycs communies. He teaches courses
in R through online educaon and has worked as an analycs consultant in India for the
past decade. He was one of the earliest independent analycs consultants in India and his
current research interests include spreading open source analycs, analyzing social media
manipulaon, simpler interfaces to cloud compung, and unorthodox cryptography.
He is the author of R for Business Analycs.
http://www.springer.com/statistics/book/978 1 4614 4342 1
www.PacktPub.com
Support les, eBooks, discount offers and more
You might want to visit www.PacktPub.com for support les and downloads related
to your book.
Did you know that Packt oers eBook versions of every book published, with PDF and ePub les
available? You can upgrade to the eBook version at www.PacktPub.com and as a print book
customer, you are entled to a discount on the eBook copy. Get in touch with us at service@
packtpub.com for more details.
At www.PacktPub.com, you can also read a collecon of free technical arcles, sign up for a
range of free newsleers and receive exclusive discounts and oers on Packt books and eBooks.
TM
http://PacktLib.PacktPub.com
Do you need instant soluons to your IT quesons? PacktLib is Packt's online digital book
library. Here, you can access, read and search across Packt's enre library of books.
Why Subscribe?
Fully searchable across every book published by Packt
Copy and paste, print and bookmark content
On demand and accessible via web browser
Free Access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib
today and view nine enrely free books. Simply use your login credenals for immediate access.
The work is dedicated to my father Narayanachar, the very rst engineer who
inuenced my outlook towards Science and Engineering. For the same reason,
my mother Lakshmi made me realize the importance of life and philosophy.
Table of Contents
Preface 1
Chapter 1: Data Characteriscs 7
Quesonnaire and its components 8
Understanding the data characteriscs in an R environment 12
Experiments with uncertainty in computer science 14
R installaon 15
Using R packages 16
RSADBE – the book's R package 17
Discrete distribuon 18
Discrete uniform distribuon 19
Binomial distribuon 20
Hypergeometric distribuon 24
Negave binomial distribuon 24
Poisson distribuon 26
Connuous distribuon 26
Uniform distribuon 27
Exponenal distribuon 28
Normal distribuon 29
Summary 31
Chapter 2: Import/Export Data 33
data.frame and other formats 33
Constants, vectors, and matrices 34
Time for acon – understanding constants, vectors, and basic arithmec 35
Time for acon – matrix computaons 41
The list object 44
Time for acon – creang a list object 44
The data.frame object 45
Table of Contents
[ ii ]
Time for acon – creang a data.frame object 45
The table object 49
Time for acon – creang the Titanic dataset as a table object 51
read.csv, read.xls, and the foreign package 52
Time for acon – imporng data from external les 55
Imporng data from MySQL 58
Exporng data/graphs 60
Exporng R objects 60
Exporng graphs 60
Time for acon – exporng a graph 61
Managing an R session 62
Time for acon – session management 62
Summary 64
Chapter 3: Data Visualizaon 65
Visualizaon techniques for categorical data 66
Bar charts 66
Going through the built-in examples of R 66
Time for acon – bar charts in R 68
Dot charts 74
Time for acon – dot charts in R 74
Spine and mosaic plots 76
Time for acon – the spine plot for the shi and operator data 77
Time for acon – the mosaic plot for the Titanic dataset 80
Pie charts and the fourfold plot 81
Visualizaon techniques for connuous variable data 84
Boxplot 84
Time for acon – using the boxplot 85
Histograms 88
Time for acon – understanding the eecveness of histograms 90
Scaer plots 93
Time for acon – plot and pairs R funcons 94
Pareto charts 97
A brief peek at ggplot2 99
Time for acon – qplot 100
Time for acon – ggplot 101
Summary 102
Chapter 4: Exploratory Analysis 103
Essenal summary stascs 104
Percenles, quanles, and median 104
Hinges 104
Table of Contents
[ iii ]
The interquarle range 105
Time for acon – the essenal summary stascs for
"The Wall" dataset 105
The stem-and-leaf plot 109
Time for acon – the stem funcon in play 110
Leer values 113
Data re-expression 114
Bagplot – a bivariate boxplot 116
Time for acon – the bagplot display for a mulvariate dataset 117
The resistant line 118
Time for acon – the resistant line as a rst regression model 120
Smoothing data 122
Time for acon – smoothening the cow temperature data 124
Median polish 125
Time for acon – the median polish algorithm 125
Summary 128
Chapter 5: Stascal Inference 129
Maximum likelihood esmator 130
Visualizing the likelihood funcon 131
Time for acon – visualizing the likelihood funcon 133
Finding the maximum likelihood esmator 135
Using the tdistr funcon 136
Time for acon – nding the MLE using mle and tdistr funcons 137
Condence intervals 139
Time for acon – condence intervals 141
Hypotheses tesng 144
Binomial test 144
Time for acon – tesng the probability of success 145
Tests of proporons and the chi-square test 147
Time for acon – tesng proporons 147
Tests based on normal distribuon – one-sample 149
Time for acon – tesng one-sample hypotheses 152
Tests based on normal distribuon – two-sample 156
Time for acon – tesng two-sample hypotheses 159
Summary 160
Chapter 6: Linear Regression Analysis 161
The simple linear regression model 162
What happens to the arbitrary choice of parameters? 163
Time for acon – the arbitrary choice of parameters 164
Building a simple linear regression model 167
Table of Contents
[ iv ]
Time for acon – building a simple linear regression model 167
ANOVA and the condence intervals 169
Time for acon – ANOVA and the condence intervals 170
Model validaon 171
Time for acon – residual plots for model validaon 172
Mulple linear regression model 176
Averaging k simple linear regression models or a mulple linear
regression model 177
Time for acon – averaging k simple linear regression models 177
Building a mulple linear regression model 179
Time for acon – building a mulple linear regression model 179
The ANOVA and condence intervals for the mulple linear regression model 181
Time for acon – the ANOVA and condence intervals for the mulple linear
regression model 182
Useful residual plots 183
Time for acon – residual plots for the mulple linear regression model 184
Regression diagnoscs 186
Leverage points 187
Inuenal points 188
DFFITS and DFBETAS 189
The mulcollinearity problem 189
Time for acon – addressing the mulcollinearity problem for the Gasoline data 191
Model selecon 192
Stepwise procedures 192
The backward eliminaon 192
The forward selecon 193
Criterion-based procedures 194
Time for acon – model selecon using the backward, forward, and AIC criteria 194
Summary 199
Chapter 7: The Logisc Regression Model 201
The binary regression problem 202
Time for acon – limitaons of linear regression models 202
Probit regression model 204
Time for acon – understanding the constants 204
Logisc regression model 206
Time for acon – ng the logisc regression model 207
Hosmer-Lemeshow goodness-of-t test stasc 210
Time for acon – the Hosmer-Lemeshow goodness-of-t stasc 211
Model validaon and diagnoscs 213
Residual plots for the GLM 213
Table of Contents
[ v ]
Time for acon – residual plots for the logisc regression model 214
Inuence and leverage for the GLM 216
Time for acon – diagnoscs for the logisc regression 216
Receiving operator curves 220
Time for acon – ROC construcon 221
Logisc regression for the German credit screening dataset 223
Time for acon – logisc regression for the German credit dataset 225
Summary 227
Chapter 8: Regression Models with Regularizaon 229
The overng problem 230
Time for acon – understanding overng 231
Regression spline 234
Basis funcons 234
Piecewise linear regression model 235
Time for acon – ng piecewise linear regression models 235
Natural cubic splines and the general B-splines 238
Time for acon – ng the spline regression models 239
Ridge regression for linear models 243
Time for acon – ridge regression for the linear regression model 244
Ridge regression for logisc regression models 248
Time for acon – ridge regression for the logisc regression model 248
Another look at model assessment 250
Time for acon – selecng lambda iteravely and other topics 250
Summary 256
Chapter 9: Classicaon and Regression Trees 257
Recursive parons 258
Time for acon – paroning the display plot 259
Spling the data 260
The rst tree 260
Time for acon – building our rst tree 261
The construcon of a regression tree 265
Time for acon – the construcon of a regression tree 266
The construcon of a classicaon tree 276
Time for acon – the construcon of a classicaon tree 279
Classicaon tree for the German credit data 284
Time for acon – the construcon of a classicaon tree 284
Pruning and other ner aspects of a tree 288
Time for acon – pruning a classicaon tree 289
Summary 291
Table of Contents
[ vi ]
Chapter 10: CART and Beyond 293
Improving CART 294
Time for acon – cross-validaon predicons 296
Bagging 297
The bootstrap 297
Time for acon – understanding the bootstrap technique 298
The bagging algorithm 300
Time for acon – the bagging algorithm 301
Random forests 303
Time for acon – random forests for the German credit data 304
The consolidaon 308
Time for acon – random forests for the low birth weight data 308
Summary 314
Appendix: References 315
Index 317
Preface
The open source soware R is fast becoming one of the preferred companions of Stascs
even as the subject connues to add many friends in Machine Learning, Data Mining, and so
on among its already rich scienc network. The era of mathemacal theory and stascal
applicaons embeddedness is truly a remarkable one for the society and the soware has
played a very pivotal role in it. This book is a humble aempt at presenng Stascal Models
through R for any reader who has a bit of familiarity with the subject. In my experience of
praccing the subject with colleagues and friends from dierent backgrounds, I realized that
many are interested in learning the subject and applying it in their domain which enables
them to take appropriate decisions in analyses, which involves uncertainty. A decade earlier
my friends would be content with being pointed to a useful reference book. Not so anymore!
The work in almost every domain is done through computers and naturally they do have
their data available in spreadsheets, databases, and somemes in plain text format. The
request for an appropriate stascal model is invariantly followed by a one word queson
"Soware?" My answer to them has always been a single leer reply "R!" Why? It is really a
very simple decision and it has been my companion over the last seven years. In this book,
this experience has been converted into detailed chapters and a cleaner breakup of model
building in R.
A by-product of the interacon with colleagues and friends who are all aspiring stascal
model builders has been that I have been able to pick up the trough of their learning
curve of the subject. The rst aempt towards xing the hurdle has been to introduce
the fundamental concepts that the beginners are most familiar with, which is data.
The dierence is simply in the subtlees and as such I rmly believe that introducing the
subject on their turf movates the reader for a long way in their journey. As with most
stascal soware, R provides modules and packages which mostly cover many of the
recently invented stascal methodology. The rst ve chapters of the book focus on the
fundamental aspects of the subject and the R soware and hence cover R basics, data
visualizaon, exploratory data analysis, and stascal inference.
Preface
[ 2 ]
The foundaonal aspects are illustrated using interesng examples and sets up the
framework for the later ve chapters. Regression models, linear and logisc regression
models being at the forefront, are of paramount interest in applicaons. The discussion is
more generic in nature and the techniques can be easily adapted across dierent domains.
The last two chapters have been inspired by the Breiman school and hence the modern
method of Classicaon and Regression Trees has been developed in detail and illustrated
through a praccal dataset.
What this book covers
Chapter 1, Data Characteriscs, introduces the dierent types of data through a
quesonnaire and dataset. The need of stascal models is elaborated in some interesng
contexts. This is followed by a brief explanaon of R installaon and the related packages.
Discrete and connuous random variables are discussed through introductory R programs.
Chapter 2, Import/Export Data, begins with a concise development of R basics. Data frames,
vectors, matrices, and lists are discussed with clear and simpler examples. Imporng of data
from external les in csv, xls, and other formats is elaborated next. Wring data/objects from
R for other soware is considered and the chapter concludes with a dialogue on R session
management.
Chapter 3, Data Visualizaon, discusses ecient graphics separately for categorical and
numeric datasets. This translates into techniques of bar chart, dot chart, spine and mosaic
plot, and four fold plot for categorical data while histogram, box plot, and scaer plot for
connuous/numeric data. A very brief introducon to ggplot2 is also provided here.
Chapter 4, Exploratory Analysis, encompasses highly intuive techniques for preliminary
analysis of data. The visualizing techniques of EDA such as stem-and-leaf, leer values, and
modeling techniques of resistant line, smoothing data, and median polish give a rich insight
as a preliminary analysis step.
Chapter 5, Stascal Inference, begins with the emphasis of likelihood funcon and
compung the maximum likelihood esmate. Condence intervals for the parameters
of interest is developed using funcons dened for specic problems. The chapter also
considers important stascal tests of Z-test and t-test for comparison of means and
chi-square tests and F-test for comparison of variances.
Chapter 6, Linear Regression Analysis, builds a linear relaonship between an output and a
set of explanatory variables. The linear regression model has many underlying assumpons
and such details are veried using validaon techniques. A model may be aected by a
single observaon, or a single output value, or an explanatory variable. Stascal metrics
are discussed in depth which helps remove one or more kinds of anomalies. Given a large
number of covariates, the ecient model is developed using model selecon techniques.
Preface
[ 3 ]
Chapter 7, The Logisc Regression Model, is useful as a classicaon model when the output
is a binary variable. Diagnosc and model validaon through residuals are used which lead
to an improved model. ROC curves are next discussed which helps in idenfying of a beer
classicaon model.
Chapter 8, Regression Models with Regularizaon, discusses the problem of over ng
arising from the use of models developed in the previous two chapters. Ridge regression
signicantly reduces the possibility of an over t model and the development of natural
spine models also lays the basis for the models considered in the next chapter.
Chapter 9, Classicaon and Regression Trees, provides a tree-based regression model.
The trees are inially built using R funcons and the nal trees are also reproduced using
rudimentary codes leading to a clear understanding of the CART mechanism.
Chapter 10, CART and Beyond, considers two enhancements of CART using bagging and
random forests. A consolidaon of all the models from Chapter 6 to Chapter 10 is also given
through a dataset.
Chapter 1 to Chapter 5 form the basics of R soware and the Stascs subject. Praccal
and modern regression models are discussed in depth from Chapter 6 to Chapter 10.
Appendix, References, lists names of the books that have been referred to in this book.
What you need for this book
R is the only required soware for this book and you can download it from http://www.
cran.r-project.org/. R packages will be required too though this task is done within a
working R session. The datasets used in the book is available in the R package RSADBE, which
is an abbreviaon of the book's tle, at http://cran.r-project.org/web/packages/
RSADBE/index.html.
Who this book is for
This book will be useful for readers who have air and a need for stascal applicaons
in their own domains. The rst seven chapters are also useful for any masters students in
Stascs and the movated student can easily complete the rest of the book and obtain
a working knowledge of CART.
Conventions
In this book, you will nd several headings appearing frequently.
To give clear instrucons of how to complete a procedure or task, we use:
Preface
[ 4 ]
Time for action – heading
1. Acon 1
2. Acon 2
3. Acon 3
Instrucons oen need some extra explanaon so that they make sense, so they are
followed with:
What just happened?
This heading explains the working of tasks or instrucons that you have just completed.
You will also nd some other learning aids in the book, including:
Pop quiz – heading
These are short mulple-choice quesons intended to help you test your own
understanding.
Have a go hero – heading
These are praccal challenges that give you ideas for experimenng with what you
have learned.
You will also nd a number of styles of text that disnguish between dierent kinds of
informaon. Here are some examples of these styles, and an explanaon of their meaning.
Code words in text are shown as follows: "The operator %% on two objects, say x and y,
returns remainder following an integer division, and the operator %/% returns the integer
division." In certain cases the complete code cannot be included within the acon list and
in such cases you will nd the following display:
Plot the "Response Residuals" against the "Fied Values" of the pass_logistic model
with the following values assigned:
plot(fitted(pass_logistic), residuals(pass_logistic,"response"),
col= "red", xlab="Fitted Values", ylab="Residuals",cex.axis=1.5,
cex.lab=1.5)
In such a case you need to run the code starng with plot(up to cex.lab=1.5) in R.
Preface
[ 5 ]
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book—what you liked or may have disliked. Reader feedback is important for us to
develop tles that you really get the most out of.
To send us general feedback, simply send an e-mail to feedback@packtpub.com,
and menon the book tle through the subject of your message.
If there is a topic that you have experse in and you are interested in either wring
or contribung to a book, see our author guide on www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.
Downloading the example code
You can download the example code les for all Packt books you have purchased from your
account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit
http://www.packtpub.com/support and register to have the les e-mailed directly to you.
Downloading the color images of this book
We also provide you a PDF le that has color images of the screenshots/diagrams used in
this book. The color images will help you beer understand the changes in the output. You
can download this le from http://www.packtpub.com/sites/default/files/
downloads/9441OS_R-Statistical-Application-Development-by-Example-
Color-Graphics.pdf.
Preface
[ 6 ]
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you nd a mistake in one of our books—maybe a mistake in the text or the
code—we would be grateful if you would report this to us. By doing so, you can save other
readers from frustraon and help us improve subsequent versions of this book. If you nd
any errata, please report them by vising http://www.packtpub.com/submit-errata,
selecng your book, clicking on the errata submission form link, and entering the details of
your errata. Once your errata are veried, your submission will be accepted and the errata
will be uploaded to our website, or added to any list of exisng errata, under the Errata
secon of that tle.
Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt,
we take the protecon of our copyright and licenses very seriously. If you come across any
illegal copies of our works, in any form, on the Internet, please provide us with the locaon
address or website name immediately so that we can pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected
pirated material.
We appreciate your help in protecng our authors, and our ability to bring you
valuable content.
Questions
You can contact us at questions@packtpub.com if you are having a problem
with any aspect of the book, and we will do our best to address it.
Data Characteristics
Data consists of observations across different types of variables, and it is vital
that any Data Analyst understands these intricacies at the earliest stage of
exposure to statistical analysis. This chapter recognizes the importance of data
and begins with a template of a dummy questionnaire and then proceeds with
the nitty-gritties of the subject. We then explain how uncertainty creeps in to
the domain of computer science. The chapter closes with coverage of important
families of discrete and continuous random variables.
We will cover the following topics:
Idencaon of the main variable types as nominal, categorical,
and connuous variables
The uncertainty arising in many real experiments
R installaon and packages
The mathemacal form of discrete and connuous random variables
and their applicaons
1
Data Characteriscs
[ 8 ]
Questionnaire and its components
The goal of this secon is introducon of numerous variable types at the rst possible
occasion. Tradionally, an introductory course begins with the elements of probability theory
and then builds up the requisites leading to random variables. This convenon is dropped in
this book and we begin straightaway with data. There is a primary reason for choosing this
path. The approach builds on what the reader is already familiar with and then connects it
with the essenal framework of the subject.
It is very likely that the user is familiar with quesonnaires. A quesonnaire may be asked
aer the birth of a baby with a view to aid the hospital in the study about the experience
of the mother, the health status of the baby, and the concerns of immediate guardians
of the new born. A mul-store department may instantly request the customer to ll in a
short quesonnaire for capturing the customer's sasfacon aer the sale of a product.
A customer's sasfacon following the service of their vehicle (see the detailed example
discussed later) can be captured through a few queries. The quesonnaires may arise in
dierent forms than just merely on paper. They may be sent via e-mail, telephone, short
message service (SMS), and so on. As an example, one may receive an SMS that seeks a
mandatory response in a Yes/No form. An e-mail may arrive in the Outlook inbox, which
requires the recipient to respond through a vote for any of these three opons, "Will aend
the meeng", "Can't aend the meeng", or "Not yet decided".
Suppose the owner of a mul-brand car center wants to nd out the sasfacon percentage
of his customers. Customers bring their car to a service center for varied reasons. The owner
wants to nd out the sasfacon levels post the servicing of the cars and nd the areas
where improvement will lead to higher sasfacon among the customers. It is well known
that the higher the sasfacon levels, the greater would be the customer's loyalty towards
the service center. Towards this, a quesonnaire is designed and then data is collected from
the customers. A snippet of the quesonnaire is given in gure 1, and the informaon given
by the customers lead to dierent types of data characteriscs. The variables Customer ID
and Quesonnaire ID may be serial numbers or randomly generated unique numbers. The
purpose of such variables is unique idencaon of people's response. It may be possible that
there are follow-up quesonnaires as well. In such cases, the Customer ID for a responder will
connue to be the same, whereas the Quesonnaire ID needs to change for idencaon of
the follow up. The values of these types of variables in general are not useful for
analycal purpose.
Chapter 1
[ 9 ]
Figure 1: A hypothetical questionnaire
The informaon of Full Name in this survey is a starng point to break the ice with
the responder. In very exceponal cases the name may be useful for proling purposes.
For our purposes the name will simply be a text variable that is not used for analysis
purposes. Gender is asked to know the person's gender, and in quite a few cases it may
be an important factor explaining the main characteriscs of the survey, in this case it may
be mileage. Gender is an example of a categorical variable.
Age in Years is a variable that captures the age of the customer. The data for this eld is
numeric in nature and is an example of a connuous variable.
The fourth and h quesons help the mul-brand dealer in idenfying the car model
and its age. The rst queson here enquires about the type of the car model. The car models
of the customers may vary from Volkswagen Beetle, Ford Endeavor, Toyota Corolla, Honda
Civic, to Tata Nano, see the next screenshot. Though the model name is actually a noun, we
make a disncon from the rst queson of the quesonnaire in the sense that the former is
a text variable while the laer leads to a categorical variable. Next, the car model may easily
be idened to classify the car into one of the car categories, such as a hatchback, sedan,
staon wagon, or ulity vehicle, and such a classifying variable may serve as one of the
ordinal variable, as per the overall car size. The age of the car in months since its manufacture
date may explain the mileage and odometer reading.
Data Characteriscs
[ 10 ]
The sixth and seventh quesons simply ask the customer if their minor/major problems
were completely xed or not. This is a binary queson that takes either of the values,
Yes or No. Small dents, power windows malfunconing, niggling noises in the cabin, music
speakers low output, and other similar issues, which do not lead to good funconing of the
car may be treated as minor problems that are expected to be xed in the car. Disc brake
problems, wheel alignment, steering raling issues, and similar problems that expose the user
and co-users of the road to danger are of grave concern, as they aect the funconing of a
car and are treated as major problems. Any user will expect all of his/her issues to be resolved
during a car service. An important goal of the survey is to nd the service center eciency in
handling the minor and major issues of the car. The labels Yes/No may be replaced by +1 and
-1, or any other label of convenience.
The eighth queson, "What is the mileage (km/liter) of car?", gives a measure of the average
petrol/diesel consumpon. In many praccal cases this data is provided by the belief of
the customer who may simply declare it between 5 km/liter to 25 km/liter. In the case of
a lower mileage, the customer may ask for a ner tune up of the engine, wheel alignment,
and so on. A general belief is that if the mileage is closer to the assured mileage as marketed
by the company, or some authority such as Automove Research Associaon of India (ARAI),
the customer is more likely to be happy. An important variable is the overall kilometers done
by the car up to the point of service. Vehicles have certain maintenances at the intervals
of 5,000 km, 10,000 km, 20,000 km, 50,000 km, and 100,000 km. This variable may also be
related with the age of the vehicle.
Chapter 1
[ 11 ]
Let us now look at the nal queson of the snippet. Here, the customer is asked to rate
his overall experience of the car service. A response from the customer may be sought
immediately aer a small test ride post the car service, or it may be through a quesonnaire
sent to the customer's e-mail ID. A rang of Very Poor suggests that the workshop has served
the customer miserably, whereas the rang of Very Good conveys that the customer is
completely sased with the workshop service. Note that there is some order in the response
of the customer, in that we can grade the ranking in a certain order of Very Poor < Poor <
Average < Good < Very Good. This implies that the structure of the rangs must be respected
when we analyze the data of such a study. In the next secon, these concepts are elaborated
through a hypothecal dataset.
Satisfaction
Rating
Good
Average
Good
Average
Very Good
Good
Good
Good
Very Poor
Very Good
Good
Poor
Poor
Good
Very Poor
Good
Good
Poor
Poor
Average
Odometer
18892
22624
42008
32556
48172
25207
41449
28555
36841
1755
2007
28265
27997
27491
29527
2702
6903
40873
15274
9934
Mileage
23
17
24
23
8
21
14
23
19
23
17
14
23
7
25
17
21
6
8
22
Questionnaire_ID
QC601FAKNQXM
QC5HZ8CP1NFB
QCY72H4J0V1X
QCH1NZO5VCD8
QCV1Y10SFW7N
QCXO04WUYQAJ
QCJQZAYMI59Z
QCIZTA35PW19
QC12XU9J0OAT
QCXWBT0V17G
QC5YOUIZ7PLC
QCYF269HVUO
QCAIE3Z0SYK9
QCE09UZHDP63
QCDWJ6ESYPZR
QCH7XRZ6W9JQ
QCGXATR9DQEK
QCYQO5RFIPK1
QCG1SZ8IDURP
QCTUSRQDX396
Customer_ID
C601FAKNQXM
C5HZ8CP1NFB
CY72H4J0V1X
CH1NZO5VCD8
CV1Y10SFW7N
CXO04WUYQAJ
CJQZAYMI59Z
CIZTA35PW19
C12XU9J0OAT
CXWBT0V17G
C5YOUIZ7PLC
CYF269HVUO
CAIE3Z0SYK9
CE09UZHDP63
CDWJ6ESYPZR
CH7XRZ6W9JQ
CGXATR9DQEK
CYQO5RFIPK1
CG1SZ8IDURP
CTUSRQDX396
J. Ram
Sanjeev Joshi
John D
Pranathi PT
Pallavi M Daksh
Mohammed Khan
Anand N T
Arun Kumar T
Prakash Prabhak
Pramod R.K.
Mithun Y.
S.P. Bala
Swamy J
Julfikar
Chris John
Naveed Khan
Prem Kashmiri
Sujana Rao
Josh K
Aravind
Name
Male
Male
Female
Female
Female
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Female
Male
Male
Male
Gender
57
53
20
54
32
20
53
65
50
22
49
37
42
47
31
24
47
54
39
61
Age
Beetle
Camry
Nano
Civic
Civic
Corolla
Civic
Endeavor
Beetle
Nano
Nano
Beetle
Nano
Camry
Endeavor
Fortuner
Civic
Camry
Endeavor
Fiesta
Car_Model
Apr-11
Feb-09
Apr-10
Oct-11
Mar-12
Dec-10
Mar-12
Aug-11
Mar-09
Mar-11
Apr-11
Jul-11
Dec-09
Jan-12
May-12
Aug-09
Oct-11
Mar-10
Jul-11
May-10
Car
Manufacture
Year
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Minor
Problems
Yes
Yes
Yes
Yes
No
No
No
Yes
No
No
No
No
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Minor
Problems
A hypothetical dataset of a Questionnaire
Data Characteriscs
[ 12 ]
Understanding the data characteristics
in an R environment
A snippet of an R session is given in Figure 2. Here we simply relate an R session with the
survey and sample data of the previous table. The simple goal here is to get a feel/buy-in
of R and not necessarily follow the R codes. The R installaon process is explained in the R
installaon secon. Here the user is loading the SQ R data object (SQ simply stands for sample
quesonnaire) in the session. The nature of the SQ object is a data.frame that stores a
variety of other objects in itself. For more technical details of the data.frame funcon, see
The data.frame object secon of Chapter 2, Import/Export Data. The names of a data.frame
object may be extracted using the funcon variable.names. The R funcon class helps
to idenfy the nature of the R object. As we have a list of variables, it is useful to nd all of
them using the funcon sapply. In the following screenshot, the menoned steps have been
carried out:
Figure 2: Understanding the variable types of an R object
Chapter 1
[ 13 ]
The variable characteriscs are also on expected lines, as they truly should be, and
we see that the variables Customer_ID, Questionnaire_ID, and Name are character
variables; Gender, Car_Model, Minor_Problems, and Major_Problems are factor
variables; DoB and Car_Manufacture_Year are date variables; Mileage and Odometer
are integer variables; and nally the variable Satisfaction_Rating is an ordered
and factor variable.
In the remainder of this chapter we will delve into more details about the nature of various
data types. In a more formal language a variable is called a random variable, abbreviated
as RV in the rest of the book, in stascal literature. A disncon needs to be made here.
In this book we do not focus on the important aspects of probability theory. It is assumed that
the reader is familiar with probability, say at the level of Freund (2003) or Ross (2001). An RV
is a funcon that maps from the probability (sample) space
:
to the real line. From
the previous example we have Odometer and Satisfaction_Rating as two examples of a
random variable. In a formal language, the random variables are generally denoted by leers
X, Y, …. The disncon that is required here is that in the applicaons what we observe are
the realizaons/values of the random variables. In general, the realized values are denoted by
the lower cases x, y, …. Let us clarify this at more length.
Suppose that we denote the random variable Satisfaction_Rating by X. Here,
the sample space
:
consists of the elements Very Poor, Poor, Average, Good, and Very
Good. For the sake of convenience we will denote these elements by O1, O2, O3, O4, and
O5 respecvely. The random variable X takes one of the values O1,…, O5 with respecve
probabilies p1,…, p5. If the probabilies were known, we don't have to worry about stascal
analysis. In simple terms, if we know the probabilies of the Satisfaction_Rating RV, we
can simply use it to conclude whether more customers give Very Good rang against Poor.
However, our survey data does not contain every customer who have availed car service from
the workshop, and as such we have representave probabilies and not actual probabilies.
Now, we have seen 20 observaons in the R session, and corresponding to each row we had
some values under the Satisfaction_Rating column. Let us denote the sasfacon rang
for the 20 observaons by the symbols X1,…, X20. Before we collect the data, the random
variables X1,…, X20 can assume any of the values in
:
. Post the data collecon, we see that the
rst customer has given the rang as Good (that is, O4), the second as Average (O3), and so on
up to the tweneth customer's rang as Average (again O3). By convenon, what is observed
in the data sheet is actually x1,…, x20, the realized values of the RVs X1,…, X20.
Data Characteriscs
[ 14 ]
Experiments with uncertainty in computer science
The common man of the previous century was skepcal about chance/randomness and
aributed it to the lack of accurate instruments, and that informaon is not necessarily
captured in many variables. The skepcism about the need for modeling for randomness
in the current era connues for the common man, as he feels that the instruments are too
accurate and that mul-variable informaon eliminates uncertainty. However, this is not
the fact, and we will look here at some examples that drive home this point. In the previous
secon we dealt with data arising from a quesonnaire regarding the service level at a car
dealer. It is natural to accept that dierent individuals respond in disnct ways, and further
the car being a complex assembly of dierent components responds dierently in near
idencal condions. A queson then arises whether we may have to really deal with such
situaons in computer science, which involve uncertainty. The answer is certainly armave
and we will consider some examples in the context of computer science and engineering.
Suppose that the task is installaon of soware, say R itself. At a new lab there has been
an arrangement of 10 new desktops that have the same conguraon. That is, the RAM,
memory, the processor, operang system, and so on are all same in the 10 dierent machines.
For simplicity, assume that the electricity supply and lab temperature are idencal for all the
machines. Do you expect that the complete R installaon, as per the direcons specied in
the next secon, will be the same in milliseconds for all the 10 installaons? The run me
of an operaon can be easily recorded, may be using other soware if not manually. The
answer is a clear "No" as there will be minor variaons of the processes acve in the dierent
desktops. Thus, we have our rst experiment in the domain of computer science which
involves uncertainty.
Suppose that the lab is now two years old. As an administrator do you expect all the 10
machines to be working in the same idencal condions, as we started with idencal
conguraon and environment? The queson is relevant as according to general experience
a few machines may have broken down. Despite warranty and assurance by the desktop
company, the number of machines that may have broken down will not be exactly as assured.
Thus, we again have uncertainty.
Assume that three machines are not funconing at the end of two years. As an administrator,
you have called the service vendor to x the problem. For the sake of simplicity, we assume
that the nature of failure of the three machines is the same, say motherboard failure on the
three failed machines. Is it praccal that the vendor would x the three machines within
idencal me? Again, by experience we know that this is very unlikely. If the reader thinks
otherwise, assume that 100 idencal machines were running for two years and 30 of them
are now having the motherboard issue. It is now clear that some machines may require a
component replacement while others would start funconing following a repair/x.
Let us now summarize the preceding experiments through the set of following quesons:
Chapter 1
[ 15 ]
What is the average installaon me for the R soware on idencally congured
computer machines?
How many machines are likely to break down aer a period of one year, two years,
and three years?
If a failed machine has issues related to motherboard, what is the average
service me?
What is the fracon of failed machines that have failed motherboard component?
The answers to these types of quesons form the main objecve of the Stascs subject.
However, there are certain characteriscs of uncertainty that are very richly covered by the
families of probability distribuons. According to the underlying problem, we have discrete or
connuous RVs. The important and widely useful probability distribuons form the content of
the rest of the chapter. We will begin with the useful discrete distribuons.
R installation
The ocial website of R is the Comprehensive R Archive Network (CRAN) at www.cran.r-
project.org. As of wring of this book, the most recent version of R is 2.15.1. This soware
can be downloaded for the three plaorms Linux, Mac OS X, and Windows.
Figure 3: The CRAN website (a snapshot)
Data Characteriscs
[ 16 ]
A Linux user may simply key in sudo apt-get install r-base in the terminal, and post
the return of right password and privilege levels, the R soware will be installed. Aer the
compleon of download and installaon, the soware can be started by simply keying in R
at the terminal.
A Windows user rst needs to click on Download R for Windows as shown in the preceding
screenshot, and then in the base subdirectory click on install R for the rst me. In the new
window, click on Download R 3.0.0 for Windows and download the .exe le to a directory
of your choice. The completely downloaded R-3.0.0-win.exe le can be installed as any
other .exe le. The R soware may be invoked either from the Start menu, or from the icon
on the desktop.
Using R packages
The CRAN repository hosts 4475 packages as of May 01, 2013. The packages are wrien and
maintained by Stascians, Engineers, Biologists, and others. The reasons are varied and the
resourcefulness is very rich, and it reduces the need of wring exhausve, new funcons
and programs from scratch. These addional packages can be obtained from http://www.
cran.r-project.org/web/packages/. The user can click on Table of available packages,
sorted by name, which directs to a new web package. Let us illustrate the installaon of an R
package named gdata.
We now wish to install the package gdata. There are mulple ways of compleng this task.
Clicking on the gdata label leads to the web page http://www.cran.r-project.org/
web/packages/gdata/index.html. In this HTML le we can nd a lot of informaon
about the package from Version, Depends, Imports, Published, Author, Maintainer, License,
System Requirements, Installaon, and CRAN checks. Further, the download opons may be
chosen from Package source, MacOS X binary, and Windows binary depending on whether
the user's OS is Unix, MacOS, or Windows respecvely. Finally, a package may require
other packages as a prerequisite, and it may itself be a prerequisite for other packages.
This informaon is provided in the Reverse dependencies secon in the opons Reverse
depends, Reverse imports, and Reverse suggests.
Suppose that the user is having Windows OS. There are two ways to install the package
gdata. Start R as explained earlier. At the console, execute the code install.
packages("gdata"). A CRAN mirror window will pop up asking the user to select one
of the available mirrors. Select one of the mirrors from the list, you may need to scroll
down to locate your favorite mirror, and then click on the Ok buon. A default seng is
dependencies=TRUE, which will then download and install all other required packages.
Unless there are some violaons, such as the dependency requirement of the R version
being at least 2.13.0 in this case, the packages are successfully installed.
Chapter 1
[ 17 ]
A second way of installing the gdata package is as follows. In the gdata web page click on the
link gdata_2.11.0.zip. This acon will then aempt to download the package through the File
download window. Choose the opon Save and specify the path where you wish to download
the package. In my case, I have chosen the path C:\Users\author\Downloads. Now go to
the R window. In the menu ribbon, we have seven opons: File, Edit, View, Misc, Packages,
Windows, and Help. Yes, your guess is correct and you would have wisely selected Packages
from the menu. Now, select the last opon of Packages, the Install Package(s) from local zip
les opon and direct it to the path where you have downloaded the .zip le. Select the
le gdata_2.11.0 and R will do the required remaining part of installing the package. One
of the drawbacks of doing this process manually is that if there are dependencies, the user
needs to ensure that all such packages have been installed before embarking on this second
task of installing the R packages. However, despite the problem, it is quite useful to know this
technique, as we may not be connected to Internet all the me and install the packages as
it is convenient.
RSADBE – the book's R package
The book uses a lot of datasets from the Web, stascal text books, and so on. The le format
of the datasets have been varied and thus to help the reader, we have put all the datasets
used in the book in an R package, RSADBE, which is the abbreviaon of the book's tle.
This package will be available from the CRAN website as well as the book's web page.
Thus, whenever you are asked to run data(xyz), the datasets xyz will be available
either in the RSADBE package or datasets package of R.
The book also uses many of the packages available on CRAN. The following table gives the list
of packages and the reader is advised to ensure that these packages are installed before you
begin reading the chapter. That is, the reader needs to ensure that, as an example, install.
packages(c("qcc","ggplot2")) is run in the R session before proceeding with Chapter
3, Data Visualizaon.
Chapter number Packages required
2foreign, RMySQL
3qcc, ggplot2
4LearnEDA, aplpack
5stats4, PASWR, PairedData
6faraway
7pscl, ROCR
8ridge, DAAG
9rpart, rattle
10 ipred, randomForest
Data Characteriscs
[ 18 ]
Discrete distribution
The previous secon highlights the dierent forms of variables. The variables such as
Gender, Car_Model, and Minor_Problems possibly take one of the nite values.
These variables are parcular cases of the more general class of discrete variables.
It is to be noted that the sample space
:
of a discrete variable need not be nite. As an
example, the number of errors on a page may take values as a set of posive integers, {0, 1, 2,
…}. Suppose that a discrete random variable X can take values among
[[
with respecve
probabilies
SS
, that is,
LL
L
3; [S
[S
. Then, we require that the probabilies be non-
zero and further that their sum be 1:
DQG t ¦
L L L
S L S
where the Greek symbol
¦
represents summaon over the index i.
The funcon
L
S[ is called the probability mass funcon (pmf) of the discrete RV X. We will
now consider formal denions of important families of discrete variables. The engineers
may refer to Bury (1999) for a detailed collecon of useful stascal distribuons in their
eld. The two most important parameters of a probability distribuon are specied by mean
and variance of the RV X. In some cases, and important too, these parameters may not exist
for the RV. However, we will not focus on such distribuons, though we cauon the reader
that this does not mean that such RVs are irrelevant. Let us dene these parameters for the
discrete RV. The mean and variance of a discrete RV are respecvely calculated as:
DQG
¦ ¦
LL LL
L L
(; S[ 9DU; S[ (;
The mean is a measure of central tendency, whereas the variance gives a measure
of the spread of the RV.
The variables dened so far are more commonly known as categorical variables.
Agres (2002) denes a categorical variable as a measurement scale consisng
of a set of categories.
Let us idenfy the categories for the variables listed in the previous secon. The categories
for the variable Gender are Male and Female; whereas the car category variables derived
from Car_Model are hatchback, sedan, staon wagon, and ulity vehicles. The variables
Minor_Problems and Major_Problems have common but independent categories Yes and
No; and nally the variable Satisfaction_Rating has the categories, as seen earlier, Very
Poor, Poor, Average, Good, and Very Good. The variable Car_Model is just labels of the name
of car and it is an example of nominal variable.
Chapter 1
[ 19 ]
Finally, the output of the variable Satistifaction_Rating has an implicit order in it, Very
Poor < Poor < Average < Good < Very Good. It may be realized that this dierence poses subtle
challenges in their analysis. These types of variables are called ordinal variables. We will look
at another type of categorical variable that has not popped up thus far.
Praccally, it is oen the case that the output of a connuous variable is put in certain bin for
ease of conceptualizaon. A very popular example is the categorizaon of the income level
or age. In the case of income variables, it has been realized in one of the earlier studies that
people are very conservave about revealing their income in precise numbers. For example,
the author may be shy to reveal that his monthly income is Rs. 34,892. On the other hand,
it has been revealed that these very same people do not have a problem in disclosing their
income as belonging to one of such bins: < Rs. 10,000, Rs. 10,000-30,000, Rs. 30,000-50,000,
and > Rs. 50,000. Thus, this informaon may also be coded into labels and then each of the
labels may refer to any one value in an interval bin. Hence, such variables are referred as
interval variables.
Discrete uniform distribution
A random variable X is said to be a discrete uniform random variable if it can take any one
of the nite M labels with equal probability.
As the discrete uniform random variable X can assume one of the 1, 2, …, M with equal
probability, this probability is actually
0
. As the probability remains same across the labels,
the nomenclature "uniform" is jused. It might appear at the outset that this is not a very
useful random variable. However, the reader is cauoned that this intuion is not correct. As
a simple case, this variable arises in many cases where simple random sampling is needed in
acon. The pmf of discrete RV is calculated as:
LL
3; [S[L 0
0
A simple plot of the probability distribuon of a discrete uniform RV is demonstrated next:
> M = 10
> mylabels=1:M
> prob_labels=rep(1/M,length(mylabels))
> dotchart(prob_labels,labels=mylabels,xlim=c(.08,.12),
+ xlab="Probability")
> title("A Dot Chart for Probability of Discrete Uniform RV")
Data Characteriscs
[ 20 ]
Downloading the example code
You can download the example code les for all Packt books you have
purchased from your account at http://www.packtpub.com. If you
purchased this book elsewhere, you can visit http://www.packtpub.
com/support and register to have the les e-mailed directly to you.
Figure 4: Probability distribution of a discrete uniform random variable
The R programs here are indicave and it is not absolutely necessary that you
follow them here. The R programs will actually begin from the next chapter and
your ow won't be aected if you do not understand certain aspects of them.
Binomial distribution
Recall the second queson in the Experiments with uncertainty in computer science secon,
which asks "How many machines are likely to break down aer a period of one year, two
years, and three years?". When the outcomes involve uncertainty, the more appropriate
queson that we ask is related to the probability of the number of break downs being x.
Consider a xed me frame, say 2 years. To make the queson more generic, we assume
that we have n number of machines. Suppose that the probability of a breakdown for a
given machine at any given me is p. The goal is to obtain the probability of x machines with
breakdown, and implicitly (n-x) funconal machines. Now consider a xed paern where
the rst x units have failed and the remaining are funconing properly. All the n machines
funcon independently of other machines. Thus, the probability of observing x machines
in the breakdown state is
[
S
.
Chapter 1
[ 21 ]
Similarly, each of the remaining (n-x) machines have the probability of (1-p) of being in the
funconal state, and thus the probability of these occurring together is
( )
1
n x
p
−
−. Again by
the independence axiom value, the probability of x machines with breakdown is then given
by . Finally, in the overall setup, the number of possible samples with breakdown
(being x and (n-x) samples) being funconal is actually the number of possible combinaons
of choosing x-out-of-n items, which is the combinatorial n
x
. As each of these samples is
equally likely to occur, the probability of exactly x broken machines is given by .
The RV X obtained in such a context is known as the binomial RV and its pmf is called the
binomial distribuon. In mathemacal terms, the pmf of the binomial RV is calculated as:
§·
d d
¨¸
©¹
Q[
[
Q
3; [S[SS[
QS
[
The pmf of binomial distribuons is somemes denoted by
( )
;,bxnp
. Let us now look at
some important properes of a binomial RV. The mean and variance of a binomial RV X
are respecvely calculated as:
DQG 9DU; Q(S S;Q S
As p is always a number between 0 and 1, the variance of a binomial RV
is always lesser than its mean.
Example 1.3.1: Suppose n = 10 and p = 0.5. We need to obtain the probabilies
p(x), x=0, 1, 2, …, 10. The probabilies can be obtained using the built-in R funcon dbinom.
The funcon dbinom returns the probabilies of a binomial RV. The rst argument of this
funcon may be a scalar or a vector according to the points at which we wish to know the
probability. The second argument of the funcon needs to know the value of n, the size of
the binomial distribuon. The third argument of this funcon requires the user to specify the
probability of success in p. It is natural to forget the syntax of funcons and the R help system
becomes very handy here. For any funcon, you can get details of it using ? followed by the
funcon name. Please do not give a space between CIT and the funcon name.
Here, you can try ?dbinom.
> n <- 10; p <- 0.5
> p_x <- round(dbinom(x=0:10, n, p),4)
> plot(x=0:10,p_x,xlab="x", ylab="P(X=x)")
Data Characteriscs
[ 22 ]
The R funcon round xes the accuracy of the argument up to the specied number
of digits.
Figure 5: Binomial probabilities
We have used the dbinom funcon in the previous example. There are three ulity facets
for the binomial distribuon. The three facets are p, q, and r. These three facets respecvely
help us in computaons related to cumulave probabilies, quanles of the distribuon,
and simulaon of random numbers from the distribuon. To use these funcons, we simply
augment the leers with the distribuon name, binom here, as pbinom, qbinom, and
rbinom. There will be of course a crical change in the arguments. In fact, there are many
distribuons for which the quartet of d, p, q, and r are available, check ?Distributions.
Example 1.3.2: Assume that the probability of a key failing on an 83-set keyboard (the authors
laptop keyboard has 83 keys) is 0.01. Now, we need to nd the probability when at a given
me there are 10, 20, and 30 non-funconing keys on this keyboard. Using the dbinom
funcon these probabilies are easy to calculate. Try to do this same problem using a scienc
calculator or by wring a simple funcon in any language that you are comfortable with.
> n <- 83; p <- 0.01
> dbinom(10,n,p)
[1] 1.168e-08
> dbinom(20,n,p)
[1] 4.343e-22
> dbinom(30,n,p)
[1] 2.043e-38
> sum(dbinom(0:83,n,p))
[1] 1
Chapter 1
[ 23 ]
As the probabilies of 10-30 keys failing appear too small, it is natural to believe that may
be something is going wrong. As a check, the sum clearly equals 1. Let us have a look at the
problem from a dierent angle. For many x values, the probability p(x) will be approximately
zero. We may not be interested in the probability of an exact number of failures, though we
are interested in the probability of at least x failures occurring, that is, we are interested in
the cumulave probabilies
()
PX x
≤. The cumulave probabilies for binomial distribuon
are obtained in R using the pbinom funcon. The main arguments of pbinom include size
(for n), prob (for p), and q (the x argument). For the same problem, we now look at the
cumulave probabilies for various p values:
> n <- 83; p <- seq(0.05,0.95,0.05)
> x <- seq(0,83,5)
> i <- 1
> plot(x,pbinom(x,n,p[i]),"l",col=1,xlab="x",ylab=
+ expression(P(X<=x)))
> for(i in 2:length(p)) { points(x,pbinom(x,n,p[i]),"l",col=i)}
Figure 6: Cumulative binomial probabilities
Try to interpret the preceding screenshot.
Data Characteriscs
[ 24 ]
Hypergeometric distribution
A box of N = 200 pieces of 12 GB pen drives arrives at a sales center. The carton contains
M = 20 defecve pen drives. A random sample of n units is drawn from the carton. Let X
denote the number of defecve pen drives obtained from the sample of n units. The task is to
obtain the probability distribuon of X. The number of possible ways of obtaining the sample
of size n is
1
Q
§ ·
¨ ¸
© ¹
. In this problem we have M defecve units and N-M working pen drives, and
x defecve units can be sampled in M
x
dierent ways and n-x good units can be obtained in
N M
n x
−
−
disnct ways. Thus, the probability distribuon of the RV X is calculated as:
§·
§·
¨¸
¨¸
©¹
©¹
§·
¨¸
©¹
010
[Q[
3; [K[Q01 1
Q
where x is an integer between
( )
max0,nNM−+ and
( )
min,nM
. The RV is called
as the hypergeometric RV and its probability distribuon is called as the
hypergeometric distribuon.
Suppose that we draw a sample of n = 10 units. The funcon dhyper in R can be used to nd
the probabilies of the RV X assuming dierent values.
> N <- 200; M <- 20
> n <- 10
> x <- 0:11
> round(dhyper(x,M,N,n),3)
[1] 0.377 0.395 0.176 0.044 0.007 0.001 0.000 0.000 0.000 0.000 0.000
0.000
The mean and variance of a hypergeometric distribuon are stated as follows:
DQG
10 1Q
0 0
(; Q1 1 1
U; Q1
9D
Negative binomial distribution
Consider a variant of the problem described in the previous subsecon. The 10 new desktops
need to be ed with an add-on, 5 megapixel external cameras to help the students aend a
certain online course. Assume that the probability of a non-defecve camera unit is p. As an
administrator you keep on placing order unl you receive 10 non-defecve cameras. Now, let X
denote the number of orders placed for obtaining the 10 good units. We denote the required
number of success by k, which in this discussion has been k = 10. The goal in this unit is to
obtain the probability distribuon of X.
Chapter 1
[ 25 ]
Suppose that the xth order placed results in the procurement of the kth non-defecve unit.
This implies that we have received (k-1) non-defecve units among the rst (x-1) orders
placed, which is possible in
1
1
x
k
−
−
disnct ways. At the xth order, the instant of having received
the kth non-defecve unit, we have k successes and x-k failures. Hence, the probability
distribuon of the RV is calculated as:
§·
¨¸
©¹
[N N
[N
3; N S S[ NN
N
Such an RV is called the negave binomial RV and its probability distribuon is the negave
binomial distribuon. Technically, this RV has no upper bound as the next required success
may never turn up. We state the mean and variance of this distribuon as follows:
DQG
9D
NS NS
( U ;; SS
A parcular and important special case of the negave binomial RV occurs for k = 1,
which is known as the geometric RV. In this case, the pmf is calculated as:
[
3; [SS[
Example 1.3.3. (Baron (2007). Page 77) Sequenal Tesng: In a certain setup, the probability
of an item being defecve is (1-p) = 0.05. To complete the lab setup, 12 non-defecve units
are required. We need to compute the probability that at least 15 units need to be tested.
Here we make use of the cumulave distribuon of negave binomial distribuon pnbinom
funcon available in R. Similar to the pbinom funcon, the main arguments that we require
here would be size, prob, and q. This problem is solved in a single line of code:
> 1-pnbinom(3,size=12,0.95)
[1] 0.005467259
Note that we have specied 3 as the quanle point (the x argument) as the size parameter
of this experiment is 12 and we are seeking at least 15 units which translate into 3 more
units than the size parameter. The funcon pnbinom computes the cumulave distribuon
funcon and the requirement is actually the complement, and hence the expression in the
code is 1–pnbinom. We may equivalently solve the problem using the dnbinom funcon,
which straighorwardly computes the required probability:
> 1-(dnbinom(3,size=12,0.95)+dnbinom(2,size=12,0.95)+dnbinom(1,
+ size=12,0.95)+dnbinom(0,size=12,0.95))
[1] 0.005467259
Data Characteriscs
[ 26 ]
Poisson distribution
The number of accidents on a 1 km stretch of road, total calls received during a one-hour
slot on your mobile, the number of "likes" received on a status on a social networking site
in a day, and similar other cases are some of the examples which are addressed by the Poisson
RV. The probability distribuon of a Poisson RV is calculated as:
O
OO
!
[
H
3; [ [
[
Here
λ
is the parameter of the Poisson RV with X denong the number of events. The Poisson
distribuon is somemes also referred to as the law of rare events. The mean and variance of
the Poisson RV are surprisingly the same and equal
λ
, that is,
() ()
EX VarX
λ
= = .
Example 1.3.4: Suppose that Santa commits errors in a soware program with a mean
of three errors per A4-size page. Santa's manager wants to know the probability of
Santa comming 0, 5, and 20 errors per page. The R funcon dpois helps to determine
the answer.
> dpois(0,lambda=3); dpois(5,lambda=3); dpois(20, lambda=3)
[1] 0.04978707
[1] 0.1008188
[1] 7.135379e-11
Note that Santa's probability of comming 20 errors is almost 0.
We will next focus on connuous distribuons.
Continuous distribution
The numeric variables in the survey, Age, Mileage, and Odometer, can take any values over
a connuous interval and these are examples of connuous RVs. In the previous secon we
dealt with RVs which had discrete output. In this secon we will deal with RVs which have
connuous output. A disncon from the previous secon needs to be pointed explicitly.
In the case of a discrete RV, there is a posive number for the probability of an RV taking on
a certain value which is determined by the pmf. In the connuous case, an RV necessarily
assumes any specic value with zero probability. These technical issues will not be discussed
in this book. In the discrete case, the probabilies of certain values are specied by the pmf,
and in the connuous case the probabilies, over intervals, are decided by probability density
funcon, abbreviated as pdf.
Chapter 1
[ 27 ]
Suppose that we have a connuous RV, X, with the pdf f(x) dened over the possible x values,
that is, we assume that the pdf f(x) is well dened over the range of the RV X, denoted by
x
R
.
It is necessary that the integraon of f(x) over the range
x
R
is necessarily 1, that is,
()
1
x
Rfsds =
∫
.
The probability that the RV X takes a value in an interval [a, b] is dened by:
[]
( )
()
,b
a
PX ab fxdx∈ = ∫
In general we are interested in the cumulave probabilies of a connuous RV, which is
the probability of the event P(X<x). In terms of the previous equaons, this is obtained as:
f
³[
3; [IVGV
A special name for this probability is the cumulave density funcon. The mean and variance
of a connuous RV are then dened by:
DQG ;
³ ³
[ [
5 5
(; [I [G[ [ (; I[G[9DU
As in the previous secon, we will begin with the simpler RV in uniform distribuon.
Uniform distribution
An RV is said to have uniform distribuon over the interval
[ ]
0, , 0
θ θ
> if its probability density
funcon is given by:
T T T
T
d d !I[ [
In fact, it is not necessary to restrict our focus on the posive real line. For any two
real numbers a and b, from the real line, with b > a, the uniform RV can be dened by:
d
d!
I[DE D[EE D
ED
Data Characteriscs
[ 28 ]
The uniform distribuon has a very important role to play in simulaon, as will be seen
in Chapter 6, Linear Regression Analysis. As with the discrete counterpart, in the connuous
case any two intervals of the same length will have equal probability of occurring. The mean
and variance of a uniform RV over the interval [a, b] are respecvely given by:
ED
DE
(; 9DU;
Example 1.4.1. Horgan's (2008), Example 15.3: The Internaonal Journal of Circuit Theory
and Applicaons reported in 1990 that researchers at the University of California, Berkely,
had designed a switched capacitor circuit for generang random signals whose trajectory
is uniformly distributed over the unit interval [0, 1]. Suppose that we are interested
in calculang the probability that the trajectory falls in the interval [0.35, 0.58].
Though the answer is straighorward, we will obtain it using the punif funcon:
> punif(0.58)-punif(0.35)
[1] 0.23
Exponential distribution
The exponenal distribuon is probably one of the most important probability distribuons in
Stascs, and more so for Computer Sciensts. The numbers of arrivals in a queuing system,
the me between two incoming calls on a mobile, the lifeme of a laptop, and so on, are
some of the important applicaons where this distribuon has a lasng ulity value.
The pdf of an exponenal RV is specied by
()
;,0, 0
x
fx ex
λ
λλ λ
−
=≥>
.
The parameter
λ
is somemes referred to as the failure rate. The exponenal RV enjoys
a special property called the memory-less property which conveys that :
IRUD_ OO !t t t3; WV;V 3; W W V
This mathemacal statement states that if X is an exponenal RV, then its failure in the future
depends on the present, and the past (age) of the RV does not maer. In simple words this
means that the probability of failure is constant in me and does not depend on the age of
the system. Let us obtain the plots of a few exponenal distribuons.
> par(mfrow=c(1,2))
> curve(dexp(x,1),0,10,ylab="f(x)",xlab="x",cex.axis=1.25)
> curve(dexp(x,0.2),add=TRUE,col=2)
> curve(dexp(x,0.5),add=TRUE,col=3)
> curve(dexp(x,0.7),add=TRUE,col=4)
> curve(dexp(x,0.85),add=TRUE,col=5)
> legend(6,1,paste("Rate = ",c(1,0.2,0.5,0.7,0.85)),col=1:5,pch=
Chapter 1
[ 29 ]
+ "___")
> curve(dexp(x,50),0,0.5,ylab="f(x)",xlab="x")
> curve(dexp(x,10),add=TRUE,col=2)
> curve(dexp(x,20),add=TRUE,col=3)
> curve(dexp(x,30),add=TRUE,col=4)
> curve(dexp(x,40),add=TRUE,col=5)
> legend(0.3,50,paste("Rate = ",c(1,0.2,0.5,0.7,0.85)),col=1:5,pch=
+ "___")
Figure 7: The exponential densities
The mean and variance of this exponenal distribuon are shown as follows:
DQG
O O
(; 9DU;
Normal distribution
The normal distribuon is in some sense an all-pervasive distribuon that arises sooner or
later in almost any stascal discussion. In fact it is very likely that the reader may already be
familiar with certain aspects of the normal distribuon, for example, the shape of a normal
distribuon curve is bell-shaped. The mathemacal appropriateness is probably reected
through the reason that though it has a simpler expression, and its density funcon includes
the three most famous irraonal numbers
d
d!
I[DE D[EE D
ED
.
Data Characteriscs
[ 30 ]
Suppose that X is normally distributed with mean
µ
and variance
2
σ
. Then, the probability
density funcon of the normal RV is given by:
PV P P V
V
SV
½
f f f
f!
® ¾
¯ ¿
I[ H[S [ [
If mean is zero and variance is one, the normal RV is referred as the standard normal RV,
and the standard is to denote it by Z.
Example 1.4.2. Shady Normal Curves: We will again consider a standard normal random
variable, which is more popularly denoted in Stascs by Z. Some of the most needed
probabilies are P(Z > 0) and P(-1.96 < Z < 1.96). These probabilies are now shaded.
> par(mfrow=c(3,1))
> # Probability Z Greater than 0
> curve(dnorm(x,0,1),-4,4,xlab="z",ylab="f(z)")
> z <- seq(0,4,0.02)
> lines(z,dnorm(z),type="h",col="grey")
> # 95% Coverage
> curve(dnorm(x,0,1),-4,4,xlab="z",ylab="f(z)")
> z <- seq(-1.96,1.96,0.001)
> lines(z,dnorm(z),type="h",col="grey")
> # 95% Coverage
> curve(dnorm(x,0,1),-4,4,xlab="z",ylab="f(z)")
> z <- seq(-2.58,2.58,0.001)
> lines(z,dnorm(z),type="h",col="grey")
Figure 08: Shady normal curves
Chapter 1
[ 31 ]
Summary
You should now be clear with the disnct nature of variables that arise in dierent scenarios.
In R, you should be able to verify that the data is in the correct format. Further, the important
families of random variables are introduced in this chapter, which should help you in dealing
with them when they crop up in your experiments. Computaon of simple probabilies were
also introduced and explained.
In the next chapter you will learn how to perform the basic R computaons, creang data
objects, and so on. As data can seldom be constructed completely in R, we need to import
data from external foreign les. The methods explained help you to import data in le formats
such as .csv and .xls. Similar to imporng, it is also important to be able to export data/
output to other soware. Finally, R session management will conclude the next chapter.
Import/Export Data
The main goals of this chapter are to familiarize you with the various classes of
objects in R, help the reader extract data from various popular formats, connect
R with popular databases such as MySQL, and finally the best export options
of the R output. The main purpose is that the practitioner frequently has data
available in a fixed format, and sometimes the dataset is available in popular
database systems.
This chapter helps you to extract the data from various sources, and then also recommends
the best export opons of the R output. We will though begin with a beer understanding of
the various formats in which R stores the data. Updated informaon about the import/export
opons is maintained at http://cran.r-project.org/doc/manuals/R-data.html.
To summarize, the main learning from this chapter would be the following:
Basic and essenal computaons in R
Imporng data from CSV, XLS, and few more
Exporng data for other soware
R session management
data.frame and other formats
Any soware comes with its structure and nuances. The Quesonnaire and its component
secon of Chapter 1, Data Characteriscs, introduced various facets of data. In the current
secon we will go into the details of how R works with data of dierent characteriscs.
Depending on the need we have dierent formats of the data. In this secon, we will begin
with simpler objects and move up the ladder towards some of the more complex ones.
2
Import/Export Data
[ 34 ]
Constants, vectors, and matrices
R has ve inbuilt objects which store certain constant values. The ve objects are LETTERS,
letters, month.abb, month.name, and pi. The rst two objects contain the leers A-Z
in upper and lower cases. The third and fourth objects have month's abbreviated form and
the complete month names. Finally, the object pi contains the value of the famous irraonal
number. So, the exercise here is for you to nd the value of the irraonal number e. The
details about these R constant objects may be obtained using the funcon ?Constants
or example(Constants), of course by execung these commands in the console.
There is also another class of constants in R which is very useful. These constants are called
NumericConstants and include Inf for innite numbers, NaN for not a number, and so on.
You are encouraged to nd more details and other useful constants. R can handle numerical,
character, logical, integer, and complex kind of vectors and it is the class of the object which
characterizes the vector. Typically, we deal with vectors which may be numeric, characters
and so on. A vector of the desired class and number of elements may be iniated using
the vector funcon. The length argument declares the size of the vector, which is the
number of elements for the vector, whereas mode characterizes the vector to take one of the
required classes. The elements of a vector can be assigned names if required. The R names
funcon comes handy for this purpose.
Arithmec on numeric vector objects can be performed in an easier way. The operators
(+, -, *, /, and ^) are respecvely used for (addion, subtracon, mulplicaon, division,
and power). The characteriscs of a vector object may be obtained using funcons such as
sum, prod, min, max, and so on. Accuracy of a vector up to certain decimals may be xed
using opons in digits, round, and so on.
Now, two vectors need not have the same number of elements and we may carry the
arithmec operaon between them, say addion. In a mathemacal sense two vectors of
unequal length cannot be added. However, R goes ahead and performs the operaons just
the same. Thus, there is a necessity to understand how operaons are carried out in such
cases. To begin with the simpler case, let us consider two vectors with an equal number of
elements. Suppose that we have a vector x = (1, 2, 3, …, 9, 10), and y = (11, 12, 13, …, 19, 20).
If we add these two vectors, x + y, the result is an element-wise addion of the respecve
elements in x and y, that is, we will get a new vector with elements 12, 14, 16, …, 28, 30.
Now, let us increase the number of elements of y from 10 to 12 with y = (11, 12, 13, …, 19,
20, 21, 22). The operaon is carried out in the order that the elements of x (the smaller
object one) are element-wise added to the rst ten elements of y. Now, R nds that there
are two more elements of y in 11 and 12 which have not been touched as of now. It now
picks the rst two elements of x in 1 and 2 and adds them to 11 and 12. Hence, the 11 and
12 elements of the output are 11+1 =12 and 12 + 2 = 14. The warning says that longer
object length is not a multiple of shorter object length, which has now
been explained.
Chapter 2
[ 35 ]
Let us have a brief peep at a few more operators related to the vectors. The operator %% on
two objects, say x and y, returns a remainder following an integer division, and the operator
%/% returns the integer division.
Time for action – understanding constants, vectors, and basic
arithmetic
We will look at a few important and interesng examples. You will understand the structure of
vectors in R and would also be able to perform the basic arithmec related to this requirement.
1. Key in LETTERS at the R console and hit the Enter key.
2. Key in letters at the R console and hit the Enter key.
3. To obtain the rst ve and the last ve alphabets, try the following code:
LETTERS[c(1:5,22:26)] and letters[c(1:5,22:26)].
4. Month names and their abbreviaons are available in the base package and explore
them using ?Constants at the console.
5. Selected month names and their abbreviaons can be obtained using month.
abb[c(1:3,8:10)] and month.name[c(1:3,8:10)]. Also, the value of pi in R
can found by entering pi at the console.
6. To generate a vector of length 4, without specifying the class, try
vector(length=4). In specic classes, vector objects can be generated
by declaring the "mode" values for a vector object. That is, a numeric vector
(with default values 0) is obtained by the code vector(mode = "numeric",
length=4). You can similarly generate logical, complex, character, and integer
vectors by specifying them as opons in the mode argument.
The next screenshot shows the results as you run the preceding codes in R.
7. Creang new vector objects and name assignment: A generated vector can be
assigned to new objects using either =, <-, or ->. The last two assignments are in
the order from the generated vector of the tail end to the new variables at the end
of the arrow.
1. First assign the integer vector 1:10 to x by using x <- 1:10.
2. Check the names of x by using names(x).
3. Assign first 10 letters of the alphabets as names for elements of x by using
names(x)<- letters[1:10], and verify that the assignment is done
using names(x).
4. Finally, display the numeric vector x by simply keying in x at the console.
Import/Export Data
[ 36 ]
8. Basic arithmec: Create new R objects by entering x<-1:10; y<-11:20; a<-
10; b<--4; c<-0.5 at the console. In a certain sense, x and y are vectors while
a, b, and c are constants.
1. Perform simple addition of numeric vectors with x + y.
2. Scalar multiplication of vectors and then summing the resulting vectors is
easily done by using a*x + b*y.
3. Verify the result (a + b)x = ax + bx by checking that the result of ((a+b)*x
== a*x + b*x results is a logical vector of length 10, each having TRUE
value.
4. Vector multiplication is carried by x*y.
5. Vector division in R is simply element-wise division of the two vectors, and
does not have an interpretation in mathematics. We obtain the accuracy up
to 4 digits using round(x/y,4).
6. Finally, (element-wise) exponentiation of vectors is carried out through x^2.
9. Adding two unequal length vectors: The arithmec explained before applies to
unequal length vectors in a slightly dierent way. Run the following operaons:
x=1:10; x+c(1:12), length(x+c(1:12)), c(1:3)^c(1,2), and
(9:11)-c(4,6).
10. The integer divisor and remainder following integer division may be obtained
respecvely using %/% and %% operators. Key in -3:3 %% 2, -3:3 %% 3, and -3:3
%% c(2,3) to nd remainders between the sequence -3, -2, …, 2, 3 and 2, 3, and
c(2,3). Replace the operator %% by %/% to nd the integer divisors.
Now, we will rst give the required R codes so you can execute them in the soware:
LETTERS
letters
LETTERS[c(1:5,22:26)]
letters[c(1:5,22:26)]
?Constants
Chapter 2
[ 37 ]
month.abb[c(1:3,8:10)]
month.name[c(1:3,8:10)]
pi
vector(length=4)
vector(mode="numeric",length=4)
vector(mode="logical",length=4)
vector(mode="complex",length=4)
vector(mode="character",length=4)
vector(mode="integer",length=4)
x=1:10
names(x)
names(x)<- letters[1:10]
names(x)
x
x=1:10
y=11:20
a=10
b=-4
x + y
a*x + b*y
sum((a+b)*x == a*x + b*x)
x*y
round(x/y,4)
x^2
x=1:10
x+c(1:12)
length(x+c(1:12))
c(1:3)^c(1,2)
(9:11)-c(4,6)
-3:3 %% 2
-3:3 %% 3
-3:3 %% c(2,3)
-3:3 %/% 2
-3:3 %/% 3
-3:3 %/% c(2,3)
Execute the preceding code in your R session.
Import/Export Data
[ 38 ]
What just happened?
We have split the output into mulple screenshots for ease of explanaon.
Introducing constants and vectors functioning in R
LETTERS is a character vector available in R that consists of the 26 uppercase leers of the
English language, whereas letters contains the alphabets in smaller leers. We have used
the integer vector c(1:5,22:26) in the index to extract the rst and last ve elements of
both the character vectors. When the ?Constants command is executed, R pops out an
HTML le in your default internet browser and opens a page with the link http://my_IP/
library/base/html/Constants.html. You can nd more details about Constants
from the base package on this web page. Months, as in January-December, are available
in the character vector month.name whereas the popular abbreviated forms of the months
are available in the character vector month.abb. Finally, the numeric object pi contains
the value of pi up to the rst three decimals only.
Chapter 2
[ 39 ]
Next, we consider, generaon of various types of vector using the R vector funcon.
Now, the code vector(mode="numeric",length=4) creates a numeric vector with
default values of 0 and required length of four. Similarly, the other vectors are created.
Vector arithmetic in R
An integer vector object is created by the code x = 1:10. We could have alternavely
used opons such as x<- 1:10 or 1:10 -> x. The nal result is of course the same.
The choice of the assignment operator—nd more details by running ?assignOps at the R
console<-— is far more popular in the R community and it can be used during any part of R
programming. By default, there won't be any names assigned for either vectors or matrices.
Thus, the output NULL. names is a funcon in R which is useful for assigning appropriate
names. Our task is to assign the rst 10 smaller leers of the alphabets to the vector x.
Hence, we have the code names(x) <- letters[1:10]. We verify if the names have
been properly assigned and the change on the display of x following the assignment of the
names using names(x) and x.
Import/Export Data
[ 40 ]
Next, we create two integer vectors in x and y, and two objects a and b, which may be treated
as scalars. Now, x + y; a*x + b*y; sum((a+b)*x == a*x + b*x) performs three
dierent tasks. First, it performs addion of vectors and returns the result of element-wise
addion of the two vectors leading to the answer 12, 14, …, 28, 30. Second, we are verifying
the result of scalar mulplicaon of vectors, and third, the result of (a + b)x = ax + bx.
In the next round of R codes, we ran x*y; round(x/y,4); x^2. Similar to the addion
operator, the * operator performs element-wise mulplicaon for the two vectors. Thus,
we get the output as 11, 24, …, 171, 200. In the next line, recall that ; executes the code
on the next line/operaon, rst the element-wise division is carried out. For the resulng
vector (a numeric one), the round funcon gives the accuracy up to four digits as specied.
Finally, x^2 gives us the square of each element of x. Here, 2 can be replaced by any other
real number.
In the last line of code, we repeat some of the earlier operaons with a minor dierence
that the two vectors are not of the same length. As predicted earlier, R issues a warning
that the length of the longer vector is not a mulple of the length of the shorter vector.
Thus, for the operaon x+c(1:12);, rst all the elements of x (which is the shorter length
vector here) are added with the rst 10 elements of 1:12. Then the last two elements of
1:12 at 11 and 12 need to be added with elements from x, and for this purpose R picks
the rst two elements of x. If the longer length vector is a mulple of the shorter one, the
enre elements of the shorter vector are repeatedly added over the in cycles. The remaining
results as a consequence of running c(1:3)^c(1,2); (9:11)-c(4,6) are le to the
reader for interpretaon.
Let us look at the output aer the R codes for an integer and a remainder between two
objects are carried out.
Integer divisor and remainder operations
Chapter 2
[ 41 ]
In the segment -3:3 %% 2, we are rst creang a sequence -3, -2, …, 2, 3 and then we
are asking for the remainder if we divide each of them by 2. Clearly, the remainder for any
integer if divided by 2 is either 0 or 1, and for a sequence of consecuve integers, we expect
an alternate sequence of 0s and 1s, which is the output in this case. Check the expected
result for -3:3 %% 3. Now, for the operaon -3:3 %% c(2,3), rst look at the complete
sequence -3:3 as -3, -2, -1, 0, 1, 2, 3. Here, the elements -3, -1, 1, 3 are divided by 2 and the
remainder is returned, whereas -2, 0, 2 are divided by 3 and the remainders are returned.
The operator %/% returns the integer divisor and interpretaon of the results are le to the
reader. Please refer to the previous screenshot for the results.
We now look at the matrix objects. Similar to the vector funcon in R, we have matrix as a
funcon, that creates matrix objects. A matrix is an array of numbers with a certain number
of rows and columns. By default, the elements of a matrix are generated as NA, that is, not
available. Let r be the number of rows and c the number of columns. The order of a matrix
is then r x c. A vector object of length rc in R can be converted into a matrix by the code
matrix(vector, nrow=r, ncol=c, byrow=TRUE). The rows and columns of a matrix
may be assigned names using the dimnames opon in the matrix funcon.
The mathemacs of matrices in R is preserved in relaon to the matrix arithmec. Suppose
we have two matrices A and B with respecve dimensions m x n and n x o. The cross-product
A x B is then a matrix of order m x o, which is obtained in R by the operaon A %*% B. We
are also interested in the determinant of square matrix, the number of rows being equal to
the number of columns, and this is obtained in R using the det funcon on the matrix, say
det(A). Finally, we also more oen than not require the computaon of the inverse of a
square matrix. The rst temptaon is to obtain the same by using A^{-1}. This will give a
wrong answer, as this leads to an element-wise reciprocal and not the inverse of a matrix.
The solve funcon in R if executed on a square matrix gives the inverse of a matrix. Fine!
Let us now do these operaons using R.
Time for action – matrix computations
We will see the basic matrix computaons in the forthcoming steps. The matrix computaons
such as the cross-product of matrices, transpose, and inverse will be illustrated.
1. Generate a 2 x 2 matrix with default values using matrix(nrow=2, ncol=2).
2. Create a matrix from the 1:4 vector by running matrix(1:4,nrow=2,ncol=2,
byrow="TRUE").
3. Assign row and column names for the preceding matrix by using the opon
dimnames, that is, by running A <- matrix(data=1:4, nrow=2, ncol=2,
byrow=TRUE, dimnames = list(c("R_1", "R_2"),c("C_1", "C_2")))
at the R console.
Import/Export Data
[ 42 ]
4. Find the properes of the preceding matrices by using the commands nrow, ncol,
dimnames, and few more, with dim(A); nrow(A); ncol(A); dimnames(A).
5. Create two matrices X and Y of order 3 * 4 and 4 * 3, and obtain their
cross-product with the code X <- matrix(c(1:12),nrow=3,ncol=4); Y =
matrix(13:24, nrow=4) and X %*% Y.
6. The transpose of a matrix is obtained using the t funcon, t(X).
7. Create a new matrix A <- matrix(data=c(13,24,34,23,67,32,
45,23,11), nrow=3) and nd its determinant and inverse by using det(A)
and solve(A) respecvely.
The R code for the preceding acon list is given in the following code snippet:
matrix(nrow=2,ncol=2)
matrix(1:4,nrow=2,ncol=2, byrow="TRUE")
A <- matrix(data=1:4, nrow=2, ncol=2, byrow=TRUE, dimnames =
list(c("R_1", "R_2"),c("C_1", "C_2")))
dim(A); nrow(A); ncol(A); dimnames(A)
X <- matrix(c(1:12),nrow=3,ncol=4)
Y <- matrix(13:24, nrow=4)
X %*% Y
t(Y)
A <- matrix(data=c(13,24,34,23,67,32,45,23,11),nrow=3)
det(A)
solve(A)
Note the use of a semicolon (;) in line 5 of the preceding code. The result of this usage
is that the code separated by a semicolon is executed as if it was entered on a new line.
Execute the preceding code in your R console. The output of the R code is given in the
following screenshot:
Chapter 2
[ 43 ]
Matrix computations in R
What just happened?
You were able to create matrices in R and learned the basic operaons. Remember
that solve and not (^-1) gives you the inverse of a matrix. It is now seen that matrix
computaons in R are really easy to carry out.
The opons, nrow and ncol, are used to specify the dimensions of a matrix. Data for
a matrix can be specied through the data argument. The rst two lines of code in the
previous screenshot create a bare-bone matrix. Using the dimnames argument, we have
created a more elegant matrix and assigned the matrix to a matrix object named A.
We next focus on the list object. It has already been used earlier to specify the dimnames
of a matrix.
Import/Export Data
[ 44 ]
The list object
In the preceding subsecon we saw dierent kinds of objects such as constants, vectors,
and matrices. Somemes it is required that we pool them together in a single object. The
framework for this task is provided by the list object. From the online source http://
cran.r-project.org/doc/manuals/R-intro.html#Lists-and-data-frames,
we dene a list as "an object consisng of an ordered collecon of objects known as its
components." Basically, various types of objects can be brought under a single umbrella
using the list funcon. Let us create list which contains a character vector, an integer
vector, and a matrix.
Time for action – creating a list object
Here, we will have a rst look at the creaon of list objects, which can contain in them
objects of dierent classes:
1. Create a character vector containing the rst six capital leers with A <-
LETTERS[1:6]. Create an integer vector of the rst ten integers 1-10 with B <-
1:10, and a matrix with C <- matrix(1:6,nrow=2).
2. Create a list which has the three objects created in the previous steps as its
components with Z <- list(A = A, B = B, C = C).
3. Ensure that the class of Z and its three components in A, B, and C are indeed
retained as follows: class(Z); class(Z$A); class(Z$B); class(Z$C).
The consolidated R codes are given next, which you will have to enter at the R console:
A <- LETTERS[1:6]; B <- 1:10; C <- matrix(1:6,nrow=2)
Z <- list(A = A, B = B, C = C)
Z
class(Z); class(Z$A); class(Z$B); class(Z$C)
Creating and understanding a list object
Chapter 2
[ 45 ]
What just happened?
Dierent classes of objects can be easily brought under a single umbrella and their structures
are also preserved within the newly created list object. Especially, here we put a character
vector, an integer vector, and a matrix under a single list object. Next, we check for the
class of the Z object and nd the answer to be list as it should be. A new extracon tool
has been introduced in the dollar symbol $, which needs an explanaon. Elements/objects
from a list vector can be extracted using the $ opon on similar lines of the [ and [[
extracng tools. In our example, Z$A extracts the A object from the Z list, and we use the
class funcon wrapper on Z$A to nd its class. It is then conrmed that the classes of A, B,
and C are preserved under the list object. More details about the extracon tools may be
obtained by running ?Extract at the R console.
Yes, you have successfully created your rst list object. This ulity is parcularly useful
when building big programs and we need certain acons within a single object.
The data.frame object
In Figure 2 of Chapter 1, Data Characteriscs we saw that when the class funcon is
applied on the SQ object, the output resulted in data.frame. The details about this
funcon can be obtained by execung ?data.frame at the R console. The rst noceable
aspect is data.frame {base}, which means that this funcon is in the base library.
Further, the descripon says: "This funcon creates data frames, ghtly-coupled collecons
of variables, which share many of the properes of matrices and of lists, used as the
fundamental data structure by most of R's modeling soware." This descripon is seen to
be correct as in the same gure we have dierent numeric, character, and factor variables
contained in the same data.frame object. Thus, we know that a data.frame object can
contain dierent kinds of variables.
A data frame can contain dierent types of objects. That is, we can create two dierent
classes of vectors and bind them together in a single data frame. A data frame can also be
updated with new vectors and exisng components can also be dropped from it. As with
vectors, and matrices, we can assign names to a data frame as is convenient for us.
Time for action – creating a data.frame object
Here, we create data.frame from vectors. New objects are then added to an exisng data
frame and some preliminary manipulaons are demonstrated.
1. Create a numeric and character vector of length 3 each with x <- c(2,3,4); y
<- LETTERS[1:3].
2. Create a new data frame with df1<-data.frame(x,y).
Import/Export Data
[ 46 ]
3. Verify the variable names of the data frame, the classes of the components,
and display the variables disnctly with variable.names(df1); sapply
(df1, class); df1$x; df1$y.
4. Add a new numeric vector to df1 with df1$z <- c(pi,sqrt(2), 2.71828)
and verify the changes in df1 by entering df1 at the console.
5. Nullify the x component of df1 and verify the change.
6. Bring back the original x values with df1$x <- x.
7. Add a fourth observaon with df1[4,]<- list(y=LETTERS[2], z=3,x=5)
and then remove the second observaon with df1 <- df1[-2,] and verify the
change again.
8. Find the row names (or the observaon names) of the data frame object by using
row.names(df1).
9. Obtain the column names (which should be actually x, y, and z) with
colnames(df1). Change the row and column names using row.names(df1)<-
1:3; colnames(df1)=LETTERS[1:3] and display the nal form of the data frame.
Following is the consolidated code that you have to enter in the R console:
# The data.frame Object
x <- c(2,3,4); y <- LETTERS[1:3]
df1<-data.frame(x,y)
variable.names(df1)
sapply(df1,class)
df1$x
df1$y
df1$z <- c(pi,sqrt(2), 2.71828)
df1
df1$x <- NULL
df1
df1$x <- x
df1[4,]<- list(y=LETTERS[2],z=3,x=5)
df1 <- df1[-2,]
df1
row.names(df1)
dim(df1)
colnames(df1)
row.names(df1)<- 1:3
colnames(df1)<-LETTERS[1:3]
df1
Chapter 2
[ 47 ]
On running the preceding code in R, you will see the output as shown in the
following screenshot:
Understanding a data.frame object
Let us now look at a larger data.frame object. iris is a very famous datasets and we will
use it to check out some very useful tools for data display.
1. Load the iris data from the datasets package with data(iris).
2. Check the rst 10 observaons of the dataset with head(iris,10).
3. A compact display of a data.frame object is obtained with the str funcon
in the following way: str(iris).
4. Using the $ extractor tool, inspect the dierent Species in the iris data in the
following way: iris$Species.
Import/Export Data
[ 48 ]
5. We are asked to get the rst 10 observaons with the Sepal.Length and
Petal.Length variables only. Now, we use the [ extractor in the following way:
iris[1:10,c("Sepal.Length","Petal.Length")].
Different ways of extracting objects from a data.frame object
What just happened?
A data frame may be a complex structure. Here, we rst created two vectors of same length
with dierent structures, one being an integer and the other one a character vector. Using
the data.frame funcon we created a new object df1, which contains both the vectors.
The variable names of df1 are then veried with the variable.names funcon. Aer
verifying that the names are indeed as expected, we verify that the variable classes are
preserved with the applicaon of two funcons: sapply and class. lapply is a useful
ulity in R which applies a funcon over a list or vector and sapply is a more friendly
version of lapply. In our parcular example earlier, we need R to return us the classes of
variables from the df1 data frame.
Chapter 2
[ 49 ]
Have a go hero
As an exercise, explain yourself the rest of the R code that you have executed here.
We have thus seen how to create a data frame, add and remove components, observaons,
change the component names, and so on.
The table object
Data displayed in a table format is easy to understand. We will begin with the famous
Titanic dataset, as it is very unlikely that you will not have heard about it. That the giganc
ship sinks at the end, that there are many beauful movies about it, novels, documentaries,
and many more, make this dataset a very interesng example. It is known that the ship had
some survivors post its unfortunate and premature end. The ship had children, women,
and dierent classes of passenger onboard. This dataset is shipped (again) in the datasets
package along with the soware. The dataset relates to the passengers' survival post
the tragedy.
The Titanic dataset has four variables: Class, Sex, Age, and Survived. For each
combinaon of the values for the four variables we have a count of that combinaon. The
Class variable is specied at four levels of 1st, 2nd, 3rd, and Crew class. The gender is
specied for the passengers, and the age classicaon is Child or Adult. It is also known
through the Survived variable whether the onboard passengers survived the clash of the
ship with the iceberg. Thus, we have 4 x 2 x 2 x 2 = 32 dierent combinaons of the Age,
Sex, Class, and Survived statuses.
Import/Export Data
[ 50 ]
The following screenshot gives a display of the dataset in two formats. On the right-hand
side we can see the dataset in a spreadsheet style, while the le-hand side displays the
frequencies according to a combinatorial group. The queson is how do we create table
displays as on the le-hand side of the screenshot. The present secon addresses this
aspect of table creaon.
Two different views of the Titanic dataset
The le-hand side display of the screenshot is obtained by simply keying in Titanic at the R
console, and the data format on the right-hand side is obtained by running View(Titanic)
at the console. In general, we have our dataset available as on the right-hand side. Hence,
we will pretend that we have the dataset available in the later format.
Chapter 2
[ 51 ]
Time for action – creating the Titanic dataset as a table object
The goal is to create a table object from the raw dataset. We will be using the expand.grid
and xtabs funcon towards this end.
1. First, create four character vectors for the four types of variables:
Class.Level <- c("1st","2nd","3rd", "Crew")
Sex.Level <- c("Male", "Female")
Age.Level <- c("Child", "Adult")
Survived.Level <- c("No", "Yes")
2. Create a list object which takes into account the variable names and their possible
levels with Data.Level <- list(Class = Class.Level, Sex = Sex.
Level, Age = Age.Level, Survived = Survived.Level)
3. Now, create a data.frame object for the levels of the four variables using
the expand.grid funcon by entering T.Table <- expand.grid(Class
= Class.Level, Sex = Sex.Level, Age = Age.Level, Survived
= Survived.Level) at the console. It is advised to view the T.Table and
appreciate the changes that are occurring in this step.
4. The Titanic dataset is ready except for the frequency count at each combinatorial
level. Specify the counts with T.freq <- c(0,0,35,0,0,0,17,0, 118,
154,387,670,4,13,89,3, 5,11, 13,0,1,13,14,0,57,14,75,192,140,
80,76,20)
5. Augment T.Table with T.freq by using T.Table <- cbind(T.Table,
count=T.freq). Again, if you view the T.Table, you will nd the display
on the le-hand side of the previous screenshot.
6. To obtain the display on the right-hand side, enter xtabs(count~ Class + Sex
+ Age + Survived , data = T.Table).
The complete R code is given next, which needs to be compiled in the soware:
Class.Level <- c("1st","2nd","3rd", "Crew")
Sex.Level <- c("Male", "Female")
Age.Level <- c("Child", "Adult")
Survived.Level <- c("No", "Yes")
Data.Level <- list(Class = Class.Level, Sex = Sex.Level,
Age = Age.Level, Survived = Survived.Level)
T.Table <- expand.grid(Class = Class.Level, Sex =
Sex.Level, Age = Age.Level, Survived = Survived.Level)
T.freq = c(0,0,35,0,0,0,17,0,118,154,387,670,4,13,89,3,
5,11, 13,0,1,13,14,0,57,14,75,192,140,80, 76,20)
T.Table = cbind(T.Table, count=T.freq)
xtabs(count~ Class + Sex + Age + Survived, data = T.Table)
Import/Export Data
[ 52 ]
What just happened?
In pracce we may oen have data in frequency format. It will be seen in later chapters
that the table object is required for carrying out stascal analysis. To translate frequency
formated data into a table object, we rst dened four variables through Class.Level,
Sex.Level, Age.Level, and Survived.Level. The levels for the required table object
have been specied through the list object Data.Level. The funcon expand.grid
creates all possible combinaons of the factors of four variables. The table of all possible
combinaons is then stored in the T.Table object. Next, the frequencies are assigned
through the T.freq integer vector. Finally, the xtabs funcon creates the count according
to the various levels of the variables and the result is a table object, which is the same
as Titanic!
Have a go hero
UCBAdmissions is one of the benchmark datasets in Stascs. It is available in the
datasets package and it has data on the admission counts of six departments. The
admissions data shows that there is a favored bias towards adming male candidates over
females, and it led to an allegaon against the University of California, Berkeley. The details
of this problem may be found on the web at http://www.unc.edu/~nielsen/soci708/
cdocs/Berkeley_admissions_bias.pdf. Informaon about the dataset is obtained
with ?UCBAdmissions. Idenfy all the variables and their classes and regenerate the enre
table from the raw codes.
read.csv, read.xls, and the foreign package
Data is generally available in an external le. The type of external les is certainly varied and
it is important to learn which of them may be imported in R. The probable spreadsheet les
may exist in a CSV (comma separated variable) format, XLS or XLSX (Microso Excel) form,
or ODS (OpenOce/LibreOce Calc). There are more possible formats and we restrict our
aenon to those described earlier. A snapshot of two les, Employ.dat and SCV.csv,
in gedit and MS Excel is given in the following screenshot. The brief characteriscs of the two
les are summarized in the following list:
The rst row lists the names of variables of the dataset
Each observaon begins on a new line
In the DAT le, the delimiter is a tab (\t) whereas for the CSV le it is a comma (,)
All the three columns of the DAT le are numeric in nature
The rst ve columns of the CSV le are numeric while the last column is a character
Overall, both the les have a well-dened structure going for them
Chapter 2
[ 53 ]
The following screenshot underlines the theme that when the external les have
a well-dened structure, it is vital that we make the most of the structure when imporng
it in R.
Screenshot of the two spreadsheet files
The core funcon for imporng les in R is the read.table funcon from the utils
package, shipped with R core. The rst argument of this funcon is the lename; see the
following screenshot. We can use header=TRUE to specify that the header names are the
variable names of the columns. The separator opon sep needs to be properly specied.
For example, for the Employ dataset, it is a tab \t whereas for the CSV le, it is a comma ,.
Frequently, each row may also have a name. For example, the customer name in a survey
dataset, or serial number, and so on. This can be specied through row.names. The row
names may or may not be present in the external le. That is, either the row names or the
column names need not be part of the le from which we are imporng the data.
The read.table syntax
Import/Export Data
[ 54 ]
In many les, there may be missing observaons. Such data can be appropriately imported
by specifying the missing values in na.strings. The missing values may be represented by
blank cells, a period, and so on. You may nd more details about the other opons in the
read.table funcon. We note that read.csv, read.delim, and so on are other variants
of the read.table funcon. An Excel le of the type XLS or XLSX may be imported into R
with the use of the read.xls funcon from the gdata package.
Let us begin with imporng simpler data les into R.
Example 2.2.1. Reading from a DAT le: The datasets analyzed in Ryan (2007) are available
on the web at ftp://ftp.wiley.com/public/sci_tech_med/engineering_
statistics/. Download the le engineering_statistics.zip and unzip the contents
to the working directory of the R session. The problem is described in Exercise 1.82 of
Ryan. The monthly data on the number of employees over a period of ve years for three
Wisconsin industries in the wholesale and retail trade, food and kindred products, and
fabricated metals is available in the le Employ.dat. The task is to import this dataset in
the R session. Note that the three variables, namely the number of employees in the three
industries, are numeric in their characteriscs. These characteriscs should be retained in
our session too.
A useful pracce is to actually open the source le and check the nature of the data in it.
For example, you should queson how you will interpret the number 3.220000000e+002
specied in the original DAT le. In the Time for acon – imporng data from external les
secon that follows, we will use the read.table funcon to import this data le.
Example 2.2.2. Reading from a CSV le: Ryan (2007) uses a dataset analyzed by Gupta
(1997). In this case study related to anbioc suspension products the response variable
is Separated Clear Volume whose smaller value indicates beer quality. This experiment
hosts ve variables, each at two dierent levels, that is each of the ve variables is a factor
variable, and the goal of the experiment is the determinaon of the best combinaon of
these factors which yields the minimum value for the response variable.
Now, somemes the required dataset may be available in various CSV les. In such cases,
we rst read them from the various desnaons and then combine them to obtain a single
metale. A trick is the usage of the merge funcon. Suppose that the preceding dataset is
divided in two datasets SCV_Usual.csv and SCV_Modified.csv according to the variable
E. We read them in two separate data objects and then merge them into a single object.
Chapter 2
[ 55 ]
We will carry out the imporng of these les in the next me for acon secon.
Example 2.2.3. Reading les using foreign package: SPSS, SAS, STATA, and so on, are some
of the very popular stascal soware packages. Each of the soware packages has their
own le structure for the datasets. The foreign package, which is shipped along with the R
soware , helps to read datasets used in these soware packages. The rootstock dataset is
a very popular dataset in the area of mulvariate stascs, and it is available on the web at
http://www.stata-press.com/data/r10/rootstock.dta. Essenally, the dataset is
available for the STATA soware. We will now see how R reads this dataset.
Let us set for the acon.
Time for action – importing data from external les
The external les may be imported into R using the right funcons available in it. Here, we
will use read.table, read.csv, and read.sta funcons to drive home the point.
1. Verify that you have the necessary les, Employ.dat, SCV.csv, SCV_Usual.csv,
and SCV_Modified.csv in the working directory by using list.files().
2. If the les are not available in the list displayed, nd your working directory
using getwd() and then copy the les to the working directory. Alternavely,
you can set the working directory to the folder where the les are with setwd
("C:/my_files_are_here").
3. Read the data in Employ.dat with the code employ <- read.table( Employ.
dat", header=TRUE)
4. View the data with View(employ) and ensure that the data le has been properly
imported into R.
5. Check that the class of employ and its variables have been imported in the correct
format with class(employ); sapply(employ,class).
6. Import the Separated Clear Volume data from the SCV.csv le using the code SCV
<- read.csv("SCV.csv",header=TRUE)
7. Run sapply(SCV, class). You will nd that variables A-D are of the numeric
class. Convert the class of variable A to factor with either class(SCV$A) <-
'factor' or SCV$A <- as.factor(SCV$A)
8. Repeat the preceding step for variables B-D.
Import/Export Data
[ 56 ]
9. The data in the SCV.csv le is split into two les by the E variable values and is
available in SCV_Usual.csv and SCV_Modified.csv. Import the data in these
two les using the appropriate modicaons in Step 6 and label the respecve R
data frame objects as SCV_Usual and SCV_Modified.
10. Combine the data from the two latest objects with SCV_Combined <- erge(SCV_
Usual,SCV_Modified,by.y=c("Response","A","B","C","D","E"),all.
x=TRUE,all.y=TRUE)
11. Inialize the library package foreign with library(foreign).
12. Tell R where on the web the dataset is available using rootstock.url <-
http://www.stata-press.com/data/r10/rootstock.dta.
An Internet connection is required to perform this step.
13. Use the read.dta funcon from the foreign package to import the dataset
from the Web into R: rootstock <- read.dta(rootstock.url)
The necessary R codes are given next in a consolidate format:
employ <- read.table("Employ.dat",header=TRUE)
View(employ)
class(employ)
sapply(employ,class)
SCV <- read.csv("SCV.csv",header=TRUE)
sapply(SCV, class)
class(SCV$A) <- 'factor'
class(SCV$B) <- 'factor'
class(SCV$C) <- 'factor'
class(SCV$D) <- 'factor'
SCV_Usual <- read.csv("SCV_Usual.csv",header=TRUE)
SCV_Modified <- read.csv("SCV_Modified.csv",header=TRUE)
SCV_Combined <- merge(SCV_Usual,SCV_Modified,by.y=c("Response",
"A","B","C","D","E"),all.x=TRUE,all.y=TRUE)
SCV_Combined
library(foreign)
rootstock.url <- "http://www.stata-press.com/data/r10/rootstock.dta"
rootstock <- read.dta(rootstock.url)
rootstock
Chapter 2
[ 57 ]
What just happened?
Funcons from the utils package help the R users in imporng data from various
external les. The following screenshot, edited in a graphics tool, shows the result
of running the previous code:
Importing data from external files
The read.table funcon succeeded in imporng the data from the Employ.dat le.
The utils funcon View conrms that the data has been imported with the desired classes.
The funcon read.csv has been used to import data from SCV.csv, SCV_Usual.csv, and
SCV_Modified.csv les. The merge funcon combined the data in the usual and modied
objects and created a new object, which is same as the one obtained using the SCV.csv le.
Import/Export Data
[ 58 ]
Next, we used the funcon read.sta from the foreign package to complete the reading
of a STATA le, which is available on the web.
What just happened?
You learned to import data in many dierent formats into R. The preceding program shows
how to change the class of variables within the object itself. You also learned how to merge
mulple data objects.
Importing data from MySQL
Data will be oen available in databases (DB) such as SQL, MySQL, and so on. To emphasize
the importance of databases is beyond the scope of this secon, and we will be content with
imporng data from a DB. The right-hand side of the following screenshot shows a snippet
of the test DB in MySQL. This DB has a single table in IO_Time and it has two variables
No_of_IO and CPU_Time. The IO_Time has 10 observaons, and we will be using this
dataset for many concepts later in the book. The goal of this secon is to show how to
import this table in to R.
An R package, RMySQL, is available from CRAN, which can be installed easily for Linux users.
Unfortunately, for Windows users, the package is not available in a readily implementable
installaon, in the sense that install.packages("RMySQL") won't work for them. The
best help for Windows users is available at http://www.r-bloggers.com/installing-
the-rmysql-package-on-windows-7/, though some of the codes there are a bit
outdated. However, the problem is certainly solvable! The program and illustraon here
works neatly for Linux users, and the following screenshot is performed on the Ubuntu 12.04
plaorm. Though simple installaon of R and MySQL generally does not help in installing
the RMySQL package, running sudo apt-get install libmysqlclient-dev rst and
then install.packages("RMySQL") helps! If you sll get an error, make a note that the
downloaded package is saved in the /tmp/RtmpeLu7CG/downloaded_packages folder of
the local machine with the name RMySQL_0.x.x.tar.gz.
Chapter 2
[ 59 ]
You can then move to that directory and execute sudo R CMD INSTALL
RMySQL_0.x.x.tar.gz. We are now set to use the RMySQL package.
Importing data from MySQL
Note that on the Ubuntu 12.04 terminal we begin R with R –q. This suppresses the general
details we get about the R soware. First invoke the library with library(RMySQL). Set
up the DB driver with d <- dbDriver("MySQL"). Specify the DB connector with con <-
dbConnect(d,dbname='test') and then run your query to fetch the IO_Time table from
MySQL with io_data <- dbGetQuerry(con,'select * from IO_Time'). Finally,
verify that the data has been properly imported into R with io_data. The right-hand side
of the previous screenshot conrms that the data has been correctly imported into R.
Import/Export Data
[ 60 ]
Exporting data/graphs
In the previous secon, we learned how to import data from external les. Now, there will
be many instances where we would be keen to export data from R into suitable foreign les.
The need may arise in an automated system, reporng, and so on, where the other soware
requires making good use of the R output.
Exporting R objects
The basic R funcon that exports data is write.table, which is not surprising as we saw
the ulity of the read.table funcon. The following screenshot gives a snippet of the
write.table funcon. While reading, we assign the imported le to an R object, and when
exporng, we rst specify the R object and then menon the lename. By default, R assigns
row names while exporng the object. If there are no row names, R will simply choose serial
numbers beginning with 1. If you do not need such row names, you need needs to specify
row.names = FALSE in the program.
Exporting data using the write.table function
Example 2.3.1. Exporng the Titanic data: In the Two dierent views of the Titanic dataset
gure, we saw the Titanic dataset in two formats. It is the display on the right-hand side
of the gure which we would like to export in a .csv format. We will use the write.csv
funcon for this purpose.
> write.csv(Titanic,"Titanic.csv",row.names=FALSE)
The Titanic.csv le will be exported to the current working directory. The reader can
open the CSV le in either Excel or LibreOce Calc and conrm that it is of the desired format.
The other write/export opons are also available in the foreign package. The write.
xport, write.systat, write.dta, and write.arff funcons are useful if the
desnaon soware is any of the following: SAS, SYSTAT, STATA, and Weka.
Exporting graphs
In Chapter 3, Data Visualizaon, we will be generang a lot of graphs. Here, we will explain
how to save the graphs in a desired format.
In the next screenshot, we have a graph generated by execuon of the code plot(sin,
-pi, 2*pi) at the terminal. This line of code generates the sine wave over the interval
[-π, 2π].
Chapter 2
[ 61 ]
Time for action – exporting a graph
Exporng of graph will be explored here:
1. Plot the sin funcon over the range [-π, 2π] by running plot(sin, -pi, 2*pi).
2. A new window pops up with the tle R Graphics Device 2 (ACTIVE).
3. In the menu bar, go to File | Save as | Png.
Saving graphs
4. Save the le as sin_plot.png, or any other name felt appropriate by you.
What just happened?
A le named sin_plot.png would have been created in the directory as specied by
you in the preceding Step 4.
Unix users do not have the luxury of saving the graph in the previously menoned way.
If you are using Unix, you have dierent opons of saving a le. Suppose we wish to save
the le when running R in a Linux environment.
Import/Export Data
[ 62 ]
The grDevices library gives dierent ways of saving a graph. Here, the user can use the
pdf, jpeg, bmp, png, and a few more funcons to save the graph. An example is given in the
following code:
> jpeg("My_File.jpeg")
> plot(sin, -pi, 2*pi)
> dev.off()
null device
1
> ?jpeg
> ?pdf
Here, we rst invoke the required device and specify the le name to save the output, the
path directory may be specied as well along with the le name. Then, we plot the funcon
and nally close the device with dev.off. Fortunately, this technique works on both Linux
and Windows plaorms.
Managing an R session
We will close the chapter with a discussion of managing the R session. In many ways, this
secon is similar to what we do to a dining table aer we have completed the dinner. Now,
there are quite a few aspects about saving the R session. We will rst explain how to save the
R codes executed during a session.
Time for action – session management
Managing a session is very important. Any well developed soware gives mulple opons
for managing a technical session and we explore some of the methods available in R.
1. You have decided to stop the R session! At this moment, we would like to save all
the R code executed at the console. In the File menu, we have an opon in Save
History. Basically, it is the acon File | Save History…. Aer selecng the opon,
as with previous secon, we can save the history of that R session in a new text le.
Save the history with the lename testhist. Basically, R saves it as a RHISTORY
le which may be easily viewed/modied through any text editor. You may also save
the R history in any appropriate directory, which is the desnaon.
2. Now, you want to reload the history testhist at the beginning of a new R session.
The direcon is File | Load History…, and choose the testhist le.
Chapter 2
[ 63 ]
3. In an R session, you would have created many objects with dierent characteriscs.
All of them can be saved in an .Rdata le with File | Save Workspace…. In a new
session, this workspace can be loaded with File | Load Workspace….
R session management
4. Another way of saving the R codes (history and workspace) is when we close the
session either with File | Exit, or by clicking on the X of the R window; a window will
pop up as displayed in the previous screenshot. If you click on Yes, R will append
the RHISTORY le in the working directory with the codes executed in the current
session and also save the workspace.
5. If you want to save only certain objects from the current list, you can use the save
funcon. As an example if you wish to save the object x, run save(x,file="x.
Rdata"). In a later session, you can reinstate the object x with load("x.Rdata").
However, the libraries that were invoked in the previous session are not available again.
They again need to be explicitly invoked using the library() funcon. Therefore, you
should be careful about this fact.
Saving the R session
Import/Export Data
[ 64 ]
What just happened?
The session history is very important, and also the objects created during a session. As you
get deeper into the subject, it is soon realized that it is not possible to complete all the tasks
in a single session. Hence, it is vital to manage the sessions properly. You learned how to
save code history, workspace, and so on.
Have a go hero
You have two matrices A=
1
6
2
5
3
0 and B =
2
9
1
1
-12
6. Obtain the cross-product AB and
nd the inverse of AB. Next, nd (BTAT) then the transpose of its inverse. What will
be your observaon?
Summary
In this chapter we learned how to carry out the essenal computaons. We also learned
how to import the data from various foreign formats and then to export R objects and
output suitable for other soware. We also saw how to eecvely manage an R session.
Now that we know how to create R data objects, the next step is the visualizaon of such
data. In the spirit of Chapter 1, Data Characteriscs, we consider graph generaon according
to the nature of the data. Thus, we will see specialized graphs for data related to discrete as
well as connuous random variables. There is also a disncon made for graphs required
for univariate and mulvariate data. The next chapter must be pleasing on the eyes! Special
emphasis is made on visualizaon techniques related to categorical data, which includes bar
charts, dot charts, and spine plots. Mulvariate data visualizaon is more than mere 3D plots
and the R methods such as pairs plot discussed in the next chapter will be useful.
3
Data Visualization
Data is possibly best understood, wherever plausible, if it is displayed in a
reasonable manner. Chen, et. al. (2008) has compiled articles where many
scientists of data visualization give a deeper, historical, and modern trend of
data display. Data visualization may probably be as historical as data itself. It
emerges across all the dimensions of science, history, and every stream of life
where data is captured. The reader may especially go through the rich history
of data visualization in the article of Friendly (2008) from Chen, et. al. (2008).
The aesthetics of visualization has been elegantly described in Tufte (2001).
The current chapter will have a deep impact on the rest of the book, and
moreover this chapter aims to provide the guidance and specialized graphics
in the appropriate context in the rest of the book.
This chapter provides the necessary smulus for understanding the gist that discrete
and connuous data needs appropriate tools, and the validaon may be seen through
the disnct characteriscs of such plots. Further, this chapter is also more closely related
to Chapter 4, Exploratory Analysis, and many visualizaon techniques here indeed are
"exploratory" in nature. Thus, the current chapter and the next complement mutually.
It has been observed that in many preliminary courses/text, a lot of emphasis is on the type
of the plots, say histogram, boxplot, and so on, which are more suitable for data arising for
connuous variables. Thus, we need to make a disncon among the plots for discrete and
connuous variable data, and towards this we rst begin with techniques which give more
insight on visualizaon of discrete variable data.
In R there are four main frameworks for producing graphics: basic graphs, grids, lace,
and ggplot2. In the current chapter, the rst three are used mostly and there is a brief
peek at ggplot2 at the end.
Data Visualizaon
[ 66 ]
This chapter will mainly cover the details on eecve data visualizaon:
Visualizaon of categorical data using a bar chart, dot chart, spine and mosaic plots,
and the pie chart and its variants
Visualizaon of connuous data using a boxplot, histogram, scaer plot and its
variants, and the Pareto chart
A very brief introducon to the rich school of ggplot2
Visualization techniques for categorical data
In Chapter 1, Data Characteriscs, we came across many variables whose outcomes
are categorical in nature. Gender, Car_Model, Minor_Problems, Major_Problems,
and Satisfaction_Rating are examples of categorical data. In a soware product
development cycle, various issues or bugs are raised at dierent severity levels such as
Minor and Show Stopper. Visualizaon methods for the categorical data require special
aenon and techniques, and the goal of this secon is to aid the reader with some useful
graphical tools.
In this secon, we will mainly focus on the dataset related to bugs, which are of primary
concern for any soware engineer. The source of the datasets is http://bug.inf.usi.ch/
and the reader is advised to check the website before proceeding further in this secon. We
will begin with the soware system Eclipse JDT Core, and the details for this system may be
found at http://www.eclipse.org/jdt/core/index.php. The les for download are
available at http://bug.inf.usi.ch/download.php.
Bar charts
It is very likely that you are familiar with bar charts, though you may not be aware of
categorical variables. Typically, in a bar chart one draws the bars proporonal to the
frequency of the category. An illustraon will begin with the dataset Severity_Counts
related to the Eclipse JDT Core soware system. The reader may also explore the built-in
examples in R.
Going through the built-in examples of R
The bar charts may be obtained using two opons. The funcon barplot, from the
graphics library, is one way of obtaining the bar charts. The built-in examples for this
plot funcon may be reviewed with example(barplot). The second opon is to load the
package lattice and then use the example(barchart) funcon. The sixth plot, aer you
click for the sixth me on the prompt, is actually an example of the barchart funcon.
Chapter 3
[ 67 ]
The main purpose of this example is to help the reader get air of the bar charts that may
be obtained using R. It happens that oen we have a specic variant of a plot in our mind
and nd it dicult to recollect it. Hence, it is a suggeson to explore the variety of bar charts
you can produce using R. Of course, there are a lot more possibilies than the mere samples
given by example().
Example 3.1.2. Bar charts for the Bug Metrics dataset: The soware system Eclipse JDT Core
has 997 dierent class environments related to the development. The bug idened on each
occasion is classied by its severity as Bugs, NonTrivial, Major, Critical, and High.
We need to plot the frequency of the severity level, and also require the frequencies to be
highlighted by Before and Aer release of the soware to be neatly reected in the graph.
The required data is available in the RSADBE package in the Severity_Counts object.
Example 3.1.3. Bar charts for the Bug Metrics of the ve soware: In the previous example,
we had considered the frequencies only on the JDT soware. Now, it will be a tedious
exercise if we need to have ve dierent bar plots for dierent soware. The frequency
table for the ve soware is given in the Bug_Metrics_Software dataset of the
RSADBE package.
Software BA_Ind Bugs
NonTrivial
Bugs
Major
Bugs Critical Bugs
High Priority
Bugs
JDT Before 11,605 10,119 1,135 432 459
After 374 17 35 10 3
PDE Before 5,803 4,191 362 100 96
After 341 14 57 6 0
Equinox Before 325 1,393 156 71 14
After 244 3 4 1 0
Lucene Before 1,714 1,714 0 0 0
After 97 0 0 0 0
Mylyn Before 14,577 6,806 592 235 8,804
After 340 187 18 3 36
It would be nice if we could simply display the frequency table across two graphs only.
This is achieved using the opon beside in the barplot funcon. The data from the
preceding table is copied from an XLS/CSV le, and then we execute the rst line of the
following R program in the Time for acon – bar charts in R secon.
Let us begin the acon and visualize the bar charts.
Data Visualizaon
[ 68 ]
Time for action – bar charts in R
Dierent forms of bar charts will be displayed with datasets. The type of bar chart also
depends on the problem (and data) on hand.
1. Enter example(barplot) in the console and hit the Return key.
2. A new window pops up with the heading Click or hit Enter for next page.
Click (and pause between the clicks) your way unl it stops changing.
3. Load the lace package with library(lattice).
4. Try example(barchart) in the console. The sixth plot is an example of the
bar chart.
5. Load the dataset on severity counts for the JDT soware from the RSADBE package
with data(Severity_Counts). Also, check for this data.
A view of this object is given in the screenshot in step 7. We have ve severies
of bugs: general bugs (Bugs), non-trivial bugs (NT.Bugs), major bugs (Major.Bugs),
crical bugs (Crical), and high priority bugs (H.Priority). For the JDT soware, these
bugs are counted before and aer release, and these are marked in the object with
suxes BR and AR. We need to understand this count data and as a rst step, we
use the bar plots for the purpose.
6. To obtain the bar chart for the severity-wise comparison before and aer release
of the JDT soware, run the following R code:
barchart(Severity_Counts,xlab="Bug Count",xlim=c(0,12000),
col=rep(c(2,3),5))
The barchart funcon is available from the lattice package. The range
for the count is specied with xlim=c(0,12000). Here, the argument
col=rep(c(2,3),5) is used to tell R that we need two colors for BR and AR
and that this should be repeated ve mes for the ve severity levels of the bugs.
Chapter 3
[ 69 ]
Figure 1: Bar graph for the Bug Metrics dataset
7. An alternave method is to use the barplot funcon from the graphics package:
barplot(Severity_Counts,xlab="Bug Count",xlim=c(0,12000), horiz
=TRUE,col=rep(c(2,3),5))
Here, we use the argument horiz = TRUE to get a horizontal display of the bar
plot. A word of cauon here that the argument horizontal = TRUE in barchart
of the lattice package works very dierently.
Data Visualizaon
[ 70 ]
We will now focus on Bug_Metrics_Software, which has the bug count data for
all the ve soware: JDT, PDE, Equinox, Lucene, and Mylyn.
Figure 2: View of Severity_Counts and Bug_Metrics_Software
8. Load the dataset related to all the ve soware with data(Bug_Metrics_
Software).
9. To obtain the bar plots for before and aer release of the soware on the same
window, run par(mfrow=c(1,2)).
What is the par funcon? It is a funcon frequently used to set the parameters
of a graph. Let us consider a simple example. Recollect that when you tried the
code example(dotchart), R would ask you to Click or hit Enter for next page
and post the click or Enter acon, the next graph will be displayed. However, this
prompt did not turn up when you ran barplot(Severity_Counts,xlab="Bug
Count",xlim=c(0,12000), horiz =TRUE,col=rep(c(2,3),5)). Now,
let us try using par, which will ask us to rst click or hit Enter so that we get the bar
plot. First run par(ask=TRUE), and then follow it with the bar plot code. You will
now be asked to either click or hit Enter. Find more details of the par funcon with
?par. Let us now get into the mfrow argument. The default plot opons displays
the output on one device and on the next one, the former will be replaced with the
new one. We require the bar plots of before and aer release count to be displayed
in the same window. The opon, mfrow = c(1,2), ensures that both the bar plots
are displayed in the same window with one row and two columns.
Chapter 3
[ 71 ]
10. To obtain the bar plot of bug frequencies before release where each of the soware
bug frequencies are placed side by side for each type of the bug severity, run
the following:
barplot(Bug_Metrics_Software[,,1],beside=TRUE,col = c("lightblue",
"mistyrose", "lightcyan", "lavender", "cornsilk"),legend = c("JDT"
,"PDE","Equinox","Lucene", "Mylyn"))
title(main = "Before Release Bug Frequency", font.main = 4)
Here, the code Bug_Metrics_Software[,,1] ensures that only before release
are considered. The opon beside = TRUE ensures that the columns are displayed
as juxtaposed bars, otherwise, the frequencies will be distributed in a single bar
with areas proporonal to the frequency of each soware. The opon col =
c("lightblue", …) assigns the respecve colors for the soware. Finally, the
tle command is used to designate an appropriate tle for the bar plot.
11. Similarly, to obtain the bar plot for aer release bug frequency, run the following:
barplot(Bug_Metrics_Software[,,2],beside=TRUE,col = c("lightblue",
"mistyrose", "lightcyan", "lavender", "cornsilk"),legend = c("JDT"
,"PDE","Equinox","Lucene", "Mylyn"))
title(main = "After Release Bug Frequency",font.main = 4)
Data Visualizaon
[ 72 ]
The reader can extend the code interpretaon for the before release to the aer
release bug frequencies.
Figure 3: Bar plots for the five software
First noce that the scale on the y-axis for before and aer release bug frequencies
is drascally dierent. In fact, before release bug frequencies are in thousands while
aer release are in hundreds. This clearly shows that the engineers have put a lot of
eort to ensure that the released products are with minimum bugs. However, the
comparison of bug counts is not fair since the frequency scales of the bar plots in
the preceding screenshot are enrely dierent. Though we don't expect the results
to be dierent under any case, it is sll appropriate that the frequency scales remain
the same for both before and aer release bar plots. A common suggeson is to plot
the diagrams with the same range on the y-axes (or x-axes), or take an appropriate
transformaon such as a logarithm. In our problem, neither of them will work, and
we resort to another variant of the bar chart from the lattice package.
Now, we will use the formula structure for the barchart funcon and bring the
BR and AR on the same graph.
Chapter 3
[ 73 ]
12. Run the following code in the R console:
barchart(Software~Freq|Bugs,groups=BA_Ind, data= as.data.
frame(Bug_Metrics_Software),col=c(2,3))
The formula Software~Freq|Bugs requires that we obtain the bar chart for the
soware count Freq according to the severity of Bugs. We further specify that
each of the bar chart a be further grouped according to BA_Ind. This will result in
the following screenshot:
Figure 4: Bar chart for Before and After release bug counts on the same scale
To nd the colors available in R, run try colors() in the console and you will nd the
names of 657 colors.
Data Visualizaon
[ 74 ]
What just happened?
barplot and barchart were the two funcons we used to obtain the bar charts. For
common recurring factors, AR and BR here, the colors can be correctly specied through the
rep funcon. The argument beside=TRUE helped us to keep the bars for various soware
together for the dierent bug types. We also saw how to use the formula structure of the
lattice package. We saw the diversity of bar charts and learned how to create eecve
bar charts depending on the purpose of the day.
Have a go hero
Explore the opon stack=TRUE in the barchart(Software~Freq|Bugs,groups= BA_
Ind,…). Also, observe that Freq for bars in the preceding screenshot begins a lile earlier
than 0. Reobtain the plots by specifying the range for Freq with xlim=c(0,15000).
Dot charts
Cleveland (1993) proposed an alternave to the bar chart where dots are used to represent
the frequency associated with the categorical variables. The dot charts are useful for small
to moderate size datasets. The dot charts are also an alternave to the pie chart, refer to
The default examples secon. The dot charts may be varied to accommodate connuous
variable dataset too. The dot charts are known to obey the Tukey's principle of achieving an
as high as possible informaon-to-ink rao.
Example 3.1.4. (Connuaon of Example 3.1.2): In the screenshot in step 6 in the Time for
acon – bar charts in R secon, we saw that the bar charts for the frequencies of bugs for
aer release are almost non-existent. This is overcome using the dot chart, see the following
acon list on the dot chart.
Time for action – dot charts in R
The dotchart funcon from the graphics package and dotplot from the lattice
package will be used to obtain the dot charts.
1. To view the default examples of dot charts, enter example(dotplot);
example(dotchart); in the console and hit the Return key.
Chapter 3
[ 75 ]
2. To obtain the dot chart of the before and aer release bug frequencies, run the
following code:
dotchart(Severity_Counts,col=15:16,lcolor="black",pch=2:3,labels
=names(Severity_Counts),main="Dot Plot for the Before and After
Release Bug Frequency",cex=1.5)
Here, the opon col=15:16 is used to specify the choice of colors; lcolor is used
for the color of the lines on the dot chart which gives a good assessment of the
relave posions of frequencies for the human eye. The opon pch=2:3 picks up
circles and squares for indicang the posions of aer and before frequencies. The
opons labels and main are trivial to understand, whereas cex magnies the size
of all labels by 1.5 mes. On execuon of the preceding R code, we get a graph as
displayed in the following screenshot:
Figure 5: Dot chart for the Bug Metrics dataset
Data Visualizaon
[ 76 ]
3. The dot plot can be easily extended for all the ve soware as we did with the
bar charts.
> par(mfrow=c(1,2))
> dotchart(Bug_Metrics_Software[,,1],gcolor=1:5,col=6:
10,lcolor= "black",pch=15:19,labels=names(Bug_Metrics_
Software[,,1]),main="Before Release Bug Frequency",xlab="Frequency
Count")
> dotchart(Bug_Metrics_Software[,,2],gcolor=1:5,col=6:
10,lcolor= "black",pch=15:19,labels=names(Bug_Metrics_
Software[,,2]),main="After Release Bug Frequency",xlab="Frequency
Count")
Figure 6: Dot charts for the five software bug frequency
For a matrix input in barchart, the gcolor opon gets the same color each column. Note
that though the class of Bug_Metrics_Software is both xtabs and table, the class of
Bug_Metrics_Software[,,1] is a matrix, and hence we create a dot chart of it. This
means that the R code dotchart(Bug_Metrics_Software) leads to errors! The dot
chart is able to display the bug frequency in a beer way as compared to the bar chart.
What just happened?
Two dierent ways of obtaining the dot plot were seen, and a host of other opons were
also clearly indicated in the current secon.
Chapter 3
[ 77 ]
Spine and mosaic plots
In the bar plot, the length (height) of the bar varies, while the width for each bar is kept the
same. In a spine/mosaic plot the height is kept constant for the categories and the width
varies in accordance with the frequency. The advantages of a spine/mosaic plot becomes
apparent when we have frequencies tabulated for several variables via a conngency table.
The spline plot is a parcular case of the mosaic plot. We rst consider an example for
understanding the spine plot.
Example 3.1.5. Visualizing Shi and Operator Data (Page 487, Ryan, 2007): In a manufacturing
factory, operators are rotated across shis and it is a concern to nd out whether the me of
shi aects the operator's performance. In this experiment, there are three operators who in a
given month work in a parcular shi. Over a period of three months, data is collected for the
number of nonconforming parts an operator obtains during a given shi. The data is obtained
from page 487 of Ryan (2007) and is reproduced in the following table:
Operator 1 Operator 2 Operator 3
Shift 1 40 35 28
Shift 2 26 40 22
Shift 3 52 46 49
We will obtain a spine plot towards an understanding of the spread of the number
of non-conforming units an operator does during the shis in the forthcoming acon
me. Let us ask the following quesons:
Does the total number of non-conforming parts depend on the operators?
Does it depend on the shi?
Can we visualize the answers to the preceding quesons?
Time for action – the spine plot for the shift and operator data
Spine plots are drawn using the spineplot funcon.
1. Explore the default examples for the spine plot with example(spineplot).
2. Enter the data for the shi and operator example with:
ShiftOperator <- matrix(c(40, 35, 28, 26, 40, 22, 52, 46, 49),nro
w=3,dimnames=list(c("Shift 1", "Shift 2", "Shift 3"), c("Operator
1", "Opereator 2", "Operator 3")),byrow=TRUE)
3. Find the number of non-conforming parts of the operators with the
colSums funcon:
> colSums(ShiftOperator)
Operator 1 Opereator2 Operator 3
118 121 99
Data Visualizaon
[ 78 ]
The non-conforming parts for operators 1 and 2 are close enough, and it is lesser
by about 20 percent for the third operator.
4. Find the number of non-conforming parts according to the shis using the
rowSums funcon:
> rowSums(ShiftOperator)
Shift 1 Shift 2 Shift 3
103 88 147
Shift 3 appears to have about 50 percent more non-conforming parts in
comparison with shis 1 and 2. Let us look out for the spine plot.
5. Obtain the spine plot for the ShiftOperator data with
spineplot(ShiftOperator).
Now, we will aempt to make the spine plot a bit more interpretable. Under the
absence of any external inuence, we would expect the shis and operators to
have a near equal number of non-conforming objects.
6. Thus, on the overall x and y axes, we plot lines at approximately the one-thirds
and check if we get approximate equal regions/squares.
abline(h=0.33,lwd=3,col="red")
abline(h=0.67,lwd=3,col="red")
abline(v=0.33,lwd=3,col="green")
abline(v=0.67,lwd=3,col="green")
The output in the graphics device window will be the following screenshot:
Figure 7: Spine plot for the Shift Operator problem
Chapter 3
[ 79 ]
It appears from the paron induced by the red lines that all the operators have
a nearly equal number of non-conforming parts. However, the spine chart shows
that most of the non-conforming parts occur during Shi 3.
What just happened?
Data summaries were used to understand the behavior of the problem, and the spine
plot helped in clear idencaon of Shi 3 as a major source of the non-conforming units
manufactured. The use of abline funcon was parcularly more insighul for this dataset
and needs to be explored whenever there is a scope for it.
Spine plot is a special case of the mosaic plot. Friendly (2001) has pioneered the concept
of mosaic plots and Chapter 4, Exploratory Analysis, is an excellent reference for the same.
For a simple understanding of the construcon of mosaic plot, you can go through slides
7-12 at http://www.stat.auckland.ac.nz/~ihaka/120/Lectures/lecture17.
pdf. As explained there, suppose that there are three categorical variables, each with two
levels. Then, the mosaic plot begins with a square and divides it into two parts with each
part having an area proporonal to the frequency of the two levels of the rst categorical
variable. Next, each of the preceding two parts is divided into two parts each according to
the predened frequency of the two levels of the second categorical variable. Note that we
now have four divisions of the total area. Finally, each of the four areas are further divided
into two more parts, each with an area reecng the predened frequency of the two levels
of the third categorical variable.
Example 3.1.6. The Titanic dataset: In the The table object secon in Chapter 2, Import/
Export Data, we came across the Titanic dataset. The dataset was seen in two dierent
forms and we also constructed the data from scratch. Let us now connue the example.
The main problems in this dataset are the following:
The distribuon of the passengers by Class, and then the spread of Survived
across Class.
The distribuon of the passengers by Sex and its distribuon across the survivors.
The distribuon by Age followed by the survivors among them. We now want to
visualize the distribuon of Survived rst by Class, then by Sex, and nally by
the Age group.
Let us see the detailed acon.
Data Visualizaon
[ 80 ]
Time for action – the mosaic plot for the Titanic dataset
The goal here is to understand the survival percentages of the Titanic ship with respect to
Class of the crew, Sex, and Age. We use rst xtabs and prop.table to gain the insight
for each of these variables, and then visualize the overall picture using mosaicplot.
1. Get the frequencies of Class for the Titanic dataset with
xtabs(Freq~Class,data=Titanic).
2. Obtain the Survived proporons across Class with prop.table( xtabs(Freq
~Class+Survived,data=Titanic),margin=1).
3. Repeat the preceding two steps for Sex: xtabs(Freq~Sex,data=Titanic)
and prop.table(xtabs(Freq~Sex+Survived,data=Titanic),margin=1).
4. Repeat this exercise for Age: xtabs(Freq~Age,data=Titanic) and prop.tab
le(xtabs(Freq~Age+Survived,data=Titanic), margin=1).
5. Obtain the mosaic plot for the dataset with mosaicplot(Titanic,col=c("red"
,"green")).
The enre output is given in the following screenshot:
Figure 8: Mosaic plot for the Titanic dataset
Chapter 3
[ 81 ]
The preceding output shows that the people traveling in higher classes survived beer than
the lower class ones. The analysis also shows that females were given more priority over
males when the rescue system was in acon. Finally, it may be seen that children were given
priority over adults.
The mosaic plot division process proceeds as follows. First, it divides the region into four
parts with the regions proporonal to the frequencies of each Class; that is, the width of
the regions are proporonate to the Class frequencies. Each of the four regions are further
divided using the predened frequencies of the Sex categories. Now, we have eight regions.
Next, each of these regions is divided using the predened frequencies of the Age group
leading to 16 disnct regions. Finally, each of the regions is divided into two parts according
to the Survived status. The Yes regions of Child for the rst two classes are larger than the
No regions. The third Class has more non-survivors than survivors, and this appears to be
true across Age and Gender. Note that there are no children among the Crew class.
The rest of the regions' interpretaon is le to the reader.
What just happened?
A clear demyscaon of the working of the mosaic plot has been provided. We applied it to
the Titanic dataset and saw how it obtains clear regions which enable to deep dive into a
categorical problem.
Pie charts and the fourfold plot
Pie charts are hugely popular among many business analysts. One reason for its popularity is
of course its simplicity. That pie chart is easy to interpret is actually not a fact. In fact, the pie
chart is seriously discouraged for analysis and observaons, refer to the cauon of Cleveland
and McGill, and also Sarkar (2008), page 57. However, we will sll connue an illustraon
of it.
Example 3.1.7. Pie chart for the Bugs Severity problem: Let us obtain the pie chart for the
bug severity levels.
> pie(Severity_Counts[1:5])
> title("Severity Counts Post-Release of JDT Software")
> pie(Severity_Counts[6:10])
> title("Severity Counts Pre-Release of JDT Software")
Data Visualizaon
[ 82 ]
Can you nd the drawback of the pie chart?
Figure 9: Pie chart for the Before and After Bug counts (output edited)
The main drawback of the pie chart stems from the fact that humans have a problem in
deciphering relave areas. A common recommendaon is the use of a bar chart or a dot
chart instead of the pie chart, as the problem of judging relave areas does not exist when
comparing linear measures.
The fourfold plot is a novel way of visualizing a
22××k
conngency table. In this method,
we obtain k plots for each
22×
frequency table. Here, the cell frequency of each of the four
cells is represented by a quarter circle whose radius is proporonal to the square root of the
frequency. In contrast to the pie chart where the radius is constant and area is varied by the
perimeter, the radius in a fourfold plot is varied to represent the cell.
Chapter 3
[ 83 ]
Example 3.1.8. The fourfold plot for the UCBAdmissions dataset: An in-built R funcon
which generates the required plot is fourfoldplot. The R code and its resultant
screenshots are displayed as follows:
> fourfoldplot(UCBAdmissions,mfrow=c(2,3),space=0.4)
Figure 10: The fourfold plot of the UCBAdmissions dataset
In this secon, we focused on graphical techniques for categorical data. In many books,
the graphical methods begin with tools which are more appropriate for data arising for
connuous variables. Such tools have many shortcomings if applied to categorical data.
Thus, we have taken a dierent approach where the categorical data gets the right tools,
which it truly deserves. In the next secon, we deal with tools which are seemingly more
appropriate for data related to connuous variables.
Data Visualizaon
[ 84 ]
Visualization techniques for continuous variable data
Connuous variables have a dierent structure and hence, we need specialized methods
for displaying them. Fortunately, many popular graphical techniques are suited very well for
connuous variables. As the connuous variables can arise from dierent phenomenon, we
consider many techniques in this secon. The graphical methods discussed in this secon
may also be considered as a part of the next chapter on exploratory analysis.
Boxplot
The boxplot is based on ve points: minimum, lower quarle, median, upper quarle, and
maximum. The median forms the thick line near the middle of the box, and the lower and
upper quarles complete the box. The lower and upper quarles along with the median,
which is the second quarle, divide the data into four regions with each containing equal
number of observaons. The median is the middle-most value among the data sorted in
the increasing (decreasing) order of magnitude. On similar lines, the lower quarle may be
interpreted as the median of observaons between the minimum and median data values.
These concepts are dealt in more detail in Chapter 4, Exploratory Analysis. The boxplot is
generally used for two purposes: understanding the data spread and idenfying the outliers.
For the rst purpose, we set the range value at zero, which will extend the whiskers to the
extremes at minimum and maximum, and give the overall distribuon of the data.
If the purpose of boxplot is to idenfy outliers, we extend the whiskers in a way which
accommodate tolerance limits to enable us to capture the outliers. Thus, the whiskers are
extended 1.5 mes the default value, the interquarle range (IQR), which is the dierence
between the third and rst quarles from the median. The default seng of boxplot is the
idencaon of the outliers. If any point is found beyond the whiskers, such observaons
may be marked as outliers. The boxplot is also somemes called a box-and-whisker plot, and
it is the whiskers, which are obtained by drawing lines from the box, ends to the minimum
and maximum points. We will begin with an example of the boxplot.
Example 3.2.1. Example (boxplot): For a quick tutorial on the various opons of the boxplot
funcon, the user may run the following code at the R console. Also, the reader is advised
to explore the bwplot funcon from the lattice package. Try example(boxplot) and
example(bwplot) from the respecve graphics and lattice packages.
Example 3.2.2. Boxplot for the resisvity data: Gunst (2002) has 16 independent
observaons from eight pairs on the resisvity of a wire. There are two processes under
which these observaons are equally distributed. We would like to see if resisvity of the
wires depends on the processes, and which of the processes leads to a higher resisvity.
A numerical comparison based on the summary funcon will be rst carried out, and then
we will visualize the two processes through boxplot to conclude whether the eects are
same, and if not which process leads to higher resisvity.
Chapter 3
[ 85 ]
Example 3.2.3. The Michelson-Morley experiment: This is a famous physics experiment in
the late nineteenth century, which helped in proving the non-existence of ether. If the ether
existed, one expects a shi of about 4 percent in the speed of light. The speed of light is
measured 20 mes in ve dierent experiments. We will use this dataset for two purposes:
is the dri of 4 percent evidenced in the data, and seng the whiskers at the extremes.
The rst one is a stascal issue and the laer is a soware seng.
For the preceding three examples, we will now read the required data into R, and then
look at the necessary summary funcons, and nally visualize them using the boxplots.
Time for action – using the boxplot
Boxplots will be obtained here using the funcon boxplot from the graphics package as well
as bwplot from the lace package.
1. Check the variety of boxplots with example(boxplot) from the graphics
package and example(bwplot) for the variants in the lattice package.
2. The resistivity data from the RSADBE package contains two processes'
informaon which we need to compare. Load in to the current session with
data(resistivity).
3. Obtain the summary of the two processes with the following:
> summary(resistivity)
Process.1 Process.2
Min. 0.138 0.142
1st Qu. 0.140 0.144
Median 0.142 0.146
Mean 0.142 0.146
3rd Qu. 0.143 0.148
Max. 0.146 0.150
Clearly, Process 2 has approximately 0.004 higher resisvity as compared
to Process 1 across all the essenal summaries. Let us check if the boxplot
captures the same.
4. Obtain the boxplot for the two processes with boxplot(resistivity,
range=0).
The argument range=0 is to ensure that the whiskers are extended to the
minimum and maximum data values. The boxplot diagram (le-hand side of the
next screenshot) clearly shows that Process.2 has higher resisvity in comparison
with Process.1. Next, we will consider the bwplot funcon from the lattice
package. A slightly dierent rearrangement of the resisvity data frame will be
required, in that we will specify all the resisvity values in a single column and their
corresponding processes in another column.
Data Visualizaon
[ 86 ]
An important opon for boxplots is that of notch, which is especially useful for
the comparison of medians. The top and boom notches for a set of observaons
are dened at the point's median
±
1.57(IQR)/n1/2. If notches of two boxplots do
not overlap, it can be concluded that the medians of the groups are signicantly
dierent. Such an opon can be specied in both boxplot and bwplot funcons.
5. Convert resisvity to another useful form which will help the applicaon of the
bwplot funcon with resistivity2 <- data.frame(rep(names( re
sistivity),each=8),c(resistivity[,1],resistivity[,2])).
Assign variable names to the new data frame with names(resistivity2)<-
c("Process","Resistivity").
6. Run the bwplot funcon on resistivity2 with
bwplot(Resistivity~Process, data=resistivity2,notch=TRUE).
Figure 11: Boxplots for resistivity data with boxplot, bwplot, and notches
The notches are overlapping and hence, we can't conclude from the boxplot that the
resisvity medians are very dierent from each other.
With the data on speed of light from the morley dataset, we have an important
goal of idenfying outliers. Towards this purpose the whiskers are extended 1.5
mes the default value, the interquarle range (IQR), from the median.
Chapter 3
[ 87 ]
7. Create a boxplot with whiskers which enable idencaon of the outliers beyond
the 1.5 IQR of the median with the following:
boxplot(Speed~Expt,data=morley,main = "Whiskers at Lower- and
Upper- Confidence Limits")
8. Add the line which helps to idenfy the presence of ether with abline(
h=792.458,lty=3).
The resulng screenshot is as follows:
Figure 12: Boxplot for the Morley dataset
It may be seen from the preceding screenshot that experiment 1 has one outlier, while
experiment 3 has three outlier values. Since the line is well below the median of the
experiment values (speed, actually), we conclude that there is no experimental evidence
for the existence of ether.
What just happened?
Variees of boxplots have been elaborated in this secon. The boxplot has also been put in
acon to idenfy the presence of outliers in a given dataset.
Data Visualizaon
[ 88 ]
Histograms
Histogram is one of the earliest graphical techniques and undoubtedly one of the most
versale and adapve graphs whose relevance is legimate as it ever had. The invenon
of histogram is a credit to the great stascian, Karl Pearson. Its strength is also in its
simplicity. In this technique, a variable is divided over intervals and the height of an interval
is determined by the frequency of the observaons falling in that interval. In the case of
an unbalanced division of the range of the variable values, histograms are especially very
informave in revealing the shape of the probability distribuon of the variable. We will see
more about these points in the following examples.
The construcon of a histogram is explained with the famous dataset of Galton, where an
aempt has been made for understanding the relaonship between the heights of a child
and parent. In this dataset, there are 928 pairs of observaon of the height of the child and
parent. Let us have a brief peek at the dataset:
> data(galton) > names(galton)
[1] "child" "parent"
> dim(galton)
[1] 928 2
> head(galton)
child parent
1 61.7 70.5
2 61.7 68.5
3 61.7 65.5
4 61.7 64.5
5 61.7 64.0
6 62.2 67.5
> sapply(galton,range)
child parent
[1,] 61.7 64
[2,] 73.7 73
> summary(galton)
child parent
Min. :61.7 Min. :64.0
1st Qu.:66.2 1st Qu.:67.5
Median :68.2 Median :68.5
Mean :68.1 Mean :68.3
3rd Qu.:70.2 3rd Qu.:69.5
Max. :73.7 Max. :73.0
Chapter 3
[ 89 ]
We need to cover all the 928 observaons in the intervals, also known as bins, which need to
cover the range of the variable. The natural queson is how does one decide on the number
of intervals and the width of these intervals? If the bin width, denoted by h, is known, the
number of bins, denoted by k, can be determined by:
maxmin−
=
i
ii i
h
x x
k
Here, the argument
>@
denotes the ceiling of the number. Similarly, if the number of bins k
is known, the width is determined by
( )
max min= −
ii ii
h x x/k.
There are many guidelines for these problems. The three opons available for the hist
funcon in R are formulas given by Sturges, Sco, and Freedman-Diaconis, the details of
which may be obtained by running ?nclass.Sturges, or ?nclass.FD and ?nclass.
scott in the R console. The default seng runs the Sturges opon. The Sturges formula
for the number of bins is given by:
[ ]
2
lo
g1
= +k n
This formula works well when the underlying distribuon is approximately distributed as a
normal distribuon. The Sco's normal reference rule for the bin width, using the sample
standard deviaon
ˆ
σ
is:
3
35
σ
=ˆ
.
h
n
Finally, the Freedman-Diaconis rule for the bin width is given by:
3
2
=
IQR
hn
In the following, we will construct a few histograms describing the problems through their
examples and their R setup in the Time for acon – understanding the eecveness of
histogram a secon.
Example 3.2.4. The default examples: To get a rst preview on the generaon of histograms,
we suggest the reader to go through the built-in examples, try example(hist) and
example(histogram).
Data Visualizaon
[ 90 ]
Example 3.2.5. The Galton dataset: We will obtain histograms for the height of child and
parent from the Galton dataset. We will use the Freedman-Diaconis and Sturges choice of
bin widths.
Example 3.2.6. Octane rang of gasoline blends: An experiment is conducted where the
octane rang of gasoline blends can be obtained using two methods. Two samples are
available for tesng each type of blend, and Snee (1981) obtains 32 dierent blends over
an appropriate spectrum of the target octane rangs. We obtain histograms for the rangs
under the two dierent methods.
Example 3.2.7. Histogram with a dummy dataset: A dummy dataset has been created by the
author. Here, we need to obtain histograms for the two samples in the Samplez data from
the RSADBE package.
Time for action – understanding the effectiveness of histograms
Histograms are obtained using the hist and histogram funcons. The choice of bin widths
is also discussed.
1. Have a buy-in of the R capability of the histograms through example(hist) and
example(histogram) for the respecve histogram funcons from the graphics
and lattice packages.
2. Invoke the graphics editor with par(mfrow=c(2,2)).
3. Create the histogram for the height of Child and Parent from the galton dataset
seen in the earlier part of the secon for the Freedman-Diaconis and Sturges
choice of bin widths:
par(mfrow=c(2,2))
hist(galton$parent,breaks="FD",xlab="Height of Parent",
main="Histogram for Parent Height with Freedman-Diaconis
Breaks",xlim=c(60,75))
hist(galton$parent,xlab="Height of Parent",main= "Histogram for
Parent Height with Sturges Breaks",xlim=c(60,75))
hist(galton$child,breaks="FD",xlab="Height of Child",
main="Histogram for Child Height with Freedman-Diaconis
Breaks",xlim=c(60,75))
hist(galton$child,xlab="Height of Child",main="Histogram for Child
Height with Sturges Breaks",xlim=c(60,75))
Chapter 3
[ 91 ]
Consequently, we get the following screenshot:
Figure 13: Histograms for the Galton dataset
Note that a few people may not like histograms for the height of parent for the
Freedman-Diaconis choice of bin width.
4. For the experiment menoned in Example 3.2.9. Octane rang of gasoline blends,
rst load the data into R with data(octane).
5. Invoke the graphics editor for the rangs under the two methods with
par(mfrow=c(2,2)).
6. Create the histograms for the rangs under the two methods for the Sturges choice
of bin widths with:
hist(octane$Method_1,xlab="Ratings Under Method I",main="Histogram
of Octane Ratings for Method I",col="mistyrose")
hist(octane$Method_2,xlab="Ratings Under Method
II",main="Histogram of Octane Ratings for Method II",col="
cornsilk")
The resulng histogram plot will be the rst row of the next screenshot.
A visual inspecon suggests that under Method_I, the mean rang is around 90
while under Method_II it is approximately 95. Moreover, the Method_II rangs
look more symmetric than the Method_I rangs.
Data Visualizaon
[ 92 ]
7. Load the required data here with data(Samplez).
8. Create the histogram for the two samples under the Samplez data frame with:
hist(Samplez$Sample_1,xlab="Sample 1",main="Histogram: Sample 1"
,col="magenta")
hist(Samplez$Sample_2,xlab="Sample 2",main="Histogram: Sample 2"
,col="snow")
We obtain the following histogram plot:
Figure 14: Histogram for the Octane and Samples dummy dataset
The lack of symmetry is very apparent in the second row display of the preceding screenshot.
It is very clear from the preceding screenshot that the le histogram exhibits an example of
a posive skewed distribuon for Sample_1, while the right histogram for Sample_2 shows
that the distribuon is a negavely skewed distribuon.
What just happened?
Histograms have tradionally provided a lot of insight into the understanding of the
distribuon of variables. In this secon, we dived deep into the intricacies of its construcon,
especially related to the opons of bin widths. We also saw how the dierent nature of
variables are clearly brought out by their histogram.
Chapter 3
[ 93 ]
Scatter plots
In the previous subsecon, we used histograms for understanding the nature of the
variables. For mulple variables, we need mulple histograms. However, we need dierent
tools for understanding the relaonship between two or more variables. A simple, yet
eecve technique is the scaer plot. When we have two variables, the scaer plot simply
draws the two variables across the two axes. The scaer plot is powerful in reecng the
relaonship between the variables as in it reveals if there is a linear/nonlinear relaonship.
If the relaonship is linear, we may get an insight if there is a posive or negave relaonship
among the variables, and so forth.
Example 3.2.8. The drain current versus the ground-to-source voltage: Montgomery and
Runger (2003) report an arcle from IEEE (Exercise 11.64) about an experiment where the
drain current is measured against the ground-to-source voltage. In the scaer plot, the drain
current values are ploed against each level of the ground-to-source voltage. The former
value is measured in milliamperes and the laer in volts. The R funcon plot is used for
understanding the relaonship. We will soon visualize the relaonship between the current
values against the level of the ground-to-source voltage. This data is available as DCD in the
RSADBE package.
The scaer plot is very exible when we need to understand the relaonship between more
than two variables. In the next example, we will extend the scaer plot to mulple variables.
Example 3.2.9. The Gasoline mileage performance data: The mileage of a car depends on
various factors; in fact, it is a very complex problem. In the next table, the various variables
x1 to x11 are described which are believed to have an inuence on the mileage of the car. We
need a plot which explains the inter-relaonships between the variables and the mileage.
The exercise of repeang the plot funcon may be done 11 mes, though most people may
struggle to recollect the inuence of the rst plot when they are looking at the sixth or maybe
the seventh plot. The pairs funcon returns a matrix of scaer plots, which is really useful.
Let us visualize the matrix of scaer plots:
> data(Gasoline)
> pairs(Gasoline) # Output suppressed
It may be seen that this matrix of scaer plots is a symmetric plot in the sense that the
upper and lower triangle of this matrix are simply copies of each other (transposed copies
actually). We can be more eecve in represenng the data in the matrix of scaer plots by
specifying addional parameters. Even as we study the inter-relaonships, it is important to
understand the variable distribuon itself. Since the diagonal elements are just indicang the
name of the variable, we can instead replace them by their histograms. Further, if we can
give the measure of the relaonship between two variables, say the correlaon coecient,
we can be more eecve. In fact, we do a step beer by displaying the correlaon coecient
by increasing the font size according to its stronger value. We rst dene the necessary
funcons and then use the pairs funcon.
Data Visualizaon
[ 94 ]
Variable
Notation
Variable Description Variable
Notation
Variable Description
YMiles/gallon x6Carburetor (barrels)
x1Displacement (cubic inches) x7No. of transmission speeds
x2Horsepower (foot-pounds) x8Overall length (inches)
x3Torque (foot-pounds) x9Width (inches)
x4Compression ratio x10 Weight (pounds)
x5Rear axle ratio x11 Type of transmission
(A-automatic, M-manual)
Time for action – plot and pairs R functions
The scaer plot and its important mulvariate extension with pairs will be considered in
detail now.
1. Consider the data data(DCD).
Use the opons xlab and ylab to specify the right labels for the axes. We specify
xlim and ylim to get a good overview of the relaonship.
2. Obtain the scaer plot for Example 3.2.8. The Drain current versus the ground-
to-source voltage using plot(DCD$Drain_Current, DCD$GTS_Voltage,t
ype="b",xlim=c(1,2.2),ylim=c(0.6,2.4),xlab="Current Drain",
ylab="Voltage").
Figure 15: The scatter plot for DCD
Chapter 3
[ 95 ]
We can easily see from the preceding scaer plot that as the ground-to-source
voltage increases, there is an appropriate increase in the drain current. This is an
indicaon of a posive relaonship between the two variables. However, the lab
assistant now comes to you and says that the measurement error of the instrument
has actually led to 15 percent higher recordings of the ground-to-source voltage.
Now, instead of dropping the enre diagram, we may simply prefer to add the
corrected gures to the exisng one. The points opon helps us to add the new
corrected data points to the gure.
3. Now, rst obtain the correct GTS_Voltage readings with DCD$GTS_
Voltage/1.15 and add them to the exisng plot with points(DCD$Drain_
Current,DCD$GTS_Voltage/1.15,type="b",col="green").
4. We rst create two funcons panel.hist and panel.cor dened as follows:
panel.hist<- function(x, ...)
{
usr<- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks<- h$breaks; nB<- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)
}
panel.cor<- function(x, y, digits=2, prefix="", cex.cor, ...) {
usr<- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x,y,use="complete.obs"))
txt<- format(c(r, 0.123456789), digits=digits)[1]
txt<- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex.cor<- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
The preceding two dened funcons are taken from the code of example(pairs).
Data Visualizaon
[ 96 ]
5. It is me to put these two funcons in to acon:
pairs(Gasoline,diag.panel=panel.hist,lower.panel=panel.
smooth,upper.panel=panel.cor)
Figure 16: The pairs plot for the Gasoline dataset
In the upper triangle of the display, we can see that the mileage has strong
associaon with the displacement, horsepower, torque, number of transmission
speeds, the overall length, width, weight, and the type of transmission. We can say
a bit more too. The rst three variables x1 to x3 relate to the engine characteriscs,
and there is a strong associaon within these three variables. Similarly, there is a
strong associaon between x8 to x10 and together they form the vehicle dimension.
Also, we have done a bit more than simply obtaining the scaer plots in the lower
triangle of the display. A smooth approximaon of the relaonship between the
variables is provided here.
6. Finally, we resort to the usual trick by looking at the capabilies of the plot
and pairs funcons with example(plot), example(pairs),
and example(xyplot).
We have seen how mul-variables can be visualized. In the next subsecon, we will explore
more about Pareto charts.
Chapter 3
[ 97 ]
What just happened?
Starng with a simple scaer plot and its eecveness, we went to great lengths for the
extension to the pairs funcon. The pairs funcon has been greatly explored using the
panel.hist and panel.cor funcons for truly understanding the relaonships between
a set of mulple variables.
Pareto charts
The Pareto rule, also known as the 80-20 rule or the law of vital few, says that approximately 80
percent of the defects are due to 20 percent of the causes. It is important as it can idenfy 20
percent vital causes whose eliminaon annihilates 80 percent of the defects. The qcc package
contains the funcon pareto.chart, which helps in generang the Pareto chart. We will give
a simple illustraon of this chart.
The Pareto chart is a display of the cause frequencies along two axes. Suppose that we have
10 causes C1 to C10 which have occurred with defect counts 5, 23, 7, 41, 19, 4, 3, 4, 2,
and 1. Causes 2, 4, and 5 have high frequencies (dominang?) and other causes look a bit
feeble. Now, let us sort these causes by decreasing the order and obtaining their cumulave
frequencies. We will also obtain their cumulave percentages.
> Cause_Freq <- c(5, 23, 7, 41, 19, 4, 3, 4, 2, 1)
> names(Cause_Freq) <- paste("C",1:10,sep="")
> Cause_Freq_Dec <- sort(Cause_Freq,dec=TRUE)
> Cause_Freq_Cumsum <- cumsum(Cause_Freq_Dec)
> Cause_Freq_Cumsum_Perc <- Cause_Freq_Cumsum/sum(Cause_Freq)
> cbind(Cause_Freq_Dec,Cause_Freq_Cumsum,Cause_Freq_Cumsum_Perc)
Cause_Freq_Dec Cause_Freq_Cumsum Cause_Freq_Cumsum_Perc
C4 41 41 0.3761
C2 23 64 0.5872
C5 19 83 0.7615
C3 7 90 0.8257
C1 5 95 0.8716
C6 4 99 0.9083
C8 4 103 0.9450
C7 3 106 0.9725
C9 2 108 0.9908
C10 1 109 1.0000
This appears to be a simple trick, and yet it is very eecve in revealing that causes 2, 4,
and 5 are contribung more than 75 percent of the defects. A Pareto chart completes the
preceding table with bar chart in a decreasing count of the causes with a le vercal axis
for the frequencies and a cumulave curve on the right vercal axis. We will see the Pareto
chart in acon for the next example.
Data Visualizaon
[ 98 ]
Example 3.2.10. The Pareto chart for incomplete applicaons: A simple step-by-step
illustraon of Pareto chart is available on the Web at http://personnel.ky.gov/nr/
rdonlyres/d04b5458-97eb-4a02-bde1-99fc31490151/0/paretochart.pdf. The
reader can go through the clear steps menoned in the document.
In the example from the precedingly menoned web document, a bank which issues credit
cards rejects applicaon forms if they are deemed incomplete. An applicaon form may be
incomplete if informaon is not provided for one or more of the details sought in the form.
For example, an applicaon can't be processed further if the customer/applicant has not
provided address, has illegible handwring, there is a signature missing, or if the customer
is an exisng credit card holder among other reasons. The concern of the manager of the
credit card wing is to ensure that the rejecons for incomplete applicaons should decline,
since a cost is incurred on issuing the form which is generally not charged for. The manager
wants to focus on certain reasons which may be leading to the rejecon of the forms.
Here, we consider the frequency of the dierent causes which lead to the rejecon of
the applicaons.
>library(qcc)
>Reject_Freq = c(9,22,15,40,8)
>names(Reject_Freq) = c("No Addr.", "Illegible", "Curr. Customer", "No
Sign.", "Other")
>Reject_Freq
No Addr. Illegible Curr.CustomerNo Sign. Other
9 22 15 40 8
>options(digits=2)
>pareto.chart(Reject_Freq)
Pareto chart analysis for Reject_Freq
Frequency Cum.Freq.Percentage Cum.Percent.
No Sign. 40.0 40.0 42.6 42.6
Illegible 22.0 62.0 23.4 66.0
Curr. Customer 15.0 77.0 16.0 81.9
No Addr. 9.0 86.0 9.6 91.5
Other 8.0 94.0 8.5 100.0
Chapter 3
[ 99 ]
Figure 17: The Pareto chart for incomplete applications
In the screenshot given in step 5 of the Time for acon – plot and pairs R funcons secon,
the frequency of the ve reasons of rejecons is represented by bars as in a bar plot with
the disncon of being displayed in a decreasing magnitude of frequency. The frequency of
the reasons is indicated on the le vercal axis. At the mid-point of each bar, the cumulave
frequency up to that reason is indicated, and the reference for this count is the right vercal
axis. Thus, we can see that more than 75 percent of the rejecons is due to the three causes
No Signature, Illegible, and Current Customer. This is the main strength of a Pareto chart.
A brief peek at ggplot2
Tue (2001) and Wilkinson (2005) emphasize a lot on the aesthecs of graphics. There is
indeed more to graphics than mere mathemacs, and the subtle changes/correcons in
a display may lead to an improved, enhanced, and pleasing feeling on the eye diagrams.
Wilkinson emphasizes on what he calls grammar of graphics, and an R adaptaon of it is
given by Wickham (2009).
Data Visualizaon
[ 100 ]
Thus far, we have used various funcons, such as barchart, dotchart, spineplot,
fourfoldplot, boxplot, plot, and so on. The grammar of graphics emphasizes that a
stascal graphic is a mapping from data to aesthec aributes of geometric objects. The
aesthecs aspect consists of color, shape, and size, while the geometric objects are composed
of points, lines, and bars. A detailed discussion of these aspects is unfortunately not feasible in
our book, and we will have to sele with a quick introducon to the grammar of graphics.
To begin with, we will simply consider the qplot funcon from the ggplot2 package.
Time for action – qplot
Here, we will rst use the qplot funcon for obtaining various kinds of plots. To keep the
story short, we are using the earlier datasets only and hence, a reproducon of the similar
plots using qplot won't be displayed. The reader is encouraged to check ?qplot and
its examples.
1. Load the library with library(ggplot2).
2. Rearrange the resisvity dataset in a dierent format and obtain the boxplots:
test <- data.frame(rep(c("R1","R2"),each=8),c(resistivity[,1],
resistivity[,2]))
names(test) <- c("RES","VALUE")
qplot(factor(RES),VALUE,data=test,geom="boxplot")
The gplot funcon needs to be explicitly specied that RES is a factor variable
and according to its levels, we need to obtain the boxplot for the resisvity values.
3. For the Gasoline dataset, we would like to obtain a boxplot of the mileage
accordingly as the gear system could be manual or automac. Thus, the qplot
can be put to acon with qplot(factor(x11),y,data=Gasoline,geom=
"boxplot").
4. A histogram is also one of the geometric aspects of ggplot2, and we next obtain
the histogram for the height of child with qplot(child,data=galton,geom="
histogram", binwidth = 2,xlim=c(60,75),xlab="Height of Child",
ylab="Frequency").
5. The scaer plot for the height of parent against child is fetched with qplot(pare
nt,child,data=galton,xlab="Height of Parent", ylab="Height of
Child", main="Height of Parent Vs Child").
Chapter 3
[ 101 ]
What just happened?
The qplot under the argument of geom allows a good family of graphics under a single
funcon. This is parcularly advantageous for us to perform a host of tricks under a
single umbrella.
Of course, there is the all the more important ggplot funcon from the ggplot2 library,
which is the primary reason for the exibility of grammar of graphics. We will close the
chapter with a very brief exposion to it. The main strength of ggplot stems from the fact
that you can build a plot layer by layer. We will illustrate this with a simple example.
Time for action – ggplot
ggplot, aes, and layer will be put in acon to explore the power of the grammar
of graphics.
1. Load the library with library(ggplot2).
2. Using the aes and ggplot funcons, rst create a ggplot object with galton_gg
<- ggplot(galton,aes(child,parent)) and nd the most recent creaon in
R by running galton_gg. You will get an error, and the graphics device will show a
blank screen.
3. Create a scaer plot by adding a layer to galton_gg with galton_gg <-
galton_gg + layer(geom="point"), and then run galton_gg to check for
changes. Yes, you will get a scaer plot of the height of child versus parent.
4. The labels of the axes are not sasfactory and we need beer ones. The strength
of ggplot is that you can connue to add layers to it with varied opons. In
fact, you can change the xlim and ylim on an exisng plot and check each me
the dierence in the plot. Run the following code in a step-by-step manner and
appreciate the strength of the grammar:
galton_gg <- galton_gg + xlim(60,75)
galton_gg
galton_gg <- galton_gg + ylim(60,75)
galton_gg
galton_gg<-galton_gg+ylab("Height of Parent")+xlab("Height
ofChild")
galton_gg
galton_gg <- galton_gg + ggtitle("Height of Parent Vs Child")
galton_gg
Data Visualizaon
[ 102 ]
What just happened?
The layer-by-layer approach of ggplot is very useful, and we have seen an illustraon of it
on the scaer plot for the Galton dataset. In fact, the reach of ggplot is much richer than
our simple illustraon, of course, and the interested reader may refer to Wickham (2009)
for more details.
Have a go hero
If you run par(list(cex.lab=2,ask=TRUE)) followed by barplot(
Severity_Counts,xlab="Bug Count",xlim=c(0,12000), horiz =TRUE,
col=rep(c(2,3),5)), what do you expect R to do?
Summary
In this chapter, we have visualized dierent types of graphs for dierent types of variables.
We have also explored how to gain insights into data through the graphs. It is important to
realize that without a clear understanding of the data structure, the plots are meaningless if
they are generated without exercising enough cauon. The GIGO adage is very true and no
rich visualizaon technique helps overcome this problem.
In the previous chapter, we learned the important methods of imporng/exporng data, and
visualized the data in dierent forms. Now that we have an understanding and visual insight
of the data, we need to take the next step, namely quantave analysis of the data. There
are roughly two streams of analysis: exploratory and conrmave analysis. It is the former
analysis technique that forms the core of the next chapter. As an instance, the scaer plot
reveals whether there is a posive, negave, or no associaon between the two variables.
If the associaon is not zero, the numeric answer of the posive or negave relaonship is
then required. Techniques such as these and extensions form the core of next chapter.
4
Exploratory Analysis
Tukey (1977) in his benchmark book Exploratory Data Analysis, abbreviated
popularly as EDA, describes "best methods" as:
We do not guarantee to introduce you to the "best" tools, parcularly since we
are not sure that there can be unique bests.
The goal of this chapter is to emphasize on EDA and its strengths.
In the previous chapter, we have seen visualizaon techniques for data of dierent
characteriscs. Analycal insight is also important and this chapter considers EDA techniques.
Further, the more popular measures include the mean, standard error, and so on. It has been
proved many mes that the mean has several drawbacks; one of them being that it is very
sensive to outliers/extremes. Thus, in exploratory analysis the focus is on measures which
are robust to the extremes. Many techniques considered in this chapter are discussed in
more detail by Velleman and Hoaglin (1981), and an eBook has been kindly made available at
http://dspace.library.cornell.edu/handle/1813/62. In the rst secon, we will
have a peek at the oen used measures for exploratory analysis. The main learnings from this
chapter are listed as follows:
Summary stascs based on median and its variants, which are robust to outliers
Visualizaon techniques in stem-and-leaf, leer values, and bagplots
First regression model in Resistant line and rened methods in smoothing data
and median polish
Exploratory Analysis
[ 104 ]
Essential summary statistics
We have seen useful summary stascs of mean and variance in the Discrete distribuons
and Connuous distribuons secons of Chapter 1, Data Characteriscs. The concepts
therein have their own ulity value. The drawback of such stascal metrics is that they are
very sensive to outliers, in the sense that a single observaon may completely distort the
enre story. In this secon, we discuss some exploratory analysis metrics which are intuive
and more robust than the metrics such as mean and variance.
Percentiles, quantiles, and median
For a given dataset and a number 0 < k < 1, the 100k% percenle divides the dataset into
two parons with 100k% of the values below it and 100(1-k) percent of the values above
it. The fracon k is referred as a quanle. In Stascs, quanles are used more oen than
percenles. The dierence being that the quanles vary over the unit interval (0, 1),
whereas 100 mes the quanles gives us the percenles. It is important to note that
the minimum (maximum) is the 0% (100%) percenle.
The median is the ieth percenle, which divides the data values into two equal parts with
itself being the mid-point of these parts. The lower and upper quarles are respecvely the
25% and 75% percenles. The standard notaon for the lower, mid (median), and upper
quanles respecvely are Q1, Q2, and Q3. By extension, Q0 and Q5 respecvely denote the
minimum and maximum quanes in a dataset.
Example 4.1.1. Rahul Dravid – The Wall: The game of Cricket is hugely popular in India, and
many cricketers have given a lot of goose bumps to those watching. Sachin Tendulkar, Anil
Kumble, Javagal Srinath, Sourav Ganguly, Rahul Dravid, and VVS Laxman are some of the
iconic names across the world. The six players menoned here have especially played a huge
role in taking India to the number one posion in the Test cricket rankings, and it is widely
believed that Rahul Dravid has been the main backbone of the success. A century is credited
to a cricketer on scoring 100 or more runs in a single innings. He has scored 36 Test centuries
across the globe, quite a handful of them were so resolute in nature that it earned him the
nickname "The Wall", and we will seek some percenles/quanles for these scores soon.
Next, we will focus on a stasc which is similar to the quanles.
Hinges
The nomenclature of the concept of hinges is basically from the hinges seen on a door. For
a door's frame, the mid-hinge is at the middle of the height of the frame, whereas the lower
and upper hinges are respecvely observed at the middle from the mid-hinge to boom and
top of the frame. In exploratory analysis, the hinges are dened by arranging the data in an
increasing order, and to start the median is idened as the mid-hinge.
Chapter 4
[ 105 ]
The lower hinge for the (ordered) data is dened as the middle-most observaon from the
minimum to the median. The upper hinge is dened similarly for the upper part of the data.
In the rst occasion, it may appear that the lower and upper hinges are the same as lower
and upper quanles. Consider the data as the rst 10 integers. Here, the median is 5.5 as
the average of the two middle-most numbers 5 and 6. Using the quanle funcon on 1:10,
it may be checked that here Q1 = 3.25 and Q3 = 7.75. The lower hinge is the middlemost
number between 1 to the median 5.5, which turns out as 3, and the upper hinge as 8.
Thus, it may be seen that the hinges are dierent from the quarles.
An extension of the concept of the hinges will be seen in the Leer values secon.
We will next look at exploratory measures of dispersion.
The interquartile range
Range, the dierence between the minimum and maximum of the data values, is one
measure of the spread of the variable. This measure is suscepble to the extreme points.
The interquarle range, abbreviated as IQR, is dened as the dierence between the upper
and lower quarle, that is:
31
=−
IQRQ Q
The R funcon IQR calculates the IQR for a given numeric object. All the concepts
theorecally described up to this point will be put into acon.
Time for action – the essential summary statistics for
"The Wall" dataset
We will understand the summary measures for EDA through the data on centuries scored
by Rahul Dravid in test matches.
1. Load the useful EDA package: library(LearnEDA).
2. Load the dataset TheWALL from the RSADBE package: data(TheWALL).
3. Obtain the quanles of the centuries with quantile(TheWALL$Score),
and the dierence between the quanles using the diff funcon
diff(quantile(TheWALL$Score)). The output is as follows:
> quantile(TheWALL$Score)
0% 25% 50% 75% 100%
100.0 111.8 140.0 165.8 270.0
> diff(quantile(TheWALL$Score))
25% 50% 75% 100%
11.75 28.25 25.75 104.25
Exploratory Analysis
[ 106 ]
As we are considering Rahul Dravid's centuries only, the beginning point is 100. The
median of his centuries is 140.0, where the rst and third quarles are respecvely
111.8 and 165.8. The median of the centuries is 140 runs, which can be interpreted
as having a 50 percent chance of The Wall reaching 140 runs if he scores a century.
The highest ever score of Dravid is of course 270. Interpret the dierence between
the quanles.
4. The percenles of Dravid's centuries can be obtained by using the quantile
funcon again: quantile(TheWALL$Score,seq(0,1,.1)), here seq(0,1,.1)
creates a vector which incrementally increases 0.1 beginning with 0 unl 1, and the
inter-dierence between the percenles with diff(quantile(TheWALL$Score,
seq(0,1,.1))):
>quantile(TheWALL$Score,seq(0,1,.1))
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
100.0 103.5 111.0 116.0 128.0 140.0 146.0 154.0 180.0 208.5 270.0
> diff(quantile(TheWALL$Score,seq(0,1,.1)))
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
3.5 7.5 5.0 12.0 12.0 6.0 8.0 26.0 28.5 61.5
The Wall is also known for his resolve of performing well in away Test matches.
Let us verify that using the data on the centuries score.
5. The Home and Away number of centuries is obtained using the table funcon.
Further, we obtain a boxplot of the home and away centuries.
> table(HA_Ind)
HA_Ind
Away Home
21 15
The R funcon table returns frequencies of the various categories for a categorical
variable. In fact, it is more versale in obtaining frequencies for more than one
categorical variable. The Wall is also known for his resolve of performing well in
away Test matches. This is partly conrmed by the fact that 21 of his 36 centuries
came in away Tests. However, the boxplot says otherwise:
>boxplot(Score~HA_Ind,data=TheWALL)
Chapter 4
[ 107 ]
Figure 1: Box plot for Home/Away centuries of The Wall
It may be tempng for The Wall's fans to believe that if they remove the outliers of
scores above 200, the result may say that his performance of Away Test centuries is
beer/equal to Home ones. However, this is not the case, which may be veried
as follows.
6. Generate the boxplot for centuries whose score is less than 200 with
boxplot(Score~HA_Ind,subset=(Score<200),data=TheWALL).
Figure 2: Box plot for Home/Away centuries of The Wall (less than 200 runs)
What do you conclude from the preceding screenshot?
Exploratory Analysis
[ 108 ]
7. The fivenum summary for the centuries is:
>fivenum(TheWALL$Score)
[1] 100.0 111.5 140.0 169.5 270.0
The fivenum funcon returns minimum, lower-hinge, median, upper-hinge,
and maximum values for the input data. The numbers 111.5 and 169.5 are
lower- and upper-hinges, and it may be seen that they are certainly dierent
values than lower- and upper-quarles of 111.5 and 169.5. Thus far, we have
focused on measures of locaon, so let us now look at some measures of dispersion.
The range funcon in R actually returns the minimum and maximum value of the
data frame. Thus, to obtain the range as a measure of spread, we get that using
diff(range()). We use the IQR funcon to obtain the interquarle range.
8. Using range, diff, and IQR funcons, obtain the spread of Dravid's centuries
as follows:
> range(TheWALL$Score)
[1] 100 270
> diff(range(TheWALL$Score))
[1] 170
> IQR(TheWALL$Score)
[1] 54
>IQR(TheWALL$Score[TheWALL$HA_Ind=="Away"])
[1] 36
> IQR(TheWALL$Score[TheWALL$HA_Ind=="Home"])
[1] 63.5
Here, we are extracng the home centuries from Score using the logic that
consider only those elements of Score when HA_Ind is Home.
What just happened?
The data summaries in the EDA framework are slightly dierent. Here, we rst used the
quantile funcon to obtain quarles and the deciles (10 percent dierence) of a numeric
variable. The diff funcon has been used to nd the dierence between the consecuve
elements of a vector. The boxplot funcon has been used to compare the home and away
Test centuries, which led to the conclusion that the median score of Dravid's centuries at
home is higher than the away centuries. The restricon of the Test centuries under 200 runs
further conrmed in parcular that Dravid's centuries at home have a higher median value
than those in away series, and in general that median is robust to outliers.
Chapter 4
[ 109 ]
The IQR funcon gives us the interquarle range for a vector, and the fivenum funcon
gives us the hinges. Though intuively it appears that hinges and quarles are similar, it is
not always true. In this secon, you also learned the usage of funcons, such as quantile,
fivenum, IQR, and so on.
We will now move to the main techniques of exploratory analysis.
The stem-and-leaf plot
The stem-and-leaf plot is considered as one of the seven important tools of Stascal
Process Control (SPC), refer to Montgomery (2005). It is a bit similar in nature to the
histogram plot.
The stem-and-leaf plot is an eecve method of displaying data in a (paral) tree form. Here,
each datum is split into two parts: the stem part and the leaf part. In general, the last digit
of a datum forms the leaf part; the rest form the stem. Now, consider a datum 235. If the
split criteria is the units place, the stem and leaf parts here will be respecvely 23 and 5; if it
is tens, then 2 and 3; and nally if it is hundreds, it will be 0 and 2. The le-hand side of the
split datum is called as the leading digits and the right-hand side as the trailing digits.
In the next step, all the possible leading digits are arranged in an increasing order. This
includes even those stems for which we may not have data for the leaf part, which ensures
that the nal stem-and-leaf plot truly depicts the distribuon of the data. All the possible
leading digits are called stems. The leaves are then displayed to the right-hand side of the
stems, and for each stem the leaves are again arranged in an increasing order.
Example 4.2.1. A simple illustraon: Consider a data of eight elements as 12, 22, 42, 13, 27,
46, 25, and 52. The leading digits for this data are 1, 2, 4, and 5. On inserng 3, the leading
digits complete the required stems to be 1 to 5. The leaves for stem 1 are 2 and 3. The
unordered leaves for stem 2 are 2, 7, and 5. The display leaves for stem 2 are then 2, 5,
and 7. There are no leaves for stem 3. Similarly, the leaves for stems 4 and 5 respecvely
are the sets {2, 6} and {2} only. The stem funcon in R will be used for generang the
stem-and-leaf plots.
Example 4.2.1. Octane Rang of Gasoline Blends: (Connued from the Vizualizaon
techniques for connuous variable data secon of Chapter 3, Data Vizualizaon):
In the earlier study, we used the histogram for understanding the octane rangs under
two dierent methods. We will use the stem funcon in the forthcoming Time for
acon – the stem funcon in play secon for displaying the octane rangs under Method_1
and Method_2.
Exploratory Analysis
[ 110 ]
Tukey (1977), being the benchmark book for EDA, produces the stem-and-leaf plot in a
slightly dierent style. For example, the stem plots for Method_1 and Method_2 are beer
understood if we can put both the stem and leaf sides adjacent to each other instead of one
below the other. It is possible to achieve this using the stem.leaf.backback funcon
from the aplpack package.
Time for action – the stem function in play
The R funcon stem from the base package and stem.leaf.backback from aplpack are
fundamental for our purpose to create the stem-and-leaf plots. We will illustrate these two
funcons for the examples discussed earlier.
1. As menoned in Example 4.2.1. Octane Rang of Gasoline Blends, rst create the
x vector, x<-c(12,22,42,13,27,46,25,52).
2. Obtain the stem-and-leaf plot with:
> stem(x)
The decimal point is 1 digit(s) to the right of the |
1 | 23
2 | 257
3 |
4 | 26
5 | 2
To obtain the median from the stem display we proceed as follows. Remove one
point each from either side of the display. First we remove 52 and 12, and then
remove 46 and 13. The trick is to proceed unl we are le with either one point, or
two. In the former case, the remaining point is the median, and in the laer case,
we simply take the average. Finally, we are le with 25 and 27 and hence, their
average 26 is the median of x.
We will now look at the octane dataset.
3. Obtain the stem plots for both the methods: data(octane),
stem(octane$Method_1,scale=2) and stem(octane$Method_2, scale=2).
Chapter 4
[ 111 ]
The output will be similar to the following screenshot:
Figure 3: The stem plot for the octane dataset (R output edited)
Of course, the preceding screenshot has been edited. To generate such
a back-to-back display, we need a dierent funcon.
4. Using the stem.leaf.backback funcon from aplpack and the code
library(aplpack) and stem.leaf.backback(Method_1, Method_2,back.
to.back=FALSE, m=5), we get the output in the desired format.
Figure 4: Tukey's stem plot for octane data
Exploratory Analysis
[ 112 ]
The preceding screenshot has many unexplained, and a bit mysterious, symbols! Prof. J.
W. Tukey has taken a very pragmac approach when developing EDA. You are is strongly
suggested to read Tukey (1977), as this brief chapter barely does jusce to it. Note that 18
of the 32 observaons for Method_1 are in the range 80.4 to 89.3. Now, if we have stems as
8, 9, and 10, the spread at stem 8 will be 18, which will not give a meaningful understanding
of the data. The stems can have substems, or be stretched out, and for which a very novel
soluon is provided by Prof. Tukey. For very high frequency stems, the soluon is to squeeze
out ve more stems. For the stem 8 here, we have the trailing digits at 0, 1, 2, …, 9. Now,
adopt a scheme of tagging lines which leads towards a clear reading of the stem-and-leaf
display. Tukey suggests to use * for zero and one, t for two and three, f for four and ve, s for
six and seven, and a period (.) for eight and nine. Truly ingenious! Thus, if you are planning to
write about stem-and-leaf in your local language, you may not require *, t, f, s, .! Go back to
the preceding screenshot and now it looks much more beauful.
Following the leaf part for each method, we are given cumulave frequencies from the top
and the boom too. Why? Now, we know that the stem-and-leaf display has increasing
values from the top and decreasing values from the boom. In this parcular example,
we have n: 32 observaons. Thus, in a sorted order, we know that the median is a value
between the sixteenth and seventeenth sorted observaon. The cumulave frequencies
when exceeds 16 from either direcon, lead to the median. This is indicated by (2) for
Method_1 and (6) for Method_2. Can you now make an approximate guess of the median
values? Obviously, depending on the dataset, we may require m = 5, 1, or 2.
We have used the argument back.to.back=FALSE to ensure that the two stem-and-leafs
can be seen independently. Now, it is fairly easy to compare these two displays by seng
back.to.back=TRUE, in which case the stem line will be common for both the methods
and thus, we can simply compare their leaf distribuons. That is, you need to run stem.
leaf.backback(octane$Method_1,octane$Method_2,back.to.back=TRUE, m=5)
and invesgate the results.
We can clearly see that the median for Method_2 is higher than that of Method_1.
What just happened?
Using the basic stem funcon and stem.leaf.backback from the aplpack, we got two
ecient exploratory displays of the datasets. The laer funcon can be used to compare two
stem-and-leaf displays. Stems can be further squeezed to reveal more informaon with the
opons of m as 1, 2, and 5.
We will next look at the EDA technique which extends the scope of hinges.
Chapter 4
[ 113 ]
Letter values
The median, quarles, and the extremes (maximum and minimum) indicate how the data
is spread over the range of the data. These values can be used to examine two or more
samples. There is another way of understanding the data oered by leer values. This
small journey begins with the use of a concept called depth, which measures the minimum
posion of the datum in the ordered sample from either of the extremes. Thus, the extremes
have a depth of 1, the second largest and smallest datum have a depth of 2, and so on.
Now, consider a sample data of size n, assumed to be an odd number for convenience sake.
Then, the depth of the median is (n + 1)/2. The depth of a datum is denoted by d, and for the
median it is indicated by d(M). Since the hinges, lower and upper, do the same to the divided
samples (by median), the depth of the hinges is given by
() ()
( )
12
= +
dH dM /. Here, [ ] denotes
the integer part of the argument. As hinges, including the mid-hinge which is the median,
divide the data into four equal parts, we can dene eights as the values which divide the
data into eight equal parts. The eights are denoted by E. The depth of eights is given by the
formula
() ()
( )
12
= +
dE dH /. It may be seen that the depth of the median, hinges, and eights
of the datum depends on the sample size.
Using the eights, we can further carry out the division of the data for obtaining the sixteenths,
and then thirty seconds, and so on. The process of division should connue ll we end up
with the extremes where we cannot further proceed with the division any longer. The leer
values connue the search unl we end at the extremes. The process of the division is well
understood when we augment the lower and upper values for the hinges, the eights, the
sixteenths, the thirty seconds, and so on. The dierence between the lower and upper values
of these metrics, concept similar to mid-range, is also useful for understanding the data.
The R funcon lval from the LearnEDA package gives the leer values for the data.
Example 4.3.1. Octane Rang of Gasoline Blends (Connued): We will now obtain the leer
values for the octane dataset:
>library(LearnEDA)
>lval(octane$Method_1)
depth lo hi mids spreads
M 16.5 88.00 88.0 88.000 0.00
H 8.5 85.55 91.4 88.475 5.85
E 4.5 84.25 94.2 89.225 9.95
D 2.5 82.25 101.5 91.875 19.25
C 1.0 80.40 105.5 92.950 25.10
> lval(octane$Method_2)
depth lo hi mids spreads
M 16.5 97.20 97.20 97.200 0.00
H 8.5 93.75 99.60 96.675 5.85
E 4.5 90.75 101.60 96.175 10.85
D 2.5 87.35 104.65 96.000 17.30
C 1.0 83.30 106.60 94.950 23.30
Exploratory Analysis
[ 114 ]
The leer values, look at the lo and hi in the preceding code, clearly show that the
corresponding values for Method_1 are always lower than those under Method_2.
Parcularly, note that the lower hinge of Method_2 is greater than the higher hinge
of Method_1. However, the spread under both the methods are very idencal.
Data re-expression
The presence of an outlier or overspread of the data may lead to an incomplete picture of
the graphical display and hence, stascal inference may be inappropriate in these scenarios.
In many such scenarios, re-expression of the data on another scale may be more useful,
refer to Chapter 3, The Problem of Scale, Tukey (1977). Here, we list the scenarios from Tukey
where the data re-expression may help circumvent the limitaons cited in the beginning.
The rst scenario where re-expression is useful is when the variables assume non-negave
values, that is, the variables never assume a value lesser than zero. Examples of such
variables are age, height, power, area, and so on. A thumb rule for the applicaon of
re-expression is when the rao of the largest to the smallest value in the data is very large,
say 100. This is one reason that most regression analysis variables such as age are almost
always re-expressed on the logarithmic scale.
The second scenario explained by Tukey is about variables like balance and
prot-and-loss. If there is a deposit to an account, the balance increases, and if there
is a withdrawal it decreases. Thus, the variables can assume posive as well as negave
values. Since re-expression of these variables like balance rarely helps; re-expression of the
amount or quanty before subtracon helps on some occasions. Fracon and percentage
counts form the third scenario where re-expression of the data is useful, though you need
special techniques. The scenarios menoned are indicave and not exhausve. We will now
look at the data re-expression techniques which are useful.
Example 4.4.1. Re-expression for the Power of 62 Hydroelectric Staons: We need to
understand the distribuon of the ulmate power in megawas of 62 hydroelectric staons
and power staons of the Corps of Engineers. The data for our illustraon has actually been
regenerated from the Exhibit 3 of Tukey (1977). First, we simply look at the stem-and-leaf
display of the original data on the power of 62 hydroelectric staons. We use the stem.
leaf funcon from the aplpack package.
> hydroelectric <- c(14,18,28,26,36,30,30,34,30,43,45,54,52,60,68, +
68,61,75,76,70,76,86,90,96,100,100,100,100,100,100,110,112,
+ 118,110,124,130,135,135,130,175,165,140,250,280,204,200,270,
+ 40,320,330,468,400,518,540,595,600,810,810,1728,1400,1743,2700)
>stem.leaf(hydroelectric,unit=1)
1 | 2: represents 12
leaf unit: 1
Chapter 4
[ 115 ]
n: 62
2 1 | 48
4 2 | 68
9 3 | 00046
…
24 9 | 06
30 10 | 000000
(4) 11 | 0028
28 12 | 4
27 13 | 0055
23 14 | 0
15 |
22 16 | 5
21 17 | 5
18 |
19 |
20 20 | 04
21 |
24 |
…
7 60 | 0
HI: 810 810 1400 1728 1743 2700
The data begins with values as low as 14, and grows modestly to hundreds, such as 100, 135,
and so on. Further, the data grows to ve hundreds and then literally explodes into thousands
running up to 2700. If all the leading digits must be mandatorily displayed, we have 270
leading digits. With an average of 35 lines per page, the output requires approximately eight
pages, and between the last two values of 1743 and 2700, we will have roughly 100 empty
leading digits. The stem.leaf funcon has already removed all the leading digits aer the
hydroelectric plant producing 600 megawas.
Let us look at the rao of the largest to smallest value, which is as follows:
>max(hydroelectric)/min(hydroelectric)
[1] 192.8571
By the thumb rule, it is an indicaon that a data re-expression is in order. Thus, we
take the log transformaon (with base 10) and obtain the stem-and-leaf display for
the transformed data.
>stem.leaf(round(log(hydroelectric,10),2),unit=0.01)
1 | 2: represents 0.12
leaf unit: 0.01
n: 62
Exploratory Analysis
[ 116 ]
1 11 | 5
2 12 | 6
13 |
…
24 19 | 358
(11) 20 | 00000044579
27 21 | 11335
22 22 | 24
20 23 | 110
…
30 |
4 31 | 5
3 32 | 44
HI: 3.43
The compactness of the stem-and-leaf display for the transformed data is indeed more
useful, and we can further see that the leading digits are just about 30. Also, the display
is more elegant and comprehensible.
Have a go hero
The stem-and-leaf plot is considered a parcular case of the histogram from a certain point
of view. You can aempt to understand the hydroelectric distribuon using histogram too.
First, obtain the histogram of the hydroelectric variable, and then repeat the exercise on its
logarithmic re-expression.
Bagplot – a bivariate boxplot
In Chapter 3, Data Visualizaon, we saw the eecveness of boxplot. For independent
variables, we can simply draw separate boxplots for the variables and visualize the
distribuon. However, when there is dependency between two variables, disnct boxplot
loses the dependency among the two variables. Thus, we need to see if there is a way to
visualize the data through a boxplot. The answer to the queson is provided by bagplot
or bivariate boxplot.
The bagplot characterisc is described in the following steps:
The depth median, denoted by * in the bagplot, is the point with the highest half
space depth.
Chapter 4
[ 117 ]
The depth median is surrounded by a polygon, called bag, which covers n/2
observaons with the largest depth.
The bag is then magnied by a factor of 3 which gives the fence. The fence is not
ploed since it will drive the aenon away from the data.
The observaons between the bag and fence are covered by a loop.
Points outside the fence are agged as outliers.
For technical details of the bagplot, refer to the paper (http://venus.unive.it/
romanaz/ada2/bagplot.pdf) by Rousseeuw, Ruts, and Tukey (1999).
Example 4.5.1. Bagplot for the Gasoline mileage problem: The pairs plot of the gasoline
mileage problem in Example 3.2.9. Octane rang of gasoline blends gave a good insight in to
understanding the nature of the data. Now, we will modify that plot and replace the upper
panel with the bagplot funcon for a cleaner comparison of the bagplot with the scaer
plot. However, in the original dataset, variables x4, x5, and x11 are factors that we remove
from the bagplot study. The bagplot funcon is available in the aplpack package. We rst
dene panel.bagplot, and then generate the matrix of the scaer plot with the bagplots
produced in the upper matrix display.
Time for action – the bagplot display for a multivariate dataset
1. Load the aplpack package with library(aplpack).
2. Check the default examples of the bagplot funcon with example(bagplot).
3. Create the panel.bagplot funcon with:
panel.bagplot <- function(x,y)
{
require(aplpack)
bagplot(x,y,verbose=FALSE,create.plot = TRUE,add=TRUE)
}
Here, the panel.bagplot funcon is dened to enable us to obtain the bagplot
for the upper panel region of the pairs funcon.
Exploratory Analysis
[ 118 ]
4. Apply the panel.bagplot funcon within the pairs funcon on the Gasoline
dataset: pairs(Gasoline[-19,-c(1,4,5,13)],upper.panel =panel.
bagplot).
We obtain the following display:
Figure 5: Bagplot for the Gasoline dataset
What just happened?
We created the panel.bagplot funcon and augmented it in the pairs funcon for
eecve display of the mulvariate dataset. The bagplot is an important EDA tool towards
geng exploratory insight in the important case of mulvariate dataset.
The resistant line
In Example 3.2.3. The Michelson-Morley experiment of Chapter 3, Data Visualizaon,
we visualized data through the scaer plot which indicates possible relaonships between
the dependent variable (y) and independent variable (x). The scaer plot or the x-y plot is
again an EDA technique. However, we would like a more quantave model which explains
the interrelaonship in a more precise manner. The tradional approach would be taken in
Chapter 7, The Logisc Regression Model. In this secon, we would take an EDA approach for
building our rst regression model.
Chapter 4
[ 119 ]
Consider a pair of n observaons:
( ) ( ) ( )
11 22
nn
x,y,x,y,..., x,y. We can easily visualize the data
using the scaer plot. We need to obtain a model of the form
=+yabx
, where a is the
intercept term while b is the slope. This model is an aempt to explain the relaonship
between the variables x and y. Basically, we need to obtain the values of the slope and
intercept from the data. In most real data, a single line will not pass through all the n pairs of
observaons. In fact, it is even a dicult task for the determined line to pass through even a
very few observaons. As a simple task, we may choose any two observaons and determine
the slope and intercept. However, the diculty lies in the choice of the two points. We will
now explain how the resistant line determines the two required terms.
The scaer plot (part A of the next screenshot) is divided into three regions, using x-values,
where each region has approximately same number of data points, refer to part B of the next
screenshot. The three regions, from the le-hand to the right-hand side, are called the lower,
middle, and upper regions. Note that the y-values are distributed among the three regions
corresponding to their x-values. Hence, there is a possibility of some y-values of lower regions
to be higher than a few values in the higher regions. Within each region, we nd the medians
of the x- and y-values independently. That is, for the lower region, the median yL is determined
by the y-values falling in this region, and similarly, the median xL is determined by the x-values
of the region. Similarly, the medians xM , xH , yM , and yH are determined, refer to part C of the
following screenshot. Using these median values, we now form three pairs: (xL , yL ), (xM , yM ),
and (xH , yH ). Note that these pairs need not be one of the data points.
To determine the slope b, two points suce. The resistant line theory determines the slope
by using the two pairs of points (xL , yL ) and (xH , yH ). Thus, we obtain the following:
−
=−
HL
HL
yy
b
xx
For obtaining the intercept value a, we use all the three pairs of medians. The value of a is
determined using
( )
1
3
= + +− ++
LM H L
MH
a y yybx xx.
Exploratory Analysis
[ 120 ]
Note that the median properes are what exactly make the soluons resistant enough.
As an example, the lower and upper median would not be aected by the outliers
(at the extreme ends).
a= yL+yM+yHb( )
xL+xM+xH
1
3[ ]
yHyL
xHxL
b=
Obtaining the and valuesab
y-axis
A
x-axis
x
x
x
x
x
x
x
xx
xx
xxx
x
xx
xxx
x
y-axis
B
x-axis
x
x
x
x
x
x
x
xx
xx
xxx
x
xx
xxx
x
Dividing into three
portions by x-values
y-axis
C
x-axis
x
x
x
x
x
x
x
xx
xx
xxx
x
xx
xxx
x
y-axis
D
x-axis
x
x
x
x
x
x
x
xx
xx
xxx
x
xx
xxx
x
y
L
y
M
yH
xL
for y-values
for x-values
(x ,y )
HH
(x ,y )
MM
(x ,y )
LL
xM
xH
The x-y Scatter Plot
Figure 6: Understanding the resistant line
We will use the rline funcon from the LearnEDA package.
Example 4.6.1. Resistant line for the IO-CPU me: The CPU me is known to depend on the
number of IO processes running at any given point of me. A simple dataset is available at
http://www.cs.gmu.edu/~menasce/cs700/files/SimpleRegression.pdf.
We aim at ng a resistant line for this dataset.
Time for action – the resistant line as a rst regression model
We use the rline funcon from the LearnEDA package for ng the resistant line on
a dataset.
1. Load the LearnEDA package: library(LearnEDA).
2. Understand the default example with example(rline).
3. Load the dataset data(IO_Time).
4. Create the IO_rline resistant line for CPU_Time as the output and No_of_IO
as the input with IO_rline <- rline(IO_Time$No_of_IO, IO_Time$CPU_
Time,iter=10) for 10 iteraons.
Chapter 4
[ 121 ]
5. Find the slope and intercept with IO_rline$a and IO_rline$b. The output will
then be:
>IO_rline$a
[1] 0.2707
>IO_rline$b
[1] 0.03913
6. Obtain the scaer plot of CPU_Time against No_of_IO with plot(IO_Time$No_
of_IO, IO_Time$CPU_Time).
7. Add the resistant line on the generated scaer plot with abline(a= IO_
rline$a,b=IO_rline$b).
8. Finally, give a tle to the plot: title("Resistant Line for the IO-CPU
Time").
We then get the following screenshot:
Figure 7: Resistant line for CPU_Time
What just happened?
The rline funcon from the LearnEdA package ts a resistant line given the input and
output vectors. It calculated the slope and intercept terms which are driven by medians.
The main advantage of the rline t is that the model is not suscepble to outliers. We can
see from the preceding screenshot that the resistant line model, IO_rline, provides a very
good t for the dataset. Well! You have created your rst exploratory regression model.
Exploratory Analysis
[ 122 ]
Smoothing data
In The resistant line secon, we constructed our rst regression model for the relaonship
between two variables. In some instances, the x-values are so systemac that their values
are almost redundant, and yet we need to understand the behavior of the y-values with
respect to them. Consider the case where the x-values are equally spaced; the shares price
(y) at the end of day (x) is an example where the dierence between two consecuve
x-values is exactly one. Here, we are more interested in smoothing the data along the
y-values, as one expects more variaon in their direcon. Time series data is a very good
example of this type. In the me series data, we typically have xn + 1 = xn + 1 , and hence
we can precisely dene the data in a compact form by yt , t = 1, 2, … .The general model
may then be specied by
12=+ +ε =
t
yabt ,t ,,...
.
In the standard EDA notaon, this is simply expressed as:
data = t + residual
In the context of me series data, the model is succinctly expressed as:
data = smooth + rough
The fundamental concept of the smoothing data technique makes use of the running
medians. In a free hand curve, we can simply draw a smooth curve using our judgment
by ignoring the out-of-curve points and complete the picture. A computer nds this task
only dicult when it needs specic instrucons for obtaining the smooth points across
which it needs to draw a curve. For a sequence of points, such as the sequence yt, the
smoothing needs to be carried over a sequence of overlapping segments. Such segments are
predened of specic length. As a simple example, we may have a three-length overlapping
segment sequence in {y1 , y2 , y3 }, {y2 , y3 , y4 }, {y3 , y4 , y5 }, and so on. It is on similar lines that
four-length or ve-length overlapping segment sequences may be dened as required. It is
within each segment that smoothing needs to be carried out. Two popular choices are mean
and median. Of course, in exploratory analysis our natural choice is the median. Note that
median of the segment {y1 , y2 , y3 } may be any of y1 , y2 , or y3 values.
The general smoothing techniques, such as LOESS, are nonparametric techniques and
require good experse in the subject. The ideas discussed here are mainly driven by median
as the core technique.
Chapter 4
[ 123 ]
A three-moving median cannot correct for more than two consecuve outliers, and similarly,
a ve-moving median for three consecuve outliers, and so on. A soluon, or work around
in an engineer's language, for this is to connue the smoothing of the sequence obtained in
the previous iteraon unl there is no further change in the smoothness part. We may also
consider a moving median of span 4. Here, the median is the average of the two mid-points.
However, considering that the x values are integers, the four-moving median actually does
not correspond to any of the me points t. Using the simplicity principle, it is easily possible
to re-center the points at t by taking a two-moving median of the values obtained in the step
of the four-moving median.
A notaon for the rst iteraon in EDA is simply the number 3, or 5 and 7 as used. The notaon
for repeated smoothing is denoted by 3R where R stands for repeons. For a four-moving
median re-centered by a two-moving median, the notaon will be 42. On many occasions, a
smoother operaon giving more renement than 42 may be desired. It is on such occasion that
we may use the running weighted average, which gives dierent weights to the points under a
span. Here, each point is replaced by a weighted average of the neighboring points. A popular
choice of weights for a running weighted average of 3 is (¼, ½, ¼), and this smoothing process
is referred to as hanning. The hanning process is denoted by H.
Since the running median smoothens the data sequence a bit more than appropriate and
hence removes any interesng paerns, paerns can be recovered from the residuals which
in this context are called rough. This is achieved by smoothing the rough sequence and
adding them back to the smooth sequence. This operaon is called as reroughing. Velleman
and Hoaglin (1984) point out that the smoothing process which performs beer in general is
4253H. That is, here we rst begin with running median of 4 which is re-centered by 2.
The re-smoothing is then done by 5 followed by 3, and the outliers are removed by H. Finally,
re-roughing is carried out by smoothing the roughs and then adding them to the smoothed
sequence. This full cycle is denoted by 4253H, twice. Unfortunately, we are not aware of any
R funcon or package which implements the 4253H smoother. The opons available in the
LearnEDA package are 3RSS and 3RSSH.
We have not explained what the smoothers 3RSS and 3RSSH are. The 3R smoothing chops
o peaks and valleys and leaves behind mesas and dales two points long. What does this
mean? Mesa refers to an area of high land with a at top and two or more steep cli-like sides,
whereas dale refers a valley. To overcome this problem, a special spling is used at each
two-point mesa and dale where the data is split into three pieces: a two-point at segment,
the smooth data to the le of the two points, and the smooth sequence to their right. Now, let
yf-1, yf refer to the two-point at segment, and yf+1 , yf+2 , … refer to the smooth sequence to the
le of these two-point at segments. Then the S technique predicts the value of yf-1 if it were
on the straight line formed by yf+1 and yf+2. A simple method is to obtain yf-1 as 3yf+1 – 2yf+2. The yf
value is obtained as the median of the predicted yf-1, yf+1, and yf+2. Aer removing all the mesas
and dales, we again repeat the 3R cycle. Thus, we have the notaon 3RSS and the reader
can now easily connect with what 3RSSH means. Now, we will obtain the 3RSS for the cow
temperature dataset of Velleman and Hoaglin.
Exploratory Analysis
[ 124 ]
Example 4.7.1. Smoothing Data for the Cow temperature: The temperature of a cow is
measured at 6:30 a.m. for 75 consecuve days. We will use the smooth funcon from the
base package and the han funcon from the LearnEDA package to achieve the required
smoothing sequence. We will build the necessary R program in the forthcoming acon list.
Time for action – smoothening the cow temperature data
First we use the smooth funcon from the stats package on the cow temperature dataset.
Next; we will use the han funcon from LearnEDA.
1. Load the cow temperature data in R by data(CT).
2. Plot the me series data using the plot.ts funcon: plot.ts(CT$Temperature
,col="red",pch=1).
3. Create a 3RSS object for the cow temperature data using the smooth funcon
and the kind opon: CT_3RSS <- smooth(CT$Temperature,kind="3RSS").
4. Han the preceding 3RSS object using the han funcon from the LearnEDA package:
CT_3RSSH <- han(smooth(CT$Temperature,kind="3RSS")).
5. Impose a line of the 3RSS data points with lines.
ts(CT_3RSS,col="blue",pch=2).
6. Impose a line of the hanned 3RSS data points with lines.
ts(CT_3RSSH,col="green",pch=3).
7. Add a meaningful legend to the plot: legend(20,90,c("Original","3RSS","3
RSSH"),col=c("red","blue","green"),pch="___").
We get a useful smoothened plot of the cow temperature data as follows:
Smoothening cow temperature data
Chapter 4
[ 125 ]
The original plot shows a lot of variaon for the cow temperature measurements. The edges
of the 3RSS smoother shows many sharp edges in comparison with the 3RSSH smoother,
though it is itself a lot smoother than the original display. The plot further indicates that
there has been a lot of decrease in the cow temperature measurements from the eenth
day of observaon. This is conrmed by all the three displays.
What just happened?
The discussion of the smoothing funcon looked very promising in the theorecal
development. We took a real dataset and saw its me series plot. Then we ploed
two versions of the smoothening process and found both to be very smooth over
the original plot.
Median polish
In Example 4.6.1. Resistant line for the IO-CPU me, we had IO as the only independent
variable which explained the variaons of the CPU me. In many praccal problems, the
dependent variable depends on more than one independent variable. In such cases, we need
to factor the eect of such independent variables using a single model. When we have two
independent variables, and median polish helps in building a robust model. A data display in
which the rows and columns hold dierent factors of two variables is called a two-way table.
Here, the table entries are values of the independent variables.
An appropriate model for the two-way table is given by:
βγ
=+ ++ε
ij ijij
ya
Here,
α
is the intercept term,
β
i
denotes the eect of the i-th row,
y
j the eect of the j-th
column, and
ε
ij is the error term. All the parameters are unknown. We need to nd the
unknown parameters through the EDA approach. The basic idea is to use row-medians and
column-medians for obtaining the row- and column-eect, and then nd the basic intercept
term. Any unexplained part of the data is considered as the residual.
Time for action – the median polish algorithm
The median polish algorithm (refer to http://www-rohan.sdsu.edu/~babailey/
stat696/medpolishalg.html) is given next:
1. Obtain the row medians of the two-way table and upend it to the right-hand side
of the data matrix. From each element of every row, subtract the respecve
row median.
Exploratory Analysis
[ 126 ]
2. Find the median of the row median and record it as the inial grand eect value.
Also, subtract the inial grand eect value from each row median.
3. For the original data columns in the previously upended matrix, obtain the column
median and append it with the previous matrix at the boom. As in step 1, subtract
from each column element their corresponding column median.
4. For the boom row of column medians in the previous table, obtain the median,
and then add the obtained value to the inial grand eect value. Next, subtract the
modied grand eect median value from each of the column medians.
5. Iterate steps 1-4 unl the changes in row or column median is insignicant.
We use the medpolish funcon from the stats library for the computaons involved in
median polish. For more details about the model, you can refer to Chapter 8, Velleman and
Hoaglin (1984).
Example 4.7.1. Male death rates: The dataset related to the male death rate per 1000 by the
cause of death and the average amount of tobacco smoked daily is available on page 221 of
Velleman and Hoaglin (1984). Here, the row eect is due to the cause of death, whereas the
column constutes the amount of tobacco smoked (in grams). We are interested in modeling
the eect of these two variables on the male death rates in the region.
> data(MDR)
> MDR2 <- as.matrix(MDR[,2:5])
>rownames(MDR2) <- c("Lung", "UR","Sto","CaR","Prost","Other_
Lung","Pul_TB","CB","RD_Other", "CT","Other_
Cardio","CH","PU","Viol","Other_Dis")
> MDR_medpol <- medpolish(MDR2)
1 : 8.38
2 : 8.17
Final: 8.1625
>MDR_medpol$row
Lung UR StoCaR Prost Other_
LungPul_TB CB RD_Other CT Other_Cardio
0.1200 -0.4500 -0.2800 -0.0125 -0.3050
0.2050 -0.3900 -0.2050 0.0000 4.0750
1.6875
CH PU Viol Other_Dis
Chapter 4
[ 127 ]
1.4725 -0.4325 0.0950 0.9650
>MDR_medpol$col
G0 G14 G24 G25
-0.0950 0.0075 -0.0050 0.1350
>MDR_medpol$overall
[1] 0.545
>MDR_medpol$residuals
G0 G14 G24 G25
Lung -0.5000 -0.2025 2.000000e-01 0.8600
UR 0.0000 0.0275 0.000000e+00 -0.0200
Sto 0.2400 0.0875 -1.600000e-01 -0.0900
CaR 0.0025 0.0000 -1.575000e-01 0.0725
Prost 0.4050 0.0125 -1.500000e-02 -0.0350
Other_Lung -0.0150 -0.0375 1.500000e-02 0.1350
Pul_TB -0.0600 -0.0025 3.000000e-02 0.0000
CB -0.1250 -0.0575 5.500000e-02 0.2450
RD_Other 0.2400 -0.0025 1.387779e-17 -0.2800
CT -0.3050 0.0125 -1.500000e-02 1.2350
Other_Cardio 0.0925 -0.0900 2.425000e-01 -0.1175
CH 0.0875 -0.0850 -1.525000e-01 0.1775
PU -0.0175 0.0200 5.250000e-02 -0.0275
Viol -0.1250 0.1725 -1.850000e-01 0.1250
Other_Dis 0.0350 0.2925 -3.500000e-02 -0.0750
What just happened?
The output associated with MDR_medpol$row gives the row eect, while MDR_medpol$col
gives the column eect. The negave value of -0.0950 for the non-consumers of tobacco
shows that the male death rate is lesser for this group, whereas the posive values of 0.0075
and 0.1350 for the group under 14 grams and above 25 grams respecvely is an indicaon
that tobacco consumers are more prone to death.
Have a go hero
For the variables G0 and G25 in the MDR2 matrix object, obtain a back-to-back
stem-leaf display.
Exploratory Analysis
[ 128 ]
Summary
Median and its variants form the core measures of EDA and you would have got a hang
of it by the rst secon. The visualizaon techniques of EDA also compose more than just
the stem-and-leaf plot, leer values, and bagplot. As EDA is basically about your atude
and approach, it is important to realize that you can (and should) use any method that is
insncve and appropriate for the data on hand. We have also built our rst regression
model in the resistant line and seen how robust it is to the outliers. Smoothing data and
median polish are also advanced EDA techniques which the reader is acquainted in their
respecve secons.
EDA is exploratory in nature and its ndings may need further stascal validaons. The next
chapter on stascal inference addresses which Tukey calls conrmatory analysis. Especially,
we look at techniques which give good point esmates of the unknown parameters. This is
then backed with further techniques such as goodness-of-t and condence intervals for the
probability distribuon and the parameters respecvely. Post the esmaon method, it is a
requirement to verify whether the parameters meet certain specied levels. This problem is
addressed through hypotheses tesng in the next chapter.
Statistical Inference
In the previous chapter, we came across numerous tools that gave first
insights of exploratory evidence into the distribution of datasets through visual
techniques as well as quantitative methods. The next step is the translation
of these exploratory results to confirmatory ones and the topics of the
current chapter pursue this goal. In the Discrete distributions and Continuous
distributions sections of Chapter 1, Data Characteristics, we came across many
important families of probability distribution. In practical scenarios, we have
data on hand and the goal is to infer about the unknown parameters of the
probability distributions. This chapter focuses on one method of inference
for the parameters using the maximum likelihood estimator (MLE). Another
way of approaching this problem is by fitting a probability distribution for
the data. The MLE is a point estimate of the unknown parameter that needs
to be supplemented with a range of possible values. This is achieved through
confidence intervals. Finally, the chapter concludes with the important topic
of hypothesis testing.
You will learn the following things aer reading through this chapter:
Visualizing the likelihood funcon and idenfying the MLE
Fing the most plausible stascal distribuon for a dataset
Condence intervals for the esmated parameters
Hypothesis tesng of the parameters of a stascal distribuon
5
Stascal Inference
[ 130 ]
Using exploratory techniques we had our rst exposion with the understanding of a dataset.
As an example in the octane dataset we found that the median of Method_2 was larger
than that of Method_1. As explained in the previous chapter, we need to conrm whatever
exploratory ndings we had with a dataset. Recall that the histograms and stem-and-leaf
displays suggest a normal distribuon. A queson that arises then is how do we assert the
center values, typically the mean, of a normal distribuon and how do we conclude that
average of the Method_2 procedure exceeds that of Method_1. The former queson is
answered by esmaon techniques and the later with tesng hypotheses. This forms the
core of Stascal Inference.
Maximum likelihood estimator
Let us consider the discrete probability distribuons as seen in the Discrete distribuons
secon of Chapter 1, Data Characteriscs. We saw that a binomial distribuon is
characterized by the parameters in n and p, the Poisson distribuon by
λ
, and so on. Here,
the parameters completely determine the probabilies of the x values. However, when
the parameters are unknown, which is the case in almost all praccal problems, we collect
data for the random experiment and try to infer about the parameters. This is essenally
inducve reasoning, and the subject of Stascs is essenally inducve driven as opposed
to the deducve reasoning of Mathemacs. This forms the core dierence between the
two beauful subjects. Assume that we have n observaons X1, X2,…, Xn from an unknown
probability distribuon
()
,fx
θ
, where
θ
may be a scalar or a vector whose values are not
known. Let us consider a few important denions that form the core of stascal inference.
Random sample: If the observaons X1, X2,…, Xn are independent of each other, we say
that it forms a random sample from
()
,fx
θ
. A technical consequence of the observaons
forming a random sample is that their joint probability density (mass) funcon can be
wrien as product of the individual density (mass) funcon. If the unknown parameter
θ
is same for all the n observaons we say that we have an independent and idencal
distributed (iid) sample.
Let X denote the score of Rahul Dravid in a century innings, and let Xi denote the runs
scored in the i th century, i = 1, 2, …, 36. The assumpon of independence is then appropriate
for all the values of Xi. Consider the problem of the R soware installaon on 10 dierent
computers of same conguraon. Let X denote the me it takes for the soware to install.
Here, again, it may be easily seen that the installaon me on the 10 machines, X1, …, X10,
are idencal (same conguraon of the computers) and independent. We will use the vector
notaon here to represent a sample of size n, X = (X1, X2,…, Xn) for the random variables, and
denote the realized values of random variable with the small case x = (x1, x2,…, xn ) with x i
represenng the realized value of random variable Xi . All the required tools are now ready,
which enable us to dene the likelihood funcon.
Chapter 5
[ 131 ]
Likelihood funcon: Let
()
,fx
θ
be the joint pmf (or pdf) for an iid sample of n observaons
of X. Here, the pmf and pdf respecvely correspond to the discrete and connuous random
variables. The likelihood funcon is then dened by:
()() ( )
1
| | |
n
i i
Lxfx fx
θ θ θ
=
= = Π
Of course, the reader may be amused about the dierence between a likelihood funcon
and a pmf (or pdf). The pmf is to be seen as a funcon of x given that the parameters are
known, whereas in the likelihood funcon we look at a funcon where the parameters are
unknown with x being known. This disncon is vital as we are looking for a tool where we
do not know the parameters. The likelihood funcon may be interpreted as the probability
funcon of
θ
condioned on the value of x and this is the main reason for idenfying that
value of
θ
, say
θ
, which leads to the maximum of
()
|Lx
θ
, that is,
( )
( )
ˆ
| |L x L x
θ θ
≥. Let us
visualize the likelihood funcon for some important families of probability distribuon.
The importance of visualizing the likelihood funcon is emphasized in Chapter 7, The Logisc
Regression Model and Chapters 1-4 of Pawitan (2001).
Visualizing the likelihood function
We had seen a few plots of the pmf/pdf in Discrete distribuons and Connuous distribuons
secons of Chapter 1, Data Characteriscs. Recall that we were plong the pmf/pdf over
the range of x. In those examples, we had assumed certain values for the parameters of
the distribuons. For the problems of stascal inference, we typically do not know the
parameter values. Thus, the likelihood funcons are ploed against the plausible parameter
values
θ
. What does this mean? For example, the pmf for a binomial distribuon is ploed
for x values ranging from 0 to n. However, the likelihood funcon needs to be plot against p
values ranging over the unit interval [0, 1].
Example 5.1.1. The likelihood funcon of a binomial distribuon: A box of electronic
chips is known to contain a certain number of defecve chips. Suppose we take a random
sample of n chips from the box and make a note of the number of non-defecve chips. The
probability of a non-defecve chip is p, and that being defecve is 1 – p. Let X be a random
variable, which takes the value 1 if the chip is non-defecve and 0 if it is defecve. Then
X~b(1,p), where p is not known. Dene
1
n
x i
i
t x
=
=∑
. The likelihood funcon is then given by:
( ) ( )
|, 1x
xnt
t
x
x
n
Lptn p p
t
−
= −
Stascal Inference
[ 132 ]
Suppose that the observed value of x is 7, that is, we have 7 successes out of 10 trials.
Now, the purpose of likelihood inference is to understand the probability distribuon of p
given the data tx. This gives us an idea about the most plausible value of p and hence it is
worthwhile to visualize the likelihood funcon
( )
|,
x
Lp
tn
.
Example 5.1.2. The likelihood funcon of a Poisson distribuon: The number of accidents
at a parcular trac signal of a city, the number of ight arrivals during a specic me
interval at an airport, and so on are some of the scenarios where assumpon of a Poisson
distribuon is appropriate to explain the numbers. Now let us consider a sample from
Poisson distribuon. Suppose that the number of ight arrivals at an airport during the
duraon of an hour follows a Poisson distribuon with an unknown rate
λ
. Suppose that
we have the number of arrivals over ten disnct hours as 1, 2, 2, 1, 0, 2, 3, 1, 2, and 4. Using
this data, we need to infer about
λ
. Towards this we will rst plot the likelihood funcon.
The likelihood funcon for a random sample of size n is given by:
( )
1
1
|
!
n
ii
x
n
n
i i
e
L x x
λλ
λ
=
Σ
−
=
=Π
Before we consider R program for visualizing the likelihood funcon for the samples from
binomial and Poisson distribuon, let us look at the likelihood funcon for a sample from
the normal distribuon.
Example 5.1.3. The likelihood funcon of a normal distribuon: The CPU_Time variable
from IO_Time may be assumed to follow a normal distribuon. For this problem, we will
simulate n = 25 observaons from a normal distribuon, for more details about the
simulaon, refer the next chapter. Though we simulate the n observaons with mean as 10
and standard deviaon as 2, we will pretend that we do not actually know the mean value
with the assumpon that the standard deviaon is known to be 2. The likelihood funcon
for a sample from normal distribuon with known a standard deviaon is given by:
( )
( )
()
2
21
1
2
1
|,
2
n
i
ix
n
L x e
µ
σ
µ σ
πσ
=
− −
∑
=
In our parcular example, it is:
( )
( )
( )
2
3
1
1
2
1
|,2
22
n
i
i
nexpx xL
µ µ
π
=
− −
=∑
It is me for acon!
Chapter 5
[ 133 ]
Time for action – visualizing the likelihood function
We will now visualize the likelihood funcon for the binomial, Poisson, and normal
distribuons discussed before:
1. Inialize the graphics windows for the three samples using par(mfrow= c(1,3)).
2. Declare the number of trials n and the number of success x by n <- 10; x <- 7.
3. Set the sequence of p values with p_seq <- seq(0,1,0.01).
For p_seq, obtain the probabilies for n = 10 and x = 7 by using the dbinom
funcon: dbinom(x=7,size=n,prob=p_seq).
4. Next, obtain the likelihood funcon plot by running plot(p_seq, dbinom(
x=7,size=n,prob=p_seq), xlab="p", ylab="Binomial Likelihood
Function", "l").
5. Enter the data for the Poisson random sample into R using x <- c(1,2,2,1,
0,2,3,1,2,4) and the number of observaons by n <- length(x).
6. Declare the sequence of possible
λ
values through lambda_seq <-
seq(0,5,0.1).
7. Plot the likelihood funcon for the Poisson distribuon with plot( lambda_seq,
dpois(x=sum(x),lambda=n*lambda_seq)…).
We are generang random observaons from a normal distribuon using the rnorm
funcon. Each run of the rnorm funcon results in dierent values and hence to
ensure that you are able to reproduce the exact output as produced here, we will
set the inial seed for random generaon tool with set.seed(123).
8. For generaon of random numbers, x the seed value with set.seed(123).
This is to simply ensure that we obtain the same result.
9. Simulate 25 observaons from the normal distribuon with mean 10 and standard
deviaon 2 using n<-25; xn <- rnorm(n,mean=10,sd=2).
10. Consider the following range of
µ
values mu_seq <- seq(9,11,0.05).
11. Plot the normal likelihood funcon with plot(mu_seq,dnorm(x= mean(xn),
mean=mu_seq,sd=2) ).
The detailed code for the preceding acon is now provided:
# Time for Action: Visualizing the Likelihood Function
par(mfrow=c(1,3))
# # Visualizing the Likelihood Function of a Binomial Distribution.
n <- 10; x <- 7
p_seq <- seq(0,1,0.01)
Stascal Inference
[ 134 ]
plot(p_seq, dbinom(x=7,size=n,prob=p_seq), xlab="p", ylab="Binomial
Likelihood Function", "l")
# Visualizing the Likelihood Function of a Poisson Distribution.
x <- c(1, 2, 2, 1, 0, 2, 3, 1, 2, 4); n = length(x)
lambda_seq <- seq(0,5,0.1)
plot(lambda_seq,dpois(x=sum(x),lambda=n*lambda_seq),
xlab=expression(lambda),ylab="Poisson Likelihood Function", "l")
# Visualizing the Likelihood Function of a Normal Distribution.
set.seed(123)
n <- 25; xn <- rnorm(n,mean=10,sd=2)
mu_seq <- seq(9,11,0.05)
plot(mu_seq,dnorm(x=mean(xn),mean=mu_seq,sd=2),"l",
xlab=expression(mu),ylab="Normal Likelihood Function")
Run the preceding code in your R session.
You will nd an idencal copy of the next plot on your computer screen too. What does the
plot tells us? The likelihood funcon for the binomial distribuon has very small values up to
0.4, and then it gradually peaks up to 0.7 and then declines sharply. This means that the values
in the neighborhood of 0.7 are more likely to be the true value of p than the points away from
it. Similarly, the likelihood funcon plot for the Poisson distribuon says that
λ
values lesser
than 1 and greater than 3 are very unlikely to be the true value of the actual
λ
. The peak of
the likelihood funcon appears at a value lile lesser than 2. The interpretaon for the normal
likelihood funcon is le as an exercise to the reader.
Figure 1: Some likelihood functions
Chapter 5
[ 135 ]
What just happened?
We took our rst step in the problem of esmaon of parameters. Visualizaon of the
likelihood funcon is a very important aspect and is oen overlooked in many introductory
textbooks. Moreover, and as it is natural, we did it in R!
Finding the maximum likelihood estimator
The likelihood funcon plot indicates the plausibility of the data generang mechanism
for dierent values of the parameters. Naturally, the value of the parameter for which the
likelihood funcon has the highest value is the most likely value of the parameter. This forms
the crux of maximum likelihood esmaon.
The value of
θ
that leads to the maximum value of the likelihood funcon
()
|Lx
θ
is referred
as the maximum likelihood esmate, abbreviated as MLE.
For the reader familiar with numerical opmizaon, it is not a surprise that calculus is useful
for nding the opmum value of a funcon. However, we will not indulge in mathemacs
more than what is required here. We will note some ner aspects of numerical opmizaon.
For an independent sample of size n, the likelihood funcon is a product of n funcons
and it is very likely that we may very soon end up in the mathemacal world of intractable
funcons. To a large extent, we can circumvent this problem by resorng to the logarithm of
the funcon, which then transforms the problem of opmizing the product of funcons to
the sum of funcons. That is, we will focus on opmizing
()
log |Lx
θ
instead of
()
|Lx
θ
.
An important consequence of using the logarithm is that the maximizaon of a product
funcon translates into that of a sum funcon since log(ab) = log(a) + log(b). It may also be
seen that the maximum point of the likelihood funcon is preserved under the logarithm
transformaon since for a > b, log(a) > log(b). Further, many numerical techniques know how
to minimize a funcon rather than maximize it. Thus, instead of maximizing the log-likelihood
funcon
()
log |
Lx
θ
, we will minimize
()
-log |Lx
θ
in R.
In the R package stats4 we are provided with a mle funcon, which returns the MLE.
There are a host of probability distribuons for which it is possible to obtain the MLE.
We will connue the illustraons for the examples considered earlier in the chapter.
Example 5.1.4. Finding the MLE of a binomial distribuon (connuaon of Example 5.1.1):
The negave log-likelihood funcon of binomial distribuon, sans the constant values, the
combinatorial term is excluded since its value is independent of p, is given by:
( ) ( ) ( )
-log |, log log 1
x x x
Lptn t p nt p=− −− −
Stascal Inference
[ 136 ]
The maximum likelihood esmator of p, dierenang the preceding equaon with
respect to p and equang the result to zero and then solving the equaon, is given by
the sample proporon:
ˆx
t
p
n
=
An esmator of a parameter is denoted by accentuaon the parameter with a hat.
Though this is very easy to compute, we will resort to the useful funcon mle.
Example 5.1.5. MLE of a Poisson distribuon (connuaon of Example 5.1.2):
The negave log-likelihood funcon is given by:
( )
1
lolog | g
n
i
i
L x n x
λ λ λ
=
− = − ∑
The MLE for
λ
admits a closed form, which can be obtained from calculus arguments,
and it is given by:
1
ˆn
ii
x
n
λ
=
Σ
=
To obtain the MLE, we need to write exclusive code for the negave log-likelihood funcon.
For the normal distribuon, we will use the mle funcon. There is another method of nding
the MLE than the mle funcon available in the stats4 package. We consider it next. The R
codes will be given in the forthcoming acon.
Using the tdistr function
In the previous examples, we needed to explicitly specify the negave log-likelihood
funcon. The fitdistr funcon from the MASS package can be used to obtain the
unknown parameters of a probability distribuon, for a list of the probability funcons for
which it applies see ?fitdistr, and the fact that it uses the maximum likelihood ng
complements our approach in this secon.
Example 5.1.6. MLEs for Poisson and normal distribuons: In the next acon, we will use
the fitdistr funcon from the MASS package for obtaining the MLEs in Example 5.1.2 and
Example 5.1.3. In fact, using the funcon, we get the answers readily without the need to
specify the negave log-likelihood explicitly.
Chapter 5
[ 137 ]
Time for action – nding the MLE using mle and tdistr functions
The mle funcon from the stats4 package will be used for obtaining the MLE from popular
distribuons such as binomial, normal, and so on. The fitdistr funcon will be used too,
which ts the distribuons using the MLEs.
1. Load the library package CIT with library(stats4).
2. Specify the number of success in a vector format and the number of observaons
with x<-rep(c(0,1),c(3,7)); n <- length(x).
3. Dene the negave log-likelihood funcon with a funcon:
binomial_nll <- function(prob) -sum(dbinom(x,size=1,prob,
log=TRUE))
The code works as follows. The dbinom funcon is invoked from the stats package
and the opon log=TRUE is exercised to indicate that we need a log of the
probability (actually likelihood) values. The dbinom funcon returns a vector of
probabilies for all the values of x. The sum, mulplied by -1, now returns us the
value of negave log-likelihood.
4. Now, enter fit_binom <- mle(binomial_nll,start=list(
prob=0.5),nobs=n) on the R console. Now, mle as a funcon opmizes the
binomial_nll funcon dened in the previous step. Inial values, a guess or a
legimate value are specied for the start opon, and we also declare the number
of observaons available for this problem.
5. summary(fit_binom) will give details of the mle funcon applied on
binomial_nll. The output is displayed in the next screenshot.
6. Specify the data for the Poisson distribuon problem in x <- c(1,2,2,1,0,
2,3,1,2,4); n <- length(x).
7. Dene the negave log-likelihood funcon on parallel lines of a
binomial distribuon:
pois_nll <- function(lambda) -sum(dpois(x,lambda,log=TRUE))
8. Explore dierent opons of the mle funcon by specifying the method, a guess of
the least and most values of the parameter, and the inial value as the median of
the observaons:
fit_poisson <- mle(pois_nll,start=list(lambda=median(x)),nobs=n,
method = "Brent", lower = 0, upper = 10)
9. Get the answer by entering summary(fit_poisson).
10. Dene the negave log-likelihood funcon for the normal distribuon by:
normal_nll <- function(mean) -sum(dnorm(xn,mean,sd=2,log=TRUE))
Stascal Inference
[ 138 ]
11. Find the MLE of the normal distribuon with fit_normal <- mle( normal_nll
,start=list(mean=8),nobs=n).
12. Get the nal answer with summary(fit_normal).
13. Load the MASS package: library(MASS).
14. Fit the x vector with a Poisson distribuon by running fitdistr( x,"poisson")
in R.
15. Fit the xn vector with a normal distribuon by running fitdistr(xn,"normal").
Figure 2: Finding the MLE and related summaries
What just happened?
You have explored the possibility of nding the MLEs for many standard distribuons using
mle from the stats4 package and fitdistr from the MASS package. The main key
in obtaining the MLE is the right construcon of the log-likelihood funcon.
Chapter 5
[ 139 ]
Condence intervals
The MLE is a point esmate and as such on its own it is almost not of praccal use. It would
be more appropriate to give coverage of parameter points, which is most likely to contain
the true unknown parameter. A general pracce is to specify the coverage of the points
through an interval and then consider specic intervals which have a specied probability.
A formal denion is in order.
A condence interval for a populaon parameter is an interval that is predicted to contain
the parameter with a certain probability.
The common choice is to obtain either 95 percent or 99 percent condence intervals. It is
common to specify the coverage of the condence through a signicance level
α
, more
about this in the next secon, which is a small number closer to 0. The 95 percent and 99
percent condence intervals then correspond to
( )
100 1
α
−
percent intervals with respecve
α
equal to 0.05 and 0.01. In general, a
( )
100 1
α
−
percent condence interval says that if
the experiment is performed many mes over, we expect the result to fall in the condence
interval by
( )
100 1
α
− percent.
Example 5.2.1. Condence interval for binomial proporon: Consider n Bernoulli trials
1,..., n
X X
with the probability of success being p. We saw earlier that the MLE of p is:
ˆx
t
p
n
=
where
1
n
x i
i
t x
=
=
∑
. Theorecally, the expected value of
ˆ
p
is p and its standard deviaon is
( )
1 /p p n−. An esmate of the standard deviaon is
( )
ˆ ˆ
1 /p p n−. For large n and when both
np and np(1-p) are greater than 5, using a normal approximaon, by virtue of the central
limit theorem, a
( )
100 1
α
−
percent condence interval for p is given by:
( ) ( )
/2 /2
ˆ ˆ ˆ ˆ
1 1
ˆ ˆ
,
p p p p
p z p z
n n
α α
− −
− +
where
/2
z
α
is the
/2
α
quanle of the standard normal distribuon. The condence
intervals obtained by using the normal approximaon are not reliable when the p value is
near 0 or 1. Thus, if the lower condence limit falls below 0, or the upper condence limit
exceeds 0, we will adapt the convenon of taking them as 0 and 1 respecvely.
Stascal Inference
[ 140 ]
Example 5.2.2. Condence interval for normal mean with known variance: Consider a
random sample of size n from a normal distribuon with an unknown mean
µ
and a known
standard deviaon
σ
. It may be shown that the MLE of mean
µ
is the sample mean
1
/
n
i
i
X x n
=
=∑
and that the distribuon of
X
is again normal with mean
µ
and standard deviaon
/n
σ
. Then, the
( )
100 1
α
− percent condence interval is given by:
/2 /2
,X Z X Z
n n
α α
σ σ
− +
where
/2
z
α
is the
/2
α
quanle of the standard normal distribuon. The width of the
preceding condence interval is /2
2 /z n
α
σ
. Thus, when the sample size is increased by
four mes, the width will decrease by half.
Example 5.2.3. Condence interval for normal mean with unknown variance: We connue
with a sample of size n. When the variance is not known, the steps become very dierent.
Since the variance is not known, we replace it by the sample variance:
( )
2
2
1
1
1
n
i
i
S X X
n=
= −
−∑
The denominator is n - 1 since we have already esmated
µ
using the n observaons.
To develop the condence interval for
µ
, consider the following stasc:
2
/
X
T
S n
µ
−
=
This new stasc T has a t-distribuon with n - 1 degrees of freedom. The
( )
100 1
α
− percent
condence interval for
µ
is then given by the following interval:
1, /2 1, /2
,
n n
S S
X t X t
n n
α α
− −
− +
where
1, /2n
t
α
− is the
/2
α
quanle of a t random variable with n - 1 degrees of freedom.
We will create funcons for obtaining the condence intervals for the preceding three
examples. Many stascal tests in R return condence intervals at desired levels. However,
we will be encountering these tests in the last secon of the chapter, and hence up to that
point, we will contain ourselves to dened funcons and applicaons.
Chapter 5
[ 141 ]
Time for action – condence intervals
We create funcons that will enable us to obtain condence intervals of the desired size:
1. Create a funcon for obtaining the condence intervals of the proporon from
a binomial distribuon with the following funcon:
binom_CI = function(x, n, alpha) {
phat = x/n
ll=phat-qnorm(alpha/2,lower.tail=FALSE)*sqrt(phat*(1-phat)/n)
ul=phat+qnorm(alpha/2,lower.tail=FALSE)*sqrt(phat*(1-phat)/n)
return(paste("The ", 100*(1-alpha),"% Confidence Interval for
Binomial Proportion is (", round(ll,4),",", round(ul,4),")",
sep=''))
}
The arguments of the funcons are x, n, and alpha. That is, the user of the funcon
needs to specify the x number of success out of the n Bernoulli trials, and the
signicance level
α
. First, we obtain the MLE
ˆ
p
of the proporon p by calculang
phat = x/n. To obtain the value of
/2
z
α
, we use the qnorm quanle funcon
qnorm(alpha/2,lower.tail= FALSE). The quanty
( )
ˆ ˆ
1 /p p n− is computed
with sqrt(phat*(1-phat)/n). The rest of the code for ll and ul is
self-explanatory. We use the paste funcon to get the output in a convenient
format along with the return funcon.
2. Consider the data in Example 5.2.1 where we have x = 7, and n = 10. Suppose that
we require 95 percent and 99 percent condence intervals. The respecve
α
values
for these condence intervals are 0.05 and 0.01. Let us execute the binom_CI
funcon on this data. That is, we need to run binom_CI(x=7,n=10,alpha
=0.05) and binom_CI(x=7,n=10,alpha=0.01) on the R console. The output
will be as shown in the next screenshot.
Thus, (0.416, 0.984) is the 95 percent condence interval for p and (0.3267,
1.0733) is the 99 percent condence interval for it. Since the upper condence
limit exceeds 1, we will use (0.3267, 1) as the 99 percent condence interval
for p.
3. We rst give the funcon for construcon of condence intervals for the mean
µ
of a normal distribuon when the standard deviaon is known:
normal_CI_ksd = function(x,sigma,alpha) {
xbar = mean(x)
n = length(x)
ll = xbar-qnorm(alpha/2,lower.tail=FALSE)*sigma/sqrt(n)
ul = xbar+qnorm(alpha/2,lower.tail=FALSE)*sigma/sqrt(n)
return(paste("The ", 100*(1-alpha),"% Confidence Interval for
the Normal mean is (", round(ll,4),",",round(ul,4),")",sep=''))
}
Stascal Inference
[ 142 ]
The funcon normal_CI_ksd works dierently from the earlier binomial one. Here,
we provide the enre data to the funcon and specify the known value of standard
deviaon and the signicance level. First, we obtain the MLE
X
of the mean
µ
with
xbar = mean(x). The R code qnorm(alpha/2,lower.tail=FALSE) is used to
obtain
/2
z
α
. Next,
/n
σ
is computed by sigma/sqrt(n). The code for ll and ul
is straighorward to comprehend. The return and paste have the same purpose
as in the previous example. Compile the code for the normal_CI_ksd funcon.
4. Let us see a few examples, connuaon of Example 5.1.3, for obtaining the
condence interval for the mean of a normal distribuon with the standard
deviaon known. To obtain the 95 percent and 99 percent condence interval
for the xn data, where the standard deviaon was known to be 2, run
normal_CI_ksd(x=xn,sigma=2,alpha=0.05) and normal_CI_ksd(x=
xn,sigma=2,alpha=0.01) on the R console. The output is consolidated in the
next screenshot.
Thus, the 95 percent condence interval for
µ
is (9.1494, 10.7173) and the 99
percent condence interval is (8.903, 10.9637).
5. Create a funcon, normal_CI_uksd, for obtaining the condence intervals for
µ
of a normal distribuon when the standard deviaon is unknown:
normal_CI_uksd = function(x,alpha) {
xbar = mean(x); s = sd(x)
n = length(x)
ll = xbar-qt(alpha/2,n-1,lower.tail=FALSE)*s/sqrt(n)
ul = xbar+qt(alpha/2,n-1,lower.tail=FALSE)*s/sqrt(n)
return(paste("The ", 100*(1-alpha),"% Confidence Interval for
the Normal mean is (", round(ll,4),",",round(ul,4),")",sep=''))
}
We have an addional computaon here in comparison with the earlier funcon.
Since the standard deviaon is unknown, we esmate it with s = sd(x).
Furthermore, we need to obtain the quanle from t-distribuon with n - 1 degrees
of freedom, and hence we have qt(alpha/2,n-1,lower.tail=FALSE) for the
computaon of
1, /2n
t
α
−. The rest of the details follow the previous funcon.
6. Let us obtain condence intervals, 95 percent and 99 percent, for the vector xn
under the assumpon that the variance is not known. The codes for achieving the
results are given in normal_CI_uksd(x=xn,alpha=0.05) and normal_CI_
uksd(x=xn,alpha=0.01).
Chapter 5
[ 143 ]
Thus, the 95 percent condence interval for the mean is (9.1518, 10.7419) and the 99
percent condence interval is (8.8742, 10.9925).
Figure 3: Confidence intervals: Some raw programs
What just happened?
We created special funcons for obtaining the condence intervals and executed them for
three dierent cases. However, our framework is quite generic in nature and with a bit of
care and cauon, it may be easily extended to other distribuons too.
Stascal Inference
[ 144 ]
Hypotheses testing
"Best consumed before six months from date of manufacture", "Two years warranty",
"Expiry date: June 20, 2015", and so on, are some of the likely assurances that you would
have easily come across. An analyst will have to arrive at such statements using the related
data. Let us rst dene a hypothesis.
Hypothesis: A hypothesis is an asseron about the unknown parameter of the probability
distribuon. For the quote of this secon, denong the least me (in months) ll which
an eatery will not be losing its good taste by
θ
, the hypothesis of interest will be
0:6H
θ
≥
.
It is common to denote the hypothesis of interest by
0
H
and it called the null hypothesis.
We want to test the null hypothesis against the alternave hypothesis that the consumpon
me is well before the six months' me, which in symbols is denoted by
1:6H
θ
<
. We will
begin with some important denions followed by related examples.
Test stasc: A stasc that is a funcon of the random sample is called as test stasc.
For an observaon X following a binomial distribuon b(n, p), the test stasc for p will be
X/n, whereas for a random sample from the normal distribuon, the test stasc may be
mean
1
/
n
i
i
X X n
=
=
∑
or the sample variance
( )
( )
2
2
1
/ 1
n
i
i
S X X n
=
= − −
∑ depending on whether the
tesng problem is for
2
or
µσ
. The stascal soluon to reject (or not) the null hypothesis
depends on the value of test stasc. This leads us to the next denion.
Crical region: The set of values of the test stasc which leads to the rejecon of the null
hypothesis is known as the crical region.
We have made various kinds of assumpons for the random experiments. Naturally,
depending on the type of the probability family, namely binomial, normal, and so on,
we will have an appropriate tesng tool. Let us look at the very popular tests arising
in stascs.
Binomial test
A binomial random variable X, distribuon represented by b(n, p), is characterized by two
parameters n and p. Typically, n represents the number of trials and is known in most cases
and it is the probability of success p that one is generally interested in the hypotheses
related to it.
For example, an LCD panel manufacturer would like to test if the number of defecves is at
most four percent. The panel manufacturer has randomly inspected 893 LCDs and found 39
to be defecve. Here the hypotheses tesng problem would be 0 1
vs: 0.04 : 0.04Hp Hp≤ ≤ .
Chapter 5
[ 145 ]
A doctor would like to test whether the proporon of people in a drought-eected area
having a viral infecon such as pneumonia is 0.2, that is,
0 1
vs: 0.2 : 0.2
Hp Hp= ≠ . The
drought-eected area may encompass a huge geographical area and as such it becomes
really dicult to carry out a census over a very short period of a day or two. Thus the doctor
selects the second-eldest member of a family and inspects 119 households for pneumonia.
He records that 28 out of 119 inspected people are suering from pneumonia. Using this
informaon, we need to help the doctor in tesng the hypothesis of interest for him.
In general, the hypothesis-tesng problems for the binomial distribuon will be along the
lines of
0 0 1 0
vs: :HppHpp≤ >
,
0 0 1 0
vs: :
HppH
pp
≥ > , or
0 0 1 0
vs: :HppHpp= ≠
.
Let us see how the binom.test funcon in R helps in tesng hypotheses problems related
to the binomial distribuon.
Time for action – testing the probability of success
We will use the R funcon binom.test for tesng hypotheses problems related to p. This
funcon takes as arguments n number of trials, x number of successes, p as the probability
of interest, and alternative as one of greater, less, or greater.
1. Discover the details related to binom.test using ?binom.test, and then run
example(binom.test) and ensure that you understand the default example.
2. For the LCD panel manufacturer, we have n = 893 and x = 39. The null hypothesis
occurs at p = 0.04. Enter this data rst in R with the following code:
n_lcd <- 893; x_lcd <- 39; p_lcd <- 0.04
3. The alternave hypothesis is that the proporon of success p is greater
than 0.04, which is listed to the binom.test funcon with the opon
alternative="greater", and hence the complete binom.test funcon
for the LCD panel is delivered by:
binom.test(n=n_lcd,x=x_lcd,p=p_lcd,alternative="greater")
The output, following screenshot, shows that the esmated probability of success
is 0.04367, which is certainly greater than 0.04. However, the p-value =
0.3103 indicates that we do not have enough evidence in the data to reject
the null hypothesis
0: 0.04Hp≤
. Note that the binom.test also gives us a 95
percent condence interval for p as (0.033, 1.000) and since the hypothesized
probability lies in this interval we arrive at the same conclusion. This condence
interval is recommended over the one developed in the previous secon, and
in parcular we don't have to worry about the condence limits either being lesser
than 0 or greater than one. Also, you may obtain any condence interval of your
choice
( )
100 1
α
−
percent CI with the argument conf.int.
Stascal Inference
[ 146 ]
4. For the doctors problem, we have the data as:
n_doc <- 119; x_doc <- 28; p_doc <- 0.2
5. We need to test the null hypothesis against a two-sided alternave hypothesis and
though this is the default seng of the binom.test, it is a good pracce to specify
it explicitly, at least unl the experse is felt by the user:
binom.test(n=n_doc,x=x_doc,p=p_doc,alternative="two.sided")
The esmated probability of success, actually a paent's probability of having the viral
infecon, is 0.3193. Since the p-value associated with the test is p-value = 0.001888,
we reject the null hypothesis that
0: 0.2Hp=
. All the output is given in the following
screenshot. The 95 percent condence interval is (0.2369, 0.4110), again given by
the binom.test, which does not contain the hypothesized value of 0.2, and hence we
can reject the null hypothesis based on the condence interval.
Figure 4: Binomial tests for probability of success
What just happened?
Binomial distribuon arises in a large number of proporonality test problems. In this secon
we used the binom.test for tesng problems related to the probability of success. We also
note that the condence intervals for p are given as a side-product of the applicaon of the
binom.test. The condence intervals are also given as a by-product of the applicaon of
the binom.test.
Chapter 5
[ 147 ]
Tests of proportions and the chi-square test
In Chapter 3, Data Visualizaon, we came across the Titanic and UCBAdmissions
datasets. For the Titanic dataset, we may like to test if the Survived proporon across
the Class is same for the two Sex groups. Similarly, for the UCBAdmissions dataset, we
may wish to know if the proporon of the Admitted candidates for the Male and Female
group is same across the six Dept. Thus, there is a need to generalize the binom.test
funcon to a group of proporons. In this problem, we may have k-proporons and the
probability vector is specied by
( )
1,...,
k
pp p=
. The hypothesis problem may be
specied as tesng the null hypothesis
0 0
:Hpp=
against the alternave hypothesis
1 0
:Hpp≠
. Equivalently, in the vector form the problem is tesng
( ) ( )
0 1 01 0
:,..., ,...,
k k
Hp p p p=
against
( ) ( )
1 1 01 0
:,..., ,...,
k k
Hp p p p≠
. The R extension of binom.test is given in prop.test.
Time for action – testing proportions
We will use the prop.test R funcon here for tesng the equality of proporons for the
count data problems.
1. Load the required dataset with data(UCBAdmissions). For the UCBAdmissions
dataset, rst obtain the Admitted and Rejected frequencies for both the genders
across the six departments with:
UCBA.Dept <- ftable(UCBAdmissions, row.vars="Dept", col.vars =
c("Gender", "Admit"))
2. Calculate the Admitted proporons for Female across the six departments with:
p_female <- prop.table(UCBA.Dept[,3:4],margin=1)[,1]
Check p_female!
3. Test whether the proporons across the departments for Male matches with
Female using the prop.test:
prop.test(UCBA.Dept[,1:2],p=p_female)
The proporons are not equal across the Gender as p-value < 2.2e-16 rejects
the null hypothesis that they are equal.
4. Next, we want to invesgate whether the Male and Female survivors
proporons are the same in the Titanic dataset. The approach is similar
to the UCBAdmissions problem; run the following code:
T.Class <- ftable(Titanic, row.vars="Class", col.vars = c("Sex",
"Survived"))
Stascal Inference
[ 148 ]
5. Compute the Female survivor proporons across the four classes with p_female
<- prop.table(T.Class[,3:4],margin=1)[,1]. Note that this new variable,
p_female, will overwrite the same named variable from the earlier steps.
6. Display p_female and then carry out the comparison across the two genders:
prop.test(T.Class[,1:2],p=p_female)
The p-value < 2.2e-16 clearly shows that the survivor proporons are not the
same across the genders.
Figure 5: prop.test in action
Indeed, there is more complexity to the two datasets than mere proporons for
the two genders. The web page http://www-stat.stanford.edu/~sabatti/
Stat48/UCB.R has detailed analysis of the UCBAdmissions dataset and here we
will simply apply the chi-square test to check if the admission percentage within
each department is independent of the gender.
7. The data for the admission/rejecon for each department is extractable through the
third index in the array, that is, UCBAdmissions[,,i] across the six departments.
Now, we apply the chisq.test to check if the admission procedure is independent
of the gender by running chisq.test( UCBAdmissions[,,i]) six mes.
The result has been edited in a foreign text editor and then a screenshot of it is
provided next.
Chapter 5
[ 149 ]
It appears that the Dept = A admits more males than females.
Figure 6: Chi-square tests for the UCBAdmissions problem
What just happened?
We used prop.test and chisq.test to test the proporons and independence
of aributes. Funcons such as ftable and prop.table and arguments such as row.
vars, col.vars, and margin were useful to get the data in the right format for the
analysies purpose.
We will now look at important family of tests for the normal distribuon.
Tests based on normal distribution – one-sample
The normal distribuon pops up in many instances of stascal analysis. In fact
Whiaker and Robinson have quoted on the popularity of normal distribuon
as follows:
Everybody believes in the exponenal law of errors [that is, the normal
distribuon]: the experimenters, because they think it can be proved by
mathemacs; and the mathemacians, because they believe it has been
established by observaon.
We will not make an aempt to nd out whether the experimenters are correct
or the mathemacians, well, at least not in this secon.
Stascal Inference
[ 150 ]
In general we will be dealing with either one-sample or two-sample tests. In the one-sample
problem we have a random sample of size n from
( )
2
,N
µσ
in
( )
12
,,..., n
XX X. The hypotheses
tesng problem may be related to either or both of the parameters
( )
2
,
µσ
. The interesng
and most frequent hypotheses tesng problems for the normal distribuon are listed here:
Tesng for mean with known variance
2
σ
:
0 0 1 0
vs: :
H H
µµ
µµ
< ≥
0 0 1 0
vs: :H H
µµ
µµ
> ≤
0 0 1 0
vs: :
H H
µµ
µµ
= ≠
Tesng for mean with unknown variance 2
σ
: this is the same set of hypotheses
problems as in the preceding point
Tesng for the variance with unknown mean:
0 0 1 0
vs: :H H
σσ
σσ
> ≤
0 0 1 0
vs: :H H
σσ
σσ
< ≥
0 0 1 0
vs: :H H
σσ
σσ
= ≠
In the case of known variance, the hypotheses tesng problem for the mean is based
on the Z-stasc given by:
0
/
X
Zn
µ
σ
−
=
where
1
/
n
i
i
X X n
=
=
∑
. The test procedure, known as Z-test, for the hypotheses tesng problem
0 0 1 0
vs: :
H H
µµ
µµ
< ≥ is to reject the null hypothesis at
α
-level of signicance
0 0
:H
µµ
>
if
0
/
X z n
α
σ µ
> + , where
z
α
is the
α
percenle of a standard normal distribuon. For
the hypotheses tesng problem
0 0 1 0
vs: :H H
µµ µµ
> ≤
, the crical/reject region is
0
/
X z n
α
σ µ
< + . Finally, for the tesng problem of
0 0 1 0
vs: :H H
µµ µµ
= ≠
, we reject the
null hypothesis if:
0
/2
/
Xz
n
α
µ
σ
−≥
Chapter 5
[ 151 ]
An R funcon, z.test, is available in the PASWR package, which carries out the Z-test for
each type of the hypotheses tesng problem. Now, we consider the case when the variance
2
σ
is not known. In this case, we rst nd an esmate of the variance using
( )
( )
2
2/ 1
1
n
S X X n
i
i
= − −
=
∑
.
The test procedure is based on the well-known t-stasc:
2
/
X
t
S n
µ
−
=
The test procedure based on the t-stasc is highly popular as the t-test or student's t-test,
and its implementaon is there in R with the t.test funcon in the base package. The
distribuon of the t-stasc, under the null hypothesis is the t-distribuon with (n - 1)
degrees of freedom. The raonale behind the applicaon of the t-test for the various types
of hypotheses remains the same as the Z-test.
For the hypotheses tesng problem concerning the variance 2
σ
of the normal
distribuon, we need to rst compute the sample variance using
( )
( )
2
2/ 1
1
n
S X X n
i
i
= − −
=
∑
and dene the chi-square stasc:
( )
2
2
2
0
1
nS
χσ
−
=
Under the null hypothesis, the chi-square stasc is distributed as a chi-square random
variable with n - 1 degrees of freedom. In the case of known mean, which is seldom
the case, the test procedure is based on the test stasc
( )
2
22
0
/
1
n
Xi
i
χ µ σ
= −
=
∑, which follows a
chi-square random variable with n degrees of freedom. For the hypotheses problem
0 0 1 0
vs: :H H
σσ σσ
> ≤
, the test procedure is to reject
0 0
:H
σσ
> if
22
1
X
α
χ
−
<
. Similarly, for
the hypotheses problem
0 0 1 0
vs: :H H
σσ σσ
< ≥
, the procedure is to reject
0 0
:H
σσ
<
if
2 2
X
α
χ
>
; and nally for the problem
0 0 1 0
vs: :H H
σσ σσ
= ≠
, the test procedure rejects
0 0
:H
σσ
=
if either
22
1/2
X
α
χ
−
<
or
2 2
/2
X
α
χ
>
.
Test examples. Let us consider some situaons when the preceding set of hypotheses arise
in a natural way:
A certain chemical experiment requires that the soluon used as a reactant has
a pH level greater than 8.4. It is known that the manufacturing process gives
measurements which follow a normal distribuon with a standard deviaon
of 0.05. The ten random observaons are 8.30, 8.42, 8.44, 8.32, 8.43, 8.41,
8.42, 8.46, 8.37, and 8.42. Here, the hypotheses tesng problem of interest is
0 1
vs: 8.4 : 8.4H H
µ µ
> ≤
. This problem is adopted from page 408 of Ross (2010).
Stascal Inference
[ 152 ]
Following a series of complaints that his company's LCD panels never last more than
a year, the manufacturer wants to test if his LCD panels indeed fail within a year.
Using historical data, he knows the standard deviaon of the panel life due to the
manufacturing process is two years. A random sample of 15 units from a freshly
manufactured lot gives their lifemes as 13.37, 10.96, 12.06, 13.82, 12.96, 10.47,
10.55, 16.28, 12.94, 11.43, 14.51, 12.63, 13.50, 11.50, and 12.87. You need to help
the manufacturer validate his hypothesis.
Freund and Wilson (2003). Suppose that the mean weight of peanuts put in jars
is required to be 8 oz. The variance of the weights is known to be 0.03, and the
observed weights for 16 jars are 8.08, 7.71, 7.89, 7.72, 8.00, 7.90, 7.77, 7.81,
8.33, 7.67, 7.79, 7.79, 7.94, 7.84, 8.17, and 7.87. Here, we are interested
in tesng
0 1
vs: 8.0 : 8.0H H
µ µ
= ≠
.
New managers have been appointed at the respecve places in the preceding
bullets. As a consequence the new managers are not aware about the standard
deviaon for the processes under their control. As an analyst, help them!
Suppose that the variance in the rst example is not known and that it is a crical
requirement that the variance be lesser than 7, that is, the null hypothesis is
2
0: 7H
σ
<
while the alternave is
2
1: 7H
σ
≥
.
Suppose that the variance test needs to be carried out for the third method,
that is, the hypotheses tesng problem is then
2 2
0 1
vs: 0.03 : 0.03H H
σ σ
= ≠ .
We will perform the necessary test for all the problems described before.
Time for action – testing one-sample hypotheses
We will require R packages PASWR and PairedData here. The R funcons such as t.test,
z.test, and var.test will be useful for tesng one-sample hypotheses problems related
to a random sample from normal distribuons.
1. Load the library library(PASWR).
2. Enter the data for pH in R:
pH_Data <-c(8.30,8.42,8.44,8.32,8.43,8.41,8.42,8.46,8.37,8.42)
3. Specify the known variance of pH in pH_sigma <- 0.05.
4. Use z.test from the PASWR library to test the hypotheses described in the rst
example with:
z.test(x=pH_Data,alternative="less",sigma.x=pH_sigma,mu=8.4)
Chapter 5
[ 153 ]
The data is specied in the x opon, the type of the hypotheses problem is specied
by stang the form of the alternative hypothesis, the known variance is fed
through the sigma.x opon, and nally, the mu opon is used to specify the value
of the mean under the null hypothesis. The output of the complete R program is
collected in the forthcoming two screenshots.
The p-value is 0.4748 which means that we do not have enough evidence
to reject the null hypothesis
0: 8.4H
µ
> and hence we conclude that mean
pH value is above 8.4.
5. Get the data of LCD panel in your session with:
LCD_Data <- c(13.37, 10.96, 12.06, 13.82, 12.96, 10.47,
10.55, 16.28, 12.94, 11.43, 14.51, 12.63, 13.50, 11.50, 12.87)
6. Specify the known variance LCD_sigma <- 2 and run the z.test with:
z.test(x=LCD_Data,alternative="greater",sigma.x=LCD_sigma,mu=12)
The p-value is seen to be 0.1018 and hence we again do not have enough data
evidence to reject the null hypothesis that the average mean lifeme of an LCD
panel is at least a year.
7. The complete program for third problem can be given as follows:
peanuts <- c(8.08, 7.71, 7.89, 7.72, 8.00, 7.90, 7.77,
7.81, 8.33, 7.67, 7.79, 7.79, 7.94, 7.84, 8.17, 7.87)
peanuts_sigma <- 0.03
z.test(x=peanuts,sigma.x=peanuts_sigma,mu=8.0)
Since the p-value associated with this test 2.2e-16, that is, it is very close to zero,
we reject the null hypothesis
0: 8.0H
µ
=
.
8. If the variance(s) are not known and a test of the sample means is required,
we need to move from the z.test (in the PASWR library) to the t.test
(in the base library):
t.test(x=pH_Data,alternative="less",mu=8.4)
t.test(x=LCD_Data,alternative="greater",mu=12)
t.test(x=peanuts,mu=8.0)
If the variance is not known, the conclusions for the problems related to pH
and peanuts do not change. However, the conclusion changes for the LCD panel
problem, and here the null hypothesis is rejected as p-value is 0.06414.
Stascal Inference
[ 154 ]
For the problem of tesng variances related to the one-sample problem, my inial
idea was to write raw R codes as there did not seem to be a funcon, package, and
so on, which readily gives the answers. However, a more appropriate search at
google.com revealed that an R package tled PairedData and created by
Stephane Champely did certainly have a funcon, var.test, not to be confused
with the same named funcon in the stats library, which is appropriate for tesng
problems related to the variance of a normal distribuon. The problem is that the
roune method of fetching the package using install.packages("PairedData")
gives a warning message, namely package 'PairedData' is not available (for R
version 2.15.1). This is the classic case of "so near, yet so far…". However, a deeper
look into this will lead us to http://cran.r-project.org/src/contrib/
Archive/PairedData/. This web page shows the various versions of the
PairedData package. A Linux user should have no problem in using it, though
the other OS users can't be helped right away. A Linux user needs to rst download
one of the zipped le, say PairedData_1.0.0.tar.gz, to a specic directory
and with the path of GNOME Terminal in that directory execute R CMD INSTALL
PairedData_1.0.0.tar.gz. Now, we are ready to carry out the tests related
to the variance of a normal distribuon. A Windows user need not be discouraged
with this scenario, and the important funcon var1.test is made available in the
RSADBE package of the book. A more recent check on the CRAN website reveals that
the PairedData package is again available for all OS plaorms since April 18, 2013.
Figure 7: z.test and t.test for one-sample problem
Chapter 5
[ 155 ]
9. Load the required library with library(PairedData).
10. Carry out the two tesng problems in the h problem with:
var.test(x=pH_Data,alternative="greater",ratio=7)
var.test(x=peanuts,alternative="two.sided",ratio=0.03)
11. It may be seen from the next screenshot that the data does not lead to rejecon
of the null hypotheses. For a Windows user, the alternave is to use the var1.test
funcon from the RSADBE package. That is, you need to run:
var1.test(x=pH_Data,alternative="greater",ratio=7)
var1.test(x=peanuts,alternative="two.sided",ratio=0.03)
You'll get same results:
Figure 8: var.test from the PairedData library
What just happened?
The tests z.test, t.test, and var.test (from the PairedData library) have been
used for the tesng hypotheses problems under varying degrees of problems.
Stascal Inference
[ 156 ]
Have a go hero
Consider the tesng problem
0 0 1 0
vs: :H H
σσ σσ
= ≠
. The test stasc for this hypothesis
tesng problem is given by:
( )
2
1
2
2
0
n
i i i
X X
χσ
=
∑ −
=
which follows a chi-square distribuon with n - 1 degrees of freedom. Create your own new
funcon for the tesng problems and compare it with the results given by var.test of
PairedData package.
With the tesng problem of parameters of normal distribuon in the case of one sample
behind us, we will next focus on the important two-sample problem.
Tests based on normal distribution – two-sample
The two-sample problem has data from two populaons where
( )
12
,,...,
n
XX X are n1
observaons from
( )
2
11
,N
µσ
and
( )
2
12
,,..., n
YY Y are n2 observaons from
( )
2
22
,N
µσ
. We assume
that the samples within each populaon are independent of each other and further that
the samples across the two populaons are also independent. Similar to the one-sample
problem, we have the following set of recurring and interesng hypotheses tesng problems.
Mean comparison with known variances
2
1
σ
and 2
2
σ
:
0 1 2 1 1 2
vs: :H H
µµ µµ
> ≤
0 1 2 1 1 2
vs: :H H
µµ µµ
< ≥
0 1 2 1 1 2
vs: :H H
µµ µµ
= ≠
Mean comparison with unknown variances
2
1
σ
and
2
2
σ
: the same set of hypotheses
problems as before. We make an addional assumpon here that the variances
2
1
σ
and 2
2
σ
are assumed to be equal, though unknown.
The variances comparison:
0 1 2 1 1 2
vs: :H H
σσ σσ
> ≤
0 1 2 1 1 2
vs: :H H
σσ σσ
< ≥
0 1 2 1 1 2
vs: :H H
σσ σσ
= ≠
First dene the sample means for the two populaons with
1
1
1/
n
i
i
X X n
=
=∑
and 2
2
1/
n
i
i
Y X n
=
=
∑
. For the case of known variances
2
1
σ
and
2
2
σ
, the test stasc
is dened by:
Chapter 5
[ 157 ]
( )
1 2
2 2
1 1 2 2
/ /
XY
Zn n
µµ
σ σ
−− −
=+
Under the null hypotheses,
( )
2 2
1 1 2 2
/ / /Z X Y n n
σ σ
= − + follows a standard normal distribuon.
The test procedure for the problem 0 1 2 1 1 2
vs: :H H
µµ
µµ
> ≤ is to reject H0 if
zz
α
≥
, and the
procedure for
0 1 2 1 1 2
vs: :H H
µµ µµ
< ≥
is to reject H0 if
zz
α
<
. As expected and on earlier
intuive lines, the test procedure for the hypotheses problem
0 1 2 1 1 2
vs: :H H
µµ µµ
= ≠
is to
reject H0 as /2
zz
α
≥.
Let us now consider the case when the variances 2
1
σ
and 2
2
σ
are not known and assumed
(or known) to be equal. In this case, we can't use the Z-test any further and need to look at
the esmator of the common variance. For this, we dened the pooled variance esmator
as follows:
2 2 2
1 2
1 2 1 2
1 1
2 2
p x y
n n
S S S
n n n n
− −
= +
+ − + −
where
2
x
S
and
2
y
S
are the sampling variances of the two populaons. Dene the t-stasc
as follows:
( )
2
1 2
1/ 1/
p
XY
tS n n
−
=+
The test procedure for the set of the three hypotheses tesng problems is then to reject
the null hypotheses if
12
2,
nn
tt
α
+−
<
,
12
2,nn
tt
α
+−
>
, or
12
2, /2nn
tt
α
+−
<
.
Finally, we focus on the problem of tesng variances across two samples. Here, the test
stasc is given by:
2
2
x
y
S
FS
=
The test procedures would be to respecvely reject the null hypotheses of the
tesng problems
0 1 2 1 1 2
vs: :H H
σσ σσ
> ≤
,
0 1 2 1 1 2
vs: :H H
σσ σσ
< ≥
, and
0 1 2 1 1 2
vs: :H H
σσ σσ
= ≠
if
12
1, 1,
nn
FF
α
−−
<
,
12
1, 1,
nn
FF
α
−−
>
,
12
1, 1, /2
nn
FF
α
−−
<
.
Stascal Inference
[ 158 ]
Let us now consider some scenarios where we have the previously listed hypotheses
tesng problems.
Test examples. Let us consider some situaons when the preceding set of hypotheses arise
in a natural way.
In connuaon of the chemical experiment problem, let us assume that the
chemists have come up with a new method of obtaining the same soluon as
discussed in the previous secon. For the new technique, the standard deviaon
connues to be 0.05 and 12 observaons for the new method yield the following
measurements: 8.78, 8.85, 8.74, 8.83, 8.82, 8.79, 8.82, 8.74, 8.84, 8.78, 8.75, 8.81.
Now, this new soluon is acceptable if its mean is greater than that for the earlier
one. Thus, the hypotheses tesng problem is now 0 1
vs: :
NEW OLD NEW OLD
H H
µ µ µ µ
> ≤ .
Ross (2008), page 451. The precision of instruments in metal cung is a much
serious business and the cut pieces can't be signicantly lesser than the target
nor be greater than it. Two machines are used to cut 10 pieces of steel, and their
measurements are respecvely given in 122.4, 123.12, 122.51, 123.12, 122.55,
121.76, 122.31, 123.2, 122.48, 121.96, and 122.36, 121.88, 122.2, 122.88, 123.43,
122.4, 122.12, 121.78, 122.85, 123.04. The standard deviaon of the length of a cut
is known to be equal to 0.5. We need to test if the average cut length is same for the
two machines.
For both the preceding problems assume that though the variances are equal,
they are not known. Complete the hypotheses tesng problems using t.test.
Freund and Wilson (2003), page 199. The monitoring of the amount of
peanuts being put in jars is an important issue in quality control viewpoint.
The consistency of the weights is of prime importance and the manufacturer has
been introduced to a new machine, which is supposed to give more accuracy
in the weights of the peanuts put in the jars. With the new device, 11 jars were
tested for their weights and found to be 8.06, 8.64, 7.97, 7.81, 7.93, 8.57, 8.39,
8.46, 8.28, 8.02, 8.39, whereas a sample of nine jars from the previous machine
weighed at 7.99, 8.12, 8.34, 8.17, 8.11, 8.03, 8.14, 8.14, 7.87. Now, the task is to test
0 1
vs: :
NEW OLD NEW OLD
H H
σ σ σ σ
= ≥ .
Let us do the tests for the preceding four problems in R.
Chapter 5
[ 159 ]
Time for action – testing two-sample hypotheses
For the problem of tesng hypotheses for the means arising from two populaons, we will
be using the funcons z.test and t.test.
1. As earlier, load the library(PASWR) library.
2. Carry out the Z-test using z.test and the opons x, y, sigma.x, and sigma.y:
pH_Data <- c(8.30, 8.42, 8.44, 8.32, 8.43, 8.41, 8.42,
8.46, 8.37, 8.42)
pH_New <- c(8.78, 8.85, 8.74, 8.83, 8.82, 8.79, 8.82,
8.74, 8.84, 8.78, 8.75, 8.81)
z.test(x=pH_Data,y=pH_New,sigma.x=sigma.y=0.05,alternative="less")
The p-value is very small (2.2e-16) indicang that we reject the null hypothesis
that
0:
NE
WO
LD
H
µ µ
>
.
3. For the steel length cut data problem, run the following code:
length_M1 <- c(122.4, 123.12, 122.51, 123.12, 122.55,
121.76, 122.31, 123.2, 122.48, 121.96)
length_M2 <- c(122.36, 121.88, 122.2, 122.88, 123.43,
122.4, 122.12, 121.78, 122.85, 123.04)
z.test(x=length_M1,y=length_M2,sigma.x=0.5,sigma.y=0.5)
The display of p-value = 0.8335 shows that the machines do not cut the steel
in dierent ways.
4. If the variances are equal but not known, we need to use t.test instead of the
z.test:
t.test(x=pH_Data,y=pH_New,alternative="less")
t.test(x=length_M1,y=length_M2)
5. The p-values for the two hypotheses problems are p-value = 3.95e-13
and p-value = 0.8397. We leave the interpretaon aspect to the reader.
6. For the fourth problem, we have the following R program:
machine_new <- c(8.06, 8.64, 7.97, 7.81, 7.93, 8.57, 8.39, 8.46,
8.28, 8.02, 8.39)
machine_old <- c(7.99, 8.12, 8.34, 8.17, 8.11, 8.03, 8.14, 8.14,
7.87)
t.test(machine_new,machine_old, alternative="greater")
Again, we have p-value = 0.1005!
Stascal Inference
[ 160 ]
What just happened?
The funcons t.test and z.test were simply extensions from the one-sample case to the
two-sample test.
Have a go hero
In the one-sample case you used var.test for the same datasets, which needed
a comparison of means with some known standard deviaon. Now, test for the variance in
the two-sample case using var.test using appropriate hypotheses for them. For example,
test whether the variances are equal for pH_Data and pH_New. Find more details of the test
with ?var.test.
Summary
In this chapter we have introduced "stascal inference", which in a common usage term
consists of three parts: esmaon, condence intervals, and hypotheses tesng. We began
the chapter with the importance of likelihood and to obtain the MLE in many of the standard
probability distribuons using built-in modules. Later, simply to maintain the order of
concepts, we dened funcons exclusively for obtaining the condence intervals. Finally,
the chapter considered important families of tests that are useful across many important
stochasc experiments. In the next chapter we will introduce the linear regression model,
which more formally constutes the applied face of the subject.
6
Linear Regression Analysis
In the Visualization techniques for continuous variable data section of Chapter
3, Data Visualization, we have seen different data visualization techniques
which help in understanding the data variables (boxplots and histograms) and
their interrelationships (matrix of scatter plots). We had seen in Example 4.6.1.
Resistant line for the IO-CPU time an illustration of the resistant line, where
CPU_Time depends linearly on the No_of_IO variable. The pair function's
output in Example 3.2.9. Octane rating of gasoline blends indicated that the
mileage of a car has strong correlations with the engine-related characteristics,
such as displacement, horsepower, torque, the number of transmission speeds,
and the type of transmission being manual or automatic. Further, the mileage
of a car also strongly depends on the vehicle dimensions, such as its length,
width, and weight. The question addressed in this chapter is meant to further
these initial findings through a more appropriate model. Now, we take the next
step forward and build linear regression models for the problems. Thus, in this
chapter we will provide more concrete answers for the mileage problem.
Linear Regression Analysis
[ 162 ]
The rst linear regression model was built by Sir Francis Galton in 1908. The word regression
implies towards-the-center. The covariates, also known as independent variables, features,
or regressors, have a regressive eect on the output, also called dependent or regressand
variable. Since the covariates are allowed, actually assumed, to aect the output in linear
increments, we call the model the linear regression model. The linear regression models
provide an answer for the correlaon between the regressand and the regressors, and as
such do not really establish causaon. As it will be seen later in the chapter, using data,
we will be able to understand the mileage of a car as a linear funcon of the car-related
dynamics. From a pure scienc point of view, the mileage should really depend on
complicated formulas of the car's speed, road condions, the climate, and so on. However,
it will be seen that linear models work just ne for the problem despite not really going
into the technical details. However, there will also be a price to pay, in the sense that most
regression models work well when the range of the variables is well dened, and that an
aempt to extrapolate the results usually does not result in sasfactory answers. We will
begin with a simple linear regression model where we have one dependent variable
and one covariate.
At the conclusion of the chapter, you will be able to build a regression model through
the following steps:
Building a linear regression model and their interpretaon
Validaon of the model assumpons
Idenfying the eect of every single observaon, covariates, as well as the output
Fixing the problem of dependent covariates
Selecon of the opmal linear regression model
The simple linear regression model
In Example 4.6.1. Resistant line for the IO-CPU me of Chapter 4, Exploratory Analysis,
we built a resistant line for CPU_Time as a funcon of the No_of_IO processes. The
results were sasfactory in the sense that the ed line was very close to covering all the
data points, refer to Figure 7 of Chapter 4, Exploratory Analysis. However, we need more
stascal validaon of the esmated values of the slope and intercept terms. Here we take
a dierent approach and state the linear regression model in more technical details.
Chapter 6
[ 163 ]
The simple linear regression model is given by
01
Y X
ββ ε
=+ +
,where X is the covariate/
independent variable, Y is the regressand/dependent variable, and ε is the unobservable
error term. The parameters of the linear model are specied by
0
β
and
1
β
. Here
0
β
is the
intercept term and corresponds to the value of Y when x = 0. The slope term
1
β
reects
the change in the Y value for a unit change in X. It is also common to refer to the
0
β
and
1
β
values as regression coecients. To understand the regression model, we begin with n pairs
of observaons
( ) ( )
11 nn
Y,X,..., Y,X
with each pair being completely independent of the other.
We make an assumpon of normal and independent and idencally distributed (iid) for
the error term
ε
, specically,
( )
2
0N,
ε σ
∼, where 2
σ
is the variance of the errors. The core
assumpons of the model are listed as follows:
All the observaons are independent
The regressand depends linearly on the regressors
The errors are normally distributed, that is
( )
2
0N,
ε σ
∼
We need to nd all the unknown parameters in
01
,
ββ
and 2
σ
. Suppose we have n
independent observaons. Stascal inference for the required parameters may be carried
out using the maximum likelihood funcon as described in the Maximum likelihood esmator
secon of Chapter 5, Stascal Inference. The popular technique for the linear regression
model is the least squares method which idenes the parameters by minimizing the error
sum of squares for the model, and under the assumpons made thus far agrees with the
MLE. Let
0
β
and
1
β
be a choice of parameters. Then the residuals, the distance between the
actual points and the model predicons, made by using the proposed choice of
01
,
ββ
on the
i-th pair of observaon
( )
ii
Y,X
is dened by:
01 12
ii i
eY X,i,,...,n
ββ
=− + =
Let us now specify dierent values for the pair
( )
01
,
ββ
and visualize the residuals for them.
What happens to the arbitrary choice of parameters?
For the IO_Time dataset, the scaer plot suggests that the intercept term is about 0.05.
Further, the resistant line gives an esmate of the slope at about 0.04. We will have three
pairs for guesses for
( )
01
,
ββ
as (0.05, 0.05), (0.1, 0.04), and (0.15, 0.03). We will now plot the
data and see the dierent residuals for the three pairs of guesses.
Linear Regression Analysis
[ 164 ]
Time for action – the arbitrary choice of parameters
1. We begin with reasonable guesses of the slope and intercept terms for a
simple linear regression model. The idea is to inspect the dierence between
the ed line and the actual observaons. Invoke the graphics windows using
par(mfrow=c(1,3)).
2. Obtain the scaer plot of the CPU_Time against No_of_IO with:
plot(No_of_IO,CPU_Time,xlab="Number of Processes",ylab="CPU Time",
ylim=c(0,0.6),xlim=c(0,11))
3. For the guessed regression line with the values of
( )
01
,
ββ
being (0.05, 0.05), plot a
line on the scaer plot with abline(a=0.05,b=0.05,col= "blue").
4. Dene a funcon which will nd the y value for the guess of the pair (0.05, 0.05)
using myline1 = function(x) 0.05*x+0.05.
5. Plot the error (residuals) made due to the choice of the pair (0.05, 0.05) from the
actual points using the following loop and give a tle for the rst pair of the guess:
for(i in 1:length(No_of_IO)){
lines(c(No_of_IO[i], No_of_IO[i]), c(CPU_Time[i],
myline1(No_of_IO[i])),col="blue", pch=10)
}
title("Residuals for the First Guess")
6. Repeat the preceding exercise for the last two pairs of guesses for the regression
coecients
( )
01
,
ββ
.
The complete R program is given as follows:
par(mfrow=c(1,3))
plot(IO_Time$No_of_IO,IO_Time$CPU_Time,xlab="Number of
Processes",ylab="CPU Time",ylim=c(0,0.6),xlim=c(0,11))
abline(a=0.05,b=0.05,col="blue")
myline1 <- function(x) 0.05*x+0.05
for(i in 1:length(IO_Time$No_of_IO)){
lines(c(IO_Time$No_of_IO[i],IO_Time$No_of_IO[i]),c(IO_Time$CPU_
Time[i],myline1(IO_Time$No_of_IO[i])),col="blue",pch=10)
}
title("Residuals for the First Guess")
plot(IO_Time$No_of_IO,IO_Time$CPU_Time,xlab="Number of
Processes",ylab="CPU Time",ylim=c(0,0.6),xlim=c(0,11))
abline(a=0.1,b=0.04,col="green")
myline2 <- function(x) 0.04*x+0.1
for(i in 1:length(IO_Time$No_of_IO)){
lines(c(IO_Time$No_of_IO[i],IO_Time$No_of_IO[i]),c(IO_Time$CPU_
Time[i],myline2(IO_Time$No_of_IO[i])),col="green",pch=10)
Chapter 6
[ 165 ]
}
title("Residuals for the Second Guess")
plot(IO_Time$No_of_IO,IO_Time$CPU_Time,xlab="Number of
Processes",ylab="CPU Time",ylim=c(0,0.6),xlim=c(0,11))
abline(a=0.15,b=0.03,col="yellow")
myline3 <- function(x) 0.03*x+0.15
for(i in 1:length(IO_Time$No_of_IO)){
lines(c(IO_Time$No_of_IO[i],IO_Time$No_of_IO[i]),c(IO_Time$CPU_
Time[i],myline3(IO_Time$No_of_IO[i])),col="yellow",pch=10)
}
title("Residuals for the Third Guess")
Figure 1: Residuals for the three choices of regression coefficients
What just happened?
We have just executed an R program which displays the residuals for arbitrary choices of the
regression parameters. The displayed result is the preceding screenshot.
In the preceding R program, we rst plot CPU_Time against No_of_IO. The rst choice of
the line is ploed by using the abline funcon, and we specify the required intercept and
slope through a = 0.05 and b = 0.05. From this straight line (color blue), we need to obtain
the magnitude of error, through perpendicular lines from the points to the line, from the
original points. This is achieved through the for loop, where the lines funcon joins the
points and the line.
Linear Regression Analysis
[ 166 ]
For the pair (0.05, 0.05) as a guess of
( )
01
,
ββ
, we see that there is a progression in the
residual values as x increases, and it is the other way for the guess of (0.15, 0.03). In either
case, we are making large mistakes (residuals) for certain x values. The middle plot for the
guess (0.1, 0.04) does not seem to have large residual values. This choice may be beer over
the other two choices. Thus, we need to dene a criterion which enables us to nd the best
values
( )
01
,
ββ
in some sense. The criterion is to minimize the sum of squared errors:
( )
01
2
1
min
n
i
,i
e
ββ
=
∑
Where:
( )
{ }
2
2
01
1 1
n n
i i i
i i
e y x
ββ
= =
= − +
∑ ∑
Here, the summaon is over all the observed pairs
( )
12
ii
y,x,i,,...,n=
. The technique of
minimizing the error sum of squares is known as the method of least squares, and for the
simple linear regression model, the values of
( )
01
,
ββ
which meet the criterion are given by:
1 0 1
xy
xx
S
ˆ ˆ ˆ
, y x
S
β β β
= = −
Where:
1 1
and
n n
ii
ii
x y
x , y
n n
− −
∑ ∑
= =
And:
( ) ( )
2 2
1 1
and
n n
xx i xx i i
i i
S x x, S y xx
= =
= − = −
∑ ∑
We will now learn how to use R for building a simple linear regression model.
Chapter 6
[ 167 ]
Building a simple linear regression model
We will use the R funcon lm for the required construcon. The lm funcon creates
an object of class lm, which consists of an ensemble of the ed regression model.
Through the following exercise, you will learn the following:
The basic construcon of an lm object
The criteria which signies the model signicance
The criteria which signies the variable signicance
The variaon of the output explained by the inputs
The relaonship is specied by a formula in R, and the details related to the generic form
may be obtained by entering ?formula in the R console. That is, the lm funcon accepts
a formula object for the model that we are aempng to build. data.frame may also be
explicitly specied, which consists of the required data. We need to model CPU_Time as a
funcon of No_of_IO, and this is carried out by specifying CPU_Time ~ No_of_IO.
The funcon lm is wrapped around the formula to obtain our rst linear regression model.
Time for action – building a simple linear regression model
We will build the simple linear regression model using the lm funcon with its
useful arguments.
1. Create a simple linear regression model for CPU_Time as a funcon of No_of_IO
by keying in IO_lm <- lm(CPU_Time ~ No_of_IO, data=IO_Time).
2. Verify that IO_lm is of the lm class with class(IO_lm).
3. Find the details of the ed regression model using the summary funcon:
summary(IO_lm).
Linear Regression Analysis
[ 168 ]
The output is given in the following screenshot:
Figure 2: Building the first simple linear regression model
The rst queson you should ask yourself is, "Is the model signicant overall?".
The answer is provided by the p-value of the F-stasc for the overall model. This appears
in the nal line of summary(IO_lm). If the p-value is closer to 0, it implies that the model is
useful. A rule of thumb for the signicance of the model is that it should be less than 0.05.
The general rule is that if you need the model signicance at a certain percentage, say P,
then the p-value of the F-stasc should be lesser than (1-P/100).
Now that we know that the model is useful, we can ask whether the independent variable
as well as the intercept term, are signicant or not. The answer to this queson is provided
by Pr(>|t|) for the variables in the summary. R has a general way of displaying the highest
signicance level of a term by using ***, **, * and . in the Signif. codes:. This display may be
easily compared with the review of a movie or a book! Just as with general rangs, where
more stars indicate a beer product, in our context, the higher the number of stars indicate
the variables are more signicant for the built model. In our linear model, we nd No_of_IO
to be highly signicant. The esmate value of No_of_IO is given as 0.04076. This coecient
has the interpretaon that for a unit increase in the number of IOs; CPU_Time is expected to
increase by 0.04076.
Chapter 6
[ 169 ]
Now that we know that the model, as well as the independent variable, are signicant, we
need to know how much of the variability in CPU_Time is explained by No_of_IO. The answer
to this queson is provided by the measure R2, not to be confused with the leer R for the
soware, which when mulplied by 100 gives the percentage of variaon in the regressand
explained by the regressor. The term R2 is also called as the coecient of determinaon. In
our example, 98.76 percent of the variaon in CPU_Time is explained by No_of_IO, see the
value associated with Mulple R-squared in the summary(IO_lm). The R2 measure does not
consider the number of parameters esmated or the number of observaons n in a model.
A more robust explanaon, which takes into consideraon the number of parameters and
observaons, is provided by Adjusted R-squared which is 98.6 percent.
We have thus far not commented on the rst numerical display as a result of using the
summary funcon. This relates to the residuals and the display is about the basic summary
of the residual values. The residuals vary from -0.016509 to 0.024006, which are not
very large in comparison with the CPU_Time values, check with summary(CPU_Time)
for instance. Also, the median of the residual values is very close to zero, and this is an
important criterion as the median of the standard normal distribuon is 0.
What just happened?
You have ed a simple linear regression model where the independent variable is No_of_
IO and the dependent variable (output) is CPU_Time. The important quanes to look for
the model signicance, the regression coecients, and so on have been clearly illustrated.
Have a go hero
Load the dataset anscombe from the datasets package. The anscombe dataset has four
pairs of datasets in x1, x2, x3, x4, y1, y2, y3, and y4. Fit a simple regression model for
all the four pairs and obtain the summary for each pair. Make your comments on the
summaries. Pay careful aenon to the details of the summary funcon. If you need further
help, simply try out example(anscombe).
We will next look at the ANOVA (Analysis of Variance) method for the regression model,
and also obtain the condence intervals for the model parameters.
ANOVA and the condence intervals
The summary funcon of the lm object species the p-value for each variable in the model,
including the intercept term. Technically, the hypothesis problem is tesng 00 0
1
j
j
H: ,j ,
β
==
against the corresponding alternave hypothesis,
10 0 1
j
j
H: ,j ,
β
≠=
. This tesng problem
is technically dierent from the simultaneous hypothesis tesng
00 10:H
ββ
==
against the
alternave that at least one of the regression coecients is dierent from 0. The ANOVA
technique gives the answer to the laer null hypothesis of interest.
Linear Regression Analysis
[ 170 ]
For more details about the ANOVA technique, you may refer to http://en.wikipedia.
org/wiki/Analysis_of_variance. Using the anova funcon, it is very simple in R to
obtain the ANOVA table for a linear regression model. Let us apply it for our IO_lm linear
regression model.
Time for action – ANOVA and the condence intervals
The R funcons anova and confint respecvely help obtain the ANOVA table
and condence intervals from the lm objects. Here, we use them for the IO_lm
regression object.
1. Use the anova funcon on the IO_lm lm object to obtain the ANOVA table by using
IO_anova <- anova(IO_lm).
2. Display the ANOVA table by keying in IO_anova in the console.
3. The 95 percent condence intervals for the intercept and the No_of_IO variable
is obtained by confint(IO_lm).
The output in R is as follows:
Figure 3: ANOVA and the confidence intervals for the simple linear regression model
The ANOVA table conrms that the variable No_of_IO is signicant indeed. Note
the dierence of the criteria for conrming this with respect to summary(IO_lm).
In the former case, the signicance was arrived at using the t-stascs and here we have
used the F-stasc. Precisely, we check for the variance signicance of the input variable.
We now give the tool for obtaining condence intervals.
Chapter 6
[ 171 ]
Check whether or not the esmated value of the parameters fall within the 95 percent
condence intervals. The preceding results show that we indeed have a very good linear
regression model. However, we also made a host of assumpons in the beginning of the
secon, and a good pracce is to ask how valid they are in the experiment. We next consider
the problem of validaon of the assumpons.
What just happened?
The ANOVA table is a very fundamental block for a regression model, and it gives the split
of the sum of squares for the variable(s) and the error term. The dierence between ANOVA
and the summary of the linear model object is in the respecvely p-values reported by them
as Pr(>F) and Pr(>|t|). You also found a method for obtaining the condence intervals for
the independent variables of the regression model.
Model validation
The violaons of the assumpons may arise in more than one way. Taar, et. al. (2012),
Kutner, et. al. (2005) discusses the numerous ways in which the assumpons are violated
and an adapon of the methods menoned in these books is now considered:
The regression funcon is not linear. In this case, we expect the residuals to have
a paern which is not linear when viewed against the regressors. Thus, a plot of the
residuals against the regressors is expected to indicate if this assumpon is violated.
The error terms do not have constant variance. Note that we made an assumpon
stang that
( )
2
0N,
ε∼ σ
, that is the magnitude of errors do not depend on the
corresponding x or y value. Thus, we expect the plot of the residuals against the
predicted y values to reveal if this assumpon is violated.
The error terms are not independent. A plot of the residuals against the serial
number of the observaons indicated if the error terms are independent or not.
We typically expect this plot to exhibit a random walk if the errors are
independent. If any systemac paern is observed, we conclude that the errors
are not independent.
The model ts all but one or a few outlier observaons. Outliers are a huge concern
in any analycal study as even a single outlier has a tendency to destabilize the
enre model. A simple boxplot of the residuals indicates the presence of an outlier.
If any outlier is present, such observaons need to be removed and the model
needs to be rebuilt. The current step of model validaon needs to be repeated for
the rebuilt model. In fact the process needs to be iterated unl there are no more
outliers. However, we need to cauon the reader that if the subject experts feel
that such outliers are indeed expected values, it may convey that some appropriate
variables are missing in the regression model.
Linear Regression Analysis
[ 172 ]
The error terms are not normally distributed. This is one of the most crucial
assumpons of the linear regression model. The violaon of this assumpon is
veried using the normal probability plot in which the predicted values (actually
cumulave probabilies) are ploed against the observed values. If the values fall
along a straight line, the normality assumpon for errors holds true. The model is to
be rejected if this assumpon is violated.
The next secon shows you how to obtain the residual plots for the purpose
of model validaon.
Time for action – residual plots for model validation
The R funcons resid and fitted can be used to extract residuals and ed values from
an lm object.
1. Find the residuals of the ed regression model using the resid funcon: IO_lm_
resid <- resid(IO_lm).
2. We need six plots, and hence we invoke the graphics editor with par(mfrow =
c(3,2)).
3. Sketch the plot of residuals against the predictor variable with plot(No_of_IO,
IO_lm_resid).
4. To check whether the regression model is linear or not, obtain the plots of absolute
residual values against the predictor variable and also that of squared residual
values against the predictor variable respecvely with plot(No_of_IO, abs(IO_
lm_resid),…) and plot(No_of_IO, IO_lm_resid^2,…).
5. The assumpon that errors have constant variance may be veried by the plot of
residuals against the ed values of the regressand. The required plot is obtained by
using plot(IO_lm$fitted.values,IO_lm_resid).
6. The assumpon that the errors are independent of each other may be veried
plong the residuals against their index numbers: plot.ts(IO_lm_resid).
Chapter 6
[ 173 ]
7. Finally, the presence of outliers is invesgated by the boxplot of the residuals:
boxplot(IO_lm_resid).
8. Finally, the assumpon of normality for the error terms is veried through the
normal probability plot. This plot is on a new graphics editor.
The complete R program is as follows:
IO_lm_resid <- resid(IO_lm)
par(mfrow=c(3,2))
plot(No_of_IO, IO_lm_resid,main="Plot of Residuals Vs
Predictor Variable",ylab="Residuals",xlab="Predictor Variable")
plot(No_of_IO, abs(IO_lm_resid), main="Plot ofAbsolute Residual Values
Vs Predictor Variable",ylab="Absolute Residuals", xlab="Predictor
Variable")
# Equivalently
plot(No_of_IO, IO_lm_resid^2,main="Plot of Squared Residual Values
Vs Predictor Variable", ylab="Squared Residuals", xlab="Predictor
Variable")
plot(IO_lm$fitted.values,IO_lm_resid, main="Plot of Residuals Vs
Fitted Values",ylab="Residuals", xlab="Fitted Values")
plot.ts(IO_lm_resid, main="Sequence Plot ofthe Residuals")
boxplot(IO_lm_resid,main="Box Plot of the Residuals")
rpanova = anova(IO_lm)
IO_lm_resid_rank=rank(IO_lm_resid)
tc_mse=rpanova$Mean[2]
IO_lm_resid_expected=sqrt(tc_mse)*qnorm((IO_lm_resid_rank-0.375)/
(length(CPU_Time)+0.25))
plot(IO_lm_resid,IO_lm_resid_expected,xlab="Expected",ylab="Residuals"
,main="The Normal Probability Plot")
abline(0,1)
Linear Regression Analysis
[ 174 ]
The resulng plot for the model validaon plot is given next. If you run the preceding R
program up to the rpanova code, you will nd the plot similar to the following:
Figure 4: Checking for violations of assumptions of IO_lm
We have used the resid funcon to extract the residuals out of the lm object. The rst plot
of residuals against the predictor variable No_of_IO shows that more the number of IO
processes, the larger is the residual value, as is also conrmed by Plot of Absolute Residual
Values Vs Predictor Variable and Plot of Squared Residual Values Vs Predictor Variable.
However, there is no clear non-linear paern suggested here. The Plot of Residuals Vs
Fied Values is similar to the rst plot of residuals against the predictor. The me series plot
of residuals does not indicate a strict determinisc trend and appears a bit similar to the
random walk. Thus, these plots do not give any evidence of any kind of dependence among
the observaons. The boxplot does not indicate any presence of an outlier.
The normal probability plot for the residuals is given next:
Chapter 6
[ 175 ]
Figure 5: The normal probability plot for IO_lm
As all the points fall close to the straight line, the normality assumpon for the errors does
not appear to be violated.
What just happened?
The R program given earlier gives various residual plots, which help in validaon of the
model. It is important that these plots are always checked whenever a linear regression
model is built. For CPU_Time as a funcon of No_of_IO, the linear regression model is
a fairly good model.
Linear Regression Analysis
[ 176 ]
Have a go hero
From a theorecal perspecve and my own experience, the seven plots obtained earlier
were found to be very useful. However, R, by default, also gives a very useful set of residual
plots for an lm object. For example, plot(my_lm) generates a powerful set of model
validaon plots. Explore the same for IO_lm with plot(IO_lm). You can explore more
about plot and lm with the plot.lm funcon.
We will next consider the general mulple linear regression model for the Gasoline problem
considered in the earlier chapters.
Multiple linear regression model
In the The simple linear regression model secon, we considered an almost (un)realisc
problem of having only one predictor. We need to extend the model for the praccal
problems when one has more than a single predictor. In Example 3.2.9. Octane rang
of gasoline blends we had a graphical study of mileage as a funcon of various vehicle
variables. In this secon, we will build a mulple linear regression model for the mileage.
If we have X1, X2, …, Xp independent set of variables which have a linear eect on the
dependent variable Y, the mulple linear regression model is given by:
01122pp
Y X X ... X
ββ β β ε
=+ + + ++ +
This model is similar to the simple linear regression model, and we have the same
interpretaon as earlier. Here, we have addional independent variables in X2, …, Xp and their
eect on the regressand Y are respecvely through the addional regression parameters
2
p
,...,
β β
. Now, suppose we have n pairs of random observaons
( ) ( )
11 nn
Y,X,..., Y,X
for
understanding the mulple linear regression model, here
( )
i i ip
X X ,..., X,i,...,n= = . A matrix
form representaon of the mulple linear regression model is useful in the understanding of
the esmator for the vector of regression coecients. We dene the following quanes:
()
( )
( )
11 12 12
21 22 2
1 2
1
1
01 and
1
1
X=
1
p
p
n
n
n n n
p
Y
XX X
X X X
Y,...,Y'
,..., ',
,,... '
X X X
,,
εε ε
ββββ
=
=
=
Chapter 6
[ 177 ]
The mulple linear regression model for n observaons can be wrien in a compact matrix
form as:
YX'
βε
= +
The least squares esmate of
β
is given by:
( )
1
ˆXX X'Y
β
−
′
=
Let us t a mulple linear regression model for the Gasoline mileage data considered earlier.
Averaging k simple linear regression models or a multiple linear
regression model
We already know how to build a simple linear regression model. Why should we learn
another theory when an extension is possible in a certain manner? Intuively, we can build
k models consisng of the kth variable and then simply average out over the k models. Such
an averaging can also be considered for the univariate case too. Here, we may divide the
covariate over disnct intervals, then build the simple linear regression model over the
intervals, and nally average over the dierent models. Montgomery, et. al. (2001) highlights
the drawback of such an approach, pages 120-122. Typically, the simple linear regression
model may indicate the wrong sign for the regression coecients. The wrong sign of such
a naïve approach may arise as a result of mulple reasons: restricng the range of some
regressors, crical regressors may have been omied from the model building, or some
computaonal errors may have crept in.
To drive home the point, we consider an example from Montgomery, et. al.
Time for action – averaging k simple linear regression models
We will build three models here. We have a vector of regressand y and two covariates: x1
and x2.
1. Enter the dependent variable and the independent variables with y <- c(1,
5,3,8,5,3,10,7), x1 = c(2,4,5,6,8,10,11,13), and x2 <- c(1,2,
2,4,4,4,6,6).
2. Visualize the relaonships among the variables with:
par(mfrow=c(1,3))
plot(x1,y)
plot(x2,y)
plot(x1,x2)
Linear Regression Analysis
[ 178 ]
3. Build the individual simple linear regression model and our rst mulple regression
model with:
summary(lm(y~x1))
summary(lm(y~x2))
summary(lm(y~x1+x2)) # Our first multiple regression model
Figure 6: Averaging k simple linear regression models
Chapter 6
[ 179 ]
What just happened?
The visual plot (the preceding screenshot) indicates that both x1 and x2 have a posive
impact on y, and this is also captured in lm( y~x1 ) and lm( y~x2 ), see the next R output
display. We have omied the scaer plot, though you should be able to see the same on
your screen aer running the R code aer step 2 in the next secon. However, both the
models are under the assumpon that the informaon contained in x1 and x2 is complete.
The variables are also seen to have a signicant eect on the output. However, the metrics
such as Mulple R-squared and Adjusted R-squared are very poor for both simple (linear)
regression models. This is one of the indicaons that we need to collect more informaon
and thus, we include both the variables and build our rst mulple linear regression model,
see the next secon for more details. There are two important changes worth registering
now. First, the sign of the regression coecient x1 now becomes negave, which is now
contradicng the intuion. The second observaon is the great increase in the R-squared
metric value.
To summarize our observaons here, it suces to say that the sum of the parts may
somemes fall way short of the enre picture.
Building a multiple linear regression model
The R funcon lm remains the same as earlier. We will connue Example 3.2.9. Octane rang
of gasoline blends from the Visualizaon techniques for connuous variable data secon of
Chapter 3, Data Visualizaon. Recall that the variables, independent and dependent as well,
are stored in the dataset Gasoline in the RSADBE package. Now, we tell R that y, which is
the mileage, is the dependent variable, and we need to build a mulple linear regression
model which includes all other variables of the Gasoline object. Thus, the formula is
specied by y~. indicang that all other variables from the Gasoline object need to be
treated as the independent variables. We proceed as earlier to obtain the summary of the
ed mulple linear regression model.
Time for action – building a multiple linear regression model
The method of building a mulple linear regression model remains the same as earlier.
If all the variables in data.frame are to be used, we use the formula y ~ .. However,
if we need specic variables, say x1 and x3, the formula would be y ~ x1 + x3.
1. Build the mulple linear regression model with gasoline_lm <- lm(y~.,
data=Gasoline). Here, the formula y~. considers the variable y as the
dependent variable and all the remaining variables in the Gasoline data
frame as independent variables.
2. Get the details of the ed mulple linear regression model with
summary(gasoline_lm).
Linear Regression Analysis
[ 180 ]
The R screen then appears as follows:
Figure 7: Building the multiple linear regression model
As with the simple model, we need to rst check whether the overall model is signicant
by looking at the p-value of the F-stascs, which appears as the last line of the summary
output. Here, the value 0.0003 being very close to zero, the overall model is signicant.
Of the 11 variables specied for modeling, only x1 and x3, that is, the engine displacement
and torque, are found to have a meaningful linear eect on the mileage. The esmated
regression coecient values indicate that the engine displacement has a negave impact
on the mileage, whereas the torque impacts posively. These results are in conrmaon
with the basic science of vehicle mileage.
Chapter 6
[ 181 ]
We have a tricky output for the eleventh independent variable, which for some strange
reasons R has been renamed as x11M. We need to explain this. You should verify
the output as a consequence of running sapply(Gasoline,class) on the console.
Now, the x11 variable is a factor variable assuming two possible values A and M, which stand
for the transmission box being Automatic or Manual. As the categorical variables are of a
special nature, they need to be handled dierently. The user may be tempted to skip this,
as the variable is seen to be insignicant in this case. However, the interpretaon is very
useful and the "skip" part may prove really expensive later. For computaonal purposes,
an m-level factor variable is used to create m-1 new dierent variables. If the variable
assumes the level l, the lth variable takes value 1, else 0, for l = 1, 2, …, m-1. If the variable
assumes level m, all the (m-1) new variables take the value 0. Now, R takes the lth factor
level and names that vector by concatenang the variable name and the factor level. Hence,
we have x11M as the variable name in the output. Here, we found the factor variable to be
insignicant. If in certain experiments we nd some factor levels to be signicant at certain
p-value, we can't ignore the other factor levels even if their p-values suggest them
as insignicant.
What just happened?
The building of a mulple linear regression model is a straighorward extension of the
simple linear regression model. The interpretaon is where one has to be more careful
with the mulple linear regression model.
We will now look at the ANOVA and condence intervals for the mulple linear regression
model. It is to be noted that the usage is not dierent from the simple linear regression
model, as we are sll dealing with the lm object.
The ANOVA and condence intervals for the multiple linear
regression model
Again, we use the anova and confint funcons to obtain the required results. Here,
the null hypothesis of interest is whether all the regression coecients equal 0, that is
00 10
p
H:
ββ β
==
==
against the alternave that at least one of the regression
coecients is dierent from 0, that is 10
0H:
β
≠ for at least one j = 0, 1, …, p.
Linear Regression Analysis
[ 182 ]
Time for action – the ANOVA and condence intervals for the
multiple linear regression model
The use of anova and confint extend in a similar way as lm is used for simple and mulple
linear regression models.
1. The ANOVA table for the mulple regression model is obtained in the same way as
for the simple regression model, aer all we are dealing with the object of class lm:
gasoline_anova<-anova(gasoline_lm).
2. The condence intervals for the independent variables are obtained by using
confint(gasoline_lm).
The R output is given as follows:
Figure 8: The ANOVA and confidence intervals for the multiple linear regression model
Chapter 6
[ 183 ]
Note the dierence between the anova and summary results. Now, we nd only the rst
variable to be signicant. The interpretaon of the condence intervals is le to you.
What just happened?
The extension from simple to mulple linear regression model in R, especially for the ANOVA
and condence intervals, is really straighorward.
Have a go hero
The ANOVA table in the preceding screenshot and the summary of gasoline_lm in the
screenshot given in step 2 of the Time for acon – building a mulple linear regression model
secon build linear regression models using the signicant variables only. Are you amused?
Useful residual plots
In the context of mulple linear regression models, modicaons of the residuals have been
found to be more useful than the residuals themselves. We have assumed the residuals to
follow a normal distribuon with mean zero and unknown variance. An esmator of the
unknown variance is provided by the mean residual sum of squares. There are four useful
types of residuals for the current model:
Standardized Residuals: We know that the residuals have zero mean. Thus,
the standardized residuals are obtained by scaling the residuals with the esmator
of the standard deviaon, that is the square root of the mean residual sum of
squares. The standardized residuals are dened by:
i
i
Re s
e
dMS
=
Here,
( )
2
1
n
Re s i i
MS e/np
=
=∑ −
, and p is the number of covariates in the model.
The residual is expected to have mean 0 and Re s
MS
is an esmate of its variance.
Hence, we expect the standardized residuals to have a standard normal distribuon.
This in turn helps us to verify whether the normality assumpon for the residuals is
meaningful or not.
Semi-studenzed Residuals: The semi-studenzed residuals are dened by:
( )
1
1
i
i
Re s ii
e
r ,i,...,n
MS h
= =
−
Linear Regression Analysis
[ 184 ]
Here, hii is the ith diagonal element of the matrix
( )
1
HXX'
XX
'
−
=.
The variance of a residual depends on the covariate value and hence, a at
scaling by
Re s
MS
is not appropriate. A correcon is provided by
( )
1
ii
h−
, and
( )
1
Re s ii
MS h− turns out to be an esmate of the variance of ei. This is the
movaon for the semi-studenzed residual ri.
PRESS Residuals: The predicted residual, PRESS, for observaon i is the dierence
between the actual value yi and the value predicted for it, by using a regression
model based on the remaining (n-1) observaons. Now let
()
i
ˆ
β
be the esmator
of regression coecients based on the (n-1) observaons (not including the ith
observaons). Then, PRESS for observaons i is given by:
() ()
1
ii
i i
ˆ
eyx,i,...,n
β
=− =
Here, the idea is that the esmate of residual for an observaon is more appropriate
when obtained from a model which is not inuenced by its own value.
R-student Residuals: This residual is especially useful for the detecon of outliers.
()
( )
1
i
i
ii
Re si
e
t
MS h
=−
Here,
()
Re si
MS
is an esmator of the variance
2
σ
based on the remaining (n-1)
observaons.
The scaling change is on similar lines as with the studenzed residuals.
The task of building n linear models may look daunng! However, there are very
useful formulas in Stascs and funcons in R which save the day for us. It is
appropriate that we use those funcons and develop the residual plots for the
Gasoline dataset. Let us set ourselves for some acon.
Time for action – residual plots for the multiple linear
regression model
R funcons resid, hatvalues, rstandard, and rstudent are available, which can be
applied on an lm object to obtain the required residuals.
Chapter 6
[ 185 ]
1. Get the MSE of the regression model with gasoline_lm_mse <- gasoline_
anova$Mean[length(gasoline_anova$Mean)].
2. Extract the residuals with the resid funcon, and standardize the residuals
using stan_resid_gasoline <- resid(gasoline_lm)/sqrt( gasoline_
lm_mse).
3. To obtain the semi-studenzed residuals, we rst need to get the hii elements
which are obtainable using the hatvalues funcon: hatvalues(gasoline_lm).
The remaining code is given at the end of this list.
4. The PRESS residuals are calculated using the rstandard funcon available in R.
5. The R-student residuals can be obtained using the rstudent funcon in R.
The detailed code is as follows:
# Useful Residual Plots
gasoline_lm_mse<-gasoline_anova$Mean[length(gasoline_anova$Mean)]
stan_resid_gasoline <- resid(gasoline_lm)/sqrt(gasoline_lm_mse)
#Standardizing the residuals
studentized_resid_gasoline <- resid(gasoline_lm)/ (sqrt(gasoline_
lm_mse*(1-hatvalues(gasoline_lm))))
#Studentizing the residuals
pred_resid_gasoline <- rstandard(gasoline_lm)
pred_student_resid_gasoline<-rstudent(gasoline_lm)
# returns the R-Student Prediction Residuals
par(mfrow=c(2,2))
plot(gasoline_fitted,stan_resid_gasoline,xlab="Fitted",
ylab="Residuals")
title("Standardized Residual Plot")
plot(gasoline_fitted,studentized_resid_gasoline,xlab="Fitted",ylab
="Residuals")
title("Studentized Residual Plot")
plot(gasoline_fitted,pred_resid_gasoline,xlab="Fitted",
ylab="Residuals")
title("PRESS Plot")
plot(gasoline_fitted,pred_student_resid_gasoline,xlab="Fitted",yla
b="Residuals")
title("R-Student Residual Plot")
Linear Regression Analysis
[ 186 ]
All the four residual plots in the screenshot given in the Time for acon – residual plots for
model navigaon secon look idencal though there is a dierence in their y-scaling. It is
apparent from the residual plots that there are no paerns which show the presence of
non-linearity, that is the linearity assumpon appears valid. In the standardized residual
plot, all the observaons are well within -3 and 3. Thus, it is correct to say that there are
no outliers in the dataset.
Figure 9: Residual plots for the multiple linear regression model
What just happened?
Using the resid, rstudent, rstandard, and other funcons, we have obtained useful
residual plots for the mulple linear regression models.
Regression diagnostics
In the Useful residual plots subsecon, we saw how outliers may be idened using
the residual plots. If there are outliers, we need to ask the following quesons:
Chapter 6
[ 187 ]
Is the observaon an outlier due to an anomalous value in one or more
covariate values?
Is the observaon an outlier due to an extreme output value?
Is the observaon an outlier because of both the covariate and output values being
extreme values?
The disncon in the nature of an outlier is vital as one needs to be sure of its type. The
techniques for outlier idencaon are certainly dierent as is their impact. If the outlier
is due to the covariate value, the observaon is called a leverage point, and if it is due to
the y value, we call it an inuenal point. The rest of the secon is for the exact stascal
technique for such an outlier idencaon.
Leverage points
As noted, a leverage point has an anomalous x value. The leverage points may be
theorecally proved not to impact the esmates of the regression coecients. However,
these points are known to drascally aect the
2
R
value. The queson then is, how do
we idenfy such points? The answer is by looking at the diagonal elements of the hat matrix
( )
1
HXX'
XX
'
−
=. Note that the matrix
H
is of the order
nn×
. The (i, i) element of the hat
matrix ii
h
may be interpreted as the amount of leverage by the observaon, i, on the ed
value i
ˆ
y
. The average size of a leverage is hp
/n
=, where p is the number of covariates and
n is the number of observaons. It is beer to leave out an observaon if its leverage value is
greater than twice of
p/n
, and we then conclude that the observaon is a leverage point.
Let us go back to the Gasoline problem and see the leverage of all the observaons. In R,
we have a funcon, hatvalues which readily extracts the diagonal elements of
H
. The R
output is given in the next screenshot.
Clearly, we have 10 observaons which are leverage points. This is indeed a maer of
concern as we have only about 25 observaons. Thus, the results of the linear model need
to be interpreted with cauon! Let us now idenfy the inuenal points for the Gasoline
linear model.
Linear Regression Analysis
[ 188 ]
Inuential points
An inuenal point has a tendency to pull the regression line (plane) towards its direcon
and hence, they drascally aect the values of the regression coecients. We want to
idenfy the impact of an observaon on the regression coecients, and one approach is
to consider how much the regression coecient values change if the observaon was not
considered. The relevant mathemacs for idencaon of inuenal points is beyond the
scope of the book, so we simply help ourselves with the metric Cook's distance which nds
the inuenal points. The R funcon, cooks.distance, returns the values of the Cook's
distance for each observaon, and the thumb rule is that if the distance is greater than 1,
the observaon is an inuenal point. Let us use the R funcon and idenfy the inuenal
points for the Gasoline problem.
Figure 10: Leverage and influential points of the gasoline_lm fitted model
For this dataset, we have only one inuenal point in Eldorado. The plot of Cook's distance
against the observaon numbers and that of Cook's distance against the leverages may be
easily obtained with plot(gasoline_lm,which=c(4,6)).
Chapter 6
[ 189 ]
DFFITS and DFBETAS
Belsley, Kuh, and Welsch (1980) proposed two addional metrics for nding the inuenal
points. The DFBETAS metric indicates the change of regression coecients (in standard
deviaon units) if the ith observaon is removed. Similarly, DFFITS is a metric which gives
the impact on the ed values i
ˆ
y
. The rule which indicates the presence of an inuenal
point using the DFFITS is 2
i
|DFFITS |/p/n>, where p is the number of covariates and n
is the number of observaons.
Finally, an observaon i is inuenal for regression coecient j if
2
j,i
DFBETAS
/n
>.
Figure 11: DFFITS and DFBETAS for the Gasoline problem
We have given the DFFITS and DFBETAS values for the Gasoline dataset. It is le as an
exercise to the reader to idenfy the inuenal points from the outputs given above.
The multicollinearity problem
One of the important assumpons of the mulple linear regression model is that the
covariates are linearly independent. The linear independence here is the sense of Linear
Algebra that a vector (covariate in our context) cannot be expressed as a linear combinaon
of others. Mathemacally, this assumpon translates into an implicaon that
( )
1
X'X− is
nonsingular, or that its determinant is non-zero. If this is not the case then we have one or
more of the following problems:
Linear Regression Analysis
[ 190 ]
The esmated will be unreliable, and there is a great chance of the regression
coecients having the wrong sign
The relevant signicant factors will not be idened by either the t-tests or the
F-tests
The importance of certain predictors will be undermined
Let us rst obtain the correlaon matrix for the predictors of the Gasoline dataset.
We will exclude the nal covariate, as it is factor variable.
Figure 12: The correlation matrix of the Gasoline covariates
We can see that covariate x1 is strongly correlated with all other predictors except x4.
Similarly, x8 to x10 are also strongly correlated. This is a strong indicaon of the presence
of the mulcollinearity problem.
Dene
( )
1
CX'X
−
=. Then it can be proved that, refer Montgomery, et. al. (2003), the
jth diagonal element of
C
can be wrien as
( )
1
2
1
jj j
C R
−
=− , where
2
j
R
is the coecient
of determinaon obtained by regressing all other covariates for xj as the output. Now,
the variable xj is independent of all the other covariates; we expect the coecient of
determinaon to be zero, and hence jj
C
to be closer to unity. However, if the covariate
depends on the others, we expect the coecient of determinaon to be a high value, and
hence the jj
C
to be a large number. The quanty jj
C
is also called the variance inaon
factor, denoted by VIFj. A general guideline for a covariate to be linearly independent of
other covariates is that its VIFj should be lesser than 5 or 10.
Chapter 6
[ 191 ]
Time for action – addressing the multicollinearity problem for
the Gasoline data
The mulcollinearity problem is addressed using the vif funcon, which is available from
two libraries: car and faraway. We will use it from the faraway package.
1. Load the faraway package with library(faraway).
2. We need to nd the variance inaon factor (VIF) of the independent variables
only. The covariate x11 is a character variable, and the rst column of the Gasoline
dataset is the regressand. Hence, run vif(Gasoline[,-c(1,12)]) to nd the VIF
of the eligible independent variables.
3. The VIF for x3 is highest at 217.587. Hence, we remove it to nd the VIF among the
remaining variables with vif(Gasoline[,-c(1,4,12)]). Remember that x3 is
the fourth column in the Gasoline data frame.
4. In the previous step, we nd x10 having the maximum VIF at 77.810. Now, run
vif(Gasoline[,-c(1,4,11,12)]) to nd if all VIFs are less than 10.
5. For the rst variable x1 the VIF is 31.956, and we now remove it with
vif(Gasoline[,-c(1,2,4,11,12)]).
6. At the end of the previous step, we have the VIF of x1 at 10.383. Thus, run
vif(Gasoline[,-c(1,2,3,4,11,12)]).
7. Now, all the independent variables have VIF lesser than 10. Hence, we stop
at this step.
8. Removing all the independent variables with VIF greater than 10, we arrive at the
nal model, summary(lm(y~x4+x5+x6+x7+x8+x9,data= Gasoline)).
Figure 13: Addressing the multicollinearity problem for the Gasoline dataset
Linear Regression Analysis
[ 192 ]
What just happened?
We used the vif funcon from the faraway package to overcome the problem of
mulcollinearity in the mulple linear regression model. This helped to reduce the
number of independent variables from 10 to 6, which is a huge 40 percent reducon.
The funcon vif from the faraway package is applied to the set of covariates. Indeed,
there is another funcon of the same name from the car package which can be directly
applied on an lm object.
Model selection
The method of removal of covariates in the The mulcollinearity problem secon depended
solely on the covariates themselves. However, it may happen more oen that the covariates
in the nal model will be selected with respect to the output. Computaonal cost is almost
a non-issue these days and especially for not-so-very-large datasets! The queson that
arises then is can one retain all possible covariates in the model, or do we have any choice
of covariates which meet a certain regression metric, say R2 > 60 percent? The problem
is that having more covariates increases the variance of the model which having lesser of
them will have large bias. The philosophical Occam's Razor principle applies here, and the
best model is the simplest model. In our context, the smallest model which ts the data is
the best. There are two types of model selecon: stepwise procedures and criterion-based
procedures. In this secon, we will consider both the procedures.
Stepwise procedures
There are three methods of selecng covariates for inclusion in the nal model: backward
eliminaon, forward selecon, and stepwise regression. We will rst describe the backward
eliminaon approach and develop the R funcon for it.
The backward elimination
In this model, one rst begins with all the available covariates. Suppose we wish to retain all
covariates for whom the p-value is at the most
α
. The value
α
is referred to as crical alpha.
Now, we rst eliminate that covariate whose p-value is maximum among all the covariates
having p-value greater than
α
. The model is reed for the current covariates. We connue
the process unl we have all the covariates whose p-value is less than
α
. In summary, the
backward eliminaon algorithm is as explained next:
Chapter 6
[ 193 ]
1. Consider all the available covariates.
2. Remove the covariate with maximum p-value among all the covariates which have
p-value greater than
α
.
3. Ret the model and go to the rst step.
4. Connue the process unl all p-values are less than
α
.
Typically, the user invesgates the p-values in the summary output and then carries out the
preceding algorithm. Taar, et. al. (2013) gives a funcon which right away executes the
enre algorithm, and we adapt the same funcon here and apply it on the linear regression
model gasoline_lm.
The forward selection
In the previous procedure we started with all covariates. Here, we begin with an empty
model and look forward for the most signicant covariates with p-value lesser than
α
.
That is, we build k new linear models with the k-th covariate for the k-th model. Naturally,
by "most signicant" we mean that the p-value should be the least among all the covariates
whose p-value is lesser than
α
. Then, we build the model with the selected covariate.
A second covariate is selected by treang the previous model as the inial empty model.
The model selecon is connued unl we fail to add any more covariates. This is summarized
in the following algorithm:
1. Begin with an empty model.
2. For each covariate, obtain the p-value if it is added to the model. Select the
covariates with the least p-value among all the covariates whose p-value is
lesser than
α
.
3. Repeat the preceding step unl no more covariates can be updated for the model.
We again use the funcon created in Taar, et. al. (2013) and apply it for
the gasoline problem.
There is yet another method of model selecon. Here, we begin with the empty model. We
add a covariate as in the forward selecon step and then perform a backward eliminaon to
remove any unwanted covariate. Then, the forward and backward steps are connued unl
we can't either add a new covariate or remove an exisng covariate. Of course, the alpha
crical values for forward and backward steps are specied disnctly. This method is called
stepwise regression. This method is however skipped here for the purpose of brevity.
Linear Regression Analysis
[ 194 ]
Criterion-based procedures
A useful tool for the model selecon problem is to evaluate all possible models and select
one of them according to certain criteria. The Akaike Informaon Criteria (AIC) is one such
criterion which can be used to select the best model. Let
( )
( )
2
01
logp
ˆˆ ˆ
ˆ
L,,..., ,|y
ββ βσ
denote the
log likelihood funcon of the ed regression model. Dene K = p + 2 which is the total
number of esmated parameters. The AIC for the ed regression model is given by:
( )
( )
2
0 1
AIC -log2 p
ˆˆ ˆˆ
L,,..., ,|
yK
ββ βσ
= +
Now, the model which has the least AIC among the candidate models is the best model.
The step funcon available in R gets the job done for us, and we will close the chapter
with the connued illustraon of the Gasoline problem.
Time for action – model selection using the backward, forward,
and AIC criteria
For the forward and backward selecon procedure under the stepwise procedures of the
model selecon problem, we rst dene two funcons: backwardlm and forwardlm.
However, for the criteria-based model selecon, say AIC, we use the step funcon, which
can be performed on the ed linear models.
1. Create a funcon pvalueslm which extracts the p-values related to the covariates
of an lm object:
pvalueslm <- function(lm) {summary(lm)$coefficients[-1,4]}
2. Create a backwardlm funcon dened as follows:
backwardlm <- function(lm,criticalalpha) {
lm2=lm
while(max(pvalueslm(lm2))>criticalalpha) {
lm2=update(lm2,paste(".~.-",attr(lm2$terms,
"term.labels")[(which(pvalueslm(lm2)==max(pvalueslm(lm2))))],s
ep=""))
}
return(lm2)
}
Chapter 6
[ 195 ]
The code needs to be explained in more detail. There are two new funcons created
here for the implementaon of the backward eliminaon procedure. Let us have a
detailed look at them. The funcon pvalueslm extracts the p-values related to the
covariates of an lm object. The choice of summary(lm)$coefficients[-1,4]
is vital, as we are interested in the p-values of the covariates and not the intercept
term. The p-values are available once the summary funcon is applied on the lm
object. Now, let us focus on the backwardlm funcon. Its arguments are the lm
object and the value of crical
α
. Our goal is to carry out the iteraons unl we
do not have any more covariates with p-value greater than
α
. Thus, we use the
while funcon which is typical of algorithm, where the last step appears during
the beginning of a funcon/program. We want our funcon to work for all the
linear models and not just gasoline_lm, and we need to get the names of the
covariates which are specied in the lm object. Remember, we conveniently used
the formula lm(y~.) and this will try to haunt us! Thankfully, attr(lm$terms,
"term.labels") extracts all the covariate names of an lm object. The argument
[(which(pvalueslm(lm2)==max (pvalueslm (lm2))))] idenes the
covariate number which has the maximum p-value above
α
. Next, paste(".~.-
",attr(), sep="") returns the formula which would have removed the
unwanted covariate. The explanaon of the formula is lengthier than the funcon
itself, which is not surprising, as R is object-oriented and a few lines of code do more
acons than detailed prose.
3. Obtain the ecient linear regression model by applying the backwardlm funcon,
with the crical alpha at 0.20 on the Gasoline lm object:
gasoline_lm_backward <- backwardlm(gasoline_lm,criticalalpha=0.20)
4. Find the details of the nal model obtained in the previous step:
summary(gasoline_lm_backward)
Linear Regression Analysis
[ 196 ]
The output as a result of applying the backward selecon algorithm is the following:
Figure 14: The backward selection model for the Gasoline problem
5. The forwardlm funcon is given by:
forwardlm <- function(y,x,criticalalpha) {
yx <- data.frame(y,x)
mylm <- lm(y~-.,data=yx)
avail_cov <- attr(mylm$terms,"dataClasses")[-1]
minpvalues <- 0
while(minpvalues<criticalalpha){
pvalues_curr <- NULL
for(i in 1:length(avail_cov)){
templm <- update(mylm,paste(".~.+",names(avail_cov[i])))
mypvalues <- summary(templm)$coefficients[,4]
pvalues_curr <- c(pvalues_curr,mypvalues[length(mypvalues)])
}
minpvalues <- min(pvalues_curr)
Chapter 6
[ 197 ]
if(minpvalues<criticalalpha){
include_me_in <- min(which(pvalues_curr<criticalalpha))
mylm <- update(mylm,paste(".~.+",names(avail_cov[include_me_in])))
avail_cov <- avail_cov[-include_me_in]
}
}
return(mylm)
}
6. Apply the forwardlm funcon on the Gasoline dataset:
gasoline_lm_forward <- forwardlm(Gasoline$y,Gasoline[,-1],
criticalalpha=0.2)
7. Obtain the details of the nalized model with summary( gasoline_lm_
forward).
The output in R is as follows:
Figure 15: The forward selection model for the Gasoline dataset
Note that the forward selecon and backward eliminaon have resulted in two
dierent models. This is to be expected and is not a surprise, and in such scenarios,
one can pick up either of the models for further analysis/implementaon.
The understanding of the construcon of the forwardlm funcon is le as
an exercise to the reader.
Linear Regression Analysis
[ 198 ]
8. The step funcon in R readily gives the best model using AIC: step
(gasoline_lm, direction="both").
Figure 16: Stepwise AIC
Backward and forward selecon can be easily performed using AIC with the opons
direction= "backward" and direction= "forward".
What just happened?
We used two customized funcons backwardlm and forwardlm for backward and forward
selecon criteria. The step funcon has been used for the model selecon problem based
on the AIC.
Have a go hero
The supervisor performance data is available in the SPD dataset from the RSADBE package.
Here, Y (the regressand) represents the overall rang of the job done by a supervisor.
The overall rang depends on six other inputs/regressors. Find more details about
the dataset with ?SPD. First, visualize the dataset with the pairs funcon. Fit a mulple
linear regression model for Y and complete the necessary regression tasks, such as model
validaon, regression diagnoscs, and model selecon.
Chapter 6
[ 199 ]
Summary
In this chapter, we learned how to build a linear regression model, check for violaons in the
model assumpons, x the mulcollinearity problem, and nally how to nd the best model.
Here, we were aided by two important assumpons: the output being a connuous variable,
and the normality assumpon for the errors. The linear regression model provides the best
foong for the general regression problems. However, when the output variable is discrete,
binary, or mul-category data, the linear regression model lets us down. This is not actually
a let down, as it was never intended to solve this class of problem. Thus, our next chapter
will focus on the problem of regression models for binary data.
The Logistic Regression Model
In this chapter we will consider regression models when the regressand is
dichotomous or binary in nature. The data is of the form
( ) ( ) ( )
1 1 2 2
n n
Y,X,Y,X,..., Y,X
,
where the dependent variable Yi, i = 1, …, n are the observed binary outputs
assumed to be independent (in the statistical sense) of each other, the vector Xi,
i = 1,…, n, are the covariates (independent variables in the sense of a regression
problem) associated with Yi. In the previous chapter we considered the linear
regression model where the regressand was assumed to be continuous along
with the assumption of normality for the error distribution. Here, we will
consider a Gaussian (normal) model for the binary regression model, which is
more widely known as the probit model.
A more generic model has emerged during the past four decades in the form of a logisc
regression model. We will consider the logisc regression model for rest of the chapter.
The approach in this chapter will be on the following lines:
The binary regression problem
The probit regression model
The logisc regression model
Model validaon and diagnoscs
Receiving operator curves
Logisc regression for the German credit screening dataset
7
The Logisc Regression Model
[ 202 ]
The binary regression problem
Consider the problem of modeling the compleon of a stat course by students based
on their Scholasc Assessment Test in the subject of mathemacs SAT-M scores at the
me of their admission. Aer the compleon of the nal exams we know which students
successfully completed the course and which of them failed. Here, the output pass/fail
may be represented by a binary number 1/0. It may be fairly said that higher the SAT-M
scores at the me of admission to the course, the more likelihood of the candidate
compleng the course. This problem has been discussed in detail in Johnson and Albert
(1999) and Taar, et. al. (2013).
Let us begin by denong the pass/fail indicator by Y and the entry SAT-M score by X. Suppose
that we have n pairs of observaons on the students' scores and their course compleon
results. We can build the simple linear regression model for the probability of course
compleon pi = P(Yi = 1) as a funcon of the SAT-M score with
0 0i i i
p X
ββ ε
=+ +
. The data
from page 77 of Johnson and Albert (1999) is available in the sat dataset of the book's R
package. The columns that contain the data on the variables Y and X are named Pass and
Sat respecvely. To build a linear regression model for the probability of compleng the
course, we take pi as 1 if Yi = 1, and 0 otherwise. A scaer plot of Pass against Sat indicates
the students with higher SAT-M scores are more likely to complete the course. The SAT score
varies from 463-649 and then we aempt to predict whether students with SAT scores of
400 and 700 would have successfully completed the course or not.
Time for action – limitations of linear regression models
A linear regression model is built for the dataset with a binary output. The model is used
to predict the probabilies for some cases, which shows the limitaons:
1. Load the dataset from the RSADBE package with data(sat).
2. Visualize the scaer plot of Pass against Sat with plot(sat$Sat,
sat$Pass,xlab="SAT Score", ylab = "Final Result").
3. Fit the simple linear regression model with passlm <- lm(Pass~Sat,
data=sat) and obtain its summary by summary(passlm). Add the ed
regression line to the scaer plot using abline(passlm).
4. Make a predicon for students with SAT-M scores of 400 and 700 by using the R
code predict(passlm,newdata=list(Sat=400)) and predict( passlm,
newdata=list(Sat=700),interval="prediction").
Chapter 7
[ 203 ]
Figure 1: Drawbacks of the linear regression model for the classification problem
The linear model is signicant as seen by p-value: 0.000179 associated with the
F-statistic. Next, Pr(>|t|) associated with the Sat variable is 0.00018, which is again
signicant. However, the predicted value for SAT-M marks at 400 and 700 are respecvely
seen as -0.4793 and 1.743. The problem with the model is that the predicted values can
be negave as well as greater than 1. It is essenally these limitaons which restrict the use
of the linear regression model when the regressand is a binary outcome.
What just happened?
We used the simple linear regression model for the probability predicon of a binary
outcome and observed that the probabilies are not bound in the unit interval [0,1]
as they are expected to be. This shows that we need to have special/dierent stascal
models for understanding the relaonship between the covariates and the binary output.
We will use two regression models that are appropriate for binary regressand: probit
regression, and logisc regression. The former model will connue the use of normal
distribuon for the error through a latent variable, whereas the laer uses the binomial
distribuon and is a popular member of the more generic generalized linear models.
The Logisc Regression Model
[ 204 ]
Probit regression model
The probit regression model is constructed as a latent variable model. Dene a latent
variable, also called as auxiliary random variable,
Y*
as follows:
Y* X'
βε
= +
which is same as the earlier linear regression model with Y replaced by
Y*
. The error term
ε
is assumed to follow a normal distribuon
( )
2
0N,
σ
. Then Y can be considered as 1 if the
latent variable is posive, that is:
if 0equivalently
otherwise
1
0
y* , X ',
Y,
βε
>−
=>
Without loss of generality, we can assume that
()
01
N,
ε∼
. Then, the probit model
is obtained by:
( ) ( ) ( )
( ) ( )
1 0PY |X PY* P X'
P X ' X '
ε β
ε β β
= = >= >−
= < =φ
The method of maximum likelihood esmaon is used to determine
β
. For a random
sample of size n, the log likelihood funcon is given by:
() ( ) ( ) ( )
( )
( )
1
log log 1 log 1
n
i i i i
i
L y x' y x '
β β β
=
= φ + − −φ
∑
Numerical opmizaon techniques can be deployed to nd the MLE of
β
. Fortunately,
we don't have to undertake this daunng task and R helps us out with the glm funcon.
Let us t the probit model for the Sat dataset seen earlier.
Time for action – understanding the constants
The probit regression model is built for the Pass variable as a funcon of the Sat score
using the glm R funcon and the argument binomial(probit).
1. Using the glm funcon and the binomial(probit) opon we can t the probit
model for Pass as a funcon of the Sat score:
pass_probit <- glm(Pass~Sat,data=sat,binomial(probit))
Chapter 7
[ 205 ]
2. The details about the pass_probit glm object are fetched using
summary(pass_probit).
The summary funcon does not give a measure of R2, the coecient of
determinaon, as we obtained for the linear regression model. In general such a
measure is not exactly available for the GLMs. However, certain pseudo-R2 measures
are available and we will use pR2 funcon from the pscl package. This package
has been developed at the Polical Science Computaonal Laboratory, Stanford
University, which explains the name of the package as pscl.
3. Load the pscl package with library(pscl), and apply the pR2 funcon
on pass_probit to obtain the measures of pseudo R2.
Finally, we check how the probit model overcomes the problems posed by
applicaon of the linear regression model.
4. Find the probability of passing the course for students with a SAT-M score of 400
and 700 with the following code:
predict(pass_probit,newdata=list(Sat=400),type = "response")
predict(pass_probit,newdata=list(Sat=700),type = "response")
The following picture is the screenshot of R acon:
Figure 2: The probit regression model for SAT problem
The Logisc Regression Model
[ 206 ]
The Pr(>|z|) for Sat is 0.0052, which shows that the variable has a signicant say
in explaining whether the student successfully completes the course or not. The regression
coecient value for the Sat variable indicates that if the Sat variable increases by one
mark, the inuence on the probit link increases by 0.0334. In easy words, the SAT-M
variable has a posive impact on the probability of success for the student. Next, the pseudo
R2 value of 0.3934 for the McFadden metric indicates that approximately 39.34 percent of
the output is explained by the Sat variable. This appears to suggest that we need to collect
more informaon about the students. That is, the experimenter may try to get informaon
on how many hours did the student spend exclusively for the course/examinaon, the
students' aendance percentages, and so on. However, the SAT-M score, which may have
been obtained nearly two years before the nal exam of the course, connues to have
a good explanatory power!
Finally, it may be seen that the probability of compleng the course for students with SAT-M
scores of 400 and 700 are respecvely 2.019e-06 and 1. It is important for the reader to
note the importance of the type = "response" opon. More details may be obtained
running ?predict.glm at the R terminal.
What just happened?
The probit regression model is appropriate for handling the binary outputs and is certainly
much more appropriate than the simple linear regression model. The reader learned how to
build the probit model using the glm funcon which is in fact more versale as will be seen in
the rest of the chapter. The predicon probabilies were also seen to be in the range of 0 to 1.
The glm funcon can be conveniently used for more than one covariate. In fact, the formula
structure of glm remains the same as lm. Model-related issues have not been considered
in full details ll now. The reason being that there is more interest in the logisc regression
model, as it will be focus for the rest of the chapter, and the logic does not change. In fact
we will return to the probit model diagnoscs in parallel with the logisc regression model.
Logistic regression model
The binary outcomes may be easily viewed as failures or successes, and we have done
the same on many earlier occasions. Typically, it is then common to assume that we
have a binomial distribuon for the probability of an observaon to be successful.
The logisc regression model species the linear eect of the covariates as a specic
funcon of the probability of success. The probability of success for observaon is
denoted by
() ( )
1
xPY
π
= = and the model is specied through the logisc funcon:
()
011
011
1
pp
pp
x... x
x... x
e
x
e
ββ β
ββ β
π
+++
+ + +
=
+
Chapter 7
[ 207 ]
The choice of this funcon is for fairly good reasons. Dene
0 1 1 pp
w x ... x
ββ β
=+ ++
. Then,
it may be easily seen that
()
( ) ( )
1 1 1
w w w
x e / e / e
π
−
= + = + . Thus, as w decreases to negave of
innity,
()
x
π
approaches 0, and if w increases towards innity,
()
x
π
reaches 1. For w = 0,
()
x
π
takes the value 0.5. The rao of probability of success to that of failure is known as the odds
rao, denoted by OR, and following some simple arithmec steps, it may be shown that:
()
()
011
1
pp
x... x
x
OR e
x
ββ β
π
π
+ + +
= =
−
Taking a logarithm of the odds raon gets us:
()
()
0 0 1
log log 1
pp
x
OR x ... x
x
πβ β β
π
= = + + +
−
And thus, we nally see that the logarithm of the odds rao as a linear funcon of the
covariates. It is actually the second term
() ()
( )
( )
log 1
i i
x/ x
π π
−, which is the form of a logit
funcon that this model derives its name from.
The log-likelihood funcon based on the data
( ) ( ) ( )
11 2 2 n n
y,x,y,x,..., y,x
is then:
()
()
0
1 0 1
log log 1
p
jij
j
p
n n x
i j ij
i j i
L y x e
β
β β
=
∑
= = =
= − +
∑ ∑ ∑
The preceding expression is indeed a bit complex in nature to obtain an explicit form for an
esmate of
β
. Indeed, a specialized algorithm is required here and it is known as the iterave
reweighted least-squares (IRLS) algorithm. We will not go into the details of the algorithm and
refer the readers to an online paper of Sco A. Czepiel available at http://czep.net/stat/
mlelr.pdf. A raw R implementaon of the IRLS is provided in Chapter 19 of Taar, et. al.
(2013). For our purpose, we will be using the soluon as provided from the glm funcon.
Let us now t the logisc regression model for the Sat-M dataset considered hitherto.
Time for action – tting the logistic regression model
The logisc regression model is built using the glm funcon with the family =
'binomial' opon. We will obtain the pseudo-R2 values using the pR2 funcon from
the pscl package.
1. Fit the logisc regression model for the Pass as a funcon of the Sat using
the opon family = 'binomial' in the glm funcon:
pass_logistic <- glm(Pass~Sat,data=sat,family = 'binomial')
The Logisc Regression Model
[ 208 ]
2. The details of the ed logisc regression model is obtained using the summary
funcon: summary(pass_logistic).
In the summary you will see two stascs called Null deviance and Residual deviance.
In general, a deviance is a measure useful for assessing the goodness-of-t, and for
the logisc regression model it plays the analogous role of residual sum of squares
for the linear regression model. The null deviance is the measure of a model that is
built without using any informaon, such as Sat, and thus we would expect such
a model to have a large value. If the Sat variable is inuencing Pass, we expect
the residuals of such a ed model to be signicantly lesser than the null deviance
model. If the residual deviance is signicantly smaller than the null deviance, we
conclude that the covariates have signicantly improved the model t.
3. Find the pseudo-R2 with pR2(pass_logistic) from the pscl package.
4. The overall model signicance of the ed logisc regression model is obtained with
with(pass_logistic, pchisq(null.deviance - deviance,
df.null - df.residual, lower.tail = FALSE))
The p-value is 0.0001496 which shows that the model is indeed signicant. The
p-values for the Sat covariate Pr(>|z|) is 0.011, which means that this variable
is indeed valuable for understanding Pass. The esmated regression coecient for
Sat of 0.0578 indicates that for the increase of a single mark increases the odds of
the candidate to pass the course by 0.0578.
A brief explanaon of this R code! It may be seen from the output following the
summary.glm(pass_logistic) that we have all the terms null.deviance,
deviance, df.null, and df.residual. So, the with funcon extracts all these
terms from the pass_logistic object and nds the p-value using the pchisq
funcon based on the dierence between the deviances (null.deviance -
deviance) and the correct degrees of freedom (df.null - df.residual).
Chapter 7
[ 209 ]
Figure 3: Logistic regression model for the Sat dataset
5. The condence intervals, with a default 95 percent requirement, for the
parameters of the regression coecients, is extracted using the confint
funcon: confint(pass_logistic).
The ranges of the 95 percent condence intervals do not contain 0 among them,
and hence we conclude that the intercept term and Sat variable are both signicant.
6. The predicon for the unknown scores are obtained as in the probit regression model:
predict.glm(pass_logistic,newdata=list(Sat=400),type = "response")
predict.glm(pass_logistic,newdata=list(Sat=700),type = "response")
7. Let us compare the logisc and probit model. Consider a sequence of hypothecal
SAT-M scores: sat_x = seq(400,700, 10). For the new sequence sat_x, we
predict the probability of course compleon using both the pass_logistic and
pass_probit models and visualize them if their predicons are vastly dierent:
pred_l <- predict(pass_logistic,newdata=list(Sat=sat_x), type=
"response")
pred_p <- predict(pass_probit,newdata=list(Sat=sat_x), type=
"response")
plot(sat_x,pred_l,type="l",ylab="Probability",xlab="Sat_M")
lines(sat_x,pred_p,lty=2)
The Logisc Regression Model
[ 210 ]
The predicon says that a candidate with a SAT-M score of 400 is very unlikely to
complete the course successfully while the one with SAT-M score of 700 is almost
guaranteed to complete it. The predicons with probabilies closer to 0 or 1 need
to be taken with a bit cauon since we rarely have enough observaons at the
boundaries of the covariates.
Figure 4: Prediction using the logistic regression
What just happened?
We ed our rst logisc regression model and viewed its various measures which tell
us whether the ed model is a good model or not. Next, we learnt how to interpret
the esmated regression coecients and also had a peek at the pseudo-R2 value.
The importance of condence intervals is also emphasized. Finally, the model has
been used to make predicons for some unobserved SAT-M scores too.
Hosmer-Lemeshow goodness-of-t test statistic
We may be sased with the analysis thus far, and there is always a lot more that we
can do. The tesng hypothesis problem is of the form
()
0
0 0
1
p p
jij j ij
j j
x x
H:EY e / e
β β
= =
∑ ∑
= +
versus
()
1
0 0
1
p p
jij j ij
j j
x x
H:EY e / e
β β
= =
∑ ∑
≠ +
. An answer to this hypothesis tesng problem is provided by
the Hosmer-Lemeshow goodness-of-t test stasc. The steps of the construcon of this
test stasc are rst discussed:
Chapter 7
[ 211 ]
1. Order the ed values using sort and ed funcons.
2. Group the ed values into g classes, the preferred values of g vary between 6-10.
3. Find the observed and expected number in each group.
4. Perform a chi-square goodness-of-t test on the these groups. That is, denote Ojk
for the number of observaons of class k, k = 0, 1, in the group j, j = 1, 2, …, g, and
by Ejk the corresponding expected numbers. The chi-square test stasc is then
given by:
( )
2
2
101
gjk jk
jk.jk
O E
E
χ
==
−
=∑∑
And it may be proved that under the null-hypothesis
22
2
g
χχ
−
∼
.
We will use an R program available at http://sas-and-r.blogspot.in/2010/09/
example-87-hosmer-and-lemeshow-goodness.html. It is important to note here
that when we use the code available on the web we verify and understand that such code
is indeed correct.
Time for action – The Hosmer-Lemeshow goodness-of-t
statistic
The Hosmer-Lemeshow goodness-of-t stasc for logisc regression is one of the very
important metrics for evaluang a logisc regression model. The hosmerlem funcon
from the preceding web link will be used for the pass_logistic regression model.
1. Extract the ed values for the pass_logistic model with pass_hat <-
fitted(pass_logistic).
2. Create the funcon hosmerlem from the previously-menoned URL:
hosmerlem <- function(y, yhat, g=10) {
cutyhat = cut(yhat,
breaks = quantile(yhat, probs=seq(0,
1, 1/g)), include.lowest=TRUE)
obs = xtabs(cbind(1 - y, y) ~ cutyhat)
expect = xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
chisq = sum((obs - expect)^2/expect)
P = 1 - pchisq(chisq, g - 2)
return(list(chisq=chisq,p.value=P))
}
The Logisc Regression Model
[ 212 ]
What is the hosmerlem funcon exactly doing here? Obviously, it is a funcon of
three variables, the real output values in y, the predicted (probabilies) in yhat,
and the number of groups g. The cutyhat variable uses the cut funcon on the
predicted probabilies among the ten groups and assigns them one of the 10
groups. The obs matrix obtains the count Ojk using the xtabs funcon and a similar
acon is repeated for Ejk. The code chisq = sum((obs - expect)^2/expect)
then obtains the value of the Hosmer-Lemeshow chi-square test stasc, and
using it we obtain the related p-value using P = 1 - pchisq(chisq, g - 2).
Finally, the required values are returned with return(list(chisq=chisq,p.
value=P)).
3. Complete the computaons of the Hosmer-Lemeshow goodness-of-t test stasc
for the ed model pass_logistic with hosmerlem( pass_logistic$y,
pass_hat).
Figure 5: Hosmer-Lemeshow goodness-of-fit test
Since there is no signicant dierence between the observed and predicted y values, we
concluded that the ed model is a good t. Now that we know that we have got a good
model on hand, it is me to invesgate how valid are the model assumpons.
What just happened?
We used an R code from the Web and successfully adapted it to the problem on our hand!
Parcularly, the Hosmer-Lemeshow goodness-of-t test is a vital metric for understanding
the appropriateness of a logisc regression model.
Chapter 7
[ 213 ]
Model validation and diagnostics
In the previous chapter we saw the ulity of residual techniques. A similar technique is also
required for the logisc regression model and we will develop these methods for the logisc
regression model in this secon.
Residual plots for the GLM
In the case of the linear regression model, we had explored the role of residuals for the
purpose of model validaon. In the context of logisc regression, actually GLM, we have
ve dierent types of residuals for the same purpose:
Response residual: The dierence between the actual values and the ed values
is the response residual, that is,
ii
ˆ
y
π
−
, and in parcular it is
1i
ˆ
π
−
if yi = 1 and
i
ˆ
π
−
for yi = 0.
Deviance residual: For an observaon i, the deviance residual is the signed square
root of the contribuon of the observaon to the sum of the model deviance. That
is, it is given by:
( ) ( ) ( )
{ }
12
2 log 1 log 1 /
dev
i i i i i
ˆ ˆ
r Y Y
π π
=± − + − −
where the sign is posive if
ii
ˆ
y
π
−
, and negave otherwise, and
i
ˆ
π
is the predicted
probability of success.
Pearson residual: The Pearson residual is dened by:
( )
1
P
ii
i
i i
ˆ
Y
rˆ ˆ
π
π π
−
=−
Paral residual: The paral residual of the jth predictor, j = 1, 2, …, p, for the ith
observaon is dened by:
( )
1 1
1
part i i
ij jij
i i
ˆ
y
ˆ
r x ,i ,...,n, j ,..., p
ˆ ˆ
π
βπ π
−
= + = =
−
The paral residuals are very useful for idencaon of the type of transformaon
that needs to be performed on the covariates.
Working residual: The working residual for the logisc regression model is given by:
( )
1
Wi i
i
i i
ˆ
y
rˆ ˆ
π
π π
−
=−
The Logisc Regression Model
[ 214 ]
Each of the preceding residual variants is easily obtained using the residuals funcon, see
? glm.summaries for details. The residual variant is specied through the opon type in
the residuals funcon. We have not given the details related to the probit regression model,
however, the same funcons for logisc regression apply here nevertheless. We will obtain
the residual plots against the ed values and examine the appropriateness of the logisc
and probit regression models.
Time for action – residual plots for the logistic regression
model
The residuals and fitted funcons will be used to obtain the residual plots from the
probit and logisc regression models.
1. Inialize a graphics windows for three panels with par(mfrow=c(1,3),
oma = c(0,0,3,0)). The oma opon ensures that we can appropriately tle
the grand output.
2. Plot Response Residuals against the Fied Values of the pass_logistic
model with:
plot(fitted(pass_logistic), residuals(pass_logistic,"response"),
col= "red", xlab="Fitted Values", ylab="Residuals",cex.axis=1.5,
cex.lab=1.5)
The reason of xlab and ylab has been explained in the earlier chapters.
3. For the purpose of comparison with the probit regression model, add their response
residuals to the previous plot with:
points(fitted(pass_probit), residuals(pass_probit,"response"),
col= "green")
And add a suitable legend and tle as follows:
legend(0.6,0,c("Logistic","Probit"),col=c("red","green"),pch="-")
title("Response Residuals")
4. Add the horizontal line at 0 with abline(h=0).
5. Repeat the preceding steps for deviance and Pearson residuals with:
plot(fitted(pass_logistic), residuals(pass_logistic,"deviance"),
col= "red", xlab="Fitted Values", ylab="Residuals",cex.axis=1.5,
cex.lab=1.5)
points(fitted(pass_probit), residuals(pass_probit,"deviance"),
col= "green")
Chapter 7
[ 215 ]
legend(0.6,0,c("Logistic","Probit"),col=c("red","green"),pch="-")
abline(h=0)
title("Deviance Residuals")
plot(fitted(pass_logistic), residuals(pass_logistic,"pearson"),
col= "red",xlab="Fitted Values",ylab="Residuals",cex.axis=1.5,
cex.lab= 1.5)
points(fitted(pass_probit), residuals(pass_probit,"pearson"), col=
"green")
legend(0.6,0,c("Logistic","Probit"),col=c("red","green"),pch="-")
abline(h=0)
title("Pearson Residuals")
6. Give an appropriate tle with title(main="Response, Deviance, and
Pearson Residuals Comparison for the Logistic and Probit
Models",outer=TRUE, cex.main=1.5).
Figure 6: Residual plots for the logistic regression model
In each of the three preceding residual plots we observe two trends of decreasing residuals
whose slope is -1. The reason for such a trend is that the residuals take one of two values at
a point Xi, either
1i
ˆ
π
−
or
i
ˆ
π
−
. Thus, in these residual plots we always get two linear trends
with slope -1. Clearly, there is not much dierence between the residuals of the logisc and
probit models. The Pearson residual graph also indicates the presence of an outlier for the
observaon with residual value lesser than -3.
The Logisc Regression Model
[ 216 ]
What just happened?
The residuals funcon along with the type opon helps in model validaon and
idencaon of some residuals. A good thing is that the same funcon applies on the
logisc as well as the probit regression model.
Have a go hero
In the previous exercise, we have le out the invesgaon using paral and working type
of residuals. Obtain these plots!
Inuence and leverage for the GLM
In the previous chapter we saw how the inuenal and leverage points are idened in
a linear regression model. It will be a bit dicult to go into the appropriate formulas and
theory for the logisc regression model.
Time for action – diagnostics for the logistic regression
The inuence and leverage points will be idened through the applicaon of the funcons,
such as hatvalues, cooks.distance, dffits, and dfbetas for the pass_logistic
ed model.
1. The high leverage points of a logisc regression model are obtained with
hatvalues(pass_logistic) while the Cooks distance is fetched with cooks.
distance(pass_logistic). The DFFITS and DFBETAS measures of inuence are
obtained by running dfbetas(pass_logistic) and dffits(pass_logistic).
2. The inuence and leverage measures are put together using the cbind funcon:
cbind(hatvalues(pass_logistic),cooks.distance(pass_
logistic),dfbetas(pass_logistic),dffits(pass_logistic))
Chapter 7
[ 217 ]
The output is given in the following screenshot:
Figure 7: Influence measures for the logistic regression model
It is me to interpret these measures.
3. If the hatvalues associated with an observaon is greater than
( )
2 1
p / n+, where
p is the number of covariates considered in the model and n is the number of
observaons, it is considered as a high leverage point. For the pass_logistic
object, we nd the high leverage points with:
hatvalues(pass_logistic)>2*(length(pass_logistic$coefficients)-1)/
length(pass_logistic$y)
The Logisc Regression Model
[ 218 ]
4. An observaon is considered to have great inuence on the parameter esmates
if the Cooks distance, as given by cooks.distance, is greater than 10 percent
quanle of the
()
11
p,np
F
+−+ distribuon, and it is considered highly inuenal if it
exceeds 50 percent quanle of the same distribuon. In terms of R program,
we need to execute:
cooks.distance(pass_logistic)>qf(0.1,length(pass_logistic$
coefficients),length(pass_logistic$y)-length(pass_logistic$
coefficients))
cooks.distance(pass_logistic)>qf(0.5,length(pass_logistic$
coefficients),length(pass_logistic$y)-length(pass_logistic$
coefficients))
Figure 8: Identifying the outliers
The previous screenshot shows that there are eight high leverage points. We also
see that at the 10 percent quanle of the F-distribuon we have two inuenal
points whereas we don't have any highly inuenal points.
5. Use the plot funcon to idenfy the inuenal observaons suggested by the
DFFITS and DFBETAS measure:
par(mfrow=c(1,3))
plot(dfbetas(pass_logistic)[,1],ylab="DFBETAS - INTERCEPT")
plot(dfbetas(pass_logistic)[,2],ylab="DFBETAS - SAT")
plot(dffits(pass_logistic),ylab="DFFITS")
Chapter 7
[ 219 ]
Figure 9: DFFITS and DFBETAS for the logistic regression model
As with the linear regression model, the DFFITS and DFBETAS are measures of inuence of
the observaons on the regression coecients. The thumb rule for the DFBETAS is that if
their absolute value exceeds 1, the observaons have signicant inuence on the covariates.
In our case it is not correct and we conclude that we do not have inuenal observaons.
The interpretaon of DFFITS is le as an exercise.
What just happened?
We adapted the inuenal measures in the context of generalized linear models,
and especially in the context of logisc regression.
Have a go hero
The inuence and leverage measures were executed on the logisc regression model,
the pass_logistic object in parcular. You also have the pass_probit object!
Repeat the enre exercise of hatvalues, cooks.distance, dffits, and dfbetas
on the pass_probit ed probit model and draw your inference.
The Logisc Regression Model
[ 220 ]
Receiving operator curves
In the binary classicaon problem, we have certain scenarios where the comparison
between the predicted and actual class is of great importance. For example, there is
a genuine problem in the banking industry for idenfying fraudulent transacons against
the non-fraudulent transacons. There is another problem of sanconing loans to customers
who may successfully repay the enre loan and the customers who will default at some stage
during the loan tenure. Given the historical data, we will build a classicaon model, for
example the logisc regression model.
Now with the logisc regression model, or any other classicaon model for that maer,
if the predicted probability is greater than 0.5, the observaon is predicted as a successful
observaon, and a failure otherwise. We remind ourselves again that success/failure is
dened according to the experiment. At least with the data on hand, we know the true
labels of the observaons and hence a comparison of the true labels with the predicted
label makes a lot of sense. In an ideal scenario we expect the predicted labels to match
perfectly with the actual labels, that is, whenever the true label stands for success/failure,
the predicted label is also success/failure. However, in the real scenario it is rarely the case.
This means that there are some observaons which are predicted as success/failure when
the true labels are actually failure/success. In other words, we make mistakes! It is possible
to put these notes in the form of a table widely known as the confusion matrix.
Observed
Predicted
Success Failure
Success True Positive (TP) False Positive (FP)
Failure False Negative (FN) True Negative (TN)
Table 1: The confusion matrix
The number in parenthesis is the count of the cases. It may be seen from the preceding
table that the cells colored in green are the correct predicons made by the model, whereas
the red colored are the ones with mistakes. The following metrics may be considered for
comparison across mulple models:
Accuracy:
TP TN
TP TN FP FN
+
+++
Precision:
TP
TP FP+
Recall:
TP
TP FN+
However, it is known that these metrics have a lot of limitaons and more robust steps
are required. The answer is provided by the receiver operator characterisc (ROC) curve.
We need two important metrics towards the construcon of an ROC. The true posive rate
(tpr) and false posive rate (fpr) are respecvely dened by:
Chapter 7
[ 221 ]
TP FP
tpr , fpr
TP FN TN FP
= =
+ +
The ROC graphs are constructed by plong the tpr against the fpr. We will now explain this
in detail. Our approach will be explaining the algorithm in an Acon framework.
Time for action – ROC construction
A simple dataset is considered and the ROC construcon is explained in a very simple
step-by-step approach:
1. Suppose that the predicted probabilies of n = 10 observaons are 0.32, 0.62, 0.19,
0.75, 0.18, 0.18, 0.95, 0.79, 0.24, 0.59. Create a vector of it as follows:
pred_prob<-c(0.32, 0.62, 0.19, 0.75, 0.18, 0.18, 0.95, 0.79, 0.24,
0.59)
2. Sort the predicted probabilies in a decreasing order:
> (pred_prob=sort(pred_prob,decreasing=TRUE))
[1] 0.95 0.79 0.75 0.62 0.59 0.32 0.24 0.19 0.18 0.18
3. Normalize the predicted probabilies in the preceding step to the unit interval:
> pred_prob<-(pred_prob-min(pred_prob))/(max(pred_prob)-min(pred_
prob ))
> pred_prob
[1] 1.00000 0.79221 0.74026 0.57143 0.53247 0.18182 0.07792
0.01299 0.00000 0.00000
Now, at each percentage of the previously sorted probability, we commit false
posives as well as false negaves. Thus, we want to check at each part of our
predicon percenles, the quantum of tpr and fpr. Since ten points are very less,
we now consider a dataset of predicted probabilies and the true labels.
4. Load the illustrave dataset from the RSADBE package with data(simpledata).
5. Set up the threshold vector threshold <- seq(1,0,-0.01).
6. Find the number of posive (success) and negave (failure) cases in the dataset P
<- sum(simpledata$Label==1) and N <- sum(simpledata$Label ==0).
7. Inialize the fpr and tpr with tpr <- fpr <- threshold*0.
The Logisc Regression Model
[ 222 ]
8. Set up the following loop which computes tpr and fpr at each point of the
threshold vector:
for(i in 1:length(threshold)) {
FP=TP=0
for(j in 1:nrow(simpledata)) {
if(simpledata$Predictions[j]>=threshold[i]) {
if(simpledata$Label[j]==1) TP=TP+1 else FP=FP+1
}
}
tpr[i]=TP/P
fpr[i]=FP/N
}
9. Plot the tpr against the fpr with:
plot(fpr,tpr,"l",xlab="False Positive Rate", ylab="True Positive
Rate",col="red")
abline(a=0,b=1)
Figure 10: An ROC illustration
The diagonal line is about the performance of a random classier in that it simply says
"Yes" or "No" without looking at any characterisc of an observaon. Any good classier
must sit, rather be displayed, above this line. The classier, albeit an unknown one, seems
a much beer classier than the random classier. The ROC curve is useful in comparison
to the compeve classiers in the sense that if one classier is always above another,
we select the former.
Chapter 7
[ 223 ]
An excellent introductory exposion of the ROC curves is available at the website http://
ns1.ce.sharif.ir/courses/90-91/2/ce725-1/resources/root/Readings/
Model%20Assessment%20with%20ROC%20Curves.pdf.
What just happened?
The construcon of ROC has been demysed! The preceding program is a very primive
one. In the later chapters we will use the ROCR package for the construcon of ROC.
We will next look at a real-world problem.
Logistic regression for the German credit screening
dataset
Millions of applicaons are made to a bank for a variety of loans! The loan may be a personal
loan, home loan, car loan, and so forth. From a bank's perspecve, loans are an asset for them
as obviously the customer pays them interest and over a period of me the bank makes prot.
If all the customers promptly pay back their loan amount, all their tenure equated monthly
installment (EMI) or the complete amount on preclosure of the principal amount, there is
only money to be made. Unfortunately, it is not always the case that the customers pay back
the enre amount. In fact, the fracon of people who do not complete the loan duraon may
also be very small, say about ve percent. However, a bad customer may take away the prots
of may be 20 or more customers. In this hypothecal case, the bank eventually makes more
losses than prot and this may eventually lead to its own bankruptcy.
Now, a loan applicaon form seeks a lot of details about the applicant. The data from these
details in the applicaon can help the bank build appropriate classiers, such as a logisc
regression model, and make predicons about which customers are most likely to turn up as
fraudulent. The customers who have been predicted to default in the future are then declined
the loan. A real dataset of 1,000 customers who had borrowed loan from a bank is available
on the web at http://www.stat.auckland.ac.nz/~reilly/credit-g.arff and
http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data).
This data has been made available by Prof. Hofmann and it contains details on 20 variables
related to the customer. It is also known whether the customers defaulted or not. The
variables are described in the following table.
The Logisc Regression Model
[ 224 ]
A detailed analysis of the dataset using R has been done by Sharma and his very useful
document can be downloaded from cran.r-project.org/doc/contrib/Sharma-
CreditScoring.pdf.
No Variable Characteristic Description No Variable Characteristic Description
1checking integer
Status of
existing
checking
account 12 property factor Property
2duration integer
Duration in
month 13 age numeric Age in years
3history integer Credit history 14 other integer
Other
installment
plans
4purpose factor Purpose 15 housing integer Housing
5amount numeric Credit amount 16 existcr integer
Number of
existing credits
at this bank
6savings integer
Savings
account/bonds 17 job integer Job
7employed integer
Present
employment
since 18 depends integer
Number
of people
being liable
to provide
maintenance
for
8installp integer
Installment
rate in
percentage
of disposable
income 19 telephon integer Telephone
9marital integer
Personal status
and sex 20 foreign integer Foreign worker
10 coapp integer
Other debtors/
guarantors 21 good_bad factor Loan defaulter
11 resident integer
Present
residence
since 22 default integer
good_bad in
numeric
We have the German credit dataset with us in the GC data from the RSADBE package.
Let us build a classier for idenfying the good customers from the bad ones.
Chapter 7
[ 225 ]
Time for action – logistic regression for the German credit
dataset
The logisc regression model will be built for credit card applicaon scoring model and an
ROC curve t to evaluate the t of the model.
1. Invoke the ROCR library with library(ROCR).
2. Get the German credit dataset in your current session with data(GC).
3. Build the logisc regression model for good_bad with GC_LR <- glm
(good_bad~.,data=GC,family=binomial()).
4. Run summary(GC_LR) and idenfy the signicant variables. Also answer the
queson of whether the model is signicant?
5. Get the predicons using the predict funcon:
LR_Pred <- predict( GC_LR,type='response')
6. Use the predicon funcon from the ROCR package to set up a predicon object:
GC_pred <- prediction(LR_Pred,GC$good_bad)
The funcon prediction sets up dierent manipulaons required computaons
as required for construcng the ROC curve. Get more details related to it with
?prediction.
7. Set up the performance vector required to obtain the ROC curve with GC_perf <-
performance(GC_pred,"tpr","fpr").
The performance funcon uses the predicon object to set up the ROC curve.
The Logisc Regression Model
[ 226 ]
8. Finally, visualize the ROC curve with plot(GC_perf).
Figure 11: Logistic regression model for the German credit data
The ROC curve shows that the logisc regression is indeed eecve in idenfying
fraudulent customers.
What just happened?
Now, we considered a real world problem with enough data points. The ed logisc
regression model gives a good explanaon of the fraudulent customers in terms of the
data that is collected about them.
Have a go hero
For simpledata, a raw program was wrien to draw the ROC curve. Redo the exercise with
red colour for the curve. Using the prediction and performance funcons from the ROCR
package, add the curve for simpledata obtained in the previous step with green colour.
What do you expect?
Chapter 7
[ 227 ]
Summary
We started with a simple linear regression model for the binary classicaon problem
and saw the limitedness of the same. The probit regression model, which is an adapon
of the linear regression model through a latent variable, overcomes the drawbacks of the
straighorward linear regression model. The versale logisc regression model has been
considered in details and we considered the various kinds of residuals that help in the model
validaon. The inuenal and leverage point detecon has been discussed too, which helps
us build a beer model by removing the outliers. A metric in the form of ROC helps us in
understanding the performance of a classier. Finally, we concluded the chapter with an
applicaon to the important problem of idenfying good customers from the bad ones.
Despite the advantages of linearity, we sll have many drawbacks with either the linear
regression model or the logisc regression model. The next chapter begins with the family
of polynomial regression model and later considers the impact of regularizaon.
8
Regression Models with
Regularization
In Chapter 6, Linear Regression Analysis, and Chapter 7, The Logistic Regression
Model, we focused on the linear and the logistic regression model. In the model
selection issues with the linear regression model, we found that a covariate
is either selected or not depending on the associated p-value. However, the
rejected covariates are not given any kind of consideration once the p-value
is lesser than the threshold. This may lead to discarding the covariates even if
they have some say on the regressand. Particularly, the final model may thus
lead to overfitting of the data, and this problem needs to be addressed.
We will rst consider ng a polynomial regression model, without the technical details,
and see how higher order polynomials give a very good t, which actually comes with a
higher price. A more general framework of B-splines is considered next. This approach leads
us to the smooth spline models, which are actually ridge regression models. The chapter
concludes with an extension of the ridge regression for the linear and logisc regression
models. For more details of the coverage, refer to Chapter 2 of Berk (2008) and Chapter 5
of Hase, et. al. (2008). This chapter will unfold on the following topics:
The problem of overng in a general regression model
The use of regression splines for certain special cases
Improving esmators of the regression coecients, and overcoming the
problem of overng with ridge regression for linear and logisc models
The framework of train + validate + test for regression models
Regression Models with Regularizaon
[ 230 ]
The overtting problem
The limitaon of the linear regression model is best understood through an example. I have
created a hypothecal dataset for understanding the problem of overng. A scaerplot of
the dataset is shown in the following gure.
It appears from the scaerplot that for x values up to 6, there is a linear increase in y, and an
eye-bird esmate of the slope is (50 - 10) / (5.5 - 1.75) = 10.67. This slope may
be on account of a linear term or even a quadrac term. On the other hand, the decline in
y-values for x-values greater than 6 is very steep, approximately (10 - 50) / (10 - 6)
= -10. Now, looking at the complete picture, it appears that the output Y depends upon
the higher order of the covariate X. Let us t polynomial curves of various degrees and
understand the behavior of the dierent linear regression models. A polynomial regression
model of degree k is dened as follows:
0 1 2
Y X X ... X
ββ β β ε
=+ + + + +
Here, the terms 2k
X,X,..., X are treated as disnct variables, in the sense that one may
compare the preceding model with the one introduced in the mulple linear regression
model of Chapter 6, Linear Regression Analysis, by dening 2
1 2
k
k
XX,X X,...,
XX
= = = .
The inference for the polynomial regression model proceeds in the same way as the
mulple linear regression with k terms:
Figure 1: A non-linear relationship displayed by a scatter plot
Chapter 8
[ 231 ]
The data for the previous gure is available in the dataset OF from RSADBE. The opon poly
is used in the right-hand side of the formula of the lm funcon for ng the polynomial
regression models.
Time for action – understanding overtting
Polynomial regression models are built using the lm funcon, as we saw earlier, with the
opon poly.
1. Read the hypothecal dataset into R by using data(OF).
2. Plot Y against X by using plot(OF$X, OF$Y,"b",col="red",xlab="X",
ylab="Y").
3. Fit the polynomial regression models of orders 1, 2, 3, 6, and 9, and add their ed
lines against the covariates X with the following code:
lines(OF$X,lm(Y~poly(X,1,raw=TRUE),data=OF)$fitted.
values,"b",col="green")
lines(OF$X,lm(Y~poly(X,2,raw=TRUE),data=OF)$fitted.
values,"b",col="wheat")
lines(OF$X,lm(Y~poly(X,3,raw=TRUE),data=OF)$fitted.
values,"b",col="yellow")
lines(OF$X,lm(Y~poly(X,6,raw=TRUE),data=OF)$fitted.
values,"b",col="orange")
lines(OF$X,lm(Y~poly(X,9,raw=TRUE),data=OF)$fitted.
values,"b",col="black")
Regression Models with Regularizaon
[ 232 ]
The opon poly is used to specify the polynomial degree:
Figure 2: Fitting higher-order polynomial terms in a regression model
4. Enhance the graph with a suitable legend:
legend(6,50,c("Poly 1","Poly 2","Poly 3","Poly 6","Poly 9"),
col=c("green","wheat","yellow","orange","black"),pch=1,ncol=3)
5. Inialize the following vectors:
R2 <- NULL; AdjR2 <- NULL; FStat <- NULL
Mvar <- NULL; PolyOrder<-1:9
6. Now, t the regression models beginning with order 1 up to order 9 (since we only
have ten points) and extract their R2, Adj- R2, F-stasc value, and model variability:
for(i in 1:9) {
temp <- summary(lm(Y~poly(X,i,raw=T),data=OF))
R2[i] <- temp$r.squared
AdjR2[i] <- temp$adj.r.squared
FStat[i] <- as.numeric(temp$fstatistic[1])
Mvar[i] <- temp$sigma
}
cbind(PolyOrder,R2,AdjR2,FStat,Mvar)
Chapter 8
[ 233 ]
We will more formally dene polynomial regression models in the next secon.
The output is given in the next gure.
7. Let us also look at the magnitude of the regression coecients:
as.numeric(lm(Y~poly(X,1,raw=T),data=OF)$coefficients)
as.numeric(lm(Y~poly(X,2,raw=T),data=OF)$coefficients)
as.numeric(lm(Y~poly(X,3,raw=T),data=OF)$coefficients)
as.numeric(lm(Y~poly(X,4,raw=T),data=OF)$coefficients)
as.numeric(lm(Y~poly(X,5,raw=T),data=OF)$coefficients)
as.numeric(lm(Y~poly(X,6,raw=T),data=OF)$coefficients)
as.numeric(lm(Y~poly(X,7,raw=T),data=OF)$coefficients)
as.numeric(lm(Y~poly(X,8,raw=T),data=OF)$coefficients)
The following screenshot shows the large size of the regression coecients,
parcularly, as the degree of the polynomial increases, so does the coecient
magnitude. This is a problem! As the complexity of a model increases, the
interpretability becomes very dicult. In the next secon, we will discuss various
techniques in polynomial regression.
Figure 3: Regression coefficients of polynomial regression models
What just happened?
The scaer plot indicated that a polynomial regression model may be appropriate.
Fing higher order polynomial curves gives a closer approximaon of the t.
The regression coecients have been observed to increase with the degree
of the polynomial t.
In the next secon, we consider the more general regression spline model.
Regression Models with Regularizaon
[ 234 ]
Have a go hero
The R2 value for gasoline_lm is at 0.895, see Figure 7: Building Mulple Linear Regression
Model, of Chapter 6, Linear Regression Analysis. Add higher order terms for the covariates
and make an aempt to reach an R2 value of 0.95.
Regression spline
In this secon, we will consider various enhancements/generalizaons of the linear
regression model. We will begin with a piecewise linear regression model and then consider
the polynomial regression extension. The term spline refers to a thin strip of wood that can
be easily bent along a curved line.
Basis functions
In the previous secon, we made mulple transformaons of the input variable X with
2
1 2
k
k
XX,X X,...,
XX
= = = . In the Data Re-expression secon of Chapter 4, Exploratory
Analysis, we saw how a useful log transformaon gave a beer stem-and-leaf display than
the original variable itself. In many applicaons, it has been found that the transformed
variables are more important than the original variable itself. Thus, we need a more generic
framework to consider the transformaons of the variables. Such a framework is provided by
the basis funcons. For a single covariate X, the set of transformaons may be dened
as follows:
( ) ( )
1
M
mm
m
fX h X
β
=
=∑
Here,
()
m
hX
is the m-th transformaon of X, and
m
β
is the associated regression coecient.
In the case of a simple linear regression model, we have
()
1
1hX= and
()
2
hX X
=. For the
polynomial regression model, we have
()
12
m m
hX X,m,,...,k= =
, and for the logarithmic
transformaon
()
loghX X=. In general, for the p mulple linear regression model,
we have the basis transformaon as follows:
( ) ( )
1
1 1
j
M
p
p jmjm j
jm
fX,..., X h X
β
= =
=∑∑
For the mulple linear regression model, we have
( )
11
j j j
h X X,j ,...p= = . In general, the
transformaon includes funcons such as sine, cosine, exponenaon, and indicator funcons.
Chapter 8
[ 235 ]
Piecewise linear regression model
Consider the scaer plot of the dataset, which is available in the dataset PWR_Illus, in
the next screenshot. We see a slanted leer N in Figure 4: Scaerplot of a dataset (A) and
the ed values using piecewise linear regression model (B), where in the beginning Y
increases with X up to the point, approximately, 15, then there is a steep decline, or negave
relaonship, ll 30, and nally there is an increase in the y values beyond that. In a certain
way, we can imagine the x values of 15 and 30 as break-down points. It is apparent from the
scaerplot display that a linear relaonship between the x- and y-values over the real line
intervals less than 15, between 15 to 30, and greater than 30 is appropriate. The queson
then is how do we build a regression model for such a phenomenon? The answer is provided
by the piecewise linear regression model. In this parcular case, we have a two-piece linear
regression model.
In general, let xa and xb denote the two points, where we believe the linear regression model
has the breakpoints. Further more, we denote an indicator funcon by Ia to represent that it
equals 1 when the x value is greater than xa and takes the value 0 in other cases. Similarly, the
second breakpoint indicator Ib is dened. The piecewise linear regression model is dened
as follows:
( ) ( )
0 1 2 3a a b b
Y X XxI X xI
ββ β β ε
=+ + − + − +
In this piecewise linear regression model, we have four transformaons, including
()
11
hX
=,
()
2
hX X
=,
()( )
3a a
hX XxI=− , and
()( )
4b b
hX XxI=− . The regression model needs to be
interpreted with a bit of care. If the x value is less than xa, then average Y value would
be
01
X
ββ
+
. For the x value greater than xa but lesser than xb, the average of Y is
( ) ( )
0 2 3 1 2 3
a b
x x X
ββ β β ββ
+ − +++
. Finally, for values greater than xb, it will be
0
β
. The
intercept term in these intervals will be
( )
02a
x
ββ
−
,
( )
0 2 3a b
x x
ββ β
− −
, and , respecvely,
whereas the slopes are
1
β
,
( )
12
ββ
+
and
( )
123
βββ
++
. Of course, we are now concerned
about ng the piecewise linear regression model in R. Let us set ourselves up for this task!
Time for action – tting piecewise linear regression models
A piecewise linear regression model can be easily ed in R by using the same lm funcon
and a bit of cauon. A loop is used to nd the points at which the model is supposed to have
changed its trajectory.
1. Read the dataset into R with data(PW_Illus).
2. For convenience, aach the variables in the PW_Illus object by using attach
(PW_Illus).
Regression Models with Regularizaon
[ 236 ]
3. To be on the safe side, we will select a range of the x values, which may be either of
the breakpoints:
break1 <- X[which(X>=12 & X<=18)]
break2 <- X[which(X>=27 & X<=33)]
4. Get the number of points that are candidates for being the breakpoints with n1 <-
length(break1) and n2 <- length(break2).
We do not have a clear dening criterion to select one of the n1 or n2 x values to be
the breakpoints. Hence, we will run various linear regression models and select that
pair of points (xa, xb) to be the breakpoints, which return the least mean residual
sum of squares. Towards this, we set up a matrix, which will have three columns
with the rst two columns for the possible potenal pair of breakpoints, and the
third column will contain the mean residual sum of squares. The choice of points,
which corresponds to the least mean residual sum of squares, will be selected as the
best model in the current case.
5. Set up the required matrix, build all the possible regression models with the pair
of potenal breakpoints, and note their mean residual sum of squares through the
following program:
MSE_MAT <- matrix(nrow=(n1*n2), ncol=3)
colnames(MSE_MAT) = c("Break_1","Break_2","MSE")
curriter=0
for(i in 1:n1){
for(j in 1:n2) {
curriter=curriter+1
MSE_MAT[curriter,1]<-break1[i]
MSE_MAT[curriter,2]<-break2[j]
piecewise1 <- lm(Y ~ X*(X<break1[i])+X*(X>=break1[i] &
X<break2[j])+X*(X>=break2[j]))
MSE_MAT[curriter,3] <- as.numeric(summary(piecewise1)[6])
}
}
Note the use of the formula ~ in the specicaon of the piecewise linear
regression model.
6. The me has arrived to nd the pair of breakpoints:
MSE_MAT[which(MSE_MAT[,3]==min(MSE_MAT[,3])),]
The pair of breakpoints is hence (14.000, 30.000). Let us now look at how good
the model t is!
Chapter 8
[ 237 ]
7. First, reobtain the scaer plot with plot(PW_Illus). Fit the piecewise linear
regression model with breakpoints at (14,30) with pw_final <- lm(Y ~
X*(X<14)+X*(X>=14 & X<30)+X*(X>=30)). Add the ed values to the scaer
plot with points(PW_Illus$X,pw_final$fitted.values,col ="red").
Note that the ed values are a very good reecon of the original data values,
(Figure 4 (B)). The fact that linear models can be extended to such dierent
scenarios makes it very promising to study this in even more detail as will be
seen in the later part of this secon.
Figure 4: Scatterplot of a dataset (A) and the fitted values using piecewise linear regression model (B)
What just happened?
The piecewise linear regression model has been explored for a hypothecal scenario, and we
invesgated how to idenfy breakpoints by using the criterion of the mean residual sum
of squares.
The piecewise linear regression model shows a useful exibility, and it is indeed a very useful
model when there is a genuine reason to believe that there are certain breakpoints in the
model. This has some advantages and certain limitaons too. From a technical perspecve,
the model is not connuous, whereas from an applied perspecve, the model possesses
problems in making guesses about the breakpoint values and also the problem of extensions
to mul-dimensional cases. It is thus required to look for a more general framework, where we
need not be bothered about these issues. Some answers are provided in the following secons.
Regression Models with Regularizaon
[ 238 ]
Natural cubic splines and the general B-splines
We will rst consider the polynomial regression splines model. As noted in the previous
discussion, we have a lot of disconnuity in the piecewise regression model. In some sense,
"greater connuity" can be achieved by using the cubic funcons of x and then construcng
regression splines in what are known as "piecewise cubics", see Berk (2008) Secon 2.2.
Suppose that there are K data points at which we require the knots. Suppose that the knots
are located at the points
12 K
,,...,
ξξ ξ
, which are between the boundary points
0 1
and
k
ξ ξ
+,
such that
0 1 2 1K K
...
ξξξ ξ ξ
+
<<< <
. The piecewise cubic polynomial regression model is
given as follows:
( )
3
2 3
0 1 2 3
1
K
j j
j
Y X X X X
β β β β θ ξ ε
+
=
= + + + + − +
∑
Here, the funcon
()
3
.
+
represents that the posive values from the argument are accepted
and then the cube power performed on it; that is:
( ) ( )
3
3if
othe0 rwise
jj
j
X
XX
,
ξ
ξξ
+
−
− = >
For this model, the K+4 basis funcons are as follows:
( ) ( ) ( ) ( ) ( )
( )
3
2 3
1 2 3 4 4
1 1
j j
hX ,h X X ,h X X ,h X X ,h X X ,j ,...,K
ξ
++
= = = = = − =
We will now consider an example from Montgomery, et. al. (2005), pages 231-3. It is
known that the baery voltage drop in a guided missile motor has a dierent behavior as a
funcon of me. The next screenshot displays the scaerplot of the baery voltage drop for
dierent me points; see ?VD from the RSADBE package. We need to build a piecewise cubic
regression spline for this dataset with knots at me t = 6.5 and t = 13 seconds since it
is known that the missile changes its course at these points. If we denote the baery voltage
drop by Y and the me by t, the model for this problem is then given as follows:
( ) ( )
3 3
2 3
0 1 2 3 1 2
65 13Y t t t t . t
β β β β θ θ ε
+ +
= + + + + − + − +
It is not possible with the math scope of this book to look into the details related to the
natural cubic spline regression models or the B-spline regression models. However, we can t
them by using the ns and bs opons in the formula of the lm funcon, along with the knots
at the appropriate places. These models will be built and their t will be visualized too. Let us
now t the models!
Chapter 8
[ 239 ]
Time for action – tting the spline regression models
A natural cubic spline regression model will be ed for the voltage drop problem.
1. Read the required dataset into R by using data(VD).
2. Invoke the graphics editor by using par(mfrow=c(1,2)).
3. Plot the data and give an appropriate tle:
plot(VD)
title(main="Scatter Plot for the Voltage Drop")
4. Build the piecewise cubic polynomial regression model by using the lm funcon
and related opons:
VD_PRS<-lm(Voltage_Drop~Time+I(Time^2)+I(Time^3)+I(((Ti
me-6.5)^3)*(sign(Time-6.5)==1))+I(((Time-13)^3)*(sign(Time-
13)==1)),data=VD)
The sign funcon returns the sign of a numeric vector as 1, 0, and -1, accordingly
as the arguments are posive, zero, and negave respecvely. The operator I is
an inhibit interpretator operator, in that the argument will be taken in an as is
format, check ?I. This operator is especially useful in data.frame and the formula
program of R.
5. To obtain the ed plot along with the scaerplot, run the following code:
plot(VD)
points(VD$Time,fitted(VD_PRS),col="red","l")
title("Piecewise Cubic Polynomial Regression Model")
Figure 5: Voltage drop data - scatter plot and a cubic polynomial regression model
Regression Models with Regularizaon
[ 240 ]
6. Obtain the details of the ed model with summary(VD_PRS).
The R output is given in the next screenshot. The summary output shows that each
of the basis funcon is indeed signicant here.
Figure 6: Details of the fitted piecewise cubic polynomial regression model
7. Fit the natural cubic spline regression model using the ns opon:
VD_NCS <-lm(Voltage_Drop~ns(Time,knots=c(6.5,13),intercept= TRUE,
degree=3), data=VD)
8. Obtain the ed plot as follows:
par(mfrow=c(1,2))
plot(VD)
points(VD$Time,fitted(VD_NCS),col="green","l")
title("Natural Cubic Regression Model")
Chapter 8
[ 241 ]
9. Obtain the details related to VD_NCS with the summary funcon summary( VD_
NCS); see Figure 09: A rst look at the linear ridge regression.
10. Fit the B-spline regression model by using the bs opon:
VD_BS <- lm(Voltage_Drop~bs(Time,knots=c(6.5,13),intercept=TRUE,
degree=3), data=VD)
11. Obtain the ed plot for VD_BS with the R program:
plot(VD)
points(VD$Time,fitted(VD_BS),col="brown","l")
title("B-Spline Regression Model")
Figure 7: Natural Cubic and B-Spline Regression Modeling
12. Finally, get the details of the ed B-spline regression model by using
summary(VD_BS).
The main purpose of the B-spline regression model is to illustrate that the splines
are smooth at the boundary points in contrast with the natural cubic regression
model. This can be clearly seen in Figure 8: Details of the natural cubic and B-spline
regression models.
Regression Models with Regularizaon
[ 242 ]
Both the models, VD_NCS and VD_BS, have good summary stascs and have really
modeled the data well.
Figure 8: Details of the natural cubic and B-spline regression models
What just happened?
We began with the ng of a piecewise polynomial regression model and then had a
look at the natural cubic spline regression and B-spline regression models. All the three
models provide a very good t to the actual data. Thus, with a good guess or experimental/
theorecal evidence, the linear regression model can be extended in an eecve way.
Chapter 8
[ 243 ]
Ridge regression for linear models
In Figure 3: Regression coecients of polynomial regression models, we saw that the
magnitude of the regression coecients increase in a drasc manner as the polynomial
degree increases. The right tweaking of the linear regression model, as seen in the previous
secon, gives us the right results. However, the models considered in the previous secon
had just one covariate and the problem of idenfying the knots in the mulple regression
model becomes an overtly complex issue. That is, if we have a problem where there are
large numbers of covariates, naturally there may be some dependency amongst them,
which cannot be invesgated for certain reasons. In such problems, it may happen that
certain covariates dominate other covariates in terms of the magnitude of their regression
coecients, and this may mar the overall usefulness of the model. Further more, even in
the univariate case, we have the problem that the choice of the number of knots, their
placements, and the polynomial degree may be manipulated by the analyzer. We have an
alternave to this problem in the way we minimize the residual sum of squares 2
1
min
n
i
i
e
β
=
∑.
The least-squares soluon leads to an esmator of
β
:
( )
1
ˆ
X'X X 'Y
β
−
=
We saw in Chapter 6, Linear Regression Analysis, how to guard ourselves against the outliers,
the measures of model t, and model selecon techniques. However, these methods are
in acon aer the construcon of the model, and hence though they oer protecon in a
certain sense to the problem of overng, we need more robust methods. The queson
that arises is can we guard ourselves against overng when building the model itself? This
will go a long way in addressing the problem. The answer is obviously an armave, and we
will check out this technique.
The least-squares soluon is the opmal soluon when we have the squared loss funcon.
The idea then is to modify this loss funcon by incorporang a penalty term, which will give
us addional protecon against the overng problem. Mathemacally, we add the penalty
term for the size of the regression coecients; in fact, the constraint would be to ensure that
the sum of squares of the regression coecients is minimized. Formally, the goal would be to
obtain an opmal soluon of the following problem:
2 2
1 1
min
p
n
i j
i j
e
βλ β
= =
+
∑ ∑
Regression Models with Regularizaon
[ 244 ]
Here,
0
λ
>
is the control factor, also known as the tuning parameter, and 2
1
p
j
j
β
=
∑
is the
penalty. If the
λ
value is zero, we get the earlier least-squares soluon. Note that the
intercept has been deliberately kept out of the penalty term! Now, for the large values of
2
1
p
j
j
β
=
∑
, the residual sum of squares will be large. Thus, loosely speaking, for the minimum
value of 2 2
1 1
n p
i j
i j
e
λ β
= =
+
∑ ∑
, we will require 2
1
p
j
j
β
=
∑
to be at a minimum value too. The opmal
soluon for the preceding minimizaon problem is given as follows:
( )
1
Ridge
ˆ
X'X I X'Y
β λ
−
= +
The choice of
λ
is a crical one. There are mulple opons to obtain it:
Find the value of
λ
by using the cross-validaon technique (discussed in the last
secon of this chapter)
Find the value of a semi-automated method as described at http://arxiv.org/
pdf/1205.0686.pdf for the value of
λ
For the rst technique, we can use the funcon lm.ridge from the MASS package, and
the second method of semi-automac detecon can be obtained from the linearRidge
funcon of the ridge package.
In the following R session, we use the funcons lm.ridge and linearRidge.
Time for action – ridge regression for the linear regression
model
The linearRidge funcon from ridge package and lm.ridge from the MASS package are
two good opons for developing the ridge regression models.
1. Though the OF object may sll be there in your session, let us again load it by using
data(OF).
2. Load the MASS and ridge package by using library(MASS); library(ridge).
3. For a polynomial regression model of degree 3 and various values of lambda,
including 0, 0.5, 1, 1.5, 2, 5, 10, and 30, obtain the ridge regression coecients
with the following single line R code:
LR <-linearRidge(Y~poly(X,3),data=as.data.frame(OF),lambda =c(0,
0.5,1,1.5,2,5,10,30))
LR
Chapter 8
[ 245 ]
The funcon linearRidge from the ridge package performs the ridge regression
for a linear model. We have two opons. First, we specify the values of lambda,
which may either be a scalar or a vector. In the case of a scalar lambda, it will simply
return the set of (ridge) regression coecients. If it is a vector, it returns the related
set of regression coecients.
4. Compute the value of
2
1
p
j
j
β
=
∑
for dierent lambda values:
LR_Coef <- LR$coef
colSums(LR_Coef^2)
Note that as the lambda value increases, the value of
2
1
p
j
j
β
=
∑
decreases. However,
this is not to say that higher lambda value is preferable, since the sum
2
1
p
j
j
β
=
∑
will decrease to 0, and eventually none of the variables will have a signicant
explanatory power about the output. The choice of selecon of the lambda value
will be discussed in the last secon.
5. The linearRidge funcon also nds the "best" lambda value:
linearRidge(Y~poly(X,3),data=as.data.
frame(OF),lambda="automatic").
6. Fetch the details of the "best" ridge regression model with the following line
of code:
summary(linearRidge(Y~poly(X,3),data=as.data.frame(OF),lambda="
automatic")).
The summary shows that the value of lambda is chosen at 0.07881, and that it
used three PCs. Now, what is a PC? PC is an abbreviaon of principal component,
and unfortunately we can't really go into the details of this aspect. Enthusiasc
readers may refer to Chapter 17 of Taar, et. al. (2013). Compare these results with
those in the rst secon.
7. For the same choice of dierent lambda values, use the lm.ridge funcon from
the MASS package:
LM <-lm.ridge(Y~poly(X,3),data = as.data.frame(OF),lambda
=c(0,0.5,1,1.5,2,5,10,30))
LM
8. The lm.ridge funcon obviously works a bit dierently from linearRidge. The
results are given in the next image. Comparison of the results is le as an exercise
to the reader. As with the linearRidge model, let us compute the value of
2
1
p
j
j
β
=
∑
for lm.ridge ed models too.
Regression Models with Regularizaon
[ 246 ]
9. Use the colSums funcon to get the required result:
LM_Coef <- LM$coef
colSums(LM_Coef^2)
Figure 09: A first look at the linear ridge regression
So far, we are sll working with a single covariate only. However, we need to consider the
mulple linear regression models and see how ridge regression helps us. To do this, we will
return to the gasoline mileage considered in Chapter 6, Linear Regression Analysis.
Chapter 8
[ 247 ]
1. Read the Gasoline data into R by using data(Gasoline).
2. Fit the ridge regression model (and the mulple linear regression model again)
for the mileage as a funcon of other variables:
gasoline_lm <- lm(y~., data=Gasoline)
gasoline_rlm <- linearRidge(y~., data=Gasoline,lambda=
"automatic")
3. Compare the lm coecients with the linearRidge coecients:
sum(coef(gasoline_lm)[-1]^2)-sum(coef(gasoline_rlm)[-1]^2)
4. Look at the summary of the ed ridge linear regression model by using
summary(gasoline_rlm).
5. The dierence between the sum of squares of the regression coecients for the
linear and ridge linear model is indeed very large. Further more, the gasoline_
rlm details reveal that there are four variables, which have signicant explanatory
power for the mileage of the car. Note that the gasoline_lm model had only one
signicant variable for the car's mileage. The output is given in the following gure:
Figure 10: Ridge regression for the gasoline mileage problem
Regression Models with Regularizaon
[ 248 ]
What just happened?
We made use of two funcons, namely lm.ridge and linearRidge, for ng ridge
regression models for the linear regression model. It is observed that the ridge regression
models may somemes reveal more signicant variables.
In the next secon, we will t consider the ridge penalty for the logisc regression model.
Ridge regression for logistic regression models
We will not be able to go into the math of the ridge regression for the logisc regression
model, though we will happily make good use of the logisticRidge funcon from the
ridge package, to illustrate how to build the ridge regression for logisc regression model. For
more details, we refer to the research paper of Cule and De Iorio (2012) available at http://
arxiv.org/pdf/1205.0686.pdf. In the previous secon, we saw that gasoline_rlm
found more signicant variables than gasoline_lm. Now, in Chapter 7, Logisc Regression
Model, we t a logisc regression model for the German credit data problem in GC_LR. The
queson that arises is if we obtain a ridge regression model of the related logisc regression
model, say GC_RLR, can we expect to nd more signicant variables?
Time for action – ridge regression for the logistic regression
model
We will use the logisticRidge funcon here from the ridge package to t the ridge
regression, and check if we can obtain more signicant variables.
1. Load the German credit dataset with data(German).
2. Use the logisticRidge funcon to obtain GC_RLR, a small manipulaon required
here, by using the following line of code:
GC_RLR<-logisticRidge(as.numeric(good_bad)-1~.,data= as.data.
frame(GC), lambda = "automatic")
Chapter 8
[ 249 ]
3. Obtain the summaries of GC_LR and GC_RLR by using summary(GC_LR) and
summary(GC_RLR).
The detailed summary output is given in the following screenshot:
Figure 11: Ridge regression with the logistic regression model
It can be seen that the ridge regression model oers a very slight improvement over the
standard logisc regression model.
What just happened?
The ridge regression concept has been applied to the important family of logisc
regression models. Although in the case of the German credit data problem we found slight
improvement in idencaon of the signicant variables, it is vital that we should always be
on the lookout to t beer models, as in sensiveness to outliers, and the logisticRidge
funcon appears as a good alternave to the glm funcon.
Regression Models with Regularizaon
[ 250 ]
Another look at model assessment
In the previous two secons, we used the automac opon for obtaining the opmum
λ
values, as discussed in the work of Cule and De Iorio (2012). There is an iterave technique
for nding the penalty factor
λ
. This technique is especially useful when we do not have
sucient well-developed theory for regression models beyond the linear and logisc
regression model. Neural networks, support vector machines, and so on, are some very useful
regression models, where the theory may not have been well developed; well at least to the
best known pracce of the author. Hence, we will use the iterave method in this secon.
For both the linearRidge and lm.ridge ed models in the Ridge regression for linear
models secon, we saw that for an increasing value of
λ
, the sum of squares of regression
coecients,
2
1
p
j
j
β
=
∑
, decreases. The queson then is how to select the "best"
λ
value. A
popular technique in the data mining community is to split the dataset into three parts, namely
Train, Validate, and Test part. There are no denive answers for what needs to be the split
percentage for the three parts and a common pracce is to split them into either 60:20:20
percentages or 50:25:25 percentages. Let us now understand this process:
Training dataset: The models are built on the data available in this data part.
Validaon dataset: For this part of the data, we pretend as though we do not know
the output values and make predicons based upon the covariate values. This step
is to ensure that overng is minimized. The errors (residual squares for regression
model and accuracy percentages for classicaon model) are then compared with
respect to the counterpart errors in the training part. If the errors decrease in the
training set while they remain the same for the validaon part, it means that we are
overng the data. A threshold, aer which this is observed, may be chosen as the
beer lambda value.
Tesng dataset: In pracce, these are really unobserved cases for which the model
is applied for forecasng purposes.
For the gasoline mileage problem, we will split the data into three parts and use the training
and validaon part to select the
λ
value.
Time for action – selecting lambda iteratively and other topics
Iterave selecon of the penalty parameter for ridge regression will be covered in this
secon. The useful framework of train + validate + test will also be considered for the
German credit data problem.
1. For the sake of simplicity, we will remove the character variable of the dataset
by using Gasoline <- Gasoline[,-12].
Chapter 8
[ 251 ]
2. Set the random seed by using set.seed(1234567). This step is to ensure that the
user can validate the results of the program.
3. Randomize the observaons to enable the spling part:
data_part_label = c("Train","Validate","Test")
indv_label=sample(data_part_label,size=nrow(Gasoline),replace=TRUE
,prob=c(0.6,0.2,0.2))
4. Now, split the gasoline dataset:
G_Train <- Gasoline[indv_label=="Train",]
G_Validate <- Gasoline[indv_label=="Validate",]
G_Test <- Gasoline[indv_label=="Test",]
5. Dene the
λ
vector with lambda <- seq(0,10,0.2).
6. Inialize the training and validaon errors:
Train_Errors <- vector("numeric",length=length(lambda))
Val_Errors <- vector("numeric",length=length(lambda))
7. Run the following loop to get the required errors:
8. Plot the training and validaon errors:
plot(lambda,Val_Errors,"l",col="red",xlab=expression(lambda),ylab=
"Training and Validation Errors",ylim=c(0,600))
points(lambda,Train_Errors,"l",col="green")
legend(6,500,c("Training Errors","Validation Errors"),col=c(
"green","red"),pch="-")
Regression Models with Regularizaon
[ 252 ]
The nal output will be the following:
Figure 12: Training and validation errors
The preceding plot suggests that the lambda value is between 0.5 and 1.5. Why?
The technique of train + validate + test is not simply restricted to selecng the
lambda value. In fact, for any regression/classicaon model, we can try to
understand if the selected model really generalizes or not. For the German credit
data problem in the previous chapter, we will make an aempt to see what the
current technique suggests.
9. The program and its output (ROC curves) is displayed following it.
Chapter 8
[ 253 ]
10. The ROC plot is given in the following screenshot:
Figure 13: ROC plot for the train + validate + test partition of the German data
We will close the chapter with a short discussion. In the train + validate + test
paroning, we had one technique of avoiding overng. A generalizaon of this
technique is the well-known cross-validaon method. In an n-fold cross-validaon
approach, the data is randomly paroned into n divisions. In the rst step, the
rst part is held for validaon and the model is built using the remaining n-1 parts
and the accuracy percentage is calculated. Next, the second part is treated as the
validaon dataset and the remaining 1, 3, …, n-1, n parts are used to build
the model and then tested for accuracy on the second part. This process is then
repeated for the remaining n-2 parts. Finally, an overall accuracy metric is reported.
At the surface, this process is complex enough and hence we will resort to the
well-dened funcons available in the DAAG package.
Regression Models with Regularizaon
[ 254 ]
11. As the cross-validaon funcon itself carries out the n-fold paroning, we build it
over the enre dataset:
library(DAAG)
data(VD) CVlm(df=VD,form.lm=formula(Voltage_Drop~Time+I(Time^2)+I(
Time^3)+I(((Time-6.5)^3)*(sign(Time-6.5)==1))
+I(((Time-13)^3)*(sign(Time-13)==1))),m=10,plotit="Observed")
The VD data frame has 41 observaons, and the output in Figure 14: Cross-
validaon for the voltage-drop problem shows that the 10-fold cross-validaon has
10 parons with fold 2 containing ve observaons and the rest of them having
four each. Now, for each fold, the cubic polynomial regression model ts the model
by using the data in the remaining folds:
Figure 14: Cross-validation for the voltage-drop problem
Chapter 8
[ 255 ]
Using the ed polynomial regression model, a predicon is made for the units in
the fold. The observed versus predicted regressand values plot is given in Figure
15: Predicted versus observed plot using the cross-validaon technique. A close
examinaon of the numerical predicted values and the plot indicate that we have a
very good model for the voltage drop phenomenon.
The generalized cross-validaon (GCV) errors are also given with the details of a
lm.ridge t model. We can use this informaon to arrive at the beer value for
the ridge regression models:
Figure 15: Predicted versus observed plot using the cross-validation technique
12. For the OF and G_Train data frames, use the lm.ridge funcon to obtain the
GCV errors:
> LM_OF <- lm.ridge(Y~poly(X,3),data=as.data.frame(OF),
+ lambda=c(0,0.5,1,1.5,2,5,10,30))
> LM_OF$GCV
0.0 0.5 1.0 1.5 2.0 5.0 10.0 30.0
5.19 5.05 5.03 5.09 5.21 6.38 8.31 12.07
> LM_GT <- lm.ridge(y~.,data=G_Train,lambda=seq(0,10,0.2))
> LM_GT$GCV
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
1.777 0.798 0.869 0.889 0.891 0.886 0.877 0.868 0.858 0.848
2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8
Regression Models with Regularizaon
[ 256 ]
0.838 0.830 0.821 0.813 0.806 0.798 0.792 0.786 0.780 0.774
4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8
0.769 0.764 0.760 0.755 0.751 0.748 0.744 0.740 0.737 0.734
6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8
0.731 0.729 0.726 0.723 0.721 0.719 0.717 0.715 0.713 0.711
8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8
0.710 0.708 0.707 0.705 0.704 0.703 0.701 0.700 0.699 0.698
10.0
0.697
For the OF data frame, the value appears to lie in the interval (1.0, 1.5).
On the other hand for the GT data frame, the value appears in (0.2, 0.4).
What just happened?
The choice of the penalty factor is indeed very crucial for the success of a ridge regression
model, and we saw dierent methods for obtaining this. This step included the automac
choice of Cule and De Iorio (2012) and the cross-validaon technique. Further more, we also
saw the applicaon of the popular train + validate + test approach. In praccal applicaons,
these methodologies will go a long way to obtain the best models.
Pop quiz
What do you expect as the results if you perform the model selecon task step funcon
on a polynomial regression model? That is, you are trying to select the variables for the
polynomial model lm(Y~poly(X,9,raw=TRUE),data=OF), or say VD_PRS. Verify your
intuion by compleng the R programs.
Summary
In this chapter, we began with a hypothecal dataset and highlighted the problem of
overng. In case of a breakpoint, also known as knots, the extensions of the linear model
in the piecewise linear regression model and the spline regression model were found to be
very useful enhancements. The problem of overng can also somemes be overcome by
using the ridge regression. The ridge regression soluon has been extended for the linear
and logisc regression models. Finally, we saw a dierent approach of model assessment
by using the train + validate + test approach and the cross-validaon approach. In spite of
the developments where we have intrinsically non-linear data, it becomes dicult for the
models discussed in this chapter to emerge as useful soluons. The past two decades has
witnessed a powerful alternave in the so-called Classicaon and Regression Trees (CART).
The next chapter discusses CART in greater depth and the nal chapter considers modern
development related to it.
9
Classication
and Regression Trees
In the previous chapters, we focused on regression models, and the majority
of the emphasis was on the linearity assumption. In what appears as the next
extension must be non-linear models, we will instead deviate to recursive
partitioning techniques, which are a bit more flexible than the non-linear
generalization of the models considered in the earlier chapters. Of course, the
recursive partitioning techniques, in most cases, may be viewed as non-linear
models.
We will rst introduce the noon of recursive parons through a hypothecal dataset.
It is apparent that the earlier approach of the linear models changes in an enrely dierent
way with the funconing of the recursive parons. Recursive paroning depends upon
the type of problem we have in hand. We develop a regression tree for the regression
problem when the output is a connuous variable, as in the linear models. If the output is
a binary variable, we develop a classicaon tree. A regression tree is rst created by using
the rpart funcon from the rpart package. A very raw R program is created, which clearly
explains the unfolding of a regression tree. A similar eort is repeated for the classicaon
tree. In the nal secon of this chapter, a classicaon tree is created for the German credit
data problem along with the use of ROC curves for understanding the model performance.
The approach in this chapter will be on the following lines:
Understanding the basis of recursive parons and the general CART.
Construcon of a regression tree
Construcon of a classicaon tree
Applicaon of a classicaon tree to the German credit data problem
The ner aspects of CART
Classicaon and Regression
[ 258 ]
Recursive partitions
The name of the library package rpart, shipped along with R, stands for Recursive
Paroning. The package was rst created by Terry M Therneau and Beth Atkinson,
and is currently maintained by Brian Ripley. We will rst have a peek at means
recursive parons are.
A complex and contrived relaonship is generally not idenable by linear models. In the
previous chapter, we saw the extensions of the linear models in piecewise, polynomial,
and spline regression models. It is also well known that if the order of a model is larger
than 4, then interpretaon and usability of the model becomes more dicult. We consider
a hypothecal dataset, where we have two classes for the output Y and two explanatory
variables in X1 and X2. The two classes are indicated by lled-in green circles and red
squares. First, we will focus only on the le display of Figure 1: A complex classicaon
dataset with parons, as it is the actual depicon of the data. At the outset, it is clear that
a linear model is not appropriate, as there is quite an overlap of the green and red indicators.
Now, there is a clear demarcaon of the classicaon problem accordingly, as X1 is greater
than 6 or not. In the area on the le side of X1=6, the mid-third region contains a majority
of green circles and the rest are red squares. The red squares are predominantly idenable
accordingly, as the X2 values are either lesser than or equal to 3 or greater than 6. The green
circles are the majority values in the region of X2 being greater than 3 and lesser than 6. A
similar story can be built for the points on the right side of X1 greater than 6. Here, we rst
paroned the data according to X1 values rst, and then in each of the paroned region,
we obtained parons according to X2 values. This is the act of recursive paroning.
Figure 1: A complex classification dataset with partitions
Let us obtain the preceding plot in R.
Chapter 9
[ 259 ]
Time for action – partitioning the display plot
We rst visualize the CART_Dummy dataset and then look in the next subsecon at how CART
gets the paerns, which are believed to exist in the data.
1. Obtain the dataset CART_Dummy from the RSADBE package by using
data( CART_Dummy).
2. Convert the binary output Y as a factor variable, and aach the data frame
with CART_Dummy$Y <- as.factor(CART_Dummy$Y).
attach(CART_Dummy)
In Figure 1: A complex classicaon dataset with parons, the red squares
refer to 0 and the green circles to 1.
3. Inialize the graphics windows for the three samples by using
par(mfrow= c(1,2)).
4. Create a blank scaer plot:
plot(c(0,12),c(0,10),type="n",xlab="X1",ylab="X2").
5. Plot the green circles and red squares:
points(X1[Y==0],X2[Y==0],pch=15,col="red")
points(X1[Y==1],X2[Y==1],pch=19,col="green")
title(main="A Difficult Classification Problem")
6. Repeat the previous two steps to obtain the idencal plot on the right side of the
graphics window.
7. First, paron according to X1 values by using abline(v=6,lwd=2).
8. Add segments on the graph with the segment funcon:
segments(x0=c(0,0,6,6),y0=c(3.75,6.25,2.25,5),x1=c(6,6,12,12),y1=c
(3.75,6.25,2.25,5),lwd=2)
title(main="Looks a Solvable Problem Under Partitions")
What just happened?
A complex problem is simplied through paroning! A more generic funcon, segments,
has nicely slipped in our program, which you may use for many other scenarios.
Classicaon and Regression
[ 260 ]
Now, this approach of recursive paroning is not feasible all the me! Why? We seldom
deal with two or three explanatory variables and data points as low as in the preceding
hypothecal example. The queson is how one creates recursive paroning of the dataset.
Breiman, et. al. (1984) and Quinlan (1988) have invented tree building algorithms, and we
will follow the Breiman, et. al. approach in the rest of book. The CART discussion in this book
is heavily inuenced by Berk (2008).
Splitting the data
In the earlier discussion, we saw that paroning the dataset can benet a lot in reducing
the noise in the data. The queson is how does one begin with it? The explanatory
variables can be discrete or connuous. We will begin with the connuous (numeric
objects in R) variables.
For a connuous variable, the task is a bit simpler. First, idenfy the unique disnct values
of the numeric object. Let us say, for example, that the disnct values of a numeric object,
say height in cms, are 160, 165, 170, 175, and 180. The data parons are then obtained
as follows:
data[Height<=160,], data[Height>160,]
data[Height<=165,], data[Height>165,]
data[Height<=170,], data[Height>170,]
data[Height<=175,], data[Height>175,]
The reader should try to understand the raonale behind the code, and certainly this is just
an indicave one.
Now, we consider the discrete variables. Here, we have two types of variables, namely
categorical and ordinal. In the case of ordinal variables, we have an order among the disnct
values. For example, in the case of the economic status variable, the order may be among
the classes Very Poor, Poor, Average, Rich, and Very Rich. Here, the splits are similar to the
case of connuous variable, and if there are m disnct orders, we consider m-1 disnct splits
of the overall data. In the case of a categorical variable with m categories, for example the
departments A to F of the UCBAdmissions dataset, the number of possible splits becomes
2m-1-1. However, the benet of using soware like R is that we do not have to worry about
these issues.
The rst tree
In the CART_Dummy dataset, we can easily visualize the parons for Y as a funcon of the
inputs X1 and X2. Obviously, we have a classicaon problem, and hence we will build the
classicaon tree.
Chapter 9
[ 261 ]
Time for action – building our rst tree
The rpart funcon from the library rpart will be used to obtain the rst classicaon tree.
The tree will be visualized by using the plot opons of rpart, and we will follow this up
with extracng the rules of a tree by using the asRules funcon from the rattle package.
1. Load the rpart package by using library(rpart).
2. Create the classicaon tree with CART_Dummy_rpart <- rpart
(Y~ X1+X2,data=CART_Dummy).
3. Visualize the tree with appropriate text labels by using plot(CART_Dummy_
rpart); text(CART_Dummy_rpart).
Figure 2: A classification tree for the dummy dataset
Now, the classicaon tree ows as follows. Obviously, the tree using the rpart
funcon does not paron as simply as we did in Figure 1: A complex classicaon
dataset with parons, the working of which will be dealt within the third secon of
this chapter. First, we check if the value of the second variable X2 is less than 4.875.
If the answer is an armaon, we move to the le side of the tree; the right side in
the other case. Let us move to the right side. A second queson asked is whether X1
is lesser than 4.5 or not, and then if the answer is yes it is idened as a red square,
and otherwise a green circle. You are now asked to interpret the le side of the rst
node. Let us look at the summary of CART_Dummy_rpart.
Classicaon and Regression
[ 262 ]
4. Apply the summary, an S3 method, for the classicaon tree with summary( CART_
Dummy_rpart).
That one is a lot of output!
Figure 3: Summary of a classification tree
Our interests are in the nodes numbered 5 to 9! Why? The terminal nodes, of course!
A terminal node is one in which we can't split the data any further, and for the
classicaon problem, we arrive at a class assignment as the class that has a majority
count at the node. The summary shows that there are indeed some misclassicaons
too. Now, wouldn't it be great if R gave the terminal nodes asRules. The funcon
asRules from the rattle package extracts the rules from an rpart object.
Let's do it!
Chapter 9
[ 263 ]
5. Invoke the rattle package library(rattle) and using the asRules funcon,
extract the rules from the terminal nodes with asRules(CART_Dummy_rpart).
The result is the following set of rules:
Figure 4: Extracting "rules" from a tree!
We can see that the classicaon tree is not according to our "eye-bird" paroning.
However, as a nal aspect of our inial understanding, let us plot the segments
using the naïve way. That is, we will paron the data display according to the
terminal nodes of the CART_Dummy_rpart tree.
6. The R code is given right away, though you should make an eort to nd the logic
behind it. Of course, it is very likely that by now you need to run some of the earlier
code that was given previously.
abline(h=4.875,lwd=2)
segments(x0=4.5,y0=4.875,x1=4.5,y1=10,lwd=2)
abline(h=1.75,lwd=2)
segments(x0=3.5,y0=1.75,x1=3.5,y1=4.875,lwd=2)
title(main="Classification Tree on the Data Display")
Classicaon and Regression
[ 264 ]
It can be easily seen from the following that rpart works really well:
Figure 5: The terminal nodes on the original display of the data
What just happened?
We obtained our rst classicaon tree, which is a good thing. Given the actual data display,
the classicaon tree gives sasfactory answers.
We have understood the "how" part of a classicaon tree. The "why" aspect is very
vital in science, and the next secon explains the science behind the construcon of
a regression tree, and it will be followed later by a detailed explanaon of the working
of a classicaon tree.
Chapter 9
[ 265 ]
The construction of a regression tree
In the CART_Dummy dataset, the output is a categorical variable, and we built a classicaon
tree for it. In Chapter 6, Linear Regression Analysis, the linear regression models were built
for a connuous random variable, while in Chapter 7, The Logisc Regression Model, we built
a logisc regression model for a binary random variable. The same disncon is required in
CART, and we thus build classicaon trees for binary random variables, where regression
trees are for connuous random variables. Recall the raonale behind the esmaon
of regression coecients for the linear regression model. The main goal was to nd the
esmates of the regression coecients, which minimize the error sum of squares between
the actual regressand values and the ed values. A similar approach is followed here, in the
sense that we need to split the data at the points that keep the residual sum of squares to
a minimum. That is, for each unique value of a predictor, which is a candidate for the node
value, we nd the sum of squares of y's within each paron of the data, and then add them
up. This step is performed for each unique value of the predictor, and the value, which leads
to the least sum of squares among all the candidates, is selected as the best split point for
that predictor. In the next step, we nd the best split points for each of the predictors, and
then the best split is selected across the best split points across the predictors. Easy!
Now, the data is paroned into two parts according to the best split. The process of nding
the best split within each paron is repeated in the same spirit as for the rst split. This
process is carried out in a recursive fashion unl the data can't be paroned any further.
What is happening here? The residual sum of squares at each child node will be lesser than
that in the parent node.
At the outset, we record that the rpart funcon does the exact same thing. However, as a
part of cleaner understanding of the regression tree, we will write raw R codes and ensure
that there is no ambiguity in the process of understanding CART. We will begin with a simple
example of a regression tree, and use the rpart funcon to plot the regression funcon.
Then, we will rst dene a funcon, which will extract the best split given by the covariate
and dependent variable. This acon will be repeated for all the available covariates, and
then we nd the best overall split. This will be veried with the regression tree. The data will
then be paroned by using the best overall split, and then the best split will be idened
for each of the paroned data. The process will be repeated unl we reach the end of the
complete regression tree given by the rpart. First, the experiment!
Classicaon and Regression
[ 266 ]
The cpus dataset available in the MASS package contains the relave performance measure
of 209 CPUs in the perf variable. It is known that the performance of a CPU depends
on factors such as the cycle me in nanoseconds (syct), minimum and maximum main
memory in kilobytes (mmin and mmax), cache size in kilobytes (cach), and minimum and
maximum number of channels (chmin and chmax). The task in hand is to model the perf
as a funcon of syct, mmin, mmax, cach, chmin, and chmax. The histogram of perf—try
hist(cpus$perf)—will show a highly skewed distribuon, and hence we will build a
regression tree for the logarithm transformaon log10(perf).
Time for action – the construction of a regression tree
A regression tree is rst built by using the rpart funcon. The getNode funcon is
introduced, which helps in idenfying the split node at each stage, and using it we build
a regression tree and verify that we had the same tree as returned by the rpart funcon.
1. Load the MASS library by using library(MASS).
2. Create the regression tree for the logarithm (to the base 10) of perf as a funcon
of the covariates explained earlier, and display the regression tree:
cpus.ltrpart <- rpart(log10(perf)~syct+mmin+mmax+cach+chmin+chmax,
data=cpus)
plot(cpus.ltrpart); text(cpus.ltrpart)
The regression tree will be indicated as follows:
Figure 6: Regression tree for the "perf" of a CPU
Chapter 9
[ 267 ]
We will now dene the getNode funcon. Given the regressand and the
covariate, we need to nd the best split in the sense of the sum of squares criterion.
The evaluaon needs to be done for every disnct value of the covariate. If there are
m disnct points, we need m-1 evaluaons. At each disnct point, the regressand
needs to be paroned accordingly, and the sum of squares should be obtained for
each paron. The two sums of squares (in each part) are then added to obtain the
reduced sum of squares. Thus, we create the required funcon to meet all
these requirements.
3. Create the getNode funcon in R by running the following code:
getNode <- function(x,y) {
xu <- sort(unique(x),decreasing=TRUE)
ss <- numeric(length(xu)-1)
for(i in 1:length(ss)) {
partR <- y[x>xu[i]]
partL <- y[x<=xu[i]]
partRSS <- sum((partR-mean(partR))^2)
partLSS <- sum((partL-mean(partL))^2)
ss[i]<-partRSS + partLSS
}
return(list(xnode=xu[which(ss==min(ss,na.rm=TRUE))],
minss = min(ss,na.rm=TRUE),ss,xu))
}
The getNode funcon gives the best split for a given covariate. It returns a list
consisng of four objects:
xnode, which is a datum of the covariate x that gives the minimum residual
sum of squares for the regressand y
The value of the minimum residual sum of squares
The vector of the residual sum of squares for the distinct points of the
vector x
The vector of the distinct x values
We will run this funcon for each of the six covariates, and nd the best overall split.
The argument na.rm=TRUE is required, as at the maximum value of x we won't get
a numeric value.
Classicaon and Regression
[ 268 ]
4. We will rst execute the getNode funcon on the syct covariate, and look at the
output we get as a result:
> getNode(cpus$syct,log10(cpus$perf))$xnode
[1] 48
> getNode(cpus$syct,log10(cpus$perf))$minss
[1] 24.72
> getNode(cpus$syct,log10(cpus$perf))[[3]]
[1] 43.12 42.42 41.23 39.93 39.44 37.54 37.23 36.87 36.51 36.52
35.92 34.91
[13] 34.96 35.10 35.03 33.65 33.28 33.49 33.23 32.75 32.96 31.59
31.26 30.86
[25] 30.83 30.62 29.85 30.90 31.15 31.51 31.40 31.50 31.23 30.41
30.55 28.98
[37] 27.68 27.55 27.44 26.80 25.98 27.45 28.05 28.11 28.66 29.11
29.81 30.67
[49] 28.22 28.50 24.72 25.22 26.37 28.28 29.10 33.02 34.39 39.05
39.29
> getNode(cpus$syct,log10(cpus$perf))[[4]]
[1] 1500 1100 900 810 800 700 600 480 400 350 330 320
300 250 240
[16] 225 220 203 200 185 180 175 167 160 150 143 140
133 125 124
[31] 116 115 112 110 105 100 98 92 90 84 75 72
70 64 60
[46] 59 57 56 52 50 48 40 38 35 30 29 26
25 23 17
The least sum of squares at a split for the best split value of the syct variable is
24.72, and it occurs at a value of syct greater than 48. The third and fourth list
objects given by getNode, respecvely, contain the details of the sum of squares for
the potenal candidates and the unique values of syct. The values of interest are
highlighted. Thus, we will rst look at the second object from the list output for all
the six covariates to nd the best split among the best split of each of the variables,
by the residual sum of squares criteria.
5. Now, run the getNode funcon for the remaining ve covariates:
getNode(cpus$syct,log10(cpus$perf))[[2]]
getNode(cpus$mmin,log10(cpus$perf))[[2]]
getNode(cpus$mmax,log10(cpus$perf))[[2]]
getNode(cpus$cach,log10(cpus$perf))[[2]]
getNode(cpus$chmin,log10(cpus$perf))[[2]]
getNode(cpus$chmax,log10(cpus$perf))[[2]]
getNode(cpus$cach,log10(cpus$perf))[[1]]
sort(getNode(cpus$cach,log10(cpus$perf))[[4]],decreasing=FALSE)
Chapter 9
[ 269 ]
The output is as follows:
Figure 7: Obtaining the best "first split" of regression tree
The sum of squares for cach is the lowest, and hence we need to nd the best
split associated with it, which is 24. However, the regression tree shows that the
best split is for the cach value of 27. The getNode funcon says that the best split
occurs at a point greater than 24, and hence we take the average of 24 and the next
unique point at 30. Having obtained the best overall split, we next obtain the rst
paron of the dataset.
6. Paron the data by using the best overall split point:
cpus_FS_R <- cpus[cpus$cach>=27,]
cpus_FS_L <- cpus[cpus$cach<27,]
The new names of the data objects are clear with _FS_R indicang the dataset
obtained on the right side for the rst split, and _FS_L indicang the le side.
In the rest of the secon, the nomenclature won't be further explained.
7. Idenfy the best split in each of the paroned datasets:
getNode(cpus_FS_R$syct,log10(cpus_FS_R$perf))[[2]]
getNode(cpus_FS_R$mmin,log10(cpus_FS_R$perf))[[2]]
getNode(cpus_FS_R$mmax,log10(cpus_FS_R$perf))[[2]]
getNode(cpus_FS_R$cach,log10(cpus_FS_R$perf))[[2]]
getNode(cpus_FS_R$chmin,log10(cpus_FS_R$perf))[[2]]
getNode(cpus_FS_R$chmax,log10(cpus_FS_R$perf))[[2]]
getNode(cpus_FS_R$mmax,log10(cpus_FS_R$perf))[[1]]
sort(getNode(cpus_FS_R$mmax,log10(cpus_FS_R$perf))[[4]],
decreasing=FALSE)
getNode(cpus_FS_L$syct,log10(cpus_FS_L$perf))[[2]]
Classicaon and Regression
[ 270 ]
getNode(cpus_FS_L$mmin,log10(cpus_FS_L$perf))[[2]]
getNode(cpus_FS_L$mmax,log10(cpus_FS_L$perf))[[2]]
getNode(cpus_FS_L$cach,log10(cpus_FS_L$perf))[[2]]
getNode(cpus_FS_L$chmin,log10(cpus_FS_L$perf))[[2]]
getNode(cpus_FS_L$chmax,log10(cpus_FS_L$perf))[[2]]
getNode(cpus_FS_L$mmax,log10(cpus_FS_L$perf))[[1]]
sort(getNode(cpus_FS_L$mmax,log10(cpus_FS_L$perf))[[4]],
decreasing=FALSE)
The following screenshot gives the results of running the preceding R code:
Figure 8: Obtaining the next two splits
Chapter 9
[ 271 ]
Thus, for the rst right paroned data, the best split is for the mmax value as the
mid-point between 24000 and 32000; that is, at mmax = 28000. Similarly, for the
rst le-paroned data, the best split is the average value of 6000 and 6200,
which is 6100, for the same mmax covariate. Note the important step here. Even
though we used cach as the criteria for the rst paron, it is sll used with the two
paroned data. The results are consistent with the display given by the regression
tree, Figure 6: Regression tree for the "perf" of a CPU. The next R program will take
care of the enre rst split's right side's future parons.
8. Paron the rst right part cpus_FS_R as follows:
cpus_FS_R_SS_R <- cpus_FS_R[cpus_FS_R$mmax>=28000,]
cpus_FS_R_SS_L <- cpus_FS_R[cpus_FS_R$mmax<28000,]
Obtain the best split for cpus_FS_R_SS_R and cpus_FS_R_SS_L by running the
following code:
cpus_FS_R_SS_R <- cpus_FS_R[cpus_FS_R$mmax>=28000,]
cpus_FS_R_SS_L <- cpus_FS_R[cpus_FS_R$mmax<28000,]
getNode(cpus_FS_R_SS_R$syct,log10(cpus_FS_R_SS_R$perf))[[2]]
getNode(cpus_FS_R_SS_R$mmin,log10(cpus_FS_R_SS_R$perf))[[2]]
getNode(cpus_FS_R_SS_R$mmax,log10(cpus_FS_R_SS_R$perf))[[2]]
getNode(cpus_FS_R_SS_R$cach,log10(cpus_FS_R_SS_R$perf))[[2]]
getNode(cpus_FS_R_SS_R$chmin,log10(cpus_FS_R_SS_R$perf))[[2]]
getNode(cpus_FS_R_SS_R$chmax,log10(cpus_FS_R_SS_R$perf))[[2]]
getNode(cpus_FS_R_SS_R$cach,log10(cpus_FS_R_SS_R$perf))[[1]]
sort(getNode(cpus_FS_R_SS_R$cach,log10(cpus_FS_R_SS_R$perf))[[4]],
decreasing=FALSE)
getNode(cpus_FS_R_SS_L$syct,log10(cpus_FS_R_SS_L$perf))[[2]]
getNode(cpus_FS_R_SS_L$mmin,log10(cpus_FS_R_SS_L$perf))[[2]]
getNode(cpus_FS_R_SS_L$mmax,log10(cpus_FS_R_SS_L$perf))[[2]]
getNode(cpus_FS_R_SS_L$cach,log10(cpus_FS_R_SS_L$perf))[[2]]
getNode(cpus_FS_R_SS_L$chmin,log10(cpus_FS_R_SS_L$perf))[[2]]
getNode(cpus_FS_R_SS_L$chmax,log10(cpus_FS_R_SS_L$perf))[[2]]
getNode(cpus_FS_R_SS_L$cach,log10(cpus_FS_R_SS_L$perf))[[1]]
sort(getNode(cpus_FS_R_SS_L$cach,log10(cpus_FS_R_SS_L$perf))
[[4]],decreasing=FALSE)
Classicaon and Regression
[ 272 ]
For the cpus_FS_R_SS_R part, the nal division is according to cach being
greater than 56 or not (average of 48 and 64). If the cach value in this paron is
greater than 56, then perf (actually log10(perf)) ends in the terminal leaf 3,
else 2. However, for the region cpus_FS_R_SS_L, we paron the data further
by the cach value being greater than 96.5 (average of 65 and 128). In the right
side of the region, log10(perf) is found as 2, and a third level split is required
for cpus_FS_R_SS_L with cpus_FS_R_SS_L_TS_L. Note that though the nal
terminal leaves of the cpus_FS_R_SS_L_TS_L region shows the same 2 as the
nal log10(perf), this may actually result in a signicant variability reducon of
the dierence between the predicted and the actual log10(perf) values. We will
now focus on the rst main split's le side.
Chapter 9
[ 273 ]
Figure 9: Partitioning the right partition after the first main split
Classicaon and Regression
[ 274 ]
9. Paron cpus_FS_L accordingly, as the mmax value being greater than 6100
or otherwise:
cpus_FS_L_SS_R <- cpus_FS_L[cpus_FS_L$mmax>=6100,]
cpus_FS_L_SS_L <- cpus_FS_L[cpus_FS_L$mmax<6100,]
The rest of the paron for cpus_FS_L is completely given next.
10. The details will be skipped and the R program is given right away:
cpus_FS_L_SS_R <- cpus_FS_L[cpus_FS_L$mmax>=6100,]
cpus_FS_L_SS_L <- cpus_FS_L[cpus_FS_L$mmax<6100,]
getNode(cpus_FS_L_SS_R$syct,log10(cpus_FS_L_SS_R$perf))[[2]]
getNode(cpus_FS_L_SS_R$mmin,log10(cpus_FS_L_SS_R$perf))[[2]]
getNode(cpus_FS_L_SS_R$mmax,log10(cpus_FS_L_SS_R$perf))[[2]]
getNode(cpus_FS_L_SS_R$cach,log10(cpus_FS_L_SS_R$perf))[[2]]
getNode(cpus_FS_L_SS_R$chmin,log10(cpus_FS_L_SS_R$perf))[[2]]
getNode(cpus_FS_L_SS_R$chmax,log10(cpus_FS_L_SS_R$perf))[[2]]
getNode(cpus_FS_L_SS_R$syct,log10(cpus_FS_L_SS_R$perf))[[1]]
sort(getNode(cpus_FS_L_SS_R$syct,log10(cpus_FS_L_SS_R$perf))[[4]],
decreasing=FALSE)
getNode(cpus_FS_L_SS_L$syct,log10(cpus_FS_L_SS_L$perf))[[2]]
getNode(cpus_FS_L_SS_L$mmin,log10(cpus_FS_L_SS_L$perf))[[2]]
getNode(cpus_FS_L_SS_L$mmax,log10(cpus_FS_L_SS_L$perf))[[2]]
getNode(cpus_FS_L_SS_L$cach,log10(cpus_FS_L_SS_L$perf))[[2]]
getNode(cpus_FS_L_SS_L$chmin,log10(cpus_FS_L_SS_L$perf))[[2]]
getNode(cpus_FS_L_SS_L$chmax,log10(cpus_FS_L_SS_L$perf))[[2]]
getNode(cpus_FS_L_SS_L$mmax,log10(cpus_FS_L_SS_L$perf))[[1]]
sort(getNode(cpus_FS_L_SS_L$mmax,log10(cpus_FS_L_SS_L$perf))
[[4]],decreasing=FALSE)
cpus_FS_L_SS_R_TS_R <- cpus_FS_L_SS_R[cpus_FS_L_SS_R$syct<360,]
getNode(cpus_FS_L_SS_R_TS_R$syct,log10(cpus_FS_L_SS_R_TS_R$
perf))[[2]]
getNode(cpus_FS_L_SS_R_TS_R$mmin,log10(cpus_FS_L_SS_R_TS_R$
perf))[[2]]
getNode(cpus_FS_L_SS_R_TS_R$mmax,log10(cpus_FS_L_SS_R_TS_R$
perf))[[2]]
getNode(cpus_FS_L_SS_R_TS_R$cach,log10(cpus_FS_L_SS_R_TS_R$
perf))[[2]]
getNode(cpus_FS_L_SS_R_TS_R$chmin,log10(cpus_FS_L_SS_R_TS_R$
perf))[[2]]
getNode(cpus_FS_L_SS_R_TS_R$chmax,log10(cpus_FS_L_SS_R_TS_R$
perf))[[2]]
getNode(cpus_FS_L_SS_R_TS_R$chmin,log10(cpus_FS_L_SS_R_TS_R$
perf))[[1]]
sort(getNode(cpus_FS_L_SS_R_TS_R$chmin,log10(cpus_FS_L_SS_R_TS_
R$perf))[[4]],decreasing=FALSE)
Chapter 9
[ 275 ]
We will now see how the preceding R code gets us closer to the regression tree:
Figure 10: Partitioning the left partition after the first main split
We leave it to you to interpret the output arising from the previous acon.
Classicaon and Regression
[ 276 ]
What just happened?
Using the rpart funcon from the rpart library we rst built the regression tree for
log10(perf). Then, we explored the basic denions underlying the construcon of
a regression tree and dened the getNode funcon to obtain the best split for a pair of
regressands and a covariate. This funcon is then applied for all the covariates, and the best
overall split is obtained; using this we get our rst paron of the data, which will be in
agreement with the tree given by the rpart funcon. We then recursively paroned
the data by using the getNode funcon and veried that all the best splits in each
paroned data are in agreement with the one provided by the rpart funcon.
The reader may wonder if the preceding tedious task was really essenal. However, it has
been the experience of the author that users/readers seldom remember the raonale
behind using direct code/funcons for any soware aer some me. Moreover, CART is a
dicult concept and it is imperave that we clearly understand our rst tree, and return to
the preceding program whenever the understanding of a science behind CART is forgoen.
The construcon of a classicaon tree uses enrely dierent metrics, and hence its working
is also explained in considerable depth in the next secon.
The construction of a classication tree
We rst need to set up the spling criteria for a classicaon tree. In the case of a
regression tree, we saw the sum of squares as the spling criteria. For idenfying the split
for a classicaon tree, we need to dene certain measures known as impurity measures.
The three popular measures of impurity are Bayes error, the cross-entropy funcon, and Gini
index. Let p denote the percentage of success in a dataset of size n. The formulae of these
impurity measures are given in the following table:
Measure Formula
Bayes error
() ( )
min1
Bpp , p
ϕ
= −
The cross-entropy measure
() ( ) ( )
log log1 1
CE p p p p p
ϕ
=− −− −
Gini index
() ( )
1
Gpp p
ϕ
= −
Chapter 9
[ 277 ]
We will write a short program to understand the shape of these impurity measures as a
funcon of p:
p <- seq(0.01,0.99,0.01)
plot(p,pmin(p,1-p),"l",col="red",xlab="p",xlim=c(0,1),ylim=c(0,1),
ylab="Impurity Measures")
points(p,-p*log(p)-(1-p)*log(1-p),"l",col="green")
points(p,p*(1-p),"l",col="blue")
title(main="Impurity Measures")
legend(0.6,1,c("Bayes Error","Cross-Entropy","Gini Index"),col=c("red"
,"green","blue"),pch="-")
The preceding R code when executed in an R session gives the following output:
Figure 11: Impurity metrics – Bayes error, cross-entropy, and Gini index
Basically, we have these three choices of impurity metrics as a building block of
a classicaon tree. The popular choice is the Gini index, and there are detailed
discussions about the reason in the literature; see Breiman, et. al. (1984). However,
we will delve into this aspect and for the development in this secon, we will be using
the cross-entropy funcon.
Classicaon and Regression
[ 278 ]
Now, for a given predictor, assume that we have a node denoted by A. In the inial
stage, where there are no parons, the impurity is based on the proporon p.
The impurity of node A is taken to be a non-negave funcon of the probability y = 1,
and is mathemacally wrien as p(y=1|A). The impurity of node A is dened as follows:
() ( )
1IA py |A
ϕ
= =
Here,
ϕ
is one of the three impurity measures. When A is one of the internal nodes, the tree
gets bifurcated to the le- and right- hand side; that is, we now have le daughter AL and a
right daughter AR. For the moment, we will take the split according to the predictor variable
x; that is, if
xc≤
, the observaon moves to AL, otherwise to A
R. Then, according to the split
criteria, we have the following table; this is the same as Table 3.2 of Berk (2008):
Split criteria
xc≤
Failure (0) Success (1) Total
AL
xc≤
n11 n12 n1.
AR x > c n21 n22 n2.
n.1 n.2 n..
Using the frequencies in the preceding table, the impurity for the daughter nodes AL and AR,
based on the cross-entropy metric, are given as follows:
( )
11 11 12 12
1 1 1 1
log log
L
n n n n
IA n n n n
=− −
And:
( )
21 21 22 22
2 2 2 2
log log
R
n n n n
IA n n n n
=− −
The probability of an observaon falling in the le- and right- hand daughter nodes are
respecvely given by
()
1L
pA n/n= and
()
2R
pA n/n=. Then, the benet of using the node
A is given as follows:
() () ()() ()()
L L R R
AIApAIApAIA∆= − −
Chapter 9
[ 279 ]
Now, we capture
()
A∆
for all unique values of a predictor, and choose that value as the
best split for which
()
A∆
is a maximum. This step is repeated across all the variables, and
the best split is selected, which has a maximum
()
A∆
. According to the best split, the data
is paroned, and as seen earlier during the construcon of the regression tree, a similar
search is performed in each of the paroned data. The process connues unl the gain by
the split reaches a threshold minimum in each of the paroned data.
We will begin with the classicaon tree as delivered by the rpart funcon. The illustrave
dataset kyphosis is selected from the rpart library itself. The data relates to children
who had correcve spinal surgery. This medical problem is about the exaggerated outward
curvature of the thoracic region of the spine, which results in a rounded upper back. In
this study, 81 children underwent a spinal surgery and aer the surgery, informaon is
captured to know whether the children sll have the kyphosis problem in the column
named Kyphosis. The value of Kyphosis="absent" indicates that the child has been
cured of the problem, and Kyphosis="present" means that child has not been cured for
kyphosis. The other informaon captured is related to the age of the children, the number
of vertebrae involved, and the number of rst (topmost) vertebrae operated on. The task
for us is building a classicaon tree, which gives the Kyphosis status dependent on the
described variables.
We will rst build the classicaon tree for Kyphosis as a funcon of the three variables
Age, Start, and Number. The tree will then be displayed and rules will be extracted from
it. The getNode funcon will be dened based on the cross-entropy funcon, which will be
applied on the raw data and the rst overall opmal split obtained to paron the data.
The process will be recursively repeated unl we get the same tree as returned by the
rpart funcon.
Time for action – the construction of a classication tree
The getNode funcon is now dened here to help us idenfy the best split for
the classicaon problem. For the Kyphosis dataset from the rpart package,
we plot the classicaon tree by using the rpart funcon. The tree is reobtained
by using the getNode funcon.
1. Using the opon of split="information", construct a classicaon tree based
on the cross-entropy informaon for the kyphosis data with the following code:
ky_rpart <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis,p
arms=list(split="information"))
Classicaon and Regression
[ 280 ]
2. Visualize the classicaon tree by using plot(ky_rpart); text(ky_rpart):
Figure 12: Classification tree for the kyphosis problem
3. Extract rules from ex_rpart by using asRules:
> asRules(ky_rpart)
Rule number: 15 [Kyphosis=present cover=13 (16%) prob=0.69]
Start< 12.5
Age>=34.5
Number>=4.5
Rule number: 14 [Kyphosis=absent cover=12 (15%) prob=0.42]
Start< 12.5
Age>=34.5
Number< 4.5
Rule number: 6 [Kyphosis=absent cover=10 (12%) prob=0.10]
Start< 12.5
Age< 34.5
Rule number: 2 [Kyphosis=absent cover=46 (57%) prob=0.04]
Start>=12.5
4. Dene the getNode funcon for the classicaon problem:
Chapter 9
[ 281 ]
In the preceding funcon, the key funcons would be unique, table, and log.
We use unique to ensure that the search is carried for the disnct elements of
the predictor values in the data. table gets the required counts as discussed earlier in
this secon. The if condion ensures that neither the p nor 1-p values become 0, in
which case the logs become minus innity. The rest of the coding is self-explanatory.
Let us now get our rst best split.
5. We will need a few data manipulaons to ensure that our R code works on the
expected lines:
KYPHOSIS <- kyphosis
KYPHOSIS$Kyphosis_y <- (kyphosis$Kyphosis=="absent")*1
6. To nd the rst best split among the three variables, execute the following code;
the output is given in a consolidated screenshot aer all the iteraons:
getNode(KYPHOSIS$Age,KYPHOSIS$Kyphosis_y)[[2]]
getNode(KYPHOSIS$Number,KYPHOSIS$Kyphosis_y)[[2]]
getNode(KYPHOSIS$Start,KYPHOSIS$Kyphosis_y)[[2]]
getNode(KYPHOSIS$Start,KYPHOSIS$Kyphosis_y)[[1]]
sort(getNode(KYPHOSIS$Start,KYPHOSIS$Kyphosis_y)[[4]],
decreasing=FALSE)
Classicaon and Regression
[ 282 ]
Now, getNode indicates that the best split occurs for the Start variable, and
the point for the best split is 12. Keeping in line with the argument of the previous
secon, we split the data into two parts accordingly, as the Start value is greater
than the average of 12 and 13. For the paroned data, the search proceeds
in a recursive fashion.
7. Paron the data accordingly, as the Start values are greater than 12.5, and nd
the best split for the right daughter node, as the tree display shows that a search in
the le daughter node is not necessary:
KYPHOSIS_FS_R <- KYPHOSIS[KYPHOSIS$Start<12.5,]
KYPHOSIS_FS_L <- KYPHOSIS[KYPHOSIS$Start>=12.5,]
getNode(KYPHOSIS_FS_R$Age,KYPHOSIS_FS_R$Kyphosis_y)[[2]]
getNode(KYPHOSIS_FS_R$Number,KYPHOSIS_FS_R$Kyphosis_y)[[2]]
getNode(KYPHOSIS_FS_R$Start,KYPHOSIS_FS_R$Kyphosis_y)[[2]]
getNode(KYPHOSIS_FS_R$Age,KYPHOSIS_FS_R$Kyphosis_y)[[1]]
sort(getNode(KYPHOSIS_FS_R$Age,KYPHOSIS_FS_R$Kyphosis_y)[[4]],
decreasing=FALSE)
The maximum incremental value occurs for the predictor Age, and the split point is
27. Again, we take the average of the 27 and next highest value of 42, which turns
out as 34.5. The (rst) right daughter node region is then paroned in two parts
accordingly, as the Age values are greater than 34.5, and the search for the next
split connues in the current right daughter part.
8. The following code completes our search:
KYPHOSIS_FS_R_SS_R <- KYPHOSIS_FS_R[KYPHOSIS_FS_R$Age>=34.5,]
KYPHOSIS_FS_R_SS_L <- KYPHOSIS_FS_R[KYPHOSIS_FS_R$Age<34.5,]
getNode(KYPHOSIS_FS_R_SS_R$Age,KYPHOSIS_FS_R_SS_R$Kyphosis_y)[[2]]
getNode(KYPHOSIS_FS_R_SS_R$Number,KYPHOSIS_FS_R_SS_R$
Kyphosis_y)[[2]]
getNode(KYPHOSIS_FS_R_SS_R$Start,KYPHOSIS_FS_R_SS_R$
Kyphosis_y)[[2]]
getNode(KYPHOSIS_FS_R_SS_R$Number,KYPHOSIS_FS_R_SS_R$
Kyphosis_y)[[1]]
sort(getNode(KYPHOSIS_FS_R_SS_R$Number,KYPHOSIS_FS_R_SS_R$
Kyphosis_y)[[4]],
decreasing=FALSE)
Chapter 9
[ 283 ]
We see that the nal split occurs for the predictor Number and the split is 4,
and we again stop at 4.5.
We see that the results from our raw code completely agree with the rpart
funcon. Thus, the eorts of wring custom code for the classicaon tree have
paid the right dividends. We now have enough clarity for the construcon of the
classicaon tree:
Figure 13: Finding the best splits for classification tree using the getnode function
Classicaon and Regression
[ 284 ]
What just happened?
A deliberate aempt has been made at demysfying the construcon of a classicaon tree.
As with the earlier aempt of understanding a regression tree, we rst deployed the rpart
funcon, and saw a display of the classicaon tree for the Kyphosis as a funcon of Age,
Start, and Number, for the choice of the cross-entropy impurity metric. The getNode
funcon is dened on the basis of the same impurity metric and in a very systemac fashion;
we reproduced the same tree as obtained by the rpart funcon.
With the understanding of the basic construcon behind us, we will now build the
classicaon tree for the German credit data problem.
Classication tree for the German credit data
In Chapter 7, The Logisc Regression Model, we constructed a logisc regression model,
and in the previous chapter, we obtained the ridge regression version for the German credit
data problem. However, problems such as these and many others may have non-linearity
built in them, and it is worthwhile to look at the same problem by using a classicaon
tree. Also, we saw another model performance of the German credit data using the train,
validate, and test approach. We will have the following approach. First, we will paron
the German dataset into three parts, namely train, validate, and test. The classicaon tree
will be built by using the data in the train set and then it will be applied on the validate
part. The corresponding ROC curves will be visualized, and if we feel that the two curves
are reasonably similar, we will apply it on the test region, and take the necessary acon of
sanconing the customers their required loan.
Time for action – the construction of a classication tree
A classicaon tree is built now for the German credit data by using the rpart funcon.
The approach of train, validate, and test is implemented, and the ROC curves are obtained too.
Chapter 9
[ 285 ]
1. The following code has been used earlier in the book, and hence there won't be
an explanaon of it:
set.seed(1234567)
data_part_label <- c("Train","Validate","Test")
indv_label = sample(data_part_label,size=1000,replace=TRUE,prob
=c(0.6,0.2,0.2))
library(ROCR)
data(GC)
GC_Train <- GC[indv_label==»Train»,]
GC_Validate <- GC[indv_label==»Validate»,]
GC_Test <- GC[indv_label=="Test",]
2. Create the classicaon tree for the German credit data, and visualize the tree.
We will also extract the rules from this classicaon tree:
GC_rpart <- rpart(good_bad~.,data=GC_Train)
plot(GC_rpart); text(GC_rpart)
asRules(GC_rpart)
The classicaon tree for the German credit data appears as in the
following screenshot:
Figure 14: Classification tree for the test part of the German credit data problem
Classicaon and Regression
[ 286 ]
By now, we know how to nd the rules of this tree. An edited version of the rules is
given as follows:
Figure 15: Rules for the German credit data
3. We use the tree given in the previous step on the validate region, and plot the ROC
for both the regions:
Pred_Train_Class <- predict(GC_rpart,type='class')
Pred_Train_Prob <- predict(GC_rpart,type='prob')
Train_Pred <- prediction(Pred_Train_Prob[,2],GC_Train$good_bad)
Perf_Train <- performance(Train_Pred,»tpr»,»fpr»)
plot(Perf_Train,col=»green»,lty=2)
Pred_Validate_Class<-predict(GC_rpart,newdata=GC_Validate[,-
21],type='class')
Pred_Validate_Prob<-predict(GC_rpart,newdata=GC_Validate[,-
21],type='prob')
Validate_Pred<-prediction(Pred_Validate_Prob[,2], GC_
Validate$good_bad)
Chapter 9
[ 287 ]
Perf_Validate <- performance(Validate_Pred,"tpr","fpr")
plot(Perf_Validate,col="yellow",lty=2,add=TRUE)
We will go ahead and predict for the test part too.
4. The necessary code is the following:
Pred_Test_Class<-predict(GC_rpart,newdata=GC_Test[,-
21],type='class')
Pred_Test_Prob<-predict(GC_rpart,newdata=GC_Test[,-
21],type='prob')
Test_Pred<- prediction(Pred_Test_Prob[,2],GC_Test$good_bad)
Perf_Test<- performance(Test_Pred,"tpr","fpr")
plot(Perf_Test,col="red",lty=2,add=TRUE)
legend(0.6,0.5,c("Train Curve","ValidateCurve","Test Curve"),col=c
("green","yellow","red"),pch="-")
The nal ROC curve looks similar to the following screenshot:
Figure 16: ROC Curves for German Credit Data
Classicaon and Regression
[ 288 ]
The performance of the classicaon tree is certainly not sasfactory with the
validate group itself. The only solace here is that the test curve is a bit similar
to the validate curve. We will look at the more modern ways of improving the
basic classicaon tree in the next chapter. The classicaon tree in Figure 14:
Classicaon tree for the test part of the German credit data problem is very large
and complex, and we somemes need to truncate the tree to make the classicaon
method a bit simpler. Of course, one of the things that we should suspect whenever
we look at very large trees is that maybe we are again having the problem of
overng. The nal secon deals with a simplisc method of overcoming
this problem.
What just happened?
A classicaon tree has been built for the German credit dataset. The ROC curve shows that
the tree does not perform well on the validate data part. In the next and concluding secon,
we look at the two ways of improving this tree.
Have a go hero
Using the getNode funcon, verify the rst ve splits of the classicaon tree for the
German credit data.
Pruning and other ner aspects of a tree
Recall from Figure 14: Classicaon tree for the test part of the German credit data problem
that the rules numbered 21, 143, 69, 165, 142, 70, 40, 164, and 16, respecvely, covered
only 20, 25, 11, 11, 14, 12, 28, 19, and 22. If we look at the total number of observaons,
we have about 600, and individually these rules do not cover even about ve percent of
them. This is one reason to suspect that maybe we overed the data. Using the opon of
minsplit, we can restrict the minimum number of observaons each rule should cover at
the least.
Another technical way of reducing the complexity of a classicaon tree is by "pruning" the
tree. Here, the least important splits are recursively snipped o according to the complexity
parameter; for details, refer to Breiman, et. al. (1984), or Secon 3.6 of Berk (2008). We will
illustrate the acon through the R program.
Chapter 9
[ 289 ]
Time for action – pruning a classication tree
A CART is improved by using minsplit and cp arguments in the rpart funcon.
1. Invoke the graphics editor with par(mfrow=c(1,2)).
2. Specify minsplit=30, and re-do the ROC plots by using the new classicaon tree:
GC_rpart_minsplit<- rpart(good_bad~.,data=GC_Train, minsplit=30)
GC_rpart_minsplit <- prune(GC_rpart,cp=0.05)
Pred_Train_Class<- predict(GC_rpart_minsplit,type='class')
Pred_Train_Prob<-predict(GC_rpart_minsplit,type='prob')
Train_Pred<- prediction(Pred_Train_Prob[,2],GC_Train$good_bad)
Perf_Train<- performance(Train_Pred,»tpr»,»fpr»)
plot(Perf_Train,col=»green»,lty=2)
Pred_Validate_Class<-predict(GC_rpart_minsplit,newdata=GC_
Validate[,-21],type='class')
Pred_Validate_Prob<-predict(GC_rpart_minsplit,newdata= GC_
Validate[,-21],type='prob')
Validate_Pred<- prediction(Pred_Validate_Prob[,2],GC_
Validate$good_bad)
Perf_Validate<- performance(Validate_Pred,"tpr","fpr")
plot(Perf_Validate,col="yellow",lty=2,add=TRUE)
Pred_Test_Class<- predict(GC_rpart_minsplit,newdata = GC_Test[,-
21],type='class')
Pred_Test_Prob<-predict(GC_rpart_minsplit,newdata = GC_Test[,-
21],type='prob')
Test_Pred<- prediction(Pred_Test_Prob[,2],GC_Test$good_bad)
Perf_Test<- performance(Test_Pred,"tpr","fpr")
plot(Perf_Test,col="red",lty=2,add=TRUE)
legend(0.6,0.5,c("Train Curve","ValidateCurve","Test Curve"),col=c
("green","yellow","red"),pch="-")
title(main="Improving a Classification Tree with "minsplit"")
3. For the pruning factor cp=0.02, repeat the ROC plot exercise:
GC_rpart_prune <- prune(GC_rpart,cp=0.02)
Pred_Train_Class<- predict(GC_rpart_prune,type='class')
Pred_Train_Prob<-predict(GC_rpart_prune,type='prob')
Train_Pred<- prediction(Pred_Train_Prob[,2],GC_Train$good_bad)
Perf_Train<- performance(Train_Pred,»tpr»,»fpr»)
plot(Perf_Train,col=»green»,lty=2)
Classicaon and Regression
[ 290 ]
Pred_Validate_Class<-predict(GC_rpart_prune,newdata = GC_
Validate[,-21],type='class')
Pred_Validate_Prob<-predict(GC_rpart_prune,newdata = GC_
Validate[,-21],type='prob')
Validate_Pred<- prediction(Pred_Validate_Prob[,2],GC_
Validate$good_bad)
Perf_Validate<- performance(Validate_Pred,"tpr","fpr")
plot(Perf_Validate,col="yellow",lty=2,add=TRUE)
Pred_Test_Class<- predict(GC_rpart_prune,newdata = GC_Test[,-
21],type='class')
Pred_Test_Prob<-predict(GC_rpart_prune,newdata = GC_Test[,-
21],type='prob')
Test_Pred<- prediction(Pred_Test_Prob[,2],GC_Test$good_bad)
Perf_Test<- performance(Test_Pred,"tpr","fpr")
plot(Perf_Test,col="red",lty=2,add=TRUE)
legend(0.6,0.5,c("Train Curve","ValidateCurve","Test Curve"),col=c
("green","yellow","red"),pch="-")
title(main="Improving a Classification Tree with Pruning")
The choice of cp=0.02 has been drawn from the plot of the complexity parameter
and the relave error; try it yourselves with plotcp(GC_rpart).
Figure 17: Pruning the CART
Chapter 9
[ 291 ]
What just happened?
Using the minsplit and cp opons, we have managed to obtain a reduced set of rules,
and in that sense, the ed model does not appear to be an overt. The ROC curves show
that there has been considerable improvement in the performance of the validate region.
Again, as earlier, the validate and test regions have a similar ROC, and it is hence preferable
to use GC_rpart_prune or GC_rpart_minsplit over GC_rpart.
Pop quiz
With the experience of model selecon from the previous chapter, jusfy the choice
of cp=0.02 from the plot obtained as a result of running plotcp(GC_rpart).
Summary
We began with the idea of recursive paroning and gave a legimate reason as to why
such an approach is praccal. The CART technique is completely demysed by using the
getNode funcon, which has been dened appropriately depending upon whether we require
a regression or a classicaon tree. With the convicon behind us, we applied the rpart
funcon to the German credit data, and with its results, we had basically two problems.
First, the ed classicaon tree appeared to overt the data. This problem may many mes
be overcome by using the minsplit and cp opons. The second problem was that the
performance was really poor in the validate region. Though the reduced classicaon trees
had slightly beer performance as compared to the inial tree, we sll need to improve the
classicaon tree. The next chapter will focus more on this aspect and discuss the modern
development of CART.
10
CART and Beyond
In the previous chapter, we studied CART as a powerful recursive partitioning
method, useful for building (non-linear) models. Despite the overall generality,
CART does have certain limitations that necessitate some enhancements. It is
these extensions that form the crux of the final chapter of this book. For some
technical reasons, we will focus solely on the classification trees in this chapter.
We will also briefly look at some limitations of the CART tool.
The rst improvement that can be made to the CART is provided by the bagging technique.
In this technique, we build mulple trees on the bootstrap samples drawn from the
actual dataset. An observaon is put through each of the trees and a predicon is made
for its class, and based on the majority predicon of its class, it is predicted to belong to
the majority count class. A dierent approach is provided by Random Forests, where you
consider a random pool of covariates against the observaons. We nally consider another
important enhancement of a CART by using the boosng algorithms. The chapter will
discuss the following:
Cross-validaon errors for CART
The bootstrap aggregaon (bagging) technique for CART
Extending the CART with random forests
A consolidaon of the applicaons developed from Chapter 6 to Chapter 10,
CART and Beyond
[ 294 ]
Improving CART
In the Another look at model assessment secon of Chapter 8, we saw that the technique
of train + validate + test may be further enhanced by using the cross-validaon technique.
In the case of, linear regression model, we had used the CVlm funcon from the DAAG
package for the purpose of cross-validaon of linear models. The cross-validaon technique
for the logisc regression models may be carried out by using the CVbinary funcon from
the same package.
Profs. Therneau and Atkinson created the package rpart, and a detailed documentaon
of the enre rpart package is available on the Web at http://www.mayo.edu/hsr/
techrpt/61.pdf. Recall the slight improvement provided in the Pruning and other ner
aspects of a tree secon of the previous chapter. The two aspects considered there related
to the complexity parameter cp and the minimum split criteria minsplit. Now, the problem
of overng with the CART may be reduced to an extent by using the cross-validaon
technique. In the ridge regression model, we had the problem of selecng the penalty factor
. Similarly, here we have the problem of selecng the complexity parameter, though not
in an analogous way. That is, for the complexity parameter, which is a number between 0
and 1, we need to obtain the predicons based on the cross-validaon technique. This may
lead to a small loss of accuracy; however, we will then increase the accuracy by looking at
the generality. An object of the rpart class has many summaries contained with it, and the
various complexity parameters are stored in the cptable matrix. This matrix has values
for the following metrics: CP, nsplit, rel error, xerror, and xstd. Let us understand
this matrix through the default example in the rpart package, which is example(xpred.
rpart); see Figure 1: Understanding the example for the "xpred.rpart" funcon:
Chapter 10
[ 295 ]
Figure 1: Understanding the example for the "xpred.rpart" function
Here the tree has CP at four values, namely 0.595, 0.135, 0.013, and 0.010.
The corresponding nsplit numbers are 0, 1, 2, and 3, and similarly, the relative error
values xerror and xstd are given in the last part of previous screenshot. The interpretaon
for the CP value is slightly dierent, and the reason being that these have to be considered
as ranges and not values, in the sense that the rest of the performance is not with respect
to the CP values as menoned previously, rather they are with respect to the intervals
[0.595,1], [0.135, 0.595), [0.013, 0.135), and [0.010, 0.013); see ?xpred.
rpart for more informaon. Now, the funcon xpred.rpart returns the predicons based
on the cross-validated technique. Therefore, we will use this funcon for the German data
problem and for dierent CP values (actually ranges), to obtain the accuracy of the
cross-validaon technique.
CART and Beyond
[ 296 ]
Time for action – cross-validation predictions
We will use the xpred.rpart funcon from rpart to obtain the cross-validaon
predicons from an rpart object.
1. Load the German dataset and the rpart package using data(GC);
library(rpart).
2. Fit the classicaon tree with GC_Complete <- rpart(good_bad~.,
data=GC).
3. Check cptable with GC_Complete$cptable:
CP nsplit rel error xerror xstd
1 0.05167 0 1.0000 1.0000 0.04830
2 0.04667 3 0.8400 0.9833 0.04807
3 0.01833 4 0.7933 0.8900 0.04663
4 0.01667 6 0.7567 0.8933 0.04669
5 0.01556 8 0.7233 0.8800 0.04646
6 0.01000 11 0.6767 0.8833 0.04652
4. Obtain the cross-validaon predicons using GC_CV_Pred <- xpred.rpart(
GC_Complete).
5. Find the accuracy of the cross-validaon predicons:
sum(diag(table(GC_CV_Pred[,2],GC$good_bad)))/1000
sum(diag(table(GC_CV_Pred[,3],GC$good_bad)))/1000
sum(diag(table(GC_CV_Pred[,4],GC$good_bad)))/1000
sum(diag(table(GC_CV_Pred[,5],GC$good_bad)))/1000
sum(diag(table(GC_CV_Pred[,6],GC$good_bad)))/1000
The accuracy output is as follows:
> sum(diag(table(GC_CV_Pred[,2],GC$good_bad)))/1000
[1] 0.71
> sum(diag(table(GC_CV_Pred[,3],GC$good_bad)))/1000
[1] 0.744
> sum(diag(table(GC_CV_Pred[,4],GC$good_bad)))/1000
[1] 0.734
> sum(diag(table(GC_CV_Pred[,5],GC$good_bad)))/1000
[1] 0.74
> sum(diag(table(GC_CV_Pred[,6],GC$good_bad)))/1000
[1] 0.741
Chapter 10
[ 297 ]
It is natural that when you execute the same code, you will most likely have a dierent
output. Why is that? Also, you need to answer for yourselves why we did not check the
accuracy for GC_CV_Pred[,1]. In general, for decreasing the CP range, we expect higher
accuracy. We have checked for cross-validaon predicons for various CP ranges. There are
also other techniques to enhance the performance of a CART.
What just happened?
We used the xpred.rpart funcon to obtain the cross-validaon predicons for
a range of CP values. The accuracy of a predicon model has been assessed by using
simple funcons such as table and diag.
However, the control acons of minsplit and cp are of a reacve nature aer the splits
have been already decided. In that sense, when we have a large number of covariates,
the CART may lead to an overt of the data and may try to capture all the local variaons
of the data, and thus lose sight of the overall generality. Therefore, we need useful
mechanisms to overcome this problem.
The classicaon and regression tree considered in the previous chapter is a single model.
That is, we are seeking the opinion (predicon) of a single model. Wouldn't it be nice if we
could extend this! Alternately, we can seek mulple models instead of a single model.
What does this mean? In the forthcoming secons, we will see the use of mulple models
for the same problem.
Bagging
Bagging is an abbreviaon for bootstrap aggregaon. The important underlying concept
here is the bootstrap, which was invented by the eminent scienst Bradley Efron. We will
rst digress here a bit from the CART technique and consider a very brief illustraon of the
bootstrap technique.
The bootstrap
Consider a random sample,
1,...,
n
X X
, of size n from
()
,fx
θ
. Let
( )
1
,...,
n
TX X
be an
esmator of
θ
. To begin with, we rst draw a random sample of size n from
1,...,
n
X X
with replacement; that is, we obtain a random sample
** *
12
,,...,
n
XX X
, where some of the
observaons from the original sample may have repeons and some may not be present
at all. There is no one-to-one correspondence between
1,..., n
X X
and ** *
12
,,..., n
XX X. Using
** *
12
,,..., n
XX X, we compute
( )
1* *
1,..., n
TX X. Repeat this exercise a large number of mes, say B.
The inference for
θ
is carried out by using the sampling distribuon of the bootstrap samples
( )
1* *
1,..., n
TX X, …,
( )
* *
1,...,
B
n
TX X. Let us illustrate the concept of bootstrap with the famous
aspirin example; see Chapter 8 of Taar, et. al. (2013).
CART and Beyond
[ 298 ]
A surprising double-blind experiment conducted by the New York Times indicated that an
aspirin consumed on alternate days signicantly reduces the number of heart aacks within
men. In their experiment, 104 out of 11034 healthy middle-aged men consuming the small
doses of aspirin suered a fatal/non-fatal heart aack, whereas 189 out of 11037 placebo
individuals had the aack. Therefore, the odds rao of the aspirin-to-placebo heart aack
possibility is (104 / 11034) / (189 / 11037) = 0.5504. This indicates that only 55
percent of the number of heart aacks observed for the group taking the placebo is likely to
be observed for men consuming small doses of aspirin. That is, the chances of having a heart
aack if taking aspirin are almost halved. The experiment being scienc, the results look
very promising. However, we would like to obtain a condence interval for the odds rao of
the heart aack. If we don't know the sampling distribuon of the odds rao, we can use
the bootstrap technique to obtain the same. There is another aspect of the aspirin study. It
has been observed that the aspirin group had about 119 individuals who had strokes. The
strokes number for the placebo group is 98. Therefore, the odds rao of a stroke is (114
/ 11034) / (98 / 11037) = 1.164. This is shocking! It says that though the aspirin
reduces the heart aack possibility, about 16 percent more people are likely to have a stroke
when compared to the placebo group. Now, let's use the bootstrap technique to obtain the
condence intervals for the heart aack as well as the strokes.
Time for action – understanding the bootstrap technique
The boot package, which comes shipped with R base, will be used for bootstrapping the
odds rao.
1. Get the boot package using library(boot).
The boot package is shipped with the R soware itself, and thus it does
not require separate installaon. The main components for the boot funcon
will be soon explained.
2. Dene the odds-rao funcon:
OR <- function(data,i) {
x <- data[,1]; y <- data[,2]
odds.ratio <- (sum(x[i]==1,na.rm=TRUE)/length(na.omit(x[i])))/
(sum(y[i]==1,na.rm=TRUE)/length(na.omit(y[i])))
return(odds.ratio)
}
Chapter 10
[ 299 ]
The OR name stands, of course, for odds-rao. data for this funcon consists of
two columns, one of which may have more observaons than the other. The opon
na.rm is used to ignore the NA data values, whereas the na.omit funcon will
remove them. It is easier to see that the odds.ratio object indeed computes the
odds-rao. Note that we have specied i as an input to the funcon OR, since this
funcon will be used within boot. Therefore, i is used to indicate that the odds
rao will be calculated for the ith bootstrap sample. Note that x[i] does not reect
the ith element of x.
3. Get the data for both the aspirin and placebo groups (the heart aack and stroke
data), with the following code:
aspirin_hattack <- c(rep(1,104),rep(0,11037-104))
placebo_hattack <- c(rep(1,189),rep(0,11034-189))
aspirin_strokes <- c(rep(1,119),rep(0,11037-119))
placebo_strokes <- c(rep(1,98),rep(0,11034-98))
4. Combine the data groups and run 1000 bootstrap replicates, calculang
the odds-rao for each of the bootstrap samples. Use the following boot funcon:
hattack <- cbind(aspirin_hattack,c(placebo_hattack,NA,NA,NA))
hattack_boot <- boot(data=hattack,statistic=OR,R=1000)
strokes <- cbind(aspirin_strokes,c(placebo_strokes,NA,NA,NA))
strokes_boot <- boot(data=strokes,statistic=OR,R=1000)
We are using three opons of the boot funcon, namely data, statistic,
and R. The rst opon accepts the data frame of interest; the second one accepts
the stasc, either an exisng R funcon or a funcon dened by the user; and
nally, the third opon accepts the number of bootstrap replicaons. The boot
funcon creates an object of the boot class, and in this case, we are obtaining the
odds-rao for various bootstrap samples.
5. Using the bootstrap samples and the odds-rao for the bootstrap samples, obtain
a 95 percent condence interval by using the quantile funcon:
quantile(hattack_boot$t,c(0.025,0.975))
quantile(strokes_boot$t,c(0.025,0.975))
The 95 percent condence interval for the odds-rao of the heart aack rate is
given as (0.4763, 0.6269), while that for the strokes is (1.126, 1.333).
Since the point esmates lie in the 95 percent condence intervals, we accept that
the odds-rao of a heart aack for the aspirin tablet indeed reduces by 55 percent
in comparison with the placebo group.
CART and Beyond
[ 300 ]
What just happened?
We used the boot funcon from the boot package and obtained bootstrap samples for the
odds-rao.
Now that we have an understanding of the bootstrap technique, let us check out how
the bagging algorithm works.
The bagging algorithm
Breiman (1996) proposed the extension of the CART in the following manner.
Suppose that the values of the n random observaons for the classicaon problem
are
( ) ( ) ( )
11 2 2
,,,,..., ,
n n
yx yx yx. As with our setup, the dependent variables
i
y
are binary.
As with the bootstrap technique explained earlier, we obtain a bootstrap sample of size n
from the data with replacement and build a tree. If we prune the tree, it is very likely that
we may end up with the same tree on most occasions. Hence, pruning is not advisable here.
Now, using the tree based on the (rst) bootstrap sample, a predicon is made for the class
of the i-th observaon and the predicted value is noted. This process is repeated a large
number of mes, say B. A general pracce is to take B = 100. Therefore, we have B number
of predicons for every observaon. The decision process is to classify the observaon to
the category that has the majority of class predicons. That is, if more than 50 mes out of
B = 100 it has been predicted to belong to a parcular class, we say that the observaon is
predicted to belong to that class. Let us formally state the bagging algorithm.
1. Draw a sample of size n with replacement from the data
( ) ( ) ( )
1 1 1
11 2 2
, , , ,..., ,
i n
yx yx yx ,
and denote the rst bootstrap sample with
( ) ( )
( )
1
1
1 1
11 2 2
, , , ,..., ,
i n
yx yx yx .
2. Create a classicaon tree with
( ) ( )
( )
1
1
1 1
11 2 2
, , , ,..., ,
i n
yx yx yx . Do not prune the
classicaon tree. Such a tree may be called a bootstrapped tree.
3. For each terminal node, assign a class; put each observaon down the tree and nd
its predicted class.
4. Repeat steps 1 to 3 a large number of mes, say B.
5. Find the number of mes each observaon is classied to a parcular class out of
the B bootstrapped trees. The bagging procedure classies an observaon to belong
to a parcular class that has the majority count.
6. Compute the confusion table from the predicons made in step 5.
Chapter 10
[ 301 ]
The advantage of mulple trees is that the problem of overng, which happens in the case
of a single tree, is overcome to a large extent, as we expect that resampling will ensure that
the general features are captured and the impact of local features is minimized. Therefore,
if an observaon is classied to belong to a parcular class because of a local issue, it will
not get repeated over in the other bootstrapped trees. Therefore, with predicons based
on a large number of trees, it is expected that the nal predicon of an observaon really
depends upon its general features and not on a parcular local feature.
There are some measures that are important to be considered with the bagging algorithm.
A good classier, a single tree, or a bunch of them should be able to predict the class of
an observaon with more convicon. For example, we use a probability threshold of 0.5
and above as a predicon for success when using a logisc regression model. If the model
can predict most observaons in the neighborhood of either 0 or 1, we will have more
condence in the predicons. As a consequence, we will be a bit hesitant to classify an
observaon as either a success or failure if the predicted probability is in the vicinity of 0.5.
This precarious situaon applies to the bagging algorithm too.
Suppose we choose B = 100 for the number of bagging trees. Assume that an observaon
belongs to a class, Yes, and let the overall classes for the study be {"Yes", "No"}. If a
large number of trees predict an observaon to belong to the Yes class, we are condent
about the predicon. On the other hand, if approximately B/2 number of trees classify the
observaon to the Yes class, the decision gets swapped, as a few more trees had predicted
the observaon to belong to the No class. Therefore, we introduce a measure called margin
as the dierence between the proporon of mes an observaon is correctly classied and
the proporon of mes it is incorrectly classied. If the bagging algorithm is a good model,
we expect the average margin over all the observaons to be a large number away from
0. If bagging is not appropriate, we expect the average margin to be near the number 0.
Let us prepare ourselves for acon. The bagging algorithm is available in ipred and the
randomForests package.
Time for action – the bagging algorithm
The bagging funcon from the ipred package will be used for bagging a CART. The opons
of coob=FALSE and nbagg=200 are used to specied the appropriate opons.
1. Get the ipred package by using library(ipred).
2. Load the German credit data by using data(GC).
CART and Beyond
[ 302 ]
3. For B=200, t the bagging procedure for the GC data:
GC_bagging <- bagging(good_bad~.,data=GC,coob=FALSE,
nbagg=200,keepX=T)
We know that we have t B =200 number of trees. Would you like to see them?
Fine, here we go.
4. The B =200 trees are stored in the mtrees list of classbagg GC_bagging. That
is, GC_bagging$mtrees[[i]] gives us the i-th bootstrapped tree, and plot(GC_
bagging$mtrees[[i]]$btree) displays that tree. Adding text(GC_bagging$m
trees[[i]]$btree,pretty=1, use.n=T) is also important. Next, put the enre
thing in a loop, execute it, and simply sit back and enjoy the display of the B number
of trees:
for(i in 1:200) {
plot(GC_bagging$mtrees[[i]]$btree);
text(GC_bagging$mtrees[[i]]$btree,pretty=1,use.n=T)
}
We hope that you understand that we can't publish all 200 trees! The next goal
is to obtain the margin of the bagging algorithm.
5. Predict the class probabilies of all the observaons with the predict.
classbagg funcon by using GCB_Margin = round(predict( GC_
bagging,type="prob")*200,0).
Let us understand the preceding code. The predict funcon returns the
probabilies of an observaon to belong to the good and bad classes. We have
used 200 trees, and hence mulplying these probabilies with it gives us the
expected number of mes an observaon is predicted to belong to these classes.
The round funcon with the 0 argument completes the predicon to integers.
6. Check the rst six predicted classes with head(GCB_Margin):
bad good
[1,] 17 183
[2,] 165 35
[3,] 11 189
[4,] 123 77
[5,] 101 99
[6,] 95 105
7. To obtain the overall margin of the bagging technique, use the R code
mean(pmax(GCB_Margin[,1],GCB_Margin[,2]) pmin(GCB_
Margin[,1],GCB_Margin[,2]))/200.
Chapter 10
[ 303 ]
The overall margin for the author's execuon turns out to be 0.5279. You may,
though, get a dierent answer. Why?
Thus far, the bagging technique made predicons for the observaons
from which it built the model. In the earlier chapters, we had championed
the need of validate group and cross-validaon techniques. That is,
we did not always rely on the model measures solely from the data on
which it was built. There is always the possibility of failure as a result of
unforeseen examples. Can the bagging technique be built for taking care
of the unforeseen observaons? The answer is a denite yes, and this
is well known as out-of-bag validaon. In fact, such an opon has been
suppressed when building the bagging model in step 3 here, as the opon
coob=FALSE. coob stands for an out-of-bag esmate of the error rate. So,
now rebuild the bagging model with coob=TRUE opon.
8. Build an out-of-bag bagging model with GC_bagging_oob <- bagging(good_ba
d~.,data=GC,coob=TRUE,nbagg=200,keepX=T). Find the error-rate with GC_
bagging_oob$err.
> GC_bagging_oob <- bagging(good_bad~.,data=GC,coob=TRUE,
nbagg=200,keepX=T)
> GC_bagging_oob$err
[1] 0.241
What Just Happened?
We have seen an important extension of the CART model in the bagging algorithm.
To an extent, this enhancement is vital and vastly dierent, as seen in the improvements
of earlier models. The bagging algorithm is dierent, in the sense that we rely on the
predicons based on more than a single model. This ensures that the overng problem,
which occurs due to local features, is almost eliminated.
It is important to note that the bagging technique is not without any limitaons; refer
to Secon 4.5 of Berk (2008). We now move to the nal model of the book, which is an
important technique for the CART school.
Random forests
In the previous secon, we built mulple models for the same classicaon problem. The
bootstrapped trees were generated by using resamples of the observaons. Breiman (2001)
suggested an important variaon—actually, there is more to it than just a variaon—where
a CART is built with the covariates (features) being resampled for each of the bootstrap
samples of the dataset. Since the nal tree of each bootstrap sample has dierent covariates,
the ensemble of the collecve trees is called a Random Forest. A formal algorithm is
given next.
CART and Beyond
[ 304 ]
1. As with the bagging algorithm, draw a sample of size n1, n1 < n with replacement
from the data
( ) ( ) ( )
1 1 1
11 2 2
, , , ,..., ,
i n
yx yx yx , and denote the rst resampled data
with
( ) ( )
( )
1
1
1 1
11 2 2
, , , ,..., ,
i n
yx yx yx . The remaining n to n1 data form the
out-of-bag dataset.
2. Among the covariate vector x, select a random number of covariates without
replacement. Note that the same covariates are selected for all the observaons.
3. Create the CART tree from the data in steps 1 and 2, and, as earlier, do not prune
the tree.
4. For each terminal node, assign a class. Put each out-of-bag data down the tree
and nd its predicted class.
5. Repeat steps 1 to 3 a large number of mes; say 200 or 500.
6. For each observaon, count the number of mes it is predicted to belong to a class
only when it is a part of the out-of-bag dataset.
7. The majority count for the observaon to belong to a class is considered as it is a
predicted class.
This is quite a complex algorithm. Luckily, the randomForest package helps us out.
We will connue with the German credit data problem.
Time for action – random forests for the German credit data
The funcon randomForest from the package of the same name will be used to build a
random forest for the German credit data problem.
1. Get the randomForest package by using library(randomForest).
2. Load the German credit data by using data(GC).
3. Create a random forest with 500 trees:
GC_RF <- randomForest(good_bad~.,data=GC,keep.forest=TRUE,
ntree=500).
Chapter 10
[ 305 ]
It is very dicult to visualize a single tree of the random forest. A very
ad-hoc approach has been found at http://stats.stackexchange.com/
questions/2344/best-way-to-present-a-random-forest-in-a-
publication. Now we reproduce the necessary funcon to get the trees, and as
the soluon step is not exactly perfect, you may skip this part; steps 4 and 5.
4. Dene the to.dendrogram funcon:
5. Use the getTree funcon, and with the to.dendrogram funcon dened
previously, visualize the rst 20 trees of the forest:
for(i in 1:20) {
tree <- getTree(GC_RF,i,labelVar=T)
d <- to.dendrogram(tree)
plot(d,center=TRUE,leaflab='none',edgePar=list(t.cex=1,p.col=NA,p.
lty=0))
}
The error rate is of primary concern. As we increase the number of trees in the forest,
we expect a decrease in the error rate. Let us invesgate this for the GC_RF object.
CART and Beyond
[ 306 ]
6. Plot the out-of-bag error rate against the number of trees with plot(1:500,GC_
RF$err.rate[,1],"l",xlab="No.of.Trees",ylab="OOB Error Rate").
Figure 2: Performance of a random forest
The covariates (features) are selected dierently for dierent trees. It is then
a concern to know which variables are signicant. The important variables are
obtained using the varImpPlot funcon.
7. The funcon varImpPlot produces a display of the importance of the variables
by using varImpPlot(GC_RF).
Chapter 10
[ 307 ]
Figure 3: Important variables of the German credit data problem
Thus, we can see which variables have more relevance over others.
What just happened?
Random forests are a very important extension of the CART concept. In this technique, we
need to know the error-rate distribuon as the number of trees increases. This is expected to
decrease with the increase in the number of trees. varImpPlot also gives a very important
display of the importance of the covariates for classifying the customers as good or bad.
In conclusion, we will undertake a classicaon dataset and revise all the techniques seen
in the book, especially in Chapter 6 to Chapter 10. We will now consider the problem of low
birth weight among the infants.
CART and Beyond
[ 308 ]
The consolidation
The goal of this secon is to quickly review all of the techniques learnt in the laer half of
the book. Towards this, a dataset has been selected where we have ten variables, including
the output. Low birth weight is a serious concern, and it needs to be understood as a factor
of many other variables. If the weight of a child at birth is lesser than 2500 grams, it is
considered as a low birth weight. This problem has been studied in Chapter 19 of Taar, et.
al. (2013). The following table gives a descripon of the variables. Since the dataset may be
studied as a regression problem (variable BWT) as well as a classicaon problem (LOW), you
can choose any path(s) that you deem t. Let the nal acon begin.
Serial number Description Abbreviation
1 Identification code ID
2 Low birth weight LOW
3 Age of mother AGE
4 Weight of mother at last menstrual period LWT
5 Race RACE
6 Smoking status during pregnancy SMOKE
7 History of premature labor PTL
8 History of hypertension HT
9Presence of uterine irritability UI
10 Number of physician visits during the first trimester FTV
11 Birth weight BWT
Time for action – random forests for the low birth weight data
The techniques learnt from Chapter 6 to Chapter 10 will now be put to the test.
That is, we will use the linear regression model, logisc regression, as well as CART.
1. Read the dataset into R with data(lowbwt).
Chapter 10
[ 309 ]
2. Visualize the dataset with the opons diag.panel, lower.panel,
and upper.panel:
pairs(lowbwt,diag.panel=panel.hist,lower.panel=panel.smooth,upper.
panel=panel.cor)
Interpret the matrix of scaer plots. Which stascal model seems most appropriate
to you?
Figure 4: Multivariable display of the "lowbwt" dataset
As the correlaons look weak, it seems that a regression model may not be
appropriate. Let us check.
3. Create (sub) datasets for the regression and classicaon problems:
LOW <- lowbwt[,-10]
BWT <- lowbwt[,-1]
CART and Beyond
[ 310 ]
4. First, we will check if a linear regression model is appropriate:
BWT_lm <- lm(BWT~., data=BWT)
summary(BWT_lm)
Interpret the output of the linear regression model; refer to Linear Regression
Analysis, Chapter 6, if necessary.
Figure 5: Linear model for the low birth weight data
The low R2 makes it dicult for us to use the model. Let us check out the logisc
regression model.
5. Fit the logisc regression model as follows:
BWT_glm <- glm(BWT~., data=BWT)
summary(BWT_glm).
Chapter 10
[ 311 ]
The summary of the model is given in the following screenshot:
Figure 6: Logistic regression model for the low birth weight data
6. The Hosmer-Lemeshow goodness-of-t test for the logisc regression model is given
by hosmerlem(LOW_glm$y,fitted(LOW_glm)).
Now, the p-value obtained is 0.7813, which shows that there is no signicant
dierence between the ed values and the observed values. Therefore, we
conclude that the logisc regression model is a good t. However, we will go
ahead and perform CART models for this problem as well. Note that the esmated
regression coecients are not huge values, and hence we do not need to check out
for the ridge regression problem.
CART and Beyond
[ 312 ]
7. Fit a classicaon tree with the rpart funcon:
LOW_rpart <- rpart(LOW~.,data=LOW)
plot(LOW_rpart)
text(LOW_rpart,pretty=1)
Does the classicaon tree appear more appropriate than the logisc regression
ed earlier?
Figure 7: Classification tree for the low birth weight data
8. Get the rules of the classicaon tree using asRules(LOW_rpart).
Figure 8: Rules for the low birth weight problem
Chapter 10
[ 313 ]
You can see that these rules are of great importance to the physician who does the
operaons. Let us check the bagging eect on the classicaon tree.
9. Using the bagging funcon, nd the error rate of the bagging technique with the
following code:
LOW_bagging <- bagging(LOW~., data=LOW,coob=TRUE,nbagg=50,keepX=T)
LOW_bagging$err
The error rate is 0.3228, which seems very high. Let us see if random forests help
us out.
10. Using the randomForest funcon, nd the error rate for the out-of-bag problem:
LOW_RF <- randomForest(LOW~.,data=LOW,keep.forest=TRUE, ntree=50)
LOW_RF$err.rate
The error rate is sll around 0.34. The inial idea was that with the number of
observaons being less than 200, we developed with only 50 trees. Repeat the task
with 150 trees and check if the error rate decreases.
11. Increase the number of trees to 150 and obtain the error-rate plot:
LOW_RF <- randomForest(LOW~.,data=LOW,keep.forest=TRUE, ntree=150)
plot(1:150,LOW_RF$err.rate[,1],"l",xlab="No.of.Trees",ylab="OOB
Error Rate")
The error rate of about 0.32 seems to be the best soluon we can obtain
for this problem.
Figure 9: The error rate for the low birth weight problem
CART and Beyond
[ 314 ]
What just happened?
We had a very quick look back at all the techniques used over the last ve chapters
of the book.
Summary
The chapter began with two important variaons of the CART technique: the bagging
technique and random forests. Random forests are parcularly a very modern technique,
invented in 2001 by Breiman. The goal of the chapter was to familiarize you with these
modern techniques. Together with the German credit data and the complete revision of
the earlier techniques with the low birth weight problem, it is hoped that you beneed a
lot from the book and will have gained enough condence to apply these tools in your own
analycal problems.
References
The book has been inuenced by many of the classical texts on the subject from Tukey
(1977) to Breiman, et. al. (1984). The modern texts of Hase, et. al. (2009) and Berk (2008)
have parcularly inuenced the later chapters of the book. We have alphabecally listed
only the books/monographs which have been cited in the text, however, the reader may go
beyond the current list too.
Agres, A. (2002), Categorical Data Analysis, Second Edion. J. Wiley
Baron, M. (2007), Probability and Stascs for Computer Sciensts, Chapman
and Hall/CRC
Belsley, K., Kuh, and Welsch, E. (1980), Regression Diagnoscs: Idenfying Inuenal
Data and Sources of Collinearity, J. Wiley
Berk, R. A. (2008), Stascal Learning from a Regression Perspecve.Springer
Breiman, L. (1996), Bagging predictors. Machine learning, 24(2), 123-140
Breiman, L. (2001), Random forests. Machine learning, 45(1), 5-32
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone. C.J. (1984), Classicaon
and Regression Trees, Wadsworth
Chen, Ch., Härdle, W., and Unwin, A. (2008). Handbook of Data Visualizaon.
Springer
Cleveland, W. S. (1985). The Elements of Graphing Data. Monterey, CA: Wadsworth
Cule, E. and De Iorio, M. (2012), A semi-automac method to guide the choice
of ridge parameter in ridge regression.arXiv:1205.0686v1 [stat.AP]
Freund, R.F., and Wilson, W.J. (2003), Stascal Methods, Second Edion,
Academic Press
Friendly, M. (2001), Visualizing Categorical Data.SAS
Friendly, M. (2008), A brief history of data visualizaon. In Handbook
of Data Visualizaon (pp. 15-56), Springer
References
[ 316 ]
Gunst, R. F. (2002), Finding condence in stascal signicance. Quality Progress,
35 (10), 107–108
Gupta, A. (1997), Establishing Opmum Process Levels of Suspending Agents
for a Suspension Product. Quality Engineering, 10,347-350
Hase, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Stascal
Learning, Second Edion, Springer
Horgan, J. M. (2008), Probability with R – An Introducon with Computer Science
Applicaons, J. Wiley
Johnson, V. E., and Albert, J. H. (1999), Ordinal Data Modeling, Springer
Kutner, M. H., Nachtsheim, C., and Neter, J. (2004), Applied Linear Regression
Models, McGraw Hill
Montgomery, D. C. (2007), Introducon to Stascal Quality Contro,. J. Wiley
Montgomery, D. C., Peck, E. A., and Vining, G. G. (2012), Introducon to Linear
Regression Analysis, Wiley
Montgomery, D.C., and Runger, G. C. (2003), Applied Stascs and Probability
for Engineers, J. Wiley
Pawitan, Yudi. (2001), In All Likelihood: Stascal Modelling and Inference Using
Likelihood, OUP Oxford
Quinlan, J. R. (1993), C4. 5: Programs for Machine Learning, Morgan Kaufmann
Ross, S.M. (2010), Introductory Stascs, 3e, Academic Press
Rousseeuw, P.J., Ruts, I., and Tukey, J.W. (1999), The bagplot: a bivariate boxplot,
The American Stascian 53.4: 382-387
Ryan, T.P. (2007), Modern Engineering Stascs, J. Wiley
Sarkar, D. (2008), Lace-Springer
Taar, P.T., Suresh. R., and Manjunath. B.G. (2013), A Course in Stascs
with R. Narosa
Tue, E. R. (2001), The Visual Display of Quantave Informaon, Graphics Pr
Tukey, J.W. (1977), Exploratory Data Analysis, Addison-Wesley
Velleman, P.F., and Hoaglin, D.C. (1981), Applicaons, Basics, and Compung of
Exploratory Data Analysis; available at http://dspace.libraty.comell.edu/
Wickham, H. (2009). ggplot2, Springer
Wilkinson, L. (2005), The Grammar of Graphics, Second Edion, Springer
Index
Symbols
3RSS 123
3RSSH 123
4253H smoother 123
%% operator 35
A
actual probabilies 13
Age in Years variable 9
Akaike Informaon Criteria (AIC) 194
alternave hypothesis 144
amount of leverage by the observaon 187
ANOVA technique
about 170
obtaining 170
anscombe dataset 169
aplpack package 117
Automove Research Associaon of India (ARAI)
10
B
backward eliminaon approach 192
backwardlm funcon 195
bagging 297
bagging algorithm 300-302
bagging technique 93
bagplot
about 116
characteriscs 116
displaying, for mulvariate dataset 117, 118
for gasoline mileage problem 117
barchart funcon 73
bar charts
built-in examples 66, 67
visualizing 68-73
barplot funcon 73
basic arithmec, vectors
performing 36
unequal length vectors, adding 36
basis funcons, regression spline 234
best split point 265
binary regression problem 202
binomial distribuon
about 20, 21
examples 21-23
binomial test
performing 144
proporons, tesng 147, 148
success probability, tesng 145, 146
bivariate boxplot 116
boosng algorithms 293
bootstrap 298, 299
bootstrap aggregaon 297
bootstrapped tree 300
box-and-whisker plot 84
boxplot
about 84
examples 84, 85
implemenng 85-87
boxplot funcon 86, 108
B-spline regression model
ng 241, 242
purpose 241
[ 318 ]
built-in examples, bar chart
Bug Metrics dataset 67
Bug Metrics of ve soware 67
bwplot funcon 86
C
CART
about 293
cross-validaon predicons, obtaining 296, 297
improving 294, 295
CART_Dummy dataset
visualizing 259
categorical variable 9, 18, 260
central limit theorem 139
classicaon tree 257
construcon 276-283
pruning 289-291
classicaon tree, German credit data
construcon 284-288
coecient of determinaon 169
colSums funcon 77
Comprehensive R Archive Network. See CRAN
computer science
experiments with uncertainty 14
condence interval
about 139
for binomial proporon 139
for normal mean with known variance 140
for normal mean with unknown variance 140
obtaining 141, 142, 170, 171
confusion matrix 220
connuous distribuons
about 26
exponenal distribuon 28
normal distribuon 29
uniform distribuon 27
connuous variable 9
Cooks distance 188
covariate 229
CRAN
about 15
URL 15
criteria-based model selecon
about 194
AIC criteria, using 194-197
backward, using 194-197
forward, using 194-197
crical alpha 192
crical region 144
CSV le
reading, from 54
cumulave density funcon 27
Customer ID variable 8
CVbinary funcon 294
CVlm funcon 294
D
data
imporng, from external les 55, 57
imporng, from MySQL 58, 59
spilng 260
databases (DB) 58
data characteriscs 12, 13
data formats
CSV (comma separated variable) format 52
ODS (OpenOce/LibreOce Calc) 52
XLS or XLSX (Microso Excel) form 52
data.frame object
about 45
creang 45, 46
data/graphs
exporng 60
graphs, exporng 60, 61
R objects, exporng 60
data re-expression
about 114
for Power of 62 Hydroelectric Staons 114-116
data visualizaon
about 65
visualizaon techniques, for categorical data 66
visualizaon techniques, for connuous variable
data 84
DAT le
reading, from 54
depth 113
deviance residual 213
DFBETAS 189
DFFITS 189
di funcon 108
discrete distribuons
about 18
binomial distribuon 20
[ 319 ]
discrete uniform distribuon 19
hypergeometric distribuon 24
negave binomial distribuon 24
poisson distribuon 26
discrete uniform distribuon 19
dot chart
about 74
example 74
visualizing 74-76
dotchart funcon 74
E
EDA 103
exponenal distribuon 28, 29
F
false posive rate (fpr) 220
fence 117
les
reading, foreign package used 55
rst tree
building 261-263
tdistr funcon
used, for ndgin MLE 136
venum funcon 108
forwardlm funcon 194
forward selecon approach 193
fourfold plot
about 82
examples 83
Full Name variable 9
G
Gender variable 9
generalized cross-validaon (GCV) errors 255
geometric RV 25
German credit data
classicaon tree 284-288
German credit screening dataset
logisc regression 223-226
getNode funcon 267
getTree funcon 305
ggplot 101
ggplot2
about 99, 100
ggplot 101
qplot funcon 100
GLM
inuenal and leverage points, idenfying 216
residual plots 213
graphs
exporng 60, 61
H
han funcon 124
hanning 123
hinges 104, 105
hist funcon 90
histogram
about 88
construcon 88, 89
creang 90-92
eecveness 90
examples 89
histogram funcon 90
Hosmer-Lemeshow goodness-of-t test stasc
210, 212
hypergeometric distribuon 24
hypergeometric RV 24
hypotheses tesng
about 144
binomial test 144
one-sample hypotheses, tesng 152-155
one-sample problems, for normal distribuon
150, 151
two-sample hypotheses, tesng 159
two sample problems, for normal distribuon
156, 157
hypothesis
about 144
alternave hypothesis 144
null hypothesis 144
stasc, tesng 144
I
impurity measures 276
independent and idencal distributed (iid)
sample 130
[ 320 ]
inuence and leverage points, GLM
idenfying 216
inuenal point 188
interquarle range (IQR) 84, 105
iterave reweighted least-squares (IRLS)
algorithm 207
L
leading digits 109
leer values 113
leverage point 187
likelihood funcon
about 131
visualizing 131-134
likelihood funcon, of binomial distribuon 131
likelihood funcon, of normal distribuon 132
likelihood funcon, of Poisson distribuon 132
linear regression model
about 162
limitaons 202, 203
linearRidge funcon 244
list object
about 44
creang 44, 45
lm.ridge t model 255
lm.ridge funcon 245, 255
logisc regression, German credit dataset 223-
226
logisc regression model
about 201-207
diagnoscs 216
ng 207- 210
inuence and leverage points, idenfying
216-218
model validaon 213
residuals plots, for GLM 213
ROC 220
logiscRidge funcon 248
loop 117
M
margin 301
matrix computaons
about 41
performing 41-43
maximum likelihood esmator. See MLE
mean 18, 122
mean residual sum of squares 183
median 104, 122
median polish 125
median polish algorithm
about 125, 126
example 126, 127
medpolish funcon 126
MLE
about 129-131
nding 135
nding, tdistr funcon used 137
nding, mle funcon used 137
likelihood funcon, visualizing 131
MLE, binomial distribuon
nding 135, 136
MLE, Poisson distribuon
nding 136
model assessment
about 250
penalty factor, nding 250
penalty parameter, selecng iteravely
250-255
tesng dataset 250
training dataset 250
validaon dataset 250
model selecon
about 192
criterion-based procedures 194
stepwise procedures 192
model validaon, simple linear regression model
about 171
residual plots, obtaining 172, 174
mosaic plot
for Titanic dataset 80, 81
mosaicplot funcon 80
mulcollinearity problem
about 189, 190
addressing 191, 192
mulple linear regression model
about 176, 177
ANOVA, obtaining 182
building 179, 180
condence intervals, obtaining 182
k simple linear regression models, averaging
177-179
[ 321 ]
useful residual plots 183
mulvariate dataset
bagplot, displaying for 117, 118
MySQL
data, imporng from 58, 59
N
natural cubic spline regression model
about 238
ng 239-241
negave binomial distribuon
about 24, 25
examples 25
negave binomial RV 25
nominal variable 18
normal distribuon
about 29
examples 30
null hypothesis 144
NumericConstants 34
O
Octane Rang
of Gasoline Blends 109
odds rao 207
one-sample hypotheses
tesng 152-155
operator curves
receiving 220
ordinal variable 9, 19, 260
out-of-bag validaon 303
overng 230-233
P
pairs funcon 118
panel.bagplot funcon 118
Pareto chart
about 97
examples 98, 99
paral residual 213
pdf 26
pearson residual 213
percenles 104
piecewise cubic 238
piecewise linear regression model
about 235
ng 235-237
pie charts
about 81
drawbacks 82
examples 81
plot.lm funcon 176
poisson distribuon
about 26
examples 26
polynomial regression model
building 231
ng 229
pooled variance esmator 157
PRESS residuals 184
principal component (PC) 245
probability density funcon 26
probability mass funcon (pmf) 18
probit model 201
probit regression model
about 204
constants 204-206
pruning 288
Q
qplot funcon 100
quanles 104
quesonnaire
about 8
components 8
Quesonnaire ID variable 8
R
R
constants 34
connuous distribuons 26
data characteriscs 12, 13
data.frame 33
data visualizaon 65
discrete distribuons 18
downloading, for Linux 16
downloading, for Windows 16
session management 62, 64
simple linear regression model 162
vectors 34, 35
[ 322 ]
randomForest funcon 304
Random Forests
about 293, 303
for German credit data 304, 306
for low birth weight data 308-313
random sample 130
random variable 13
range funcon 108
R databases 33
read.csv 54
read.delim 54
read.table funcon 53
read.xls funcon 54
receiving operator curves. See ROC
Recursive Paroning
about 258
data, spling 260
display plot, paroning 259
regression 162
regression diagnoscs
about 186
DFBETAS 189
DFFITS 189
inuenal point 187, 188
leverage point 187
regression spline
about 234
basis funcons 234
natural cubic splines 238
piecewise linear regression model 235
regression tree
about 257
construcon 265-274
representave probabilies 13
reroughing 123
resid funcon 174
residual plots, GLM
deviance residual 213
obtaining, ed funcon used 214, 215
obtaining, residuals funcon used 214, 215
paral residual 213
pearson residual 213
response residual 213
working residual 213
residual plots, mulple linear regression model
about 183
PRESS residuals 184
R-student residuals 184, 186
semi-studenzed residuals 183, 184
standardized residuals 183
residuals funcon 216
resistant line
about 118, 120
as regression model 120, 121
for IO-CPU me 120
response residual 213
ridge regression, for linear regression model
243-247
ridge regression, for logisc regression models
248, 249
R installaon
about 15, 16
R packages, using 16, 17
RSADBE 17
rline funcon 120
R names funcon 34
R objects
about 33
exporng 60
leers 34
LETTERS 34
month.abb 34
month.name 34
pi 34
ROC
about 220
construcon 221, 222
rootstock dataset 55
rough 123
R output 33
rowSums funcon 78
R packages
using 16
rpart class 294
rpart funcon 257
RSADBE package 17, 105
R-student residuals 184
RV 13
S
scaer plot
about 93
creang 94-96
[ 323 ]
examples 93
semi-studenzed residuals 183
Separated Clear Volume 54
session management
about 62
performing 62, 64
short message service (SMS) 8
signicance level 139
simple linear regression model
about 163
ANOVA technique 169
building 167, 168
condence intervals, obtaining 170
core assumpons 163
limitaon 230
overng problem 230
residuals for arbitrary choice of parameters,
displaying 164-166
validaon 171
smooth funcon 124
smoothing data technique
about 122
for cow temperature 124, 125
spine/mosaic plot
about 76
advantages 76
examples 77
spine plot
for shi and operator data 77, 79
spineplot funcon 77
spline 234
standardized residuals 183
Stascal Process Control (SPC) 109
stem-and-leaf plot 109
stem funcon
about 110
working 110-112
stems 109
step funcon 194
stepwise procedures
about 192
backward eliminaon 192
forward selecon 193
stepwise regression 193
summary funcon 169
summary stascs
about 104
for The Wall dataset 105-108
hinges 104, 105
interquarle range (IQR) 105
median 104
percenles 104
quanles 104
T
table object
about 49, 50
Titanic dataset, creang 51, 52
tesng dataset 250
text variable 9
Titanic data
exporng 60
to.dendrogram funcon 305
towards-the-center 162
trailing digits 109
training dataset 250
true posive rate (tpr) 220
two-sample hypotheses
tesng 159
U
UCBAdmissions 52
UCBAdmissions dataset 260
uniform distribuon
about 27
examples 28
V
validaon dataset 250
variance 18
variance inaon factor (VIF) 191
vector
about 34
examples 35
generang 35
vector objects
basic arithmec 36
creang 35
visualizaon techniques, for categorical data
about 66
bar chart 66
dot chart 74
Thank you for buying
R Statistical Application Development by Example
Beginner's Guide
About Packt Publishing
Packt, pronounced 'packed', published its rst book "Mastering phpMyAdmin for Eecve
MySQL Management" in April 2004 and subsequently connued to specialize in publishing
highly focused books on specic technologies and soluons.
Our books and publicaons share the experiences of your fellow IT professionals in adapng
and customizing today's systems, applicaons, and frameworks. Our soluon based books
give you the knowledge and power to customize the soware and technologies you're
using to get the job done. Packt books are more specic and less general than the IT books
you have seen in the past. Our unique business model allows us to bring you more focused
informaon, giving you more of what you need to know, and less of what you don't.
Packt is a modern, yet unique publishing company, which focuses on producing quality,
cung-edge books for communies of developers, administrators, and newbies alike. For
more informaon, please visit our website: www.packtpub.com.
About Packt Open Source
In 2010, Packt launched two new brands, Packt Open Source and Packt Enterprise, in order
to connue its focus on specializaon. This book is part of the Packt Open Source brand,
home to books published on soware built around Open Source licences, and oering
informaon to anybody from advanced developers to budding web designers. The Open
Source brand also runs Packt's Open Source Royalty Scheme, by which Packt gives a royalty
to each Open Source project about whose soware a book is sold.
Writing for Packt
We welcome all inquiries from people who are interested in authoring. Book proposals
should be sent to author@packtpub.com. If your book idea is sll at an early stage and you
would like to discuss it rst before wring a formal book proposal, contact us; one of our
commissioning editors will get in touch with you.
We're not just looking for published authors; if you have strong technical skills but no wring
experience, our experienced editors can help you develop a wring career, or simply get
some addional reward for your experse.
NumPy 1.5 Beginner's Guide
ISBN: 978-1-84951-530-6 Paperback: 234 pages
An acon-packed guide for the easy-to-use, high
performance, Python based free open source NumPy
mathemacal library using real-world examples
1. The rst and only book that truly explores NumPy
praccally
2. Perform high performance calculaons with clean
and ecient NumPy code
3. Analyze large data sets with stascal funcons
4. Execute complex linear algebra and mathemacal
computaons
Matplotlib for Python Developers
ISBN: 978-1-84719-790-0 Paperback: 308 pages
Build remarkable publicaon-quality plots the easy
way
1. Create high quality 2D plots by using Matplotlib
producvely
2. Incremental introducon to Matplotlib, from the
ground up to advanced levels
3. Embed Matplotlib in GTK+, Qt, and wxWidgets
applicaons as well as web sites to ulize them in
Python applicaons
4. Deploy Matplotlib in web applicaons and expose it
on the Web using popular web frameworks such as
Pylons and Django
Please check www.PacktPub.com for information on our titles
Sage Beginner's Guide
ISBN: 978-1-84951-446-0 Paperback: 364 pages
Unlock the full potenal of Sage for simplifying and
automang mathemacal compung
1. The best way to learn Sage which is a open source
alternave to Magma, Maple, Mathemaca, and
Matlab
2. Learn to use symbolic and numerical computaon
to simplify your work and produce publicaon-
quality graphics
3. Numerically solve systems of equaons, nd roots,
and analyze data from experiments or simulaons
R Graph Cookbook
ISBN: 978-1-84951-306-7 Paperback: 272 pages
Detailed hands-on recipes for creang the most
useful types of graphs in R—starng from the
simplest versions to more advanced applicaons
1. Learn to draw any type of graph or visual data
representaon in R
2. Filled with praccal ps and techniques for creang
any type of graph you need; not just theorecal
explanaons
3. All examples are accompanied with the
corresponding graph images, so you know what the
results look like
Please check www.PacktPub.com for information on our titles