R Statistical Application Development By Example Beginners Guide

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 345 [warning: Documents this large are best viewed by clicking the View PDF Link!]

R Statistical Application
Development by Example
Beginner's Guide
Learn R Stascal Applicaon Development from scratch
in a clear and pedagogical manner
Prabhanjan Narayanachar Taar
BIRMINGHAM - MUMBAI
R Statistical Application Development by Example
Beginner's Guide
Copyright © 2013 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmied in any form or by any means, without the prior wrien permission of the
publisher, except in the case of brief quotaons embedded in crical arcles or reviews.
Every eort has been made in the preparaon of this book to ensure the accuracy of the
informaon presented. However, the informaon contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be
caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark informaon about all of the
companies and products menoned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this informaon.
First published: July 2013
Producon Reference: 1170713
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-84951-944-1
www.packtpub.com
Cover Image by Asher Wishkerman (wishkerman@hotmail.com)
Credits
Author
Prabhanjan Narayanachar Taar
Reviewers
Mark ver der Loo
Mzabalazo Z. Ngwenya
A Ohri
Tengfei Yin
Acquision Editor
Usha Iyer
Lead Technical Editor
Arun Nadar
Technical Editors
Madhuri Das
Mausam Kothari
Amit Ramadas
Varun Pius Rodrigues
Lubna Shaikh
Project Coordinator
Anurag Banerjee
Proofreaders
Maria Gould
Paul Hindle
Indexer
Hemangini Bari
Graphics
Ronak Dhruv
Producon Coordinators
Melwyn D'sa
Zahid Shaikh
Cover Work
Melwyn D'sa
Zahid Shaikh
About the Author
Prabhanjan Narayanachar Taar has seven years of experience with R soware and
has also co-authored the book A Course in Stascs with R published by Narosa Publishing
House. The author has built two packages in R tled gpk and ACSWR. He has obtained a PhD
(Stascs) from Bangalore University under the broad area of Survival Analysis and published
several arcles in peer-reviewed journals. During the PhD program, the author received
the young Stascian honors in IBS(IR)-GK Shukla Young Biometrician Award (2005) and
Dr. U.S. Nair Award for Young Stascian (2007) and also held a Junior and Senior Research
Fellowship of CSIR-UGC.
Prabhanjan is working as a Business Analysis Advisor at Dell Inc, Bangalore. He is working
for the Customer Service Analycs unit of the larger Dell Global Analycs arm of Dell.
I would like to thank Prof. Athar Khan, Aligarh Muslim University, whose
teaching during a shared R workshop inspired me to a very large extent.
My friend Veeresh Naidu has gone out of his way in helping and inspiring
me complete this book and I thank him for everything that denes our
friendship.
Many of my colleagues at the Customer Service Analycs unit of Dell Global
Analycs, Dell Inc. have been very tolerant of my stat talk with them and it
is their need for the subject which has partly inuenced the wring of the
book. I would like to record my thanks to them and also my manager Debal
Chakraborty.
My wife Chandrika has been very cooperave and without her permission to
work on the book during weekends and housework mings this book would
have never been completed. Pranathi at 2 years and 9 months has started
school, the pre-kindergarten, and it is genuinely believed that one day she
will read this enre book.
I am also grateful to the reviewers whose construcve suggesons and
cricisms have helped the book reach a higher level than where it would
have ended up being without their help. Last and not least, I would like to
take the opportunity to thank Usha Iyer and Anurag Banerjee for their inputs
with the earlier dras, and also their paence with my delays.
About the Reviewers
Mark van der Loo obtained his PhD at the Instute for Theorecal Chemistry at the
University of Nijmegen (The Netherlands). Since 2007 he has worked at the stascal
methodology department of the Dutch ocial stascs oce (Stascs Netherlands).
His research interests include automated data cleaning methods and stascal compung.
At Stascs Netherlands he is responsible for the local R center of experse, which supports
and educates users on stascal compung with R. Mark has coauthored a number of R
packages that are available via CRAN, namely editrules, deducorrect, rspa, extremevalues,
and stringdist. Together with Edwin de Jonge he authored the book Learning RStudio for R
Stascal Compung. A list of his publicaons can be found at www.markvanderloo.eu.
Mzabalazo Z. Ngwenya has worked extensively in the eld of stascal consulng and
currently works as a biometrician. He holds an MSC in Mathemacal Stascs from the
University of Cape Town and is at present studying towards a PhD (School of Informaon
Technology, University of Pretoria) in the eld of Computaonal Intelligence. His research
interests include stascal compung, machine learning, spaal stascs, and simulaon and
stochasc processes. Previously he was involved in reviewing Learning RStudio for R Stascal
Compung by Mark P.J. van der Loo and Edwin de Jonge, Packt Publishing
A Ohri is the founder of analycs startup Decisionstats.com. He has pursued graduaon
studies from the University of Tennessee, Knoxville and the Indian Instute of Management,
Lucknow. In addion, he has a Mechanical Engineering degree from the Delhi College of
Engineering. He has interviewed more than 100 praconers in analycs, including leading
members from all the analycs soware vendors. He has wrien almost 1300 arcles on
his blog besides guest wring for inuenal analycs communies. He teaches courses
in R through online educaon and has worked as an analycs consultant in India for the
past decade. He was one of the earliest independent analycs consultants in India and his
current research interests include spreading open source analycs, analyzing social media
manipulaon, simpler interfaces to cloud compung, and unorthodox cryptography.
He is the author of R for Business Analycs.
http://www.springer.com/statistics/book/978 1 4614 4342 1
www.PacktPub.com
Support les, eBooks, discount offers and more
You might want to visit www.PacktPub.com for support les and downloads related
to your book.
Did you know that Packt oers eBook versions of every book published, with PDF and ePub les
available? You can upgrade to the eBook version at www.PacktPub.com and as a print book
customer, you are entled to a discount on the eBook copy. Get in touch with us at service@
packtpub.com for more details.
At www.PacktPub.com, you can also read a collecon of free technical arcles, sign up for a
range of free newsleers and receive exclusive discounts and oers on Packt books and eBooks.
TM
http://PacktLib.PacktPub.com
Do you need instant soluons to your IT quesons? PacktLib is Packt's online digital book
library. Here, you can access, read and search across Packt's enre library of books.
Why Subscribe?
Fully searchable across every book published by Packt
Copy and paste, print and bookmark content
On demand and accessible via web browser
Free Access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib
today and view nine enrely free books. Simply use your login credenals for immediate access.
The work is dedicated to my father Narayanachar, the very rst engineer who
inuenced my outlook towards Science and Engineering. For the same reason,
my mother Lakshmi made me realize the importance of life and philosophy.
Table of Contents
Preface 1
Chapter 1: Data Characteriscs 7
Quesonnaire and its components 8
Understanding the data characteriscs in an R environment 12
Experiments with uncertainty in computer science 14
R installaon 15
Using R packages 16
RSADBE – the book's R package 17
Discrete distribuon 18
Discrete uniform distribuon 19
Binomial distribuon 20
Hypergeometric distribuon 24
Negave binomial distribuon 24
Poisson distribuon 26
Connuous distribuon 26
Uniform distribuon 27
Exponenal distribuon 28
Normal distribuon 29
Summary 31
Chapter 2: Import/Export Data 33
data.frame and other formats 33
Constants, vectors, and matrices 34
Time for acon – understanding constants, vectors, and basic arithmec 35
Time for acon – matrix computaons 41
The list object 44
Time for acon – creang a list object 44
The data.frame object 45
Table of Contents
[ ii ]
Time for acon – creang a data.frame object 45
The table object 49
Time for acon – creang the Titanic dataset as a table object 51
read.csv, read.xls, and the foreign package 52
Time for acon – imporng data from external les 55
Imporng data from MySQL 58
Exporng data/graphs 60
Exporng R objects 60
Exporng graphs 60
Time for acon – exporng a graph 61
Managing an R session 62
Time for acon – session management 62
Summary 64
Chapter 3: Data Visualizaon 65
Visualizaon techniques for categorical data 66
Bar charts 66
Going through the built-in examples of R 66
Time for acon – bar charts in R 68
Dot charts 74
Time for acon – dot charts in R 74
Spine and mosaic plots 76
Time for acon – the spine plot for the shi and operator data 77
Time for acon – the mosaic plot for the Titanic dataset 80
Pie charts and the fourfold plot 81
Visualizaon techniques for connuous variable data 84
Boxplot 84
Time for acon – using the boxplot 85
Histograms 88
Time for acon – understanding the eecveness of histograms 90
Scaer plots 93
Time for acon – plot and pairs R funcons 94
Pareto charts 97
A brief peek at ggplot2 99
Time for acon – qplot 100
Time for acon – ggplot 101
Summary 102
Chapter 4: Exploratory Analysis 103
Essenal summary stascs 104
Percenles, quanles, and median 104
Hinges 104
Table of Contents
[ iii ]
The interquarle range 105
Time for acon – the essenal summary stascs for
"The Wall" dataset 105
The stem-and-leaf plot 109
Time for acon – the stem funcon in play 110
Leer values 113
Data re-expression 114
Bagplot – a bivariate boxplot 116
Time for acon – the bagplot display for a mulvariate dataset 117
The resistant line 118
Time for acon – the resistant line as a rst regression model 120
Smoothing data 122
Time for acon – smoothening the cow temperature data 124
Median polish 125
Time for acon – the median polish algorithm 125
Summary 128
Chapter 5: Stascal Inference 129
Maximum likelihood esmator 130
Visualizing the likelihood funcon 131
Time for acon – visualizing the likelihood funcon 133
Finding the maximum likelihood esmator 135
Using the tdistr funcon 136
Time for acon – nding the MLE using mle and tdistr funcons 137
Condence intervals 139
Time for acon – condence intervals 141
Hypotheses tesng 144
Binomial test 144
Time for acon – tesng the probability of success 145
Tests of proporons and the chi-square test 147
Time for acon – tesng proporons 147
Tests based on normal distribuon – one-sample 149
Time for acon – tesng one-sample hypotheses 152
Tests based on normal distribuon – two-sample 156
Time for acon – tesng two-sample hypotheses 159
Summary 160
Chapter 6: Linear Regression Analysis 161
The simple linear regression model 162
What happens to the arbitrary choice of parameters? 163
Time for acon – the arbitrary choice of parameters 164
Building a simple linear regression model 167
Table of Contents
[ iv ]
Time for acon – building a simple linear regression model 167
ANOVA and the condence intervals 169
Time for acon – ANOVA and the condence intervals 170
Model validaon 171
Time for acon – residual plots for model validaon 172
Mulple linear regression model 176
Averaging k simple linear regression models or a mulple linear
regression model 177
Time for acon – averaging k simple linear regression models 177
Building a mulple linear regression model 179
Time for acon – building a mulple linear regression model 179
The ANOVA and condence intervals for the mulple linear regression model 181
Time for acon – the ANOVA and condence intervals for the mulple linear
regression model 182
Useful residual plots 183
Time for acon – residual plots for the mulple linear regression model 184
Regression diagnoscs 186
Leverage points 187
Inuenal points 188
DFFITS and DFBETAS 189
The mulcollinearity problem 189
Time for acon – addressing the mulcollinearity problem for the Gasoline data 191
Model selecon 192
Stepwise procedures 192
The backward eliminaon 192
The forward selecon 193
Criterion-based procedures 194
Time for acon – model selecon using the backward, forward, and AIC criteria 194
Summary 199
Chapter 7: The Logisc Regression Model 201
The binary regression problem 202
Time for acon – limitaons of linear regression models 202
Probit regression model 204
Time for acon – understanding the constants 204
Logisc regression model 206
Time for acon – ng the logisc regression model 207
Hosmer-Lemeshow goodness-of-t test stasc 210
Time for acon – the Hosmer-Lemeshow goodness-of-t stasc 211
Model validaon and diagnoscs 213
Residual plots for the GLM 213
Table of Contents
[ v ]
Time for acon – residual plots for the logisc regression model 214
Inuence and leverage for the GLM 216
Time for acon – diagnoscs for the logisc regression 216
Receiving operator curves 220
Time for acon – ROC construcon 221
Logisc regression for the German credit screening dataset 223
Time for acon – logisc regression for the German credit dataset 225
Summary 227
Chapter 8: Regression Models with Regularizaon 229
The overng problem 230
Time for acon – understanding overng 231
Regression spline 234
Basis funcons 234
Piecewise linear regression model 235
Time for acon – ng piecewise linear regression models 235
Natural cubic splines and the general B-splines 238
Time for acon – ng the spline regression models 239
Ridge regression for linear models 243
Time for acon – ridge regression for the linear regression model 244
Ridge regression for logisc regression models 248
Time for acon – ridge regression for the logisc regression model 248
Another look at model assessment 250
Time for acon – selecng lambda iteravely and other topics 250
Summary 256
Chapter 9: Classicaon and Regression Trees 257
Recursive parons 258
Time for acon – paroning the display plot 259
Spling the data 260
The rst tree 260
Time for acon – building our rst tree 261
The construcon of a regression tree 265
Time for acon – the construcon of a regression tree 266
The construcon of a classicaon tree 276
Time for acon – the construcon of a classicaon tree 279
Classicaon tree for the German credit data 284
Time for acon – the construcon of a classicaon tree 284
Pruning and other ner aspects of a tree 288
Time for acon – pruning a classicaon tree 289
Summary 291
Table of Contents
[ vi ]
Chapter 10: CART and Beyond 293
Improving CART 294
Time for acon – cross-validaon predicons 296
Bagging 297
The bootstrap 297
Time for acon – understanding the bootstrap technique 298
The bagging algorithm 300
Time for acon – the bagging algorithm 301
Random forests 303
Time for acon – random forests for the German credit data 304
The consolidaon 308
Time for acon – random forests for the low birth weight data 308
Summary 314
Appendix: References 315
Index 317
Preface
The open source soware R is fast becoming one of the preferred companions of Stascs
even as the subject connues to add many friends in Machine Learning, Data Mining, and so
on among its already rich scienc network. The era of mathemacal theory and stascal
applicaons embeddedness is truly a remarkable one for the society and the soware has
played a very pivotal role in it. This book is a humble aempt at presenng Stascal Models
through R for any reader who has a bit of familiarity with the subject. In my experience of
praccing the subject with colleagues and friends from dierent backgrounds, I realized that
many are interested in learning the subject and applying it in their domain which enables
them to take appropriate decisions in analyses, which involves uncertainty. A decade earlier
my friends would be content with being pointed to a useful reference book. Not so anymore!
The work in almost every domain is done through computers and naturally they do have
their data available in spreadsheets, databases, and somemes in plain text format. The
request for an appropriate stascal model is invariantly followed by a one word queson
"Soware?" My answer to them has always been a single leer reply "R!" Why? It is really a
very simple decision and it has been my companion over the last seven years. In this book,
this experience has been converted into detailed chapters and a cleaner breakup of model
building in R.
A by-product of the interacon with colleagues and friends who are all aspiring stascal
model builders has been that I have been able to pick up the trough of their learning
curve of the subject. The rst aempt towards xing the hurdle has been to introduce
the fundamental concepts that the beginners are most familiar with, which is data.
The dierence is simply in the subtlees and as such I rmly believe that introducing the
subject on their turf movates the reader for a long way in their journey. As with most
stascal soware, R provides modules and packages which mostly cover many of the
recently invented stascal methodology. The rst ve chapters of the book focus on the
fundamental aspects of the subject and the R soware and hence cover R basics, data
visualizaon, exploratory data analysis, and stascal inference.
Preface
[ 2 ]
The foundaonal aspects are illustrated using interesng examples and sets up the
framework for the later ve chapters. Regression models, linear and logisc regression
models being at the forefront, are of paramount interest in applicaons. The discussion is
more generic in nature and the techniques can be easily adapted across dierent domains.
The last two chapters have been inspired by the Breiman school and hence the modern
method of Classicaon and Regression Trees has been developed in detail and illustrated
through a praccal dataset.
What this book covers
Chapter 1, Data Characteriscs, introduces the dierent types of data through a
quesonnaire and dataset. The need of stascal models is elaborated in some interesng
contexts. This is followed by a brief explanaon of R installaon and the related packages.
Discrete and connuous random variables are discussed through introductory R programs.
Chapter 2, Import/Export Data, begins with a concise development of R basics. Data frames,
vectors, matrices, and lists are discussed with clear and simpler examples. Imporng of data
from external les in csv, xls, and other formats is elaborated next. Wring data/objects from
R for other soware is considered and the chapter concludes with a dialogue on R session
management.
Chapter 3, Data Visualizaon, discusses ecient graphics separately for categorical and
numeric datasets. This translates into techniques of bar chart, dot chart, spine and mosaic
plot, and four fold plot for categorical data while histogram, box plot, and scaer plot for
connuous/numeric data. A very brief introducon to ggplot2 is also provided here.
Chapter 4, Exploratory Analysis, encompasses highly intuive techniques for preliminary
analysis of data. The visualizing techniques of EDA such as stem-and-leaf, leer values, and
modeling techniques of resistant line, smoothing data, and median polish give a rich insight
as a preliminary analysis step.
Chapter 5, Stascal Inference, begins with the emphasis of likelihood funcon and
compung the maximum likelihood esmate. Condence intervals for the parameters
of interest is developed using funcons dened for specic problems. The chapter also
considers important stascal tests of Z-test and t-test for comparison of means and
chi-square tests and F-test for comparison of variances.
Chapter 6, Linear Regression Analysis, builds a linear relaonship between an output and a
set of explanatory variables. The linear regression model has many underlying assumpons
and such details are veried using validaon techniques. A model may be aected by a
single observaon, or a single output value, or an explanatory variable. Stascal metrics
are discussed in depth which helps remove one or more kinds of anomalies. Given a large
number of covariates, the ecient model is developed using model selecon techniques.
Preface
[ 3 ]
Chapter 7, The Logisc Regression Model, is useful as a classicaon model when the output
is a binary variable. Diagnosc and model validaon through residuals are used which lead
to an improved model. ROC curves are next discussed which helps in idenfying of a beer
classicaon model.
Chapter 8, Regression Models with Regularizaon, discusses the problem of over ng
arising from the use of models developed in the previous two chapters. Ridge regression
signicantly reduces the possibility of an over t model and the development of natural
spine models also lays the basis for the models considered in the next chapter.
Chapter 9, Classicaon and Regression Trees, provides a tree-based regression model.
The trees are inially built using R funcons and the nal trees are also reproduced using
rudimentary codes leading to a clear understanding of the CART mechanism.
Chapter 10, CART and Beyond, considers two enhancements of CART using bagging and
random forests. A consolidaon of all the models from Chapter 6 to Chapter 10 is also given
through a dataset.
Chapter 1 to Chapter 5 form the basics of R soware and the Stascs subject. Praccal
and modern regression models are discussed in depth from Chapter 6 to Chapter 10.
Appendix, References, lists names of the books that have been referred to in this book.
What you need for this book
R is the only required soware for this book and you can download it from http://www.
cran.r-project.org/. R packages will be required too though this task is done within a
working R session. The datasets used in the book is available in the R package RSADBE, which
is an abbreviaon of the book's tle, at http://cran.r-project.org/web/packages/
RSADBE/index.html.
Who this book is for
This book will be useful for readers who have air and a need for stascal applicaons
in their own domains. The rst seven chapters are also useful for any masters students in
Stascs and the movated student can easily complete the rest of the book and obtain
a working knowledge of CART.
Conventions
In this book, you will nd several headings appearing frequently.
To give clear instrucons of how to complete a procedure or task, we use:
Preface
[ 4 ]
Time for action – heading
1. Acon 1
2. Acon 2
3. Acon 3
Instrucons oen need some extra explanaon so that they make sense, so they are
followed with:
What just happened?
This heading explains the working of tasks or instrucons that you have just completed.
You will also nd some other learning aids in the book, including:
Pop quiz – heading
These are short mulple-choice quesons intended to help you test your own
understanding.
Have a go hero – heading
These are praccal challenges that give you ideas for experimenng with what you
have learned.
You will also nd a number of styles of text that disnguish between dierent kinds of
informaon. Here are some examples of these styles, and an explanaon of their meaning.
Code words in text are shown as follows: "The operator %% on two objects, say x and y,
returns remainder following an integer division, and the operator %/% returns the integer
division." In certain cases the complete code cannot be included within the acon list and
in such cases you will nd the following display:
Plot the "Response Residuals" against the "Fied Values" of the pass_logistic model
with the following values assigned:
plot(fitted(pass_logistic), residuals(pass_logistic,"response"),
col= "red", xlab="Fitted Values", ylab="Residuals",cex.axis=1.5,
cex.lab=1.5)
In such a case you need to run the code starng with plot(up to cex.lab=1.5) in R.
Preface
[ 5 ]
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book—what you liked or may have disliked. Reader feedback is important for us to
develop tles that you really get the most out of.
To send us general feedback, simply send an e-mail to feedback@packtpub.com,
and menon the book tle through the subject of your message.
If there is a topic that you have experse in and you are interested in either wring
or contribung to a book, see our author guide on www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.
Downloading the example code
You can download the example code les for all Packt books you have purchased from your
account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit
http://www.packtpub.com/support and register to have the les e-mailed directly to you.
Downloading the color images of this book
We also provide you a PDF le that has color images of the screenshots/diagrams used in
this book. The color images will help you beer understand the changes in the output. You
can download this le from http://www.packtpub.com/sites/default/files/
downloads/9441OS_R-Statistical-Application-Development-by-Example-
Color-Graphics.pdf.
Preface
[ 6 ]
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you nd a mistake in one of our books—maybe a mistake in the text or the
code—we would be grateful if you would report this to us. By doing so, you can save other
readers from frustraon and help us improve subsequent versions of this book. If you nd
any errata, please report them by vising http://www.packtpub.com/submit-errata,
selecng your book, clicking on the errata submission form link, and entering the details of
your errata. Once your errata are veried, your submission will be accepted and the errata
will be uploaded to our website, or added to any list of exisng errata, under the Errata
secon of that tle.
Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt,
we take the protecon of our copyright and licenses very seriously. If you come across any
illegal copies of our works, in any form, on the Internet, please provide us with the locaon
address or website name immediately so that we can pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected
pirated material.
We appreciate your help in protecng our authors, and our ability to bring you
valuable content.
Questions
You can contact us at questions@packtpub.com if you are having a problem
with any aspect of the book, and we will do our best to address it.
Data Characteristics
Data consists of observations across different types of variables, and it is vital
that any Data Analyst understands these intricacies at the earliest stage of
exposure to statistical analysis. This chapter recognizes the importance of data
and begins with a template of a dummy questionnaire and then proceeds with
the nitty-gritties of the subject. We then explain how uncertainty creeps in to
the domain of computer science. The chapter closes with coverage of important
families of discrete and continuous random variables.
We will cover the following topics:
Idencaon of the main variable types as nominal, categorical,
and connuous variables
The uncertainty arising in many real experiments
R installaon and packages
The mathemacal form of discrete and connuous random variables
and their applicaons
1
Data Characteriscs
[ 8 ]
Questionnaire and its components
The goal of this secon is introducon of numerous variable types at the rst possible
occasion. Tradionally, an introductory course begins with the elements of probability theory
and then builds up the requisites leading to random variables. This convenon is dropped in
this book and we begin straightaway with data. There is a primary reason for choosing this
path. The approach builds on what the reader is already familiar with and then connects it
with the essenal framework of the subject.
It is very likely that the user is familiar with quesonnaires. A quesonnaire may be asked
aer the birth of a baby with a view to aid the hospital in the study about the experience
of the mother, the health status of the baby, and the concerns of immediate guardians
of the new born. A mul-store department may instantly request the customer to ll in a
short quesonnaire for capturing the customer's sasfacon aer the sale of a product.
A customer's sasfacon following the service of their vehicle (see the detailed example
discussed later) can be captured through a few queries. The quesonnaires may arise in
dierent forms than just merely on paper. They may be sent via e-mail, telephone, short
message service (SMS), and so on. As an example, one may receive an SMS that seeks a
mandatory response in a Yes/No form. An e-mail may arrive in the Outlook inbox, which
requires the recipient to respond through a vote for any of these three opons, "Will aend
the meeng", "Can't aend the meeng", or "Not yet decided".
Suppose the owner of a mul-brand car center wants to nd out the sasfacon percentage
of his customers. Customers bring their car to a service center for varied reasons. The owner
wants to nd out the sasfacon levels post the servicing of the cars and nd the areas
where improvement will lead to higher sasfacon among the customers. It is well known
that the higher the sasfacon levels, the greater would be the customer's loyalty towards
the service center. Towards this, a quesonnaire is designed and then data is collected from
the customers. A snippet of the quesonnaire is given in gure 1, and the informaon given
by the customers lead to dierent types of data characteriscs. The variables Customer ID
and Quesonnaire ID may be serial numbers or randomly generated unique numbers. The
purpose of such variables is unique idencaon of people's response. It may be possible that
there are follow-up quesonnaires as well. In such cases, the Customer ID for a responder will
connue to be the same, whereas the Quesonnaire ID needs to change for idencaon of
the follow up. The values of these types of variables in general are not useful for
analycal purpose.
Chapter 1
[ 9 ]
Figure 1: A hypothetical questionnaire
The informaon of Full Name in this survey is a starng point to break the ice with
the responder. In very exceponal cases the name may be useful for proling purposes.
For our purposes the name will simply be a text variable that is not used for analysis
purposes. Gender is asked to know the person's gender, and in quite a few cases it may
be an important factor explaining the main characteriscs of the survey, in this case it may
be mileage. Gender is an example of a categorical variable.
Age in Years is a variable that captures the age of the customer. The data for this eld is
numeric in nature and is an example of a connuous variable.
The fourth and h quesons help the mul-brand dealer in idenfying the car model
and its age. The rst queson here enquires about the type of the car model. The car models
of the customers may vary from Volkswagen Beetle, Ford Endeavor, Toyota Corolla, Honda
Civic, to Tata Nano, see the next screenshot. Though the model name is actually a noun, we
make a disncon from the rst queson of the quesonnaire in the sense that the former is
a text variable while the laer leads to a categorical variable. Next, the car model may easily
be idened to classify the car into one of the car categories, such as a hatchback, sedan,
staon wagon, or ulity vehicle, and such a classifying variable may serve as one of the
ordinal variable, as per the overall car size. The age of the car in months since its manufacture
date may explain the mileage and odometer reading.
Data Characteriscs
[ 10 ]
The sixth and seventh quesons simply ask the customer if their minor/major problems
were completely xed or not. This is a binary queson that takes either of the values,
Yes or No. Small dents, power windows malfunconing, niggling noises in the cabin, music
speakers low output, and other similar issues, which do not lead to good funconing of the
car may be treated as minor problems that are expected to be xed in the car. Disc brake
problems, wheel alignment, steering raling issues, and similar problems that expose the user
and co-users of the road to danger are of grave concern, as they aect the funconing of a
car and are treated as major problems. Any user will expect all of his/her issues to be resolved
during a car service. An important goal of the survey is to nd the service center eciency in
handling the minor and major issues of the car. The labels Yes/No may be replaced by +1 and
-1, or any other label of convenience.
The eighth queson, "What is the mileage (km/liter) of car?", gives a measure of the average
petrol/diesel consumpon. In many praccal cases this data is provided by the belief of
the customer who may simply declare it between 5 km/liter to 25 km/liter. In the case of
a lower mileage, the customer may ask for a ner tune up of the engine, wheel alignment,
and so on. A general belief is that if the mileage is closer to the assured mileage as marketed
by the company, or some authority such as Automove Research Associaon of India (ARAI),
the customer is more likely to be happy. An important variable is the overall kilometers done
by the car up to the point of service. Vehicles have certain maintenances at the intervals
of 5,000 km, 10,000 km, 20,000 km, 50,000 km, and 100,000 km. This variable may also be
related with the age of the vehicle.
Chapter 1
[ 11 ]
Let us now look at the nal queson of the snippet. Here, the customer is asked to rate
his overall experience of the car service. A response from the customer may be sought
immediately aer a small test ride post the car service, or it may be through a quesonnaire
sent to the customer's e-mail ID. A rang of Very Poor suggests that the workshop has served
the customer miserably, whereas the rang of Very Good conveys that the customer is
completely sased with the workshop service. Note that there is some order in the response
of the customer, in that we can grade the ranking in a certain order of Very Poor < Poor <
Average < Good < Very Good. This implies that the structure of the rangs must be respected
when we analyze the data of such a study. In the next secon, these concepts are elaborated
through a hypothecal dataset.
Satisfaction
Rating
Good
Average
Good
Average
Very Good
Good
Good
Good
Very Poor
Very Good
Good
Poor
Poor
Good
Very Poor
Good
Good
Poor
Poor
Average
Odometer
18892
22624
42008
32556
48172
25207
41449
28555
36841
1755
2007
28265
27997
27491
29527
2702
6903
40873
15274
9934
Mileage
23
17
24
23
8
21
14
23
19
23
17
14
23
7
25
17
21
6
8
22
Questionnaire_ID
QC601FAKNQXM
QC5HZ8CP1NFB
QCY72H4J0V1X
QCH1NZO5VCD8
QCV1Y10SFW7N
QCXO04WUYQAJ
QCJQZAYMI59Z
QCIZTA35PW19
QC12XU9J0OAT
QCXWBT0V17G
QC5YOUIZ7PLC
QCYF269HVUO
QCAIE3Z0SYK9
QCE09UZHDP63
QCDWJ6ESYPZR
QCH7XRZ6W9JQ
QCGXATR9DQEK
QCYQO5RFIPK1
QCG1SZ8IDURP
QCTUSRQDX396
Customer_ID
C601FAKNQXM
C5HZ8CP1NFB
CY72H4J0V1X
CH1NZO5VCD8
CV1Y10SFW7N
CXO04WUYQAJ
CJQZAYMI59Z
CIZTA35PW19
C12XU9J0OAT
CXWBT0V17G
C5YOUIZ7PLC
CYF269HVUO
CAIE3Z0SYK9
CE09UZHDP63
CDWJ6ESYPZR
CH7XRZ6W9JQ
CGXATR9DQEK
CYQO5RFIPK1
CG1SZ8IDURP
CTUSRQDX396
J. Ram
Sanjeev Joshi
John D
Pranathi PT
Pallavi M Daksh
Mohammed Khan
Anand N T
Arun Kumar T
Prakash Prabhak
Pramod R.K.
Mithun Y.
S.P. Bala
Swamy J
Julfikar
Chris John
Naveed Khan
Prem Kashmiri
Sujana Rao
Josh K
Aravind
Name
Male
Male
Female
Female
Female
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Female
Male
Male
Male
Gender
57
53
20
54
32
20
53
65
50
22
49
37
42
47
31
24
47
54
39
61
Age
Beetle
Camry
Nano
Civic
Civic
Corolla
Civic
Endeavor
Beetle
Nano
Nano
Beetle
Nano
Camry
Endeavor
Fortuner
Civic
Camry
Endeavor
Fiesta
Car_Model
Apr-11
Feb-09
Apr-10
Oct-11
Mar-12
Dec-10
Mar-12
Aug-11
Mar-09
Mar-11
Apr-11
Jul-11
Dec-09
Jan-12
May-12
Aug-09
Oct-11
Mar-10
Jul-11
May-10
Car
Manufacture
Year
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Minor
Problems
Yes
Yes
Yes
Yes
No
No
No
Yes
No
No
No
No
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Minor
Problems
A hypothetical dataset of a Questionnaire
Data Characteriscs
[ 12 ]
Understanding the data characteristics
in an R environment
A snippet of an R session is given in Figure 2. Here we simply relate an R session with the
survey and sample data of the previous table. The simple goal here is to get a feel/buy-in
of R and not necessarily follow the R codes. The R installaon process is explained in the R
installaon secon. Here the user is loading the SQ R data object (SQ simply stands for sample
quesonnaire) in the session. The nature of the SQ object is a data.frame that stores a
variety of other objects in itself. For more technical details of the data.frame funcon, see
The data.frame object secon of Chapter 2, Import/Export Data. The names of a data.frame
object may be extracted using the funcon variable.names. The R funcon class helps
to idenfy the nature of the R object. As we have a list of variables, it is useful to nd all of
them using the funcon sapply. In the following screenshot, the menoned steps have been
carried out:
Figure 2: Understanding the variable types of an R object
Chapter 1
[ 13 ]
The variable characteriscs are also on expected lines, as they truly should be, and
we see that the variables Customer_ID, Questionnaire_ID, and Name are character
variables; Gender, Car_Model, Minor_Problems, and Major_Problems are factor
variables; DoB and Car_Manufacture_Year are date variables; Mileage and Odometer
are integer variables; and nally the variable Satisfaction_Rating is an ordered
and factor variable.
In the remainder of this chapter we will delve into more details about the nature of various
data types. In a more formal language a variable is called a random variable, abbreviated
as RV in the rest of the book, in stascal literature. A disncon needs to be made here.
In this book we do not focus on the important aspects of probability theory. It is assumed that
the reader is familiar with probability, say at the level of Freund (2003) or Ross (2001). An RV
is a funcon that maps from the probability (sample) space
:
to the real line. From
the previous example we have Odometer and Satisfaction_Rating as two examples of a
random variable. In a formal language, the random variables are generally denoted by leers
X, Y, …. The disncon that is required here is that in the applicaons what we observe are
the realizaons/values of the random variables. In general, the realized values are denoted by
the lower cases x, y, …. Let us clarify this at more length.
Suppose that we denote the random variable Satisfaction_Rating by X. Here,
the sample space
:
consists of the elements Very Poor, Poor, Average, Good, and Very
Good. For the sake of convenience we will denote these elements by O1, O2, O3, O4, and
O5 respecvely. The random variable X takes one of the values O1,…, O5 with respecve
probabilies p1,…, p5. If the probabilies were known, we don't have to worry about stascal
analysis. In simple terms, if we know the probabilies of the Satisfaction_Rating RV, we
can simply use it to conclude whether more customers give Very Good rang against Poor.
However, our survey data does not contain every customer who have availed car service from
the workshop, and as such we have representave probabilies and not actual probabilies.
Now, we have seen 20 observaons in the R session, and corresponding to each row we had
some values under the Satisfaction_Rating column. Let us denote the sasfacon rang
for the 20 observaons by the symbols X1,…, X20. Before we collect the data, the random
variables X1,…, X20 can assume any of the values in
:
. Post the data collecon, we see that the
rst customer has given the rang as Good (that is, O4), the second as Average (O3), and so on
up to the tweneth customer's rang as Average (again O3). By convenon, what is observed
in the data sheet is actually x1,…, x20, the realized values of the RVs X1,…, X20.
Data Characteriscs
[ 14 ]
Experiments with uncertainty in computer science
The common man of the previous century was skepcal about chance/randomness and
aributed it to the lack of accurate instruments, and that informaon is not necessarily
captured in many variables. The skepcism about the need for modeling for randomness
in the current era connues for the common man, as he feels that the instruments are too
accurate and that mul-variable informaon eliminates uncertainty. However, this is not
the fact, and we will look here at some examples that drive home this point. In the previous
secon we dealt with data arising from a quesonnaire regarding the service level at a car
dealer. It is natural to accept that dierent individuals respond in disnct ways, and further
the car being a complex assembly of dierent components responds dierently in near
idencal condions. A queson then arises whether we may have to really deal with such
situaons in computer science, which involve uncertainty. The answer is certainly armave
and we will consider some examples in the context of computer science and engineering.
Suppose that the task is installaon of soware, say R itself. At a new lab there has been
an arrangement of 10 new desktops that have the same conguraon. That is, the RAM,
memory, the processor, operang system, and so on are all same in the 10 dierent machines.
For simplicity, assume that the electricity supply and lab temperature are idencal for all the
machines. Do you expect that the complete R installaon, as per the direcons specied in
the next secon, will be the same in milliseconds for all the 10 installaons? The run me
of an operaon can be easily recorded, may be using other soware if not manually. The
answer is a clear "No" as there will be minor variaons of the processes acve in the dierent
desktops. Thus, we have our rst experiment in the domain of computer science which
involves uncertainty.
Suppose that the lab is now two years old. As an administrator do you expect all the 10
machines to be working in the same idencal condions, as we started with idencal
conguraon and environment? The queson is relevant as according to general experience
a few machines may have broken down. Despite warranty and assurance by the desktop
company, the number of machines that may have broken down will not be exactly as assured.
Thus, we again have uncertainty.
Assume that three machines are not funconing at the end of two years. As an administrator,
you have called the service vendor to x the problem. For the sake of simplicity, we assume
that the nature of failure of the three machines is the same, say motherboard failure on the
three failed machines. Is it praccal that the vendor would x the three machines within
idencal me? Again, by experience we know that this is very unlikely. If the reader thinks
otherwise, assume that 100 idencal machines were running for two years and 30 of them
are now having the motherboard issue. It is now clear that some machines may require a
component replacement while others would start funconing following a repair/x.
Let us now summarize the preceding experiments through the set of following quesons:
Chapter 1
[ 15 ]
What is the average installaon me for the R soware on idencally congured
computer machines?
How many machines are likely to break down aer a period of one year, two years,
and three years?
If a failed machine has issues related to motherboard, what is the average
service me?
What is the fracon of failed machines that have failed motherboard component?
The answers to these types of quesons form the main objecve of the Stascs subject.
However, there are certain characteriscs of uncertainty that are very richly covered by the
families of probability distribuons. According to the underlying problem, we have discrete or
connuous RVs. The important and widely useful probability distribuons form the content of
the rest of the chapter. We will begin with the useful discrete distribuons.
R installation
The ocial website of R is the Comprehensive R Archive Network (CRAN) at www.cran.r-
project.org. As of wring of this book, the most recent version of R is 2.15.1. This soware
can be downloaded for the three plaorms Linux, Mac OS X, and Windows.
Figure 3: The CRAN website (a snapshot)
Data Characteriscs
[ 16 ]
A Linux user may simply key in sudo apt-get install r-base in the terminal, and post
the return of right password and privilege levels, the R soware will be installed. Aer the
compleon of download and installaon, the soware can be started by simply keying in R
at the terminal.
A Windows user rst needs to click on Download R for Windows as shown in the preceding
screenshot, and then in the base subdirectory click on install R for the rst me. In the new
window, click on Download R 3.0.0 for Windows and download the .exe le to a directory
of your choice. The completely downloaded R-3.0.0-win.exe le can be installed as any
other .exe le. The R soware may be invoked either from the Start menu, or from the icon
on the desktop.
Using R packages
The CRAN repository hosts 4475 packages as of May 01, 2013. The packages are wrien and
maintained by Stascians, Engineers, Biologists, and others. The reasons are varied and the
resourcefulness is very rich, and it reduces the need of wring exhausve, new funcons
and programs from scratch. These addional packages can be obtained from http://www.
cran.r-project.org/web/packages/. The user can click on Table of available packages,
sorted by name, which directs to a new web package. Let us illustrate the installaon of an R
package named gdata.
We now wish to install the package gdata. There are mulple ways of compleng this task.
Clicking on the gdata label leads to the web page http://www.cran.r-project.org/
web/packages/gdata/index.html. In this HTML le we can nd a lot of informaon
about the package from Version, Depends, Imports, Published, Author, Maintainer, License,
System Requirements, Installaon, and CRAN checks. Further, the download opons may be
chosen from Package source, MacOS X binary, and Windows binary depending on whether
the user's OS is Unix, MacOS, or Windows respecvely. Finally, a package may require
other packages as a prerequisite, and it may itself be a prerequisite for other packages.
This informaon is provided in the Reverse dependencies secon in the opons Reverse
depends, Reverse imports, and Reverse suggests.
Suppose that the user is having Windows OS. There are two ways to install the package
gdata. Start R as explained earlier. At the console, execute the code install.
packages("gdata"). A CRAN mirror window will pop up asking the user to select one
of the available mirrors. Select one of the mirrors from the list, you may need to scroll
down to locate your favorite mirror, and then click on the Ok buon. A default seng is
dependencies=TRUE, which will then download and install all other required packages.
Unless there are some violaons, such as the dependency requirement of the R version
being at least 2.13.0 in this case, the packages are successfully installed.
Chapter 1
[ 17 ]
A second way of installing the gdata package is as follows. In the gdata web page click on the
link gdata_2.11.0.zip. This acon will then aempt to download the package through the File
download window. Choose the opon Save and specify the path where you wish to download
the package. In my case, I have chosen the path C:\Users\author\Downloads. Now go to
the R window. In the menu ribbon, we have seven opons: File, Edit, View, Misc, Packages,
Windows, and Help. Yes, your guess is correct and you would have wisely selected Packages
from the menu. Now, select the last opon of Packages, the Install Package(s) from local zip
les opon and direct it to the path where you have downloaded the .zip le. Select the
le gdata_2.11.0 and R will do the required remaining part of installing the package. One
of the drawbacks of doing this process manually is that if there are dependencies, the user
needs to ensure that all such packages have been installed before embarking on this second
task of installing the R packages. However, despite the problem, it is quite useful to know this
technique, as we may not be connected to Internet all the me and install the packages as
it is convenient.
RSADBE – the book's R package
The book uses a lot of datasets from the Web, stascal text books, and so on. The le format
of the datasets have been varied and thus to help the reader, we have put all the datasets
used in the book in an R package, RSADBE, which is the abbreviaon of the book's tle.
This package will be available from the CRAN website as well as the book's web page.
Thus, whenever you are asked to run data(xyz), the datasets xyz will be available
either in the RSADBE package or datasets package of R.
The book also uses many of the packages available on CRAN. The following table gives the list
of packages and the reader is advised to ensure that these packages are installed before you
begin reading the chapter. That is, the reader needs to ensure that, as an example, install.
packages(c("qcc","ggplot2")) is run in the R session before proceeding with Chapter
3, Data Visualizaon.
Chapter number Packages required
2foreign, RMySQL
3qcc, ggplot2
4LearnEDA, aplpack
5stats4, PASWR, PairedData
6faraway
7pscl, ROCR
8ridge, DAAG
9rpart, rattle
10 ipred, randomForest
Data Characteriscs
[ 18 ]
Discrete distribution
The previous secon highlights the dierent forms of variables. The variables such as
Gender, Car_Model, and Minor_Problems possibly take one of the nite values.
These variables are parcular cases of the more general class of discrete variables.
It is to be noted that the sample space
:
of a discrete variable need not be nite. As an
example, the number of errors on a page may take values as a set of posive integers, {0, 1, 2,
…}. Suppose that a discrete random variable X can take values among 
[[
with respecve
probabilies

SS
, that is,

LL
L
3; [S
[S
. Then, we require that the probabilies be non-
zero and further that their sum be 1:
   DQG t ¦
L L L
S L S
where the Greek symbol
¦
represents summaon over the index i.
The funcon

L
S[ is called the probability mass funcon (pmf) of the discrete RV X. We will
now consider formal denions of important families of discrete variables. The engineers
may refer to Bury (1999) for a detailed collecon of useful stascal distribuons in their
eld. The two most important parameters of a probability distribuon are specied by mean
and variance of the RV X. In some cases, and important too, these parameters may not exist
for the RV. However, we will not focus on such distribuons, though we cauon the reader
that this does not mean that such RVs are irrelevant. Let us dene these parameters for the
discrete RV. The mean and variance of a discrete RV are respecvely calculated as:
  
 
DQG
¦ ¦
LL LL
L L
(; S[ 9DU; S[ (;
The mean is a measure of central tendency, whereas the variance gives a measure
of the spread of the RV.
The variables dened so far are more commonly known as categorical variables.
Agres (2002) denes a categorical variable as a measurement scale consisng
of a set of categories.
Let us idenfy the categories for the variables listed in the previous secon. The categories
for the variable Gender are Male and Female; whereas the car category variables derived
from Car_Model are hatchback, sedan, staon wagon, and ulity vehicles. The variables
Minor_Problems and Major_Problems have common but independent categories Yes and
No; and nally the variable Satisfaction_Rating has the categories, as seen earlier, Very
Poor, Poor, Average, Good, and Very Good. The variable Car_Model is just labels of the name
of car and it is an example of nominal variable.
Chapter 1
[ 19 ]
Finally, the output of the variable Satistifaction_Rating has an implicit order in it, Very
Poor < Poor < Average < Good < Very Good. It may be realized that this dierence poses subtle
challenges in their analysis. These types of variables are called ordinal variables. We will look
at another type of categorical variable that has not popped up thus far.
Praccally, it is oen the case that the output of a connuous variable is put in certain bin for
ease of conceptualizaon. A very popular example is the categorizaon of the income level
or age. In the case of income variables, it has been realized in one of the earlier studies that
people are very conservave about revealing their income in precise numbers. For example,
the author may be shy to reveal that his monthly income is Rs. 34,892. On the other hand,
it has been revealed that these very same people do not have a problem in disclosing their
income as belonging to one of such bins: < Rs. 10,000, Rs. 10,000-30,000, Rs. 30,000-50,000,
and > Rs. 50,000. Thus, this informaon may also be coded into labels and then each of the
labels may refer to any one value in an interval bin. Hence, such variables are referred as
interval variables.
Discrete uniform distribution
A random variable X is said to be a discrete uniform random variable if it can take any one
of the nite M labels with equal probability.
As the discrete uniform random variable X can assume one of the 1, 2, …, M with equal
probability, this probability is actually
0
. As the probability remains same across the labels,
the nomenclature "uniform" is jused. It might appear at the outset that this is not a very
useful random variable. However, the reader is cauoned that this intuion is not correct. As
a simple case, this variable arises in many cases where simple random sampling is needed in
acon. The pmf of discrete RV is calculated as:


LL
3; [S[L 0
0
A simple plot of the probability distribuon of a discrete uniform RV is demonstrated next:
> M = 10
> mylabels=1:M
> prob_labels=rep(1/M,length(mylabels))
> dotchart(prob_labels,labels=mylabels,xlim=c(.08,.12),
+ xlab="Probability")
> title("A Dot Chart for Probability of Discrete Uniform RV")
Data Characteriscs
[ 20 ]
Downloading the example code
You can download the example code les for all Packt books you have
purchased from your account at http://www.packtpub.com. If you
purchased this book elsewhere, you can visit http://www.packtpub.
com/support and register to have the les e-mailed directly to you.
Figure 4: Probability distribution of a discrete uniform random variable
The R programs here are indicave and it is not absolutely necessary that you
follow them here. The R programs will actually begin from the next chapter and
your ow won't be aected if you do not understand certain aspects of them.
Binomial distribution
Recall the second queson in the Experiments with uncertainty in computer science secon,
which asks "How many machines are likely to break down aer a period of one year, two
years, and three years?". When the outcomes involve uncertainty, the more appropriate
queson that we ask is related to the probability of the number of break downs being x.
Consider a xed me frame, say 2 years. To make the queson more generic, we assume
that we have n number of machines. Suppose that the probability of a breakdown for a
given machine at any given me is p. The goal is to obtain the probability of x machines with
breakdown, and implicitly (n-x) funconal machines. Now consider a xed paern where
the rst x units have failed and the remaining are funconing properly. All the n machines
funcon independently of other machines. Thus, the probability of observing x machines
in the breakdown state is
[
S
.
Chapter 1
[ 21 ]
Similarly, each of the remaining (n-x) machines have the probability of (1-p) of being in the
funconal state, and thus the probability of these occurring together is
( )
1
n x
p
. Again by
the independence axiom value, the probability of x machines with breakdown is then given
by . Finally, in the overall setup, the number of possible samples with breakdown
(being x and (n-x) samples) being funconal is actually the number of possible combinaons
of choosing x-out-of-n items, which is the combinatorial n
x
 
 
 . As each of these samples is
equally likely to occur, the probability of exactly x broken machines is given by .
The RV X obtained in such a context is known as the binomial RV and its pmf is called the
binomial distribuon. In mathemacal terms, the pmf of the binomial RV is calculated as:
 
  
§·
d d
¨¸
©¹
Q[
[
Q
3; [S[SS[
QS
[
The pmf of binomial distribuons is somemes denoted by
( )
;,bxnp
. Let us now look at
some important properes of a binomial RV. The mean and variance of a binomial RV X
are respecvely calculated as:
 

DQG  9DU; Q(S S;Q S
As p is always a number between 0 and 1, the variance of a binomial RV
is always lesser than its mean.
Example 1.3.1: Suppose n = 10 and p = 0.5. We need to obtain the probabilies
p(x), x=0, 1, 2, …, 10. The probabilies can be obtained using the built-in R funcon dbinom.
The funcon dbinom returns the probabilies of a binomial RV. The rst argument of this
funcon may be a scalar or a vector according to the points at which we wish to know the
probability. The second argument of the funcon needs to know the value of n, the size of
the binomial distribuon. The third argument of this funcon requires the user to specify the
probability of success in p. It is natural to forget the syntax of funcons and the R help system
becomes very handy here. For any funcon, you can get details of it using ? followed by the
funcon name. Please do not give a space between CIT and the funcon name.
Here, you can try ?dbinom.
> n <- 10; p <- 0.5
> p_x <- round(dbinom(x=0:10, n, p),4)
> plot(x=0:10,p_x,xlab="x", ylab="P(X=x)")
Data Characteriscs
[ 22 ]
The R funcon round xes the accuracy of the argument up to the specied number
of digits.
Figure 5: Binomial probabilities
We have used the dbinom funcon in the previous example. There are three ulity facets
for the binomial distribuon. The three facets are p, q, and r. These three facets respecvely
help us in computaons related to cumulave probabilies, quanles of the distribuon,
and simulaon of random numbers from the distribuon. To use these funcons, we simply
augment the leers with the distribuon name, binom here, as pbinom, qbinom, and
rbinom. There will be of course a crical change in the arguments. In fact, there are many
distribuons for which the quartet of d, p, q, and r are available, check ?Distributions.
Example 1.3.2: Assume that the probability of a key failing on an 83-set keyboard (the authors
laptop keyboard has 83 keys) is 0.01. Now, we need to nd the probability when at a given
me there are 10, 20, and 30 non-funconing keys on this keyboard. Using the dbinom
funcon these probabilies are easy to calculate. Try to do this same problem using a scienc
calculator or by wring a simple funcon in any language that you are comfortable with.
> n <- 83; p <- 0.01
> dbinom(10,n,p)
[1] 1.168e-08
> dbinom(20,n,p)
[1] 4.343e-22
> dbinom(30,n,p)
[1] 2.043e-38
> sum(dbinom(0:83,n,p))
[1] 1
Chapter 1
[ 23 ]
As the probabilies of 10-30 keys failing appear too small, it is natural to believe that may
be something is going wrong. As a check, the sum clearly equals 1. Let us have a look at the
problem from a dierent angle. For many x values, the probability p(x) will be approximately
zero. We may not be interested in the probability of an exact number of failures, though we
are interested in the probability of at least x failures occurring, that is, we are interested in
the cumulave probabilies
()
PX x
. The cumulave probabilies for binomial distribuon
are obtained in R using the pbinom funcon. The main arguments of pbinom include size
(for n), prob (for p), and q (the x argument). For the same problem, we now look at the
cumulave probabilies for various p values:
> n <- 83; p <- seq(0.05,0.95,0.05)
> x <- seq(0,83,5)
> i <- 1
> plot(x,pbinom(x,n,p[i]),"l",col=1,xlab="x",ylab=
+ expression(P(X<=x)))
> for(i in 2:length(p)) { points(x,pbinom(x,n,p[i]),"l",col=i)}
Figure 6: Cumulative binomial probabilities
Try to interpret the preceding screenshot.
Data Characteriscs
[ 24 ]
Hypergeometric distribution
A box of N = 200 pieces of 12 GB pen drives arrives at a sales center. The carton contains
M = 20 defecve pen drives. A random sample of n units is drawn from the carton. Let X
denote the number of defecve pen drives obtained from the sample of n units. The task is to
obtain the probability distribuon of X. The number of possible ways of obtaining the sample
of size n is
1
Q
§ ·
¨ ¸
© ¹
. In this problem we have M defecve units and N-M working pen drives, and
x defecve units can be sampled in M
x
 
 
  dierent ways and n-x good units can be obtained in
N M
n x
 
 
 
disnct ways. Thus, the probability distribuon of the RV X is calculated as:
 

§·
§·
¨¸
¨¸
©¹
©¹
§·
¨¸
©¹
010
[Q[
3; [K[Q01 1
Q
where x is an integer between
( )
max0,nNM−+ and
( )
min,nM
. The RV is called
as the hypergeometric RV and its probability distribuon is called as the
hypergeometric distribuon.
Suppose that we draw a sample of n = 10 units. The funcon dhyper in R can be used to nd
the probabilies of the RV X assuming dierent values.
> N <- 200; M <- 20
> n <- 10
> x <- 0:11
> round(dhyper(x,M,N,n),3)
[1] 0.377 0.395 0.176 0.044 0.007 0.001 0.000 0.000 0.000 0.000 0.000
0.000
The mean and variance of a hypergeometric distribuon are stated as follows:
 


DQG

10 1Q
0 0
(; Q1 1 1
U; Q1
9D
Negative binomial distribution
Consider a variant of the problem described in the previous subsecon. The 10 new desktops
need to be ed with an add-on, 5 megapixel external cameras to help the students aend a
certain online course. Assume that the probability of a non-defecve camera unit is p. As an
administrator you keep on placing order unl you receive 10 non-defecve cameras. Now, let X
denote the number of orders placed for obtaining the 10 good units. We denote the required
number of success by k, which in this discussion has been k = 10. The goal in this unit is to
obtain the probability distribuon of X.
Chapter 1
[ 25 ]
Suppose that the xth order placed results in the procurement of the kth non-defecve unit.
This implies that we have received (k-1) non-defecve units among the rst (x-1) orders
placed, which is possible in
1
1
x
k
 
 
 
disnct ways. At the xth order, the instant of having received
the kth non-defecve unit, we have k successes and x-k failures. Hence, the probability
distribuon of the RV is calculated as:
 
 


§·
¨¸
©¹
[N N
[N
3; N S S[ NN
N
Such an RV is called the negave binomial RV and its probability distribuon is the negave
binomial distribuon. Technically, this RV has no upper bound as the next required success
may never turn up. We state the mean and variance of this distribuon as follows:
 

DQG
9D
NS NS
( U ;; SS
A parcular and important special case of the negave binomial RV occurs for k = 1,
which is known as the geometric RV. In this case, the pmf is calculated as:

 
[
3; [SS[
Example 1.3.3. (Baron (2007). Page 77) Sequenal Tesng: In a certain setup, the probability
of an item being defecve is (1-p) = 0.05. To complete the lab setup, 12 non-defecve units
are required. We need to compute the probability that at least 15 units need to be tested.
Here we make use of the cumulave distribuon of negave binomial distribuon pnbinom
funcon available in R. Similar to the pbinom funcon, the main arguments that we require
here would be size, prob, and q. This problem is solved in a single line of code:
> 1-pnbinom(3,size=12,0.95)
[1] 0.005467259
Note that we have specied 3 as the quanle point (the x argument) as the size parameter
of this experiment is 12 and we are seeking at least 15 units which translate into 3 more
units than the size parameter. The funcon pnbinom computes the cumulave distribuon
funcon and the requirement is actually the complement, and hence the expression in the
code is 1–pnbinom. We may equivalently solve the problem using the dnbinom funcon,
which straighorwardly computes the required probability:
> 1-(dnbinom(3,size=12,0.95)+dnbinom(2,size=12,0.95)+dnbinom(1,
+ size=12,0.95)+dnbinom(0,size=12,0.95))
[1] 0.005467259
Data Characteriscs
[ 26 ]
Poisson distribution
The number of accidents on a 1 km stretch of road, total calls received during a one-hour
slot on your mobile, the number of "likes" received on a status on a social networking site
in a day, and similar other cases are some of the examples which are addressed by the Poisson
RV. The probability distribuon of a Poisson RV is calculated as:
 
 
O
OO
!
[
H
3; [ [
[
Here
λ
is the parameter of the Poisson RV with X denong the number of events. The Poisson
distribuon is somemes also referred to as the law of rare events. The mean and variance of
the Poisson RV are surprisingly the same and equal
λ
, that is,
() ()
EX VarX
λ
= = .
Example 1.3.4: Suppose that Santa commits errors in a soware program with a mean
of three errors per A4-size page. Santa's manager wants to know the probability of
Santa comming 0, 5, and 20 errors per page. The R funcon dpois helps to determine
the answer.
> dpois(0,lambda=3); dpois(5,lambda=3); dpois(20, lambda=3)
[1] 0.04978707
[1] 0.1008188
[1] 7.135379e-11
Note that Santa's probability of comming 20 errors is almost 0.
We will next focus on connuous distribuons.
Continuous distribution
The numeric variables in the survey, Age, Mileage, and Odometer, can take any values over
a connuous interval and these are examples of connuous RVs. In the previous secon we
dealt with RVs which had discrete output. In this secon we will deal with RVs which have
connuous output. A disncon from the previous secon needs to be pointed explicitly.
In the case of a discrete RV, there is a posive number for the probability of an RV taking on
a certain value which is determined by the pmf. In the connuous case, an RV necessarily
assumes any specic value with zero probability. These technical issues will not be discussed
in this book. In the discrete case, the probabilies of certain values are specied by the pmf,
and in the connuous case the probabilies, over intervals, are decided by probability density
funcon, abbreviated as pdf.
Chapter 1
[ 27 ]
Suppose that we have a connuous RV, X, with the pdf f(x) dened over the possible x values,
that is, we assume that the pdf f(x) is well dened over the range of the RV X, denoted by
x
R
.
It is necessary that the integraon of f(x) over the range
x
R
is necessarily 1, that is,
()
1
x
Rfsds =
.
The probability that the RV X takes a value in an interval [a, b] is dened by:
[]
( )
()
,b
a
PX ab fxdx =
In general we are interested in the cumulave probabilies of a connuous RV, which is
the probability of the event P(X<x). In terms of the previous equaons, this is obtained as:
 
f
³[
3; [IVGV
A special name for this probability is the cumulave density funcon. The mean and variance
of a connuous RV are then dened by:
   
 

DQG ;
³ ³
[ [
5 5
(; [I [G[ [ (; I[G[9DU
As in the previous secon, we will begin with the simpler RV in uniform distribuon.
Uniform distribution
An RV is said to have uniform distribuon over the interval
[ ]
0, , 0
θ θ
> if its probability density
funcon is given by:

 
T T T
T
d d !I[ [
In fact, it is not necessary to restrict our focus on the posive real line. For any two
real numbers a and b, from the real line, with b > a, the uniform RV can be dened by:
 
   d
d!
I[DE D[EE D
ED
Data Characteriscs
[ 28 ]
The uniform distribuon has a very important role to play in simulaon, as will be seen
in Chapter 6, Linear Regression Analysis. As with the discrete counterpart, in the connuous
case any two intervals of the same length will have equal probability of occurring. The mean
and variance of a uniform RV over the interval [a, b] are respecvely given by:
  
 
ED
DE
(; 9DU;
Example 1.4.1. Horgan's (2008), Example 15.3: The Internaonal Journal of Circuit Theory
and Applicaons reported in 1990 that researchers at the University of California, Berkely,
had designed a switched capacitor circuit for generang random signals whose trajectory
is uniformly distributed over the unit interval [0, 1]. Suppose that we are interested
in calculang the probability that the trajectory falls in the interval [0.35, 0.58].
Though the answer is straighorward, we will obtain it using the punif funcon:
> punif(0.58)-punif(0.35)
[1] 0.23
Exponential distribution
The exponenal distribuon is probably one of the most important probability distribuons in
Stascs, and more so for Computer Sciensts. The numbers of arrivals in a queuing system,
the me between two incoming calls on a mobile, the lifeme of a laptop, and so on, are
some of the important applicaons where this distribuon has a lasng ulity value.
The pdf of an exponenal RV is specied by
()
;,0, 0
x
fx ex
λ
λλ λ
=≥>
.
The parameter
λ
is somemes referred to as the failure rate. The exponenal RV enjoys
a special property called the memory-less property which conveys that :
 
IRUD_ OO  !t t t3; WV;V 3; W W V
This mathemacal statement states that if X is an exponenal RV, then its failure in the future
depends on the present, and the past (age) of the RV does not maer. In simple words this
means that the probability of failure is constant in me and does not depend on the age of
the system. Let us obtain the plots of a few exponenal distribuons.
> par(mfrow=c(1,2))
> curve(dexp(x,1),0,10,ylab="f(x)",xlab="x",cex.axis=1.25)
> curve(dexp(x,0.2),add=TRUE,col=2)
> curve(dexp(x,0.5),add=TRUE,col=3)
> curve(dexp(x,0.7),add=TRUE,col=4)
> curve(dexp(x,0.85),add=TRUE,col=5)
> legend(6,1,paste("Rate = ",c(1,0.2,0.5,0.7,0.85)),col=1:5,pch=
Chapter 1
[ 29 ]
+ "___")
> curve(dexp(x,50),0,0.5,ylab="f(x)",xlab="x")
> curve(dexp(x,10),add=TRUE,col=2)
> curve(dexp(x,20),add=TRUE,col=3)
> curve(dexp(x,30),add=TRUE,col=4)
> curve(dexp(x,40),add=TRUE,col=5)
> legend(0.3,50,paste("Rate = ",c(1,0.2,0.5,0.7,0.85)),col=1:5,pch=
+ "___")
Figure 7: The exponential densities
The mean and variance of this exponenal distribuon are shown as follows:
 
 
DQG
O O
(; 9DU;
Normal distribution
The normal distribuon is in some sense an all-pervasive distribuon that arises sooner or
later in almost any stascal discussion. In fact it is very likely that the reader may already be
familiar with certain aspects of the normal distribuon, for example, the shape of a normal
distribuon curve is bell-shaped. The mathemacal appropriateness is probably reected
through the reason that though it has a simpler expression, and its density funcon includes
the three most famous irraonal numbers
 
   d
d!
I[DE D[EE D
ED
.
Data Characteriscs
[ 30 ]
Suppose that X is normally distributed with mean
µ
and variance
2
σ
. Then, the probability
density funcon of the normal RV is given by:
 
 
 
  

PV P P V
V
SV
 ½
f f f 
f!
® ¾
¯ ¿
I[ H[S [ [
If mean is zero and variance is one, the normal RV is referred as the standard normal RV,
and the standard is to denote it by Z.
Example 1.4.2. Shady Normal Curves: We will again consider a standard normal random
variable, which is more popularly denoted in Stascs by Z. Some of the most needed
probabilies are P(Z > 0) and P(-1.96 < Z < 1.96). These probabilies are now shaded.
> par(mfrow=c(3,1))
> # Probability Z Greater than 0
> curve(dnorm(x,0,1),-4,4,xlab="z",ylab="f(z)")
> z <- seq(0,4,0.02)
> lines(z,dnorm(z),type="h",col="grey")
> # 95% Coverage
> curve(dnorm(x,0,1),-4,4,xlab="z",ylab="f(z)")
> z <- seq(-1.96,1.96,0.001)
> lines(z,dnorm(z),type="h",col="grey")
> # 95% Coverage
> curve(dnorm(x,0,1),-4,4,xlab="z",ylab="f(z)")
> z <- seq(-2.58,2.58,0.001)
> lines(z,dnorm(z),type="h",col="grey")
Figure 08: Shady normal curves
Chapter 1
[ 31 ]
Summary
You should now be clear with the disnct nature of variables that arise in dierent scenarios.
In R, you should be able to verify that the data is in the correct format. Further, the important
families of random variables are introduced in this chapter, which should help you in dealing
with them when they crop up in your experiments. Computaon of simple probabilies were
also introduced and explained.
In the next chapter you will learn how to perform the basic R computaons, creang data
objects, and so on. As data can seldom be constructed completely in R, we need to import
data from external foreign les. The methods explained help you to import data in le formats
such as .csv and .xls. Similar to imporng, it is also important to be able to export data/
output to other soware. Finally, R session management will conclude the next chapter.
Import/Export Data
The main goals of this chapter are to familiarize you with the various classes of
objects in R, help the reader extract data from various popular formats, connect
R with popular databases such as MySQL, and finally the best export options
of the R output. The main purpose is that the practitioner frequently has data
available in a fixed format, and sometimes the dataset is available in popular
database systems.
This chapter helps you to extract the data from various sources, and then also recommends
the best export opons of the R output. We will though begin with a beer understanding of
the various formats in which R stores the data. Updated informaon about the import/export
opons is maintained at http://cran.r-project.org/doc/manuals/R-data.html.
To summarize, the main learning from this chapter would be the following:
Basic and essenal computaons in R
Imporng data from CSV, XLS, and few more
Exporng data for other soware
R session management
data.frame and other formats
Any soware comes with its structure and nuances. The Quesonnaire and its component
secon of Chapter 1, Data Characteriscs, introduced various facets of data. In the current
secon we will go into the details of how R works with data of dierent characteriscs.
Depending on the need we have dierent formats of the data. In this secon, we will begin
with simpler objects and move up the ladder towards some of the more complex ones.
2
Import/Export Data
[ 34 ]
Constants, vectors, and matrices
R has ve inbuilt objects which store certain constant values. The ve objects are LETTERS,
letters, month.abb, month.name, and pi. The rst two objects contain the leers A-Z
in upper and lower cases. The third and fourth objects have month's abbreviated form and
the complete month names. Finally, the object pi contains the value of the famous irraonal
number. So, the exercise here is for you to nd the value of the irraonal number e. The
details about these R constant objects may be obtained using the funcon ?Constants
or example(Constants), of course by execung these commands in the console.
There is also another class of constants in R which is very useful. These constants are called
NumericConstants and include Inf for innite numbers, NaN for not a number, and so on.
You are encouraged to nd more details and other useful constants. R can handle numerical,
character, logical, integer, and complex kind of vectors and it is the class of the object which
characterizes the vector. Typically, we deal with vectors which may be numeric, characters
and so on. A vector of the desired class and number of elements may be iniated using
the vector funcon. The length argument declares the size of the vector, which is the
number of elements for the vector, whereas mode characterizes the vector to take one of the
required classes. The elements of a vector can be assigned names if required. The R names
funcon comes handy for this purpose.
Arithmec on numeric vector objects can be performed in an easier way. The operators
(+, -, *, /, and ^) are respecvely used for (addion, subtracon, mulplicaon, division,
and power). The characteriscs of a vector object may be obtained using funcons such as
sum, prod, min, max, and so on. Accuracy of a vector up to certain decimals may be xed
using opons in digits, round, and so on.
Now, two vectors need not have the same number of elements and we may carry the
arithmec operaon between them, say addion. In a mathemacal sense two vectors of
unequal length cannot be added. However, R goes ahead and performs the operaons just
the same. Thus, there is a necessity to understand how operaons are carried out in such
cases. To begin with the simpler case, let us consider two vectors with an equal number of
elements. Suppose that we have a vector x = (1, 2, 3, …, 9, 10), and y = (11, 12, 13, …, 19, 20).
If we add these two vectors, x + y, the result is an element-wise addion of the respecve
elements in x and y, that is, we will get a new vector with elements 12, 14, 16, …, 28, 30.
Now, let us increase the number of elements of y from 10 to 12 with y = (11, 12, 13, …, 19,
20, 21, 22). The operaon is carried out in the order that the elements of x (the smaller
object one) are element-wise added to the rst ten elements of y. Now, R nds that there
are two more elements of y in 11 and 12 which have not been touched as of now. It now
picks the rst two elements of x in 1 and 2 and adds them to 11 and 12. Hence, the 11 and
12 elements of the output are 11+1 =12 and 12 + 2 = 14. The warning says that longer
object length is not a multiple of shorter object length, which has now
been explained.
Chapter 2
[ 35 ]
Let us have a brief peep at a few more operators related to the vectors. The operator %% on
two objects, say x and y, returns a remainder following an integer division, and the operator
%/% returns the integer division.
Time for action – understanding constants, vectors, and basic
arithmetic
We will look at a few important and interesng examples. You will understand the structure of
vectors in R and would also be able to perform the basic arithmec related to this requirement.
1. Key in LETTERS at the R console and hit the Enter key.
2. Key in letters at the R console and hit the Enter key.
3. To obtain the rst ve and the last ve alphabets, try the following code:
LETTERS[c(1:5,22:26)] and letters[c(1:5,22:26)].
4. Month names and their abbreviaons are available in the base package and explore
them using ?Constants at the console.
5. Selected month names and their abbreviaons can be obtained using month.
abb[c(1:3,8:10)] and month.name[c(1:3,8:10)]. Also, the value of pi in R
can found by entering pi at the console.
6. To generate a vector of length 4, without specifying the class, try
vector(length=4). In specic classes, vector objects can be generated
by declaring the "mode" values for a vector object. That is, a numeric vector
(with default values 0) is obtained by the code vector(mode = "numeric",
length=4). You can similarly generate logical, complex, character, and integer
vectors by specifying them as opons in the mode argument.
The next screenshot shows the results as you run the preceding codes in R.
7. Creang new vector objects and name assignment: A generated vector can be
assigned to new objects using either =, <-, or ->. The last two assignments are in
the order from the generated vector of the tail end to the new variables at the end
of the arrow.
1. First assign the integer vector 1:10 to x by using x <- 1:10.
2. Check the names of x by using names(x).
3. Assign first 10 letters of the alphabets as names for elements of x by using
names(x)<- letters[1:10], and verify that the assignment is done
using names(x).
4. Finally, display the numeric vector x by simply keying in x at the console.
Import/Export Data
[ 36 ]
8. Basic arithmec: Create new R objects by entering x<-1:10; y<-11:20; a<-
10; b<--4; c<-0.5 at the console. In a certain sense, x and y are vectors while
a, b, and c are constants.
1. Perform simple addition of numeric vectors with x + y.
2. Scalar multiplication of vectors and then summing the resulting vectors is
easily done by using a*x + b*y.
3. Verify the result (a + b)x = ax + bx by checking that the result of ((a+b)*x
== a*x + b*x results is a logical vector of length 10, each having TRUE
value.
4. Vector multiplication is carried by x*y.
5. Vector division in R is simply element-wise division of the two vectors, and
does not have an interpretation in mathematics. We obtain the accuracy up
to 4 digits using round(x/y,4).
6. Finally, (element-wise) exponentiation of vectors is carried out through x^2.
9. Adding two unequal length vectors: The arithmec explained before applies to
unequal length vectors in a slightly dierent way. Run the following operaons:
x=1:10; x+c(1:12), length(x+c(1:12)), c(1:3)^c(1,2), and
(9:11)-c(4,6).
10. The integer divisor and remainder following integer division may be obtained
respecvely using %/% and %% operators. Key in -3:3 %% 2, -3:3 %% 3, and -3:3
%% c(2,3) to nd remainders between the sequence -3, -2, …, 2, 3 and 2, 3, and
c(2,3). Replace the operator %% by %/% to nd the integer divisors.
Now, we will rst give the required R codes so you can execute them in the soware:
LETTERS
letters
LETTERS[c(1:5,22:26)]
letters[c(1:5,22:26)]
?Constants
Chapter 2
[ 37 ]
month.abb[c(1:3,8:10)]
month.name[c(1:3,8:10)]
pi
vector(length=4)
vector(mode="numeric",length=4)
vector(mode="logical",length=4)
vector(mode="complex",length=4)
vector(mode="character",length=4)
vector(mode="integer",length=4)
x=1:10
names(x)
names(x)<- letters[1:10]
names(x)
x
x=1:10
y=11:20
a=10
b=-4
x + y
a*x + b*y
sum((a+b)*x == a*x + b*x)
x*y
round(x/y,4)
x^2
x=1:10
x+c(1:12)
length(x+c(1:12))
c(1:3)^c(1,2)
(9:11)-c(4,6)
-3:3 %% 2
-3:3 %% 3
-3:3 %% c(2,3)
-3:3 %/% 2
-3:3 %/% 3
-3:3 %/% c(2,3)
Execute the preceding code in your R session.
Import/Export Data
[ 38 ]
What just happened?
We have split the output into mulple screenshots for ease of explanaon.
Introducing constants and vectors functioning in R
LETTERS is a character vector available in R that consists of the 26 uppercase leers of the
English language, whereas letters contains the alphabets in smaller leers. We have used
the integer vector c(1:5,22:26) in the index to extract the rst and last ve elements of
both the character vectors. When the ?Constants command is executed, R pops out an
HTML le in your default internet browser and opens a page with the link http://my_IP/
library/base/html/Constants.html. You can nd more details about Constants
from the base package on this web page. Months, as in January-December, are available
in the character vector month.name whereas the popular abbreviated forms of the months
are available in the character vector month.abb. Finally, the numeric object pi contains
the value of pi up to the rst three decimals only.
Chapter 2
[ 39 ]
Next, we consider, generaon of various types of vector using the R vector funcon.
Now, the code vector(mode="numeric",length=4) creates a numeric vector with
default values of 0 and required length of four. Similarly, the other vectors are created.
Vector arithmetic in R
An integer vector object is created by the code x = 1:10. We could have alternavely
used opons such as x<- 1:10 or 1:10 -> x. The nal result is of course the same.
The choice of the assignment operator—nd more details by running ?assignOps at the R
console<-— is far more popular in the R community and it can be used during any part of R
programming. By default, there won't be any names assigned for either vectors or matrices.
Thus, the output NULL. names is a funcon in R which is useful for assigning appropriate
names. Our task is to assign the rst 10 smaller leers of the alphabets to the vector x.
Hence, we have the code names(x) <- letters[1:10]. We verify if the names have
been properly assigned and the change on the display of x following the assignment of the
names using names(x) and x.
Import/Export Data
[ 40 ]
Next, we create two integer vectors in x and y, and two objects a and b, which may be treated
as scalars. Now, x + y; a*x + b*y; sum((a+b)*x == a*x + b*x) performs three
dierent tasks. First, it performs addion of vectors and returns the result of element-wise
addion of the two vectors leading to the answer 12, 14, …, 28, 30. Second, we are verifying
the result of scalar mulplicaon of vectors, and third, the result of (a + b)x = ax + bx.
In the next round of R codes, we ran x*y; round(x/y,4); x^2. Similar to the addion
operator, the * operator performs element-wise mulplicaon for the two vectors. Thus,
we get the output as 11, 24, …, 171, 200. In the next line, recall that ; executes the code
on the next line/operaon, rst the element-wise division is carried out. For the resulng
vector (a numeric one), the round funcon gives the accuracy up to four digits as specied.
Finally, x^2 gives us the square of each element of x. Here, 2 can be replaced by any other
real number.
In the last line of code, we repeat some of the earlier operaons with a minor dierence
that the two vectors are not of the same length. As predicted earlier, R issues a warning
that the length of the longer vector is not a mulple of the length of the shorter vector.
Thus, for the operaon x+c(1:12);, rst all the elements of x (which is the shorter length
vector here) are added with the rst 10 elements of 1:12. Then the last two elements of
1:12 at 11 and 12 need to be added with elements from x, and for this purpose R picks
the rst two elements of x. If the longer length vector is a mulple of the shorter one, the
enre elements of the shorter vector are repeatedly added over the in cycles. The remaining
results as a consequence of running c(1:3)^c(1,2); (9:11)-c(4,6) are le to the
reader for interpretaon.
Let us look at the output aer the R codes for an integer and a remainder between two
objects are carried out.
Integer divisor and remainder operations
Chapter 2
[ 41 ]
In the segment -3:3 %% 2, we are rst creang a sequence -3, -2, …, 2, 3 and then we
are asking for the remainder if we divide each of them by 2. Clearly, the remainder for any
integer if divided by 2 is either 0 or 1, and for a sequence of consecuve integers, we expect
an alternate sequence of 0s and 1s, which is the output in this case. Check the expected
result for -3:3 %% 3. Now, for the operaon -3:3 %% c(2,3), rst look at the complete
sequence -3:3 as -3, -2, -1, 0, 1, 2, 3. Here, the elements -3, -1, 1, 3 are divided by 2 and the
remainder is returned, whereas -2, 0, 2 are divided by 3 and the remainders are returned.
The operator %/% returns the integer divisor and interpretaon of the results are le to the
reader. Please refer to the previous screenshot for the results.
We now look at the matrix objects. Similar to the vector funcon in R, we have matrix as a
funcon, that creates matrix objects. A matrix is an array of numbers with a certain number
of rows and columns. By default, the elements of a matrix are generated as NA, that is, not
available. Let r be the number of rows and c the number of columns. The order of a matrix
is then r x c. A vector object of length rc in R can be converted into a matrix by the code
matrix(vector, nrow=r, ncol=c, byrow=TRUE). The rows and columns of a matrix
may be assigned names using the dimnames opon in the matrix funcon.
The mathemacs of matrices in R is preserved in relaon to the matrix arithmec. Suppose
we have two matrices A and B with respecve dimensions m x n and n x o. The cross-product
A x B is then a matrix of order m x o, which is obtained in R by the operaon A %*% B. We
are also interested in the determinant of square matrix, the number of rows being equal to
the number of columns, and this is obtained in R using the det funcon on the matrix, say
det(A). Finally, we also more oen than not require the computaon of the inverse of a
square matrix. The rst temptaon is to obtain the same by using A^{-1}. This will give a
wrong answer, as this leads to an element-wise reciprocal and not the inverse of a matrix.
The solve funcon in R if executed on a square matrix gives the inverse of a matrix. Fine!
Let us now do these operaons using R.
Time for action – matrix computations
We will see the basic matrix computaons in the forthcoming steps. The matrix computaons
such as the cross-product of matrices, transpose, and inverse will be illustrated.
1. Generate a 2 x 2 matrix with default values using matrix(nrow=2, ncol=2).
2. Create a matrix from the 1:4 vector by running matrix(1:4,nrow=2,ncol=2,
byrow="TRUE").
3. Assign row and column names for the preceding matrix by using the opon
dimnames, that is, by running A <- matrix(data=1:4, nrow=2, ncol=2,
byrow=TRUE, dimnames = list(c("R_1", "R_2"),c("C_1", "C_2")))
at the R console.
Import/Export Data
[ 42 ]
4. Find the properes of the preceding matrices by using the commands nrow, ncol,
dimnames, and few more, with dim(A); nrow(A); ncol(A); dimnames(A).
5. Create two matrices X and Y of order 3 * 4 and 4 * 3, and obtain their
cross-product with the code X <- matrix(c(1:12),nrow=3,ncol=4); Y =
matrix(13:24, nrow=4) and X %*% Y.
6. The transpose of a matrix is obtained using the t funcon, t(X).
7. Create a new matrix A <- matrix(data=c(13,24,34,23,67,32,
45,23,11), nrow=3) and nd its determinant and inverse by using det(A)
and solve(A) respecvely.
The R code for the preceding acon list is given in the following code snippet:
matrix(nrow=2,ncol=2)
matrix(1:4,nrow=2,ncol=2, byrow="TRUE")
A <- matrix(data=1:4, nrow=2, ncol=2, byrow=TRUE, dimnames =
list(c("R_1", "R_2"),c("C_1", "C_2")))
dim(A); nrow(A); ncol(A); dimnames(A)
X <- matrix(c(1:12),nrow=3,ncol=4)
Y <- matrix(13:24, nrow=4)
X %*% Y
t(Y)
A <- matrix(data=c(13,24,34,23,67,32,45,23,11),nrow=3)
det(A)
solve(A)
Note the use of a semicolon (;) in line 5 of the preceding code. The result of this usage
is that the code separated by a semicolon is executed as if it was entered on a new line.
Execute the preceding code in your R console. The output of the R code is given in the
following screenshot:
Chapter 2
[ 43 ]
Matrix computations in R
What just happened?
You were able to create matrices in R and learned the basic operaons. Remember
that solve and not (^-1) gives you the inverse of a matrix. It is now seen that matrix
computaons in R are really easy to carry out.
The opons, nrow and ncol, are used to specify the dimensions of a matrix. Data for
a matrix can be specied through the data argument. The rst two lines of code in the
previous screenshot create a bare-bone matrix. Using the dimnames argument, we have
created a more elegant matrix and assigned the matrix to a matrix object named A.
We next focus on the list object. It has already been used earlier to specify the dimnames
of a matrix.
Import/Export Data
[ 44 ]
The list object
In the preceding subsecon we saw dierent kinds of objects such as constants, vectors,
and matrices. Somemes it is required that we pool them together in a single object. The
framework for this task is provided by the list object. From the online source http://
cran.r-project.org/doc/manuals/R-intro.html#Lists-and-data-frames,
we dene a list as "an object consisng of an ordered collecon of objects known as its
components." Basically, various types of objects can be brought under a single umbrella
using the list funcon. Let us create list which contains a character vector, an integer
vector, and a matrix.
Time for action – creating a list object
Here, we will have a rst look at the creaon of list objects, which can contain in them
objects of dierent classes:
1. Create a character vector containing the rst six capital leers with A <-
LETTERS[1:6]. Create an integer vector of the rst ten integers 1-10 with B <-
1:10, and a matrix with C <- matrix(1:6,nrow=2).
2. Create a list which has the three objects created in the previous steps as its
components with Z <- list(A = A, B = B, C = C).
3. Ensure that the class of Z and its three components in A, B, and C are indeed
retained as follows: class(Z); class(Z$A); class(Z$B); class(Z$C).
The consolidated R codes are given next, which you will have to enter at the R console:
A <- LETTERS[1:6]; B <- 1:10; C <- matrix(1:6,nrow=2)
Z <- list(A = A, B = B, C = C)
Z
class(Z); class(Z$A); class(Z$B); class(Z$C)
Creating and understanding a list object
Chapter 2
[ 45 ]
What just happened?
Dierent classes of objects can be easily brought under a single umbrella and their structures
are also preserved within the newly created list object. Especially, here we put a character
vector, an integer vector, and a matrix under a single list object. Next, we check for the
class of the Z object and nd the answer to be list as it should be. A new extracon tool
has been introduced in the dollar symbol $, which needs an explanaon. Elements/objects
from a list vector can be extracted using the $ opon on similar lines of the [ and [[
extracng tools. In our example, Z$A extracts the A object from the Z list, and we use the
class funcon wrapper on Z$A to nd its class. It is then conrmed that the classes of A, B,
and C are preserved under the list object. More details about the extracon tools may be
obtained by running ?Extract at the R console.
Yes, you have successfully created your rst list object. This ulity is parcularly useful
when building big programs and we need certain acons within a single object.
The data.frame object
In Figure 2 of Chapter 1, Data Characteriscs we saw that when the class funcon is
applied on the SQ object, the output resulted in data.frame. The details about this
funcon can be obtained by execung ?data.frame at the R console. The rst noceable
aspect is data.frame {base}, which means that this funcon is in the base library.
Further, the descripon says: "This funcon creates data frames, ghtly-coupled collecons
of variables, which share many of the properes of matrices and of lists, used as the
fundamental data structure by most of R's modeling soware." This descripon is seen to
be correct as in the same gure we have dierent numeric, character, and factor variables
contained in the same data.frame object. Thus, we know that a data.frame object can
contain dierent kinds of variables.
A data frame can contain dierent types of objects. That is, we can create two dierent
classes of vectors and bind them together in a single data frame. A data frame can also be
updated with new vectors and exisng components can also be dropped from it. As with
vectors, and matrices, we can assign names to a data frame as is convenient for us.
Time for action – creating a data.frame object
Here, we create data.frame from vectors. New objects are then added to an exisng data
frame and some preliminary manipulaons are demonstrated.
1. Create a numeric and character vector of length 3 each with x <- c(2,3,4); y
<- LETTERS[1:3].
2. Create a new data frame with df1<-data.frame(x,y).
Import/Export Data
[ 46 ]
3. Verify the variable names of the data frame, the classes of the components,
and display the variables disnctly with variable.names(df1); sapply
(df1, class); df1$x; df1$y.
4. Add a new numeric vector to df1 with df1$z <- c(pi,sqrt(2), 2.71828)
and verify the changes in df1 by entering df1 at the console.
5. Nullify the x component of df1 and verify the change.
6. Bring back the original x values with df1$x <- x.
7. Add a fourth observaon with df1[4,]<- list(y=LETTERS[2], z=3,x=5)
and then remove the second observaon with df1 <- df1[-2,] and verify the
change again.
8. Find the row names (or the observaon names) of the data frame object by using
row.names(df1).
9. Obtain the column names (which should be actually x, y, and z) with
colnames(df1). Change the row and column names using row.names(df1)<-
1:3; colnames(df1)=LETTERS[1:3] and display the nal form of the data frame.
Following is the consolidated code that you have to enter in the R console:
# The data.frame Object
x <- c(2,3,4); y <- LETTERS[1:3]
df1<-data.frame(x,y)
variable.names(df1)
sapply(df1,class)
df1$x
df1$y
df1$z <- c(pi,sqrt(2), 2.71828)
df1
df1$x <- NULL
df1
df1$x <- x
df1[4,]<- list(y=LETTERS[2],z=3,x=5)
df1 <- df1[-2,]
df1
row.names(df1)
dim(df1)
colnames(df1)
row.names(df1)<- 1:3
colnames(df1)<-LETTERS[1:3]
df1
Chapter 2
[ 47 ]
On running the preceding code in R, you will see the output as shown in the
following screenshot:
Understanding a data.frame object
Let us now look at a larger data.frame object. iris is a very famous datasets and we will
use it to check out some very useful tools for data display.
1. Load the iris data from the datasets package with data(iris).
2. Check the rst 10 observaons of the dataset with head(iris,10).
3. A compact display of a data.frame object is obtained with the str funcon
in the following way: str(iris).
4. Using the $ extractor tool, inspect the dierent Species in the iris data in the
following way: iris$Species.
Import/Export Data
[ 48 ]
5. We are asked to get the rst 10 observaons with the Sepal.Length and
Petal.Length variables only. Now, we use the [ extractor in the following way:
iris[1:10,c("Sepal.Length","Petal.Length")].
Different ways of extracting objects from a data.frame object
What just happened?
A data frame may be a complex structure. Here, we rst created two vectors of same length
with dierent structures, one being an integer and the other one a character vector. Using
the data.frame funcon we created a new object df1, which contains both the vectors.
The variable names of df1 are then veried with the variable.names funcon. Aer
verifying that the names are indeed as expected, we verify that the variable classes are
preserved with the applicaon of two funcons: sapply and class. lapply is a useful
ulity in R which applies a funcon over a list or vector and sapply is a more friendly
version of lapply. In our parcular example earlier, we need R to return us the classes of
variables from the df1 data frame.
Chapter 2
[ 49 ]
Have a go hero
As an exercise, explain yourself the rest of the R code that you have executed here.
We have thus seen how to create a data frame, add and remove components, observaons,
change the component names, and so on.
The table object
Data displayed in a table format is easy to understand. We will begin with the famous
Titanic dataset, as it is very unlikely that you will not have heard about it. That the giganc
ship sinks at the end, that there are many beauful movies about it, novels, documentaries,
and many more, make this dataset a very interesng example. It is known that the ship had
some survivors post its unfortunate and premature end. The ship had children, women,
and dierent classes of passenger onboard. This dataset is shipped (again) in the datasets
package along with the soware. The dataset relates to the passengers' survival post
the tragedy.
The Titanic dataset has four variables: Class, Sex, Age, and Survived. For each
combinaon of the values for the four variables we have a count of that combinaon. The
Class variable is specied at four levels of 1st, 2nd, 3rd, and Crew class. The gender is
specied for the passengers, and the age classicaon is Child or Adult. It is also known
through the Survived variable whether the onboard passengers survived the clash of the
ship with the iceberg. Thus, we have 4 x 2 x 2 x 2 = 32 dierent combinaons of the Age,
Sex, Class, and Survived statuses.
Import/Export Data
[ 50 ]
The following screenshot gives a display of the dataset in two formats. On the right-hand
side we can see the dataset in a spreadsheet style, while the le-hand side displays the
frequencies according to a combinatorial group. The queson is how do we create table
displays as on the le-hand side of the screenshot. The present secon addresses this
aspect of table creaon.
Two different views of the Titanic dataset
The le-hand side display of the screenshot is obtained by simply keying in Titanic at the R
console, and the data format on the right-hand side is obtained by running View(Titanic)
at the console. In general, we have our dataset available as on the right-hand side. Hence,
we will pretend that we have the dataset available in the later format.
Chapter 2
[ 51 ]
Time for action – creating the Titanic dataset as a table object
The goal is to create a table object from the raw dataset. We will be using the expand.grid
and xtabs funcon towards this end.
1. First, create four character vectors for the four types of variables:
Class.Level <- c("1st","2nd","3rd", "Crew")
Sex.Level <- c("Male", "Female")
Age.Level <- c("Child", "Adult")
Survived.Level <- c("No", "Yes")
2. Create a list object which takes into account the variable names and their possible
levels with Data.Level <- list(Class = Class.Level, Sex = Sex.
Level, Age = Age.Level, Survived = Survived.Level)
3. Now, create a data.frame object for the levels of the four variables using
the expand.grid funcon by entering T.Table <- expand.grid(Class
= Class.Level, Sex = Sex.Level, Age = Age.Level, Survived
= Survived.Level) at the console. It is advised to view the T.Table and
appreciate the changes that are occurring in this step.
4. The Titanic dataset is ready except for the frequency count at each combinatorial
level. Specify the counts with T.freq <- c(0,0,35,0,0,0,17,0, 118,
154,387,670,4,13,89,3, 5,11, 13,0,1,13,14,0,57,14,75,192,140,
80,76,20)
5. Augment T.Table with T.freq by using T.Table <- cbind(T.Table,
count=T.freq). Again, if you view the T.Table, you will nd the display
on the le-hand side of the previous screenshot.
6. To obtain the display on the right-hand side, enter xtabs(count~ Class + Sex
+ Age + Survived , data = T.Table).
The complete R code is given next, which needs to be compiled in the soware:
Class.Level <- c("1st","2nd","3rd", "Crew")
Sex.Level <- c("Male", "Female")
Age.Level <- c("Child", "Adult")
Survived.Level <- c("No", "Yes")
Data.Level <- list(Class = Class.Level, Sex = Sex.Level,
Age = Age.Level, Survived = Survived.Level)
T.Table <- expand.grid(Class = Class.Level, Sex =
Sex.Level, Age = Age.Level, Survived = Survived.Level)
T.freq = c(0,0,35,0,0,0,17,0,118,154,387,670,4,13,89,3,
5,11, 13,0,1,13,14,0,57,14,75,192,140,80, 76,20)
T.Table = cbind(T.Table, count=T.freq)
xtabs(count~ Class + Sex + Age + Survived, data = T.Table)
Import/Export Data
[ 52 ]
What just happened?
In pracce we may oen have data in frequency format. It will be seen in later chapters
that the table object is required for carrying out stascal analysis. To translate frequency
formated data into a table object, we rst dened four variables through Class.Level,
Sex.Level, Age.Level, and Survived.Level. The levels for the required table object
have been specied through the list object Data.Level. The funcon expand.grid
creates all possible combinaons of the factors of four variables. The table of all possible
combinaons is then stored in the T.Table object. Next, the frequencies are assigned
through the T.freq integer vector. Finally, the xtabs funcon creates the count according
to the various levels of the variables and the result is a table object, which is the same
as Titanic!
Have a go hero
UCBAdmissions is one of the benchmark datasets in Stascs. It is available in the
datasets package and it has data on the admission counts of six departments. The
admissions data shows that there is a favored bias towards adming male candidates over
females, and it led to an allegaon against the University of California, Berkeley. The details
of this problem may be found on the web at http://www.unc.edu/~nielsen/soci708/
cdocs/Berkeley_admissions_bias.pdf. Informaon about the dataset is obtained
with ?UCBAdmissions. Idenfy all the variables and their classes and regenerate the enre
table from the raw codes.
read.csv, read.xls, and the foreign package
Data is generally available in an external le. The type of external les is certainly varied and
it is important to learn which of them may be imported in R. The probable spreadsheet les
may exist in a CSV (comma separated variable) format, XLS or XLSX (Microso Excel) form,
or ODS (OpenOce/LibreOce Calc). There are more possible formats and we restrict our
aenon to those described earlier. A snapshot of two les, Employ.dat and SCV.csv,
in gedit and MS Excel is given in the following screenshot. The brief characteriscs of the two
les are summarized in the following list:
The rst row lists the names of variables of the dataset
Each observaon begins on a new line
In the DAT le, the delimiter is a tab (\t) whereas for the CSV le it is a comma (,)
All the three columns of the DAT le are numeric in nature
The rst ve columns of the CSV le are numeric while the last column is a character
Overall, both the les have a well-dened structure going for them
Chapter 2
[ 53 ]
The following screenshot underlines the theme that when the external les have
a well-dened structure, it is vital that we make the most of the structure when imporng
it in R.
Screenshot of the two spreadsheet files
The core funcon for imporng les in R is the read.table funcon from the utils
package, shipped with R core. The rst argument of this funcon is the lename; see the
following screenshot. We can use header=TRUE to specify that the header names are the
variable names of the columns. The separator opon sep needs to be properly specied.
For example, for the Employ dataset, it is a tab \t whereas for the CSV le, it is a comma ,.
Frequently, each row may also have a name. For example, the customer name in a survey
dataset, or serial number, and so on. This can be specied through row.names. The row
names may or may not be present in the external le. That is, either the row names or the
column names need not be part of the le from which we are imporng the data.
The read.table syntax
Import/Export Data
[ 54 ]
In many les, there may be missing observaons. Such data can be appropriately imported
by specifying the missing values in na.strings. The missing values may be represented by
blank cells, a period, and so on. You may nd more details about the other opons in the
read.table funcon. We note that read.csv, read.delim, and so on are other variants
of the read.table funcon. An Excel le of the type XLS or XLSX may be imported into R
with the use of the read.xls funcon from the gdata package.
Let us begin with imporng simpler data les into R.
Example 2.2.1. Reading from a DAT le: The datasets analyzed in Ryan (2007) are available
on the web at ftp://ftp.wiley.com/public/sci_tech_med/engineering_
statistics/. Download the le engineering_statistics.zip and unzip the contents
to the working directory of the R session. The problem is described in Exercise 1.82 of
Ryan. The monthly data on the number of employees over a period of ve years for three
Wisconsin industries in the wholesale and retail trade, food and kindred products, and
fabricated metals is available in the le Employ.dat. The task is to import this dataset in
the R session. Note that the three variables, namely the number of employees in the three
industries, are numeric in their characteriscs. These characteriscs should be retained in
our session too.
A useful pracce is to actually open the source le and check the nature of the data in it.
For example, you should queson how you will interpret the number 3.220000000e+002
specied in the original DAT le. In the Time for acon – imporng data from external les
secon that follows, we will use the read.table funcon to import this data le.
Example 2.2.2. Reading from a CSV le: Ryan (2007) uses a dataset analyzed by Gupta
(1997). In this case study related to anbioc suspension products the response variable
is Separated Clear Volume whose smaller value indicates beer quality. This experiment
hosts ve variables, each at two dierent levels, that is each of the ve variables is a factor
variable, and the goal of the experiment is the determinaon of the best combinaon of
these factors which yields the minimum value for the response variable.
Now, somemes the required dataset may be available in various CSV les. In such cases,
we rst read them from the various desnaons and then combine them to obtain a single
metale. A trick is the usage of the merge funcon. Suppose that the preceding dataset is
divided in two datasets SCV_Usual.csv and SCV_Modified.csv according to the variable
E. We read them in two separate data objects and then merge them into a single object.
Chapter 2
[ 55 ]
We will carry out the imporng of these les in the next me for acon secon.
Example 2.2.3. Reading les using foreign package: SPSS, SAS, STATA, and so on, are some
of the very popular stascal soware packages. Each of the soware packages has their
own le structure for the datasets. The foreign package, which is shipped along with the R
soware , helps to read datasets used in these soware packages. The rootstock dataset is
a very popular dataset in the area of mulvariate stascs, and it is available on the web at
http://www.stata-press.com/data/r10/rootstock.dta. Essenally, the dataset is
available for the STATA soware. We will now see how R reads this dataset.
Let us set for the acon.
Time for action – importing data from external les
The external les may be imported into R using the right funcons available in it. Here, we
will use read.table, read.csv, and read.sta funcons to drive home the point.
1. Verify that you have the necessary les, Employ.dat, SCV.csv, SCV_Usual.csv,
and SCV_Modified.csv in the working directory by using list.files().
2. If the les are not available in the list displayed, nd your working directory
using getwd() and then copy the les to the working directory. Alternavely,
you can set the working directory to the folder where the les are with setwd
("C:/my_files_are_here").
3. Read the data in Employ.dat with the code employ <- read.table( Employ.
dat", header=TRUE)
4. View the data with View(employ) and ensure that the data le has been properly
imported into R.
5. Check that the class of employ and its variables have been imported in the correct
format with class(employ); sapply(employ,class).
6. Import the Separated Clear Volume data from the SCV.csv le using the code SCV
<- read.csv("SCV.csv",header=TRUE)
7. Run sapply(SCV, class). You will nd that variables A-D are of the numeric
class. Convert the class of variable A to factor with either class(SCV$A) <-
'factor' or SCV$A <- as.factor(SCV$A)
8. Repeat the preceding step for variables B-D.
Import/Export Data
[ 56 ]
9. The data in the SCV.csv le is split into two les by the E variable values and is
available in SCV_Usual.csv and SCV_Modified.csv. Import the data in these
two les using the appropriate modicaons in Step 6 and label the respecve R
data frame objects as SCV_Usual and SCV_Modified.
10. Combine the data from the two latest objects with SCV_Combined <- erge(SCV_
Usual,SCV_Modified,by.y=c("Response","A","B","C","D","E"),all.
x=TRUE,all.y=TRUE)
11. Inialize the library package foreign with library(foreign).
12. Tell R where on the web the dataset is available using rootstock.url <-
http://www.stata-press.com/data/r10/rootstock.dta.
An Internet connection is required to perform this step.
13. Use the read.dta funcon from the foreign package to import the dataset
from the Web into R: rootstock <- read.dta(rootstock.url)
The necessary R codes are given next in a consolidate format:
employ <- read.table("Employ.dat",header=TRUE)
View(employ)
class(employ)
sapply(employ,class)
SCV <- read.csv("SCV.csv",header=TRUE)
sapply(SCV, class)
class(SCV$A) <- 'factor'
class(SCV$B) <- 'factor'
class(SCV$C) <- 'factor'
class(SCV$D) <- 'factor'
SCV_Usual <- read.csv("SCV_Usual.csv",header=TRUE)
SCV_Modified <- read.csv("SCV_Modified.csv",header=TRUE)
SCV_Combined <- merge(SCV_Usual,SCV_Modified,by.y=c("Response",
"A","B","C","D","E"),all.x=TRUE,all.y=TRUE)
SCV_Combined
library(foreign)
rootstock.url <- "http://www.stata-press.com/data/r10/rootstock.dta"
rootstock <- read.dta(rootstock.url)
rootstock
Chapter 2
[ 57 ]
What just happened?
Funcons from the utils package help the R users in imporng data from various
external les. The following screenshot, edited in a graphics tool, shows the result
of running the previous code:
Importing data from external files
The read.table funcon succeeded in imporng the data from the Employ.dat le.
The utils funcon View conrms that the data has been imported with the desired classes.
The funcon read.csv has been used to import data from SCV.csv, SCV_Usual.csv, and
SCV_Modified.csv les. The merge funcon combined the data in the usual and modied
objects and created a new object, which is same as the one obtained using the SCV.csv le.
Import/Export Data
[ 58 ]
Next, we used the funcon read.sta from the foreign package to complete the reading
of a STATA le, which is available on the web.
What just happened?
You learned to import data in many dierent formats into R. The preceding program shows
how to change the class of variables within the object itself. You also learned how to merge
mulple data objects.
Importing data from MySQL
Data will be oen available in databases (DB) such as SQL, MySQL, and so on. To emphasize
the importance of databases is beyond the scope of this secon, and we will be content with
imporng data from a DB. The right-hand side of the following screenshot shows a snippet
of the test DB in MySQL. This DB has a single table in IO_Time and it has two variables
No_of_IO and CPU_Time. The IO_Time has 10 observaons, and we will be using this
dataset for many concepts later in the book. The goal of this secon is to show how to
import this table in to R.
An R package, RMySQL, is available from CRAN, which can be installed easily for Linux users.
Unfortunately, for Windows users, the package is not available in a readily implementable
installaon, in the sense that install.packages("RMySQL") won't work for them. The
best help for Windows users is available at http://www.r-bloggers.com/installing-
the-rmysql-package-on-windows-7/, though some of the codes there are a bit
outdated. However, the problem is certainly solvable! The program and illustraon here
works neatly for Linux users, and the following screenshot is performed on the Ubuntu 12.04
plaorm. Though simple installaon of R and MySQL generally does not help in installing
the RMySQL package, running sudo apt-get install libmysqlclient-dev rst and
then install.packages("RMySQL") helps! If you sll get an error, make a note that the
downloaded package is saved in the /tmp/RtmpeLu7CG/downloaded_packages folder of
the local machine with the name RMySQL_0.x.x.tar.gz.
Chapter 2
[ 59 ]
You can then move to that directory and execute sudo R CMD INSTALL
RMySQL_0.x.x.tar.gz. We are now set to use the RMySQL package.
Importing data from MySQL
Note that on the Ubuntu 12.04 terminal we begin R with R –q. This suppresses the general
details we get about the R soware. First invoke the library with library(RMySQL). Set
up the DB driver with d <- dbDriver("MySQL"). Specify the DB connector with con <-
dbConnect(d,dbname='test') and then run your query to fetch the IO_Time table from
MySQL with io_data <- dbGetQuerry(con,'select * from IO_Time'). Finally,
verify that the data has been properly imported into R with io_data. The right-hand side
of the previous screenshot conrms that the data has been correctly imported into R.
Import/Export Data
[ 60 ]
Exporting data/graphs
In the previous secon, we learned how to import data from external les. Now, there will
be many instances where we would be keen to export data from R into suitable foreign les.
The need may arise in an automated system, reporng, and so on, where the other soware
requires making good use of the R output.
Exporting R objects
The basic R funcon that exports data is write.table, which is not surprising as we saw
the ulity of the read.table funcon. The following screenshot gives a snippet of the
write.table funcon. While reading, we assign the imported le to an R object, and when
exporng, we rst specify the R object and then menon the lename. By default, R assigns
row names while exporng the object. If there are no row names, R will simply choose serial
numbers beginning with 1. If you do not need such row names, you need needs to specify
row.names = FALSE in the program.
Exporting data using the write.table function
Example 2.3.1. Exporng the Titanic data: In the Two dierent views of the Titanic dataset
gure, we saw the Titanic dataset in two formats. It is the display on the right-hand side
of the gure which we would like to export in a .csv format. We will use the write.csv
funcon for this purpose.
> write.csv(Titanic,"Titanic.csv",row.names=FALSE)
The Titanic.csv le will be exported to the current working directory. The reader can
open the CSV le in either Excel or LibreOce Calc and conrm that it is of the desired format.
The other write/export opons are also available in the foreign package. The write.
xport, write.systat, write.dta, and write.arff funcons are useful if the
desnaon soware is any of the following: SAS, SYSTAT, STATA, and Weka.
Exporting graphs
In Chapter 3, Data Visualizaon, we will be generang a lot of graphs. Here, we will explain
how to save the graphs in a desired format.
In the next screenshot, we have a graph generated by execuon of the code plot(sin,
-pi, 2*pi) at the terminal. This line of code generates the sine wave over the interval
[-π, 2π].
Chapter 2
[ 61 ]
Time for action – exporting a graph
Exporng of graph will be explored here:
1. Plot the sin funcon over the range [-π, 2π] by running plot(sin, -pi, 2*pi).
2. A new window pops up with the tle R Graphics Device 2 (ACTIVE).
3. In the menu bar, go to File | Save as | Png.
Saving graphs
4. Save the le as sin_plot.png, or any other name felt appropriate by you.
What just happened?
A le named sin_plot.png would have been created in the directory as specied by
you in the preceding Step 4.
Unix users do not have the luxury of saving the graph in the previously menoned way.
If you are using Unix, you have dierent opons of saving a le. Suppose we wish to save
the le when running R in a Linux environment.
Import/Export Data
[ 62 ]
The grDevices library gives dierent ways of saving a graph. Here, the user can use the
pdf, jpeg, bmp, png, and a few more funcons to save the graph. An example is given in the
following code:
> jpeg("My_File.jpeg")
> plot(sin, -pi, 2*pi)
> dev.off()
null device
1
> ?jpeg
> ?pdf
Here, we rst invoke the required device and specify the le name to save the output, the
path directory may be specied as well along with the le name. Then, we plot the funcon
and nally close the device with dev.off. Fortunately, this technique works on both Linux
and Windows plaorms.
Managing an R session
We will close the chapter with a discussion of managing the R session. In many ways, this
secon is similar to what we do to a dining table aer we have completed the dinner. Now,
there are quite a few aspects about saving the R session. We will rst explain how to save the
R codes executed during a session.
Time for action – session management
Managing a session is very important. Any well developed soware gives mulple opons
for managing a technical session and we explore some of the methods available in R.
1. You have decided to stop the R session! At this moment, we would like to save all
the R code executed at the console. In the File menu, we have an opon in Save
History. Basically, it is the acon File | Save History…. Aer selecng the opon,
as with previous secon, we can save the history of that R session in a new text le.
Save the history with the lename testhist. Basically, R saves it as a RHISTORY
le which may be easily viewed/modied through any text editor. You may also save
the R history in any appropriate directory, which is the desnaon.
2. Now, you want to reload the history testhist at the beginning of a new R session.
The direcon is File | Load History…, and choose the testhist le.
Chapter 2
[ 63 ]
3. In an R session, you would have created many objects with dierent characteriscs.
All of them can be saved in an .Rdata le with File | Save Workspace…. In a new
session, this workspace can be loaded with File | Load Workspace….
R session management
4. Another way of saving the R codes (history and workspace) is when we close the
session either with File | Exit, or by clicking on the X of the R window; a window will
pop up as displayed in the previous screenshot. If you click on Yes, R will append
the RHISTORY le in the working directory with the codes executed in the current
session and also save the workspace.
5. If you want to save only certain objects from the current list, you can use the save
funcon. As an example if you wish to save the object x, run save(x,file="x.
Rdata"). In a later session, you can reinstate the object x with load("x.Rdata").
However, the libraries that were invoked in the previous session are not available again.
They again need to be explicitly invoked using the library() funcon. Therefore, you
should be careful about this fact.
Saving the R session
Import/Export Data
[ 64 ]
What just happened?
The session history is very important, and also the objects created during a session. As you
get deeper into the subject, it is soon realized that it is not possible to complete all the tasks
in a single session. Hence, it is vital to manage the sessions properly. You learned how to
save code history, workspace, and so on.
Have a go hero
You have two matrices A=
1
6
2
5
3
0 and B =
2
9
1
1
-12
6. Obtain the cross-product AB and
nd the inverse of AB. Next, nd (BTAT) then the transpose of its inverse. What will
be your observaon?
Summary
In this chapter we learned how to carry out the essenal computaons. We also learned
how to import the data from various foreign formats and then to export R objects and
output suitable for other soware. We also saw how to eecvely manage an R session.
Now that we know how to create R data objects, the next step is the visualizaon of such
data. In the spirit of Chapter 1, Data Characteriscs, we consider graph generaon according
to the nature of the data. Thus, we will see specialized graphs for data related to discrete as
well as connuous random variables. There is also a disncon made for graphs required
for univariate and mulvariate data. The next chapter must be pleasing on the eyes! Special
emphasis is made on visualizaon techniques related to categorical data, which includes bar
charts, dot charts, and spine plots. Mulvariate data visualizaon is more than mere 3D plots
and the R methods such as pairs plot discussed in the next chapter will be useful.
3
Data Visualization
Data is possibly best understood, wherever plausible, if it is displayed in a
reasonable manner. Chen, et. al. (2008) has compiled articles where many
scientists of data visualization give a deeper, historical, and modern trend of
data display. Data visualization may probably be as historical as data itself. It
emerges across all the dimensions of science, history, and every stream of life
where data is captured. The reader may especially go through the rich history
of data visualization in the article of Friendly (2008) from Chen, et. al. (2008).
The aesthetics of visualization has been elegantly described in Tufte (2001).
The current chapter will have a deep impact on the rest of the book, and
moreover this chapter aims to provide the guidance and specialized graphics
in the appropriate context in the rest of the book.
This chapter provides the necessary smulus for understanding the gist that discrete
and connuous data needs appropriate tools, and the validaon may be seen through
the disnct characteriscs of such plots. Further, this chapter is also more closely related
to Chapter 4, Exploratory Analysis, and many visualizaon techniques here indeed are
"exploratory" in nature. Thus, the current chapter and the next complement mutually.
It has been observed that in many preliminary courses/text, a lot of emphasis is on the type
of the plots, say histogram, boxplot, and so on, which are more suitable for data arising for
connuous variables. Thus, we need to make a disncon among the plots for discrete and
connuous variable data, and towards this we rst begin with techniques which give more
insight on visualizaon of discrete variable data.
In R there are four main frameworks for producing graphics: basic graphs, grids, lace,
and ggplot2. In the current chapter, the rst three are used mostly and there is a brief
peek at ggplot2 at the end.
Data Visualizaon
[ 66 ]
This chapter will mainly cover the details on eecve data visualizaon:
Visualizaon of categorical data using a bar chart, dot chart, spine and mosaic plots,
and the pie chart and its variants
Visualizaon of connuous data using a boxplot, histogram, scaer plot and its
variants, and the Pareto chart
A very brief introducon to the rich school of ggplot2
Visualization techniques for categorical data
In Chapter 1, Data Characteriscs, we came across many variables whose outcomes
are categorical in nature. Gender, Car_Model, Minor_Problems, Major_Problems,
and Satisfaction_Rating are examples of categorical data. In a soware product
development cycle, various issues or bugs are raised at dierent severity levels such as
Minor and Show Stopper. Visualizaon methods for the categorical data require special
aenon and techniques, and the goal of this secon is to aid the reader with some useful
graphical tools.
In this secon, we will mainly focus on the dataset related to bugs, which are of primary
concern for any soware engineer. The source of the datasets is http://bug.inf.usi.ch/
and the reader is advised to check the website before proceeding further in this secon. We
will begin with the soware system Eclipse JDT Core, and the details for this system may be
found at http://www.eclipse.org/jdt/core/index.php. The les for download are
available at http://bug.inf.usi.ch/download.php.
Bar charts
It is very likely that you are familiar with bar charts, though you may not be aware of
categorical variables. Typically, in a bar chart one draws the bars proporonal to the
frequency of the category. An illustraon will begin with the dataset Severity_Counts
related to the Eclipse JDT Core soware system. The reader may also explore the built-in
examples in R.
Going through the built-in examples of R
The bar charts may be obtained using two opons. The funcon barplot, from the
graphics library, is one way of obtaining the bar charts. The built-in examples for this
plot funcon may be reviewed with example(barplot). The second opon is to load the
package lattice and then use the example(barchart) funcon. The sixth plot, aer you
click for the sixth me on the prompt, is actually an example of the barchart funcon.
Chapter 3
[ 67 ]
The main purpose of this example is to help the reader get air of the bar charts that may
be obtained using R. It happens that oen we have a specic variant of a plot in our mind
and nd it dicult to recollect it. Hence, it is a suggeson to explore the variety of bar charts
you can produce using R. Of course, there are a lot more possibilies than the mere samples
given by example().
Example 3.1.2. Bar charts for the Bug Metrics dataset: The soware system Eclipse JDT Core
has 997 dierent class environments related to the development. The bug idened on each
occasion is classied by its severity as Bugs, NonTrivial, Major, Critical, and High.
We need to plot the frequency of the severity level, and also require the frequencies to be
highlighted by Before and Aer release of the soware to be neatly reected in the graph.
The required data is available in the RSADBE package in the Severity_Counts object.
Example 3.1.3. Bar charts for the Bug Metrics of the ve soware: In the previous example,
we had considered the frequencies only on the JDT soware. Now, it will be a tedious
exercise if we need to have ve dierent bar plots for dierent soware. The frequency
table for the ve soware is given in the Bug_Metrics_Software dataset of the
RSADBE package.
Software BA_Ind Bugs
NonTrivial
Bugs
Major
Bugs Critical Bugs
High Priority
Bugs
JDT Before 11,605 10,119 1,135 432 459
After 374 17 35 10 3
PDE Before 5,803 4,191 362 100 96
After 341 14 57 6 0
Equinox Before 325 1,393 156 71 14
After 244 3 4 1 0
Lucene Before 1,714 1,714 0 0 0
After 97 0 0 0 0
Mylyn Before 14,577 6,806 592 235 8,804
After 340 187 18 3 36
It would be nice if we could simply display the frequency table across two graphs only.
This is achieved using the opon beside in the barplot funcon. The data from the
preceding table is copied from an XLS/CSV le, and then we execute the rst line of the
following R program in the Time for acon – bar charts in R secon.
Let us begin the acon and visualize the bar charts.
Data Visualizaon
[ 68 ]
Time for action – bar charts in R
Dierent forms of bar charts will be displayed with datasets. The type of bar chart also
depends on the problem (and data) on hand.
1. Enter example(barplot) in the console and hit the Return key.
2. A new window pops up with the heading Click or hit Enter for next page.
Click (and pause between the clicks) your way unl it stops changing.
3. Load the lace package with library(lattice).
4. Try example(barchart) in the console. The sixth plot is an example of the
bar chart.
5. Load the dataset on severity counts for the JDT soware from the RSADBE package
with data(Severity_Counts). Also, check for this data.
A view of this object is given in the screenshot in step 7. We have ve severies
of bugs: general bugs (Bugs), non-trivial bugs (NT.Bugs), major bugs (Major.Bugs),
crical bugs (Crical), and high priority bugs (H.Priority). For the JDT soware, these
bugs are counted before and aer release, and these are marked in the object with
suxes BR and AR. We need to understand this count data and as a rst step, we
use the bar plots for the purpose.
6. To obtain the bar chart for the severity-wise comparison before and aer release
of the JDT soware, run the following R code:
barchart(Severity_Counts,xlab="Bug Count",xlim=c(0,12000),
col=rep(c(2,3),5))
The barchart funcon is available from the lattice package. The range
for the count is specied with xlim=c(0,12000). Here, the argument
col=rep(c(2,3),5) is used to tell R that we need two colors for BR and AR
and that this should be repeated ve mes for the ve severity levels of the bugs.
Chapter 3
[ 69 ]
Figure 1: Bar graph for the Bug Metrics dataset
7. An alternave method is to use the barplot funcon from the graphics package:
barplot(Severity_Counts,xlab="Bug Count",xlim=c(0,12000), horiz
=TRUE,col=rep(c(2,3),5))
Here, we use the argument horiz = TRUE to get a horizontal display of the bar
plot. A word of cauon here that the argument horizontal = TRUE in barchart
of the lattice package works very dierently.
Data Visualizaon
[ 70 ]
We will now focus on Bug_Metrics_Software, which has the bug count data for
all the ve soware: JDT, PDE, Equinox, Lucene, and Mylyn.
Figure 2: View of Severity_Counts and Bug_Metrics_Software
8. Load the dataset related to all the ve soware with data(Bug_Metrics_
Software).
9. To obtain the bar plots for before and aer release of the soware on the same
window, run par(mfrow=c(1,2)).
What is the par funcon? It is a funcon frequently used to set the parameters
of a graph. Let us consider a simple example. Recollect that when you tried the
code example(dotchart), R would ask you to Click or hit Enter for next page
and post the click or Enter acon, the next graph will be displayed. However, this
prompt did not turn up when you ran barplot(Severity_Counts,xlab="Bug
Count",xlim=c(0,12000), horiz =TRUE,col=rep(c(2,3),5)). Now,
let us try using par, which will ask us to rst click or hit Enter so that we get the bar
plot. First run par(ask=TRUE), and then follow it with the bar plot code. You will
now be asked to either click or hit Enter. Find more details of the par funcon with
?par. Let us now get into the mfrow argument. The default plot opons displays
the output on one device and on the next one, the former will be replaced with the
new one. We require the bar plots of before and aer release count to be displayed
in the same window. The opon, mfrow = c(1,2), ensures that both the bar plots
are displayed in the same window with one row and two columns.
Chapter 3
[ 71 ]
10. To obtain the bar plot of bug frequencies before release where each of the soware
bug frequencies are placed side by side for each type of the bug severity, run
the following:
barplot(Bug_Metrics_Software[,,1],beside=TRUE,col = c("lightblue",
"mistyrose", "lightcyan", "lavender", "cornsilk"),legend = c("JDT"
,"PDE","Equinox","Lucene", "Mylyn"))
title(main = "Before Release Bug Frequency", font.main = 4)
Here, the code Bug_Metrics_Software[,,1] ensures that only before release
are considered. The opon beside = TRUE ensures that the columns are displayed
as juxtaposed bars, otherwise, the frequencies will be distributed in a single bar
with areas proporonal to the frequency of each soware. The opon col =
c("lightblue", …) assigns the respecve colors for the soware. Finally, the
tle command is used to designate an appropriate tle for the bar plot.
11. Similarly, to obtain the bar plot for aer release bug frequency, run the following:
barplot(Bug_Metrics_Software[,,2],beside=TRUE,col = c("lightblue",
"mistyrose", "lightcyan", "lavender", "cornsilk"),legend = c("JDT"
,"PDE","Equinox","Lucene", "Mylyn"))
title(main = "After Release Bug Frequency",font.main = 4)
Data Visualizaon
[ 72 ]
The reader can extend the code interpretaon for the before release to the aer
release bug frequencies.
Figure 3: Bar plots for the five software
First noce that the scale on the y-axis for before and aer release bug frequencies
is drascally dierent. In fact, before release bug frequencies are in thousands while
aer release are in hundreds. This clearly shows that the engineers have put a lot of
eort to ensure that the released products are with minimum bugs. However, the
comparison of bug counts is not fair since the frequency scales of the bar plots in
the preceding screenshot are enrely dierent. Though we don't expect the results
to be dierent under any case, it is sll appropriate that the frequency scales remain
the same for both before and aer release bar plots. A common suggeson is to plot
the diagrams with the same range on the y-axes (or x-axes), or take an appropriate
transformaon such as a logarithm. In our problem, neither of them will work, and
we resort to another variant of the bar chart from the lattice package.
Now, we will use the formula structure for the barchart funcon and bring the
BR and AR on the same graph.
Chapter 3
[ 73 ]
12. Run the following code in the R console:
barchart(Software~Freq|Bugs,groups=BA_Ind, data= as.data.
frame(Bug_Metrics_Software),col=c(2,3))
The formula Software~Freq|Bugs requires that we obtain the bar chart for the
soware count Freq according to the severity of Bugs. We further specify that
each of the bar chart a be further grouped according to BA_Ind. This will result in
the following screenshot:
Figure 4: Bar chart for Before and After release bug counts on the same scale
To nd the colors available in R, run try colors() in the console and you will nd the
names of 657 colors.
Data Visualizaon
[ 74 ]
What just happened?
barplot and barchart were the two funcons we used to obtain the bar charts. For
common recurring factors, AR and BR here, the colors can be correctly specied through the
rep funcon. The argument beside=TRUE helped us to keep the bars for various soware
together for the dierent bug types. We also saw how to use the formula structure of the
lattice package. We saw the diversity of bar charts and learned how to create eecve
bar charts depending on the purpose of the day.
Have a go hero
Explore the opon stack=TRUE in the barchart(Software~Freq|Bugs,groups= BA_
Ind,…). Also, observe that Freq for bars in the preceding screenshot begins a lile earlier
than 0. Reobtain the plots by specifying the range for Freq with xlim=c(0,15000).
Dot charts
Cleveland (1993) proposed an alternave to the bar chart where dots are used to represent
the frequency associated with the categorical variables. The dot charts are useful for small
to moderate size datasets. The dot charts are also an alternave to the pie chart, refer to
The default examples secon. The dot charts may be varied to accommodate connuous
variable dataset too. The dot charts are known to obey the Tukey's principle of achieving an
as high as possible informaon-to-ink rao.
Example 3.1.4. (Connuaon of Example 3.1.2): In the screenshot in step 6 in the Time for
acon – bar charts in R secon, we saw that the bar charts for the frequencies of bugs for
aer release are almost non-existent. This is overcome using the dot chart, see the following
acon list on the dot chart.
Time for action – dot charts in R
The dotchart funcon from the graphics package and dotplot from the lattice
package will be used to obtain the dot charts.
1. To view the default examples of dot charts, enter example(dotplot);
example(dotchart); in the console and hit the Return key.
Chapter 3
[ 75 ]
2. To obtain the dot chart of the before and aer release bug frequencies, run the
following code:
dotchart(Severity_Counts,col=15:16,lcolor="black",pch=2:3,labels
=names(Severity_Counts),main="Dot Plot for the Before and After
Release Bug Frequency",cex=1.5)
Here, the opon col=15:16 is used to specify the choice of colors; lcolor is used
for the color of the lines on the dot chart which gives a good assessment of the
relave posions of frequencies for the human eye. The opon pch=2:3 picks up
circles and squares for indicang the posions of aer and before frequencies. The
opons labels and main are trivial to understand, whereas cex magnies the size
of all labels by 1.5 mes. On execuon of the preceding R code, we get a graph as
displayed in the following screenshot:
Figure 5: Dot chart for the Bug Metrics dataset
Data Visualizaon
[ 76 ]
3. The dot plot can be easily extended for all the ve soware as we did with the
bar charts.
> par(mfrow=c(1,2))
> dotchart(Bug_Metrics_Software[,,1],gcolor=1:5,col=6:
10,lcolor= "black",pch=15:19,labels=names(Bug_Metrics_
Software[,,1]),main="Before Release Bug Frequency",xlab="Frequency
Count")
> dotchart(Bug_Metrics_Software[,,2],gcolor=1:5,col=6:
10,lcolor= "black",pch=15:19,labels=names(Bug_Metrics_
Software[,,2]),main="After Release Bug Frequency",xlab="Frequency
Count")
Figure 6: Dot charts for the five software bug frequency
For a matrix input in barchart, the gcolor opon gets the same color each column. Note
that though the class of Bug_Metrics_Software is both xtabs and table, the class of
Bug_Metrics_Software[,,1] is a matrix, and hence we create a dot chart of it. This
means that the R code dotchart(Bug_Metrics_Software) leads to errors! The dot
chart is able to display the bug frequency in a beer way as compared to the bar chart.
What just happened?
Two dierent ways of obtaining the dot plot were seen, and a host of other opons were
also clearly indicated in the current secon.
Chapter 3
[ 77 ]
Spine and mosaic plots
In the bar plot, the length (height) of the bar varies, while the width for each bar is kept the
same. In a spine/mosaic plot the height is kept constant for the categories and the width
varies in accordance with the frequency. The advantages of a spine/mosaic plot becomes
apparent when we have frequencies tabulated for several variables via a conngency table.
The spline plot is a parcular case of the mosaic plot. We rst consider an example for
understanding the spine plot.
Example 3.1.5. Visualizing Shi and Operator Data (Page 487, Ryan, 2007): In a manufacturing
factory, operators are rotated across shis and it is a concern to nd out whether the me of
shi aects the operator's performance. In this experiment, there are three operators who in a
given month work in a parcular shi. Over a period of three months, data is collected for the
number of nonconforming parts an operator obtains during a given shi. The data is obtained
from page 487 of Ryan (2007) and is reproduced in the following table:
Operator 1 Operator 2 Operator 3
Shift 1 40 35 28
Shift 2 26 40 22
Shift 3 52 46 49
We will obtain a spine plot towards an understanding of the spread of the number
of non-conforming units an operator does during the shis in the forthcoming acon
me. Let us ask the following quesons:
Does the total number of non-conforming parts depend on the operators?
Does it depend on the shi?
Can we visualize the answers to the preceding quesons?
Time for action – the spine plot for the shift and operator data
Spine plots are drawn using the spineplot funcon.
1. Explore the default examples for the spine plot with example(spineplot).
2. Enter the data for the shi and operator example with:
ShiftOperator <- matrix(c(40, 35, 28, 26, 40, 22, 52, 46, 49),nro
w=3,dimnames=list(c("Shift 1", "Shift 2", "Shift 3"), c("Operator
1", "Opereator 2", "Operator 3")),byrow=TRUE)
3. Find the number of non-conforming parts of the operators with the
colSums funcon:
> colSums(ShiftOperator)
Operator 1 Opereator2 Operator 3
118 121 99
Data Visualizaon
[ 78 ]
The non-conforming parts for operators 1 and 2 are close enough, and it is lesser
by about 20 percent for the third operator.
4. Find the number of non-conforming parts according to the shis using the
rowSums funcon:
> rowSums(ShiftOperator)
Shift 1 Shift 2 Shift 3
103 88 147
Shift 3 appears to have about 50 percent more non-conforming parts in
comparison with shis 1 and 2. Let us look out for the spine plot.
5. Obtain the spine plot for the ShiftOperator data with
spineplot(ShiftOperator).
Now, we will aempt to make the spine plot a bit more interpretable. Under the
absence of any external inuence, we would expect the shis and operators to
have a near equal number of non-conforming objects.
6. Thus, on the overall x and y axes, we plot lines at approximately the one-thirds
and check if we get approximate equal regions/squares.
abline(h=0.33,lwd=3,col="red")
abline(h=0.67,lwd=3,col="red")
abline(v=0.33,lwd=3,col="green")
abline(v=0.67,lwd=3,col="green")
The output in the graphics device window will be the following screenshot:
Figure 7: Spine plot for the Shift Operator problem
Chapter 3
[ 79 ]
It appears from the paron induced by the red lines that all the operators have
a nearly equal number of non-conforming parts. However, the spine chart shows
that most of the non-conforming parts occur during Shi 3.
What just happened?
Data summaries were used to understand the behavior of the problem, and the spine
plot helped in clear idencaon of Shi 3 as a major source of the non-conforming units
manufactured. The use of abline funcon was parcularly more insighul for this dataset
and needs to be explored whenever there is a scope for it.
Spine plot is a special case of the mosaic plot. Friendly (2001) has pioneered the concept
of mosaic plots and Chapter 4, Exploratory Analysis, is an excellent reference for the same.
For a simple understanding of the construcon of mosaic plot, you can go through slides
7-12 at http://www.stat.auckland.ac.nz/~ihaka/120/Lectures/lecture17.
pdf. As explained there, suppose that there are three categorical variables, each with two
levels. Then, the mosaic plot begins with a square and divides it into two parts with each
part having an area proporonal to the frequency of the two levels of the rst categorical
variable. Next, each of the preceding two parts is divided into two parts each according to
the predened frequency of the two levels of the second categorical variable. Note that we
now have four divisions of the total area. Finally, each of the four areas are further divided
into two more parts, each with an area reecng the predened frequency of the two levels
of the third categorical variable.
Example 3.1.6. The Titanic dataset: In the The table object secon in Chapter 2, Import/
Export Data, we came across the Titanic dataset. The dataset was seen in two dierent
forms and we also constructed the data from scratch. Let us now connue the example.
The main problems in this dataset are the following:
The distribuon of the passengers by Class, and then the spread of Survived
across Class.
The distribuon of the passengers by Sex and its distribuon across the survivors.
The distribuon by Age followed by the survivors among them. We now want to
visualize the distribuon of Survived rst by Class, then by Sex, and nally by
the Age group.
Let us see the detailed acon.
Data Visualizaon
[ 80 ]
Time for action – the mosaic plot for the Titanic dataset
The goal here is to understand the survival percentages of the Titanic ship with respect to
Class of the crew, Sex, and Age. We use rst xtabs and prop.table to gain the insight
for each of these variables, and then visualize the overall picture using mosaicplot.
1. Get the frequencies of Class for the Titanic dataset with
xtabs(Freq~Class,data=Titanic).
2. Obtain the Survived proporons across Class with prop.table( xtabs(Freq
~Class+Survived,data=Titanic),margin=1).
3. Repeat the preceding two steps for Sex: xtabs(Freq~Sex,data=Titanic)
and prop.table(xtabs(Freq~Sex+Survived,data=Titanic),margin=1).
4. Repeat this exercise for Age: xtabs(Freq~Age,data=Titanic) and prop.tab
le(xtabs(Freq~Age+Survived,data=Titanic), margin=1).
5. Obtain the mosaic plot for the dataset with mosaicplot(Titanic,col=c("red"
,"green")).
The enre output is given in the following screenshot:
Figure 8: Mosaic plot for the Titanic dataset
Chapter 3
[ 81 ]
The preceding output shows that the people traveling in higher classes survived beer than
the lower class ones. The analysis also shows that females were given more priority over
males when the rescue system was in acon. Finally, it may be seen that children were given
priority over adults.
The mosaic plot division process proceeds as follows. First, it divides the region into four
parts with the regions proporonal to the frequencies of each Class; that is, the width of
the regions are proporonate to the Class frequencies. Each of the four regions are further
divided using the predened frequencies of the Sex categories. Now, we have eight regions.
Next, each of these regions is divided using the predened frequencies of the Age group
leading to 16 disnct regions. Finally, each of the regions is divided into two parts according
to the Survived status. The Yes regions of Child for the rst two classes are larger than the
No regions. The third Class has more non-survivors than survivors, and this appears to be
true across Age and Gender. Note that there are no children among the Crew class.
The rest of the regions' interpretaon is le to the reader.
What just happened?
A clear demyscaon of the working of the mosaic plot has been provided. We applied it to
the Titanic dataset and saw how it obtains clear regions which enable to deep dive into a
categorical problem.
Pie charts and the fourfold plot
Pie charts are hugely popular among many business analysts. One reason for its popularity is
of course its simplicity. That pie chart is easy to interpret is actually not a fact. In fact, the pie
chart is seriously discouraged for analysis and observaons, refer to the cauon of Cleveland
and McGill, and also Sarkar (2008), page 57. However, we will sll connue an illustraon
of it.
Example 3.1.7. Pie chart for the Bugs Severity problem: Let us obtain the pie chart for the
bug severity levels.
> pie(Severity_Counts[1:5])
> title("Severity Counts Post-Release of JDT Software")
> pie(Severity_Counts[6:10])
> title("Severity Counts Pre-Release of JDT Software")
Data Visualizaon
[ 82 ]
Can you nd the drawback of the pie chart?
Figure 9: Pie chart for the Before and After Bug counts (output edited)
The main drawback of the pie chart stems from the fact that humans have a problem in
deciphering relave areas. A common recommendaon is the use of a bar chart or a dot
chart instead of the pie chart, as the problem of judging relave areas does not exist when
comparing linear measures.
The fourfold plot is a novel way of visualizing a
22××k
conngency table. In this method,
we obtain k plots for each
22×
frequency table. Here, the cell frequency of each of the four
cells is represented by a quarter circle whose radius is proporonal to the square root of the
frequency. In contrast to the pie chart where the radius is constant and area is varied by the
perimeter, the radius in a fourfold plot is varied to represent the cell.
Chapter 3
[ 83 ]
Example 3.1.8. The fourfold plot for the UCBAdmissions dataset: An in-built R funcon
which generates the required plot is fourfoldplot. The R code and its resultant
screenshots are displayed as follows:
> fourfoldplot(UCBAdmissions,mfrow=c(2,3),space=0.4)
Figure 10: The fourfold plot of the UCBAdmissions dataset
In this secon, we focused on graphical techniques for categorical data. In many books,
the graphical methods begin with tools which are more appropriate for data arising for
connuous variables. Such tools have many shortcomings if applied to categorical data.
Thus, we have taken a dierent approach where the categorical data gets the right tools,
which it truly deserves. In the next secon, we deal with tools which are seemingly more
appropriate for data related to connuous variables.
Data Visualizaon
[ 84 ]
Visualization techniques for continuous variable data
Connuous variables have a dierent structure and hence, we need specialized methods
for displaying them. Fortunately, many popular graphical techniques are suited very well for
connuous variables. As the connuous variables can arise from dierent phenomenon, we
consider many techniques in this secon. The graphical methods discussed in this secon
may also be considered as a part of the next chapter on exploratory analysis.
Boxplot
The boxplot is based on ve points: minimum, lower quarle, median, upper quarle, and
maximum. The median forms the thick line near the middle of the box, and the lower and
upper quarles complete the box. The lower and upper quarles along with the median,
which is the second quarle, divide the data into four regions with each containing equal
number of observaons. The median is the middle-most value among the data sorted in
the increasing (decreasing) order of magnitude. On similar lines, the lower quarle may be
interpreted as the median of observaons between the minimum and median data values.
These concepts are dealt in more detail in Chapter 4, Exploratory Analysis. The boxplot is
generally used for two purposes: understanding the data spread and idenfying the outliers.
For the rst purpose, we set the range value at zero, which will extend the whiskers to the
extremes at minimum and maximum, and give the overall distribuon of the data.
If the purpose of boxplot is to idenfy outliers, we extend the whiskers in a way which
accommodate tolerance limits to enable us to capture the outliers. Thus, the whiskers are
extended 1.5 mes the default value, the interquarle range (IQR), which is the dierence
between the third and rst quarles from the median. The default seng of boxplot is the
idencaon of the outliers. If any point is found beyond the whiskers, such observaons
may be marked as outliers. The boxplot is also somemes called a box-and-whisker plot, and
it is the whiskers, which are obtained by drawing lines from the box, ends to the minimum
and maximum points. We will begin with an example of the boxplot.
Example 3.2.1. Example (boxplot): For a quick tutorial on the various opons of the boxplot
funcon, the user may run the following code at the R console. Also, the reader is advised
to explore the bwplot funcon from the lattice package. Try example(boxplot) and
example(bwplot) from the respecve graphics and lattice packages.
Example 3.2.2. Boxplot for the resisvity data: Gunst (2002) has 16 independent
observaons from eight pairs on the resisvity of a wire. There are two processes under
which these observaons are equally distributed. We would like to see if resisvity of the
wires depends on the processes, and which of the processes leads to a higher resisvity.
A numerical comparison based on the summary funcon will be rst carried out, and then
we will visualize the two processes through boxplot to conclude whether the eects are
same, and if not which process leads to higher resisvity.
Chapter 3
[ 85 ]
Example 3.2.3. The Michelson-Morley experiment: This is a famous physics experiment in
the late nineteenth century, which helped in proving the non-existence of ether. If the ether
existed, one expects a shi of about 4 percent in the speed of light. The speed of light is
measured 20 mes in ve dierent experiments. We will use this dataset for two purposes:
is the dri of 4 percent evidenced in the data, and seng the whiskers at the extremes.
The rst one is a stascal issue and the laer is a soware seng.
For the preceding three examples, we will now read the required data into R, and then
look at the necessary summary funcons, and nally visualize them using the boxplots.
Time for action – using the boxplot
Boxplots will be obtained here using the funcon boxplot from the graphics package as well
as bwplot from the lace package.
1. Check the variety of boxplots with example(boxplot) from the graphics
package and example(bwplot) for the variants in the lattice package.
2. The resistivity data from the RSADBE package contains two processes'
informaon which we need to compare. Load in to the current session with
data(resistivity).
3. Obtain the summary of the two processes with the following:
> summary(resistivity)
Process.1 Process.2
Min. 0.138 0.142
1st Qu. 0.140 0.144
Median 0.142 0.146
Mean 0.142 0.146
3rd Qu. 0.143 0.148
Max. 0.146 0.150
Clearly, Process 2 has approximately 0.004 higher resisvity as compared
to Process 1 across all the essenal summaries. Let us check if the boxplot
captures the same.
4. Obtain the boxplot for the two processes with boxplot(resistivity,
range=0).
The argument range=0 is to ensure that the whiskers are extended to the
minimum and maximum data values. The boxplot diagram (le-hand side of the
next screenshot) clearly shows that Process.2 has higher resisvity in comparison
with Process.1. Next, we will consider the bwplot funcon from the lattice
package. A slightly dierent rearrangement of the resisvity data frame will be
required, in that we will specify all the resisvity values in a single column and their
corresponding processes in another column.
Data Visualizaon
[ 86 ]
An important opon for boxplots is that of notch, which is especially useful for
the comparison of medians. The top and boom notches for a set of observaons
are dened at the point's median
±
1.57(IQR)/n1/2. If notches of two boxplots do
not overlap, it can be concluded that the medians of the groups are signicantly
dierent. Such an opon can be specied in both boxplot and bwplot funcons.
5. Convert resisvity to another useful form which will help the applicaon of the
bwplot funcon with resistivity2 <- data.frame(rep(names( re
sistivity),each=8),c(resistivity[,1],resistivity[,2])).
Assign variable names to the new data frame with names(resistivity2)<-
c("Process","Resistivity").
6. Run the bwplot funcon on resistivity2 with
bwplot(Resistivity~Process, data=resistivity2,notch=TRUE).
Figure 11: Boxplots for resistivity data with boxplot, bwplot, and notches
The notches are overlapping and hence, we can't conclude from the boxplot that the
resisvity medians are very dierent from each other.
With the data on speed of light from the morley dataset, we have an important
goal of idenfying outliers. Towards this purpose the whiskers are extended 1.5
mes the default value, the interquarle range (IQR), from the median.
Chapter 3
[ 87 ]
7. Create a boxplot with whiskers which enable idencaon of the outliers beyond
the 1.5 IQR of the median with the following:
boxplot(Speed~Expt,data=morley,main = "Whiskers at Lower- and
Upper- Confidence Limits")
8. Add the line which helps to idenfy the presence of ether with abline(
h=792.458,lty=3).
The resulng screenshot is as follows:
Figure 12: Boxplot for the Morley dataset
It may be seen from the preceding screenshot that experiment 1 has one outlier, while
experiment 3 has three outlier values. Since the line is well below the median of the
experiment values (speed, actually), we conclude that there is no experimental evidence
for the existence of ether.
What just happened?
Variees of boxplots have been elaborated in this secon. The boxplot has also been put in
acon to idenfy the presence of outliers in a given dataset.
Data Visualizaon
[ 88 ]
Histograms
Histogram is one of the earliest graphical techniques and undoubtedly one of the most
versale and adapve graphs whose relevance is legimate as it ever had. The invenon
of histogram is a credit to the great stascian, Karl Pearson. Its strength is also in its
simplicity. In this technique, a variable is divided over intervals and the height of an interval
is determined by the frequency of the observaons falling in that interval. In the case of
an unbalanced division of the range of the variable values, histograms are especially very
informave in revealing the shape of the probability distribuon of the variable. We will see
more about these points in the following examples.
The construcon of a histogram is explained with the famous dataset of Galton, where an
aempt has been made for understanding the relaonship between the heights of a child
and parent. In this dataset, there are 928 pairs of observaon of the height of the child and
parent. Let us have a brief peek at the dataset:
> data(galton) > names(galton)
[1] "child" "parent"
> dim(galton)
[1] 928 2
> head(galton)
child parent
1 61.7 70.5
2 61.7 68.5
3 61.7 65.5
4 61.7 64.5
5 61.7 64.0
6 62.2 67.5
> sapply(galton,range)
child parent
[1,] 61.7 64
[2,] 73.7 73
> summary(galton)
child parent
Min. :61.7 Min. :64.0
1st Qu.:66.2 1st Qu.:67.5
Median :68.2 Median :68.5
Mean :68.1 Mean :68.3
3rd Qu.:70.2 3rd Qu.:69.5
Max. :73.7 Max. :73.0
Chapter 3
[ 89 ]
We need to cover all the 928 observaons in the intervals, also known as bins, which need to
cover the range of the variable. The natural queson is how does one decide on the number
of intervals and the width of these intervals? If the bin width, denoted by h, is known, the
number of bins, denoted by k, can be determined by:
maxmin
 
= 
 
 
i
ii i
h
x x
k
Here, the argument
>@
denotes the ceiling of the number. Similarly, if the number of bins k
is known, the width is determined by
( )
max min= −
 
 
ii ii
h x x/k.
There are many guidelines for these problems. The three opons available for the hist
funcon in R are formulas given by Sturges, Sco, and Freedman-Diaconis, the details of
which may be obtained by running ?nclass.Sturges, or ?nclass.FD and ?nclass.
scott in the R console. The default seng runs the Sturges opon. The Sturges formula
for the number of bins is given by:
[ ]
2
lo
g1
= +k n
This formula works well when the underlying distribuon is approximately distributed as a
normal distribuon. The Sco's normal reference rule for the bin width, using the sample
standard deviaon
ˆ
σ
is:
3
35
σ
=ˆ
.
h
n
Finally, the Freedman-Diaconis rule for the bin width is given by:
3
2
=
IQR
hn
In the following, we will construct a few histograms describing the problems through their
examples and their R setup in the Time for acon – understanding the eecveness of
histogram a secon.
Example 3.2.4. The default examples: To get a rst preview on the generaon of histograms,
we suggest the reader to go through the built-in examples, try example(hist) and
example(histogram).
Data Visualizaon
[ 90 ]
Example 3.2.5. The Galton dataset: We will obtain histograms for the height of child and
parent from the Galton dataset. We will use the Freedman-Diaconis and Sturges choice of
bin widths.
Example 3.2.6. Octane rang of gasoline blends: An experiment is conducted where the
octane rang of gasoline blends can be obtained using two methods. Two samples are
available for tesng each type of blend, and Snee (1981) obtains 32 dierent blends over
an appropriate spectrum of the target octane rangs. We obtain histograms for the rangs
under the two dierent methods.
Example 3.2.7. Histogram with a dummy dataset: A dummy dataset has been created by the
author. Here, we need to obtain histograms for the two samples in the Samplez data from
the RSADBE package.
Time for action – understanding the effectiveness of histograms
Histograms are obtained using the hist and histogram funcons. The choice of bin widths
is also discussed.
1. Have a buy-in of the R capability of the histograms through example(hist) and
example(histogram) for the respecve histogram funcons from the graphics
and lattice packages.
2. Invoke the graphics editor with par(mfrow=c(2,2)).
3. Create the histogram for the height of Child and Parent from the galton dataset
seen in the earlier part of the secon for the Freedman-Diaconis and Sturges
choice of bin widths:
par(mfrow=c(2,2))
hist(galton$parent,breaks="FD",xlab="Height of Parent",
main="Histogram for Parent Height with Freedman-Diaconis
Breaks",xlim=c(60,75))
hist(galton$parent,xlab="Height of Parent",main= "Histogram for
Parent Height with Sturges Breaks",xlim=c(60,75))
hist(galton$child,breaks="FD",xlab="Height of Child",
main="Histogram for Child Height with Freedman-Diaconis
Breaks",xlim=c(60,75))
hist(galton$child,xlab="Height of Child",main="Histogram for Child
Height with Sturges Breaks",xlim=c(60,75))
Chapter 3
[ 91 ]
Consequently, we get the following screenshot:
Figure 13: Histograms for the Galton dataset
Note that a few people may not like histograms for the height of parent for the
Freedman-Diaconis choice of bin width.
4. For the experiment menoned in Example 3.2.9. Octane rang of gasoline blends,
rst load the data into R with data(octane).
5. Invoke the graphics editor for the rangs under the two methods with
par(mfrow=c(2,2)).
6. Create the histograms for the rangs under the two methods for the Sturges choice
of bin widths with:
hist(octane$Method_1,xlab="Ratings Under Method I",main="Histogram
of Octane Ratings for Method I",col="mistyrose")
hist(octane$Method_2,xlab="Ratings Under Method
II",main="Histogram of Octane Ratings for Method II",col="
cornsilk")
The resulng histogram plot will be the rst row of the next screenshot.
A visual inspecon suggests that under Method_I, the mean rang is around 90
while under Method_II it is approximately 95. Moreover, the Method_II rangs
look more symmetric than the Method_I rangs.
Data Visualizaon
[ 92 ]
7. Load the required data here with data(Samplez).
8. Create the histogram for the two samples under the Samplez data frame with:
hist(Samplez$Sample_1,xlab="Sample 1",main="Histogram: Sample 1"
,col="magenta")
hist(Samplez$Sample_2,xlab="Sample 2",main="Histogram: Sample 2"
,col="snow")
We obtain the following histogram plot:
Figure 14: Histogram for the Octane and Samples dummy dataset
The lack of symmetry is very apparent in the second row display of the preceding screenshot.
It is very clear from the preceding screenshot that the le histogram exhibits an example of
a posive skewed distribuon for Sample_1, while the right histogram for Sample_2 shows
that the distribuon is a negavely skewed distribuon.
What just happened?
Histograms have tradionally provided a lot of insight into the understanding of the
distribuon of variables. In this secon, we dived deep into the intricacies of its construcon,
especially related to the opons of bin widths. We also saw how the dierent nature of
variables are clearly brought out by their histogram.
Chapter 3
[ 93 ]
Scatter plots
In the previous subsecon, we used histograms for understanding the nature of the
variables. For mulple variables, we need mulple histograms. However, we need dierent
tools for understanding the relaonship between two or more variables. A simple, yet
eecve technique is the scaer plot. When we have two variables, the scaer plot simply
draws the two variables across the two axes. The scaer plot is powerful in reecng the
relaonship between the variables as in it reveals if there is a linear/nonlinear relaonship.
If the relaonship is linear, we may get an insight if there is a posive or negave relaonship
among the variables, and so forth.
Example 3.2.8. The drain current versus the ground-to-source voltage: Montgomery and
Runger (2003) report an arcle from IEEE (Exercise 11.64) about an experiment where the
drain current is measured against the ground-to-source voltage. In the scaer plot, the drain
current values are ploed against each level of the ground-to-source voltage. The former
value is measured in milliamperes and the laer in volts. The R funcon plot is used for
understanding the relaonship. We will soon visualize the relaonship between the current
values against the level of the ground-to-source voltage. This data is available as DCD in the
RSADBE package.
The scaer plot is very exible when we need to understand the relaonship between more
than two variables. In the next example, we will extend the scaer plot to mulple variables.
Example 3.2.9. The Gasoline mileage performance data: The mileage of a car depends on
various factors; in fact, it is a very complex problem. In the next table, the various variables
x1 to x11 are described which are believed to have an inuence on the mileage of the car. We
need a plot which explains the inter-relaonships between the variables and the mileage.
The exercise of repeang the plot funcon may be done 11 mes, though most people may
struggle to recollect the inuence of the rst plot when they are looking at the sixth or maybe
the seventh plot. The pairs funcon returns a matrix of scaer plots, which is really useful.
Let us visualize the matrix of scaer plots:
> data(Gasoline)
> pairs(Gasoline) # Output suppressed
It may be seen that this matrix of scaer plots is a symmetric plot in the sense that the
upper and lower triangle of this matrix are simply copies of each other (transposed copies
actually). We can be more eecve in represenng the data in the matrix of scaer plots by
specifying addional parameters. Even as we study the inter-relaonships, it is important to
understand the variable distribuon itself. Since the diagonal elements are just indicang the
name of the variable, we can instead replace them by their histograms. Further, if we can
give the measure of the relaonship between two variables, say the correlaon coecient,
we can be more eecve. In fact, we do a step beer by displaying the correlaon coecient
by increasing the font size according to its stronger value. We rst dene the necessary
funcons and then use the pairs funcon.
Data Visualizaon
[ 94 ]
Variable
Notation
Variable Description Variable
Notation
Variable Description
YMiles/gallon x6Carburetor (barrels)
x1Displacement (cubic inches) x7No. of transmission speeds
x2Horsepower (foot-pounds) x8Overall length (inches)
x3Torque (foot-pounds) x9Width (inches)
x4Compression ratio x10 Weight (pounds)
x5Rear axle ratio x11 Type of transmission
(A-automatic, M-manual)
Time for action – plot and pairs R functions
The scaer plot and its important mulvariate extension with pairs will be considered in
detail now.
1. Consider the data data(DCD).
Use the opons xlab and ylab to specify the right labels for the axes. We specify
xlim and ylim to get a good overview of the relaonship.
2. Obtain the scaer plot for Example 3.2.8. The Drain current versus the ground-
to-source voltage using plot(DCD$Drain_Current, DCD$GTS_Voltage,t
ype="b",xlim=c(1,2.2),ylim=c(0.6,2.4),xlab="Current Drain",
ylab="Voltage").
Figure 15: The scatter plot for DCD
Chapter 3
[ 95 ]
We can easily see from the preceding scaer plot that as the ground-to-source
voltage increases, there is an appropriate increase in the drain current. This is an
indicaon of a posive relaonship between the two variables. However, the lab
assistant now comes to you and says that the measurement error of the instrument
has actually led to 15 percent higher recordings of the ground-to-source voltage.
Now, instead of dropping the enre diagram, we may simply prefer to add the
corrected gures to the exisng one. The points opon helps us to add the new
corrected data points to the gure.
3. Now, rst obtain the correct GTS_Voltage readings with DCD$GTS_
Voltage/1.15 and add them to the exisng plot with points(DCD$Drain_
Current,DCD$GTS_Voltage/1.15,type="b",col="green").
4. We rst create two funcons panel.hist and panel.cor dened as follows:
panel.hist<- function(x, ...)
{
usr<- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks<- h$breaks; nB<- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)
}
panel.cor<- function(x, y, digits=2, prefix="", cex.cor, ...) {
usr<- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x,y,use="complete.obs"))
txt<- format(c(r, 0.123456789), digits=digits)[1]
txt<- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex.cor<- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
The preceding two dened funcons are taken from the code of example(pairs).
Data Visualizaon
[ 96 ]
5. It is me to put these two funcons in to acon:
pairs(Gasoline,diag.panel=panel.hist,lower.panel=panel.
smooth,upper.panel=panel.cor)
Figure 16: The pairs plot for the Gasoline dataset
In the upper triangle of the display, we can see that the mileage has strong
associaon with the displacement, horsepower, torque, number of transmission
speeds, the overall length, width, weight, and the type of transmission. We can say
a bit more too. The rst three variables x1 to x3 relate to the engine characteriscs,
and there is a strong associaon within these three variables. Similarly, there is a
strong associaon between x8 to x10 and together they form the vehicle dimension.
Also, we have done a bit more than simply obtaining the scaer plots in the lower
triangle of the display. A smooth approximaon of the relaonship between the
variables is provided here.
6. Finally, we resort to the usual trick by looking at the capabilies of the plot
and pairs funcons with example(plot), example(pairs),
and example(xyplot).
We have seen how mul-variables can be visualized. In the next subsecon, we will explore
more about Pareto charts.
Chapter 3
[ 97 ]
What just happened?
Starng with a simple scaer plot and its eecveness, we went to great lengths for the
extension to the pairs funcon. The pairs funcon has been greatly explored using the
panel.hist and panel.cor funcons for truly understanding the relaonships between
a set of mulple variables.
Pareto charts
The Pareto rule, also known as the 80-20 rule or the law of vital few, says that approximately 80
percent of the defects are due to 20 percent of the causes. It is important as it can idenfy 20
percent vital causes whose eliminaon annihilates 80 percent of the defects. The qcc package
contains the funcon pareto.chart, which helps in generang the Pareto chart. We will give
a simple illustraon of this chart.
The Pareto chart is a display of the cause frequencies along two axes. Suppose that we have
10 causes C1 to C10 which have occurred with defect counts 5, 23, 7, 41, 19, 4, 3, 4, 2,
and 1. Causes 2, 4, and 5 have high frequencies (dominang?) and other causes look a bit
feeble. Now, let us sort these causes by decreasing the order and obtaining their cumulave
frequencies. We will also obtain their cumulave percentages.
> Cause_Freq <- c(5, 23, 7, 41, 19, 4, 3, 4, 2, 1)
> names(Cause_Freq) <- paste("C",1:10,sep="")
> Cause_Freq_Dec <- sort(Cause_Freq,dec=TRUE)
> Cause_Freq_Cumsum <- cumsum(Cause_Freq_Dec)
> Cause_Freq_Cumsum_Perc <- Cause_Freq_Cumsum/sum(Cause_Freq)
> cbind(Cause_Freq_Dec,Cause_Freq_Cumsum,Cause_Freq_Cumsum_Perc)
Cause_Freq_Dec Cause_Freq_Cumsum Cause_Freq_Cumsum_Perc
C4 41 41 0.3761
C2 23 64 0.5872
C5 19 83 0.7615
C3 7 90 0.8257
C1 5 95 0.8716
C6 4 99 0.9083
C8 4 103 0.9450
C7 3 106 0.9725
C9 2 108 0.9908
C10 1 109 1.0000
This appears to be a simple trick, and yet it is very eecve in revealing that causes 2, 4,
and 5 are contribung more than 75 percent of the defects. A Pareto chart completes the
preceding table with bar chart in a decreasing count of the causes with a le vercal axis
for the frequencies and a cumulave curve on the right vercal axis. We will see the Pareto
chart in acon for the next example.
Data Visualizaon
[ 98 ]
Example 3.2.10. The Pareto chart for incomplete applicaons: A simple step-by-step
illustraon of Pareto chart is available on the Web at http://personnel.ky.gov/nr/
rdonlyres/d04b5458-97eb-4a02-bde1-99fc31490151/0/paretochart.pdf. The
reader can go through the clear steps menoned in the document.
In the example from the precedingly menoned web document, a bank which issues credit
cards rejects applicaon forms if they are deemed incomplete. An applicaon form may be
incomplete if informaon is not provided for one or more of the details sought in the form.
For example, an applicaon can't be processed further if the customer/applicant has not
provided address, has illegible handwring, there is a signature missing, or if the customer
is an exisng credit card holder among other reasons. The concern of the manager of the
credit card wing is to ensure that the rejecons for incomplete applicaons should decline,
since a cost is incurred on issuing the form which is generally not charged for. The manager
wants to focus on certain reasons which may be leading to the rejecon of the forms.
Here, we consider the frequency of the dierent causes which lead to the rejecon of
the applicaons.
>library(qcc)
>Reject_Freq = c(9,22,15,40,8)
>names(Reject_Freq) = c("No Addr.", "Illegible", "Curr. Customer", "No
Sign.", "Other")
>Reject_Freq
No Addr. Illegible Curr.CustomerNo Sign. Other
9 22 15 40 8
>options(digits=2)
>pareto.chart(Reject_Freq)
Pareto chart analysis for Reject_Freq
Frequency Cum.Freq.Percentage Cum.Percent.
No Sign. 40.0 40.0 42.6 42.6
Illegible 22.0 62.0 23.4 66.0
Curr. Customer 15.0 77.0 16.0 81.9
No Addr. 9.0 86.0 9.6 91.5
Other 8.0 94.0 8.5 100.0
Chapter 3
[ 99 ]
Figure 17: The Pareto chart for incomplete applications
In the screenshot given in step 5 of the Time for acon – plot and pairs R funcons secon,
the frequency of the ve reasons of rejecons is represented by bars as in a bar plot with
the disncon of being displayed in a decreasing magnitude of frequency. The frequency of
the reasons is indicated on the le vercal axis. At the mid-point of each bar, the cumulave
frequency up to that reason is indicated, and the reference for this count is the right vercal
axis. Thus, we can see that more than 75 percent of the rejecons is due to the three causes
No Signature, Illegible, and Current Customer. This is the main strength of a Pareto chart.
A brief peek at ggplot2
Tue (2001) and Wilkinson (2005) emphasize a lot on the aesthecs of graphics. There is
indeed more to graphics than mere mathemacs, and the subtle changes/correcons in
a display may lead to an improved, enhanced, and pleasing feeling on the eye diagrams.
Wilkinson emphasizes on what he calls grammar of graphics, and an R adaptaon of it is
given by Wickham (2009).
Data Visualizaon
[ 100 ]
Thus far, we have used various funcons, such as barchart, dotchart, spineplot,
fourfoldplot, boxplot, plot, and so on. The grammar of graphics emphasizes that a
stascal graphic is a mapping from data to aesthec aributes of geometric objects. The
aesthecs aspect consists of color, shape, and size, while the geometric objects are composed
of points, lines, and bars. A detailed discussion of these aspects is unfortunately not feasible in
our book, and we will have to sele with a quick introducon to the grammar of graphics.
To begin with, we will simply consider the qplot funcon from the ggplot2 package.
Time for action – qplot
Here, we will rst use the qplot funcon for obtaining various kinds of plots. To keep the
story short, we are using the earlier datasets only and hence, a reproducon of the similar
plots using qplot won't be displayed. The reader is encouraged to check ?qplot and
its examples.
1. Load the library with library(ggplot2).
2. Rearrange the resisvity dataset in a dierent format and obtain the boxplots:
test <- data.frame(rep(c("R1","R2"),each=8),c(resistivity[,1],
resistivity[,2]))
names(test) <- c("RES","VALUE")
qplot(factor(RES),VALUE,data=test,geom="boxplot")
The gplot funcon needs to be explicitly specied that RES is a factor variable
and according to its levels, we need to obtain the boxplot for the resisvity values.
3. For the Gasoline dataset, we would like to obtain a boxplot of the mileage
accordingly as the gear system could be manual or automac. Thus, the qplot
can be put to acon with qplot(factor(x11),y,data=Gasoline,geom=
"boxplot").
4. A histogram is also one of the geometric aspects of ggplot2, and we next obtain
the histogram for the height of child with qplot(child,data=galton,geom="
histogram", binwidth = 2,xlim=c(60,75),xlab="Height of Child",
ylab="Frequency").
5. The scaer plot for the height of parent against child is fetched with qplot(pare
nt,child,data=galton,xlab="Height of Parent", ylab="Height of
Child", main="Height of Parent Vs Child").
Chapter 3
[ 101 ]
What just happened?
The qplot under the argument of geom allows a good family of graphics under a single
funcon. This is parcularly advantageous for us to perform a host of tricks under a
single umbrella.
Of course, there is the all the more important ggplot funcon from the ggplot2 library,
which is the primary reason for the exibility of grammar of graphics. We will close the
chapter with a very brief exposion to it. The main strength of ggplot stems from the fact
that you can build a plot layer by layer. We will illustrate this with a simple example.
Time for action – ggplot
ggplot, aes, and layer will be put in acon to explore the power of the grammar
of graphics.
1. Load the library with library(ggplot2).
2. Using the aes and ggplot funcons, rst create a ggplot object with galton_gg
<- ggplot(galton,aes(child,parent)) and nd the most recent creaon in
R by running galton_gg. You will get an error, and the graphics device will show a
blank screen.
3. Create a scaer plot by adding a layer to galton_gg with galton_gg <-
galton_gg + layer(geom="point"), and then run galton_gg to check for
changes. Yes, you will get a scaer plot of the height of child versus parent.
4. The labels of the axes are not sasfactory and we need beer ones. The strength
of ggplot is that you can connue to add layers to it with varied opons. In
fact, you can change the xlim and ylim on an exisng plot and check each me
the dierence in the plot. Run the following code in a step-by-step manner and
appreciate the strength of the grammar:
galton_gg <- galton_gg + xlim(60,75)
galton_gg
galton_gg <- galton_gg + ylim(60,75)
galton_gg
galton_gg<-galton_gg+ylab("Height of Parent")+xlab("Height
ofChild")
galton_gg
galton_gg <- galton_gg + ggtitle("Height of Parent Vs Child")
galton_gg
Data Visualizaon
[ 102 ]
What just happened?
The layer-by-layer approach of ggplot is very useful, and we have seen an illustraon of it
on the scaer plot for the Galton dataset. In fact, the reach of ggplot is much richer than
our simple illustraon, of course, and the interested reader may refer to Wickham (2009)
for more details.
Have a go hero
If you run par(list(cex.lab=2,ask=TRUE)) followed by barplot(
Severity_Counts,xlab="Bug Count",xlim=c(0,12000), horiz =TRUE,
col=rep(c(2,3),5)), what do you expect R to do?
Summary
In this chapter, we have visualized dierent types of graphs for dierent types of variables.
We have also explored how to gain insights into data through the graphs. It is important to
realize that without a clear understanding of the data structure, the plots are meaningless if
they are generated without exercising enough cauon. The GIGO adage is very true and no
rich visualizaon technique helps overcome this problem.
In the previous chapter, we learned the important methods of imporng/exporng data, and
visualized the data in dierent forms. Now that we have an understanding and visual insight
of the data, we need to take the next step, namely quantave analysis of the data. There
are roughly two streams of analysis: exploratory and conrmave analysis. It is the former
analysis technique that forms the core of the next chapter. As an instance, the scaer plot
reveals whether there is a posive, negave, or no associaon between the two variables.
If the associaon is not zero, the numeric answer of the posive or negave relaonship is
then required. Techniques such as these and extensions form the core of next chapter.
4
Exploratory Analysis
Tukey (1977) in his benchmark book Exploratory Data Analysis, abbreviated
popularly as EDA, describes "best methods" as:
We do not guarantee to introduce you to the "best" tools, parcularly since we
are not sure that there can be unique bests.
The goal of this chapter is to emphasize on EDA and its strengths.
In the previous chapter, we have seen visualizaon techniques for data of dierent
characteriscs. Analycal insight is also important and this chapter considers EDA techniques.
Further, the more popular measures include the mean, standard error, and so on. It has been
proved many mes that the mean has several drawbacks; one of them being that it is very
sensive to outliers/extremes. Thus, in exploratory analysis the focus is on measures which
are robust to the extremes. Many techniques considered in this chapter are discussed in
more detail by Velleman and Hoaglin (1981), and an eBook has been kindly made available at
http://dspace.library.cornell.edu/handle/1813/62. In the rst secon, we will
have a peek at the oen used measures for exploratory analysis. The main learnings from this
chapter are listed as follows:
Summary stascs based on median and its variants, which are robust to outliers
Visualizaon techniques in stem-and-leaf, leer values, and bagplots
First regression model in Resistant line and rened methods in smoothing data
and median polish
Exploratory Analysis
[ 104 ]
Essential summary statistics
We have seen useful summary stascs of mean and variance in the Discrete distribuons
and Connuous distribuons secons of Chapter 1, Data Characteriscs. The concepts
therein have their own ulity value. The drawback of such stascal metrics is that they are
very sensive to outliers, in the sense that a single observaon may completely distort the
enre story. In this secon, we discuss some exploratory analysis metrics which are intuive
and more robust than the metrics such as mean and variance.
Percentiles, quantiles, and median
For a given dataset and a number 0 < k < 1, the 100k% percenle divides the dataset into
two parons with 100k% of the values below it and 100(1-k) percent of the values above
it. The fracon k is referred as a quanle. In Stascs, quanles are used more oen than
percenles. The dierence being that the quanles vary over the unit interval (0, 1),
whereas 100 mes the quanles gives us the percenles. It is important to note that
the minimum (maximum) is the 0% (100%) percenle.
The median is the ieth percenle, which divides the data values into two equal parts with
itself being the mid-point of these parts. The lower and upper quarles are respecvely the
25% and 75% percenles. The standard notaon for the lower, mid (median), and upper
quanles respecvely are Q1, Q2, and Q3. By extension, Q0 and Q5 respecvely denote the
minimum and maximum quanes in a dataset.
Example 4.1.1. Rahul Dravid – The Wall: The game of Cricket is hugely popular in India, and
many cricketers have given a lot of goose bumps to those watching. Sachin Tendulkar, Anil
Kumble, Javagal Srinath, Sourav Ganguly, Rahul Dravid, and VVS Laxman are some of the
iconic names across the world. The six players menoned here have especially played a huge
role in taking India to the number one posion in the Test cricket rankings, and it is widely
believed that Rahul Dravid has been the main backbone of the success. A century is credited
to a cricketer on scoring 100 or more runs in a single innings. He has scored 36 Test centuries
across the globe, quite a handful of them were so resolute in nature that it earned him the
nickname "The Wall", and we will seek some percenles/quanles for these scores soon.
Next, we will focus on a stasc which is similar to the quanles.
Hinges
The nomenclature of the concept of hinges is basically from the hinges seen on a door. For
a door's frame, the mid-hinge is at the middle of the height of the frame, whereas the lower
and upper hinges are respecvely observed at the middle from the mid-hinge to boom and
top of the frame. In exploratory analysis, the hinges are dened by arranging the data in an
increasing order, and to start the median is idened as the mid-hinge.
Chapter 4
[ 105 ]
The lower hinge for the (ordered) data is dened as the middle-most observaon from the
minimum to the median. The upper hinge is dened similarly for the upper part of the data.
In the rst occasion, it may appear that the lower and upper hinges are the same as lower
and upper quanles. Consider the data as the rst 10 integers. Here, the median is 5.5 as
the average of the two middle-most numbers 5 and 6. Using the quanle funcon on 1:10,
it may be checked that here Q1 = 3.25 and Q3 = 7.75. The lower hinge is the middlemost
number between 1 to the median 5.5, which turns out as 3, and the upper hinge as 8.
Thus, it may be seen that the hinges are dierent from the quarles.
An extension of the concept of the hinges will be seen in the Leer values secon.
We will next look at exploratory measures of dispersion.
The interquartile range
Range, the dierence between the minimum and maximum of the data values, is one
measure of the spread of the variable. This measure is suscepble to the extreme points.
The interquarle range, abbreviated as IQR, is dened as the dierence between the upper
and lower quarle, that is:
31
=−
IQRQ Q
The R funcon IQR calculates the IQR for a given numeric object. All the concepts
theorecally described up to this point will be put into acon.
Time for action – the essential summary statistics for
"The Wall" dataset
We will understand the summary measures for EDA through the data on centuries scored
by Rahul Dravid in test matches.
1. Load the useful EDA package: library(LearnEDA).
2. Load the dataset TheWALL from the RSADBE package: data(TheWALL).
3. Obtain the quanles of the centuries with quantile(TheWALL$Score),
and the dierence between the quanles using the diff funcon
diff(quantile(TheWALL$Score)). The output is as follows:
> quantile(TheWALL$Score)
0% 25% 50% 75% 100%
100.0 111.8 140.0 165.8 270.0
> diff(quantile(TheWALL$Score))
25% 50% 75% 100%
11.75 28.25 25.75 104.25
Exploratory Analysis
[ 106 ]
As we are considering Rahul Dravid's centuries only, the beginning point is 100. The
median of his centuries is 140.0, where the rst and third quarles are respecvely
111.8 and 165.8. The median of the centuries is 140 runs, which can be interpreted
as having a 50 percent chance of The Wall reaching 140 runs if he scores a century.
The highest ever score of Dravid is of course 270. Interpret the dierence between
the quanles.
4. The percenles of Dravid's centuries can be obtained by using the quantile
funcon again: quantile(TheWALL$Score,seq(0,1,.1)), here seq(0,1,.1)
creates a vector which incrementally increases 0.1 beginning with 0 unl 1, and the
inter-dierence between the percenles with diff(quantile(TheWALL$Score,
seq(0,1,.1))):
>quantile(TheWALL$Score,seq(0,1,.1))
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
100.0 103.5 111.0 116.0 128.0 140.0 146.0 154.0 180.0 208.5 270.0
> diff(quantile(TheWALL$Score,seq(0,1,.1)))
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
3.5 7.5 5.0 12.0 12.0 6.0 8.0 26.0 28.5 61.5
The Wall is also known for his resolve of performing well in away Test matches.
Let us verify that using the data on the centuries score.
5. The Home and Away number of centuries is obtained using the table funcon.
Further, we obtain a boxplot of the home and away centuries.
> table(HA_Ind)
HA_Ind
Away Home
21 15
The R funcon table returns frequencies of the various categories for a categorical
variable. In fact, it is more versale in obtaining frequencies for more than one
categorical variable. The Wall is also known for his resolve of performing well in
away Test matches. This is partly conrmed by the fact that 21 of his 36 centuries
came in away Tests. However, the boxplot says otherwise:
>boxplot(Score~HA_Ind,data=TheWALL)
Chapter 4
[ 107 ]
Figure 1: Box plot for Home/Away centuries of The Wall
It may be tempng for The Wall's fans to believe that if they remove the outliers of
scores above 200, the result may say that his performance of Away Test centuries is
beer/equal to Home ones. However, this is not the case, which may be veried
as follows.
6. Generate the boxplot for centuries whose score is less than 200 with
boxplot(Score~HA_Ind,subset=(Score<200),data=TheWALL).
Figure 2: Box plot for Home/Away centuries of The Wall (less than 200 runs)
What do you conclude from the preceding screenshot?
Exploratory Analysis
[ 108 ]
7. The fivenum summary for the centuries is:
>fivenum(TheWALL$Score)
[1] 100.0 111.5 140.0 169.5 270.0
The fivenum funcon returns minimum, lower-hinge, median, upper-hinge,
and maximum values for the input data. The numbers 111.5 and 169.5 are
lower- and upper-hinges, and it may be seen that they are certainly dierent
values than lower- and upper-quarles of 111.5 and 169.5. Thus far, we have
focused on measures of locaon, so let us now look at some measures of dispersion.
The range funcon in R actually returns the minimum and maximum value of the
data frame. Thus, to obtain the range as a measure of spread, we get that using
diff(range()). We use the IQR funcon to obtain the interquarle range.
8. Using range, diff, and IQR funcons, obtain the spread of Dravid's centuries
as follows:
> range(TheWALL$Score)
[1] 100 270
> diff(range(TheWALL$Score))
[1] 170
> IQR(TheWALL$Score)
[1] 54
>IQR(TheWALL$Score[TheWALL$HA_Ind=="Away"])
[1] 36
> IQR(TheWALL$Score[TheWALL$HA_Ind=="Home"])
[1] 63.5
Here, we are extracng the home centuries from Score using the logic that
consider only those elements of Score when HA_Ind is Home.
What just happened?
The data summaries in the EDA framework are slightly dierent. Here, we rst used the
quantile funcon to obtain quarles and the deciles (10 percent dierence) of a numeric
variable. The diff funcon has been used to nd the dierence between the consecuve
elements of a vector. The boxplot funcon has been used to compare the home and away
Test centuries, which led to the conclusion that the median score of Dravid's centuries at
home is higher than the away centuries. The restricon of the Test centuries under 200 runs
further conrmed in parcular that Dravid's centuries at home have a higher median value
than those in away series, and in general that median is robust to outliers.
Chapter 4
[ 109 ]
The IQR funcon gives us the interquarle range for a vector, and the fivenum funcon
gives us the hinges. Though intuively it appears that hinges and quarles are similar, it is
not always true. In this secon, you also learned the usage of funcons, such as quantile,
fivenum, IQR, and so on.
We will now move to the main techniques of exploratory analysis.
The stem-and-leaf plot
The stem-and-leaf plot is considered as one of the seven important tools of Stascal
Process Control (SPC), refer to Montgomery (2005). It is a bit similar in nature to the
histogram plot.
The stem-and-leaf plot is an eecve method of displaying data in a (paral) tree form. Here,
each datum is split into two parts: the stem part and the leaf part. In general, the last digit
of a datum forms the leaf part; the rest form the stem. Now, consider a datum 235. If the
split criteria is the units place, the stem and leaf parts here will be respecvely 23 and 5; if it
is tens, then 2 and 3; and nally if it is hundreds, it will be 0 and 2. The le-hand side of the
split datum is called as the leading digits and the right-hand side as the trailing digits.
In the next step, all the possible leading digits are arranged in an increasing order. This
includes even those stems for which we may not have data for the leaf part, which ensures
that the nal stem-and-leaf plot truly depicts the distribuon of the data. All the possible
leading digits are called stems. The leaves are then displayed to the right-hand side of the
stems, and for each stem the leaves are again arranged in an increasing order.
Example 4.2.1. A simple illustraon: Consider a data of eight elements as 12, 22, 42, 13, 27,
46, 25, and 52. The leading digits for this data are 1, 2, 4, and 5. On inserng 3, the leading
digits complete the required stems to be 1 to 5. The leaves for stem 1 are 2 and 3. The
unordered leaves for stem 2 are 2, 7, and 5. The display leaves for stem 2 are then 2, 5,
and 7. There are no leaves for stem 3. Similarly, the leaves for stems 4 and 5 respecvely
are the sets {2, 6} and {2} only. The stem funcon in R will be used for generang the
stem-and-leaf plots.
Example 4.2.1. Octane Rang of Gasoline Blends: (Connued from the Vizualizaon
techniques for connuous variable data secon of Chapter 3, Data Vizualizaon):
In the earlier study, we used the histogram for understanding the octane rangs under
two dierent methods. We will use the stem funcon in the forthcoming Time for
acon – the stem funcon in play secon for displaying the octane rangs under Method_1
and Method_2.
Exploratory Analysis
[ 110 ]
Tukey (1977), being the benchmark book for EDA, produces the stem-and-leaf plot in a
slightly dierent style. For example, the stem plots for Method_1 and Method_2 are beer
understood if we can put both the stem and leaf sides adjacent to each other instead of one
below the other. It is possible to achieve this using the stem.leaf.backback funcon
from the aplpack package.
Time for action – the stem function in play
The R funcon stem from the base package and stem.leaf.backback from aplpack are
fundamental for our purpose to create the stem-and-leaf plots. We will illustrate these two
funcons for the examples discussed earlier.
1. As menoned in Example 4.2.1. Octane Rang of Gasoline Blends, rst create the
x vector, x<-c(12,22,42,13,27,46,25,52).
2. Obtain the stem-and-leaf plot with:
> stem(x)
The decimal point is 1 digit(s) to the right of the |
1 | 23
2 | 257
3 |
4 | 26
5 | 2
To obtain the median from the stem display we proceed as follows. Remove one
point each from either side of the display. First we remove 52 and 12, and then
remove 46 and 13. The trick is to proceed unl we are le with either one point, or
two. In the former case, the remaining point is the median, and in the laer case,
we simply take the average. Finally, we are le with 25 and 27 and hence, their
average 26 is the median of x.
We will now look at the octane dataset.
3. Obtain the stem plots for both the methods: data(octane),
stem(octane$Method_1,scale=2) and stem(octane$Method_2, scale=2).
Chapter 4
[ 111 ]
The output will be similar to the following screenshot:
Figure 3: The stem plot for the octane dataset (R output edited)
Of course, the preceding screenshot has been edited. To generate such
a back-to-back display, we need a dierent funcon.
4. Using the stem.leaf.backback funcon from aplpack and the code
library(aplpack) and stem.leaf.backback(Method_1, Method_2,back.
to.back=FALSE, m=5), we get the output in the desired format.
Figure 4: Tukey's stem plot for octane data
Exploratory Analysis
[ 112 ]
The preceding screenshot has many unexplained, and a bit mysterious, symbols! Prof. J.
W. Tukey has taken a very pragmac approach when developing EDA. You are is strongly
suggested to read Tukey (1977), as this brief chapter barely does jusce to it. Note that 18
of the 32 observaons for Method_1 are in the range 80.4 to 89.3. Now, if we have stems as
8, 9, and 10, the spread at stem 8 will be 18, which will not give a meaningful understanding
of the data. The stems can have substems, or be stretched out, and for which a very novel
soluon is provided by Prof. Tukey. For very high frequency stems, the soluon is to squeeze
out ve more stems. For the stem 8 here, we have the trailing digits at 0, 1, 2, …, 9. Now,
adopt a scheme of tagging lines which leads towards a clear reading of the stem-and-leaf
display. Tukey suggests to use * for zero and one, t for two and three, f for four and ve, s for
six and seven, and a period (.) for eight and nine. Truly ingenious! Thus, if you are planning to
write about stem-and-leaf in your local language, you may not require *, t, f, s, .! Go back to
the preceding screenshot and now it looks much more beauful.
Following the leaf part for each method, we are given cumulave frequencies from the top
and the boom too. Why? Now, we know that the stem-and-leaf display has increasing
values from the top and decreasing values from the boom. In this parcular example,
we have n: 32 observaons. Thus, in a sorted order, we know that the median is a value
between the sixteenth and seventeenth sorted observaon. The cumulave frequencies
when exceeds 16 from either direcon, lead to the median. This is indicated by (2) for
Method_1 and (6) for Method_2. Can you now make an approximate guess of the median
values? Obviously, depending on the dataset, we may require m = 5, 1, or 2.
We have used the argument back.to.back=FALSE to ensure that the two stem-and-leafs
can be seen independently. Now, it is fairly easy to compare these two displays by seng
back.to.back=TRUE, in which case the stem line will be common for both the methods
and thus, we can simply compare their leaf distribuons. That is, you need to run stem.
leaf.backback(octane$Method_1,octane$Method_2,back.to.back=TRUE, m=5)
and invesgate the results.
We can clearly see that the median for Method_2 is higher than that of Method_1.
What just happened?
Using the basic stem funcon and stem.leaf.backback from the aplpack, we got two
ecient exploratory displays of the datasets. The laer funcon can be used to compare two
stem-and-leaf displays. Stems can be further squeezed to reveal more informaon with the
opons of m as 1, 2, and 5.
We will next look at the EDA technique which extends the scope of hinges.
Chapter 4
[ 113 ]
Letter values
The median, quarles, and the extremes (maximum and minimum) indicate how the data
is spread over the range of the data. These values can be used to examine two or more
samples. There is another way of understanding the data oered by leer values. This
small journey begins with the use of a concept called depth, which measures the minimum
posion of the datum in the ordered sample from either of the extremes. Thus, the extremes
have a depth of 1, the second largest and smallest datum have a depth of 2, and so on.
Now, consider a sample data of size n, assumed to be an odd number for convenience sake.
Then, the depth of the median is (n + 1)/2. The depth of a datum is denoted by d, and for the
median it is indicated by d(M). Since the hinges, lower and upper, do the same to the divided
samples (by median), the depth of the hinges is given by
() ()
( )
12
= +
 
 
dH dM /. Here, [ ] denotes
the integer part of the argument. As hinges, including the mid-hinge which is the median,
divide the data into four equal parts, we can dene eights as the values which divide the
data into eight equal parts. The eights are denoted by E. The depth of eights is given by the
formula
() ()
( )
12
= +
 
 
dE dH /. It may be seen that the depth of the median, hinges, and eights
of the datum depends on the sample size.
Using the eights, we can further carry out the division of the data for obtaining the sixteenths,
and then thirty seconds, and so on. The process of division should connue ll we end up
with the extremes where we cannot further proceed with the division any longer. The leer
values connue the search unl we end at the extremes. The process of the division is well
understood when we augment the lower and upper values for the hinges, the eights, the
sixteenths, the thirty seconds, and so on. The dierence between the lower and upper values
of these metrics, concept similar to mid-range, is also useful for understanding the data.
The R funcon lval from the LearnEDA package gives the leer values for the data.
Example 4.3.1. Octane Rang of Gasoline Blends (Connued): We will now obtain the leer
values for the octane dataset:
>library(LearnEDA)
>lval(octane$Method_1)
depth lo hi mids spreads
M 16.5 88.00 88.0 88.000 0.00
H 8.5 85.55 91.4 88.475 5.85
E 4.5 84.25 94.2 89.225 9.95
D 2.5 82.25 101.5 91.875 19.25
C 1.0 80.40 105.5 92.950 25.10
> lval(octane$Method_2)
depth lo hi mids spreads
M 16.5 97.20 97.20 97.200 0.00
H 8.5 93.75 99.60 96.675 5.85
E 4.5 90.75 101.60 96.175 10.85
D 2.5 87.35 104.65 96.000 17.30
C 1.0 83.30 106.60 94.950 23.30
Exploratory Analysis
[ 114 ]
The leer values, look at the lo and hi in the preceding code, clearly show that the
corresponding values for Method_1 are always lower than those under Method_2.
Parcularly, note that the lower hinge of Method_2 is greater than the higher hinge
of Method_1. However, the spread under both the methods are very idencal.
Data re-expression
The presence of an outlier or overspread of the data may lead to an incomplete picture of
the graphical display and hence, stascal inference may be inappropriate in these scenarios.
In many such scenarios, re-expression of the data on another scale may be more useful,
refer to Chapter 3, The Problem of Scale, Tukey (1977). Here, we list the scenarios from Tukey
where the data re-expression may help circumvent the limitaons cited in the beginning.
The rst scenario where re-expression is useful is when the variables assume non-negave
values, that is, the variables never assume a value lesser than zero. Examples of such
variables are age, height, power, area, and so on. A thumb rule for the applicaon of
re-expression is when the rao of the largest to the smallest value in the data is very large,
say 100. This is one reason that most regression analysis variables such as age are almost
always re-expressed on the logarithmic scale.
The second scenario explained by Tukey is about variables like balance and
prot-and-loss. If there is a deposit to an account, the balance increases, and if there
is a withdrawal it decreases. Thus, the variables can assume posive as well as negave
values. Since re-expression of these variables like balance rarely helps; re-expression of the
amount or quanty before subtracon helps on some occasions. Fracon and percentage
counts form the third scenario where re-expression of the data is useful, though you need
special techniques. The scenarios menoned are indicave and not exhausve. We will now
look at the data re-expression techniques which are useful.
Example 4.4.1. Re-expression for the Power of 62 Hydroelectric Staons: We need to
understand the distribuon of the ulmate power in megawas of 62 hydroelectric staons
and power staons of the Corps of Engineers. The data for our illustraon has actually been
regenerated from the Exhibit 3 of Tukey (1977). First, we simply look at the stem-and-leaf
display of the original data on the power of 62 hydroelectric staons. We use the stem.
leaf funcon from the aplpack package.
> hydroelectric <- c(14,18,28,26,36,30,30,34,30,43,45,54,52,60,68, +
68,61,75,76,70,76,86,90,96,100,100,100,100,100,100,110,112,
+ 118,110,124,130,135,135,130,175,165,140,250,280,204,200,270,
+ 40,320,330,468,400,518,540,595,600,810,810,1728,1400,1743,2700)
>stem.leaf(hydroelectric,unit=1)
1 | 2: represents 12
leaf unit: 1
Chapter 4
[ 115 ]
n: 62
2 1 | 48
4 2 | 68
9 3 | 00046
24 9 | 06
30 10 | 000000
(4) 11 | 0028
28 12 | 4
27 13 | 0055
23 14 | 0
15 |
22 16 | 5
21 17 | 5
18 |
19 |
20 20 | 04
21 |
24 |
7 60 | 0
HI: 810 810 1400 1728 1743 2700
The data begins with values as low as 14, and grows modestly to hundreds, such as 100, 135,
and so on. Further, the data grows to ve hundreds and then literally explodes into thousands
running up to 2700. If all the leading digits must be mandatorily displayed, we have 270
leading digits. With an average of 35 lines per page, the output requires approximately eight
pages, and between the last two values of 1743 and 2700, we will have roughly 100 empty
leading digits. The stem.leaf funcon has already removed all the leading digits aer the
hydroelectric plant producing 600 megawas.
Let us look at the rao of the largest to smallest value, which is as follows:
>max(hydroelectric)/min(hydroelectric)
[1] 192.8571
By the thumb rule, it is an indicaon that a data re-expression is in order. Thus, we
take the log transformaon (with base 10) and obtain the stem-and-leaf display for
the transformed data.
>stem.leaf(round(log(hydroelectric,10),2),unit=0.01)
1 | 2: represents 0.12
leaf unit: 0.01
n: 62
Exploratory Analysis
[ 116 ]
1 11 | 5
2 12 | 6
13 |
24 19 | 358
(11) 20 | 00000044579
27 21 | 11335
22 22 | 24
20 23 | 110
30 |
4 31 | 5
3 32 | 44
HI: 3.43
The compactness of the stem-and-leaf display for the transformed data is indeed more
useful, and we can further see that the leading digits are just about 30. Also, the display
is more elegant and comprehensible.
Have a go hero
The stem-and-leaf plot is considered a parcular case of the histogram from a certain point
of view. You can aempt to understand the hydroelectric distribuon using histogram too.
First, obtain the histogram of the hydroelectric variable, and then repeat the exercise on its
logarithmic re-expression.
Bagplot – a bivariate boxplot
In Chapter 3, Data Visualizaon, we saw the eecveness of boxplot. For independent
variables, we can simply draw separate boxplots for the variables and visualize the
distribuon. However, when there is dependency between two variables, disnct boxplot
loses the dependency among the two variables. Thus, we need to see if there is a way to
visualize the data through a boxplot. The answer to the queson is provided by bagplot
or bivariate boxplot.
The bagplot characterisc is described in the following steps:
The depth median, denoted by * in the bagplot, is the point with the highest half
space depth.
Chapter 4
[ 117 ]
The depth median is surrounded by a polygon, called bag, which covers n/2
observaons with the largest depth.
The bag is then magnied by a factor of 3 which gives the fence. The fence is not
ploed since it will drive the aenon away from the data.
The observaons between the bag and fence are covered by a loop.
Points outside the fence are agged as outliers.
For technical details of the bagplot, refer to the paper (http://venus.unive.it/
romanaz/ada2/bagplot.pdf) by Rousseeuw, Ruts, and Tukey (1999).
Example 4.5.1. Bagplot for the Gasoline mileage problem: The pairs plot of the gasoline
mileage problem in Example 3.2.9. Octane rang of gasoline blends gave a good insight in to
understanding the nature of the data. Now, we will modify that plot and replace the upper
panel with the bagplot funcon for a cleaner comparison of the bagplot with the scaer
plot. However, in the original dataset, variables x4, x5, and x11 are factors that we remove
from the bagplot study. The bagplot funcon is available in the aplpack package. We rst
dene panel.bagplot, and then generate the matrix of the scaer plot with the bagplots
produced in the upper matrix display.
Time for action – the bagplot display for a multivariate dataset
1. Load the aplpack package with library(aplpack).
2. Check the default examples of the bagplot funcon with example(bagplot).
3. Create the panel.bagplot funcon with:
panel.bagplot <- function(x,y)
{
require(aplpack)
bagplot(x,y,verbose=FALSE,create.plot = TRUE,add=TRUE)
}
Here, the panel.bagplot funcon is dened to enable us to obtain the bagplot
for the upper panel region of the pairs funcon.
Exploratory Analysis
[ 118 ]
4. Apply the panel.bagplot funcon within the pairs funcon on the Gasoline
dataset: pairs(Gasoline[-19,-c(1,4,5,13)],upper.panel =panel.
bagplot).
We obtain the following display:
Figure 5: Bagplot for the Gasoline dataset
What just happened?
We created the panel.bagplot funcon and augmented it in the pairs funcon for
eecve display of the mulvariate dataset. The bagplot is an important EDA tool towards
geng exploratory insight in the important case of mulvariate dataset.
The resistant line
In Example 3.2.3. The Michelson-Morley experiment of Chapter 3, Data Visualizaon,
we visualized data through the scaer plot which indicates possible relaonships between
the dependent variable (y) and independent variable (x). The scaer plot or the x-y plot is
again an EDA technique. However, we would like a more quantave model which explains
the interrelaonship in a more precise manner. The tradional approach would be taken in
Chapter 7, The Logisc Regression Model. In this secon, we would take an EDA approach for
building our rst regression model.
Chapter 4
[ 119 ]
Consider a pair of n observaons:
( ) ( ) ( )
11 22
nn
x,y,x,y,..., x,y. We can easily visualize the data
using the scaer plot. We need to obtain a model of the form
=+yabx
, where a is the
intercept term while b is the slope. This model is an aempt to explain the relaonship
between the variables x and y. Basically, we need to obtain the values of the slope and
intercept from the data. In most real data, a single line will not pass through all the n pairs of
observaons. In fact, it is even a dicult task for the determined line to pass through even a
very few observaons. As a simple task, we may choose any two observaons and determine
the slope and intercept. However, the diculty lies in the choice of the two points. We will
now explain how the resistant line determines the two required terms.
The scaer plot (part A of the next screenshot) is divided into three regions, using x-values,
where each region has approximately same number of data points, refer to part B of the next
screenshot. The three regions, from the le-hand to the right-hand side, are called the lower,
middle, and upper regions. Note that the y-values are distributed among the three regions
corresponding to their x-values. Hence, there is a possibility of some y-values of lower regions
to be higher than a few values in the higher regions. Within each region, we nd the medians
of the x- and y-values independently. That is, for the lower region, the median yL is determined
by the y-values falling in this region, and similarly, the median xL is determined by the x-values
of the region. Similarly, the medians xM , xH , yM , and yH are determined, refer to part C of the
following screenshot. Using these median values, we now form three pairs: (xL , yL ), (xM , yM ),
and (xH , yH ). Note that these pairs need not be one of the data points.
To determine the slope b, two points suce. The resistant line theory determines the slope
by using the two pairs of points (xL , yL ) and (xH , yH ). Thus, we obtain the following:
=
HL
HL
yy
b
xx
For obtaining the intercept value a, we use all the three pairs of medians. The value of a is
determined using
( )
1
3
= + +− ++
 
 
LM H L
MH
a y yybx xx.
Exploratory Analysis
[ 120 ]
Note that the median properes are what exactly make the soluons resistant enough.
As an example, the lower and upper median would not be aected by the outliers
(at the extreme ends).
a= yL+yM+yHb( )
xL+xM+xH
1
3[ ]
yHyL
xHxL
b=
Obtaining the and valuesab
y-axis
A
x-axis
x
x
x
x
x
x
x
xx
xx
xxx
x
xx
xxx
x
y-axis
B
x-axis
x
x
x
x
x
x
x
xx
xx
xxx
x
xx
xxx
x
Dividing into three
portions by x-values
y-axis
C
x-axis
x
x
x
x
x
x
x
xx
xx
xxx
x
xx
xxx
x
y-axis
D
x-axis
x
x
x
x
x
x
x
xx
xx
xxx
x
xx
xxx
x
y
L
y
M
yH
xL
for y-values
for x-values
(x ,y )
HH
(x ,y )
MM
(x ,y )
LL
xM
xH
The x-y Scatter Plot
Figure 6: Understanding the resistant line
We will use the rline funcon from the LearnEDA package.
Example 4.6.1. Resistant line for the IO-CPU me: The CPU me is known to depend on the
number of IO processes running at any given point of me. A simple dataset is available at
http://www.cs.gmu.edu/~menasce/cs700/files/SimpleRegression.pdf.
We aim at ng a resistant line for this dataset.
Time for action – the resistant line as a rst regression model
We use the rline funcon from the LearnEDA package for ng the resistant line on
a dataset.
1. Load the LearnEDA package: library(LearnEDA).
2. Understand the default example with example(rline).
3. Load the dataset data(IO_Time).
4. Create the IO_rline resistant line for CPU_Time as the output and No_of_IO
as the input with IO_rline <- rline(IO_Time$No_of_IO, IO_Time$CPU_
Time,iter=10) for 10 iteraons.
Chapter 4
[ 121 ]
5. Find the slope and intercept with IO_rline$a and IO_rline$b. The output will
then be:
>IO_rline$a
[1] 0.2707
>IO_rline$b
[1] 0.03913
6. Obtain the scaer plot of CPU_Time against No_of_IO with plot(IO_Time$No_
of_IO, IO_Time$CPU_Time).
7. Add the resistant line on the generated scaer plot with abline(a= IO_
rline$a,b=IO_rline$b).
8. Finally, give a tle to the plot: title("Resistant Line for the IO-CPU
Time").
We then get the following screenshot:
Figure 7: Resistant line for CPU_Time
What just happened?
The rline funcon from the LearnEdA package ts a resistant line given the input and
output vectors. It calculated the slope and intercept terms which are driven by medians.
The main advantage of the rline t is that the model is not suscepble to outliers. We can
see from the preceding screenshot that the resistant line model, IO_rline, provides a very
good t for the dataset. Well! You have created your rst exploratory regression model.
Exploratory Analysis
[ 122 ]
Smoothing data
In The resistant line secon, we constructed our rst regression model for the relaonship
between two variables. In some instances, the x-values are so systemac that their values
are almost redundant, and yet we need to understand the behavior of the y-values with
respect to them. Consider the case where the x-values are equally spaced; the shares price
(y) at the end of day (x) is an example where the dierence between two consecuve
x-values is exactly one. Here, we are more interested in smoothing the data along the
y-values, as one expects more variaon in their direcon. Time series data is a very good
example of this type. In the me series data, we typically have xn + 1 = xn + 1 , and hence
we can precisely dene the data in a compact form by yt , t = 1, 2, … .The general model
may then be specied by
12=+ =
t
yabt ,t ,,...
.
In the standard EDA notaon, this is simply expressed as:
data = t + residual
In the context of me series data, the model is succinctly expressed as:
data = smooth + rough
The fundamental concept of the smoothing data technique makes use of the running
medians. In a free hand curve, we can simply draw a smooth curve using our judgment
by ignoring the out-of-curve points and complete the picture. A computer nds this task
only dicult when it needs specic instrucons for obtaining the smooth points across
which it needs to draw a curve. For a sequence of points, such as the sequence yt, the
smoothing needs to be carried over a sequence of overlapping segments. Such segments are
predened of specic length. As a simple example, we may have a three-length overlapping
segment sequence in {y1 , y2 , y3 }, {y2 , y3 , y4 }, {y3 , y4 , y5 }, and so on. It is on similar lines that
four-length or ve-length overlapping segment sequences may be dened as required. It is
within each segment that smoothing needs to be carried out. Two popular choices are mean
and median. Of course, in exploratory analysis our natural choice is the median. Note that
median of the segment {y1 , y2 , y3 } may be any of y1 , y2 , or y3 values.
The general smoothing techniques, such as LOESS, are nonparametric techniques and
require good experse in the subject. The ideas discussed here are mainly driven by median
as the core technique.
Chapter 4
[ 123 ]
A three-moving median cannot correct for more than two consecuve outliers, and similarly,
a ve-moving median for three consecuve outliers, and so on. A soluon, or work around
in an engineer's language, for this is to connue the smoothing of the sequence obtained in
the previous iteraon unl there is no further change in the smoothness part. We may also
consider a moving median of span 4. Here, the median is the average of the two mid-points.
However, considering that the x values are integers, the four-moving median actually does
not correspond to any of the me points t. Using the simplicity principle, it is easily possible
to re-center the points at t by taking a two-moving median of the values obtained in the step
of the four-moving median.
A notaon for the rst iteraon in EDA is simply the number 3, or 5 and 7 as used. The notaon
for repeated smoothing is denoted by 3R where R stands for repeons. For a four-moving
median re-centered by a two-moving median, the notaon will be 42. On many occasions, a
smoother operaon giving more renement than 42 may be desired. It is on such occasion that
we may use the running weighted average, which gives dierent weights to the points under a
span. Here, each point is replaced by a weighted average of the neighboring points. A popular
choice of weights for a running weighted average of 3 is (¼, ½, ¼), and this smoothing process
is referred to as hanning. The hanning process is denoted by H.
Since the running median smoothens the data sequence a bit more than appropriate and
hence removes any interesng paerns, paerns can be recovered from the residuals which
in this context are called rough. This is achieved by smoothing the rough sequence and
adding them back to the smooth sequence. This operaon is called as reroughing. Velleman
and Hoaglin (1984) point out that the smoothing process which performs beer in general is
4253H. That is, here we rst begin with running median of 4 which is re-centered by 2.
The re-smoothing is then done by 5 followed by 3, and the outliers are removed by H. Finally,
re-roughing is carried out by smoothing the roughs and then adding them to the smoothed
sequence. This full cycle is denoted by 4253H, twice. Unfortunately, we are not aware of any
R funcon or package which implements the 4253H smoother. The opons available in the
LearnEDA package are 3RSS and 3RSSH.
We have not explained what the smoothers 3RSS and 3RSSH are. The 3R smoothing chops
o peaks and valleys and leaves behind mesas and dales two points long. What does this
mean? Mesa refers to an area of high land with a at top and two or more steep cli-like sides,
whereas dale refers a valley. To overcome this problem, a special spling is used at each
two-point mesa and dale where the data is split into three pieces: a two-point at segment,
the smooth data to the le of the two points, and the smooth sequence to their right. Now, let
yf-1, yf refer to the two-point at segment, and yf+1 , yf+2 , refer to the smooth sequence to the
le of these two-point at segments. Then the S technique predicts the value of yf-1 if it were
on the straight line formed by yf+1 and yf+2. A simple method is to obtain yf-1 as 3yf+1 – 2yf+2. The yf
value is obtained as the median of the predicted yf-1, yf+1, and yf+2. Aer removing all the mesas
and dales, we again repeat the 3R cycle. Thus, we have the notaon 3RSS and the reader
can now easily connect with what 3RSSH means. Now, we will obtain the 3RSS for the cow
temperature dataset of Velleman and Hoaglin.
Exploratory Analysis
[ 124 ]
Example 4.7.1. Smoothing Data for the Cow temperature: The temperature of a cow is
measured at 6:30 a.m. for 75 consecuve days. We will use the smooth funcon from the
base package and the han funcon from the LearnEDA package to achieve the required
smoothing sequence. We will build the necessary R program in the forthcoming acon list.
Time for action – smoothening the cow temperature data
First we use the smooth funcon from the stats package on the cow temperature dataset.
Next; we will use the han funcon from LearnEDA.
1. Load the cow temperature data in R by data(CT).
2. Plot the me series data using the plot.ts funcon: plot.ts(CT$Temperature
,col="red",pch=1).
3. Create a 3RSS object for the cow temperature data using the smooth funcon
and the kind opon: CT_3RSS <- smooth(CT$Temperature,kind="3RSS").
4. Han the preceding 3RSS object using the han funcon from the LearnEDA package:
CT_3RSSH <- han(smooth(CT$Temperature,kind="3RSS")).
5. Impose a line of the 3RSS data points with lines.
ts(CT_3RSS,col="blue",pch=2).
6. Impose a line of the hanned 3RSS data points with lines.
ts(CT_3RSSH,col="green",pch=3).
7. Add a meaningful legend to the plot: legend(20,90,c("Original","3RSS","3
RSSH"),col=c("red","blue","green"),pch="___").
We get a useful smoothened plot of the cow temperature data as follows:
Smoothening cow temperature data
Chapter 4
[ 125 ]
The original plot shows a lot of variaon for the cow temperature measurements. The edges
of the 3RSS smoother shows many sharp edges in comparison with the 3RSSH smoother,
though it is itself a lot smoother than the original display. The plot further indicates that
there has been a lot of decrease in the cow temperature measurements from the eenth
day of observaon. This is conrmed by all the three displays.
What just happened?
The discussion of the smoothing funcon looked very promising in the theorecal
development. We took a real dataset and saw its me series plot. Then we ploed
two versions of the smoothening process and found both to be very smooth over
the original plot.
Median polish
In Example 4.6.1. Resistant line for the IO-CPU me, we had IO as the only independent
variable which explained the variaons of the CPU me. In many praccal problems, the
dependent variable depends on more than one independent variable. In such cases, we need
to factor the eect of such independent variables using a single model. When we have two
independent variables, and median polish helps in building a robust model. A data display in
which the rows and columns hold dierent factors of two variables is called a two-way table.
Here, the table entries are values of the independent variables.
An appropriate model for the two-way table is given by:
βγ
=+ ++ε
ij ijij
ya
Here,
α
is the intercept term,
β
i
denotes the eect of the i-th row,
y
j the eect of the j-th
column, and
ε
ij is the error term. All the parameters are unknown. We need to nd the
unknown parameters through the EDA approach. The basic idea is to use row-medians and
column-medians for obtaining the row- and column-eect, and then nd the basic intercept
term. Any unexplained part of the data is considered as the residual.
Time for action – the median polish algorithm
The median polish algorithm (refer to http://www-rohan.sdsu.edu/~babailey/
stat696/medpolishalg.html) is given next:
1. Obtain the row medians of the two-way table and upend it to the right-hand side
of the data matrix. From each element of every row, subtract the respecve
row median.
Exploratory Analysis
[ 126 ]
2. Find the median of the row median and record it as the inial grand eect value.
Also, subtract the inial grand eect value from each row median.
3. For the original data columns in the previously upended matrix, obtain the column
median and append it with the previous matrix at the boom. As in step 1, subtract
from each column element their corresponding column median.
4. For the boom row of column medians in the previous table, obtain the median,
and then add the obtained value to the inial grand eect value. Next, subtract the
modied grand eect median value from each of the column medians.
5. Iterate steps 1-4 unl the changes in row or column median is insignicant.
We use the medpolish funcon from the stats library for the computaons involved in
median polish. For more details about the model, you can refer to Chapter 8, Velleman and
Hoaglin (1984).
Example 4.7.1. Male death rates: The dataset related to the male death rate per 1000 by the
cause of death and the average amount of tobacco smoked daily is available on page 221 of
Velleman and Hoaglin (1984). Here, the row eect is due to the cause of death, whereas the
column constutes the amount of tobacco smoked (in grams). We are interested in modeling
the eect of these two variables on the male death rates in the region.
> data(MDR)
> MDR2 <- as.matrix(MDR[,2:5])
>rownames(MDR2) <- c("Lung", "UR","Sto","CaR","Prost","Other_
Lung","Pul_TB","CB","RD_Other", "CT","Other_
Cardio","CH","PU","Viol","Other_Dis")
> MDR_medpol <- medpolish(MDR2)
1 : 8.38
2 : 8.17
Final: 8.1625
>MDR_medpol$row
Lung UR StoCaR Prost Other_
LungPul_TB CB RD_Other CT Other_Cardio
0.1200 -0.4500 -0.2800 -0.0125 -0.3050
0.2050 -0.3900 -0.2050 0.0000 4.0750
1.6875
CH PU Viol Other_Dis
Chapter 4
[ 127 ]
1.4725 -0.4325 0.0950 0.9650
>MDR_medpol$col
G0 G14 G24 G25
-0.0950 0.0075 -0.0050 0.1350
>MDR_medpol$overall
[1] 0.545
>MDR_medpol$residuals
G0 G14 G24 G25
Lung -0.5000 -0.2025 2.000000e-01 0.8600
UR 0.0000 0.0275 0.000000e+00 -0.0200
Sto 0.2400 0.0875 -1.600000e-01 -0.0900
CaR 0.0025 0.0000 -1.575000e-01 0.0725
Prost 0.4050 0.0125 -1.500000e-02 -0.0350
Other_Lung -0.0150 -0.0375 1.500000e-02 0.1350
Pul_TB -0.0600 -0.0025 3.000000e-02 0.0000
CB -0.1250 -0.0575 5.500000e-02 0.2450
RD_Other 0.2400 -0.0025 1.387779e-17 -0.2800
CT -0.3050 0.0125 -1.500000e-02 1.2350
Other_Cardio 0.0925 -0.0900 2.425000e-01 -0.1175
CH 0.0875 -0.0850 -1.525000e-01 0.1775
PU -0.0175 0.0200 5.250000e-02 -0.0275
Viol -0.1250 0.1725 -1.850000e-01 0.1250
Other_Dis 0.0350 0.2925 -3.500000e-02 -0.0750
What just happened?
The output associated with MDR_medpol$row gives the row eect, while MDR_medpol$col
gives the column eect. The negave value of -0.0950 for the non-consumers of tobacco
shows that the male death rate is lesser for this group, whereas the posive values of 0.0075
and 0.1350 for the group under 14 grams and above 25 grams respecvely is an indicaon
that tobacco consumers are more prone to death.
Have a go hero
For the variables G0 and G25 in the MDR2 matrix object, obtain a back-to-back
stem-leaf display.
Exploratory Analysis
[ 128 ]
Summary
Median and its variants form the core measures of EDA and you would have got a hang
of it by the rst secon. The visualizaon techniques of EDA also compose more than just
the stem-and-leaf plot, leer values, and bagplot. As EDA is basically about your atude
and approach, it is important to realize that you can (and should) use any method that is
insncve and appropriate for the data on hand. We have also built our rst regression
model in the resistant line and seen how robust it is to the outliers. Smoothing data and
median polish are also advanced EDA techniques which the reader is acquainted in their
respecve secons.
EDA is exploratory in nature and its ndings may need further stascal validaons. The next
chapter on stascal inference addresses which Tukey calls conrmatory analysis. Especially,
we look at techniques which give good point esmates of the unknown parameters. This is
then backed with further techniques such as goodness-of-t and condence intervals for the
probability distribuon and the parameters respecvely. Post the esmaon method, it is a
requirement to verify whether the parameters meet certain specied levels. This problem is
addressed through hypotheses tesng in the next chapter.
Statistical Inference
In the previous chapter, we came across numerous tools that gave first
insights of exploratory evidence into the distribution of datasets through visual
techniques as well as quantitative methods. The next step is the translation
of these exploratory results to confirmatory ones and the topics of the
current chapter pursue this goal. In the Discrete distributions and Continuous
distributions sections of Chapter 1, Data Characteristics, we came across many
important families of probability distribution. In practical scenarios, we have
data on hand and the goal is to infer about the unknown parameters of the
probability distributions. This chapter focuses on one method of inference
for the parameters using the maximum likelihood estimator (MLE). Another
way of approaching this problem is by fitting a probability distribution for
the data. The MLE is a point estimate of the unknown parameter that needs
to be supplemented with a range of possible values. This is achieved through
confidence intervals. Finally, the chapter concludes with the important topic
of hypothesis testing.
You will learn the following things aer reading through this chapter:
Visualizing the likelihood funcon and idenfying the MLE
Fing the most plausible stascal distribuon for a dataset
Condence intervals for the esmated parameters
Hypothesis tesng of the parameters of a stascal distribuon
5
Stascal Inference
[ 130 ]
Using exploratory techniques we had our rst exposion with the understanding of a dataset.
As an example in the octane dataset we found that the median of Method_2 was larger
than that of Method_1. As explained in the previous chapter, we need to conrm whatever
exploratory ndings we had with a dataset. Recall that the histograms and stem-and-leaf
displays suggest a normal distribuon. A queson that arises then is how do we assert the
center values, typically the mean, of a normal distribuon and how do we conclude that
average of the Method_2 procedure exceeds that of Method_1. The former queson is
answered by esmaon techniques and the later with tesng hypotheses. This forms the
core of Stascal Inference.
Maximum likelihood estimator
Let us consider the discrete probability distribuons as seen in the Discrete distribuons
secon of Chapter 1, Data Characteriscs. We saw that a binomial distribuon is
characterized by the parameters in n and p, the Poisson distribuon by
λ
, and so on. Here,
the parameters completely determine the probabilies of the x values. However, when
the parameters are unknown, which is the case in almost all praccal problems, we collect
data for the random experiment and try to infer about the parameters. This is essenally
inducve reasoning, and the subject of Stascs is essenally inducve driven as opposed
to the deducve reasoning of Mathemacs. This forms the core dierence between the
two beauful subjects. Assume that we have n observaons X1, X2,…, Xn from an unknown
probability distribuon
()
,fx
θ
, where
θ
may be a scalar or a vector whose values are not
known. Let us consider a few important denions that form the core of stascal inference.
Random sample: If the observaons X1, X2,…, Xn are independent of each other, we say
that it forms a random sample from
()
,fx
θ
. A technical consequence of the observaons
forming a random sample is that their joint probability density (mass) funcon can be
wrien as product of the individual density (mass) funcon. If the unknown parameter
θ
is same for all the n observaons we say that we have an independent and idencal
distributed (iid) sample.
Let X denote the score of Rahul Dravid in a century innings, and let Xi denote the runs
scored in the i th century, i = 1, 2, …, 36. The assumpon of independence is then appropriate
for all the values of Xi. Consider the problem of the R soware installaon on 10 dierent
computers of same conguraon. Let X denote the me it takes for the soware to install.
Here, again, it may be easily seen that the installaon me on the 10 machines, X1, …, X10,
are idencal (same conguraon of the computers) and independent. We will use the vector
notaon here to represent a sample of size n, X = (X1, X2,…, Xn) for the random variables, and
denote the realized values of random variable with the small case x = (x1, x2,…, xn ) with x i
represenng the realized value of random variable Xi . All the required tools are now ready,
which enable us to dene the likelihood funcon.
Chapter 5
[ 131 ]
Likelihood funcon: Let
()
,fx
θ
be the joint pmf (or pdf) for an iid sample of n observaons
of X. Here, the pmf and pdf respecvely correspond to the discrete and connuous random
variables. The likelihood funcon is then dened by:
()() ( )
1
| | |
n
i i
Lxfx fx
θ θ θ
=
= = Π
Of course, the reader may be amused about the dierence between a likelihood funcon
and a pmf (or pdf). The pmf is to be seen as a funcon of x given that the parameters are
known, whereas in the likelihood funcon we look at a funcon where the parameters are
unknown with x being known. This disncon is vital as we are looking for a tool where we
do not know the parameters. The likelihood funcon may be interpreted as the probability
funcon of
θ
condioned on the value of x and this is the main reason for idenfying that
value of
θ
, say
θ
, which leads to the maximum of
()
|Lx
θ
, that is,
( )
( )
ˆ
| |L x L x
θ θ
. Let us
visualize the likelihood funcon for some important families of probability distribuon.
The importance of visualizing the likelihood funcon is emphasized in Chapter 7, The Logisc
Regression Model and Chapters 1-4 of Pawitan (2001).
Visualizing the likelihood function
We had seen a few plots of the pmf/pdf in Discrete distribuons and Connuous distribuons
secons of Chapter 1, Data Characteriscs. Recall that we were plong the pmf/pdf over
the range of x. In those examples, we had assumed certain values for the parameters of
the distribuons. For the problems of stascal inference, we typically do not know the
parameter values. Thus, the likelihood funcons are ploed against the plausible parameter
values
θ
. What does this mean? For example, the pmf for a binomial distribuon is ploed
for x values ranging from 0 to n. However, the likelihood funcon needs to be plot against p
values ranging over the unit interval [0, 1].
Example 5.1.1. The likelihood funcon of a binomial distribuon: A box of electronic
chips is known to contain a certain number of defecve chips. Suppose we take a random
sample of n chips from the box and make a note of the number of non-defecve chips. The
probability of a non-defecve chip is p, and that being defecve is 1 – p. Let X be a random
variable, which takes the value 1 if the chip is non-defecve and 0 if it is defecve. Then
X~b(1,p), where p is not known. Dene
1
n
x i
i
t x
=
=
. The likelihood funcon is then given by:
( ) ( )
|, 1x
xnt
t
x
x
n
Lptn p p
t

= −


Stascal Inference
[ 132 ]
Suppose that the observed value of x is 7, that is, we have 7 successes out of 10 trials.
Now, the purpose of likelihood inference is to understand the probability distribuon of p
given the data tx. This gives us an idea about the most plausible value of p and hence it is
worthwhile to visualize the likelihood funcon
( )
|,
x
Lp
tn
.
Example 5.1.2. The likelihood funcon of a Poisson distribuon: The number of accidents
at a parcular trac signal of a city, the number of ight arrivals during a specic me
interval at an airport, and so on are some of the scenarios where assumpon of a Poisson
distribuon is appropriate to explain the numbers. Now let us consider a sample from
Poisson distribuon. Suppose that the number of ight arrivals at an airport during the
duraon of an hour follows a Poisson distribuon with an unknown rate
λ
. Suppose that
we have the number of arrivals over ten disnct hours as 1, 2, 2, 1, 0, 2, 3, 1, 2, and 4. Using
this data, we need to infer about
λ
. Towards this we will rst plot the likelihood funcon.
The likelihood funcon for a random sample of size n is given by:
( )
1
1
|
!
n
ii
x
n
n
i i
e
L x x
λλ
λ
=
Σ
=
=Π
Before we consider R program for visualizing the likelihood funcon for the samples from
binomial and Poisson distribuon, let us look at the likelihood funcon for a sample from
the normal distribuon.
Example 5.1.3. The likelihood funcon of a normal distribuon: The CPU_Time variable
from IO_Time may be assumed to follow a normal distribuon. For this problem, we will
simulate n = 25 observaons from a normal distribuon, for more details about the
simulaon, refer the next chapter. Though we simulate the n observaons with mean as 10
and standard deviaon as 2, we will pretend that we do not actually know the mean value
with the assumpon that the standard deviaon is known to be 2. The likelihood funcon
for a sample from normal distribuon with known a standard deviaon is given by:
( )
( )
()
2
21
1
2
1
|,
2
n
i
ix
n
L x e
µ
σ
µ σ
πσ
=
− −
=
In our parcular example, it is:
( )
( )
( )
2
3
1
1
2
1
|,2
22
n
i
i
nexpx xL
µ µ
π
=
 
− −
 
 
=
It is me for acon!
Chapter 5
[ 133 ]
Time for action – visualizing the likelihood function
We will now visualize the likelihood funcon for the binomial, Poisson, and normal
distribuons discussed before:
1. Inialize the graphics windows for the three samples using par(mfrow= c(1,3)).
2. Declare the number of trials n and the number of success x by n <- 10; x <- 7.
3. Set the sequence of p values with p_seq <- seq(0,1,0.01).
For p_seq, obtain the probabilies for n = 10 and x = 7 by using the dbinom
funcon: dbinom(x=7,size=n,prob=p_seq).
4. Next, obtain the likelihood funcon plot by running plot(p_seq, dbinom(
x=7,size=n,prob=p_seq), xlab="p", ylab="Binomial Likelihood
Function", "l").
5. Enter the data for the Poisson random sample into R using x <- c(1,2,2,1,
0,2,3,1,2,4) and the number of observaons by n <- length(x).
6. Declare the sequence of possible
λ
values through lambda_seq <-
seq(0,5,0.1).
7. Plot the likelihood funcon for the Poisson distribuon with plot( lambda_seq,
dpois(x=sum(x),lambda=n*lambda_seq)…).
We are generang random observaons from a normal distribuon using the rnorm
funcon. Each run of the rnorm funcon results in dierent values and hence to
ensure that you are able to reproduce the exact output as produced here, we will
set the inial seed for random generaon tool with set.seed(123).
8. For generaon of random numbers, x the seed value with set.seed(123).
This is to simply ensure that we obtain the same result.
9. Simulate 25 observaons from the normal distribuon with mean 10 and standard
deviaon 2 using n<-25; xn <- rnorm(n,mean=10,sd=2).
10. Consider the following range of
µ
values mu_seq <- seq(9,11,0.05).
11. Plot the normal likelihood funcon with plot(mu_seq,dnorm(x= mean(xn),
mean=mu_seq,sd=2) ).
The detailed code for the preceding acon is now provided:
# Time for Action: Visualizing the Likelihood Function
par(mfrow=c(1,3))
# # Visualizing the Likelihood Function of a Binomial Distribution.
n <- 10; x <- 7
p_seq <- seq(0,1,0.01)
Stascal Inference
[ 134 ]
plot(p_seq, dbinom(x=7,size=n,prob=p_seq), xlab="p", ylab="Binomial
Likelihood Function", "l")
# Visualizing the Likelihood Function of a Poisson Distribution.
x <- c(1, 2, 2, 1, 0, 2, 3, 1, 2, 4); n = length(x)
lambda_seq <- seq(0,5,0.1)
plot(lambda_seq,dpois(x=sum(x),lambda=n*lambda_seq),
xlab=expression(lambda),ylab="Poisson Likelihood Function", "l")
# Visualizing the Likelihood Function of a Normal Distribution.
set.seed(123)
n <- 25; xn <- rnorm(n,mean=10,sd=2)
mu_seq <- seq(9,11,0.05)
plot(mu_seq,dnorm(x=mean(xn),mean=mu_seq,sd=2),"l",
xlab=expression(mu),ylab="Normal Likelihood Function")
Run the preceding code in your R session.
You will nd an idencal copy of the next plot on your computer screen too. What does the
plot tells us? The likelihood funcon for the binomial distribuon has very small values up to
0.4, and then it gradually peaks up to 0.7 and then declines sharply. This means that the values
in the neighborhood of 0.7 are more likely to be the true value of p than the points away from
it. Similarly, the likelihood funcon plot for the Poisson distribuon says that
λ
values lesser
than 1 and greater than 3 are very unlikely to be the true value of the actual
λ
. The peak of
the likelihood funcon appears at a value lile lesser than 2. The interpretaon for the normal
likelihood funcon is le as an exercise to the reader.
Figure 1: Some likelihood functions
Chapter 5
[ 135 ]
What just happened?
We took our rst step in the problem of esmaon of parameters. Visualizaon of the
likelihood funcon is a very important aspect and is oen overlooked in many introductory
textbooks. Moreover, and as it is natural, we did it in R!
Finding the maximum likelihood estimator
The likelihood funcon plot indicates the plausibility of the data generang mechanism
for dierent values of the parameters. Naturally, the value of the parameter for which the
likelihood funcon has the highest value is the most likely value of the parameter. This forms
the crux of maximum likelihood esmaon.
The value of
θ
that leads to the maximum value of the likelihood funcon
()
|Lx
θ
is referred
as the maximum likelihood esmate, abbreviated as MLE.
For the reader familiar with numerical opmizaon, it is not a surprise that calculus is useful
for nding the opmum value of a funcon. However, we will not indulge in mathemacs
more than what is required here. We will note some ner aspects of numerical opmizaon.
For an independent sample of size n, the likelihood funcon is a product of n funcons
and it is very likely that we may very soon end up in the mathemacal world of intractable
funcons. To a large extent, we can circumvent this problem by resorng to the logarithm of
the funcon, which then transforms the problem of opmizing the product of funcons to
the sum of funcons. That is, we will focus on opmizing
()
log |Lx
θ
instead of
()
|Lx
θ
.
An important consequence of using the logarithm is that the maximizaon of a product
funcon translates into that of a sum funcon since log(ab) = log(a) + log(b). It may also be
seen that the maximum point of the likelihood funcon is preserved under the logarithm
transformaon since for a > b, log(a) > log(b). Further, many numerical techniques know how
to minimize a funcon rather than maximize it. Thus, instead of maximizing the log-likelihood
funcon
()
log |
Lx
θ
, we will minimize
()
-log |Lx
θ
in R.
In the R package stats4 we are provided with a mle funcon, which returns the MLE.
There are a host of probability distribuons for which it is possible to obtain the MLE.
We will connue the illustraons for the examples considered earlier in the chapter.
Example 5.1.4. Finding the MLE of a binomial distribuon (connuaon of Example 5.1.1):
The negave log-likelihood funcon of binomial distribuon, sans the constant values, the
combinatorial term is excluded since its value is independent of p, is given by:
( ) ( ) ( )
-log |, log log 1
x x x
Lptn t p nt p=− −−
Stascal Inference
[ 136 ]
The maximum likelihood esmator of p, dierenang the preceding equaon with
respect to p and equang the result to zero and then solving the equaon, is given by
the sample proporon:
ˆx
t
p
n
=
An esmator of a parameter is denoted by accentuaon the parameter with a hat.
Though this is very easy to compute, we will resort to the useful funcon mle.
Example 5.1.5. MLE of a Poisson distribuon (connuaon of Example 5.1.2):
The negave log-likelihood funcon is given by:
( )
1
lolog | g
n
i
i
L x n x
λ λ λ
=
=
The MLE for
λ
admits a closed form, which can be obtained from calculus arguments,
and it is given by:
1
ˆn
ii
x
n
λ
=
Σ
=
To obtain the MLE, we need to write exclusive code for the negave log-likelihood funcon.
For the normal distribuon, we will use the mle funcon. There is another method of nding
the MLE than the mle funcon available in the stats4 package. We consider it next. The R
codes will be given in the forthcoming acon.
Using the tdistr function
In the previous examples, we needed to explicitly specify the negave log-likelihood
funcon. The fitdistr funcon from the MASS package can be used to obtain the
unknown parameters of a probability distribuon, for a list of the probability funcons for
which it applies see ?fitdistr, and the fact that it uses the maximum likelihood ng
complements our approach in this secon.
Example 5.1.6. MLEs for Poisson and normal distribuons: In the next acon, we will use
the fitdistr funcon from the MASS package for obtaining the MLEs in Example 5.1.2 and
Example 5.1.3. In fact, using the funcon, we get the answers readily without the need to
specify the negave log-likelihood explicitly.
Chapter 5
[ 137 ]
Time for action – nding the MLE using mle and tdistr functions
The mle funcon from the stats4 package will be used for obtaining the MLE from popular
distribuons such as binomial, normal, and so on. The fitdistr funcon will be used too,
which ts the distribuons using the MLEs.
1. Load the library package CIT with library(stats4).
2. Specify the number of success in a vector format and the number of observaons
with x<-rep(c(0,1),c(3,7)); n <- length(x).
3. Dene the negave log-likelihood funcon with a funcon:
binomial_nll <- function(prob) -sum(dbinom(x,size=1,prob,
log=TRUE))
The code works as follows. The dbinom funcon is invoked from the stats package
and the opon log=TRUE is exercised to indicate that we need a log of the
probability (actually likelihood) values. The dbinom funcon returns a vector of
probabilies for all the values of x. The sum, mulplied by -1, now returns us the
value of negave log-likelihood.
4. Now, enter fit_binom <- mle(binomial_nll,start=list(
prob=0.5),nobs=n) on the R console. Now, mle as a funcon opmizes the
binomial_nll funcon dened in the previous step. Inial values, a guess or a
legimate value are specied for the start opon, and we also declare the number
of observaons available for this problem.
5. summary(fit_binom) will give details of the mle funcon applied on
binomial_nll. The output is displayed in the next screenshot.
6. Specify the data for the Poisson distribuon problem in x <- c(1,2,2,1,0,
2,3,1,2,4); n <- length(x).
7. Dene the negave log-likelihood funcon on parallel lines of a
binomial distribuon:
pois_nll <- function(lambda) -sum(dpois(x,lambda,log=TRUE))
8. Explore dierent opons of the mle funcon by specifying the method, a guess of
the least and most values of the parameter, and the inial value as the median of
the observaons:
fit_poisson <- mle(pois_nll,start=list(lambda=median(x)),nobs=n,
method = "Brent", lower = 0, upper = 10)
9. Get the answer by entering summary(fit_poisson).
10. Dene the negave log-likelihood funcon for the normal distribuon by:
normal_nll <- function(mean) -sum(dnorm(xn,mean,sd=2,log=TRUE))
Stascal Inference
[ 138 ]
11. Find the MLE of the normal distribuon with fit_normal <- mle( normal_nll
,start=list(mean=8),nobs=n).
12. Get the nal answer with summary(fit_normal).
13. Load the MASS package: library(MASS).
14. Fit the x vector with a Poisson distribuon by running fitdistr( x,"poisson")
in R.
15. Fit the xn vector with a normal distribuon by running fitdistr(xn,"normal").
Figure 2: Finding the MLE and related summaries
What just happened?
You have explored the possibility of nding the MLEs for many standard distribuons using
mle from the stats4 package and fitdistr from the MASS package. The main key
in obtaining the MLE is the right construcon of the log-likelihood funcon.
Chapter 5
[ 139 ]
Condence intervals
The MLE is a point esmate and as such on its own it is almost not of praccal use. It would
be more appropriate to give coverage of parameter points, which is most likely to contain
the true unknown parameter. A general pracce is to specify the coverage of the points
through an interval and then consider specic intervals which have a specied probability.
A formal denion is in order.
A condence interval for a populaon parameter is an interval that is predicted to contain
the parameter with a certain probability.
The common choice is to obtain either 95 percent or 99 percent condence intervals. It is
common to specify the coverage of the condence through a signicance level
α
, more
about this in the next secon, which is a small number closer to 0. The 95 percent and 99
percent condence intervals then correspond to
( )
100 1
α
percent intervals with respecve
α
equal to 0.05 and 0.01. In general, a
( )
100 1
α
percent condence interval says that if
the experiment is performed many mes over, we expect the result to fall in the condence
interval by
( )
100 1
α
percent.
Example 5.2.1. Condence interval for binomial proporon: Consider n Bernoulli trials
1,..., n
X X
with the probability of success being p. We saw earlier that the MLE of p is:
ˆx
t
p
n
=
where
1
n
x i
i
t x
=
=
. Theorecally, the expected value of
ˆ
p
is p and its standard deviaon is
( )
1 /p p n. An esmate of the standard deviaon is
( )
ˆ ˆ
1 /p p n. For large n and when both
np and np(1-p) are greater than 5, using a normal approximaon, by virtue of the central
limit theorem, a
( )
100 1
α
percent condence interval for p is given by:
( ) ( )
/2 /2
ˆ ˆ ˆ ˆ
1 1
ˆ ˆ
,
p p p p
p z p z
n n
α α
 
− −
 
− +
 
 
where
/2
z
α
is the
/2
α
quanle of the standard normal distribuon. The condence
intervals obtained by using the normal approximaon are not reliable when the p value is
near 0 or 1. Thus, if the lower condence limit falls below 0, or the upper condence limit
exceeds 0, we will adapt the convenon of taking them as 0 and 1 respecvely.
Stascal Inference
[ 140 ]
Example 5.2.2. Condence interval for normal mean with known variance: Consider a
random sample of size n from a normal distribuon with an unknown mean
µ
and a known
standard deviaon
σ
. It may be shown that the MLE of mean
µ
is the sample mean
1
/
n
i
i
X x n
=
=
and that the distribuon of
X
is again normal with mean
µ
and standard deviaon
/n
σ
. Then, the
( )
100 1
α
percent condence interval is given by:
/2 /2
,X Z X Z
n n
α α
σ σ
 
− +
 
 
where
/2
z
α
is the
/2
α
quanle of the standard normal distribuon. The width of the
preceding condence interval is /2
2 /z n
α
σ
. Thus, when the sample size is increased by
four mes, the width will decrease by half.
Example 5.2.3. Condence interval for normal mean with unknown variance: We connue
with a sample of size n. When the variance is not known, the steps become very dierent.
Since the variance is not known, we replace it by the sample variance:
( )
2
2
1
1
1
n
i
i
S X X
n=
= −
The denominator is n - 1 since we have already esmated
µ
using the n observaons.
To develop the condence interval for
µ
, consider the following stasc:
2
/
X
T
S n
µ
=
This new stasc T has a t-distribuon with n - 1 degrees of freedom. The
( )
100 1
α
percent
condence interval for
µ
is then given by the following interval:
1, /2 1, /2
,
n n
S S
X t X t
n n
α α
− −
 
− +
 
 
where
1, /2n
t
α
is the
/2
α
quanle of a t random variable with n - 1 degrees of freedom.
We will create funcons for obtaining the condence intervals for the preceding three
examples. Many stascal tests in R return condence intervals at desired levels. However,
we will be encountering these tests in the last secon of the chapter, and hence up to that
point, we will contain ourselves to dened funcons and applicaons.
Chapter 5
[ 141 ]
Time for action – condence intervals
We create funcons that will enable us to obtain condence intervals of the desired size:
1. Create a funcon for obtaining the condence intervals of the proporon from
a binomial distribuon with the following funcon:
binom_CI = function(x, n, alpha) {
phat = x/n
ll=phat-qnorm(alpha/2,lower.tail=FALSE)*sqrt(phat*(1-phat)/n)
ul=phat+qnorm(alpha/2,lower.tail=FALSE)*sqrt(phat*(1-phat)/n)
return(paste("The ", 100*(1-alpha),"% Confidence Interval for
Binomial Proportion is (", round(ll,4),",", round(ul,4),")",
sep=''))
}
The arguments of the funcons are x, n, and alpha. That is, the user of the funcon
needs to specify the x number of success out of the n Bernoulli trials, and the
signicance level
α
. First, we obtain the MLE
ˆ
p
of the proporon p by calculang
phat = x/n. To obtain the value of
/2
z
α
, we use the qnorm quanle funcon
qnorm(alpha/2,lower.tail= FALSE). The quanty
( )
ˆ ˆ
1 /p p n is computed
with sqrt(phat*(1-phat)/n). The rest of the code for ll and ul is
self-explanatory. We use the paste funcon to get the output in a convenient
format along with the return funcon.
2. Consider the data in Example 5.2.1 where we have x = 7, and n = 10. Suppose that
we require 95 percent and 99 percent condence intervals. The respecve
α
values
for these condence intervals are 0.05 and 0.01. Let us execute the binom_CI
funcon on this data. That is, we need to run binom_CI(x=7,n=10,alpha
=0.05) and binom_CI(x=7,n=10,alpha=0.01) on the R console. The output
will be as shown in the next screenshot.
Thus, (0.416, 0.984) is the 95 percent condence interval for p and (0.3267,
1.0733) is the 99 percent condence interval for it. Since the upper condence
limit exceeds 1, we will use (0.3267, 1) as the 99 percent condence interval
for p.
3. We rst give the funcon for construcon of condence intervals for the mean
µ
of a normal distribuon when the standard deviaon is known:
normal_CI_ksd = function(x,sigma,alpha) {
xbar = mean(x)
n = length(x)
ll = xbar-qnorm(alpha/2,lower.tail=FALSE)*sigma/sqrt(n)
ul = xbar+qnorm(alpha/2,lower.tail=FALSE)*sigma/sqrt(n)
return(paste("The ", 100*(1-alpha),"% Confidence Interval for
the Normal mean is (", round(ll,4),",",round(ul,4),")",sep=''))
}
Stascal Inference
[ 142 ]
The funcon normal_CI_ksd works dierently from the earlier binomial one. Here,
we provide the enre data to the funcon and specify the known value of standard
deviaon and the signicance level. First, we obtain the MLE
X
of the mean
µ
with
xbar = mean(x). The R code qnorm(alpha/2,lower.tail=FALSE) is used to
obtain
/2
z
α
. Next,
/n
σ
is computed by sigma/sqrt(n). The code for ll and ul
is straighorward to comprehend. The return and paste have the same purpose
as in the previous example. Compile the code for the normal_CI_ksd funcon.
4. Let us see a few examples, connuaon of Example 5.1.3, for obtaining the
condence interval for the mean of a normal distribuon with the standard
deviaon known. To obtain the 95 percent and 99 percent condence interval
for the xn data, where the standard deviaon was known to be 2, run
normal_CI_ksd(x=xn,sigma=2,alpha=0.05) and normal_CI_ksd(x=
xn,sigma=2,alpha=0.01) on the R console. The output is consolidated in the
next screenshot.
Thus, the 95 percent condence interval for
µ
is (9.1494, 10.7173) and the 99
percent condence interval is (8.903, 10.9637).
5. Create a funcon, normal_CI_uksd, for obtaining the condence intervals for
µ
of a normal distribuon when the standard deviaon is unknown:
normal_CI_uksd = function(x,alpha) {
xbar = mean(x); s = sd(x)
n = length(x)
ll = xbar-qt(alpha/2,n-1,lower.tail=FALSE)*s/sqrt(n)
ul = xbar+qt(alpha/2,n-1,lower.tail=FALSE)*s/sqrt(n)
return(paste("The ", 100*(1-alpha),"% Confidence Interval for
the Normal mean is (", round(ll,4),",",round(ul,4),")",sep=''))
}
We have an addional computaon here in comparison with the earlier funcon.
Since the standard deviaon is unknown, we esmate it with s = sd(x).
Furthermore, we need to obtain the quanle from t-distribuon with n - 1 degrees
of freedom, and hence we have qt(alpha/2,n-1,lower.tail=FALSE) for the
computaon of
1, /2n
t
α
. The rest of the details follow the previous funcon.
6. Let us obtain condence intervals, 95 percent and 99 percent, for the vector xn
under the assumpon that the variance is not known. The codes for achieving the
results are given in normal_CI_uksd(x=xn,alpha=0.05) and normal_CI_
uksd(x=xn,alpha=0.01).
Chapter 5
[ 143 ]
Thus, the 95 percent condence interval for the mean is (9.1518, 10.7419) and the 99
percent condence interval is (8.8742, 10.9925).
Figure 3: Confidence intervals: Some raw programs
What just happened?
We created special funcons for obtaining the condence intervals and executed them for
three dierent cases. However, our framework is quite generic in nature and with a bit of
care and cauon, it may be easily extended to other distribuons too.
Stascal Inference
[ 144 ]
Hypotheses testing
"Best consumed before six months from date of manufacture", "Two years warranty",
"Expiry date: June 20, 2015", and so on, are some of the likely assurances that you would
have easily come across. An analyst will have to arrive at such statements using the related
data. Let us rst dene a hypothesis.
Hypothesis: A hypothesis is an asseron about the unknown parameter of the probability
distribuon. For the quote of this secon, denong the least me (in months) ll which
an eatery will not be losing its good taste by
θ
, the hypothesis of interest will be
0:6H
θ
.
It is common to denote the hypothesis of interest by
0
H
and it called the null hypothesis.
We want to test the null hypothesis against the alternave hypothesis that the consumpon
me is well before the six months' me, which in symbols is denoted by
1:6H
θ
<
. We will
begin with some important denions followed by related examples.
Test stasc: A stasc that is a funcon of the random sample is called as test stasc.
For an observaon X following a binomial distribuon b(n, p), the test stasc for p will be
X/n, whereas for a random sample from the normal distribuon, the test stasc may be
mean
1
/
n
i
i
X X n
=
=
or the sample variance
( )
( )
2
2
1
/ 1
n
i
i
S X X n
=
= − −
depending on whether the
tesng problem is for
2
or
µσ
. The stascal soluon to reject (or not) the null hypothesis
depends on the value of test stasc. This leads us to the next denion.
Crical region: The set of values of the test stasc which leads to the rejecon of the null
hypothesis is known as the crical region.
We have made various kinds of assumpons for the random experiments. Naturally,
depending on the type of the probability family, namely binomial, normal, and so on,
we will have an appropriate tesng tool. Let us look at the very popular tests arising
in stascs.
Binomial test
A binomial random variable X, distribuon represented by b(n, p), is characterized by two
parameters n and p. Typically, n represents the number of trials and is known in most cases
and it is the probability of success p that one is generally interested in the hypotheses
related to it.
For example, an LCD panel manufacturer would like to test if the number of defecves is at
most four percent. The panel manufacturer has randomly inspected 893 LCDs and found 39
to be defecve. Here the hypotheses tesng problem would be 0 1
vs: 0.04 : 0.04Hp Hp .
Chapter 5
[ 145 ]
A doctor would like to test whether the proporon of people in a drought-eected area
having a viral infecon such as pneumonia is 0.2, that is,
0 1
vs: 0.2 : 0.2
Hp Hp= . The
drought-eected area may encompass a huge geographical area and as such it becomes
really dicult to carry out a census over a very short period of a day or two. Thus the doctor
selects the second-eldest member of a family and inspects 119 households for pneumonia.
He records that 28 out of 119 inspected people are suering from pneumonia. Using this
informaon, we need to help the doctor in tesng the hypothesis of interest for him.
In general, the hypothesis-tesng problems for the binomial distribuon will be along the
lines of
0 0 1 0
vs: :HppHpp≤ >
,
0 0 1 0
vs: :
HppH
pp
> , or
0 0 1 0
vs: :HppHpp= ≠
.
Let us see how the binom.test funcon in R helps in tesng hypotheses problems related
to the binomial distribuon.
Time for action – testing the probability of success
We will use the R funcon binom.test for tesng hypotheses problems related to p. This
funcon takes as arguments n number of trials, x number of successes, p as the probability
of interest, and alternative as one of greater, less, or greater.
1. Discover the details related to binom.test using ?binom.test, and then run
example(binom.test) and ensure that you understand the default example.
2. For the LCD panel manufacturer, we have n = 893 and x = 39. The null hypothesis
occurs at p = 0.04. Enter this data rst in R with the following code:
n_lcd <- 893; x_lcd <- 39; p_lcd <- 0.04
3. The alternave hypothesis is that the proporon of success p is greater
than 0.04, which is listed to the binom.test funcon with the opon
alternative="greater", and hence the complete binom.test funcon
for the LCD panel is delivered by:
binom.test(n=n_lcd,x=x_lcd,p=p_lcd,alternative="greater")
The output, following screenshot, shows that the esmated probability of success
is 0.04367, which is certainly greater than 0.04. However, the p-value =
0.3103 indicates that we do not have enough evidence in the data to reject
the null hypothesis
0: 0.04Hp
. Note that the binom.test also gives us a 95
percent condence interval for p as (0.033, 1.000) and since the hypothesized
probability lies in this interval we arrive at the same conclusion. This condence
interval is recommended over the one developed in the previous secon, and
in parcular we don't have to worry about the condence limits either being lesser
than 0 or greater than one. Also, you may obtain any condence interval of your
choice
( )
100 1
α
percent CI with the argument conf.int.
Stascal Inference
[ 146 ]
4. For the doctors problem, we have the data as:
n_doc <- 119; x_doc <- 28; p_doc <- 0.2
5. We need to test the null hypothesis against a two-sided alternave hypothesis and
though this is the default seng of the binom.test, it is a good pracce to specify
it explicitly, at least unl the experse is felt by the user:
binom.test(n=n_doc,x=x_doc,p=p_doc,alternative="two.sided")
The esmated probability of success, actually a paent's probability of having the viral
infecon, is 0.3193. Since the p-value associated with the test is p-value = 0.001888,
we reject the null hypothesis that
0: 0.2Hp=
. All the output is given in the following
screenshot. The 95 percent condence interval is (0.2369, 0.4110), again given by
the binom.test, which does not contain the hypothesized value of 0.2, and hence we
can reject the null hypothesis based on the condence interval.
Figure 4: Binomial tests for probability of success
What just happened?
Binomial distribuon arises in a large number of proporonality test problems. In this secon
we used the binom.test for tesng problems related to the probability of success. We also
note that the condence intervals for p are given as a side-product of the applicaon of the
binom.test. The condence intervals are also given as a by-product of the applicaon of
the binom.test.
Chapter 5
[ 147 ]
Tests of proportions and the chi-square test
In Chapter 3, Data Visualizaon, we came across the Titanic and UCBAdmissions
datasets. For the Titanic dataset, we may like to test if the Survived proporon across
the Class is same for the two Sex groups. Similarly, for the UCBAdmissions dataset, we
may wish to know if the proporon of the Admitted candidates for the Male and Female
group is same across the six Dept. Thus, there is a need to generalize the binom.test
funcon to a group of proporons. In this problem, we may have k-proporons and the
probability vector is specied by
( )
1,...,
k
pp p=
. The hypothesis problem may be
specied as tesng the null hypothesis
0 0
:Hpp=
against the alternave hypothesis
1 0
:Hpp
. Equivalently, in the vector form the problem is tesng
( ) ( )
0 1 01 0
:,..., ,...,
k k
Hp p p p=
against
( ) ( )
1 1 01 0
:,..., ,...,
k k
Hp p p p
. The R extension of binom.test is given in prop.test.
Time for action – testing proportions
We will use the prop.test R funcon here for tesng the equality of proporons for the
count data problems.
1. Load the required dataset with data(UCBAdmissions). For the UCBAdmissions
dataset, rst obtain the Admitted and Rejected frequencies for both the genders
across the six departments with:
UCBA.Dept <- ftable(UCBAdmissions, row.vars="Dept", col.vars =
c("Gender", "Admit"))
2. Calculate the Admitted proporons for Female across the six departments with:
p_female <- prop.table(UCBA.Dept[,3:4],margin=1)[,1]
Check p_female!
3. Test whether the proporons across the departments for Male matches with
Female using the prop.test:
prop.test(UCBA.Dept[,1:2],p=p_female)
The proporons are not equal across the Gender as p-value < 2.2e-16 rejects
the null hypothesis that they are equal.
4. Next, we want to invesgate whether the Male and Female survivors
proporons are the same in the Titanic dataset. The approach is similar
to the UCBAdmissions problem; run the following code:
T.Class <- ftable(Titanic, row.vars="Class", col.vars = c("Sex",
"Survived"))
Stascal Inference
[ 148 ]
5. Compute the Female survivor proporons across the four classes with p_female
<- prop.table(T.Class[,3:4],margin=1)[,1]. Note that this new variable,
p_female, will overwrite the same named variable from the earlier steps.
6. Display p_female and then carry out the comparison across the two genders:
prop.test(T.Class[,1:2],p=p_female)
The p-value < 2.2e-16 clearly shows that the survivor proporons are not the
same across the genders.
Figure 5: prop.test in action
Indeed, there is more complexity to the two datasets than mere proporons for
the two genders. The web page http://www-stat.stanford.edu/~sabatti/
Stat48/UCB.R has detailed analysis of the UCBAdmissions dataset and here we
will simply apply the chi-square test to check if the admission percentage within
each department is independent of the gender.
7. The data for the admission/rejecon for each department is extractable through the
third index in the array, that is, UCBAdmissions[,,i] across the six departments.
Now, we apply the chisq.test to check if the admission procedure is independent
of the gender by running chisq.test( UCBAdmissions[,,i]) six mes.
The result has been edited in a foreign text editor and then a screenshot of it is
provided next.
Chapter 5
[ 149 ]
It appears that the Dept = A admits more males than females.
Figure 6: Chi-square tests for the UCBAdmissions problem
What just happened?
We used prop.test and chisq.test to test the proporons and independence
of aributes. Funcons such as ftable and prop.table and arguments such as row.
vars, col.vars, and margin were useful to get the data in the right format for the
analysies purpose.
We will now look at important family of tests for the normal distribuon.
Tests based on normal distribution – one-sample
The normal distribuon pops up in many instances of stascal analysis. In fact
Whiaker and Robinson have quoted on the popularity of normal distribuon
as follows:
Everybody believes in the exponenal law of errors [that is, the normal
distribuon]: the experimenters, because they think it can be proved by
mathemacs; and the mathemacians, because they believe it has been
established by observaon.
We will not make an aempt to nd out whether the experimenters are correct
or the mathemacians, well, at least not in this secon.
Stascal Inference
[ 150 ]
In general we will be dealing with either one-sample or two-sample tests. In the one-sample
problem we have a random sample of size n from
( )
2
,N
µσ
in
( )
12
,,..., n
XX X. The hypotheses
tesng problem may be related to either or both of the parameters
( )
2
,
µσ
. The interesng
and most frequent hypotheses tesng problems for the normal distribuon are listed here:
Tesng for mean with known variance
2
σ
:
0 0 1 0
vs: :
H H
µµ
µµ
< ≥
0 0 1 0
vs: :H H
µµ
µµ
> ≤
0 0 1 0
vs: :
H H
µµ
µµ
= ≠
Tesng for mean with unknown variance 2
σ
: this is the same set of hypotheses
problems as in the preceding point
Tesng for the variance with unknown mean:
0 0 1 0
vs: :H H
σσ
σσ
> ≤
0 0 1 0
vs: :H H
σσ
σσ
< ≥
0 0 1 0
vs: :H H
σσ
σσ
= ≠
In the case of known variance, the hypotheses tesng problem for the mean is based
on the Z-stasc given by:
0
/
X
Zn
µ
σ
=
where
1
/
n
i
i
X X n
=
=
. The test procedure, known as Z-test, for the hypotheses tesng problem
0 0 1 0
vs: :
H H
µµ
µµ
< is to reject the null hypothesis at
α
-level of signicance
0 0
:H
µµ
>
if
0
/
X z n
α
σ µ
> + , where
z
α
is the
α
percenle of a standard normal distribuon. For
the hypotheses tesng problem
0 0 1 0
vs: :H H
µµ µµ
> ≤
, the crical/reject region is
0
/
X z n
α
σ µ
< + . Finally, for the tesng problem of
0 0 1 0
vs: :H H
µµ µµ
= ≠
, we reject the
null hypothesis if:
0
/2
/
Xz
n
α
µ
σ
Chapter 5
[ 151 ]
An R funcon, z.test, is available in the PASWR package, which carries out the Z-test for
each type of the hypotheses tesng problem. Now, we consider the case when the variance
2
σ
is not known. In this case, we rst nd an esmate of the variance using
( )
( )
2
2/ 1
1
n
S X X n
i
i
= −
=
.
The test procedure is based on the well-known t-stasc:
2
/
X
t
S n
µ
=
The test procedure based on the t-stasc is highly popular as the t-test or student's t-test,
and its implementaon is there in R with the t.test funcon in the base package. The
distribuon of the t-stasc, under the null hypothesis is the t-distribuon with (n - 1)
degrees of freedom. The raonale behind the applicaon of the t-test for the various types
of hypotheses remains the same as the Z-test.
For the hypotheses tesng problem concerning the variance 2
σ
of the normal
distribuon, we need to rst compute the sample variance using
( )
( )
2
2/ 1
1
n
S X X n
i
i
= −
=
and dene the chi-square stasc:
( )
2
2
2
0
1
nS
χσ
=
Under the null hypothesis, the chi-square stasc is distributed as a chi-square random
variable with n - 1 degrees of freedom. In the case of known mean, which is seldom
the case, the test procedure is based on the test stasc
( )
2
22
0
/
1
n
Xi
i
χ µ σ
= −
=
, which follows a
chi-square random variable with n degrees of freedom. For the hypotheses problem
0 0 1 0
vs: :H H
σσ σσ
> ≤
, the test procedure is to reject
0 0
:H
σσ
> if
22
1
X
α
χ
<
. Similarly, for
the hypotheses problem
0 0 1 0
vs: :H H
σσ σσ
< ≥
, the procedure is to reject
0 0
:H
σσ
<
if
2 2
X
α
χ
>
; and nally for the problem
0 0 1 0
vs: :H H
σσ σσ
= ≠
, the test procedure rejects
0 0
:H
σσ
=
if either
22
1/2
X
α
χ
<
or
2 2
/2
X
α
χ
>
.
Test examples. Let us consider some situaons when the preceding set of hypotheses arise
in a natural way:
A certain chemical experiment requires that the soluon used as a reactant has
a pH level greater than 8.4. It is known that the manufacturing process gives
measurements which follow a normal distribuon with a standard deviaon
of 0.05. The ten random observaons are 8.30, 8.42, 8.44, 8.32, 8.43, 8.41,
8.42, 8.46, 8.37, and 8.42. Here, the hypotheses tesng problem of interest is
0 1
vs: 8.4 : 8.4H H
µ µ
> ≤
. This problem is adopted from page 408 of Ross (2010).
Stascal Inference
[ 152 ]
Following a series of complaints that his company's LCD panels never last more than
a year, the manufacturer wants to test if his LCD panels indeed fail within a year.
Using historical data, he knows the standard deviaon of the panel life due to the
manufacturing process is two years. A random sample of 15 units from a freshly
manufactured lot gives their lifemes as 13.37, 10.96, 12.06, 13.82, 12.96, 10.47,
10.55, 16.28, 12.94, 11.43, 14.51, 12.63, 13.50, 11.50, and 12.87. You need to help
the manufacturer validate his hypothesis.
Freund and Wilson (2003). Suppose that the mean weight of peanuts put in jars
is required to be 8 oz. The variance of the weights is known to be 0.03, and the
observed weights for 16 jars are 8.08, 7.71, 7.89, 7.72, 8.00, 7.90, 7.77, 7.81,
8.33, 7.67, 7.79, 7.79, 7.94, 7.84, 8.17, and 7.87. Here, we are interested
in tesng
0 1
vs: 8.0 : 8.0H H
µ µ
= ≠
.
New managers have been appointed at the respecve places in the preceding
bullets. As a consequence the new managers are not aware about the standard
deviaon for the processes under their control. As an analyst, help them!
Suppose that the variance in the rst example is not known and that it is a crical
requirement that the variance be lesser than 7, that is, the null hypothesis is
2
0: 7H
σ
<
while the alternave is
2
1: 7H
σ
.
Suppose that the variance test needs to be carried out for the third method,
that is, the hypotheses tesng problem is then
2 2
0 1
vs: 0.03 : 0.03H H
σ σ
= .
We will perform the necessary test for all the problems described before.
Time for action – testing one-sample hypotheses
We will require R packages PASWR and PairedData here. The R funcons such as t.test,
z.test, and var.test will be useful for tesng one-sample hypotheses problems related
to a random sample from normal distribuons.
1. Load the library library(PASWR).
2. Enter the data for pH in R:
pH_Data <-c(8.30,8.42,8.44,8.32,8.43,8.41,8.42,8.46,8.37,8.42)
3. Specify the known variance of pH in pH_sigma <- 0.05.
4. Use z.test from the PASWR library to test the hypotheses described in the rst
example with:
z.test(x=pH_Data,alternative="less",sigma.x=pH_sigma,mu=8.4)
Chapter 5
[ 153 ]
The data is specied in the x opon, the type of the hypotheses problem is specied
by stang the form of the alternative hypothesis, the known variance is fed
through the sigma.x opon, and nally, the mu opon is used to specify the value
of the mean under the null hypothesis. The output of the complete R program is
collected in the forthcoming two screenshots.
The p-value is 0.4748 which means that we do not have enough evidence
to reject the null hypothesis
0: 8.4H
µ
> and hence we conclude that mean
pH value is above 8.4.
5. Get the data of LCD panel in your session with:
LCD_Data <- c(13.37, 10.96, 12.06, 13.82, 12.96, 10.47,
10.55, 16.28, 12.94, 11.43, 14.51, 12.63, 13.50, 11.50, 12.87)
6. Specify the known variance LCD_sigma <- 2 and run the z.test with:
z.test(x=LCD_Data,alternative="greater",sigma.x=LCD_sigma,mu=12)
The p-value is seen to be 0.1018 and hence we again do not have enough data
evidence to reject the null hypothesis that the average mean lifeme of an LCD
panel is at least a year.
7. The complete program for third problem can be given as follows:
peanuts <- c(8.08, 7.71, 7.89, 7.72, 8.00, 7.90, 7.77,
7.81, 8.33, 7.67, 7.79, 7.79, 7.94, 7.84, 8.17, 7.87)
peanuts_sigma <- 0.03
z.test(x=peanuts,sigma.x=peanuts_sigma,mu=8.0)
Since the p-value associated with this test 2.2e-16, that is, it is very close to zero,
we reject the null hypothesis
0: 8.0H
µ
=
.
8. If the variance(s) are not known and a test of the sample means is required,
we need to move from the z.test (in the PASWR library) to the t.test
(in the base library):
t.test(x=pH_Data,alternative="less",mu=8.4)
t.test(x=LCD_Data,alternative="greater",mu=12)
t.test(x=peanuts,mu=8.0)
If the variance is not known, the conclusions for the problems related to pH
and peanuts do not change. However, the conclusion changes for the LCD panel
problem, and here the null hypothesis is rejected as p-value is 0.06414.
Stascal Inference
[ 154 ]
For the problem of tesng variances related to the one-sample problem, my inial
idea was to write raw R codes as there did not seem to be a funcon, package, and
so on, which readily gives the answers. However, a more appropriate search at
google.com revealed that an R package tled PairedData and created by
Stephane Champely did certainly have a funcon, var.test, not to be confused
with the same named funcon in the stats library, which is appropriate for tesng
problems related to the variance of a normal distribuon. The problem is that the
roune method of fetching the package using install.packages("PairedData")
gives a warning message, namely package 'PairedData' is not available (for R
version 2.15.1). This is the classic case of "so near, yet so far…". However, a deeper
look into this will lead us to http://cran.r-project.org/src/contrib/
Archive/PairedData/. This web page shows the various versions of the
PairedData package. A Linux user should have no problem in using it, though
the other OS users can't be helped right away. A Linux user needs to rst download
one of the zipped le, say PairedData_1.0.0.tar.gz, to a specic directory
and with the path of GNOME Terminal in that directory execute R CMD INSTALL
PairedData_1.0.0.tar.gz. Now, we are ready to carry out the tests related
to the variance of a normal distribuon. A Windows user need not be discouraged
with this scenario, and the important funcon var1.test is made available in the
RSADBE package of the book. A more recent check on the CRAN website reveals that
the PairedData package is again available for all OS plaorms since April 18, 2013.
Figure 7: z.test and t.test for one-sample problem
Chapter 5
[ 155 ]
9. Load the required library with library(PairedData).
10. Carry out the two tesng problems in the h problem with:
var.test(x=pH_Data,alternative="greater",ratio=7)
var.test(x=peanuts,alternative="two.sided",ratio=0.03)
11. It may be seen from the next screenshot that the data does not lead to rejecon
of the null hypotheses. For a Windows user, the alternave is to use the var1.test
funcon from the RSADBE package. That is, you need to run:
var1.test(x=pH_Data,alternative="greater",ratio=7)
var1.test(x=peanuts,alternative="two.sided",ratio=0.03)
You'll get same results:
Figure 8: var.test from the PairedData library
What just happened?
The tests z.test, t.test, and var.test (from the PairedData library) have been
used for the tesng hypotheses problems under varying degrees of problems.
Stascal Inference
[ 156 ]
Have a go hero
Consider the tesng problem
0 0 1 0
vs: :H H
σσ σσ
= ≠
. The test stasc for this hypothesis
tesng problem is given by:
( )
2
1
2
2
0
n
i i i
X X
χσ
=
∑ −
=
which follows a chi-square distribuon with n - 1 degrees of freedom. Create your own new
funcon for the tesng problems and compare it with the results given by var.test of
PairedData package.
With the tesng problem of parameters of normal distribuon in the case of one sample
behind us, we will next focus on the important two-sample problem.
Tests based on normal distribution – two-sample
The two-sample problem has data from two populaons where
( )
12
,,...,
n
XX X are n1
observaons from
( )
2
11
,N
µσ
and
( )
2
12
,,..., n
YY Y are n2 observaons from
( )
2
22
,N
µσ
. We assume
that the samples within each populaon are independent of each other and further that
the samples across the two populaons are also independent. Similar to the one-sample
problem, we have the following set of recurring and interesng hypotheses tesng problems.
Mean comparison with known variances
2
1
σ
and 2
2
σ
:
0 1 2 1 1 2
vs: :H H
µµ µµ
> ≤
0 1 2 1 1 2
vs: :H H
µµ µµ
< ≥
0 1 2 1 1 2
vs: :H H
µµ µµ
= ≠
Mean comparison with unknown variances
2
1
σ
and
2
2
σ
: the same set of hypotheses
problems as before. We make an addional assumpon here that the variances
2
1
σ
and 2
2
σ
are assumed to be equal, though unknown.
The variances comparison:
0 1 2 1 1 2
vs: :H H
σσ σσ
> ≤
0 1 2 1 1 2
vs: :H H
σσ σσ
< ≥
0 1 2 1 1 2
vs: :H H
σσ σσ
= ≠
First dene the sample means for the two populaons with
1
1
1/
n
i
i
X X n
=
=
and 2
2
1/
n
i
i
Y X n
=
=
. For the case of known variances
2
1
σ
and
2
2
σ
, the test stasc
is dened by:
Chapter 5
[ 157 ]
( )
1 2
2 2
1 1 2 2
/ /
XY
Zn n
µµ
σ σ
−−
=+
Under the null hypotheses,
( )
2 2
1 1 2 2
/ / /Z X Y n n
σ σ
= − + follows a standard normal distribuon.
The test procedure for the problem 0 1 2 1 1 2
vs: :H H
µµ
µµ
> is to reject H0 if
zz
α
, and the
procedure for
0 1 2 1 1 2
vs: :H H
µµ µµ
< ≥
is to reject H0 if
zz
α
<
. As expected and on earlier
intuive lines, the test procedure for the hypotheses problem
0 1 2 1 1 2
vs: :H H
µµ µµ
= ≠
is to
reject H0 as /2
zz
α
.
Let us now consider the case when the variances 2
1
σ
and 2
2
σ
are not known and assumed
(or known) to be equal. In this case, we can't use the Z-test any further and need to look at
the esmator of the common variance. For this, we dened the pooled variance esmator
as follows:
2 2 2
1 2
1 2 1 2
1 1
2 2
p x y
n n
S S S
n n n n
− −
= +
+ − + −
where
2
x
S
and
2
y
S
are the sampling variances of the two populaons. Dene the t-stasc
as follows:
( )
2
1 2
1/ 1/
p
XY
tS n n
=+
The test procedure for the set of the three hypotheses tesng problems is then to reject
the null hypotheses if
12
2,
nn
tt
α
+−
<
,
12
2,nn
tt
α
+−
>
, or
12
2, /2nn
tt
α
+−
<
.
Finally, we focus on the problem of tesng variances across two samples. Here, the test
stasc is given by:
2
2
x
y
S
FS
=
The test procedures would be to respecvely reject the null hypotheses of the
tesng problems
0 1 2 1 1 2
vs: :H H
σσ σσ
> ≤
,
0 1 2 1 1 2
vs: :H H
σσ σσ
< ≥
, and
0 1 2 1 1 2
vs: :H H
σσ σσ
= ≠
if
12
1, 1,
nn
FF
α
−−
<
,
12
1, 1,
nn
FF
α
−−
>
,
12
1, 1, /2
nn
FF
α
−−
<
.
Stascal Inference
[ 158 ]
Let us now consider some scenarios where we have the previously listed hypotheses
tesng problems.
Test examples. Let us consider some situaons when the preceding set of hypotheses arise
in a natural way.
In connuaon of the chemical experiment problem, let us assume that the
chemists have come up with a new method of obtaining the same soluon as
discussed in the previous secon. For the new technique, the standard deviaon
connues to be 0.05 and 12 observaons for the new method yield the following
measurements: 8.78, 8.85, 8.74, 8.83, 8.82, 8.79, 8.82, 8.74, 8.84, 8.78, 8.75, 8.81.
Now, this new soluon is acceptable if its mean is greater than that for the earlier
one. Thus, the hypotheses tesng problem is now 0 1
vs: :
NEW OLD NEW OLD
H H
µ µ µ µ
> .
Ross (2008), page 451. The precision of instruments in metal cung is a much
serious business and the cut pieces can't be signicantly lesser than the target
nor be greater than it. Two machines are used to cut 10 pieces of steel, and their
measurements are respecvely given in 122.4, 123.12, 122.51, 123.12, 122.55,
121.76, 122.31, 123.2, 122.48, 121.96, and 122.36, 121.88, 122.2, 122.88, 123.43,
122.4, 122.12, 121.78, 122.85, 123.04. The standard deviaon of the length of a cut
is known to be equal to 0.5. We need to test if the average cut length is same for the
two machines.
For both the preceding problems assume that though the variances are equal,
they are not known. Complete the hypotheses tesng problems using t.test.
Freund and Wilson (2003), page 199. The monitoring of the amount of
peanuts being put in jars is an important issue in quality control viewpoint.
The consistency of the weights is of prime importance and the manufacturer has
been introduced to a new machine, which is supposed to give more accuracy
in the weights of the peanuts put in the jars. With the new device, 11 jars were
tested for their weights and found to be 8.06, 8.64, 7.97, 7.81, 7.93, 8.57, 8.39,
8.46, 8.28, 8.02, 8.39, whereas a sample of nine jars from the previous machine
weighed at 7.99, 8.12, 8.34, 8.17, 8.11, 8.03, 8.14, 8.14, 7.87. Now, the task is to test
0 1
vs: :
NEW OLD NEW OLD
H H
σ σ σ σ
= .
Let us do the tests for the preceding four problems in R.
Chapter 5
[ 159 ]
Time for action – testing two-sample hypotheses
For the problem of tesng hypotheses for the means arising from two populaons, we will
be using the funcons z.test and t.test.
1. As earlier, load the library(PASWR) library.
2. Carry out the Z-test using z.test and the opons x, y, sigma.x, and sigma.y:
pH_Data <- c(8.30, 8.42, 8.44, 8.32, 8.43, 8.41, 8.42,
8.46, 8.37, 8.42)
pH_New <- c(8.78, 8.85, 8.74, 8.83, 8.82, 8.79, 8.82,
8.74, 8.84, 8.78, 8.75, 8.81)
z.test(x=pH_Data,y=pH_New,sigma.x=sigma.y=0.05,alternative="less")
The p-value is very small (2.2e-16) indicang that we reject the null hypothesis
that
0:
NE
WO
LD
H
µ µ
>
.
3. For the steel length cut data problem, run the following code:
length_M1 <- c(122.4, 123.12, 122.51, 123.12, 122.55,
121.76, 122.31, 123.2, 122.48, 121.96)
length_M2 <- c(122.36, 121.88, 122.2, 122.88, 123.43,
122.4, 122.12, 121.78, 122.85, 123.04)
z.test(x=length_M1,y=length_M2,sigma.x=0.5,sigma.y=0.5)
The display of p-value = 0.8335 shows that the machines do not cut the steel
in dierent ways.
4. If the variances are equal but not known, we need to use t.test instead of the
z.test:
t.test(x=pH_Data,y=pH_New,alternative="less")
t.test(x=length_M1,y=length_M2)
5. The p-values for the two hypotheses problems are p-value = 3.95e-13
and p-value = 0.8397. We leave the interpretaon aspect to the reader.
6. For the fourth problem, we have the following R program:
machine_new <- c(8.06, 8.64, 7.97, 7.81, 7.93, 8.57, 8.39, 8.46,
8.28, 8.02, 8.39)
machine_old <- c(7.99, 8.12, 8.34, 8.17, 8.11, 8.03, 8.14, 8.14,
7.87)
t.test(machine_new,machine_old, alternative="greater")
Again, we have p-value = 0.1005!
Stascal Inference
[ 160 ]
What just happened?
The funcons t.test and z.test were simply extensions from the one-sample case to the
two-sample test.
Have a go hero
In the one-sample case you used var.test for the same datasets, which needed
a comparison of means with some known standard deviaon. Now, test for the variance in
the two-sample case using var.test using appropriate hypotheses for them. For example,
test whether the variances are equal for pH_Data and pH_New. Find more details of the test
with ?var.test.
Summary
In this chapter we have introduced "stascal inference", which in a common usage term
consists of three parts: esmaon, condence intervals, and hypotheses tesng. We began
the chapter with the importance of likelihood and to obtain the MLE in many of the standard
probability distribuons using built-in modules. Later, simply to maintain the order of
concepts, we dened funcons exclusively for obtaining the condence intervals. Finally,
the chapter considered important families of tests that are useful across many important
stochasc experiments. In the next chapter we will introduce the linear regression model,
which more formally constutes the applied face of the subject.
6
Linear Regression Analysis
In the Visualization techniques for continuous variable data section of Chapter
3, Data Visualization, we have seen different data visualization techniques
which help in understanding the data variables (boxplots and histograms) and
their interrelationships (matrix of scatter plots). We had seen in Example 4.6.1.
Resistant line for the IO-CPU time an illustration of the resistant line, where
CPU_Time depends linearly on the No_of_IO variable. The pair function's
output in Example 3.2.9. Octane rating of gasoline blends indicated that the
mileage of a car has strong correlations with the engine-related characteristics,
such as displacement, horsepower, torque, the number of transmission speeds,
and the type of transmission being manual or automatic. Further, the mileage
of a car also strongly depends on the vehicle dimensions, such as its length,
width, and weight. The question addressed in this chapter is meant to further
these initial findings through a more appropriate model. Now, we take the next
step forward and build linear regression models for the problems. Thus, in this
chapter we will provide more concrete answers for the mileage problem.
Linear Regression Analysis
[ 162 ]
The rst linear regression model was built by Sir Francis Galton in 1908. The word regression
implies towards-the-center. The covariates, also known as independent variables, features,
or regressors, have a regressive eect on the output, also called dependent or regressand
variable. Since the covariates are allowed, actually assumed, to aect the output in linear
increments, we call the model the linear regression model. The linear regression models
provide an answer for the correlaon between the regressand and the regressors, and as
such do not really establish causaon. As it will be seen later in the chapter, using data,
we will be able to understand the mileage of a car as a linear funcon of the car-related
dynamics. From a pure scienc point of view, the mileage should really depend on
complicated formulas of the car's speed, road condions, the climate, and so on. However,
it will be seen that linear models work just ne for the problem despite not really going
into the technical details. However, there will also be a price to pay, in the sense that most
regression models work well when the range of the variables is well dened, and that an
aempt to extrapolate the results usually does not result in sasfactory answers. We will
begin with a simple linear regression model where we have one dependent variable
and one covariate.
At the conclusion of the chapter, you will be able to build a regression model through
the following steps:
Building a linear regression model and their interpretaon
Validaon of the model assumpons
Idenfying the eect of every single observaon, covariates, as well as the output
Fixing the problem of dependent covariates
Selecon of the opmal linear regression model
The simple linear regression model
In Example 4.6.1. Resistant line for the IO-CPU me of Chapter 4, Exploratory Analysis,
we built a resistant line for CPU_Time as a funcon of the No_of_IO processes. The
results were sasfactory in the sense that the ed line was very close to covering all the
data points, refer to Figure 7 of Chapter 4, Exploratory Analysis. However, we need more
stascal validaon of the esmated values of the slope and intercept terms. Here we take
a dierent approach and state the linear regression model in more technical details.
Chapter 6
[ 163 ]
The simple linear regression model is given by
01
Y X
ββ ε
=+ +
,where X is the covariate/
independent variable, Y is the regressand/dependent variable, and ε is the unobservable
error term. The parameters of the linear model are specied by
0
β
and
1
β
. Here
0
β
is the
intercept term and corresponds to the value of Y when x = 0. The slope term
1
β
reects
the change in the Y value for a unit change in X. It is also common to refer to the
0
β
and
1
β
values as regression coecients. To understand the regression model, we begin with n pairs
of observaons
( ) ( )
11 nn
Y,X,..., Y,X
with each pair being completely independent of the other.
We make an assumpon of normal and independent and idencally distributed (iid) for
the error term
ε
, specically,
( )
2
0N,
ε σ
, where 2
σ
is the variance of the errors. The core
assumpons of the model are listed as follows:
All the observaons are independent
The regressand depends linearly on the regressors
The errors are normally distributed, that is
( )
2
0N,
ε σ
We need to nd all the unknown parameters in
01
,
ββ
and 2
σ
. Suppose we have n
independent observaons. Stascal inference for the required parameters may be carried
out using the maximum likelihood funcon as described in the Maximum likelihood esmator
secon of Chapter 5, Stascal Inference. The popular technique for the linear regression
model is the least squares method which idenes the parameters by minimizing the error
sum of squares for the model, and under the assumpons made thus far agrees with the
MLE. Let
0
β
and
1
β
be a choice of parameters. Then the residuals, the distance between the
actual points and the model predicons, made by using the proposed choice of
01
,
ββ
on the
i-th pair of observaon
( )
ii
Y,X
is dened by:
01 12
ii i
eY X,i,,...,n
ββ
=− + =

Let us now specify dierent values for the pair
( )
01
,
ββ
and visualize the residuals for them.
What happens to the arbitrary choice of parameters?
For the IO_Time dataset, the scaer plot suggests that the intercept term is about 0.05.
Further, the resistant line gives an esmate of the slope at about 0.04. We will have three
pairs for guesses for
( )
01
,
ββ
as (0.05, 0.05), (0.1, 0.04), and (0.15, 0.03). We will now plot the
data and see the dierent residuals for the three pairs of guesses.
Linear Regression Analysis
[ 164 ]
Time for action – the arbitrary choice of parameters
1. We begin with reasonable guesses of the slope and intercept terms for a
simple linear regression model. The idea is to inspect the dierence between
the ed line and the actual observaons. Invoke the graphics windows using
par(mfrow=c(1,3)).
2. Obtain the scaer plot of the CPU_Time against No_of_IO with:
plot(No_of_IO,CPU_Time,xlab="Number of Processes",ylab="CPU Time",
ylim=c(0,0.6),xlim=c(0,11))
3. For the guessed regression line with the values of
( )
01
,
ββ
being (0.05, 0.05), plot a
line on the scaer plot with abline(a=0.05,b=0.05,col= "blue").
4. Dene a funcon which will nd the y value for the guess of the pair (0.05, 0.05)
using myline1 = function(x) 0.05*x+0.05.
5. Plot the error (residuals) made due to the choice of the pair (0.05, 0.05) from the
actual points using the following loop and give a tle for the rst pair of the guess:
for(i in 1:length(No_of_IO)){
lines(c(No_of_IO[i], No_of_IO[i]), c(CPU_Time[i],
myline1(No_of_IO[i])),col="blue", pch=10)
}
title("Residuals for the First Guess")
6. Repeat the preceding exercise for the last two pairs of guesses for the regression
coecients
( )
01
,
ββ
.
The complete R program is given as follows:
par(mfrow=c(1,3))
plot(IO_Time$No_of_IO,IO_Time$CPU_Time,xlab="Number of
Processes",ylab="CPU Time",ylim=c(0,0.6),xlim=c(0,11))
abline(a=0.05,b=0.05,col="blue")
myline1 <- function(x) 0.05*x+0.05
for(i in 1:length(IO_Time$No_of_IO)){
lines(c(IO_Time$No_of_IO[i],IO_Time$No_of_IO[i]),c(IO_Time$CPU_
Time[i],myline1(IO_Time$No_of_IO[i])),col="blue",pch=10)
}
title("Residuals for the First Guess")
plot(IO_Time$No_of_IO,IO_Time$CPU_Time,xlab="Number of
Processes",ylab="CPU Time",ylim=c(0,0.6),xlim=c(0,11))
abline(a=0.1,b=0.04,col="green")
myline2 <- function(x) 0.04*x+0.1
for(i in 1:length(IO_Time$No_of_IO)){
lines(c(IO_Time$No_of_IO[i],IO_Time$No_of_IO[i]),c(IO_Time$CPU_
Time[i],myline2(IO_Time$No_of_IO[i])),col="green",pch=10)
Chapter 6
[ 165 ]
}
title("Residuals for the Second Guess")
plot(IO_Time$No_of_IO,IO_Time$CPU_Time,xlab="Number of
Processes",ylab="CPU Time",ylim=c(0,0.6),xlim=c(0,11))
abline(a=0.15,b=0.03,col="yellow")
myline3 <- function(x) 0.03*x+0.15
for(i in 1:length(IO_Time$No_of_IO)){
lines(c(IO_Time$No_of_IO[i],IO_Time$No_of_IO[i]),c(IO_Time$CPU_
Time[i],myline3(IO_Time$No_of_IO[i])),col="yellow",pch=10)
}
title("Residuals for the Third Guess")
Figure 1: Residuals for the three choices of regression coefficients
What just happened?
We have just executed an R program which displays the residuals for arbitrary choices of the
regression parameters. The displayed result is the preceding screenshot.
In the preceding R program, we rst plot CPU_Time against No_of_IO. The rst choice of
the line is ploed by using the abline funcon, and we specify the required intercept and
slope through a = 0.05 and b = 0.05. From this straight line (color blue), we need to obtain
the magnitude of error, through perpendicular lines from the points to the line, from the
original points. This is achieved through the for loop, where the lines funcon joins the
points and the line.
Linear Regression Analysis
[ 166 ]
For the pair (0.05, 0.05) as a guess of
( )
01
,
ββ
, we see that there is a progression in the
residual values as x increases, and it is the other way for the guess of (0.15, 0.03). In either
case, we are making large mistakes (residuals) for certain x values. The middle plot for the
guess (0.1, 0.04) does not seem to have large residual values. This choice may be beer over
the other two choices. Thus, we need to dene a criterion which enables us to nd the best
values
( )
01
,
ββ
in some sense. The criterion is to minimize the sum of squared errors:
( )
01
2
1
min
n
i
,i
e
ββ
=
Where:
( )
{ }
2
2
01
1 1
n n
i i i
i i
e y x
ββ
= =
= − +
∑ ∑
Here, the summaon is over all the observed pairs
( )
12
ii
y,x,i,,...,n=
. The technique of
minimizing the error sum of squares is known as the method of least squares, and for the
simple linear regression model, the values of
( )
01
,
ββ
which meet the criterion are given by:
1 0 1
xy
xx
S
ˆ ˆ ˆ
, y x
S
β β β
= = −
Where:
1 1
and
n n
ii
ii
x y
x , y
n n
− −
∑ ∑
= =
And:
( ) ( )
2 2
1 1
and
n n
xx i xx i i
i i
S x x, S y xx
= =
= − =
∑ ∑
We will now learn how to use R for building a simple linear regression model.
Chapter 6
[ 167 ]
Building a simple linear regression model
We will use the R funcon lm for the required construcon. The lm funcon creates
an object of class lm, which consists of an ensemble of the ed regression model.
Through the following exercise, you will learn the following:
The basic construcon of an lm object
The criteria which signies the model signicance
The criteria which signies the variable signicance
The variaon of the output explained by the inputs
The relaonship is specied by a formula in R, and the details related to the generic form
may be obtained by entering ?formula in the R console. That is, the lm funcon accepts
a formula object for the model that we are aempng to build. data.frame may also be
explicitly specied, which consists of the required data. We need to model CPU_Time as a
funcon of No_of_IO, and this is carried out by specifying CPU_Time ~ No_of_IO.
The funcon lm is wrapped around the formula to obtain our rst linear regression model.
Time for action – building a simple linear regression model
We will build the simple linear regression model using the lm funcon with its
useful arguments.
1. Create a simple linear regression model for CPU_Time as a funcon of No_of_IO
by keying in IO_lm <- lm(CPU_Time ~ No_of_IO, data=IO_Time).
2. Verify that IO_lm is of the lm class with class(IO_lm).
3. Find the details of the ed regression model using the summary funcon:
summary(IO_lm).
Linear Regression Analysis
[ 168 ]
The output is given in the following screenshot:
Figure 2: Building the first simple linear regression model
The rst queson you should ask yourself is, "Is the model signicant overall?".
The answer is provided by the p-value of the F-stasc for the overall model. This appears
in the nal line of summary(IO_lm). If the p-value is closer to 0, it implies that the model is
useful. A rule of thumb for the signicance of the model is that it should be less than 0.05.
The general rule is that if you need the model signicance at a certain percentage, say P,
then the p-value of the F-stasc should be lesser than (1-P/100).
Now that we know that the model is useful, we can ask whether the independent variable
as well as the intercept term, are signicant or not. The answer to this queson is provided
by Pr(>|t|) for the variables in the summary. R has a general way of displaying the highest
signicance level of a term by using ***, **, * and . in the Signif. codes:. This display may be
easily compared with the review of a movie or a book! Just as with general rangs, where
more stars indicate a beer product, in our context, the higher the number of stars indicate
the variables are more signicant for the built model. In our linear model, we nd No_of_IO
to be highly signicant. The esmate value of No_of_IO is given as 0.04076. This coecient
has the interpretaon that for a unit increase in the number of IOs; CPU_Time is expected to
increase by 0.04076.
Chapter 6
[ 169 ]
Now that we know that the model, as well as the independent variable, are signicant, we
need to know how much of the variability in CPU_Time is explained by No_of_IO. The answer
to this queson is provided by the measure R2, not to be confused with the leer R for the
soware, which when mulplied by 100 gives the percentage of variaon in the regressand
explained by the regressor. The term R2 is also called as the coecient of determinaon. In
our example, 98.76 percent of the variaon in CPU_Time is explained by No_of_IO, see the
value associated with Mulple R-squared in the summary(IO_lm). The R2 measure does not
consider the number of parameters esmated or the number of observaons n in a model.
A more robust explanaon, which takes into consideraon the number of parameters and
observaons, is provided by Adjusted R-squared which is 98.6 percent.
We have thus far not commented on the rst numerical display as a result of using the
summary funcon. This relates to the residuals and the display is about the basic summary
of the residual values. The residuals vary from -0.016509 to 0.024006, which are not
very large in comparison with the CPU_Time values, check with summary(CPU_Time)
for instance. Also, the median of the residual values is very close to zero, and this is an
important criterion as the median of the standard normal distribuon is 0.
What just happened?
You have ed a simple linear regression model where the independent variable is No_of_
IO and the dependent variable (output) is CPU_Time. The important quanes to look for
the model signicance, the regression coecients, and so on have been clearly illustrated.
Have a go hero
Load the dataset anscombe from the datasets package. The anscombe dataset has four
pairs of datasets in x1, x2, x3, x4, y1, y2, y3, and y4. Fit a simple regression model for
all the four pairs and obtain the summary for each pair. Make your comments on the
summaries. Pay careful aenon to the details of the summary funcon. If you need further
help, simply try out example(anscombe).
We will next look at the ANOVA (Analysis of Variance) method for the regression model,
and also obtain the condence intervals for the model parameters.
ANOVA and the condence intervals
The summary funcon of the lm object species the p-value for each variable in the model,
including the intercept term. Technically, the hypothesis problem is tesng 00 0
1
j
j
H: ,j ,
β
==
against the corresponding alternave hypothesis,
10 0 1
j
j
H: ,j ,
β
≠=
. This tesng problem
is technically dierent from the simultaneous hypothesis tesng
00 10:H
ββ
==
against the
alternave that at least one of the regression coecients is dierent from 0. The ANOVA
technique gives the answer to the laer null hypothesis of interest.
Linear Regression Analysis
[ 170 ]
For more details about the ANOVA technique, you may refer to http://en.wikipedia.
org/wiki/Analysis_of_variance. Using the anova funcon, it is very simple in R to
obtain the ANOVA table for a linear regression model. Let us apply it for our IO_lm linear
regression model.
Time for action – ANOVA and the condence intervals
The R funcons anova and confint respecvely help obtain the ANOVA table
and condence intervals from the lm objects. Here, we use them for the IO_lm
regression object.
1. Use the anova funcon on the IO_lm lm object to obtain the ANOVA table by using
IO_anova <- anova(IO_lm).
2. Display the ANOVA table by keying in IO_anova in the console.
3. The 95 percent condence intervals for the intercept and the No_of_IO variable
is obtained by confint(IO_lm).
The output in R is as follows:
Figure 3: ANOVA and the confidence intervals for the simple linear regression model
The ANOVA table conrms that the variable No_of_IO is signicant indeed. Note
the dierence of the criteria for conrming this with respect to summary(IO_lm).
In the former case, the signicance was arrived at using the t-stascs and here we have
used the F-stasc. Precisely, we check for the variance signicance of the input variable.
We now give the tool for obtaining condence intervals.
Chapter 6
[ 171 ]
Check whether or not the esmated value of the parameters fall within the 95 percent
condence intervals. The preceding results show that we indeed have a very good linear
regression model. However, we also made a host of assumpons in the beginning of the
secon, and a good pracce is to ask how valid they are in the experiment. We next consider
the problem of validaon of the assumpons.
What just happened?
The ANOVA table is a very fundamental block for a regression model, and it gives the split
of the sum of squares for the variable(s) and the error term. The dierence between ANOVA
and the summary of the linear model object is in the respecvely p-values reported by them
as Pr(>F) and Pr(>|t|). You also found a method for obtaining the condence intervals for
the independent variables of the regression model.
Model validation
The violaons of the assumpons may arise in more than one way. Taar, et. al. (2012),
Kutner, et. al. (2005) discusses the numerous ways in which the assumpons are violated
and an adapon of the methods menoned in these books is now considered:
The regression funcon is not linear. In this case, we expect the residuals to have
a paern which is not linear when viewed against the regressors. Thus, a plot of the
residuals against the regressors is expected to indicate if this assumpon is violated.
The error terms do not have constant variance. Note that we made an assumpon
stang that
( )
2
0N,
ε∼ σ
, that is the magnitude of errors do not depend on the
corresponding x or y value. Thus, we expect the plot of the residuals against the
predicted y values to reveal if this assumpon is violated.
The error terms are not independent. A plot of the residuals against the serial
number of the observaons indicated if the error terms are independent or not.
We typically expect this plot to exhibit a random walk if the errors are
independent. If any systemac paern is observed, we conclude that the errors
are not independent.
The model ts all but one or a few outlier observaons. Outliers are a huge concern
in any analycal study as even a single outlier has a tendency to destabilize the
enre model. A simple boxplot of the residuals indicates the presence of an outlier.
If any outlier is present, such observaons need to be removed and the model
needs to be rebuilt. The current step of model validaon needs to be repeated for
the rebuilt model. In fact the process needs to be iterated unl there are no more
outliers. However, we need to cauon the reader that if the subject experts feel
that such outliers are indeed expected values, it may convey that some appropriate
variables are missing in the regression model.
Linear Regression Analysis
[ 172 ]
The error terms are not normally distributed. This is one of the most crucial
assumpons of the linear regression model. The violaon of this assumpon is
veried using the normal probability plot in which the predicted values (actually
cumulave probabilies) are ploed against the observed values. If the values fall
along a straight line, the normality assumpon for errors holds true. The model is to
be rejected if this assumpon is violated.
The next secon shows you how to obtain the residual plots for the purpose
of model validaon.
Time for action – residual plots for model validation
The R funcons resid and fitted can be used to extract residuals and ed values from
an lm object.
1. Find the residuals of the ed regression model using the resid funcon: IO_lm_
resid <- resid(IO_lm).
2. We need six plots, and hence we invoke the graphics editor with par(mfrow =
c(3,2)).
3. Sketch the plot of residuals against the predictor variable with plot(No_of_IO,
IO_lm_resid).
4. To check whether the regression model is linear or not, obtain the plots of absolute
residual values against the predictor variable and also that of squared residual
values against the predictor variable respecvely with plot(No_of_IO, abs(IO_
lm_resid),…) and plot(No_of_IO, IO_lm_resid^2,…).
5. The assumpon that errors have constant variance may be veried by the plot of
residuals against the ed values of the regressand. The required plot is obtained by
using plot(IO_lm$fitted.values,IO_lm_resid).
6. The assumpon that the errors are independent of each other may be veried
plong the residuals against their index numbers: plot.ts(IO_lm_resid).
Chapter 6
[ 173 ]
7. Finally, the presence of outliers is invesgated by the boxplot of the residuals:
boxplot(IO_lm_resid).
8. Finally, the assumpon of normality for the error terms is veried through the
normal probability plot. This plot is on a new graphics editor.
The complete R program is as follows:
IO_lm_resid <- resid(IO_lm)
par(mfrow=c(3,2))
plot(No_of_IO, IO_lm_resid,main="Plot of Residuals Vs
Predictor Variable",ylab="Residuals",xlab="Predictor Variable")
plot(No_of_IO, abs(IO_lm_resid), main="Plot ofAbsolute Residual Values
Vs Predictor Variable",ylab="Absolute Residuals", xlab="Predictor
Variable")
# Equivalently
plot(No_of_IO, IO_lm_resid^2,main="Plot of Squared Residual Values
Vs Predictor Variable", ylab="Squared Residuals", xlab="Predictor
Variable")
plot(IO_lm$fitted.values,IO_lm_resid, main="Plot of Residuals Vs
Fitted Values",ylab="Residuals", xlab="Fitted Values")
plot.ts(IO_lm_resid, main="Sequence Plot ofthe Residuals")
boxplot(IO_lm_resid,main="Box Plot of the Residuals")
rpanova = anova(IO_lm)
IO_lm_resid_rank=rank(IO_lm_resid)
tc_mse=rpanova$Mean[2]
IO_lm_resid_expected=sqrt(tc_mse)*qnorm((IO_lm_resid_rank-0.375)/
(length(CPU_Time)+0.25))
plot(IO_lm_resid,IO_lm_resid_expected,xlab="Expected",ylab="Residuals"
,main="The Normal Probability Plot")
abline(0,1)
Linear Regression Analysis
[ 174 ]
The resulng plot for the model validaon plot is given next. If you run the preceding R
program up to the rpanova code, you will nd the plot similar to the following:
Figure 4: Checking for violations of assumptions of IO_lm
We have used the resid funcon to extract the residuals out of the lm object. The rst plot
of residuals against the predictor variable No_of_IO shows that more the number of IO
processes, the larger is the residual value, as is also conrmed by Plot of Absolute Residual
Values Vs Predictor Variable and Plot of Squared Residual Values Vs Predictor Variable.
However, there is no clear non-linear paern suggested here. The Plot of Residuals Vs
Fied Values is similar to the rst plot of residuals against the predictor. The me series plot
of residuals does not indicate a strict determinisc trend and appears a bit similar to the
random walk. Thus, these plots do not give any evidence of any kind of dependence among
the observaons. The boxplot does not indicate any presence of an outlier.
The normal probability plot for the residuals is given next:
Chapter 6
[ 175 ]
Figure 5: The normal probability plot for IO_lm
As all the points fall close to the straight line, the normality assumpon for the errors does
not appear to be violated.
What just happened?
The R program given earlier gives various residual plots, which help in validaon of the
model. It is important that these plots are always checked whenever a linear regression
model is built. For CPU_Time as a funcon of No_of_IO, the linear regression model is
a fairly good model.
Linear Regression Analysis
[ 176 ]
Have a go hero
From a theorecal perspecve and my own experience, the seven plots obtained earlier
were found to be very useful. However, R, by default, also gives a very useful set of residual
plots for an lm object. For example, plot(my_lm) generates a powerful set of model
validaon plots. Explore the same for IO_lm with plot(IO_lm). You can explore more
about plot and lm with the plot.lm funcon.
We will next consider the general mulple linear regression model for the Gasoline problem
considered in the earlier chapters.
Multiple linear regression model
In the The simple linear regression model secon, we considered an almost (un)realisc
problem of having only one predictor. We need to extend the model for the praccal
problems when one has more than a single predictor. In Example 3.2.9. Octane rang
of gasoline blends we had a graphical study of mileage as a funcon of various vehicle
variables. In this secon, we will build a mulple linear regression model for the mileage.
If we have X1, X2, …, Xp independent set of variables which have a linear eect on the
dependent variable Y, the mulple linear regression model is given by:
01122pp
Y X X ... X
ββ β β ε
=+ + + ++ +
This model is similar to the simple linear regression model, and we have the same
interpretaon as earlier. Here, we have addional independent variables in X2, …, Xp and their
eect on the regressand Y are respecvely through the addional regression parameters
2
p
,...,
β β
. Now, suppose we have n pairs of random observaons
( ) ( )
11 nn
Y,X,..., Y,X
for
understanding the mulple linear regression model, here
( )
i i ip
X X ,..., X,i,...,n= = . A matrix
form representaon of the mulple linear regression model is useful in the understanding of
the esmator for the vector of regression coecients. We dene the following quanes:
()
( )
( )
11 12 12
21 22 2
1 2
1
1
01 and
1
1
X=
1
p
p
n
n
n n n
p
Y
XX X
X X X
Y,...,Y'
,..., ',
,,... '
X X X
,,
εε ε
ββββ
=
=
=
 
 
 
 
 
 
 
 
Chapter 6
[ 177 ]
The mulple linear regression model for n observaons can be wrien in a compact matrix
form as:
YX'
βε
= +
The least squares esmate of
β
is given by:
( )
1
ˆXX X'Y
β
=
Let us t a mulple linear regression model for the Gasoline mileage data considered earlier.
Averaging k simple linear regression models or a multiple linear
regression model
We already know how to build a simple linear regression model. Why should we learn
another theory when an extension is possible in a certain manner? Intuively, we can build
k models consisng of the kth variable and then simply average out over the k models. Such
an averaging can also be considered for the univariate case too. Here, we may divide the
covariate over disnct intervals, then build the simple linear regression model over the
intervals, and nally average over the dierent models. Montgomery, et. al. (2001) highlights
the drawback of such an approach, pages 120-122. Typically, the simple linear regression
model may indicate the wrong sign for the regression coecients. The wrong sign of such
a naïve approach may arise as a result of mulple reasons: restricng the range of some
regressors, crical regressors may have been omied from the model building, or some
computaonal errors may have crept in.
To drive home the point, we consider an example from Montgomery, et. al.
Time for action – averaging k simple linear regression models
We will build three models here. We have a vector of regressand y and two covariates: x1
and x2.
1. Enter the dependent variable and the independent variables with y <- c(1,
5,3,8,5,3,10,7), x1 = c(2,4,5,6,8,10,11,13), and x2 <- c(1,2,
2,4,4,4,6,6).
2. Visualize the relaonships among the variables with:
par(mfrow=c(1,3))
plot(x1,y)
plot(x2,y)
plot(x1,x2)
Linear Regression Analysis
[ 178 ]
3. Build the individual simple linear regression model and our rst mulple regression
model with:
summary(lm(y~x1))
summary(lm(y~x2))
summary(lm(y~x1+x2)) # Our first multiple regression model
Figure 6: Averaging k simple linear regression models
Chapter 6
[ 179 ]
What just happened?
The visual plot (the preceding screenshot) indicates that both x1 and x2 have a posive
impact on y, and this is also captured in lm( y~x1 ) and lm( y~x2 ), see the next R output
display. We have omied the scaer plot, though you should be able to see the same on
your screen aer running the R code aer step 2 in the next secon. However, both the
models are under the assumpon that the informaon contained in x1 and x2 is complete.
The variables are also seen to have a signicant eect on the output. However, the metrics
such as Mulple R-squared and Adjusted R-squared are very poor for both simple (linear)
regression models. This is one of the indicaons that we need to collect more informaon
and thus, we include both the variables and build our rst mulple linear regression model,
see the next secon for more details. There are two important changes worth registering
now. First, the sign of the regression coecient x1 now becomes negave, which is now
contradicng the intuion. The second observaon is the great increase in the R-squared
metric value.
To summarize our observaons here, it suces to say that the sum of the parts may
somemes fall way short of the enre picture.
Building a multiple linear regression model
The R funcon lm remains the same as earlier. We will connue Example 3.2.9. Octane rang
of gasoline blends from the Visualizaon techniques for connuous variable data secon of
Chapter 3, Data Visualizaon. Recall that the variables, independent and dependent as well,
are stored in the dataset Gasoline in the RSADBE package. Now, we tell R that y, which is
the mileage, is the dependent variable, and we need to build a mulple linear regression
model which includes all other variables of the Gasoline object. Thus, the formula is
specied by y~. indicang that all other variables from the Gasoline object need to be
treated as the independent variables. We proceed as earlier to obtain the summary of the
ed mulple linear regression model.
Time for action – building a multiple linear regression model
The method of building a mulple linear regression model remains the same as earlier.
If all the variables in data.frame are to be used, we use the formula y ~ .. However,
if we need specic variables, say x1 and x3, the formula would be y ~ x1 + x3.
1. Build the mulple linear regression model with gasoline_lm <- lm(y~.,
data=Gasoline). Here, the formula y~. considers the variable y as the
dependent variable and all the remaining variables in the Gasoline data
frame as independent variables.
2. Get the details of the ed mulple linear regression model with
summary(gasoline_lm).
Linear Regression Analysis
[ 180 ]
The R screen then appears as follows:
Figure 7: Building the multiple linear regression model
As with the simple model, we need to rst check whether the overall model is signicant
by looking at the p-value of the F-stascs, which appears as the last line of the summary
output. Here, the value 0.0003 being very close to zero, the overall model is signicant.
Of the 11 variables specied for modeling, only x1 and x3, that is, the engine displacement
and torque, are found to have a meaningful linear eect on the mileage. The esmated
regression coecient values indicate that the engine displacement has a negave impact
on the mileage, whereas the torque impacts posively. These results are in conrmaon
with the basic science of vehicle mileage.
Chapter 6
[ 181 ]
We have a tricky output for the eleventh independent variable, which for some strange
reasons R has been renamed as x11M. We need to explain this. You should verify
the output as a consequence of running sapply(Gasoline,class) on the console.
Now, the x11 variable is a factor variable assuming two possible values A and M, which stand
for the transmission box being Automatic or Manual. As the categorical variables are of a
special nature, they need to be handled dierently. The user may be tempted to skip this,
as the variable is seen to be insignicant in this case. However, the interpretaon is very
useful and the "skip" part may prove really expensive later. For computaonal purposes,
an m-level factor variable is used to create m-1 new dierent variables. If the variable
assumes the level l, the lth variable takes value 1, else 0, for l = 1, 2, …, m-1. If the variable
assumes level m, all the (m-1) new variables take the value 0. Now, R takes the lth factor
level and names that vector by concatenang the variable name and the factor level. Hence,
we have x11M as the variable name in the output. Here, we found the factor variable to be
insignicant. If in certain experiments we nd some factor levels to be signicant at certain
p-value, we can't ignore the other factor levels even if their p-values suggest them
as insignicant.
What just happened?
The building of a mulple linear regression model is a straighorward extension of the
simple linear regression model. The interpretaon is where one has to be more careful
with the mulple linear regression model.
We will now look at the ANOVA and condence intervals for the mulple linear regression
model. It is to be noted that the usage is not dierent from the simple linear regression
model, as we are sll dealing with the lm object.
The ANOVA and condence intervals for the multiple linear
regression model
Again, we use the anova and confint funcons to obtain the required results. Here,
the null hypothesis of interest is whether all the regression coecients equal 0, that is
00 10
p
H:
ββ β
==
==
against the alternave that at least one of the regression
coecients is dierent from 0, that is 10
0H:
β
for at least one j = 0, 1, …, p.
Linear Regression Analysis
[ 182 ]
Time for action – the ANOVA and condence intervals for the
multiple linear regression model
The use of anova and confint extend in a similar way as lm is used for simple and mulple
linear regression models.
1. The ANOVA table for the mulple regression model is obtained in the same way as
for the simple regression model, aer all we are dealing with the object of class lm:
gasoline_anova<-anova(gasoline_lm).
2. The condence intervals for the independent variables are obtained by using
confint(gasoline_lm).
The R output is given as follows:
Figure 8: The ANOVA and confidence intervals for the multiple linear regression model
Chapter 6
[ 183 ]
Note the dierence between the anova and summary results. Now, we nd only the rst
variable to be signicant. The interpretaon of the condence intervals is le to you.
What just happened?
The extension from simple to mulple linear regression model in R, especially for the ANOVA
and condence intervals, is really straighorward.
Have a go hero
The ANOVA table in the preceding screenshot and the summary of gasoline_lm in the
screenshot given in step 2 of the Time for acon – building a mulple linear regression model
secon build linear regression models using the signicant variables only. Are you amused?
Useful residual plots
In the context of mulple linear regression models, modicaons of the residuals have been
found to be more useful than the residuals themselves. We have assumed the residuals to
follow a normal distribuon with mean zero and unknown variance. An esmator of the
unknown variance is provided by the mean residual sum of squares. There are four useful
types of residuals for the current model:
Standardized Residuals: We know that the residuals have zero mean. Thus,
the standardized residuals are obtained by scaling the residuals with the esmator
of the standard deviaon, that is the square root of the mean residual sum of
squares. The standardized residuals are dened by:
i
i
Re s
e
dMS
=
Here,
( )
2
1
n
Re s i i
MS e/np
=
=∑
, and p is the number of covariates in the model.
The residual is expected to have mean 0 and Re s
MS
is an esmate of its variance.
Hence, we expect the standardized residuals to have a standard normal distribuon.
This in turn helps us to verify whether the normality assumpon for the residuals is
meaningful or not.
Semi-studenzed Residuals: The semi-studenzed residuals are dened by:
( )
1
1
i
i
Re s ii
e
r ,i,...,n
MS h
= =
Linear Regression Analysis
[ 184 ]
Here, hii is the ith diagonal element of the matrix
( )
1
HXX'
XX
'
=.
The variance of a residual depends on the covariate value and hence, a at
scaling by
Re s
MS
is not appropriate. A correcon is provided by
( )
1
ii
h
, and
( )
1
Re s ii
MS h turns out to be an esmate of the variance of ei. This is the
movaon for the semi-studenzed residual ri.
PRESS Residuals: The predicted residual, PRESS, for observaon i is the dierence
between the actual value yi and the value predicted for it, by using a regression
model based on the remaining (n-1) observaons. Now let
()
i
ˆ
β
be the esmator
of regression coecients based on the (n-1) observaons (not including the ith
observaons). Then, PRESS for observaons i is given by:
() ()
1
ii
i i
ˆ
eyx,i,...,n
β
=− =
Here, the idea is that the esmate of residual for an observaon is more appropriate
when obtained from a model which is not inuenced by its own value.
R-student Residuals: This residual is especially useful for the detecon of outliers.
()
( )
1
i
i
ii
Re si
e
t
MS h
=
Here,
()
Re si
MS
is an esmator of the variance
2
σ
based on the remaining (n-1)
observaons.
The scaling change is on similar lines as with the studenzed residuals.
The task of building n linear models may look daunng! However, there are very
useful formulas in Stascs and funcons in R which save the day for us. It is
appropriate that we use those funcons and develop the residual plots for the
Gasoline dataset. Let us set ourselves for some acon.
Time for action – residual plots for the multiple linear
regression model
R funcons resid, hatvalues, rstandard, and rstudent are available, which can be
applied on an lm object to obtain the required residuals.
Chapter 6
[ 185 ]
1. Get the MSE of the regression model with gasoline_lm_mse <- gasoline_
anova$Mean[length(gasoline_anova$Mean)].
2. Extract the residuals with the resid funcon, and standardize the residuals
using stan_resid_gasoline <- resid(gasoline_lm)/sqrt( gasoline_
lm_mse).
3. To obtain the semi-studenzed residuals, we rst need to get the hii elements
which are obtainable using the hatvalues funcon: hatvalues(gasoline_lm).
The remaining code is given at the end of this list.
4. The PRESS residuals are calculated using the rstandard funcon available in R.
5. The R-student residuals can be obtained using the rstudent funcon in R.
The detailed code is as follows:
# Useful Residual Plots
gasoline_lm_mse<-gasoline_anova$Mean[length(gasoline_anova$Mean)]
stan_resid_gasoline <- resid(gasoline_lm)/sqrt(gasoline_lm_mse)
#Standardizing the residuals
studentized_resid_gasoline <- resid(gasoline_lm)/ (sqrt(gasoline_
lm_mse*(1-hatvalues(gasoline_lm))))
#Studentizing the residuals
pred_resid_gasoline <- rstandard(gasoline_lm)
pred_student_resid_gasoline<-rstudent(gasoline_lm)
# returns the R-Student Prediction Residuals
par(mfrow=c(2,2))
plot(gasoline_fitted,stan_resid_gasoline,xlab="Fitted",
ylab="Residuals")
title("Standardized Residual Plot")
plot(gasoline_fitted,studentized_resid_gasoline,xlab="Fitted",ylab
="Residuals")
title("Studentized Residual Plot")
plot(gasoline_fitted,pred_resid_gasoline,xlab="Fitted",
ylab="Residuals")
title("PRESS Plot")
plot(gasoline_fitted,pred_student_resid_gasoline,xlab="Fitted",yla
b="Residuals")
title("R-Student Residual Plot")
Linear Regression Analysis
[ 186 ]
All the four residual plots in the screenshot given in the Time for acon – residual plots for
model navigaon secon look idencal though there is a dierence in their y-scaling. It is
apparent from the residual plots that there are no paerns which show the presence of
non-linearity, that is the linearity assumpon appears valid. In the standardized residual
plot, all the observaons are well within -3 and 3. Thus, it is correct to say that there are
no outliers in the dataset.
Figure 9: Residual plots for the multiple linear regression model
What just happened?
Using the resid, rstudent, rstandard, and other funcons, we have obtained useful
residual plots for the mulple linear regression models.
Regression diagnostics
In the Useful residual plots subsecon, we saw how outliers may be idened using
the residual plots. If there are outliers, we need to ask the following quesons:
Chapter 6
[ 187 ]
Is the observaon an outlier due to an anomalous value in one or more
covariate values?
Is the observaon an outlier due to an extreme output value?
Is the observaon an outlier because of both the covariate and output values being
extreme values?
The disncon in the nature of an outlier is vital as one needs to be sure of its type. The
techniques for outlier idencaon are certainly dierent as is their impact. If the outlier
is due to the covariate value, the observaon is called a leverage point, and if it is due to
the y value, we call it an inuenal point. The rest of the secon is for the exact stascal
technique for such an outlier idencaon.
Leverage points
As noted, a leverage point has an anomalous x value. The leverage points may be
theorecally proved not to impact the esmates of the regression coecients. However,
these points are known to drascally aect the
2
R
value. The queson then is, how do
we idenfy such points? The answer is by looking at the diagonal elements of the hat matrix
( )
1
HXX'
XX
'
=. Note that the matrix
H
is of the order
nn×
. The (i, i) element of the hat
matrix ii
h
may be interpreted as the amount of leverage by the observaon, i, on the ed
value i
ˆ
y
. The average size of a leverage is hp
/n
=, where p is the number of covariates and
n is the number of observaons. It is beer to leave out an observaon if its leverage value is
greater than twice of
p/n
, and we then conclude that the observaon is a leverage point.
Let us go back to the Gasoline problem and see the leverage of all the observaons. In R,
we have a funcon, hatvalues which readily extracts the diagonal elements of
H
. The R
output is given in the next screenshot.
Clearly, we have 10 observaons which are leverage points. This is indeed a maer of
concern as we have only about 25 observaons. Thus, the results of the linear model need
to be interpreted with cauon! Let us now idenfy the inuenal points for the Gasoline
linear model.
Linear Regression Analysis
[ 188 ]
Inuential points
An inuenal point has a tendency to pull the regression line (plane) towards its direcon
and hence, they drascally aect the values of the regression coecients. We want to
idenfy the impact of an observaon on the regression coecients, and one approach is
to consider how much the regression coecient values change if the observaon was not
considered. The relevant mathemacs for idencaon of inuenal points is beyond the
scope of the book, so we simply help ourselves with the metric Cook's distance which nds
the inuenal points. The R funcon, cooks.distance, returns the values of the Cook's
distance for each observaon, and the thumb rule is that if the distance is greater than 1,
the observaon is an inuenal point. Let us use the R funcon and idenfy the inuenal
points for the Gasoline problem.
Figure 10: Leverage and influential points of the gasoline_lm fitted model
For this dataset, we have only one inuenal point in Eldorado. The plot of Cook's distance
against the observaon numbers and that of Cook's distance against the leverages may be
easily obtained with plot(gasoline_lm,which=c(4,6)).
Chapter 6
[ 189 ]
DFFITS and DFBETAS
Belsley, Kuh, and Welsch (1980) proposed two addional metrics for nding the inuenal
points. The DFBETAS metric indicates the change of regression coecients (in standard
deviaon units) if the ith observaon is removed. Similarly, DFFITS is a metric which gives
the impact on the ed values i
ˆ
y
. The rule which indicates the presence of an inuenal
point using the DFFITS is 2
i
|DFFITS |/p/n>, where p is the number of covariates and n
is the number of observaons.
Finally, an observaon i is inuenal for regression coecient j if
2
j,i
DFBETAS
/n
>.
Figure 11: DFFITS and DFBETAS for the Gasoline problem
We have given the DFFITS and DFBETAS values for the Gasoline dataset. It is le as an
exercise to the reader to idenfy the inuenal points from the outputs given above.
The multicollinearity problem
One of the important assumpons of the mulple linear regression model is that the
covariates are linearly independent. The linear independence here is the sense of Linear
Algebra that a vector (covariate in our context) cannot be expressed as a linear combinaon
of others. Mathemacally, this assumpon translates into an implicaon that
( )
1
X'X is
nonsingular, or that its determinant is non-zero. If this is not the case then we have one or
more of the following problems:
Linear Regression Analysis
[ 190 ]
The esmated will be unreliable, and there is a great chance of the regression
coecients having the wrong sign
The relevant signicant factors will not be idened by either the t-tests or the
F-tests
The importance of certain predictors will be undermined
Let us rst obtain the correlaon matrix for the predictors of the Gasoline dataset.
We will exclude the nal covariate, as it is factor variable.
Figure 12: The correlation matrix of the Gasoline covariates
We can see that covariate x1 is strongly correlated with all other predictors except x4.
Similarly, x8 to x10 are also strongly correlated. This is a strong indicaon of the presence
of the mulcollinearity problem.
Dene
( )
1
CX'X
=. Then it can be proved that, refer Montgomery, et. al. (2003), the
jth diagonal element of
C
can be wrien as
( )
1
2
1
jj j
C R
=− , where
2
j
R
is the coecient
of determinaon obtained by regressing all other covariates for xj as the output. Now,
the variable xj is independent of all the other covariates; we expect the coecient of
determinaon to be zero, and hence jj
C
to be closer to unity. However, if the covariate
depends on the others, we expect the coecient of determinaon to be a high value, and
hence the jj
C
to be a large number. The quanty jj
C
is also called the variance inaon
factor, denoted by VIFj. A general guideline for a covariate to be linearly independent of
other covariates is that its VIFj should be lesser than 5 or 10.
Chapter 6
[ 191 ]
Time for action – addressing the multicollinearity problem for
the Gasoline data
The mulcollinearity problem is addressed using the vif funcon, which is available from
two libraries: car and faraway. We will use it from the faraway package.
1. Load the faraway package with library(faraway).
2. We need to nd the variance inaon factor (VIF) of the independent variables
only. The covariate x11 is a character variable, and the rst column of the Gasoline
dataset is the regressand. Hence, run vif(Gasoline[,-c(1,12)]) to nd the VIF
of the eligible independent variables.
3. The VIF for x3 is highest at 217.587. Hence, we remove it to nd the VIF among the
remaining variables with vif(Gasoline[,-c(1,4,12)]). Remember that x3 is
the fourth column in the Gasoline data frame.
4. In the previous step, we nd x10 having the maximum VIF at 77.810. Now, run
vif(Gasoline[,-c(1,4,11,12)]) to nd if all VIFs are less than 10.
5. For the rst variable x1 the VIF is 31.956, and we now remove it with
vif(Gasoline[,-c(1,2,4,11,12)]).
6. At the end of the previous step, we have the VIF of x1 at 10.383. Thus, run
vif(Gasoline[,-c(1,2,3,4,11,12)]).
7. Now, all the independent variables have VIF lesser than 10. Hence, we stop
at this step.
8. Removing all the independent variables with VIF greater than 10, we arrive at the
nal model, summary(lm(y~x4+x5+x6+x7+x8+x9,data= Gasoline)).
Figure 13: Addressing the multicollinearity problem for the Gasoline dataset
Linear Regression Analysis
[ 192 ]
What just happened?
We used the vif funcon from the faraway package to overcome the problem of
mulcollinearity in the mulple linear regression model. This helped to reduce the
number of independent variables from 10 to 6, which is a huge 40 percent reducon.
The funcon vif from the faraway package is applied to the set of covariates. Indeed,
there is another funcon of the same name from the car package which can be directly
applied on an lm object.
Model selection
The method of removal of covariates in the The mulcollinearity problem secon depended
solely on the covariates themselves. However, it may happen more oen that the covariates
in the nal model will be selected with respect to the output. Computaonal cost is almost
a non-issue these days and especially for not-so-very-large datasets! The queson that
arises then is can one retain all possible covariates in the model, or do we have any choice
of covariates which meet a certain regression metric, say R2 > 60 percent? The problem
is that having more covariates increases the variance of the model which having lesser of
them will have large bias. The philosophical Occam's Razor principle applies here, and the
best model is the simplest model. In our context, the smallest model which ts the data is
the best. There are two types of model selecon: stepwise procedures and criterion-based
procedures. In this secon, we will consider both the procedures.
Stepwise procedures
There are three methods of selecng covariates for inclusion in the nal model: backward
eliminaon, forward selecon, and stepwise regression. We will rst describe the backward
eliminaon approach and develop the R funcon for it.
The backward elimination
In this model, one rst begins with all the available covariates. Suppose we wish to retain all
covariates for whom the p-value is at the most
α
. The value
α
is referred to as crical alpha.
Now, we rst eliminate that covariate whose p-value is maximum among all the covariates
having p-value greater than
α
. The model is reed for the current covariates. We connue
the process unl we have all the covariates whose p-value is less than
α
. In summary, the
backward eliminaon algorithm is as explained next:
Chapter 6
[ 193 ]
1. Consider all the available covariates.
2. Remove the covariate with maximum p-value among all the covariates which have
p-value greater than
α
.
3. Ret the model and go to the rst step.
4. Connue the process unl all p-values are less than
α
.
Typically, the user invesgates the p-values in the summary output and then carries out the
preceding algorithm. Taar, et. al. (2013) gives a funcon which right away executes the
enre algorithm, and we adapt the same funcon here and apply it on the linear regression
model gasoline_lm.
The forward selection
In the previous procedure we started with all covariates. Here, we begin with an empty
model and look forward for the most signicant covariates with p-value lesser than
α
.
That is, we build k new linear models with the k-th covariate for the k-th model. Naturally,
by "most signicant" we mean that the p-value should be the least among all the covariates
whose p-value is lesser than
α
. Then, we build the model with the selected covariate.
A second covariate is selected by treang the previous model as the inial empty model.
The model selecon is connued unl we fail to add any more covariates. This is summarized
in the following algorithm:
1. Begin with an empty model.
2. For each covariate, obtain the p-value if it is added to the model. Select the
covariates with the least p-value among all the covariates whose p-value is
lesser than
α
.
3. Repeat the preceding step unl no more covariates can be updated for the model.
We again use the funcon created in Taar, et. al. (2013) and apply it for
the gasoline problem.
There is yet another method of model selecon. Here, we begin with the empty model. We
add a covariate as in the forward selecon step and then perform a backward eliminaon to
remove any unwanted covariate. Then, the forward and backward steps are connued unl
we can't either add a new covariate or remove an exisng covariate. Of course, the alpha
crical values for forward and backward steps are specied disnctly. This method is called
stepwise regression. This method is however skipped here for the purpose of brevity.
Linear Regression Analysis
[ 194 ]
Criterion-based procedures
A useful tool for the model selecon problem is to evaluate all possible models and select
one of them according to certain criteria. The Akaike Informaon Criteria (AIC) is one such
criterion which can be used to select the best model. Let
( )
( )
2
01
logp
ˆˆ ˆ
ˆ
L,,..., ,|y
ββ βσ
denote the
log likelihood funcon of the ed regression model. Dene K = p + 2 which is the total
number of esmated parameters. The AIC for the ed regression model is given by:
( )
( )
2
0 1
AIC -log2 p
ˆˆ ˆˆ
L,,..., ,|
yK
ββ βσ
 
= +
 
Now, the model which has the least AIC among the candidate models is the best model.
The step funcon available in R gets the job done for us, and we will close the chapter
with the connued illustraon of the Gasoline problem.
Time for action – model selection using the backward, forward,
and AIC criteria
For the forward and backward selecon procedure under the stepwise procedures of the
model selecon problem, we rst dene two funcons: backwardlm and forwardlm.
However, for the criteria-based model selecon, say AIC, we use the step funcon, which
can be performed on the ed linear models.
1. Create a funcon pvalueslm which extracts the p-values related to the covariates
of an lm object:
pvalueslm <- function(lm) {summary(lm)$coefficients[-1,4]}
2. Create a backwardlm funcon dened as follows:
backwardlm <- function(lm,criticalalpha) {
lm2=lm
while(max(pvalueslm(lm2))>criticalalpha) {
lm2=update(lm2,paste(".~.-",attr(lm2$terms,
"term.labels")[(which(pvalueslm(lm2)==max(pvalueslm(lm2))))],s
ep=""))
}
return(lm2)
}
Chapter 6
[ 195 ]
The code needs to be explained in more detail. There are two new funcons created
here for the implementaon of the backward eliminaon procedure. Let us have a
detailed look at them. The funcon pvalueslm extracts the p-values related to the
covariates of an lm object. The choice of summary(lm)$coefficients[-1,4]
is vital, as we are interested in the p-values of the covariates and not the intercept
term. The p-values are available once the summary funcon is applied on the lm
object. Now, let us focus on the backwardlm funcon. Its arguments are the lm
object and the value of crical
α
. Our goal is to carry out the iteraons unl we
do not have any more covariates with p-value greater than
α
. Thus, we use the
while funcon which is typical of algorithm, where the last step appears during
the beginning of a funcon/program. We want our funcon to work for all the
linear models and not just gasoline_lm, and we need to get the names of the
covariates which are specied in the lm object. Remember, we conveniently used
the formula lm(y~.) and this will try to haunt us! Thankfully, attr(lm$terms,
"term.labels") extracts all the covariate names of an lm object. The argument
[(which(pvalueslm(lm2)==max (pvalueslm (lm2))))] idenes the
covariate number which has the maximum p-value above
α
. Next, paste(".~.-
",attr(), sep="") returns the formula which would have removed the
unwanted covariate. The explanaon of the formula is lengthier than the funcon
itself, which is not surprising, as R is object-oriented and a few lines of code do more
acons than detailed prose.
3. Obtain the ecient linear regression model by applying the backwardlm funcon,
with the crical alpha at 0.20 on the Gasoline lm object:
gasoline_lm_backward <- backwardlm(gasoline_lm,criticalalpha=0.20)
4. Find the details of the nal model obtained in the previous step:
summary(gasoline_lm_backward)
Linear Regression Analysis
[ 196 ]
The output as a result of applying the backward selecon algorithm is the following:
Figure 14: The backward selection model for the Gasoline problem
5. The forwardlm funcon is given by:
forwardlm <- function(y,x,criticalalpha) {
yx <- data.frame(y,x)
mylm <- lm(y~-.,data=yx)
avail_cov <- attr(mylm$terms,"dataClasses")[-1]
minpvalues <- 0
while(minpvalues<criticalalpha){
pvalues_curr <- NULL
for(i in 1:length(avail_cov)){
templm <- update(mylm,paste(".~.+",names(avail_cov[i])))
mypvalues <- summary(templm)$coefficients[,4]
pvalues_curr <- c(pvalues_curr,mypvalues[length(mypvalues)])
}
minpvalues <- min(pvalues_curr)
Chapter 6
[ 197 ]
if(minpvalues<criticalalpha){
include_me_in <- min(which(pvalues_curr<criticalalpha))
mylm <- update(mylm,paste(".~.+",names(avail_cov[include_me_in])))
avail_cov <- avail_cov[-include_me_in]
}
}
return(mylm)
}
6. Apply the forwardlm funcon on the Gasoline dataset:
gasoline_lm_forward <- forwardlm(Gasoline$y,Gasoline[,-1],
criticalalpha=0.2)
7. Obtain the details of the nalized model with summary( gasoline_lm_
forward).
The output in R is as follows:
Figure 15: The forward selection model for the Gasoline dataset
Note that the forward selecon and backward eliminaon have resulted in two
dierent models. This is to be expected and is not a surprise, and in such scenarios,
one can pick up either of the models for further analysis/implementaon.
The understanding of the construcon of the forwardlm funcon is le as
an exercise to the reader.
Linear Regression Analysis
[ 198 ]
8. The step funcon in R readily gives the best model using AIC: step
(gasoline_lm, direction="both").
Figure 16: Stepwise AIC
Backward and forward selecon can be easily performed using AIC with the opons
direction= "backward" and direction= "forward".
What just happened?
We used two customized funcons backwardlm and forwardlm for backward and forward
selecon criteria. The step funcon has been used for the model selecon problem based
on the AIC.
Have a go hero
The supervisor performance data is available in the SPD dataset from the RSADBE package.
Here, Y (the regressand) represents the overall rang of the job done by a supervisor.
The overall rang depends on six other inputs/regressors. Find more details about
the dataset with ?SPD. First, visualize the dataset with the pairs funcon. Fit a mulple
linear regression model for Y and complete the necessary regression tasks, such as model
validaon, regression diagnoscs, and model selecon.
Chapter 6
[ 199 ]
Summary
In this chapter, we learned how to build a linear regression model, check for violaons in the
model assumpons, x the mulcollinearity problem, and nally how to nd the best model.
Here, we were aided by two important assumpons: the output being a connuous variable,
and the normality assumpon for the errors. The linear regression model provides the best
foong for the general regression problems. However, when the output variable is discrete,
binary, or mul-category data, the linear regression model lets us down. This is not actually
a let down, as it was never intended to solve this class of problem. Thus, our next chapter
will focus on the problem of regression models for binary data.
The Logistic Regression Model
In this chapter we will consider regression models when the regressand is
dichotomous or binary in nature. The data is of the form
( ) ( ) ( )
1 1 2 2
n n
Y,X,Y,X,..., Y,X
,
where the dependent variable Yi, i = 1, …, n are the observed binary outputs
assumed to be independent (in the statistical sense) of each other, the vector Xi,
i = 1,…, n, are the covariates (independent variables in the sense of a regression
problem) associated with Yi. In the previous chapter we considered the linear
regression model where the regressand was assumed to be continuous along
with the assumption of normality for the error distribution. Here, we will
consider a Gaussian (normal) model for the binary regression model, which is
more widely known as the probit model.
A more generic model has emerged during the past four decades in the form of a logisc
regression model. We will consider the logisc regression model for rest of the chapter.
The approach in this chapter will be on the following lines:
The binary regression problem
The probit regression model
The logisc regression model
Model validaon and diagnoscs
Receiving operator curves
Logisc regression for the German credit screening dataset
7
The Logisc Regression Model
[ 202 ]
The binary regression problem
Consider the problem of modeling the compleon of a stat course by students based
on their Scholasc Assessment Test in the subject of mathemacs SAT-M scores at the
me of their admission. Aer the compleon of the nal exams we know which students
successfully completed the course and which of them failed. Here, the output pass/fail
may be represented by a binary number 1/0. It may be fairly said that higher the SAT-M
scores at the me of admission to the course, the more likelihood of the candidate
compleng the course. This problem has been discussed in detail in Johnson and Albert
(1999) and Taar, et. al. (2013).
Let us begin by denong the pass/fail indicator by Y and the entry SAT-M score by X. Suppose
that we have n pairs of observaons on the students' scores and their course compleon
results. We can build the simple linear regression model for the probability of course
compleon pi = P(Yi = 1) as a funcon of the SAT-M score with
0 0i i i
p X
ββ ε
=+ +
. The data
from page 77 of Johnson and Albert (1999) is available in the sat dataset of the book's R
package. The columns that contain the data on the variables Y and X are named Pass and
Sat respecvely. To build a linear regression model for the probability of compleng the
course, we take pi as 1 if Yi = 1, and 0 otherwise. A scaer plot of Pass against Sat indicates
the students with higher SAT-M scores are more likely to complete the course. The SAT score
varies from 463-649 and then we aempt to predict whether students with SAT scores of
400 and 700 would have successfully completed the course or not.
Time for action – limitations of linear regression models
A linear regression model is built for the dataset with a binary output. The model is used
to predict the probabilies for some cases, which shows the limitaons:
1. Load the dataset from the RSADBE package with data(sat).
2. Visualize the scaer plot of Pass against Sat with plot(sat$Sat,
sat$Pass,xlab="SAT Score", ylab = "Final Result").
3. Fit the simple linear regression model with passlm <- lm(Pass~Sat,
data=sat) and obtain its summary by summary(passlm). Add the ed
regression line to the scaer plot using abline(passlm).
4. Make a predicon for students with SAT-M scores of 400 and 700 by using the R
code predict(passlm,newdata=list(Sat=400)) and predict( passlm,
newdata=list(Sat=700),interval="prediction").
Chapter 7
[ 203 ]
Figure 1: Drawbacks of the linear regression model for the classification problem
The linear model is signicant as seen by p-value: 0.000179 associated with the
F-statistic. Next, Pr(>|t|) associated with the Sat variable is 0.00018, which is again
signicant. However, the predicted value for SAT-M marks at 400 and 700 are respecvely
seen as -0.4793 and 1.743. The problem with the model is that the predicted values can
be negave as well as greater than 1. It is essenally these limitaons which restrict the use
of the linear regression model when the regressand is a binary outcome.
What just happened?
We used the simple linear regression model for the probability predicon of a binary
outcome and observed that the probabilies are not bound in the unit interval [0,1]
as they are expected to be. This shows that we need to have special/dierent stascal
models for understanding the relaonship between the covariates and the binary output.
We will use two regression models that are appropriate for binary regressand: probit
regression, and logisc regression. The former model will connue the use of normal
distribuon for the error through a latent variable, whereas the laer uses the binomial
distribuon and is a popular member of the more generic generalized linear models.
The Logisc Regression Model
[ 204 ]
Probit regression model
The probit regression model is constructed as a latent variable model. Dene a latent
variable, also called as auxiliary random variable,
Y*
as follows:
Y* X'
βε
= +
which is same as the earlier linear regression model with Y replaced by
Y*
. The error term
ε
is assumed to follow a normal distribuon
( )
2
0N,
σ
. Then Y can be considered as 1 if the
latent variable is posive, that is:
if 0equivalently
otherwise
1
0
y* , X ',
Y,
βε
>−
=>
Without loss of generality, we can assume that
()
01
N,
ε∼
. Then, the probit model
is obtained by:
( ) ( ) ( )
( ) ( )
1 0PY |X PY* P X'
P X ' X '
ε β
ε β β
= = >= >−
= <
The method of maximum likelihood esmaon is used to determine
β
. For a random
sample of size n, the log likelihood funcon is given by:
() ( ) ( ) ( )
( )
( )
1
log log 1 log 1
n
i i i i
i
L y x' y x '
β β β
=
= φ + − −φ
Numerical opmizaon techniques can be deployed to nd the MLE of
β
. Fortunately,
we don't have to undertake this daunng task and R helps us out with the glm funcon.
Let us t the probit model for the Sat dataset seen earlier.
Time for action – understanding the constants
The probit regression model is built for the Pass variable as a funcon of the Sat score
using the glm R funcon and the argument binomial(probit).
1. Using the glm funcon and the binomial(probit) opon we can t the probit
model for Pass as a funcon of the Sat score:
pass_probit <- glm(Pass~Sat,data=sat,binomial(probit))
Chapter 7
[ 205 ]
2. The details about the pass_probit glm object are fetched using
summary(pass_probit).
The summary funcon does not give a measure of R2, the coecient of
determinaon, as we obtained for the linear regression model. In general such a
measure is not exactly available for the GLMs. However, certain pseudo-R2 measures
are available and we will use pR2 funcon from the pscl package. This package
has been developed at the Polical Science Computaonal Laboratory, Stanford
University, which explains the name of the package as pscl.
3. Load the pscl package with library(pscl), and apply the pR2 funcon
on pass_probit to obtain the measures of pseudo R2.
Finally, we check how the probit model overcomes the problems posed by
applicaon of the linear regression model.
4. Find the probability of passing the course for students with a SAT-M score of 400
and 700 with the following code:
predict(pass_probit,newdata=list(Sat=400),type = "response")
predict(pass_probit,newdata=list(Sat=700),type = "response")
The following picture is the screenshot of R acon:
Figure 2: The probit regression model for SAT problem
The Logisc Regression Model
[ 206 ]
The Pr(>|z|) for Sat is 0.0052, which shows that the variable has a signicant say
in explaining whether the student successfully completes the course or not. The regression
coecient value for the Sat variable indicates that if the Sat variable increases by one
mark, the inuence on the probit link increases by 0.0334. In easy words, the SAT-M
variable has a posive impact on the probability of success for the student. Next, the pseudo
R2 value of 0.3934 for the McFadden metric indicates that approximately 39.34 percent of
the output is explained by the Sat variable. This appears to suggest that we need to collect
more informaon about the students. That is, the experimenter may try to get informaon
on how many hours did the student spend exclusively for the course/examinaon, the
students' aendance percentages, and so on. However, the SAT-M score, which may have
been obtained nearly two years before the nal exam of the course, connues to have
a good explanatory power!
Finally, it may be seen that the probability of compleng the course for students with SAT-M
scores of 400 and 700 are respecvely 2.019e-06 and 1. It is important for the reader to
note the importance of the type = "response" opon. More details may be obtained
running ?predict.glm at the R terminal.
What just happened?
The probit regression model is appropriate for handling the binary outputs and is certainly
much more appropriate than the simple linear regression model. The reader learned how to
build the probit model using the glm funcon which is in fact more versale as will be seen in
the rest of the chapter. The predicon probabilies were also seen to be in the range of 0 to 1.
The glm funcon can be conveniently used for more than one covariate. In fact, the formula
structure of glm remains the same as lm. Model-related issues have not been considered
in full details ll now. The reason being that there is more interest in the logisc regression
model, as it will be focus for the rest of the chapter, and the logic does not change. In fact
we will return to the probit model diagnoscs in parallel with the logisc regression model.
Logistic regression model
The binary outcomes may be easily viewed as failures or successes, and we have done
the same on many earlier occasions. Typically, it is then common to assume that we
have a binomial distribuon for the probability of an observaon to be successful.
The logisc regression model species the linear eect of the covariates as a specic
funcon of the probability of success. The probability of success for observaon is
denoted by
() ( )
1
xPY
π
= = and the model is specied through the logisc funcon:
()
011
011
1
pp
pp
x... x
x... x
e
x
e
ββ β
ββ β
π
+++
+ + +
=
+
Chapter 7
[ 207 ]
The choice of this funcon is for fairly good reasons. Dene
0 1 1 pp
w x ... x
ββ β
=+ ++
. Then,
it may be easily seen that
()
( ) ( )
1 1 1
w w w
x e / e / e
π
= + = + . Thus, as w decreases to negave of
innity,
()
x
π
approaches 0, and if w increases towards innity,
()
x
π
reaches 1. For w = 0,
()
x
π
takes the value 0.5. The rao of probability of success to that of failure is known as the odds
rao, denoted by OR, and following some simple arithmec steps, it may be shown that:
()
()
011
1
pp
x... x
x
OR e
x
ββ β
π
π
+ + +
= =
Taking a logarithm of the odds raon gets us:
()
()
0 0 1
log log 1
pp
x
OR x ... x
x
πβ β β
π
 
= = + + +
 
 
 
And thus, we nally see that the logarithm of the odds rao as a linear funcon of the
covariates. It is actually the second term
() ()
( )
( )
log 1
i i
x/ x
π π
, which is the form of a logit
funcon that this model derives its name from.
The log-likelihood funcon based on the data
( ) ( ) ( )
11 2 2 n n
y,x,y,x,..., y,x
is then:
()
()
0
1 0 1
log log 1
p
jij
j
p
n n x
i j ij
i j i
L y x e
β
β β
=
= = =
 
= − +
 
 
∑ ∑
The preceding expression is indeed a bit complex in nature to obtain an explicit form for an
esmate of
β
. Indeed, a specialized algorithm is required here and it is known as the iterave
reweighted least-squares (IRLS) algorithm. We will not go into the details of the algorithm and
refer the readers to an online paper of Sco A. Czepiel available at http://czep.net/stat/
mlelr.pdf. A raw R implementaon of the IRLS is provided in Chapter 19 of Taar, et. al.
(2013). For our purpose, we will be using the soluon as provided from the glm funcon.
Let us now t the logisc regression model for the Sat-M dataset considered hitherto.
Time for action – tting the logistic regression model
The logisc regression model is built using the glm funcon with the family =
'binomial' opon. We will obtain the pseudo-R2 values using the pR2 funcon from
the pscl package.
1. Fit the logisc regression model for the Pass as a funcon of the Sat using
the opon family = 'binomial' in the glm funcon:
pass_logistic <- glm(Pass~Sat,data=sat,family = 'binomial')
The Logisc Regression Model
[ 208 ]
2. The details of the ed logisc regression model is obtained using the summary
funcon: summary(pass_logistic).
In the summary you will see two stascs called Null deviance and Residual deviance.
In general, a deviance is a measure useful for assessing the goodness-of-t, and for
the logisc regression model it plays the analogous role of residual sum of squares
for the linear regression model. The null deviance is the measure of a model that is
built without using any informaon, such as Sat, and thus we would expect such
a model to have a large value. If the Sat variable is inuencing Pass, we expect
the residuals of such a ed model to be signicantly lesser than the null deviance
model. If the residual deviance is signicantly smaller than the null deviance, we
conclude that the covariates have signicantly improved the model t.
3. Find the pseudo-R2 with pR2(pass_logistic) from the pscl package.
4. The overall model signicance of the ed logisc regression model is obtained with
with(pass_logistic, pchisq(null.deviance - deviance,
df.null - df.residual, lower.tail = FALSE))
The p-value is 0.0001496 which shows that the model is indeed signicant. The
p-values for the Sat covariate Pr(>|z|) is 0.011, which means that this variable
is indeed valuable for understanding Pass. The esmated regression coecient for
Sat of 0.0578 indicates that for the increase of a single mark increases the odds of
the candidate to pass the course by 0.0578.
A brief explanaon of this R code! It may be seen from the output following the
summary.glm(pass_logistic) that we have all the terms null.deviance,
deviance, df.null, and df.residual. So, the with funcon extracts all these
terms from the pass_logistic object and nds the p-value using the pchisq
funcon based on the dierence between the deviances (null.deviance -
deviance) and the correct degrees of freedom (df.null - df.residual).
Chapter 7
[ 209 ]
Figure 3: Logistic regression model for the Sat dataset
5. The condence intervals, with a default 95 percent requirement, for the
parameters of the regression coecients, is extracted using the confint
funcon: confint(pass_logistic).
The ranges of the 95 percent condence intervals do not contain 0 among them,
and hence we conclude that the intercept term and Sat variable are both signicant.
6. The predicon for the unknown scores are obtained as in the probit regression model:
predict.glm(pass_logistic,newdata=list(Sat=400),type = "response")
predict.glm(pass_logistic,newdata=list(Sat=700),type = "response")
7. Let us compare the logisc and probit model. Consider a sequence of hypothecal
SAT-M scores: sat_x = seq(400,700, 10). For the new sequence sat_x, we
predict the probability of course compleon using both the pass_logistic and
pass_probit models and visualize them if their predicons are vastly dierent:
pred_l <- predict(pass_logistic,newdata=list(Sat=sat_x), type=
"response")
pred_p <- predict(pass_probit,newdata=list(Sat=sat_x), type=
"response")
plot(sat_x,pred_l,type="l",ylab="Probability",xlab="Sat_M")
lines(sat_x,pred_p,lty=2)
The Logisc Regression Model
[ 210 ]
The predicon says that a candidate with a SAT-M score of 400 is very unlikely to
complete the course successfully while the one with SAT-M score of 700 is almost
guaranteed to complete it. The predicons with probabilies closer to 0 or 1 need
to be taken with a bit cauon since we rarely have enough observaons at the
boundaries of the covariates.
Figure 4: Prediction using the logistic regression
What just happened?
We ed our rst logisc regression model and viewed its various measures which tell
us whether the ed model is a good model or not. Next, we learnt how to interpret
the esmated regression coecients and also had a peek at the pseudo-R2 value.
The importance of condence intervals is also emphasized. Finally, the model has
been used to make predicons for some unobserved SAT-M scores too.
Hosmer-Lemeshow goodness-of-t test statistic
We may be sased with the analysis thus far, and there is always a lot more that we
can do. The tesng hypothesis problem is of the form
()
0
0 0
1
p p
jij j ij
j j
x x
H:EY e / e
β β
= =
∑ ∑
 
= +
 
  versus
()
1
0 0
1
p p
jij j ij
j j
x x
H:EY e / e
β β
= =
∑ ∑
 
≠ +
 
. An answer to this hypothesis tesng problem is provided by
the Hosmer-Lemeshow goodness-of-t test stasc. The steps of the construcon of this
test stasc are rst discussed:
Chapter 7
[ 211 ]
1. Order the ed values using sort and ed funcons.
2. Group the ed values into g classes, the preferred values of g vary between 6-10.
3. Find the observed and expected number in each group.
4. Perform a chi-square goodness-of-t test on the these groups. That is, denote Ojk
for the number of observaons of class k, k = 0, 1, in the group j, j = 1, 2, …, g, and
by Ejk the corresponding expected numbers. The chi-square test stasc is then
given by:
( )
2
2
101
gjk jk
jk.jk
O E
E
χ
==
=∑∑
And it may be proved that under the null-hypothesis
22
2
g
χχ
.
We will use an R program available at http://sas-and-r.blogspot.in/2010/09/
example-87-hosmer-and-lemeshow-goodness.html. It is important to note here
that when we use the code available on the web we verify and understand that such code
is indeed correct.
Time for action – The Hosmer-Lemeshow goodness-of-t
statistic
The Hosmer-Lemeshow goodness-of-t stasc for logisc regression is one of the very
important metrics for evaluang a logisc regression model. The hosmerlem funcon
from the preceding web link will be used for the pass_logistic regression model.
1. Extract the ed values for the pass_logistic model with pass_hat <-
fitted(pass_logistic).
2. Create the funcon hosmerlem from the previously-menoned URL:
hosmerlem <- function(y, yhat, g=10) {
cutyhat = cut(yhat,
breaks = quantile(yhat, probs=seq(0,
1, 1/g)), include.lowest=TRUE)
obs = xtabs(cbind(1 - y, y) ~ cutyhat)
expect = xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
chisq = sum((obs - expect)^2/expect)
P = 1 - pchisq(chisq, g - 2)
return(list(chisq=chisq,p.value=P))
}
The Logisc Regression Model
[ 212 ]
What is the hosmerlem funcon exactly doing here? Obviously, it is a funcon of
three variables, the real output values in y, the predicted (probabilies) in yhat,
and the number of groups g. The cutyhat variable uses the cut funcon on the
predicted probabilies among the ten groups and assigns them one of the 10
groups. The obs matrix obtains the count Ojk using the xtabs funcon and a similar
acon is repeated for Ejk. The code chisq = sum((obs - expect)^2/expect)
then obtains the value of the Hosmer-Lemeshow chi-square test stasc, and
using it we obtain the related p-value using P = 1 - pchisq(chisq, g - 2).
Finally, the required values are returned with return(list(chisq=chisq,p.
value=P)).
3. Complete the computaons of the Hosmer-Lemeshow goodness-of-t test stasc
for the ed model pass_logistic with hosmerlem( pass_logistic$y,
pass_hat).
Figure 5: Hosmer-Lemeshow goodness-of-fit test
Since there is no signicant dierence between the observed and predicted y values, we
concluded that the ed model is a good t. Now that we know that we have got a good
model on hand, it is me to invesgate how valid are the model assumpons.
What just happened?
We used an R code from the Web and successfully adapted it to the problem on our hand!
Parcularly, the Hosmer-Lemeshow goodness-of-t test is a vital metric for understanding
the appropriateness of a logisc regression model.
Chapter 7
[ 213 ]
Model validation and diagnostics
In the previous chapter we saw the ulity of residual techniques. A similar technique is also
required for the logisc regression model and we will develop these methods for the logisc
regression model in this secon.
Residual plots for the GLM
In the case of the linear regression model, we had explored the role of residuals for the
purpose of model validaon. In the context of logisc regression, actually GLM, we have
ve dierent types of residuals for the same purpose:
Response residual: The dierence between the actual values and the ed values
is the response residual, that is,
ii
ˆ
y
π
, and in parcular it is
1i
ˆ
π
if yi = 1 and
i
ˆ
π
for yi = 0.
Deviance residual: For an observaon i, the deviance residual is the signed square
root of the contribuon of the observaon to the sum of the model deviance. That
is, it is given by:
( ) ( ) ( )
{ }
12
2 log 1 log 1 /
dev
i i i i i
ˆ ˆ
r Y Y
π π
+ −
 
 
where the sign is posive if
ii
ˆ
y
π
, and negave otherwise, and
i
ˆ
π
is the predicted
probability of success.
Pearson residual: The Pearson residual is dened by:
( )
1
P
ii
i
i i
ˆ
Y
rˆ ˆ
π
π π
=
Paral residual: The paral residual of the jth predictor, j = 1, 2, …, p, for the ith
observaon is dened by:
( )
1 1
1
part i i
ij jij
i i
ˆ
y
ˆ
r x ,i ,...,n, j ,..., p
ˆ ˆ
π
βπ π
= + = =
The paral residuals are very useful for idencaon of the type of transformaon
that needs to be performed on the covariates.
Working residual: The working residual for the logisc regression model is given by:
( )
1
Wi i
i
i i
ˆ
y
rˆ ˆ
π
π π
=
The Logisc Regression Model
[ 214 ]
Each of the preceding residual variants is easily obtained using the residuals funcon, see
? glm.summaries for details. The residual variant is specied through the opon type in
the residuals funcon. We have not given the details related to the probit regression model,
however, the same funcons for logisc regression apply here nevertheless. We will obtain
the residual plots against the ed values and examine the appropriateness of the logisc
and probit regression models.
Time for action – residual plots for the logistic regression
model
The residuals and fitted funcons will be used to obtain the residual plots from the
probit and logisc regression models.
1. Inialize a graphics windows for three panels with par(mfrow=c(1,3),
oma = c(0,0,3,0)). The oma opon ensures that we can appropriately tle
the grand output.
2. Plot Response Residuals against the Fied Values of the pass_logistic
model with:
plot(fitted(pass_logistic), residuals(pass_logistic,"response"),
col= "red", xlab="Fitted Values", ylab="Residuals",cex.axis=1.5,
cex.lab=1.5)
The reason of xlab and ylab has been explained in the earlier chapters.
3. For the purpose of comparison with the probit regression model, add their response
residuals to the previous plot with:
points(fitted(pass_probit), residuals(pass_probit,"response"),
col= "green")
And add a suitable legend and tle as follows:
legend(0.6,0,c("Logistic","Probit"),col=c("red","green"),pch="-")
title("Response Residuals")
4. Add the horizontal line at 0 with abline(h=0).
5. Repeat the preceding steps for deviance and Pearson residuals with:
plot(fitted(pass_logistic), residuals(pass_logistic,"deviance"),
col= "red", xlab="Fitted Values", ylab="Residuals",cex.axis=1.5,
cex.lab=1.5)
points(fitted(pass_probit), residuals(pass_probit,"deviance"),
col= "green")
Chapter 7
[ 215 ]
legend(0.6,0,c("Logistic","Probit"),col=c("red","green"),pch="-")
abline(h=0)
title("Deviance Residuals")
plot(fitted(pass_logistic), residuals(pass_logistic,"pearson"),
col= "red",xlab="Fitted Values",ylab="Residuals",cex.axis=1.5,
cex.lab= 1.5)
points(fitted(pass_probit), residuals(pass_probit,"pearson"), col=
"green")
legend(0.6,0,c("Logistic","Probit"),col=c("red","green"),pch="-")
abline(h=0)
title("Pearson Residuals")
6. Give an appropriate tle with title(main="Response, Deviance, and
Pearson Residuals Comparison for the Logistic and Probit
Models",outer=TRUE, cex.main=1.5).
Figure 6: Residual plots for the logistic regression model
In each of the three preceding residual plots we observe two trends of decreasing residuals
whose slope is -1. The reason for such a trend is that the residuals take one of two values at
a point Xi, either
1i
ˆ
π
or
i
ˆ
π
. Thus, in these residual plots we always get two linear trends
with slope -1. Clearly, there is not much dierence between the residuals of the logisc and
probit models. The Pearson residual graph also indicates the presence of an outlier for the
observaon with residual value lesser than -3.
The Logisc Regression Model
[ 216 ]
What just happened?
The residuals funcon along with the type opon helps in model validaon and
idencaon of some residuals. A good thing is that the same funcon applies on the
logisc as well as the probit regression model.
Have a go hero
In the previous exercise, we have le out the invesgaon using paral and working type
of residuals. Obtain these plots!
Inuence and leverage for the GLM
In the previous chapter we saw how the inuenal and leverage points are idened in
a linear regression model. It will be a bit dicult to go into the appropriate formulas and
theory for the logisc regression model.
Time for action – diagnostics for the logistic regression
The inuence and leverage points will be idened through the applicaon of the funcons,
such as hatvalues, cooks.distance, dffits, and dfbetas for the pass_logistic
ed model.
1. The high leverage points of a logisc regression model are obtained with
hatvalues(pass_logistic) while the Cooks distance is fetched with cooks.
distance(pass_logistic). The DFFITS and DFBETAS measures of inuence are
obtained by running dfbetas(pass_logistic) and dffits(pass_logistic).
2. The inuence and leverage measures are put together using the cbind funcon:
cbind(hatvalues(pass_logistic),cooks.distance(pass_
logistic),dfbetas(pass_logistic),dffits(pass_logistic))
Chapter 7
[ 217 ]
The output is given in the following screenshot:
Figure 7: Influence measures for the logistic regression model
It is me to interpret these measures.
3. If the hatvalues associated with an observaon is greater than
( )
2 1
p / n+, where
p is the number of covariates considered in the model and n is the number of
observaons, it is considered as a high leverage point. For the pass_logistic
object, we nd the high leverage points with:
hatvalues(pass_logistic)>2*(length(pass_logistic$coefficients)-1)/
length(pass_logistic$y)
The Logisc Regression Model
[ 218 ]
4. An observaon is considered to have great inuence on the parameter esmates
if the Cooks distance, as given by cooks.distance, is greater than 10 percent
quanle of the
()
11
p,np
F
+−+ distribuon, and it is considered highly inuenal if it
exceeds 50 percent quanle of the same distribuon. In terms of R program,
we need to execute:
cooks.distance(pass_logistic)>qf(0.1,length(pass_logistic$
coefficients),length(pass_logistic$y)-length(pass_logistic$
coefficients))
cooks.distance(pass_logistic)>qf(0.5,length(pass_logistic$
coefficients),length(pass_logistic$y)-length(pass_logistic$
coefficients))
Figure 8: Identifying the outliers
The previous screenshot shows that there are eight high leverage points. We also
see that at the 10 percent quanle of the F-distribuon we have two inuenal
points whereas we don't have any highly inuenal points.
5. Use the plot funcon to idenfy the inuenal observaons suggested by the
DFFITS and DFBETAS measure:
par(mfrow=c(1,3))
plot(dfbetas(pass_logistic)[,1],ylab="DFBETAS - INTERCEPT")
plot(dfbetas(pass_logistic)[,2],ylab="DFBETAS - SAT")
plot(dffits(pass_logistic),ylab="DFFITS")
Chapter 7
[ 219 ]
Figure 9: DFFITS and DFBETAS for the logistic regression model
As with the linear regression model, the DFFITS and DFBETAS are measures of inuence of
the observaons on the regression coecients. The thumb rule for the DFBETAS is that if
their absolute value exceeds 1, the observaons have signicant inuence on the covariates.
In our case it is not correct and we conclude that we do not have inuenal observaons.
The interpretaon of DFFITS is le as an exercise.
What just happened?
We adapted the inuenal measures in the context of generalized linear models,
and especially in the context of logisc regression.
Have a go hero
The inuence and leverage measures were executed on the logisc regression model,
the pass_logistic object in parcular. You also have the pass_probit object!
Repeat the enre exercise of hatvalues, cooks.distance, dffits, and dfbetas
on the pass_probit ed probit model and draw your inference.
The Logisc Regression Model
[ 220 ]
Receiving operator curves
In the binary classicaon problem, we have certain scenarios where the comparison
between the predicted and actual class is of great importance. For example, there is
a genuine problem in the banking industry for idenfying fraudulent transacons against
the non-fraudulent transacons. There is another problem of sanconing loans to customers
who may successfully repay the enre loan and the customers who will default at some stage
during the loan tenure. Given the historical data, we will build a classicaon model, for
example the logisc regression model.
Now with the logisc regression model, or any other classicaon model for that maer,
if the predicted probability is greater than 0.5, the observaon is predicted as a successful
observaon, and a failure otherwise. We remind ourselves again that success/failure is
dened according to the experiment. At least with the data on hand, we know the true
labels of the observaons and hence a comparison of the true labels with the predicted
label makes a lot of sense. In an ideal scenario we expect the predicted labels to match
perfectly with the actual labels, that is, whenever the true label stands for success/failure,
the predicted label is also success/failure. However, in the real scenario it is rarely the case.
This means that there are some observaons which are predicted as success/failure when
the true labels are actually failure/success. In other words, we make mistakes! It is possible
to put these notes in the form of a table widely known as the confusion matrix.
Observed
Predicted
Success Failure
Success True Positive (TP) False Positive (FP)
Failure False Negative (FN) True Negative (TN)
Table 1: The confusion matrix
The number in parenthesis is the count of the cases. It may be seen from the preceding
table that the cells colored in green are the correct predicons made by the model, whereas
the red colored are the ones with mistakes. The following metrics may be considered for
comparison across mulple models:
Accuracy:
TP TN
TP TN FP FN
+
+++
Precision:
TP
TP FP+
Recall:
TP
TP FN+
However, it is known that these metrics have a lot of limitaons and more robust steps
are required. The answer is provided by the receiver operator characterisc (ROC) curve.
We need two important metrics towards the construcon of an ROC. The true posive rate
(tpr) and false posive rate (fpr) are respecvely dened by:
Chapter 7
[ 221 ]
TP FP
tpr , fpr
TP FN TN FP
= =
+ +
The ROC graphs are constructed by plong the tpr against the fpr. We will now explain this
in detail. Our approach will be explaining the algorithm in an Acon framework.
Time for action – ROC construction
A simple dataset is considered and the ROC construcon is explained in a very simple
step-by-step approach:
1. Suppose that the predicted probabilies of n = 10 observaons are 0.32, 0.62, 0.19,
0.75, 0.18, 0.18, 0.95, 0.79, 0.24, 0.59. Create a vector of it as follows:
pred_prob<-c(0.32, 0.62, 0.19, 0.75, 0.18, 0.18, 0.95, 0.79, 0.24,
0.59)
2. Sort the predicted probabilies in a decreasing order:
> (pred_prob=sort(pred_prob,decreasing=TRUE))
[1] 0.95 0.79 0.75 0.62 0.59 0.32 0.24 0.19 0.18 0.18
3. Normalize the predicted probabilies in the preceding step to the unit interval:
> pred_prob<-(pred_prob-min(pred_prob))/(max(pred_prob)-min(pred_
prob ))
> pred_prob
[1] 1.00000 0.79221 0.74026 0.57143 0.53247 0.18182 0.07792
0.01299 0.00000 0.00000
Now, at each percentage of the previously sorted probability, we commit false
posives as well as false negaves. Thus, we want to check at each part of our
predicon percenles, the quantum of tpr and fpr. Since ten points are very less,
we now consider a dataset of predicted probabilies and the true labels.
4. Load the illustrave dataset from the RSADBE package with data(simpledata).
5. Set up the threshold vector threshold <- seq(1,0,-0.01).
6. Find the number of posive (success) and negave (failure) cases in the dataset P
<- sum(simpledata$Label==1) and N <- sum(simpledata$Label ==0).
7. Inialize the fpr and tpr with tpr <- fpr <- threshold*0.
The Logisc Regression Model
[ 222 ]
8. Set up the following loop which computes tpr and fpr at each point of the
threshold vector:
for(i in 1:length(threshold)) {
FP=TP=0
for(j in 1:nrow(simpledata)) {
if(simpledata$Predictions[j]>=threshold[i]) {
if(simpledata$Label[j]==1) TP=TP+1 else FP=FP+1
}
}
tpr[i]=TP/P
fpr[i]=FP/N
}
9. Plot the tpr against the fpr with:
plot(fpr,tpr,"l",xlab="False Positive Rate", ylab="True Positive
Rate",col="red")
abline(a=0,b=1)
Figure 10: An ROC illustration
The diagonal line is about the performance of a random classier in that it simply says
"Yes" or "No" without looking at any characterisc of an observaon. Any good classier
must sit, rather be displayed, above this line. The classier, albeit an unknown one, seems
a much beer classier than the random classier. The ROC curve is useful in comparison
to the compeve classiers in the sense that if one classier is always above another,
we select the former.
Chapter 7
[ 223 ]
An excellent introductory exposion of the ROC curves is available at the website http://
ns1.ce.sharif.ir/courses/90-91/2/ce725-1/resources/root/Readings/
Model%20Assessment%20with%20ROC%20Curves.pdf.
What just happened?
The construcon of ROC has been demysed! The preceding program is a very primive
one. In the later chapters we will use the ROCR package for the construcon of ROC.
We will next look at a real-world problem.
Logistic regression for the German credit screening
dataset
Millions of applicaons are made to a bank for a variety of loans! The loan may be a personal
loan, home loan, car loan, and so forth. From a bank's perspecve, loans are an asset for them
as obviously the customer pays them interest and over a period of me the bank makes prot.
If all the customers promptly pay back their loan amount, all their tenure equated monthly
installment (EMI) or the complete amount on preclosure of the principal amount, there is
only money to be made. Unfortunately, it is not always the case that the customers pay back
the enre amount. In fact, the fracon of people who do not complete the loan duraon may
also be very small, say about ve percent. However, a bad customer may take away the prots
of may be 20 or more customers. In this hypothecal case, the bank eventually makes more
losses than prot and this may eventually lead to its own bankruptcy.
Now, a loan applicaon form seeks a lot of details about the applicant. The data from these
details in the applicaon can help the bank build appropriate classiers, such as a logisc
regression model, and make predicons about which customers are most likely to turn up as
fraudulent. The customers who have been predicted to default in the future are then declined
the loan. A real dataset of 1,000 customers who had borrowed loan from a bank is available
on the web at http://www.stat.auckland.ac.nz/~reilly/credit-g.arff and
http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data).
This data has been made available by Prof. Hofmann and it contains details on 20 variables
related to the customer. It is also known whether the customers defaulted or not. The
variables are described in the following table.
The Logisc Regression Model
[ 224 ]
A detailed analysis of the dataset using R has been done by Sharma and his very useful
document can be downloaded from cran.r-project.org/doc/contrib/Sharma-
CreditScoring.pdf.
No Variable Characteristic Description No Variable Characteristic Description
1checking integer
Status of
existing
checking
account 12 property factor Property
2duration integer
Duration in
month 13 age numeric Age in years
3history integer Credit history 14 other integer
Other
installment
plans
4purpose factor Purpose 15 housing integer Housing
5amount numeric Credit amount 16 existcr integer
Number of
existing credits
at this bank
6savings integer
Savings
account/bonds 17 job integer Job
7employed integer
Present
employment
since 18 depends integer
Number
of people
being liable
to provide
maintenance
for
8installp integer
Installment
rate in
percentage
of disposable
income 19 telephon integer Telephone
9marital integer
Personal status
and sex 20 foreign integer Foreign worker
10 coapp integer
Other debtors/
guarantors 21 good_bad factor Loan defaulter
11 resident integer
Present
residence
since 22 default integer
good_bad in
numeric
We have the German credit dataset with us in the GC data from the RSADBE package.
Let us build a classier for idenfying the good customers from the bad ones.
Chapter 7
[ 225 ]
Time for action – logistic regression for the German credit
dataset
The logisc regression model will be built for credit card applicaon scoring model and an
ROC curve t to evaluate the t of the model.
1. Invoke the ROCR library with library(ROCR).
2. Get the German credit dataset in your current session with data(GC).
3. Build the logisc regression model for good_bad with GC_LR <- glm
(good_bad~.,data=GC,family=binomial()).
4. Run summary(GC_LR) and idenfy the signicant variables. Also answer the
queson of whether the model is signicant?
5. Get the predicons using the predict funcon:
LR_Pred <- predict( GC_LR,type='response')
6. Use the predicon funcon from the ROCR package to set up a predicon object:
GC_pred <- prediction(LR_Pred,GC$good_bad)
The funcon prediction sets up dierent manipulaons required computaons
as required for construcng the ROC curve. Get more details related to it with
?prediction.
7. Set up the performance vector required to obtain the ROC curve with GC_perf <-
performance(GC_pred,"tpr","fpr").
The performance funcon uses the predicon object to set up the ROC curve.
The Logisc Regression Model
[ 226 ]
8. Finally, visualize the ROC curve with plot(GC_perf).
Figure 11: Logistic regression model for the German credit data
The ROC curve shows that the logisc regression is indeed eecve in idenfying
fraudulent customers.
What just happened?
Now, we considered a real world problem with enough data points. The ed logisc
regression model gives a good explanaon of the fraudulent customers in terms of the
data that is collected about them.
Have a go hero
For simpledata, a raw program was wrien to draw the ROC curve. Redo the exercise with
red colour for the curve. Using the prediction and performance funcons from the ROCR
package, add the curve for simpledata obtained in the previous step with green colour.
What do you expect?
Chapter 7
[ 227 ]
Summary
We started with a simple linear regression model for the binary classicaon problem
and saw the limitedness of the same. The probit regression model, which is an adapon
of the linear regression model through a latent variable, overcomes the drawbacks of the
straighorward linear regression model. The versale logisc regression model has been
considered in details and we considered the various kinds of residuals that help in the model
validaon. The inuenal and leverage point detecon has been discussed too, which helps
us build a beer model by removing the outliers. A metric in the form of ROC helps us in
understanding the performance of a classier. Finally, we concluded the chapter with an
applicaon to the important problem of idenfying good customers from the bad ones.
Despite the advantages of linearity, we sll have many drawbacks with either the linear
regression model or the logisc regression model. The next chapter begins with the family
of polynomial regression model and later considers the impact of regularizaon.
8
Regression Models with
Regularization
In Chapter 6, Linear Regression Analysis, and Chapter 7, The Logistic Regression
Model, we focused on the linear and the logistic regression model. In the model
selection issues with the linear regression model, we found that a covariate
is either selected or not depending on the associated p-value. However, the
rejected covariates are not given any kind of consideration once the p-value
is lesser than the threshold. This may lead to discarding the covariates even if
they have some say on the regressand. Particularly, the final model may thus
lead to overfitting of the data, and this problem needs to be addressed.
We will rst consider ng a polynomial regression model, without the technical details,
and see how higher order polynomials give a very good t, which actually comes with a
higher price. A more general framework of B-splines is considered next. This approach leads
us to the smooth spline models, which are actually ridge regression models. The chapter
concludes with an extension of the ridge regression for the linear and logisc regression
models. For more details of the coverage, refer to Chapter 2 of Berk (2008) and Chapter 5
of Hase, et. al. (2008). This chapter will unfold on the following topics:
The problem of overng in a general regression model
The use of regression splines for certain special cases
Improving esmators of the regression coecients, and overcoming the
problem of overng with ridge regression for linear and logisc models
The framework of train + validate + test for regression models
Regression Models with Regularizaon
[ 230 ]
The overtting problem
The limitaon of the linear regression model is best understood through an example. I have
created a hypothecal dataset for understanding the problem of overng. A scaerplot of
the dataset is shown in the following gure.
It appears from the scaerplot that for x values up to 6, there is a linear increase in y, and an
eye-bird esmate of the slope is (50 - 10) / (5.5 - 1.75) = 10.67. This slope may
be on account of a linear term or even a quadrac term. On the other hand, the decline in
y-values for x-values greater than 6 is very steep, approximately (10 - 50) / (10 - 6)
= -10. Now, looking at the complete picture, it appears that the output Y depends upon
the higher order of the covariate X. Let us t polynomial curves of various degrees and
understand the behavior of the dierent linear regression models. A polynomial regression
model of degree k is dened as follows:
0 1 2
Y X X ... X
ββ β β ε
=+ + + + +
Here, the terms 2k
X,X,..., X are treated as disnct variables, in the sense that one may
compare the preceding model with the one introduced in the mulple linear regression
model of Chapter 6, Linear Regression Analysis, by dening 2
1 2
k
k
XX,X X,...,
XX
= = = .
The inference for the polynomial regression model proceeds in the same way as the
mulple linear regression with k terms:
Figure 1: A non-linear relationship displayed by a scatter plot
Chapter 8
[ 231 ]
The data for the previous gure is available in the dataset OF from RSADBE. The opon poly
is used in the right-hand side of the formula of the lm funcon for ng the polynomial
regression models.
Time for action – understanding overtting
Polynomial regression models are built using the lm funcon, as we saw earlier, with the
opon poly.
1. Read the hypothecal dataset into R by using data(OF).
2. Plot Y against X by using plot(OF$X, OF$Y,"b",col="red",xlab="X",
ylab="Y").
3. Fit the polynomial regression models of orders 1, 2, 3, 6, and 9, and add their ed
lines against the covariates X with the following code:
lines(OF$X,lm(Y~poly(X,1,raw=TRUE),data=OF)$fitted.
values,"b",col="green")
lines(OF$X,lm(Y~poly(X,2,raw=TRUE),data=OF)$fitted.
values,"b",col="wheat")
lines(OF$X,lm(Y~poly(X,3,raw=TRUE),data=OF)$fitted.
values,"b",col="yellow")
lines(OF$X,lm(Y~poly(X,6,raw=TRUE),data=OF)$fitted.
values,"b",col="orange")
lines(OF$X,lm(Y~poly(X,9,raw=TRUE),data=OF)$fitted.
values,"b",col="black")
Regression Models with Regularizaon
[ 232 ]
The opon poly is used to specify the polynomial degree:
Figure 2: Fitting higher-order polynomial terms in a regression model
4. Enhance the graph with a suitable legend:
legend(6,50,c("Poly 1","Poly 2","Poly 3","Poly 6","Poly 9"),
col=c("green","wheat","yellow","orange","black"),pch=1,ncol=3)
5. Inialize the following vectors:
R2 <- NULL; AdjR2 <- NULL; FStat <- NULL
Mvar <- NULL; PolyOrder<-1:9
6. Now, t the regression models beginning with order 1 up to order 9 (since we only
have ten points) and extract their R2, Adj- R2, F-stasc value, and model variability:
for(i in 1:9) {
temp <- summary(lm(Y~poly(X,i,raw=T),data=OF))
R2[i] <- temp$r.squared
AdjR2[i] <- temp$adj.r.squared
FStat[i] <- as.numeric(temp$fstatistic[1])
Mvar[i] <- temp$sigma
}
cbind(PolyOrder,R2,AdjR2,FStat,Mvar)
Chapter 8
[ 233 ]
We will more formally dene polynomial regression models in the next secon.
The output is given in the next gure.
7. Let us also look at the magnitude of the regression coecients:
as.numeric(lm(Y~poly(X,1,raw=T),data=OF)$coefficients)
as.numeric(lm(Y~poly(X,2,raw=T),data=OF)$coefficients)
as.numeric(lm(Y~poly(X,3,raw=T),data=OF)$coefficients)
as.numeric(lm(Y~poly(X,4,raw=T),data=OF)$coefficients)
as.numeric(lm(Y~poly(X,5,raw=T),data=OF)$coefficients)
as.numeric(lm(Y~poly(X,6,raw=T),data=OF)$coefficients)
as.numeric(lm(Y~poly(X,7,raw=T),data=OF)$coefficients)
as.numeric(lm(Y~poly(X,8,raw=T),data=OF)$coefficients)
The following screenshot shows the large size of the regression coecients,
parcularly, as the degree of the polynomial increases, so does the coecient
magnitude. This is a problem! As the complexity of a model increases, the
interpretability becomes very dicult. In the next secon, we will discuss various
techniques in polynomial regression.
Figure 3: Regression coefficients of polynomial regression models
What just happened?
The scaer plot indicated that a polynomial regression model may be appropriate.
Fing higher order polynomial curves gives a closer approximaon of the t.
The regression coecients have been observed to increase with the degree
of the polynomial t.
In the next secon, we consider the more general regression spline model.
Regression Models with Regularizaon
[ 234 ]
Have a go hero
The R2 value for gasoline_lm is at 0.895, see Figure 7: Building Mulple Linear Regression
Model, of Chapter 6, Linear Regression Analysis. Add higher order terms for the covariates
and make an aempt to reach an R2 value of 0.95.
Regression spline
In this secon, we will consider various enhancements/generalizaons of the linear
regression model. We will begin with a piecewise linear regression model and then consider
the polynomial regression extension. The term spline refers to a thin strip of wood that can
be easily bent along a curved line.
Basis functions
In the previous secon, we made mulple transformaons of the input variable X with
2
1 2
k
k
XX,X X,...,
XX
= = = . In the Data Re-expression secon of Chapter 4, Exploratory
Analysis, we saw how a useful log transformaon gave a beer stem-and-leaf display than
the original variable itself. In many applicaons, it has been found that the transformed
variables are more important than the original variable itself. Thus, we need a more generic
framework to consider the transformaons of the variables. Such a framework is provided by
the basis funcons. For a single covariate X, the set of transformaons may be dened
as follows:
( ) ( )
1
M
mm
m
fX h X
β
=
=
Here,
()
m
hX
is the m-th transformaon of X, and
m
β
is the associated regression coecient.
In the case of a simple linear regression model, we have
()
1
1hX= and
()
2
hX X
=. For the
polynomial regression model, we have
()
12
m m
hX X,m,,...,k= =
, and for the logarithmic
transformaon
()
loghX X=. In general, for the p mulple linear regression model,
we have the basis transformaon as follows:
( ) ( )
1
1 1
j
M
p
p jmjm j
jm
fX,..., X h X
β
= =
=∑∑
For the mulple linear regression model, we have
( )
11
j j j
h X X,j ,...p= = . In general, the
transformaon includes funcons such as sine, cosine, exponenaon, and indicator funcons.
Chapter 8
[ 235 ]
Piecewise linear regression model
Consider the scaer plot of the dataset, which is available in the dataset PWR_Illus, in
the next screenshot. We see a slanted leer N in Figure 4: Scaerplot of a dataset (A) and
the ed values using piecewise linear regression model (B), where in the beginning Y
increases with X up to the point, approximately, 15, then there is a steep decline, or negave
relaonship, ll 30, and nally there is an increase in the y values beyond that. In a certain
way, we can imagine the x values of 15 and 30 as break-down points. It is apparent from the
scaerplot display that a linear relaonship between the x- and y-values over the real line
intervals less than 15, between 15 to 30, and greater than 30 is appropriate. The queson
then is how do we build a regression model for such a phenomenon? The answer is provided
by the piecewise linear regression model. In this parcular case, we have a two-piece linear
regression model.
In general, let xa and xb denote the two points, where we believe the linear regression model
has the breakpoints. Further more, we denote an indicator funcon by Ia to represent that it
equals 1 when the x value is greater than xa and takes the value 0 in other cases. Similarly, the
second breakpoint indicator Ib is dened. The piecewise linear regression model is dened
as follows:
( ) ( )
0 1 2 3a a b b
Y X XxI X xI
ββ β β ε
=+ + + − +
In this piecewise linear regression model, we have four transformaons, including
()
11
hX
=,
()
2
hX X
=,
()( )
3a a
hX XxI=− , and
()( )
4b b
hX XxI=− . The regression model needs to be
interpreted with a bit of care. If the x value is less than xa, then average Y value would
be
01
X
ββ
+
. For the x value greater than xa but lesser than xb, the average of Y is
( ) ( )
0 2 3 1 2 3
a b
x x X
ββ β β ββ
+ − +++
. Finally, for values greater than xb, it will be
0
β
. The
intercept term in these intervals will be
( )
02a
x
ββ
,
( )
0 2 3a b
x x
ββ β
− −
, and , respecvely,
whereas the slopes are
1
β
,
( )
12
ββ
+
and
( )
123
βββ
++
. Of course, we are now concerned
about ng the piecewise linear regression model in R. Let us set ourselves up for this task!
Time for action – tting piecewise linear regression models
A piecewise linear regression model can be easily ed in R by using the same lm funcon
and a bit of cauon. A loop is used to nd the points at which the model is supposed to have
changed its trajectory.
1. Read the dataset into R with data(PW_Illus).
2. For convenience, aach the variables in the PW_Illus object by using attach
(PW_Illus).
Regression Models with Regularizaon
[ 236 ]
3. To be on the safe side, we will select a range of the x values, which may be either of
the breakpoints:
break1 <- X[which(X>=12 & X<=18)]
break2 <- X[which(X>=27 & X<=33)]
4. Get the number of points that are candidates for being the breakpoints with n1 <-
length(break1) and n2 <- length(break2).
We do not have a clear dening criterion to select one of the n1 or n2 x values to be
the breakpoints. Hence, we will run various linear regression models and select that
pair of points (xa, xb) to be the breakpoints, which return the least mean residual
sum of squares. Towards this, we set up a matrix, which will have three columns
with the rst two columns for the possible potenal pair of breakpoints, and the
third column will contain the mean residual sum of squares. The choice of points,
which corresponds to the least mean residual sum of squares, will be selected as the
best model in the current case.
5. Set up the required matrix, build all the possible regression models with the pair
of potenal breakpoints, and note their mean residual sum of squares through the
following program:
MSE_MAT <- matrix(nrow=(n1*n2), ncol=3)
colnames(MSE_MAT) = c("Break_1","Break_2","MSE")
curriter=0
for(i in 1:n1){
for(j in 1:n2) {
curriter=curriter+1
MSE_MAT[curriter,1]<-break1[i]
MSE_MAT[curriter,2]<-break2[j]
piecewise1 <- lm(Y ~ X*(X<break1[i])+X*(X>=break1[i] &
X<break2[j])+X*(X>=break2[j]))
MSE_MAT[curriter,3] <- as.numeric(summary(piecewise1)[6])
}
}
Note the use of the formula ~ in the specicaon of the piecewise linear
regression model.
6. The me has arrived to nd the pair of breakpoints:
MSE_MAT[which(MSE_MAT[,3]==min(MSE_MAT[,3])),]
The pair of breakpoints is hence (14.000, 30.000). Let us now look at how good
the model t is!
Chapter 8
[ 237 ]
7. First, reobtain the scaer plot with plot(PW_Illus). Fit the piecewise linear
regression model with breakpoints at (14,30) with pw_final <- lm(Y ~
X*(X<14)+X*(X>=14 & X<30)+X*(X>=30)). Add the ed values to the scaer
plot with points(PW_Illus$X,pw_final$fitted.values,col ="red").
Note that the ed values are a very good reecon of the original data values,
(Figure 4 (B)). The fact that linear models can be extended to such dierent
scenarios makes it very promising to study this in even more detail as will be
seen in the later part of this secon.
Figure 4: Scatterplot of a dataset (A) and the fitted values using piecewise linear regression model (B)
What just happened?
The piecewise linear regression model has been explored for a hypothecal scenario, and we
invesgated how to idenfy breakpoints by using the criterion of the mean residual sum
of squares.
The piecewise linear regression model shows a useful exibility, and it is indeed a very useful
model when there is a genuine reason to believe that there are certain breakpoints in the
model. This has some advantages and certain limitaons too. From a technical perspecve,
the model is not connuous, whereas from an applied perspecve, the model possesses
problems in making guesses about the breakpoint values and also the problem of extensions
to mul-dimensional cases. It is thus required to look for a more general framework, where we
need not be bothered about these issues. Some answers are provided in the following secons.
Regression Models with Regularizaon
[ 238 ]
Natural cubic splines and the general B-splines
We will rst consider the polynomial regression splines model. As noted in the previous
discussion, we have a lot of disconnuity in the piecewise regression model. In some sense,
"greater connuity" can be achieved by using the cubic funcons of x and then construcng
regression splines in what are known as "piecewise cubics", see Berk (2008) Secon 2.2.
Suppose that there are K data points at which we require the knots. Suppose that the knots
are located at the points
12 K
,,...,
ξξ ξ
, which are between the boundary points
0 1
and
k
ξ ξ
+,
such that
0 1 2 1K K
...
ξξξ ξ ξ
+
<<< <
. The piecewise cubic polynomial regression model is
given as follows:
( )
3
2 3
0 1 2 3
1
K
j j
j
Y X X X X
β β β β θ ξ ε
+
=
= + + + + +
Here, the funcon
()
3
.
+
represents that the posive values from the argument are accepted
and then the cube power performed on it; that is:
( ) ( )
3
3if
othe0 rwise
jj
j
X
XX
,
ξ
ξξ
+
= >
For this model, the K+4 basis funcons are as follows:
( ) ( ) ( ) ( ) ( )
( )
3
2 3
1 2 3 4 4
1 1
j j
hX ,h X X ,h X X ,h X X ,h X X ,j ,...,K
ξ
++
= = = = = − =
We will now consider an example from Montgomery, et. al. (2005), pages 231-3. It is
known that the baery voltage drop in a guided missile motor has a dierent behavior as a
funcon of me. The next screenshot displays the scaerplot of the baery voltage drop for
dierent me points; see ?VD from the RSADBE package. We need to build a piecewise cubic
regression spline for this dataset with knots at me t = 6.5 and t = 13 seconds since it
is known that the missile changes its course at these points. If we denote the baery voltage
drop by Y and the me by t, the model for this problem is then given as follows:
( ) ( )
3 3
2 3
0 1 2 3 1 2
65 13Y t t t t . t
β β β β θ θ ε
+ +
= + + + + + +
It is not possible with the math scope of this book to look into the details related to the
natural cubic spline regression models or the B-spline regression models. However, we can t
them by using the ns and bs opons in the formula of the lm funcon, along with the knots
at the appropriate places. These models will be built and their t will be visualized too. Let us
now t the models!
Chapter 8
[ 239 ]
Time for action – tting the spline regression models
A natural cubic spline regression model will be ed for the voltage drop problem.
1. Read the required dataset into R by using data(VD).
2. Invoke the graphics editor by using par(mfrow=c(1,2)).
3. Plot the data and give an appropriate tle:
plot(VD)
title(main="Scatter Plot for the Voltage Drop")
4. Build the piecewise cubic polynomial regression model by using the lm funcon
and related opons:
VD_PRS<-lm(Voltage_Drop~Time+I(Time^2)+I(Time^3)+I(((Ti
me-6.5)^3)*(sign(Time-6.5)==1))+I(((Time-13)^3)*(sign(Time-
13)==1)),data=VD)
The sign funcon returns the sign of a numeric vector as 1, 0, and -1, accordingly
as the arguments are posive, zero, and negave respecvely. The operator I is
an inhibit interpretator operator, in that the argument will be taken in an as is
format, check ?I. This operator is especially useful in data.frame and the formula
program of R.
5. To obtain the ed plot along with the scaerplot, run the following code:
plot(VD)
points(VD$Time,fitted(VD_PRS),col="red","l")
title("Piecewise Cubic Polynomial Regression Model")
Figure 5: Voltage drop data - scatter plot and a cubic polynomial regression model
Regression Models with Regularizaon
[ 240 ]
6. Obtain the details of the ed model with summary(VD_PRS).
The R output is given in the next screenshot. The summary output shows that each
of the basis funcon is indeed signicant here.
Figure 6: Details of the fitted piecewise cubic polynomial regression model
7. Fit the natural cubic spline regression model using the ns opon:
VD_NCS <-lm(Voltage_Drop~ns(Time,knots=c(6.5,13),intercept= TRUE,
degree=3), data=VD)
8. Obtain the ed plot as follows:
par(mfrow=c(1,2))
plot(VD)
points(VD$Time,fitted(VD_NCS),col="green","l")
title("Natural Cubic Regression Model")
Chapter 8
[ 241 ]
9. Obtain the details related to VD_NCS with the summary funcon summary( VD_
NCS); see Figure 09: A rst look at the linear ridge regression.
10. Fit the B-spline regression model by using the bs opon:
VD_BS <- lm(Voltage_Drop~bs(Time,knots=c(6.5,13),intercept=TRUE,
degree=3), data=VD)
11. Obtain the ed plot for VD_BS with the R program:
plot(VD)
points(VD$Time,fitted(VD_BS),col="brown","l")
title("B-Spline Regression Model")
Figure 7: Natural Cubic and B-Spline Regression Modeling
12. Finally, get the details of the ed B-spline regression model by using
summary(VD_BS).
The main purpose of the B-spline regression model is to illustrate that the splines
are smooth at the boundary points in contrast with the natural cubic regression
model. This can be clearly seen in Figure 8: Details of the natural cubic and B-spline
regression models.
Regression Models with Regularizaon
[ 242 ]
Both the models, VD_NCS and VD_BS, have good summary stascs and have really
modeled the data well.
Figure 8: Details of the natural cubic and B-spline regression models
What just happened?
We began with the ng of a piecewise polynomial regression model and then had a
look at the natural cubic spline regression and B-spline regression models. All the three
models provide a very good t to the actual data. Thus, with a good guess or experimental/
theorecal evidence, the linear regression model can be extended in an eecve way.
Chapter 8
[ 243 ]
Ridge regression for linear models
In Figure 3: Regression coecients of polynomial regression models, we saw that the
magnitude of the regression coecients increase in a drasc manner as the polynomial
degree increases. The right tweaking of the linear regression model, as seen in the previous
secon, gives us the right results. However, the models considered in the previous secon
had just one covariate and the problem of idenfying the knots in the mulple regression
model becomes an overtly complex issue. That is, if we have a problem where there are
large numbers of covariates, naturally there may be some dependency amongst them,
which cannot be invesgated for certain reasons. In such problems, it may happen that
certain covariates dominate other covariates in terms of the magnitude of their regression
coecients, and this may mar the overall usefulness of the model. Further more, even in
the univariate case, we have the problem that the choice of the number of knots, their
placements, and the polynomial degree may be manipulated by the analyzer. We have an
alternave to this problem in the way we minimize the residual sum of squares 2
1
min
n
i
i
e
β
=
.
The least-squares soluon leads to an esmator of
β
:
( )
1
ˆ
X'X X 'Y
β
=
We saw in Chapter 6, Linear Regression Analysis, how to guard ourselves against the outliers,
the measures of model t, and model selecon techniques. However, these methods are
in acon aer the construcon of the model, and hence though they oer protecon in a
certain sense to the problem of overng, we need more robust methods. The queson
that arises is can we guard ourselves against overng when building the model itself? This
will go a long way in addressing the problem. The answer is obviously an armave, and we
will check out this technique.
The least-squares soluon is the opmal soluon when we have the squared loss funcon.
The idea then is to modify this loss funcon by incorporang a penalty term, which will give
us addional protecon against the overng problem. Mathemacally, we add the penalty
term for the size of the regression coecients; in fact, the constraint would be to ensure that
the sum of squares of the regression coecients is minimized. Formally, the goal would be to
obtain an opmal soluon of the following problem:
2 2
1 1
min
p
n
i j
i j
e
βλ β
= =
 
+
 
 
∑ ∑
Regression Models with Regularizaon
[ 244 ]
Here,
0
λ
>
is the control factor, also known as the tuning parameter, and 2
1
p
j
j
β
=
is the
penalty. If the
λ
value is zero, we get the earlier least-squares soluon. Note that the
intercept has been deliberately kept out of the penalty term! Now, for the large values of
2
1
p
j
j
β
=
, the residual sum of squares will be large. Thus, loosely speaking, for the minimum
value of 2 2
1 1
n p
i j
i j
e
λ β
= =
+
∑ ∑
, we will require 2
1
p
j
j
β
=
to be at a minimum value too. The opmal
soluon for the preceding minimizaon problem is given as follows:
( )
1
Ridge
ˆ
X'X I X'Y
β λ
= +
The choice of
λ
is a crical one. There are mulple opons to obtain it:
Find the value of
λ
by using the cross-validaon technique (discussed in the last
secon of this chapter)
Find the value of a semi-automated method as described at http://arxiv.org/
pdf/1205.0686.pdf for the value of
λ
For the rst technique, we can use the funcon lm.ridge from the MASS package, and
the second method of semi-automac detecon can be obtained from the linearRidge
funcon of the ridge package.
In the following R session, we use the funcons lm.ridge and linearRidge.
Time for action – ridge regression for the linear regression
model
The linearRidge funcon from ridge package and lm.ridge from the MASS package are
two good opons for developing the ridge regression models.
1. Though the OF object may sll be there in your session, let us again load it by using
data(OF).
2. Load the MASS and ridge package by using library(MASS); library(ridge).
3. For a polynomial regression model of degree 3 and various values of lambda,
including 0, 0.5, 1, 1.5, 2, 5, 10, and 30, obtain the ridge regression coecients
with the following single line R code:
LR <-linearRidge(Y~poly(X,3),data=as.data.frame(OF),lambda =c(0,
0.5,1,1.5,2,5,10,30))
LR
Chapter 8
[ 245 ]
The funcon linearRidge from the ridge package performs the ridge regression
for a linear model. We have two opons. First, we specify the values of lambda,
which may either be a scalar or a vector. In the case of a scalar lambda, it will simply
return the set of (ridge) regression coecients. If it is a vector, it returns the related
set of regression coecients.
4. Compute the value of
2
1
p
j
j
β
=
for dierent lambda values:
LR_Coef <- LR$coef
colSums(LR_Coef^2)
Note that as the lambda value increases, the value of
2
1
p
j
j
β
=
decreases. However,
this is not to say that higher lambda value is preferable, since the sum
2
1
p
j
j
β
=
will decrease to 0, and eventually none of the variables will have a signicant
explanatory power about the output. The choice of selecon of the lambda value
will be discussed in the last secon.
5. The linearRidge funcon also nds the "best" lambda value:
linearRidge(Y~poly(X,3),data=as.data.
frame(OF),lambda="automatic").
6. Fetch the details of the "best" ridge regression model with the following line
of code:
summary(linearRidge(Y~poly(X,3),data=as.data.frame(OF),lambda="
automatic")).
The summary shows that the value of lambda is chosen at 0.07881, and that it
used three PCs. Now, what is a PC? PC is an abbreviaon of principal component,
and unfortunately we can't really go into the details of this aspect. Enthusiasc
readers may refer to Chapter 17 of Taar, et. al. (2013). Compare these results with
those in the rst secon.
7. For the same choice of dierent lambda values, use the lm.ridge funcon from
the MASS package:
LM <-lm.ridge(Y~poly(X,3),data = as.data.frame(OF),lambda
=c(0,0.5,1,1.5,2,5,10,30))
LM
8. The lm.ridge funcon obviously works a bit dierently from linearRidge. The
results are given in the next image. Comparison of the results is le as an exercise
to the reader. As with the linearRidge model, let us compute the value of
2
1
p
j
j
β
=
for lm.ridge ed models too.
Regression Models with Regularizaon
[ 246 ]
9. Use the colSums funcon to get the required result:
LM_Coef <- LM$coef
colSums(LM_Coef^2)
Figure 09: A first look at the linear ridge regression
So far, we are sll working with a single covariate only. However, we need to consider the
mulple linear regression models and see how ridge regression helps us. To do this, we will
return to the gasoline mileage considered in Chapter 6, Linear Regression Analysis.
Chapter 8
[ 247 ]
1. Read the Gasoline data into R by using data(Gasoline).
2. Fit the ridge regression model (and the mulple linear regression model again)
for the mileage as a funcon of other variables:
gasoline_lm <- lm(y~., data=Gasoline)
gasoline_rlm <- linearRidge(y~., data=Gasoline,lambda=
"automatic")
3. Compare the lm coecients with the linearRidge coecients:
sum(coef(gasoline_lm)[-1]^2)-sum(coef(gasoline_rlm)[-1]^2)
4. Look at the summary of the ed ridge linear regression model by using
summary(gasoline_rlm).
5. The dierence between the sum of squares of the regression coecients for the
linear and ridge linear model is indeed very large. Further more, the gasoline_
rlm details reveal that there are four variables, which have signicant explanatory
power for the mileage of the car. Note that the gasoline_lm model had only one
signicant variable for the car's mileage. The output is given in the following gure:
Figure 10: Ridge regression for the gasoline mileage problem
Regression Models with Regularizaon
[ 248 ]
What just happened?
We made use of two funcons, namely lm.ridge and linearRidge, for ng ridge
regression models for the linear regression model. It is observed that the ridge regression
models may somemes reveal more signicant variables.
In the next secon, we will t consider the ridge penalty for the logisc regression model.
Ridge regression for logistic regression models
We will not be able to go into the math of the ridge regression for the logisc regression
model, though we will happily make good use of the logisticRidge funcon from the
ridge package, to illustrate how to build the ridge regression for logisc regression model. For
more details, we refer to the research paper of Cule and De Iorio (2012) available at http://
arxiv.org/pdf/1205.0686.pdf. In the previous secon, we saw that gasoline_rlm
found more signicant variables than gasoline_lm. Now, in Chapter 7, Logisc Regression
Model, we t a logisc regression model for the German credit data problem in GC_LR. The
queson that arises is if we obtain a ridge regression model of the related logisc regression
model, say GC_RLR, can we expect to nd more signicant variables?
Time for action – ridge regression for the logistic regression
model
We will use the logisticRidge funcon here from the ridge package to t the ridge
regression, and check if we can obtain more signicant variables.
1. Load the German credit dataset with data(German).
2. Use the logisticRidge funcon to obtain GC_RLR, a small manipulaon required
here, by using the following line of code:
GC_RLR<-logisticRidge(as.numeric(good_bad)-1~.,data= as.data.
frame(GC), lambda = "automatic")
Chapter 8
[ 249 ]
3. Obtain the summaries of GC_LR and GC_RLR by using summary(GC_LR) and
summary(GC_RLR).
The detailed summary output is given in the following screenshot:
Figure 11: Ridge regression with the logistic regression model
It can be seen that the ridge regression model oers a very slight improvement over the
standard logisc regression model.
What just happened?
The ridge regression concept has been applied to the important family of logisc
regression models. Although in the case of the German credit data problem we found slight
improvement in idencaon of the signicant variables, it is vital that we should always be
on the lookout to t beer models, as in sensiveness to outliers, and the logisticRidge
funcon appears as a good alternave to the glm funcon.
Regression Models with Regularizaon
[ 250 ]
Another look at model assessment
In the previous two secons, we used the automac opon for obtaining the opmum
λ
values, as discussed in the work of Cule and De Iorio (2012). There is an iterave technique
for nding the penalty factor
λ
. This technique is especially useful when we do not have
sucient well-developed theory for regression models beyond the linear and logisc
regression model. Neural networks, support vector machines, and so on, are some very useful
regression models, where the theory may not have been well developed; well at least to the
best known pracce of the author. Hence, we will use the iterave method in this secon.
For both the linearRidge and lm.ridge ed models in the Ridge regression for linear
models secon, we saw that for an increasing value of
λ
, the sum of squares of regression
coecients,
2
1
p
j
j
β
=
, decreases. The queson then is how to select the "best"
λ
value. A
popular technique in the data mining community is to split the dataset into three parts, namely
Train, Validate, and Test part. There are no denive answers for what needs to be the split
percentage for the three parts and a common pracce is to split them into either 60:20:20
percentages or 50:25:25 percentages. Let us now understand this process:
Training dataset: The models are built on the data available in this data part.
Validaon dataset: For this part of the data, we pretend as though we do not know
the output values and make predicons based upon the covariate values. This step
is to ensure that overng is minimized. The errors (residual squares for regression
model and accuracy percentages for classicaon model) are then compared with
respect to the counterpart errors in the training part. If the errors decrease in the
training set while they remain the same for the validaon part, it means that we are
overng the data. A threshold, aer which this is observed, may be chosen as the
beer lambda value.
Tesng dataset: In pracce, these are really unobserved cases for which the model
is applied for forecasng purposes.
For the gasoline mileage problem, we will split the data into three parts and use the training
and validaon part to select the
λ
value.
Time for action – selecting lambda iteratively and other topics
Iterave selecon of the penalty parameter for ridge regression will be covered in this
secon. The useful framework of train + validate + test will also be considered for the
German credit data problem.
1. For the sake of simplicity, we will remove the character variable of the dataset
by using Gasoline <- Gasoline[,-12].
Chapter 8
[ 251 ]
2. Set the random seed by using set.seed(1234567). This step is to ensure that the
user can validate the results of the program.
3. Randomize the observaons to enable the spling part:
data_part_label = c("Train","Validate","Test")
indv_label=sample(data_part_label,size=nrow(Gasoline),replace=TRUE
,prob=c(0.6,0.2,0.2))
4. Now, split the gasoline dataset:
G_Train <- Gasoline[indv_label=="Train",]
G_Validate <- Gasoline[indv_label=="Validate",]
G_Test <- Gasoline[indv_label=="Test",]
5. Dene the
λ
vector with lambda <- seq(0,10,0.2).
6. Inialize the training and validaon errors:
Train_Errors <- vector("numeric",length=length(lambda))
Val_Errors <- vector("numeric",length=length(lambda))
7. Run the following loop to get the required errors:
8. Plot the training and validaon errors:
plot(lambda,Val_Errors,"l",col="red",xlab=expression(lambda),ylab=
"Training and Validation Errors",ylim=c(0,600))
points(lambda,Train_Errors,"l",col="green")
legend(6,500,c("Training Errors","Validation Errors"),col=c(
"green","red"),pch="-")
Regression Models with Regularizaon
[ 252 ]
The nal output will be the following:
Figure 12: Training and validation errors
The preceding plot suggests that the lambda value is between 0.5 and 1.5. Why?
The technique of train + validate + test is not simply restricted to selecng the
lambda value. In fact, for any regression/classicaon model, we can try to
understand if the selected model really generalizes or not. For the German credit
data problem in the previous chapter, we will make an aempt to see what the
current technique suggests.
9. The program and its output (ROC curves) is displayed following it.
Chapter 8
[ 253 ]
10. The ROC plot is given in the following screenshot:
Figure 13: ROC plot for the train + validate + test partition of the German data
We will close the chapter with a short discussion. In the train + validate + test
paroning, we had one technique of avoiding overng. A generalizaon of this
technique is the well-known cross-validaon method. In an n-fold cross-validaon
approach, the data is randomly paroned into n divisions. In the rst step, the
rst part is held for validaon and the model is built using the remaining n-1 parts
and the accuracy percentage is calculated. Next, the second part is treated as the
validaon dataset and the remaining 1, 3, …, n-1, n parts are used to build
the model and then tested for accuracy on the second part. This process is then
repeated for the remaining n-2 parts. Finally, an overall accuracy metric is reported.
At the surface, this process is complex enough and hence we will resort to the
well-dened funcons available in the DAAG package.
Regression Models with Regularizaon
[ 254 ]
11. As the cross-validaon funcon itself carries out the n-fold paroning, we build it
over the enre dataset:
library(DAAG)
data(VD) CVlm(df=VD,form.lm=formula(Voltage_Drop~Time+I(Time^2)+I(
Time^3)+I(((Time-6.5)^3)*(sign(Time-6.5)==1))
+I(((Time-13)^3)*(sign(Time-13)==1))),m=10,plotit="Observed")
The VD data frame has 41 observaons, and the output in Figure 14: Cross-
validaon for the voltage-drop problem shows that the 10-fold cross-validaon has
10 parons with fold 2 containing ve observaons and the rest of them having
four each. Now, for each fold, the cubic polynomial regression model ts the model
by using the data in the remaining folds:
Figure 14: Cross-validation for the voltage-drop problem
Chapter 8
[ 255 ]
Using the ed polynomial regression model, a predicon is made for the units in
the fold. The observed versus predicted regressand values plot is given in Figure
15: Predicted versus observed plot using the cross-validaon technique. A close
examinaon of the numerical predicted values and the plot indicate that we have a
very good model for the voltage drop phenomenon.
The generalized cross-validaon (GCV) errors are also given with the details of a
lm.ridge t model. We can use this informaon to arrive at the beer value for
the ridge regression models:
Figure 15: Predicted versus observed plot using the cross-validation technique
12. For the OF and G_Train data frames, use the lm.ridge funcon to obtain the
GCV errors:
> LM_OF <- lm.ridge(Y~poly(X,3),data=as.data.frame(OF),
+ lambda=c(0,0.5,1,1.5,2,5,10,30))
> LM_OF$GCV
0.0 0.5 1.0 1.5 2.0 5.0 10.0 30.0
5.19 5.05 5.03 5.09 5.21 6.38 8.31 12.07
> LM_GT <- lm.ridge(y~.,data=G_Train,lambda=seq(0,10,0.2))
> LM_GT$GCV
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
1.777 0.798 0.869 0.889 0.891 0.886 0.877 0.868 0.858 0.848
2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8
Regression Models with Regularizaon
[ 256 ]
0.838 0.830 0.821 0.813 0.806 0.798 0.792 0.786 0.780 0.774
4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8
0.769 0.764 0.760 0.755 0.751 0.748 0.744 0.740 0.737 0.734
6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8
0.731 0.729 0.726 0.723 0.721 0.719 0.717 0.715 0.713 0.711
8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8
0.710 0.708 0.707 0.705 0.704 0.703 0.701 0.700 0.699 0.698
10.0
0.697
For the OF data frame, the value appears to lie in the interval (1.0, 1.5).
On the other hand for the GT data frame, the value appears in (0.2, 0.4).
What just happened?
The choice of the penalty factor is indeed very crucial for the success of a ridge regression
model, and we saw dierent methods for obtaining this. This step included the automac
choice of Cule and De Iorio (2012) and the cross-validaon technique. Further more, we also
saw the applicaon of the popular train + validate + test approach. In praccal applicaons,
these methodologies will go a long way to obtain the best models.
Pop quiz
What do you expect as the results if you perform the model selecon task step funcon
on a polynomial regression model? That is, you are trying to select the variables for the
polynomial model lm(Y~poly(X,9,raw=TRUE),data=OF), or say VD_PRS. Verify your
intuion by compleng the R programs.
Summary
In this chapter, we began with a hypothecal dataset and highlighted the problem of
overng. In case of a breakpoint, also known as knots, the extensions of the linear model
in the piecewise linear regression model and the spline regression model were found to be
very useful enhancements. The problem of overng can also somemes be overcome by
using the ridge regression. The ridge regression soluon has been extended for the linear
and logisc regression models. Finally, we saw a dierent approach of model assessment
by using the train + validate + test approach and the cross-validaon approach. In spite of
the developments where we have intrinsically non-linear data, it becomes dicult for the
models discussed in this chapter to emerge as useful soluons. The past two decades has
witnessed a powerful alternave in the so-called Classicaon and Regression Trees (CART).
The next chapter discusses CART in greater depth and the nal chapter considers modern
development related to it.
9
Classication
and Regression Trees
In the previous chapters, we focused on regression models, and the majority
of the emphasis was on the linearity assumption. In what appears as the next
extension must be non-linear models, we will instead deviate to recursive
partitioning techniques, which are a bit more flexible than the non-linear
generalization of the models considered in the earlier chapters. Of course, the
recursive partitioning techniques, in most cases, may be viewed as non-linear
models.
We will rst introduce the noon of recursive parons through a hypothecal dataset.
It is apparent that the earlier approach of the linear models changes in an enrely dierent
way with the funconing of the recursive parons. Recursive paroning depends upon
the type of problem we have in hand. We develop a regression tree for the regression
problem when the output is a connuous variable, as in the linear models. If the output is
a binary variable, we develop a classicaon tree. A regression tree is rst created by using
the rpart funcon from the rpart package. A very raw R program is created, which clearly
explains the unfolding of a regression tree. A similar eort is repeated for the classicaon
tree. In the nal secon of this chapter, a classicaon tree is created for the German credit
data problem along with the use of ROC curves for understanding the model performance.
The approach in this chapter will be on the following lines:
Understanding the basis of recursive parons and the general CART.
Construcon of a regression tree
Construcon of a classicaon tree
Applicaon of a classicaon tree to the German credit data problem
The ner aspects of CART
Classicaon and Regression
[ 258 ]
Recursive partitions
The name of the library package rpart, shipped along with R, stands for Recursive
Paroning. The package was rst created by Terry M Therneau and Beth Atkinson,
and is currently maintained by Brian Ripley. We will rst have a peek at means
recursive parons are.
A complex and contrived relaonship is generally not idenable by linear models. In the
previous chapter, we saw the extensions of the linear models in piecewise, polynomial,
and spline regression models. It is also well known that if the order of a model is larger
than 4, then interpretaon and usability of the model becomes more dicult. We consider
a hypothecal dataset, where we have two classes for the output Y and two explanatory
variables in X1 and X2. The two classes are indicated by lled-in green circles and red
squares. First, we will focus only on the le display of Figure 1: A complex classicaon
dataset with parons, as it is the actual depicon of the data. At the outset, it is clear that
a linear model is not appropriate, as there is quite an overlap of the green and red indicators.
Now, there is a clear demarcaon of the classicaon problem accordingly, as X1 is greater
than 6 or not. In the area on the le side of X1=6, the mid-third region contains a majority
of green circles and the rest are red squares. The red squares are predominantly idenable
accordingly, as the X2 values are either lesser than or equal to 3 or greater than 6. The green
circles are the majority values in the region of X2 being greater than 3 and lesser than 6. A
similar story can be built for the points on the right side of X1 greater than 6. Here, we rst
paroned the data according to X1 values rst, and then in each of the paroned region,
we obtained parons according to X2 values. This is the act of recursive paroning.
Figure 1: A complex classification dataset with partitions
Let us obtain the preceding plot in R.
Chapter 9
[ 259 ]
Time for action – partitioning the display plot
We rst visualize the CART_Dummy dataset and then look in the next subsecon at how CART
gets the paerns, which are believed to exist in the data.
1. Obtain the dataset CART_Dummy from the RSADBE package by using
data( CART_Dummy).
2. Convert the binary output Y as a factor variable, and aach the data frame
with CART_Dummy$Y <- as.factor(CART_Dummy$Y).
attach(CART_Dummy)
In Figure 1: A complex classicaon dataset with parons, the red squares
refer to 0 and the green circles to 1.
3. Inialize the graphics windows for the three samples by using
par(mfrow= c(1,2)).
4. Create a blank scaer plot:
plot(c(0,12),c(0,10),type="n",xlab="X1",ylab="X2").
5. Plot the green circles and red squares:
points(X1[Y==0],X2[Y==0],pch=15,col="red")
points(X1[Y==1],X2[Y==1],pch=19,col="green")
title(main="A Difficult Classification Problem")
6. Repeat the previous two steps to obtain the idencal plot on the right side of the
graphics window.
7. First, paron according to X1 values by using abline(v=6,lwd=2).
8. Add segments on the graph with the segment funcon:
segments(x0=c(0,0,6,6),y0=c(3.75,6.25,2.25,5),x1=c(6,6,12,12),y1=c
(3.75,6.25,2.25,5),lwd=2)
title(main="Looks a Solvable Problem Under Partitions")
What just happened?
A complex problem is simplied through paroning! A more generic funcon, segments,
has nicely slipped in our program, which you may use for many other scenarios.
Classicaon and Regression
[ 260 ]
Now, this approach of recursive paroning is not feasible all the me! Why? We seldom
deal with two or three explanatory variables and data points as low as in the preceding
hypothecal example. The queson is how one creates recursive paroning of the dataset.
Breiman, et. al. (1984) and Quinlan (1988) have invented tree building algorithms, and we
will follow the Breiman, et. al. approach in the rest of book. The CART discussion in this book
is heavily inuenced by Berk (2008).
Splitting the data
In the earlier discussion, we saw that paroning the dataset can benet a lot in reducing
the noise in the data. The queson is how does one begin with it? The explanatory
variables can be discrete or connuous. We will begin with the connuous (numeric
objects in R) variables.
For a connuous variable, the task is a bit simpler. First, idenfy the unique disnct values
of the numeric object. Let us say, for example, that the disnct values of a numeric object,
say height in cms, are 160, 165, 170, 175, and 180. The data parons are then obtained
as follows:
data[Height<=160,], data[Height>160,]
data[Height<=165,], data[Height>165,]
data[Height<=170,], data[Height>170,]
data[Height<=175,], data[Height>175,]
The reader should try to understand the raonale behind the code, and certainly this is just
an indicave one.
Now, we consider the discrete variables. Here, we have two types of variables, namely
categorical and ordinal. In the case of ordinal variables, we have an order among the disnct
values. For example, in the case of the economic status variable, the order may be among
the classes Very Poor, Poor, Average, Rich, and Very Rich. Here, the splits are similar to the
case of connuous variable, and if there are m disnct orders, we consider m-1 disnct splits
of the overall data. In the case of a categorical variable with m categories, for example the
departments A to F of the UCBAdmissions dataset, the number of possible splits becomes
2m-1-1. However, the benet of using soware like R is that we do not have to worry about
these issues.
The rst tree
In the CART_Dummy dataset, we can easily visualize the parons for Y as a funcon of the
inputs X1 and X2. Obviously, we have a classicaon problem, and hence we will build the
classicaon tree.
Chapter 9
[ 261 ]
Time for action – building our rst tree
The rpart funcon from the library rpart will be used to obtain the rst classicaon tree.
The tree will be visualized by using the plot opons of rpart, and we will follow this up
with extracng the rules of a tree by using the asRules funcon from the rattle package.
1. Load the rpart package by using library(rpart).
2. Create the classicaon tree with CART_Dummy_rpart <- rpart
(Y~ X1+X2,data=CART_Dummy).
3. Visualize the tree with appropriate text labels by using plot(CART_Dummy_
rpart); text(CART_Dummy_rpart).
Figure 2: A classification tree for the dummy dataset
Now, the classicaon tree ows as follows. Obviously, the tree using the rpart
funcon does not paron as simply as we did in Figure 1: A complex classicaon
dataset with parons, the working of which will be dealt within the third secon of
this chapter. First, we check if the value of the second variable X2 is less than 4.875.
If the answer is an armaon, we move to the le side of the tree; the right side in
the other case. Let us move to the right side. A second queson asked is whether X1
is lesser than 4.5 or not, and then if the answer is yes it is idened as a red square,
and otherwise a green circle. You are now asked to interpret the le side of the rst
node. Let us look at the summary of CART_Dummy_rpart.
Classicaon and Regression
[ 262 ]
4. Apply the summary, an S3 method, for the classicaon tree with summary( CART_
Dummy_rpart).
That one is a lot of output!
Figure 3: Summary of a classification tree
Our interests are in the nodes numbered 5 to 9! Why? The terminal nodes, of course!
A terminal node is one in which we can't split the data any further, and for the
classicaon problem, we arrive at a class assignment as the class that has a majority
count at the node. The summary shows that there are indeed some misclassicaons
too. Now, wouldn't it be great if R gave the terminal nodes asRules. The funcon
asRules from the rattle package extracts the rules from an rpart object.
Let's do it!
Chapter 9
[ 263 ]
5. Invoke the rattle package library(rattle) and using the asRules funcon,
extract the rules from the terminal nodes with asRules(CART_Dummy_rpart).
The result is the following set of rules:
Figure 4: Extracting "rules" from a tree!
We can see that the classicaon tree is not according to our "eye-bird" paroning.
However, as a nal aspect of our inial understanding, let us plot the segments
using the naïve way. That is, we will paron the data display according to the
terminal nodes of the CART_Dummy_rpart tree.
6. The R code is given right away, though you should make an eort to nd the logic
behind it. Of course, it is very likely that by now you need to run some of the earlier
code that was given previously.
abline(h=4.875,lwd=2)
segments(x0=4.5,y0=4.875,x1=4.5,y1=10,lwd=2)
abline(h=1.75,lwd=2)
segments(x0=3.5,y0=1.75,x1=3.5,y1=4.875,lwd=2)
title(main="Classification Tree on the Data Display")
Classicaon and Regression
[ 264 ]
It can be easily seen from the following that rpart works really well:
Figure 5: The terminal nodes on the original display of the data
What just happened?
We obtained our rst classicaon tree, which is a good thing. Given the actual data display,
the classicaon tree gives sasfactory answers.
We have understood the "how" part of a classicaon tree. The "why" aspect is very
vital in science, and the next secon explains the science behind the construcon of
a regression tree, and it will be followed later by a detailed explanaon of the working
of a classicaon tree.
Chapter 9
[ 265 ]
The construction of a regression tree
In the CART_Dummy dataset, the output is a categorical variable, and we built a classicaon
tree for it. In Chapter 6, Linear Regression Analysis, the linear regression models were built
for a connuous random variable, while in Chapter 7, The Logisc Regression Model, we built
a logisc regression model for a binary random variable. The same disncon is required in
CART, and we thus build classicaon trees for binary random variables, where regression
trees are for connuous random variables. Recall the raonale behind the esmaon
of regression coecients for the linear regression model. The main goal was to nd the
esmates of the regression coecients, which minimize the error sum of squares between
the actual regressand values and the ed values. A similar approach is followed here, in the
sense that we need to split the data at the points that keep the residual sum of squares to
a minimum. That is, for each unique value of a predictor, which is a candidate for the node
value, we nd the sum of squares of y's within each paron of the data, and then add them
up. This step is performed for each unique value of the predictor, and the value, which leads
to the least sum of squares among all the candidates, is selected as the best split point for
that predictor. In the next step, we nd the best split points for each of the predictors, and
then the best split is selected across the best split points across the predictors. Easy!
Now, the data is paroned into two parts according to the best split. The process of nding
the best split within each paron is repeated in the same spirit as for the rst split. This
process is carried out in a recursive fashion unl the data can't be paroned any further.
What is happening here? The residual sum of squares at each child node will be lesser than
that in the parent node.
At the outset, we record that the rpart funcon does the exact same thing. However, as a
part of cleaner understanding of the regression tree, we will write raw R codes and ensure
that there is no ambiguity in the process of understanding CART. We will begin with a simple
example of a regression tree, and use the rpart funcon to plot the regression funcon.
Then, we will rst dene a funcon, which will extract the best split given by the covariate
and dependent variable. This acon will be repeated for all the available covariates, and
then we nd the best overall split. This will be veried with the regression tree. The data will
then be paroned by using the best overall split, and then the best split will be idened
for each of the paroned data. The process will be repeated unl we reach the end of the
complete regression tree given by the rpart. First, the experiment!
Classicaon and Regression
[ 266 ]
The cpus dataset available in the MASS package contains the relave performance measure
of 209 CPUs in the perf variable. It is known that the performance of a CPU depends
on factors such as the cycle me in nanoseconds (syct), minimum and maximum main
memory in kilobytes (mmin and mmax), cache size in kilobytes (cach), and minimum and
maximum number of channels (chmin and chmax). The task in hand is to model the perf
as a funcon of syct, mmin, mmax, cach, chmin, and chmax. The histogram of perf—try
hist(cpus$perf)—will show a highly skewed distribuon, and hence we will build a
regression tree for the logarithm transformaon log10(perf).
Time for action – the construction of a regression tree
A regression tree is rst built by using the rpart funcon. The getNode funcon is
introduced, which helps in idenfying the split node at each stage, and using it we build
a regression tree and verify that we had the same tree as returned by the rpart funcon.
1. Load the MASS library by using library(MASS).
2. Create the regression tree for the logarithm (to the base 10) of perf as a funcon
of the covariates explained earlier, and display the regression tree:
cpus.ltrpart <- rpart(log10(perf)~syct+mmin+mmax+cach+chmin+chmax,
data=cpus)
plot(cpus.ltrpart); text(cpus.ltrpart)
The regression tree will be indicated as follows:
Figure 6: Regression tree for the "perf" of a CPU
Chapter 9
[ 267 ]
We will now dene the getNode funcon. Given the regressand and the
covariate, we need to nd the best split in the sense of the sum of squares criterion.
The evaluaon needs to be done for every disnct value of the covariate. If there are
m disnct points, we need m-1 evaluaons. At each disnct point, the regressand
needs to be paroned accordingly, and the sum of squares should be obtained for
each paron. The two sums of squares (in each part) are then added to obtain the
reduced sum of squares. Thus, we create the required funcon to meet all
these requirements.
3. Create the getNode funcon in R by running the following code:
getNode <- function(x,y) {
xu <- sort(unique(x),decreasing=TRUE)
ss <- numeric(length(xu)-1)
for(i in 1:length(ss)) {
partR <- y[x>xu[i]]
partL <- y[x<=xu[i]]
partRSS <- sum((partR-mean(partR))^2)
partLSS <- sum((partL-mean(partL))^2)
ss[i]<-partRSS + partLSS
}
return(list(xnode=xu[which(ss==min(ss,na.rm=TRUE))],
minss = min(ss,na.rm=TRUE),ss,xu))
}
The getNode funcon gives the best split for a given covariate. It returns a list
consisng of four objects:
xnode, which is a datum of the covariate x that gives the minimum residual
sum of squares for the regressand y
The value of the minimum residual sum of squares
The vector of the residual sum of squares for the distinct points of the
vector x
The vector of the distinct x values
We will run this funcon for each of the six covariates, and nd the best overall split.
The argument na.rm=TRUE is required, as at the maximum value of x we won't get
a numeric value.
Classicaon and Regression
[ 268 ]
4. We will rst execute the getNode funcon on the syct covariate, and look at the
output we get as a result:
> getNode(cpus$syct,log10(cpus$perf))$xnode
[1] 48
> getNode(cpus$syct,log10(cpus$perf))$minss
[1] 24.72
> getNode(cpus$syct,log10(cpus$perf))[[3]]
[1] 43.12 42.42 41.23 39.93 39.44 37.54 37.23 36.87 36.51 36.52
35.92 34.91
[13] 34.96 35.10 35.03 33.65 33.28 33.49 33.23 32.75 32.96 31.59
31.26 30.86
[25] 30.83 30.62 29.85 30.90 31.15 31.51 31.40 31.50 31.23 30.41
30.55 28.98
[37] 27.68 27.55 27.44 26.80 25.98 27.45 28.05 28.11 28.66 29.11
29.81 30.67
[49] 28.22 28.50 24.72 25.22 26.37 28.28 29.10 33.02 34.39 39.05
39.29
> getNode(cpus$syct,log10(cpus$perf))[[4]]
[1] 1500 1100 900 810 800 700 600 480 400 350 330 320
300 250 240
[16] 225 220 203 200 185 180 175 167 160 150 143 140
133 125 124
[31] 116 115 112 110 105 100 98 92 90 84 75 72
70 64 60
[46] 59 57 56 52 50 48 40 38 35 30 29 26
25 23 17
The least sum of squares at a split for the best split value of the syct variable is
24.72, and it occurs at a value of syct greater than 48. The third and fourth list
objects given by getNode, respecvely, contain the details of the sum of squares for
the potenal candidates and the unique values of syct. The values of interest are
highlighted. Thus, we will rst look at the second object from the list output for all
the six covariates to nd the best split among the best split of each of the variables,
by the residual sum of squares criteria.
5. Now, run the getNode funcon for the remaining ve covariates:
getNode(cpus$syct,log10(cpus$perf))[[2]]
getNode(cpus$mmin,log10(cpus$perf))[[2]]
getNode(cpus$mmax,log10(cpus$perf))[[2]]
getNode(cpus$cach,log10(cpus$perf))[[2]]
getNode(cpus$chmin,log10(cpus$perf))[[2]]
getNode(cpus$chmax,log10(cpus$perf))[[2]]
getNode(cpus$cach,log10(cpus$perf))[[1]]
sort(getNode(cpus$cach,log10(cpus$perf))[[4]],decreasing=FALSE)
Chapter 9
[ 269 ]
The output is as follows:
Figure 7: Obtaining the best "first split" of regression tree
The sum of squares for cach is the lowest, and hence we need to nd the best
split associated with it, which is 24. However, the regression tree shows that the
best split is for the cach value of 27. The getNode funcon says that the best split
occurs at a point greater than 24, and hence we take the average of 24 and the next
unique point at 30. Having obtained the best overall split, we next obtain the rst
paron of the dataset.
6. Paron the data by using the best overall split point:
cpus_FS_R <- cpus[cpus$cach>=27,]
cpus_FS_L <- cpus[cpus$cach<27,]
The new names of the data objects are clear with _FS_R indicang the dataset
obtained on the right side for the rst split, and _FS_L indicang the le side.
In the rest of the secon, the nomenclature won't be further explained.
7. Idenfy the best split in each of the paroned datasets:
getNode(cpus_FS_R$syct,log10(cpus_FS_R$perf))[[2]]
getNode(cpus_FS_R$mmin,log10(cpus_FS_R$perf))[[2]]
getNode(cpus_FS_R$mmax,log10(cpus_FS_R$perf))[[2]]
getNode(cpus_FS_R$cach,log10(cpus_FS_R$perf))[[2]]
getNode(cpus_FS_R$chmin,log10(cpus_FS_R$perf))[[2]]
getNode(cpus_FS_R$chmax,log10(cpus_FS_R$perf))[[2]]
getNode(cpus_FS_R$mmax,log10(cpus_FS_R$perf))[[1]]
sort(getNode(cpus_FS_R$mmax,log10(cpus_FS_R$perf))[[4]],
decreasing=FALSE)
getNode(cpus_FS_L$syct,log10(cpus_FS_L$perf))[[2]]
Classicaon and Regression
[ 270 ]
getNode(cpus_FS_L$mmin,log10(cpus_FS_L$perf))[[2]]
getNode(cpus_FS_L$mmax,log10(cpus_FS_L$perf))[[2]]
getNode(cpus_FS_L$cach,log10(cpus_FS_L$perf))[[2]]
getNode(cpus_FS_L$chmin,log10(cpus_FS_L$perf))[[2]]
getNode(cpus_FS_L$chmax,log10(cpus_FS_L$perf))[[2]]
getNode(cpus_FS_L$mmax,log10(cpus_FS_L$perf))[[1]]
sort(getNode(cpus_FS_L$mmax,log10(cpus_FS_L$perf))[[4]],
decreasing=FALSE)
The following screenshot gives the results of running the preceding R code:
Figure 8: Obtaining the next two splits
Chapter 9
[ 271 ]
Thus, for the rst right paroned data, the best split is for the mmax value as the
mid-point between 24000 and 32000; that is, at mmax = 28000. Similarly, for the
rst le-paroned data, the best split is the average value of 6000 and 6200,
which is 6100, for the same mmax covariate. Note the important step here. Even
though we used cach as the criteria for the rst paron, it is sll used with the two
paroned data. The results are consistent with the display given by the regression
tree, Figure 6: Regression tree for the "perf" of a CPU. The next R program will take
care of the enre rst split's right side's future parons.
8. Paron the rst right part cpus_FS_R as follows:
cpus_FS_R_SS_R <- cpus_FS_R[cpus_FS_R$mmax>=28000,]
cpus_FS_R_SS_L <- cpus_FS_R[cpus_FS_R$mmax<28000,]
Obtain the best split for cpus_FS_R_SS_R and cpus_FS_R_SS_L by running the
following code:
cpus_FS_R_SS_R <- cpus_FS_R[cpus_FS_R$mmax>=28000,]
cpus_FS_R_SS_L <- cpus_FS_R[cpus_FS_R$mmax<28000,]
getNode(cpus_FS_R_SS_R$syct,log10(cpus_FS_R_SS_R$perf))[[2]]
getNode(cpus_FS_R_SS_R$mmin,log10(cpus_FS_R_SS_R$perf))[[2]]
getNode(cpus_FS_R_SS_R$mmax,log10(cpus_FS_R_SS_R$perf))[[2]]
getNode(cpus_FS_R_SS_R$cach,log10(cpus_FS_R_SS_R$perf))[[2]]
getNode(cpus_FS_R_SS_R$chmin,log10(cpus_FS_R_SS_R$perf))[[2]]
getNode(cpus_FS_R_SS_R$chmax,log10(cpus_FS_R_SS_R$perf))[[2]]
getNode(cpus_FS_R_SS_R$cach,log10(cpus_FS_R_SS_R$perf))[[1]]
sort(getNode(cpus_FS_R_SS_R$cach,log10(cpus_FS_R_SS_R$perf))[[4]],
decreasing=FALSE)
getNode(cpus_FS_R_SS_L$syct,log10(cpus_FS_R_SS_L$perf))[[2]]
getNode(cpus_FS_R_SS_L$mmin,log10(cpus_FS_R_SS_L$perf))[[2]]
getNode(cpus_FS_R_SS_L$mmax,log10(cpus_FS_R_SS_L$perf))[[2]]
getNode(cpus_FS_R_SS_L$cach,log10(cpus_FS_R_SS_L$perf))[[2]]
getNode(cpus_FS_R_SS_L$chmin,log10(cpus_FS_R_SS_L$perf))[[2]]
getNode(cpus_FS_R_SS_L$chmax,log10(cpus_FS_R_SS_L$perf))[[2]]
getNode(cpus_FS_R_SS_L$cach,log10(cpus_FS_R_SS_L$perf))[[1]]
sort(getNode(cpus_FS_R_SS_L$cach,log10(cpus_FS_R_SS_L$perf))
[[4]],decreasing=FALSE)
Classicaon and Regression
[ 272 ]
For the cpus_FS_R_SS_R part, the nal division is according to cach being
greater than 56 or not (average of 48 and 64). If the cach value in this paron is
greater than 56, then perf (actually log10(perf)) ends in the terminal leaf 3,
else 2. However, for the region cpus_FS_R_SS_L, we paron the data further
by the cach value being greater than 96.5 (average of 65 and 128). In the right
side of the region, log10(perf) is found as 2, and a third level split is required
for cpus_FS_R_SS_L with cpus_FS_R_SS_L_TS_L. Note that though the nal
terminal leaves of the cpus_FS_R_SS_L_TS_L region shows the same 2 as the
nal log10(perf), this may actually result in a signicant variability reducon of
the dierence between the predicted and the actual log10(perf) values. We will
now focus on the rst main split's le side.
Chapter 9
[ 273 ]
Figure 9: Partitioning the right partition after the first main split
Classicaon and Regression
[ 274 ]
9. Paron cpus_FS_L accordingly, as the mmax value being greater than 6100
or otherwise:
cpus_FS_L_SS_R <- cpus_FS_L[cpus_FS_L$mmax>=6100,]
cpus_FS_L_SS_L <- cpus_FS_L[cpus_FS_L$mmax<6100,]
The rest of the paron for cpus_FS_L is completely given next.
10. The details will be skipped and the R program is given right away:
cpus_FS_L_SS_R <- cpus_FS_L[cpus_FS_L$mmax>=6100,]
cpus_FS_L_SS_L <- cpus_FS_L[cpus_FS_L$mmax<6100,]
getNode(cpus_FS_L_SS_R$syct,log10(cpus_FS_L_SS_R$perf))[[2]]
getNode(cpus_FS_L_SS_R$mmin,log10(cpus_FS_L_SS_R$perf))[[2]]
getNode(cpus_FS_L_SS_R$mmax,log10(cpus_FS_L_SS_R$perf))[[2]]
getNode(cpus_FS_L_SS_R$cach,log10(cpus_FS_L_SS_R$perf))[[2]]
getNode(cpus_FS_L_SS_R$chmin,log10(cpus_FS_L_SS_R$perf))[[2]]
getNode(cpus_FS_L_SS_R$chmax,log10(cpus_FS_L_SS_R$perf))[[2]]
getNode(cpus_FS_L_SS_R$syct,log10(cpus_FS_L_SS_R$perf))[[1]]
sort(getNode(cpus_FS_L_SS_R$syct,log10(cpus_FS_L_SS_R$perf))[[4]],
decreasing=FALSE)
getNode(cpus_FS_L_SS_L$syct,log10(cpus_FS_L_SS_L$perf))[[2]]
getNode(cpus_FS_L_SS_L$mmin,log10(cpus_FS_L_SS_L$perf))[[2]]
getNode(cpus_FS_L_SS_L$mmax,log10(cpus_FS_L_SS_L$perf))[[2]]
getNode(cpus_FS_L_SS_L$cach,log10(cpus_FS_L_SS_L$perf))[[2]]
getNode(cpus_FS_L_SS_L$chmin,log10(cpus_FS_L_SS_L$perf))[[2]]
getNode(cpus_FS_L_SS_L$chmax,log10(cpus_FS_L_SS_L$perf))[[2]]
getNode(cpus_FS_L_SS_L$mmax,log10(cpus_FS_L_SS_L$perf))[[1]]
sort(getNode(cpus_FS_L_SS_L$mmax,log10(cpus_FS_L_SS_L$perf))
[[4]],decreasing=FALSE)
cpus_FS_L_SS_R_TS_R <- cpus_FS_L_SS_R[cpus_FS_L_SS_R$syct<360,]
getNode(cpus_FS_L_SS_R_TS_R$syct,log10(cpus_FS_L_SS_R_TS_R$
perf))[[2]]
getNode(cpus_FS_L_SS_R_TS_R$mmin,log10(cpus_FS_L_SS_R_TS_R$
perf))[[2]]
getNode(cpus_FS_L_SS_R_TS_R$mmax,log10(cpus_FS_L_SS_R_TS_R$
perf))[[2]]
getNode(cpus_FS_L_SS_R_TS_R$cach,log10(cpus_FS_L_SS_R_TS_R$
perf))[[2]]
getNode(cpus_FS_L_SS_R_TS_R$chmin,log10(cpus_FS_L_SS_R_TS_R$
perf))[[2]]
getNode(cpus_FS_L_SS_R_TS_R$chmax,log10(cpus_FS_L_SS_R_TS_R$
perf))[[2]]
getNode(cpus_FS_L_SS_R_TS_R$chmin,log10(cpus_FS_L_SS_R_TS_R$
perf))[[1]]
sort(getNode(cpus_FS_L_SS_R_TS_R$chmin,log10(cpus_FS_L_SS_R_TS_
R$perf))[[4]],decreasing=FALSE)
Chapter 9
[ 275 ]
We will now see how the preceding R code gets us closer to the regression tree:
Figure 10: Partitioning the left partition after the first main split
We leave it to you to interpret the output arising from the previous acon.
Classicaon and Regression
[ 276 ]
What just happened?
Using the rpart funcon from the rpart library we rst built the regression tree for
log10(perf). Then, we explored the basic denions underlying the construcon of
a regression tree and dened the getNode funcon to obtain the best split for a pair of
regressands and a covariate. This funcon is then applied for all the covariates, and the best
overall split is obtained; using this we get our rst paron of the data, which will be in
agreement with the tree given by the rpart funcon. We then recursively paroned
the data by using the getNode funcon and veried that all the best splits in each
paroned data are in agreement with the one provided by the rpart funcon.
The reader may wonder if the preceding tedious task was really essenal. However, it has
been the experience of the author that users/readers seldom remember the raonale
behind using direct code/funcons for any soware aer some me. Moreover, CART is a
dicult concept and it is imperave that we clearly understand our rst tree, and return to
the preceding program whenever the understanding of a science behind CART is forgoen.
The construcon of a classicaon tree uses enrely dierent metrics, and hence its working
is also explained in considerable depth in the next secon.
The construction of a classication tree
We rst need to set up the spling criteria for a classicaon tree. In the case of a
regression tree, we saw the sum of squares as the spling criteria. For idenfying the split
for a classicaon tree, we need to dene certain measures known as impurity measures.
The three popular measures of impurity are Bayes error, the cross-entropy funcon, and Gini
index. Let p denote the percentage of success in a dataset of size n. The formulae of these
impurity measures are given in the following table:
Measure Formula
Bayes error
() ( )
min1
Bpp , p
ϕ
= −
The cross-entropy measure
() ( ) ( )
log log1 1
CE p p p p p
ϕ
=− −−
Gini index
() ( )
1
Gpp p
ϕ
= −
Chapter 9
[ 277 ]
We will write a short program to understand the shape of these impurity measures as a
funcon of p:
p <- seq(0.01,0.99,0.01)
plot(p,pmin(p,1-p),"l",col="red",xlab="p",xlim=c(0,1),ylim=c(0,1),
ylab="Impurity Measures")
points(p,-p*log(p)-(1-p)*log(1-p),"l",col="green")
points(p,p*(1-p),"l",col="blue")
title(main="Impurity Measures")
legend(0.6,1,c("Bayes Error","Cross-Entropy","Gini Index"),col=c("red"
,"green","blue"),pch="-")
The preceding R code when executed in an R session gives the following output:
Figure 11: Impurity metrics – Bayes error, cross-entropy, and Gini index
Basically, we have these three choices of impurity metrics as a building block of
a classicaon tree. The popular choice is the Gini index, and there are detailed
discussions about the reason in the literature; see Breiman, et. al. (1984). However,
we will delve into this aspect and for the development in this secon, we will be using
the cross-entropy funcon.
Classicaon and Regression
[ 278 ]
Now, for a given predictor, assume that we have a node denoted by A. In the inial
stage, where there are no parons, the impurity is based on the proporon p.
The impurity of node A is taken to be a non-negave funcon of the probability y = 1,
and is mathemacally wrien as p(y=1|A). The impurity of node A is dened as follows:
() ( )
1IA py |A
ϕ
= =
 
 
Here,
ϕ
is one of the three impurity measures. When A is one of the internal nodes, the tree
gets bifurcated to the le- and right- hand side; that is, we now have le daughter AL and a
right daughter AR. For the moment, we will take the split according to the predictor variable
x; that is, if
xc
, the observaon moves to AL, otherwise to A
R. Then, according to the split
criteria, we have the following table; this is the same as Table 3.2 of Berk (2008):
Split criteria
xc
Failure (0) Success (1) Total
AL
xc
n11 n12 n1.
AR x > c n21 n22 n2.
n.1 n.2 n..
Using the frequencies in the preceding table, the impurity for the daughter nodes AL and AR,
based on the cross-entropy metric, are given as follows:
( )
11 11 12 12
1 1 1 1
log log
L
n n n n
IA n n n n


=−
 
 
And:
( )
21 21 22 22
2 2 2 2
log log
R
n n n n
IA n n n n


=−
 
 
The probability of an observaon falling in the le- and right- hand daughter nodes are
respecvely given by
()
1L
pA n/n= and
()
2R
pA n/n=. Then, the benet of using the node
A is given as follows:
() () ()() ()()
L L R R
AIApAIApAIA∆= − −
Chapter 9
[ 279 ]
Now, we capture
()
A
for all unique values of a predictor, and choose that value as the
best split for which
()
A
is a maximum. This step is repeated across all the variables, and
the best split is selected, which has a maximum
()
A
. According to the best split, the data
is paroned, and as seen earlier during the construcon of the regression tree, a similar
search is performed in each of the paroned data. The process connues unl the gain by
the split reaches a threshold minimum in each of the paroned data.
We will begin with the classicaon tree as delivered by the rpart funcon. The illustrave
dataset kyphosis is selected from the rpart library itself. The data relates to children
who had correcve spinal surgery. This medical problem is about the exaggerated outward
curvature of the thoracic region of the spine, which results in a rounded upper back. In
this study, 81 children underwent a spinal surgery and aer the surgery, informaon is
captured to know whether the children sll have the kyphosis problem in the column
named Kyphosis. The value of Kyphosis="absent" indicates that the child has been
cured of the problem, and Kyphosis="present" means that child has not been cured for
kyphosis. The other informaon captured is related to the age of the children, the number
of vertebrae involved, and the number of rst (topmost) vertebrae operated on. The task
for us is building a classicaon tree, which gives the Kyphosis status dependent on the
described variables.
We will rst build the classicaon tree for Kyphosis as a funcon of the three variables
Age, Start, and Number. The tree will then be displayed and rules will be extracted from
it. The getNode funcon will be dened based on the cross-entropy funcon, which will be
applied on the raw data and the rst overall opmal split obtained to paron the data.
The process will be recursively repeated unl we get the same tree as returned by the
rpart funcon.
Time for action – the construction of a classication tree
The getNode funcon is now dened here to help us idenfy the best split for
the classicaon problem. For the Kyphosis dataset from the rpart package,
we plot the classicaon tree by using the rpart funcon. The tree is reobtained
by using the getNode funcon.
1. Using the opon of split="information", construct a classicaon tree based
on the cross-entropy informaon for the kyphosis data with the following code:
ky_rpart <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis,p
arms=list(split="information"))
Classicaon and Regression
[ 280 ]
2. Visualize the classicaon tree by using plot(ky_rpart); text(ky_rpart):
Figure 12: Classification tree for the kyphosis problem
3. Extract rules from ex_rpart by using asRules:
> asRules(ky_rpart)
Rule number: 15 [Kyphosis=present cover=13 (16%) prob=0.69]
Start< 12.5
Age>=34.5
Number>=4.5
Rule number: 14 [Kyphosis=absent cover=12 (15%) prob=0.42]
Start< 12.5
Age>=34.5
Number< 4.5
Rule number: 6 [Kyphosis=absent cover=10 (12%) prob=0.10]
Start< 12.5
Age< 34.5
Rule number: 2 [Kyphosis=absent cover=46 (57%) prob=0.04]
Start>=12.5
4. Dene the getNode funcon for the classicaon problem:
Chapter 9
[ 281 ]
In the preceding funcon, the key funcons would be unique, table, and log.
We use unique to ensure that the search is carried for the disnct elements of
the predictor values in the data. table gets the required counts as discussed earlier in
this secon. The if condion ensures that neither the p nor 1-p values become 0, in
which case the logs become minus innity. The rest of the coding is self-explanatory.
Let us now get our rst best split.
5. We will need a few data manipulaons to ensure that our R code works on the
expected lines:
KYPHOSIS <- kyphosis
KYPHOSIS$Kyphosis_y <- (kyphosis$Kyphosis=="absent")*1
6. To nd the rst best split among the three variables, execute the following code;
the output is given in a consolidated screenshot aer all the iteraons:
getNode(KYPHOSIS$Age,KYPHOSIS$Kyphosis_y)[[2]]
getNode(KYPHOSIS$Number,KYPHOSIS$Kyphosis_y)[[2]]
getNode(KYPHOSIS$Start,KYPHOSIS$Kyphosis_y)[[2]]
getNode(KYPHOSIS$Start,KYPHOSIS$Kyphosis_y)[[1]]
sort(getNode(KYPHOSIS$Start,KYPHOSIS$Kyphosis_y)[[4]],
decreasing=FALSE)
Classicaon and Regression
[ 282 ]
Now, getNode indicates that the best split occurs for the Start variable, and
the point for the best split is 12. Keeping in line with the argument of the previous
secon, we split the data into two parts accordingly, as the Start value is greater
than the average of 12 and 13. For the paroned data, the search proceeds
in a recursive fashion.
7. Paron the data accordingly, as the Start values are greater than 12.5, and nd
the best split for the right daughter node, as the tree display shows that a search in
the le daughter node is not necessary:
KYPHOSIS_FS_R <- KYPHOSIS[KYPHOSIS$Start<12.5,]
KYPHOSIS_FS_L <- KYPHOSIS[KYPHOSIS$Start>=12.5,]
getNode(KYPHOSIS_FS_R$Age,KYPHOSIS_FS_R$Kyphosis_y)[[2]]
getNode(KYPHOSIS_FS_R$Number,KYPHOSIS_FS_R$Kyphosis_y)[[2]]
getNode(KYPHOSIS_FS_R$Start,KYPHOSIS_FS_R$Kyphosis_y)[[2]]
getNode(KYPHOSIS_FS_R$Age,KYPHOSIS_FS_R$Kyphosis_y)[[1]]
sort(getNode(KYPHOSIS_FS_R$Age,KYPHOSIS_FS_R$Kyphosis_y)[[4]],
decreasing=FALSE)
The maximum incremental value occurs for the predictor Age, and the split point is
27. Again, we take the average of the 27 and next highest value of 42, which turns
out as 34.5. The (rst) right daughter node region is then paroned in two parts
accordingly, as the Age values are greater than 34.5, and the search for the next
split connues in the current right daughter part.
8. The following code completes our search:
KYPHOSIS_FS_R_SS_R <- KYPHOSIS_FS_R[KYPHOSIS_FS_R$Age>=34.5,]
KYPHOSIS_FS_R_SS_L <- KYPHOSIS_FS_R[KYPHOSIS_FS_R$Age<34.5,]
getNode(KYPHOSIS_FS_R_SS_R$Age,KYPHOSIS_FS_R_SS_R$Kyphosis_y)[[2]]
getNode(KYPHOSIS_FS_R_SS_R$Number,KYPHOSIS_FS_R_SS_R$
Kyphosis_y)[[2]]
getNode(KYPHOSIS_FS_R_SS_R$Start,KYPHOSIS_FS_R_SS_R$
Kyphosis_y)[[2]]
getNode(KYPHOSIS_FS_R_SS_R$Number,KYPHOSIS_FS_R_SS_R$
Kyphosis_y)[[1]]
sort(getNode(KYPHOSIS_FS_R_SS_R$Number,KYPHOSIS_FS_R_SS_R$
Kyphosis_y)[[4]],
decreasing=FALSE)
Chapter 9
[ 283 ]
We see that the nal split occurs for the predictor Number and the split is 4,
and we again stop at 4.5.
We see that the results from our raw code completely agree with the rpart
funcon. Thus, the eorts of wring custom code for the classicaon tree have
paid the right dividends. We now have enough clarity for the construcon of the
classicaon tree:
Figure 13: Finding the best splits for classification tree using the getnode function
Classicaon and Regression
[ 284 ]
What just happened?
A deliberate aempt has been made at demysfying the construcon of a classicaon tree.
As with the earlier aempt of understanding a regression tree, we rst deployed the rpart
funcon, and saw a display of the classicaon tree for the Kyphosis as a funcon of Age,
Start, and Number, for the choice of the cross-entropy impurity metric. The getNode
funcon is dened on the basis of the same impurity metric and in a very systemac fashion;
we reproduced the same tree as obtained by the rpart funcon.
With the understanding of the basic construcon behind us, we will now build the
classicaon tree for the German credit data problem.
Classication tree for the German credit data
In Chapter 7, The Logisc Regression Model, we constructed a logisc regression model,
and in the previous chapter, we obtained the ridge regression version for the German credit
data problem. However, problems such as these and many others may have non-linearity
built in them, and it is worthwhile to look at the same problem by using a classicaon
tree. Also, we saw another model performance of the German credit data using the train,
validate, and test approach. We will have the following approach. First, we will paron
the German dataset into three parts, namely train, validate, and test. The classicaon tree
will be built by using the data in the train set and then it will be applied on the validate
part. The corresponding ROC curves will be visualized, and if we feel that the two curves
are reasonably similar, we will apply it on the test region, and take the necessary acon of
sanconing the customers their required loan.
Time for action – the construction of a classication tree
A classicaon tree is built now for the German credit data by using the rpart funcon.
The approach of train, validate, and test is implemented, and the ROC curves are obtained too.
Chapter 9
[ 285 ]
1. The following code has been used earlier in the book, and hence there won't be
an explanaon of it:
set.seed(1234567)
data_part_label <- c("Train","Validate","Test")
indv_label = sample(data_part_label,size=1000,replace=TRUE,prob
=c(0.6,0.2,0.2))
library(ROCR)
data(GC)
GC_Train <- GC[indv_label==»Train»,]
GC_Validate <- GC[indv_label==»Validate»,]
GC_Test <- GC[indv_label=="Test",]
2. Create the classicaon tree for the German credit data, and visualize the tree.
We will also extract the rules from this classicaon tree:
GC_rpart <- rpart(good_bad~.,data=GC_Train)
plot(GC_rpart); text(GC_rpart)
asRules(GC_rpart)
The classicaon tree for the German credit data appears as in the
following screenshot:
Figure 14: Classification tree for the test part of the German credit data problem
Classicaon and Regression
[ 286 ]
By now, we know how to nd the rules of this tree. An edited version of the rules is
given as follows:
Figure 15: Rules for the German credit data
3. We use the tree given in the previous step on the validate region, and plot the ROC
for both the regions:
Pred_Train_Class <- predict(GC_rpart,type='class')
Pred_Train_Prob <- predict(GC_rpart,type='prob')
Train_Pred <- prediction(Pred_Train_Prob[,2],GC_Train$good_bad)
Perf_Train <- performance(Train_Pred,»tpr»,»fpr»)
plot(Perf_Train,col=»green»,lty=2)
Pred_Validate_Class<-predict(GC_rpart,newdata=GC_Validate[,-
21],type='class')
Pred_Validate_Prob<-predict(GC_rpart,newdata=GC_Validate[,-
21],type='prob')
Validate_Pred<-prediction(Pred_Validate_Prob[,2], GC_
Validate$good_bad)
Chapter 9
[ 287 ]
Perf_Validate <- performance(Validate_Pred,"tpr","fpr")
plot(Perf_Validate,col="yellow",lty=2,add=TRUE)
We will go ahead and predict for the test part too.
4. The necessary code is the following:
Pred_Test_Class<-predict(GC_rpart,newdata=GC_Test[,-
21],type='class')
Pred_Test_Prob<-predict(GC_rpart,newdata=GC_Test[,-
21],type='prob')
Test_Pred<- prediction(Pred_Test_Prob[,2],GC_Test$good_bad)
Perf_Test<- performance(Test_Pred,"tpr","fpr")
plot(Perf_Test,col="red",lty=2,add=TRUE)
legend(0.6,0.5,c("Train Curve","ValidateCurve","Test Curve"),col=c
("green","yellow","red"),pch="-")
The nal ROC curve looks similar to the following screenshot:
Figure 16: ROC Curves for German Credit Data
Classicaon and Regression
[ 288 ]
The performance of the classicaon tree is certainly not sasfactory with the
validate group itself. The only solace here is that the test curve is a bit similar
to the validate curve. We will look at the more modern ways of improving the
basic classicaon tree in the next chapter. The classicaon tree in Figure 14:
Classicaon tree for the test part of the German credit data problem is very large
and complex, and we somemes need to truncate the tree to make the classicaon
method a bit simpler. Of course, one of the things that we should suspect whenever
we look at very large trees is that maybe we are again having the problem of
overng. The nal secon deals with a simplisc method of overcoming
this problem.
What just happened?
A classicaon tree has been built for the German credit dataset. The ROC curve shows that
the tree does not perform well on the validate data part. In the next and concluding secon,
we look at the two ways of improving this tree.
Have a go hero
Using the getNode funcon, verify the rst ve splits of the classicaon tree for the
German credit data.
Pruning and other ner aspects of a tree
Recall from Figure 14: Classicaon tree for the test part of the German credit data problem
that the rules numbered 21, 143, 69, 165, 142, 70, 40, 164, and 16, respecvely, covered
only 20, 25, 11, 11, 14, 12, 28, 19, and 22. If we look at the total number of observaons,
we have about 600, and individually these rules do not cover even about ve percent of
them. This is one reason to suspect that maybe we overed the data. Using the opon of
minsplit, we can restrict the minimum number of observaons each rule should cover at
the least.
Another technical way of reducing the complexity of a classicaon tree is by "pruning" the
tree. Here, the least important splits are recursively snipped o according to the complexity
parameter; for details, refer to Breiman, et. al. (1984), or Secon 3.6 of Berk (2008). We will
illustrate the acon through the R program.
Chapter 9
[ 289 ]
Time for action – pruning a classication tree
A CART is improved by using minsplit and cp arguments in the rpart funcon.
1. Invoke the graphics editor with par(mfrow=c(1,2)).
2. Specify minsplit=30, and re-do the ROC plots by using the new classicaon tree:
GC_rpart_minsplit<- rpart(good_bad~.,data=GC_Train, minsplit=30)
GC_rpart_minsplit <- prune(GC_rpart,cp=0.05)
Pred_Train_Class<- predict(GC_rpart_minsplit,type='class')
Pred_Train_Prob<-predict(GC_rpart_minsplit,type='prob')
Train_Pred<- prediction(Pred_Train_Prob[,2],GC_Train$good_bad)
Perf_Train<- performance(Train_Pred,»tpr»,»fpr»)
plot(Perf_Train,col=»green»,lty=2)
Pred_Validate_Class<-predict(GC_rpart_minsplit,newdata=GC_
Validate[,-21],type='class')
Pred_Validate_Prob<-predict(GC_rpart_minsplit,newdata= GC_
Validate[,-21],type='prob')
Validate_Pred<- prediction(Pred_Validate_Prob[,2],GC_
Validate$good_bad)
Perf_Validate<- performance(Validate_Pred,"tpr","fpr")
plot(Perf_Validate,col="yellow",lty=2,add=TRUE)
Pred_Test_Class<- predict(GC_rpart_minsplit,newdata = GC_Test[,-
21],type='class')
Pred_Test_Prob<-predict(GC_rpart_minsplit,newdata = GC_Test[,-
21],type='prob')
Test_Pred<- prediction(Pred_Test_Prob[,2],GC_Test$good_bad)
Perf_Test<- performance(Test_Pred,"tpr","fpr")
plot(Perf_Test,col="red",lty=2,add=TRUE)
legend(0.6,0.5,c("Train Curve","ValidateCurve","Test Curve"),col=c
("green","yellow","red"),pch="-")
title(main="Improving a Classification Tree with "minsplit"")
3. For the pruning factor cp=0.02, repeat the ROC plot exercise:
GC_rpart_prune <- prune(GC_rpart,cp=0.02)
Pred_Train_Class<- predict(GC_rpart_prune,type='class')
Pred_Train_Prob<-predict(GC_rpart_prune,type='prob')
Train_Pred<- prediction(Pred_Train_Prob[,2],GC_Train$good_bad)
Perf_Train<- performance(Train_Pred,»tpr»,»fpr»)
plot(Perf_Train,col=»green»,lty=2)
Classicaon and Regression
[ 290 ]
Pred_Validate_Class<-predict(GC_rpart_prune,newdata = GC_
Validate[,-21],type='class')
Pred_Validate_Prob<-predict(GC_rpart_prune,newdata = GC_
Validate[,-21],type='prob')
Validate_Pred<- prediction(Pred_Validate_Prob[,2],GC_
Validate$good_bad)
Perf_Validate<- performance(Validate_Pred,"tpr","fpr")
plot(Perf_Validate,col="yellow",lty=2,add=TRUE)
Pred_Test_Class<- predict(GC_rpart_prune,newdata = GC_Test[,-
21],type='class')
Pred_Test_Prob<-predict(GC_rpart_prune,newdata = GC_Test[,-
21],type='prob')
Test_Pred<- prediction(Pred_Test_Prob[,2],GC_Test$good_bad)
Perf_Test<- performance(Test_Pred,"tpr","fpr")
plot(Perf_Test,col="red",lty=2,add=TRUE)
legend(0.6,0.5,c("Train Curve","ValidateCurve","Test Curve"),col=c
("green","yellow","red"),pch="-")
title(main="Improving a Classification Tree with Pruning")
The choice of cp=0.02 has been drawn from the plot of the complexity parameter
and the relave error; try it yourselves with plotcp(GC_rpart).
Figure 17: Pruning the CART
Chapter 9
[ 291 ]
What just happened?
Using the minsplit and cp opons, we have managed to obtain a reduced set of rules,
and in that sense, the ed model does not appear to be an overt. The ROC curves show
that there has been considerable improvement in the performance of the validate region.
Again, as earlier, the validate and test regions have a similar ROC, and it is hence preferable
to use GC_rpart_prune or GC_rpart_minsplit over GC_rpart.
Pop quiz
With the experience of model selecon from the previous chapter, jusfy the choice
of cp=0.02 from the plot obtained as a result of running plotcp(GC_rpart).
Summary
We began with the idea of recursive paroning and gave a legimate reason as to why
such an approach is praccal. The CART technique is completely demysed by using the
getNode funcon, which has been dened appropriately depending upon whether we require
a regression or a classicaon tree. With the convicon behind us, we applied the rpart
funcon to the German credit data, and with its results, we had basically two problems.
First, the ed classicaon tree appeared to overt the data. This problem may many mes
be overcome by using the minsplit and cp opons. The second problem was that the
performance was really poor in the validate region. Though the reduced classicaon trees
had slightly beer performance as compared to the inial tree, we sll need to improve the
classicaon tree. The next chapter will focus more on this aspect and discuss the modern
development of CART.
10
CART and Beyond
In the previous chapter, we studied CART as a powerful recursive partitioning
method, useful for building (non-linear) models. Despite the overall generality,
CART does have certain limitations that necessitate some enhancements. It is
these extensions that form the crux of the final chapter of this book. For some
technical reasons, we will focus solely on the classification trees in this chapter.
We will also briefly look at some limitations of the CART tool.
The rst improvement that can be made to the CART is provided by the bagging technique.
In this technique, we build mulple trees on the bootstrap samples drawn from the
actual dataset. An observaon is put through each of the trees and a predicon is made
for its class, and based on the majority predicon of its class, it is predicted to belong to
the majority count class. A dierent approach is provided by Random Forests, where you
consider a random pool of covariates against the observaons. We nally consider another
important enhancement of a CART by using the boosng algorithms. The chapter will
discuss the following:
Cross-validaon errors for CART
The bootstrap aggregaon (bagging) technique for CART
Extending the CART with random forests
A consolidaon of the applicaons developed from Chapter 6 to Chapter 10,
CART and Beyond
[ 294 ]
Improving CART
In the Another look at model assessment secon of Chapter 8, we saw that the technique
of train + validate + test may be further enhanced by using the cross-validaon technique.
In the case of, linear regression model, we had used the CVlm funcon from the DAAG
package for the purpose of cross-validaon of linear models. The cross-validaon technique
for the logisc regression models may be carried out by using the CVbinary funcon from
the same package.
Profs. Therneau and Atkinson created the package rpart, and a detailed documentaon
of the enre rpart package is available on the Web at http://www.mayo.edu/hsr/
techrpt/61.pdf. Recall the slight improvement provided in the Pruning and other ner
aspects of a tree secon of the previous chapter. The two aspects considered there related
to the complexity parameter cp and the minimum split criteria minsplit. Now, the problem
of overng with the CART may be reduced to an extent by using the cross-validaon
technique. In the ridge regression model, we had the problem of selecng the penalty factor
. Similarly, here we have the problem of selecng the complexity parameter, though not
in an analogous way. That is, for the complexity parameter, which is a number between 0
and 1, we need to obtain the predicons based on the cross-validaon technique. This may
lead to a small loss of accuracy; however, we will then increase the accuracy by looking at
the generality. An object of the rpart class has many summaries contained with it, and the
various complexity parameters are stored in the cptable matrix. This matrix has values
for the following metrics: CP, nsplit, rel error, xerror, and xstd. Let us understand
this matrix through the default example in the rpart package, which is example(xpred.
rpart); see Figure 1: Understanding the example for the "xpred.rpart" funcon:
Chapter 10
[ 295 ]
Figure 1: Understanding the example for the "xpred.rpart" function
Here the tree has CP at four values, namely 0.595, 0.135, 0.013, and 0.010.
The corresponding nsplit numbers are 0, 1, 2, and 3, and similarly, the relative error
values xerror and xstd are given in the last part of previous screenshot. The interpretaon
for the CP value is slightly dierent, and the reason being that these have to be considered
as ranges and not values, in the sense that the rest of the performance is not with respect
to the CP values as menoned previously, rather they are with respect to the intervals
[0.595,1], [0.135, 0.595), [0.013, 0.135), and [0.010, 0.013); see ?xpred.
rpart for more informaon. Now, the funcon xpred.rpart returns the predicons based
on the cross-validated technique. Therefore, we will use this funcon for the German data
problem and for dierent CP values (actually ranges), to obtain the accuracy of the
cross-validaon technique.
CART and Beyond
[ 296 ]
Time for action – cross-validation predictions
We will use the xpred.rpart funcon from rpart to obtain the cross-validaon
predicons from an rpart object.
1. Load the German dataset and the rpart package using data(GC);
library(rpart).
2. Fit the classicaon tree with GC_Complete <- rpart(good_bad~.,
data=GC).
3. Check cptable with GC_Complete$cptable:
CP nsplit rel error xerror xstd
1 0.05167 0 1.0000 1.0000 0.04830
2 0.04667 3 0.8400 0.9833 0.04807
3 0.01833 4 0.7933 0.8900 0.04663
4 0.01667 6 0.7567 0.8933 0.04669
5 0.01556 8 0.7233 0.8800 0.04646
6 0.01000 11 0.6767 0.8833 0.04652
4. Obtain the cross-validaon predicons using GC_CV_Pred <- xpred.rpart(
GC_Complete).
5. Find the accuracy of the cross-validaon predicons:
sum(diag(table(GC_CV_Pred[,2],GC$good_bad)))/1000
sum(diag(table(GC_CV_Pred[,3],GC$good_bad)))/1000
sum(diag(table(GC_CV_Pred[,4],GC$good_bad)))/1000
sum(diag(table(GC_CV_Pred[,5],GC$good_bad)))/1000
sum(diag(table(GC_CV_Pred[,6],GC$good_bad)))/1000
The accuracy output is as follows:
> sum(diag(table(GC_CV_Pred[,2],GC$good_bad)))/1000
[1] 0.71
> sum(diag(table(GC_CV_Pred[,3],GC$good_bad)))/1000
[1] 0.744
> sum(diag(table(GC_CV_Pred[,4],GC$good_bad)))/1000
[1] 0.734
> sum(diag(table(GC_CV_Pred[,5],GC$good_bad)))/1000
[1] 0.74
> sum(diag(table(GC_CV_Pred[,6],GC$good_bad)))/1000
[1] 0.741
Chapter 10
[ 297 ]
It is natural that when you execute the same code, you will most likely have a dierent
output. Why is that? Also, you need to answer for yourselves why we did not check the
accuracy for GC_CV_Pred[,1]. In general, for decreasing the CP range, we expect higher
accuracy. We have checked for cross-validaon predicons for various CP ranges. There are
also other techniques to enhance the performance of a CART.
What just happened?
We used the xpred.rpart funcon to obtain the cross-validaon predicons for
a range of CP values. The accuracy of a predicon model has been assessed by using
simple funcons such as table and diag.
However, the control acons of minsplit and cp are of a reacve nature aer the splits
have been already decided. In that sense, when we have a large number of covariates,
the CART may lead to an overt of the data and may try to capture all the local variaons
of the data, and thus lose sight of the overall generality. Therefore, we need useful
mechanisms to overcome this problem.
The classicaon and regression tree considered in the previous chapter is a single model.
That is, we are seeking the opinion (predicon) of a single model. Wouldn't it be nice if we
could extend this! Alternately, we can seek mulple models instead of a single model.
What does this mean? In the forthcoming secons, we will see the use of mulple models
for the same problem.
Bagging
Bagging is an abbreviaon for bootstrap aggregaon. The important underlying concept
here is the bootstrap, which was invented by the eminent scienst Bradley Efron. We will
rst digress here a bit from the CART technique and consider a very brief illustraon of the
bootstrap technique.
The bootstrap
Consider a random sample,
1,...,
n
X X
, of size n from
()
,fx
θ
. Let
( )
1
,...,
n
TX X
be an
esmator of
θ
. To begin with, we rst draw a random sample of size n from
1,...,
n
X X
with replacement; that is, we obtain a random sample
** *
12
,,...,
n
XX X
, where some of the
observaons from the original sample may have repeons and some may not be present
at all. There is no one-to-one correspondence between
1,..., n
X X
and ** *
12
,,..., n
XX X. Using
** *
12
,,..., n
XX X, we compute
( )
1* *
1,..., n
TX X. Repeat this exercise a large number of mes, say B.
The inference for
θ
is carried out by using the sampling distribuon of the bootstrap samples
( )
1* *
1,..., n
TX X, …,
( )
* *
1,...,
B
n
TX X. Let us illustrate the concept of bootstrap with the famous
aspirin example; see Chapter 8 of Taar, et. al. (2013).
CART and Beyond
[ 298 ]
A surprising double-blind experiment conducted by the New York Times indicated that an
aspirin consumed on alternate days signicantly reduces the number of heart aacks within
men. In their experiment, 104 out of 11034 healthy middle-aged men consuming the small
doses of aspirin suered a fatal/non-fatal heart aack, whereas 189 out of 11037 placebo
individuals had the aack. Therefore, the odds rao of the aspirin-to-placebo heart aack
possibility is (104 / 11034) / (189 / 11037) = 0.5504. This indicates that only 55
percent of the number of heart aacks observed for the group taking the placebo is likely to
be observed for men consuming small doses of aspirin. That is, the chances of having a heart
aack if taking aspirin are almost halved. The experiment being scienc, the results look
very promising. However, we would like to obtain a condence interval for the odds rao of
the heart aack. If we don't know the sampling distribuon of the odds rao, we can use
the bootstrap technique to obtain the same. There is another aspect of the aspirin study. It
has been observed that the aspirin group had about 119 individuals who had strokes. The
strokes number for the placebo group is 98. Therefore, the odds rao of a stroke is (114
/ 11034) / (98 / 11037) = 1.164. This is shocking! It says that though the aspirin
reduces the heart aack possibility, about 16 percent more people are likely to have a stroke
when compared to the placebo group. Now, let's use the bootstrap technique to obtain the
condence intervals for the heart aack as well as the strokes.
Time for action – understanding the bootstrap technique
The boot package, which comes shipped with R base, will be used for bootstrapping the
odds rao.
1. Get the boot package using library(boot).
The boot package is shipped with the R soware itself, and thus it does
not require separate installaon. The main components for the boot funcon
will be soon explained.
2. Dene the odds-rao funcon:
OR <- function(data,i) {
x <- data[,1]; y <- data[,2]
odds.ratio <- (sum(x[i]==1,na.rm=TRUE)/length(na.omit(x[i])))/
(sum(y[i]==1,na.rm=TRUE)/length(na.omit(y[i])))
return(odds.ratio)
}
Chapter 10
[ 299 ]
The OR name stands, of course, for odds-rao. data for this funcon consists of
two columns, one of which may have more observaons than the other. The opon
na.rm is used to ignore the NA data values, whereas the na.omit funcon will
remove them. It is easier to see that the odds.ratio object indeed computes the
odds-rao. Note that we have specied i as an input to the funcon OR, since this
funcon will be used within boot. Therefore, i is used to indicate that the odds
rao will be calculated for the ith bootstrap sample. Note that x[i] does not reect
the ith element of x.
3. Get the data for both the aspirin and placebo groups (the heart aack and stroke
data), with the following code:
aspirin_hattack <- c(rep(1,104),rep(0,11037-104))
placebo_hattack <- c(rep(1,189),rep(0,11034-189))
aspirin_strokes <- c(rep(1,119),rep(0,11037-119))
placebo_strokes <- c(rep(1,98),rep(0,11034-98))
4. Combine the data groups and run 1000 bootstrap replicates, calculang
the odds-rao for each of the bootstrap samples. Use the following boot funcon:
hattack <- cbind(aspirin_hattack,c(placebo_hattack,NA,NA,NA))
hattack_boot <- boot(data=hattack,statistic=OR,R=1000)
strokes <- cbind(aspirin_strokes,c(placebo_strokes,NA,NA,NA))
strokes_boot <- boot(data=strokes,statistic=OR,R=1000)
We are using three opons of the boot funcon, namely data, statistic,
and R. The rst opon accepts the data frame of interest; the second one accepts
the stasc, either an exisng R funcon or a funcon dened by the user; and
nally, the third opon accepts the number of bootstrap replicaons. The boot
funcon creates an object of the boot class, and in this case, we are obtaining the
odds-rao for various bootstrap samples.
5. Using the bootstrap samples and the odds-rao for the bootstrap samples, obtain
a 95 percent condence interval by using the quantile funcon:
quantile(hattack_boot$t,c(0.025,0.975))
quantile(strokes_boot$t,c(0.025,0.975))
The 95 percent condence interval for the odds-rao of the heart aack rate is
given as (0.4763, 0.6269), while that for the strokes is (1.126, 1.333).
Since the point esmates lie in the 95 percent condence intervals, we accept that
the odds-rao of a heart aack for the aspirin tablet indeed reduces by 55 percent
in comparison with the placebo group.
CART and Beyond
[ 300 ]
What just happened?
We used the boot funcon from the boot package and obtained bootstrap samples for the
odds-rao.
Now that we have an understanding of the bootstrap technique, let us check out how
the bagging algorithm works.
The bagging algorithm
Breiman (1996) proposed the extension of the CART in the following manner.
Suppose that the values of the n random observaons for the classicaon problem
are
( ) ( ) ( )
11 2 2
,,,,..., ,
n n
yx yx yx. As with our setup, the dependent variables
i
y
are binary.
As with the bootstrap technique explained earlier, we obtain a bootstrap sample of size n
from the data with replacement and build a tree. If we prune the tree, it is very likely that
we may end up with the same tree on most occasions. Hence, pruning is not advisable here.
Now, using the tree based on the (rst) bootstrap sample, a predicon is made for the class
of the i-th observaon and the predicted value is noted. This process is repeated a large
number of mes, say B. A general pracce is to take B = 100. Therefore, we have B number
of predicons for every observaon. The decision process is to classify the observaon to
the category that has the majority of class predicons. That is, if more than 50 mes out of
B = 100 it has been predicted to belong to a parcular class, we say that the observaon is
predicted to belong to that class. Let us formally state the bagging algorithm.
1. Draw a sample of size n with replacement from the data
( ) ( ) ( )
1 1 1
11 2 2
, , , ,..., ,
i n
yx yx yx ,
and denote the rst bootstrap sample with
( ) ( )
( )
1
1
1 1
11 2 2
, , , ,..., ,
i n
yx yx yx .
2. Create a classicaon tree with
( ) ( )
( )
1
1
1 1
11 2 2
, , , ,..., ,
i n
yx yx yx . Do not prune the
classicaon tree. Such a tree may be called a bootstrapped tree.
3. For each terminal node, assign a class; put each observaon down the tree and nd
its predicted class.
4. Repeat steps 1 to 3 a large number of mes, say B.
5. Find the number of mes each observaon is classied to a parcular class out of
the B bootstrapped trees. The bagging procedure classies an observaon to belong
to a parcular class that has the majority count.
6. Compute the confusion table from the predicons made in step 5.
Chapter 10
[ 301 ]
The advantage of mulple trees is that the problem of overng, which happens in the case
of a single tree, is overcome to a large extent, as we expect that resampling will ensure that
the general features are captured and the impact of local features is minimized. Therefore,
if an observaon is classied to belong to a parcular class because of a local issue, it will
not get repeated over in the other bootstrapped trees. Therefore, with predicons based
on a large number of trees, it is expected that the nal predicon of an observaon really
depends upon its general features and not on a parcular local feature.
There are some measures that are important to be considered with the bagging algorithm.
A good classier, a single tree, or a bunch of them should be able to predict the class of
an observaon with more convicon. For example, we use a probability threshold of 0.5
and above as a predicon for success when using a logisc regression model. If the model
can predict most observaons in the neighborhood of either 0 or 1, we will have more
condence in the predicons. As a consequence, we will be a bit hesitant to classify an
observaon as either a success or failure if the predicted probability is in the vicinity of 0.5.
This precarious situaon applies to the bagging algorithm too.
Suppose we choose B = 100 for the number of bagging trees. Assume that an observaon
belongs to a class, Yes, and let the overall classes for the study be {"Yes", "No"}. If a
large number of trees predict an observaon to belong to the Yes class, we are condent
about the predicon. On the other hand, if approximately B/2 number of trees classify the
observaon to the Yes class, the decision gets swapped, as a few more trees had predicted
the observaon to belong to the No class. Therefore, we introduce a measure called margin
as the dierence between the proporon of mes an observaon is correctly classied and
the proporon of mes it is incorrectly classied. If the bagging algorithm is a good model,
we expect the average margin over all the observaons to be a large number away from
0. If bagging is not appropriate, we expect the average margin to be near the number 0.
Let us prepare ourselves for acon. The bagging algorithm is available in ipred and the
randomForests package.
Time for action – the bagging algorithm
The bagging funcon from the ipred package will be used for bagging a CART. The opons
of coob=FALSE and nbagg=200 are used to specied the appropriate opons.
1. Get the ipred package by using library(ipred).
2. Load the German credit data by using data(GC).
CART and Beyond
[ 302 ]
3. For B=200, t the bagging procedure for the GC data:
GC_bagging <- bagging(good_bad~.,data=GC,coob=FALSE,
nbagg=200,keepX=T)
We know that we have t B =200 number of trees. Would you like to see them?
Fine, here we go.
4. The B =200 trees are stored in the mtrees list of classbagg GC_bagging. That
is, GC_bagging$mtrees[[i]] gives us the i-th bootstrapped tree, and plot(GC_
bagging$mtrees[[i]]$btree) displays that tree. Adding text(GC_bagging$m
trees[[i]]$btree,pretty=1, use.n=T) is also important. Next, put the enre
thing in a loop, execute it, and simply sit back and enjoy the display of the B number
of trees:
for(i in 1:200) {
plot(GC_bagging$mtrees[[i]]$btree);
text(GC_bagging$mtrees[[i]]$btree,pretty=1,use.n=T)
}
We hope that you understand that we can't publish all 200 trees! The next goal
is to obtain the margin of the bagging algorithm.
5. Predict the class probabilies of all the observaons with the predict.
classbagg funcon by using GCB_Margin = round(predict( GC_
bagging,type="prob")*200,0).
Let us understand the preceding code. The predict funcon returns the
probabilies of an observaon to belong to the good and bad classes. We have
used 200 trees, and hence mulplying these probabilies with it gives us the
expected number of mes an observaon is predicted to belong to these classes.
The round funcon with the 0 argument completes the predicon to integers.
6. Check the rst six predicted classes with head(GCB_Margin):
bad good
[1,] 17 183
[2,] 165 35
[3,] 11 189
[4,] 123 77
[5,] 101 99
[6,] 95 105
7. To obtain the overall margin of the bagging technique, use the R code
mean(pmax(GCB_Margin[,1],GCB_Margin[,2]) pmin(GCB_
Margin[,1],GCB_Margin[,2]))/200.
Chapter 10
[ 303 ]
The overall margin for the author's execuon turns out to be 0.5279. You may,
though, get a dierent answer. Why?
Thus far, the bagging technique made predicons for the observaons
from which it built the model. In the earlier chapters, we had championed
the need of validate group and cross-validaon techniques. That is,
we did not always rely on the model measures solely from the data on
which it was built. There is always the possibility of failure as a result of
unforeseen examples. Can the bagging technique be built for taking care
of the unforeseen observaons? The answer is a denite yes, and this
is well known as out-of-bag validaon. In fact, such an opon has been
suppressed when building the bagging model in step 3 here, as the opon
coob=FALSE. coob stands for an out-of-bag esmate of the error rate. So,
now rebuild the bagging model with coob=TRUE opon.
8. Build an out-of-bag bagging model with GC_bagging_oob <- bagging(good_ba
d~.,data=GC,coob=TRUE,nbagg=200,keepX=T). Find the error-rate with GC_
bagging_oob$err.
> GC_bagging_oob <- bagging(good_bad~.,data=GC,coob=TRUE,
nbagg=200,keepX=T)
> GC_bagging_oob$err
[1] 0.241
What Just Happened?
We have seen an important extension of the CART model in the bagging algorithm.
To an extent, this enhancement is vital and vastly dierent, as seen in the improvements
of earlier models. The bagging algorithm is dierent, in the sense that we rely on the
predicons based on more than a single model. This ensures that the overng problem,
which occurs due to local features, is almost eliminated.
It is important to note that the bagging technique is not without any limitaons; refer
to Secon 4.5 of Berk (2008). We now move to the nal model of the book, which is an
important technique for the CART school.
Random forests
In the previous secon, we built mulple models for the same classicaon problem. The
bootstrapped trees were generated by using resamples of the observaons. Breiman (2001)
suggested an important variaon—actually, there is more to it than just a variaon—where
a CART is built with the covariates (features) being resampled for each of the bootstrap
samples of the dataset. Since the nal tree of each bootstrap sample has dierent covariates,
the ensemble of the collecve trees is called a Random Forest. A formal algorithm is
given next.
CART and Beyond
[ 304 ]
1. As with the bagging algorithm, draw a sample of size n1, n1 < n with replacement
from the data
( ) ( ) ( )
1 1 1
11 2 2
, , , ,..., ,
i n
yx yx yx , and denote the rst resampled data
with
( ) ( )
( )
1
1
1 1
11 2 2
, , , ,..., ,
i n
yx yx yx . The remaining n to n1 data form the
out-of-bag dataset.
2. Among the covariate vector x, select a random number of covariates without
replacement. Note that the same covariates are selected for all the observaons.
3. Create the CART tree from the data in steps 1 and 2, and, as earlier, do not prune
the tree.
4. For each terminal node, assign a class. Put each out-of-bag data down the tree
and nd its predicted class.
5. Repeat steps 1 to 3 a large number of mes; say 200 or 500.
6. For each observaon, count the number of mes it is predicted to belong to a class
only when it is a part of the out-of-bag dataset.
7. The majority count for the observaon to belong to a class is considered as it is a
predicted class.
This is quite a complex algorithm. Luckily, the randomForest package helps us out.
We will connue with the German credit data problem.
Time for action – random forests for the German credit data
The funcon randomForest from the package of the same name will be used to build a
random forest for the German credit data problem.
1. Get the randomForest package by using library(randomForest).
2. Load the German credit data by using data(GC).
3. Create a random forest with 500 trees:
GC_RF <- randomForest(good_bad~.,data=GC,keep.forest=TRUE,
ntree=500).
Chapter 10
[ 305 ]
It is very dicult to visualize a single tree of the random forest. A very
ad-hoc approach has been found at http://stats.stackexchange.com/
questions/2344/best-way-to-present-a-random-forest-in-a-
publication. Now we reproduce the necessary funcon to get the trees, and as
the soluon step is not exactly perfect, you may skip this part; steps 4 and 5.
4. Dene the to.dendrogram funcon:
5. Use the getTree funcon, and with the to.dendrogram funcon dened
previously, visualize the rst 20 trees of the forest:
for(i in 1:20) {
tree <- getTree(GC_RF,i,labelVar=T)
d <- to.dendrogram(tree)
plot(d,center=TRUE,leaflab='none',edgePar=list(t.cex=1,p.col=NA,p.
lty=0))
}
The error rate is of primary concern. As we increase the number of trees in the forest,
we expect a decrease in the error rate. Let us invesgate this for the GC_RF object.
CART and Beyond
[ 306 ]
6. Plot the out-of-bag error rate against the number of trees with plot(1:500,GC_
RF$err.rate[,1],"l",xlab="No.of.Trees",ylab="OOB Error Rate").
Figure 2: Performance of a random forest
The covariates (features) are selected dierently for dierent trees. It is then
a concern to know which variables are signicant. The important variables are
obtained using the varImpPlot funcon.
7. The funcon varImpPlot produces a display of the importance of the variables
by using varImpPlot(GC_RF).
Chapter 10
[ 307 ]
Figure 3: Important variables of the German credit data problem
Thus, we can see which variables have more relevance over others.
What just happened?
Random forests are a very important extension of the CART concept. In this technique, we
need to know the error-rate distribuon as the number of trees increases. This is expected to
decrease with the increase in the number of trees. varImpPlot also gives a very important
display of the importance of the covariates for classifying the customers as good or bad.
In conclusion, we will undertake a classicaon dataset and revise all the techniques seen
in the book, especially in Chapter 6 to Chapter 10. We will now consider the problem of low
birth weight among the infants.
CART and Beyond
[ 308 ]
The consolidation
The goal of this secon is to quickly review all of the techniques learnt in the laer half of
the book. Towards this, a dataset has been selected where we have ten variables, including
the output. Low birth weight is a serious concern, and it needs to be understood as a factor
of many other variables. If the weight of a child at birth is lesser than 2500 grams, it is
considered as a low birth weight. This problem has been studied in Chapter 19 of Taar, et.
al. (2013). The following table gives a descripon of the variables. Since the dataset may be
studied as a regression problem (variable BWT) as well as a classicaon problem (LOW), you
can choose any path(s) that you deem t. Let the nal acon begin.
Serial number Description Abbreviation
1 Identification code ID
2 Low birth weight LOW
3 Age of mother AGE
4 Weight of mother at last menstrual period LWT
5 Race RACE
6 Smoking status during pregnancy SMOKE
7 History of premature labor PTL
8 History of hypertension HT
9Presence of uterine irritability UI
10 Number of physician visits during the first trimester FTV
11 Birth weight BWT
Time for action – random forests for the low birth weight data
The techniques learnt from Chapter 6 to Chapter 10 will now be put to the test.
That is, we will use the linear regression model, logisc regression, as well as CART.
1. Read the dataset into R with data(lowbwt).
Chapter 10
[ 309 ]
2. Visualize the dataset with the opons diag.panel, lower.panel,
and upper.panel:
pairs(lowbwt,diag.panel=panel.hist,lower.panel=panel.smooth,upper.
panel=panel.cor)
Interpret the matrix of scaer plots. Which stascal model seems most appropriate
to you?
Figure 4: Multivariable display of the "lowbwt" dataset
As the correlaons look weak, it seems that a regression model may not be
appropriate. Let us check.
3. Create (sub) datasets for the regression and classicaon problems:
LOW <- lowbwt[,-10]
BWT <- lowbwt[,-1]
CART and Beyond
[ 310 ]
4. First, we will check if a linear regression model is appropriate:
BWT_lm <- lm(BWT~., data=BWT)
summary(BWT_lm)
Interpret the output of the linear regression model; refer to Linear Regression
Analysis, Chapter 6, if necessary.
Figure 5: Linear model for the low birth weight data
The low R2 makes it dicult for us to use the model. Let us check out the logisc
regression model.
5. Fit the logisc regression model as follows:
BWT_glm <- glm(BWT~., data=BWT)
summary(BWT_glm).
Chapter 10
[ 311 ]
The summary of the model is given in the following screenshot:
Figure 6: Logistic regression model for the low birth weight data
6. The Hosmer-Lemeshow goodness-of-t test for the logisc regression model is given
by hosmerlem(LOW_glm$y,fitted(LOW_glm)).
Now, the p-value obtained is 0.7813, which shows that there is no signicant
dierence between the ed values and the observed values. Therefore, we
conclude that the logisc regression model is a good t. However, we will go
ahead and perform CART models for this problem as well. Note that the esmated
regression coecients are not huge values, and hence we do not need to check out
for the ridge regression problem.
CART and Beyond
[ 312 ]
7. Fit a classicaon tree with the rpart funcon:
LOW_rpart <- rpart(LOW~.,data=LOW)
plot(LOW_rpart)
text(LOW_rpart,pretty=1)
Does the classicaon tree appear more appropriate than the logisc regression
ed earlier?
Figure 7: Classification tree for the low birth weight data
8. Get the rules of the classicaon tree using asRules(LOW_rpart).
Figure 8: Rules for the low birth weight problem
Chapter 10
[ 313 ]
You can see that these rules are of great importance to the physician who does the
operaons. Let us check the bagging eect on the classicaon tree.
9. Using the bagging funcon, nd the error rate of the bagging technique with the
following code:
LOW_bagging <- bagging(LOW~., data=LOW,coob=TRUE,nbagg=50,keepX=T)
LOW_bagging$err
The error rate is 0.3228, which seems very high. Let us see if random forests help
us out.
10. Using the randomForest funcon, nd the error rate for the out-of-bag problem:
LOW_RF <- randomForest(LOW~.,data=LOW,keep.forest=TRUE, ntree=50)
LOW_RF$err.rate
The error rate is sll around 0.34. The inial idea was that with the number of
observaons being less than 200, we developed with only 50 trees. Repeat the task
with 150 trees and check if the error rate decreases.
11. Increase the number of trees to 150 and obtain the error-rate plot:
LOW_RF <- randomForest(LOW~.,data=LOW,keep.forest=TRUE, ntree=150)
plot(1:150,LOW_RF$err.rate[,1],"l",xlab="No.of.Trees",ylab="OOB
Error Rate")
The error rate of about 0.32 seems to be the best soluon we can obtain
for this problem.
Figure 9: The error rate for the low birth weight problem
CART and Beyond
[ 314 ]
What just happened?
We had a very quick look back at all the techniques used over the last ve chapters
of the book.
Summary
The chapter began with two important variaons of the CART technique: the bagging
technique and random forests. Random forests are parcularly a very modern technique,
invented in 2001 by Breiman. The goal of the chapter was to familiarize you with these
modern techniques. Together with the German credit data and the complete revision of
the earlier techniques with the low birth weight problem, it is hoped that you beneed a
lot from the book and will have gained enough condence to apply these tools in your own
analycal problems.
References
The book has been inuenced by many of the classical texts on the subject from Tukey
(1977) to Breiman, et. al. (1984). The modern texts of Hase, et. al. (2009) and Berk (2008)
have parcularly inuenced the later chapters of the book. We have alphabecally listed
only the books/monographs which have been cited in the text, however, the reader may go
beyond the current list too.
Agres, A. (2002), Categorical Data Analysis, Second Edion. J. Wiley
Baron, M. (2007), Probability and Stascs for Computer Sciensts, Chapman
and Hall/CRC
Belsley, K., Kuh, and Welsch, E. (1980), Regression Diagnoscs: Idenfying Inuenal
Data and Sources of Collinearity, J. Wiley
Berk, R. A. (2008), Stascal Learning from a Regression Perspecve.Springer
Breiman, L. (1996), Bagging predictors. Machine learning, 24(2), 123-140
Breiman, L. (2001), Random forests. Machine learning, 45(1), 5-32
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone. C.J. (1984), Classicaon
and Regression Trees, Wadsworth
Chen, Ch., Härdle, W., and Unwin, A. (2008). Handbook of Data Visualizaon.
Springer
Cleveland, W. S. (1985). The Elements of Graphing Data. Monterey, CA: Wadsworth
Cule, E. and De Iorio, M. (2012), A semi-automac method to guide the choice
of ridge parameter in ridge regression.arXiv:1205.0686v1 [stat.AP]
Freund, R.F., and Wilson, W.J. (2003), Stascal Methods, Second Edion,
Academic Press
Friendly, M. (2001), Visualizing Categorical Data.SAS
Friendly, M. (2008), A brief history of data visualizaon. In Handbook
of Data Visualizaon (pp. 15-56), Springer
References
[ 316 ]
Gunst, R. F. (2002), Finding condence in stascal signicance. Quality Progress,
35 (10), 107–108
Gupta, A. (1997), Establishing Opmum Process Levels of Suspending Agents
for a Suspension Product. Quality Engineering, 10,347-350
Hase, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Stascal
Learning, Second Edion, Springer
Horgan, J. M. (2008), Probability with R – An Introducon with Computer Science
Applicaons, J. Wiley
Johnson, V. E., and Albert, J. H. (1999), Ordinal Data Modeling, Springer
Kutner, M. H., Nachtsheim, C., and Neter, J. (2004), Applied Linear Regression
Models, McGraw Hill
Montgomery, D. C. (2007), Introducon to Stascal Quality Contro,. J. Wiley
Montgomery, D. C., Peck, E. A., and Vining, G. G. (2012), Introducon to Linear
Regression Analysis, Wiley
Montgomery, D.C., and Runger, G. C. (2003), Applied Stascs and Probability
for Engineers, J. Wiley
Pawitan, Yudi. (2001), In All Likelihood: Stascal Modelling and Inference Using
Likelihood, OUP Oxford
Quinlan, J. R. (1993), C4. 5: Programs for Machine Learning, Morgan Kaufmann
Ross, S.M. (2010), Introductory Stascs, 3e, Academic Press
Rousseeuw, P.J., Ruts, I., and Tukey, J.W. (1999), The bagplot: a bivariate boxplot,
The American Stascian 53.4: 382-387
Ryan, T.P. (2007), Modern Engineering Stascs, J. Wiley
Sarkar, D. (2008), Lace-Springer
Taar, P.T., Suresh. R., and Manjunath. B.G. (2013), A Course in Stascs
with R. Narosa
Tue, E. R. (2001), The Visual Display of Quantave Informaon, Graphics Pr
Tukey, J.W. (1977), Exploratory Data Analysis, Addison-Wesley
Velleman, P.F., and Hoaglin, D.C. (1981), Applicaons, Basics, and Compung of
Exploratory Data Analysis; available at http://dspace.libraty.comell.edu/
Wickham, H. (2009). ggplot2, Springer
Wilkinson, L. (2005), The Grammar of Graphics, Second Edion, Springer
Index
Symbols
3RSS 123
3RSSH 123
4253H smoother 123
%% operator 35
A
actual probabilies 13
Age in Years variable 9
Akaike Informaon Criteria (AIC) 194
alternave hypothesis 144
amount of leverage by the observaon 187
ANOVA technique
about 170
obtaining 170
anscombe dataset 169
aplpack package 117
Automove Research Associaon of India (ARAI)
10
B
backward eliminaon approach 192
backwardlm funcon 195
bagging 297
bagging algorithm 300-302
bagging technique 93
bagplot
about 116
characteriscs 116
displaying, for mulvariate dataset 117, 118
for gasoline mileage problem 117
barchart funcon 73
bar charts
built-in examples 66, 67
visualizing 68-73
barplot funcon 73
basic arithmec, vectors
performing 36
unequal length vectors, adding 36
basis funcons, regression spline 234
best split point 265
binary regression problem 202
binomial distribuon
about 20, 21
examples 21-23
binomial test
performing 144
proporons, tesng 147, 148
success probability, tesng 145, 146
bivariate boxplot 116
boosng algorithms 293
bootstrap 298, 299
bootstrap aggregaon 297
bootstrapped tree 300
box-and-whisker plot 84
boxplot
about 84
examples 84, 85
implemenng 85-87
boxplot funcon 86, 108
B-spline regression model
ng 241, 242
purpose 241
[ 318 ]
built-in examples, bar chart
Bug Metrics dataset 67
Bug Metrics of ve soware 67
bwplot funcon 86
C
CART
about 293
cross-validaon predicons, obtaining 296, 297
improving 294, 295
CART_Dummy dataset
visualizing 259
categorical variable 9, 18, 260
central limit theorem 139
classicaon tree 257
construcon 276-283
pruning 289-291
classicaon tree, German credit data
construcon 284-288
coecient of determinaon 169
colSums funcon 77
Comprehensive R Archive Network. See CRAN
computer science
experiments with uncertainty 14
condence interval
about 139
for binomial proporon 139
for normal mean with known variance 140
for normal mean with unknown variance 140
obtaining 141, 142, 170, 171
confusion matrix 220
connuous distribuons
about 26
exponenal distribuon 28
normal distribuon 29
uniform distribuon 27
connuous variable 9
Cooks distance 188
covariate 229
CRAN
about 15
URL 15
criteria-based model selecon
about 194
AIC criteria, using 194-197
backward, using 194-197
forward, using 194-197
crical alpha 192
crical region 144
CSV le
reading, from 54
cumulave density funcon 27
Customer ID variable 8
CVbinary funcon 294
CVlm funcon 294
D
data
imporng, from external les 55, 57
imporng, from MySQL 58, 59
spilng 260
databases (DB) 58
data characteriscs 12, 13
data formats
CSV (comma separated variable) format 52
ODS (OpenOce/LibreOce Calc) 52
XLS or XLSX (Microso Excel) form 52
data.frame object
about 45
creang 45, 46
data/graphs
exporng 60
graphs, exporng 60, 61
R objects, exporng 60
data re-expression
about 114
for Power of 62 Hydroelectric Staons 114-116
data visualizaon
about 65
visualizaon techniques, for categorical data 66
visualizaon techniques, for connuous variable
data 84
DAT le
reading, from 54
depth 113
deviance residual 213
DFBETAS 189
DFFITS 189
di funcon 108
discrete distribuons
about 18
binomial distribuon 20
[ 319 ]
discrete uniform distribuon 19
hypergeometric distribuon 24
negave binomial distribuon 24
poisson distribuon 26
discrete uniform distribuon 19
dot chart
about 74
example 74
visualizing 74-76
dotchart funcon 74
E
EDA 103
exponenal distribuon 28, 29
F
false posive rate (fpr) 220
fence 117
les
reading, foreign package used 55
rst tree
building 261-263
tdistr funcon
used, for ndgin MLE 136
venum funcon 108
forwardlm funcon 194
forward selecon approach 193
fourfold plot
about 82
examples 83
Full Name variable 9
G
Gender variable 9
generalized cross-validaon (GCV) errors 255
geometric RV 25
German credit data
classicaon tree 284-288
German credit screening dataset
logisc regression 223-226
getNode funcon 267
getTree funcon 305
ggplot 101
ggplot2
about 99, 100
ggplot 101
qplot funcon 100
GLM
inuenal and leverage points, idenfying 216
residual plots 213
graphs
exporng 60, 61
H
han funcon 124
hanning 123
hinges 104, 105
hist funcon 90
histogram
about 88
construcon 88, 89
creang 90-92
eecveness 90
examples 89
histogram funcon 90
Hosmer-Lemeshow goodness-of-t test stasc
210, 212
hypergeometric distribuon 24
hypergeometric RV 24
hypotheses tesng
about 144
binomial test 144
one-sample hypotheses, tesng 152-155
one-sample problems, for normal distribuon
150, 151
two-sample hypotheses, tesng 159
two sample problems, for normal distribuon
156, 157
hypothesis
about 144
alternave hypothesis 144
null hypothesis 144
stasc, tesng 144
I
impurity measures 276
independent and idencal distributed (iid)
sample 130
[ 320 ]
inuence and leverage points, GLM
idenfying 216
inuenal point 188
interquarle range (IQR) 84, 105
iterave reweighted least-squares (IRLS)
algorithm 207
L
leading digits 109
leer values 113
leverage point 187
likelihood funcon
about 131
visualizing 131-134
likelihood funcon, of binomial distribuon 131
likelihood funcon, of normal distribuon 132
likelihood funcon, of Poisson distribuon 132
linear regression model
about 162
limitaons 202, 203
linearRidge funcon 244
list object
about 44
creang 44, 45
lm.ridge t model 255
lm.ridge funcon 245, 255
logisc regression, German credit dataset 223-
226
logisc regression model
about 201-207
diagnoscs 216
ng 207- 210
inuence and leverage points, idenfying
216-218
model validaon 213
residuals plots, for GLM 213
ROC 220
logiscRidge funcon 248
loop 117
M
margin 301
matrix computaons
about 41
performing 41-43
maximum likelihood esmator. See MLE
mean 18, 122
mean residual sum of squares 183
median 104, 122
median polish 125
median polish algorithm
about 125, 126
example 126, 127
medpolish funcon 126
MLE
about 129-131
nding 135
nding, tdistr funcon used 137
nding, mle funcon used 137
likelihood funcon, visualizing 131
MLE, binomial distribuon
nding 135, 136
MLE, Poisson distribuon
nding 136
model assessment
about 250
penalty factor, nding 250
penalty parameter, selecng iteravely
250-255
tesng dataset 250
training dataset 250
validaon dataset 250
model selecon
about 192
criterion-based procedures 194
stepwise procedures 192
model validaon, simple linear regression model
about 171
residual plots, obtaining 172, 174
mosaic plot
for Titanic dataset 80, 81
mosaicplot funcon 80
mulcollinearity problem
about 189, 190
addressing 191, 192
mulple linear regression model
about 176, 177
ANOVA, obtaining 182
building 179, 180
condence intervals, obtaining 182
k simple linear regression models, averaging
177-179
[ 321 ]
useful residual plots 183
mulvariate dataset
bagplot, displaying for 117, 118
MySQL
data, imporng from 58, 59
N
natural cubic spline regression model
about 238
ng 239-241
negave binomial distribuon
about 24, 25
examples 25
negave binomial RV 25
nominal variable 18
normal distribuon
about 29
examples 30
null hypothesis 144
NumericConstants 34
O
Octane Rang
of Gasoline Blends 109
odds rao 207
one-sample hypotheses
tesng 152-155
operator curves
receiving 220
ordinal variable 9, 19, 260
out-of-bag validaon 303
overng 230-233
P
pairs funcon 118
panel.bagplot funcon 118
Pareto chart
about 97
examples 98, 99
paral residual 213
pdf 26
pearson residual 213
percenles 104
piecewise cubic 238
piecewise linear regression model
about 235
ng 235-237
pie charts
about 81
drawbacks 82
examples 81
plot.lm funcon 176
poisson distribuon
about 26
examples 26
polynomial regression model
building 231
ng 229
pooled variance esmator 157
PRESS residuals 184
principal component (PC) 245
probability density funcon 26
probability mass funcon (pmf) 18
probit model 201
probit regression model
about 204
constants 204-206
pruning 288
Q
qplot funcon 100
quanles 104
quesonnaire
about 8
components 8
Quesonnaire ID variable 8
R
R
constants 34
connuous distribuons 26
data characteriscs 12, 13
data.frame 33
data visualizaon 65
discrete distribuons 18
downloading, for Linux 16
downloading, for Windows 16
session management 62, 64
simple linear regression model 162
vectors 34, 35
[ 322 ]
randomForest funcon 304
Random Forests
about 293, 303
for German credit data 304, 306
for low birth weight data 308-313
random sample 130
random variable 13
range funcon 108
R databases 33
read.csv 54
read.delim 54
read.table funcon 53
read.xls funcon 54
receiving operator curves. See ROC
Recursive Paroning
about 258
data, spling 260
display plot, paroning 259
regression 162
regression diagnoscs
about 186
DFBETAS 189
DFFITS 189
inuenal point 187, 188
leverage point 187
regression spline
about 234
basis funcons 234
natural cubic splines 238
piecewise linear regression model 235
regression tree
about 257
construcon 265-274
representave probabilies 13
reroughing 123
resid funcon 174
residual plots, GLM
deviance residual 213
obtaining, ed funcon used 214, 215
obtaining, residuals funcon used 214, 215
paral residual 213
pearson residual 213
response residual 213
working residual 213
residual plots, mulple linear regression model
about 183
PRESS residuals 184
R-student residuals 184, 186
semi-studenzed residuals 183, 184
standardized residuals 183
residuals funcon 216
resistant line
about 118, 120
as regression model 120, 121
for IO-CPU me 120
response residual 213
ridge regression, for linear regression model
243-247
ridge regression, for logisc regression models
248, 249
R installaon
about 15, 16
R packages, using 16, 17
RSADBE 17
rline funcon 120
R names funcon 34
R objects
about 33
exporng 60
leers 34
LETTERS 34
month.abb 34
month.name 34
pi 34
ROC
about 220
construcon 221, 222
rootstock dataset 55
rough 123
R output 33
rowSums funcon 78
R packages
using 16
rpart class 294
rpart funcon 257
RSADBE package 17, 105
R-student residuals 184
RV 13
S
scaer plot
about 93
creang 94-96
[ 323 ]
examples 93
semi-studenzed residuals 183
Separated Clear Volume 54
session management
about 62
performing 62, 64
short message service (SMS) 8
signicance level 139
simple linear regression model
about 163
ANOVA technique 169
building 167, 168
condence intervals, obtaining 170
core assumpons 163
limitaon 230
overng problem 230
residuals for arbitrary choice of parameters,
displaying 164-166
validaon 171
smooth funcon 124
smoothing data technique
about 122
for cow temperature 124, 125
spine/mosaic plot
about 76
advantages 76
examples 77
spine plot
for shi and operator data 77, 79
spineplot funcon 77
spline 234
standardized residuals 183
Stascal Process Control (SPC) 109
stem-and-leaf plot 109
stem funcon
about 110
working 110-112
stems 109
step funcon 194
stepwise procedures
about 192
backward eliminaon 192
forward selecon 193
stepwise regression 193
summary funcon 169
summary stascs
about 104
for The Wall dataset 105-108
hinges 104, 105
interquarle range (IQR) 105
median 104
percenles 104
quanles 104
T
table object
about 49, 50
Titanic dataset, creang 51, 52
tesng dataset 250
text variable 9
Titanic data
exporng 60
to.dendrogram funcon 305
towards-the-center 162
trailing digits 109
training dataset 250
true posive rate (tpr) 220
two-sample hypotheses
tesng 159
U
UCBAdmissions 52
UCBAdmissions dataset 260
uniform distribuon
about 27
examples 28
V
validaon dataset 250
variance 18
variance inaon factor (VIF) 191
vector
about 34
examples 35
generang 35
vector objects
basic arithmec 36
creang 35
visualizaon techniques, for categorical data
about 66
bar chart 66
dot chart 74
[ 324 ]
fourfold plot 81
mosaic plot 76
pie charts 81
spline plot 76
visualizaon techniques, for connuous variable
data
about 84
boxplot 84
histogram 88
Pareto chart 97
scaer plot 93
W
working residual 213
write.table funcon 60
X
xpred.rpart funcon 296
Thank you for buying
R Statistical Application Development by Example
Beginner's Guide
About Packt Publishing
Packt, pronounced 'packed', published its rst book "Mastering phpMyAdmin for Eecve
MySQL Management" in April 2004 and subsequently connued to specialize in publishing
highly focused books on specic technologies and soluons.
Our books and publicaons share the experiences of your fellow IT professionals in adapng
and customizing today's systems, applicaons, and frameworks. Our soluon based books
give you the knowledge and power to customize the soware and technologies you're
using to get the job done. Packt books are more specic and less general than the IT books
you have seen in the past. Our unique business model allows us to bring you more focused
informaon, giving you more of what you need to know, and less of what you don't.
Packt is a modern, yet unique publishing company, which focuses on producing quality,
cung-edge books for communies of developers, administrators, and newbies alike. For
more informaon, please visit our website: www.packtpub.com.
About Packt Open Source
In 2010, Packt launched two new brands, Packt Open Source and Packt Enterprise, in order
to connue its focus on specializaon. This book is part of the Packt Open Source brand,
home to books published on soware built around Open Source licences, and oering
informaon to anybody from advanced developers to budding web designers. The Open
Source brand also runs Packt's Open Source Royalty Scheme, by which Packt gives a royalty
to each Open Source project about whose soware a book is sold.
Writing for Packt
We welcome all inquiries from people who are interested in authoring. Book proposals
should be sent to author@packtpub.com. If your book idea is sll at an early stage and you
would like to discuss it rst before wring a formal book proposal, contact us; one of our
commissioning editors will get in touch with you.
We're not just looking for published authors; if you have strong technical skills but no wring
experience, our experienced editors can help you develop a wring career, or simply get
some addional reward for your experse.
NumPy 1.5 Beginner's Guide
ISBN: 978-1-84951-530-6 Paperback: 234 pages
An acon-packed guide for the easy-to-use, high
performance, Python based free open source NumPy
mathemacal library using real-world examples
1. The rst and only book that truly explores NumPy
praccally
2. Perform high performance calculaons with clean
and ecient NumPy code
3. Analyze large data sets with stascal funcons
4. Execute complex linear algebra and mathemacal
computaons
Matplotlib for Python Developers
ISBN: 978-1-84719-790-0 Paperback: 308 pages
Build remarkable publicaon-quality plots the easy
way
1. Create high quality 2D plots by using Matplotlib
producvely
2. Incremental introducon to Matplotlib, from the
ground up to advanced levels
3. Embed Matplotlib in GTK+, Qt, and wxWidgets
applicaons as well as web sites to ulize them in
Python applicaons
4. Deploy Matplotlib in web applicaons and expose it
on the Web using popular web frameworks such as
Pylons and Django
Please check www.PacktPub.com for information on our titles
Sage Beginner's Guide
ISBN: 978-1-84951-446-0 Paperback: 364 pages
Unlock the full potenal of Sage for simplifying and
automang mathemacal compung
1. The best way to learn Sage which is a open source
alternave to Magma, Maple, Mathemaca, and
Matlab
2. Learn to use symbolic and numerical computaon
to simplify your work and produce publicaon-
quality graphics
3. Numerically solve systems of equaons, nd roots,
and analyze data from experiments or simulaons
R Graph Cookbook
ISBN: 978-1-84951-306-7 Paperback: 272 pages
Detailed hands-on recipes for creang the most
useful types of graphs in R—starng from the
simplest versions to more advanced applicaons
1. Learn to draw any type of graph or visual data
representaon in R
2. Filled with praccal ps and techniques for creang
any type of graph you need; not just theorecal
explanaons
3. All examples are accompanied with the
corresponding graph images, so you know what the
results look like
Please check www.PacktPub.com for information on our titles

Navigation menu