R Statistical Application Development By Example Beginners Guide

User Manual:

Open the PDF directly: View PDF .
Page Count: 345 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Cover
Copyright
Credits
About the Author
About the Reviewers
www.PacktPub.com
Table of Contents
Preface
Chapter 1: Data Characteristics
- Questionnaire and its components
  - Understanding the data characteristics in an R environment
- Experiments with uncertainty in computer science
- R installation
- Continuous distribution
- Summary
Chapter 2: Import/Export Data
- data.frame and other formats
  - Constants, vectors, and matrices
- Time for action – understanding constants, vectors, and basic arithmetic
- Time for action – matrix computations
  - The list object
- Time for action – creating a list object
  - The data.frame object
- Time for action – creating a data.frame object
  - The table object
- Time for action – creating the Titanic dataset as a table object
- read.csv, read.xls, and the foreign package
- Time for action – importing data from external files
  - Importing data from MySQL
- Exporting data/graphs
  - Exporting R objects
  - Exporting graphs
- Time for action – exporting a graph
- Managing an R session
- Time for action – session management
- Summary
Chapter 3: Data Visualization
- Visualization techniques for categorical data
  - Bar charts
    - Going through the built-in examples of R
- Time for action – bar charts in R
  - Dot charts
- Time for action – dot charts in R
  - Spine and mosaic plots
- Time for action – the spine plot for the shift and operator data
- Time for action – the mosaic plot for the Titanic dataset
  - Pie charts and the fourfold plot
- Visualization techniques for continuous variable data
  - Boxplot
- Time for action – using the boxplot
  - Histograms
- Time for action – understanding the effectiveness of histograms
  - Scatter plots
- Time for action – plot and pairs R functions
  - Pareto charts
- A brief peek at ggplot2
- Time for action – qplot
- Time for action – ggplot
- Summary
Chapter 4: Exploratory Analysis
- Essential summary statistics
- Time for action – the essential summary statistics for "The Wall" dataset
- The stem-and-leaf plot
- Time for action – the stem function in play
- Letter values
- Data re-expression
- Bagplot – a bivariate boxplot
- Time for action – the bagplot display for a multivariate dataset
- The resistant line
- Time for action – the resistant line as a first regression model
- Smoothing data
- Time for action – smoothening the cow temperature data
- Median polish
- Time for action – the median polish algorithm
- Summary
Chapter 5: Statistical Inference
- Maximum likelihood estimator
  - Visualizing the likelihood function
- Time for action – visualizing the likelihood function
  - Finding the maximum likelihood estimator
  - Using the fitdistr function
- Time for action – finding the MLE using mle and fitdistr functions
- Confidence intervals
- Time for action – confidence intervals
- Hypotheses testing
  - Binomial test
- Time for action – testing the probability of success
  - Tests of proportions and the chi-square test
- Time for action – testing proportions
  - Tests based on normal distribution – one sample
- Time for action – testing one-sample hypotheses
  - Tests based on normal distribution – two sample
- Time for action – testing two-sample hypotheses
- Summary
Chapter 6: Linear Regression Analysis
- The simple linear regression model
  - What happens to the arbitrary choice of parameters?
- Time for action – the arbitrary choice of parameters
  - Building a simple linear regression model
- Time for action – building a simple linear regression model
  - ANOVA and the confidence intervals
- Time for action – ANOVA and the confidence intervals
  - Model validation
- Time for action – residual plots for model validation
- Multiple linear regression model
  - Averaging k simple linear regression models or a multiple linear regression model
- Time for action – averaging k simple linear regression models
  - Building a multiple linear regression model
- Time for action – building a multiple linear regression model
  - The ANOVA and confidence intervals for the multiple linear regression model
- Time for action – the ANOVA and confidence intervals for the multiple linear regression model
  - Useful residual plots
- Time for action – residual plots for the multiple linear regression model
- Regression diagnostics
- The multicollinearity problem
- Time for action – addressing the multicollinearity problem for the Gasoline data
- Model selection
  - Stepwise procedures
    - The backward elimination
    - The forward selection
  - Criterion-based procedures
- Time for action – model selection using the backward, forward, and AIC criteria
- Summary
Chapter 7: The Logistic Regression Model
- The binary regression problem
- Time for action – limitations of linear regression models
- Probit regression model
- Time for action – understanding the constants
- Logistic regression model
- Time for action – fitting the logistic regression model
  - Hosmer-Lemeshow goodness-of-fit test statistic
- Time for action – The Hosmer-Lemeshow goodness-of-fit statistic
- Model validation and diagnostics
  - Residual plots for the GLM
- Time for action – residual plots for the logistic regression model
  - Influence and leverage for the GLM
- Time for action – diagnostics for the logistic regression
- Receiving operator curves
- Time for action – ROC construction
- Logistic regression for the German credit screening dataset
- Time for action – logistic regression for the German credit dataset
- Summary
Chapter 8: Regression Models with Regularization
- The overfitting problem
- Time for action – understanding overfitting
- Regression spline
  - Basis functions
  - Piecewise linear regression model
- Time for action – fitting piecewise linear regression models
  - Natural cubic splines and the general B-splines
- Time for action – fitting the spline regression models
- Ridge regression for linear models
- Time for action – ridge regression for the linear regression model
- Ridge regression for logistic regression models
- Time for action – ridge regression for the logistic regression model
- Another look at model assessment
- Time for action – selecting lambda iteratively and other topics
- Summary
Chapter 9: Classification and Regression Trees
- Recursive partitions
- Time for action – partitioning the display plot
  - Splitting the data
  - The first tree
- Time for action – building our first tree
- The construction of a regression tree
- Time for action – the construction of a regression tree
- The construction of a classification tree
- Time for action – the construction of a classification tree
- Classification tree for the German credit data
- Time for action – the construction of a classification tree
- Pruning and other finer aspects of a tree
- Time for action – pruning a classification tree
- Summary
Chapter 10: CART and Beyond
- Improving CART
- Time for action – cross-validation predictions
- Bagging
  - The bootstrap
- Time for action – understanding the bootstrap technique
  - The bagging algorithm
- Time for action – the bagging algorithm
- Random forests
- Time for action – random forests for the German credit data
- The consolidation
- Time for action – random forests for the low birth weight data
- Summary
Appendix: References
Index

R Statistical Application

Development by Example

Beginner's Guide

Learn R Stascal Applicaon Development from scratch

in a clear and pedagogical manner

Prabhanjan Narayanachar Taar

BIRMINGHAM - MUMBAI

R Statistical Application Development by Example

Beginner's Guide

or transmied in any form or by any means, without the prior wrien permission of the

publisher, except in the case of brief quotaons embedded in crical arcles or reviews.

Every eort has been made in the preparaon of this book to ensure the accuracy of the

informaon presented. However, the informaon contained in this book is sold without

warranty, either express or implied. Neither the author, nor Packt Publishing, and its

dealers and distributors will be held liable for any damages caused or alleged to be

caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark informaon about all of the

companies and products menoned in this book by the appropriate use of capitals.

However, Packt Publishing cannot guarantee the accuracy of this informaon.

First published: July 2013

Producon Reference: 1170713

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-84951-944-1

www.packtpub.com

Cover Image by Asher Wishkerman (wishkerman@hotmail.com)

Credits

Author

Prabhanjan Narayanachar Taar

Reviewers

Mark ver der Loo

Mzabalazo Z. Ngwenya

A Ohri

Tengfei Yin

Acquision Editor

Usha Iyer

Lead Technical Editor

Arun Nadar

Technical Editors

Madhuri Das

Mausam Kothari

Amit Ramadas

Varun Pius Rodrigues

Lubna Shaikh

Project Coordinator

Anurag Banerjee

Proofreaders

Maria Gould

Paul Hindle

Indexer

Hemangini Bari

Graphics

Ronak Dhruv

Producon Coordinators

Melwyn D'sa

Zahid Shaikh

Cover Work

Melwyn D'sa

Zahid Shaikh

About the Author

Prabhanjan Narayanachar Taar has seven years of experience with R soware and

has also co-authored the book A Course in Stascs with R published by Narosa Publishing

House. The author has built two packages in R tled gpk and ACSWR. He has obtained a PhD

(Stascs) from Bangalore University under the broad area of Survival Analysis and published

several arcles in peer-reviewed journals. During the PhD program, the author received

the young Stascian honors in IBS(IR)-GK Shukla Young Biometrician Award (2005) and

Dr. U.S. Nair Award for Young Stascian (2007) and also held a Junior and Senior Research

Fellowship of CSIR-UGC.

Prabhanjan is working as a Business Analysis Advisor at Dell Inc, Bangalore. He is working

for the Customer Service Analycs unit of the larger Dell Global Analycs arm of Dell.

I would like to thank Prof. Athar Khan, Aligarh Muslim University, whose

teaching during a shared R workshop inspired me to a very large extent.

My friend Veeresh Naidu has gone out of his way in helping and inspiring

me complete this book and I thank him for everything that denes our

friendship.

Many of my colleagues at the Customer Service Analycs unit of Dell Global

Analycs, Dell Inc. have been very tolerant of my stat talk with them and it

is their need for the subject which has partly inuenced the wring of the

book. I would like to record my thanks to them and also my manager Debal

Chakraborty.

My wife Chandrika has been very cooperave and without her permission to

work on the book during weekends and housework mings this book would

have never been completed. Pranathi at 2 years and 9 months has started

school, the pre-kindergarten, and it is genuinely believed that one day she

will read this enre book.

I am also grateful to the reviewers whose construcve suggesons and

cricisms have helped the book reach a higher level than where it would

have ended up being without their help. Last and not least, I would like to

take the opportunity to thank Usha Iyer and Anurag Banerjee for their inputs

with the earlier dras, and also their paence with my delays.

About the Reviewers

Mark van der Loo obtained his PhD at the Instute for Theorecal Chemistry at the

University of Nijmegen (The Netherlands). Since 2007 he has worked at the stascal

methodology department of the Dutch ocial stascs oce (Stascs Netherlands).

His research interests include automated data cleaning methods and stascal compung.

At Stascs Netherlands he is responsible for the local R center of experse, which supports

and educates users on stascal compung with R. Mark has coauthored a number of R

packages that are available via CRAN, namely editrules, deducorrect, rspa, extremevalues,

and stringdist. Together with Edwin de Jonge he authored the book Learning RStudio for R

Stascal Compung. A list of his publicaons can be found at www.markvanderloo.eu.

Mzabalazo Z. Ngwenya has worked extensively in the eld of stascal consulng and

currently works as a biometrician. He holds an MSC in Mathemacal Stascs from the

University of Cape Town and is at present studying towards a PhD (School of Informaon

Technology, University of Pretoria) in the eld of Computaonal Intelligence. His research

interests include stascal compung, machine learning, spaal stascs, and simulaon and

stochasc processes. Previously he was involved in reviewing Learning RStudio for R Stascal

Compung by Mark P.J. van der Loo and Edwin de Jonge, Packt Publishing

A Ohri is the founder of analycs startup Decisionstats.com. He has pursued graduaon

studies from the University of Tennessee, Knoxville and the Indian Instute of Management,

Lucknow. In addion, he has a Mechanical Engineering degree from the Delhi College of

Engineering. He has interviewed more than 100 praconers in analycs, including leading

members from all the analycs soware vendors. He has wrien almost 1300 arcles on

his blog besides guest wring for inuenal analycs communies. He teaches courses

in R through online educaon and has worked as an analycs consultant in India for the

past decade. He was one of the earliest independent analycs consultants in India and his

current research interests include spreading open source analycs, analyzing social media

manipulaon, simpler interfaces to cloud compung, and unorthodox cryptography.

He is the author of R for Business Analycs.

http://www.springer.com/statistics/book/978 1 4614 4342 1

www.PacktPub.com

Support les, eBooks, discount offers and more

You might want to visit www.PacktPub.com for support les and downloads related

to your book.

Did you know that Packt oers eBook versions of every book published, with PDF and ePub les

available? You can upgrade to the eBook version at www.PacktPub.com and as a print book

customer, you are entled to a discount on the eBook copy. Get in touch with us at service@

packtpub.com for more details.

At www.PacktPub.com, you can also read a collecon of free technical arcles, sign up for a

range of free newsleers and receive exclusive discounts and oers on Packt books and eBooks.

http://PacktLib.PacktPub.com

Do you need instant soluons to your IT quesons? PacktLib is Packt's online digital book

library. Here, you can access, read and search across Packt's enre library of books.

Why Subscribe?

Fully searchable across every book published by Packt

Copy and paste, print and bookmark content

On demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib

today and view nine enrely free books. Simply use your login credenals for immediate access.

The work is dedicated to my father Narayanachar, the very rst engineer who

inuenced my outlook towards Science and Engineering. For the same reason,

my mother Lakshmi made me realize the importance of life and philosophy.

Table of Contents

Preface 1

Chapter 1: Data Characteriscs 7

Quesonnaire and its components 8

Understanding the data characteriscs in an R environment 12

Experiments with uncertainty in computer science 14

R installaon 15

Using R packages 16

RSADBE – the book's R package 17

Discrete distribuon 18

Discrete uniform distribuon 19

Binomial distribuon 20

Hypergeometric distribuon 24

Negave binomial distribuon 24

Poisson distribuon 26

Connuous distribuon 26

Uniform distribuon 27

Exponenal distribuon 28

Normal distribuon 29

Summary 31

Chapter 2: Import/Export Data 33

data.frame and other formats 33

Constants, vectors, and matrices 34

Time for acon – understanding constants, vectors, and basic arithmec 35

Time for acon – matrix computaons 41

The list object 44

Time for acon – creang a list object 44

The data.frame object 45

Table of Contents

[ ii ]

Time for acon – creang a data.frame object 45

The table object 49

Time for acon – creang the Titanic dataset as a table object 51

read.csv, read.xls, and the foreign package 52

Time for acon – imporng data from external les 55

Imporng data from MySQL 58

Exporng data/graphs 60

Exporng R objects 60

Exporng graphs 60

Time for acon – exporng a graph 61

Managing an R session 62

Time for acon – session management 62

Summary 64

Chapter 3: Data Visualizaon 65

Visualizaon techniques for categorical data 66

Bar charts 66

Going through the built-in examples of R 66

Time for acon – bar charts in R 68

Dot charts 74

Time for acon – dot charts in R 74

Spine and mosaic plots 76

Time for acon – the spine plot for the shi and operator data 77

Time for acon – the mosaic plot for the Titanic dataset 80

Pie charts and the fourfold plot 81

Visualizaon techniques for connuous variable data 84

Boxplot 84

Time for acon – using the boxplot 85

Histograms 88

Time for acon – understanding the eecveness of histograms 90

Scaer plots 93

Time for acon – plot and pairs R funcons 94

Pareto charts 97

A brief peek at ggplot2 99

Time for acon – qplot 100

Time for acon – ggplot 101

Summary 102

Chapter 4: Exploratory Analysis 103

Essenal summary stascs 104

Percenles, quanles, and median 104

Hinges 104

Table of Contents

[ iii ]

The interquarle range 105

Time for acon – the essenal summary stascs for

"The Wall" dataset 105

The stem-and-leaf plot 109

Time for acon – the stem funcon in play 110

Leer values 113

Data re-expression 114

Bagplot – a bivariate boxplot 116

Time for acon – the bagplot display for a mulvariate dataset 117

The resistant line 118

Time for acon – the resistant line as a rst regression model 120

Smoothing data 122

Time for acon – smoothening the cow temperature data 124

Median polish 125

Time for acon – the median polish algorithm 125

Summary 128

Chapter 5: Stascal Inference 129

Maximum likelihood esmator 130

Visualizing the likelihood funcon 131

Time for acon – visualizing the likelihood funcon 133

Finding the maximum likelihood esmator 135

Using the tdistr funcon 136

Time for acon – nding the MLE using mle and tdistr funcons 137

Condence intervals 139

Time for acon – condence intervals 141

Hypotheses tesng 144

Binomial test 144

Time for acon – tesng the probability of success 145

Tests of proporons and the chi-square test 147

Time for acon – tesng proporons 147

Tests based on normal distribuon – one-sample 149

Time for acon – tesng one-sample hypotheses 152

Tests based on normal distribuon – two-sample 156

Time for acon – tesng two-sample hypotheses 159

Summary 160

Chapter 6: Linear Regression Analysis 161

The simple linear regression model 162

What happens to the arbitrary choice of parameters? 163

Time for acon – the arbitrary choice of parameters 164

Building a simple linear regression model 167

Table of Contents

[ iv ]

Time for acon – building a simple linear regression model 167

ANOVA and the condence intervals 169

Time for acon – ANOVA and the condence intervals 170

Model validaon 171

Time for acon – residual plots for model validaon 172

Mulple linear regression model 176

Averaging k simple linear regression models or a mulple linear

regression model 177

Time for acon – averaging k simple linear regression models 177

Building a mulple linear regression model 179

Time for acon – building a mulple linear regression model 179

The ANOVA and condence intervals for the mulple linear regression model 181

Time for acon – the ANOVA and condence intervals for the mulple linear

regression model 182

Useful residual plots 183

Time for acon – residual plots for the mulple linear regression model 184

Regression diagnoscs 186

Leverage points 187

Inuenal points 188

DFFITS and DFBETAS 189

The mulcollinearity problem 189

Time for acon – addressing the mulcollinearity problem for the Gasoline data 191

Model selecon 192

Stepwise procedures 192

The backward eliminaon 192

The forward selecon 193

Criterion-based procedures 194

Time for acon – model selecon using the backward, forward, and AIC criteria 194

Summary 199

Chapter 7: The Logisc Regression Model 201

The binary regression problem 202

Time for acon – limitaons of linear regression models 202

Probit regression model 204

Time for acon – understanding the constants 204

Logisc regression model 206

Time for acon – ng the logisc regression model 207

Hosmer-Lemeshow goodness-of-t test stasc 210

Time for acon – the Hosmer-Lemeshow goodness-of-t stasc 211

Model validaon and diagnoscs 213

Residual plots for the GLM 213

Table of Contents

[ v ]

Time for acon – residual plots for the logisc regression model 214

Inuence and leverage for the GLM 216

Time for acon – diagnoscs for the logisc regression 216

Receiving operator curves 220

Time for acon – ROC construcon 221

Logisc regression for the German credit screening dataset 223

Time for acon – logisc regression for the German credit dataset 225

Summary 227

Chapter 8: Regression Models with Regularizaon 229

The overng problem 230

Time for acon – understanding overng 231

Regression spline 234

Basis funcons 234

Piecewise linear regression model 235

Time for acon – ng piecewise linear regression models 235

Natural cubic splines and the general B-splines 238

Time for acon – ng the spline regression models 239

Ridge regression for linear models 243

Time for acon – ridge regression for the linear regression model 244

Ridge regression for logisc regression models 248

Time for acon – ridge regression for the logisc regression model 248

Another look at model assessment 250

Time for acon – selecng lambda iteravely and other topics 250

Summary 256

Chapter 9: Classicaon and Regression Trees 257

Recursive parons 258

Time for acon – paroning the display plot 259

Spling the data 260

The rst tree 260

Time for acon – building our rst tree 261

The construcon of a regression tree 265

Time for acon – the construcon of a regression tree 266

The construcon of a classicaon tree 276

Time for acon – the construcon of a classicaon tree 279

Classicaon tree for the German credit data 284

Time for acon – the construcon of a classicaon tree 284

Pruning and other ner aspects of a tree 288

Time for acon – pruning a classicaon tree 289

Summary 291

Table of Contents

[ vi ]

Chapter 10: CART and Beyond 293

Improving CART 294

Time for acon – cross-validaon predicons 296

Bagging 297

The bootstrap 297

Time for acon – understanding the bootstrap technique 298

The bagging algorithm 300

Time for acon – the bagging algorithm 301

Random forests 303

Time for acon – random forests for the German credit data 304

The consolidaon 308

Time for acon – random forests for the low birth weight data 308

Summary 314

Appendix: References 315

Index 317

Preface

The open source soware R is fast becoming one of the preferred companions of Stascs

even as the subject connues to add many friends in Machine Learning, Data Mining, and so

on among its already rich scienc network. The era of mathemacal theory and stascal

applicaons embeddedness is truly a remarkable one for the society and the soware has

played a very pivotal role in it. This book is a humble aempt at presenng Stascal Models

through R for any reader who has a bit of familiarity with the subject. In my experience of

praccing the subject with colleagues and friends from dierent backgrounds, I realized that

many are interested in learning the subject and applying it in their domain which enables

them to take appropriate decisions in analyses, which involves uncertainty. A decade earlier

my friends would be content with being pointed to a useful reference book. Not so anymore!

The work in almost every domain is done through computers and naturally they do have

their data available in spreadsheets, databases, and somemes in plain text format. The

request for an appropriate stascal model is invariantly followed by a one word queson

"Soware?" My answer to them has always been a single leer reply "R!" Why? It is really a

very simple decision and it has been my companion over the last seven years. In this book,

this experience has been converted into detailed chapters and a cleaner breakup of model

building in R.

A by-product of the interacon with colleagues and friends who are all aspiring stascal

model builders has been that I have been able to pick up the trough of their learning

curve of the subject. The rst aempt towards xing the hurdle has been to introduce

the fundamental concepts that the beginners are most familiar with, which is data.

The dierence is simply in the subtlees and as such I rmly believe that introducing the

subject on their turf movates the reader for a long way in their journey. As with most

stascal soware, R provides modules and packages which mostly cover many of the

recently invented stascal methodology. The rst ve chapters of the book focus on the

fundamental aspects of the subject and the R soware and hence cover R basics, data

visualizaon, exploratory data analysis, and stascal inference.

Preface

[ 2 ]

The foundaonal aspects are illustrated using interesng examples and sets up the

framework for the later ve chapters. Regression models, linear and logisc regression

models being at the forefront, are of paramount interest in applicaons. The discussion is

more generic in nature and the techniques can be easily adapted across dierent domains.

The last two chapters have been inspired by the Breiman school and hence the modern

method of Classicaon and Regression Trees has been developed in detail and illustrated

through a praccal dataset.

What this book covers

Chapter 1, Data Characteriscs, introduces the dierent types of data through a

quesonnaire and dataset. The need of stascal models is elaborated in some interesng

contexts. This is followed by a brief explanaon of R installaon and the related packages.

Discrete and connuous random variables are discussed through introductory R programs.

Chapter 2, Import/Export Data, begins with a concise development of R basics. Data frames,

vectors, matrices, and lists are discussed with clear and simpler examples. Imporng of data

from external les in csv, xls, and other formats is elaborated next. Wring data/objects from

R for other soware is considered and the chapter concludes with a dialogue on R session

management.

Chapter 3, Data Visualizaon, discusses ecient graphics separately for categorical and

numeric datasets. This translates into techniques of bar chart, dot chart, spine and mosaic

plot, and four fold plot for categorical data while histogram, box plot, and scaer plot for

connuous/numeric data. A very brief introducon to ggplot2 is also provided here.

Chapter 4, Exploratory Analysis, encompasses highly intuive techniques for preliminary

analysis of data. The visualizing techniques of EDA such as stem-and-leaf, leer values, and

modeling techniques of resistant line, smoothing data, and median polish give a rich insight

as a preliminary analysis step.

Chapter 5, Stascal Inference, begins with the emphasis of likelihood funcon and

compung the maximum likelihood esmate. Condence intervals for the parameters

of interest is developed using funcons dened for specic problems. The chapter also

considers important stascal tests of Z-test and t-test for comparison of means and

chi-square tests and F-test for comparison of variances.

Chapter 6, Linear Regression Analysis, builds a linear relaonship between an output and a

set of explanatory variables. The linear regression model has many underlying assumpons

and such details are veried using validaon techniques. A model may be aected by a

single observaon, or a single output value, or an explanatory variable. Stascal metrics

are discussed in depth which helps remove one or more kinds of anomalies. Given a large

number of covariates, the ecient model is developed using model selecon techniques.

Preface

[ 3 ]

Chapter 7, The Logisc Regression Model, is useful as a classicaon model when the output

is a binary variable. Diagnosc and model validaon through residuals are used which lead

to an improved model. ROC curves are next discussed which helps in idenfying of a beer

classicaon model.

Chapter 8, Regression Models with Regularizaon, discusses the problem of over ng

arising from the use of models developed in the previous two chapters. Ridge regression

signicantly reduces the possibility of an over t model and the development of natural

spine models also lays the basis for the models considered in the next chapter.

Chapter 9, Classicaon and Regression Trees, provides a tree-based regression model.

The trees are inially built using R funcons and the nal trees are also reproduced using

rudimentary codes leading to a clear understanding of the CART mechanism.

Chapter 10, CART and Beyond, considers two enhancements of CART using bagging and

random forests. A consolidaon of all the models from Chapter 6 to Chapter 10 is also given

through a dataset.

Chapter 1 to Chapter 5 form the basics of R soware and the Stascs subject. Praccal

and modern regression models are discussed in depth from Chapter 6 to Chapter 10.

Appendix, References, lists names of the books that have been referred to in this book.

What you need for this book

R is the only required soware for this book and you can download it from http://www.

cran.r-project.org/. R packages will be required too though this task is done within a

working R session. The datasets used in the book is available in the R package RSADBE, which

is an abbreviaon of the book's tle, at http://cran.r-project.org/web/packages/

RSADBE/index.html.

Who this book is for

This book will be useful for readers who have air and a need for stascal applicaons

in their own domains. The rst seven chapters are also useful for any masters students in

Stascs and the movated student can easily complete the rest of the book and obtain

a working knowledge of CART.

Conventions

In this book, you will nd several headings appearing frequently.

To give clear instrucons of how to complete a procedure or task, we use:

Preface

[ 4 ]

Time for action – heading

1. Acon 1

2. Acon 2

3. Acon 3

Instrucons oen need some extra explanaon so that they make sense, so they are

followed with:

What just happened?

This heading explains the working of tasks or instrucons that you have just completed.

You will also nd some other learning aids in the book, including:

Pop quiz – heading

These are short mulple-choice quesons intended to help you test your own

understanding.

Have a go hero – heading

These are praccal challenges that give you ideas for experimenng with what you

have learned.

You will also nd a number of styles of text that disnguish between dierent kinds of

informaon. Here are some examples of these styles, and an explanaon of their meaning.

Code words in text are shown as follows: "The operator %% on two objects, say x and y,

returns remainder following an integer division, and the operator %/% returns the integer

division." In certain cases the complete code cannot be included within the acon list and

in such cases you will nd the following display:

Plot the "Response Residuals" against the "Fied Values" of the pass_logistic model

with the following values assigned:

plot(fitted(pass_logistic), residuals(pass_logistic,"response"),

col= "red", xlab="Fitted Values", ylab="Residuals",cex.axis=1.5,

cex.lab=1.5)

In such a case you need to run the code starng with plot(up to cex.lab=1.5) in R.

Preface

[ 5 ]

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this

book—what you liked or may have disliked. Reader feedback is important for us to

develop tles that you really get the most out of.

To send us general feedback, simply send an e-mail to feedback@packtpub.com,

and menon the book tle through the subject of your message.

If there is a topic that you have experse in and you are interested in either wring

or contribung to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you

to get the most from your purchase.

Downloading the example code

You can download the example code les for all Packt books you have purchased from your

account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit

http://www.packtpub.com/support and register to have the les e-mailed directly to you.

Downloading the color images of this book

We also provide you a PDF le that has color images of the screenshots/diagrams used in

this book. The color images will help you beer understand the changes in the output. You

can download this le from http://www.packtpub.com/sites/default/files/

downloads/9441OS_R-Statistical-Application-Development-by-Example-

Color-Graphics.pdf.

Preface

[ 6 ]

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes

do happen. If you nd a mistake in one of our books—maybe a mistake in the text or the

code—we would be grateful if you would report this to us. By doing so, you can save other

readers from frustraon and help us improve subsequent versions of this book. If you nd

any errata, please report them by vising http://www.packtpub.com/submit-errata,

selecng your book, clicking on the errata submission form link, and entering the details of

your errata. Once your errata are veried, your submission will be accepted and the errata

will be uploaded to our website, or added to any list of exisng errata, under the Errata

secon of that tle.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt,

we take the protecon of our copyright and licenses very seriously. If you come across any

illegal copies of our works, in any form, on the Internet, please provide us with the locaon

address or website name immediately so that we can pursue a remedy.

Please contact us at copyright@packtpub.com with a link to the suspected

pirated material.

We appreciate your help in protecng our authors, and our ability to bring you

valuable content.

Questions

You can contact us at questions@packtpub.com if you are having a problem

with any aspect of the book, and we will do our best to address it.

Data Characteristics

Data consists of observations across different types of variables, and it is vital

that any Data Analyst understands these intricacies at the earliest stage of

exposure to statistical analysis. This chapter recognizes the importance of data

and begins with a template of a dummy questionnaire and then proceeds with

the nitty-gritties of the subject. We then explain how uncertainty creeps in to

the domain of computer science. The chapter closes with coverage of important

families of discrete and continuous random variables.

We will cover the following topics:

Idencaon of the main variable types as nominal, categorical,

and connuous variables

The uncertainty arising in many real experiments

R installaon and packages

The mathemacal form of discrete and connuous random variables

and their applicaons

Data Characteriscs

[ 8 ]

Questionnaire and its components

The goal of this secon is introducon of numerous variable types at the rst possible

occasion. Tradionally, an introductory course begins with the elements of probability theory

and then builds up the requisites leading to random variables. This convenon is dropped in

this book and we begin straightaway with data. There is a primary reason for choosing this

path. The approach builds on what the reader is already familiar with and then connects it

with the essenal framework of the subject.

It is very likely that the user is familiar with quesonnaires. A quesonnaire may be asked

aer the birth of a baby with a view to aid the hospital in the study about the experience

of the mother, the health status of the baby, and the concerns of immediate guardians

of the new born. A mul-store department may instantly request the customer to ll in a

short quesonnaire for capturing the customer's sasfacon aer the sale of a product.

A customer's sasfacon following the service of their vehicle (see the detailed example

discussed later) can be captured through a few queries. The quesonnaires may arise in

dierent forms than just merely on paper. They may be sent via e-mail, telephone, short

message service (SMS), and so on. As an example, one may receive an SMS that seeks a

mandatory response in a Yes/No form. An e-mail may arrive in the Outlook inbox, which

requires the recipient to respond through a vote for any of these three opons, "Will aend

the meeng", "Can't aend the meeng", or "Not yet decided".

Suppose the owner of a mul-brand car center wants to nd out the sasfacon percentage

of his customers. Customers bring their car to a service center for varied reasons. The owner

wants to nd out the sasfacon levels post the servicing of the cars and nd the areas

where improvement will lead to higher sasfacon among the customers. It is well known

that the higher the sasfacon levels, the greater would be the customer's loyalty towards

the service center. Towards this, a quesonnaire is designed and then data is collected from

the customers. A snippet of the quesonnaire is given in gure 1, and the informaon given

by the customers lead to dierent types of data characteriscs. The variables Customer ID

and Quesonnaire ID may be serial numbers or randomly generated unique numbers. The

purpose of such variables is unique idencaon of people's response. It may be possible that

there are follow-up quesonnaires as well. In such cases, the Customer ID for a responder will

connue to be the same, whereas the Quesonnaire ID needs to change for idencaon of

the follow up. The values of these types of variables in general are not useful for

analycal purpose.

Chapter 1

[ 9 ]

Figure 1: A hypothetical questionnaire

The informaon of Full Name in this survey is a starng point to break the ice with

the responder. In very exceponal cases the name may be useful for proling purposes.

For our purposes the name will simply be a text variable that is not used for analysis

purposes. Gender is asked to know the person's gender, and in quite a few cases it may

be an important factor explaining the main characteriscs of the survey, in this case it may

be mileage. Gender is an example of a categorical variable.

Age in Years is a variable that captures the age of the customer. The data for this eld is

numeric in nature and is an example of a connuous variable.

The fourth and h quesons help the mul-brand dealer in idenfying the car model

and its age. The rst queson here enquires about the type of the car model. The car models

of the customers may vary from Volkswagen Beetle, Ford Endeavor, Toyota Corolla, Honda

Civic, to Tata Nano, see the next screenshot. Though the model name is actually a noun, we

make a disncon from the rst queson of the quesonnaire in the sense that the former is

a text variable while the laer leads to a categorical variable. Next, the car model may easily

be idened to classify the car into one of the car categories, such as a hatchback, sedan,

staon wagon, or ulity vehicle, and such a classifying variable may serve as one of the

ordinal variable, as per the overall car size. The age of the car in months since its manufacture

date may explain the mileage and odometer reading.

Data Characteriscs

[ 10 ]

The sixth and seventh quesons simply ask the customer if their minor/major problems

were completely xed or not. This is a binary queson that takes either of the values,

Yes or No. Small dents, power windows malfunconing, niggling noises in the cabin, music

speakers low output, and other similar issues, which do not lead to good funconing of the

car may be treated as minor problems that are expected to be xed in the car. Disc brake

problems, wheel alignment, steering raling issues, and similar problems that expose the user

and co-users of the road to danger are of grave concern, as they aect the funconing of a

car and are treated as major problems. Any user will expect all of his/her issues to be resolved

during a car service. An important goal of the survey is to nd the service center eciency in

handling the minor and major issues of the car. The labels Yes/No may be replaced by +1 and

-1, or any other label of convenience.

The eighth queson, "What is the mileage (km/liter) of car?", gives a measure of the average

petrol/diesel consumpon. In many praccal cases this data is provided by the belief of

the customer who may simply declare it between 5 km/liter to 25 km/liter. In the case of

a lower mileage, the customer may ask for a ner tune up of the engine, wheel alignment,

and so on. A general belief is that if the mileage is closer to the assured mileage as marketed

by the company, or some authority such as Automove Research Associaon of India (ARAI),

the customer is more likely to be happy. An important variable is the overall kilometers done

by the car up to the point of service. Vehicles have certain maintenances at the intervals

of 5,000 km, 10,000 km, 20,000 km, 50,000 km, and 100,000 km. This variable may also be

related with the age of the vehicle.

Chapter 1

[ 11 ]

Let us now look at the nal queson of the snippet. Here, the customer is asked to rate

his overall experience of the car service. A response from the customer may be sought

immediately aer a small test ride post the car service, or it may be through a quesonnaire

sent to the customer's e-mail ID. A rang of Very Poor suggests that the workshop has served

the customer miserably, whereas the rang of Very Good conveys that the customer is

completely sased with the workshop service. Note that there is some order in the response

of the customer, in that we can grade the ranking in a certain order of Very Poor < Poor <

Average < Good < Very Good. This implies that the structure of the rangs must be respected

when we analyze the data of such a study. In the next secon, these concepts are elaborated

through a hypothecal dataset.

Satisfaction

Rating

Good

Average

Good

Average

Very Good

Good

Very Poor

Very Good

Good

Poor

Good

Very Poor

Good

Poor

Average

Odometer

18892

22624

42008

32556

48172

25207

41449

28555

36841

1755

2007

28265

27997

27491

29527

2702

6903

40873

15274

9934

Mileage

Questionnaire_ID

QC601FAKNQXM

QC5HZ8CP1NFB

QCY72H4J0V1X

QCH1NZO5VCD8

QCV1Y10SFW7N

QCXO04WUYQAJ

QCJQZAYMI59Z

QCIZTA35PW19

QC12XU9J0OAT

QCXWBT0V17G

QC5YOUIZ7PLC

QCYF269HVUO

QCAIE3Z0SYK9

QCE09UZHDP63

QCDWJ6ESYPZR

QCH7XRZ6W9JQ

QCGXATR9DQEK

QCYQO5RFIPK1

QCG1SZ8IDURP

QCTUSRQDX396

Customer_ID

C601FAKNQXM

C5HZ8CP1NFB

CY72H4J0V1X

CH1NZO5VCD8

CV1Y10SFW7N

CXO04WUYQAJ

CJQZAYMI59Z

CIZTA35PW19

C12XU9J0OAT

CXWBT0V17G

C5YOUIZ7PLC

CYF269HVUO

CAIE3Z0SYK9

CE09UZHDP63

CDWJ6ESYPZR

CH7XRZ6W9JQ

CGXATR9DQEK

CYQO5RFIPK1

CG1SZ8IDURP

CTUSRQDX396

J. Ram

Sanjeev Joshi

John D

Pranathi PT

Pallavi M Daksh

Mohammed Khan

Anand N T

Arun Kumar T

Prakash Prabhak

Pramod R.K.

Mithun Y.

S.P. Bala

Swamy J

Julfikar

Chris John

Naveed Khan

Prem Kashmiri

Sujana Rao

Josh K

Aravind

Name

Male

Female

Male

Female

Male

Gender

Age

Beetle

Camry

Nano

Civic

Corolla

Civic

Endeavor

Beetle

Nano

Beetle

Nano

Camry

Endeavor

Fortuner

Civic

Camry

Endeavor

Fiesta

Car_Model

Apr-11

Feb-09

Apr-10

Oct-11

Mar-12

Dec-10

Mar-12

Aug-11

Mar-09

Mar-11

Apr-11

Jul-11

Dec-09

Jan-12

May-12

Aug-09

Oct-11

Mar-10

Jul-11

May-10

Car

Manufacture

Year

Yes

Minor

Problems

Yes

Minor

Problems

A hypothetical dataset of a Questionnaire

Data Characteriscs

[ 12 ]

Understanding the data characteristics

in an R environment

A snippet of an R session is given in Figure 2. Here we simply relate an R session with the

survey and sample data of the previous table. The simple goal here is to get a feel/buy-in

of R and not necessarily follow the R codes. The R installaon process is explained in the R

installaon secon. Here the user is loading the SQ R data object (SQ simply stands for sample

quesonnaire) in the session. The nature of the SQ object is a data.frame that stores a

variety of other objects in itself. For more technical details of the data.frame funcon, see

The data.frame object secon of Chapter 2, Import/Export Data. The names of a data.frame

object may be extracted using the funcon variable.names. The R funcon class helps

to idenfy the nature of the R object. As we have a list of variables, it is useful to nd all of

them using the funcon sapply. In the following screenshot, the menoned steps have been

carried out:

Figure 2: Understanding the variable types of an R object

Chapter 1

[ 13 ]

The variable characteriscs are also on expected lines, as they truly should be, and

we see that the variables Customer_ID, Questionnaire_ID, and Name are character

variables; Gender, Car_Model, Minor_Problems, and Major_Problems are factor

variables; DoB and Car_Manufacture_Year are date variables; Mileage and Odometer

are integer variables; and nally the variable Satisfaction_Rating is an ordered

and factor variable.

In the remainder of this chapter we will delve into more details about the nature of various

data types. In a more formal language a variable is called a random variable, abbreviated

as RV in the rest of the book, in stascal literature. A disncon needs to be made here.

In this book we do not focus on the important aspects of probability theory. It is assumed that

the reader is familiar with probability, say at the level of Freund (2003) or Ross (2001). An RV

is a funcon that maps from the probability (sample) space

to the real line. From

the previous example we have Odometer and Satisfaction_Rating as two examples of a

random variable. In a formal language, the random variables are generally denoted by leers

X, Y, …. The disncon that is required here is that in the applicaons what we observe are

the realizaons/values of the random variables. In general, the realized values are denoted by

the lower cases x, y, …. Let us clarify this at more length.

Suppose that we denote the random variable Satisfaction_Rating by X. Here,

the sample space

consists of the elements Very Poor, Poor, Average, Good, and Very

Good. For the sake of convenience we will denote these elements by O1, O2, O3, O4, and

O5 respecvely. The random variable X takes one of the values O1,…, O5 with respecve

probabilies p1,…, p5. If the probabilies were known, we don't have to worry about stascal

analysis. In simple terms, if we know the probabilies of the Satisfaction_Rating RV, we

can simply use it to conclude whether more customers give Very Good rang against Poor.

However, our survey data does not contain every customer who have availed car service from

the workshop, and as such we have representave probabilies and not actual probabilies.

Now, we have seen 20 observaons in the R session, and corresponding to each row we had

some values under the Satisfaction_Rating column. Let us denote the sasfacon rang

for the 20 observaons by the symbols X1,…, X20. Before we collect the data, the random

variables X1,…, X20 can assume any of the values in

. Post the data collecon, we see that the

rst customer has given the rang as Good (that is, O4), the second as Average (O3), and so on

up to the tweneth customer's rang as Average (again O3). By convenon, what is observed

in the data sheet is actually x1,…, x20, the realized values of the RVs X1,…, X20.

Data Characteriscs

[ 14 ]

Experiments with uncertainty in computer science

The common man of the previous century was skepcal about chance/randomness and

aributed it to the lack of accurate instruments, and that informaon is not necessarily

captured in many variables. The skepcism about the need for modeling for randomness

in the current era connues for the common man, as he feels that the instruments are too

accurate and that mul-variable informaon eliminates uncertainty. However, this is not

the fact, and we will look here at some examples that drive home this point. In the previous

secon we dealt with data arising from a quesonnaire regarding the service level at a car

dealer. It is natural to accept that dierent individuals respond in disnct ways, and further

the car being a complex assembly of dierent components responds dierently in near

idencal condions. A queson then arises whether we may have to really deal with such

situaons in computer science, which involve uncertainty. The answer is certainly armave

and we will consider some examples in the context of computer science and engineering.

Suppose that the task is installaon of soware, say R itself. At a new lab there has been

an arrangement of 10 new desktops that have the same conguraon. That is, the RAM,

memory, the processor, operang system, and so on are all same in the 10 dierent machines.

For simplicity, assume that the electricity supply and lab temperature are idencal for all the

machines. Do you expect that the complete R installaon, as per the direcons specied in

the next secon, will be the same in milliseconds for all the 10 installaons? The run me

of an operaon can be easily recorded, may be using other soware if not manually. The

answer is a clear "No" as there will be minor variaons of the processes acve in the dierent

desktops. Thus, we have our rst experiment in the domain of computer science which

involves uncertainty.

Suppose that the lab is now two years old. As an administrator do you expect all the 10

machines to be working in the same idencal condions, as we started with idencal

conguraon and environment? The queson is relevant as according to general experience

a few machines may have broken down. Despite warranty and assurance by the desktop

company, the number of machines that may have broken down will not be exactly as assured.

Thus, we again have uncertainty.

Assume that three machines are not funconing at the end of two years. As an administrator,

you have called the service vendor to x the problem. For the sake of simplicity, we assume

that the nature of failure of the three machines is the same, say motherboard failure on the

three failed machines. Is it praccal that the vendor would x the three machines within

idencal me? Again, by experience we know that this is very unlikely. If the reader thinks

otherwise, assume that 100 idencal machines were running for two years and 30 of them

are now having the motherboard issue. It is now clear that some machines may require a

component replacement while others would start funconing following a repair/x.

Let us now summarize the preceding experiments through the set of following quesons:

Chapter 1

[ 15 ]

What is the average installaon me for the R soware on idencally congured

computer machines?

How many machines are likely to break down aer a period of one year, two years,

and three years?

If a failed machine has issues related to motherboard, what is the average

service me?

What is the fracon of failed machines that have failed motherboard component?

The answers to these types of quesons form the main objecve of the Stascs subject.

However, there are certain characteriscs of uncertainty that are very richly covered by the

families of probability distribuons. According to the underlying problem, we have discrete or

connuous RVs. The important and widely useful probability distribuons form the content of

the rest of the chapter. We will begin with the useful discrete distribuons.

R installation

The ocial website of R is the Comprehensive R Archive Network (CRAN) at www.cran.r-

project.org. As of wring of this book, the most recent version of R is 2.15.1. This soware

can be downloaded for the three plaorms Linux, Mac OS X, and Windows.

Figure 3: The CRAN website (a snapshot)

Data Characteriscs

[ 16 ]

A Linux user may simply key in sudo apt-get install r-base in the terminal, and post

the return of right password and privilege levels, the R soware will be installed. Aer the

compleon of download and installaon, the soware can be started by simply keying in R

at the terminal.

A Windows user rst needs to click on Download R for Windows as shown in the preceding

screenshot, and then in the base subdirectory click on install R for the rst me. In the new

window, click on Download R 3.0.0 for Windows and download the .exe le to a directory

of your choice. The completely downloaded R-3.0.0-win.exe le can be installed as any

other .exe le. The R soware may be invoked either from the Start menu, or from the icon

on the desktop.

Using R packages

The CRAN repository hosts 4475 packages as of May 01, 2013. The packages are wrien and

maintained by Stascians, Engineers, Biologists, and others. The reasons are varied and the

resourcefulness is very rich, and it reduces the need of wring exhausve, new funcons

and programs from scratch. These addional packages can be obtained from http://www.

cran.r-project.org/web/packages/. The user can click on Table of available packages,

sorted by name, which directs to a new web package. Let us illustrate the installaon of an R

package named gdata.

We now wish to install the package gdata. There are mulple ways of compleng this task.

Clicking on the gdata label leads to the web page http://www.cran.r-project.org/

web/packages/gdata/index.html. In this HTML le we can nd a lot of informaon

about the package from Version, Depends, Imports, Published, Author, Maintainer, License,

System Requirements, Installaon, and CRAN checks. Further, the download opons may be

chosen from Package source, MacOS X binary, and Windows binary depending on whether

the user's OS is Unix, MacOS, or Windows respecvely. Finally, a package may require

other packages as a prerequisite, and it may itself be a prerequisite for other packages.

This informaon is provided in the Reverse dependencies secon in the opons Reverse

depends, Reverse imports, and Reverse suggests.

Suppose that the user is having Windows OS. There are two ways to install the package

gdata. Start R as explained earlier. At the console, execute the code install.

packages("gdata"). A CRAN mirror window will pop up asking the user to select one

of the available mirrors. Select one of the mirrors from the list, you may need to scroll

down to locate your favorite mirror, and then click on the Ok buon. A default seng is

dependencies=TRUE, which will then download and install all other required packages.

Unless there are some violaons, such as the dependency requirement of the R version

being at least 2.13.0 in this case, the packages are successfully installed.

Chapter 1

[ 17 ]

A second way of installing the gdata package is as follows. In the gdata web page click on the

link gdata_2.11.0.zip. This acon will then aempt to download the package through the File

download window. Choose the opon Save and specify the path where you wish to download

the package. In my case, I have chosen the path C:\Users\author\Downloads. Now go to

the R window. In the menu ribbon, we have seven opons: File, Edit, View, Misc, Packages,

Windows, and Help. Yes, your guess is correct and you would have wisely selected Packages

from the menu. Now, select the last opon of Packages, the Install Package(s) from local zip

les opon and direct it to the path where you have downloaded the .zip le. Select the

le gdata_2.11.0 and R will do the required remaining part of installing the package. One

of the drawbacks of doing this process manually is that if there are dependencies, the user

needs to ensure that all such packages have been installed before embarking on this second

task of installing the R packages. However, despite the problem, it is quite useful to know this

technique, as we may not be connected to Internet all the me and install the packages as

it is convenient.

RSADBE – the book's R package

The book uses a lot of datasets from the Web, stascal text books, and so on. The le format

of the datasets have been varied and thus to help the reader, we have put all the datasets

used in the book in an R package, RSADBE, which is the abbreviaon of the book's tle.

This package will be available from the CRAN website as well as the book's web page.

Thus, whenever you are asked to run data(xyz), the datasets xyz will be available

either in the RSADBE package or datasets package of R.

The book also uses many of the packages available on CRAN. The following table gives the list

of packages and the reader is advised to ensure that these packages are installed before you

begin reading the chapter. That is, the reader needs to ensure that, as an example, install.

packages(c("qcc","ggplot2")) is run in the R session before proceeding with Chapter

3, Data Visualizaon.

Chapter number Packages required

2foreign, RMySQL

3qcc, ggplot2

4LearnEDA, aplpack

5stats4, PASWR, PairedData

6faraway

7pscl, ROCR

8ridge, DAAG

9rpart, rattle

10 ipred, randomForest

Data Characteriscs

[ 18 ]

Discrete distribution

The previous secon highlights the dierent forms of variables. The variables such as

Gender, Car_Model, and Minor_Problems possibly take one of the nite values.

These variables are parcular cases of the more general class of discrete variables.

It is to be noted that the sample space

of a discrete variable need not be nite. As an

example, the number of errors on a page may take values as a set of posive integers, {0, 1, 2,

…}. Suppose that a discrete random variable X can take values among 

[[

with respecve

probabilies



SS

, that is,



3; [S

. Then, we require that the probabilies be non-

zero and further that their sum be 1:

   DQG t ¦

L L L

S L S

where the Greek symbol

represents summaon over the index i.

The funcon



S[ is called the probability mass funcon (pmf) of the discrete RV X. We will

now consider formal denions of important families of discrete variables. The engineers

may refer to Bury (1999) for a detailed collecon of useful stascal distribuons in their

eld. The two most important parameters of a probability distribuon are specied by mean

and variance of the RV X. In some cases, and important too, these parameters may not exist

for the RV. However, we will not focus on such distribuons, though we cauon the reader

that this does not mean that such RVs are irrelevant. Let us dene these parameters for the

discrete RV. The mean and variance of a discrete RV are respecvely calculated as:

  

 



DQG 

¦ ¦

LL LL

L L

(; S[ 9DU; S[ (;

The mean is a measure of central tendency, whereas the variance gives a measure

of the spread of the RV.

The variables dened so far are more commonly known as categorical variables.

Agres (2002) denes a categorical variable as a measurement scale consisng

of a set of categories.

Let us idenfy the categories for the variables listed in the previous secon. The categories

for the variable Gender are Male and Female; whereas the car category variables derived

from Car_Model are hatchback, sedan, staon wagon, and ulity vehicles. The variables

Minor_Problems and Major_Problems have common but independent categories Yes and

No; and nally the variable Satisfaction_Rating has the categories, as seen earlier, Very

Poor, Poor, Average, Good, and Very Good. The variable Car_Model is just labels of the name

of car and it is an example of nominal variable.

Chapter 1

[ 19 ]

Finally, the output of the variable Satistifaction_Rating has an implicit order in it, Very

Poor < Poor < Average < Good < Very Good. It may be realized that this dierence poses subtle

challenges in their analysis. These types of variables are called ordinal variables. We will look

at another type of categorical variable that has not popped up thus far.

Praccally, it is oen the case that the output of a connuous variable is put in certain bin for

ease of conceptualizaon. A very popular example is the categorizaon of the income level

or age. In the case of income variables, it has been realized in one of the earlier studies that

people are very conservave about revealing their income in precise numbers. For example,

the author may be shy to reveal that his monthly income is Rs. 34,892. On the other hand,

it has been revealed that these very same people do not have a problem in disclosing their

income as belonging to one of such bins: < Rs. 10,000, Rs. 10,000-30,000, Rs. 30,000-50,000,

and > Rs. 50,000. Thus, this informaon may also be coded into labels and then each of the

labels may refer to any one value in an interval bin. Hence, such variables are referred as

interval variables.

Discrete uniform distribution

A random variable X is said to be a discrete uniform random variable if it can take any one

of the nite M labels with equal probability.

As the discrete uniform random variable X can assume one of the 1, 2, …, M with equal

probability, this probability is actually

0

. As the probability remains same across the labels,

the nomenclature "uniform" is jused. It might appear at the outset that this is not a very

useful random variable. However, the reader is cauoned that this intuion is not correct. As

a simple case, this variable arises in many cases where simple random sampling is needed in

acon. The pmf of discrete RV is calculated as:





3; [S[L 0

A simple plot of the probability distribuon of a discrete uniform RV is demonstrated next:

> M = 10

> mylabels=1:M

> prob_labels=rep(1/M,length(mylabels))

> dotchart(prob_labels,labels=mylabels,xlim=c(.08,.12),

+ xlab="Probability")

> title("A Dot Chart for Probability of Discrete Uniform RV")

Data Characteriscs

[ 20 ]

Downloading the example code

You can download the example code les for all Packt books you have

purchased from your account at http://www.packtpub.com. If you

purchased this book elsewhere, you can visit http://www.packtpub.

com/support and register to have the les e-mailed directly to you.

Figure 4: Probability distribution of a discrete uniform random variable

The R programs here are indicave and it is not absolutely necessary that you

follow them here. The R programs will actually begin from the next chapter and

your ow won't be aected if you do not understand certain aspects of them.

Binomial distribution

Recall the second queson in the Experiments with uncertainty in computer science secon,

which asks "How many machines are likely to break down aer a period of one year, two

years, and three years?". When the outcomes involve uncertainty, the more appropriate

queson that we ask is related to the probability of the number of break downs being x.

Consider a xed me frame, say 2 years. To make the queson more generic, we assume

that we have n number of machines. Suppose that the probability of a breakdown for a

given machine at any given me is p. The goal is to obtain the probability of x machines with

breakdown, and implicitly (n-x) funconal machines. Now consider a xed paern where

the rst x units have failed and the remaining are funconing properly. All the n machines

funcon independently of other machines. Thus, the probability of observing x machines

in the breakdown state is

[

Chapter 1

[ 21 ]

Similarly, each of the remaining (n-x) machines have the probability of (1-p) of being in the

funconal state, and thus the probability of these occurring together is

( )

n x

−

−. Again by

the independence axiom value, the probability of x machines with breakdown is then given

by . Finally, in the overall setup, the number of possible samples with breakdown

(being x and (n-x) samples) being funconal is actually the number of possible combinaons

of choosing x-out-of-n items, which is the combinatorial n

 

 

 . As each of these samples is

equally likely to occur, the probability of exactly x broken machines is given by .

The RV X obtained in such a context is known as the binomial RV and its pmf is called the

binomial distribuon. In mathemacal terms, the pmf of the binomial RV is calculated as:

 

   



§·

 d d

¨¸

©¹

[

3; [S[SS[

[

The pmf of binomial distribuons is somemes denoted by

( )

;,bxnp

. Let us now look at

some important properes of a binomial RV. The mean and variance of a binomial RV X

are respecvely calculated as:

 



DQG  9DU; Q(S S;Q S

As p is always a number between 0 and 1, the variance of a binomial RV

is always lesser than its mean.

Example 1.3.1: Suppose n = 10 and p = 0.5. We need to obtain the probabilies

p(x), x=0, 1, 2, …, 10. The probabilies can be obtained using the built-in R funcon dbinom.

The funcon dbinom returns the probabilies of a binomial RV. The rst argument of this

funcon may be a scalar or a vector according to the points at which we wish to know the

probability. The second argument of the funcon needs to know the value of n, the size of

the binomial distribuon. The third argument of this funcon requires the user to specify the

probability of success in p. It is natural to forget the syntax of funcons and the R help system

becomes very handy here. For any funcon, you can get details of it using ? followed by the

funcon name. Please do not give a space between CIT and the funcon name.

Here, you can try ?dbinom.

> n <- 10; p <- 0.5

> p_x <- round(dbinom(x=0:10, n, p),4)

> plot(x=0:10,p_x,xlab="x", ylab="P(X=x)")

Data Characteriscs

[ 22 ]

The R funcon round xes the accuracy of the argument up to the specied number

of digits.

Figure 5: Binomial probabilities

We have used the dbinom funcon in the previous example. There are three ulity facets

for the binomial distribuon. The three facets are p, q, and r. These three facets respecvely

help us in computaons related to cumulave probabilies, quanles of the distribuon,

and simulaon of random numbers from the distribuon. To use these funcons, we simply

augment the leers with the distribuon name, binom here, as pbinom, qbinom, and

rbinom. There will be of course a crical change in the arguments. In fact, there are many

distribuons for which the quartet of d, p, q, and r are available, check ?Distributions.

Example 1.3.2: Assume that the probability of a key failing on an 83-set keyboard (the authors

laptop keyboard has 83 keys) is 0.01. Now, we need to nd the probability when at a given

me there are 10, 20, and 30 non-funconing keys on this keyboard. Using the dbinom

funcon these probabilies are easy to calculate. Try to do this same problem using a scienc

calculator or by wring a simple funcon in any language that you are comfortable with.

> n <- 83; p <- 0.01

> dbinom(10,n,p)

[1] 1.168e-08

> dbinom(20,n,p)

[1] 4.343e-22

> dbinom(30,n,p)

[1] 2.043e-38

> sum(dbinom(0:83,n,p))

[1] 1

Chapter 1

[ 23 ]

As the probabilies of 10-30 keys failing appear too small, it is natural to believe that may

be something is going wrong. As a check, the sum clearly equals 1. Let us have a look at the

problem from a dierent angle. For many x values, the probability p(x) will be approximately

zero. We may not be interested in the probability of an exact number of failures, though we

are interested in the probability of at least x failures occurring, that is, we are interested in

the cumulave probabilies

()

PX x

≤. The cumulave probabilies for binomial distribuon

are obtained in R using the pbinom funcon. The main arguments of pbinom include size

(for n), prob (for p), and q (the x argument). For the same problem, we now look at the

cumulave probabilies for various p values:

> n <- 83; p <- seq(0.05,0.95,0.05)

> x <- seq(0,83,5)

> i <- 1

> plot(x,pbinom(x,n,p[i]),"l",col=1,xlab="x",ylab=

+ expression(P(X<=x)))

> for(i in 2:length(p)) { points(x,pbinom(x,n,p[i]),"l",col=i)}

Figure 6: Cumulative binomial probabilities

Try to interpret the preceding screenshot.

Data Characteriscs

[ 24 ]

Hypergeometric distribution

A box of N = 200 pieces of 12 GB pen drives arrives at a sales center. The carton contains

M = 20 defecve pen drives. A random sample of n units is drawn from the carton. Let X

denote the number of defecve pen drives obtained from the sample of n units. The task is to

obtain the probability distribuon of X. The number of possible ways of obtaining the sample

of size n is

§ ·

¨ ¸

. In this problem we have M defecve units and N-M working pen drives, and

x defecve units can be sampled in M

 

 

  dierent ways and n-x good units can be obtained in

N M

n x

−

 

 

−

 

disnct ways. Thus, the probability distribuon of the RV X is calculated as:

 

 



§·

¨¸



©¹

§·

¨¸

©¹

010

[Q[

3; [K[Q01 1

where x is an integer between

( )

max0,nNM−+ and

( )

min,nM

. The RV is called

as the hypergeometric RV and its probability distribuon is called as the

hypergeometric distribuon.

Suppose that we draw a sample of n = 10 units. The funcon dhyper in R can be used to nd

the probabilies of the RV X assuming dierent values.

> N <- 200; M <- 20

> n <- 10

> x <- 0:11

> round(dhyper(x,M,N,n),3)

[1] 0.377 0.395 0.176 0.044 0.007 0.001 0.000 0.000 0.000 0.000 0.000

0.000

The mean and variance of a hypergeometric distribuon are stated as follows:

 





DQG





10 1Q

0 0

(; Q1 1 1

U; Q1

Negative binomial distribution

Consider a variant of the problem described in the previous subsecon. The 10 new desktops

need to be ed with an add-on, 5 megapixel external cameras to help the students aend a

certain online course. Assume that the probability of a non-defecve camera unit is p. As an

administrator you keep on placing order unl you receive 10 non-defecve cameras. Now, let X

denote the number of orders placed for obtaining the 10 good units. We denote the required

number of success by k, which in this discussion has been k = 10. The goal in this unit is to

obtain the probability distribuon of X.

Chapter 1

[ 25 ]

Suppose that the xth order placed results in the procurement of the kth non-defecve unit.

This implies that we have received (k-1) non-defecve units among the rst (x-1) orders

placed, which is possible in

−

 

 

−

 

disnct ways. At the xth order, the instant of having received

the kth non-defecve unit, we have k successes and x-k failures. Hence, the probability

distribuon of the RV is calculated as:

 

 









§·

 

¨¸



©¹

[N N

3; N S S[ NN

Such an RV is called the negave binomial RV and its probability distribuon is the negave

binomial distribuon. Technically, this RV has no upper bound as the next required success

may never turn up. We state the mean and variance of this distribuon as follows:

 





DQG





NS NS

( U ;; SS

A parcular and important special case of the negave binomial RV occurs for k = 1,

which is known as the geometric RV. In this case, the pmf is calculated as:





 



[

3; [SS[

Example 1.3.3. (Baron (2007). Page 77) Sequenal Tesng: In a certain setup, the probability

of an item being defecve is (1-p) = 0.05. To complete the lab setup, 12 non-defecve units

are required. We need to compute the probability that at least 15 units need to be tested.

Here we make use of the cumulave distribuon of negave binomial distribuon pnbinom

funcon available in R. Similar to the pbinom funcon, the main arguments that we require

here would be size, prob, and q. This problem is solved in a single line of code:

> 1-pnbinom(3,size=12,0.95)

[1] 0.005467259

Note that we have specied 3 as the quanle point (the x argument) as the size parameter

of this experiment is 12 and we are seeking at least 15 units which translate into 3 more

units than the size parameter. The funcon pnbinom computes the cumulave distribuon

funcon and the requirement is actually the complement, and hence the expression in the

code is 1–pnbinom. We may equivalently solve the problem using the dnbinom funcon,

which straighorwardly computes the required probability:

> 1-(dnbinom(3,size=12,0.95)+dnbinom(2,size=12,0.95)+dnbinom(1,

+ size=12,0.95)+dnbinom(0,size=12,0.95))

[1] 0.005467259

Data Characteriscs

[ 26 ]

Poisson distribution

The number of accidents on a 1 km stretch of road, total calls received during a one-hour

slot on your mobile, the number of "likes" received on a status on a social networking site

in a day, and similar other cases are some of the examples which are addressed by the Poisson

RV. The probability distribuon of a Poisson RV is calculated as:

 

 





[

3; [ [

[

Here

is the parameter of the Poisson RV with X denong the number of events. The Poisson

distribuon is somemes also referred to as the law of rare events. The mean and variance of

the Poisson RV are surprisingly the same and equal

, that is,

() ()

EX VarX

= = .

Example 1.3.4: Suppose that Santa commits errors in a soware program with a mean

of three errors per A4-size page. Santa's manager wants to know the probability of

Santa comming 0, 5, and 20 errors per page. The R funcon dpois helps to determine

the answer.

> dpois(0,lambda=3); dpois(5,lambda=3); dpois(20, lambda=3)

[1] 0.04978707

[1] 0.1008188

[1] 7.135379e-11

Note that Santa's probability of comming 20 errors is almost 0.

We will next focus on connuous distribuons.

Continuous distribution

The numeric variables in the survey, Age, Mileage, and Odometer, can take any values over

a connuous interval and these are examples of connuous RVs. In the previous secon we

dealt with RVs which had discrete output. In this secon we will deal with RVs which have

connuous output. A disncon from the previous secon needs to be pointed explicitly.

In the case of a discrete RV, there is a posive number for the probability of an RV taking on

a certain value which is determined by the pmf. In the connuous case, an RV necessarily

assumes any specic value with zero probability. These technical issues will not be discussed

in this book. In the discrete case, the probabilies of certain values are specied by the pmf,

and in the connuous case the probabilies, over intervals, are decided by probability density

funcon, abbreviated as pdf.

Chapter 1

[ 27 ]

Suppose that we have a connuous RV, X, with the pdf f(x) dened over the possible x values,

that is, we assume that the pdf f(x) is well dened over the range of the RV X, denoted by

It is necessary that the integraon of f(x) over the range

is necessarily 1, that is,

()

Rfsds =

∫

The probability that the RV X takes a value in an interval [a, b] is dened by:

[]

( )

()

PX ab fxdx∈ = ∫

In general we are interested in the cumulave probabilies of a connuous RV, which is

the probability of the event P(X<x). In terms of the previous equaons, this is obtained as:

 

f



³[

3; [IVGV

A special name for this probability is the cumulave density funcon. The mean and variance

of a connuous RV are then dened by:

   

 





DQG ; 

³ ³

[ [

5 5

(; [I [G[ [ (; I[G[9DU

As in the previous secon, we will begin with the simpler RV in uniform distribuon.

Uniform distribution

An RV is said to have uniform distribuon over the interval

[ ]

0, , 0

θ θ

> if its probability density

funcon is given by:





    

T T T

d d !I[ [

In fact, it is not necessary to restrict our focus on the posive real line. For any two

real numbers a and b, from the real line, with b > a, the uniform RV can be dened by:

 



   d



I[DE D[EE D

Data Characteriscs

[ 28 ]

The uniform distribuon has a very important role to play in simulaon, as will be seen

in Chapter 6, Linear Regression Analysis. As with the discrete counterpart, in the connuous

case any two intervals of the same length will have equal probability of occurring. The mean

and variance of a uniform RV over the interval [a, b] are respecvely given by:

  





 





(; 9DU;

Example 1.4.1. Horgan's (2008), Example 15.3: The Internaonal Journal of Circuit Theory

and Applicaons reported in 1990 that researchers at the University of California, Berkely,

had designed a switched capacitor circuit for generang random signals whose trajectory

is uniformly distributed over the unit interval [0, 1]. Suppose that we are interested

in calculang the probability that the trajectory falls in the interval [0.35, 0.58].

Though the answer is straighorward, we will obtain it using the punif funcon:

> punif(0.58)-punif(0.35)

[1] 0.23

Exponential distribution

The exponenal distribuon is probably one of the most important probability distribuons in

Stascs, and more so for Computer Sciensts. The numbers of arrivals in a queuing system,

the me between two incoming calls on a mobile, the lifeme of a laptop, and so on, are

some of the important applicaons where this distribuon has a lasng ulity value.

The pdf of an exponenal RV is specied by

()

;,0, 0

fx ex

λλ λ

−

=≥>

The parameter

is somemes referred to as the failure rate. The exponenal RV enjoys

a special property called the memory-less property which conveys that :

   

IRUD_ OO  !t t t3; WV;V 3; W W V

This mathemacal statement states that if X is an exponenal RV, then its failure in the future

depends on the present, and the past (age) of the RV does not maer. In simple words this

means that the probability of failure is constant in me and does not depend on the age of

the system. Let us obtain the plots of a few exponenal distribuons.

> par(mfrow=c(1,2))

> curve(dexp(x,1),0,10,ylab="f(x)",xlab="x",cex.axis=1.25)

> curve(dexp(x,0.2),add=TRUE,col=2)

> curve(dexp(x,0.5),add=TRUE,col=3)

> curve(dexp(x,0.7),add=TRUE,col=4)

> curve(dexp(x,0.85),add=TRUE,col=5)

> legend(6,1,paste("Rate = ",c(1,0.2,0.5,0.7,0.85)),col=1:5,pch=

Chapter 1

[ 29 ]

+ "___")

> curve(dexp(x,50),0,0.5,ylab="f(x)",xlab="x")

> curve(dexp(x,10),add=TRUE,col=2)

> curve(dexp(x,20),add=TRUE,col=3)

> curve(dexp(x,30),add=TRUE,col=4)

> curve(dexp(x,40),add=TRUE,col=5)

> legend(0.3,50,paste("Rate = ",c(1,0.2,0.5,0.7,0.85)),col=1:5,pch=

+ "___")

Figure 7: The exponential densities

The mean and variance of this exponenal distribuon are shown as follows:

 



 

DQG

O O

(; 9DU;

Normal distribution

The normal distribuon is in some sense an all-pervasive distribuon that arises sooner or

later in almost any stascal discussion. In fact it is very likely that the reader may already be

familiar with certain aspects of the normal distribuon, for example, the shape of a normal

distribuon curve is bell-shaped. The mathemacal appropriateness is probably reected

through the reason that though it has a simpler expression, and its density funcon includes

the three most famous irraonal numbers

 



   d



I[DE D[EE D

Data Characteriscs

[ 30 ]

Suppose that X is normally distributed with mean

and variance

. Then, the probability

density funcon of the normal RV is given by:

 



 

  





PV P P V

 ½

  f f f 

® ¾

¯ ¿

I[ H[S [ [

If mean is zero and variance is one, the normal RV is referred as the standard normal RV,

and the standard is to denote it by Z.

Example 1.4.2. Shady Normal Curves: We will again consider a standard normal random

variable, which is more popularly denoted in Stascs by Z. Some of the most needed

probabilies are P(Z > 0) and P(-1.96 < Z < 1.96). These probabilies are now shaded.

> par(mfrow=c(3,1))

> # Probability Z Greater than 0

> curve(dnorm(x,0,1),-4,4,xlab="z",ylab="f(z)")

> z <- seq(0,4,0.02)

> lines(z,dnorm(z),type="h",col="grey")

> # 95% Coverage

> curve(dnorm(x,0,1),-4,4,xlab="z",ylab="f(z)")

> z <- seq(-1.96,1.96,0.001)

> lines(z,dnorm(z),type="h",col="grey")

> # 95% Coverage

> curve(dnorm(x,0,1),-4,4,xlab="z",ylab="f(z)")

> z <- seq(-2.58,2.58,0.001)

> lines(z,dnorm(z),type="h",col="grey")

Figure 08: Shady normal curves

Chapter 1

[ 31 ]

Summary

You should now be clear with the disnct nature of variables that arise in dierent scenarios.

In R, you should be able to verify that the data is in the correct format. Further, the important

families of random variables are introduced in this chapter, which should help you in dealing

with them when they crop up in your experiments. Computaon of simple probabilies were

also introduced and explained.

In the next chapter you will learn how to perform the basic R computaons, creang data

objects, and so on. As data can seldom be constructed completely in R, we need to import

data from external foreign les. The methods explained help you to import data in le formats

such as .csv and .xls. Similar to imporng, it is also important to be able to export data/

output to other soware. Finally, R session management will conclude the next chapter.

Import/Export Data

The main goals of this chapter are to familiarize you with the various classes of

objects in R, help the reader extract data from various popular formats, connect

R with popular databases such as MySQL, and finally the best export options

of the R output. The main purpose is that the practitioner frequently has data

available in a fixed format, and sometimes the dataset is available in popular

database systems.

This chapter helps you to extract the data from various sources, and then also recommends

the best export opons of the R output. We will though begin with a beer understanding of

the various formats in which R stores the data. Updated informaon about the import/export

opons is maintained at http://cran.r-project.org/doc/manuals/R-data.html.

To summarize, the main learning from this chapter would be the following:

Basic and essenal computaons in R

Imporng data from CSV, XLS, and few more

Exporng data for other soware

R session management

data.frame and other formats

Any soware comes with its structure and nuances. The Quesonnaire and its component

secon of Chapter 1, Data Characteriscs, introduced various facets of data. In the current

secon we will go into the details of how R works with data of dierent characteriscs.

Depending on the need we have dierent formats of the data. In this secon, we will begin

with simpler objects and move up the ladder towards some of the more complex ones.

Import/Export Data

[ 34 ]

Constants, vectors, and matrices

R has ve inbuilt objects which store certain constant values. The ve objects are LETTERS,

letters, month.abb, month.name, and pi. The rst two objects contain the leers A-Z

in upper and lower cases. The third and fourth objects have month's abbreviated form and

the complete month names. Finally, the object pi contains the value of the famous irraonal

number. So, the exercise here is for you to nd the value of the irraonal number e. The

details about these R constant objects may be obtained using the funcon ?Constants

or example(Constants), of course by execung these commands in the console.

There is also another class of constants in R which is very useful. These constants are called

NumericConstants and include Inf for innite numbers, NaN for not a number, and so on.

You are encouraged to nd more details and other useful constants. R can handle numerical,

character, logical, integer, and complex kind of vectors and it is the class of the object which

characterizes the vector. Typically, we deal with vectors which may be numeric, characters

and so on. A vector of the desired class and number of elements may be iniated using

the vector funcon. The length argument declares the size of the vector, which is the

number of elements for the vector, whereas mode characterizes the vector to take one of the

required classes. The elements of a vector can be assigned names if required. The R names

funcon comes handy for this purpose.

Arithmec on numeric vector objects can be performed in an easier way. The operators

(+, -, *, /, and ^) are respecvely used for (addion, subtracon, mulplicaon, division,

and power). The characteriscs of a vector object may be obtained using funcons such as

sum, prod, min, max, and so on. Accuracy of a vector up to certain decimals may be xed

using opons in digits, round, and so on.

Now, two vectors need not have the same number of elements and we may carry the

arithmec operaon between them, say addion. In a mathemacal sense two vectors of

unequal length cannot be added. However, R goes ahead and performs the operaons just

the same. Thus, there is a necessity to understand how operaons are carried out in such

cases. To begin with the simpler case, let us consider two vectors with an equal number of

elements. Suppose that we have a vector x = (1, 2, 3, …, 9, 10), and y = (11, 12, 13, …, 19, 20).

If we add these two vectors, x + y, the result is an element-wise addion of the respecve

elements in x and y, that is, we will get a new vector with elements 12, 14, 16, …, 28, 30.

Now, let us increase the number of elements of y from 10 to 12 with y = (11, 12, 13, …, 19,

20, 21, 22). The operaon is carried out in the order that the elements of x (the smaller

object one) are element-wise added to the rst ten elements of y. Now, R nds that there

are two more elements of y in 11 and 12 which have not been touched as of now. It now

picks the rst two elements of x in 1 and 2 and adds them to 11 and 12. Hence, the 11 and

12 elements of the output are 11+1 =12 and 12 + 2 = 14. The warning says that longer

object length is not a multiple of shorter object length, which has now

been explained.

Chapter 2

[ 35 ]

Let us have a brief peep at a few more operators related to the vectors. The operator %% on

two objects, say x and y, returns a remainder following an integer division, and the operator

%/% returns the integer division.

Time for action – understanding constants, vectors, and basic

arithmetic

We will look at a few important and interesng examples. You will understand the structure of

vectors in R and would also be able to perform the basic arithmec related to this requirement.

1. Key in LETTERS at the R console and hit the Enter key.

2. Key in letters at the R console and hit the Enter key.

3. To obtain the rst ve and the last ve alphabets, try the following code:

LETTERS[c(1:5,22:26)] and letters[c(1:5,22:26)].

4. Month names and their abbreviaons are available in the base package and explore

them using ?Constants at the console.

5. Selected month names and their abbreviaons can be obtained using month.

abb[c(1:3,8:10)] and month.name[c(1:3,8:10)]. Also, the value of pi in R

can found by entering pi at the console.

6. To generate a vector of length 4, without specifying the class, try

vector(length=4). In specic classes, vector objects can be generated

by declaring the "mode" values for a vector object. That is, a numeric vector

(with default values 0) is obtained by the code vector(mode = "numeric",

length=4). You can similarly generate logical, complex, character, and integer

vectors by specifying them as opons in the mode argument.

The next screenshot shows the results as you run the preceding codes in R.

7. Creang new vector objects and name assignment: A generated vector can be

assigned to new objects using either =, <-, or ->. The last two assignments are in

the order from the generated vector of the tail end to the new variables at the end

of the arrow.

1. First assign the integer vector 1:10 to x by using x <- 1:10.

2. Check the names of x by using names(x).

3. Assign first 10 letters of the alphabets as names for elements of x by using

names(x)<- letters[1:10], and verify that the assignment is done

using names(x).

4. Finally, display the numeric vector x by simply keying in x at the console.

Import/Export Data

[ 36 ]

8. Basic arithmec: Create new R objects by entering x<-1:10; y<-11:20; a<-

10; b<--4; c<-0.5 at the console. In a certain sense, x and y are vectors while

a, b, and c are constants.

1. Perform simple addition of numeric vectors with x + y.

2. Scalar multiplication of vectors and then summing the resulting vectors is

easily done by using a*x + b*y.

3. Verify the result (a + b)x = ax + bx by checking that the result of ((a+b)*x

== a*x + b*x results is a logical vector of length 10, each having TRUE

value.

4. Vector multiplication is carried by x*y.

5. Vector division in R is simply element-wise division of the two vectors, and

does not have an interpretation in mathematics. We obtain the accuracy up

to 4 digits using round(x/y,4).

6. Finally, (element-wise) exponentiation of vectors is carried out through x^2.

9. Adding two unequal length vectors: The arithmec explained before applies to

unequal length vectors in a slightly dierent way. Run the following operaons:

x=1:10; x+c(1:12), length(x+c(1:12)), c(1:3)^c(1,2), and

(9:11)-c(4,6).

10. The integer divisor and remainder following integer division may be obtained

respecvely using %/% and %% operators. Key in -3:3 %% 2, -3:3 %% 3, and -3:3

%% c(2,3) to nd remainders between the sequence -3, -2, …, 2, 3 and 2, 3, and

c(2,3). Replace the operator %% by %/% to nd the integer divisors.

Now, we will rst give the required R codes so you can execute them in the soware:

LETTERS

letters

LETTERS[c(1:5,22:26)]

letters[c(1:5,22:26)]

?Constants

Chapter 2

[ 37 ]

month.abb[c(1:3,8:10)]

month.name[c(1:3,8:10)]

vector(length=4)

vector(mode="numeric",length=4)

vector(mode="logical",length=4)

vector(mode="complex",length=4)

vector(mode="character",length=4)

vector(mode="integer",length=4)

x=1:10

names(x)

names(x)<- letters[1:10]

names(x)

x=1:10

y=11:20

a=10

b=-4

x + y

a*x + b*y

sum((a+b)*x == a*x + b*x)

x*y

round(x/y,4)

x^2

x=1:10

x+c(1:12)

length(x+c(1:12))

c(1:3)^c(1,2)

(9:11)-c(4,6)

-3:3 %% 2

-3:3 %% 3

-3:3 %% c(2,3)

-3:3 %/% 2

-3:3 %/% 3

-3:3 %/% c(2,3)

Execute the preceding code in your R session.

Import/Export Data

[ 38 ]

What just happened?

We have split the output into mulple screenshots for ease of explanaon.

Introducing constants and vectors functioning in R

LETTERS is a character vector available in R that consists of the 26 uppercase leers of the

English language, whereas letters contains the alphabets in smaller leers. We have used

the integer vector c(1:5,22:26) in the index to extract the rst and last ve elements of

both the character vectors. When the ?Constants command is executed, R pops out an

HTML le in your default internet browser and opens a page with the link http://my_IP/

library/base/html/Constants.html. You can nd more details about Constants

from the base package on this web page. Months, as in January-December, are available

in the character vector month.name whereas the popular abbreviated forms of the months

are available in the character vector month.abb. Finally, the numeric object pi contains

the value of pi up to the rst three decimals only.

Chapter 2

[ 39 ]

Next, we consider, generaon of various types of vector using the R vector funcon.

Now, the code vector(mode="numeric",length=4) creates a numeric vector with

default values of 0 and required length of four. Similarly, the other vectors are created.

Vector arithmetic in R

An integer vector object is created by the code x = 1:10. We could have alternavely

used opons such as x<- 1:10 or 1:10 -> x. The nal result is of course the same.

The choice of the assignment operator—nd more details by running ?assignOps at the R

console<-— is far more popular in the R community and it can be used during any part of R

programming. By default, there won't be any names assigned for either vectors or matrices.

Thus, the output NULL. names is a funcon in R which is useful for assigning appropriate

names. Our task is to assign the rst 10 smaller leers of the alphabets to the vector x.

Hence, we have the code names(x) <- letters[1:10]. We verify if the names have

been properly assigned and the change on the display of x following the assignment of the

names using names(x) and x.

Import/Export Data

[ 40 ]

Next, we create two integer vectors in x and y, and two objects a and b, which may be treated

as scalars. Now, x + y; a*x + b*y; sum((a+b)*x == a*x + b*x) performs three

dierent tasks. First, it performs addion of vectors and returns the result of element-wise

addion of the two vectors leading to the answer 12, 14, …, 28, 30. Second, we are verifying

the result of scalar mulplicaon of vectors, and third, the result of (a + b)x = ax + bx.

In the next round of R codes, we ran x*y; round(x/y,4); x^2. Similar to the addion

operator, the * operator performs element-wise mulplicaon for the two vectors. Thus,

we get the output as 11, 24, …, 171, 200. In the next line, recall that ; executes the code

on the next line/operaon, rst the element-wise division is carried out. For the resulng

vector (a numeric one), the round funcon gives the accuracy up to four digits as specied.

Finally, x^2 gives us the square of each element of x. Here, 2 can be replaced by any other

real number.

In the last line of code, we repeat some of the earlier operaons with a minor dierence

that the two vectors are not of the same length. As predicted earlier, R issues a warning

that the length of the longer vector is not a mulple of the length of the shorter vector.

Thus, for the operaon x+c(1:12);, rst all the elements of x (which is the shorter length

vector here) are added with the rst 10 elements of 1:12. Then the last two elements of

1:12 at 11 and 12 need to be added with elements from x, and for this purpose R picks

the rst two elements of x. If the longer length vector is a mulple of the shorter one, the

enre elements of the shorter vector are repeatedly added over the in cycles. The remaining

results as a consequence of running c(1:3)^c(1,2); (9:11)-c(4,6) are le to the

reader for interpretaon.

Let us look at the output aer the R codes for an integer and a remainder between two

objects are carried out.

Integer divisor and remainder operations

Chapter 2

[ 41 ]

In the segment -3:3 %% 2, we are rst creang a sequence -3, -2, …, 2, 3 and then we

are asking for the remainder if we divide each of them by 2. Clearly, the remainder for any

integer if divided by 2 is either 0 or 1, and for a sequence of consecuve integers, we expect

an alternate sequence of 0s and 1s, which is the output in this case. Check the expected

result for -3:3 %% 3. Now, for the operaon -3:3 %% c(2,3), rst look at the complete

sequence -3:3 as -3, -2, -1, 0, 1, 2, 3. Here, the elements -3, -1, 1, 3 are divided by 2 and the

remainder is returned, whereas -2, 0, 2 are divided by 3 and the remainders are returned.

The operator %/% returns the integer divisor and interpretaon of the results are le to the

reader. Please refer to the previous screenshot for the results.

We now look at the matrix objects. Similar to the vector funcon in R, we have matrix as a

funcon, that creates matrix objects. A matrix is an array of numbers with a certain number

of rows and columns. By default, the elements of a matrix are generated as NA, that is, not

available. Let r be the number of rows and c the number of columns. The order of a matrix

is then r x c. A vector object of length rc in R can be converted into a matrix by the code

matrix(vector, nrow=r, ncol=c, byrow=TRUE). The rows and columns of a matrix

may be assigned names using the dimnames opon in the matrix funcon.

The mathemacs of matrices in R is preserved in relaon to the matrix arithmec. Suppose

we have two matrices A and B with respecve dimensions m x n and n x o. The cross-product

A x B is then a matrix of order m x o, which is obtained in R by the operaon A %*% B. We

are also interested in the determinant of square matrix, the number of rows being equal to

the number of columns, and this is obtained in R using the det funcon on the matrix, say

det(A). Finally, we also more oen than not require the computaon of the inverse of a

square matrix. The rst temptaon is to obtain the same by using A^{-1}. This will give a

wrong answer, as this leads to an element-wise reciprocal and not the inverse of a matrix.

The solve funcon in R if executed on a square matrix gives the inverse of a matrix. Fine!

Let us now do these operaons using R.

Time for action – matrix computations

We will see the basic matrix computaons in the forthcoming steps. The matrix computaons

such as the cross-product of matrices, transpose, and inverse will be illustrated.

1. Generate a 2 x 2 matrix with default values using matrix(nrow=2, ncol=2).

2. Create a matrix from the 1:4 vector by running matrix(1:4,nrow=2,ncol=2,

byrow="TRUE").

3. Assign row and column names for the preceding matrix by using the opon

dimnames, that is, by running A <- matrix(data=1:4, nrow=2, ncol=2,

byrow=TRUE, dimnames = list(c("R_1", "R_2"),c("C_1", "C_2")))

at the R console.

Import/Export Data

[ 42 ]

4. Find the properes of the preceding matrices by using the commands nrow, ncol,

dimnames, and few more, with dim(A); nrow(A); ncol(A); dimnames(A).

5. Create two matrices X and Y of order 3 * 4 and 4 * 3, and obtain their

cross-product with the code X <- matrix(c(1:12),nrow=3,ncol=4); Y =

matrix(13:24, nrow=4) and X %*% Y.

6. The transpose of a matrix is obtained using the t funcon, t(X).

7. Create a new matrix A <- matrix(data=c(13,24,34,23,67,32,

45,23,11), nrow=3) and nd its determinant and inverse by using det(A)

and solve(A) respecvely.

The R code for the preceding acon list is given in the following code snippet:

matrix(nrow=2,ncol=2)

matrix(1:4,nrow=2,ncol=2, byrow="TRUE")

A <- matrix(data=1:4, nrow=2, ncol=2, byrow=TRUE, dimnames =

list(c("R_1", "R_2"),c("C_1", "C_2")))

dim(A); nrow(A); ncol(A); dimnames(A)

X <- matrix(c(1:12),nrow=3,ncol=4)

Y <- matrix(13:24, nrow=4)

X %*% Y

t(Y)

A <- matrix(data=c(13,24,34,23,67,32,45,23,11),nrow=3)

det(A)

solve(A)

Note the use of a semicolon (;) in line 5 of the preceding code. The result of this usage

is that the code separated by a semicolon is executed as if it was entered on a new line.

Execute the preceding code in your R console. The output of the R code is given in the

following screenshot:

Chapter 2

[ 43 ]

Matrix computations in R

What just happened?

You were able to create matrices in R and learned the basic operaons. Remember

that solve and not (^-1) gives you the inverse of a matrix. It is now seen that matrix

computaons in R are really easy to carry out.

The opons, nrow and ncol, are used to specify the dimensions of a matrix. Data for

a matrix can be specied through the data argument. The rst two lines of code in the

previous screenshot create a bare-bone matrix. Using the dimnames argument, we have

created a more elegant matrix and assigned the matrix to a matrix object named A.

We next focus on the list object. It has already been used earlier to specify the dimnames

of a matrix.

Import/Export Data

[ 44 ]

The list object

In the preceding subsecon we saw dierent kinds of objects such as constants, vectors,

and matrices. Somemes it is required that we pool them together in a single object. The

framework for this task is provided by the list object. From the online source http://

cran.r-project.org/doc/manuals/R-intro.html#Lists-and-data-frames,

we dene a list as "an object consisng of an ordered collecon of objects known as its

components." Basically, various types of objects can be brought under a single umbrella

using the list funcon. Let us create list which contains a character vector, an integer

vector, and a matrix.

Time for action – creating a list object

Here, we will have a rst look at the creaon of list objects, which can contain in them

objects of dierent classes:

1. Create a character vector containing the rst six capital leers with A <-

LETTERS[1:6]. Create an integer vector of the rst ten integers 1-10 with B <-

1:10, and a matrix with C <- matrix(1:6,nrow=2).

2. Create a list which has the three objects created in the previous steps as its

components with Z <- list(A = A, B = B, C = C).

3. Ensure that the class of Z and its three components in A, B, and C are indeed

retained as follows: class(Z); class(Z$A); class(Z$B); class(Z$C).

The consolidated R codes are given next, which you will have to enter at the R console:

A <- LETTERS[1:6]; B <- 1:10; C <- matrix(1:6,nrow=2)

Z <- list(A = A, B = B, C = C)

class(Z); class(Z$A); class(Z$B); class(Z$C)

Creating and understanding a list object

Chapter 2

[ 45 ]

What just happened?

Dierent classes of objects can be easily brought under a single umbrella and their structures

are also preserved within the newly created list object. Especially, here we put a character

vector, an integer vector, and a matrix under a single list object. Next, we check for the

class of the Z object and nd the answer to be list as it should be. A new extracon tool

has been introduced in the dollar symbol $, which needs an explanaon. Elements/objects

from a list vector can be extracted using the $ opon on similar lines of the [ and [[

extracng tools. In our example, Z$A extracts the A object from the Z list, and we use the

class funcon wrapper on Z$A to nd its class. It is then conrmed that the classes of A, B,

and C are preserved under the list object. More details about the extracon tools may be

obtained by running ?Extract at the R console.

Yes, you have successfully created your rst list object. This ulity is parcularly useful

when building big programs and we need certain acons within a single object.

The data.frame object

In Figure 2 of Chapter 1, Data Characteriscs we saw that when the class funcon is

applied on the SQ object, the output resulted in data.frame. The details about this

funcon can be obtained by execung ?data.frame at the R console. The rst noceable

aspect is data.frame {base}, which means that this funcon is in the base library.

Further, the descripon says: "This funcon creates data frames, ghtly-coupled collecons

of variables, which share many of the properes of matrices and of lists, used as the

fundamental data structure by most of R's modeling soware." This descripon is seen to

be correct as in the same gure we have dierent numeric, character, and factor variables

contained in the same data.frame object. Thus, we know that a data.frame object can

contain dierent kinds of variables.

A data frame can contain dierent types of objects. That is, we can create two dierent

classes of vectors and bind them together in a single data frame. A data frame can also be

updated with new vectors and exisng components can also be dropped from it. As with

vectors, and matrices, we can assign names to a data frame as is convenient for us.

Time for action – creating a data.frame object

Here, we create data.frame from vectors. New objects are then added to an exisng data

frame and some preliminary manipulaons are demonstrated.

1. Create a numeric and character vector of length 3 each with x <- c(2,3,4); y

<- LETTERS[1:3].

2. Create a new data frame with df1<-data.frame(x,y).

Import/Export Data

[ 46 ]

3. Verify the variable names of the data frame, the classes of the components,

and display the variables disnctly with variable.names(df1); sapply

(df1, class); df1$x; df1$y.

4. Add a new numeric vector to df1 with df1$z <- c(pi,sqrt(2), 2.71828)

and verify the changes in df1 by entering df1 at the console.

5. Nullify the x component of df1 and verify the change.

6. Bring back the original x values with df1$x <- x.

7. Add a fourth observaon with df1[4,]<- list(y=LETTERS[2], z=3,x=5)

and then remove the second observaon with df1 <- df1[-2,] and verify the

change again.

8. Find the row names (or the observaon names) of the data frame object by using

row.names(df1).

9. Obtain the column names (which should be actually x, y, and z) with

colnames(df1). Change the row and column names using row.names(df1)<-

1:3; colnames(df1)=LETTERS[1:3] and display the nal form of the data frame.

Following is the consolidated code that you have to enter in the R console:

# The data.frame Object

x <- c(2,3,4); y <- LETTERS[1:3]

df1<-data.frame(x,y)

variable.names(df1)

sapply(df1,class)

df1$x

df1$y

df1$z <- c(pi,sqrt(2), 2.71828)

df1

df1$x <- NULL

df1

df1$x <- x

df1[4,]<- list(y=LETTERS[2],z=3,x=5)

df1 <- df1[-2,]

df1

row.names(df1)

dim(df1)

colnames(df1)

row.names(df1)<- 1:3

colnames(df1)<-LETTERS[1:3]

df1

Chapter 2

[ 47 ]

On running the preceding code in R, you will see the output as shown in the

following screenshot:

Understanding a data.frame object

Let us now look at a larger data.frame object. iris is a very famous datasets and we will

use it to check out some very useful tools for data display.

1. Load the iris data from the datasets package with data(iris).

2. Check the rst 10 observaons of the dataset with head(iris,10).

3. A compact display of a data.frame object is obtained with the str funcon

in the following way: str(iris).

4. Using the $ extractor tool, inspect the dierent Species in the iris data in the

following way: iris$Species.

Import/Export Data

[ 48 ]

5. We are asked to get the rst 10 observaons with the Sepal.Length and

Petal.Length variables only. Now, we use the [ extractor in the following way:

iris[1:10,c("Sepal.Length","Petal.Length")].

Different ways of extracting objects from a data.frame object

What just happened?

A data frame may be a complex structure. Here, we rst created two vectors of same length

with dierent structures, one being an integer and the other one a character vector. Using

the data.frame funcon we created a new object df1, which contains both the vectors.

The variable names of df1 are then veried with the variable.names funcon. Aer

verifying that the names are indeed as expected, we verify that the variable classes are

preserved with the applicaon of two funcons: sapply and class. lapply is a useful

ulity in R which applies a funcon over a list or vector and sapply is a more friendly

version of lapply. In our parcular example earlier, we need R to return us the classes of

variables from the df1 data frame.

Chapter 2

[ 49 ]

Have a go hero

As an exercise, explain yourself the rest of the R code that you have executed here.

We have thus seen how to create a data frame, add and remove components, observaons,

change the component names, and so on.

The table object

Data displayed in a table format is easy to understand. We will begin with the famous

Titanic dataset, as it is very unlikely that you will not have heard about it. That the giganc

ship sinks at the end, that there are many beauful movies about it, novels, documentaries,

and many more, make this dataset a very interesng example. It is known that the ship had

some survivors post its unfortunate and premature end. The ship had children, women,

and dierent classes of passenger onboard. This dataset is shipped (again) in the datasets

package along with the soware. The dataset relates to the passengers' survival post

the tragedy.

The Titanic dataset has four variables: Class, Sex, Age, and Survived. For each

combinaon of the values for the four variables we have a count of that combinaon. The

Class variable is specied at four levels of 1st, 2nd, 3rd, and Crew class. The gender is

specied for the passengers, and the age classicaon is Child or Adult. It is also known

through the Survived variable whether the onboard passengers survived the clash of the

ship with the iceberg. Thus, we have 4 x 2 x 2 x 2 = 32 dierent combinaons of the Age,

Sex, Class, and Survived statuses.

Import/Export Data

[ 50 ]

The following screenshot gives a display of the dataset in two formats. On the right-hand

side we can see the dataset in a spreadsheet style, while the le-hand side displays the

frequencies according to a combinatorial group. The queson is how do we create table

displays as on the le-hand side of the screenshot. The present secon addresses this

aspect of table creaon.

Two different views of the Titanic dataset

The le-hand side display of the screenshot is obtained by simply keying in Titanic at the R

console, and the data format on the right-hand side is obtained by running View(Titanic)

at the console. In general, we have our dataset available as on the right-hand side. Hence,

we will pretend that we have the dataset available in the later format.

Chapter 2

[ 51 ]

Time for action – creating the Titanic dataset as a table object

The goal is to create a table object from the raw dataset. We will be using the expand.grid

and xtabs funcon towards this end.

1. First, create four character vectors for the four types of variables:

Class.Level <- c("1st","2nd","3rd", "Crew")

Sex.Level <- c("Male", "Female")

Age.Level <- c("Child", "Adult")

Survived.Level <- c("No", "Yes")

2. Create a list object which takes into account the variable names and their possible

levels with Data.Level <- list(Class = Class.Level, Sex = Sex.

Level, Age = Age.Level, Survived = Survived.Level)

3. Now, create a data.frame object for the levels of the four variables using

the expand.grid funcon by entering T.Table <- expand.grid(Class

= Class.Level, Sex = Sex.Level, Age = Age.Level, Survived

= Survived.Level) at the console. It is advised to view the T.Table and

appreciate the changes that are occurring in this step.

4. The Titanic dataset is ready except for the frequency count at each combinatorial

level. Specify the counts with T.freq <- c(0,0,35,0,0,0,17,0, 118,

154,387,670,4,13,89,3, 5,11, 13,0,1,13,14,0,57,14,75,192,140,

80,76,20)

5. Augment T.Table with T.freq by using T.Table <- cbind(T.Table,

count=T.freq). Again, if you view the T.Table, you will nd the display

on the le-hand side of the previous screenshot.

6. To obtain the display on the right-hand side, enter xtabs(count~ Class + Sex

+ Age + Survived , data = T.Table).

The complete R code is given next, which needs to be compiled in the soware:

Class.Level <- c("1st","2nd","3rd", "Crew")

Sex.Level <- c("Male", "Female")

Age.Level <- c("Child", "Adult")

Survived.Level <- c("No", "Yes")

Data.Level <- list(Class = Class.Level, Sex = Sex.Level,

Age = Age.Level, Survived = Survived.Level)

T.Table <- expand.grid(Class = Class.Level, Sex =

Sex.Level, Age = Age.Level, Survived = Survived.Level)

T.freq = c(0,0,35,0,0,0,17,0,118,154,387,670,4,13,89,3,

5,11, 13,0,1,13,14,0,57,14,75,192,140,80, 76,20)

T.Table = cbind(T.Table, count=T.freq)

xtabs(count~ Class + Sex + Age + Survived, data = T.Table)

Import/Export Data

[ 52 ]

What just happened?

In pracce we may oen have data in frequency format. It will be seen in later chapters

that the table object is required for carrying out stascal analysis. To translate frequency

formated data into a table object, we rst dened four variables through Class.Level,

Sex.Level, Age.Level, and Survived.Level. The levels for the required table object

have been specied through the list object Data.Level. The funcon expand.grid

creates all possible combinaons of the factors of four variables. The table of all possible

combinaons is then stored in the T.Table object. Next, the frequencies are assigned

through the T.freq integer vector. Finally, the xtabs funcon creates the count according

to the various levels of the variables and the result is a table object, which is the same

as Titanic!

Have a go hero

UCBAdmissions is one of the benchmark datasets in Stascs. It is available in the

datasets package and it has data on the admission counts of six departments. The

admissions data shows that there is a favored bias towards adming male candidates over

females, and it led to an allegaon against the University of California, Berkeley. The details

of this problem may be found on the web at http://www.unc.edu/~nielsen/soci708/

cdocs/Berkeley_admissions_bias.pdf. Informaon about the dataset is obtained

with ?UCBAdmissions. Idenfy all the variables and their classes and regenerate the enre

table from the raw codes.

read.csv, read.xls, and the foreign package

Data is generally available in an external le. The type of external les is certainly varied and

it is important to learn which of them may be imported in R. The probable spreadsheet les

may exist in a CSV (comma separated variable) format, XLS or XLSX (Microso Excel) form,

or ODS (OpenOce/LibreOce Calc). There are more possible formats and we restrict our

aenon to those described earlier. A snapshot of two les, Employ.dat and SCV.csv,

in gedit and MS Excel is given in the following screenshot. The brief characteriscs of the two

les are summarized in the following list:

The rst row lists the names of variables of the dataset

Each observaon begins on a new line

In the DAT le, the delimiter is a tab (\t) whereas for the CSV le it is a comma (,)

All the three columns of the DAT le are numeric in nature

The rst ve columns of the CSV le are numeric while the last column is a character

Overall, both the les have a well-dened structure going for them

Chapter 2

[ 53 ]

The following screenshot underlines the theme that when the external les have

a well-dened structure, it is vital that we make the most of the structure when imporng

it in R.

Screenshot of the two spreadsheet files

The core funcon for imporng les in R is the read.table funcon from the utils

package, shipped with R core. The rst argument of this funcon is the lename; see the

following screenshot. We can use header=TRUE to specify that the header names are the

variable names of the columns. The separator opon sep needs to be properly specied.

For example, for the Employ dataset, it is a tab \t whereas for the CSV le, it is a comma ,.

Frequently, each row may also have a name. For example, the customer name in a survey

dataset, or serial number, and so on. This can be specied through row.names. The row

names may or may not be present in the external le. That is, either the row names or the

column names need not be part of the le from which we are imporng the data.

The read.table syntax

Import/Export Data

[ 54 ]

In many les, there may be missing observaons. Such data can be appropriately imported

by specifying the missing values in na.strings. The missing values may be represented by

blank cells, a period, and so on. You may nd more details about the other opons in the

read.table funcon. We note that read.csv, read.delim, and so on are other variants

of the read.table funcon. An Excel le of the type XLS or XLSX may be imported into R

with the use of the read.xls funcon from the gdata package.

Let us begin with imporng simpler data les into R.

Example 2.2.1. Reading from a DAT le: The datasets analyzed in Ryan (2007) are available

on the web at ftp://ftp.wiley.com/public/sci_tech_med/engineering_

statistics/. Download the le engineering_statistics.zip and unzip the contents

to the working directory of the R session. The problem is described in Exercise 1.82 of

Ryan. The monthly data on the number of employees over a period of ve years for three

Wisconsin industries in the wholesale and retail trade, food and kindred products, and

fabricated metals is available in the le Employ.dat. The task is to import this dataset in

the R session. Note that the three variables, namely the number of employees in the three

industries, are numeric in their characteriscs. These characteriscs should be retained in

our session too.

A useful pracce is to actually open the source le and check the nature of the data in it.

For example, you should queson how you will interpret the number 3.220000000e+002

specied in the original DAT le. In the Time for acon – imporng data from external les

secon that follows, we will use the read.table funcon to import this data le.

Example 2.2.2. Reading from a CSV le: Ryan (2007) uses a dataset analyzed by Gupta

(1997). In this case study related to anbioc suspension products the response variable

is Separated Clear Volume whose smaller value indicates beer quality. This experiment

hosts ve variables, each at two dierent levels, that is each of the ve variables is a factor

variable, and the goal of the experiment is the determinaon of the best combinaon of

these factors which yields the minimum value for the response variable.

Now, somemes the required dataset may be available in various CSV les. In such cases,

we rst read them from the various desnaons and then combine them to obtain a single

metale. A trick is the usage of the merge funcon. Suppose that the preceding dataset is

divided in two datasets SCV_Usual.csv and SCV_Modified.csv according to the variable

E. We read them in two separate data objects and then merge them into a single object.

Chapter 2

[ 55 ]

We will carry out the imporng of these les in the next me for acon secon.

Example 2.2.3. Reading les using foreign package: SPSS, SAS, STATA, and so on, are some

of the very popular stascal soware packages. Each of the soware packages has their

own le structure for the datasets. The foreign package, which is shipped along with the R

soware , helps to read datasets used in these soware packages. The rootstock dataset is

a very popular dataset in the area of mulvariate stascs, and it is available on the web at

http://www.stata-press.com/data/r10/rootstock.dta. Essenally, the dataset is

available for the STATA soware. We will now see how R reads this dataset.

Let us set for the acon.

Time for action – importing data from external les

The external les may be imported into R using the right funcons available in it. Here, we

will use read.table, read.csv, and read.sta funcons to drive home the point.

1. Verify that you have the necessary les, Employ.dat, SCV.csv, SCV_Usual.csv,

and SCV_Modified.csv in the working directory by using list.files().

2. If the les are not available in the list displayed, nd your working directory

using getwd() and then copy the les to the working directory. Alternavely,

you can set the working directory to the folder where the les are with setwd

("C:/my_files_are_here").

3. Read the data in Employ.dat with the code employ <- read.table( Employ.

dat", header=TRUE)

4. View the data with View(employ) and ensure that the data le has been properly

imported into R.

5. Check that the class of employ and its variables have been imported in the correct

format with class(employ); sapply(employ,class).

6. Import the Separated Clear Volume data from the SCV.csv le using the code SCV

<- read.csv("SCV.csv",header=TRUE)

7. Run sapply(SCV, class). You will nd that variables A-D are of the numeric

class. Convert the class of variable A to factor with either class(SCV$A) <-

'factor' or SCV$A <- as.factor(SCV$A)

8. Repeat the preceding step for variables B-D.

Import/Export Data

[ 56 ]

9. The data in the SCV.csv le is split into two les by the E variable values and is

available in SCV_Usual.csv and SCV_Modified.csv. Import the data in these

two les using the appropriate modicaons in Step 6 and label the respecve R

data frame objects as SCV_Usual and SCV_Modified.

10. Combine the data from the two latest objects with SCV_Combined <- erge(SCV_

Usual,SCV_Modified,by.y=c("Response","A","B","C","D","E"),all.

x=TRUE,all.y=TRUE)

11. Inialize the library package foreign with library(foreign).

12. Tell R where on the web the dataset is available using rootstock.url <-

http://www.stata-press.com/data/r10/rootstock.dta.

An Internet connection is required to perform this step.

13. Use the read.dta funcon from the foreign package to import the dataset

from the Web into R: rootstock <- read.dta(rootstock.url)

The necessary R codes are given next in a consolidate format:

employ <- read.table("Employ.dat",header=TRUE)

View(employ)

class(employ)

sapply(employ,class)

SCV <- read.csv("SCV.csv",header=TRUE)

sapply(SCV, class)

class(SCV$A) <- 'factor'

class(SCV$B) <- 'factor'

class(SCV$C) <- 'factor'

class(SCV$D) <- 'factor'

SCV_Usual <- read.csv("SCV_Usual.csv",header=TRUE)

SCV_Modified <- read.csv("SCV_Modified.csv",header=TRUE)

SCV_Combined <- merge(SCV_Usual,SCV_Modified,by.y=c("Response",

"A","B","C","D","E"),all.x=TRUE,all.y=TRUE)

SCV_Combined

library(foreign)

rootstock.url <- "http://www.stata-press.com/data/r10/rootstock.dta"

rootstock <- read.dta(rootstock.url)

rootstock

Chapter 2

[ 57 ]

What just happened?

Funcons from the utils package help the R users in imporng data from various

external les. The following screenshot, edited in a graphics tool, shows the result

of running the previous code:

Importing data from external files

The read.table funcon succeeded in imporng the data from the Employ.dat le.

The utils funcon View conrms that the data has been imported with the desired classes.

The funcon read.csv has been used to import data from SCV.csv, SCV_Usual.csv, and

SCV_Modified.csv les. The merge funcon combined the data in the usual and modied

objects and created a new object, which is same as the one obtained using the SCV.csv le.

Import/Export Data

[ 58 ]

Next, we used the funcon read.sta from the foreign package to complete the reading

of a STATA le, which is available on the web.

What just happened?

You learned to import data in many dierent formats into R. The preceding program shows

how to change the class of variables within the object itself. You also learned how to merge

mulple data objects.

Importing data from MySQL

Data will be oen available in databases (DB) such as SQL, MySQL, and so on. To emphasize

the importance of databases is beyond the scope of this secon, and we will be content with

imporng data from a DB. The right-hand side of the following screenshot shows a snippet

of the test DB in MySQL. This DB has a single table in IO_Time and it has two variables

No_of_IO and CPU_Time. The IO_Time has 10 observaons, and we will be using this

dataset for many concepts later in the book. The goal of this secon is to show how to

import this table in to R.

An R package, RMySQL, is available from CRAN, which can be installed easily for Linux users.

Unfortunately, for Windows users, the package is not available in a readily implementable

installaon, in the sense that install.packages("RMySQL") won't work for them. The

best help for Windows users is available at http://www.r-bloggers.com/installing-

the-rmysql-package-on-windows-7/, though some of the codes there are a bit

outdated. However, the problem is certainly solvable! The program and illustraon here

works neatly for Linux users, and the following screenshot is performed on the Ubuntu 12.04

plaorm. Though simple installaon of R and MySQL generally does not help in installing

the RMySQL package, running sudo apt-get install libmysqlclient-dev rst and

then install.packages("RMySQL") helps! If you sll get an error, make a note that the

downloaded package is saved in the /tmp/RtmpeLu7CG/downloaded_packages folder of

the local machine with the name RMySQL_0.x.x.tar.gz.

Chapter 2

[ 59 ]

You can then move to that directory and execute sudo R CMD INSTALL

RMySQL_0.x.x.tar.gz. We are now set to use the RMySQL package.

Importing data from MySQL

Note that on the Ubuntu 12.04 terminal we begin R with R –q. This suppresses the general

details we get about the R soware. First invoke the library with library(RMySQL). Set

up the DB driver with d <- dbDriver("MySQL"). Specify the DB connector with con <-

dbConnect(d,dbname='test') and then run your query to fetch the IO_Time table from

MySQL with io_data <- dbGetQuerry(con,'select * from IO_Time'). Finally,

verify that the data has been properly imported into R with io_data. The right-hand side

of the previous screenshot conrms that the data has been correctly imported into R.

Import/Export Data

[ 60 ]

Exporting data/graphs

In the previous secon, we learned how to import data from external les. Now, there will

be many instances where we would be keen to export data from R into suitable foreign les.

The need may arise in an automated system, reporng, and so on, where the other soware

requires making good use of the R output.

Exporting R objects

The basic R funcon that exports data is write.table, which is not surprising as we saw

the ulity of the read.table funcon. The following screenshot gives a snippet of the

write.table funcon. While reading, we assign the imported le to an R object, and when

exporng, we rst specify the R object and then menon the lename. By default, R assigns

row names while exporng the object. If there are no row names, R will simply choose serial

numbers beginning with 1. If you do not need such row names, you need needs to specify

row.names = FALSE in the program.

Exporting data using the write.table function

Example 2.3.1. Exporng the Titanic data: In the Two dierent views of the Titanic dataset

gure, we saw the Titanic dataset in two formats. It is the display on the right-hand side

of the gure which we would like to export in a .csv format. We will use the write.csv

funcon for this purpose.

> write.csv(Titanic,"Titanic.csv",row.names=FALSE)

The Titanic.csv le will be exported to the current working directory. The reader can

open the CSV le in either Excel or LibreOce Calc and conrm that it is of the desired format.

The other write/export opons are also available in the foreign package. The write.

xport, write.systat, write.dta, and write.arff funcons are useful if the

desnaon soware is any of the following: SAS, SYSTAT, STATA, and Weka.

Exporting graphs

In Chapter 3, Data Visualizaon, we will be generang a lot of graphs. Here, we will explain

how to save the graphs in a desired format.

In the next screenshot, we have a graph generated by execuon of the code plot(sin,

-pi, 2*pi) at the terminal. This line of code generates the sine wave over the interval

[-π, 2π].

Chapter 2

[ 61 ]

Time for action – exporting a graph

Exporng of graph will be explored here:

1. Plot the sin funcon over the range [-π, 2π] by running plot(sin, -pi, 2*pi).

2. A new window pops up with the tle R Graphics Device 2 (ACTIVE).

3. In the menu bar, go to File | Save as | Png.

Saving graphs

4. Save the le as sin_plot.png, or any other name felt appropriate by you.

What just happened?

A le named sin_plot.png would have been created in the directory as specied by

you in the preceding Step 4.

Unix users do not have the luxury of saving the graph in the previously menoned way.

If you are using Unix, you have dierent opons of saving a le. Suppose we wish to save

the le when running R in a Linux environment.

Import/Export Data

[ 62 ]

The grDevices library gives dierent ways of saving a graph. Here, the user can use the

pdf, jpeg, bmp, png, and a few more funcons to save the graph. An example is given in the

following code:

> jpeg("My_File.jpeg")

> plot(sin, -pi, 2*pi)

> dev.off()

null device

> ?jpeg

> ?pdf

Here, we rst invoke the required device and specify the le name to save the output, the

path directory may be specied as well along with the le name. Then, we plot the funcon

and nally close the device with dev.off. Fortunately, this technique works on both Linux

and Windows plaorms.

Managing an R session

We will close the chapter with a discussion of managing the R session. In many ways, this

secon is similar to what we do to a dining table aer we have completed the dinner. Now,

there are quite a few aspects about saving the R session. We will rst explain how to save the

R codes executed during a session.

Time for action – session management

Managing a session is very important. Any well developed soware gives mulple opons

for managing a technical session and we explore some of the methods available in R.

1. You have decided to stop the R session! At this moment, we would like to save all

the R code executed at the console. In the File menu, we have an opon in Save

History. Basically, it is the acon File | Save History…. Aer selecng the opon,

as with previous secon, we can save the history of that R session in a new text le.

Save the history with the lename testhist. Basically, R saves it as a RHISTORY

le which may be easily viewed/modied through any text editor. You may also save

the R history in any appropriate directory, which is the desnaon.

2. Now, you want to reload the history testhist at the beginning of a new R session.

The direcon is File | Load History…, and choose the testhist le.

Chapter 2

[ 63 ]

3. In an R session, you would have created many objects with dierent characteriscs.

All of them can be saved in an .Rdata le with File | Save Workspace…. In a new

session, this workspace can be loaded with File | Load Workspace….

R session management

4. Another way of saving the R codes (history and workspace) is when we close the

session either with File | Exit, or by clicking on the X of the R window; a window will

pop up as displayed in the previous screenshot. If you click on Yes, R will append

the RHISTORY le in the working directory with the codes executed in the current

session and also save the workspace.

5. If you want to save only certain objects from the current list, you can use the save

funcon. As an example if you wish to save the object x, run save(x,file="x.

Rdata"). In a later session, you can reinstate the object x with load("x.Rdata").

However, the libraries that were invoked in the previous session are not available again.

They again need to be explicitly invoked using the library() funcon. Therefore, you

should be careful about this fact.

Saving the R session

Import/Export Data

[ 64 ]

What just happened?

The session history is very important, and also the objects created during a session. As you

get deeper into the subject, it is soon realized that it is not possible to complete all the tasks

in a single session. Hence, it is vital to manage the sessions properly. You learned how to

save code history, workspace, and so on.

Have a go hero

You have two matrices A=

0 and B =

-12

6. Obtain the cross-product AB and

nd the inverse of AB. Next, nd (BTAT) then the transpose of its inverse. What will

be your observaon?

Summary

In this chapter we learned how to carry out the essenal computaons. We also learned

how to import the data from various foreign formats and then to export R objects and

output suitable for other soware. We also saw how to eecvely manage an R session.

Now that we know how to create R data objects, the next step is the visualizaon of such

data. In the spirit of Chapter 1, Data Characteriscs, we consider graph generaon according

to the nature of the data. Thus, we will see specialized graphs for data related to discrete as

well as connuous random variables. There is also a disncon made for graphs required

for univariate and mulvariate data. The next chapter must be pleasing on the eyes! Special

emphasis is made on visualizaon techniques related to categorical data, which includes bar

charts, dot charts, and spine plots. Mulvariate data visualizaon is more than mere 3D plots

and the R methods such as pairs plot discussed in the next chapter will be useful.

Data Visualization

Data is possibly best understood, wherever plausible, if it is displayed in a

reasonable manner. Chen, et. al. (2008) has compiled articles where many

scientists of data visualization give a deeper, historical, and modern trend of

data display. Data visualization may probably be as historical as data itself. It

emerges across all the dimensions of science, history, and every stream of life

where data is captured. The reader may especially go through the rich history

of data visualization in the article of Friendly (2008) from Chen, et. al. (2008).

The aesthetics of visualization has been elegantly described in Tufte (2001).

The current chapter will have a deep impact on the rest of the book, and

moreover this chapter aims to provide the guidance and specialized graphics

in the appropriate context in the rest of the book.

This chapter provides the necessary smulus for understanding the gist that discrete

and connuous data needs appropriate tools, and the validaon may be seen through

the disnct characteriscs of such plots. Further, this chapter is also more closely related

to Chapter 4, Exploratory Analysis, and many visualizaon techniques here indeed are

"exploratory" in nature. Thus, the current chapter and the next complement mutually.

It has been observed that in many preliminary courses/text, a lot of emphasis is on the type

of the plots, say histogram, boxplot, and so on, which are more suitable for data arising for

connuous variables. Thus, we need to make a disncon among the plots for discrete and

connuous variable data, and towards this we rst begin with techniques which give more

insight on visualizaon of discrete variable data.

In R there are four main frameworks for producing graphics: basic graphs, grids, lace,

and ggplot2. In the current chapter, the rst three are used mostly and there is a brief

peek at ggplot2 at the end.

Data Visualizaon

[ 66 ]

This chapter will mainly cover the details on eecve data visualizaon:

Visualizaon of categorical data using a bar chart, dot chart, spine and mosaic plots,

and the pie chart and its variants

Visualizaon of connuous data using a boxplot, histogram, scaer plot and its

variants, and the Pareto chart

A very brief introducon to the rich school of ggplot2

Visualization techniques for categorical data

In Chapter 1, Data Characteriscs, we came across many variables whose outcomes

are categorical in nature. Gender, Car_Model, Minor_Problems, Major_Problems,

and Satisfaction_Rating are examples of categorical data. In a soware product

development cycle, various issues or bugs are raised at dierent severity levels such as

Minor and Show Stopper. Visualizaon methods for the categorical data require special

aenon and techniques, and the goal of this secon is to aid the reader with some useful

graphical tools.

In this secon, we will mainly focus on the dataset related to bugs, which are of primary

concern for any soware engineer. The source of the datasets is http://bug.inf.usi.ch/

and the reader is advised to check the website before proceeding further in this secon. We

will begin with the soware system Eclipse JDT Core, and the details for this system may be

found at http://www.eclipse.org/jdt/core/index.php. The les for download are

available at http://bug.inf.usi.ch/download.php.

Bar charts

It is very likely that you are familiar with bar charts, though you may not be aware of

categorical variables. Typically, in a bar chart one draws the bars proporonal to the

frequency of the category. An illustraon will begin with the dataset Severity_Counts

related to the Eclipse JDT Core soware system. The reader may also explore the built-in

examples in R.

Going through the built-in examples of R

The bar charts may be obtained using two opons. The funcon barplot, from the

graphics library, is one way of obtaining the bar charts. The built-in examples for this

plot funcon may be reviewed with example(barplot). The second opon is to load the

package lattice and then use the example(barchart) funcon. The sixth plot, aer you

click for the sixth me on the prompt, is actually an example of the barchart funcon.

Chapter 3

[ 67 ]

The main purpose of this example is to help the reader get air of the bar charts that may

be obtained using R. It happens that oen we have a specic variant of a plot in our mind

and nd it dicult to recollect it. Hence, it is a suggeson to explore the variety of bar charts

you can produce using R. Of course, there are a lot more possibilies than the mere samples

given by example().

Example 3.1.2. Bar charts for the Bug Metrics dataset: The soware system Eclipse JDT Core

has 997 dierent class environments related to the development. The bug idened on each

occasion is classied by its severity as Bugs, NonTrivial, Major, Critical, and High.

We need to plot the frequency of the severity level, and also require the frequencies to be

highlighted by Before and Aer release of the soware to be neatly reected in the graph.

The required data is available in the RSADBE package in the Severity_Counts object.

Example 3.1.3. Bar charts for the Bug Metrics of the ve soware: In the previous example,

we had considered the frequencies only on the JDT soware. Now, it will be a tedious

exercise if we need to have ve dierent bar plots for dierent soware. The frequency

table for the ve soware is given in the Bug_Metrics_Software dataset of the

RSADBE package.

Software BA_Ind Bugs

NonTrivial

Bugs

Major

Bugs Critical Bugs

High Priority

Bugs

JDT Before 11,605 10,119 1,135 432 459

After 374 17 35 10 3

PDE Before 5,803 4,191 362 100 96

After 341 14 57 6 0

Equinox Before 325 1,393 156 71 14

After 244 3 4 1 0

Lucene Before 1,714 1,714 0 0 0

After 97 0 0 0 0

Mylyn Before 14,577 6,806 592 235 8,804

After 340 187 18 3 36

It would be nice if we could simply display the frequency table across two graphs only.

This is achieved using the opon beside in the barplot funcon. The data from the

preceding table is copied from an XLS/CSV le, and then we execute the rst line of the

following R program in the Time for acon – bar charts in R secon.

Let us begin the acon and visualize the bar charts.

Data Visualizaon

[ 68 ]

Time for action – bar charts in R

Dierent forms of bar charts will be displayed with datasets. The type of bar chart also

depends on the problem (and data) on hand.

1. Enter example(barplot) in the console and hit the Return key.

2. A new window pops up with the heading Click or hit Enter for next page.

Click (and pause between the clicks) your way unl it stops changing.

3. Load the lace package with library(lattice).

4. Try example(barchart) in the console. The sixth plot is an example of the

bar chart.

5. Load the dataset on severity counts for the JDT soware from the RSADBE package

with data(Severity_Counts). Also, check for this data.

A view of this object is given in the screenshot in step 7. We have ve severies

of bugs: general bugs (Bugs), non-trivial bugs (NT.Bugs), major bugs (Major.Bugs),

crical bugs (Crical), and high priority bugs (H.Priority). For the JDT soware, these

bugs are counted before and aer release, and these are marked in the object with

suxes BR and AR. We need to understand this count data and as a rst step, we

use the bar plots for the purpose.

6. To obtain the bar chart for the severity-wise comparison before and aer release

of the JDT soware, run the following R code:

barchart(Severity_Counts,xlab="Bug Count",xlim=c(0,12000),

col=rep(c(2,3),5))

The barchart funcon is available from the lattice package. The range

for the count is specied with xlim=c(0,12000). Here, the argument

col=rep(c(2,3),5) is used to tell R that we need two colors for BR and AR

and that this should be repeated ve mes for the ve severity levels of the bugs.

Chapter 3

[ 69 ]

Figure 1: Bar graph for the Bug Metrics dataset

7. An alternave method is to use the barplot funcon from the graphics package:

barplot(Severity_Counts,xlab="Bug Count",xlim=c(0,12000), horiz

=TRUE,col=rep(c(2,3),5))

Here, we use the argument horiz = TRUE to get a horizontal display of the bar

plot. A word of cauon here that the argument horizontal = TRUE in barchart

of the lattice package works very dierently.

Data Visualizaon

[ 70 ]

We will now focus on Bug_Metrics_Software, which has the bug count data for

all the ve soware: JDT, PDE, Equinox, Lucene, and Mylyn.

Figure 2: View of Severity_Counts and Bug_Metrics_Software

8. Load the dataset related to all the ve soware with data(Bug_Metrics_

Software).

9. To obtain the bar plots for before and aer release of the soware on the same

window, run par(mfrow=c(1,2)).

What is the par funcon? It is a funcon frequently used to set the parameters

of a graph. Let us consider a simple example. Recollect that when you tried the

code example(dotchart), R would ask you to Click or hit Enter for next page

and post the click or Enter acon, the next graph will be displayed. However, this

prompt did not turn up when you ran barplot(Severity_Counts,xlab="Bug

Count",xlim=c(0,12000), horiz =TRUE,col=rep(c(2,3),5)). Now,

let us try using par, which will ask us to rst click or hit Enter so that we get the bar

plot. First run par(ask=TRUE), and then follow it with the bar plot code. You will

now be asked to either click or hit Enter. Find more details of the par funcon with

?par. Let us now get into the mfrow argument. The default plot opons displays

the output on one device and on the next one, the former will be replaced with the

new one. We require the bar plots of before and aer release count to be displayed

in the same window. The opon, mfrow = c(1,2), ensures that both the bar plots

are displayed in the same window with one row and two columns.

Chapter 3

[ 71 ]

10. To obtain the bar plot of bug frequencies before release where each of the soware

bug frequencies are placed side by side for each type of the bug severity, run

the following:

barplot(Bug_Metrics_Software[,,1],beside=TRUE,col = c("lightblue",

"mistyrose", "lightcyan", "lavender", "cornsilk"),legend = c("JDT"

,"PDE","Equinox","Lucene", "Mylyn"))

title(main = "Before Release Bug Frequency", font.main = 4)

Here, the code Bug_Metrics_Software[,,1] ensures that only before release

are considered. The opon beside = TRUE ensures that the columns are displayed

as juxtaposed bars, otherwise, the frequencies will be distributed in a single bar

with areas proporonal to the frequency of each soware. The opon col =

c("lightblue", …) assigns the respecve colors for the soware. Finally, the

tle command is used to designate an appropriate tle for the bar plot.

11. Similarly, to obtain the bar plot for aer release bug frequency, run the following:

barplot(Bug_Metrics_Software[,,2],beside=TRUE,col = c("lightblue",

"mistyrose", "lightcyan", "lavender", "cornsilk"),legend = c("JDT"

,"PDE","Equinox","Lucene", "Mylyn"))

title(main = "After Release Bug Frequency",font.main = 4)

Data Visualizaon

[ 72 ]

The reader can extend the code interpretaon for the before release to the aer

release bug frequencies.

Figure 3: Bar plots for the five software

First noce that the scale on the y-axis for before and aer release bug frequencies

is drascally dierent. In fact, before release bug frequencies are in thousands while

aer release are in hundreds. This clearly shows that the engineers have put a lot of

eort to ensure that the released products are with minimum bugs. However, the

comparison of bug counts is not fair since the frequency scales of the bar plots in

the preceding screenshot are enrely dierent. Though we don't expect the results

to be dierent under any case, it is sll appropriate that the frequency scales remain

the same for both before and aer release bar plots. A common suggeson is to plot

the diagrams with the same range on the y-axes (or x-axes), or take an appropriate

transformaon such as a logarithm. In our problem, neither of them will work, and

we resort to another variant of the bar chart from the lattice package.

Now, we will use the formula structure for the barchart funcon and bring the

BR and AR on the same graph.

Chapter 3

[ 73 ]

12. Run the following code in the R console:

barchart(Software~Freq|Bugs,groups=BA_Ind, data= as.data.

frame(Bug_Metrics_Software),col=c(2,3))

The formula Software~Freq|Bugs requires that we obtain the bar chart for the

soware count Freq according to the severity of Bugs. We further specify that

each of the bar chart a be further grouped according to BA_Ind. This will result in

the following screenshot:

Figure 4: Bar chart for Before and After release bug counts on the same scale

To nd the colors available in R, run try colors() in the console and you will nd the

names of 657 colors.

Data Visualizaon

[ 74 ]

What just happened?

barplot and barchart were the two funcons we used to obtain the bar charts. For

common recurring factors, AR and BR here, the colors can be correctly specied through the

rep funcon. The argument beside=TRUE helped us to keep the bars for various soware

together for the dierent bug types. We also saw how to use the formula structure of the

lattice package. We saw the diversity of bar charts and learned how to create eecve

bar charts depending on the purpose of the day.

Have a go hero

Explore the opon stack=TRUE in the barchart(Software~Freq|Bugs,groups= BA_

Ind,…). Also, observe that Freq for bars in the preceding screenshot begins a lile earlier

than 0. Reobtain the plots by specifying the range for Freq with xlim=c(0,15000).

Dot charts

Cleveland (1993) proposed an alternave to the bar chart where dots are used to represent

the frequency associated with the categorical variables. The dot charts are useful for small

to moderate size datasets. The dot charts are also an alternave to the pie chart, refer to

The default examples secon. The dot charts may be varied to accommodate connuous

variable dataset too. The dot charts are known to obey the Tukey's principle of achieving an

as high as possible informaon-to-ink rao.

Example 3.1.4. (Connuaon of Example 3.1.2): In the screenshot in step 6 in the Time for

acon – bar charts in R secon, we saw that the bar charts for the frequencies of bugs for

aer release are almost non-existent. This is overcome using the dot chart, see the following

acon list on the dot chart.

Time for action – dot charts in R

The dotchart funcon from the graphics package and dotplot from the lattice

package will be used to obtain the dot charts.

1. To view the default examples of dot charts, enter example(dotplot);

example(dotchart); in the console and hit the Return key.

Chapter 3

[ 75 ]

2. To obtain the dot chart of the before and aer release bug frequencies, run the

following code:

dotchart(Severity_Counts,col=15:16,lcolor="black",pch=2:3,labels

=names(Severity_Counts),main="Dot Plot for the Before and After

Release Bug Frequency",cex=1.5)

Here, the opon col=15:16 is used to specify the choice of colors; lcolor is used

for the color of the lines on the dot chart which gives a good assessment of the

relave posions of frequencies for the human eye. The opon pch=2:3 picks up

circles and squares for indicang the posions of aer and before frequencies. The

opons labels and main are trivial to understand, whereas cex magnies the size

of all labels by 1.5 mes. On execuon of the preceding R code, we get a graph as

displayed in the following screenshot:

Figure 5: Dot chart for the Bug Metrics dataset

Data Visualizaon

[ 76 ]

3. The dot plot can be easily extended for all the ve soware as we did with the

bar charts.

> par(mfrow=c(1,2))

> dotchart(Bug_Metrics_Software[,,1],gcolor=1:5,col=6:

10,lcolor= "black",pch=15:19,labels=names(Bug_Metrics_

Software[,,1]),main="Before Release Bug Frequency",xlab="Frequency

Count")

> dotchart(Bug_Metrics_Software[,,2],gcolor=1:5,col=6:

10,lcolor= "black",pch=15:19,labels=names(Bug_Metrics_

Software[,,2]),main="After Release Bug Frequency",xlab="Frequency

Count")

Figure 6: Dot charts for the five software bug frequency

For a matrix input in barchart, the gcolor opon gets the same color each column. Note

that though the class of Bug_Metrics_Software is both xtabs and table, the class of

Bug_Metrics_Software[,,1] is a matrix, and hence we create a dot chart of it. This

means that the R code dotchart(Bug_Metrics_Software) leads to errors! The dot

chart is able to display the bug frequency in a beer way as compared to the bar chart.

What just happened?

Two dierent ways of obtaining the dot plot were seen, and a host of other opons were

also clearly indicated in the current secon.

Chapter 3

[ 77 ]

Spine and mosaic plots

In the bar plot, the length (height) of the bar varies, while the width for each bar is kept the

same. In a spine/mosaic plot the height is kept constant for the categories and the width

varies in accordance with the frequency. The advantages of a spine/mosaic plot becomes

apparent when we have frequencies tabulated for several variables via a conngency table.

The spline plot is a parcular case of the mosaic plot. We rst consider an example for

understanding the spine plot.

Example 3.1.5. Visualizing Shi and Operator Data (Page 487, Ryan, 2007): In a manufacturing

factory, operators are rotated across shis and it is a concern to nd out whether the me of

shi aects the operator's performance. In this experiment, there are three operators who in a

given month work in a parcular shi. Over a period of three months, data is collected for the

number of nonconforming parts an operator obtains during a given shi. The data is obtained

from page 487 of Ryan (2007) and is reproduced in the following table:

Operator 1 Operator 2 Operator 3

Shift 1 40 35 28

Shift 2 26 40 22

Shift 3 52 46 49

We will obtain a spine plot towards an understanding of the spread of the number

of non-conforming units an operator does during the shis in the forthcoming acon

me. Let us ask the following quesons:

Does the total number of non-conforming parts depend on the operators?

Does it depend on the shi?

Can we visualize the answers to the preceding quesons?

Time for action – the spine plot for the shift and operator data

Spine plots are drawn using the spineplot funcon.

1. Explore the default examples for the spine plot with example(spineplot).

2. Enter the data for the shi and operator example with:

ShiftOperator <- matrix(c(40, 35, 28, 26, 40, 22, 52, 46, 49),nro

w=3,dimnames=list(c("Shift 1", "Shift 2", "Shift 3"), c("Operator

1", "Opereator 2", "Operator 3")),byrow=TRUE)

3. Find the number of non-conforming parts of the operators with the

colSums funcon:

> colSums(ShiftOperator)

Operator 1 Opereator2 Operator 3

118 121 99

Data Visualizaon

[ 78 ]

The non-conforming parts for operators 1 and 2 are close enough, and it is lesser

by about 20 percent for the third operator.

4. Find the number of non-conforming parts according to the shis using the

rowSums funcon:

> rowSums(ShiftOperator)

Shift 1 Shift 2 Shift 3

103 88 147

Shift 3 appears to have about 50 percent more non-conforming parts in

comparison with shis 1 and 2. Let us look out for the spine plot.

5. Obtain the spine plot for the ShiftOperator data with

spineplot(ShiftOperator).

Now, we will aempt to make the spine plot a bit more interpretable. Under the

absence of any external inuence, we would expect the shis and operators to

have a near equal number of non-conforming objects.

6. Thus, on the overall x and y axes, we plot lines at approximately the one-thirds

and check if we get approximate equal regions/squares.

abline(h=0.33,lwd=3,col="red")

abline(h=0.67,lwd=3,col="red")

abline(v=0.33,lwd=3,col="green")

abline(v=0.67,lwd=3,col="green")

The output in the graphics device window will be the following screenshot:

Figure 7: Spine plot for the Shift Operator problem

Chapter 3

[ 79 ]

It appears from the paron induced by the red lines that all the operators have

a nearly equal number of non-conforming parts. However, the spine chart shows

that most of the non-conforming parts occur during Shi 3.

What just happened?

Data summaries were used to understand the behavior of the problem, and the spine

plot helped in clear idencaon of Shi 3 as a major source of the non-conforming units

manufactured. The use of abline funcon was parcularly more insighul for this dataset

and needs to be explored whenever there is a scope for it.

Spine plot is a special case of the mosaic plot. Friendly (2001) has pioneered the concept

of mosaic plots and Chapter 4, Exploratory Analysis, is an excellent reference for the same.

For a simple understanding of the construcon of mosaic plot, you can go through slides

7-12 at http://www.stat.auckland.ac.nz/~ihaka/120/Lectures/lecture17.

pdf. As explained there, suppose that there are three categorical variables, each with two

levels. Then, the mosaic plot begins with a square and divides it into two parts with each

part having an area proporonal to the frequency of the two levels of the rst categorical

variable. Next, each of the preceding two parts is divided into two parts each according to

the predened frequency of the two levels of the second categorical variable. Note that we

now have four divisions of the total area. Finally, each of the four areas are further divided

into two more parts, each with an area reecng the predened frequency of the two levels

of the third categorical variable.

Example 3.1.6. The Titanic dataset: In the The table object secon in Chapter 2, Import/

Export Data, we came across the Titanic dataset. The dataset was seen in two dierent

forms and we also constructed the data from scratch. Let us now connue the example.

The main problems in this dataset are the following:

The distribuon of the passengers by Class, and then the spread of Survived

across Class.

The distribuon of the passengers by Sex and its distribuon across the survivors.

The distribuon by Age followed by the survivors among them. We now want to

visualize the distribuon of Survived rst by Class, then by Sex, and nally by

the Age group.

Let us see the detailed acon.

Data Visualizaon

[ 80 ]

Time for action – the mosaic plot for the Titanic dataset

The goal here is to understand the survival percentages of the Titanic ship with respect to

Class of the crew, Sex, and Age. We use rst xtabs and prop.table to gain the insight

for each of these variables, and then visualize the overall picture using mosaicplot.

1. Get the frequencies of Class for the Titanic dataset with

xtabs(Freq~Class,data=Titanic).

2. Obtain the Survived proporons across Class with prop.table( xtabs(Freq

~Class+Survived,data=Titanic),margin=1).

3. Repeat the preceding two steps for Sex: xtabs(Freq~Sex,data=Titanic)

and prop.table(xtabs(Freq~Sex+Survived,data=Titanic),margin=1).

4. Repeat this exercise for Age: xtabs(Freq~Age,data=Titanic) and prop.tab

le(xtabs(Freq~Age+Survived,data=Titanic), margin=1).

5. Obtain the mosaic plot for the dataset with mosaicplot(Titanic,col=c("red"

,"green")).

The enre output is given in the following screenshot:

Figure 8: Mosaic plot for the Titanic dataset

Chapter 3

[ 81 ]

The preceding output shows that the people traveling in higher classes survived beer than

the lower class ones. The analysis also shows that females were given more priority over

males when the rescue system was in acon. Finally, it may be seen that children were given

priority over adults.

The mosaic plot division process proceeds as follows. First, it divides the region into four

parts with the regions proporonal to the frequencies of each Class; that is, the width of

the regions are proporonate to the Class frequencies. Each of the four regions are further

divided using the predened frequencies of the Sex categories. Now, we have eight regions.

Next, each of these regions is divided using the predened frequencies of the Age group

leading to 16 disnct regions. Finally, each of the regions is divided into two parts according

to the Survived status. The Yes regions of Child for the rst two classes are larger than the

No regions. The third Class has more non-survivors than survivors, and this appears to be

true across Age and Gender. Note that there are no children among the Crew class.

The rest of the regions' interpretaon is le to the reader.

What just happened?

A clear demyscaon of the working of the mosaic plot has been provided. We applied it to

the Titanic dataset and saw how it obtains clear regions which enable to deep dive into a

categorical problem.

Pie charts and the fourfold plot

Pie charts are hugely popular among many business analysts. One reason for its popularity is

of course its simplicity. That pie chart is easy to interpret is actually not a fact. In fact, the pie

chart is seriously discouraged for analysis and observaons, refer to the cauon of Cleveland

and McGill, and also Sarkar (2008), page 57. However, we will sll connue an illustraon

of it.

Example 3.1.7. Pie chart for the Bugs Severity problem: Let us obtain the pie chart for the

bug severity levels.

> pie(Severity_Counts[1:5])

> title("Severity Counts Post-Release of JDT Software")

> pie(Severity_Counts[6:10])

> title("Severity Counts Pre-Release of JDT Software")

Data Visualizaon

[ 82 ]

Can you nd the drawback of the pie chart?

Figure 9: Pie chart for the Before and After Bug counts (output edited)

The main drawback of the pie chart stems from the fact that humans have a problem in

deciphering relave areas. A common recommendaon is the use of a bar chart or a dot

chart instead of the pie chart, as the problem of judging relave areas does not exist when

comparing linear measures.

The fourfold plot is a novel way of visualizing a

22××k

conngency table. In this method,

we obtain k plots for each

22×

frequency table. Here, the cell frequency of each of the four

cells is represented by a quarter circle whose radius is proporonal to the square root of the

frequency. In contrast to the pie chart where the radius is constant and area is varied by the

perimeter, the radius in a fourfold plot is varied to represent the cell.

Chapter 3

[ 83 ]

Example 3.1.8. The fourfold plot for the UCBAdmissions dataset: An in-built R funcon

which generates the required plot is fourfoldplot. The R code and its resultant

screenshots are displayed as follows:

> fourfoldplot(UCBAdmissions,mfrow=c(2,3),space=0.4)

Figure 10: The fourfold plot of the UCBAdmissions dataset

In this secon, we focused on graphical techniques for categorical data. In many books,

the graphical methods begin with tools which are more appropriate for data arising for

connuous variables. Such tools have many shortcomings if applied to categorical data.

Thus, we have taken a dierent approach where the categorical data gets the right tools,

which it truly deserves. In the next secon, we deal with tools which are seemingly more

appropriate for data related to connuous variables.

Data Visualizaon

[ 84 ]

Visualization techniques for continuous variable data

Connuous variables have a dierent structure and hence, we need specialized methods

for displaying them. Fortunately, many popular graphical techniques are suited very well for

connuous variables. As the connuous variables can arise from dierent phenomenon, we

consider many techniques in this secon. The graphical methods discussed in this secon

may also be considered as a part of the next chapter on exploratory analysis.

Boxplot

The boxplot is based on ve points: minimum, lower quarle, median, upper quarle, and

maximum. The median forms the thick line near the middle of the box, and the lower and

upper quarles complete the box. The lower and upper quarles along with the median,

which is the second quarle, divide the data into four regions with each containing equal

number of observaons. The median is the middle-most value among the data sorted in

the increasing (decreasing) order of magnitude. On similar lines, the lower quarle may be

interpreted as the median of observaons between the minimum and median data values.

These concepts are dealt in more detail in Chapter 4, Exploratory Analysis. The boxplot is

generally used for two purposes: understanding the data spread and idenfying the outliers.

For the rst purpose, we set the range value at zero, which will extend the whiskers to the

extremes at minimum and maximum, and give the overall distribuon of the data.

If the purpose of boxplot is to idenfy outliers, we extend the whiskers in a way which

accommodate tolerance limits to enable us to capture the outliers. Thus, the whiskers are

extended 1.5 mes the default value, the interquarle range (IQR), which is the dierence

between the third and rst quarles from the median. The default seng of boxplot is the

idencaon of the outliers. If any point is found beyond the whiskers, such observaons

may be marked as outliers. The boxplot is also somemes called a box-and-whisker plot, and

it is the whiskers, which are obtained by drawing lines from the box, ends to the minimum

and maximum points. We will begin with an example of the boxplot.

Example 3.2.1. Example (boxplot): For a quick tutorial on the various opons of the boxplot

funcon, the user may run the following code at the R console. Also, the reader is advised

to explore the bwplot funcon from the lattice package. Try example(boxplot) and

example(bwplot) from the respecve graphics and lattice packages.

Example 3.2.2. Boxplot for the resisvity data: Gunst (2002) has 16 independent

observaons from eight pairs on the resisvity of a wire. There are two processes under

which these observaons are equally distributed. We would like to see if resisvity of the

wires depends on the processes, and which of the processes leads to a higher resisvity.

A numerical comparison based on the summary funcon will be rst carried out, and then

we will visualize the two processes through boxplot to conclude whether the eects are

same, and if not which process leads to higher resisvity.

Chapter 3

[ 85 ]

Example 3.2.3. The Michelson-Morley experiment: This is a famous physics experiment in

the late nineteenth century, which helped in proving the non-existence of ether. If the ether

existed, one expects a shi of about 4 percent in the speed of light. The speed of light is

measured 20 mes in ve dierent experiments. We will use this dataset for two purposes:

is the dri of 4 percent evidenced in the data, and seng the whiskers at the extremes.

The rst one is a stascal issue and the laer is a soware seng.

For the preceding three examples, we will now read the required data into R, and then

look at the necessary summary funcons, and nally visualize them using the boxplots.

Time for action – using the boxplot

Boxplots will be obtained here using the funcon boxplot from the graphics package as well

as bwplot from the lace package.

1. Check the variety of boxplots with example(boxplot) from the graphics

package and example(bwplot) for the variants in the lattice package.

2. The resistivity data from the RSADBE package contains two processes'

informaon which we need to compare. Load in to the current session with

data(resistivity).

3. Obtain the summary of the two processes with the following:

> summary(resistivity)

Process.1 Process.2

Min. 0.138 0.142

1st Qu. 0.140 0.144

Median 0.142 0.146

Mean 0.142 0.146

3rd Qu. 0.143 0.148

Max. 0.146 0.150

Clearly, Process 2 has approximately 0.004 higher resisvity as compared

to Process 1 across all the essenal summaries. Let us check if the boxplot

captures the same.

4. Obtain the boxplot for the two processes with boxplot(resistivity,

range=0).

The argument range=0 is to ensure that the whiskers are extended to the

minimum and maximum data values. The boxplot diagram (le-hand side of the

next screenshot) clearly shows that Process.2 has higher resisvity in comparison

with Process.1. Next, we will consider the bwplot funcon from the lattice

package. A slightly dierent rearrangement of the resisvity data frame will be

required, in that we will specify all the resisvity values in a single column and their

corresponding processes in another column.

Data Visualizaon

[ 86 ]

An important opon for boxplots is that of notch, which is especially useful for

the comparison of medians. The top and boom notches for a set of observaons

are dened at the point's median

1.57(IQR)/n1/2. If notches of two boxplots do

not overlap, it can be concluded that the medians of the groups are signicantly

dierent. Such an opon can be specied in both boxplot and bwplot funcons.

5. Convert resisvity to another useful form which will help the applicaon of the

bwplot funcon with resistivity2 <- data.frame(rep(names( re

sistivity),each=8),c(resistivity[,1],resistivity[,2])).

Assign variable names to the new data frame with names(resistivity2)<-

c("Process","Resistivity").

6. Run the bwplot funcon on resistivity2 with

bwplot(Resistivity~Process, data=resistivity2,notch=TRUE).

Figure 11: Boxplots for resistivity data with boxplot, bwplot, and notches

The notches are overlapping and hence, we can't conclude from the boxplot that the

resisvity medians are very dierent from each other.

With the data on speed of light from the morley dataset, we have an important

goal of idenfying outliers. Towards this purpose the whiskers are extended 1.5

mes the default value, the interquarle range (IQR), from the median.

Chapter 3

[ 87 ]

7. Create a boxplot with whiskers which enable idencaon of the outliers beyond

the 1.5 IQR of the median with the following:

boxplot(Speed~Expt,data=morley,main = "Whiskers at Lower- and

Upper- Confidence Limits")

8. Add the line which helps to idenfy the presence of ether with abline(

h=792.458,lty=3).

The resulng screenshot is as follows:

Figure 12: Boxplot for the Morley dataset

It may be seen from the preceding screenshot that experiment 1 has one outlier, while

experiment 3 has three outlier values. Since the line is well below the median of the

experiment values (speed, actually), we conclude that there is no experimental evidence

for the existence of ether.

What just happened?

Variees of boxplots have been elaborated in this secon. The boxplot has also been put in

acon to idenfy the presence of outliers in a given dataset.

Data Visualizaon

[ 88 ]

Histograms

Histogram is one of the earliest graphical techniques and undoubtedly one of the most

versale and adapve graphs whose relevance is legimate as it ever had. The invenon

of histogram is a credit to the great stascian, Karl Pearson. Its strength is also in its

simplicity. In this technique, a variable is divided over intervals and the height of an interval

is determined by the frequency of the observaons falling in that interval. In the case of

an unbalanced division of the range of the variable values, histograms are especially very

informave in revealing the shape of the probability distribuon of the variable. We will see

more about these points in the following examples.

The construcon of a histogram is explained with the famous dataset of Galton, where an

aempt has been made for understanding the relaonship between the heights of a child

and parent. In this dataset, there are 928 pairs of observaon of the height of the child and

parent. Let us have a brief peek at the dataset:

> data(galton) > names(galton)

[1] "child" "parent"

> dim(galton)

[1] 928 2

> head(galton)

child parent

1 61.7 70.5

2 61.7 68.5

3 61.7 65.5

4 61.7 64.5

5 61.7 64.0

6 62.2 67.5

> sapply(galton,range)

child parent

[1,] 61.7 64

[2,] 73.7 73

> summary(galton)

child parent

Min. :61.7 Min. :64.0

1st Qu.:66.2 1st Qu.:67.5

Median :68.2 Median :68.5

Mean :68.1 Mean :68.3

3rd Qu.:70.2 3rd Qu.:69.5

Max. :73.7 Max. :73.0

Chapter 3

[ 89 ]

We need to cover all the 928 observaons in the intervals, also known as bins, which need to

cover the range of the variable. The natural queson is how does one decide on the number

of intervals and the width of these intervals? If the bin width, denoted by h, is known, the

number of bins, denoted by k, can be determined by:

maxmin−

 

= 

 

 

ii i

x x

Here, the argument

denotes the ceiling of the number. Similarly, if the number of bins k

is known, the width is determined by

( )

max min= −

 

 

ii ii

h x x/k.

There are many guidelines for these problems. The three opons available for the hist

funcon in R are formulas given by Sturges, Sco, and Freedman-Diaconis, the details of

which may be obtained by running ?nclass.Sturges, or ?nclass.FD and ?nclass.

scott in the R console. The default seng runs the Sturges opon. The Sturges formula

for the number of bins is given by:

[ ]

= +k n

This formula works well when the underlying distribuon is approximately distributed as a

normal distribuon. The Sco's normal reference rule for the bin width, using the sample

standard deviaon

is:

=ˆ

Finally, the Freedman-Diaconis rule for the bin width is given by:

IQR

In the following, we will construct a few histograms describing the problems through their

examples and their R setup in the Time for acon – understanding the eecveness of

histogram a secon.

Example 3.2.4. The default examples: To get a rst preview on the generaon of histograms,

we suggest the reader to go through the built-in examples, try example(hist) and

example(histogram).

Data Visualizaon

[ 90 ]

Example 3.2.5. The Galton dataset: We will obtain histograms for the height of child and

parent from the Galton dataset. We will use the Freedman-Diaconis and Sturges choice of

bin widths.

Example 3.2.6. Octane rang of gasoline blends: An experiment is conducted where the

octane rang of gasoline blends can be obtained using two methods. Two samples are

available for tesng each type of blend, and Snee (1981) obtains 32 dierent blends over

an appropriate spectrum of the target octane rangs. We obtain histograms for the rangs

under the two dierent methods.

Example 3.2.7. Histogram with a dummy dataset: A dummy dataset has been created by the

author. Here, we need to obtain histograms for the two samples in the Samplez data from

the RSADBE package.

Time for action – understanding the effectiveness of histograms

Histograms are obtained using the hist and histogram funcons. The choice of bin widths

is also discussed.

1. Have a buy-in of the R capability of the histograms through example(hist) and

example(histogram) for the respecve histogram funcons from the graphics

and lattice packages.

2. Invoke the graphics editor with par(mfrow=c(2,2)).

3. Create the histogram for the height of Child and Parent from the galton dataset

seen in the earlier part of the secon for the Freedman-Diaconis and Sturges

choice of bin widths:

par(mfrow=c(2,2))

hist(galton$parent,breaks="FD",xlab="Height of Parent",

main="Histogram for Parent Height with Freedman-Diaconis

Breaks",xlim=c(60,75))

hist(galton$parent,xlab="Height of Parent",main= "Histogram for

Parent Height with Sturges Breaks",xlim=c(60,75))

hist(galton$child,breaks="FD",xlab="Height of Child",

main="Histogram for Child Height with Freedman-Diaconis

Breaks",xlim=c(60,75))

hist(galton$child,xlab="Height of Child",main="Histogram for Child

Height with Sturges Breaks",xlim=c(60,75))

Chapter 3

[ 91 ]

Consequently, we get the following screenshot:

Figure 13: Histograms for the Galton dataset

Note that a few people may not like histograms for the height of parent for the

Freedman-Diaconis choice of bin width.

4. For the experiment menoned in Example 3.2.9. Octane rang of gasoline blends,

rst load the data into R with data(octane).

5. Invoke the graphics editor for the rangs under the two methods with

par(mfrow=c(2,2)).

6. Create the histograms for the rangs under the two methods for the Sturges choice

of bin widths with:

hist(octane$Method_1,xlab="Ratings Under Method I",main="Histogram

of Octane Ratings for Method I",col="mistyrose")

hist(octane$Method_2,xlab="Ratings Under Method

II",main="Histogram of Octane Ratings for Method II",col="

cornsilk")

The resulng histogram plot will be the rst row of the next screenshot.

A visual inspecon suggests that under Method_I, the mean rang is around 90

while under Method_II it is approximately 95. Moreover, the Method_II rangs

look more symmetric than the Method_I rangs.

Data Visualizaon

[ 92 ]

7. Load the required data here with data(Samplez).

8. Create the histogram for the two samples under the Samplez data frame with:

hist(Samplez$Sample_1,xlab="Sample 1",main="Histogram: Sample 1"

,col="magenta")

hist(Samplez$Sample_2,xlab="Sample 2",main="Histogram: Sample 2"

,col="snow")

We obtain the following histogram plot:

Figure 14: Histogram for the Octane and Samples dummy dataset

The lack of symmetry is very apparent in the second row display of the preceding screenshot.

It is very clear from the preceding screenshot that the le histogram exhibits an example of

a posive skewed distribuon for Sample_1, while the right histogram for Sample_2 shows

that the distribuon is a negavely skewed distribuon.

What just happened?

Histograms have tradionally provided a lot of insight into the understanding of the

distribuon of variables. In this secon, we dived deep into the intricacies of its construcon,

especially related to the opons of bin widths. We also saw how the dierent nature of

variables are clearly brought out by their histogram.

Chapter 3

[ 93 ]

Scatter plots

In the previous subsecon, we used histograms for understanding the nature of the

variables. For mulple variables, we need mulple histograms. However, we need dierent

tools for understanding the relaonship between two or more variables. A simple, yet

eecve technique is the scaer plot. When we have two variables, the scaer plot simply

draws the two variables across the two axes. The scaer plot is powerful in reecng the

relaonship between the variables as in it reveals if there is a linear/nonlinear relaonship.

If the relaonship is linear, we may get an insight if there is a posive or negave relaonship

among the variables, and so forth.

Example 3.2.8. The drain current versus the ground-to-source voltage: Montgomery and

Runger (2003) report an arcle from IEEE (Exercise 11.64) about an experiment where the

drain current is measured against the ground-to-source voltage. In the scaer plot, the drain

current values are ploed against each level of the ground-to-source voltage. The former

value is measured in milliamperes and the laer in volts. The R funcon plot is used for

understanding the relaonship. We will soon visualize the relaonship between the current

values against the level of the ground-to-source voltage. This data is available as DCD in the

RSADBE package.

The scaer plot is very exible when we need to understand the relaonship between more

than two variables. In the next example, we will extend the scaer plot to mulple variables.

Example 3.2.9. The Gasoline mileage performance data: The mileage of a car depends on

various factors; in fact, it is a very complex problem. In the next table, the various variables

x1 to x11 are described which are believed to have an inuence on the mileage of the car. We

need a plot which explains the inter-relaonships between the variables and the mileage.

The exercise of repeang the plot funcon may be done 11 mes, though most people may

struggle to recollect the inuence of the rst plot when they are looking at the sixth or maybe

the seventh plot. The pairs funcon returns a matrix of scaer plots, which is really useful.

Let us visualize the matrix of scaer plots:

> data(Gasoline)

> pairs(Gasoline) # Output suppressed

It may be seen that this matrix of scaer plots is a symmetric plot in the sense that the

upper and lower triangle of this matrix are simply copies of each other (transposed copies

actually). We can be more eecve in represenng the data in the matrix of scaer plots by

specifying addional parameters. Even as we study the inter-relaonships, it is important to

understand the variable distribuon itself. Since the diagonal elements are just indicang the

name of the variable, we can instead replace them by their histograms. Further, if we can

give the measure of the relaonship between two variables, say the correlaon coecient,

we can be more eecve. In fact, we do a step beer by displaying the correlaon coecient

by increasing the font size according to its stronger value. We rst dene the necessary

funcons and then use the pairs funcon.

Data Visualizaon

[ 94 ]

Variable

Notation

Variable Description Variable

Notation

Variable Description

YMiles/gallon x6Carburetor (barrels)

x1Displacement (cubic inches) x7No. of transmission speeds

x2Horsepower (foot-pounds) x8Overall length (inches)

x3Torque (foot-pounds) x9Width (inches)

x4Compression ratio x10 Weight (pounds)

x5Rear axle ratio x11 Type of transmission

(A-automatic, M-manual)

Time for action – plot and pairs R functions

The scaer plot and its important mulvariate extension with pairs will be considered in

detail now.

1. Consider the data data(DCD).

Use the opons xlab and ylab to specify the right labels for the axes. We specify

xlim and ylim to get a good overview of the relaonship.

2. Obtain the scaer plot for Example 3.2.8. The Drain current versus the ground-

to-source voltage using plot(DCD$Drain_Current, DCD$GTS_Voltage,t

ype="b",xlim=c(1,2.2),ylim=c(0.6,2.4),xlab="Current Drain",

ylab="Voltage").

Figure 15: The scatter plot for DCD

Chapter 3

[ 95 ]

We can easily see from the preceding scaer plot that as the ground-to-source

voltage increases, there is an appropriate increase in the drain current. This is an

indicaon of a posive relaonship between the two variables. However, the lab

assistant now comes to you and says that the measurement error of the instrument

has actually led to 15 percent higher recordings of the ground-to-source voltage.

Now, instead of dropping the enre diagram, we may simply prefer to add the

corrected gures to the exisng one. The points opon helps us to add the new

corrected data points to the gure.

3. Now, rst obtain the correct GTS_Voltage readings with DCD$GTS_

Voltage/1.15 and add them to the exisng plot with points(DCD$Drain_

Current,DCD$GTS_Voltage/1.15,type="b",col="green").

4. We rst create two funcons panel.hist and panel.cor dened as follows:

panel.hist<- function(x, ...)

{

usr<- par("usr"); on.exit(par(usr))

par(usr = c(usr[1:2], 0, 1.5) )

h <- hist(x, plot = FALSE)

breaks<- h$breaks; nB<- length(breaks)

y <- h$counts; y <- y/max(y)

rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)

}

panel.cor<- function(x, y, digits=2, prefix="", cex.cor, ...) {

usr<- par("usr"); on.exit(par(usr))

par(usr = c(0, 1, 0, 1))

r <- abs(cor(x,y,use="complete.obs"))

txt<- format(c(r, 0.123456789), digits=digits)[1]

txt<- paste(prefix, txt, sep="")

if(missing(cex.cor)) cex.cor<- 0.8/strwidth(txt)

text(0.5, 0.5, txt, cex = cex.cor * r)

}

The preceding two dened funcons are taken from the code of example(pairs).

Data Visualizaon

[ 96 ]

5. It is me to put these two funcons in to acon:

pairs(Gasoline,diag.panel=panel.hist,lower.panel=panel.

smooth,upper.panel=panel.cor)

Figure 16: The pairs plot for the Gasoline dataset

In the upper triangle of the display, we can see that the mileage has strong

associaon with the displacement, horsepower, torque, number of transmission

speeds, the overall length, width, weight, and the type of transmission. We can say

a bit more too. The rst three variables x1 to x3 relate to the engine characteriscs,

and there is a strong associaon within these three variables. Similarly, there is a

strong associaon between x8 to x10 and together they form the vehicle dimension.

Also, we have done a bit more than simply obtaining the scaer plots in the lower

triangle of the display. A smooth approximaon of the relaonship between the

variables is provided here.

6. Finally, we resort to the usual trick by looking at the capabilies of the plot

and pairs funcons with example(plot), example(pairs),

and example(xyplot).

We have seen how mul-variables can be visualized. In the next subsecon, we will explore

more about Pareto charts.

Chapter 3

[ 97 ]

What just happened?

Starng with a simple scaer plot and its eecveness, we went to great lengths for the

extension to the pairs funcon. The pairs funcon has been greatly explored using the

panel.hist and panel.cor funcons for truly understanding the relaonships between

a set of mulple variables.

Pareto charts

The Pareto rule, also known as the 80-20 rule or the law of vital few, says that approximately 80

percent of the defects are due to 20 percent of the causes. It is important as it can idenfy 20

percent vital causes whose eliminaon annihilates 80 percent of the defects. The qcc package

contains the funcon pareto.chart, which helps in generang the Pareto chart. We will give

a simple illustraon of this chart.

The Pareto chart is a display of the cause frequencies along two axes. Suppose that we have

10 causes C1 to C10 which have occurred with defect counts 5, 23, 7, 41, 19, 4, 3, 4, 2,

and 1. Causes 2, 4, and 5 have high frequencies (dominang?) and other causes look a bit

feeble. Now, let us sort these causes by decreasing the order and obtaining their cumulave

frequencies. We will also obtain their cumulave percentages.

> Cause_Freq <- c(5, 23, 7, 41, 19, 4, 3, 4, 2, 1)

> names(Cause_Freq) <- paste("C",1:10,sep="")

> Cause_Freq_Dec <- sort(Cause_Freq,dec=TRUE)

> Cause_Freq_Cumsum <- cumsum(Cause_Freq_Dec)

> Cause_Freq_Cumsum_Perc <- Cause_Freq_Cumsum/sum(Cause_Freq)

> cbind(Cause_Freq_Dec,Cause_Freq_Cumsum,Cause_Freq_Cumsum_Perc)

Cause_Freq_Dec Cause_Freq_Cumsum Cause_Freq_Cumsum_Perc

C4 41 41 0.3761

C2 23 64 0.5872

C5 19 83 0.7615

C3 7 90 0.8257

C1 5 95 0.8716

C6 4 99 0.9083

C8 4 103 0.9450

C7 3 106 0.9725

C9 2 108 0.9908

C10 1 109 1.0000

This appears to be a simple trick, and yet it is very eecve in revealing that causes 2, 4,

and 5 are contribung more than 75 percent of the defects. A Pareto chart completes the

preceding table with bar chart in a decreasing count of the causes with a le vercal axis

for the frequencies and a cumulave curve on the right vercal axis. We will see the Pareto

chart in acon for the next example.

Data Visualizaon

[ 98 ]

Example 3.2.10. The Pareto chart for incomplete applicaons: A simple step-by-step

illustraon of Pareto chart is available on the Web at http://personnel.ky.gov/nr/

rdonlyres/d04b5458-97eb-4a02-bde1-99fc31490151/0/paretochart.pdf. The

reader can go through the clear steps menoned in the document.

In the example from the precedingly menoned web document, a bank which issues credit

cards rejects applicaon forms if they are deemed incomplete. An applicaon form may be

incomplete if informaon is not provided for one or more of the details sought in the form.

For example, an applicaon can't be processed further if the customer/applicant has not

provided address, has illegible handwring, there is a signature missing, or if the customer

is an exisng credit card holder among other reasons. The concern of the manager of the

credit card wing is to ensure that the rejecons for incomplete applicaons should decline,

since a cost is incurred on issuing the form which is generally not charged for. The manager

wants to focus on certain reasons which may be leading to the rejecon of the forms.

Here, we consider the frequency of the dierent causes which lead to the rejecon of

the applicaons.

>library(qcc)

>Reject_Freq = c(9,22,15,40,8)

>names(Reject_Freq) = c("No Addr.", "Illegible", "Curr. Customer", "No

Sign.", "Other")

>Reject_Freq

No Addr. Illegible Curr.CustomerNo Sign. Other

9 22 15 40 8

>options(digits=2)

>pareto.chart(Reject_Freq)

Pareto chart analysis for Reject_Freq

Frequency Cum.Freq.Percentage Cum.Percent.

No Sign. 40.0 40.0 42.6 42.6

Illegible 22.0 62.0 23.4 66.0

Curr. Customer 15.0 77.0 16.0 81.9

No Addr. 9.0 86.0 9.6 91.5

Other 8.0 94.0 8.5 100.0

Chapter 3

[ 99 ]

Figure 17: The Pareto chart for incomplete applications

In the screenshot given in step 5 of the Time for acon – plot and pairs R funcons secon,

the frequency of the ve reasons of rejecons is represented by bars as in a bar plot with

the disncon of being displayed in a decreasing magnitude of frequency. The frequency of

the reasons is indicated on the le vercal axis. At the mid-point of each bar, the cumulave

frequency up to that reason is indicated, and the reference for this count is the right vercal

axis. Thus, we can see that more than 75 percent of the rejecons is due to the three causes

No Signature, Illegible, and Current Customer. This is the main strength of a Pareto chart.

A brief peek at ggplot2

Tue (2001) and Wilkinson (2005) emphasize a lot on the aesthecs of graphics. There is

indeed more to graphics than mere mathemacs, and the subtle changes/correcons in

a display may lead to an improved, enhanced, and pleasing feeling on the eye diagrams.

Wilkinson emphasizes on what he calls grammar of graphics, and an R adaptaon of it is

given by Wickham (2009).

Data Visualizaon

[ 100 ]

Thus far, we have used various funcons, such as barchart, dotchart, spineplot,

fourfoldplot, boxplot, plot, and so on. The grammar of graphics emphasizes that a

stascal graphic is a mapping from data to aesthec aributes of geometric objects. The

aesthecs aspect consists of color, shape, and size, while the geometric objects are composed

of points, lines, and bars. A detailed discussion of these aspects is unfortunately not feasible in

our book, and we will have to sele with a quick introducon to the grammar of graphics.

To begin with, we will simply consider the qplot funcon from the ggplot2 package.

Time for action – qplot

Here, we will rst use the qplot funcon for obtaining various kinds of plots. To keep the

story short, we are using the earlier datasets only and hence, a reproducon of the similar

plots using qplot won't be displayed. The reader is encouraged to check ?qplot and

its examples.

1. Load the library with library(ggplot2).

2. Rearrange the resisvity dataset in a dierent format and obtain the boxplots:

test <- data.frame(rep(c("R1","R2"),each=8),c(resistivity[,1],

resistivity[,2]))

names(test) <- c("RES","VALUE")

qplot(factor(RES),VALUE,data=test,geom="boxplot")

The gplot funcon needs to be explicitly specied that RES is a factor variable

and according to its levels, we need to obtain the boxplot for the resisvity values.

3. For the Gasoline dataset, we would like to obtain a boxplot of the mileage

accordingly as the gear system could be manual or automac. Thus, the qplot

can be put to acon with qplot(factor(x11),y,data=Gasoline,geom=

"boxplot").

4. A histogram is also one of the geometric aspects of ggplot2, and we next obtain

the histogram for the height of child with qplot(child,data=galton,geom="

histogram", binwidth = 2,xlim=c(60,75),xlab="Height of Child",

ylab="Frequency").

5. The scaer plot for the height of parent against child is fetched with qplot(pare

nt,child,data=galton,xlab="Height of Parent", ylab="Height of

Child", main="Height of Parent Vs Child").

Chapter 3

[ 101 ]

What just happened?

The qplot under the argument of geom allows a good family of graphics under a single

funcon. This is parcularly advantageous for us to perform a host of tricks under a

single umbrella.

Of course, there is the all the more important ggplot funcon from the ggplot2 library,

which is the primary reason for the exibility of grammar of graphics. We will close the

chapter with a very brief exposion to it. The main strength of ggplot stems from the fact

that you can build a plot layer by layer. We will illustrate this with a simple example.

Time for action – ggplot

ggplot, aes, and layer will be put in acon to explore the power of the grammar

of graphics.

1. Load the library with library(ggplot2).

2. Using the aes and ggplot funcons, rst create a ggplot object with galton_gg

<- ggplot(galton,aes(child,parent)) and nd the most recent creaon in

R by running galton_gg. You will get an error, and the graphics device will show a

blank screen.

3. Create a scaer plot by adding a layer to galton_gg with galton_gg <-

galton_gg + layer(geom="point"), and then run galton_gg to check for

changes. Yes, you will get a scaer plot of the height of child versus parent.

4. The labels of the axes are not sasfactory and we need beer ones. The strength

of ggplot is that you can connue to add layers to it with varied opons. In

fact, you can change the xlim and ylim on an exisng plot and check each me

the dierence in the plot. Run the following code in a step-by-step manner and

appreciate the strength of the grammar:

galton_gg <- galton_gg + xlim(60,75)

galton_gg

galton_gg <- galton_gg + ylim(60,75)

galton_gg

galton_gg<-galton_gg+ylab("Height of Parent")+xlab("Height

ofChild")

galton_gg

galton_gg <- galton_gg + ggtitle("Height of Parent Vs Child")

galton_gg

Data Visualizaon

[ 102 ]

What just happened?

The layer-by-layer approach of ggplot is very useful, and we have seen an illustraon of it

on the scaer plot for the Galton dataset. In fact, the reach of ggplot is much richer than

our simple illustraon, of course, and the interested reader may refer to Wickham (2009)

for more details.

Have a go hero

If you run par(list(cex.lab=2,ask=TRUE)) followed by barplot(

Severity_Counts,xlab="Bug Count",xlim=c(0,12000), horiz =TRUE,

col=rep(c(2,3),5)), what do you expect R to do?

Summary

In this chapter, we have visualized dierent types of graphs for dierent types of variables.

We have also explored how to gain insights into data through the graphs. It is important to

realize that without a clear understanding of the data structure, the plots are meaningless if

they are generated without exercising enough cauon. The GIGO adage is very true and no

rich visualizaon technique helps overcome this problem.

In the previous chapter, we learned the important methods of imporng/exporng data, and

visualized the data in dierent forms. Now that we have an understanding and visual insight

of the data, we need to take the next step, namely quantave analysis of the data. There

are roughly two streams of analysis: exploratory and conrmave analysis. It is the former

analysis technique that forms the core of the next chapter. As an instance, the scaer plot

reveals whether there is a posive, negave, or no associaon between the two variables.

If the associaon is not zero, the numeric answer of the posive or negave relaonship is

then required. Techniques such as these and extensions form the core of next chapter.

Exploratory Analysis

Tukey (1977) in his benchmark book Exploratory Data Analysis, abbreviated

popularly as EDA, describes "best methods" as:

We do not guarantee to introduce you to the "best" tools, parcularly since we

are not sure that there can be unique bests.

The goal of this chapter is to emphasize on EDA and its strengths.

In the previous chapter, we have seen visualizaon techniques for data of dierent

characteriscs. Analycal insight is also important and this chapter considers EDA techniques.

Further, the more popular measures include the mean, standard error, and so on. It has been

proved many mes that the mean has several drawbacks; one of them being that it is very

sensive to outliers/extremes. Thus, in exploratory analysis the focus is on measures which

are robust to the extremes. Many techniques considered in this chapter are discussed in

more detail by Velleman and Hoaglin (1981), and an eBook has been kindly made available at

http://dspace.library.cornell.edu/handle/1813/62. In the rst secon, we will

have a peek at the oen used measures for exploratory analysis. The main learnings from this

chapter are listed as follows:

Summary stascs based on median and its variants, which are robust to outliers

Visualizaon techniques in stem-and-leaf, leer values, and bagplots

First regression model in Resistant line and rened methods in smoothing data

and median polish

Exploratory Analysis

[ 104 ]

Essential summary statistics

We have seen useful summary stascs of mean and variance in the Discrete distribuons

and Connuous distribuons secons of Chapter 1, Data Characteriscs. The concepts

therein have their own ulity value. The drawback of such stascal metrics is that they are

very sensive to outliers, in the sense that a single observaon may completely distort the

enre story. In this secon, we discuss some exploratory analysis metrics which are intuive

and more robust than the metrics such as mean and variance.

Percentiles, quantiles, and median

For a given dataset and a number 0 < k < 1, the 100k% percenle divides the dataset into

two parons with 100k% of the values below it and 100(1-k) percent of the values above

it. The fracon k is referred as a quanle. In Stascs, quanles are used more oen than

percenles. The dierence being that the quanles vary over the unit interval (0, 1),

whereas 100 mes the quanles gives us the percenles. It is important to note that

the minimum (maximum) is the 0% (100%) percenle.

The median is the ieth percenle, which divides the data values into two equal parts with

itself being the mid-point of these parts. The lower and upper quarles are respecvely the

25% and 75% percenles. The standard notaon for the lower, mid (median), and upper

quanles respecvely are Q1, Q2, and Q3. By extension, Q0 and Q5 respecvely denote the

minimum and maximum quanes in a dataset.

Example 4.1.1. Rahul Dravid – The Wall: The game of Cricket is hugely popular in India, and

many cricketers have given a lot of goose bumps to those watching. Sachin Tendulkar, Anil

Kumble, Javagal Srinath, Sourav Ganguly, Rahul Dravid, and VVS Laxman are some of the

iconic names across the world. The six players menoned here have especially played a huge

role in taking India to the number one posion in the Test cricket rankings, and it is widely

believed that Rahul Dravid has been the main backbone of the success. A century is credited

to a cricketer on scoring 100 or more runs in a single innings. He has scored 36 Test centuries

across the globe, quite a handful of them were so resolute in nature that it earned him the

nickname "The Wall", and we will seek some percenles/quanles for these scores soon.

Next, we will focus on a stasc which is similar to the quanles.

Hinges

The nomenclature of the concept of hinges is basically from the hinges seen on a door. For

a door's frame, the mid-hinge is at the middle of the height of the frame, whereas the lower

and upper hinges are respecvely observed at the middle from the mid-hinge to boom and

top of the frame. In exploratory analysis, the hinges are dened by arranging the data in an

increasing order, and to start the median is idened as the mid-hinge.

Chapter 4

[ 105 ]

The lower hinge for the (ordered) data is dened as the middle-most observaon from the

minimum to the median. The upper hinge is dened similarly for the upper part of the data.

In the rst occasion, it may appear that the lower and upper hinges are the same as lower

and upper quanles. Consider the data as the rst 10 integers. Here, the median is 5.5 as

the average of the two middle-most numbers 5 and 6. Using the quanle funcon on 1:10,

it may be checked that here Q1 = 3.25 and Q3 = 7.75. The lower hinge is the middlemost

number between 1 to the median 5.5, which turns out as 3, and the upper hinge as 8.

Thus, it may be seen that the hinges are dierent from the quarles.

An extension of the concept of the hinges will be seen in the Leer values secon.

We will next look at exploratory measures of dispersion.

The interquartile range

Range, the dierence between the minimum and maximum of the data values, is one

measure of the spread of the variable. This measure is suscepble to the extreme points.

The interquarle range, abbreviated as IQR, is dened as the dierence between the upper

and lower quarle, that is:

=−

IQRQ Q

The R funcon IQR calculates the IQR for a given numeric object. All the concepts

theorecally described up to this point will be put into acon.

Time for action – the essential summary statistics for

"The Wall" dataset

We will understand the summary measures for EDA through the data on centuries scored

by Rahul Dravid in test matches.

1. Load the useful EDA package: library(LearnEDA).

2. Load the dataset TheWALL from the RSADBE package: data(TheWALL).

3. Obtain the quanles of the centuries with quantile(TheWALL$Score),

and the dierence between the quanles using the diff funcon

diff(quantile(TheWALL$Score)). The output is as follows:

> quantile(TheWALL$Score)

0% 25% 50% 75% 100%

100.0 111.8 140.0 165.8 270.0

> diff(quantile(TheWALL$Score))

25% 50% 75% 100%

11.75 28.25 25.75 104.25

Exploratory Analysis

[ 106 ]

As we are considering Rahul Dravid's centuries only, the beginning point is 100. The

median of his centuries is 140.0, where the rst and third quarles are respecvely

111.8 and 165.8. The median of the centuries is 140 runs, which can be interpreted

as having a 50 percent chance of The Wall reaching 140 runs if he scores a century.

The highest ever score of Dravid is of course 270. Interpret the dierence between

the quanles.

4. The percenles of Dravid's centuries can be obtained by using the quantile

funcon again: quantile(TheWALL$Score,seq(0,1,.1)), here seq(0,1,.1)

creates a vector which incrementally increases 0.1 beginning with 0 unl 1, and the

inter-dierence between the percenles with diff(quantile(TheWALL$Score,

seq(0,1,.1))):

>quantile(TheWALL$Score,seq(0,1,.1))

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

100.0 103.5 111.0 116.0 128.0 140.0 146.0 154.0 180.0 208.5 270.0

> diff(quantile(TheWALL$Score,seq(0,1,.1)))

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

3.5 7.5 5.0 12.0 12.0 6.0 8.0 26.0 28.5 61.5

The Wall is also known for his resolve of performing well in away Test matches.

Let us verify that using the data on the centuries score.

5. The Home and Away number of centuries is obtained using the table funcon.

Further, we obtain a boxplot of the home and away centuries.

> table(HA_Ind)

HA_Ind

Away Home

21 15

The R funcon table returns frequencies of the various categories for a categorical

variable. In fact, it is more versale in obtaining frequencies for more than one

categorical variable. The Wall is also known for his resolve of performing well in

away Test matches. This is partly conrmed by the fact that 21 of his 36 centuries

came in away Tests. However, the boxplot says otherwise:

>boxplot(Score~HA_Ind,data=TheWALL)

Chapter 4

[ 107 ]

Figure 1: Box plot for Home/Away centuries of The Wall

It may be tempng for The Wall's fans to believe that if they remove the outliers of

scores above 200, the result may say that his performance of Away Test centuries is

beer/equal to Home ones. However, this is not the case, which may be veried

as follows.

6. Generate the boxplot for centuries whose score is less than 200 with

boxplot(Score~HA_Ind,subset=(Score<200),data=TheWALL).

Figure 2: Box plot for Home/Away centuries of The Wall (less than 200 runs)

What do you conclude from the preceding screenshot?

Exploratory Analysis

[ 108 ]

7. The fivenum summary for the centuries is:

>fivenum(TheWALL$Score)

[1] 100.0 111.5 140.0 169.5 270.0

The fivenum funcon returns minimum, lower-hinge, median, upper-hinge,

and maximum values for the input data. The numbers 111.5 and 169.5 are

lower- and upper-hinges, and it may be seen that they are certainly dierent

values than lower- and upper-quarles of 111.5 and 169.5. Thus far, we have

focused on measures of locaon, so let us now look at some measures of dispersion.

The range funcon in R actually returns the minimum and maximum value of the

data frame. Thus, to obtain the range as a measure of spread, we get that using

diff(range()). We use the IQR funcon to obtain the interquarle range.

8. Using range, diff, and IQR funcons, obtain the spread of Dravid's centuries

as follows:

> range(TheWALL$Score)

[1] 100 270

> diff(range(TheWALL$Score))

[1] 170

> IQR(TheWALL$Score)

[1] 54

>IQR(TheWALL$Score[TheWALL$HA_Ind=="Away"])

[1] 36

> IQR(TheWALL$Score[TheWALL$HA_Ind=="Home"])

[1] 63.5

Here, we are extracng the home centuries from Score using the logic that

consider only those elements of Score when HA_Ind is Home.

What just happened?

The data summaries in the EDA framework are slightly dierent. Here, we rst used the

quantile funcon to obtain quarles and the deciles (10 percent dierence) of a numeric

variable. The diff funcon has been used to nd the dierence between the consecuve

elements of a vector. The boxplot funcon has been used to compare the home and away

Test centuries, which led to the conclusion that the median score of Dravid's centuries at

home is higher than the away centuries. The restricon of the Test centuries under 200 runs

further conrmed in parcular that Dravid's centuries at home have a higher median value

than those in away series, and in general that median is robust to outliers.

Chapter 4

[ 109 ]

The IQR funcon gives us the interquarle range for a vector, and the fivenum funcon

gives us the hinges. Though intuively it appears that hinges and quarles are similar, it is

not always true. In this secon, you also learned the usage of funcons, such as quantile,

fivenum, IQR, and so on.

We will now move to the main techniques of exploratory analysis.

The stem-and-leaf plot

The stem-and-leaf plot is considered as one of the seven important tools of Stascal

Process Control (SPC), refer to Montgomery (2005). It is a bit similar in nature to the

histogram plot.

The stem-and-leaf plot is an eecve method of displaying data in a (paral) tree form. Here,

each datum is split into two parts: the stem part and the leaf part. In general, the last digit

of a datum forms the leaf part; the rest form the stem. Now, consider a datum 235. If the

split criteria is the units place, the stem and leaf parts here will be respecvely 23 and 5; if it

is tens, then 2 and 3; and nally if it is hundreds, it will be 0 and 2. The le-hand side of the

split datum is called as the leading digits and the right-hand side as the trailing digits.

In the next step, all the possible leading digits are arranged in an increasing order. This

includes even those stems for which we may not have data for the leaf part, which ensures

that the nal stem-and-leaf plot truly depicts the distribuon of the data. All the possible

leading digits are called stems. The leaves are then displayed to the right-hand side of the

stems, and for each stem the leaves are again arranged in an increasing order.

Example 4.2.1. A simple illustraon: Consider a data of eight elements as 12, 22, 42, 13, 27,

46, 25, and 52. The leading digits for this data are 1, 2, 4, and 5. On inserng 3, the leading

digits complete the required stems to be 1 to 5. The leaves for stem 1 are 2 and 3. The

unordered leaves for stem 2 are 2, 7, and 5. The display leaves for stem 2 are then 2, 5,

and 7. There are no leaves for stem 3. Similarly, the leaves for stems 4 and 5 respecvely

are the sets {2, 6} and {2} only. The stem funcon in R will be used for generang the

stem-and-leaf plots.

Example 4.2.1. Octane Rang of Gasoline Blends: (Connued from the Vizualizaon

techniques for connuous variable data secon of Chapter 3, Data Vizualizaon):

In the earlier study, we used the histogram for understanding the octane rangs under

two dierent methods. We will use the stem funcon in the forthcoming Time for

acon – the stem funcon in play secon for displaying the octane rangs under Method_1

and Method_2.

Exploratory Analysis

[ 110 ]

Tukey (1977), being the benchmark book for EDA, produces the stem-and-leaf plot in a

slightly dierent style. For example, the stem plots for Method_1 and Method_2 are beer

understood if we can put both the stem and leaf sides adjacent to each other instead of one

below the other. It is possible to achieve this using the stem.leaf.backback funcon

from the aplpack package.

Time for action – the stem function in play

The R funcon stem from the base package and stem.leaf.backback from aplpack are

fundamental for our purpose to create the stem-and-leaf plots. We will illustrate these two

funcons for the examples discussed earlier.

1. As menoned in Example 4.2.1. Octane Rang of Gasoline Blends, rst create the

x vector, x<-c(12,22,42,13,27,46,25,52).

2. Obtain the stem-and-leaf plot with:

> stem(x)

The decimal point is 1 digit(s) to the right of the |

1 | 23

2 | 257

3 |

4 | 26

5 | 2

To obtain the median from the stem display we proceed as follows. Remove one

point each from either side of the display. First we remove 52 and 12, and then

remove 46 and 13. The trick is to proceed unl we are le with either one point, or

two. In the former case, the remaining point is the median, and in the laer case,

we simply take the average. Finally, we are le with 25 and 27 and hence, their

average 26 is the median of x.

We will now look at the octane dataset.

3. Obtain the stem plots for both the methods: data(octane),

stem(octane$Method_1,scale=2) and stem(octane$Method_2, scale=2).

Chapter 4

[ 111 ]

The output will be similar to the following screenshot:

Figure 3: The stem plot for the octane dataset (R output edited)

Of course, the preceding screenshot has been edited. To generate such

a back-to-back display, we need a dierent funcon.

4. Using the stem.leaf.backback funcon from aplpack and the code

library(aplpack) and stem.leaf.backback(Method_1, Method_2,back.

to.back=FALSE, m=5), we get the output in the desired format.

Figure 4: Tukey's stem plot for octane data

Exploratory Analysis

[ 112 ]

The preceding screenshot has many unexplained, and a bit mysterious, symbols! Prof. J.

W. Tukey has taken a very pragmac approach when developing EDA. You are is strongly

suggested to read Tukey (1977), as this brief chapter barely does jusce to it. Note that 18

of the 32 observaons for Method_1 are in the range 80.4 to 89.3. Now, if we have stems as

8, 9, and 10, the spread at stem 8 will be 18, which will not give a meaningful understanding

of the data. The stems can have substems, or be stretched out, and for which a very novel

soluon is provided by Prof. Tukey. For very high frequency stems, the soluon is to squeeze

out ve more stems. For the stem 8 here, we have the trailing digits at 0, 1, 2, …, 9. Now,

adopt a scheme of tagging lines which leads towards a clear reading of the stem-and-leaf

display. Tukey suggests to use * for zero and one, t for two and three, f for four and ve, s for

six and seven, and a period (.) for eight and nine. Truly ingenious! Thus, if you are planning to

write about stem-and-leaf in your local language, you may not require *, t, f, s, .! Go back to

the preceding screenshot and now it looks much more beauful.

Following the leaf part for each method, we are given cumulave frequencies from the top

and the boom too. Why? Now, we know that the stem-and-leaf display has increasing

values from the top and decreasing values from the boom. In this parcular example,

we have n: 32 observaons. Thus, in a sorted order, we know that the median is a value

between the sixteenth and seventeenth sorted observaon. The cumulave frequencies

when exceeds 16 from either direcon, lead to the median. This is indicated by (2) for

Method_1 and (6) for Method_2. Can you now make an approximate guess of the median

values? Obviously, depending on the dataset, we may require m = 5, 1, or 2.

We have used the argument back.to.back=FALSE to ensure that the two stem-and-leafs

can be seen independently. Now, it is fairly easy to compare these two displays by seng

back.to.back=TRUE, in which case the stem line will be common for both the methods

and thus, we can simply compare their leaf distribuons. That is, you need to run stem.

leaf.backback(octane$Method_1,octane$Method_2,back.to.back=TRUE, m=5)

and invesgate the results.

We can clearly see that the median for Method_2 is higher than that of Method_1.

What just happened?

Using the basic stem funcon and stem.leaf.backback from the aplpack, we got two

ecient exploratory displays of the datasets. The laer funcon can be used to compare two

stem-and-leaf displays. Stems can be further squeezed to reveal more informaon with the

opons of m as 1, 2, and 5.

We will next look at the EDA technique which extends the scope of hinges.

Chapter 4

[ 113 ]

Letter values

The median, quarles, and the extremes (maximum and minimum) indicate how the data

is spread over the range of the data. These values can be used to examine two or more

samples. There is another way of understanding the data oered by leer values. This

small journey begins with the use of a concept called depth, which measures the minimum

posion of the datum in the ordered sample from either of the extremes. Thus, the extremes

have a depth of 1, the second largest and smallest datum have a depth of 2, and so on.

Now, consider a sample data of size n, assumed to be an odd number for convenience sake.

Then, the depth of the median is (n + 1)/2. The depth of a datum is denoted by d, and for the

median it is indicated by d(M). Since the hinges, lower and upper, do the same to the divided

samples (by median), the depth of the hinges is given by

() ()

( )

= +

 

 

dH dM /. Here, [ ] denotes

the integer part of the argument. As hinges, including the mid-hinge which is the median,

divide the data into four equal parts, we can dene eights as the values which divide the

data into eight equal parts. The eights are denoted by E. The depth of eights is given by the

formula

() ()

( )

= +

 

 

dE dH /. It may be seen that the depth of the median, hinges, and eights

of the datum depends on the sample size.

Using the eights, we can further carry out the division of the data for obtaining the sixteenths,

and then thirty seconds, and so on. The process of division should connue ll we end up

with the extremes where we cannot further proceed with the division any longer. The leer

values connue the search unl we end at the extremes. The process of the division is well

understood when we augment the lower and upper values for the hinges, the eights, the

sixteenths, the thirty seconds, and so on. The dierence between the lower and upper values

of these metrics, concept similar to mid-range, is also useful for understanding the data.

The R funcon lval from the LearnEDA package gives the leer values for the data.

Example 4.3.1. Octane Rang of Gasoline Blends (Connued): We will now obtain the leer

values for the octane dataset:

>library(LearnEDA)

>lval(octane$Method_1)

depth lo hi mids spreads

M 16.5 88.00 88.0 88.000 0.00

H 8.5 85.55 91.4 88.475 5.85

E 4.5 84.25 94.2 89.225 9.95

D 2.5 82.25 101.5 91.875 19.25

C 1.0 80.40 105.5 92.950 25.10

> lval(octane$Method_2)

depth lo hi mids spreads

M 16.5 97.20 97.20 97.200 0.00

H 8.5 93.75 99.60 96.675 5.85

E 4.5 90.75 101.60 96.175 10.85

D 2.5 87.35 104.65 96.000 17.30

C 1.0 83.30 106.60 94.950 23.30

Exploratory Analysis

[ 114 ]

The leer values, look at the lo and hi in the preceding code, clearly show that the

corresponding values for Method_1 are always lower than those under Method_2.

Parcularly, note that the lower hinge of Method_2 is greater than the higher hinge

of Method_1. However, the spread under both the methods are very idencal.

Data re-expression

The presence of an outlier or overspread of the data may lead to an incomplete picture of

the graphical display and hence, stascal inference may be inappropriate in these scenarios.

In many such scenarios, re-expression of the data on another scale may be more useful,

refer to Chapter 3, The Problem of Scale, Tukey (1977). Here, we list the scenarios from Tukey

where the data re-expression may help circumvent the limitaons cited in the beginning.

The rst scenario where re-expression is useful is when the variables assume non-negave

values, that is, the variables never assume a value lesser than zero. Examples of such

variables are age, height, power, area, and so on. A thumb rule for the applicaon of

re-expression is when the rao of the largest to the smallest value in the data is very large,

say 100. This is one reason that most regression analysis variables such as age are almost

always re-expressed on the logarithmic scale.

The second scenario explained by Tukey is about variables like balance and

prot-and-loss. If there is a deposit to an account, the balance increases, and if there

is a withdrawal it decreases. Thus, the variables can assume posive as well as negave

values. Since re-expression of these variables like balance rarely helps; re-expression of the

amount or quanty before subtracon helps on some occasions. Fracon and percentage

counts form the third scenario where re-expression of the data is useful, though you need

special techniques. The scenarios menoned are indicave and not exhausve. We will now

look at the data re-expression techniques which are useful.

Example 4.4.1. Re-expression for the Power of 62 Hydroelectric Staons: We need to

understand the distribuon of the ulmate power in megawas of 62 hydroelectric staons

and power staons of the Corps of Engineers. The data for our illustraon has actually been

regenerated from the Exhibit 3 of Tukey (1977). First, we simply look at the stem-and-leaf

display of the original data on the power of 62 hydroelectric staons. We use the stem.

leaf funcon from the aplpack package.

> hydroelectric <- c(14,18,28,26,36,30,30,34,30,43,45,54,52,60,68, +

68,61,75,76,70,76,86,90,96,100,100,100,100,100,100,110,112,

+ 118,110,124,130,135,135,130,175,165,140,250,280,204,200,270,

+ 40,320,330,468,400,518,540,595,600,810,810,1728,1400,1743,2700)

>stem.leaf(hydroelectric,unit=1)

1 | 2: represents 12

leaf unit: 1

Chapter 4

[ 115 ]

n: 62

2 1 | 48

4 2 | 68

9 3 | 00046

…

24 9 | 06

30 10 | 000000

(4) 11 | 0028

28 12 | 4

27 13 | 0055

23 14 | 0

15 |

22 16 | 5

21 17 | 5

18 |

19 |

20 20 | 04

21 |

24 |

…

7 60 | 0

HI: 810 810 1400 1728 1743 2700

The data begins with values as low as 14, and grows modestly to hundreds, such as 100, 135,

and so on. Further, the data grows to ve hundreds and then literally explodes into thousands

running up to 2700. If all the leading digits must be mandatorily displayed, we have 270

leading digits. With an average of 35 lines per page, the output requires approximately eight

pages, and between the last two values of 1743 and 2700, we will have roughly 100 empty

leading digits. The stem.leaf funcon has already removed all the leading digits aer the

hydroelectric plant producing 600 megawas.

Let us look at the rao of the largest to smallest value, which is as follows:

>max(hydroelectric)/min(hydroelectric)

[1] 192.8571

By the thumb rule, it is an indicaon that a data re-expression is in order. Thus, we

take the log transformaon (with base 10) and obtain the stem-and-leaf display for

the transformed data.

>stem.leaf(round(log(hydroelectric,10),2),unit=0.01)

1 | 2: represents 0.12

leaf unit: 0.01

n: 62

Exploratory Analysis

[ 116 ]

1 11 | 5

2 12 | 6

13 |

…

24 19 | 358

(11) 20 | 00000044579

27 21 | 11335

22 22 | 24

20 23 | 110

…

30 |

4 31 | 5

3 32 | 44

HI: 3.43

The compactness of the stem-and-leaf display for the transformed data is indeed more

useful, and we can further see that the leading digits are just about 30. Also, the display

is more elegant and comprehensible.

Have a go hero

The stem-and-leaf plot is considered a parcular case of the histogram from a certain point

of view. You can aempt to understand the hydroelectric distribuon using histogram too.

First, obtain the histogram of the hydroelectric variable, and then repeat the exercise on its

logarithmic re-expression.

Bagplot – a bivariate boxplot

In Chapter 3, Data Visualizaon, we saw the eecveness of boxplot. For independent

variables, we can simply draw separate boxplots for the variables and visualize the

distribuon. However, when there is dependency between two variables, disnct boxplot

loses the dependency among the two variables. Thus, we need to see if there is a way to

visualize the data through a boxplot. The answer to the queson is provided by bagplot

or bivariate boxplot.

The bagplot characterisc is described in the following steps:

The depth median, denoted by * in the bagplot, is the point with the highest half

space depth.

Chapter 4

[ 117 ]

The depth median is surrounded by a polygon, called bag, which covers n/2

observaons with the largest depth.

The bag is then magnied by a factor of 3 which gives the fence. The fence is not

ploed since it will drive the aenon away from the data.

The observaons between the bag and fence are covered by a loop.

Points outside the fence are agged as outliers.

For technical details of the bagplot, refer to the paper (http://venus.unive.it/

romanaz/ada2/bagplot.pdf) by Rousseeuw, Ruts, and Tukey (1999).

Example 4.5.1. Bagplot for the Gasoline mileage problem: The pairs plot of the gasoline

mileage problem in Example 3.2.9. Octane rang of gasoline blends gave a good insight in to

understanding the nature of the data. Now, we will modify that plot and replace the upper

panel with the bagplot funcon for a cleaner comparison of the bagplot with the scaer

plot. However, in the original dataset, variables x4, x5, and x11 are factors that we remove

from the bagplot study. The bagplot funcon is available in the aplpack package. We rst

dene panel.bagplot, and then generate the matrix of the scaer plot with the bagplots

produced in the upper matrix display.

Time for action – the bagplot display for a multivariate dataset

1. Load the aplpack package with library(aplpack).

2. Check the default examples of the bagplot funcon with example(bagplot).

3. Create the panel.bagplot funcon with:

panel.bagplot <- function(x,y)

{

require(aplpack)

bagplot(x,y,verbose=FALSE,create.plot = TRUE,add=TRUE)

}

Here, the panel.bagplot funcon is dened to enable us to obtain the bagplot

for the upper panel region of the pairs funcon.

Exploratory Analysis

[ 118 ]

4. Apply the panel.bagplot funcon within the pairs funcon on the Gasoline

dataset: pairs(Gasoline[-19,-c(1,4,5,13)],upper.panel =panel.

bagplot).

We obtain the following display:

Figure 5: Bagplot for the Gasoline dataset

What just happened?

We created the panel.bagplot funcon and augmented it in the pairs funcon for

eecve display of the mulvariate dataset. The bagplot is an important EDA tool towards

geng exploratory insight in the important case of mulvariate dataset.

The resistant line

In Example 3.2.3. The Michelson-Morley experiment of Chapter 3, Data Visualizaon,

we visualized data through the scaer plot which indicates possible relaonships between

the dependent variable (y) and independent variable (x). The scaer plot or the x-y plot is

again an EDA technique. However, we would like a more quantave model which explains

the interrelaonship in a more precise manner. The tradional approach would be taken in

Chapter 7, The Logisc Regression Model. In this secon, we would take an EDA approach for

building our rst regression model.

Chapter 4

[ 119 ]

Consider a pair of n observaons:

( ) ( ) ( )

11 22

x,y,x,y,..., x,y. We can easily visualize the data

using the scaer plot. We need to obtain a model of the form

=+yabx

, where a is the

intercept term while b is the slope. This model is an aempt to explain the relaonship

between the variables x and y. Basically, we need to obtain the values of the slope and

intercept from the data. In most real data, a single line will not pass through all the n pairs of

observaons. In fact, it is even a dicult task for the determined line to pass through even a

very few observaons. As a simple task, we may choose any two observaons and determine

the slope and intercept. However, the diculty lies in the choice of the two points. We will

now explain how the resistant line determines the two required terms.

The scaer plot (part A of the next screenshot) is divided into three regions, using x-values,

where each region has approximately same number of data points, refer to part B of the next

screenshot. The three regions, from the le-hand to the right-hand side, are called the lower,

middle, and upper regions. Note that the y-values are distributed among the three regions

corresponding to their x-values. Hence, there is a possibility of some y-values of lower regions

to be higher than a few values in the higher regions. Within each region, we nd the medians

of the x- and y-values independently. That is, for the lower region, the median yL is determined

by the y-values falling in this region, and similarly, the median xL is determined by the x-values

of the region. Similarly, the medians xM , xH , yM , and yH are determined, refer to part C of the

following screenshot. Using these median values, we now form three pairs: (xL , yL ), (xM , yM ),

and (xH , yH ). Note that these pairs need not be one of the data points.

To determine the slope b, two points suce. The resistant line theory determines the slope

by using the two pairs of points (xL , yL ) and (xH , yH ). Thus, we obtain the following:

−

=−

For obtaining the intercept value a, we use all the three pairs of medians. The value of a is

determined using

( )

= + +− ++

 

 

LM H L

a y yybx xx.

Exploratory Analysis

[ 120 ]

Note that the median properes are what exactly make the soluons resistant enough.

As an example, the lower and upper median would not be aected by the outliers

(at the extreme ends).

a= yL+yM+yHb( )

xL+xM+xH

3[ ]

yHyL

xHxL

Obtaining the and valuesab

y-axis

x-axis

xxx

y-axis

x-axis

xxx

Dividing into three

portions by x-values

y-axis

x-axis

xxx

y-axis

x-axis

xxx

for y-values

for x-values

(x ,y )

The x-y Scatter Plot

Figure 6: Understanding the resistant line

We will use the rline funcon from the LearnEDA package.

Example 4.6.1. Resistant line for the IO-CPU me: The CPU me is known to depend on the

number of IO processes running at any given point of me. A simple dataset is available at

http://www.cs.gmu.edu/~menasce/cs700/files/SimpleRegression.pdf.

We aim at ng a resistant line for this dataset.

Time for action – the resistant line as a rst regression model

We use the rline funcon from the LearnEDA package for ng the resistant line on

a dataset.

1. Load the LearnEDA package: library(LearnEDA).

2. Understand the default example with example(rline).

3. Load the dataset data(IO_Time).

4. Create the IO_rline resistant line for CPU_Time as the output and No_of_IO

as the input with IO_rline <- rline(IO_Time$No_of_IO, IO_Time$CPU_

Time,iter=10) for 10 iteraons.

Chapter 4

[ 121 ]

5. Find the slope and intercept with IO_rline$a and IO_rline$b. The output will

then be:

>IO_rline$a

[1] 0.2707

>IO_rline$b

[1] 0.03913

6. Obtain the scaer plot of CPU_Time against No_of_IO with plot(IO_Time$No_

of_IO, IO_Time$CPU_Time).

7. Add the resistant line on the generated scaer plot with abline(a= IO_

rline$a,b=IO_rline$b).

8. Finally, give a tle to the plot: title("Resistant Line for the IO-CPU

Time").

We then get the following screenshot:

Figure 7: Resistant line for CPU_Time

What just happened?

The rline funcon from the LearnEdA package ts a resistant line given the input and

output vectors. It calculated the slope and intercept terms which are driven by medians.

The main advantage of the rline t is that the model is not suscepble to outliers. We can

see from the preceding screenshot that the resistant line model, IO_rline, provides a very

good t for the dataset. Well! You have created your rst exploratory regression model.

Exploratory Analysis

[ 122 ]

Smoothing data

In The resistant line secon, we constructed our rst regression model for the relaonship

between two variables. In some instances, the x-values are so systemac that their values

are almost redundant, and yet we need to understand the behavior of the y-values with

respect to them. Consider the case where the x-values are equally spaced; the shares price

(y) at the end of day (x) is an example where the dierence between two consecuve

x-values is exactly one. Here, we are more interested in smoothing the data along the

y-values, as one expects more variaon in their direcon. Time series data is a very good

example of this type. In the me series data, we typically have xn + 1 = xn + 1 , and hence

we can precisely dene the data in a compact form by yt , t = 1, 2, … .The general model

may then be specied by

12=+ +ε =

yabt ,t ,,...

In the standard EDA notaon, this is simply expressed as:

data = t + residual

In the context of me series data, the model is succinctly expressed as:

data = smooth + rough

The fundamental concept of the smoothing data technique makes use of the running

medians. In a free hand curve, we can simply draw a smooth curve using our judgment

by ignoring the out-of-curve points and complete the picture. A computer nds this task

only dicult when it needs specic instrucons for obtaining the smooth points across

which it needs to draw a curve. For a sequence of points, such as the sequence yt, the

smoothing needs to be carried over a sequence of overlapping segments. Such segments are

predened of specic length. As a simple example, we may have a three-length overlapping

segment sequence in {y1 , y2 , y3 }, {y2 , y3 , y4 }, {y3 , y4 , y5 }, and so on. It is on similar lines that

four-length or ve-length overlapping segment sequences may be dened as required. It is

within each segment that smoothing needs to be carried out. Two popular choices are mean

and median. Of course, in exploratory analysis our natural choice is the median. Note that

median of the segment {y1 , y2 , y3 } may be any of y1 , y2 , or y3 values.

The general smoothing techniques, such as LOESS, are nonparametric techniques and

require good experse in the subject. The ideas discussed here are mainly driven by median

as the core technique.

Chapter 4

[ 123 ]

A three-moving median cannot correct for more than two consecuve outliers, and similarly,

a ve-moving median for three consecuve outliers, and so on. A soluon, or work around

in an engineer's language, for this is to connue the smoothing of the sequence obtained in

the previous iteraon unl there is no further change in the smoothness part. We may also

consider a moving median of span 4. Here, the median is the average of the two mid-points.

However, considering that the x values are integers, the four-moving median actually does

not correspond to any of the me points t. Using the simplicity principle, it is easily possible

to re-center the points at t by taking a two-moving median of the values obtained in the step

of the four-moving median.

A notaon for the rst iteraon in EDA is simply the number 3, or 5 and 7 as used. The notaon

for repeated smoothing is denoted by 3R where R stands for repeons. For a four-moving

median re-centered by a two-moving median, the notaon will be 42. On many occasions, a

smoother operaon giving more renement than 42 may be desired. It is on such occasion that

we may use the running weighted average, which gives dierent weights to the points under a

span. Here, each point is replaced by a weighted average of the neighboring points. A popular

choice of weights for a running weighted average of 3 is (¼, ½, ¼), and this smoothing process

is referred to as hanning. The hanning process is denoted by H.

Since the running median smoothens the data sequence a bit more than appropriate and

hence removes any interesng paerns, paerns can be recovered from the residuals which

in this context are called rough. This is achieved by smoothing the rough sequence and

adding them back to the smooth sequence. This operaon is called as reroughing. Velleman

and Hoaglin (1984) point out that the smoothing process which performs beer in general is

4253H. That is, here we rst begin with running median of 4 which is re-centered by 2.

The re-smoothing is then done by 5 followed by 3, and the outliers are removed by H. Finally,

re-roughing is carried out by smoothing the roughs and then adding them to the smoothed

sequence. This full cycle is denoted by 4253H, twice. Unfortunately, we are not aware of any

R funcon or package which implements the 4253H smoother. The opons available in the

LearnEDA package are 3RSS and 3RSSH.

We have not explained what the smoothers 3RSS and 3RSSH are. The 3R smoothing chops

o peaks and valleys and leaves behind mesas and dales two points long. What does this

mean? Mesa refers to an area of high land with a at top and two or more steep cli-like sides,

whereas dale refers a valley. To overcome this problem, a special spling is used at each

two-point mesa and dale where the data is split into three pieces: a two-point at segment,

the smooth data to the le of the two points, and the smooth sequence to their right. Now, let

yf-1, yf refer to the two-point at segment, and yf+1 , yf+2 , … refer to the smooth sequence to the

le of these two-point at segments. Then the S technique predicts the value of yf-1 if it were

on the straight line formed by yf+1 and yf+2. A simple method is to obtain yf-1 as 3yf+1 – 2yf+2. The yf

value is obtained as the median of the predicted yf-1, yf+1, and yf+2. Aer removing all the mesas

and dales, we again repeat the 3R cycle. Thus, we have the notaon 3RSS and the reader

can now easily connect with what 3RSSH means. Now, we will obtain the 3RSS for the cow

temperature dataset of Velleman and Hoaglin.

Exploratory Analysis

[ 124 ]

Example 4.7.1. Smoothing Data for the Cow temperature: The temperature of a cow is

measured at 6:30 a.m. for 75 consecuve days. We will use the smooth funcon from the

base package and the han funcon from the LearnEDA package to achieve the required

smoothing sequence. We will build the necessary R program in the forthcoming acon list.

Time for action – smoothening the cow temperature data

First we use the smooth funcon from the stats package on the cow temperature dataset.

Next; we will use the han funcon from LearnEDA.

1. Load the cow temperature data in R by data(CT).

2. Plot the me series data using the plot.ts funcon: plot.ts(CT$Temperature

,col="red",pch=1).

3. Create a 3RSS object for the cow temperature data using the smooth funcon

and the kind opon: CT_3RSS <- smooth(CT$Temperature,kind="3RSS").

4. Han the preceding 3RSS object using the han funcon from the LearnEDA package:

CT_3RSSH <- han(smooth(CT$Temperature,kind="3RSS")).

5. Impose a line of the 3RSS data points with lines.

ts(CT_3RSS,col="blue",pch=2).

6. Impose a line of the hanned 3RSS data points with lines.

ts(CT_3RSSH,col="green",pch=3).

7. Add a meaningful legend to the plot: legend(20,90,c("Original","3RSS","3

RSSH"),col=c("red","blue","green"),pch="___").

We get a useful smoothened plot of the cow temperature data as follows:

Smoothening cow temperature data

Chapter 4

[ 125 ]

The original plot shows a lot of variaon for the cow temperature measurements. The edges

of the 3RSS smoother shows many sharp edges in comparison with the 3RSSH smoother,

though it is itself a lot smoother than the original display. The plot further indicates that

there has been a lot of decrease in the cow temperature measurements from the eenth

day of observaon. This is conrmed by all the three displays.

What just happened?

The discussion of the smoothing funcon looked very promising in the theorecal

development. We took a real dataset and saw its me series plot. Then we ploed

two versions of the smoothening process and found both to be very smooth over

the original plot.

Median polish

In Example 4.6.1. Resistant line for the IO-CPU me, we had IO as the only independent

variable which explained the variaons of the CPU me. In many praccal problems, the

dependent variable depends on more than one independent variable. In such cases, we need

to factor the eect of such independent variables using a single model. When we have two

independent variables, and median polish helps in building a robust model. A data display in

which the rows and columns hold dierent factors of two variables is called a two-way table.

Here, the table entries are values of the independent variables.

An appropriate model for the two-way table is given by:

βγ

=+ ++ε

ij ijij

Here,

is the intercept term,

denotes the eect of the i-th row,

j the eect of the j-th

column, and

ij is the error term. All the parameters are unknown. We need to nd the

unknown parameters through the EDA approach. The basic idea is to use row-medians and

column-medians for obtaining the row- and column-eect, and then nd the basic intercept

term. Any unexplained part of the data is considered as the residual.

Time for action – the median polish algorithm

The median polish algorithm (refer to http://www-rohan.sdsu.edu/~babailey/

stat696/medpolishalg.html) is given next:

1. Obtain the row medians of the two-way table and upend it to the right-hand side

of the data matrix. From each element of every row, subtract the respecve

row median.

Exploratory Analysis

[ 126 ]

2. Find the median of the row median and record it as the inial grand eect value.

Also, subtract the inial grand eect value from each row median.

3. For the original data columns in the previously upended matrix, obtain the column

median and append it with the previous matrix at the boom. As in step 1, subtract

from each column element their corresponding column median.

4. For the boom row of column medians in the previous table, obtain the median,

and then add the obtained value to the inial grand eect value. Next, subtract the

modied grand eect median value from each of the column medians.

5. Iterate steps 1-4 unl the changes in row or column median is insignicant.

We use the medpolish funcon from the stats library for the computaons involved in

median polish. For more details about the model, you can refer to Chapter 8, Velleman and

Hoaglin (1984).

Example 4.7.1. Male death rates: The dataset related to the male death rate per 1000 by the

cause of death and the average amount of tobacco smoked daily is available on page 221 of

Velleman and Hoaglin (1984). Here, the row eect is due to the cause of death, whereas the

column constutes the amount of tobacco smoked (in grams). We are interested in modeling

the eect of these two variables on the male death rates in the region.

> data(MDR)

> MDR2 <- as.matrix(MDR[,2:5])

>rownames(MDR2) <- c("Lung", "UR","Sto","CaR","Prost","Other_

Lung","Pul_TB","CB","RD_Other", "CT","Other_

Cardio","CH","PU","Viol","Other_Dis")

> MDR_medpol <- medpolish(MDR2)

1 : 8.38

2 : 8.17

Final: 8.1625

>MDR_medpol$row

Lung UR StoCaR Prost Other_

LungPul_TB CB RD_Other CT Other_Cardio

0.1200 -0.4500 -0.2800 -0.0125 -0.3050

0.2050 -0.3900 -0.2050 0.0000 4.0750

1.6875

CH PU Viol Other_Dis

Chapter 4

[ 127 ]

1.4725 -0.4325 0.0950 0.9650

>MDR_medpol$col

G0 G14 G24 G25

-0.0950 0.0075 -0.0050 0.1350

>MDR_medpol$overall

[1] 0.545

>MDR_medpol$residuals

G0 G14 G24 G25

Lung -0.5000 -0.2025 2.000000e-01 0.8600

UR 0.0000 0.0275 0.000000e+00 -0.0200

Sto 0.2400 0.0875 -1.600000e-01 -0.0900

CaR 0.0025 0.0000 -1.575000e-01 0.0725

Prost 0.4050 0.0125 -1.500000e-02 -0.0350

Other_Lung -0.0150 -0.0375 1.500000e-02 0.1350

Pul_TB -0.0600 -0.0025 3.000000e-02 0.0000

CB -0.1250 -0.0575 5.500000e-02 0.2450

RD_Other 0.2400 -0.0025 1.387779e-17 -0.2800

CT -0.3050 0.0125 -1.500000e-02 1.2350

Other_Cardio 0.0925 -0.0900 2.425000e-01 -0.1175

CH 0.0875 -0.0850 -1.525000e-01 0.1775

PU -0.0175 0.0200 5.250000e-02 -0.0275

Viol -0.1250 0.1725 -1.850000e-01 0.1250

Other_Dis 0.0350 0.2925 -3.500000e-02 -0.0750

What just happened?

The output associated with MDR_medpol$row gives the row eect, while MDR_medpol$col

gives the column eect. The negave value of -0.0950 for the non-consumers of tobacco

shows that the male death rate is lesser for this group, whereas the posive values of 0.0075

and 0.1350 for the group under 14 grams and above 25 grams respecvely is an indicaon

that tobacco consumers are more prone to death.

Have a go hero

For the variables G0 and G25 in the MDR2 matrix object, obtain a back-to-back

stem-leaf display.

Exploratory Analysis

[ 128 ]

Summary

Median and its variants form the core measures of EDA and you would have got a hang

of it by the rst secon. The visualizaon techniques of EDA also compose more than just

the stem-and-leaf plot, leer values, and bagplot. As EDA is basically about your atude

and approach, it is important to realize that you can (and should) use any method that is

insncve and appropriate for the data on hand. We have also built our rst regression

model in the resistant line and seen how robust it is to the outliers. Smoothing data and

median polish are also advanced EDA techniques which the reader is acquainted in their

respecve secons.

EDA is exploratory in nature and its ndings may need further stascal validaons. The next

chapter on stascal inference addresses which Tukey calls conrmatory analysis. Especially,

we look at techniques which give good point esmates of the unknown parameters. This is

then backed with further techniques such as goodness-of-t and condence intervals for the

probability distribuon and the parameters respecvely. Post the esmaon method, it is a

requirement to verify whether the parameters meet certain specied levels. This problem is

addressed through hypotheses tesng in the next chapter.

Statistical Inference

In the previous chapter, we came across numerous tools that gave first

insights of exploratory evidence into the distribution of datasets through visual

techniques as well as quantitative methods. The next step is the translation

of these exploratory results to confirmatory ones and the topics of the

current chapter pursue this goal. In the Discrete distributions and Continuous

distributions sections of Chapter 1, Data Characteristics, we came across many

important families of probability distribution. In practical scenarios, we have

data on hand and the goal is to infer about the unknown parameters of the

probability distributions. This chapter focuses on one method of inference

for the parameters using the maximum likelihood estimator (MLE). Another

way of approaching this problem is by fitting a probability distribution for

the data. The MLE is a point estimate of the unknown parameter that needs

to be supplemented with a range of possible values. This is achieved through

confidence intervals. Finally, the chapter concludes with the important topic

of hypothesis testing.

You will learn the following things aer reading through this chapter:

Visualizing the likelihood funcon and idenfying the MLE

Fing the most plausible stascal distribuon for a dataset

Condence intervals for the esmated parameters

Hypothesis tesng of the parameters of a stascal distribuon

Stascal Inference

[ 130 ]

Using exploratory techniques we had our rst exposion with the understanding of a dataset.

As an example in the octane dataset we found that the median of Method_2 was larger

than that of Method_1. As explained in the previous chapter, we need to conrm whatever

exploratory ndings we had with a dataset. Recall that the histograms and stem-and-leaf

displays suggest a normal distribuon. A queson that arises then is how do we assert the

center values, typically the mean, of a normal distribuon and how do we conclude that

average of the Method_2 procedure exceeds that of Method_1. The former queson is

answered by esmaon techniques and the later with tesng hypotheses. This forms the

core of Stascal Inference.

Maximum likelihood estimator

Let us consider the discrete probability distribuons as seen in the Discrete distribuons

secon of Chapter 1, Data Characteriscs. We saw that a binomial distribuon is

characterized by the parameters in n and p, the Poisson distribuon by

, and so on. Here,

the parameters completely determine the probabilies of the x values. However, when

the parameters are unknown, which is the case in almost all praccal problems, we collect

data for the random experiment and try to infer about the parameters. This is essenally

inducve reasoning, and the subject of Stascs is essenally inducve driven as opposed

to the deducve reasoning of Mathemacs. This forms the core dierence between the

two beauful subjects. Assume that we have n observaons X1, X2,…, Xn from an unknown

probability distribuon

()

,fx

, where

may be a scalar or a vector whose values are not

known. Let us consider a few important denions that form the core of stascal inference.

Random sample: If the observaons X1, X2,…, Xn are independent of each other, we say

that it forms a random sample from

()

,fx

. A technical consequence of the observaons

forming a random sample is that their joint probability density (mass) funcon can be

wrien as product of the individual density (mass) funcon. If the unknown parameter

is same for all the n observaons we say that we have an independent and idencal

distributed (iid) sample.

Let X denote the score of Rahul Dravid in a century innings, and let Xi denote the runs

scored in the i th century, i = 1, 2, …, 36. The assumpon of independence is then appropriate

for all the values of Xi. Consider the problem of the R soware installaon on 10 dierent

computers of same conguraon. Let X denote the me it takes for the soware to install.

Here, again, it may be easily seen that the installaon me on the 10 machines, X1, …, X10,

are idencal (same conguraon of the computers) and independent. We will use the vector

notaon here to represent a sample of size n, X = (X1, X2,…, Xn) for the random variables, and

denote the realized values of random variable with the small case x = (x1, x2,…, xn ) with x i

represenng the realized value of random variable Xi . All the required tools are now ready,

which enable us to dene the likelihood funcon.

Chapter 5

[ 131 ]

Likelihood funcon: Let

()

,fx

be the joint pmf (or pdf) for an iid sample of n observaons

of X. Here, the pmf and pdf respecvely correspond to the discrete and connuous random

variables. The likelihood funcon is then dened by:

()() ( )

| | |

i i

Lxfx fx

θ θ θ

= = Π

Of course, the reader may be amused about the dierence between a likelihood funcon

and a pmf (or pdf). The pmf is to be seen as a funcon of x given that the parameters are

known, whereas in the likelihood funcon we look at a funcon where the parameters are

unknown with x being known. This disncon is vital as we are looking for a tool where we

do not know the parameters. The likelihood funcon may be interpreted as the probability

funcon of

condioned on the value of x and this is the main reason for idenfying that

value of

, say

, which leads to the maximum of

()

|Lx

, that is,

( )

| |L x L x

θ θ

≥. Let us

visualize the likelihood funcon for some important families of probability distribuon.

The importance of visualizing the likelihood funcon is emphasized in Chapter 7, The Logisc

Regression Model and Chapters 1-4 of Pawitan (2001).

Visualizing the likelihood function

We had seen a few plots of the pmf/pdf in Discrete distribuons and Connuous distribuons

secons of Chapter 1, Data Characteriscs. Recall that we were plong the pmf/pdf over

the range of x. In those examples, we had assumed certain values for the parameters of

the distribuons. For the problems of stascal inference, we typically do not know the

parameter values. Thus, the likelihood funcons are ploed against the plausible parameter

values

. What does this mean? For example, the pmf for a binomial distribuon is ploed

for x values ranging from 0 to n. However, the likelihood funcon needs to be plot against p

values ranging over the unit interval [0, 1].

Example 5.1.1. The likelihood funcon of a binomial distribuon: A box of electronic

chips is known to contain a certain number of defecve chips. Suppose we take a random

sample of n chips from the box and make a note of the number of non-defecve chips. The

probability of a non-defecve chip is p, and that being defecve is 1 – p. Let X be a random

variable, which takes the value 1 if the chip is non-defecve and 0 if it is defecve. Then

X~b(1,p), where p is not known. Dene

x i

t x

=∑

. The likelihood funcon is then given by:

( ) ( )

|, 1x

xnt

Lptn p p

−



= −





Stascal Inference

[ 132 ]

Suppose that the observed value of x is 7, that is, we have 7 successes out of 10 trials.

Now, the purpose of likelihood inference is to understand the probability distribuon of p

given the data tx. This gives us an idea about the most plausible value of p and hence it is

worthwhile to visualize the likelihood funcon

( )

Example 5.1.2. The likelihood funcon of a Poisson distribuon: The number of accidents

at a parcular trac signal of a city, the number of ight arrivals during a specic me

interval at an airport, and so on are some of the scenarios where assumpon of a Poisson

distribuon is appropriate to explain the numbers. Now let us consider a sample from

Poisson distribuon. Suppose that the number of ight arrivals at an airport during the

duraon of an hour follows a Poisson distribuon with an unknown rate

. Suppose that

we have the number of arrivals over ten disnct hours as 1, 2, 2, 1, 0, 2, 3, 1, 2, and 4. Using

this data, we need to infer about

. Towards this we will rst plot the likelihood funcon.

The likelihood funcon for a random sample of size n is given by:

( )

i i

L x x

λλ

−

=Π

Before we consider R program for visualizing the likelihood funcon for the samples from

binomial and Poisson distribuon, let us look at the likelihood funcon for a sample from

the normal distribuon.

Example 5.1.3. The likelihood funcon of a normal distribuon: The CPU_Time variable

from IO_Time may be assumed to follow a normal distribuon. For this problem, we will

simulate n = 25 observaons from a normal distribuon, for more details about the

simulaon, refer the next chapter. Though we simulate the n observaons with mean as 10

and standard deviaon as 2, we will pretend that we do not actually know the mean value

with the assumpon that the standard deviaon is known to be 2. The likelihood funcon

for a sample from normal distribuon with known a standard deviaon is given by:

( )

()

L x e

µ σ

πσ

− −

∑

In our parcular example, it is:

( )

|,2

nexpx xL

µ µ

 

− −

 

 

=∑

It is me for acon!

Chapter 5

[ 133 ]

Time for action – visualizing the likelihood function

We will now visualize the likelihood funcon for the binomial, Poisson, and normal

distribuons discussed before:

1. Inialize the graphics windows for the three samples using par(mfrow= c(1,3)).

2. Declare the number of trials n and the number of success x by n <- 10; x <- 7.

3. Set the sequence of p values with p_seq <- seq(0,1,0.01).

For p_seq, obtain the probabilies for n = 10 and x = 7 by using the dbinom

funcon: dbinom(x=7,size=n,prob=p_seq).

4. Next, obtain the likelihood funcon plot by running plot(p_seq, dbinom(

x=7,size=n,prob=p_seq), xlab="p", ylab="Binomial Likelihood

Function", "l").

5. Enter the data for the Poisson random sample into R using x <- c(1,2,2,1,

0,2,3,1,2,4) and the number of observaons by n <- length(x).

6. Declare the sequence of possible

values through lambda_seq <-

seq(0,5,0.1).

7. Plot the likelihood funcon for the Poisson distribuon with plot( lambda_seq,

dpois(x=sum(x),lambda=n*lambda_seq)…).

We are generang random observaons from a normal distribuon using the rnorm

funcon. Each run of the rnorm funcon results in dierent values and hence to

ensure that you are able to reproduce the exact output as produced here, we will

set the inial seed for random generaon tool with set.seed(123).

8. For generaon of random numbers, x the seed value with set.seed(123).

This is to simply ensure that we obtain the same result.

9. Simulate 25 observaons from the normal distribuon with mean 10 and standard

deviaon 2 using n<-25; xn <- rnorm(n,mean=10,sd=2).

10. Consider the following range of

values mu_seq <- seq(9,11,0.05).

11. Plot the normal likelihood funcon with plot(mu_seq,dnorm(x= mean(xn),

mean=mu_seq,sd=2) ).

The detailed code for the preceding acon is now provided:

# Time for Action: Visualizing the Likelihood Function

par(mfrow=c(1,3))

# # Visualizing the Likelihood Function of a Binomial Distribution.

n <- 10; x <- 7

p_seq <- seq(0,1,0.01)

Stascal Inference

[ 134 ]

plot(p_seq, dbinom(x=7,size=n,prob=p_seq), xlab="p", ylab="Binomial

Likelihood Function", "l")

# Visualizing the Likelihood Function of a Poisson Distribution.

x <- c(1, 2, 2, 1, 0, 2, 3, 1, 2, 4); n = length(x)

lambda_seq <- seq(0,5,0.1)

plot(lambda_seq,dpois(x=sum(x),lambda=n*lambda_seq),

xlab=expression(lambda),ylab="Poisson Likelihood Function", "l")

# Visualizing the Likelihood Function of a Normal Distribution.

set.seed(123)

n <- 25; xn <- rnorm(n,mean=10,sd=2)

mu_seq <- seq(9,11,0.05)

plot(mu_seq,dnorm(x=mean(xn),mean=mu_seq,sd=2),"l",

xlab=expression(mu),ylab="Normal Likelihood Function")

Run the preceding code in your R session.

You will nd an idencal copy of the next plot on your computer screen too. What does the

plot tells us? The likelihood funcon for the binomial distribuon has very small values up to

0.4, and then it gradually peaks up to 0.7 and then declines sharply. This means that the values

in the neighborhood of 0.7 are more likely to be the true value of p than the points away from

it. Similarly, the likelihood funcon plot for the Poisson distribuon says that

values lesser

than 1 and greater than 3 are very unlikely to be the true value of the actual

. The peak of

the likelihood funcon appears at a value lile lesser than 2. The interpretaon for the normal

likelihood funcon is le as an exercise to the reader.

Figure 1: Some likelihood functions

Chapter 5

[ 135 ]

What just happened?

We took our rst step in the problem of esmaon of parameters. Visualizaon of the

likelihood funcon is a very important aspect and is oen overlooked in many introductory

textbooks. Moreover, and as it is natural, we did it in R!

Finding the maximum likelihood estimator

The likelihood funcon plot indicates the plausibility of the data generang mechanism

for dierent values of the parameters. Naturally, the value of the parameter for which the

likelihood funcon has the highest value is the most likely value of the parameter. This forms

the crux of maximum likelihood esmaon.

The value of

that leads to the maximum value of the likelihood funcon

()

|Lx

is referred

as the maximum likelihood esmate, abbreviated as MLE.

For the reader familiar with numerical opmizaon, it is not a surprise that calculus is useful

for nding the opmum value of a funcon. However, we will not indulge in mathemacs

more than what is required here. We will note some ner aspects of numerical opmizaon.

For an independent sample of size n, the likelihood funcon is a product of n funcons

and it is very likely that we may very soon end up in the mathemacal world of intractable

funcons. To a large extent, we can circumvent this problem by resorng to the logarithm of

the funcon, which then transforms the problem of opmizing the product of funcons to

the sum of funcons. That is, we will focus on opmizing

()

log |Lx

instead of

()

|Lx

An important consequence of using the logarithm is that the maximizaon of a product

funcon translates into that of a sum funcon since log(ab) = log(a) + log(b). It may also be

seen that the maximum point of the likelihood funcon is preserved under the logarithm

transformaon since for a > b, log(a) > log(b). Further, many numerical techniques know how

to minimize a funcon rather than maximize it. Thus, instead of maximizing the log-likelihood

funcon

()

log |

, we will minimize

()

-log |Lx

in R.

In the R package stats4 we are provided with a mle funcon, which returns the MLE.

There are a host of probability distribuons for which it is possible to obtain the MLE.

We will connue the illustraons for the examples considered earlier in the chapter.

Example 5.1.4. Finding the MLE of a binomial distribuon (connuaon of Example 5.1.1):

The negave log-likelihood funcon of binomial distribuon, sans the constant values, the

combinatorial term is excluded since its value is independent of p, is given by:

( ) ( ) ( )

-log |, log log 1

x x x

Lptn t p nt p=− −− −

Stascal Inference

[ 136 ]

The maximum likelihood esmator of p, dierenang the preceding equaon with

respect to p and equang the result to zero and then solving the equaon, is given by

the sample proporon:

ˆx

An esmator of a parameter is denoted by accentuaon the parameter with a hat.

Though this is very easy to compute, we will resort to the useful funcon mle.

Example 5.1.5. MLE of a Poisson distribuon (connuaon of Example 5.1.2):

The negave log-likelihood funcon is given by:

( )

lolog | g

L x n x

λ λ λ

− = − ∑

The MLE for

admits a closed form, which can be obtained from calculus arguments,

and it is given by:

ˆn

To obtain the MLE, we need to write exclusive code for the negave log-likelihood funcon.

For the normal distribuon, we will use the mle funcon. There is another method of nding

the MLE than the mle funcon available in the stats4 package. We consider it next. The R

codes will be given in the forthcoming acon.

Using the tdistr function

In the previous examples, we needed to explicitly specify the negave log-likelihood

funcon. The fitdistr funcon from the MASS package can be used to obtain the

unknown parameters of a probability distribuon, for a list of the probability funcons for

which it applies see ?fitdistr, and the fact that it uses the maximum likelihood ng

complements our approach in this secon.

Example 5.1.6. MLEs for Poisson and normal distribuons: In the next acon, we will use

the fitdistr funcon from the MASS package for obtaining the MLEs in Example 5.1.2 and

Example 5.1.3. In fact, using the funcon, we get the answers readily without the need to

specify the negave log-likelihood explicitly.

Chapter 5

[ 137 ]

Time for action – nding the MLE using mle and tdistr functions

The mle funcon from the stats4 package will be used for obtaining the MLE from popular

distribuons such as binomial, normal, and so on. The fitdistr funcon will be used too,

which ts the distribuons using the MLEs.

1. Load the library package CIT with library(stats4).

2. Specify the number of success in a vector format and the number of observaons

with x<-rep(c(0,1),c(3,7)); n <- length(x).

3. Dene the negave log-likelihood funcon with a funcon:

binomial_nll <- function(prob) -sum(dbinom(x,size=1,prob,

log=TRUE))

The code works as follows. The dbinom funcon is invoked from the stats package

and the opon log=TRUE is exercised to indicate that we need a log of the

probability (actually likelihood) values. The dbinom funcon returns a vector of

probabilies for all the values of x. The sum, mulplied by -1, now returns us the

value of negave log-likelihood.

4. Now, enter fit_binom <- mle(binomial_nll,start=list(

prob=0.5),nobs=n) on the R console. Now, mle as a funcon opmizes the

binomial_nll funcon dened in the previous step. Inial values, a guess or a

legimate value are specied for the start opon, and we also declare the number

of observaons available for this problem.

5. summary(fit_binom) will give details of the mle funcon applied on

binomial_nll. The output is displayed in the next screenshot.

6. Specify the data for the Poisson distribuon problem in x <- c(1,2,2,1,0,

2,3,1,2,4); n <- length(x).

7. Dene the negave log-likelihood funcon on parallel lines of a

binomial distribuon:

pois_nll <- function(lambda) -sum(dpois(x,lambda,log=TRUE))

8. Explore dierent opons of the mle funcon by specifying the method, a guess of

the least and most values of the parameter, and the inial value as the median of

the observaons:

fit_poisson <- mle(pois_nll,start=list(lambda=median(x)),nobs=n,

method = "Brent", lower = 0, upper = 10)

9. Get the answer by entering summary(fit_poisson).

10. Dene the negave log-likelihood funcon for the normal distribuon by:

normal_nll <- function(mean) -sum(dnorm(xn,mean,sd=2,log=TRUE))

Stascal Inference

[ 138 ]

11. Find the MLE of the normal distribuon with fit_normal <- mle( normal_nll

,start=list(mean=8),nobs=n).

12. Get the nal answer with summary(fit_normal).

13. Load the MASS package: library(MASS).

14. Fit the x vector with a Poisson distribuon by running fitdistr( x,"poisson")

in R.

15. Fit the xn vector with a normal distribuon by running fitdistr(xn,"normal").

Figure 2: Finding the MLE and related summaries

What just happened?

You have explored the possibility of nding the MLEs for many standard distribuons using

mle from the stats4 package and fitdistr from the MASS package. The main key

in obtaining the MLE is the right construcon of the log-likelihood funcon.

Chapter 5

[ 139 ]

Condence intervals

The MLE is a point esmate and as such on its own it is almost not of praccal use. It would

be more appropriate to give coverage of parameter points, which is most likely to contain

the true unknown parameter. A general pracce is to specify the coverage of the points

through an interval and then consider specic intervals which have a specied probability.

A formal denion is in order.

A condence interval for a populaon parameter is an interval that is predicted to contain

the parameter with a certain probability.

The common choice is to obtain either 95 percent or 99 percent condence intervals. It is

common to specify the coverage of the condence through a signicance level

, more

about this in the next secon, which is a small number closer to 0. The 95 percent and 99

percent condence intervals then correspond to

( )

100 1

−

percent intervals with respecve

equal to 0.05 and 0.01. In general, a

( )

100 1

−

percent condence interval says that if

the experiment is performed many mes over, we expect the result to fall in the condence

interval by

( )

100 1

− percent.

Example 5.2.1. Condence interval for binomial proporon: Consider n Bernoulli trials

1,..., n

X X

with the probability of success being p. We saw earlier that the MLE of p is:

ˆx

where

x i

t x

∑

. Theorecally, the expected value of

is p and its standard deviaon is

( )

1 /p p n−. An esmate of the standard deviaon is

( )

ˆ ˆ

1 /p p n−. For large n and when both

np and np(1-p) are greater than 5, using a normal approximaon, by virtue of the central

limit theorem, a

( )

100 1

−

percent condence interval for p is given by:

( ) ( )

/2 /2

ˆ ˆ ˆ ˆ

1 1

ˆ ˆ

p p p p

p z p z

n n

α α

 

− −

 

− +

 

 

where

is the

quanle of the standard normal distribuon. The condence

intervals obtained by using the normal approximaon are not reliable when the p value is

near 0 or 1. Thus, if the lower condence limit falls below 0, or the upper condence limit

exceeds 0, we will adapt the convenon of taking them as 0 and 1 respecvely.

Stascal Inference

[ 140 ]

Example 5.2.2. Condence interval for normal mean with known variance: Consider a

random sample of size n from a normal distribuon with an unknown mean

and a known

standard deviaon

. It may be shown that the MLE of mean

is the sample mean

X x n

=∑

and that the distribuon of

is again normal with mean

and standard deviaon

. Then, the

( )

100 1

− percent condence interval is given by:

/2 /2

,X Z X Z

n n

α α

σ σ

 

− +

 

 

where

is the

quanle of the standard normal distribuon. The width of the

preceding condence interval is /2

2 /z n

. Thus, when the sample size is increased by

four mes, the width will decrease by half.

Example 5.2.3. Condence interval for normal mean with unknown variance: We connue

with a sample of size n. When the variance is not known, the steps become very dierent.

Since the variance is not known, we replace it by the sample variance:

( )

S X X

= −

−∑

The denominator is n - 1 since we have already esmated

using the n observaons.

To develop the condence interval for

, consider the following stasc:

S n

−

This new stasc T has a t-distribuon with n - 1 degrees of freedom. The

( )

100 1

− percent

condence interval for

is then given by the following interval:

1, /2 1, /2

n n

S S

X t X t

n n

α α

− −

 

− +

 

 

where

1, /2n

− is the

quanle of a t random variable with n - 1 degrees of freedom.

We will create funcons for obtaining the condence intervals for the preceding three

examples. Many stascal tests in R return condence intervals at desired levels. However,

we will be encountering these tests in the last secon of the chapter, and hence up to that

point, we will contain ourselves to dened funcons and applicaons.

Chapter 5

[ 141 ]

Time for action – condence intervals

We create funcons that will enable us to obtain condence intervals of the desired size:

1. Create a funcon for obtaining the condence intervals of the proporon from

a binomial distribuon with the following funcon:

binom_CI = function(x, n, alpha) {

phat = x/n

ll=phat-qnorm(alpha/2,lower.tail=FALSE)*sqrt(phat*(1-phat)/n)

ul=phat+qnorm(alpha/2,lower.tail=FALSE)*sqrt(phat*(1-phat)/n)

return(paste("The ", 100*(1-alpha),"% Confidence Interval for

Binomial Proportion is (", round(ll,4),",", round(ul,4),")",

sep=''))

}

The arguments of the funcons are x, n, and alpha. That is, the user of the funcon

needs to specify the x number of success out of the n Bernoulli trials, and the

signicance level

. First, we obtain the MLE

of the proporon p by calculang

phat = x/n. To obtain the value of

, we use the qnorm quanle funcon

qnorm(alpha/2,lower.tail= FALSE). The quanty

( )

ˆ ˆ

1 /p p n− is computed

with sqrt(phat*(1-phat)/n). The rest of the code for ll and ul is

self-explanatory. We use the paste funcon to get the output in a convenient

format along with the return funcon.

2. Consider the data in Example 5.2.1 where we have x = 7, and n = 10. Suppose that

we require 95 percent and 99 percent condence intervals. The respecve

values

for these condence intervals are 0.05 and 0.01. Let us execute the binom_CI

funcon on this data. That is, we need to run binom_CI(x=7,n=10,alpha

=0.05) and binom_CI(x=7,n=10,alpha=0.01) on the R console. The output

will be as shown in the next screenshot.

Thus, (0.416, 0.984) is the 95 percent condence interval for p and (0.3267,

1.0733) is the 99 percent condence interval for it. Since the upper condence

limit exceeds 1, we will use (0.3267, 1) as the 99 percent condence interval

for p.

3. We rst give the funcon for construcon of condence intervals for the mean

of a normal distribuon when the standard deviaon is known:

normal_CI_ksd = function(x,sigma,alpha) {

xbar = mean(x)

n = length(x)

ll = xbar-qnorm(alpha/2,lower.tail=FALSE)*sigma/sqrt(n)

ul = xbar+qnorm(alpha/2,lower.tail=FALSE)*sigma/sqrt(n)

return(paste("The ", 100*(1-alpha),"% Confidence Interval for

the Normal mean is (", round(ll,4),",",round(ul,4),")",sep=''))

}

Stascal Inference

[ 142 ]

The funcon normal_CI_ksd works dierently from the earlier binomial one. Here,

we provide the enre data to the funcon and specify the known value of standard

deviaon and the signicance level. First, we obtain the MLE

of the mean

with

xbar = mean(x). The R code qnorm(alpha/2,lower.tail=FALSE) is used to

obtain

. Next,

is computed by sigma/sqrt(n). The code for ll and ul

is straighorward to comprehend. The return and paste have the same purpose

as in the previous example. Compile the code for the normal_CI_ksd funcon.

4. Let us see a few examples, connuaon of Example 5.1.3, for obtaining the

condence interval for the mean of a normal distribuon with the standard

deviaon known. To obtain the 95 percent and 99 percent condence interval

for the xn data, where the standard deviaon was known to be 2, run

normal_CI_ksd(x=xn,sigma=2,alpha=0.05) and normal_CI_ksd(x=

xn,sigma=2,alpha=0.01) on the R console. The output is consolidated in the

next screenshot.

Thus, the 95 percent condence interval for

is (9.1494, 10.7173) and the 99

percent condence interval is (8.903, 10.9637).

5. Create a funcon, normal_CI_uksd, for obtaining the condence intervals for

of a normal distribuon when the standard deviaon is unknown:

normal_CI_uksd = function(x,alpha) {

xbar = mean(x); s = sd(x)

n = length(x)

ll = xbar-qt(alpha/2,n-1,lower.tail=FALSE)*s/sqrt(n)

ul = xbar+qt(alpha/2,n-1,lower.tail=FALSE)*s/sqrt(n)

return(paste("The ", 100*(1-alpha),"% Confidence Interval for

the Normal mean is (", round(ll,4),",",round(ul,4),")",sep=''))

}

We have an addional computaon here in comparison with the earlier funcon.

Since the standard deviaon is unknown, we esmate it with s = sd(x).

Furthermore, we need to obtain the quanle from t-distribuon with n - 1 degrees

of freedom, and hence we have qt(alpha/2,n-1,lower.tail=FALSE) for the

computaon of

1, /2n

−. The rest of the details follow the previous funcon.

6. Let us obtain condence intervals, 95 percent and 99 percent, for the vector xn

under the assumpon that the variance is not known. The codes for achieving the

results are given in normal_CI_uksd(x=xn,alpha=0.05) and normal_CI_

uksd(x=xn,alpha=0.01).

Chapter 5

[ 143 ]

Thus, the 95 percent condence interval for the mean is (9.1518, 10.7419) and the 99

percent condence interval is (8.8742, 10.9925).

Figure 3: Confidence intervals: Some raw programs

What just happened?

We created special funcons for obtaining the condence intervals and executed them for

three dierent cases. However, our framework is quite generic in nature and with a bit of

care and cauon, it may be easily extended to other distribuons too.

Stascal Inference

[ 144 ]

Hypotheses testing

"Best consumed before six months from date of manufacture", "Two years warranty",

"Expiry date: June 20, 2015", and so on, are some of the likely assurances that you would

have easily come across. An analyst will have to arrive at such statements using the related

data. Let us rst dene a hypothesis.

Hypothesis: A hypothesis is an asseron about the unknown parameter of the probability

distribuon. For the quote of this secon, denong the least me (in months) ll which

an eatery will not be losing its good taste by

, the hypothesis of interest will be

0:6H

≥

It is common to denote the hypothesis of interest by

and it called the null hypothesis.

We want to test the null hypothesis against the alternave hypothesis that the consumpon

me is well before the six months' me, which in symbols is denoted by

1:6H

. We will

begin with some important denions followed by related examples.

Test stasc: A stasc that is a funcon of the random sample is called as test stasc.

For an observaon X following a binomial distribuon b(n, p), the test stasc for p will be

X/n, whereas for a random sample from the normal distribuon, the test stasc may be

mean

X X n

∑

or the sample variance

( )

/ 1

S X X n

= − −

∑ depending on whether the

tesng problem is for

µσ

. The stascal soluon to reject (or not) the null hypothesis

depends on the value of test stasc. This leads us to the next denion.

Crical region: The set of values of the test stasc which leads to the rejecon of the null

hypothesis is known as the crical region.

We have made various kinds of assumpons for the random experiments. Naturally,

depending on the type of the probability family, namely binomial, normal, and so on,

we will have an appropriate tesng tool. Let us look at the very popular tests arising

in stascs.

Binomial test

A binomial random variable X, distribuon represented by b(n, p), is characterized by two

parameters n and p. Typically, n represents the number of trials and is known in most cases

and it is the probability of success p that one is generally interested in the hypotheses

related to it.

For example, an LCD panel manufacturer would like to test if the number of defecves is at

most four percent. The panel manufacturer has randomly inspected 893 LCDs and found 39

to be defecve. Here the hypotheses tesng problem would be 0 1

vs: 0.04 : 0.04Hp Hp≤ ≤ .

Chapter 5

[ 145 ]

A doctor would like to test whether the proporon of people in a drought-eected area

having a viral infecon such as pneumonia is 0.2, that is,

0 1

vs: 0.2 : 0.2

Hp Hp= ≠ . The

drought-eected area may encompass a huge geographical area and as such it becomes

really dicult to carry out a census over a very short period of a day or two. Thus the doctor

selects the second-eldest member of a family and inspects 119 households for pneumonia.

He records that 28 out of 119 inspected people are suering from pneumonia. Using this

informaon, we need to help the doctor in tesng the hypothesis of interest for him.

In general, the hypothesis-tesng problems for the binomial distribuon will be along the

lines of

0 0 1 0

vs: :HppHpp≤ >

0 0 1 0

vs: :

HppH

≥ > , or

0 0 1 0

vs: :HppHpp= ≠

Let us see how the binom.test funcon in R helps in tesng hypotheses problems related

to the binomial distribuon.

Time for action – testing the probability of success

We will use the R funcon binom.test for tesng hypotheses problems related to p. This

funcon takes as arguments n number of trials, x number of successes, p as the probability

of interest, and alternative as one of greater, less, or greater.

1. Discover the details related to binom.test using ?binom.test, and then run

example(binom.test) and ensure that you understand the default example.

2. For the LCD panel manufacturer, we have n = 893 and x = 39. The null hypothesis

occurs at p = 0.04. Enter this data rst in R with the following code:

n_lcd <- 893; x_lcd <- 39; p_lcd <- 0.04

3. The alternave hypothesis is that the proporon of success p is greater

than 0.04, which is listed to the binom.test funcon with the opon

alternative="greater", and hence the complete binom.test funcon

for the LCD panel is delivered by:

binom.test(n=n_lcd,x=x_lcd,p=p_lcd,alternative="greater")

The output, following screenshot, shows that the esmated probability of success

is 0.04367, which is certainly greater than 0.04. However, the p-value =

0.3103 indicates that we do not have enough evidence in the data to reject

the null hypothesis

0: 0.04Hp≤

. Note that the binom.test also gives us a 95

percent condence interval for p as (0.033, 1.000) and since the hypothesized

probability lies in this interval we arrive at the same conclusion. This condence

interval is recommended over the one developed in the previous secon, and

in parcular we don't have to worry about the condence limits either being lesser

than 0 or greater than one. Also, you may obtain any condence interval of your

choice

( )

100 1

−

percent CI with the argument conf.int.

Stascal Inference

[ 146 ]

4. For the doctors problem, we have the data as:

n_doc <- 119; x_doc <- 28; p_doc <- 0.2

5. We need to test the null hypothesis against a two-sided alternave hypothesis and

though this is the default seng of the binom.test, it is a good pracce to specify

it explicitly, at least unl the experse is felt by the user:

binom.test(n=n_doc,x=x_doc,p=p_doc,alternative="two.sided")

The esmated probability of success, actually a paent's probability of having the viral

infecon, is 0.3193. Since the p-value associated with the test is p-value = 0.001888,

we reject the null hypothesis that

0: 0.2Hp=

. All the output is given in the following

screenshot. The 95 percent condence interval is (0.2369, 0.4110), again given by

the binom.test, which does not contain the hypothesized value of 0.2, and hence we

can reject the null hypothesis based on the condence interval.

Figure 4: Binomial tests for probability of success

What just happened?

Binomial distribuon arises in a large number of proporonality test problems. In this secon

we used the binom.test for tesng problems related to the probability of success. We also

note that the condence intervals for p are given as a side-product of the applicaon of the

binom.test. The condence intervals are also given as a by-product of the applicaon of

the binom.test.

Chapter 5

[ 147 ]

Tests of proportions and the chi-square test

In Chapter 3, Data Visualizaon, we came across the Titanic and UCBAdmissions

datasets. For the Titanic dataset, we may like to test if the Survived proporon across

the Class is same for the two Sex groups. Similarly, for the UCBAdmissions dataset, we

may wish to know if the proporon of the Admitted candidates for the Male and Female

group is same across the six Dept. Thus, there is a need to generalize the binom.test

funcon to a group of proporons. In this problem, we may have k-proporons and the

probability vector is specied by

( )

1,...,

pp p=

. The hypothesis problem may be

specied as tesng the null hypothesis

0 0

:Hpp=

against the alternave hypothesis

1 0

:Hpp≠

. Equivalently, in the vector form the problem is tesng

( ) ( )

0 1 01 0

:,..., ,...,

k k

Hp p p p=

against

( ) ( )

1 1 01 0

:,..., ,...,

k k

Hp p p p≠

. The R extension of binom.test is given in prop.test.

Time for action – testing proportions

We will use the prop.test R funcon here for tesng the equality of proporons for the

count data problems.

1. Load the required dataset with data(UCBAdmissions). For the UCBAdmissions

dataset, rst obtain the Admitted and Rejected frequencies for both the genders

across the six departments with:

UCBA.Dept <- ftable(UCBAdmissions, row.vars="Dept", col.vars =

c("Gender", "Admit"))

2. Calculate the Admitted proporons for Female across the six departments with:

p_female <- prop.table(UCBA.Dept[,3:4],margin=1)[,1]

Check p_female!

3. Test whether the proporons across the departments for Male matches with

Female using the prop.test:

prop.test(UCBA.Dept[,1:2],p=p_female)

The proporons are not equal across the Gender as p-value < 2.2e-16 rejects

the null hypothesis that they are equal.

4. Next, we want to invesgate whether the Male and Female survivors

proporons are the same in the Titanic dataset. The approach is similar

to the UCBAdmissions problem; run the following code:

T.Class <- ftable(Titanic, row.vars="Class", col.vars = c("Sex",

"Survived"))

Stascal Inference

[ 148 ]

5. Compute the Female survivor proporons across the four classes with p_female

<- prop.table(T.Class[,3:4],margin=1)[,1]. Note that this new variable,

p_female, will overwrite the same named variable from the earlier steps.

6. Display p_female and then carry out the comparison across the two genders:

prop.test(T.Class[,1:2],p=p_female)

The p-value < 2.2e-16 clearly shows that the survivor proporons are not the

same across the genders.

Figure 5: prop.test in action

Indeed, there is more complexity to the two datasets than mere proporons for

the two genders. The web page http://www-stat.stanford.edu/~sabatti/

Stat48/UCB.R has detailed analysis of the UCBAdmissions dataset and here we

will simply apply the chi-square test to check if the admission percentage within

each department is independent of the gender.

7. The data for the admission/rejecon for each department is extractable through the

third index in the array, that is, UCBAdmissions[,,i] across the six departments.

Now, we apply the chisq.test to check if the admission procedure is independent

of the gender by running chisq.test( UCBAdmissions[,,i]) six mes.

The result has been edited in a foreign text editor and then a screenshot of it is

provided next.

Chapter 5

[ 149 ]

It appears that the Dept = A admits more males than females.

Figure 6: Chi-square tests for the UCBAdmissions problem

What just happened?

We used prop.test and chisq.test to test the proporons and independence

of aributes. Funcons such as ftable and prop.table and arguments such as row.

vars, col.vars, and margin were useful to get the data in the right format for the

analysies purpose.

We will now look at important family of tests for the normal distribuon.

Tests based on normal distribution – one-sample

The normal distribuon pops up in many instances of stascal analysis. In fact

Whiaker and Robinson have quoted on the popularity of normal distribuon

as follows:

Everybody believes in the exponenal law of errors [that is, the normal

distribuon]: the experimenters, because they think it can be proved by

mathemacs; and the mathemacians, because they believe it has been

established by observaon.

We will not make an aempt to nd out whether the experimenters are correct

or the mathemacians, well, at least not in this secon.

Stascal Inference

[ 150 ]

In general we will be dealing with either one-sample or two-sample tests. In the one-sample

problem we have a random sample of size n from

( )

µσ

( )

,,..., n

XX X. The hypotheses

tesng problem may be related to either or both of the parameters

( )

µσ

. The interesng

and most frequent hypotheses tesng problems for the normal distribuon are listed here:

Tesng for mean with known variance



0 0 1 0

vs: :

H H

µµ

< ≥



0 0 1 0

vs: :H H

µµ

> ≤

0 0 1 0

vs: :

H H

µµ

= ≠

Tesng for mean with unknown variance 2

: this is the same set of hypotheses

problems as in the preceding point

Tesng for the variance with unknown mean:



0 0 1 0

vs: :H H

σσ

> ≤

0 0 1 0

vs: :H H

σσ

< ≥

0 0 1 0

vs: :H H

σσ

= ≠

In the case of known variance, the hypotheses tesng problem for the mean is based

on the Z-stasc given by:

−

where

X X n

∑

. The test procedure, known as Z-test, for the hypotheses tesng problem

0 0 1 0

vs: :

H H

µµ

< ≥ is to reject the null hypothesis at

-level of signicance

0 0

µµ

X z n

σ µ

> + , where

is the

percenle of a standard normal distribuon. For

the hypotheses tesng problem

0 0 1 0

vs: :H H

µµ µµ

> ≤

, the crical/reject region is

X z n

σ µ

< + . Finally, for the tesng problem of

0 0 1 0

vs: :H H

µµ µµ

= ≠

, we reject the

null hypothesis if:

−≥

Chapter 5

[ 151 ]

An R funcon, z.test, is available in the PASWR package, which carries out the Z-test for

each type of the hypotheses tesng problem. Now, we consider the case when the variance

is not known. In this case, we rst nd an esmate of the variance using

( )

2/ 1

S X X n

= − −

∑

The test procedure is based on the well-known t-stasc:

S n

−

The test procedure based on the t-stasc is highly popular as the t-test or student's t-test,

and its implementaon is there in R with the t.test funcon in the base package. The

distribuon of the t-stasc, under the null hypothesis is the t-distribuon with (n - 1)

degrees of freedom. The raonale behind the applicaon of the t-test for the various types

of hypotheses remains the same as the Z-test.

For the hypotheses tesng problem concerning the variance 2

of the normal

distribuon, we need to rst compute the sample variance using

( )

2/ 1

S X X n

= − −

∑

and dene the chi-square stasc:

( )

χσ

−

Under the null hypothesis, the chi-square stasc is distributed as a chi-square random

variable with n - 1 degrees of freedom. In the case of known mean, which is seldom

the case, the test procedure is based on the test stasc

( )

χ µ σ

= −

∑, which follows a

chi-square random variable with n degrees of freedom. For the hypotheses problem

0 0 1 0

vs: :H H

σσ σσ

> ≤

, the test procedure is to reject

0 0

σσ

> if

−

. Similarly, for

the hypotheses problem

0 0 1 0

vs: :H H

σσ σσ

< ≥

, the procedure is to reject

0 0

σσ

2 2

; and nally for the problem

0 0 1 0

vs: :H H

σσ σσ

= ≠

, the test procedure rejects

0 0

σσ

if either

1/2

−

2 2

Test examples. Let us consider some situaons when the preceding set of hypotheses arise

in a natural way:

A certain chemical experiment requires that the soluon used as a reactant has

a pH level greater than 8.4. It is known that the manufacturing process gives

measurements which follow a normal distribuon with a standard deviaon

of 0.05. The ten random observaons are 8.30, 8.42, 8.44, 8.32, 8.43, 8.41,

8.42, 8.46, 8.37, and 8.42. Here, the hypotheses tesng problem of interest is

0 1

vs: 8.4 : 8.4H H

µ µ

> ≤

. This problem is adopted from page 408 of Ross (2010).

Stascal Inference

[ 152 ]

Following a series of complaints that his company's LCD panels never last more than

a year, the manufacturer wants to test if his LCD panels indeed fail within a year.

Using historical data, he knows the standard deviaon of the panel life due to the

manufacturing process is two years. A random sample of 15 units from a freshly

manufactured lot gives their lifemes as 13.37, 10.96, 12.06, 13.82, 12.96, 10.47,

10.55, 16.28, 12.94, 11.43, 14.51, 12.63, 13.50, 11.50, and 12.87. You need to help

the manufacturer validate his hypothesis.

Freund and Wilson (2003). Suppose that the mean weight of peanuts put in jars

is required to be 8 oz. The variance of the weights is known to be 0.03, and the

observed weights for 16 jars are 8.08, 7.71, 7.89, 7.72, 8.00, 7.90, 7.77, 7.81,

8.33, 7.67, 7.79, 7.79, 7.94, 7.84, 8.17, and 7.87. Here, we are interested

in tesng

0 1

vs: 8.0 : 8.0H H

µ µ

= ≠

New managers have been appointed at the respecve places in the preceding

bullets. As a consequence the new managers are not aware about the standard

deviaon for the processes under their control. As an analyst, help them!

Suppose that the variance in the rst example is not known and that it is a crical

requirement that the variance be lesser than 7, that is, the null hypothesis is

0: 7H

while the alternave is

1: 7H

≥

Suppose that the variance test needs to be carried out for the third method,

that is, the hypotheses tesng problem is then

2 2

0 1

vs: 0.03 : 0.03H H

σ σ

= ≠ .

We will perform the necessary test for all the problems described before.

Time for action – testing one-sample hypotheses

We will require R packages PASWR and PairedData here. The R funcons such as t.test,

z.test, and var.test will be useful for tesng one-sample hypotheses problems related

to a random sample from normal distribuons.

1. Load the library library(PASWR).

2. Enter the data for pH in R:

pH_Data <-c(8.30,8.42,8.44,8.32,8.43,8.41,8.42,8.46,8.37,8.42)

3. Specify the known variance of pH in pH_sigma <- 0.05.

4. Use z.test from the PASWR library to test the hypotheses described in the rst

example with:

z.test(x=pH_Data,alternative="less",sigma.x=pH_sigma,mu=8.4)

Chapter 5

[ 153 ]

The data is specied in the x opon, the type of the hypotheses problem is specied

by stang the form of the alternative hypothesis, the known variance is fed

through the sigma.x opon, and nally, the mu opon is used to specify the value

of the mean under the null hypothesis. The output of the complete R program is

collected in the forthcoming two screenshots.

The p-value is 0.4748 which means that we do not have enough evidence

to reject the null hypothesis

0: 8.4H

> and hence we conclude that mean

pH value is above 8.4.

5. Get the data of LCD panel in your session with:

LCD_Data <- c(13.37, 10.96, 12.06, 13.82, 12.96, 10.47,

10.55, 16.28, 12.94, 11.43, 14.51, 12.63, 13.50, 11.50, 12.87)

6. Specify the known variance LCD_sigma <- 2 and run the z.test with:

z.test(x=LCD_Data,alternative="greater",sigma.x=LCD_sigma,mu=12)

The p-value is seen to be 0.1018 and hence we again do not have enough data

evidence to reject the null hypothesis that the average mean lifeme of an LCD

panel is at least a year.

7. The complete program for third problem can be given as follows:

peanuts <- c(8.08, 7.71, 7.89, 7.72, 8.00, 7.90, 7.77,

7.81, 8.33, 7.67, 7.79, 7.79, 7.94, 7.84, 8.17, 7.87)

peanuts_sigma <- 0.03

z.test(x=peanuts,sigma.x=peanuts_sigma,mu=8.0)

Since the p-value associated with this test 2.2e-16, that is, it is very close to zero,

we reject the null hypothesis

0: 8.0H

8. If the variance(s) are not known and a test of the sample means is required,

we need to move from the z.test (in the PASWR library) to the t.test

(in the base library):

t.test(x=pH_Data,alternative="less",mu=8.4)

t.test(x=LCD_Data,alternative="greater",mu=12)

t.test(x=peanuts,mu=8.0)

If the variance is not known, the conclusions for the problems related to pH

and peanuts do not change. However, the conclusion changes for the LCD panel

problem, and here the null hypothesis is rejected as p-value is 0.06414.

Stascal Inference

[ 154 ]

For the problem of tesng variances related to the one-sample problem, my inial

idea was to write raw R codes as there did not seem to be a funcon, package, and

so on, which readily gives the answers. However, a more appropriate search at

google.com revealed that an R package tled PairedData and created by

Stephane Champely did certainly have a funcon, var.test, not to be confused

with the same named funcon in the stats library, which is appropriate for tesng

problems related to the variance of a normal distribuon. The problem is that the

roune method of fetching the package using install.packages("PairedData")

gives a warning message, namely package 'PairedData' is not available (for R

version 2.15.1). This is the classic case of "so near, yet so far…". However, a deeper

look into this will lead us to http://cran.r-project.org/src/contrib/

Archive/PairedData/. This web page shows the various versions of the

PairedData package. A Linux user should have no problem in using it, though

the other OS users can't be helped right away. A Linux user needs to rst download

one of the zipped le, say PairedData_1.0.0.tar.gz, to a specic directory

and with the path of GNOME Terminal in that directory execute R CMD INSTALL

PairedData_1.0.0.tar.gz. Now, we are ready to carry out the tests related

to the variance of a normal distribuon. A Windows user need not be discouraged

with this scenario, and the important funcon var1.test is made available in the

RSADBE package of the book. A more recent check on the CRAN website reveals that

the PairedData package is again available for all OS plaorms since April 18, 2013.

Figure 7: z.test and t.test for one-sample problem

Chapter 5

[ 155 ]

9. Load the required library with library(PairedData).

10. Carry out the two tesng problems in the h problem with:

var.test(x=pH_Data,alternative="greater",ratio=7)

var.test(x=peanuts,alternative="two.sided",ratio=0.03)

11. It may be seen from the next screenshot that the data does not lead to rejecon

of the null hypotheses. For a Windows user, the alternave is to use the var1.test

funcon from the RSADBE package. That is, you need to run:

var1.test(x=pH_Data,alternative="greater",ratio=7)

var1.test(x=peanuts,alternative="two.sided",ratio=0.03)

You'll get same results:

Figure 8: var.test from the PairedData library

What just happened?

The tests z.test, t.test, and var.test (from the PairedData library) have been

used for the tesng hypotheses problems under varying degrees of problems.

Stascal Inference

[ 156 ]

Have a go hero

Consider the tesng problem

0 0 1 0

vs: :H H

σσ σσ

= ≠

. The test stasc for this hypothesis

tesng problem is given by:

( )

i i i

X X

χσ

∑ −

which follows a chi-square distribuon with n - 1 degrees of freedom. Create your own new

funcon for the tesng problems and compare it with the results given by var.test of

PairedData package.

With the tesng problem of parameters of normal distribuon in the case of one sample

behind us, we will next focus on the important two-sample problem.

Tests based on normal distribution – two-sample

The two-sample problem has data from two populaons where

( )

,,...,

XX X are n1

observaons from

( )

µσ

and

( )

,,..., n

YY Y are n2 observaons from

( )

µσ

. We assume

that the samples within each populaon are independent of each other and further that

the samples across the two populaons are also independent. Similar to the one-sample

problem, we have the following set of recurring and interesng hypotheses tesng problems.

Mean comparison with known variances

and 2



0 1 2 1 1 2

vs: :H H

µµ µµ

> ≤



0 1 2 1 1 2

vs: :H H

µµ µµ

< ≥



0 1 2 1 1 2

vs: :H H

µµ µµ

= ≠

Mean comparison with unknown variances

and

: the same set of hypotheses

problems as before. We make an addional assumpon here that the variances

and 2

are assumed to be equal, though unknown.

The variances comparison:



0 1 2 1 1 2

vs: :H H

σσ σσ

> ≤



0 1 2 1 1 2

vs: :H H

σσ σσ

< ≥



0 1 2 1 1 2

vs: :H H

σσ σσ

= ≠

First dene the sample means for the two populaons with

X X n

=∑

and 2

Y X n

∑

. For the case of known variances

and

, the test stasc

is dened by:

Chapter 5

[ 157 ]

( )

1 2

2 2

1 1 2 2

/ /

Zn n

µµ

σ σ

−− −

Under the null hypotheses,

( )

2 2

1 1 2 2

/ / /Z X Y n n

σ σ

= − + follows a standard normal distribuon.

The test procedure for the problem 0 1 2 1 1 2

vs: :H H

µµ

> ≤ is to reject H0 if

≥

, and the

procedure for

0 1 2 1 1 2

vs: :H H

µµ µµ

< ≥

is to reject H0 if

. As expected and on earlier

intuive lines, the test procedure for the hypotheses problem

0 1 2 1 1 2

vs: :H H

µµ µµ

= ≠

is to

reject H0 as /2

≥.

Let us now consider the case when the variances 2

and 2

are not known and assumed

(or known) to be equal. In this case, we can't use the Z-test any further and need to look at

the esmator of the common variance. For this, we dened the pooled variance esmator

as follows:

2 2 2

1 2

1 2 1 2

1 1

2 2

p x y

n n

S S S

n n n n

− −

= +

+ − + −

where

and

are the sampling variances of the two populaons. Dene the t-stasc

as follows:

( )

1 2

1/ 1/

tS n n

−

The test procedure for the set of the three hypotheses tesng problems is then to reject

the null hypotheses if

+−

2,nn

+−

, or

2, /2nn

+−

Finally, we focus on the problem of tesng variances across two samples. Here, the test

stasc is given by:

The test procedures would be to respecvely reject the null hypotheses of the

tesng problems

0 1 2 1 1 2

vs: :H H

σσ σσ

> ≤

0 1 2 1 1 2

vs: :H H

σσ σσ

< ≥

, and

0 1 2 1 1 2

vs: :H H

σσ σσ

= ≠

1, 1,

−−

1, 1,

−−

1, 1, /2

−−

Stascal Inference

[ 158 ]

Let us now consider some scenarios where we have the previously listed hypotheses

tesng problems.

Test examples. Let us consider some situaons when the preceding set of hypotheses arise

in a natural way.

In connuaon of the chemical experiment problem, let us assume that the

chemists have come up with a new method of obtaining the same soluon as

discussed in the previous secon. For the new technique, the standard deviaon

connues to be 0.05 and 12 observaons for the new method yield the following

measurements: 8.78, 8.85, 8.74, 8.83, 8.82, 8.79, 8.82, 8.74, 8.84, 8.78, 8.75, 8.81.

Now, this new soluon is acceptable if its mean is greater than that for the earlier

one. Thus, the hypotheses tesng problem is now 0 1

vs: :

NEW OLD NEW OLD

H H

µ µ µ µ

> ≤ .

Ross (2008), page 451. The precision of instruments in metal cung is a much

serious business and the cut pieces can't be signicantly lesser than the target

nor be greater than it. Two machines are used to cut 10 pieces of steel, and their

measurements are respecvely given in 122.4, 123.12, 122.51, 123.12, 122.55,

121.76, 122.31, 123.2, 122.48, 121.96, and 122.36, 121.88, 122.2, 122.88, 123.43,

122.4, 122.12, 121.78, 122.85, 123.04. The standard deviaon of the length of a cut

is known to be equal to 0.5. We need to test if the average cut length is same for the

two machines.

For both the preceding problems assume that though the variances are equal,

they are not known. Complete the hypotheses tesng problems using t.test.

Freund and Wilson (2003), page 199. The monitoring of the amount of

peanuts being put in jars is an important issue in quality control viewpoint.

The consistency of the weights is of prime importance and the manufacturer has

been introduced to a new machine, which is supposed to give more accuracy

in the weights of the peanuts put in the jars. With the new device, 11 jars were

tested for their weights and found to be 8.06, 8.64, 7.97, 7.81, 7.93, 8.57, 8.39,

8.46, 8.28, 8.02, 8.39, whereas a sample of nine jars from the previous machine

weighed at 7.99, 8.12, 8.34, 8.17, 8.11, 8.03, 8.14, 8.14, 7.87. Now, the task is to test

0 1

vs: :

NEW OLD NEW OLD

H H

σ σ σ σ

= ≥ .

Let us do the tests for the preceding four problems in R.

Chapter 5

[ 159 ]

Time for action – testing two-sample hypotheses

For the problem of tesng hypotheses for the means arising from two populaons, we will

be using the funcons z.test and t.test.

1. As earlier, load the library(PASWR) library.

2. Carry out the Z-test using z.test and the opons x, y, sigma.x, and sigma.y:

pH_Data <- c(8.30, 8.42, 8.44, 8.32, 8.43, 8.41, 8.42,

8.46, 8.37, 8.42)

pH_New <- c(8.78, 8.85, 8.74, 8.83, 8.82, 8.79, 8.82,

8.74, 8.84, 8.78, 8.75, 8.81)

z.test(x=pH_Data,y=pH_New,sigma.x=sigma.y=0.05,alternative="less")

The p-value is very small (2.2e-16) indicang that we reject the null hypothesis

that

µ µ

3. For the steel length cut data problem, run the following code:

length_M1 <- c(122.4, 123.12, 122.51, 123.12, 122.55,

121.76, 122.31, 123.2, 122.48, 121.96)

length_M2 <- c(122.36, 121.88, 122.2, 122.88, 123.43,

122.4, 122.12, 121.78, 122.85, 123.04)

z.test(x=length_M1,y=length_M2,sigma.x=0.5,sigma.y=0.5)

The display of p-value = 0.8335 shows that the machines do not cut the steel

in dierent ways.

4. If the variances are equal but not known, we need to use t.test instead of the

z.test:

t.test(x=pH_Data,y=pH_New,alternative="less")

t.test(x=length_M1,y=length_M2)

5. The p-values for the two hypotheses problems are p-value = 3.95e-13

and p-value = 0.8397. We leave the interpretaon aspect to the reader.

6. For the fourth problem, we have the following R program:

machine_new <- c(8.06, 8.64, 7.97, 7.81, 7.93, 8.57, 8.39, 8.46,

8.28, 8.02, 8.39)

machine_old <- c(7.99, 8.12, 8.34, 8.17, 8.11, 8.03, 8.14, 8.14,

7.87)

t.test(machine_new,machine_old, alternative="greater")

Again, we have p-value = 0.1005!

Stascal Inference

[ 160 ]

What just happened?

The funcons t.test and z.test were simply extensions from the one-sample case to the

two-sample test.

Have a go hero

In the one-sample case you used var.test for the same datasets, which needed

a comparison of means with some known standard deviaon. Now, test for the variance in

the two-sample case using var.test using appropriate hypotheses for them. For example,

test whether the variances are equal for pH_Data and pH_New. Find more details of the test

with ?var.test.

Summary

In this chapter we have introduced "stascal inference", which in a common usage term

consists of three parts: esmaon, condence intervals, and hypotheses tesng. We began

the chapter with the importance of likelihood and to obtain the MLE in many of the standard

probability distribuons using built-in modules. Later, simply to maintain the order of

concepts, we dened funcons exclusively for obtaining the condence intervals. Finally,

the chapter considered important families of tests that are useful across many important

stochasc experiments. In the next chapter we will introduce the linear regression model,

which more formally constutes the applied face of the subject.

Linear Regression Analysis

In the Visualization techniques for continuous variable data section of Chapter

3, Data Visualization, we have seen different data visualization techniques

which help in understanding the data variables (boxplots and histograms) and

their interrelationships (matrix of scatter plots). We had seen in Example 4.6.1.

Resistant line for the IO-CPU time an illustration of the resistant line, where

CPU_Time depends linearly on the No_of_IO variable. The pair function's

output in Example 3.2.9. Octane rating of gasoline blends indicated that the

mileage of a car has strong correlations with the engine-related characteristics,

such as displacement, horsepower, torque, the number of transmission speeds,

and the type of transmission being manual or automatic. Further, the mileage

of a car also strongly depends on the vehicle dimensions, such as its length,

width, and weight. The question addressed in this chapter is meant to further

these initial findings through a more appropriate model. Now, we take the next

step forward and build linear regression models for the problems. Thus, in this

chapter we will provide more concrete answers for the mileage problem.

Linear Regression Analysis

[ 162 ]

The rst linear regression model was built by Sir Francis Galton in 1908. The word regression

implies towards-the-center. The covariates, also known as independent variables, features,

or regressors, have a regressive eect on the output, also called dependent or regressand

variable. Since the covariates are allowed, actually assumed, to aect the output in linear

increments, we call the model the linear regression model. The linear regression models

provide an answer for the correlaon between the regressand and the regressors, and as

such do not really establish causaon. As it will be seen later in the chapter, using data,

we will be able to understand the mileage of a car as a linear funcon of the car-related

dynamics. From a pure scienc point of view, the mileage should really depend on

complicated formulas of the car's speed, road condions, the climate, and so on. However,

it will be seen that linear models work just ne for the problem despite not really going

into the technical details. However, there will also be a price to pay, in the sense that most

regression models work well when the range of the variables is well dened, and that an

aempt to extrapolate the results usually does not result in sasfactory answers. We will

begin with a simple linear regression model where we have one dependent variable

and one covariate.

At the conclusion of the chapter, you will be able to build a regression model through

the following steps:

Building a linear regression model and their interpretaon

Validaon of the model assumpons

Idenfying the eect of every single observaon, covariates, as well as the output

Fixing the problem of dependent covariates

Selecon of the opmal linear regression model

The simple linear regression model

In Example 4.6.1. Resistant line for the IO-CPU me of Chapter 4, Exploratory Analysis,

we built a resistant line for CPU_Time as a funcon of the No_of_IO processes. The

results were sasfactory in the sense that the ed line was very close to covering all the

data points, refer to Figure 7 of Chapter 4, Exploratory Analysis. However, we need more

stascal validaon of the esmated values of the slope and intercept terms. Here we take

a dierent approach and state the linear regression model in more technical details.

Chapter 6

[ 163 ]

The simple linear regression model is given by

Y X

ββ ε

=+ +

,where X is the covariate/

independent variable, Y is the regressand/dependent variable, and ε is the unobservable

error term. The parameters of the linear model are specied by

and

. Here

is the

intercept term and corresponds to the value of Y when x = 0. The slope term

reects

the change in the Y value for a unit change in X. It is also common to refer to the

and

values as regression coecients. To understand the regression model, we begin with n pairs

of observaons

( ) ( )

11 nn

Y,X,..., Y,X

with each pair being completely independent of the other.

We make an assumpon of normal and independent and idencally distributed (iid) for

the error term

, specically,

( )

0N,

ε σ

∼, where 2

is the variance of the errors. The core

assumpons of the model are listed as follows:

All the observaons are independent

The regressand depends linearly on the regressors

The errors are normally distributed, that is

( )

0N,

ε σ

∼

We need to nd all the unknown parameters in

ββ

and 2

. Suppose we have n

independent observaons. Stascal inference for the required parameters may be carried

out using the maximum likelihood funcon as described in the Maximum likelihood esmator

secon of Chapter 5, Stascal Inference. The popular technique for the linear regression

model is the least squares method which idenes the parameters by minimizing the error

sum of squares for the model, and under the assumpons made thus far agrees with the

MLE. Let

and

be a choice of parameters. Then the residuals, the distance between the

actual points and the model predicons, made by using the proposed choice of

ββ

on the

i-th pair of observaon

( )

Y,X

is dened by:

01 12

ii i

eY X,i,,...,n

ββ

=− + =



Let us now specify dierent values for the pair

( )

ββ

and visualize the residuals for them.

What happens to the arbitrary choice of parameters?

For the IO_Time dataset, the scaer plot suggests that the intercept term is about 0.05.

Further, the resistant line gives an esmate of the slope at about 0.04. We will have three

pairs for guesses for

( )

ββ

as (0.05, 0.05), (0.1, 0.04), and (0.15, 0.03). We will now plot the

data and see the dierent residuals for the three pairs of guesses.

Linear Regression Analysis

[ 164 ]

Time for action – the arbitrary choice of parameters

1. We begin with reasonable guesses of the slope and intercept terms for a

simple linear regression model. The idea is to inspect the dierence between

the ed line and the actual observaons. Invoke the graphics windows using

par(mfrow=c(1,3)).

2. Obtain the scaer plot of the CPU_Time against No_of_IO with:

plot(No_of_IO,CPU_Time,xlab="Number of Processes",ylab="CPU Time",

ylim=c(0,0.6),xlim=c(0,11))

3. For the guessed regression line with the values of

( )

ββ

being (0.05, 0.05), plot a

line on the scaer plot with abline(a=0.05,b=0.05,col= "blue").

4. Dene a funcon which will nd the y value for the guess of the pair (0.05, 0.05)

using myline1 = function(x) 0.05*x+0.05.

5. Plot the error (residuals) made due to the choice of the pair (0.05, 0.05) from the

actual points using the following loop and give a tle for the rst pair of the guess:

for(i in 1:length(No_of_IO)){

lines(c(No_of_IO[i], No_of_IO[i]), c(CPU_Time[i],

myline1(No_of_IO[i])),col="blue", pch=10)

}

title("Residuals for the First Guess")

6. Repeat the preceding exercise for the last two pairs of guesses for the regression

coecients

( )

ββ

The complete R program is given as follows:

par(mfrow=c(1,3))

plot(IO_Time$No_of_IO,IO_Time$CPU_Time,xlab="Number of

Processes",ylab="CPU Time",ylim=c(0,0.6),xlim=c(0,11))

abline(a=0.05,b=0.05,col="blue")

myline1 <- function(x) 0.05*x+0.05

for(i in 1:length(IO_Time$No_of_IO)){

lines(c(IO_Time$No_of_IO[i],IO_Time$No_of_IO[i]),c(IO_Time$CPU_

Time[i],myline1(IO_Time$No_of_IO[i])),col="blue",pch=10)

}

title("Residuals for the First Guess")

plot(IO_Time$No_of_IO,IO_Time$CPU_Time,xlab="Number of

Processes",ylab="CPU Time",ylim=c(0,0.6),xlim=c(0,11))

abline(a=0.1,b=0.04,col="green")

myline2 <- function(x) 0.04*x+0.1

for(i in 1:length(IO_Time$No_of_IO)){

lines(c(IO_Time$No_of_IO[i],IO_Time$No_of_IO[i]),c(IO_Time$CPU_

Time[i],myline2(IO_Time$No_of_IO[i])),col="green",pch=10)

Chapter 6

[ 165 ]

}

title("Residuals for the Second Guess")

plot(IO_Time$No_of_IO,IO_Time$CPU_Time,xlab="Number of

Processes",ylab="CPU Time",ylim=c(0,0.6),xlim=c(0,11))

abline(a=0.15,b=0.03,col="yellow")

myline3 <- function(x) 0.03*x+0.15

for(i in 1:length(IO_Time$No_of_IO)){

lines(c(IO_Time$No_of_IO[i],IO_Time$No_of_IO[i]),c(IO_Time$CPU_

Time[i],myline3(IO_Time$No_of_IO[i])),col="yellow",pch=10)

}

title("Residuals for the Third Guess")

Figure 1: Residuals for the three choices of regression coefficients

What just happened?

We have just executed an R program which displays the residuals for arbitrary choices of the

regression parameters. The displayed result is the preceding screenshot.

In the preceding R program, we rst plot CPU_Time against No_of_IO. The rst choice of

the line is ploed by using the abline funcon, and we specify the required intercept and

slope through a = 0.05 and b = 0.05. From this straight line (color blue), we need to obtain

the magnitude of error, through perpendicular lines from the points to the line, from the

original points. This is achieved through the for loop, where the lines funcon joins the

points and the line.

Linear Regression Analysis

[ 166 ]

For the pair (0.05, 0.05) as a guess of

( )

ββ

, we see that there is a progression in the

residual values as x increases, and it is the other way for the guess of (0.15, 0.03). In either

case, we are making large mistakes (residuals) for certain x values. The middle plot for the

guess (0.1, 0.04) does not seem to have large residual values. This choice may be beer over

the other two choices. Thus, we need to dene a criterion which enables us to nd the best

values

( )

ββ

in some sense. The criterion is to minimize the sum of squared errors:

( )

min

ββ

∑

Where:

( )

{ }

1 1

n n

i i i

i i

e y x

ββ

= =

= − +

∑ ∑

Here, the summaon is over all the observed pairs

( )

y,x,i,,...,n=

. The technique of

minimizing the error sum of squares is known as the method of least squares, and for the

simple linear regression model, the values of

( )

ββ

which meet the criterion are given by:

1 0 1

ˆ ˆ ˆ

, y x

β β β

= = −

Where:

1 1

and

n n

x y

x , y

n n

− −

∑ ∑

= =

And:

( ) ( )

2 2

1 1

and

n n

xx i xx i i

i i

S x x, S y xx

= =

= − = −

∑ ∑

We will now learn how to use R for building a simple linear regression model.

Chapter 6

[ 167 ]

Building a simple linear regression model

We will use the R funcon lm for the required construcon. The lm funcon creates

an object of class lm, which consists of an ensemble of the ed regression model.

Through the following exercise, you will learn the following:

The basic construcon of an lm object

The criteria which signies the model signicance

The criteria which signies the variable signicance

The variaon of the output explained by the inputs

The relaonship is specied by a formula in R, and the details related to the generic form

may be obtained by entering ?formula in the R console. That is, the lm funcon accepts

a formula object for the model that we are aempng to build. data.frame may also be

explicitly specied, which consists of the required data. We need to model CPU_Time as a

funcon of No_of_IO, and this is carried out by specifying CPU_Time ~ No_of_IO.

The funcon lm is wrapped around the formula to obtain our rst linear regression model.

Time for action – building a simple linear regression model

We will build the simple linear regression model using the lm funcon with its

useful arguments.

1. Create a simple linear regression model for CPU_Time as a funcon of No_of_IO

by keying in IO_lm <- lm(CPU_Time ~ No_of_IO, data=IO_Time).

2. Verify that IO_lm is of the lm class with class(IO_lm).

3. Find the details of the ed regression model using the summary funcon:

summary(IO_lm).

Linear Regression Analysis

[ 168 ]

The output is given in the following screenshot:

Figure 2: Building the first simple linear regression model

The rst queson you should ask yourself is, "Is the model signicant overall?".

The answer is provided by the p-value of the F-stasc for the overall model. This appears

in the nal line of summary(IO_lm). If the p-value is closer to 0, it implies that the model is

useful. A rule of thumb for the signicance of the model is that it should be less than 0.05.

The general rule is that if you need the model signicance at a certain percentage, say P,

then the p-value of the F-stasc should be lesser than (1-P/100).

Now that we know that the model is useful, we can ask whether the independent variable

as well as the intercept term, are signicant or not. The answer to this queson is provided

by Pr(>|t|) for the variables in the summary. R has a general way of displaying the highest

signicance level of a term by using ***, **, * and . in the Signif. codes:. This display may be

easily compared with the review of a movie or a book! Just as with general rangs, where

more stars indicate a beer product, in our context, the higher the number of stars indicate

the variables are more signicant for the built model. In our linear model, we nd No_of_IO

to be highly signicant. The esmate value of No_of_IO is given as 0.04076. This coecient

has the interpretaon that for a unit increase in the number of IOs; CPU_Time is expected to

increase by 0.04076.

Chapter 6

[ 169 ]

Now that we know that the model, as well as the independent variable, are signicant, we

need to know how much of the variability in CPU_Time is explained by No_of_IO. The answer

to this queson is provided by the measure R2, not to be confused with the leer R for the

soware, which when mulplied by 100 gives the percentage of variaon in the regressand

explained by the regressor. The term R2 is also called as the coecient of determinaon. In

our example, 98.76 percent of the variaon in CPU_Time is explained by No_of_IO, see the

value associated with Mulple R-squared in the summary(IO_lm). The R2 measure does not

consider the number of parameters esmated or the number of observaons n in a model.

A more robust explanaon, which takes into consideraon the number of parameters and

observaons, is provided by Adjusted R-squared which is 98.6 percent.

We have thus far not commented on the rst numerical display as a result of using the

summary funcon. This relates to the residuals and the display is about the basic summary

of the residual values. The residuals vary from -0.016509 to 0.024006, which are not

very large in comparison with the CPU_Time values, check with summary(CPU_Time)

for instance. Also, the median of the residual values is very close to zero, and this is an

important criterion as the median of the standard normal distribuon is 0.

What just happened?

You have ed a simple linear regression model where the independent variable is No_of_

IO and the dependent variable (output) is CPU_Time. The important quanes to look for

the model signicance, the regression coecients, and so on have been clearly illustrated.

Have a go hero

Load the dataset anscombe from the datasets package. The anscombe dataset has four

pairs of datasets in x1, x2, x3, x4, y1, y2, y3, and y4. Fit a simple regression model for

all the four pairs and obtain the summary for each pair. Make your comments on the

summaries. Pay careful aenon to the details of the summary funcon. If you need further

help, simply try out example(anscombe).

We will next look at the ANOVA (Analysis of Variance) method for the regression model,

and also obtain the condence intervals for the model parameters.

ANOVA and the condence intervals

The summary funcon of the lm object species the p-value for each variable in the model,

including the intercept term. Technically, the hypothesis problem is tesng 00 0

H: ,j ,

against the corresponding alternave hypothesis,

10 0 1

H: ,j ,

≠=

. This tesng problem

is technically dierent from the simultaneous hypothesis tesng

00 10:H

ββ

against the

alternave that at least one of the regression coecients is dierent from 0. The ANOVA

technique gives the answer to the laer null hypothesis of interest.

Linear Regression Analysis

[ 170 ]

For more details about the ANOVA technique, you may refer to http://en.wikipedia.

org/wiki/Analysis_of_variance. Using the anova funcon, it is very simple in R to

obtain the ANOVA table for a linear regression model. Let us apply it for our IO_lm linear

regression model.

Time for action – ANOVA and the condence intervals

The R funcons anova and confint respecvely help obtain the ANOVA table

and condence intervals from the lm objects. Here, we use them for the IO_lm

regression object.

1. Use the anova funcon on the IO_lm lm object to obtain the ANOVA table by using

IO_anova <- anova(IO_lm).

2. Display the ANOVA table by keying in IO_anova in the console.

3. The 95 percent condence intervals for the intercept and the No_of_IO variable

is obtained by confint(IO_lm).

The output in R is as follows:

Figure 3: ANOVA and the confidence intervals for the simple linear regression model

The ANOVA table conrms that the variable No_of_IO is signicant indeed. Note

the dierence of the criteria for conrming this with respect to summary(IO_lm).

In the former case, the signicance was arrived at using the t-stascs and here we have

used the F-stasc. Precisely, we check for the variance signicance of the input variable.

We now give the tool for obtaining condence intervals.

Chapter 6

[ 171 ]

Check whether or not the esmated value of the parameters fall within the 95 percent

condence intervals. The preceding results show that we indeed have a very good linear

regression model. However, we also made a host of assumpons in the beginning of the

secon, and a good pracce is to ask how valid they are in the experiment. We next consider

the problem of validaon of the assumpons.

What just happened?

The ANOVA table is a very fundamental block for a regression model, and it gives the split

of the sum of squares for the variable(s) and the error term. The dierence between ANOVA

and the summary of the linear model object is in the respecvely p-values reported by them

as Pr(>F) and Pr(>|t|). You also found a method for obtaining the condence intervals for

the independent variables of the regression model.

Model validation

The violaons of the assumpons may arise in more than one way. Taar, et. al. (2012),

Kutner, et. al. (2005) discusses the numerous ways in which the assumpons are violated

and an adapon of the methods menoned in these books is now considered:

The regression funcon is not linear. In this case, we expect the residuals to have

a paern which is not linear when viewed against the regressors. Thus, a plot of the

residuals against the regressors is expected to indicate if this assumpon is violated.

The error terms do not have constant variance. Note that we made an assumpon

stang that

( )

0N,

ε∼ σ

, that is the magnitude of errors do not depend on the

corresponding x or y value. Thus, we expect the plot of the residuals against the

predicted y values to reveal if this assumpon is violated.

The error terms are not independent. A plot of the residuals against the serial

number of the observaons indicated if the error terms are independent or not.

We typically expect this plot to exhibit a random walk if the errors are

independent. If any systemac paern is observed, we conclude that the errors

are not independent.

The model ts all but one or a few outlier observaons. Outliers are a huge concern

in any analycal study as even a single outlier has a tendency to destabilize the

enre model. A simple boxplot of the residuals indicates the presence of an outlier.

If any outlier is present, such observaons need to be removed and the model

needs to be rebuilt. The current step of model validaon needs to be repeated for

the rebuilt model. In fact the process needs to be iterated unl there are no more

outliers. However, we need to cauon the reader that if the subject experts feel

that such outliers are indeed expected values, it may convey that some appropriate

variables are missing in the regression model.

Linear Regression Analysis

[ 172 ]

The error terms are not normally distributed. This is one of the most crucial

assumpons of the linear regression model. The violaon of this assumpon is

veried using the normal probability plot in which the predicted values (actually

cumulave probabilies) are ploed against the observed values. If the values fall

along a straight line, the normality assumpon for errors holds true. The model is to

be rejected if this assumpon is violated.

The next secon shows you how to obtain the residual plots for the purpose

of model validaon.

Time for action – residual plots for model validation

The R funcons resid and fitted can be used to extract residuals and ed values from

an lm object.

1. Find the residuals of the ed regression model using the resid funcon: IO_lm_

resid <- resid(IO_lm).

2. We need six plots, and hence we invoke the graphics editor with par(mfrow =

c(3,2)).

3. Sketch the plot of residuals against the predictor variable with plot(No_of_IO,

IO_lm_resid).

4. To check whether the regression model is linear or not, obtain the plots of absolute

residual values against the predictor variable and also that of squared residual

values against the predictor variable respecvely with plot(No_of_IO, abs(IO_

lm_resid),…) and plot(No_of_IO, IO_lm_resid^2,…).

5. The assumpon that errors have constant variance may be veried by the plot of

residuals against the ed values of the regressand. The required plot is obtained by

using plot(IO_lm$fitted.values,IO_lm_resid).

6. The assumpon that the errors are independent of each other may be veried

plong the residuals against their index numbers: plot.ts(IO_lm_resid).

Chapter 6

[ 173 ]

7. Finally, the presence of outliers is invesgated by the boxplot of the residuals:

boxplot(IO_lm_resid).

8. Finally, the assumpon of normality for the error terms is veried through the

normal probability plot. This plot is on a new graphics editor.

The complete R program is as follows:

IO_lm_resid <- resid(IO_lm)

par(mfrow=c(3,2))

plot(No_of_IO, IO_lm_resid,main="Plot of Residuals Vs

Predictor Variable",ylab="Residuals",xlab="Predictor Variable")

plot(No_of_IO, abs(IO_lm_resid), main="Plot ofAbsolute Residual Values

Vs Predictor Variable",ylab="Absolute Residuals", xlab="Predictor

Variable")

# Equivalently

plot(No_of_IO, IO_lm_resid^2,main="Plot of Squared Residual Values

Vs Predictor Variable", ylab="Squared Residuals", xlab="Predictor

Variable")

plot(IO_lm$fitted.values,IO_lm_resid, main="Plot of Residuals Vs

Fitted Values",ylab="Residuals", xlab="Fitted Values")

plot.ts(IO_lm_resid, main="Sequence Plot ofthe Residuals")

boxplot(IO_lm_resid,main="Box Plot of the Residuals")

rpanova = anova(IO_lm)

IO_lm_resid_rank=rank(IO_lm_resid)

tc_mse=rpanova$Mean[2]

IO_lm_resid_expected=sqrt(tc_mse)*qnorm((IO_lm_resid_rank-0.375)/

(length(CPU_Time)+0.25))

plot(IO_lm_resid,IO_lm_resid_expected,xlab="Expected",ylab="Residuals"

,main="The Normal Probability Plot")

abline(0,1)

Linear Regression Analysis

[ 174 ]

The resulng plot for the model validaon plot is given next. If you run the preceding R

program up to the rpanova code, you will nd the plot similar to the following:

Figure 4: Checking for violations of assumptions of IO_lm

We have used the resid funcon to extract the residuals out of the lm object. The rst plot

of residuals against the predictor variable No_of_IO shows that more the number of IO

processes, the larger is the residual value, as is also conrmed by Plot of Absolute Residual

Values Vs Predictor Variable and Plot of Squared Residual Values Vs Predictor Variable.

However, there is no clear non-linear paern suggested here. The Plot of Residuals Vs

Fied Values is similar to the rst plot of residuals against the predictor. The me series plot

of residuals does not indicate a strict determinisc trend and appears a bit similar to the

random walk. Thus, these plots do not give any evidence of any kind of dependence among

the observaons. The boxplot does not indicate any presence of an outlier.

The normal probability plot for the residuals is given next:

Chapter 6

[ 175 ]

Figure 5: The normal probability plot for IO_lm

As all the points fall close to the straight line, the normality assumpon for the errors does

not appear to be violated.

What just happened?

The R program given earlier gives various residual plots, which help in validaon of the

model. It is important that these plots are always checked whenever a linear regression

model is built. For CPU_Time as a funcon of No_of_IO, the linear regression model is

a fairly good model.

Linear Regression Analysis

[ 176 ]

Have a go hero

From a theorecal perspecve and my own experience, the seven plots obtained earlier

were found to be very useful. However, R, by default, also gives a very useful set of residual

plots for an lm object. For example, plot(my_lm) generates a powerful set of model

validaon plots. Explore the same for IO_lm with plot(IO_lm). You can explore more

about plot and lm with the plot.lm funcon.

We will next consider the general mulple linear regression model for the Gasoline problem

considered in the earlier chapters.

Multiple linear regression model

In the The simple linear regression model secon, we considered an almost (un)realisc

problem of having only one predictor. We need to extend the model for the praccal

problems when one has more than a single predictor. In Example 3.2.9. Octane rang

of gasoline blends we had a graphical study of mileage as a funcon of various vehicle

variables. In this secon, we will build a mulple linear regression model for the mileage.

If we have X1, X2, …, Xp independent set of variables which have a linear eect on the

dependent variable Y, the mulple linear regression model is given by:

01122pp

Y X X ... X

ββ β β ε

=+ + + ++ +

This model is similar to the simple linear regression model, and we have the same

interpretaon as earlier. Here, we have addional independent variables in X2, …, Xp and their

eect on the regressand Y are respecvely through the addional regression parameters

,...,

β β

. Now, suppose we have n pairs of random observaons

( ) ( )

11 nn

Y,X,..., Y,X

for

understanding the mulple linear regression model, here

( )

i i ip

X X ,..., X,i,...,n= = . A matrix

form representaon of the mulple linear regression model is useful in the understanding of

the esmator for the vector of regression coecients. We dene the following quanes:

()

( )

11 12 12

21 22 2

1 2

01 and

n n n

XX X

X X X

Y,...,Y'

,..., ',

,,... '

X X X

εε ε

ββββ

 

 

 



  



Chapter 6

[ 177 ]

The mulple linear regression model for n observaons can be wrien in a compact matrix

form as:

YX'

βε

= +

The least squares esmate of

is given by:

( )

ˆXX X'Y

−

′

Let us t a mulple linear regression model for the Gasoline mileage data considered earlier.

Averaging k simple linear regression models or a multiple linear

regression model

We already know how to build a simple linear regression model. Why should we learn

another theory when an extension is possible in a certain manner? Intuively, we can build

k models consisng of the kth variable and then simply average out over the k models. Such

an averaging can also be considered for the univariate case too. Here, we may divide the

covariate over disnct intervals, then build the simple linear regression model over the

intervals, and nally average over the dierent models. Montgomery, et. al. (2001) highlights

the drawback of such an approach, pages 120-122. Typically, the simple linear regression

model may indicate the wrong sign for the regression coecients. The wrong sign of such

a naïve approach may arise as a result of mulple reasons: restricng the range of some

regressors, crical regressors may have been omied from the model building, or some

computaonal errors may have crept in.

To drive home the point, we consider an example from Montgomery, et. al.

Time for action – averaging k simple linear regression models

We will build three models here. We have a vector of regressand y and two covariates: x1

and x2.

1. Enter the dependent variable and the independent variables with y <- c(1,

5,3,8,5,3,10,7), x1 = c(2,4,5,6,8,10,11,13), and x2 <- c(1,2,

2,4,4,4,6,6).

2. Visualize the relaonships among the variables with:

par(mfrow=c(1,3))

plot(x1,y)

plot(x2,y)

plot(x1,x2)

Linear Regression Analysis

[ 178 ]

3. Build the individual simple linear regression model and our rst mulple regression

model with:

summary(lm(y~x1))

summary(lm(y~x2))

summary(lm(y~x1+x2)) # Our first multiple regression model

Figure 6: Averaging k simple linear regression models

Chapter 6

[ 179 ]

What just happened?

The visual plot (the preceding screenshot) indicates that both x1 and x2 have a posive

impact on y, and this is also captured in lm( y~x1 ) and lm( y~x2 ), see the next R output

display. We have omied the scaer plot, though you should be able to see the same on

your screen aer running the R code aer step 2 in the next secon. However, both the

models are under the assumpon that the informaon contained in x1 and x2 is complete.

The variables are also seen to have a signicant eect on the output. However, the metrics

such as Mulple R-squared and Adjusted R-squared are very poor for both simple (linear)

regression models. This is one of the indicaons that we need to collect more informaon

and thus, we include both the variables and build our rst mulple linear regression model,

see the next secon for more details. There are two important changes worth registering

now. First, the sign of the regression coecient x1 now becomes negave, which is now

contradicng the intuion. The second observaon is the great increase in the R-squared

metric value.

To summarize our observaons here, it suces to say that the sum of the parts may

somemes fall way short of the enre picture.

Building a multiple linear regression model

The R funcon lm remains the same as earlier. We will connue Example 3.2.9. Octane rang

of gasoline blends from the Visualizaon techniques for connuous variable data secon of

Chapter 3, Data Visualizaon. Recall that the variables, independent and dependent as well,

are stored in the dataset Gasoline in the RSADBE package. Now, we tell R that y, which is

the mileage, is the dependent variable, and we need to build a mulple linear regression

model which includes all other variables of the Gasoline object. Thus, the formula is

specied by y~. indicang that all other variables from the Gasoline object need to be

treated as the independent variables. We proceed as earlier to obtain the summary of the

ed mulple linear regression model.

Time for action – building a multiple linear regression model

The method of building a mulple linear regression model remains the same as earlier.

If all the variables in data.frame are to be used, we use the formula y ~ .. However,

if we need specic variables, say x1 and x3, the formula would be y ~ x1 + x3.

1. Build the mulple linear regression model with gasoline_lm <- lm(y~.,

data=Gasoline). Here, the formula y~. considers the variable y as the

dependent variable and all the remaining variables in the Gasoline data

frame as independent variables.

2. Get the details of the ed mulple linear regression model with

summary(gasoline_lm).

Linear Regression Analysis

[ 180 ]

The R screen then appears as follows:

Figure 7: Building the multiple linear regression model

As with the simple model, we need to rst check whether the overall model is signicant

by looking at the p-value of the F-stascs, which appears as the last line of the summary

output. Here, the value 0.0003 being very close to zero, the overall model is signicant.

Of the 11 variables specied for modeling, only x1 and x3, that is, the engine displacement

and torque, are found to have a meaningful linear eect on the mileage. The esmated

regression coecient values indicate that the engine displacement has a negave impact

on the mileage, whereas the torque impacts posively. These results are in conrmaon

with the basic science of vehicle mileage.

Chapter 6

[ 181 ]

We have a tricky output for the eleventh independent variable, which for some strange

reasons R has been renamed as x11M. We need to explain this. You should verify

the output as a consequence of running sapply(Gasoline,class) on the console.

Now, the x11 variable is a factor variable assuming two possible values A and M, which stand

for the transmission box being Automatic or Manual. As the categorical variables are of a

special nature, they need to be handled dierently. The user may be tempted to skip this,

as the variable is seen to be insignicant in this case. However, the interpretaon is very

useful and the "skip" part may prove really expensive later. For computaonal purposes,

an m-level factor variable is used to create m-1 new dierent variables. If the variable

assumes the level l, the lth variable takes value 1, else 0, for l = 1, 2, …, m-1. If the variable

assumes level m, all the (m-1) new variables take the value 0. Now, R takes the lth factor

level and names that vector by concatenang the variable name and the factor level. Hence,

we have x11M as the variable name in the output. Here, we found the factor variable to be

insignicant. If in certain experiments we nd some factor levels to be signicant at certain

p-value, we can't ignore the other factor levels even if their p-values suggest them

as insignicant.

What just happened?

The building of a mulple linear regression model is a straighorward extension of the

simple linear regression model. The interpretaon is where one has to be more careful

with the mulple linear regression model.

We will now look at the ANOVA and condence intervals for the mulple linear regression

model. It is to be noted that the usage is not dierent from the simple linear regression

model, as we are sll dealing with the lm object.

The ANOVA and condence intervals for the multiple linear

regression model

Again, we use the anova and confint funcons to obtain the required results. Here,

the null hypothesis of interest is whether all the regression coecients equal 0, that is

00 10

ββ β

 against the alternave that at least one of the regression

coecients is dierent from 0, that is 10

0H:

≠ for at least one j = 0, 1, …, p.

Linear Regression Analysis

[ 182 ]

Time for action – the ANOVA and condence intervals for the

multiple linear regression model

The use of anova and confint extend in a similar way as lm is used for simple and mulple

linear regression models.

1. The ANOVA table for the mulple regression model is obtained in the same way as

for the simple regression model, aer all we are dealing with the object of class lm:

gasoline_anova<-anova(gasoline_lm).

2. The condence intervals for the independent variables are obtained by using

confint(gasoline_lm).

The R output is given as follows:

Figure 8: The ANOVA and confidence intervals for the multiple linear regression model

Chapter 6

[ 183 ]

Note the dierence between the anova and summary results. Now, we nd only the rst

variable to be signicant. The interpretaon of the condence intervals is le to you.

What just happened?

The extension from simple to mulple linear regression model in R, especially for the ANOVA

and condence intervals, is really straighorward.

Have a go hero

The ANOVA table in the preceding screenshot and the summary of gasoline_lm in the

screenshot given in step 2 of the Time for acon – building a mulple linear regression model

secon build linear regression models using the signicant variables only. Are you amused?

Useful residual plots

In the context of mulple linear regression models, modicaons of the residuals have been

found to be more useful than the residuals themselves. We have assumed the residuals to

follow a normal distribuon with mean zero and unknown variance. An esmator of the

unknown variance is provided by the mean residual sum of squares. There are four useful

types of residuals for the current model:

Standardized Residuals: We know that the residuals have zero mean. Thus,

the standardized residuals are obtained by scaling the residuals with the esmator

of the standard deviaon, that is the square root of the mean residual sum of

squares. The standardized residuals are dened by:

Re s

dMS

Here,

( )

Re s i i

MS e/np

=∑ −

, and p is the number of covariates in the model.

The residual is expected to have mean 0 and Re s

is an esmate of its variance.

Hence, we expect the standardized residuals to have a standard normal distribuon.

This in turn helps us to verify whether the normality assumpon for the residuals is

meaningful or not.

Semi-studenzed Residuals: The semi-studenzed residuals are dened by:

( )

Re s ii

r ,i,...,n

MS h

= =

−

Linear Regression Analysis

[ 184 ]

Here, hii is the ith diagonal element of the matrix

( )

HXX'

−

The variance of a residual depends on the covariate value and hence, a at

scaling by

Re s

is not appropriate. A correcon is provided by

( )

h−

, and

( )

Re s ii

MS h− turns out to be an esmate of the variance of ei. This is the

movaon for the semi-studenzed residual ri.

PRESS Residuals: The predicted residual, PRESS, for observaon i is the dierence

between the actual value yi and the value predicted for it, by using a regression

model based on the remaining (n-1) observaons. Now let

()

be the esmator

of regression coecients based on the (n-1) observaons (not including the ith

observaons). Then, PRESS for observaons i is given by:

() ()

i i

eyx,i,...,n

=− =

Here, the idea is that the esmate of residual for an observaon is more appropriate

when obtained from a model which is not inuenced by its own value.

R-student Residuals: This residual is especially useful for the detecon of outliers.

()

( )

Re si

MS h

=−

Here,

()

Re si

is an esmator of the variance

based on the remaining (n-1)

observaons.

The scaling change is on similar lines as with the studenzed residuals.

The task of building n linear models may look daunng! However, there are very

useful formulas in Stascs and funcons in R which save the day for us. It is

appropriate that we use those funcons and develop the residual plots for the

Gasoline dataset. Let us set ourselves for some acon.

Time for action – residual plots for the multiple linear

regression model

R funcons resid, hatvalues, rstandard, and rstudent are available, which can be

applied on an lm object to obtain the required residuals.

Chapter 6

[ 185 ]

1. Get the MSE of the regression model with gasoline_lm_mse <- gasoline_

anova$Mean[length(gasoline_anova$Mean)].

2. Extract the residuals with the resid funcon, and standardize the residuals

using stan_resid_gasoline <- resid(gasoline_lm)/sqrt( gasoline_

lm_mse).

3. To obtain the semi-studenzed residuals, we rst need to get the hii elements

which are obtainable using the hatvalues funcon: hatvalues(gasoline_lm).

The remaining code is given at the end of this list.

4. The PRESS residuals are calculated using the rstandard funcon available in R.

5. The R-student residuals can be obtained using the rstudent funcon in R.

The detailed code is as follows:

# Useful Residual Plots

gasoline_lm_mse<-gasoline_anova$Mean[length(gasoline_anova$Mean)]

stan_resid_gasoline <- resid(gasoline_lm)/sqrt(gasoline_lm_mse)

#Standardizing the residuals

studentized_resid_gasoline <- resid(gasoline_lm)/ (sqrt(gasoline_

lm_mse*(1-hatvalues(gasoline_lm))))

#Studentizing the residuals

pred_resid_gasoline <- rstandard(gasoline_lm)

pred_student_resid_gasoline<-rstudent(gasoline_lm)

# returns the R-Student Prediction Residuals

par(mfrow=c(2,2))

plot(gasoline_fitted,stan_resid_gasoline,xlab="Fitted",

ylab="Residuals")

title("Standardized Residual Plot")

plot(gasoline_fitted,studentized_resid_gasoline,xlab="Fitted",ylab

="Residuals")

title("Studentized Residual Plot")

plot(gasoline_fitted,pred_resid_gasoline,xlab="Fitted",

ylab="Residuals")

title("PRESS Plot")

plot(gasoline_fitted,pred_student_resid_gasoline,xlab="Fitted",yla

b="Residuals")

title("R-Student Residual Plot")

Linear Regression Analysis

[ 186 ]

All the four residual plots in the screenshot given in the Time for acon – residual plots for

model navigaon secon look idencal though there is a dierence in their y-scaling. It is

apparent from the residual plots that there are no paerns which show the presence of

non-linearity, that is the linearity assumpon appears valid. In the standardized residual

plot, all the observaons are well within -3 and 3. Thus, it is correct to say that there are

no outliers in the dataset.

Figure 9: Residual plots for the multiple linear regression model

What just happened?

Using the resid, rstudent, rstandard, and other funcons, we have obtained useful

residual plots for the mulple linear regression models.

Regression diagnostics

In the Useful residual plots subsecon, we saw how outliers may be idened using

the residual plots. If there are outliers, we need to ask the following quesons:

Chapter 6

[ 187 ]

Is the observaon an outlier due to an anomalous value in one or more

covariate values?

Is the observaon an outlier due to an extreme output value?

Is the observaon an outlier because of both the covariate and output values being

extreme values?

The disncon in the nature of an outlier is vital as one needs to be sure of its type. The

techniques for outlier idencaon are certainly dierent as is their impact. If the outlier

is due to the covariate value, the observaon is called a leverage point, and if it is due to

the y value, we call it an inuenal point. The rest of the secon is for the exact stascal

technique for such an outlier idencaon.

Leverage points

As noted, a leverage point has an anomalous x value. The leverage points may be

theorecally proved not to impact the esmates of the regression coecients. However,

these points are known to drascally aect the

value. The queson then is, how do

we idenfy such points? The answer is by looking at the diagonal elements of the hat matrix

( )

HXX'

−

=. Note that the matrix

is of the order

nn×

. The (i, i) element of the hat

matrix ii

may be interpreted as the amount of leverage by the observaon, i, on the ed

value i

. The average size of a leverage is hp

=, where p is the number of covariates and

n is the number of observaons. It is beer to leave out an observaon if its leverage value is

greater than twice of

p/n

, and we then conclude that the observaon is a leverage point.

Let us go back to the Gasoline problem and see the leverage of all the observaons. In R,

we have a funcon, hatvalues which readily extracts the diagonal elements of

. The R

output is given in the next screenshot.

Clearly, we have 10 observaons which are leverage points. This is indeed a maer of

concern as we have only about 25 observaons. Thus, the results of the linear model need

to be interpreted with cauon! Let us now idenfy the inuenal points for the Gasoline

linear model.

Linear Regression Analysis

[ 188 ]

Inuential points

An inuenal point has a tendency to pull the regression line (plane) towards its direcon

and hence, they drascally aect the values of the regression coecients. We want to

idenfy the impact of an observaon on the regression coecients, and one approach is

to consider how much the regression coecient values change if the observaon was not

considered. The relevant mathemacs for idencaon of inuenal points is beyond the

scope of the book, so we simply help ourselves with the metric Cook's distance which nds

the inuenal points. The R funcon, cooks.distance, returns the values of the Cook's

distance for each observaon, and the thumb rule is that if the distance is greater than 1,

the observaon is an inuenal point. Let us use the R funcon and idenfy the inuenal

points for the Gasoline problem.

Figure 10: Leverage and influential points of the gasoline_lm fitted model

For this dataset, we have only one inuenal point in Eldorado. The plot of Cook's distance

against the observaon numbers and that of Cook's distance against the leverages may be

easily obtained with plot(gasoline_lm,which=c(4,6)).

Chapter 6

[ 189 ]

DFFITS and DFBETAS

Belsley, Kuh, and Welsch (1980) proposed two addional metrics for nding the inuenal

points. The DFBETAS metric indicates the change of regression coecients (in standard

deviaon units) if the ith observaon is removed. Similarly, DFFITS is a metric which gives

the impact on the ed values i

. The rule which indicates the presence of an inuenal

point using the DFFITS is 2

|DFFITS |/p/n>, where p is the number of covariates and n

is the number of observaons.

Finally, an observaon i is inuenal for regression coecient j if

j,i

DFBETAS

Figure 11: DFFITS and DFBETAS for the Gasoline problem

We have given the DFFITS and DFBETAS values for the Gasoline dataset. It is le as an

exercise to the reader to idenfy the inuenal points from the outputs given above.

The multicollinearity problem

One of the important assumpons of the mulple linear regression model is that the

covariates are linearly independent. The linear independence here is the sense of Linear

Algebra that a vector (covariate in our context) cannot be expressed as a linear combinaon

of others. Mathemacally, this assumpon translates into an implicaon that

( )

X'X− is

nonsingular, or that its determinant is non-zero. If this is not the case then we have one or

more of the following problems:

Linear Regression Analysis

[ 190 ]

The esmated will be unreliable, and there is a great chance of the regression

coecients having the wrong sign

The relevant signicant factors will not be idened by either the t-tests or the

F-tests

The importance of certain predictors will be undermined

Let us rst obtain the correlaon matrix for the predictors of the Gasoline dataset.

We will exclude the nal covariate, as it is factor variable.

Figure 12: The correlation matrix of the Gasoline covariates

We can see that covariate x1 is strongly correlated with all other predictors except x4.

Similarly, x8 to x10 are also strongly correlated. This is a strong indicaon of the presence

of the mulcollinearity problem.

Dene

( )

CX'X

−

=. Then it can be proved that, refer Montgomery, et. al. (2003), the

jth diagonal element of

can be wrien as

( )

jj j

C R

−

=− , where

is the coecient

of determinaon obtained by regressing all other covariates for xj as the output. Now,

the variable xj is independent of all the other covariates; we expect the coecient of

determinaon to be zero, and hence jj

to be closer to unity. However, if the covariate

depends on the others, we expect the coecient of determinaon to be a high value, and

hence the jj

to be a large number. The quanty jj

is also called the variance inaon

factor, denoted by VIFj. A general guideline for a covariate to be linearly independent of

other covariates is that its VIFj should be lesser than 5 or 10.

Chapter 6

[ 191 ]

Time for action – addressing the multicollinearity problem for

the Gasoline data

The mulcollinearity problem is addressed using the vif funcon, which is available from

two libraries: car and faraway. We will use it from the faraway package.

1. Load the faraway package with library(faraway).

2. We need to nd the variance inaon factor (VIF) of the independent variables

only. The covariate x11 is a character variable, and the rst column of the Gasoline

dataset is the regressand. Hence, run vif(Gasoline[,-c(1,12)]) to nd the VIF

of the eligible independent variables.

3. The VIF for x3 is highest at 217.587. Hence, we remove it to nd the VIF among the

remaining variables with vif(Gasoline[,-c(1,4,12)]). Remember that x3 is

the fourth column in the Gasoline data frame.

4. In the previous step, we nd x10 having the maximum VIF at 77.810. Now, run

vif(Gasoline[,-c(1,4,11,12)]) to nd if all VIFs are less than 10.

5. For the rst variable x1 the VIF is 31.956, and we now remove it with

vif(Gasoline[,-c(1,2,4,11,12)]).

6. At the end of the previous step, we have the VIF of x1 at 10.383. Thus, run

vif(Gasoline[,-c(1,2,3,4,11,12)]).

7. Now, all the independent variables have VIF lesser than 10. Hence, we stop

at this step.

8. Removing all the independent variables with VIF greater than 10, we arrive at the

nal model, summary(lm(y~x4+x5+x6+x7+x8+x9,data= Gasoline)).

Figure 13: Addressing the multicollinearity problem for the Gasoline dataset

Linear Regression Analysis

[ 192 ]

What just happened?

We used the vif funcon from the faraway package to overcome the problem of

mulcollinearity in the mulple linear regression model. This helped to reduce the

number of independent variables from 10 to 6, which is a huge 40 percent reducon.

The funcon vif from the faraway package is applied to the set of covariates. Indeed,

there is another funcon of the same name from the car package which can be directly

applied on an lm object.

Model selection

The method of removal of covariates in the The mulcollinearity problem secon depended

solely on the covariates themselves. However, it may happen more oen that the covariates

in the nal model will be selected with respect to the output. Computaonal cost is almost

a non-issue these days and especially for not-so-very-large datasets! The queson that

arises then is can one retain all possible covariates in the model, or do we have any choice

of covariates which meet a certain regression metric, say R2 > 60 percent? The problem

is that having more covariates increases the variance of the model which having lesser of

them will have large bias. The philosophical Occam's Razor principle applies here, and the

best model is the simplest model. In our context, the smallest model which ts the data is

the best. There are two types of model selecon: stepwise procedures and criterion-based

procedures. In this secon, we will consider both the procedures.

Stepwise procedures

There are three methods of selecng covariates for inclusion in the nal model: backward

eliminaon, forward selecon, and stepwise regression. We will rst describe the backward

eliminaon approach and develop the R funcon for it.

The backward elimination

In this model, one rst begins with all the available covariates. Suppose we wish to retain all

covariates for whom the p-value is at the most

. The value

is referred to as crical alpha.

Now, we rst eliminate that covariate whose p-value is maximum among all the covariates

having p-value greater than

. The model is reed for the current covariates. We connue

the process unl we have all the covariates whose p-value is less than

. In summary, the

backward eliminaon algorithm is as explained next:

Chapter 6

[ 193 ]

1. Consider all the available covariates.

2. Remove the covariate with maximum p-value among all the covariates which have

p-value greater than

3. Ret the model and go to the rst step.

4. Connue the process unl all p-values are less than

Typically, the user invesgates the p-values in the summary output and then carries out the

preceding algorithm. Taar, et. al. (2013) gives a funcon which right away executes the

enre algorithm, and we adapt the same funcon here and apply it on the linear regression

model gasoline_lm.

The forward selection

In the previous procedure we started with all covariates. Here, we begin with an empty

model and look forward for the most signicant covariates with p-value lesser than

That is, we build k new linear models with the k-th covariate for the k-th model. Naturally,

by "most signicant" we mean that the p-value should be the least among all the covariates

whose p-value is lesser than

. Then, we build the model with the selected covariate.

A second covariate is selected by treang the previous model as the inial empty model.

The model selecon is connued unl we fail to add any more covariates. This is summarized

in the following algorithm:

1. Begin with an empty model.

2. For each covariate, obtain the p-value if it is added to the model. Select the

covariates with the least p-value among all the covariates whose p-value is

lesser than

3. Repeat the preceding step unl no more covariates can be updated for the model.

We again use the funcon created in Taar, et. al. (2013) and apply it for

the gasoline problem.

There is yet another method of model selecon. Here, we begin with the empty model. We

add a covariate as in the forward selecon step and then perform a backward eliminaon to

remove any unwanted covariate. Then, the forward and backward steps are connued unl

we can't either add a new covariate or remove an exisng covariate. Of course, the alpha

crical values for forward and backward steps are specied disnctly. This method is called

stepwise regression. This method is however skipped here for the purpose of brevity.

Linear Regression Analysis

[ 194 ]

Criterion-based procedures

A useful tool for the model selecon problem is to evaluate all possible models and select

one of them according to certain criteria. The Akaike Informaon Criteria (AIC) is one such

criterion which can be used to select the best model. Let

( )

logp

ˆˆ ˆ

L,,..., ,|y

ββ βσ

denote the

log likelihood funcon of the ed regression model. Dene K = p + 2 which is the total

number of esmated parameters. The AIC for the ed regression model is given by:

( )

0 1

AIC -log2 p

ˆˆ ˆˆ

L,,..., ,|

ββ βσ

 

= +

 

Now, the model which has the least AIC among the candidate models is the best model.

The step funcon available in R gets the job done for us, and we will close the chapter

with the connued illustraon of the Gasoline problem.

Time for action – model selection using the backward, forward,

and AIC criteria

For the forward and backward selecon procedure under the stepwise procedures of the

model selecon problem, we rst dene two funcons: backwardlm and forwardlm.

However, for the criteria-based model selecon, say AIC, we use the step funcon, which

can be performed on the ed linear models.

1. Create a funcon pvalueslm which extracts the p-values related to the covariates

of an lm object:

pvalueslm <- function(lm) {summary(lm)$coefficients[-1,4]}

2. Create a backwardlm funcon dened as follows:

backwardlm <- function(lm,criticalalpha) {

lm2=lm

while(max(pvalueslm(lm2))>criticalalpha) {

lm2=update(lm2,paste(".~.-",attr(lm2$terms,

"term.labels")[(which(pvalueslm(lm2)==max(pvalueslm(lm2))))],s

ep=""))

}

return(lm2)

}

Chapter 6

[ 195 ]

The code needs to be explained in more detail. There are two new funcons created

here for the implementaon of the backward eliminaon procedure. Let us have a

detailed look at them. The funcon pvalueslm extracts the p-values related to the

covariates of an lm object. The choice of summary(lm)$coefficients[-1,4]

is vital, as we are interested in the p-values of the covariates and not the intercept

term. The p-values are available once the summary funcon is applied on the lm

object. Now, let us focus on the backwardlm funcon. Its arguments are the lm

object and the value of crical

. Our goal is to carry out the iteraons unl we

do not have any more covariates with p-value greater than

. Thus, we use the

while funcon which is typical of algorithm, where the last step appears during

the beginning of a funcon/program. We want our funcon to work for all the

linear models and not just gasoline_lm, and we need to get the names of the

covariates which are specied in the lm object. Remember, we conveniently used

the formula lm(y~.) and this will try to haunt us! Thankfully, attr(lm$terms,

"term.labels") extracts all the covariate names of an lm object. The argument

[(which(pvalueslm(lm2)==max (pvalueslm (lm2))))] idenes the

covariate number which has the maximum p-value above

. Next, paste(".~.-

",attr(), sep="") returns the formula which would have removed the

unwanted covariate. The explanaon of the formula is lengthier than the funcon

itself, which is not surprising, as R is object-oriented and a few lines of code do more

acons than detailed prose.

3. Obtain the ecient linear regression model by applying the backwardlm funcon,

with the crical alpha at 0.20 on the Gasoline lm object:

gasoline_lm_backward <- backwardlm(gasoline_lm,criticalalpha=0.20)

4. Find the details of the nal model obtained in the previous step:

summary(gasoline_lm_backward)

Linear Regression Analysis

[ 196 ]

The output as a result of applying the backward selecon algorithm is the following:

Figure 14: The backward selection model for the Gasoline problem

5. The forwardlm funcon is given by:

forwardlm <- function(y,x,criticalalpha) {

yx <- data.frame(y,x)

mylm <- lm(y~-.,data=yx)

avail_cov <- attr(mylm$terms,"dataClasses")[-1]

minpvalues <- 0

while(minpvalues<criticalalpha){

pvalues_curr <- NULL

for(i in 1:length(avail_cov)){

templm <- update(mylm,paste(".~.+",names(avail_cov[i])))

mypvalues <- summary(templm)$coefficients[,4]

pvalues_curr <- c(pvalues_curr,mypvalues[length(mypvalues)])

}

minpvalues <- min(pvalues_curr)

Chapter 6

[ 197 ]

if(minpvalues<criticalalpha){

include_me_in <- min(which(pvalues_curr<criticalalpha))

mylm <- update(mylm,paste(".~.+",names(avail_cov[include_me_in])))

avail_cov <- avail_cov[-include_me_in]

}

return(mylm)

}

6. Apply the forwardlm funcon on the Gasoline dataset:

gasoline_lm_forward <- forwardlm(Gasoline$y,Gasoline[,-1],

criticalalpha=0.2)

7. Obtain the details of the nalized model with summary( gasoline_lm_

forward).

The output in R is as follows:

Figure 15: The forward selection model for the Gasoline dataset

Note that the forward selecon and backward eliminaon have resulted in two

dierent models. This is to be expected and is not a surprise, and in such scenarios,

one can pick up either of the models for further analysis/implementaon.

The understanding of the construcon of the forwardlm funcon is le as

an exercise to the reader.

Linear Regression Analysis

[ 198 ]

8. The step funcon in R readily gives the best model using AIC: step

(gasoline_lm, direction="both").

Figure 16: Stepwise AIC

Backward and forward selecon can be easily performed using AIC with the opons

direction= "backward" and direction= "forward".

What just happened?

We used two customized funcons backwardlm and forwardlm for backward and forward

selecon criteria. The step funcon has been used for the model selecon problem based

on the AIC.

Have a go hero

The supervisor performance data is available in the SPD dataset from the RSADBE package.

Here, Y (the regressand) represents the overall rang of the job done by a supervisor.

The overall rang depends on six other inputs/regressors. Find more details about

the dataset with ?SPD. First, visualize the dataset with the pairs funcon. Fit a mulple

linear regression model for Y and complete the necessary regression tasks, such as model

validaon, regression diagnoscs, and model selecon.

Chapter 6

[ 199 ]

Summary

In this chapter, we learned how to build a linear regression model, check for violaons in the

model assumpons, x the mulcollinearity problem, and nally how to nd the best model.

Here, we were aided by two important assumpons: the output being a connuous variable,

and the normality assumpon for the errors. The linear regression model provides the best

foong for the general regression problems. However, when the output variable is discrete,

binary, or mul-category data, the linear regression model lets us down. This is not actually

a let down, as it was never intended to solve this class of problem. Thus, our next chapter

will focus on the problem of regression models for binary data.

The Logistic Regression Model

In this chapter we will consider regression models when the regressand is

dichotomous or binary in nature. The data is of the form

( ) ( ) ( )

1 1 2 2

n n

Y,X,Y,X,..., Y,X

where the dependent variable Yi, i = 1, …, n are the observed binary outputs

assumed to be independent (in the statistical sense) of each other, the vector Xi,

i = 1,…, n, are the covariates (independent variables in the sense of a regression

problem) associated with Yi. In the previous chapter we considered the linear

regression model where the regressand was assumed to be continuous along

with the assumption of normality for the error distribution. Here, we will

consider a Gaussian (normal) model for the binary regression model, which is

more widely known as the probit model.

A more generic model has emerged during the past four decades in the form of a logisc

regression model. We will consider the logisc regression model for rest of the chapter.

The approach in this chapter will be on the following lines:

The binary regression problem

The probit regression model

The logisc regression model

Model validaon and diagnoscs

Receiving operator curves

Logisc regression for the German credit screening dataset

The Logisc Regression Model

[ 202 ]

The binary regression problem

Consider the problem of modeling the compleon of a stat course by students based

on their Scholasc Assessment Test in the subject of mathemacs SAT-M scores at the

me of their admission. Aer the compleon of the nal exams we know which students

successfully completed the course and which of them failed. Here, the output pass/fail

may be represented by a binary number 1/0. It may be fairly said that higher the SAT-M

scores at the me of admission to the course, the more likelihood of the candidate

compleng the course. This problem has been discussed in detail in Johnson and Albert

(1999) and Taar, et. al. (2013).

Let us begin by denong the pass/fail indicator by Y and the entry SAT-M score by X. Suppose

that we have n pairs of observaons on the students' scores and their course compleon

results. We can build the simple linear regression model for the probability of course

compleon pi = P(Yi = 1) as a funcon of the SAT-M score with

0 0i i i

p X

ββ ε

=+ +

. The data

from page 77 of Johnson and Albert (1999) is available in the sat dataset of the book's R

package. The columns that contain the data on the variables Y and X are named Pass and

Sat respecvely. To build a linear regression model for the probability of compleng the

course, we take pi as 1 if Yi = 1, and 0 otherwise. A scaer plot of Pass against Sat indicates

the students with higher SAT-M scores are more likely to complete the course. The SAT score

varies from 463-649 and then we aempt to predict whether students with SAT scores of

400 and 700 would have successfully completed the course or not.

Time for action – limitations of linear regression models

A linear regression model is built for the dataset with a binary output. The model is used

to predict the probabilies for some cases, which shows the limitaons:

1. Load the dataset from the RSADBE package with data(sat).

2. Visualize the scaer plot of Pass against Sat with plot(sat$Sat,

sat$Pass,xlab="SAT Score", ylab = "Final Result").

3. Fit the simple linear regression model with passlm <- lm(Pass~Sat,

data=sat) and obtain its summary by summary(passlm). Add the ed

regression line to the scaer plot using abline(passlm).

4. Make a predicon for students with SAT-M scores of 400 and 700 by using the R

code predict(passlm,newdata=list(Sat=400)) and predict( passlm,

newdata=list(Sat=700),interval="prediction").

Chapter 7

[ 203 ]

Figure 1: Drawbacks of the linear regression model for the classification problem

The linear model is signicant as seen by p-value: 0.000179 associated with the

F-statistic. Next, Pr(>|t|) associated with the Sat variable is 0.00018, which is again

signicant. However, the predicted value for SAT-M marks at 400 and 700 are respecvely

seen as -0.4793 and 1.743. The problem with the model is that the predicted values can

be negave as well as greater than 1. It is essenally these limitaons which restrict the use

of the linear regression model when the regressand is a binary outcome.

What just happened?

We used the simple linear regression model for the probability predicon of a binary

outcome and observed that the probabilies are not bound in the unit interval [0,1]

as they are expected to be. This shows that we need to have special/dierent stascal

models for understanding the relaonship between the covariates and the binary output.

We will use two regression models that are appropriate for binary regressand: probit

regression, and logisc regression. The former model will connue the use of normal

distribuon for the error through a latent variable, whereas the laer uses the binomial

distribuon and is a popular member of the more generic generalized linear models.

The Logisc Regression Model

[ 204 ]

Probit regression model

The probit regression model is constructed as a latent variable model. Dene a latent

variable, also called as auxiliary random variable,

as follows:

Y* X'

βε

= +

which is same as the earlier linear regression model with Y replaced by

. The error term

is assumed to follow a normal distribuon

( )

0N,

. Then Y can be considered as 1 if the

latent variable is posive, that is:

if 0equivalently

otherwise

y* , X ',

βε

>−







Without loss of generality, we can assume that

()

ε∼

. Then, the probit model

is obtained by:

( ) ( ) ( )

( ) ( )

1 0PY |X PY* P X'

P X ' X '

ε β

ε β β

= = >= >−

= < =φ

The method of maximum likelihood esmaon is used to determine

. For a random

sample of size n, the log likelihood funcon is given by:

() ( ) ( ) ( )

( )

log log 1 log 1

i i i i

L y x' y x '

β β β

= φ + − −φ

∑

Numerical opmizaon techniques can be deployed to nd the MLE of

. Fortunately,

we don't have to undertake this daunng task and R helps us out with the glm funcon.

Let us t the probit model for the Sat dataset seen earlier.

Time for action – understanding the constants

The probit regression model is built for the Pass variable as a funcon of the Sat score

using the glm R funcon and the argument binomial(probit).

1. Using the glm funcon and the binomial(probit) opon we can t the probit

model for Pass as a funcon of the Sat score:

pass_probit <- glm(Pass~Sat,data=sat,binomial(probit))

Chapter 7

[ 205 ]

2. The details about the pass_probit glm object are fetched using

summary(pass_probit).

The summary funcon does not give a measure of R2, the coecient of

determinaon, as we obtained for the linear regression model. In general such a

measure is not exactly available for the GLMs. However, certain pseudo-R2 measures

are available and we will use pR2 funcon from the pscl package. This package

has been developed at the Polical Science Computaonal Laboratory, Stanford

University, which explains the name of the package as pscl.

3. Load the pscl package with library(pscl), and apply the pR2 funcon

on pass_probit to obtain the measures of pseudo R2.

Finally, we check how the probit model overcomes the problems posed by

applicaon of the linear regression model.

4. Find the probability of passing the course for students with a SAT-M score of 400

and 700 with the following code:

predict(pass_probit,newdata=list(Sat=400),type = "response")

predict(pass_probit,newdata=list(Sat=700),type = "response")

The following picture is the screenshot of R acon:

Figure 2: The probit regression model for SAT problem

The Logisc Regression Model

[ 206 ]

The Pr(>|z|) for Sat is 0.0052, which shows that the variable has a signicant say

in explaining whether the student successfully completes the course or not. The regression

coecient value for the Sat variable indicates that if the Sat variable increases by one

mark, the inuence on the probit link increases by 0.0334. In easy words, the SAT-M

variable has a posive impact on the probability of success for the student. Next, the pseudo

R2 value of 0.3934 for the McFadden metric indicates that approximately 39.34 percent of

the output is explained by the Sat variable. This appears to suggest that we need to collect

more informaon about the students. That is, the experimenter may try to get informaon

on how many hours did the student spend exclusively for the course/examinaon, the

students' aendance percentages, and so on. However, the SAT-M score, which may have

been obtained nearly two years before the nal exam of the course, connues to have

a good explanatory power!

Finally, it may be seen that the probability of compleng the course for students with SAT-M

scores of 400 and 700 are respecvely 2.019e-06 and 1. It is important for the reader to

note the importance of the type = "response" opon. More details may be obtained

running ?predict.glm at the R terminal.

What just happened?

The probit regression model is appropriate for handling the binary outputs and is certainly

much more appropriate than the simple linear regression model. The reader learned how to

build the probit model using the glm funcon which is in fact more versale as will be seen in

the rest of the chapter. The predicon probabilies were also seen to be in the range of 0 to 1.

The glm funcon can be conveniently used for more than one covariate. In fact, the formula

structure of glm remains the same as lm. Model-related issues have not been considered

in full details ll now. The reason being that there is more interest in the logisc regression

model, as it will be focus for the rest of the chapter, and the logic does not change. In fact

we will return to the probit model diagnoscs in parallel with the logisc regression model.

Logistic regression model

The binary outcomes may be easily viewed as failures or successes, and we have done

the same on many earlier occasions. Typically, it is then common to assume that we

have a binomial distribuon for the probability of an observaon to be successful.

The logisc regression model species the linear eect of the covariates as a specic

funcon of the probability of success. The probability of success for observaon is

denoted by

() ( )

xPY

= = and the model is specied through the logisc funcon:

()

011

x... x

ββ β

+++

+ + +

Chapter 7

[ 207 ]

The choice of this funcon is for fairly good reasons. Dene

0 1 1 pp

w x ... x

ββ β

=+ ++

. Then,

it may be easily seen that

()

( ) ( )

1 1 1

w w w

x e / e / e

−

= + = + . Thus, as w decreases to negave of

innity,

()

approaches 0, and if w increases towards innity,

()

reaches 1. For w = 0,

()

takes the value 0.5. The rao of probability of success to that of failure is known as the odds

rao, denoted by OR, and following some simple arithmec steps, it may be shown that:

()

011

x... x

OR e

ββ β

+ + +

= =

−

Taking a logarithm of the odds raon gets us:

()

0 0 1

log log 1

OR x ... x

πβ β β

 

= = + + +

 

−

 

And thus, we nally see that the logarithm of the odds rao as a linear funcon of the

covariates. It is actually the second term

() ()

( )

log 1

i i

x/ x

π π

−, which is the form of a logit

funcon that this model derives its name from.

The log-likelihood funcon based on the data

( ) ( ) ( )

11 2 2 n n

y,x,y,x,..., y,x

is then:

()

1 0 1

log log 1

jij

n n x

i j ij

i j i

L y x e

β β

∑

= = =

 

= − +

 

 

∑ ∑ ∑

The preceding expression is indeed a bit complex in nature to obtain an explicit form for an

esmate of

. Indeed, a specialized algorithm is required here and it is known as the iterave

reweighted least-squares (IRLS) algorithm. We will not go into the details of the algorithm and

refer the readers to an online paper of Sco A. Czepiel available at http://czep.net/stat/

mlelr.pdf. A raw R implementaon of the IRLS is provided in Chapter 19 of Taar, et. al.

(2013). For our purpose, we will be using the soluon as provided from the glm funcon.

Let us now t the logisc regression model for the Sat-M dataset considered hitherto.

Time for action – tting the logistic regression model

The logisc regression model is built using the glm funcon with the family =

'binomial' opon. We will obtain the pseudo-R2 values using the pR2 funcon from

the pscl package.

1. Fit the logisc regression model for the Pass as a funcon of the Sat using

the opon family = 'binomial' in the glm funcon:

pass_logistic <- glm(Pass~Sat,data=sat,family = 'binomial')

The Logisc Regression Model

[ 208 ]

2. The details of the ed logisc regression model is obtained using the summary

funcon: summary(pass_logistic).

In the summary you will see two stascs called Null deviance and Residual deviance.

In general, a deviance is a measure useful for assessing the goodness-of-t, and for

the logisc regression model it plays the analogous role of residual sum of squares

for the linear regression model. The null deviance is the measure of a model that is

built without using any informaon, such as Sat, and thus we would expect such

a model to have a large value. If the Sat variable is inuencing Pass, we expect

the residuals of such a ed model to be signicantly lesser than the null deviance

model. If the residual deviance is signicantly smaller than the null deviance, we

conclude that the covariates have signicantly improved the model t.

3. Find the pseudo-R2 with pR2(pass_logistic) from the pscl package.

4. The overall model signicance of the ed logisc regression model is obtained with

with(pass_logistic, pchisq(null.deviance - deviance,

df.null - df.residual, lower.tail = FALSE))

The p-value is 0.0001496 which shows that the model is indeed signicant. The

p-values for the Sat covariate Pr(>|z|) is 0.011, which means that this variable

is indeed valuable for understanding Pass. The esmated regression coecient for

Sat of 0.0578 indicates that for the increase of a single mark increases the odds of

the candidate to pass the course by 0.0578.

A brief explanaon of this R code! It may be seen from the output following the

summary.glm(pass_logistic) that we have all the terms null.deviance,

deviance, df.null, and df.residual. So, the with funcon extracts all these

terms from the pass_logistic object and nds the p-value using the pchisq

funcon based on the dierence between the deviances (null.deviance -

deviance) and the correct degrees of freedom (df.null - df.residual).

Chapter 7

[ 209 ]

Figure 3: Logistic regression model for the Sat dataset

5. The condence intervals, with a default 95 percent requirement, for the

parameters of the regression coecients, is extracted using the confint

funcon: confint(pass_logistic).

The ranges of the 95 percent condence intervals do not contain 0 among them,

and hence we conclude that the intercept term and Sat variable are both signicant.

6. The predicon for the unknown scores are obtained as in the probit regression model:

predict.glm(pass_logistic,newdata=list(Sat=400),type = "response")

predict.glm(pass_logistic,newdata=list(Sat=700),type = "response")

7. Let us compare the logisc and probit model. Consider a sequence of hypothecal

SAT-M scores: sat_x = seq(400,700, 10). For the new sequence sat_x, we

predict the probability of course compleon using both the pass_logistic and

pass_probit models and visualize them if their predicons are vastly dierent:

pred_l <- predict(pass_logistic,newdata=list(Sat=sat_x), type=

"response")

pred_p <- predict(pass_probit,newdata=list(Sat=sat_x), type=

"response")

plot(sat_x,pred_l,type="l",ylab="Probability",xlab="Sat_M")

lines(sat_x,pred_p,lty=2)

The Logisc Regression Model

[ 210 ]

The predicon says that a candidate with a SAT-M score of 400 is very unlikely to

complete the course successfully while the one with SAT-M score of 700 is almost

guaranteed to complete it. The predicons with probabilies closer to 0 or 1 need

to be taken with a bit cauon since we rarely have enough observaons at the

boundaries of the covariates.

Figure 4: Prediction using the logistic regression

What just happened?

We ed our rst logisc regression model and viewed its various measures which tell

us whether the ed model is a good model or not. Next, we learnt how to interpret

the esmated regression coecients and also had a peek at the pseudo-R2 value.

The importance of condence intervals is also emphasized. Finally, the model has

been used to make predicons for some unobserved SAT-M scores too.

Hosmer-Lemeshow goodness-of-t test statistic

We may be sased with the analysis thus far, and there is always a lot more that we

can do. The tesng hypothesis problem is of the form

()

0 0

p p

jij j ij

j j

x x

H:EY e / e

β β

= =

∑ ∑

 

= +

 

  versus

()

0 0

p p

jij j ij

j j

x x

H:EY e / e

β β

= =

∑ ∑

 

≠ +

 

  . An answer to this hypothesis tesng problem is provided by

the Hosmer-Lemeshow goodness-of-t test stasc. The steps of the construcon of this

test stasc are rst discussed:

Chapter 7

[ 211 ]

1. Order the ed values using sort and ed funcons.

2. Group the ed values into g classes, the preferred values of g vary between 6-10.

3. Find the observed and expected number in each group.

4. Perform a chi-square goodness-of-t test on the these groups. That is, denote Ojk

for the number of observaons of class k, k = 0, 1, in the group j, j = 1, 2, …, g, and

by Ejk the corresponding expected numbers. The chi-square test stasc is then

given by:

( )

101

gjk jk

jk.jk

O E

−

=∑∑

And it may be proved that under the null-hypothesis

χχ

−

∼

We will use an R program available at http://sas-and-r.blogspot.in/2010/09/

example-87-hosmer-and-lemeshow-goodness.html. It is important to note here

that when we use the code available on the web we verify and understand that such code

is indeed correct.

Time for action – The Hosmer-Lemeshow goodness-of-t

statistic

The Hosmer-Lemeshow goodness-of-t stasc for logisc regression is one of the very

important metrics for evaluang a logisc regression model. The hosmerlem funcon

from the preceding web link will be used for the pass_logistic regression model.

1. Extract the ed values for the pass_logistic model with pass_hat <-

fitted(pass_logistic).

2. Create the funcon hosmerlem from the previously-menoned URL:

hosmerlem <- function(y, yhat, g=10) {

cutyhat = cut(yhat,

breaks = quantile(yhat, probs=seq(0,

1, 1/g)), include.lowest=TRUE)

obs = xtabs(cbind(1 - y, y) ~ cutyhat)

expect = xtabs(cbind(1 - yhat, yhat) ~ cutyhat)

chisq = sum((obs - expect)^2/expect)

P = 1 - pchisq(chisq, g - 2)

return(list(chisq=chisq,p.value=P))

}

The Logisc Regression Model

[ 212 ]

What is the hosmerlem funcon exactly doing here? Obviously, it is a funcon of

three variables, the real output values in y, the predicted (probabilies) in yhat,

and the number of groups g. The cutyhat variable uses the cut funcon on the

predicted probabilies among the ten groups and assigns them one of the 10

groups. The obs matrix obtains the count Ojk using the xtabs funcon and a similar

acon is repeated for Ejk. The code chisq = sum((obs - expect)^2/expect)

then obtains the value of the Hosmer-Lemeshow chi-square test stasc, and

using it we obtain the related p-value using P = 1 - pchisq(chisq, g - 2).

Finally, the required values are returned with return(list(chisq=chisq,p.

value=P)).

3. Complete the computaons of the Hosmer-Lemeshow goodness-of-t test stasc

for the ed model pass_logistic with hosmerlem( pass_logistic$y,

pass_hat).

Figure 5: Hosmer-Lemeshow goodness-of-fit test

Since there is no signicant dierence between the observed and predicted y values, we

concluded that the ed model is a good t. Now that we know that we have got a good

model on hand, it is me to invesgate how valid are the model assumpons.

What just happened?

We used an R code from the Web and successfully adapted it to the problem on our hand!

Parcularly, the Hosmer-Lemeshow goodness-of-t test is a vital metric for understanding

the appropriateness of a logisc regression model.

Chapter 7

[ 213 ]

Model validation and diagnostics

In the previous chapter we saw the ulity of residual techniques. A similar technique is also

required for the logisc regression model and we will develop these methods for the logisc

regression model in this secon.

Residual plots for the GLM

In the case of the linear regression model, we had explored the role of residuals for the

purpose of model validaon. In the context of logisc regression, actually GLM, we have

ve dierent types of residuals for the same purpose:

Response residual: The dierence between the actual values and the ed values

is the response residual, that is,

−

, and in parcular it is

−

if yi = 1 and

−

for yi = 0.

Deviance residual: For an observaon i, the deviance residual is the signed square

root of the contribuon of the observaon to the sum of the model deviance. That

is, it is given by:

( ) ( ) ( )

{ }

2 log 1 log 1 /

dev

i i i i i

ˆ ˆ

r Y Y

π π

=± − + − −

 

 

where the sign is posive if

−

, and negave otherwise, and

is the predicted

probability of success.

Pearson residual: The Pearson residual is dened by:

( )

i i

rˆ ˆ

π π

−

=−

Paral residual: The paral residual of the jth predictor, j = 1, 2, …, p, for the ith

observaon is dened by:

( )

1 1

part i i

ij jij

i i

r x ,i ,...,n, j ,..., p

ˆ ˆ

βπ π

−

= + = =

−

The paral residuals are very useful for idencaon of the type of transformaon

that needs to be performed on the covariates.

Working residual: The working residual for the logisc regression model is given by:

( )

Wi i

i i

rˆ ˆ

π π

−

=−

The Logisc Regression Model

[ 214 ]

Each of the preceding residual variants is easily obtained using the residuals funcon, see

? glm.summaries for details. The residual variant is specied through the opon type in

the residuals funcon. We have not given the details related to the probit regression model,

however, the same funcons for logisc regression apply here nevertheless. We will obtain

the residual plots against the ed values and examine the appropriateness of the logisc

and probit regression models.

Time for action – residual plots for the logistic regression

model

The residuals and fitted funcons will be used to obtain the residual plots from the

probit and logisc regression models.

1. Inialize a graphics windows for three panels with par(mfrow=c(1,3),

oma = c(0,0,3,0)). The oma opon ensures that we can appropriately tle

the grand output.

2. Plot Response Residuals against the Fied Values of the pass_logistic

model with:

plot(fitted(pass_logistic), residuals(pass_logistic,"response"),

col= "red", xlab="Fitted Values", ylab="Residuals",cex.axis=1.5,

cex.lab=1.5)

The reason of xlab and ylab has been explained in the earlier chapters.

3. For the purpose of comparison with the probit regression model, add their response

residuals to the previous plot with:

points(fitted(pass_probit), residuals(pass_probit,"response"),

col= "green")

And add a suitable legend and tle as follows:

legend(0.6,0,c("Logistic","Probit"),col=c("red","green"),pch="-")

title("Response Residuals")

4. Add the horizontal line at 0 with abline(h=0).

5. Repeat the preceding steps for deviance and Pearson residuals with:

plot(fitted(pass_logistic), residuals(pass_logistic,"deviance"),

col= "red", xlab="Fitted Values", ylab="Residuals",cex.axis=1.5,

cex.lab=1.5)

points(fitted(pass_probit), residuals(pass_probit,"deviance"),

col= "green")

Chapter 7

[ 215 ]

legend(0.6,0,c("Logistic","Probit"),col=c("red","green"),pch="-")

abline(h=0)

title("Deviance Residuals")

plot(fitted(pass_logistic), residuals(pass_logistic,"pearson"),

col= "red",xlab="Fitted Values",ylab="Residuals",cex.axis=1.5,

cex.lab= 1.5)

points(fitted(pass_probit), residuals(pass_probit,"pearson"), col=

"green")

legend(0.6,0,c("Logistic","Probit"),col=c("red","green"),pch="-")

abline(h=0)

title("Pearson Residuals")

6. Give an appropriate tle with title(main="Response, Deviance, and

Pearson Residuals Comparison for the Logistic and Probit

Models",outer=TRUE, cex.main=1.5).

Figure 6: Residual plots for the logistic regression model

In each of the three preceding residual plots we observe two trends of decreasing residuals

whose slope is -1. The reason for such a trend is that the residuals take one of two values at

a point Xi, either

−

. Thus, in these residual plots we always get two linear trends

with slope -1. Clearly, there is not much dierence between the residuals of the logisc and

probit models. The Pearson residual graph also indicates the presence of an outlier for the

observaon with residual value lesser than -3.

The Logisc Regression Model

[ 216 ]

What just happened?

The residuals funcon along with the type opon helps in model validaon and

idencaon of some residuals. A good thing is that the same funcon applies on the

logisc as well as the probit regression model.

Have a go hero

In the previous exercise, we have le out the invesgaon using paral and working type

of residuals. Obtain these plots!

Inuence and leverage for the GLM

In the previous chapter we saw how the inuenal and leverage points are idened in

a linear regression model. It will be a bit dicult to go into the appropriate formulas and

theory for the logisc regression model.

Time for action – diagnostics for the logistic regression

The inuence and leverage points will be idened through the applicaon of the funcons,

such as hatvalues, cooks.distance, dffits, and dfbetas for the pass_logistic

ed model.

1. The high leverage points of a logisc regression model are obtained with

hatvalues(pass_logistic) while the Cooks distance is fetched with cooks.

distance(pass_logistic). The DFFITS and DFBETAS measures of inuence are

obtained by running dfbetas(pass_logistic) and dffits(pass_logistic).

2. The inuence and leverage measures are put together using the cbind funcon:

cbind(hatvalues(pass_logistic),cooks.distance(pass_

logistic),dfbetas(pass_logistic),dffits(pass_logistic))

Chapter 7

[ 217 ]

The output is given in the following screenshot:

Figure 7: Influence measures for the logistic regression model

It is me to interpret these measures.

3. If the hatvalues associated with an observaon is greater than

( )

2 1

p / n+, where

p is the number of covariates considered in the model and n is the number of

observaons, it is considered as a high leverage point. For the pass_logistic

object, we nd the high leverage points with:

hatvalues(pass_logistic)>2*(length(pass_logistic$coefficients)-1)/

length(pass_logistic$y)

The Logisc Regression Model

[ 218 ]

4. An observaon is considered to have great inuence on the parameter esmates

if the Cooks distance, as given by cooks.distance, is greater than 10 percent

quanle of the

()

p,np

+−+ distribuon, and it is considered highly inuenal if it

exceeds 50 percent quanle of the same distribuon. In terms of R program,

we need to execute:

cooks.distance(pass_logistic)>qf(0.1,length(pass_logistic$

coefficients),length(pass_logistic$y)-length(pass_logistic$

coefficients))

cooks.distance(pass_logistic)>qf(0.5,length(pass_logistic$

coefficients),length(pass_logistic$y)-length(pass_logistic$

coefficients))

Figure 8: Identifying the outliers

The previous screenshot shows that there are eight high leverage points. We also

see that at the 10 percent quanle of the F-distribuon we have two inuenal

points whereas we don't have any highly inuenal points.

5. Use the plot funcon to idenfy the inuenal observaons suggested by the

DFFITS and DFBETAS measure:

par(mfrow=c(1,3))

plot(dfbetas(pass_logistic)[,1],ylab="DFBETAS - INTERCEPT")

plot(dfbetas(pass_logistic)[,2],ylab="DFBETAS - SAT")

plot(dffits(pass_logistic),ylab="DFFITS")

Chapter 7

[ 219 ]

Figure 9: DFFITS and DFBETAS for the logistic regression model

As with the linear regression model, the DFFITS and DFBETAS are measures of inuence of

the observaons on the regression coecients. The thumb rule for the DFBETAS is that if

their absolute value exceeds 1, the observaons have signicant inuence on the covariates.

In our case it is not correct and we conclude that we do not have inuenal observaons.

The interpretaon of DFFITS is le as an exercise.

What just happened?

We adapted the inuenal measures in the context of generalized linear models,

and especially in the context of logisc regression.

Have a go hero

The inuence and leverage measures were executed on the logisc regression model,

the pass_logistic object in parcular. You also have the pass_probit object!

Repeat the enre exercise of hatvalues, cooks.distance, dffits, and dfbetas

on the pass_probit ed probit model and draw your inference.

The Logisc Regression Model

[ 220 ]

Receiving operator curves

In the binary classicaon problem, we have certain scenarios where the comparison

between the predicted and actual class is of great importance. For example, there is

a genuine problem in the banking industry for idenfying fraudulent transacons against

the non-fraudulent transacons. There is another problem of sanconing loans to customers

who may successfully repay the enre loan and the customers who will default at some stage

during the loan tenure. Given the historical data, we will build a classicaon model, for

example the logisc regression model.

Now with the logisc regression model, or any other classicaon model for that maer,

if the predicted probability is greater than 0.5, the observaon is predicted as a successful

observaon, and a failure otherwise. We remind ourselves again that success/failure is

dened according to the experiment. At least with the data on hand, we know the true

labels of the observaons and hence a comparison of the true labels with the predicted

label makes a lot of sense. In an ideal scenario we expect the predicted labels to match

perfectly with the actual labels, that is, whenever the true label stands for success/failure,

the predicted label is also success/failure. However, in the real scenario it is rarely the case.

This means that there are some observaons which are predicted as success/failure when

the true labels are actually failure/success. In other words, we make mistakes! It is possible

to put these notes in the form of a table widely known as the confusion matrix.

Observed

Predicted

Success Failure

Success True Positive (TP) False Positive (FP)

Failure False Negative (FN) True Negative (TN)

Table 1: The confusion matrix

The number in parenthesis is the count of the cases. It may be seen from the preceding

table that the cells colored in green are the correct predicons made by the model, whereas

the red colored are the ones with mistakes. The following metrics may be considered for

comparison across mulple models:

Accuracy:

TP TN

TP TN FP FN

+++

Precision:

TP FP+

Recall:

TP FN+

However, it is known that these metrics have a lot of limitaons and more robust steps

are required. The answer is provided by the receiver operator characterisc (ROC) curve.

We need two important metrics towards the construcon of an ROC. The true posive rate

(tpr) and false posive rate (fpr) are respecvely dened by:

Chapter 7

[ 221 ]

TP FP

tpr , fpr

TP FN TN FP

= =

+ +

The ROC graphs are constructed by plong the tpr against the fpr. We will now explain this

in detail. Our approach will be explaining the algorithm in an Acon framework.

Time for action – ROC construction

A simple dataset is considered and the ROC construcon is explained in a very simple

step-by-step approach:

1. Suppose that the predicted probabilies of n = 10 observaons are 0.32, 0.62, 0.19,

0.75, 0.18, 0.18, 0.95, 0.79, 0.24, 0.59. Create a vector of it as follows:

pred_prob<-c(0.32, 0.62, 0.19, 0.75, 0.18, 0.18, 0.95, 0.79, 0.24,

0.59)

2. Sort the predicted probabilies in a decreasing order:

> (pred_prob=sort(pred_prob,decreasing=TRUE))

[1] 0.95 0.79 0.75 0.62 0.59 0.32 0.24 0.19 0.18 0.18

3. Normalize the predicted probabilies in the preceding step to the unit interval:

> pred_prob<-(pred_prob-min(pred_prob))/(max(pred_prob)-min(pred_

prob ))

> pred_prob

[1] 1.00000 0.79221 0.74026 0.57143 0.53247 0.18182 0.07792

0.01299 0.00000 0.00000

Now, at each percentage of the previously sorted probability, we commit false

posives as well as false negaves. Thus, we want to check at each part of our

predicon percenles, the quantum of tpr and fpr. Since ten points are very less,

we now consider a dataset of predicted probabilies and the true labels.

4. Load the illustrave dataset from the RSADBE package with data(simpledata).

5. Set up the threshold vector threshold <- seq(1,0,-0.01).

6. Find the number of posive (success) and negave (failure) cases in the dataset P

<- sum(simpledata$Label==1) and N <- sum(simpledata$Label ==0).

7. Inialize the fpr and tpr with tpr <- fpr <- threshold*0.

The Logisc Regression Model

[ 222 ]

8. Set up the following loop which computes tpr and fpr at each point of the

threshold vector:

for(i in 1:length(threshold)) {

FP=TP=0

for(j in 1:nrow(simpledata)) {

if(simpledata$Predictions[j]>=threshold[i]) {

if(simpledata$Label[j]==1) TP=TP+1 else FP=FP+1

}

tpr[i]=TP/P

fpr[i]=FP/N

}

9. Plot the tpr against the fpr with:

plot(fpr,tpr,"l",xlab="False Positive Rate", ylab="True Positive

Rate",col="red")

abline(a=0,b=1)

Figure 10: An ROC illustration

The diagonal line is about the performance of a random classier in that it simply says

"Yes" or "No" without looking at any characterisc of an observaon. Any good classier

must sit, rather be displayed, above this line. The classier, albeit an unknown one, seems

a much beer classier than the random classier. The ROC curve is useful in comparison

to the compeve classiers in the sense that if one classier is always above another,

we select the former.

Chapter 7

[ 223 ]

An excellent introductory exposion of the ROC curves is available at the website http://

ns1.ce.sharif.ir/courses/90-91/2/ce725-1/resources/root/Readings/

Model%20Assessment%20with%20ROC%20Curves.pdf.

What just happened?

The construcon of ROC has been demysed! The preceding program is a very primive

one. In the later chapters we will use the ROCR package for the construcon of ROC.

We will next look at a real-world problem.

Logistic regression for the German credit screening

dataset

Millions of applicaons are made to a bank for a variety of loans! The loan may be a personal

loan, home loan, car loan, and so forth. From a bank's perspecve, loans are an asset for them

as obviously the customer pays them interest and over a period of me the bank makes prot.

If all the customers promptly pay back their loan amount, all their tenure equated monthly

installment (EMI) or the complete amount on preclosure of the principal amount, there is

only money to be made. Unfortunately, it is not always the case that the customers pay back

the enre amount. In fact, the fracon of people who do not complete the loan duraon may

also be very small, say about ve percent. However, a bad customer may take away the prots

of may be 20 or more customers. In this hypothecal case, the bank eventually makes more

losses than prot and this may eventually lead to its own bankruptcy.

Now, a loan applicaon form seeks a lot of details about the applicant. The data from these

details in the applicaon can help the bank build appropriate classiers, such as a logisc

regression model, and make predicons about which customers are most likely to turn up as

fraudulent. The customers who have been predicted to default in the future are then declined

the loan. A real dataset of 1,000 customers who had borrowed loan from a bank is available

on the web at http://www.stat.auckland.ac.nz/~reilly/credit-g.arff and

http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data).

This data has been made available by Prof. Hofmann and it contains details on 20 variables

related to the customer. It is also known whether the customers defaulted or not. The

variables are described in the following table.

The Logisc Regression Model

[ 224 ]

A detailed analysis of the dataset using R has been done by Sharma and his very useful

document can be downloaded from cran.r-project.org/doc/contrib/Sharma-

CreditScoring.pdf.

No Variable Characteristic Description No Variable Characteristic Description

1checking integer

Status of

existing

checking

account 12 property factor Property

2duration integer

Duration in

month 13 age numeric Age in years

3history integer Credit history 14 other integer

Other

installment

plans

4purpose factor Purpose 15 housing integer Housing

5amount numeric Credit amount 16 existcr integer

Number of

existing credits

at this bank

6savings integer

Savings

account/bonds 17 job integer Job

7employed integer

Present

employment

since 18 depends integer

Number

of people

being liable

to provide

maintenance

for

8installp integer

Installment

rate in

percentage

of disposable

income 19 telephon integer Telephone

9marital integer

Personal status

and sex 20 foreign integer Foreign worker

10 coapp integer

Other debtors/

guarantors 21 good_bad factor Loan defaulter

11 resident integer

Present

residence

since 22 default integer

good_bad in

numeric

We have the German credit dataset with us in the GC data from the RSADBE package.

Let us build a classier for idenfying the good customers from the bad ones.

Chapter 7

[ 225 ]

Time for action – logistic regression for the German credit

dataset

The logisc regression model will be built for credit card applicaon scoring model and an

ROC curve t to evaluate the t of the model.

1. Invoke the ROCR library with library(ROCR).

2. Get the German credit dataset in your current session with data(GC).

3. Build the logisc regression model for good_bad with GC_LR <- glm

(good_bad~.,data=GC,family=binomial()).

4. Run summary(GC_LR) and idenfy the signicant variables. Also answer the

queson of whether the model is signicant?

5. Get the predicons using the predict funcon:

LR_Pred <- predict( GC_LR,type='response')

6. Use the predicon funcon from the ROCR package to set up a predicon object:

GC_pred <- prediction(LR_Pred,GC$good_bad)

The funcon prediction sets up dierent manipulaons required computaons

as required for construcng the ROC curve. Get more details related to it with

?prediction.

7. Set up the performance vector required to obtain the ROC curve with GC_perf <-

performance(GC_pred,"tpr","fpr").

The performance funcon uses the predicon object to set up the ROC curve.

The Logisc Regression Model

[ 226 ]

8. Finally, visualize the ROC curve with plot(GC_perf).

Figure 11: Logistic regression model for the German credit data

The ROC curve shows that the logisc regression is indeed eecve in idenfying

fraudulent customers.

What just happened?

Now, we considered a real world problem with enough data points. The ed logisc

regression model gives a good explanaon of the fraudulent customers in terms of the

data that is collected about them.

Have a go hero

For simpledata, a raw program was wrien to draw the ROC curve. Redo the exercise with

red colour for the curve. Using the prediction and performance funcons from the ROCR

package, add the curve for simpledata obtained in the previous step with green colour.

What do you expect?

Chapter 7

[ 227 ]

Summary

We started with a simple linear regression model for the binary classicaon problem

and saw the limitedness of the same. The probit regression model, which is an adapon

of the linear regression model through a latent variable, overcomes the drawbacks of the

straighorward linear regression model. The versale logisc regression model has been

considered in details and we considered the various kinds of residuals that help in the model

validaon. The inuenal and leverage point detecon has been discussed too, which helps

us build a beer model by removing the outliers. A metric in the form of ROC helps us in

understanding the performance of a classier. Finally, we concluded the chapter with an

applicaon to the important problem of idenfying good customers from the bad ones.

Despite the advantages of linearity, we sll have many drawbacks with either the linear

regression model or the logisc regression model. The next chapter begins with the family

of polynomial regression model and later considers the impact of regularizaon.

Regression Models with

Regularization

In Chapter 6, Linear Regression Analysis, and Chapter 7, The Logistic Regression

Model, we focused on the linear and the logistic regression model. In the model

selection issues with the linear regression model, we found that a covariate

is either selected or not depending on the associated p-value. However, the

rejected covariates are not given any kind of consideration once the p-value

is lesser than the threshold. This may lead to discarding the covariates even if

they have some say on the regressand. Particularly, the final model may thus

lead to overfitting of the data, and this problem needs to be addressed.

We will rst consider ng a polynomial regression model, without the technical details,

and see how higher order polynomials give a very good t, which actually comes with a

higher price. A more general framework of B-splines is considered next. This approach leads

us to the smooth spline models, which are actually ridge regression models. The chapter

concludes with an extension of the ridge regression for the linear and logisc regression

models. For more details of the coverage, refer to Chapter 2 of Berk (2008) and Chapter 5

of Hase, et. al. (2008). This chapter will unfold on the following topics:

The problem of overng in a general regression model

The use of regression splines for certain special cases

Improving esmators of the regression coecients, and overcoming the

problem of overng with ridge regression for linear and logisc models

The framework of train + validate + test for regression models

Regression Models with Regularizaon

[ 230 ]

The overtting problem

The limitaon of the linear regression model is best understood through an example. I have

created a hypothecal dataset for understanding the problem of overng. A scaerplot of

the dataset is shown in the following gure.

It appears from the scaerplot that for x values up to 6, there is a linear increase in y, and an

eye-bird esmate of the slope is (50 - 10) / (5.5 - 1.75) = 10.67. This slope may

be on account of a linear term or even a quadrac term. On the other hand, the decline in

y-values for x-values greater than 6 is very steep, approximately (10 - 50) / (10 - 6)

= -10. Now, looking at the complete picture, it appears that the output Y depends upon

the higher order of the covariate X. Let us t polynomial curves of various degrees and

understand the behavior of the dierent linear regression models. A polynomial regression

model of degree k is dened as follows:

0 1 2

Y X X ... X

ββ β β ε

=+ + + + +

Here, the terms 2k

X,X,..., X are treated as disnct variables, in the sense that one may

compare the preceding model with the one introduced in the mulple linear regression

model of Chapter 6, Linear Regression Analysis, by dening 2

1 2

XX,X X,...,

= = = .

The inference for the polynomial regression model proceeds in the same way as the

mulple linear regression with k terms:

Figure 1: A non-linear relationship displayed by a scatter plot

Chapter 8

[ 231 ]

The data for the previous gure is available in the dataset OF from RSADBE. The opon poly

is used in the right-hand side of the formula of the lm funcon for ng the polynomial

regression models.

Time for action – understanding overtting

Polynomial regression models are built using the lm funcon, as we saw earlier, with the

opon poly.

1. Read the hypothecal dataset into R by using data(OF).

2. Plot Y against X by using plot(OF$X, OF$Y,"b",col="red",xlab="X",

ylab="Y").

3. Fit the polynomial regression models of orders 1, 2, 3, 6, and 9, and add their ed

lines against the covariates X with the following code:

lines(OF$X,lm(Y~poly(X,1,raw=TRUE),data=OF)$fitted.

values,"b",col="green")

lines(OF$X,lm(Y~poly(X,2,raw=TRUE),data=OF)$fitted.

values,"b",col="wheat")

lines(OF$X,lm(Y~poly(X,3,raw=TRUE),data=OF)$fitted.

values,"b",col="yellow")

lines(OF$X,lm(Y~poly(X,6,raw=TRUE),data=OF)$fitted.

values,"b",col="orange")

lines(OF$X,lm(Y~poly(X,9,raw=TRUE),data=OF)$fitted.

values,"b",col="black")

Regression Models with Regularizaon

[ 232 ]

The opon poly is used to specify the polynomial degree:

Figure 2: Fitting higher-order polynomial terms in a regression model

4. Enhance the graph with a suitable legend:

legend(6,50,c("Poly 1","Poly 2","Poly 3","Poly 6","Poly 9"),

col=c("green","wheat","yellow","orange","black"),pch=1,ncol=3)

5. Inialize the following vectors:

R2 <- NULL; AdjR2 <- NULL; FStat <- NULL

Mvar <- NULL; PolyOrder<-1:9

6. Now, t the regression models beginning with order 1 up to order 9 (since we only

have ten points) and extract their R2, Adj- R2, F-stasc value, and model variability:

for(i in 1:9) {

temp <- summary(lm(Y~poly(X,i,raw=T),data=OF))

R2[i] <- temp$r.squared

AdjR2[i] <- temp$adj.r.squared

FStat[i] <- as.numeric(temp$fstatistic[1])

Mvar[i] <- temp$sigma

}

cbind(PolyOrder,R2,AdjR2,FStat,Mvar)

Chapter 8

[ 233 ]

We will more formally dene polynomial regression models in the next secon.

The output is given in the next gure.

7. Let us also look at the magnitude of the regression coecients:

as.numeric(lm(Y~poly(X,1,raw=T),data=OF)$coefficients)

as.numeric(lm(Y~poly(X,2,raw=T),data=OF)$coefficients)

as.numeric(lm(Y~poly(X,3,raw=T),data=OF)$coefficients)

as.numeric(lm(Y~poly(X,4,raw=T),data=OF)$coefficients)

as.numeric(lm(Y~poly(X,5,raw=T),data=OF)$coefficients)

as.numeric(lm(Y~poly(X,6,raw=T),data=OF)$coefficients)

as.numeric(lm(Y~poly(X,7,raw=T),data=OF)$coefficients)

as.numeric(lm(Y~poly(X,8,raw=T),data=OF)$coefficients)

The following screenshot shows the large size of the regression coecients,

parcularly, as the degree of the polynomial increases, so does the coecient

magnitude. This is a problem! As the complexity of a model increases, the

interpretability becomes very dicult. In the next secon, we will discuss various

techniques in polynomial regression.

Figure 3: Regression coefficients of polynomial regression models

What just happened?

The scaer plot indicated that a polynomial regression model may be appropriate.

Fing higher order polynomial curves gives a closer approximaon of the t.

The regression coecients have been observed to increase with the degree

of the polynomial t.

In the next secon, we consider the more general regression spline model.

Regression Models with Regularizaon

[ 234 ]

Have a go hero

The R2 value for gasoline_lm is at 0.895, see Figure 7: Building Mulple Linear Regression

Model, of Chapter 6, Linear Regression Analysis. Add higher order terms for the covariates

and make an aempt to reach an R2 value of 0.95.

Regression spline

In this secon, we will consider various enhancements/generalizaons of the linear

regression model. We will begin with a piecewise linear regression model and then consider

the polynomial regression extension. The term spline refers to a thin strip of wood that can

be easily bent along a curved line.

Basis functions

In the previous secon, we made mulple transformaons of the input variable X with

1 2

XX,X X,...,

= = = . In the Data Re-expression secon of Chapter 4, Exploratory

Analysis, we saw how a useful log transformaon gave a beer stem-and-leaf display than

the original variable itself. In many applicaons, it has been found that the transformed

variables are more important than the original variable itself. Thus, we need a more generic

framework to consider the transformaons of the variables. Such a framework is provided by

the basis funcons. For a single covariate X, the set of transformaons may be dened

as follows:

( ) ( )

fX h X

=∑

Here,

()

is the m-th transformaon of X, and

is the associated regression coecient.

In the case of a simple linear regression model, we have

()

1hX= and

()

hX X

=. For the

polynomial regression model, we have

()

m m

hX X,m,,...,k= =

, and for the logarithmic

transformaon

()

loghX X=. In general, for the p mulple linear regression model,

we have the basis transformaon as follows:

( ) ( )

1 1

p jmjm j

fX,..., X h X

= =

=∑∑

For the mulple linear regression model, we have

( )

j j j

h X X,j ,...p= = . In general, the

transformaon includes funcons such as sine, cosine, exponenaon, and indicator funcons.

Chapter 8

[ 235 ]

Piecewise linear regression model

Consider the scaer plot of the dataset, which is available in the dataset PWR_Illus, in

the next screenshot. We see a slanted leer N in Figure 4: Scaerplot of a dataset (A) and

the ed values using piecewise linear regression model (B), where in the beginning Y

increases with X up to the point, approximately, 15, then there is a steep decline, or negave

relaonship, ll 30, and nally there is an increase in the y values beyond that. In a certain

way, we can imagine the x values of 15 and 30 as break-down points. It is apparent from the

scaerplot display that a linear relaonship between the x- and y-values over the real line

intervals less than 15, between 15 to 30, and greater than 30 is appropriate. The queson

then is how do we build a regression model for such a phenomenon? The answer is provided

by the piecewise linear regression model. In this parcular case, we have a two-piece linear

regression model.

In general, let xa and xb denote the two points, where we believe the linear regression model

has the breakpoints. Further more, we denote an indicator funcon by Ia to represent that it

equals 1 when the x value is greater than xa and takes the value 0 in other cases. Similarly, the

second breakpoint indicator Ib is dened. The piecewise linear regression model is dened

as follows:

( ) ( )

0 1 2 3a a b b

Y X XxI X xI

ββ β β ε

=+ + − + − +

In this piecewise linear regression model, we have four transformaons, including

()

hX X

()( )

3a a

hX XxI=− , and

()( )

4b b

hX XxI=− . The regression model needs to be

interpreted with a bit of care. If the x value is less than xa, then average Y value would

ββ

. For the x value greater than xa but lesser than xb, the average of Y is

( ) ( )

0 2 3 1 2 3

a b

x x X

ββ β β ββ

+ − +++

. Finally, for values greater than xb, it will be

. The

intercept term in these intervals will be

( )

02a

ββ

−

( )

0 2 3a b

x x

ββ β

− −

, and , respecvely,

whereas the slopes are

( )

ββ

and

( )

123

βββ

. Of course, we are now concerned

about ng the piecewise linear regression model in R. Let us set ourselves up for this task!

Time for action – tting piecewise linear regression models

A piecewise linear regression model can be easily ed in R by using the same lm funcon

and a bit of cauon. A loop is used to nd the points at which the model is supposed to have

changed its trajectory.

1. Read the dataset into R with data(PW_Illus).

2. For convenience, aach the variables in the PW_Illus object by using attach

(PW_Illus).

Regression Models with Regularizaon

[ 236 ]

3. To be on the safe side, we will select a range of the x values, which may be either of

the breakpoints:

break1 <- X[which(X>=12 & X<=18)]

break2 <- X[which(X>=27 & X<=33)]

4. Get the number of points that are candidates for being the breakpoints with n1 <-

length(break1) and n2 <- length(break2).

We do not have a clear dening criterion to select one of the n1 or n2 x values to be

the breakpoints. Hence, we will run various linear regression models and select that

pair of points (xa, xb) to be the breakpoints, which return the least mean residual

sum of squares. Towards this, we set up a matrix, which will have three columns

with the rst two columns for the possible potenal pair of breakpoints, and the

third column will contain the mean residual sum of squares. The choice of points,

which corresponds to the least mean residual sum of squares, will be selected as the

best model in the current case.

5. Set up the required matrix, build all the possible regression models with the pair

of potenal breakpoints, and note their mean residual sum of squares through the

following program:

MSE_MAT <- matrix(nrow=(n1*n2), ncol=3)

colnames(MSE_MAT) = c("Break_1","Break_2","MSE")

curriter=0

for(i in 1:n1){

for(j in 1:n2) {

curriter=curriter+1

MSE_MAT[curriter,1]<-break1[i]

MSE_MAT[curriter,2]<-break2[j]

piecewise1 <- lm(Y ~ X*(X<break1[i])+X*(X>=break1[i] &

X<break2[j])+X*(X>=break2[j]))

MSE_MAT[curriter,3] <- as.numeric(summary(piecewise1)[6])

}

Note the use of the formula ~ in the specicaon of the piecewise linear

regression model.

6. The me has arrived to nd the pair of breakpoints:

MSE_MAT[which(MSE_MAT[,3]==min(MSE_MAT[,3])),]

The pair of breakpoints is hence (14.000, 30.000). Let us now look at how good

the model t is!

Chapter 8

[ 237 ]

7. First, reobtain the scaer plot with plot(PW_Illus). Fit the piecewise linear

regression model with breakpoints at (14,30) with pw_final <- lm(Y ~

X*(X<14)+X*(X>=14 & X<30)+X*(X>=30)). Add the ed values to the scaer

plot with points(PW_Illus$X,pw_final$fitted.values,col ="red").

Note that the ed values are a very good reecon of the original data values,

(Figure 4 (B)). The fact that linear models can be extended to such dierent

scenarios makes it very promising to study this in even more detail as will be

seen in the later part of this secon.

Figure 4: Scatterplot of a dataset (A) and the fitted values using piecewise linear regression model (B)

What just happened?

The piecewise linear regression model has been explored for a hypothecal scenario, and we

invesgated how to idenfy breakpoints by using the criterion of the mean residual sum

of squares.

The piecewise linear regression model shows a useful exibility, and it is indeed a very useful

model when there is a genuine reason to believe that there are certain breakpoints in the

model. This has some advantages and certain limitaons too. From a technical perspecve,

the model is not connuous, whereas from an applied perspecve, the model possesses

problems in making guesses about the breakpoint values and also the problem of extensions

to mul-dimensional cases. It is thus required to look for a more general framework, where we

need not be bothered about these issues. Some answers are provided in the following secons.

Regression Models with Regularizaon

[ 238 ]

Natural cubic splines and the general B-splines

We will rst consider the polynomial regression splines model. As noted in the previous

discussion, we have a lot of disconnuity in the piecewise regression model. In some sense,

"greater connuity" can be achieved by using the cubic funcons of x and then construcng

regression splines in what are known as "piecewise cubics", see Berk (2008) Secon 2.2.

Suppose that there are K data points at which we require the knots. Suppose that the knots

are located at the points

12 K

,,...,

ξξ ξ

, which are between the boundary points

0 1

and

ξ ξ

such that

0 1 2 1K K

...

ξξξ ξ ξ

<<< <

. The piecewise cubic polynomial regression model is

given as follows:

( )

2 3

0 1 2 3

j j

Y X X X X

β β β β θ ξ ε

= + + + + − +

∑

Here, the funcon

()

represents that the posive values from the argument are accepted

and then the cube power performed on it; that is:

( ) ( )

3if

othe0 rwise

ξξ

−



− = >





For this model, the K+4 basis funcons are as follows:

( ) ( ) ( ) ( ) ( )

( )

2 3

1 2 3 4 4

1 1

j j

hX ,h X X ,h X X ,h X X ,h X X ,j ,...,K

= = = = = − =

We will now consider an example from Montgomery, et. al. (2005), pages 231-3. It is

known that the baery voltage drop in a guided missile motor has a dierent behavior as a

funcon of me. The next screenshot displays the scaerplot of the baery voltage drop for

dierent me points; see ?VD from the RSADBE package. We need to build a piecewise cubic

regression spline for this dataset with knots at me t = 6.5 and t = 13 seconds since it

is known that the missile changes its course at these points. If we denote the baery voltage

drop by Y and the me by t, the model for this problem is then given as follows:

( ) ( )

3 3

2 3

0 1 2 3 1 2

65 13Y t t t t . t

β β β β θ θ ε

+ +

= + + + + − + − +

It is not possible with the math scope of this book to look into the details related to the

natural cubic spline regression models or the B-spline regression models. However, we can t

them by using the ns and bs opons in the formula of the lm funcon, along with the knots

at the appropriate places. These models will be built and their t will be visualized too. Let us

now t the models!

Chapter 8

[ 239 ]

Time for action – tting the spline regression models

A natural cubic spline regression model will be ed for the voltage drop problem.

1. Read the required dataset into R by using data(VD).

2. Invoke the graphics editor by using par(mfrow=c(1,2)).

3. Plot the data and give an appropriate tle:

plot(VD)

title(main="Scatter Plot for the Voltage Drop")

4. Build the piecewise cubic polynomial regression model by using the lm funcon

and related opons:

VD_PRS<-lm(Voltage_Drop~Time+I(Time^2)+I(Time^3)+I(((Ti

me-6.5)^3)*(sign(Time-6.5)==1))+I(((Time-13)^3)*(sign(Time-

13)==1)),data=VD)

The sign funcon returns the sign of a numeric vector as 1, 0, and -1, accordingly

as the arguments are posive, zero, and negave respecvely. The operator I is

an inhibit interpretator operator, in that the argument will be taken in an as is

format, check ?I. This operator is especially useful in data.frame and the formula

program of R.

5. To obtain the ed plot along with the scaerplot, run the following code:

plot(VD)

points(VD$Time,fitted(VD_PRS),col="red","l")

title("Piecewise Cubic Polynomial Regression Model")

Figure 5: Voltage drop data - scatter plot and a cubic polynomial regression model

Regression Models with Regularizaon

[ 240 ]

6. Obtain the details of the ed model with summary(VD_PRS).

The R output is given in the next screenshot. The summary output shows that each

of the basis funcon is indeed signicant here.

Figure 6: Details of the fitted piecewise cubic polynomial regression model

7. Fit the natural cubic spline regression model using the ns opon:

VD_NCS <-lm(Voltage_Drop~ns(Time,knots=c(6.5,13),intercept= TRUE,

degree=3), data=VD)

8. Obtain the ed plot as follows:

par(mfrow=c(1,2))

plot(VD)

points(VD$Time,fitted(VD_NCS),col="green","l")

title("Natural Cubic Regression Model")

Chapter 8

[ 241 ]

9. Obtain the details related to VD_NCS with the summary funcon summary( VD_

NCS); see Figure 09: A rst look at the linear ridge regression.

10. Fit the B-spline regression model by using the bs opon:

VD_BS <- lm(Voltage_Drop~bs(Time,knots=c(6.5,13),intercept=TRUE,

degree=3), data=VD)

11. Obtain the ed plot for VD_BS with the R program:

plot(VD)

points(VD$Time,fitted(VD_BS),col="brown","l")

title("B-Spline Regression Model")

Figure 7: Natural Cubic and B-Spline Regression Modeling

12. Finally, get the details of the ed B-spline regression model by using

summary(VD_BS).

The main purpose of the B-spline regression model is to illustrate that the splines

are smooth at the boundary points in contrast with the natural cubic regression

model. This can be clearly seen in Figure 8: Details of the natural cubic and B-spline

regression models.

Regression Models with Regularizaon

[ 242 ]

Both the models, VD_NCS and VD_BS, have good summary stascs and have really

modeled the data well.

Figure 8: Details of the natural cubic and B-spline regression models

What just happened?

We began with the ng of a piecewise polynomial regression model and then had a

look at the natural cubic spline regression and B-spline regression models. All the three

models provide a very good t to the actual data. Thus, with a good guess or experimental/

theorecal evidence, the linear regression model can be extended in an eecve way.

Chapter 8

[ 243 ]

Ridge regression for linear models

In Figure 3: Regression coecients of polynomial regression models, we saw that the

magnitude of the regression coecients increase in a drasc manner as the polynomial

degree increases. The right tweaking of the linear regression model, as seen in the previous

secon, gives us the right results. However, the models considered in the previous secon

had just one covariate and the problem of idenfying the knots in the mulple regression

model becomes an overtly complex issue. That is, if we have a problem where there are

large numbers of covariates, naturally there may be some dependency amongst them,

which cannot be invesgated for certain reasons. In such problems, it may happen that

certain covariates dominate other covariates in terms of the magnitude of their regression

coecients, and this may mar the overall usefulness of the model. Further more, even in

the univariate case, we have the problem that the choice of the number of knots, their

placements, and the polynomial degree may be manipulated by the analyzer. We have an

alternave to this problem in the way we minimize the residual sum of squares 2

min

∑.

The least-squares soluon leads to an esmator of

( )

X'X X 'Y

−

We saw in Chapter 6, Linear Regression Analysis, how to guard ourselves against the outliers,

the measures of model t, and model selecon techniques. However, these methods are

in acon aer the construcon of the model, and hence though they oer protecon in a

certain sense to the problem of overng, we need more robust methods. The queson

that arises is can we guard ourselves against overng when building the model itself? This

will go a long way in addressing the problem. The answer is obviously an armave, and we

will check out this technique.

The least-squares soluon is the opmal soluon when we have the squared loss funcon.

The idea then is to modify this loss funcon by incorporang a penalty term, which will give

us addional protecon against the overng problem. Mathemacally, we add the penalty

term for the size of the regression coecients; in fact, the constraint would be to ensure that

the sum of squares of the regression coecients is minimized. Formally, the goal would be to

obtain an opmal soluon of the following problem:

2 2

1 1

min

i j

βλ β

= =

 

 

 

∑ ∑

Regression Models with Regularizaon

[ 244 ]

Here,

is the control factor, also known as the tuning parameter, and 2

∑

is the

penalty. If the

value is zero, we get the earlier least-squares soluon. Note that the

intercept has been deliberately kept out of the penalty term! Now, for the large values of

∑

, the residual sum of squares will be large. Thus, loosely speaking, for the minimum

value of 2 2

1 1

n p

i j

λ β

= =

∑ ∑

, we will require 2

∑

to be at a minimum value too. The opmal

soluon for the preceding minimizaon problem is given as follows:

( )

Ridge

X'X I X'Y

β λ

−

= +

The choice of

is a crical one. There are mulple opons to obtain it:

Find the value of

by using the cross-validaon technique (discussed in the last

secon of this chapter)

Find the value of a semi-automated method as described at http://arxiv.org/

pdf/1205.0686.pdf for the value of

For the rst technique, we can use the funcon lm.ridge from the MASS package, and

the second method of semi-automac detecon can be obtained from the linearRidge

funcon of the ridge package.

In the following R session, we use the funcons lm.ridge and linearRidge.

Time for action – ridge regression for the linear regression

model

The linearRidge funcon from ridge package and lm.ridge from the MASS package are

two good opons for developing the ridge regression models.

1. Though the OF object may sll be there in your session, let us again load it by using

data(OF).

2. Load the MASS and ridge package by using library(MASS); library(ridge).

3. For a polynomial regression model of degree 3 and various values of lambda,

including 0, 0.5, 1, 1.5, 2, 5, 10, and 30, obtain the ridge regression coecients

with the following single line R code:

LR <-linearRidge(Y~poly(X,3),data=as.data.frame(OF),lambda =c(0,

0.5,1,1.5,2,5,10,30))

Chapter 8

[ 245 ]

The funcon linearRidge from the ridge package performs the ridge regression

for a linear model. We have two opons. First, we specify the values of lambda,

which may either be a scalar or a vector. In the case of a scalar lambda, it will simply

return the set of (ridge) regression coecients. If it is a vector, it returns the related

set of regression coecients.

4. Compute the value of

∑

for dierent lambda values:

LR_Coef <- LR$coef

colSums(LR_Coef^2)

Note that as the lambda value increases, the value of

∑

decreases. However,

this is not to say that higher lambda value is preferable, since the sum

∑

will decrease to 0, and eventually none of the variables will have a signicant

explanatory power about the output. The choice of selecon of the lambda value

will be discussed in the last secon.

5. The linearRidge funcon also nds the "best" lambda value:

linearRidge(Y~poly(X,3),data=as.data.

frame(OF),lambda="automatic").

6. Fetch the details of the "best" ridge regression model with the following line

of code:

summary(linearRidge(Y~poly(X,3),data=as.data.frame(OF),lambda="

automatic")).

The summary shows that the value of lambda is chosen at 0.07881, and that it

used three PCs. Now, what is a PC? PC is an abbreviaon of principal component,

and unfortunately we can't really go into the details of this aspect. Enthusiasc

readers may refer to Chapter 17 of Taar, et. al. (2013). Compare these results with

those in the rst secon.

7. For the same choice of dierent lambda values, use the lm.ridge funcon from

the MASS package:

LM <-lm.ridge(Y~poly(X,3),data = as.data.frame(OF),lambda

=c(0,0.5,1,1.5,2,5,10,30))

8. The lm.ridge funcon obviously works a bit dierently from linearRidge. The

results are given in the next image. Comparison of the results is le as an exercise

to the reader. As with the linearRidge model, let us compute the value of

∑

for lm.ridge ed models too.

Regression Models with Regularizaon

[ 246 ]

9. Use the colSums funcon to get the required result:

LM_Coef <- LM$coef

colSums(LM_Coef^2)

Figure 09: A first look at the linear ridge regression

So far, we are sll working with a single covariate only. However, we need to consider the

mulple linear regression models and see how ridge regression helps us. To do this, we will

return to the gasoline mileage considered in Chapter 6, Linear Regression Analysis.

Chapter 8

[ 247 ]

1. Read the Gasoline data into R by using data(Gasoline).

2. Fit the ridge regression model (and the mulple linear regression model again)

for the mileage as a funcon of other variables:

gasoline_lm <- lm(y~., data=Gasoline)

gasoline_rlm <- linearRidge(y~., data=Gasoline,lambda=

"automatic")

3. Compare the lm coecients with the linearRidge coecients:

sum(coef(gasoline_lm)[-1]^2)-sum(coef(gasoline_rlm)[-1]^2)

4. Look at the summary of the ed ridge linear regression model by using

summary(gasoline_rlm).

5. The dierence between the sum of squares of the regression coecients for the

linear and ridge linear model is indeed very large. Further more, the gasoline_

rlm details reveal that there are four variables, which have signicant explanatory

power for the mileage of the car. Note that the gasoline_lm model had only one

signicant variable for the car's mileage. The output is given in the following gure:

Figure 10: Ridge regression for the gasoline mileage problem

Regression Models with Regularizaon

[ 248 ]

What just happened?

We made use of two funcons, namely lm.ridge and linearRidge, for ng ridge

regression models for the linear regression model. It is observed that the ridge regression

models may somemes reveal more signicant variables.

In the next secon, we will t consider the ridge penalty for the logisc regression model.

Ridge regression for logistic regression models

We will not be able to go into the math of the ridge regression for the logisc regression

model, though we will happily make good use of the logisticRidge funcon from the

ridge package, to illustrate how to build the ridge regression for logisc regression model. For

more details, we refer to the research paper of Cule and De Iorio (2012) available at http://

arxiv.org/pdf/1205.0686.pdf. In the previous secon, we saw that gasoline_rlm

found more signicant variables than gasoline_lm. Now, in Chapter 7, Logisc Regression

Model, we t a logisc regression model for the German credit data problem in GC_LR. The

queson that arises is if we obtain a ridge regression model of the related logisc regression

model, say GC_RLR, can we expect to nd more signicant variables?

Time for action – ridge regression for the logistic regression

model

We will use the logisticRidge funcon here from the ridge package to t the ridge

regression, and check if we can obtain more signicant variables.

1. Load the German credit dataset with data(German).

2. Use the logisticRidge funcon to obtain GC_RLR, a small manipulaon required

here, by using the following line of code:

GC_RLR<-logisticRidge(as.numeric(good_bad)-1~.,data= as.data.

frame(GC), lambda = "automatic")

Chapter 8

[ 249 ]

3. Obtain the summaries of GC_LR and GC_RLR by using summary(GC_LR) and

summary(GC_RLR).

The detailed summary output is given in the following screenshot:

Figure 11: Ridge regression with the logistic regression model

It can be seen that the ridge regression model oers a very slight improvement over the

standard logisc regression model.

What just happened?

The ridge regression concept has been applied to the important family of logisc

regression models. Although in the case of the German credit data problem we found slight

improvement in idencaon of the signicant variables, it is vital that we should always be

on the lookout to t beer models, as in sensiveness to outliers, and the logisticRidge

funcon appears as a good alternave to the glm funcon.

Regression Models with Regularizaon

[ 250 ]

Another look at model assessment

In the previous two secons, we used the automac opon for obtaining the opmum

values, as discussed in the work of Cule and De Iorio (2012). There is an iterave technique

for nding the penalty factor

. This technique is especially useful when we do not have

sucient well-developed theory for regression models beyond the linear and logisc

regression model. Neural networks, support vector machines, and so on, are some very useful

regression models, where the theory may not have been well developed; well at least to the

best known pracce of the author. Hence, we will use the iterave method in this secon.

For both the linearRidge and lm.ridge ed models in the Ridge regression for linear

models secon, we saw that for an increasing value of

, the sum of squares of regression

coecients,

∑

, decreases. The queson then is how to select the "best"

value. A

popular technique in the data mining community is to split the dataset into three parts, namely

Train, Validate, and Test part. There are no denive answers for what needs to be the split

percentage for the three parts and a common pracce is to split them into either 60:20:20

percentages or 50:25:25 percentages. Let us now understand this process:

Training dataset: The models are built on the data available in this data part.

Validaon dataset: For this part of the data, we pretend as though we do not know

the output values and make predicons based upon the covariate values. This step

is to ensure that overng is minimized. The errors (residual squares for regression

model and accuracy percentages for classicaon model) are then compared with

respect to the counterpart errors in the training part. If the errors decrease in the

training set while they remain the same for the validaon part, it means that we are

overng the data. A threshold, aer which this is observed, may be chosen as the

beer lambda value.

Tesng dataset: In pracce, these are really unobserved cases for which the model

is applied for forecasng purposes.

For the gasoline mileage problem, we will split the data into three parts and use the training

and validaon part to select the

value.

Time for action – selecting lambda iteratively and other topics

Iterave selecon of the penalty parameter for ridge regression will be covered in this

secon. The useful framework of train + validate + test will also be considered for the

German credit data problem.

1. For the sake of simplicity, we will remove the character variable of the dataset

by using Gasoline <- Gasoline[,-12].

Chapter 8

[ 251 ]

2. Set the random seed by using set.seed(1234567). This step is to ensure that the

user can validate the results of the program.

3. Randomize the observaons to enable the spling part:

data_part_label = c("Train","Validate","Test")

indv_label=sample(data_part_label,size=nrow(Gasoline),replace=TRUE

,prob=c(0.6,0.2,0.2))

4. Now, split the gasoline dataset:

G_Train <- Gasoline[indv_label=="Train",]

G_Validate <- Gasoline[indv_label=="Validate",]

G_Test <- Gasoline[indv_label=="Test",]

5. Dene the

vector with lambda <- seq(0,10,0.2).

6. Inialize the training and validaon errors:

Train_Errors <- vector("numeric",length=length(lambda))

Val_Errors <- vector("numeric",length=length(lambda))

7. Run the following loop to get the required errors:

8. Plot the training and validaon errors:

plot(lambda,Val_Errors,"l",col="red",xlab=expression(lambda),ylab=

"Training and Validation Errors",ylim=c(0,600))

points(lambda,Train_Errors,"l",col="green")

legend(6,500,c("Training Errors","Validation Errors"),col=c(

"green","red"),pch="-")

Regression Models with Regularizaon

[ 252 ]

The nal output will be the following:

Figure 12: Training and validation errors

The preceding plot suggests that the lambda value is between 0.5 and 1.5. Why?

The technique of train + validate + test is not simply restricted to selecng the

lambda value. In fact, for any regression/classicaon model, we can try to

understand if the selected model really generalizes or not. For the German credit

data problem in the previous chapter, we will make an aempt to see what the

current technique suggests.

9. The program and its output (ROC curves) is displayed following it.

Chapter 8

[ 253 ]

10. The ROC plot is given in the following screenshot:

Figure 13: ROC plot for the train + validate + test partition of the German data

We will close the chapter with a short discussion. In the train + validate + test

paroning, we had one technique of avoiding overng. A generalizaon of this

technique is the well-known cross-validaon method. In an n-fold cross-validaon

approach, the data is randomly paroned into n divisions. In the rst step, the

rst part is held for validaon and the model is built using the remaining n-1 parts

and the accuracy percentage is calculated. Next, the second part is treated as the

validaon dataset and the remaining 1, 3, …, n-1, n parts are used to build

the model and then tested for accuracy on the second part. This process is then

repeated for the remaining n-2 parts. Finally, an overall accuracy metric is reported.

At the surface, this process is complex enough and hence we will resort to the

well-dened funcons available in the DAAG package.

Regression Models with Regularizaon

[ 254 ]

11. As the cross-validaon funcon itself carries out the n-fold paroning, we build it

over the enre dataset:

library(DAAG)

data(VD) CVlm(df=VD,form.lm=formula(Voltage_Drop~Time+I(Time^2)+I(

Time^3)+I(((Time-6.5)^3)*(sign(Time-6.5)==1))

+I(((Time-13)^3)*(sign(Time-13)==1))),m=10,plotit="Observed")

The VD data frame has 41 observaons, and the output in Figure 14: Cross-

validaon for the voltage-drop problem shows that the 10-fold cross-validaon has

10 parons with fold 2 containing ve observaons and the rest of them having

four each. Now, for each fold, the cubic polynomial regression model ts the model

by using the data in the remaining folds:

Figure 14: Cross-validation for the voltage-drop problem

Chapter 8

[ 255 ]

Using the ed polynomial regression model, a predicon is made for the units in

the fold. The observed versus predicted regressand values plot is given in Figure

15: Predicted versus observed plot using the cross-validaon technique. A close

examinaon of the numerical predicted values and the plot indicate that we have a

very good model for the voltage drop phenomenon.

The generalized cross-validaon (GCV) errors are also given with the details of a

lm.ridge t model. We can use this informaon to arrive at the beer value for

the ridge regression models:

Figure 15: Predicted versus observed plot using the cross-validation technique

12. For the OF and G_Train data frames, use the lm.ridge funcon to obtain the

GCV errors:

> LM_OF <- lm.ridge(Y~poly(X,3),data=as.data.frame(OF),

+ lambda=c(0,0.5,1,1.5,2,5,10,30))

> LM_OF$GCV

0.0 0.5 1.0 1.5 2.0 5.0 10.0 30.0

5.19 5.05 5.03 5.09 5.21 6.38 8.31 12.07

> LM_GT <- lm.ridge(y~.,data=G_Train,lambda=seq(0,10,0.2))

> LM_GT$GCV

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

1.777 0.798 0.869 0.889 0.891 0.886 0.877 0.868 0.858 0.848

2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8

Regression Models with Regularizaon

[ 256 ]

0.838 0.830 0.821 0.813 0.806 0.798 0.792 0.786 0.780 0.774

4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8

0.769 0.764 0.760 0.755 0.751 0.748 0.744 0.740 0.737 0.734

6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8

0.731 0.729 0.726 0.723 0.721 0.719 0.717 0.715 0.713 0.711

8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8

0.710 0.708 0.707 0.705 0.704 0.703 0.701 0.700 0.699 0.698

10.0

0.697

For the OF data frame, the value appears to lie in the interval (1.0, 1.5).

On the other hand for the GT data frame, the value appears in (0.2, 0.4).

What just happened?

The choice of the penalty factor is indeed very crucial for the success of a ridge regression

model, and we saw dierent methods for obtaining this. This step included the automac

choice of Cule and De Iorio (2012) and the cross-validaon technique. Further more, we also

saw the applicaon of the popular train + validate + test approach. In praccal applicaons,

these methodologies will go a long way to obtain the best models.

Pop quiz

What do you expect as the results if you perform the model selecon task step funcon

on a polynomial regression model? That is, you are trying to select the variables for the

polynomial model lm(Y~poly(X,9,raw=TRUE),data=OF), or say VD_PRS. Verify your

intuion by compleng the R programs.

Summary

In this chapter, we began with a hypothecal dataset and highlighted the problem of

overng. In case of a breakpoint, also known as knots, the extensions of the linear model

in the piecewise linear regression model and the spline regression model were found to be

very useful enhancements. The problem of overng can also somemes be overcome by

using the ridge regression. The ridge regression soluon has been extended for the linear

and logisc regression models. Finally, we saw a dierent approach of model assessment

by using the train + validate + test approach and the cross-validaon approach. In spite of

the developments where we have intrinsically non-linear data, it becomes dicult for the

models discussed in this chapter to emerge as useful soluons. The past two decades has

witnessed a powerful alternave in the so-called Classicaon and Regression Trees (CART).

The next chapter discusses CART in greater depth and the nal chapter considers modern

development related to it.

Classication

and Regression Trees

In the previous chapters, we focused on regression models, and the majority

of the emphasis was on the linearity assumption. In what appears as the next

extension must be non-linear models, we will instead deviate to recursive

partitioning techniques, which are a bit more flexible than the non-linear

generalization of the models considered in the earlier chapters. Of course, the

recursive partitioning techniques, in most cases, may be viewed as non-linear

models.

We will rst introduce the noon of recursive parons through a hypothecal dataset.

It is apparent that the earlier approach of the linear models changes in an enrely dierent

way with the funconing of the recursive parons. Recursive paroning depends upon

the type of problem we have in hand. We develop a regression tree for the regression

problem when the output is a connuous variable, as in the linear models. If the output is

a binary variable, we develop a classicaon tree. A regression tree is rst created by using

the rpart funcon from the rpart package. A very raw R program is created, which clearly

explains the unfolding of a regression tree. A similar eort is repeated for the classicaon

tree. In the nal secon of this chapter, a classicaon tree is created for the German credit

data problem along with the use of ROC curves for understanding the model performance.

The approach in this chapter will be on the following lines:

Understanding the basis of recursive parons and the general CART.

Construcon of a regression tree

Construcon of a classicaon tree

Applicaon of a classicaon tree to the German credit data problem

The ner aspects of CART

Classicaon and Regression

[ 258 ]

Recursive partitions

The name of the library package rpart, shipped along with R, stands for Recursive

Paroning. The package was rst created by Terry M Therneau and Beth Atkinson,

and is currently maintained by Brian Ripley. We will rst have a peek at means

recursive parons are.

A complex and contrived relaonship is generally not idenable by linear models. In the

previous chapter, we saw the extensions of the linear models in piecewise, polynomial,

and spline regression models. It is also well known that if the order of a model is larger

than 4, then interpretaon and usability of the model becomes more dicult. We consider

a hypothecal dataset, where we have two classes for the output Y and two explanatory

variables in X1 and X2. The two classes are indicated by lled-in green circles and red

squares. First, we will focus only on the le display of Figure 1: A complex classicaon

dataset with parons, as it is the actual depicon of the data. At the outset, it is clear that

a linear model is not appropriate, as there is quite an overlap of the green and red indicators.

Now, there is a clear demarcaon of the classicaon problem accordingly, as X1 is greater

than 6 or not. In the area on the le side of X1=6, the mid-third region contains a majority

of green circles and the rest are red squares. The red squares are predominantly idenable

accordingly, as the X2 values are either lesser than or equal to 3 or greater than 6. The green

circles are the majority values in the region of X2 being greater than 3 and lesser than 6. A

similar story can be built for the points on the right side of X1 greater than 6. Here, we rst

paroned the data according to X1 values rst, and then in each of the paroned region,

we obtained parons according to X2 values. This is the act of recursive paroning.

Figure 1: A complex classification dataset with partitions

Let us obtain the preceding plot in R.

Chapter 9

[ 259 ]

Time for action – partitioning the display plot

We rst visualize the CART_Dummy dataset and then look in the next subsecon at how CART

gets the paerns, which are believed to exist in the data.

1. Obtain the dataset CART_Dummy from the RSADBE package by using

data( CART_Dummy).

2. Convert the binary output Y as a factor variable, and aach the data frame

with CART_Dummy$Y <- as.factor(CART_Dummy$Y).

attach(CART_Dummy)

In Figure 1: A complex classicaon dataset with parons, the red squares

refer to 0 and the green circles to 1.

3. Inialize the graphics windows for the three samples by using

par(mfrow= c(1,2)).

4. Create a blank scaer plot:

plot(c(0,12),c(0,10),type="n",xlab="X1",ylab="X2").

5. Plot the green circles and red squares:

points(X1[Y==0],X2[Y==0],pch=15,col="red")

points(X1[Y==1],X2[Y==1],pch=19,col="green")

title(main="A Difficult Classification Problem")

6. Repeat the previous two steps to obtain the idencal plot on the right side of the

graphics window.

7. First, paron according to X1 values by using abline(v=6,lwd=2).

8. Add segments on the graph with the segment funcon:

segments(x0=c(0,0,6,6),y0=c(3.75,6.25,2.25,5),x1=c(6,6,12,12),y1=c

(3.75,6.25,2.25,5),lwd=2)

title(main="Looks a Solvable Problem Under Partitions")

What just happened?

A complex problem is simplied through paroning! A more generic funcon, segments,

has nicely slipped in our program, which you may use for many other scenarios.

Classicaon and Regression

[ 260 ]

Now, this approach of recursive paroning is not feasible all the me! Why? We seldom

deal with two or three explanatory variables and data points as low as in the preceding

hypothecal example. The queson is how one creates recursive paroning of the dataset.

Breiman, et. al. (1984) and Quinlan (1988) have invented tree building algorithms, and we

will follow the Breiman, et. al. approach in the rest of book. The CART discussion in this book

is heavily inuenced by Berk (2008).

Splitting the data

In the earlier discussion, we saw that paroning the dataset can benet a lot in reducing

the noise in the data. The queson is how does one begin with it? The explanatory

variables can be discrete or connuous. We will begin with the connuous (numeric

objects in R) variables.

For a connuous variable, the task is a bit simpler. First, idenfy the unique disnct values

of the numeric object. Let us say, for example, that the disnct values of a numeric object,

say height in cms, are 160, 165, 170, 175, and 180. The data parons are then obtained

as follows:

data[Height<=160,], data[Height>160,]

data[Height<=165,], data[Height>165,]

data[Height<=170,], data[Height>170,]

data[Height<=175,], data[Height>175,]

The reader should try to understand the raonale behind the code, and certainly this is just

an indicave one.

Now, we consider the discrete variables. Here, we have two types of variables, namely

categorical and ordinal. In the case of ordinal variables, we have an order among the disnct

values. For example, in the case of the economic status variable, the order may be among

the classes Very Poor, Poor, Average, Rich, and Very Rich. Here, the splits are similar to the

case of connuous variable, and if there are m disnct orders, we consider m-1 disnct splits

of the overall data. In the case of a categorical variable with m categories, for example the

departments A to F of the UCBAdmissions dataset, the number of possible splits becomes

2m-1-1. However, the benet of using soware like R is that we do not have to worry about

these issues.

The rst tree

In the CART_Dummy dataset, we can easily visualize the parons for Y as a funcon of the

inputs X1 and X2. Obviously, we have a classicaon problem, and hence we will build the

classicaon tree.

Chapter 9

[ 261 ]

Time for action – building our rst tree

The rpart funcon from the library rpart will be used to obtain the rst classicaon tree.

The tree will be visualized by using the plot opons of rpart, and we will follow this up

with extracng the rules of a tree by using the asRules funcon from the rattle package.

1. Load the rpart package by using library(rpart).

2. Create the classicaon tree with CART_Dummy_rpart <- rpart

(Y~ X1+X2,data=CART_Dummy).

3. Visualize the tree with appropriate text labels by using plot(CART_Dummy_

rpart); text(CART_Dummy_rpart).

Figure 2: A classification tree for the dummy dataset

Now, the classicaon tree ows as follows. Obviously, the tree using the rpart

funcon does not paron as simply as we did in Figure 1: A complex classicaon

dataset with parons, the working of which will be dealt within the third secon of

this chapter. First, we check if the value of the second variable X2 is less than 4.875.

If the answer is an armaon, we move to the le side of the tree; the right side in

the other case. Let us move to the right side. A second queson asked is whether X1

is lesser than 4.5 or not, and then if the answer is yes it is idened as a red square,

and otherwise a green circle. You are now asked to interpret the le side of the rst

node. Let us look at the summary of CART_Dummy_rpart.

Classicaon and Regression

[ 262 ]

4. Apply the summary, an S3 method, for the classicaon tree with summary( CART_

Dummy_rpart).

That one is a lot of output!

Figure 3: Summary of a classification tree

Our interests are in the nodes numbered 5 to 9! Why? The terminal nodes, of course!

A terminal node is one in which we can't split the data any further, and for the

classicaon problem, we arrive at a class assignment as the class that has a majority

count at the node. The summary shows that there are indeed some misclassicaons

too. Now, wouldn't it be great if R gave the terminal nodes asRules. The funcon

asRules from the rattle package extracts the rules from an rpart object.

Let's do it!

Chapter 9

[ 263 ]

5. Invoke the rattle package library(rattle) and using the asRules funcon,

extract the rules from the terminal nodes with asRules(CART_Dummy_rpart).

The result is the following set of rules:

Figure 4: Extracting "rules" from a tree!

We can see that the classicaon tree is not according to our "eye-bird" paroning.

However, as a nal aspect of our inial understanding, let us plot the segments

using the naïve way. That is, we will paron the data display according to the

terminal nodes of the CART_Dummy_rpart tree.

6. The R code is given right away, though you should make an eort to nd the logic

behind it. Of course, it is very likely that by now you need to run some of the earlier

code that was given previously.

abline(h=4.875,lwd=2)

segments(x0=4.5,y0=4.875,x1=4.5,y1=10,lwd=2)

abline(h=1.75,lwd=2)

segments(x0=3.5,y0=1.75,x1=3.5,y1=4.875,lwd=2)

title(main="Classification Tree on the Data Display")

Classicaon and Regression

[ 264 ]

It can be easily seen from the following that rpart works really well:

Figure 5: The terminal nodes on the original display of the data

What just happened?

We obtained our rst classicaon tree, which is a good thing. Given the actual data display,

the classicaon tree gives sasfactory answers.

We have understood the "how" part of a classicaon tree. The "why" aspect is very

vital in science, and the next secon explains the science behind the construcon of

a regression tree, and it will be followed later by a detailed explanaon of the working

of a classicaon tree.

Chapter 9

[ 265 ]

The construction of a regression tree

In the CART_Dummy dataset, the output is a categorical variable, and we built a classicaon

tree for it. In Chapter 6, Linear Regression Analysis, the linear regression models were built

for a connuous random variable, while in Chapter 7, The Logisc Regression Model, we built

a logisc regression model for a binary random variable. The same disncon is required in

CART, and we thus build classicaon trees for binary random variables, where regression

trees are for connuous random variables. Recall the raonale behind the esmaon

of regression coecients for the linear regression model. The main goal was to nd the

esmates of the regression coecients, which minimize the error sum of squares between

the actual regressand values and the ed values. A similar approach is followed here, in the

sense that we need to split the data at the points that keep the residual sum of squares to

a minimum. That is, for each unique value of a predictor, which is a candidate for the node

value, we nd the sum of squares of y's within each paron of the data, and then add them

up. This step is performed for each unique value of the predictor, and the value, which leads

to the least sum of squares among all the candidates, is selected as the best split point for

that predictor. In the next step, we nd the best split points for each of the predictors, and

then the best split is selected across the best split points across the predictors. Easy!

Now, the data is paroned into two parts according to the best split. The process of nding

the best split within each paron is repeated in the same spirit as for the rst split. This

process is carried out in a recursive fashion unl the data can't be paroned any further.

What is happening here? The residual sum of squares at each child node will be lesser than

that in the parent node.

At the outset, we record that the rpart funcon does the exact same thing. However, as a

part of cleaner understanding of the regression tree, we will write raw R codes and ensure

that there is no ambiguity in the process of understanding CART. We will begin with a simple

example of a regression tree, and use the rpart funcon to plot the regression funcon.

Then, we will rst dene a funcon, which will extract the best split given by the covariate

and dependent variable. This acon will be repeated for all the available covariates, and

then we nd the best overall split. This will be veried with the regression tree. The data will

then be paroned by using the best overall split, and then the best split will be idened

for each of the paroned data. The process will be repeated unl we reach the end of the

complete regression tree given by the rpart. First, the experiment!

Classicaon and Regression

[ 266 ]

The cpus dataset available in the MASS package contains the relave performance measure

of 209 CPUs in the perf variable. It is known that the performance of a CPU depends

on factors such as the cycle me in nanoseconds (syct), minimum and maximum main

memory in kilobytes (mmin and mmax), cache size in kilobytes (cach), and minimum and

maximum number of channels (chmin and chmax). The task in hand is to model the perf

as a funcon of syct, mmin, mmax, cach, chmin, and chmax. The histogram of perf—try

hist(cpus$perf)—will show a highly skewed distribuon, and hence we will build a

regression tree for the logarithm transformaon log10(perf).

Time for action – the construction of a regression tree

A regression tree is rst built by using the rpart funcon. The getNode funcon is

introduced, which helps in idenfying the split node at each stage, and using it we build

a regression tree and verify that we had the same tree as returned by the rpart funcon.

1. Load the MASS library by using library(MASS).

2. Create the regression tree for the logarithm (to the base 10) of perf as a funcon

of the covariates explained earlier, and display the regression tree:

cpus.ltrpart <- rpart(log10(perf)~syct+mmin+mmax+cach+chmin+chmax,

data=cpus)

plot(cpus.ltrpart); text(cpus.ltrpart)

The regression tree will be indicated as follows:

Figure 6: Regression tree for the "perf" of a CPU

Chapter 9

[ 267 ]

We will now dene the getNode funcon. Given the regressand and the

covariate, we need to nd the best split in the sense of the sum of squares criterion.

The evaluaon needs to be done for every disnct value of the covariate. If there are

m disnct points, we need m-1 evaluaons. At each disnct point, the regressand

needs to be paroned accordingly, and the sum of squares should be obtained for

each paron. The two sums of squares (in each part) are then added to obtain the

reduced sum of squares. Thus, we create the required funcon to meet all

these requirements.

3. Create the getNode funcon in R by running the following code:

getNode <- function(x,y) {

xu <- sort(unique(x),decreasing=TRUE)

ss <- numeric(length(xu)-1)

for(i in 1:length(ss)) {

partR <- y[x>xu[i]]

partL <- y[x<=xu[i]]

partRSS <- sum((partR-mean(partR))^2)

partLSS <- sum((partL-mean(partL))^2)

ss[i]<-partRSS + partLSS

}

return(list(xnode=xu[which(ss==min(ss,na.rm=TRUE))],

minss = min(ss,na.rm=TRUE),ss,xu))

}

The getNode funcon gives the best split for a given covariate. It returns a list

consisng of four objects:

xnode, which is a datum of the covariate x that gives the minimum residual

sum of squares for the regressand y

The value of the minimum residual sum of squares

The vector of the residual sum of squares for the distinct points of the

vector x

The vector of the distinct x values

We will run this funcon for each of the six covariates, and nd the best overall split.

The argument na.rm=TRUE is required, as at the maximum value of x we won't get

a numeric value.

Classicaon and Regression

[ 268 ]

4. We will rst execute the getNode funcon on the syct covariate, and look at the

output we get as a result:

> getNode(cpus$syct,log10(cpus$perf))$xnode

[1] 48

> getNode(cpus$syct,log10(cpus$perf))$minss

[1] 24.72

> getNode(cpus$syct,log10(cpus$perf))[[3]]

[1] 43.12 42.42 41.23 39.93 39.44 37.54 37.23 36.87 36.51 36.52

35.92 34.91

[13] 34.96 35.10 35.03 33.65 33.28 33.49 33.23 32.75 32.96 31.59

31.26 30.86

[25] 30.83 30.62 29.85 30.90 31.15 31.51 31.40 31.50 31.23 30.41

30.55 28.98

[37] 27.68 27.55 27.44 26.80 25.98 27.45 28.05 28.11 28.66 29.11

29.81 30.67

[49] 28.22 28.50 24.72 25.22 26.37 28.28 29.10 33.02 34.39 39.05

39.29

> getNode(cpus$syct,log10(cpus$perf))[[4]]

[1] 1500 1100 900 810 800 700 600 480 400 350 330 320

300 250 240

[16] 225 220 203 200 185 180 175 167 160 150 143 140

133 125 124

[31] 116 115 112 110 105 100 98 92 90 84 75 72

70 64 60

[46] 59 57 56 52 50 48 40 38 35 30 29 26

25 23 17

The least sum of squares at a split for the best split value of the syct variable is

24.72, and it occurs at a value of syct greater than 48. The third and fourth list

objects given by getNode, respecvely, contain the details of the sum of squares for

the potenal candidates and the unique values of syct. The values of interest are

highlighted. Thus, we will rst look at the second object from the list output for all

the six covariates to nd the best split among the best split of each of the variables,

by the residual sum of squares criteria.

5. Now, run the getNode funcon for the remaining ve covariates:

getNode(cpus$syct,log10(cpus$perf))[[2]]

getNode(cpus$mmin,log10(cpus$perf))[[2]]

getNode(cpus$mmax,log10(cpus$perf))[[2]]

getNode(cpus$cach,log10(cpus$perf))[[2]]

getNode(cpus$chmin,log10(cpus$perf))[[2]]

getNode(cpus$chmax,log10(cpus$perf))[[2]]

getNode(cpus$cach,log10(cpus$perf))[[1]]

sort(getNode(cpus$cach,log10(cpus$perf))[[4]],decreasing=FALSE)

Chapter 9

[ 269 ]

The output is as follows:

Figure 7: Obtaining the best "first split" of regression tree

The sum of squares for cach is the lowest, and hence we need to nd the best

split associated with it, which is 24. However, the regression tree shows that the

best split is for the cach value of 27. The getNode funcon says that the best split

occurs at a point greater than 24, and hence we take the average of 24 and the next

unique point at 30. Having obtained the best overall split, we next obtain the rst

paron of the dataset.

6. Paron the data by using the best overall split point:

cpus_FS_R <- cpus[cpus$cach>=27,]

cpus_FS_L <- cpus[cpus$cach<27,]

The new names of the data objects are clear with _FS_R indicang the dataset

obtained on the right side for the rst split, and _FS_L indicang the le side.

In the rest of the secon, the nomenclature won't be further explained.

7. Idenfy the best split in each of the paroned datasets:

getNode(cpus_FS_R$syct,log10(cpus_FS_R$perf))[[2]]

getNode(cpus_FS_R$mmin,log10(cpus_FS_R$perf))[[2]]

getNode(cpus_FS_R$mmax,log10(cpus_FS_R$perf))[[2]]

getNode(cpus_FS_R$cach,log10(cpus_FS_R$perf))[[2]]

getNode(cpus_FS_R$chmin,log10(cpus_FS_R$perf))[[2]]

getNode(cpus_FS_R$chmax,log10(cpus_FS_R$perf))[[2]]

getNode(cpus_FS_R$mmax,log10(cpus_FS_R$perf))[[1]]

sort(getNode(cpus_FS_R$mmax,log10(cpus_FS_R$perf))[[4]],

decreasing=FALSE)

getNode(cpus_FS_L$syct,log10(cpus_FS_L$perf))[[2]]

Classicaon and Regression

[ 270 ]

getNode(cpus_FS_L$mmin,log10(cpus_FS_L$perf))[[2]]

getNode(cpus_FS_L$mmax,log10(cpus_FS_L$perf))[[2]]

getNode(cpus_FS_L$cach,log10(cpus_FS_L$perf))[[2]]

getNode(cpus_FS_L$chmin,log10(cpus_FS_L$perf))[[2]]

getNode(cpus_FS_L$chmax,log10(cpus_FS_L$perf))[[2]]

getNode(cpus_FS_L$mmax,log10(cpus_FS_L$perf))[[1]]

sort(getNode(cpus_FS_L$mmax,log10(cpus_FS_L$perf))[[4]],

decreasing=FALSE)

The following screenshot gives the results of running the preceding R code:

Figure 8: Obtaining the next two splits

Chapter 9

[ 271 ]

Thus, for the rst right paroned data, the best split is for the mmax value as the

mid-point between 24000 and 32000; that is, at mmax = 28000. Similarly, for the

rst le-paroned data, the best split is the average value of 6000 and 6200,

which is 6100, for the same mmax covariate. Note the important step here. Even

though we used cach as the criteria for the rst paron, it is sll used with the two

paroned data. The results are consistent with the display given by the regression

tree, Figure 6: Regression tree for the "perf" of a CPU. The next R program will take

care of the enre rst split's right side's future parons.

8. Paron the rst right part cpus_FS_R as follows:

cpus_FS_R_SS_R <- cpus_FS_R[cpus_FS_R$mmax>=28000,]

cpus_FS_R_SS_L <- cpus_FS_R[cpus_FS_R$mmax<28000,]

Obtain the best split for cpus_FS_R_SS_R and cpus_FS_R_SS_L by running the

following code:

cpus_FS_R_SS_R <- cpus_FS_R[cpus_FS_R$mmax>=28000,]

cpus_FS_R_SS_L <- cpus_FS_R[cpus_FS_R$mmax<28000,]

getNode(cpus_FS_R_SS_R$syct,log10(cpus_FS_R_SS_R$perf))[[2]]

getNode(cpus_FS_R_SS_R$mmin,log10(cpus_FS_R_SS_R$perf))[[2]]

getNode(cpus_FS_R_SS_R$mmax,log10(cpus_FS_R_SS_R$perf))[[2]]

getNode(cpus_FS_R_SS_R$cach,log10(cpus_FS_R_SS_R$perf))[[2]]

getNode(cpus_FS_R_SS_R$chmin,log10(cpus_FS_R_SS_R$perf))[[2]]

getNode(cpus_FS_R_SS_R$chmax,log10(cpus_FS_R_SS_R$perf))[[2]]

getNode(cpus_FS_R_SS_R$cach,log10(cpus_FS_R_SS_R$perf))[[1]]

sort(getNode(cpus_FS_R_SS_R$cach,log10(cpus_FS_R_SS_R$perf))[[4]],

decreasing=FALSE)

getNode(cpus_FS_R_SS_L$syct,log10(cpus_FS_R_SS_L$perf))[[2]]

getNode(cpus_FS_R_SS_L$mmin,log10(cpus_FS_R_SS_L$perf))[[2]]

getNode(cpus_FS_R_SS_L$mmax,log10(cpus_FS_R_SS_L$perf))[[2]]

getNode(cpus_FS_R_SS_L$cach,log10(cpus_FS_R_SS_L$perf))[[2]]

getNode(cpus_FS_R_SS_L$chmin,log10(cpus_FS_R_SS_L$perf))[[2]]

getNode(cpus_FS_R_SS_L$chmax,log10(cpus_FS_R_SS_L$perf))[[2]]

getNode(cpus_FS_R_SS_L$cach,log10(cpus_FS_R_SS_L$perf))[[1]]

sort(getNode(cpus_FS_R_SS_L$cach,log10(cpus_FS_R_SS_L$perf))

[[4]],decreasing=FALSE)

Classicaon and Regression

[ 272 ]

For the cpus_FS_R_SS_R part, the nal division is according to cach being

greater than 56 or not (average of 48 and 64). If the cach value in this paron is

greater than 56, then perf (actually log10(perf)) ends in the terminal leaf 3,

else 2. However, for the region cpus_FS_R_SS_L, we paron the data further

by the cach value being greater than 96.5 (average of 65 and 128). In the right

side of the region, log10(perf) is found as 2, and a third level split is required

for cpus_FS_R_SS_L with cpus_FS_R_SS_L_TS_L. Note that though the nal

terminal leaves of the cpus_FS_R_SS_L_TS_L region shows the same 2 as the

nal log10(perf), this may actually result in a signicant variability reducon of

the dierence between the predicted and the actual log10(perf) values. We will

now focus on the rst main split's le side.

Chapter 9

[ 273 ]

Figure 9: Partitioning the right partition after the first main split

Classicaon and Regression

[ 274 ]

9. Paron cpus_FS_L accordingly, as the mmax value being greater than 6100

or otherwise:

cpus_FS_L_SS_R <- cpus_FS_L[cpus_FS_L$mmax>=6100,]

cpus_FS_L_SS_L <- cpus_FS_L[cpus_FS_L$mmax<6100,]

The rest of the paron for cpus_FS_L is completely given next.

10. The details will be skipped and the R program is given right away:

cpus_FS_L_SS_R <- cpus_FS_L[cpus_FS_L$mmax>=6100,]

cpus_FS_L_SS_L <- cpus_FS_L[cpus_FS_L$mmax<6100,]

getNode(cpus_FS_L_SS_R$syct,log10(cpus_FS_L_SS_R$perf))[[2]]

getNode(cpus_FS_L_SS_R$mmin,log10(cpus_FS_L_SS_R$perf))[[2]]

getNode(cpus_FS_L_SS_R$mmax,log10(cpus_FS_L_SS_R$perf))[[2]]

getNode(cpus_FS_L_SS_R$cach,log10(cpus_FS_L_SS_R$perf))[[2]]

getNode(cpus_FS_L_SS_R$chmin,log10(cpus_FS_L_SS_R$perf))[[2]]

getNode(cpus_FS_L_SS_R$chmax,log10(cpus_FS_L_SS_R$perf))[[2]]

getNode(cpus_FS_L_SS_R$syct,log10(cpus_FS_L_SS_R$perf))[[1]]

sort(getNode(cpus_FS_L_SS_R$syct,log10(cpus_FS_L_SS_R$perf))[[4]],

decreasing=FALSE)

getNode(cpus_FS_L_SS_L$syct,log10(cpus_FS_L_SS_L$perf))[[2]]

getNode(cpus_FS_L_SS_L$mmin,log10(cpus_FS_L_SS_L$perf))[[2]]

getNode(cpus_FS_L_SS_L$mmax,log10(cpus_FS_L_SS_L$perf))[[2]]

getNode(cpus_FS_L_SS_L$cach,log10(cpus_FS_L_SS_L$perf))[[2]]

getNode(cpus_FS_L_SS_L$chmin,log10(cpus_FS_L_SS_L$perf))[[2]]

getNode(cpus_FS_L_SS_L$chmax,log10(cpus_FS_L_SS_L$perf))[[2]]

getNode(cpus_FS_L_SS_L$mmax,log10(cpus_FS_L_SS_L$perf))[[1]]

sort(getNode(cpus_FS_L_SS_L$mmax,log10(cpus_FS_L_SS_L$perf))

[[4]],decreasing=FALSE)

cpus_FS_L_SS_R_TS_R <- cpus_FS_L_SS_R[cpus_FS_L_SS_R$syct<360,]

getNode(cpus_FS_L_SS_R_TS_R$syct,log10(cpus_FS_L_SS_R_TS_R$

perf))[[2]]

getNode(cpus_FS_L_SS_R_TS_R$mmin,log10(cpus_FS_L_SS_R_TS_R$

perf))[[2]]

getNode(cpus_FS_L_SS_R_TS_R$mmax,log10(cpus_FS_L_SS_R_TS_R$

perf))[[2]]

getNode(cpus_FS_L_SS_R_TS_R$cach,log10(cpus_FS_L_SS_R_TS_R$

perf))[[2]]

getNode(cpus_FS_L_SS_R_TS_R$chmin,log10(cpus_FS_L_SS_R_TS_R$

perf))[[2]]

getNode(cpus_FS_L_SS_R_TS_R$chmax,log10(cpus_FS_L_SS_R_TS_R$

perf))[[2]]

getNode(cpus_FS_L_SS_R_TS_R$chmin,log10(cpus_FS_L_SS_R_TS_R$

perf))[[1]]

sort(getNode(cpus_FS_L_SS_R_TS_R$chmin,log10(cpus_FS_L_SS_R_TS_

R$perf))[[4]],decreasing=FALSE)

Chapter 9

[ 275 ]

We will now see how the preceding R code gets us closer to the regression tree:

Figure 10: Partitioning the left partition after the first main split

We leave it to you to interpret the output arising from the previous acon.

Classicaon and Regression

[ 276 ]

What just happened?

Using the rpart funcon from the rpart library we rst built the regression tree for

log10(perf). Then, we explored the basic denions underlying the construcon of

a regression tree and dened the getNode funcon to obtain the best split for a pair of

regressands and a covariate. This funcon is then applied for all the covariates, and the best

overall split is obtained; using this we get our rst paron of the data, which will be in

agreement with the tree given by the rpart funcon. We then recursively paroned

the data by using the getNode funcon and veried that all the best splits in each

paroned data are in agreement with the one provided by the rpart funcon.

The reader may wonder if the preceding tedious task was really essenal. However, it has

been the experience of the author that users/readers seldom remember the raonale

behind using direct code/funcons for any soware aer some me. Moreover, CART is a

dicult concept and it is imperave that we clearly understand our rst tree, and return to

the preceding program whenever the understanding of a science behind CART is forgoen.

The construcon of a classicaon tree uses enrely dierent metrics, and hence its working

is also explained in considerable depth in the next secon.

The construction of a classication tree

We rst need to set up the spling criteria for a classicaon tree. In the case of a

regression tree, we saw the sum of squares as the spling criteria. For idenfying the split

for a classicaon tree, we need to dene certain measures known as impurity measures.

The three popular measures of impurity are Bayes error, the cross-entropy funcon, and Gini

index. Let p denote the percentage of success in a dataset of size n. The formulae of these

impurity measures are given in the following table:

Measure Formula

Bayes error

() ( )

min1

Bpp , p

= −

The cross-entropy measure

() ( ) ( )

log log1 1

CE p p p p p

=− −− −

Gini index

() ( )

Gpp p

= −

Chapter 9

[ 277 ]

We will write a short program to understand the shape of these impurity measures as a

funcon of p:

p <- seq(0.01,0.99,0.01)

plot(p,pmin(p,1-p),"l",col="red",xlab="p",xlim=c(0,1),ylim=c(0,1),

ylab="Impurity Measures")

points(p,-p*log(p)-(1-p)*log(1-p),"l",col="green")

points(p,p*(1-p),"l",col="blue")

title(main="Impurity Measures")

legend(0.6,1,c("Bayes Error","Cross-Entropy","Gini Index"),col=c("red"

,"green","blue"),pch="-")

The preceding R code when executed in an R session gives the following output:

Figure 11: Impurity metrics – Bayes error, cross-entropy, and Gini index

Basically, we have these three choices of impurity metrics as a building block of

a classicaon tree. The popular choice is the Gini index, and there are detailed

discussions about the reason in the literature; see Breiman, et. al. (1984). However,

we will delve into this aspect and for the development in this secon, we will be using

the cross-entropy funcon.

Classicaon and Regression

[ 278 ]

Now, for a given predictor, assume that we have a node denoted by A. In the inial

stage, where there are no parons, the impurity is based on the proporon p.

The impurity of node A is taken to be a non-negave funcon of the probability y = 1,

and is mathemacally wrien as p(y=1|A). The impurity of node A is dened as follows:

() ( )

1IA py |A

= =

 

 

Here,

is one of the three impurity measures. When A is one of the internal nodes, the tree

gets bifurcated to the le- and right- hand side; that is, we now have le daughter AL and a

right daughter AR. For the moment, we will take the split according to the predictor variable

x; that is, if

xc≤

, the observaon moves to AL, otherwise to A

R. Then, according to the split

criteria, we have the following table; this is the same as Table 3.2 of Berk (2008):

Split criteria

xc≤

Failure (0) Success (1) Total

xc≤

n11 n12 n1.

AR x > c n21 n22 n2.

n.1 n.2 n..

Using the frequencies in the preceding table, the impurity for the daughter nodes AL and AR,

based on the cross-entropy metric, are given as follows:

( )

11 11 12 12

1 1 1 1

log log

n n n n

IA n n n n



=− −

   

   

And:

( )

21 21 22 22

2 2 2 2

log log

n n n n

IA n n n n



=− −

   

   

The probability of an observaon falling in the le- and right- hand daughter nodes are

respecvely given by

()

pA n/n= and

()

pA n/n=. Then, the benet of using the node

A is given as follows:

() () ()() ()()

L L R R

AIApAIApAIA∆= − −

Chapter 9

[ 279 ]

Now, we capture

()

A∆

for all unique values of a predictor, and choose that value as the

best split for which

()

A∆

is a maximum. This step is repeated across all the variables, and

the best split is selected, which has a maximum

()

A∆

. According to the best split, the data

is paroned, and as seen earlier during the construcon of the regression tree, a similar

search is performed in each of the paroned data. The process connues unl the gain by

the split reaches a threshold minimum in each of the paroned data.

We will begin with the classicaon tree as delivered by the rpart funcon. The illustrave

dataset kyphosis is selected from the rpart library itself. The data relates to children

who had correcve spinal surgery. This medical problem is about the exaggerated outward

curvature of the thoracic region of the spine, which results in a rounded upper back. In

this study, 81 children underwent a spinal surgery and aer the surgery, informaon is

captured to know whether the children sll have the kyphosis problem in the column

named Kyphosis. The value of Kyphosis="absent" indicates that the child has been

cured of the problem, and Kyphosis="present" means that child has not been cured for

kyphosis. The other informaon captured is related to the age of the children, the number

of vertebrae involved, and the number of rst (topmost) vertebrae operated on. The task

for us is building a classicaon tree, which gives the Kyphosis status dependent on the

described variables.

We will rst build the classicaon tree for Kyphosis as a funcon of the three variables

Age, Start, and Number. The tree will then be displayed and rules will be extracted from

it. The getNode funcon will be dened based on the cross-entropy funcon, which will be

applied on the raw data and the rst overall opmal split obtained to paron the data.

The process will be recursively repeated unl we get the same tree as returned by the

rpart funcon.

Time for action – the construction of a classication tree

The getNode funcon is now dened here to help us idenfy the best split for

the classicaon problem. For the Kyphosis dataset from the rpart package,

we plot the classicaon tree by using the rpart funcon. The tree is reobtained

by using the getNode funcon.

1. Using the opon of split="information", construct a classicaon tree based

on the cross-entropy informaon for the kyphosis data with the following code:

ky_rpart <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis,p

arms=list(split="information"))

Classicaon and Regression

[ 280 ]

2. Visualize the classicaon tree by using plot(ky_rpart); text(ky_rpart):

Figure 12: Classification tree for the kyphosis problem

3. Extract rules from ex_rpart by using asRules:

> asRules(ky_rpart)

Rule number: 15 [Kyphosis=present cover=13 (16%) prob=0.69]

Start< 12.5

Age>=34.5

Number>=4.5

Rule number: 14 [Kyphosis=absent cover=12 (15%) prob=0.42]

Start< 12.5

Age>=34.5

Number< 4.5

Rule number: 6 [Kyphosis=absent cover=10 (12%) prob=0.10]

Start< 12.5

Age< 34.5

Rule number: 2 [Kyphosis=absent cover=46 (57%) prob=0.04]

Start>=12.5

4. Dene the getNode funcon for the classicaon problem:

Chapter 9

[ 281 ]

In the preceding funcon, the key funcons would be unique, table, and log.

We use unique to ensure that the search is carried for the disnct elements of

the predictor values in the data. table gets the required counts as discussed earlier in

this secon. The if condion ensures that neither the p nor 1-p values become 0, in

which case the logs become minus innity. The rest of the coding is self-explanatory.

Let us now get our rst best split.

5. We will need a few data manipulaons to ensure that our R code works on the

expected lines:

KYPHOSIS <- kyphosis

KYPHOSIS$Kyphosis_y <- (kyphosis$Kyphosis=="absent")*1

6. To nd the rst best split among the three variables, execute the following code;

the output is given in a consolidated screenshot aer all the iteraons:

getNode(KYPHOSIS$Age,KYPHOSIS$Kyphosis_y)[[2]]

getNode(KYPHOSIS$Number,KYPHOSIS$Kyphosis_y)[[2]]

getNode(KYPHOSIS$Start,KYPHOSIS$Kyphosis_y)[[2]]

getNode(KYPHOSIS$Start,KYPHOSIS$Kyphosis_y)[[1]]

sort(getNode(KYPHOSIS$Start,KYPHOSIS$Kyphosis_y)[[4]],

decreasing=FALSE)

Classicaon and Regression

[ 282 ]

Now, getNode indicates that the best split occurs for the Start variable, and

the point for the best split is 12. Keeping in line with the argument of the previous

secon, we split the data into two parts accordingly, as the Start value is greater

than the average of 12 and 13. For the paroned data, the search proceeds

in a recursive fashion.

7. Paron the data accordingly, as the Start values are greater than 12.5, and nd

the best split for the right daughter node, as the tree display shows that a search in

the le daughter node is not necessary:

KYPHOSIS_FS_R <- KYPHOSIS[KYPHOSIS$Start<12.5,]

KYPHOSIS_FS_L <- KYPHOSIS[KYPHOSIS$Start>=12.5,]

getNode(KYPHOSIS_FS_R$Age,KYPHOSIS_FS_R$Kyphosis_y)[[2]]

getNode(KYPHOSIS_FS_R$Number,KYPHOSIS_FS_R$Kyphosis_y)[[2]]

getNode(KYPHOSIS_FS_R$Start,KYPHOSIS_FS_R$Kyphosis_y)[[2]]

getNode(KYPHOSIS_FS_R$Age,KYPHOSIS_FS_R$Kyphosis_y)[[1]]

sort(getNode(KYPHOSIS_FS_R$Age,KYPHOSIS_FS_R$Kyphosis_y)[[4]],

decreasing=FALSE)

The maximum incremental value occurs for the predictor Age, and the split point is

27. Again, we take the average of the 27 and next highest value of 42, which turns

out as 34.5. The (rst) right daughter node region is then paroned in two parts

accordingly, as the Age values are greater than 34.5, and the search for the next

split connues in the current right daughter part.

8. The following code completes our search:

KYPHOSIS_FS_R_SS_R <- KYPHOSIS_FS_R[KYPHOSIS_FS_R$Age>=34.5,]

KYPHOSIS_FS_R_SS_L <- KYPHOSIS_FS_R[KYPHOSIS_FS_R$Age<34.5,]

getNode(KYPHOSIS_FS_R_SS_R$Age,KYPHOSIS_FS_R_SS_R$Kyphosis_y)[[2]]

getNode(KYPHOSIS_FS_R_SS_R$Number,KYPHOSIS_FS_R_SS_R$

Kyphosis_y)[[2]]

getNode(KYPHOSIS_FS_R_SS_R$Start,KYPHOSIS_FS_R_SS_R$

Kyphosis_y)[[2]]

getNode(KYPHOSIS_FS_R_SS_R$Number,KYPHOSIS_FS_R_SS_R$

Kyphosis_y)[[1]]

sort(getNode(KYPHOSIS_FS_R_SS_R$Number,KYPHOSIS_FS_R_SS_R$

Kyphosis_y)[[4]],

decreasing=FALSE)

Chapter 9

[ 283 ]

We see that the nal split occurs for the predictor Number and the split is 4,

and we again stop at 4.5.

We see that the results from our raw code completely agree with the rpart

funcon. Thus, the eorts of wring custom code for the classicaon tree have

paid the right dividends. We now have enough clarity for the construcon of the

classicaon tree:

Figure 13: Finding the best splits for classification tree using the getnode function

Classicaon and Regression

[ 284 ]

What just happened?

A deliberate aempt has been made at demysfying the construcon of a classicaon tree.

As with the earlier aempt of understanding a regression tree, we rst deployed the rpart

funcon, and saw a display of the classicaon tree for the Kyphosis as a funcon of Age,

Start, and Number, for the choice of the cross-entropy impurity metric. The getNode

funcon is dened on the basis of the same impurity metric and in a very systemac fashion;

we reproduced the same tree as obtained by the rpart funcon.

With the understanding of the basic construcon behind us, we will now build the

classicaon tree for the German credit data problem.

Classication tree for the German credit data

In Chapter 7, The Logisc Regression Model, we constructed a logisc regression model,

and in the previous chapter, we obtained the ridge regression version for the German credit

data problem. However, problems such as these and many others may have non-linearity

built in them, and it is worthwhile to look at the same problem by using a classicaon

tree. Also, we saw another model performance of the German credit data using the train,

validate, and test approach. We will have the following approach. First, we will paron

the German dataset into three parts, namely train, validate, and test. The classicaon tree

will be built by using the data in the train set and then it will be applied on the validate

part. The corresponding ROC curves will be visualized, and if we feel that the two curves

are reasonably similar, we will apply it on the test region, and take the necessary acon of

sanconing the customers their required loan.

Time for action – the construction of a classication tree

A classicaon tree is built now for the German credit data by using the rpart funcon.

The approach of train, validate, and test is implemented, and the ROC curves are obtained too.

Chapter 9

[ 285 ]

1. The following code has been used earlier in the book, and hence there won't be

an explanaon of it:

set.seed(1234567)

data_part_label <- c("Train","Validate","Test")

indv_label = sample(data_part_label,size=1000,replace=TRUE,prob

=c(0.6,0.2,0.2))

library(ROCR)

data(GC)

GC_Train <- GC[indv_label==»Train»,]

GC_Validate <- GC[indv_label==»Validate»,]

GC_Test <- GC[indv_label=="Test",]

2. Create the classicaon tree for the German credit data, and visualize the tree.

We will also extract the rules from this classicaon tree:

GC_rpart <- rpart(good_bad~.,data=GC_Train)

plot(GC_rpart); text(GC_rpart)

asRules(GC_rpart)

The classicaon tree for the German credit data appears as in the

following screenshot:

Figure 14: Classification tree for the test part of the German credit data problem

Classicaon and Regression

[ 286 ]

By now, we know how to nd the rules of this tree. An edited version of the rules is

given as follows:

Figure 15: Rules for the German credit data

3. We use the tree given in the previous step on the validate region, and plot the ROC

for both the regions:

Pred_Train_Class <- predict(GC_rpart,type='class')

Pred_Train_Prob <- predict(GC_rpart,type='prob')

Train_Pred <- prediction(Pred_Train_Prob[,2],GC_Train$good_bad)

Perf_Train <- performance(Train_Pred,»tpr»,»fpr»)

plot(Perf_Train,col=»green»,lty=2)

Pred_Validate_Class<-predict(GC_rpart,newdata=GC_Validate[,-

21],type='class')

Pred_Validate_Prob<-predict(GC_rpart,newdata=GC_Validate[,-

21],type='prob')

Validate_Pred<-prediction(Pred_Validate_Prob[,2], GC_

Validate$good_bad)

Chapter 9

[ 287 ]

Perf_Validate <- performance(Validate_Pred,"tpr","fpr")

plot(Perf_Validate,col="yellow",lty=2,add=TRUE)

We will go ahead and predict for the test part too.

4. The necessary code is the following:

Pred_Test_Class<-predict(GC_rpart,newdata=GC_Test[,-

21],type='class')

Pred_Test_Prob<-predict(GC_rpart,newdata=GC_Test[,-

21],type='prob')

Test_Pred<- prediction(Pred_Test_Prob[,2],GC_Test$good_bad)

Perf_Test<- performance(Test_Pred,"tpr","fpr")

plot(Perf_Test,col="red",lty=2,add=TRUE)

legend(0.6,0.5,c("Train Curve","ValidateCurve","Test Curve"),col=c

("green","yellow","red"),pch="-")

The nal ROC curve looks similar to the following screenshot:

Figure 16: ROC Curves for German Credit Data

Classicaon and Regression

[ 288 ]

The performance of the classicaon tree is certainly not sasfactory with the

validate group itself. The only solace here is that the test curve is a bit similar

to the validate curve. We will look at the more modern ways of improving the

basic classicaon tree in the next chapter. The classicaon tree in Figure 14:

Classicaon tree for the test part of the German credit data problem is very large

and complex, and we somemes need to truncate the tree to make the classicaon

method a bit simpler. Of course, one of the things that we should suspect whenever

we look at very large trees is that maybe we are again having the problem of

overng. The nal secon deals with a simplisc method of overcoming

this problem.

What just happened?

A classicaon tree has been built for the German credit dataset. The ROC curve shows that

the tree does not perform well on the validate data part. In the next and concluding secon,

we look at the two ways of improving this tree.

Have a go hero

Using the getNode funcon, verify the rst ve splits of the classicaon tree for the

German credit data.

Pruning and other ner aspects of a tree

Recall from Figure 14: Classicaon tree for the test part of the German credit data problem

that the rules numbered 21, 143, 69, 165, 142, 70, 40, 164, and 16, respecvely, covered

only 20, 25, 11, 11, 14, 12, 28, 19, and 22. If we look at the total number of observaons,

we have about 600, and individually these rules do not cover even about ve percent of

them. This is one reason to suspect that maybe we overed the data. Using the opon of

minsplit, we can restrict the minimum number of observaons each rule should cover at

the least.

Another technical way of reducing the complexity of a classicaon tree is by "pruning" the

tree. Here, the least important splits are recursively snipped o according to the complexity

parameter; for details, refer to Breiman, et. al. (1984), or Secon 3.6 of Berk (2008). We will

illustrate the acon through the R program.

Chapter 9

[ 289 ]

Time for action – pruning a classication tree

A CART is improved by using minsplit and cp arguments in the rpart funcon.

1. Invoke the graphics editor with par(mfrow=c(1,2)).

2. Specify minsplit=30, and re-do the ROC plots by using the new classicaon tree:

GC_rpart_minsplit<- rpart(good_bad~.,data=GC_Train, minsplit=30)

GC_rpart_minsplit <- prune(GC_rpart,cp=0.05)

Pred_Train_Class<- predict(GC_rpart_minsplit,type='class')

Pred_Train_Prob<-predict(GC_rpart_minsplit,type='prob')

Train_Pred<- prediction(Pred_Train_Prob[,2],GC_Train$good_bad)

Perf_Train<- performance(Train_Pred,»tpr»,»fpr»)

plot(Perf_Train,col=»green»,lty=2)

Pred_Validate_Class<-predict(GC_rpart_minsplit,newdata=GC_

Validate[,-21],type='class')

Pred_Validate_Prob<-predict(GC_rpart_minsplit,newdata= GC_

Validate[,-21],type='prob')

Validate_Pred<- prediction(Pred_Validate_Prob[,2],GC_

Validate$good_bad)

Perf_Validate<- performance(Validate_Pred,"tpr","fpr")

plot(Perf_Validate,col="yellow",lty=2,add=TRUE)

Pred_Test_Class<- predict(GC_rpart_minsplit,newdata = GC_Test[,-

21],type='class')

Pred_Test_Prob<-predict(GC_rpart_minsplit,newdata = GC_Test[,-

21],type='prob')

Test_Pred<- prediction(Pred_Test_Prob[,2],GC_Test$good_bad)

Perf_Test<- performance(Test_Pred,"tpr","fpr")

plot(Perf_Test,col="red",lty=2,add=TRUE)

legend(0.6,0.5,c("Train Curve","ValidateCurve","Test Curve"),col=c

("green","yellow","red"),pch="-")

title(main="Improving a Classification Tree with "minsplit"")

3. For the pruning factor cp=0.02, repeat the ROC plot exercise:

GC_rpart_prune <- prune(GC_rpart,cp=0.02)

Pred_Train_Class<- predict(GC_rpart_prune,type='class')

Pred_Train_Prob<-predict(GC_rpart_prune,type='prob')

Train_Pred<- prediction(Pred_Train_Prob[,2],GC_Train$good_bad)

Perf_Train<- performance(Train_Pred,»tpr»,»fpr»)

plot(Perf_Train,col=»green»,lty=2)

Classicaon and Regression

[ 290 ]

Pred_Validate_Class<-predict(GC_rpart_prune,newdata = GC_

Validate[,-21],type='class')

Pred_Validate_Prob<-predict(GC_rpart_prune,newdata = GC_

Validate[,-21],type='prob')

Validate_Pred<- prediction(Pred_Validate_Prob[,2],GC_

Validate$good_bad)

Perf_Validate<- performance(Validate_Pred,"tpr","fpr")

plot(Perf_Validate,col="yellow",lty=2,add=TRUE)

Pred_Test_Class<- predict(GC_rpart_prune,newdata = GC_Test[,-

21],type='class')

Pred_Test_Prob<-predict(GC_rpart_prune,newdata = GC_Test[,-

21],type='prob')

Test_Pred<- prediction(Pred_Test_Prob[,2],GC_Test$good_bad)

Perf_Test<- performance(Test_Pred,"tpr","fpr")

plot(Perf_Test,col="red",lty=2,add=TRUE)

legend(0.6,0.5,c("Train Curve","ValidateCurve","Test Curve"),col=c

("green","yellow","red"),pch="-")

title(main="Improving a Classification Tree with Pruning")

The choice of cp=0.02 has been drawn from the plot of the complexity parameter

and the relave error; try it yourselves with plotcp(GC_rpart).

Figure 17: Pruning the CART

Chapter 9

[ 291 ]

What just happened?

Using the minsplit and cp opons, we have managed to obtain a reduced set of rules,

and in that sense, the ed model does not appear to be an overt. The ROC curves show

that there has been considerable improvement in the performance of the validate region.

Again, as earlier, the validate and test regions have a similar ROC, and it is hence preferable

to use GC_rpart_prune or GC_rpart_minsplit over GC_rpart.

Pop quiz

With the experience of model selecon from the previous chapter, jusfy the choice

of cp=0.02 from the plot obtained as a result of running plotcp(GC_rpart).

Summary

We began with the idea of recursive paroning and gave a legimate reason as to why

such an approach is praccal. The CART technique is completely demysed by using the

getNode funcon, which has been dened appropriately depending upon whether we require

a regression or a classicaon tree. With the convicon behind us, we applied the rpart

funcon to the German credit data, and with its results, we had basically two problems.

First, the ed classicaon tree appeared to overt the data. This problem may many mes

be overcome by using the minsplit and cp opons. The second problem was that the

performance was really poor in the validate region. Though the reduced classicaon trees

had slightly beer performance as compared to the inial tree, we sll need to improve the

classicaon tree. The next chapter will focus more on this aspect and discuss the modern

development of CART.

CART and Beyond

In the previous chapter, we studied CART as a powerful recursive partitioning

method, useful for building (non-linear) models. Despite the overall generality,

CART does have certain limitations that necessitate some enhancements. It is

these extensions that form the crux of the final chapter of this book. For some

technical reasons, we will focus solely on the classification trees in this chapter.

We will also briefly look at some limitations of the CART tool.

The rst improvement that can be made to the CART is provided by the bagging technique.

In this technique, we build mulple trees on the bootstrap samples drawn from the

actual dataset. An observaon is put through each of the trees and a predicon is made

for its class, and based on the majority predicon of its class, it is predicted to belong to

the majority count class. A dierent approach is provided by Random Forests, where you

consider a random pool of covariates against the observaons. We nally consider another

important enhancement of a CART by using the boosng algorithms. The chapter will

discuss the following:

Cross-validaon errors for CART

The bootstrap aggregaon (bagging) technique for CART

Extending the CART with random forests

A consolidaon of the applicaons developed from Chapter 6 to Chapter 10,

CART and Beyond

[ 294 ]

Improving CART

In the Another look at model assessment secon of Chapter 8, we saw that the technique

of train + validate + test may be further enhanced by using the cross-validaon technique.

In the case of, linear regression model, we had used the CVlm funcon from the DAAG

package for the purpose of cross-validaon of linear models. The cross-validaon technique

for the logisc regression models may be carried out by using the CVbinary funcon from

the same package.

Profs. Therneau and Atkinson created the package rpart, and a detailed documentaon

of the enre rpart package is available on the Web at http://www.mayo.edu/hsr/

techrpt/61.pdf. Recall the slight improvement provided in the Pruning and other ner

aspects of a tree secon of the previous chapter. The two aspects considered there related

to the complexity parameter cp and the minimum split criteria minsplit. Now, the problem

of overng with the CART may be reduced to an extent by using the cross-validaon

technique. In the ridge regression model, we had the problem of selecng the penalty factor

. Similarly, here we have the problem of selecng the complexity parameter, though not

in an analogous way. That is, for the complexity parameter, which is a number between 0

and 1, we need to obtain the predicons based on the cross-validaon technique. This may

lead to a small loss of accuracy; however, we will then increase the accuracy by looking at

the generality. An object of the rpart class has many summaries contained with it, and the

various complexity parameters are stored in the cptable matrix. This matrix has values

for the following metrics: CP, nsplit, rel error, xerror, and xstd. Let us understand

this matrix through the default example in the rpart package, which is example(xpred.

rpart); see Figure 1: Understanding the example for the "xpred.rpart" funcon:

Chapter 10

[ 295 ]

Figure 1: Understanding the example for the "xpred.rpart" function

Here the tree has CP at four values, namely 0.595, 0.135, 0.013, and 0.010.

The corresponding nsplit numbers are 0, 1, 2, and 3, and similarly, the relative error

values xerror and xstd are given in the last part of previous screenshot. The interpretaon

for the CP value is slightly dierent, and the reason being that these have to be considered

as ranges and not values, in the sense that the rest of the performance is not with respect

to the CP values as menoned previously, rather they are with respect to the intervals

[0.595,1], [0.135, 0.595), [0.013, 0.135), and [0.010, 0.013); see ?xpred.

rpart for more informaon. Now, the funcon xpred.rpart returns the predicons based

on the cross-validated technique. Therefore, we will use this funcon for the German data

problem and for dierent CP values (actually ranges), to obtain the accuracy of the

cross-validaon technique.

CART and Beyond

[ 296 ]

Time for action – cross-validation predictions

We will use the xpred.rpart funcon from rpart to obtain the cross-validaon

predicons from an rpart object.

1. Load the German dataset and the rpart package using data(GC);

library(rpart).

2. Fit the classicaon tree with GC_Complete <- rpart(good_bad~.,

data=GC).

3. Check cptable with GC_Complete$cptable:

CP nsplit rel error xerror xstd

1 0.05167 0 1.0000 1.0000 0.04830

2 0.04667 3 0.8400 0.9833 0.04807

3 0.01833 4 0.7933 0.8900 0.04663

4 0.01667 6 0.7567 0.8933 0.04669

5 0.01556 8 0.7233 0.8800 0.04646

6 0.01000 11 0.6767 0.8833 0.04652

4. Obtain the cross-validaon predicons using GC_CV_Pred <- xpred.rpart(

GC_Complete).

5. Find the accuracy of the cross-validaon predicons:

sum(diag(table(GC_CV_Pred[,2],GC$good_bad)))/1000

sum(diag(table(GC_CV_Pred[,3],GC$good_bad)))/1000

sum(diag(table(GC_CV_Pred[,4],GC$good_bad)))/1000

sum(diag(table(GC_CV_Pred[,5],GC$good_bad)))/1000

sum(diag(table(GC_CV_Pred[,6],GC$good_bad)))/1000

The accuracy output is as follows:

> sum(diag(table(GC_CV_Pred[,2],GC$good_bad)))/1000

[1] 0.71

> sum(diag(table(GC_CV_Pred[,3],GC$good_bad)))/1000

[1] 0.744

> sum(diag(table(GC_CV_Pred[,4],GC$good_bad)))/1000

[1] 0.734

> sum(diag(table(GC_CV_Pred[,5],GC$good_bad)))/1000

[1] 0.74

> sum(diag(table(GC_CV_Pred[,6],GC$good_bad)))/1000

[1] 0.741

Chapter 10

[ 297 ]

It is natural that when you execute the same code, you will most likely have a dierent

output. Why is that? Also, you need to answer for yourselves why we did not check the

accuracy for GC_CV_Pred[,1]. In general, for decreasing the CP range, we expect higher

accuracy. We have checked for cross-validaon predicons for various CP ranges. There are

also other techniques to enhance the performance of a CART.

What just happened?

We used the xpred.rpart funcon to obtain the cross-validaon predicons for

a range of CP values. The accuracy of a predicon model has been assessed by using

simple funcons such as table and diag.

However, the control acons of minsplit and cp are of a reacve nature aer the splits

have been already decided. In that sense, when we have a large number of covariates,

the CART may lead to an overt of the data and may try to capture all the local variaons

of the data, and thus lose sight of the overall generality. Therefore, we need useful

mechanisms to overcome this problem.

The classicaon and regression tree considered in the previous chapter is a single model.

That is, we are seeking the opinion (predicon) of a single model. Wouldn't it be nice if we

could extend this! Alternately, we can seek mulple models instead of a single model.

What does this mean? In the forthcoming secons, we will see the use of mulple models

for the same problem.

Bagging

Bagging is an abbreviaon for bootstrap aggregaon. The important underlying concept

here is the bootstrap, which was invented by the eminent scienst Bradley Efron. We will

rst digress here a bit from the CART technique and consider a very brief illustraon of the

bootstrap technique.

The bootstrap

Consider a random sample,

1,...,

X X

, of size n from

()

,fx

. Let

( )

,...,

TX X

be an

esmator of

. To begin with, we rst draw a random sample of size n from

1,...,

X X

with replacement; that is, we obtain a random sample

** *

,,...,

XX X

, where some of the

observaons from the original sample may have repeons and some may not be present

at all. There is no one-to-one correspondence between

1,..., n

X X

and ** *

,,..., n

XX X. Using

** *

,,..., n

XX X, we compute

( )

1* *

1,..., n

TX X. Repeat this exercise a large number of mes, say B.

The inference for

is carried out by using the sampling distribuon of the bootstrap samples

( )

1* *

1,..., n

TX X, …,

( )

* *

1,...,

TX X. Let us illustrate the concept of bootstrap with the famous

aspirin example; see Chapter 8 of Taar, et. al. (2013).

CART and Beyond

[ 298 ]

A surprising double-blind experiment conducted by the New York Times indicated that an

aspirin consumed on alternate days signicantly reduces the number of heart aacks within

men. In their experiment, 104 out of 11034 healthy middle-aged men consuming the small

doses of aspirin suered a fatal/non-fatal heart aack, whereas 189 out of 11037 placebo

individuals had the aack. Therefore, the odds rao of the aspirin-to-placebo heart aack

possibility is (104 / 11034) / (189 / 11037) = 0.5504. This indicates that only 55

percent of the number of heart aacks observed for the group taking the placebo is likely to

be observed for men consuming small doses of aspirin. That is, the chances of having a heart

aack if taking aspirin are almost halved. The experiment being scienc, the results look

very promising. However, we would like to obtain a condence interval for the odds rao of

the heart aack. If we don't know the sampling distribuon of the odds rao, we can use

the bootstrap technique to obtain the same. There is another aspect of the aspirin study. It

has been observed that the aspirin group had about 119 individuals who had strokes. The

strokes number for the placebo group is 98. Therefore, the odds rao of a stroke is (114

/ 11034) / (98 / 11037) = 1.164. This is shocking! It says that though the aspirin

reduces the heart aack possibility, about 16 percent more people are likely to have a stroke

when compared to the placebo group. Now, let's use the bootstrap technique to obtain the

condence intervals for the heart aack as well as the strokes.

Time for action – understanding the bootstrap technique

The boot package, which comes shipped with R base, will be used for bootstrapping the

odds rao.

1. Get the boot package using library(boot).

The boot package is shipped with the R soware itself, and thus it does

not require separate installaon. The main components for the boot funcon

will be soon explained.

2. Dene the odds-rao funcon:

OR <- function(data,i) {

x <- data[,1]; y <- data[,2]

odds.ratio <- (sum(x[i]==1,na.rm=TRUE)/length(na.omit(x[i])))/

(sum(y[i]==1,na.rm=TRUE)/length(na.omit(y[i])))

return(odds.ratio)

}

Chapter 10

[ 299 ]

The OR name stands, of course, for odds-rao. data for this funcon consists of

two columns, one of which may have more observaons than the other. The opon

na.rm is used to ignore the NA data values, whereas the na.omit funcon will

remove them. It is easier to see that the odds.ratio object indeed computes the

odds-rao. Note that we have specied i as an input to the funcon OR, since this

funcon will be used within boot. Therefore, i is used to indicate that the odds

rao will be calculated for the ith bootstrap sample. Note that x[i] does not reect

the ith element of x.

3. Get the data for both the aspirin and placebo groups (the heart aack and stroke

data), with the following code:

aspirin_hattack <- c(rep(1,104),rep(0,11037-104))

placebo_hattack <- c(rep(1,189),rep(0,11034-189))

aspirin_strokes <- c(rep(1,119),rep(0,11037-119))

placebo_strokes <- c(rep(1,98),rep(0,11034-98))

4. Combine the data groups and run 1000 bootstrap replicates, calculang

the odds-rao for each of the bootstrap samples. Use the following boot funcon:

hattack <- cbind(aspirin_hattack,c(placebo_hattack,NA,NA,NA))

hattack_boot <- boot(data=hattack,statistic=OR,R=1000)

strokes <- cbind(aspirin_strokes,c(placebo_strokes,NA,NA,NA))

strokes_boot <- boot(data=strokes,statistic=OR,R=1000)

We are using three opons of the boot funcon, namely data, statistic,

and R. The rst opon accepts the data frame of interest; the second one accepts

the stasc, either an exisng R funcon or a funcon dened by the user; and

nally, the third opon accepts the number of bootstrap replicaons. The boot

funcon creates an object of the boot class, and in this case, we are obtaining the

odds-rao for various bootstrap samples.

5. Using the bootstrap samples and the odds-rao for the bootstrap samples, obtain

a 95 percent condence interval by using the quantile funcon:

quantile(hattack_boot$t,c(0.025,0.975))

quantile(strokes_boot$t,c(0.025,0.975))

The 95 percent condence interval for the odds-rao of the heart aack rate is

given as (0.4763, 0.6269), while that for the strokes is (1.126, 1.333).

Since the point esmates lie in the 95 percent condence intervals, we accept that

the odds-rao of a heart aack for the aspirin tablet indeed reduces by 55 percent

in comparison with the placebo group.

CART and Beyond

[ 300 ]

What just happened?

We used the boot funcon from the boot package and obtained bootstrap samples for the

odds-rao.

Now that we have an understanding of the bootstrap technique, let us check out how

the bagging algorithm works.

The bagging algorithm

Breiman (1996) proposed the extension of the CART in the following manner.

Suppose that the values of the n random observaons for the classicaon problem

are

( ) ( ) ( )

11 2 2

,,,,..., ,

n n

yx yx yx. As with our setup, the dependent variables

are binary.

As with the bootstrap technique explained earlier, we obtain a bootstrap sample of size n

from the data with replacement and build a tree. If we prune the tree, it is very likely that

we may end up with the same tree on most occasions. Hence, pruning is not advisable here.

Now, using the tree based on the (rst) bootstrap sample, a predicon is made for the class

of the i-th observaon and the predicted value is noted. This process is repeated a large

number of mes, say B. A general pracce is to take B = 100. Therefore, we have B number

of predicons for every observaon. The decision process is to classify the observaon to

the category that has the majority of class predicons. That is, if more than 50 mes out of

B = 100 it has been predicted to belong to a parcular class, we say that the observaon is

predicted to belong to that class. Let us formally state the bagging algorithm.

1. Draw a sample of size n with replacement from the data

( ) ( ) ( )

1 1 1

11 2 2

, , , ,..., ,

i n

yx yx yx ,

and denote the rst bootstrap sample with

( ) ( )

( )

1 1

11 2 2

, , , ,..., ,

i n

yx yx yx .

2. Create a classicaon tree with

( ) ( )

( )

1 1

11 2 2

, , , ,..., ,

i n

yx yx yx . Do not prune the

classicaon tree. Such a tree may be called a bootstrapped tree.

3. For each terminal node, assign a class; put each observaon down the tree and nd

its predicted class.

4. Repeat steps 1 to 3 a large number of mes, say B.

5. Find the number of mes each observaon is classied to a parcular class out of

the B bootstrapped trees. The bagging procedure classies an observaon to belong

to a parcular class that has the majority count.

6. Compute the confusion table from the predicons made in step 5.

Chapter 10

[ 301 ]

The advantage of mulple trees is that the problem of overng, which happens in the case

of a single tree, is overcome to a large extent, as we expect that resampling will ensure that

the general features are captured and the impact of local features is minimized. Therefore,

if an observaon is classied to belong to a parcular class because of a local issue, it will

not get repeated over in the other bootstrapped trees. Therefore, with predicons based

on a large number of trees, it is expected that the nal predicon of an observaon really

depends upon its general features and not on a parcular local feature.

There are some measures that are important to be considered with the bagging algorithm.

A good classier, a single tree, or a bunch of them should be able to predict the class of

an observaon with more convicon. For example, we use a probability threshold of 0.5

and above as a predicon for success when using a logisc regression model. If the model

can predict most observaons in the neighborhood of either 0 or 1, we will have more

condence in the predicons. As a consequence, we will be a bit hesitant to classify an

observaon as either a success or failure if the predicted probability is in the vicinity of 0.5.

This precarious situaon applies to the bagging algorithm too.

Suppose we choose B = 100 for the number of bagging trees. Assume that an observaon

belongs to a class, Yes, and let the overall classes for the study be {"Yes", "No"}. If a

large number of trees predict an observaon to belong to the Yes class, we are condent

about the predicon. On the other hand, if approximately B/2 number of trees classify the

observaon to the Yes class, the decision gets swapped, as a few more trees had predicted

the observaon to belong to the No class. Therefore, we introduce a measure called margin

as the dierence between the proporon of mes an observaon is correctly classied and

the proporon of mes it is incorrectly classied. If the bagging algorithm is a good model,

we expect the average margin over all the observaons to be a large number away from

0. If bagging is not appropriate, we expect the average margin to be near the number 0.

Let us prepare ourselves for acon. The bagging algorithm is available in ipred and the

randomForests package.

Time for action – the bagging algorithm

The bagging funcon from the ipred package will be used for bagging a CART. The opons

of coob=FALSE and nbagg=200 are used to specied the appropriate opons.

1. Get the ipred package by using library(ipred).

2. Load the German credit data by using data(GC).

CART and Beyond

[ 302 ]

3. For B=200, t the bagging procedure for the GC data:

GC_bagging <- bagging(good_bad~.,data=GC,coob=FALSE,

nbagg=200,keepX=T)

We know that we have t B =200 number of trees. Would you like to see them?

Fine, here we go.

4. The B =200 trees are stored in the mtrees list of classbagg GC_bagging. That

is, GC_bagging$mtrees[[i]] gives us the i-th bootstrapped tree, and plot(GC_

bagging$mtrees[[i]]$btree) displays that tree. Adding text(GC_bagging$m

trees[[i]]$btree,pretty=1, use.n=T) is also important. Next, put the enre

thing in a loop, execute it, and simply sit back and enjoy the display of the B number

of trees:

for(i in 1:200) {

plot(GC_bagging$mtrees[[i]]$btree);

text(GC_bagging$mtrees[[i]]$btree,pretty=1,use.n=T)

}

We hope that you understand that we can't publish all 200 trees! The next goal

is to obtain the margin of the bagging algorithm.

5. Predict the class probabilies of all the observaons with the predict.

classbagg funcon by using GCB_Margin = round(predict( GC_

bagging,type="prob")*200,0).

Let us understand the preceding code. The predict funcon returns the

probabilies of an observaon to belong to the good and bad classes. We have

used 200 trees, and hence mulplying these probabilies with it gives us the

expected number of mes an observaon is predicted to belong to these classes.

The round funcon with the 0 argument completes the predicon to integers.

6. Check the rst six predicted classes with head(GCB_Margin):

bad good

[1,] 17 183

[2,] 165 35

[3,] 11 189

[4,] 123 77

[5,] 101 99

[6,] 95 105

7. To obtain the overall margin of the bagging technique, use the R code

mean(pmax(GCB_Margin[,1],GCB_Margin[,2]) pmin(GCB_

Margin[,1],GCB_Margin[,2]))/200.

Chapter 10

[ 303 ]

The overall margin for the author's execuon turns out to be 0.5279. You may,

though, get a dierent answer. Why?

Thus far, the bagging technique made predicons for the observaons

from which it built the model. In the earlier chapters, we had championed

the need of validate group and cross-validaon techniques. That is,

we did not always rely on the model measures solely from the data on

which it was built. There is always the possibility of failure as a result of

unforeseen examples. Can the bagging technique be built for taking care

of the unforeseen observaons? The answer is a denite yes, and this

is well known as out-of-bag validaon. In fact, such an opon has been

suppressed when building the bagging model in step 3 here, as the opon

coob=FALSE. coob stands for an out-of-bag esmate of the error rate. So,

now rebuild the bagging model with coob=TRUE opon.

8. Build an out-of-bag bagging model with GC_bagging_oob <- bagging(good_ba

d~.,data=GC,coob=TRUE,nbagg=200,keepX=T). Find the error-rate with GC_

bagging_oob$err.

> GC_bagging_oob <- bagging(good_bad~.,data=GC,coob=TRUE,

nbagg=200,keepX=T)

> GC_bagging_oob$err

[1] 0.241

What Just Happened?

We have seen an important extension of the CART model in the bagging algorithm.

To an extent, this enhancement is vital and vastly dierent, as seen in the improvements

of earlier models. The bagging algorithm is dierent, in the sense that we rely on the

predicons based on more than a single model. This ensures that the overng problem,

which occurs due to local features, is almost eliminated.

It is important to note that the bagging technique is not without any limitaons; refer

to Secon 4.5 of Berk (2008). We now move to the nal model of the book, which is an

important technique for the CART school.

Random forests

In the previous secon, we built mulple models for the same classicaon problem. The

bootstrapped trees were generated by using resamples of the observaons. Breiman (2001)

suggested an important variaon—actually, there is more to it than just a variaon—where

a CART is built with the covariates (features) being resampled for each of the bootstrap

samples of the dataset. Since the nal tree of each bootstrap sample has dierent covariates,

the ensemble of the collecve trees is called a Random Forest. A formal algorithm is

given next.

CART and Beyond

[ 304 ]

1. As with the bagging algorithm, draw a sample of size n1, n1 < n with replacement

from the data

( ) ( ) ( )

1 1 1

11 2 2

, , , ,..., ,

i n

yx yx yx , and denote the rst resampled data

with

( ) ( )

( )

1 1

11 2 2

, , , ,..., ,

i n

yx yx yx . The remaining n to n1 data form the

out-of-bag dataset.

2. Among the covariate vector x, select a random number of covariates without

replacement. Note that the same covariates are selected for all the observaons.

3. Create the CART tree from the data in steps 1 and 2, and, as earlier, do not prune

the tree.

4. For each terminal node, assign a class. Put each out-of-bag data down the tree

and nd its predicted class.

5. Repeat steps 1 to 3 a large number of mes; say 200 or 500.

6. For each observaon, count the number of mes it is predicted to belong to a class

only when it is a part of the out-of-bag dataset.

7. The majority count for the observaon to belong to a class is considered as it is a

predicted class.

This is quite a complex algorithm. Luckily, the randomForest package helps us out.

We will connue with the German credit data problem.

Time for action – random forests for the German credit data

The funcon randomForest from the package of the same name will be used to build a

random forest for the German credit data problem.

1. Get the randomForest package by using library(randomForest).

2. Load the German credit data by using data(GC).

3. Create a random forest with 500 trees:

GC_RF <- randomForest(good_bad~.,data=GC,keep.forest=TRUE,

ntree=500).

Chapter 10

[ 305 ]

It is very dicult to visualize a single tree of the random forest. A very

ad-hoc approach has been found at http://stats.stackexchange.com/

questions/2344/best-way-to-present-a-random-forest-in-a-

publication. Now we reproduce the necessary funcon to get the trees, and as

the soluon step is not exactly perfect, you may skip this part; steps 4 and 5.

4. Dene the to.dendrogram funcon:

5. Use the getTree funcon, and with the to.dendrogram funcon dened

previously, visualize the rst 20 trees of the forest:

for(i in 1:20) {

tree <- getTree(GC_RF,i,labelVar=T)

d <- to.dendrogram(tree)

plot(d,center=TRUE,leaflab='none',edgePar=list(t.cex=1,p.col=NA,p.

lty=0))

}

The error rate is of primary concern. As we increase the number of trees in the forest,

we expect a decrease in the error rate. Let us invesgate this for the GC_RF object.

CART and Beyond

[ 306 ]

6. Plot the out-of-bag error rate against the number of trees with plot(1:500,GC_

RF$err.rate[,1],"l",xlab="No.of.Trees",ylab="OOB Error Rate").

Figure 2: Performance of a random forest

The covariates (features) are selected dierently for dierent trees. It is then

a concern to know which variables are signicant. The important variables are

obtained using the varImpPlot funcon.

7. The funcon varImpPlot produces a display of the importance of the variables

by using varImpPlot(GC_RF).

Chapter 10

[ 307 ]

Figure 3: Important variables of the German credit data problem

Thus, we can see which variables have more relevance over others.

What just happened?

Random forests are a very important extension of the CART concept. In this technique, we

need to know the error-rate distribuon as the number of trees increases. This is expected to

decrease with the increase in the number of trees. varImpPlot also gives a very important

display of the importance of the covariates for classifying the customers as good or bad.

In conclusion, we will undertake a classicaon dataset and revise all the techniques seen

in the book, especially in Chapter 6 to Chapter 10. We will now consider the problem of low

birth weight among the infants.

CART and Beyond

[ 308 ]

The consolidation

The goal of this secon is to quickly review all of the techniques learnt in the laer half of

the book. Towards this, a dataset has been selected where we have ten variables, including

the output. Low birth weight is a serious concern, and it needs to be understood as a factor

of many other variables. If the weight of a child at birth is lesser than 2500 grams, it is

considered as a low birth weight. This problem has been studied in Chapter 19 of Taar, et.

al. (2013). The following table gives a descripon of the variables. Since the dataset may be

studied as a regression problem (variable BWT) as well as a classicaon problem (LOW), you

can choose any path(s) that you deem t. Let the nal acon begin.

Serial number Description Abbreviation

1 Identification code ID

2 Low birth weight LOW

3 Age of mother AGE

4 Weight of mother at last menstrual period LWT

5 Race RACE

6 Smoking status during pregnancy SMOKE

7 History of premature labor PTL

8 History of hypertension HT

9Presence of uterine irritability UI

10 Number of physician visits during the first trimester FTV

11 Birth weight BWT

Time for action – random forests for the low birth weight data

The techniques learnt from Chapter 6 to Chapter 10 will now be put to the test.

That is, we will use the linear regression model, logisc regression, as well as CART.

1. Read the dataset into R with data(lowbwt).

Chapter 10

[ 309 ]

2. Visualize the dataset with the opons diag.panel, lower.panel,

and upper.panel:

pairs(lowbwt,diag.panel=panel.hist,lower.panel=panel.smooth,upper.

panel=panel.cor)

Interpret the matrix of scaer plots. Which stascal model seems most appropriate

to you?

Figure 4: Multivariable display of the "lowbwt" dataset

As the correlaons look weak, it seems that a regression model may not be

appropriate. Let us check.

3. Create (sub) datasets for the regression and classicaon problems:

LOW <- lowbwt[,-10]

BWT <- lowbwt[,-1]

CART and Beyond

[ 310 ]

4. First, we will check if a linear regression model is appropriate:

BWT_lm <- lm(BWT~., data=BWT)

summary(BWT_lm)

Interpret the output of the linear regression model; refer to Linear Regression

Analysis, Chapter 6, if necessary.

Figure 5: Linear model for the low birth weight data

The low R2 makes it dicult for us to use the model. Let us check out the logisc

regression model.

5. Fit the logisc regression model as follows:

BWT_glm <- glm(BWT~., data=BWT)

summary(BWT_glm).

Chapter 10

[ 311 ]

The summary of the model is given in the following screenshot:

Figure 6: Logistic regression model for the low birth weight data

6. The Hosmer-Lemeshow goodness-of-t test for the logisc regression model is given

by hosmerlem(LOW_glm$y,fitted(LOW_glm)).

Now, the p-value obtained is 0.7813, which shows that there is no signicant

dierence between the ed values and the observed values. Therefore, we

conclude that the logisc regression model is a good t. However, we will go

ahead and perform CART models for this problem as well. Note that the esmated

regression coecients are not huge values, and hence we do not need to check out

for the ridge regression problem.

CART and Beyond

[ 312 ]

7. Fit a classicaon tree with the rpart funcon:

LOW_rpart <- rpart(LOW~.,data=LOW)

plot(LOW_rpart)

text(LOW_rpart,pretty=1)

Does the classicaon tree appear more appropriate than the logisc regression

ed earlier?

Figure 7: Classification tree for the low birth weight data

8. Get the rules of the classicaon tree using asRules(LOW_rpart).

Figure 8: Rules for the low birth weight problem

Chapter 10

[ 313 ]

You can see that these rules are of great importance to the physician who does the

operaons. Let us check the bagging eect on the classicaon tree.

9. Using the bagging funcon, nd the error rate of the bagging technique with the

following code:

LOW_bagging <- bagging(LOW~., data=LOW,coob=TRUE,nbagg=50,keepX=T)

LOW_bagging$err

The error rate is 0.3228, which seems very high. Let us see if random forests help

us out.

10. Using the randomForest funcon, nd the error rate for the out-of-bag problem:

LOW_RF <- randomForest(LOW~.,data=LOW,keep.forest=TRUE, ntree=50)

LOW_RF$err.rate

The error rate is sll around 0.34. The inial idea was that with the number of

observaons being less than 200, we developed with only 50 trees. Repeat the task

with 150 trees and check if the error rate decreases.

11. Increase the number of trees to 150 and obtain the error-rate plot:

LOW_RF <- randomForest(LOW~.,data=LOW,keep.forest=TRUE, ntree=150)

plot(1:150,LOW_RF$err.rate[,1],"l",xlab="No.of.Trees",ylab="OOB

Error Rate")

The error rate of about 0.32 seems to be the best soluon we can obtain

for this problem.

Figure 9: The error rate for the low birth weight problem

CART and Beyond

[ 314 ]

What just happened?

We had a very quick look back at all the techniques used over the last ve chapters

of the book.

Summary

The chapter began with two important variaons of the CART technique: the bagging

technique and random forests. Random forests are parcularly a very modern technique,

invented in 2001 by Breiman. The goal of the chapter was to familiarize you with these

modern techniques. Together with the German credit data and the complete revision of

the earlier techniques with the low birth weight problem, it is hoped that you beneed a

lot from the book and will have gained enough condence to apply these tools in your own

analycal problems.

References

The book has been inuenced by many of the classical texts on the subject from Tukey

(1977) to Breiman, et. al. (1984). The modern texts of Hase, et. al. (2009) and Berk (2008)

have parcularly inuenced the later chapters of the book. We have alphabecally listed

only the books/monographs which have been cited in the text, however, the reader may go

beyond the current list too.

Agres, A. (2002), Categorical Data Analysis, Second Edion. J. Wiley

Baron, M. (2007), Probability and Stascs for Computer Sciensts, Chapman

and Hall/CRC

Belsley, K., Kuh, and Welsch, E. (1980), Regression Diagnoscs: Idenfying Inuenal

Data and Sources of Collinearity, J. Wiley

Berk, R. A. (2008), Stascal Learning from a Regression Perspecve.Springer

Breiman, L. (1996), Bagging predictors. Machine learning, 24(2), 123-140

Breiman, L. (2001), Random forests. Machine learning, 45(1), 5-32

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone. C.J. (1984), Classicaon

and Regression Trees, Wadsworth

Chen, Ch., Härdle, W., and Unwin, A. (2008). Handbook of Data Visualizaon.

Springer

Cleveland, W. S. (1985). The Elements of Graphing Data. Monterey, CA: Wadsworth

Cule, E. and De Iorio, M. (2012), A semi-automac method to guide the choice

of ridge parameter in ridge regression.arXiv:1205.0686v1 [stat.AP]

Freund, R.F., and Wilson, W.J. (2003), Stascal Methods, Second Edion,

Academic Press

Friendly, M. (2001), Visualizing Categorical Data.SAS

Friendly, M. (2008), A brief history of data visualizaon. In Handbook

of Data Visualizaon (pp. 15-56), Springer

References

[ 316 ]

Gunst, R. F. (2002), Finding condence in stascal signicance. Quality Progress,

35 (10), 107–108

Gupta, A. (1997), Establishing Opmum Process Levels of Suspending Agents

for a Suspension Product. Quality Engineering, 10,347-350

Hase, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Stascal

Learning, Second Edion, Springer

Horgan, J. M. (2008), Probability with R – An Introducon with Computer Science

Applicaons, J. Wiley

Johnson, V. E., and Albert, J. H. (1999), Ordinal Data Modeling, Springer

Kutner, M. H., Nachtsheim, C., and Neter, J. (2004), Applied Linear Regression

Models, McGraw Hill

Montgomery, D. C. (2007), Introducon to Stascal Quality Contro,. J. Wiley

Montgomery, D. C., Peck, E. A., and Vining, G. G. (2012), Introducon to Linear

Regression Analysis, Wiley

Montgomery, D.C., and Runger, G. C. (2003), Applied Stascs and Probability

for Engineers, J. Wiley

Pawitan, Yudi. (2001), In All Likelihood: Stascal Modelling and Inference Using

Likelihood, OUP Oxford

Quinlan, J. R. (1993), C4. 5: Programs for Machine Learning, Morgan Kaufmann

Ross, S.M. (2010), Introductory Stascs, 3e, Academic Press

Rousseeuw, P.J., Ruts, I., and Tukey, J.W. (1999), The bagplot: a bivariate boxplot,

The American Stascian 53.4: 382-387

Ryan, T.P. (2007), Modern Engineering Stascs, J. Wiley

Sarkar, D. (2008), Lace-Springer

Taar, P.T., Suresh. R., and Manjunath. B.G. (2013), A Course in Stascs

with R. Narosa

Tue, E. R. (2001), The Visual Display of Quantave Informaon, Graphics Pr

Tukey, J.W. (1977), Exploratory Data Analysis, Addison-Wesley

Velleman, P.F., and Hoaglin, D.C. (1981), Applicaons, Basics, and Compung of

Exploratory Data Analysis; available at http://dspace.libraty.comell.edu/

Wickham, H. (2009). ggplot2, Springer

Wilkinson, L. (2005), The Grammar of Graphics, Second Edion, Springer

Index

Symbols

3RSS 123

3RSSH 123

4253H smoother 123

%% operator 35

actual probabilies 13

Age in Years variable 9

Akaike Informaon Criteria (AIC) 194

alternave hypothesis 144

amount of leverage by the observaon 187

ANOVA technique

about 170

obtaining 170

anscombe dataset 169

aplpack package 117

Automove Research Associaon of India (ARAI)

backward eliminaon approach 192

backwardlm funcon 195

bagging 297

bagging algorithm 300-302

bagging technique 93

bagplot

about 116

characteriscs 116

displaying, for mulvariate dataset 117, 118

for gasoline mileage problem 117

barchart funcon 73

bar charts

built-in examples 66, 67

visualizing 68-73

barplot funcon 73

basic arithmec, vectors

performing 36

unequal length vectors, adding 36

basis funcons, regression spline 234

best split point 265

binary regression problem 202

binomial distribuon

about 20, 21

examples 21-23

binomial test

performing 144

proporons, tesng 147, 148

success probability, tesng 145, 146

bivariate boxplot 116

boosng algorithms 293

bootstrap 298, 299

bootstrap aggregaon 297

bootstrapped tree 300

box-and-whisker plot 84

boxplot

about 84

examples 84, 85

implemenng 85-87

boxplot funcon 86, 108

B-spline regression model

ng 241, 242

purpose 241

[ 318 ]

built-in examples, bar chart

Bug Metrics dataset 67

Bug Metrics of ve soware 67

bwplot funcon 86

CART

about 293

cross-validaon predicons, obtaining 296, 297

improving 294, 295

CART_Dummy dataset

visualizing 259

categorical variable 9, 18, 260

central limit theorem 139

classicaon tree 257

construcon 276-283

pruning 289-291

classicaon tree, German credit data

construcon 284-288

coecient of determinaon 169

colSums funcon 77

Comprehensive R Archive Network. See CRAN

computer science

experiments with uncertainty 14

condence interval

about 139

for binomial proporon 139

for normal mean with known variance 140

for normal mean with unknown variance 140

obtaining 141, 142, 170, 171

confusion matrix 220

connuous distribuons

about 26

exponenal distribuon 28

normal distribuon 29

uniform distribuon 27

connuous variable 9

Cooks distance 188

covariate 229

CRAN

about 15

URL 15

criteria-based model selecon

about 194

AIC criteria, using 194-197

backward, using 194-197

forward, using 194-197

crical alpha 192

crical region 144

CSV le

reading, from 54

cumulave density funcon 27

Customer ID variable 8

CVbinary funcon 294

CVlm funcon 294

data

imporng, from external les 55, 57

imporng, from MySQL 58, 59

spilng 260

databases (DB) 58

data characteriscs 12, 13

data formats

CSV (comma separated variable) format 52

ODS (OpenOce/LibreOce Calc) 52

XLS or XLSX (Microso Excel) form 52

data.frame object

about 45

creang 45, 46

data/graphs

exporng 60

graphs, exporng 60, 61

R objects, exporng 60

data re-expression

about 114

for Power of 62 Hydroelectric Staons 114-116

data visualizaon

about 65

visualizaon techniques, for categorical data 66

visualizaon techniques, for connuous variable

data 84

DAT le

reading, from 54

depth 113

deviance residual 213

DFBETAS 189

DFFITS 189

di funcon 108

discrete distribuons

about 18

binomial distribuon 20

[ 319 ]

discrete uniform distribuon 19

hypergeometric distribuon 24

negave binomial distribuon 24

poisson distribuon 26

discrete uniform distribuon 19

dot chart

about 74

example 74

visualizing 74-76

dotchart funcon 74

EDA 103

exponenal distribuon 28, 29

false posive rate (fpr) 220

fence 117

les

reading, foreign package used 55

rst tree

building 261-263

tdistr funcon

used, for ndgin MLE 136

venum funcon 108

forwardlm funcon 194

forward selecon approach 193

fourfold plot

about 82

examples 83

Full Name variable 9

Gender variable 9

generalized cross-validaon (GCV) errors 255

geometric RV 25

German credit data

classicaon tree 284-288

German credit screening dataset

logisc regression 223-226

getNode funcon 267

getTree funcon 305

ggplot 101

ggplot2

about 99, 100

ggplot 101

qplot funcon 100

GLM

inuenal and leverage points, idenfying 216

residual plots 213

graphs

exporng 60, 61

han funcon 124

hanning 123

hinges 104, 105

hist funcon 90

histogram

about 88

construcon 88, 89

creang 90-92

eecveness 90

examples 89

histogram funcon 90

Hosmer-Lemeshow goodness-of-t test stasc

210, 212

hypergeometric distribuon 24

hypergeometric RV 24

hypotheses tesng

about 144

binomial test 144

one-sample hypotheses, tesng 152-155

one-sample problems, for normal distribuon

150, 151

two-sample hypotheses, tesng 159

two sample problems, for normal distribuon

156, 157

hypothesis

about 144

alternave hypothesis 144

null hypothesis 144

stasc, tesng 144

impurity measures 276

independent and idencal distributed (iid)

sample 130

[ 320 ]

inuence and leverage points, GLM

idenfying 216

inuenal point 188

interquarle range (IQR) 84, 105

iterave reweighted least-squares (IRLS)

algorithm 207

leading digits 109

leer values 113

leverage point 187

likelihood funcon

about 131

visualizing 131-134

likelihood funcon, of binomial distribuon 131

likelihood funcon, of normal distribuon 132

likelihood funcon, of Poisson distribuon 132

linear regression model

about 162

limitaons 202, 203

linearRidge funcon 244

list object

about 44

creang 44, 45

lm.ridge t model 255

lm.ridge funcon 245, 255

logisc regression, German credit dataset 223-

226

logisc regression model

about 201-207

diagnoscs 216

ng 207- 210

inuence and leverage points, idenfying

216-218

model validaon 213

residuals plots, for GLM 213

ROC 220

logiscRidge funcon 248

loop 117

margin 301

matrix computaons

about 41

performing 41-43

maximum likelihood esmator. See MLE

mean 18, 122

mean residual sum of squares 183

median 104, 122

median polish 125

median polish algorithm

about 125, 126

example 126, 127

medpolish funcon 126

MLE

about 129-131

nding 135

nding, tdistr funcon used 137

nding, mle funcon used 137

likelihood funcon, visualizing 131

MLE, binomial distribuon

nding 135, 136

MLE, Poisson distribuon

nding 136

model assessment

about 250

penalty factor, nding 250

penalty parameter, selecng iteravely

250-255

tesng dataset 250

training dataset 250

validaon dataset 250

model selecon

about 192

criterion-based procedures 194

stepwise procedures 192

model validaon, simple linear regression model

about 171

residual plots, obtaining 172, 174

mosaic plot

for Titanic dataset 80, 81

mosaicplot funcon 80

mulcollinearity problem

about 189, 190

addressing 191, 192

mulple linear regression model

about 176, 177

ANOVA, obtaining 182

building 179, 180

condence intervals, obtaining 182

k simple linear regression models, averaging

177-179

[ 321 ]

useful residual plots 183

mulvariate dataset

bagplot, displaying for 117, 118

MySQL

data, imporng from 58, 59

natural cubic spline regression model

about 238

ng 239-241

negave binomial distribuon

about 24, 25

examples 25

negave binomial RV 25

nominal variable 18

normal distribuon

about 29

examples 30

null hypothesis 144

NumericConstants 34

Octane Rang

of Gasoline Blends 109

odds rao 207

one-sample hypotheses

tesng 152-155

operator curves

receiving 220

ordinal variable 9, 19, 260

out-of-bag validaon 303

overng 230-233

pairs funcon 118

panel.bagplot funcon 118

Pareto chart

about 97

examples 98, 99

paral residual 213

pdf 26

pearson residual 213

percenles 104

piecewise cubic 238

piecewise linear regression model

about 235

ng 235-237

pie charts

about 81

drawbacks 82

examples 81

plot.lm funcon 176

poisson distribuon

about 26

examples 26

polynomial regression model

building 231

ng 229

pooled variance esmator 157

PRESS residuals 184

principal component (PC) 245

probability density funcon 26

probability mass funcon (pmf) 18

probit model 201

probit regression model

about 204

constants 204-206

pruning 288

qplot funcon 100

quanles 104

quesonnaire

about 8

components 8

Quesonnaire ID variable 8

constants 34

connuous distribuons 26

data characteriscs 12, 13

data.frame 33

data visualizaon 65

discrete distribuons 18

downloading, for Linux 16

downloading, for Windows 16

session management 62, 64

simple linear regression model 162

vectors 34, 35

[ 322 ]

randomForest funcon 304

Random Forests

about 293, 303

for German credit data 304, 306

for low birth weight data 308-313

random sample 130

random variable 13

range funcon 108

R databases 33

read.csv 54

read.delim 54

read.table funcon 53

read.xls funcon 54

receiving operator curves. See ROC

Recursive Paroning

about 258

data, spling 260

display plot, paroning 259

regression 162

regression diagnoscs

about 186

DFBETAS 189

DFFITS 189

inuenal point 187, 188

leverage point 187

regression spline

about 234

basis funcons 234

natural cubic splines 238

piecewise linear regression model 235

regression tree

about 257

construcon 265-274

representave probabilies 13

reroughing 123

resid funcon 174

residual plots, GLM

deviance residual 213

obtaining, ed funcon used 214, 215

obtaining, residuals funcon used 214, 215

paral residual 213

pearson residual 213

response residual 213

working residual 213

residual plots, mulple linear regression model

about 183

PRESS residuals 184

R-student residuals 184, 186

semi-studenzed residuals 183, 184

standardized residuals 183

residuals funcon 216

resistant line

about 118, 120

as regression model 120, 121

for IO-CPU me 120

response residual 213

ridge regression, for linear regression model

243-247

ridge regression, for logisc regression models

248, 249

R installaon

about 15, 16

R packages, using 16, 17

RSADBE 17

rline funcon 120

R names funcon 34

R objects

about 33

exporng 60

leers 34

LETTERS 34

month.abb 34

month.name 34

pi 34

ROC

about 220

construcon 221, 222

rootstock dataset 55

rough 123

R output 33

rowSums funcon 78

R packages

using 16

rpart class 294

rpart funcon 257

RSADBE package 17, 105

R-student residuals 184

RV 13

scaer plot

about 93

creang 94-96

[ 323 ]

examples 93

semi-studenzed residuals 183

Separated Clear Volume 54

session management

about 62

performing 62, 64

short message service (SMS) 8

signicance level 139

simple linear regression model

about 163

ANOVA technique 169

building 167, 168

condence intervals, obtaining 170

core assumpons 163

limitaon 230

overng problem 230

residuals for arbitrary choice of parameters,

displaying 164-166

validaon 171

smooth funcon 124

smoothing data technique

about 122

for cow temperature 124, 125

spine/mosaic plot

about 76

advantages 76

examples 77

spine plot

for shi and operator data 77, 79

spineplot funcon 77

spline 234

standardized residuals 183

Stascal Process Control (SPC) 109

stem-and-leaf plot 109

stem funcon

about 110

working 110-112

stems 109

step funcon 194

stepwise procedures

about 192

backward eliminaon 192

forward selecon 193

stepwise regression 193

summary funcon 169

summary stascs

about 104

for The Wall dataset 105-108

hinges 104, 105

interquarle range (IQR) 105

median 104

percenles 104

quanles 104

table object

about 49, 50

Titanic dataset, creang 51, 52

tesng dataset 250

text variable 9

Titanic data

exporng 60

to.dendrogram funcon 305

towards-the-center 162

trailing digits 109

training dataset 250

true posive rate (tpr) 220

two-sample hypotheses

tesng 159

UCBAdmissions 52

UCBAdmissions dataset 260

uniform distribuon

about 27

examples 28

validaon dataset 250

variance 18

variance inaon factor (VIF) 191

vector

about 34

examples 35

generang 35

vector objects

basic arithmec 36

creang 35

visualizaon techniques, for categorical data

about 66

bar chart 66

dot chart 74

[ 324 ]

fourfold plot 81

mosaic plot 76

pie charts 81

spline plot 76

visualizaon techniques, for connuous variable

data

about 84

boxplot 84

histogram 88

Pareto chart 97

scaer plot 93

working residual 213

write.table funcon 60

xpred.rpart funcon 296

Thank you for buying

R Statistical Application Development by Example

Beginner's Guide

About Packt Publishing

Packt, pronounced 'packed', published its rst book "Mastering phpMyAdmin for Eecve

MySQL Management" in April 2004 and subsequently connued to specialize in publishing

highly focused books on specic technologies and soluons.

Our books and publicaons share the experiences of your fellow IT professionals in adapng

and customizing today's systems, applicaons, and frameworks. Our soluon based books

give you the knowledge and power to customize the soware and technologies you're

using to get the job done. Packt books are more specic and less general than the IT books

you have seen in the past. Our unique business model allows us to bring you more focused

informaon, giving you more of what you need to know, and less of what you don't.

Packt is a modern, yet unique publishing company, which focuses on producing quality,

cung-edge books for communies of developers, administrators, and newbies alike. For

more informaon, please visit our website: www.packtpub.com.

About Packt Open Source

In 2010, Packt launched two new brands, Packt Open Source and Packt Enterprise, in order

to connue its focus on specializaon. This book is part of the Packt Open Source brand,

home to books published on soware built around Open Source licences, and oering

informaon to anybody from advanced developers to budding web designers. The Open

Source brand also runs Packt's Open Source Royalty Scheme, by which Packt gives a royalty

to each Open Source project about whose soware a book is sold.

Writing for Packt

We welcome all inquiries from people who are interested in authoring. Book proposals

should be sent to author@packtpub.com. If your book idea is sll at an early stage and you

would like to discuss it rst before wring a formal book proposal, contact us; one of our

commissioning editors will get in touch with you.

We're not just looking for published authors; if you have strong technical skills but no wring

experience, our experienced editors can help you develop a wring career, or simply get

some addional reward for your experse.

NumPy 1.5 Beginner's Guide

ISBN: 978-1-84951-530-6 Paperback: 234 pages

An acon-packed guide for the easy-to-use, high

performance, Python based free open source NumPy

mathemacal library using real-world examples

1. The rst and only book that truly explores NumPy

praccally

2. Perform high performance calculaons with clean

and ecient NumPy code

3. Analyze large data sets with stascal funcons

4. Execute complex linear algebra and mathemacal

computaons

Matplotlib for Python Developers

ISBN: 978-1-84719-790-0 Paperback: 308 pages

Build remarkable publicaon-quality plots the easy

way

1. Create high quality 2D plots by using Matplotlib

producvely

2. Incremental introducon to Matplotlib, from the

ground up to advanced levels

3. Embed Matplotlib in GTK+, Qt, and wxWidgets

applicaons as well as web sites to ulize them in

Python applicaons

4. Deploy Matplotlib in web applicaons and expose it

on the Web using popular web frameworks such as

Pylons and Django

Please check www.PacktPub.com for information on our titles

Sage Beginner's Guide

ISBN: 978-1-84951-446-0 Paperback: 364 pages

Unlock the full potenal of Sage for simplifying and

automang mathemacal compung

1. The best way to learn Sage which is a open source

alternave to Magma, Maple, Mathemaca, and

Matlab

2. Learn to use symbolic and numerical computaon

to simplify your work and produce publicaon-

quality graphics

3. Numerically solve systems of equaons, nd roots,

and analyze data from experiments or simulaons

R Graph Cookbook

ISBN: 978-1-84951-306-7 Paperback: 272 pages

Detailed hands-on recipes for creang the most

useful types of graphs in R—starng from the

simplest versions to more advanced applicaons

1. Learn to draw any type of graph or visual data

representaon in R

2. Filled with praccal ps and techniques for creang

any type of graph you need; not just theorecal

explanaons

3. All examples are accompanied with the

corresponding graph images, so you know what the

results look like

Please check www.PacktPub.com for information on our titles

R Statistical Application Development By Example Beginners Guide

Navigation menu

Versions of this User Manual:

Views

Navigation