R Statistical Application Development By Example Beginners Guide
User Manual:
Open the PDF directly: View PDF .
Page Count: 345
Download | |
Open PDF In Browser | View PDF |
R Statistical Application Development by Example Beginner's Guide Learn R Statistical Application Development from scratch in a clear and pedagogical manner Prabhanjan Narayanachar Tattar BIRMINGHAM - MUMBAI R Statistical Application Development by Example Beginner's Guide Copyright © 2013 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: July 2013 Production Reference: 1170713 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-84951-944-1 www.packtpub.com Cover Image by Asher Wishkerman (wishkerman@hotmail.com) Credits Author Prabhanjan Narayanachar Tattar Reviewers Project Coordinator Anurag Banerjee Proofreaders Mark ver der Loo Maria Gould Mzabalazo Z. Ngwenya Paul Hindle A Ohri Tengfei Yin Indexer Hemangini Bari Acquisition Editor Usha Iyer Graphics Ronak Dhruv Lead Technical Editor Arun Nadar Production Coordinators Melwyn D'sa Technical Editors Zahid Shaikh Madhuri Das Mausam Kothari Amit Ramadas Varun Pius Rodrigues Lubna Shaikh Cover Work Melwyn D'sa Zahid Shaikh About the Author Prabhanjan Narayanachar Tattar has seven years of experience with R software and has also co-authored the book A Course in Statistics with R published by Narosa Publishing House. The author has built two packages in R titled gpk and ACSWR. He has obtained a PhD (Statistics) from Bangalore University under the broad area of Survival Analysis and published several articles in peer-reviewed journals. During the PhD program, the author received the young Statistician honors in IBS(IR)-GK Shukla Young Biometrician Award (2005) and Dr. U.S. Nair Award for Young Statistician (2007) and also held a Junior and Senior Research Fellowship of CSIR-UGC. Prabhanjan is working as a Business Analysis Advisor at Dell Inc, Bangalore. He is working for the Customer Service Analytics unit of the larger Dell Global Analytics arm of Dell. I would like to thank Prof. Athar Khan, Aligarh Muslim University, whose teaching during a shared R workshop inspired me to a very large extent. My friend Veeresh Naidu has gone out of his way in helping and inspiring me complete this book and I thank him for everything that defines our friendship. Many of my colleagues at the Customer Service Analytics unit of Dell Global Analytics, Dell Inc. have been very tolerant of my stat talk with them and it is their need for the subject which has partly influenced the writing of the book. I would like to record my thanks to them and also my manager Debal Chakraborty. My wife Chandrika has been very cooperative and without her permission to work on the book during weekends and housework timings this book would have never been completed. Pranathi at 2 years and 9 months has started school, the pre-kindergarten, and it is genuinely believed that one day she will read this entire book. I am also grateful to the reviewers whose constructive suggestions and criticisms have helped the book reach a higher level than where it would have ended up being without their help. Last and not least, I would like to take the opportunity to thank Usha Iyer and Anurag Banerjee for their inputs with the earlier drafts, and also their patience with my delays. About the Reviewers Mark van der Loo obtained his PhD at the Institute for Theoretical Chemistry at the University of Nijmegen (The Netherlands). Since 2007 he has worked at the statistical methodology department of the Dutch official statistics office (Statistics Netherlands). His research interests include automated data cleaning methods and statistical computing. At Statistics Netherlands he is responsible for the local R center of expertise, which supports and educates users on statistical computing with R. Mark has coauthored a number of R packages that are available via CRAN, namely editrules, deducorrect, rspa, extremevalues, and stringdist. Together with Edwin de Jonge he authored the book Learning RStudio for R Statistical Computing. A list of his publications can be found at www.markvanderloo.eu. Mzabalazo Z. Ngwenya has worked extensively in the field of statistical consulting and currently works as a biometrician. He holds an MSC in Mathematical Statistics from the University of Cape Town and is at present studying towards a PhD (School of Information Technology, University of Pretoria) in the field of Computational Intelligence. His research interests include statistical computing, machine learning, spatial statistics, and simulation and stochastic processes. Previously he was involved in reviewing Learning RStudio for R Statistical Computing by Mark P.J. van der Loo and Edwin de Jonge, Packt Publishing A Ohri is the founder of analytics startup Decisionstats.com. He has pursued graduation studies from the University of Tennessee, Knoxville and the Indian Institute of Management, Lucknow. In addition, he has a Mechanical Engineering degree from the Delhi College of Engineering. He has interviewed more than 100 practitioners in analytics, including leading members from all the analytics software vendors. He has written almost 1300 articles on his blog besides guest writing for influential analytics communities. He teaches courses in R through online education and has worked as an analytics consultant in India for the past decade. He was one of the earliest independent analytics consultants in India and his current research interests include spreading open source analytics, analyzing social media manipulation, simpler interfaces to cloud computing, and unorthodox cryptography. He is the author of R for Business Analytics. http://www.springer.com/statistics/book/9781461443421 www.PacktPub.com Support files, eBooks, discount offers and more You might want to visit www.PacktPub.com for support files and downloads related to your book. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@ packtpub.com for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM http://PacktLib.PacktPub.com Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books. Why Subscribe? Fully searchable across every book published by Packt Copy and paste, print and bookmark content On demand and accessible via web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access. The work is dedicated to my father Narayanachar, the very first engineer who influenced my outlook towards Science and Engineering. For the same reason, my mother Lakshmi made me realize the importance of life and philosophy. Table of Contents Preface Chapter 1: Data Characteristics Questionnaire and its components Understanding the data characteristics in an R environment Experiments with uncertainty in computer science R installation Using R packages RSADBE – the book's R package Discrete distribution Discrete uniform distribution Binomial distribution Hypergeometric distribution Negative binomial distribution Poisson distribution Continuous distribution Uniform distribution Exponential distribution Normal distribution Summary Chapter 2: Import/Export Data data.frame and other formats Constants, vectors, and matrices Time for action – understanding constants, vectors, and basic arithmetic Time for action – matrix computations The list object Time for action – creating a list object The data.frame object 1 7 8 12 14 15 16 17 18 19 20 24 24 26 26 27 28 29 31 33 33 34 35 41 44 44 45 Table of Contents Time for action – creating a data.frame object The table object Time for action – creating the Titanic dataset as a table object read.csv, read.xls, and the foreign package Time for action – importing data from external files Importing data from MySQL Exporting data/graphs Exporting R objects Exporting graphs Time for action – exporting a graph Managing an R session Time for action – session management Summary Chapter 3: Data Visualization 45 49 51 52 55 58 60 60 60 61 62 62 64 65 Visualization techniques for categorical data Bar charts Going through the built-in examples of R 66 66 66 Time for action – bar charts in R Dot charts Time for action – dot charts in R Spine and mosaic plots Time for action – the spine plot for the shift and operator data Time for action – the mosaic plot for the Titanic dataset Pie charts and the fourfold plot Visualization techniques for continuous variable data Boxplot Time for action – using the boxplot Histograms Time for action – understanding the effectiveness of histograms Scatter plots Time for action – plot and pairs R functions Pareto charts A brief peek at ggplot2 Time for action – qplot Time for action – ggplot Summary Chapter 4: Exploratory Analysis 68 74 74 76 77 80 81 84 84 85 88 90 93 94 97 99 100 101 102 103 Essential summary statistics Percentiles, quantiles, and median Hinges 104 104 104 [ ii ] Table of Contents The interquartile range Time for action – the essential summary statistics for "The Wall" dataset The stem-and-leaf plot Time for action – the stem function in play Letter values Data re-expression Bagplot – a bivariate boxplot Time for action – the bagplot display for a multivariate dataset The resistant line Time for action – the resistant line as a first regression model Smoothing data Time for action – smoothening the cow temperature data Median polish Time for action – the median polish algorithm Summary Chapter 5: Statistical Inference 105 105 109 110 113 114 116 117 118 120 122 124 125 125 128 129 Maximum likelihood estimator Visualizing the likelihood function Time for action – visualizing the likelihood function Finding the maximum likelihood estimator Using the fitdistr function Time for action – finding the MLE using mle and fitdistr functions Confidence intervals Time for action – confidence intervals Hypotheses testing Binomial test Time for action – testing the probability of success Tests of proportions and the chi-square test Time for action – testing proportions Tests based on normal distribution – one-sample Time for action – testing one-sample hypotheses Tests based on normal distribution – two-sample Time for action – testing two-sample hypotheses Summary Chapter 6: Linear Regression Analysis 130 131 133 135 136 137 139 141 144 144 145 147 147 149 152 156 159 160 161 The simple linear regression model What happens to the arbitrary choice of parameters? Time for action – the arbitrary choice of parameters Building a simple linear regression model [ iii ] 162 163 164 167 Table of Contents Time for action – building a simple linear regression model ANOVA and the confidence intervals Time for action – ANOVA and the confidence intervals Model validation Time for action – residual plots for model validation Multiple linear regression model Averaging k simple linear regression models or a multiple linear regression model Time for action – averaging k simple linear regression models Building a multiple linear regression model Time for action – building a multiple linear regression model The ANOVA and confidence intervals for the multiple linear regression model Time for action – the ANOVA and confidence intervals for the multiple linear regression model Useful residual plots Time for action – residual plots for the multiple linear regression model Regression diagnostics Leverage points Influential points DFFITS and DFBETAS The multicollinearity problem Time for action – addressing the multicollinearity problem for the Gasoline data Model selection Stepwise procedures The backward elimination The forward selection 167 169 170 171 172 176 177 177 179 179 181 182 183 184 186 187 188 189 189 191 192 192 192 193 Criterion-based procedures Time for action – model selection using the backward, forward, and AIC criteria Summary Chapter 7: The Logistic Regression Model The binary regression problem Time for action – limitations of linear regression models Probit regression model Time for action – understanding the constants Logistic regression model Time for action – fitting the logistic regression model Hosmer-Lemeshow goodness-of-fit test statistic Time for action – the Hosmer-Lemeshow goodness-of-fit statistic Model validation and diagnostics Residual plots for the GLM [ iv ] 194 194 199 201 202 202 204 204 206 207 210 211 213 213 Table of Contents Time for action – residual plots for the logistic regression model Influence and leverage for the GLM Time for action – diagnostics for the logistic regression Receiving operator curves Time for action – ROC construction Logistic regression for the German credit screening dataset Time for action – logistic regression for the German credit dataset Summary Chapter 8: Regression Models with Regularization The overfitting problem Time for action – understanding overfitting Regression spline Basis functions Piecewise linear regression model Time for action – fitting piecewise linear regression models Natural cubic splines and the general B-splines Time for action – fitting the spline regression models Ridge regression for linear models Time for action – ridge regression for the linear regression model Ridge regression for logistic regression models Time for action – ridge regression for the logistic regression model Another look at model assessment Time for action – selecting lambda iteratively and other topics Summary Chapter 9: Classification and Regression Trees Recursive partitions Time for action – partitioning the display plot Splitting the data The first tree Time for action – building our first tree The construction of a regression tree Time for action – the construction of a regression tree The construction of a classification tree Time for action – the construction of a classification tree Classification tree for the German credit data Time for action – the construction of a classification tree Pruning and other finer aspects of a tree Time for action – pruning a classification tree Summary [v] 214 216 216 220 221 223 225 227 229 230 231 234 234 235 235 238 239 243 244 248 248 250 250 256 257 258 259 260 260 261 265 266 276 279 284 284 288 289 291 Table of Contents Chapter 10: CART and Beyond 293 Improving CART Time for action – cross-validation predictions Bagging The bootstrap Time for action – understanding the bootstrap technique The bagging algorithm Time for action – the bagging algorithm Random forests Time for action – random forests for the German credit data The consolidation Time for action – random forests for the low birth weight data Summary Appendix: References Index 294 296 297 297 298 300 301 303 304 308 308 314 315 317 [ vi ] Preface The open source software R is fast becoming one of the preferred companions of Statistics even as the subject continues to add many friends in Machine Learning, Data Mining, and so on among its already rich scientific network. The era of mathematical theory and statistical applications embeddedness is truly a remarkable one for the society and the software has played a very pivotal role in it. This book is a humble attempt at presenting Statistical Models through R for any reader who has a bit of familiarity with the subject. In my experience of practicing the subject with colleagues and friends from different backgrounds, I realized that many are interested in learning the subject and applying it in their domain which enables them to take appropriate decisions in analyses, which involves uncertainty. A decade earlier my friends would be content with being pointed to a useful reference book. Not so anymore! The work in almost every domain is done through computers and naturally they do have their data available in spreadsheets, databases, and sometimes in plain text format. The request for an appropriate statistical model is invariantly followed by a one word question "Software?" My answer to them has always been a single letter reply "R!" Why? It is really a very simple decision and it has been my companion over the last seven years. In this book, this experience has been converted into detailed chapters and a cleaner breakup of model building in R. A by-product of the interaction with colleagues and friends who are all aspiring statistical model builders has been that I have been able to pick up the trough of their learning curve of the subject. The first attempt towards fixing the hurdle has been to introduce the fundamental concepts that the beginners are most familiar with, which is data. The difference is simply in the subtleties and as such I firmly believe that introducing the subject on their turf motivates the reader for a long way in their journey. As with most statistical software, R provides modules and packages which mostly cover many of the recently invented statistical methodology. The first five chapters of the book focus on the fundamental aspects of the subject and the R software and hence cover R basics, data visualization, exploratory data analysis, and statistical inference. Preface The foundational aspects are illustrated using interesting examples and sets up the framework for the later five chapters. Regression models, linear and logistic regression models being at the forefront, are of paramount interest in applications. The discussion is more generic in nature and the techniques can be easily adapted across different domains. The last two chapters have been inspired by the Breiman school and hence the modern method of Classification and Regression Trees has been developed in detail and illustrated through a practical dataset. What this book covers Chapter 1, Data Characteristics, introduces the different types of data through a questionnaire and dataset. The need of statistical models is elaborated in some interesting contexts. This is followed by a brief explanation of R installation and the related packages. Discrete and continuous random variables are discussed through introductory R programs. Chapter 2, Import/Export Data, begins with a concise development of R basics. Data frames, vectors, matrices, and lists are discussed with clear and simpler examples. Importing of data from external files in csv, xls, and other formats is elaborated next. Writing data/objects from R for other software is considered and the chapter concludes with a dialogue on R session management. Chapter 3, Data Visualization, discusses efficient graphics separately for categorical and numeric datasets. This translates into techniques of bar chart, dot chart, spine and mosaic plot, and four fold plot for categorical data while histogram, box plot, and scatter plot for continuous/numeric data. A very brief introduction to ggplot2 is also provided here. Chapter 4, Exploratory Analysis, encompasses highly intuitive techniques for preliminary analysis of data. The visualizing techniques of EDA such as stem-and-leaf, letter values, and modeling techniques of resistant line, smoothing data, and median polish give a rich insight as a preliminary analysis step. Chapter 5, Statistical Inference, begins with the emphasis of likelihood function and computing the maximum likelihood estimate. Confidence intervals for the parameters of interest is developed using functions defined for specific problems. The chapter also considers important statistical tests of Z-test and t-test for comparison of means and chi-square tests and F-test for comparison of variances. Chapter 6, Linear Regression Analysis, builds a linear relationship between an output and a set of explanatory variables. The linear regression model has many underlying assumptions and such details are verified using validation techniques. A model may be affected by a single observation, or a single output value, or an explanatory variable. Statistical metrics are discussed in depth which helps remove one or more kinds of anomalies. Given a large number of covariates, the efficient model is developed using model selection techniques. [2] Preface Chapter 7, The Logistic Regression Model, is useful as a classification model when the output is a binary variable. Diagnostic and model validation through residuals are used which lead to an improved model. ROC curves are next discussed which helps in identifying of a better classification model. Chapter 8, Regression Models with Regularization, discusses the problem of over fitting arising from the use of models developed in the previous two chapters. Ridge regression significantly reduces the possibility of an over fit model and the development of natural spine models also lays the basis for the models considered in the next chapter. Chapter 9, Classification and Regression Trees, provides a tree-based regression model. The trees are initially built using R functions and the final trees are also reproduced using rudimentary codes leading to a clear understanding of the CART mechanism. Chapter 10, CART and Beyond, considers two enhancements of CART using bagging and random forests. A consolidation of all the models from Chapter 6 to Chapter 10 is also given through a dataset. Chapter 1 to Chapter 5 form the basics of R software and the Statistics subject. Practical and modern regression models are discussed in depth from Chapter 6 to Chapter 10. Appendix, References, lists names of the books that have been referred to in this book. What you need for this book R is the only required software for this book and you can download it from http://www. cran.r-project.org/. R packages will be required too though this task is done within a working R session. The datasets used in the book is available in the R package RSADBE, which is an abbreviation of the book's title, at http://cran.r-project.org/web/packages/ RSADBE/index.html. Who this book is for This book will be useful for readers who have flair and a need for statistical applications in their own domains. The first seven chapters are also useful for any masters students in Statistics and the motivated student can easily complete the rest of the book and obtain a working knowledge of CART. Conventions In this book, you will find several headings appearing frequently. To give clear instructions of how to complete a procedure or task, we use: [3] Preface Time for action – heading 1. 2. 3. Action 1 Action 2 Action 3 Instructions often need some extra explanation so that they make sense, so they are followed with: What just happened? This heading explains the working of tasks or instructions that you have just completed. You will also find some other learning aids in the book, including: Pop quiz – heading These are short multiple-choice questions intended to help you test your own understanding. Have a go hero – heading These are practical challenges that give you ideas for experimenting with what you have learned. You will also find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning. Code words in text are shown as follows: "The operator %% on two objects, say x and y, returns remainder following an integer division, and the operator %/% returns the integer division." In certain cases the complete code cannot be included within the action list and in such cases you will find the following display: Plot the "Response Residuals" against the "Fitted Values" of the pass_logistic model with the following values assigned: plot(fitted(pass_logistic), residuals(pass_logistic,"response"), col= "red", xlab="Fitted Values", ylab="Residuals",cex.axis=1.5, cex.lab=1.5) In such a case you need to run the code starting with plot(up to cex.lab=1.5) in R. [4] Preface Warnings or important notes appear in a box like this. Tips and tricks appear like this. Reader feedback Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of. To send us general feedback, simply send an e-mail to feedback@packtpub.com, and mention the book title through the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors. Customer support Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase. Downloading the example code You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. Downloading the color images of this book We also provide you a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/ downloads/9441OS_R-Statistical-Application-Development-by-ExampleColor-Graphics.pdf. [5] Preface Errata Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website, or added to any list of existing errata, under the Errata section of that title. Piracy Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at copyright@packtpub.com with a link to the suspected pirated material. We appreciate your help in protecting our authors, and our ability to bring you valuable content. Questions You can contact us at questions@packtpub.com if you are having a problem with any aspect of the book, and we will do our best to address it. [6] 1 Data Characteristics Data consists of observations across different types of variables, and it is vital that any Data Analyst understands these intricacies at the earliest stage of exposure to statistical analysis. This chapter recognizes the importance of data and begins with a template of a dummy questionnaire and then proceeds with the nitty-gritties of the subject. We then explain how uncertainty creeps in to the domain of computer science. The chapter closes with coverage of important families of discrete and continuous random variables. We will cover the following topics: Identification of the main variable types as nominal, categorical, and continuous variables The uncertainty arising in many real experiments R installation and packages The mathematical form of discrete and continuous random variables and their applications Data Characteristics Questionnaire and its components The goal of this section is introduction of numerous variable types at the first possible occasion. Traditionally, an introductory course begins with the elements of probability theory and then builds up the requisites leading to random variables. This convention is dropped in this book and we begin straightaway with data. There is a primary reason for choosing this path. The approach builds on what the reader is already familiar with and then connects it with the essential framework of the subject. It is very likely that the user is familiar with questionnaires. A questionnaire may be asked after the birth of a baby with a view to aid the hospital in the study about the experience of the mother, the health status of the baby, and the concerns of immediate guardians of the new born. A multi-store department may instantly request the customer to fill in a short questionnaire for capturing the customer's satisfaction after the sale of a product. A customer's satisfaction following the service of their vehicle (see the detailed example discussed later) can be captured through a few queries. The questionnaires may arise in different forms than just merely on paper. They may be sent via e-mail, telephone, short message service (SMS), and so on. As an example, one may receive an SMS that seeks a mandatory response in a Yes/No form. An e-mail may arrive in the Outlook inbox, which requires the recipient to respond through a vote for any of these three options, "Will attend the meeting", "Can't attend the meeting", or "Not yet decided". Suppose the owner of a multi-brand car center wants to find out the satisfaction percentage of his customers. Customers bring their car to a service center for varied reasons. The owner wants to find out the satisfaction levels post the servicing of the cars and find the areas where improvement will lead to higher satisfaction among the customers. It is well known that the higher the satisfaction levels, the greater would be the customer's loyalty towards the service center. Towards this, a questionnaire is designed and then data is collected from the customers. A snippet of the questionnaire is given in figure 1, and the information given by the customers lead to different types of data characteristics. The variables Customer ID and Questionnaire ID may be serial numbers or randomly generated unique numbers. The purpose of such variables is unique identification of people's response. It may be possible that there are follow-up questionnaires as well. In such cases, the Customer ID for a responder will continue to be the same, whereas the Questionnaire ID needs to change for identification of the follow up. The values of these types of variables in general are not useful for analytical purpose. [8] Chapter 1 Figure 1: A hypothetical questionnaire The information of Full Name in this survey is a starting point to break the ice with the responder. In very exceptional cases the name may be useful for profiling purposes. For our purposes the name will simply be a text variable that is not used for analysis purposes. Gender is asked to know the person's gender, and in quite a few cases it may be an important factor explaining the main characteristics of the survey, in this case it may be mileage. Gender is an example of a categorical variable. Age in Years is a variable that captures the age of the customer. The data for this field is numeric in nature and is an example of a continuous variable. The fourth and fifth questions help the multi-brand dealer in identifying the car model and its age. The first question here enquires about the type of the car model. The car models of the customers may vary from Volkswagen Beetle, Ford Endeavor, Toyota Corolla, Honda Civic, to Tata Nano, see the next screenshot. Though the model name is actually a noun, we make a distinction from the first question of the questionnaire in the sense that the former is a text variable while the latter leads to a categorical variable. Next, the car model may easily be identified to classify the car into one of the car categories, such as a hatchback, sedan, station wagon, or utility vehicle, and such a classifying variable may serve as one of the ordinal variable, as per the overall car size. The age of the car in months since its manufacture date may explain the mileage and odometer reading. [9] Data Characteristics The sixth and seventh questions simply ask the customer if their minor/major problems were completely fixed or not. This is a binary question that takes either of the values, Yes or No. Small dents, power windows malfunctioning, niggling noises in the cabin, music speakers low output, and other similar issues, which do not lead to good functioning of the car may be treated as minor problems that are expected to be fixed in the car. Disc brake problems, wheel alignment, steering rattling issues, and similar problems that expose the user and co-users of the road to danger are of grave concern, as they affect the functioning of a car and are treated as major problems. Any user will expect all of his/her issues to be resolved during a car service. An important goal of the survey is to find the service center efficiency in handling the minor and major issues of the car. The labels Yes/No may be replaced by +1 and -1, or any other label of convenience. The eighth question, "What is the mileage (km/liter) of car?", gives a measure of the average petrol/diesel consumption. In many practical cases this data is provided by the belief of the customer who may simply declare it between 5 km/liter to 25 km/liter. In the case of a lower mileage, the customer may ask for a finer tune up of the engine, wheel alignment, and so on. A general belief is that if the mileage is closer to the assured mileage as marketed by the company, or some authority such as Automotive Research Association of India (ARAI), the customer is more likely to be happy. An important variable is the overall kilometers done by the car up to the point of service. Vehicles have certain maintenances at the intervals of 5,000 km, 10,000 km, 20,000 km, 50,000 km, and 100,000 km. This variable may also be related with the age of the vehicle. [ 10 ] Chapter 1 Let us now look at the final question of the snippet. Here, the customer is asked to rate his overall experience of the car service. A response from the customer may be sought immediately after a small test ride post the car service, or it may be through a questionnaire sent to the customer's e-mail ID. A rating of Very Poor suggests that the workshop has served the customer miserably, whereas the rating of Very Good conveys that the customer is completely satisfied with the workshop service. Note that there is some order in the response of the customer, in that we can grade the ranking in a certain order of Very Poor < Poor < Average < Good < Very Good. This implies that the structure of the ratings must be respected when we analyze the data of such a study. In the next section, these concepts are elaborated through a hypothetical dataset. Customer_ID Car Manufacture Minor Minor Year Problems Problems Mileage Satisfaction Odometer Rating C601FAKNQXM Questionnaire_ID Name QC601FAKNQXM J. Ram Male 57 Beetle Apr-11 Yes Yes 23 18892 Good C5HZ8CP1NFB QC5HZ8CP1NFB Male 53 Camry Feb-09 Yes Yes 17 22624 Average CY72H4J0V1X QCY72H4J0V1X Male 20 Corolla Dec-10 Yes No 21 25207 Good Female 20 Nano Apr-10 Yes Yes 24 42008 Good 32556 Average Gender Sanjeev Joshi CV1Y10SFW7N John D QCH1NZO5VCD8 Pranathi PT QCV1Y10SFW7N Pallavi M Daksh CXO04WUYQAJ QCXO04WUYQAJ CJQZAYMI59Z CH1NZO5VCD8 Age Car_Model 54 Civic Oct-11 Yes Yes 23 53 Civic Mar-12 Yes No 14 41449 Good QCJQZAYMI59Z Female Mohammed Khan Male Anand N T Male 65 Endeavor Aug-11 Yes Yes 23 28555 Good CIZTA35PW19 QCIZTA35PW19 Arun Kumar T Male 50 Beetle Mar-09 Yes No 19 36841 Very Poor C12XU9J0OAT QC12XU9J0OAT Prakash Prabhak Male 22 Nano Mar-11 Yes No 23 CXWBT0V17G QCXWBT0V17G Pramod R.K. Male 49 Nano Apr-11 No No 17 C5YOUIZ7PLC QC5YOUIZ7PLC Mithun Y. Male 37 Beetle Jul-11 Yes No 14 CYF269HVUO QCYF269HVUO S.P. Bala Male 42 Nano Dec-09 Yes Yes 23 27997 Poor CAIE3Z0SYK9 QCAIE3Z0SYK9 Swamy J Male 47 Camry Jan-12 Yes Yes 7 27491 Good CE09UZHDP63 QCE09UZHDP63 Julfikar Male 31 Endeavor May-12 Yes Yes 25 CDWJ6ESYPZR QCDWJ6ESYPZR Chris John Male 24 Fortuner Aug-09 Yes Yes 17 CH7XRZ6W9JQ QCH7XRZ6W9JQ Naveed Khan Female 47 Civic Oct-11 No No 21 CGXATR9DQEK QCGXATR9DQEK Prem Kashmiri Male 54 Camry Mar-10 No Yes 6 CYQO5RFIPK1 QCYQO5RFIPK1 Sujana Rao Female 32 Civic Mar-12 Yes No 8 48172 Very Good CG1SZ8IDURP QCG1SZ8IDURP Josh K Male 39 Endeavor Jul-11 Yes Yes 8 15274 Poor CTUSRQDX396 QCTUSRQDX396 Aravind Male 61 Fiesta May-10 Yes Yes 22 A hypothetical dataset of a Questionnaire [ 11 ] 1755 Very Good 2007 Good 28265 Poor 29527 Very Poor 2702 Good 6903 Good 40873 Poor 9934 Average Data Characteristics Understanding the data characteristics in an R environment A snippet of an R session is given in Figure 2. Here we simply relate an R session with the survey and sample data of the previous table. The simple goal here is to get a feel/buy-in of R and not necessarily follow the R codes. The R installation process is explained in the R installation section. Here the user is loading the SQ R data object (SQ simply stands for sample questionnaire) in the session. The nature of the SQ object is a data.frame that stores a variety of other objects in itself. For more technical details of the data.frame function, see The data.frame object section of Chapter 2, Import/Export Data. The names of a data.frame object may be extracted using the function variable.names. The R function class helps to identify the nature of the R object. As we have a list of variables, it is useful to find all of them using the function sapply. In the following screenshot, the mentioned steps have been carried out: Figure 2: Understanding the variable types of an R object [ 12 ] Chapter 1 The variable characteristics are also on expected lines, as they truly should be, and we see that the variables Customer_ID, Questionnaire_ID, and Name are character variables; Gender, Car_Model, Minor_Problems, and Major_Problems are factor variables; DoB and Car_Manufacture_Year are date variables; Mileage and Odometer are integer variables; and finally the variable Satisfaction_Rating is an ordered and factor variable. In the remainder of this chapter we will delve into more details about the nature of various data types. In a more formal language a variable is called a random variable, abbreviated as RV in the rest of the book, in statistical literature. A distinction needs to be made here. In this book we do not focus on the important aspects of probability theory. It is assumed that the reader is familiar with probability, say at the level of Freund (2003) or Ross (2001). An RV is a function that maps from the probability (sample) space : to the real line. From the previous example we have Odometer and Satisfaction_Rating as two examples of a random variable. In a formal language, the random variables are generally denoted by letters X, Y, …. The distinction that is required here is that in the applications what we observe are the realizations/values of the random variables. In general, the realized values are denoted by the lower cases x, y, …. Let us clarify this at more length. Suppose that we denote the random variable Satisfaction_Rating by X. Here, the sample space : consists of the elements Very Poor, Poor, Average, Good, and Very Good. For the sake of convenience we will denote these elements by O1, O2, O3, O4, and O5 respectively. The random variable X takes one of the values O1,…, O5 with respective probabilities p1,…, p5. If the probabilities were known, we don't have to worry about statistical analysis. In simple terms, if we know the probabilities of the Satisfaction_Rating RV, we can simply use it to conclude whether more customers give Very Good rating against Poor. However, our survey data does not contain every customer who have availed car service from the workshop, and as such we have representative probabilities and not actual probabilities. Now, we have seen 20 observations in the R session, and corresponding to each row we had some values under the Satisfaction_Rating column. Let us denote the satisfaction rating for the 20 observations by the symbols X1,…, X20. Before we collect the data, the random variables X1,…, X20 can assume any of the values in : . Post the data collection, we see that the first customer has given the rating as Good (that is, O4), the second as Average (O3), and so on up to the twentieth customer's rating as Average (again O3). By convention, what is observed in the data sheet is actually x1,…, x20, the realized values of the RVs X1,…, X20. [ 13 ] Data Characteristics Experiments with uncertainty in computer science The common man of the previous century was skeptical about chance/randomness and attributed it to the lack of accurate instruments, and that information is not necessarily captured in many variables. The skepticism about the need for modeling for randomness in the current era continues for the common man, as he feels that the instruments are too accurate and that multi-variable information eliminates uncertainty. However, this is not the fact, and we will look here at some examples that drive home this point. In the previous section we dealt with data arising from a questionnaire regarding the service level at a car dealer. It is natural to accept that different individuals respond in distinct ways, and further the car being a complex assembly of different components responds differently in near identical conditions. A question then arises whether we may have to really deal with such situations in computer science, which involve uncertainty. The answer is certainly affirmative and we will consider some examples in the context of computer science and engineering. Suppose that the task is installation of software, say R itself. At a new lab there has been an arrangement of 10 new desktops that have the same configuration. That is, the RAM, memory, the processor, operating system, and so on are all same in the 10 different machines. For simplicity, assume that the electricity supply and lab temperature are identical for all the machines. Do you expect that the complete R installation, as per the directions specified in the next section, will be the same in milliseconds for all the 10 installations? The run time of an operation can be easily recorded, may be using other software if not manually. The answer is a clear "No" as there will be minor variations of the processes active in the different desktops. Thus, we have our first experiment in the domain of computer science which involves uncertainty. Suppose that the lab is now two years old. As an administrator do you expect all the 10 machines to be working in the same identical conditions, as we started with identical configuration and environment? The question is relevant as according to general experience a few machines may have broken down. Despite warranty and assurance by the desktop company, the number of machines that may have broken down will not be exactly as assured. Thus, we again have uncertainty. Assume that three machines are not functioning at the end of two years. As an administrator, you have called the service vendor to fix the problem. For the sake of simplicity, we assume that the nature of failure of the three machines is the same, say motherboard failure on the three failed machines. Is it practical that the vendor would fix the three machines within identical time? Again, by experience we know that this is very unlikely. If the reader thinks otherwise, assume that 100 identical machines were running for two years and 30 of them are now having the motherboard issue. It is now clear that some machines may require a component replacement while others would start functioning following a repair/fix. Let us now summarize the preceding experiments through the set of following questions: [ 14 ] Chapter 1 What is the average installation time for the R software on identically configured computer machines? How many machines are likely to break down after a period of one year, two years, and three years? If a failed machine has issues related to motherboard, what is the average service time? What is the fraction of failed machines that have failed motherboard component? The answers to these types of questions form the main objective of the Statistics subject. However, there are certain characteristics of uncertainty that are very richly covered by the families of probability distributions. According to the underlying problem, we have discrete or continuous RVs. The important and widely useful probability distributions form the content of the rest of the chapter. We will begin with the useful discrete distributions. R installation The official website of R is the Comprehensive R Archive Network (CRAN) at www.cran.rproject.org. As of writing of this book, the most recent version of R is 2.15.1. This software can be downloaded for the three platforms Linux, Mac OS X, and Windows. Figure 3: The CRAN website (a snapshot) [ 15 ] Data Characteristics A Linux user may simply key in sudo apt-get install r-base in the terminal, and post the return of right password and privilege levels, the R software will be installed. After the completion of download and installation, the software can be started by simply keying in R at the terminal. A Windows user first needs to click on Download R for Windows as shown in the preceding screenshot, and then in the base subdirectory click on install R for the first time. In the new window, click on Download R 3.0.0 for Windows and download the .exe file to a directory of your choice. The completely downloaded R-3.0.0-win.exe file can be installed as any other .exe file. The R software may be invoked either from the Start menu, or from the icon on the desktop. Using R packages The CRAN repository hosts 4475 packages as of May 01, 2013. The packages are written and maintained by Statisticians, Engineers, Biologists, and others. The reasons are varied and the resourcefulness is very rich, and it reduces the need of writing exhaustive, new functions and programs from scratch. These additional packages can be obtained from http://www. cran.r-project.org/web/packages/. The user can click on Table of available packages, sorted by name, which directs to a new web package. Let us illustrate the installation of an R package named gdata. We now wish to install the package gdata. There are multiple ways of completing this task. Clicking on the gdata label leads to the web page http://www.cran.r-project.org/ web/packages/gdata/index.html. In this HTML file we can find a lot of information about the package from Version, Depends, Imports, Published, Author, Maintainer, License, System Requirements, Installation, and CRAN checks. Further, the download options may be chosen from Package source, MacOS X binary, and Windows binary depending on whether the user's OS is Unix, MacOS, or Windows respectively. Finally, a package may require other packages as a prerequisite, and it may itself be a prerequisite for other packages. This information is provided in the Reverse dependencies section in the options Reverse depends, Reverse imports, and Reverse suggests. Suppose that the user is having Windows OS. There are two ways to install the package gdata. Start R as explained earlier. At the console, execute the code install. packages("gdata"). A CRAN mirror window will pop up asking the user to select one of the available mirrors. Select one of the mirrors from the list, you may need to scroll down to locate your favorite mirror, and then click on the Ok button. A default setting is dependencies=TRUE, which will then download and install all other required packages. Unless there are some violations, such as the dependency requirement of the R version being at least 2.13.0 in this case, the packages are successfully installed. [ 16 ] Chapter 1 A second way of installing the gdata package is as follows. In the gdata web page click on the link gdata_2.11.0.zip. This action will then attempt to download the package through the File download window. Choose the option Save and specify the path where you wish to download the package. In my case, I have chosen the path C:\Users\author\Downloads. Now go to the R window. In the menu ribbon, we have seven options: File, Edit, View, Misc, Packages, Windows, and Help. Yes, your guess is correct and you would have wisely selected Packages from the menu. Now, select the last option of Packages, the Install Package(s) from local zip files option and direct it to the path where you have downloaded the .zip file. Select the file gdata_2.11.0 and R will do the required remaining part of installing the package. One of the drawbacks of doing this process manually is that if there are dependencies, the user needs to ensure that all such packages have been installed before embarking on this second task of installing the R packages. However, despite the problem, it is quite useful to know this technique, as we may not be connected to Internet all the time and install the packages as it is convenient. RSADBE – the book's R package The book uses a lot of datasets from the Web, statistical text books, and so on. The file format of the datasets have been varied and thus to help the reader, we have put all the datasets used in the book in an R package, RSADBE, which is the abbreviation of the book's title. This package will be available from the CRAN website as well as the book's web page. Thus, whenever you are asked to run data(xyz), the datasets xyz will be available either in the RSADBE package or datasets package of R. The book also uses many of the packages available on CRAN. The following table gives the list of packages and the reader is advised to ensure that these packages are installed before you begin reading the chapter. That is, the reader needs to ensure that, as an example, install. packages(c("qcc","ggplot2")) is run in the R session before proceeding with Chapter 3, Data Visualization. Chapter number Packages required 2 foreign, RMySQL 3 qcc, ggplot2 4 LearnEDA, aplpack 5 stats4, PASWR, PairedData 6 faraway 7 pscl, ROCR 8 ridge, DAAG 9 rpart, rattle 10 ipred, randomForest [ 17 ] Data Characteristics Discrete distribution The previous section highlights the different forms of variables. The variables such as Gender, Car_Model, and Minor_Problems possibly take one of the finite values. These variables are particular cases of the more general class of discrete variables. It is to be noted that the sample space : of a discrete variable need not be finite. As an example, the number of errors on a page may take values as a set of positive integers, {0, 1, 2, …}. Suppose that a discrete random variable X can take values among [ [ with respective probabilities S S , that is, 3 ; [L S [L SL . Then, we require that the probabilities be nonzero and further that their sum be 1: SL t L DQG ¦L SL where the Greek symbol ¦ represents summation over the index i. The function S [L is called the probability mass function (pmf) of the discrete RV X. We will now consider formal definitions of important families of discrete variables. The engineers may refer to Bury (1999) for a detailed collection of useful statistical distributions in their field. The two most important parameters of a probability distribution are specified by mean and variance of the RV X. In some cases, and important too, these parameters may not exist for the RV. However, we will not focus on such distributions, though we caution the reader that this does not mean that such RVs are irrelevant. Let us define these parameters for the discrete RV. The mean and variance of a discrete RV are respectively calculated as: ( ; ¦ S [ DQG 9DU L L L ; ¦S L L [L ( ; The mean is a measure of central tendency, whereas the variance gives a measure of the spread of the RV. The variables defined so far are more commonly known as categorical variables. Agresti (2002) defines a categorical variable as a measurement scale consisting of a set of categories. Let us identify the categories for the variables listed in the previous section. The categories for the variable Gender are Male and Female; whereas the car category variables derived from Car_Model are hatchback, sedan, station wagon, and utility vehicles. The variables Minor_Problems and Major_Problems have common but independent categories Yes and No; and finally the variable Satisfaction_Rating has the categories, as seen earlier, Very Poor, Poor, Average, Good, and Very Good. The variable Car_Model is just labels of the name of car and it is an example of nominal variable. [ 18 ] Chapter 1 Finally, the output of the variable Satistifaction_Rating has an implicit order in it, Very Poor < Poor < Average < Good < Very Good. It may be realized that this difference poses subtle challenges in their analysis. These types of variables are called ordinal variables. We will look at another type of categorical variable that has not popped up thus far. Practically, it is often the case that the output of a continuous variable is put in certain bin for ease of conceptualization. A very popular example is the categorization of the income level or age. In the case of income variables, it has been realized in one of the earlier studies that people are very conservative about revealing their income in precise numbers. For example, the author may be shy to reveal that his monthly income is Rs. 34,892. On the other hand, it has been revealed that these very same people do not have a problem in disclosing their income as belonging to one of such bins: < Rs. 10,000, Rs. 10,000-30,000, Rs. 30,000-50,000, and > Rs. 50,000. Thus, this information may also be coded into labels and then each of the labels may refer to any one value in an interval bin. Hence, such variables are referred as interval variables. Discrete uniform distribution A random variable X is said to be a discrete uniform random variable if it can take any one of the finite M labels with equal probability. As the discrete uniform random variable X can assume one of the 1, 2, …, M with equal probability, this probability is actually 0 . As the probability remains same across the labels, the nomenclature "uniform" is justified. It might appear at the outset that this is not a very useful random variable. However, the reader is cautioned that this intuition is not correct. As a simple case, this variable arises in many cases where simple random sampling is needed in action. The pmf of discrete RV is calculated as: 3 ; [L L 0 0 S [L A simple plot of the probability distribution of a discrete uniform RV is demonstrated next: > > > > + > M = 10 mylabels=1:M prob_labels=rep(1/M,length(mylabels)) dotchart(prob_labels,labels=mylabels,xlim=c(.08,.12), xlab="Probability") title("A Dot Chart for Probability of Discrete Uniform RV") [ 19 ] Data Characteristics Downloading the example code You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub. com/support and register to have the files e-mailed directly to you. Figure 4: Probability distribution of a discrete uniform random variable The R programs here are indicative and it is not absolutely necessary that you follow them here. The R programs will actually begin from the next chapter and your flow won't be affected if you do not understand certain aspects of them. Binomial distribution Recall the second question in the Experiments with uncertainty in computer science section, which asks "How many machines are likely to break down after a period of one year, two years, and three years?". When the outcomes involve uncertainty, the more appropriate question that we ask is related to the probability of the number of break downs being x. Consider a fixed time frame, say 2 years. To make the question more generic, we assume that we have n number of machines. Suppose that the probability of a breakdown for a given machine at any given time is p. The goal is to obtain the probability of x machines with breakdown, and implicitly (n-x) functional machines. Now consider a fixed pattern where the first x units have failed and the remaining are functioning properly. All the n machines function independently of other machines. Thus, the probability of observing x machines in the breakdown state is S [ . [ 20 ] Chapter 1 Similarly, each of the remaining (n-x) machines have the probability of (1-p) of being in the n− x functional state, and thus the probability of these occurring together is (1 − p ) . Again by the independence axiom value, the probability of x machines with breakdown is then given by . Finally, in the overall setup, the number of possible samples with breakdown (being x and (n-x) samples) being functional is actually the number of possible combinations of choosing x-out-of-n items, which is the combinatorial nx . As each of these samples is equally likely to occur, the probability of exactly x broken machines is given by . The RV X obtained in such a context is known as the binomial RV and its pmf is called the binomial distribution. In mathematical terms, the pmf of the binomial RV is calculated as: 3 ; [ S [ §Q· [ ¨ ¸ S S © [¹ Q [ [ Q d S d The pmf of binomial distributions is sometimes denoted by b ( x; n, p ) . Let us now look at some important properties of a binomial RV. The mean and variance of a binomial RV X are respectively calculated as: ( ; QS DQG 9DU ; QS S As p is always a number between 0 and 1, the variance of a binomial RV is always lesser than its mean. Example 1.3.1: Suppose n = 10 and p = 0.5. We need to obtain the probabilities p(x), x=0, 1, 2, …, 10. The probabilities can be obtained using the built-in R function dbinom. The function dbinom returns the probabilities of a binomial RV. The first argument of this function may be a scalar or a vector according to the points at which we wish to know the probability. The second argument of the function needs to know the value of n, the size of the binomial distribution. The third argument of this function requires the user to specify the probability of success in p. It is natural to forget the syntax of functions and the R help system becomes very handy here. For any function, you can get details of it using ? followed by the function name. Please do not give a space between CIT and the function name. Here, you can try ?dbinom. > n <- 10; p <- 0.5 > p_x <- round(dbinom(x=0:10, n, p),4) > plot(x=0:10,p_x,xlab="x", ylab="P(X=x)") [ 21 ] Data Characteristics The R function round fixes the accuracy of the argument up to the specified number of digits. Figure 5: Binomial probabilities We have used the dbinom function in the previous example. There are three utility facets for the binomial distribution. The three facets are p, q, and r. These three facets respectively help us in computations related to cumulative probabilities, quantiles of the distribution, and simulation of random numbers from the distribution. To use these functions, we simply augment the letters with the distribution name, binom here, as pbinom, qbinom, and rbinom. There will be of course a critical change in the arguments. In fact, there are many distributions for which the quartet of d, p, q, and r are available, check ?Distributions. Example 1.3.2: Assume that the probability of a key failing on an 83-set keyboard (the authors laptop keyboard has 83 keys) is 0.01. Now, we need to find the probability when at a given time there are 10, 20, and 30 non-functioning keys on this keyboard. Using the dbinom function these probabilities are easy to calculate. Try to do this same problem using a scientific calculator or by writing a simple function in any language that you are comfortable with. > n <- 83; p <- 0.01 > dbinom(10,n,p) [1] 1.168e-08 > dbinom(20,n,p) [1] 4.343e-22 > dbinom(30,n,p) [1] 2.043e-38 > sum(dbinom(0:83,n,p)) [1] 1 [ 22 ] Chapter 1 As the probabilities of 10-30 keys failing appear too small, it is natural to believe that may be something is going wrong. As a check, the sum clearly equals 1. Let us have a look at the problem from a different angle. For many x values, the probability p(x) will be approximately zero. We may not be interested in the probability of an exact number of failures, though we are interested in the probability of at least x failures occurring, that is, we are interested in the cumulative probabilities P ( X ≤ x ). The cumulative probabilities for binomial distribution are obtained in R using the pbinom function. The main arguments of pbinom include size (for n), prob (for p), and q (the x argument). For the same problem, we now look at the cumulative probabilities for various p values: > > > > + > n <- 83; p <- seq(0.05,0.95,0.05) x <- seq(0,83,5) i <- 1 plot(x,pbinom(x,n,p[i]),"l",col=1,xlab="x",ylab= expression(P(X<=x))) for(i in 2:length(p)) { points(x,pbinom(x,n,p[i]),"l",col=i)} Figure 6: Cumulative binomial probabilities Try to interpret the preceding screenshot. [ 23 ] Data Characteristics Hypergeometric distribution A box of N = 200 pieces of 12 GB pen drives arrives at a sales center. The carton contains M = 20 defective pen drives. A random sample of n units is drawn from the carton. Let X denote the number of defective pen drives obtained from the sample of n units. The task is to obtain the probability distribution of X. The number of possible ways of obtaining the sample of size n is §¨© 1Q ·¹¸. In this problem we have M defective units and N-M working pen drives, and M x defective units can be sampled in x different ways and n-x good units can be obtained in N −M distinct ways. Thus, the probability distribution of the RV X is calculated as: n−x 3 ; [ § 0 ·§ 1 0 · ¨ ¸¨ ¸ © [ ¹© Q [ ¹ §1· ¨ ¸ ©Q¹ K [ Q 0 1 where x is an integer between max ( 0, n − N + M ) and min ( n, M ). The RV is called as the hypergeometric RV and its probability distribution is called as the hypergeometric distribution. Suppose that we draw a sample of n = 10 units. The function dhyper in R can be used to find the probabilities of the RV X assuming different values. > N <- 200; M <- 20 > n <- 10 > x <- 0:11 > round(dhyper(x,M,N,n),3) [1] 0.377 0.395 0.176 0.044 0.007 0.001 0.000 0.000 0.000 0.000 0.000 0.000 The mean and variance of a hypergeometric distribution are stated as follows: ( ; Q 0 DQG 9DU ; 1 Q 0 1 0 1 1 1 Q 1 Negative binomial distribution Consider a variant of the problem described in the previous subsection. The 10 new desktops need to be fitted with an add-on, 5 megapixel external cameras to help the students attend a certain online course. Assume that the probability of a non-defective camera unit is p. As an administrator you keep on placing order until you receive 10 non-defective cameras. Now, let X denote the number of orders placed for obtaining the 10 good units. We denote the required number of success by k, which in this discussion has been k = 10. The goal in this unit is to obtain the probability distribution of X. [ 24 ] Chapter 1 Suppose that the xth order placed results in the procurement of the kth non-defective unit. This implies that we have received (k-1) non-defective units among the first (x-1) orders x −1 placed, which is possible in k − 1 distinct ways. At the xth order, the instant of having received th the k non-defective unit, we have k successes and x-k failures. Hence, the probability distribution of the RV is calculated as: 3 ; N §[N· ¨ ¸ S © N ¹ [N SN [ N N Such an RV is called the negative binomial RV and its probability distribution is the negative binomial distribution. Technically, this RV has no upper bound as the next required success may never turn up. We state the mean and variance of this distribution as follows: ( ; NS DQG 9DU ; S NS S A particular and important special case of the negative binomial RV occurs for k = 1, which is known as the geometric RV. In this case, the pmf is calculated as: 3 ; [ S [ S [ Example 1.3.3. (Baron (2007). Page 77) Sequential Testing: In a certain setup, the probability of an item being defective is (1-p) = 0.05. To complete the lab setup, 12 non-defective units are required. We need to compute the probability that at least 15 units need to be tested. Here we make use of the cumulative distribution of negative binomial distribution pnbinom function available in R. Similar to the pbinom function, the main arguments that we require here would be size, prob, and q. This problem is solved in a single line of code: > 1-pnbinom(3,size=12,0.95) [1] 0.005467259 Note that we have specified 3 as the quantile point (the x argument) as the size parameter of this experiment is 12 and we are seeking at least 15 units which translate into 3 more units than the size parameter. The function pnbinom computes the cumulative distribution function and the requirement is actually the complement, and hence the expression in the code is 1–pnbinom. We may equivalently solve the problem using the dnbinom function, which straightforwardly computes the required probability: > 1-(dnbinom(3,size=12,0.95)+dnbinom(2,size=12,0.95)+dnbinom(1, + size=12,0.95)+dnbinom(0,size=12,0.95)) [1] 0.005467259 [ 25 ] Data Characteristics Poisson distribution The number of accidents on a 1 km stretch of road, total calls received during a one-hour slot on your mobile, the number of "likes" received on a status on a social networking site in a day, and similar other cases are some of the examples which are addressed by the Poisson RV. The probability distribution of a Poisson RV is calculated as: 3 ; [ HO O [ [ [ O ! Here λ is the parameter of the Poisson RV with X denoting the number of events. The Poisson distribution is sometimes also referred to as the law of rare events. The mean and variance of the Poisson RV are surprisingly the same and equal λ , that is, E ( X ) = Var ( X ) = λ . Example 1.3.4: Suppose that Santa commits errors in a software program with a mean of three errors per A4-size page. Santa's manager wants to know the probability of Santa committing 0, 5, and 20 errors per page. The R function dpois helps to determine the answer. > dpois(0,lambda=3); dpois(5,lambda=3); dpois(20, lambda=3) [1] 0.04978707 [1] 0.1008188 [1] 7.135379e-11 Note that Santa's probability of committing 20 errors is almost 0. We will next focus on continuous distributions. Continuous distribution The numeric variables in the survey, Age, Mileage, and Odometer, can take any values over a continuous interval and these are examples of continuous RVs. In the previous section we dealt with RVs which had discrete output. In this section we will deal with RVs which have continuous output. A distinction from the previous section needs to be pointed explicitly. In the case of a discrete RV, there is a positive number for the probability of an RV taking on a certain value which is determined by the pmf. In the continuous case, an RV necessarily assumes any specific value with zero probability. These technical issues will not be discussed in this book. In the discrete case, the probabilities of certain values are specified by the pmf, and in the continuous case the probabilities, over intervals, are decided by probability density function, abbreviated as pdf. [ 26 ] Chapter 1 Suppose that we have a continuous RV, X, with the pdf f(x) defined over the possible x values, that is, we assume that the pdf f(x) is well defined over the range of the RV X, denoted by Rx . It is necessary that the integration of f(x) over the range Rx is necessarily 1, that is, ∫R f ( s ) ds = 1. The probability that the RV X takes a value in an interval [a, b] is defined by: x P ( X ∈ [ a, b ]) = ∫ f ( x ) dx b a In general we are interested in the cumulative probabilities of a continuous RV, which is the probability of the event P(X0 if its probability density function is given by: I [T T d [ d T T ! In fact, it is not necessary to restrict our focus on the positive real line. For any two real numbers a and b, from the real line, with b > a, the uniform RV can be defined by: I [ D E D d [ d E E ! D ED [ 27 ] Data Characteristics The uniform distribution has a very important role to play in simulation, as will be seen in Chapter 6, Linear Regression Analysis. As with the discrete counterpart, in the continuous case any two intervals of the same length will have equal probability of occurring. The mean and variance of a uniform RV over the interval [a, b] are respectively given by: ( ; DE 9DU ; ED Example 1.4.1. Horgan's (2008), Example 15.3: The International Journal of Circuit Theory and Applications reported in 1990 that researchers at the University of California, Berkely, had designed a switched capacitor circuit for generating random signals whose trajectory is uniformly distributed over the unit interval [0, 1]. Suppose that we are interested in calculating the probability that the trajectory falls in the interval [0.35, 0.58]. Though the answer is straightforward, we will obtain it using the punif function: > punif(0.58)-punif(0.35) [1] 0.23 Exponential distribution The exponential distribution is probably one of the most important probability distributions in Statistics, and more so for Computer Scientists. The numbers of arrivals in a queuing system, the time between two incoming calls on a mobile, the lifetime of a laptop, and so on, are some of the important applications where this distribution has a lasting utility value. The pdf of an exponential RV is specified by f ( x; λ ) = λ e− λ x , x ≥ 0, λ > 0 . The parameter λ is sometimes referred to as the failure rate. The exponential RV enjoys a special property called the memory-less property which conveys that : 3 ; tWV_ ; t V 3 ; t W IRUDOO W V ! This mathematical statement states that if X is an exponential RV, then its failure in the future depends on the present, and the past (age) of the RV does not matter. In simple words this means that the probability of failure is constant in time and does not depend on the age of the system. Let us obtain the plots of a few exponential distributions. > > > > > > > par(mfrow=c(1,2)) curve(dexp(x,1),0,10,ylab="f(x)",xlab="x",cex.axis=1.25) curve(dexp(x,0.2),add=TRUE,col=2) curve(dexp(x,0.5),add=TRUE,col=3) curve(dexp(x,0.7),add=TRUE,col=4) curve(dexp(x,0.85),add=TRUE,col=5) legend(6,1,paste("Rate = ",c(1,0.2,0.5,0.7,0.85)),col=1:5,pch= [ 28 ] Chapter 1 + > > > > > > + "___") curve(dexp(x,50),0,0.5,ylab="f(x)",xlab="x") curve(dexp(x,10),add=TRUE,col=2) curve(dexp(x,20),add=TRUE,col=3) curve(dexp(x,30),add=TRUE,col=4) curve(dexp(x,40),add=TRUE,col=5) legend(0.3,50,paste("Rate = ",c(1,0.2,0.5,0.7,0.85)),col=1:5,pch= "___") Figure 7: The exponential densities The mean and variance of this exponential distribution are shown as follows: ( ; O DQG 9DU ; O Normal distribution The normal distribution is in some sense an all-pervasive distribution that arises sooner or later in almost any statistical discussion. In fact it is very likely that the reader may already be familiar with certain aspects of the normal distribution, for example, the shape of a normal distribution curve is bell-shaped. The mathematical appropriateness is probably reflected through the reason that though it has a simpler expression, and its density function includes the three most famous irrational numbers I [ D E D d [ d E E ! D . ED [ 29 ] Data Characteristics Suppose that X is normally distributed with mean µ and variance σ 2 . Then, the probability density function of the normal RV is given by: I [ P V ½ H[S ® [ P ¾ f [ f f P f V ! SV ¯ V ¿ If mean is zero and variance is one, the normal RV is referred as the standard normal RV, and the standard is to denote it by Z. Example 1.4.2. Shady Normal Curves: We will again consider a standard normal random variable, which is more popularly denoted in Statistics by Z. Some of the most needed probabilities are P(Z > 0) and P(-1.96 < Z < 1.96). These probabilities are now shaded. > > > > > > > > > > > > > par(mfrow=c(3,1)) # Probability Z Greater than 0 curve(dnorm(x,0,1),-4,4,xlab="z",ylab="f(z)") z <- seq(0,4,0.02) lines(z,dnorm(z),type="h",col="grey") # 95% Coverage curve(dnorm(x,0,1),-4,4,xlab="z",ylab="f(z)") z <- seq(-1.96,1.96,0.001) lines(z,dnorm(z),type="h",col="grey") # 95% Coverage curve(dnorm(x,0,1),-4,4,xlab="z",ylab="f(z)") z <- seq(-2.58,2.58,0.001) lines(z,dnorm(z),type="h",col="grey") Figure 08: Shady normal curves [ 30 ] Chapter 1 Summary You should now be clear with the distinct nature of variables that arise in different scenarios. In R, you should be able to verify that the data is in the correct format. Further, the important families of random variables are introduced in this chapter, which should help you in dealing with them when they crop up in your experiments. Computation of simple probabilities were also introduced and explained. In the next chapter you will learn how to perform the basic R computations, creating data objects, and so on. As data can seldom be constructed completely in R, we need to import data from external foreign files. The methods explained help you to import data in file formats such as .csv and .xls. Similar to importing, it is also important to be able to export data/ output to other software. Finally, R session management will conclude the next chapter. [ 31 ] 2 Import/Export Data The main goals of this chapter are to familiarize you with the various classes of objects in R, help the reader extract data from various popular formats, connect R with popular databases such as MySQL, and finally the best export options of the R output. The main purpose is that the practitioner frequently has data available in a fixed format, and sometimes the dataset is available in popular database systems. This chapter helps you to extract the data from various sources, and then also recommends the best export options of the R output. We will though begin with a better understanding of the various formats in which R stores the data. Updated information about the import/export options is maintained at http://cran.r-project.org/doc/manuals/R-data.html. To summarize, the main learning from this chapter would be the following: Basic and essential computations in R Importing data from CSV, XLS, and few more Exporting data for other software R session management data.frame and other formats Any software comes with its structure and nuances. The Questionnaire and its component section of Chapter 1, Data Characteristics, introduced various facets of data. In the current section we will go into the details of how R works with data of different characteristics. Depending on the need we have different formats of the data. In this section, we will begin with simpler objects and move up the ladder towards some of the more complex ones. Import/Export Data Constants, vectors, and matrices R has five inbuilt objects which store certain constant values. The five objects are LETTERS, letters, month.abb, month.name, and pi. The first two objects contain the letters A-Z in upper and lower cases. The third and fourth objects have month's abbreviated form and the complete month names. Finally, the object pi contains the value of the famous irrational number. So, the exercise here is for you to find the value of the irrational number e. The details about these R constant objects may be obtained using the function ?Constants or example(Constants), of course by executing these commands in the console. There is also another class of constants in R which is very useful. These constants are called NumericConstants and include Inf for infinite numbers, NaN for not a number, and so on. You are encouraged to find more details and other useful constants. R can handle numerical, character, logical, integer, and complex kind of vectors and it is the class of the object which characterizes the vector. Typically, we deal with vectors which may be numeric, characters and so on. A vector of the desired class and number of elements may be initiated using the vector function. The length argument declares the size of the vector, which is the number of elements for the vector, whereas mode characterizes the vector to take one of the required classes. The elements of a vector can be assigned names if required. The R names function comes handy for this purpose. Arithmetic on numeric vector objects can be performed in an easier way. The operators (+, -, *, /, and ^) are respectively used for (addition, subtraction, multiplication, division, and power). The characteristics of a vector object may be obtained using functions such as sum, prod, min, max, and so on. Accuracy of a vector up to certain decimals may be fixed using options in digits, round, and so on. Now, two vectors need not have the same number of elements and we may carry the arithmetic operation between them, say addition. In a mathematical sense two vectors of unequal length cannot be added. However, R goes ahead and performs the operations just the same. Thus, there is a necessity to understand how operations are carried out in such cases. To begin with the simpler case, let us consider two vectors with an equal number of elements. Suppose that we have a vector x = (1, 2, 3, …, 9, 10), and y = (11, 12, 13, …, 19, 20). If we add these two vectors, x + y, the result is an element-wise addition of the respective elements in x and y, that is, we will get a new vector with elements 12, 14, 16, …, 28, 30. Now, let us increase the number of elements of y from 10 to 12 with y = (11, 12, 13, …, 19, 20, 21, 22). The operation is carried out in the order that the elements of x (the smaller object one) are element-wise added to the first ten elements of y. Now, R finds that there are two more elements of y in 11 and 12 which have not been touched as of now. It now picks the first two elements of x in 1 and 2 and adds them to 11 and 12. Hence, the 11 and 12 elements of the output are 11+1 =12 and 12 + 2 = 14. The warning says that longer object length is not a multiple of shorter object length, which has now been explained. [ 34 ] Chapter 2 Let us have a brief peep at a few more operators related to the vectors. The operator %% on two objects, say x and y, returns a remainder following an integer division, and the operator %/% returns the integer division. Time for action – understanding constants, vectors, and basic arithmetic We will look at a few important and interesting examples. You will understand the structure of vectors in R and would also be able to perform the basic arithmetic related to this requirement. 1. 2. 3. Key in LETTERS at the R console and hit the Enter key. 4. Month names and their abbreviations are available in the base package and explore them using ?Constants at the console. 5. Selected month names and their abbreviations can be obtained using month. abb[c(1:3,8:10)] and month.name[c(1:3,8:10)]. Also, the value of pi in R can found by entering pi at the console. 6. To generate a vector of length 4, without specifying the class, try vector(length=4). In specific classes, vector objects can be generated by declaring the "mode" values for a vector object. That is, a numeric vector (with default values 0) is obtained by the code vector(mode = "numeric", length=4). You can similarly generate logical, complex, character, and integer vectors by specifying them as options in the mode argument. Key in letters at the R console and hit the Enter key. To obtain the first five and the last five alphabets, try the following code: LETTERS[c(1:5,22:26)] and letters[c(1:5,22:26)]. The next screenshot shows the results as you run the preceding codes in R. 7. Creating new vector objects and name assignment: A generated vector can be assigned to new objects using either =, <-, or ->. The last two assignments are in the order from the generated vector of the tail end to the new variables at the end of the arrow. 1. First assign the integer vector 1:10 to x by using x <- 1:10. 2. Check the names of x by using names(x). 3. Assign first 10 letters of the alphabets as names for elements of x by using names(x)<- letters[1:10], and verify that the assignment is done using names(x). 4. Finally, display the numeric vector x by simply keying in x at the console. [ 35 ] Import/Export Data 8. Basic arithmetic: Create new R objects by entering x<-1:10; y<-11:20; a<10; b<--4; c<-0.5 at the console. In a certain sense, x and y are vectors while a, b, and c are constants. 1. Perform simple addition of numeric vectors with x + y. 2. Scalar multiplication of vectors and then summing the resulting vectors is easily done by using a*x + b*y. 3. Verify the result (a + b)x = ax + bx by checking that the result of ((a+b)*x == a*x + b*x results is a logical vector of length 10, each having TRUE value. 4. Vector multiplication is carried by x*y. 5. Vector division in R is simply element-wise division of the two vectors, and does not have an interpretation in mathematics. We obtain the accuracy up to 4 digits using round(x/y,4). 6. Finally, (element-wise) exponentiation of vectors is carried out through x^2. 9. Adding two unequal length vectors: The arithmetic explained before applies to unequal length vectors in a slightly different way. Run the following operations: x=1:10; x+c(1:12), length(x+c(1:12)), c(1:3)^c(1,2), and (9:11)-c(4,6). 10. The integer divisor and remainder following integer division may be obtained respectively using %/% and %% operators. Key in -3:3 %% 2, -3:3 %% 3, and -3:3 %% c(2,3) to find remainders between the sequence -3, -2, …, 2, 3 and 2, 3, and c(2,3). Replace the operator %% by %/% to find the integer divisors. Now, we will first give the required R codes so you can execute them in the software: LETTERS letters LETTERS[c(1:5,22:26)] letters[c(1:5,22:26)] ?Constants [ 36 ] Chapter 2 month.abb[c(1:3,8:10)] month.name[c(1:3,8:10)] pi vector(length=4) vector(mode="numeric",length=4) vector(mode="logical",length=4) vector(mode="complex",length=4) vector(mode="character",length=4) vector(mode="integer",length=4) x=1:10 names(x) names(x)<- letters[1:10] names(x) x x=1:10 y=11:20 a=10 b=-4 x + y a*x + b*y sum((a+b)*x == a*x + b*x) x*y round(x/y,4) x^2 x=1:10 x+c(1:12) length(x+c(1:12)) c(1:3)^c(1,2) (9:11)-c(4,6) -3:3 %% 2 -3:3 %% 3 -3:3 %% c(2,3) -3:3 %/% 2 -3:3 %/% 3 -3:3 %/% c(2,3) Execute the preceding code in your R session. [ 37 ] Import/Export Data What just happened? We have split the output into multiple screenshots for ease of explanation. Introducing constants and vectors functioning in R LETTERS is a character vector available in R that consists of the 26 uppercase letters of the English language, whereas letters contains the alphabets in smaller letters. We have used the integer vector c(1:5,22:26) in the index to extract the first and last five elements of both the character vectors. When the ?Constants command is executed, R pops out an HTML file in your default internet browser and opens a page with the link http://my_IP/ library/base/html/Constants.html. You can find more details about Constants from the base package on this web page. Months, as in January-December, are available in the character vector month.name whereas the popular abbreviated forms of the months are available in the character vector month.abb. Finally, the numeric object pi contains the value of pi up to the first three decimals only. [ 38 ] Chapter 2 Next, we consider, generation of various types of vector using the R vector function. Now, the code vector(mode="numeric",length=4) creates a numeric vector with default values of 0 and required length of four. Similarly, the other vectors are created. Vector arithmetic in R An integer vector object is created by the code x = 1:10. We could have alternatively used options such as x<- 1:10 or 1:10 -> x. The final result is of course the same. The choice of the assignment operator—find more details by running ?assignOps at the R console<-— is far more popular in the R community and it can be used during any part of R programming. By default, there won't be any names assigned for either vectors or matrices. Thus, the output NULL. names is a function in R which is useful for assigning appropriate names. Our task is to assign the first 10 smaller letters of the alphabets to the vector x. Hence, we have the code names(x) <- letters[1:10]. We verify if the names have been properly assigned and the change on the display of x following the assignment of the names using names(x) and x. [ 39 ] Import/Export Data Next, we create two integer vectors in x and y, and two objects a and b, which may be treated as scalars. Now, x + y; a*x + b*y; sum((a+b)*x == a*x + b*x) performs three different tasks. First, it performs addition of vectors and returns the result of element-wise addition of the two vectors leading to the answer 12, 14, …, 28, 30. Second, we are verifying the result of scalar multiplication of vectors, and third, the result of (a + b)x = ax + bx. In the next round of R codes, we ran x*y; round(x/y,4); x^2. Similar to the addition operator, the * operator performs element-wise multiplication for the two vectors. Thus, we get the output as 11, 24, …, 171, 200. In the next line, recall that ; executes the code on the next line/operation, first the element-wise division is carried out. For the resulting vector (a numeric one), the round function gives the accuracy up to four digits as specified. Finally, x^2 gives us the square of each element of x. Here, 2 can be replaced by any other real number. In the last line of code, we repeat some of the earlier operations with a minor difference that the two vectors are not of the same length. As predicted earlier, R issues a warning that the length of the longer vector is not a multiple of the length of the shorter vector. Thus, for the operation x+c(1:12);, first all the elements of x (which is the shorter length vector here) are added with the first 10 elements of 1:12. Then the last two elements of 1:12 at 11 and 12 need to be added with elements from x, and for this purpose R picks the first two elements of x. If the longer length vector is a multiple of the shorter one, the entire elements of the shorter vector are repeatedly added over the in cycles. The remaining results as a consequence of running c(1:3)^c(1,2); (9:11)-c(4,6) are left to the reader for interpretation. Let us look at the output after the R codes for an integer and a remainder between two objects are carried out. Integer divisor and remainder operations [ 40 ] Chapter 2 In the segment -3:3 %% 2, we are first creating a sequence -3, -2, …, 2, 3 and then we are asking for the remainder if we divide each of them by 2. Clearly, the remainder for any integer if divided by 2 is either 0 or 1, and for a sequence of consecutive integers, we expect an alternate sequence of 0s and 1s, which is the output in this case. Check the expected result for -3:3 %% 3. Now, for the operation -3:3 %% c(2,3), first look at the complete sequence -3:3 as -3, -2, -1, 0, 1, 2, 3. Here, the elements -3, -1, 1, 3 are divided by 2 and the remainder is returned, whereas -2, 0, 2 are divided by 3 and the remainders are returned. The operator %/% returns the integer divisor and interpretation of the results are left to the reader. Please refer to the previous screenshot for the results. We now look at the matrix objects. Similar to the vector function in R, we have matrix as a function, that creates matrix objects. A matrix is an array of numbers with a certain number of rows and columns. By default, the elements of a matrix are generated as NA, that is, not available. Let r be the number of rows and c the number of columns. The order of a matrix is then r x c. A vector object of length rc in R can be converted into a matrix by the code matrix(vector, nrow=r, ncol=c, byrow=TRUE). The rows and columns of a matrix may be assigned names using the dimnames option in the matrix function. The mathematics of matrices in R is preserved in relation to the matrix arithmetic. Suppose we have two matrices A and B with respective dimensions m x n and n x o. The cross-product A x B is then a matrix of order m x o, which is obtained in R by the operation A %*% B. We are also interested in the determinant of square matrix, the number of rows being equal to the number of columns, and this is obtained in R using the det function on the matrix, say det(A). Finally, we also more often than not require the computation of the inverse of a square matrix. The first temptation is to obtain the same by using A^{-1}. This will give a wrong answer, as this leads to an element-wise reciprocal and not the inverse of a matrix. The solve function in R if executed on a square matrix gives the inverse of a matrix. Fine! Let us now do these operations using R. Time for action – matrix computations We will see the basic matrix computations in the forthcoming steps. The matrix computations such as the cross-product of matrices, transpose, and inverse will be illustrated. 1. 2. Generate a 2 x 2 matrix with default values using matrix(nrow=2, ncol=2). 3. Assign row and column names for the preceding matrix by using the option dimnames, that is, by running A <- matrix(data=1:4, nrow=2, ncol=2, Create a matrix from the 1:4 vector by running matrix(1:4,nrow=2,ncol=2, byrow="TRUE"). byrow=TRUE, dimnames = list(c("R_1", "R_2"),c("C_1", "C_2"))) at the R console. [ 41 ] Import/Export Data 4. Find the properties of the preceding matrices by using the commands nrow, ncol, dimnames, and few more, with dim(A); nrow(A); ncol(A); dimnames(A). 5. Create two matrices X and Y of order 3 * 4 and 4 * 3, and obtain their cross-product with the code X <- matrix(c(1:12),nrow=3,ncol=4); Y = matrix(13:24, nrow=4) and X %*% Y. 6. 7. The transpose of a matrix is obtained using the t function, t(X). Create a new matrix A <- matrix(data=c(13,24,34,23,67,32, 45,23,11), nrow=3) and find its determinant and inverse by using det(A) and solve(A) respectively. The R code for the preceding action list is given in the following code snippet: matrix(nrow=2,ncol=2) matrix(1:4,nrow=2,ncol=2, byrow="TRUE") A <- matrix(data=1:4, nrow=2, ncol=2, byrow=TRUE, dimnames = list(c("R_1", "R_2"),c("C_1", "C_2"))) dim(A); nrow(A); ncol(A); dimnames(A) X <- matrix(c(1:12),nrow=3,ncol=4) Y <- matrix(13:24, nrow=4) X %*% Y t(Y) A <- matrix(data=c(13,24,34,23,67,32,45,23,11),nrow=3) det(A) solve(A) Note the use of a semicolon (;) in line 5 of the preceding code. The result of this usage is that the code separated by a semicolon is executed as if it was entered on a new line. Execute the preceding code in your R console. The output of the R code is given in the following screenshot: [ 42 ] Chapter 2 Matrix computations in R What just happened? You were able to create matrices in R and learned the basic operations. Remember that solve and not (^-1) gives you the inverse of a matrix. It is now seen that matrix computations in R are really easy to carry out. The options, nrow and ncol, are used to specify the dimensions of a matrix. Data for a matrix can be specified through the data argument. The first two lines of code in the previous screenshot create a bare-bone matrix. Using the dimnames argument, we have created a more elegant matrix and assigned the matrix to a matrix object named A. We next focus on the list object. It has already been used earlier to specify the dimnames of a matrix. [ 43 ] Import/Export Data The list object In the preceding subsection we saw different kinds of objects such as constants, vectors, and matrices. Sometimes it is required that we pool them together in a single object. The framework for this task is provided by the list object. From the online source http:// cran.r-project.org/doc/manuals/R-intro.html#Lists-and-data-frames, we define a list as "an object consisting of an ordered collection of objects known as its components." Basically, various types of objects can be brought under a single umbrella using the list function. Let us create list which contains a character vector, an integer vector, and a matrix. Time for action – creating a list object Here, we will have a first look at the creation of list objects, which can contain in them objects of different classes: 1. Create a character vector containing the first six capital letters with A write.csv(Titanic,"Titanic.csv",row.names=FALSE) The Titanic.csv file will be exported to the current working directory. The reader can open the CSV file in either Excel or LibreOffice Calc and confirm that it is of the desired format. The other write/export options are also available in the foreign package. The write. xport, write.systat, write.dta, and write.arff functions are useful if the destination software is any of the following: SAS, SYSTAT, STATA, and Weka. Exporting graphs In Chapter 3, Data Visualization, we will be generating a lot of graphs. Here, we will explain how to save the graphs in a desired format. In the next screenshot, we have a graph generated by execution of the code plot(sin, -pi, 2*pi) at the terminal. This line of code generates the sine wave over the interval [-π, 2π]. [ 60 ] Chapter 2 Time for action – exporting a graph Exporting of graph will be explored here: 1. 2. 3. Plot the sin function over the range [-π, 2π] by running plot(sin, -pi, 2*pi). A new window pops up with the title R Graphics Device 2 (ACTIVE). In the menu bar, go to File | Save as | Png. Saving graphs 4. Save the file as sin_plot.png, or any other name felt appropriate by you. What just happened? A file named sin_plot.png would have been created in the directory as specified by you in the preceding Step 4. Unix users do not have the luxury of saving the graph in the previously mentioned way. If you are using Unix, you have different options of saving a file. Suppose we wish to save the file when running R in a Linux environment. [ 61 ] Import/Export Data The grDevices library gives different ways of saving a graph. Here, the user can use the pdf, jpeg, bmp, png, and a few more functions to save the graph. An example is given in the following code: > jpeg("My_File.jpeg") > plot(sin, -pi, 2*pi) > dev.off() null device 1 > ?jpeg > ?pdf Here, we first invoke the required device and specify the file name to save the output, the path directory may be specified as well along with the file name. Then, we plot the function and finally close the device with dev.off. Fortunately, this technique works on both Linux and Windows platforms. Managing an R session We will close the chapter with a discussion of managing the R session. In many ways, this section is similar to what we do to a dining table after we have completed the dinner. Now, there are quite a few aspects about saving the R session. We will first explain how to save the R codes executed during a session. Time for action – session management Managing a session is very important. Any well developed software gives multiple options for managing a technical session and we explore some of the methods available in R. 1. You have decided to stop the R session! At this moment, we would like to save all the R code executed at the console. In the File menu, we have an option in Save History. Basically, it is the action File | Save History…. After selecting the option, as with previous section, we can save the history of that R session in a new text file. Save the history with the filename testhist. Basically, R saves it as a RHISTORY file which may be easily viewed/modified through any text editor. You may also save the R history in any appropriate directory, which is the destination. 2. Now, you want to reload the history testhist at the beginning of a new R session. The direction is File | Load History…, and choose the testhist file. [ 62 ] Chapter 2 3. In an R session, you would have created many objects with different characteristics. All of them can be saved in an .Rdata file with File | Save Workspace…. In a new session, this workspace can be loaded with File | Load Workspace…. R session management 4. Another way of saving the R codes (history and workspace) is when we close the session either with File | Exit, or by clicking on the X of the R window; a window will pop up as displayed in the previous screenshot. If you click on Yes, R will append the RHISTORY file in the working directory with the codes executed in the current session and also save the workspace. 5. If you want to save only certain objects from the current list, you can use the save function. As an example if you wish to save the object x, run save(x,file="x. Rdata"). In a later session, you can reinstate the object x with load("x.Rdata"). However, the libraries that were invoked in the previous session are not available again. They again need to be explicitly invoked using the library() function. Therefore, you should be careful about this fact. Saving the R session [ 63 ] Import/Export Data What just happened? The session history is very important, and also the objects created during a session. As you get deeper into the subject, it is soon realized that it is not possible to complete all the tasks in a single session. Hence, it is vital to manage the sessions properly. You learned how to save code history, workspace, and so on. Have a go hero 2 1 You have two matrices A = 16 25 30 and B = 91 -126 . Obtain the cross-product AB and find the inverse of AB. Next, find (BTAT) then the transpose of its inverse. What will be your observation? Summary In this chapter we learned how to carry out the essential computations. We also learned how to import the data from various foreign formats and then to export R objects and output suitable for other software. We also saw how to effectively manage an R session. Now that we know how to create R data objects, the next step is the visualization of such data. In the spirit of Chapter 1, Data Characteristics, we consider graph generation according to the nature of the data. Thus, we will see specialized graphs for data related to discrete as well as continuous random variables. There is also a distinction made for graphs required for univariate and multivariate data. The next chapter must be pleasing on the eyes! Special emphasis is made on visualization techniques related to categorical data, which includes bar charts, dot charts, and spine plots. Multivariate data visualization is more than mere 3D plots and the R methods such as pairs plot discussed in the next chapter will be useful. [ 64 ] 3 Data Visualization Data is possibly best understood, wherever plausible, if it is displayed in a reasonable manner. Chen, et. al. (2008) has compiled articles where many scientists of data visualization give a deeper, historical, and modern trend of data display. Data visualization may probably be as historical as data itself. It emerges across all the dimensions of science, history, and every stream of life where data is captured. The reader may especially go through the rich history of data visualization in the article of Friendly (2008) from Chen, et. al. (2008). The aesthetics of visualization has been elegantly described in Tufte (2001). The current chapter will have a deep impact on the rest of the book, and moreover this chapter aims to provide the guidance and specialized graphics in the appropriate context in the rest of the book. This chapter provides the necessary stimulus for understanding the gist that discrete and continuous data needs appropriate tools, and the validation may be seen through the distinct characteristics of such plots. Further, this chapter is also more closely related to Chapter 4, Exploratory Analysis, and many visualization techniques here indeed are "exploratory" in nature. Thus, the current chapter and the next complement mutually. It has been observed that in many preliminary courses/text, a lot of emphasis is on the type of the plots, say histogram, boxplot, and so on, which are more suitable for data arising for continuous variables. Thus, we need to make a distinction among the plots for discrete and continuous variable data, and towards this we first begin with techniques which give more insight on visualization of discrete variable data. In R there are four main frameworks for producing graphics: basic graphs, grids, lattice, and ggplot2. In the current chapter, the first three are used mostly and there is a brief peek at ggplot2 at the end. Data Visualization This chapter will mainly cover the details on effective data visualization: Visualization of categorical data using a bar chart, dot chart, spine and mosaic plots, and the pie chart and its variants Visualization of continuous data using a boxplot, histogram, scatter plot and its variants, and the Pareto chart A very brief introduction to the rich school of ggplot2 Visualization techniques for categorical data In Chapter 1, Data Characteristics, we came across many variables whose outcomes are categorical in nature. Gender, Car_Model, Minor_Problems, Major_Problems, and Satisfaction_Rating are examples of categorical data. In a software product development cycle, various issues or bugs are raised at different severity levels such as Minor and Show Stopper. Visualization methods for the categorical data require special attention and techniques, and the goal of this section is to aid the reader with some useful graphical tools. In this section, we will mainly focus on the dataset related to bugs, which are of primary concern for any software engineer. The source of the datasets is http://bug.inf.usi.ch/ and the reader is advised to check the website before proceeding further in this section. We will begin with the software system Eclipse JDT Core, and the details for this system may be found at http://www.eclipse.org/jdt/core/index.php. The files for download are available at http://bug.inf.usi.ch/download.php. Bar charts It is very likely that you are familiar with bar charts, though you may not be aware of categorical variables. Typically, in a bar chart one draws the bars proportional to the frequency of the category. An illustration will begin with the dataset Severity_Counts related to the Eclipse JDT Core software system. The reader may also explore the built-in examples in R. Going through the built-in examples of R The bar charts may be obtained using two options. The function barplot, from the graphics library, is one way of obtaining the bar charts. The built-in examples for this plot function may be reviewed with example(barplot). The second option is to load the package lattice and then use the example(barchart) function. The sixth plot, after you click for the sixth time on the prompt, is actually an example of the barchart function. [ 66 ] Chapter 3 The main purpose of this example is to help the reader get flair of the bar charts that may be obtained using R. It happens that often we have a specific variant of a plot in our mind and find it difficult to recollect it. Hence, it is a suggestion to explore the variety of bar charts you can produce using R. Of course, there are a lot more possibilities than the mere samples given by example(). Example 3.1.2. Bar charts for the Bug Metrics dataset: The software system Eclipse JDT Core has 997 different class environments related to the development. The bug identified on each occasion is classified by its severity as Bugs, NonTrivial, Major, Critical, and High. We need to plot the frequency of the severity level, and also require the frequencies to be highlighted by Before and After release of the software to be neatly reflected in the graph. The required data is available in the RSADBE package in the Severity_Counts object. Example 3.1.3. Bar charts for the Bug Metrics of the five software: In the previous example, we had considered the frequencies only on the JDT software. Now, it will be a tedious exercise if we need to have five different bar plots for different software. The frequency table for the five software is given in the Bug_Metrics_Software dataset of the RSADBE package. Software JDT PDE Equinox Lucene Mylyn BA_Ind Bugs NonTrivial Bugs Major Bugs Critical Bugs High Priority Bugs Before 11,605 10,119 1,135 432 459 After 374 17 35 10 3 Before 5,803 4,191 362 100 96 After 341 14 57 6 0 Before 325 1,393 156 71 14 After 244 3 4 1 0 Before 1,714 1,714 0 0 0 After 97 0 0 0 0 Before 14,577 6,806 592 235 8,804 After 340 187 18 3 36 It would be nice if we could simply display the frequency table across two graphs only. This is achieved using the option beside in the barplot function. The data from the preceding table is copied from an XLS/CSV file, and then we execute the first line of the following R program in the Time for action – bar charts in R section. Let us begin the action and visualize the bar charts. [ 67 ] Data Visualization Time for action – bar charts in R Different forms of bar charts will be displayed with datasets. The type of bar chart also depends on the problem (and data) on hand. 1. 2. Enter example(barplot) in the console and hit the Return key. 3. 4. Load the lattice package with library(lattice). 5. Load the dataset on severity counts for the JDT software from the RSADBE package with data(Severity_Counts). Also, check for this data. A view of this object is given in the screenshot in step 7. We have five severities of bugs: general bugs (Bugs), non-trivial bugs (NT.Bugs), major bugs (Major.Bugs), critical bugs (Critical), and high priority bugs (H.Priority). For the JDT software, these bugs are counted before and after release, and these are marked in the object with suffixes BR and AR. We need to understand this count data and as a first step, we use the bar plots for the purpose. 6. To obtain the bar chart for the severity-wise comparison before and after release of the JDT software, run the following R code: A new window pops up with the heading Click or hit Enter for next page. Click (and pause between the clicks) your way until it stops changing. Try example(barchart) in the console. The sixth plot is an example of the bar chart. barchart(Severity_Counts,xlab="Bug Count",xlim=c(0,12000), col=rep(c(2,3),5)) The barchart function is available from the lattice package. The range for the count is specified with xlim=c(0,12000). Here, the argument col=rep(c(2,3),5) is used to tell R that we need two colors for BR and AR and that this should be repeated five times for the five severity levels of the bugs. [ 68 ] Chapter 3 Figure 1: Bar graph for the Bug Metrics dataset 7. An alternative method is to use the barplot function from the graphics package: barplot(Severity_Counts,xlab="Bug Count",xlim=c(0,12000), horiz =TRUE,col=rep(c(2,3),5)) Here, we use the argument horiz = TRUE to get a horizontal display of the bar plot. A word of caution here that the argument horizontal = TRUE in barchart of the lattice package works very differently. [ 69 ] Data Visualization We will now focus on Bug_Metrics_Software, which has the bug count data for all the five software: JDT, PDE, Equinox, Lucene, and Mylyn. Figure 2: View of Severity_Counts and Bug_Metrics_Software 8. 9. Load the dataset related to all the five software with data(Bug_Metrics_ Software). To obtain the bar plots for before and after release of the software on the same window, run par(mfrow=c(1,2)). What is the par function? It is a function frequently used to set the parameters of a graph. Let us consider a simple example. Recollect that when you tried the code example(dotchart), R would ask you to Click or hit Enter for next page and post the click or Enter action, the next graph will be displayed. However, this prompt did not turn up when you ran barplot(Severity_Counts,xlab="Bug Count",xlim=c(0,12000), horiz =TRUE,col=rep(c(2,3),5)). Now, let us try using par, which will ask us to first click or hit Enter so that we get the bar plot. First run par(ask=TRUE), and then follow it with the bar plot code. You will now be asked to either click or hit Enter. Find more details of the par function with ?par. Let us now get into the mfrow argument. The default plot options displays the output on one device and on the next one, the former will be replaced with the new one. We require the bar plots of before and after release count to be displayed in the same window. The option, mfrow = c(1,2), ensures that both the bar plots are displayed in the same window with one row and two columns. [ 70 ] Chapter 3 10. To obtain the bar plot of bug frequencies before release where each of the software bug frequencies are placed side by side for each type of the bug severity, run the following: barplot(Bug_Metrics_Software[,,1],beside=TRUE,col = c("lightblue", "mistyrose", "lightcyan", "lavender", "cornsilk"),legend = c("JDT" ,"PDE","Equinox","Lucene", "Mylyn")) title(main = "Before Release Bug Frequency", font.main = 4) Here, the code Bug_Metrics_Software[,,1] ensures that only before release are considered. The option beside = TRUE ensures that the columns are displayed as juxtaposed bars, otherwise, the frequencies will be distributed in a single bar with areas proportional to the frequency of each software. The option col = c("lightblue", …) assigns the respective colors for the software. Finally, the title command is used to designate an appropriate title for the bar plot. 11. Similarly, to obtain the bar plot for after release bug frequency, run the following: barplot(Bug_Metrics_Software[,,2],beside=TRUE,col = c("lightblue", "mistyrose", "lightcyan", "lavender", "cornsilk"),legend = c("JDT" ,"PDE","Equinox","Lucene", "Mylyn")) title(main = "After Release Bug Frequency",font.main = 4) [ 71 ] Data Visualization The reader can extend the code interpretation for the before release to the after release bug frequencies. Figure 3: Bar plots for the five software First notice that the scale on the y-axis for before and after release bug frequencies is drastically different. In fact, before release bug frequencies are in thousands while after release are in hundreds. This clearly shows that the engineers have put a lot of effort to ensure that the released products are with minimum bugs. However, the comparison of bug counts is not fair since the frequency scales of the bar plots in the preceding screenshot are entirely different. Though we don't expect the results to be different under any case, it is still appropriate that the frequency scales remain the same for both before and after release bar plots. A common suggestion is to plot the diagrams with the same range on the y-axes (or x-axes), or take an appropriate transformation such as a logarithm. In our problem, neither of them will work, and we resort to another variant of the bar chart from the lattice package. Now, we will use the formula structure for the barchart function and bring the BR and AR on the same graph. [ 72 ] Chapter 3 12. Run the following code in the R console: barchart(Software~Freq|Bugs,groups=BA_Ind, data= as.data. frame(Bug_Metrics_Software),col=c(2,3)) The formula Software~Freq|Bugs requires that we obtain the bar chart for the software count Freq according to the severity of Bugs. We further specify that each of the bar chart a be further grouped according to BA_Ind. This will result in the following screenshot: Figure 4: Bar chart for Before and After release bug counts on the same scale To find the colors available in R, run try colors() in the console and you will find the names of 657 colors. [ 73 ] Data Visualization What just happened? barplot and barchart were the two functions we used to obtain the bar charts. For common recurring factors, AR and BR here, the colors can be correctly specified through the rep function. The argument beside=TRUE helped us to keep the bars for various software together for the different bug types. We also saw how to use the formula structure of the lattice package. We saw the diversity of bar charts and learned how to create effective bar charts depending on the purpose of the day. Have a go hero Explore the option stack=TRUE in the barchart(Software~Freq|Bugs,groups= BA_ Ind,…). Also, observe that Freq for bars in the preceding screenshot begins a little earlier than 0. Reobtain the plots by specifying the range for Freq with xlim=c(0,15000). Dot charts Cleveland (1993) proposed an alternative to the bar chart where dots are used to represent the frequency associated with the categorical variables. The dot charts are useful for small to moderate size datasets. The dot charts are also an alternative to the pie chart, refer to The default examples section. The dot charts may be varied to accommodate continuous variable dataset too. The dot charts are known to obey the Tukey's principle of achieving an as high as possible information-to-ink ratio. Example 3.1.4. (Continuation of Example 3.1.2): In the screenshot in step 6 in the Time for action – bar charts in R section, we saw that the bar charts for the frequencies of bugs for after release are almost non-existent. This is overcome using the dot chart, see the following action list on the dot chart. Time for action – dot charts in R The dotchart function from the graphics package and dotplot from the lattice package will be used to obtain the dot charts. 1. To view the default examples of dot charts, enter example(dotplot); example(dotchart); in the console and hit the Return key. [ 74 ] Chapter 3 2. To obtain the dot chart of the before and after release bug frequencies, run the following code: dotchart(Severity_Counts,col=15:16,lcolor="black",pch=2:3,labels =names(Severity_Counts),main="Dot Plot for the Before and After Release Bug Frequency",cex=1.5) Here, the option col=15:16 is used to specify the choice of colors; lcolor is used for the color of the lines on the dot chart which gives a good assessment of the relative positions of frequencies for the human eye. The option pch=2:3 picks up circles and squares for indicating the positions of after and before frequencies. The options labels and main are trivial to understand, whereas cex magnifies the size of all labels by 1.5 times. On execution of the preceding R code, we get a graph as displayed in the following screenshot: Figure 5: Dot chart for the Bug Metrics dataset [ 75 ] Data Visualization 3. The dot plot can be easily extended for all the five software as we did with the bar charts. > par(mfrow=c(1,2)) > dotchart(Bug_Metrics_Software[,,1],gcolor=1:5,col=6: 10,lcolor= "black",pch=15:19,labels=names(Bug_Metrics_ Software[,,1]),main="Before Release Bug Frequency",xlab="Frequency Count") > dotchart(Bug_Metrics_Software[,,2],gcolor=1:5,col=6: 10,lcolor= "black",pch=15:19,labels=names(Bug_Metrics_ Software[,,2]),main="After Release Bug Frequency",xlab="Frequency Count") Figure 6: Dot charts for the five software bug frequency For a matrix input in barchart, the gcolor option gets the same color each column. Note that though the class of Bug_Metrics_Software is both xtabs and table, the class of Bug_Metrics_Software[,,1] is a matrix, and hence we create a dot chart of it. This means that the R code dotchart(Bug_Metrics_Software) leads to errors! The dot chart is able to display the bug frequency in a better way as compared to the bar chart. What just happened? Two different ways of obtaining the dot plot were seen, and a host of other options were also clearly indicated in the current section. [ 76 ] Chapter 3 Spine and mosaic plots In the bar plot, the length (height) of the bar varies, while the width for each bar is kept the same. In a spine/mosaic plot the height is kept constant for the categories and the width varies in accordance with the frequency. The advantages of a spine/mosaic plot becomes apparent when we have frequencies tabulated for several variables via a contingency table. The spline plot is a particular case of the mosaic plot. We first consider an example for understanding the spine plot. Example 3.1.5. Visualizing Shift and Operator Data (Page 487, Ryan, 2007): In a manufacturing factory, operators are rotated across shifts and it is a concern to find out whether the time of shift affects the operator's performance. In this experiment, there are three operators who in a given month work in a particular shift. Over a period of three months, data is collected for the number of nonconforming parts an operator obtains during a given shift. The data is obtained from page 487 of Ryan (2007) and is reproduced in the following table: Operator 1 Operator 2 Operator 3 Shift 1 40 35 28 Shift 2 26 40 22 Shift 3 52 46 49 We will obtain a spine plot towards an understanding of the spread of the number of non-conforming units an operator does during the shifts in the forthcoming action time. Let us ask the following questions: Does the total number of non-conforming parts depend on the operators? Does it depend on the shift? Can we visualize the answers to the preceding questions? Time for action – the spine plot for the shift and operator data Spine plots are drawn using the spineplot function. 1. 2. Explore the default examples for the spine plot with example(spineplot). Enter the data for the shift and operator example with: ShiftOperator <- matrix(c(40, 35, 28, 26, 40, 22, 52, 46, 49),nro w=3,dimnames=list(c("Shift 1", "Shift 2", "Shift 3"), c("Operator 1", "Opereator 2", "Operator 3")),byrow=TRUE) 3. Find the number of non-conforming parts of the operators with the colSums function: > colSums(ShiftOperator) Operator 1 Opereator2 Operator 3 118 121 99 [ 77 ] Data Visualization The non-conforming parts for operators 1 and 2 are close enough, and it is lesser by about 20 percent for the third operator. 4. Find the number of non-conforming parts according to the shifts using the rowSums function: > rowSums(ShiftOperator) Shift 1 Shift 2 Shift 3 103 88 147 Shift 3 appears to have about 50 percent more non-conforming parts in comparison with shifts 1 and 2. Let us look out for the spine plot. 5. Obtain the spine plot for the ShiftOperator data with spineplot(ShiftOperator). Now, we will attempt to make the spine plot a bit more interpretable. Under the absence of any external influence, we would expect the shifts and operators to have a near equal number of non-conforming objects. 6. Thus, on the overall x and y axes, we plot lines at approximately the one-thirds and check if we get approximate equal regions/squares. abline(h=0.33,lwd=3,col="red") abline(h=0.67,lwd=3,col="red") abline(v=0.33,lwd=3,col="green") abline(v=0.67,lwd=3,col="green") The output in the graphics device window will be the following screenshot: Figure 7: Spine plot for the Shift Operator problem [ 78 ] Chapter 3 It appears from the partition induced by the red lines that all the operators have a nearly equal number of non-conforming parts. However, the spine chart shows that most of the non-conforming parts occur during Shift 3. What just happened? Data summaries were used to understand the behavior of the problem, and the spine plot helped in clear identification of Shift 3 as a major source of the non-conforming units manufactured. The use of abline function was particularly more insightful for this dataset and needs to be explored whenever there is a scope for it. Spine plot is a special case of the mosaic plot. Friendly (2001) has pioneered the concept of mosaic plots and Chapter 4, Exploratory Analysis, is an excellent reference for the same. For a simple understanding of the construction of mosaic plot, you can go through slides 7-12 at http://www.stat.auckland.ac.nz/~ihaka/120/Lectures/lecture17. pdf. As explained there, suppose that there are three categorical variables, each with two levels. Then, the mosaic plot begins with a square and divides it into two parts with each part having an area proportional to the frequency of the two levels of the first categorical variable. Next, each of the preceding two parts is divided into two parts each according to the predefined frequency of the two levels of the second categorical variable. Note that we now have four divisions of the total area. Finally, each of the four areas are further divided into two more parts, each with an area reflecting the predefined frequency of the two levels of the third categorical variable. Example 3.1.6. The Titanic dataset: In the The table object section in Chapter 2, Import/ Export Data, we came across the Titanic dataset. The dataset was seen in two different forms and we also constructed the data from scratch. Let us now continue the example. The main problems in this dataset are the following: The distribution of the passengers by Class, and then the spread of Survived across Class. The distribution of the passengers by Sex and its distribution across the survivors. The distribution by Age followed by the survivors among them. We now want to visualize the distribution of Survived first by Class, then by Sex, and finally by the Age group. Let us see the detailed action. [ 79 ] Data Visualization Time for action – the mosaic plot for the Titanic dataset The goal here is to understand the survival percentages of the Titanic ship with respect to Class of the crew, Sex, and Age. We use first xtabs and prop.table to gain the insight for each of these variables, and then visualize the overall picture using mosaicplot. 1. Get the frequencies of Class for the Titanic dataset with xtabs(Freq~Class,data=Titanic). 2. Obtain the Survived proportions across Class with prop.table( xtabs(Freq ~Class+Survived,data=Titanic),margin=1). 3. Repeat the preceding two steps for Sex: xtabs(Freq~Sex,data=Titanic) and prop.table(xtabs(Freq~Sex+Survived,data=Titanic),margin=1). 4. Repeat this exercise for Age: xtabs(Freq~Age,data=Titanic) and prop.tab le(xtabs(Freq~Age+Survived,data=Titanic), margin=1). 5. Obtain the mosaic plot for the dataset with mosaicplot(Titanic,col=c("red" ,"green")). The entire output is given in the following screenshot: Figure 8: Mosaic plot for the Titanic dataset [ 80 ] Chapter 3 The preceding output shows that the people traveling in higher classes survived better than the lower class ones. The analysis also shows that females were given more priority over males when the rescue system was in action. Finally, it may be seen that children were given priority over adults. The mosaic plot division process proceeds as follows. First, it divides the region into four parts with the regions proportional to the frequencies of each Class; that is, the width of the regions are proportionate to the Class frequencies. Each of the four regions are further divided using the predefined frequencies of the Sex categories. Now, we have eight regions. Next, each of these regions is divided using the predefined frequencies of the Age group leading to 16 distinct regions. Finally, each of the regions is divided into two parts according to the Survived status. The Yes regions of Child for the first two classes are larger than the No regions. The third Class has more non-survivors than survivors, and this appears to be true across Age and Gender. Note that there are no children among the Crew class. The rest of the regions' interpretation is left to the reader. What just happened? A clear demystification of the working of the mosaic plot has been provided. We applied it to the Titanic dataset and saw how it obtains clear regions which enable to deep dive into a categorical problem. Pie charts and the fourfold plot Pie charts are hugely popular among many business analysts. One reason for its popularity is of course its simplicity. That pie chart is easy to interpret is actually not a fact. In fact, the pie chart is seriously discouraged for analysis and observations, refer to the caution of Cleveland and McGill, and also Sarkar (2008), page 57. However, we will still continue an illustration of it. Example 3.1.7. Pie chart for the Bugs Severity problem: Let us obtain the pie chart for the bug severity levels. > > > > pie(Severity_Counts[1:5]) title("Severity Counts Post-Release of JDT Software") pie(Severity_Counts[6:10]) title("Severity Counts Pre-Release of JDT Software") [ 81 ] Data Visualization Can you find the drawback of the pie chart? Figure 9: Pie chart for the Before and After Bug counts (output edited) The main drawback of the pie chart stems from the fact that humans have a problem in deciphering relative areas. A common recommendation is the use of a bar chart or a dot chart instead of the pie chart, as the problem of judging relative areas does not exist when comparing linear measures. The fourfold plot is a novel way of visualizing a 2 × 2 × k contingency table. In this method, we obtain k plots for each 2 × 2 frequency table. Here, the cell frequency of each of the four cells is represented by a quarter circle whose radius is proportional to the square root of the frequency. In contrast to the pie chart where the radius is constant and area is varied by the perimeter, the radius in a fourfold plot is varied to represent the cell. [ 82 ] Chapter 3 Example 3.1.8. The fourfold plot for the UCBAdmissions dataset: An in-built R function which generates the required plot is fourfoldplot. The R code and its resultant screenshots are displayed as follows: > fourfoldplot(UCBAdmissions,mfrow=c(2,3),space=0.4) Figure 10: The fourfold plot of the UCBAdmissions dataset In this section, we focused on graphical techniques for categorical data. In many books, the graphical methods begin with tools which are more appropriate for data arising for continuous variables. Such tools have many shortcomings if applied to categorical data. Thus, we have taken a different approach where the categorical data gets the right tools, which it truly deserves. In the next section, we deal with tools which are seemingly more appropriate for data related to continuous variables. [ 83 ] Data Visualization Visualization techniques for continuous variable data Continuous variables have a different structure and hence, we need specialized methods for displaying them. Fortunately, many popular graphical techniques are suited very well for continuous variables. As the continuous variables can arise from different phenomenon, we consider many techniques in this section. The graphical methods discussed in this section may also be considered as a part of the next chapter on exploratory analysis. Boxplot The boxplot is based on five points: minimum, lower quartile, median, upper quartile, and maximum. The median forms the thick line near the middle of the box, and the lower and upper quartiles complete the box. The lower and upper quartiles along with the median, which is the second quartile, divide the data into four regions with each containing equal number of observations. The median is the middle-most value among the data sorted in the increasing (decreasing) order of magnitude. On similar lines, the lower quartile may be interpreted as the median of observations between the minimum and median data values. These concepts are dealt in more detail in Chapter 4, Exploratory Analysis. The boxplot is generally used for two purposes: understanding the data spread and identifying the outliers. For the first purpose, we set the range value at zero, which will extend the whiskers to the extremes at minimum and maximum, and give the overall distribution of the data. If the purpose of boxplot is to identify outliers, we extend the whiskers in a way which accommodate tolerance limits to enable us to capture the outliers. Thus, the whiskers are extended 1.5 times the default value, the interquartile range (IQR), which is the difference between the third and first quartiles from the median. The default setting of boxplot is the identification of the outliers. If any point is found beyond the whiskers, such observations may be marked as outliers. The boxplot is also sometimes called a box-and-whisker plot, and it is the whiskers, which are obtained by drawing lines from the box, ends to the minimum and maximum points. We will begin with an example of the boxplot. Example 3.2.1. Example (boxplot): For a quick tutorial on the various options of the boxplot function, the user may run the following code at the R console. Also, the reader is advised to explore the bwplot function from the lattice package. Try example(boxplot) and example(bwplot) from the respective graphics and lattice packages. Example 3.2.2. Boxplot for the resistivity data: Gunst (2002) has 16 independent observations from eight pairs on the resistivity of a wire. There are two processes under which these observations are equally distributed. We would like to see if resistivity of the wires depends on the processes, and which of the processes leads to a higher resistivity. A numerical comparison based on the summary function will be first carried out, and then we will visualize the two processes through boxplot to conclude whether the effects are same, and if not which process leads to higher resistivity. [ 84 ] Chapter 3 Example 3.2.3. The Michelson-Morley experiment: This is a famous physics experiment in the late nineteenth century, which helped in proving the non-existence of ether. If the ether existed, one expects a shift of about 4 percent in the speed of light. The speed of light is measured 20 times in five different experiments. We will use this dataset for two purposes: is the drift of 4 percent evidenced in the data, and setting the whiskers at the extremes. The first one is a statistical issue and the latter is a software setting. For the preceding three examples, we will now read the required data into R, and then look at the necessary summary functions, and finally visualize them using the boxplots. Time for action – using the boxplot Boxplots will be obtained here using the function boxplot from the graphics package as well as bwplot from the lattice package. 1. Check the variety of boxplots with example(boxplot) from the graphics package and example(bwplot) for the variants in the lattice package. 2. The resistivity data from the RSADBE package contains two processes' information which we need to compare. Load in to the current session with data(resistivity). 3. Obtain the summary of the two processes with the following: > summary(resistivity) Process.1 Process.2 Min. 0.138 0.142 1st Qu. 0.140 0.144 Median 0.142 0.146 Mean 0.142 0.146 3rd Qu. 0.143 0.148 Max. 0.146 0.150 Clearly, Process 2 has approximately 0.004 higher resistivity as compared to Process 1 across all the essential summaries. Let us check if the boxplot captures the same. 4. Obtain the boxplot for the two processes with boxplot(resistivity, range=0). The argument range=0 is to ensure that the whiskers are extended to the minimum and maximum data values. The boxplot diagram (left-hand side of the next screenshot) clearly shows that Process.2 has higher resistivity in comparison with Process.1. Next, we will consider the bwplot function from the lattice package. A slightly different rearrangement of the resistivity data frame will be required, in that we will specify all the resistivity values in a single column and their corresponding processes in another column. [ 85 ] Data Visualization An important option for boxplots is that of notch, which is especially useful for the comparison of medians. The top and bottom notches for a set of observations are defined at the point's median ± 1.57(IQR)/n1/2. If notches of two boxplots do not overlap, it can be concluded that the medians of the groups are significantly different. Such an option can be specified in both boxplot and bwplot functions. 5. Convert resistivity to another useful form which will help the application of the bwplot function with resistivity2 <- data.frame(rep(names( re sistivity),each=8),c(resistivity[,1],resistivity[,2])). Assign variable names to the new data frame with names(resistivity2) data(galton) > names(galton) [1] "child" "parent" > dim(galton) [1] 928 2 > head(galton) child parent 1 61.7 70.5 2 61.7 68.5 3 61.7 65.5 4 61.7 64.5 5 61.7 64.0 6 62.2 67.5 > sapply(galton,range) child parent [1,] 61.7 64 [2,] 73.7 73 > summary(galton) child parent Min. :61.7 Min. :64.0 1st Qu.:66.2 1st Qu.:67.5 Median :68.2 Median :68.5 Mean :68.1 Mean :68.3 3rd Qu.:70.2 3rd Qu.:69.5 Max. :73.7 Max. :73.0 [ 88 ] Chapter 3 We need to cover all the 928 observations in the intervals, also known as bins, which need to cover the range of the variable. The natural question is how does one decide on the number of intervals and the width of these intervals? If the bin width, denoted by h, is known, the number of bins, denoted by k, can be determined by: max xi − min xi i k= i h Here, the argument > @ denotes the ceiling of the number. Similarly, if the number of bins k is known, the width is determined by h = ( max i xi − min i xi ) / k . There are many guidelines for these problems. The three options available for the hist function in R are formulas given by Sturges, Scott, and Freedman-Diaconis, the details of which may be obtained by running ?nclass.Sturges, or ?nclass.FD and ?nclass. scott in the R console. The default setting runs the Sturges option. The Sturges formula for the number of bins is given by: k = [ log 2 n + 1] This formula works well when the underlying distribution is approximately distributed as a normal distribution. The Scott's normal reference rule for the bin width, using the sample standard deviation σ̂ is: h= 3.5σˆ 3 n Finally, the Freedman-Diaconis rule for the bin width is given by: h= 2 IQR 3 n In the following, we will construct a few histograms describing the problems through their examples and their R setup in the Time for action – understanding the effectiveness of histogram a section. Example 3.2.4. The default examples: To get a first preview on the generation of histograms, we suggest the reader to go through the built-in examples, try example(hist) and example(histogram). [ 89 ] Data Visualization Example 3.2.5. The Galton dataset: We will obtain histograms for the height of child and parent from the Galton dataset. We will use the Freedman-Diaconis and Sturges choice of bin widths. Example 3.2.6. Octane rating of gasoline blends: An experiment is conducted where the octane rating of gasoline blends can be obtained using two methods. Two samples are available for testing each type of blend, and Snee (1981) obtains 32 different blends over an appropriate spectrum of the target octane ratings. We obtain histograms for the ratings under the two different methods. Example 3.2.7. Histogram with a dummy dataset: A dummy dataset has been created by the author. Here, we need to obtain histograms for the two samples in the Samplez data from the RSADBE package. Time for action – understanding the effectiveness of histograms Histograms are obtained using the hist and histogram functions. The choice of bin widths is also discussed. 1. Have a buy-in of the R capability of the histograms through example(hist) and example(histogram) for the respective histogram functions from the graphics and lattice packages. 2. 3. Invoke the graphics editor with par(mfrow=c(2,2)). Create the histogram for the height of Child and Parent from the galton dataset seen in the earlier part of the section for the Freedman-Diaconis and Sturges choice of bin widths: par(mfrow=c(2,2)) hist(galton$parent,breaks="FD",xlab="Height of Parent", main="Histogram for Parent Height with Freedman-Diaconis Breaks",xlim=c(60,75)) hist(galton$parent,xlab="Height of Parent",main= "Histogram for Parent Height with Sturges Breaks",xlim=c(60,75)) hist(galton$child,breaks="FD",xlab="Height of Child", main="Histogram for Child Height with Freedman-Diaconis Breaks",xlim=c(60,75)) hist(galton$child,xlab="Height of Child",main="Histogram for Child Height with Sturges Breaks",xlim=c(60,75)) [ 90 ] Chapter 3 Consequently, we get the following screenshot: Figure 13: Histograms for the Galton dataset Note that a few people may not like histograms for the height of parent for the Freedman-Diaconis choice of bin width. 4. For the experiment mentioned in Example 3.2.9. Octane rating of gasoline blends, first load the data into R with data(octane). 5. Invoke the graphics editor for the ratings under the two methods with par(mfrow=c(2,2)). 6. Create the histograms for the ratings under the two methods for the Sturges choice of bin widths with: hist(octane$Method_1,xlab="Ratings Under Method I",main="Histogram of Octane Ratings for Method I",col="mistyrose") hist(octane$Method_2,xlab="Ratings Under Method II",main="Histogram of Octane Ratings for Method II",col=" cornsilk") The resulting histogram plot will be the first row of the next screenshot. A visual inspection suggests that under Method_I, the mean rating is around 90 while under Method_II it is approximately 95. Moreover, the Method_II ratings look more symmetric than the Method_I ratings. [ 91 ] Data Visualization 7. 8. Load the required data here with data(Samplez). Create the histogram for the two samples under the Samplez data frame with: hist(Samplez$Sample_1,xlab="Sample 1",main="Histogram: Sample 1" ,col="magenta") hist(Samplez$Sample_2,xlab="Sample 2",main="Histogram: Sample 2" ,col="snow") We obtain the following histogram plot: Figure 14: Histogram for the Octane and Samples dummy dataset The lack of symmetry is very apparent in the second row display of the preceding screenshot. It is very clear from the preceding screenshot that the left histogram exhibits an example of a positive skewed distribution for Sample_1, while the right histogram for Sample_2 shows that the distribution is a negatively skewed distribution. What just happened? Histograms have traditionally provided a lot of insight into the understanding of the distribution of variables. In this section, we dived deep into the intricacies of its construction, especially related to the options of bin widths. We also saw how the different nature of variables are clearly brought out by their histogram. [ 92 ] Chapter 3 Scatter plots In the previous subsection, we used histograms for understanding the nature of the variables. For multiple variables, we need multiple histograms. However, we need different tools for understanding the relationship between two or more variables. A simple, yet effective technique is the scatter plot. When we have two variables, the scatter plot simply draws the two variables across the two axes. The scatter plot is powerful in reflecting the relationship between the variables as in it reveals if there is a linear/nonlinear relationship. If the relationship is linear, we may get an insight if there is a positive or negative relationship among the variables, and so forth. Example 3.2.8. The drain current versus the ground-to-source voltage: Montgomery and Runger (2003) report an article from IEEE (Exercise 11.64) about an experiment where the drain current is measured against the ground-to-source voltage. In the scatter plot, the drain current values are plotted against each level of the ground-to-source voltage. The former value is measured in milliamperes and the latter in volts. The R function plot is used for understanding the relationship. We will soon visualize the relationship between the current values against the level of the ground-to-source voltage. This data is available as DCD in the RSADBE package. The scatter plot is very flexible when we need to understand the relationship between more than two variables. In the next example, we will extend the scatter plot to multiple variables. Example 3.2.9. The Gasoline mileage performance data: The mileage of a car depends on various factors; in fact, it is a very complex problem. In the next table, the various variables x1 to x11 are described which are believed to have an influence on the mileage of the car. We need a plot which explains the inter-relationships between the variables and the mileage. The exercise of repeating the plot function may be done 11 times, though most people may struggle to recollect the influence of the first plot when they are looking at the sixth or maybe the seventh plot. The pairs function returns a matrix of scatter plots, which is really useful. Let us visualize the matrix of scatter plots: > data(Gasoline) > pairs(Gasoline) # Output suppressed It may be seen that this matrix of scatter plots is a symmetric plot in the sense that the upper and lower triangle of this matrix are simply copies of each other (transposed copies actually). We can be more effective in representing the data in the matrix of scatter plots by specifying additional parameters. Even as we study the inter-relationships, it is important to understand the variable distribution itself. Since the diagonal elements are just indicating the name of the variable, we can instead replace them by their histograms. Further, if we can give the measure of the relationship between two variables, say the correlation coefficient, we can be more effective. In fact, we do a step better by displaying the correlation coefficient by increasing the font size according to its stronger value. We first define the necessary functions and then use the pairs function. [ 93 ] Data Visualization Variable Notation Variable Description Variable Notation Variable Description Y Miles/gallon x6 Carburetor (barrels) x1 Displacement (cubic inches) x7 No. of transmission speeds x2 Horsepower (foot-pounds) x8 Overall length (inches) x3 Torque (foot-pounds) x9 Width (inches) x4 Compression ratio x10 Weight (pounds) x5 Rear axle ratio x11 Type of transmission (A-automatic, M-manual) Time for action – plot and pairs R functions The scatter plot and its important multivariate extension with pairs will be considered in detail now. 1. Consider the data data(DCD). Use the options xlab and ylab to specify the right labels for the axes. We specify xlim and ylim to get a good overview of the relationship. 2. Obtain the scatter plot for Example 3.2.8. The Drain current versus the groundto-source voltage using plot(DCD$Drain_Current, DCD$GTS_Voltage,t ype="b",xlim=c(1,2.2),ylim=c(0.6,2.4),xlab="Current Drain", ylab="Voltage"). Figure 15: The scatter plot for DCD [ 94 ] Chapter 3 We can easily see from the preceding scatter plot that as the ground-to-source voltage increases, there is an appropriate increase in the drain current. This is an indication of a positive relationship between the two variables. However, the lab assistant now comes to you and says that the measurement error of the instrument has actually led to 15 percent higher recordings of the ground-to-source voltage. Now, instead of dropping the entire diagram, we may simply prefer to add the corrected figures to the existing one. The points option helps us to add the new corrected data points to the figure. 3. Now, first obtain the correct GTS_Voltage readings with DCD$GTS_ Voltage/1.15 and add them to the existing plot with points(DCD$Drain_ Current,DCD$GTS_Voltage/1.15,type="b",col="green"). 4. We first create two functions panel.hist and panel.cor defined as follows: panel.hist<- function(x, ...) { usr<- par("usr"); on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) h <- hist(x, plot = FALSE) breaks<- h$breaks; nB<- length(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...) } panel.cor<- function(x, y, digits=2, prefix="", cex.cor, ...) { usr<- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r <- abs(cor(x,y,use="complete.obs")) txt<- format(c(r, 0.123456789), digits=digits)[1] txt<- paste(prefix, txt, sep="") if(missing(cex.cor)) cex.cor<- 0.8/strwidth(txt) text(0.5, 0.5, txt, cex = cex.cor * r) } The preceding two defined functions are taken from the code of example(pairs). [ 95 ] Data Visualization 5. It is time to put these two functions in to action: pairs(Gasoline,diag.panel=panel.hist,lower.panel=panel. smooth,upper.panel=panel.cor) Figure 16: The pairs plot for the Gasoline dataset In the upper triangle of the display, we can see that the mileage has strong association with the displacement, horsepower, torque, number of transmission speeds, the overall length, width, weight, and the type of transmission. We can say a bit more too. The first three variables x1 to x3 relate to the engine characteristics, and there is a strong association within these three variables. Similarly, there is a strong association between x8 to x10 and together they form the vehicle dimension. Also, we have done a bit more than simply obtaining the scatter plots in the lower triangle of the display. A smooth approximation of the relationship between the variables is provided here. 6. Finally, we resort to the usual trick by looking at the capabilities of the plot and pairs functions with example(plot), example(pairs), and example(xyplot). We have seen how multi-variables can be visualized. In the next subsection, we will explore more about Pareto charts. [ 96 ] Chapter 3 What just happened? Starting with a simple scatter plot and its effectiveness, we went to great lengths for the extension to the pairs function. The pairs function has been greatly explored using the panel.hist and panel.cor functions for truly understanding the relationships between a set of multiple variables. Pareto charts The Pareto rule, also known as the 80-20 rule or the law of vital few, says that approximately 80 percent of the defects are due to 20 percent of the causes. It is important as it can identify 20 percent vital causes whose elimination annihilates 80 percent of the defects. The qcc package contains the function pareto.chart, which helps in generating the Pareto chart. We will give a simple illustration of this chart. The Pareto chart is a display of the cause frequencies along two axes. Suppose that we have 10 causes C1 to C10 which have occurred with defect counts 5, 23, 7, 41, 19, 4, 3, 4, 2, and 1. Causes 2, 4, and 5 have high frequencies (dominating?) and other causes look a bit feeble. Now, let us sort these causes by decreasing the order and obtaining their cumulative frequencies. We will also obtain their cumulative percentages. > > > > > > Cause_Freq <- c(5, 23, 7, 41, 19, 4, 3, 4, 2, 1) names(Cause_Freq) <- paste("C",1:10,sep="") Cause_Freq_Dec <- sort(Cause_Freq,dec=TRUE) Cause_Freq_Cumsum <- cumsum(Cause_Freq_Dec) Cause_Freq_Cumsum_Perc <- Cause_Freq_Cumsum/sum(Cause_Freq) cbind(Cause_Freq_Dec,Cause_Freq_Cumsum,Cause_Freq_Cumsum_Perc) Cause_Freq_Dec Cause_Freq_Cumsum Cause_Freq_Cumsum_Perc C4 41 41 0.3761 C2 23 64 0.5872 C5 19 83 0.7615 C3 7 90 0.8257 C1 5 95 0.8716 C6 4 99 0.9083 C8 4 103 0.9450 C7 3 106 0.9725 C9 2 108 0.9908 C10 1 109 1.0000 This appears to be a simple trick, and yet it is very effective in revealing that causes 2, 4, and 5 are contributing more than 75 percent of the defects. A Pareto chart completes the preceding table with bar chart in a decreasing count of the causes with a left vertical axis for the frequencies and a cumulative curve on the right vertical axis. We will see the Pareto chart in action for the next example. [ 97 ] Data Visualization Example 3.2.10. The Pareto chart for incomplete applications: A simple step-by-step illustration of Pareto chart is available on the Web at http://personnel.ky.gov/nr/ rdonlyres/d04b5458-97eb-4a02-bde1-99fc31490151/0/paretochart.pdf. The reader can go through the clear steps mentioned in the document. In the example from the precedingly mentioned web document, a bank which issues credit cards rejects application forms if they are deemed incomplete. An application form may be incomplete if information is not provided for one or more of the details sought in the form. For example, an application can't be processed further if the customer/applicant has not provided address, has illegible handwriting, there is a signature missing, or if the customer is an existing credit card holder among other reasons. The concern of the manager of the credit card wing is to ensure that the rejections for incomplete applications should decline, since a cost is incurred on issuing the form which is generally not charged for. The manager wants to focus on certain reasons which may be leading to the rejection of the forms. Here, we consider the frequency of the different causes which lead to the rejection of the applications. >library(qcc) >Reject_Freq = c(9,22,15,40,8) >names(Reject_Freq) = c("No Addr.", "Illegible", "Curr. Customer", "No Sign.", "Other") >Reject_Freq No Addr. Illegible Curr.CustomerNo Sign. Other 9 22 15 40 8 >options(digits=2) >pareto.chart(Reject_Freq) Pareto chart analysis for Reject_Freq Frequency Cum.Freq.Percentage Cum.Percent. No Sign. 40.0 40.0 42.6 42.6 Illegible 22.0 62.0 23.4 66.0 Curr. Customer 15.0 77.0 16.0 81.9 No Addr. 9.0 86.0 9.6 91.5 Other 8.0 94.0 8.5 100.0 [ 98 ] Chapter 3 Figure 17: The Pareto chart for incomplete applications In the screenshot given in step 5 of the Time for action – plot and pairs R functions section, the frequency of the five reasons of rejections is represented by bars as in a bar plot with the distinction of being displayed in a decreasing magnitude of frequency. The frequency of the reasons is indicated on the left vertical axis. At the mid-point of each bar, the cumulative frequency up to that reason is indicated, and the reference for this count is the right vertical axis. Thus, we can see that more than 75 percent of the rejections is due to the three causes No Signature, Illegible, and Current Customer. This is the main strength of a Pareto chart. A brief peek at ggplot2 Tufte (2001) and Wilkinson (2005) emphasize a lot on the aesthetics of graphics. There is indeed more to graphics than mere mathematics, and the subtle changes/corrections in a display may lead to an improved, enhanced, and pleasing feeling on the eye diagrams. Wilkinson emphasizes on what he calls grammar of graphics, and an R adaptation of it is given by Wickham (2009). [ 99 ] Data Visualization Thus far, we have used various functions, such as barchart, dotchart, spineplot, fourfoldplot, boxplot, plot, and so on. The grammar of graphics emphasizes that a statistical graphic is a mapping from data to aesthetic attributes of geometric objects. The aesthetics aspect consists of color, shape, and size, while the geometric objects are composed of points, lines, and bars. A detailed discussion of these aspects is unfortunately not feasible in our book, and we will have to settle with a quick introduction to the grammar of graphics. To begin with, we will simply consider the qplot function from the ggplot2 package. Time for action – qplot Here, we will first use the qplot function for obtaining various kinds of plots. To keep the story short, we are using the earlier datasets only and hence, a reproduction of the similar plots using qplot won't be displayed. The reader is encouraged to check ?qplot and its examples. 1. 2. Load the library with library(ggplot2). Rearrange the resistivity dataset in a different format and obtain the boxplots: test <- data.frame(rep(c("R1","R2"),each=8),c(resistivity[,1], resistivity[,2])) names(test) <- c("RES","VALUE") qplot(factor(RES),VALUE,data=test,geom="boxplot") The gplot function needs to be explicitly specified that RES is a factor variable and according to its levels, we need to obtain the boxplot for the resistivity values. 3. For the Gasoline dataset, we would like to obtain a boxplot of the mileage accordingly as the gear system could be manual or automatic. Thus, the qplot can be put to action with qplot(factor(x11),y,data=Gasoline,geom= "boxplot"). 4. A histogram is also one of the geometric aspects of ggplot2, and we next obtain the histogram for the height of child with qplot(child,data=galton,geom=" histogram", binwidth = 2,xlim=c(60,75),xlab="Height of Child", ylab="Frequency"). 5. The scatter plot for the height of parent against child is fetched with qplot(pare nt,child,data=galton,xlab="Height of Parent", ylab="Height of Child", main="Height of Parent Vs Child"). [ 100 ] Chapter 3 What just happened? The qplot under the argument of geom allows a good family of graphics under a single function. This is particularly advantageous for us to perform a host of tricks under a single umbrella. Of course, there is the all the more important ggplot function from the ggplot2 library, which is the primary reason for the flexibility of grammar of graphics. We will close the chapter with a very brief exposition to it. The main strength of ggplot stems from the fact that you can build a plot layer by layer. We will illustrate this with a simple example. Time for action – ggplot ggplot, aes, and layer will be put in action to explore the power of the grammar of graphics. 1. 2. 3. Load the library with library(ggplot2). Using the aes and ggplot functions, first create a ggplot object with galton_gg <- ggplot(galton,aes(child,parent)) and find the most recent creation in R by running galton_gg. You will get an error, and the graphics device will show a blank screen. Create a scatter plot by adding a layer to galton_gg with galton_gg quantile(TheWALL$Score) 0% 25% 50% 75% 100% 100.0 111.8 140.0 165.8 270.0 > diff(quantile(TheWALL$Score)) 25% 50% 75% 100% 11.75 28.25 25.75 104.25 [ 105 ] Exploratory Analysis As we are considering Rahul Dravid's centuries only, the beginning point is 100. The median of his centuries is 140.0, where the first and third quartiles are respectively 111.8 and 165.8. The median of the centuries is 140 runs, which can be interpreted as having a 50 percent chance of The Wall reaching 140 runs if he scores a century. The highest ever score of Dravid is of course 270. Interpret the difference between the quantiles. 4. The percentiles of Dravid's centuries can be obtained by using the quantile function again: quantile(TheWALL$Score,seq(0,1,.1)), here seq(0,1,.1) creates a vector which incrementally increases 0.1 beginning with 0 until 1, and the inter-difference between the percentiles with diff(quantile(TheWALL$Score, seq(0,1,.1))): >quantile(TheWALL$Score,seq(0,1,.1)) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 100.0 103.5 111.0 116.0 128.0 140.0 146.0 154.0 180.0 208.5 270.0 > diff(quantile(TheWALL$Score,seq(0,1,.1))) 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 3.5 7.5 5.0 12.0 12.0 6.0 8.0 26.0 28.5 61.5 The Wall is also known for his resolve of performing well in away Test matches. Let us verify that using the data on the centuries score. 5. The Home and Away number of centuries is obtained using the table function. Further, we obtain a boxplot of the home and away centuries. > table(HA_Ind) HA_Ind Away Home 21 15 The R function table returns frequencies of the various categories for a categorical variable. In fact, it is more versatile in obtaining frequencies for more than one categorical variable. The Wall is also known for his resolve of performing well in away Test matches. This is partly confirmed by the fact that 21 of his 36 centuries came in away Tests. However, the boxplot says otherwise: >boxplot(Score~HA_Ind,data=TheWALL) [ 106 ] Chapter 4 Figure 1: Box plot for Home/Away centuries of The Wall It may be tempting for The Wall's fans to believe that if they remove the outliers of scores above 200, the result may say that his performance of Away Test centuries is better/equal to Home ones. However, this is not the case, which may be verified as follows. 6. Generate the boxplot for centuries whose score is less than 200 with boxplot(Score~HA_Ind,subset=(Score<200),data=TheWALL). Figure 2: Box plot for Home/Away centuries of The Wall (less than 200 runs) What do you conclude from the preceding screenshot? [ 107 ] Exploratory Analysis 7. The fivenum summary for the centuries is: >fivenum(TheWALL$Score) [1] 100.0 111.5 140.0 169.5 270.0 The fivenum function returns minimum, lower-hinge, median, upper-hinge, and maximum values for the input data. The numbers 111.5 and 169.5 are lower- and upper-hinges, and it may be seen that they are certainly different values than lower- and upper-quartiles of 111.5 and 169.5. Thus far, we have focused on measures of location, so let us now look at some measures of dispersion. The range function in R actually returns the minimum and maximum value of the data frame. Thus, to obtain the range as a measure of spread, we get that using diff(range()). We use the IQR function to obtain the interquartile range. 8. Using range, diff, and IQR functions, obtain the spread of Dravid's centuries as follows: > range(TheWALL$Score) [1] 100 270 > diff(range(TheWALL$Score)) [1] 170 > IQR(TheWALL$Score) [1] 54 >IQR(TheWALL$Score[TheWALL$HA_Ind=="Away"]) [1] 36 > IQR(TheWALL$Score[TheWALL$HA_Ind=="Home"]) [1] 63.5 Here, we are extracting the home centuries from Score using the logic that consider only those elements of Score when HA_Ind is Home. What just happened? The data summaries in the EDA framework are slightly different. Here, we first used the quantile function to obtain quartiles and the deciles (10 percent difference) of a numeric variable. The diff function has been used to find the difference between the consecutive elements of a vector. The boxplot function has been used to compare the home and away Test centuries, which led to the conclusion that the median score of Dravid's centuries at home is higher than the away centuries. The restriction of the Test centuries under 200 runs further confirmed in particular that Dravid's centuries at home have a higher median value than those in away series, and in general that median is robust to outliers. [ 108 ] Chapter 4 The IQR function gives us the interquartile range for a vector, and the fivenum function gives us the hinges. Though intuitively it appears that hinges and quartiles are similar, it is not always true. In this section, you also learned the usage of functions, such as quantile, fivenum, IQR, and so on. We will now move to the main techniques of exploratory analysis. The stem-and-leaf plot The stem-and-leaf plot is considered as one of the seven important tools of Statistical Process Control (SPC), refer to Montgomery (2005). It is a bit similar in nature to the histogram plot. The stem-and-leaf plot is an effective method of displaying data in a (partial) tree form. Here, each datum is split into two parts: the stem part and the leaf part. In general, the last digit of a datum forms the leaf part; the rest form the stem. Now, consider a datum 235. If the split criteria is the units place, the stem and leaf parts here will be respectively 23 and 5; if it is tens, then 2 and 3; and finally if it is hundreds, it will be 0 and 2. The left-hand side of the split datum is called as the leading digits and the right-hand side as the trailing digits. In the next step, all the possible leading digits are arranged in an increasing order. This includes even those stems for which we may not have data for the leaf part, which ensures that the final stem-and-leaf plot truly depicts the distribution of the data. All the possible leading digits are called stems. The leaves are then displayed to the right-hand side of the stems, and for each stem the leaves are again arranged in an increasing order. Example 4.2.1. A simple illustration: Consider a data of eight elements as 12, 22, 42, 13, 27, 46, 25, and 52. The leading digits for this data are 1, 2, 4, and 5. On inserting 3, the leading digits complete the required stems to be 1 to 5. The leaves for stem 1 are 2 and 3. The unordered leaves for stem 2 are 2, 7, and 5. The display leaves for stem 2 are then 2, 5, and 7. There are no leaves for stem 3. Similarly, the leaves for stems 4 and 5 respectively are the sets {2, 6} and {2} only. The stem function in R will be used for generating the stem-and-leaf plots. Example 4.2.1. Octane Rating of Gasoline Blends: (Continued from the Vizualization techniques for continuous variable data section of Chapter 3, Data Vizualization): In the earlier study, we used the histogram for understanding the octane ratings under two different methods. We will use the stem function in the forthcoming Time for action – the stem function in play section for displaying the octane ratings under Method_1 and Method_2. [ 109 ] Exploratory Analysis Tukey (1977), being the benchmark book for EDA, produces the stem-and-leaf plot in a slightly different style. For example, the stem plots for Method_1 and Method_2 are better understood if we can put both the stem and leaf sides adjacent to each other instead of one below the other. It is possible to achieve this using the stem.leaf.backback function from the aplpack package. Time for action – the stem function in play The R function stem from the base package and stem.leaf.backback from aplpack are fundamental for our purpose to create the stem-and-leaf plots. We will illustrate these two functions for the examples discussed earlier. 1. As mentioned in Example 4.2.1. Octane Rating of Gasoline Blends, first create the x vector, x<-c(12,22,42,13,27,46,25,52). 2. Obtain the stem-and-leaf plot with: > stem(x) The decimal point is 1 digit(s) to the right of the | 1 | 23 2 | 257 3 | 4 | 26 5 | 2 To obtain the median from the stem display we proceed as follows. Remove one point each from either side of the display. First we remove 52 and 12, and then remove 46 and 13. The trick is to proceed until we are left with either one point, or two. In the former case, the remaining point is the median, and in the latter case, we simply take the average. Finally, we are left with 25 and 27 and hence, their average 26 is the median of x. We will now look at the octane dataset. 3. Obtain the stem plots for both the methods: data(octane), stem(octane$Method_1,scale=2) and stem(octane$Method_2, scale=2). [ 110 ] Chapter 4 The output will be similar to the following screenshot: Figure 3: The stem plot for the octane dataset (R output edited) Of course, the preceding screenshot has been edited. To generate such a back-to-back display, we need a different function. 4. Using the stem.leaf.backback function from aplpack and the code library(aplpack) and stem.leaf.backback(Method_1, Method_2,back. to.back=FALSE, m=5), we get the output in the desired format. Figure 4: Tukey's stem plot for octane data [ 111 ] Exploratory Analysis The preceding screenshot has many unexplained, and a bit mysterious, symbols! Prof. J. W. Tukey has taken a very pragmatic approach when developing EDA. You are is strongly suggested to read Tukey (1977), as this brief chapter barely does justice to it. Note that 18 of the 32 observations for Method_1 are in the range 80.4 to 89.3. Now, if we have stems as 8, 9, and 10, the spread at stem 8 will be 18, which will not give a meaningful understanding of the data. The stems can have substems, or be stretched out, and for which a very novel solution is provided by Prof. Tukey. For very high frequency stems, the solution is to squeeze out five more stems. For the stem 8 here, we have the trailing digits at 0, 1, 2, …, 9. Now, adopt a scheme of tagging lines which leads towards a clear reading of the stem-and-leaf display. Tukey suggests to use * for zero and one, t for two and three, f for four and five, s for six and seven, and a period (.) for eight and nine. Truly ingenious! Thus, if you are planning to write about stem-and-leaf in your local language, you may not require *, t, f, s, .! Go back to the preceding screenshot and now it looks much more beautiful. Following the leaf part for each method, we are given cumulative frequencies from the top and the bottom too. Why? Now, we know that the stem-and-leaf display has increasing values from the top and decreasing values from the bottom. In this particular example, we have n: 32 observations. Thus, in a sorted order, we know that the median is a value between the sixteenth and seventeenth sorted observation. The cumulative frequencies when exceeds 16 from either direction, lead to the median. This is indicated by (2) for Method_1 and (6) for Method_2. Can you now make an approximate guess of the median values? Obviously, depending on the dataset, we may require m = 5, 1, or 2. We have used the argument back.to.back=FALSE to ensure that the two stem-and-leafs can be seen independently. Now, it is fairly easy to compare these two displays by setting back.to.back=TRUE, in which case the stem line will be common for both the methods and thus, we can simply compare their leaf distributions. That is, you need to run stem. leaf.backback(octane$Method_1,octane$Method_2,back.to.back=TRUE, m=5) and investigate the results. We can clearly see that the median for Method_2 is higher than that of Method_1. What just happened? Using the basic stem function and stem.leaf.backback from the aplpack, we got two efficient exploratory displays of the datasets. The latter function can be used to compare two stem-and-leaf displays. Stems can be further squeezed to reveal more information with the options of m as 1, 2, and 5. We will next look at the EDA technique which extends the scope of hinges. [ 112 ] Chapter 4 Letter values The median, quartiles, and the extremes (maximum and minimum) indicate how the data is spread over the range of the data. These values can be used to examine two or more samples. There is another way of understanding the data offered by letter values. This small journey begins with the use of a concept called depth, which measures the minimum position of the datum in the ordered sample from either of the extremes. Thus, the extremes have a depth of 1, the second largest and smallest datum have a depth of 2, and so on. Now, consider a sample data of size n, assumed to be an odd number for convenience sake. Then, the depth of the median is (n + 1)/2. The depth of a datum is denoted by d, and for the median it is indicated by d(M). Since the hinges, lower and upper, do the same to the divided samples (by median), the depth of the hinges is given by d ( H ) = ( d ( M ) + 1) / 2 . Here, [ ] denotes the integer part of the argument. As hinges, including the mid-hinge which is the median, divide the data into four equal parts, we can define eights as the values which divide the data into eight equal parts. The eights are denoted by E. The depth of eights is given by the formula d ( E ) = ( d ( H ) + 1) / 2 . It may be seen that the depth of the median, hinges, and eights of the datum depends on the sample size. Using the eights, we can further carry out the division of the data for obtaining the sixteenths, and then thirty seconds, and so on. The process of division should continue till we end up with the extremes where we cannot further proceed with the division any longer. The letter values continue the search until we end at the extremes. The process of the division is well understood when we augment the lower and upper values for the hinges, the eights, the sixteenths, the thirty seconds, and so on. The difference between the lower and upper values of these metrics, concept similar to mid-range, is also useful for understanding the data. The R function lval from the LearnEDA package gives the letter values for the data. Example 4.3.1. Octane Rating of Gasoline Blends (Continued): We will now obtain the letter values for the octane dataset: >library(LearnEDA) >lval(octane$Method_1) depth lo hi mids spreads M 16.5 88.00 88.0 88.000 0.00 H 8.5 85.55 91.4 88.475 5.85 E 4.5 84.25 94.2 89.225 9.95 D 2.5 82.25 101.5 91.875 19.25 C 1.0 80.40 105.5 92.950 25.10 > lval(octane$Method_2) depth lo hi mids spreads M 16.5 97.20 97.20 97.200 0.00 H 8.5 93.75 99.60 96.675 5.85 E 4.5 90.75 101.60 96.175 10.85 D 2.5 87.35 104.65 96.000 17.30 C 1.0 83.30 106.60 94.950 23.30 [ 113 ] Exploratory Analysis The letter values, look at the lo and hi in the preceding code, clearly show that the corresponding values for Method_1 are always lower than those under Method_2. Particularly, note that the lower hinge of Method_2 is greater than the higher hinge of Method_1. However, the spread under both the methods are very identical. Data re-expression The presence of an outlier or overspread of the data may lead to an incomplete picture of the graphical display and hence, statistical inference may be inappropriate in these scenarios. In many such scenarios, re-expression of the data on another scale may be more useful, refer to Chapter 3, The Problem of Scale, Tukey (1977). Here, we list the scenarios from Tukey where the data re-expression may help circumvent the limitations cited in the beginning. The first scenario where re-expression is useful is when the variables assume non-negative values, that is, the variables never assume a value lesser than zero. Examples of such variables are age, height, power, area, and so on. A thumb rule for the application of re-expression is when the ratio of the largest to the smallest value in the data is very large, say 100. This is one reason that most regression analysis variables such as age are almost always re-expressed on the logarithmic scale. The second scenario explained by Tukey is about variables like balance and profit-and-loss. If there is a deposit to an account, the balance increases, and if there is a withdrawal it decreases. Thus, the variables can assume positive as well as negative values. Since re-expression of these variables like balance rarely helps; re-expression of the amount or quantity before subtraction helps on some occasions. Fraction and percentage counts form the third scenario where re-expression of the data is useful, though you need special techniques. The scenarios mentioned are indicative and not exhaustive. We will now look at the data re-expression techniques which are useful. Example 4.4.1. Re-expression for the Power of 62 Hydroelectric Stations: We need to understand the distribution of the ultimate power in megawatts of 62 hydroelectric stations and power stations of the Corps of Engineers. The data for our illustration has actually been regenerated from the Exhibit 3 of Tukey (1977). First, we simply look at the stem-and-leaf display of the original data on the power of 62 hydroelectric stations. We use the stem. leaf function from the aplpack package. > hydroelectric <- c(14,18,28,26,36,30,30,34,30,43,45,54,52,60,68, + 68,61,75,76,70,76,86,90,96,100,100,100,100,100,100,110,112, + 118,110,124,130,135,135,130,175,165,140,250,280,204,200,270, + 40,320,330,468,400,518,540,595,600,810,810,1728,1400,1743,2700) >stem.leaf(hydroelectric,unit=1) 1 | 2: represents 12 leaf unit: 1 [ 114 ] Chapter 4 n: 62 2 4 9 … 24 30 (4) 28 27 23 22 21 20 1 | 48 2 | 68 3 | 00046 9 10 11 12 13 14 15 16 17 18 19 20 21 24 | | | | | | | | | | | | | | 06 000000 0028 4 0055 0 5 5 04 … 7 60 | 0 HI: 810 810 1400 1728 1743 2700 The data begins with values as low as 14, and grows modestly to hundreds, such as 100, 135, and so on. Further, the data grows to five hundreds and then literally explodes into thousands running up to 2700. If all the leading digits must be mandatorily displayed, we have 270 leading digits. With an average of 35 lines per page, the output requires approximately eight pages, and between the last two values of 1743 and 2700, we will have roughly 100 empty leading digits. The stem.leaf function has already removed all the leading digits after the hydroelectric plant producing 600 megawatts. Let us look at the ratio of the largest to smallest value, which is as follows: >max(hydroelectric)/min(hydroelectric) [1] 192.8571 By the thumb rule, it is an indication that a data re-expression is in order. Thus, we take the log transformation (with base 10) and obtain the stem-and-leaf display for the transformed data. >stem.leaf(round(log(hydroelectric,10),2),unit=0.01) 1 | 2: represents 0.12 leaf unit: 0.01 n: 62 [ 115 ] Exploratory Analysis 1 2 … 24 (11) 27 22 20 … 11 | 5 12 | 6 13 | 19 20 21 22 23 | | | | | 358 00000044579 11335 24 110 30 | 4 31 | 5 3 32 | 44 HI: 3.43 The compactness of the stem-and-leaf display for the transformed data is indeed more useful, and we can further see that the leading digits are just about 30. Also, the display is more elegant and comprehensible. Have a go hero The stem-and-leaf plot is considered a particular case of the histogram from a certain point of view. You can attempt to understand the hydroelectric distribution using histogram too. First, obtain the histogram of the hydroelectric variable, and then repeat the exercise on its logarithmic re-expression. Bagplot – a bivariate boxplot In Chapter 3, Data Visualization, we saw the effectiveness of boxplot. For independent variables, we can simply draw separate boxplots for the variables and visualize the distribution. However, when there is dependency between two variables, distinct boxplot loses the dependency among the two variables. Thus, we need to see if there is a way to visualize the data through a boxplot. The answer to the question is provided by bagplot or bivariate boxplot. The bagplot characteristic is described in the following steps: The depth median, denoted by * in the bagplot, is the point with the highest half space depth. [ 116 ] Chapter 4 The depth median is surrounded by a polygon, called bag, which covers n/2 observations with the largest depth. The bag is then magnified by a factor of 3 which gives the fence. The fence is not plotted since it will drive the attention away from the data. The observations between the bag and fence are covered by a loop. Points outside the fence are flagged as outliers. For technical details of the bagplot, refer to the paper (http://venus.unive.it/ romanaz/ada2/bagplot.pdf) by Rousseeuw, Ruts, and Tukey (1999). Example 4.5.1. Bagplot for the Gasoline mileage problem: The pairs plot of the gasoline mileage problem in Example 3.2.9. Octane rating of gasoline blends gave a good insight in to understanding the nature of the data. Now, we will modify that plot and replace the upper panel with the bagplot function for a cleaner comparison of the bagplot with the scatter plot. However, in the original dataset, variables x4, x5, and x11 are factors that we remove from the bagplot study. The bagplot function is available in the aplpack package. We first define panel.bagplot, and then generate the matrix of the scatter plot with the bagplots produced in the upper matrix display. Time for action – the bagplot display for a multivariate dataset 1. 2. 3. Load the aplpack package with library(aplpack). Check the default examples of the bagplot function with example(bagplot). Create the panel.bagplot function with: panel.bagplot <- function(x,y) { require(aplpack) bagplot(x,y,verbose=FALSE,create.plot = TRUE,add=TRUE) } Here, the panel.bagplot function is defined to enable us to obtain the bagplot for the upper panel region of the pairs function. [ 117 ] Exploratory Analysis 4. Apply the panel.bagplot function within the pairs function on the Gasoline dataset: pairs(Gasoline[-19,-c(1,4,5,13)],upper.panel =panel. bagplot). We obtain the following display: Figure 5: Bagplot for the Gasoline dataset What just happened? We created the panel.bagplot function and augmented it in the pairs function for effective display of the multivariate dataset. The bagplot is an important EDA tool towards getting exploratory insight in the important case of multivariate dataset. The resistant line In Example 3.2.3. The Michelson-Morley experiment of Chapter 3, Data Visualization, we visualized data through the scatter plot which indicates possible relationships between the dependent variable (y) and independent variable (x). The scatter plot or the x-y plot is again an EDA technique. However, we would like a more quantitative model which explains the interrelationship in a more precise manner. The traditional approach would be taken in Chapter 7, The Logistic Regression Model. In this section, we would take an EDA approach for building our first regression model. [ 118 ] Chapter 4 Consider a pair of n observations: ( x1 , y1 ) ,( x2 , y2 ) ,...,( xn , yn ) . We can easily visualize the data using the scatter plot. We need to obtain a model of the form y = a + bx , where a is the intercept term while b is the slope. This model is an attempt to explain the relationship between the variables x and y. Basically, we need to obtain the values of the slope and intercept from the data. In most real data, a single line will not pass through all the n pairs of observations. In fact, it is even a difficult task for the determined line to pass through even a very few observations. As a simple task, we may choose any two observations and determine the slope and intercept. However, the difficulty lies in the choice of the two points. We will now explain how the resistant line determines the two required terms. The scatter plot (part A of the next screenshot) is divided into three regions, using x-values, where each region has approximately same number of data points, refer to part B of the next screenshot. The three regions, from the left-hand to the right-hand side, are called the lower, middle, and upper regions. Note that the y-values are distributed among the three regions corresponding to their x-values. Hence, there is a possibility of some y-values of lower regions to be higher than a few values in the higher regions. Within each region, we find the medians of the x- and y-values independently. That is, for the lower region, the median yL is determined by the y-values falling in this region, and similarly, the median xL is determined by the x-values of the region. Similarly, the medians xM , xH , yM , and yH are determined, refer to part C of the following screenshot. Using these median values, we now form three pairs: (xL , yL ), (xM , yM ), and (xH , yH ). Note that these pairs need not be one of the data points. To determine the slope b, two points suffice. The resistant line theory determines the slope by using the two pairs of points (xL , yL ) and (xH , yH ). Thus, we obtain the following: b= yH − yL xH − xL For obtaining the intercept value a, we use all the three pairs of medians. The value of a is determined using a = 1 yL + yM + yH − b ( xL + xM + xH ) . 3 [ 119 ] Exploratory Analysis Note that the median properties are what exactly make the solutions resistant enough. As an example, the lower and upper median would not be affected by the outliers (at the extreme ends). y-axis y-axis A B x x x x x x x x x x x x x x x x x x-axis x x x x x x The x-y Scatter Plot x x x xL xM x yM x x x x x-axis x Dividing into three portions by x-values x x yL x x x x D yH yL b= x x H L x yH x x x y-axis C x x x y-axis x x x x x x x x x x x x x a= x x 1 y +y +y [ L M H b( xL+ xM + xH )] 3 x x x x x x x x (x H ,y H ) x xH x x-axis x x x x x x x x-axis (x M,yM ) x for x-values for y-values (x L ,yL ) x x x x x Obtaining the a and b values x Figure 6: Understanding the resistant line We will use the rline function from the LearnEDA package. Example 4.6.1. Resistant line for the IO-CPU time: The CPU time is known to depend on the number of IO processes running at any given point of time. A simple dataset is available at http://www.cs.gmu.edu/~menasce/cs700/files/SimpleRegression.pdf. We aim at fitting a resistant line for this dataset. Time for action – the resistant line as a first regression model We use the rline function from the LearnEDA package for fitting the resistant line on a dataset. 1. 2. 3. 4. Load the LearnEDA package: library(LearnEDA). Understand the default example with example(rline). Load the dataset data(IO_Time). Create the IO_rline resistant line for CPU_Time as the output and No_of_IO as the input with IO_rline <- rline(IO_Time$No_of_IO, IO_Time$CPU_ Time,iter=10) for 10 iterations. [ 120 ] Chapter 4 5. Find the slope and intercept with IO_rline$a and IO_rline$b. The output will then be: >IO_rline$a [1] 0.2707 >IO_rline$b [1] 0.03913 6. 7. 8. Obtain the scatter plot of CPU_Time against No_of_IO with plot(IO_Time$No_ of_IO, IO_Time$CPU_Time). Add the resistant line on the generated scatter plot with abline(a= IO_ rline$a,b=IO_rline$b). Finally, give a title to the plot: title("Resistant Line for the IO-CPU Time"). We then get the following screenshot: Figure 7: Resistant line for CPU_Time What just happened? The rline function from the LearnEdA package fits a resistant line given the input and output vectors. It calculated the slope and intercept terms which are driven by medians. The main advantage of the rline fit is that the model is not susceptible to outliers. We can see from the preceding screenshot that the resistant line model, IO_rline, provides a very good fit for the dataset. Well! You have created your first exploratory regression model. [ 121 ] Exploratory Analysis Smoothing data In The resistant line section, we constructed our first regression model for the relationship between two variables. In some instances, the x-values are so systematic that their values are almost redundant, and yet we need to understand the behavior of the y-values with respect to them. Consider the case where the x-values are equally spaced; the shares price (y) at the end of day (x) is an example where the difference between two consecutive x-values is exactly one. Here, we are more interested in smoothing the data along the y-values, as one expects more variation in their direction. Time series data is a very good example of this type. In the time series data, we typically have xn + 1 = xn + 1 , and hence we can precisely define the data in a compact form by yt , t = 1, 2, … .The general model may then be specified by yt = a + bt + ε ,t = 1, 2,... . In the standard EDA notation, this is simply expressed as: data = fit + residual In the context of time series data, the model is succinctly expressed as: data = smooth + rough The fundamental concept of the smoothing data technique makes use of the running medians. In a free hand curve, we can simply draw a smooth curve using our judgment by ignoring the out-of-curve points and complete the picture. A computer finds this task only difficult when it needs specific instructions for obtaining the smooth points across which it needs to draw a curve. For a sequence of points, such as the sequence yt, the smoothing needs to be carried over a sequence of overlapping segments. Such segments are predefined of specific length. As a simple example, we may have a three-length overlapping segment sequence in {y1 , y2 , y3 }, {y2 , y3 , y4 }, {y3 , y4 , y5 }, and so on. It is on similar lines that four-length or five-length overlapping segment sequences may be defined as required. It is within each segment that smoothing needs to be carried out. Two popular choices are mean and median. Of course, in exploratory analysis our natural choice is the median. Note that median of the segment {y1 , y2 , y3 } may be any of y1 , y2 , or y3 values. The general smoothing techniques, such as LOESS, are nonparametric techniques and require good expertise in the subject. The ideas discussed here are mainly driven by median as the core technique. [ 122 ] Chapter 4 A three-moving median cannot correct for more than two consecutive outliers, and similarly, a five-moving median for three consecutive outliers, and so on. A solution, or work around in an engineer's language, for this is to continue the smoothing of the sequence obtained in the previous iteration until there is no further change in the smoothness part. We may also consider a moving median of span 4. Here, the median is the average of the two mid-points. However, considering that the x values are integers, the four-moving median actually does not correspond to any of the time points t. Using the simplicity principle, it is easily possible to re-center the points at t by taking a two-moving median of the values obtained in the step of the four-moving median. A notation for the first iteration in EDA is simply the number 3, or 5 and 7 as used. The notation for repeated smoothing is denoted by 3R where R stands for repetitions. For a four-moving median re-centered by a two-moving median, the notation will be 42. On many occasions, a smoother operation giving more refinement than 42 may be desired. It is on such occasion that we may use the running weighted average, which gives different weights to the points under a span. Here, each point is replaced by a weighted average of the neighboring points. A popular choice of weights for a running weighted average of 3 is (¼, ½, ¼), and this smoothing process is referred to as hanning. The hanning process is denoted by H. Since the running median smoothens the data sequence a bit more than appropriate and hence removes any interesting patterns, patterns can be recovered from the residuals which in this context are called rough. This is achieved by smoothing the rough sequence and adding them back to the smooth sequence. This operation is called as reroughing. Velleman and Hoaglin (1984) point out that the smoothing process which performs better in general is 4253H. That is, here we first begin with running median of 4 which is re-centered by 2. The re-smoothing is then done by 5 followed by 3, and the outliers are removed by H. Finally, re-roughing is carried out by smoothing the roughs and then adding them to the smoothed sequence. This full cycle is denoted by 4253H, twice. Unfortunately, we are not aware of any R function or package which implements the 4253H smoother. The options available in the LearnEDA package are 3RSS and 3RSSH. We have not explained what the smoothers 3RSS and 3RSSH are. The 3R smoothing chops off peaks and valleys and leaves behind mesas and dales two points long. What does this mean? Mesa refers to an area of high land with a flat top and two or more steep cliff-like sides, whereas dale refers a valley. To overcome this problem, a special splitting is used at each two-point mesa and dale where the data is split into three pieces: a two-point flat segment, the smooth data to the left of the two points, and the smooth sequence to their right. Now, let yf-1, yf refer to the two-point flat segment, and yf+1 , yf+2 , … refer to the smooth sequence to the left of these two-point flat segments. Then the S technique predicts the value of yf-1 if it were on the straight line formed by yf+1 and yf+2. A simple method is to obtain yf-1 as 3yf+1 – 2yf+2. The yf value is obtained as the median of the predicted yf-1, yf+1, and yf+2. After removing all the mesas and dales, we again repeat the 3R cycle. Thus, we have the notation 3RSS and the reader can now easily connect with what 3RSSH means. Now, we will obtain the 3RSS for the cow temperature dataset of Velleman and Hoaglin. [ 123 ] Exploratory Analysis Example 4.7.1. Smoothing Data for the Cow temperature: The temperature of a cow is measured at 6:30 a.m. for 75 consecutive days. We will use the smooth function from the base package and the han function from the LearnEDA package to achieve the required smoothing sequence. We will build the necessary R program in the forthcoming action list. Time for action – smoothening the cow temperature data First we use the smooth function from the stats package on the cow temperature dataset. Next; we will use the han function from LearnEDA. 1. 2. 3. 4. 5. 6. 7. Load the cow temperature data in R by data(CT). Plot the time series data using the plot.ts function: plot.ts(CT$Temperature ,col="red",pch=1). Create a 3RSS object for the cow temperature data using the smooth function and the kind option: CT_3RSS <- smooth(CT$Temperature,kind="3RSS"). Han the preceding 3RSS object using the han function from the LearnEDA package: CT_3RSSH <- han(smooth(CT$Temperature,kind="3RSS")). Impose a line of the 3RSS data points with lines. ts(CT_3RSS,col="blue",pch=2). Impose a line of the hanned 3RSS data points with lines. ts(CT_3RSSH,col="green",pch=3). Add a meaningful legend to the plot: legend(20,90,c("Original","3RSS","3 RSSH"),col=c("red","blue","green"),pch="___"). We get a useful smoothened plot of the cow temperature data as follows: Smoothening cow temperature data [ 124 ] Chapter 4 The original plot shows a lot of variation for the cow temperature measurements. The edges of the 3RSS smoother shows many sharp edges in comparison with the 3RSSH smoother, though it is itself a lot smoother than the original display. The plot further indicates that there has been a lot of decrease in the cow temperature measurements from the fifteenth day of observation. This is confirmed by all the three displays. What just happened? The discussion of the smoothing function looked very promising in the theoretical development. We took a real dataset and saw its time series plot. Then we plotted two versions of the smoothening process and found both to be very smooth over the original plot. Median polish In Example 4.6.1. Resistant line for the IO-CPU time, we had IO as the only independent variable which explained the variations of the CPU time. In many practical problems, the dependent variable depends on more than one independent variable. In such cases, we need to factor the effect of such independent variables using a single model. When we have two independent variables, and median polish helps in building a robust model. A data display in which the rows and columns hold different factors of two variables is called a two-way table. Here, the table entries are values of the independent variables. An appropriate model for the two-way table is given by: yij = a + βi + γ j + εij Here, α is the intercept term, βi denotes the effect of the i-th row, yj the effect of the j-th column, and εij is the error term. All the parameters are unknown. We need to find the unknown parameters through the EDA approach. The basic idea is to use row-medians and column-medians for obtaining the row- and column-effect, and then find the basic intercept term. Any unexplained part of the data is considered as the residual. Time for action – the median polish algorithm The median polish algorithm (refer to http://www-rohan.sdsu.edu/~babailey/ stat696/medpolishalg.html) is given next: 1. Obtain the row medians of the two-way table and upend it to the right-hand side of the data matrix. From each element of every row, subtract the respective row median. [ 125 ] Exploratory Analysis 2. Find the median of the row median and record it as the initial grand effect value. Also, subtract the initial grand effect value from each row median. 3. For the original data columns in the previously upended matrix, obtain the column median and append it with the previous matrix at the bottom. As in step 1, subtract from each column element their corresponding column median. 4. For the bottom row of column medians in the previous table, obtain the median, and then add the obtained value to the initial grand effect value. Next, subtract the modified grand effect median value from each of the column medians. 5. Iterate steps 1-4 until the changes in row or column median is insignificant. We use the medpolish function from the stats library for the computations involved in median polish. For more details about the model, you can refer to Chapter 8, Velleman and Hoaglin (1984). Example 4.7.1. Male death rates: The dataset related to the male death rate per 1000 by the cause of death and the average amount of tobacco smoked daily is available on page 221 of Velleman and Hoaglin (1984). Here, the row effect is due to the cause of death, whereas the column constitutes the amount of tobacco smoked (in grams). We are interested in modeling the effect of these two variables on the male death rates in the region. > data(MDR) > MDR2 <- as.matrix(MDR[,2:5]) >rownames(MDR2) <- c("Lung", "UR","Sto","CaR","Prost","Other_ Lung","Pul_TB","CB","RD_Other", "CT","Other_ Cardio","CH","PU","Viol","Other_Dis") > MDR_medpol <- medpolish(MDR2) 1 : 8.38 2 : 8.17 Final: 8.1625 >MDR_medpol$row Lung UR StoCaR Prost Other_ LungPul_TB CB RD_Other CT Other_Cardio 0.1200 -0.4500 -0.2800 -0.0125 -0.3050 0.2050 -0.3900 -0.2050 0.0000 4.0750 1.6875 CH PU Viol Other_Dis [ 126 ] Chapter 4 1.4725 -0.4325 0.0950 >MDR_medpol$col G0 G14 G24 G25 -0.0950 0.0075 -0.0050 0.1350 >MDR_medpol$overall [1] 0.545 >MDR_medpol$residuals G0 G14 G24 Lung -0.5000 -0.2025 2.000000e-01 UR 0.0000 0.0275 0.000000e+00 Sto 0.2400 0.0875 -1.600000e-01 CaR 0.0025 0.0000 -1.575000e-01 Prost 0.4050 0.0125 -1.500000e-02 Other_Lung -0.0150 -0.0375 1.500000e-02 Pul_TB -0.0600 -0.0025 3.000000e-02 CB -0.1250 -0.0575 5.500000e-02 RD_Other 0.2400 -0.0025 1.387779e-17 CT -0.3050 0.0125 -1.500000e-02 Other_Cardio 0.0925 -0.0900 2.425000e-01 CH 0.0875 -0.0850 -1.525000e-01 PU -0.0175 0.0200 5.250000e-02 Viol -0.1250 0.1725 -1.850000e-01 Other_Dis 0.0350 0.2925 -3.500000e-02 0.9650 G25 0.8600 -0.0200 -0.0900 0.0725 -0.0350 0.1350 0.0000 0.2450 -0.2800 1.2350 -0.1175 0.1775 -0.0275 0.1250 -0.0750 What just happened? The output associated with MDR_medpol$row gives the row effect, while MDR_medpol$col gives the column effect. The negative value of -0.0950 for the non-consumers of tobacco shows that the male death rate is lesser for this group, whereas the positive values of 0.0075 and 0.1350 for the group under 14 grams and above 25 grams respectively is an indication that tobacco consumers are more prone to death. Have a go hero For the variables G0 and G25 in the MDR2 matrix object, obtain a back-to-back stem-leaf display. [ 127 ] Exploratory Analysis Summary Median and its variants form the core measures of EDA and you would have got a hang of it by the first section. The visualization techniques of EDA also compose more than just the stem-and-leaf plot, letter values, and bagplot. As EDA is basically about your attitude and approach, it is important to realize that you can (and should) use any method that is instinctive and appropriate for the data on hand. We have also built our first regression model in the resistant line and seen how robust it is to the outliers. Smoothing data and median polish are also advanced EDA techniques which the reader is acquainted in their respective sections. EDA is exploratory in nature and its findings may need further statistical validations. The next chapter on statistical inference addresses which Tukey calls confirmatory analysis. Especially, we look at techniques which give good point estimates of the unknown parameters. This is then backed with further techniques such as goodness-of-fit and confidence intervals for the probability distribution and the parameters respectively. Post the estimation method, it is a requirement to verify whether the parameters meet certain specified levels. This problem is addressed through hypotheses testing in the next chapter. [ 128 ] 5 Statistical Inference In the previous chapter, we came across numerous tools that gave first insights of exploratory evidence into the distribution of datasets through visual techniques as well as quantitative methods. The next step is the translation of these exploratory results to confirmatory ones and the topics of the current chapter pursue this goal. In the Discrete distributions and Continuous distributions sections of Chapter 1, Data Characteristics, we came across many important families of probability distribution. In practical scenarios, we have data on hand and the goal is to infer about the unknown parameters of the probability distributions. This chapter focuses on one method of inference for the parameters using the maximum likelihood estimator (MLE). Another way of approaching this problem is by fitting a probability distribution for the data. The MLE is a point estimate of the unknown parameter that needs to be supplemented with a range of possible values. This is achieved through confidence intervals. Finally, the chapter concludes with the important topic of hypothesis testing. You will learn the following things after reading through this chapter: Visualizing the likelihood function and identifying the MLE Fitting the most plausible statistical distribution for a dataset Confidence intervals for the estimated parameters Hypothesis testing of the parameters of a statistical distribution Statistical Inference Using exploratory techniques we had our first exposition with the understanding of a dataset. As an example in the octane dataset we found that the median of Method_2 was larger than that of Method_1. As explained in the previous chapter, we need to confirm whatever exploratory findings we had with a dataset. Recall that the histograms and stem-and-leaf displays suggest a normal distribution. A question that arises then is how do we assert the center values, typically the mean, of a normal distribution and how do we conclude that average of the Method_2 procedure exceeds that of Method_1. The former question is answered by estimation techniques and the later with testing hypotheses. This forms the core of Statistical Inference. Maximum likelihood estimator Let us consider the discrete probability distributions as seen in the Discrete distributions section of Chapter 1, Data Characteristics. We saw that a binomial distribution is characterized by the parameters in n and p, the Poisson distribution by λ , and so on. Here, the parameters completely determine the probabilities of the x values. However, when the parameters are unknown, which is the case in almost all practical problems, we collect data for the random experiment and try to infer about the parameters. This is essentially inductive reasoning, and the subject of Statistics is essentially inductive driven as opposed to the deductive reasoning of Mathematics. This forms the core difference between the two beautiful subjects. Assume that we have n observations X1, X2,…, Xn from an unknown probability distribution f ( x, θ ) , where θ may be a scalar or a vector whose values are not known. Let us consider a few important definitions that form the core of statistical inference. Random sample: If the observations X1, X2,…, Xn are independent of each other, we say that it forms a random sample from f ( x,θ ) . A technical consequence of the observations forming a random sample is that their joint probability density (mass) function can be written as product of the individual density (mass) function. If the unknown parameter θ is same for all the n observations we say that we have an independent and identical distributed (iid) sample. Let X denote the score of Rahul Dravid in a century innings, and let Xi denote the runs scored in the i th century, i = 1, 2, …, 36. The assumption of independence is then appropriate for all the values of Xi. Consider the problem of the R software installation on 10 different computers of same configuration. Let X denote the time it takes for the software to install. Here, again, it may be easily seen that the installation time on the 10 machines, X1, …, X10, are identical (same configuration of the computers) and independent. We will use the vector notation here to represent a sample of size n, X = (X1, X2,…, Xn) for the random variables, and denote the realized values of random variable with the small case x = (x1, x2,…, xn ) with xi representing the realized value of random variable Xi . All the required tools are now ready, which enable us to define the likelihood function. [ 130 ] Chapter 5 Likelihood function: Let f ( x,θ ) be the joint pmf (or pdf) for an iid sample of n observations of X. Here, the pmf and pdf respectively correspond to the discrete and continuous random variables. The likelihood function is then defined by: L (θ | x ) = f ( x | θ ) = Π in=1 f ( xi | θ ) Of course, the reader may be amused about the difference between a likelihood function and a pmf (or pdf). The pmf is to be seen as a function of x given that the parameters are known, whereas in the likelihood function we look at a function where the parameters are unknown with x being known. This distinction is vital as we are looking for a tool where we do not know the parameters. The likelihood function may be interpreted as the probability function of θ conditioned on the value of x and this is the main reason for identifying that value of θ , say θ , which leads to the maximum of L (θ | x ) , that is, L (θˆ | x ) ≥ L (θ | x ) . Let us visualize the likelihood function for some important families of probability distribution. The importance of visualizing the likelihood function is emphasized in Chapter 7, The Logistic Regression Model and Chapters 1-4 of Pawitan (2001). Visualizing the likelihood function We had seen a few plots of the pmf/pdf in Discrete distributions and Continuous distributions sections of Chapter 1, Data Characteristics. Recall that we were plotting the pmf/pdf over the range of x. In those examples, we had assumed certain values for the parameters of the distributions. For the problems of statistical inference, we typically do not know the parameter values. Thus, the likelihood functions are plotted against the plausible parameter values θ . What does this mean? For example, the pmf for a binomial distribution is plotted for x values ranging from 0 to n. However, the likelihood function needs to be plot against p values ranging over the unit interval [0, 1]. Example 5.1.1. The likelihood function of a binomial distribution: A box of electronic chips is known to contain a certain number of defective chips. Suppose we take a random sample of n chips from the box and make a note of the number of non-defective chips. The probability of a non-defective chip is p, and that being defective is 1 – p. Let X be a random variable, which takes the value 1 if the chip is non-defective and 0 if it is defective. Then n X~b(1,p), where p is not known. Define t x = ∑ i =1 xi . The likelihood function is then given by: n n −t L ( p | t x , n ) = p tx (1 − p ) x t x [ 131 ] Statistical Inference Suppose that the observed value of x is 7, that is, we have 7 successes out of 10 trials. Now, the purpose of likelihood inference is to understand the probability distribution of p given the data tx. This gives us an idea about the most plausible value of p and hence it is worthwhile to visualize the likelihood function L ( p | t x , n ) . Example 5.1.2. The likelihood function of a Poisson distribution: The number of accidents at a particular traffic signal of a city, the number of flight arrivals during a specific time interval at an airport, and so on are some of the scenarios where assumption of a Poisson distribution is appropriate to explain the numbers. Now let us consider a sample from Poisson distribution. Suppose that the number of flight arrivals at an airport during the duration of an hour follows a Poisson distribution with an unknown rate λ . Suppose that we have the number of arrivals over ten distinct hours as 1, 2, 2, 1, 0, 2, 3, 1, 2, and 4. Using this data, we need to infer about λ . Towards this we will first plot the likelihood function. The likelihood function for a random sample of size n is given by: n e − nλ λ Σi=1 xi L (λ | x) = Π in=1 xi ! Before we consider R program for visualizing the likelihood function for the samples from binomial and Poisson distribution, let us look at the likelihood function for a sample from the normal distribution. Example 5.1.3. The likelihood function of a normal distribution: The CPU_Time variable from IO_Time may be assumed to follow a normal distribution. For this problem, we will simulate n = 25 observations from a normal distribution, for more details about the simulation, refer the next chapter. Though we simulate the n observations with mean as 10 and standard deviation as 2, we will pretend that we do not actually know the mean value with the assumption that the standard deviation is known to be 2. The likelihood function for a sample from normal distribution with known a standard deviation is given by: L ( µ | x, σ ) = ( 1 2πσ ) n e − 1 2σ 2 ∑ i=1( xi − µ ) n 2 In our particular example, it is: L ( µ | x, 2 ) = (2 1 2π ) n 1 n 2 exp − 3 ∑ ( xi − µ ) 2 i =1 It is time for action! [ 132 ] Chapter 5 Time for action – visualizing the likelihood function We will now visualize the likelihood function for the binomial, Poisson, and normal distributions discussed before: 1. 2. 3. Initialize the graphics windows for the three samples using par(mfrow= c(1,3)). Declare the number of trials n and the number of success x by n <- 10; x <- 7. Set the sequence of p values with p_seq <- seq(0,1,0.01). For p_seq, obtain the probabilities for n = 10 and x = 7 by using the dbinom function: dbinom(x=7,size=n,prob=p_seq). 4. Next, obtain the likelihood function plot by running plot(p_seq, dbinom( x=7,size=n,prob=p_seq), xlab="p", ylab="Binomial Likelihood Function", "l"). 5. 6. 7. Enter the data for the Poisson random sample into R using x <- c(1,2,2,1, 0,2,3,1,2,4) and the number of observations by n <- length(x). Declare the sequence of possible seq(0,5,0.1). λ values through lambda_seq <- Plot the likelihood function for the Poisson distribution with plot( lambda_seq, dpois(x=sum(x),lambda=n*lambda_seq)…). We are generating random observations from a normal distribution using the rnorm function. Each run of the rnorm function results in different values and hence to ensure that you are able to reproduce the exact output as produced here, we will set the initial seed for random generation tool with set.seed(123). 8. For generation of random numbers, fix the seed value with set.seed(123). This is to simply ensure that we obtain the same result. 9. Simulate 25 observations from the normal distribution with mean 10 and standard deviation 2 using n<-25; xn <- rnorm(n,mean=10,sd=2). 10. 11. Consider the following range of µ values mu_seq <- seq(9,11,0.05). Plot the normal likelihood function with plot(mu_seq,dnorm(x= mean(xn), mean=mu_seq,sd=2) ). The detailed code for the preceding action is now provided: # Time for Action: Visualizing the Likelihood Function par(mfrow=c(1,3)) # # Visualizing the Likelihood Function of a Binomial Distribution. n <- 10; x <- 7 p_seq <- seq(0,1,0.01) [ 133 ] Statistical Inference plot(p_seq, dbinom(x=7,size=n,prob=p_seq), xlab="p", ylab="Binomial Likelihood Function", "l") # Visualizing the Likelihood Function of a Poisson Distribution. x <- c(1, 2, 2, 1, 0, 2, 3, 1, 2, 4); n = length(x) lambda_seq <- seq(0,5,0.1) plot(lambda_seq,dpois(x=sum(x),lambda=n*lambda_seq), xlab=expression(lambda),ylab="Poisson Likelihood Function", "l") # Visualizing the Likelihood Function of a Normal Distribution. set.seed(123) n <- 25; xn <- rnorm(n,mean=10,sd=2) mu_seq <- seq(9,11,0.05) plot(mu_seq,dnorm(x=mean(xn),mean=mu_seq,sd=2),"l", xlab=expression(mu),ylab="Normal Likelihood Function") Run the preceding code in your R session. You will find an identical copy of the next plot on your computer screen too. What does the plot tells us? The likelihood function for the binomial distribution has very small values up to 0.4, and then it gradually peaks up to 0.7 and then declines sharply. This means that the values in the neighborhood of 0.7 are more likely to be the true value of p than the points away from it. Similarly, the likelihood function plot for the Poisson distribution says that λ values lesser than 1 and greater than 3 are very unlikely to be the true value of the actual λ . The peak of the likelihood function appears at a value little lesser than 2. The interpretation for the normal likelihood function is left as an exercise to the reader. Figure 1: Some likelihood functions [ 134 ] Chapter 5 What just happened? We took our first step in the problem of estimation of parameters. Visualization of the likelihood function is a very important aspect and is often overlooked in many introductory textbooks. Moreover, and as it is natural, we did it in R! Finding the maximum likelihood estimator The likelihood function plot indicates the plausibility of the data generating mechanism for different values of the parameters. Naturally, the value of the parameter for which the likelihood function has the highest value is the most likely value of the parameter. This forms the crux of maximum likelihood estimation. The value of θ that leads to the maximum value of the likelihood function L (θ | x ) is referred as the maximum likelihood estimate, abbreviated as MLE. For the reader familiar with numerical optimization, it is not a surprise that calculus is useful for finding the optimum value of a function. However, we will not indulge in mathematics more than what is required here. We will note some finer aspects of numerical optimization. For an independent sample of size n, the likelihood function is a product of n functions and it is very likely that we may very soon end up in the mathematical world of intractable functions. To a large extent, we can circumvent this problem by resorting to the logarithm of the function, which then transforms the problem of optimizing the product of functions to the sum of functions. That is, we will focus on optimizing log L (θ | x ) instead of L (θ | x ) . An important consequence of using the logarithm is that the maximization of a product function translates into that of a sum function since log(ab) = log(a) + log(b). It may also be seen that the maximum point of the likelihood function is preserved under the logarithm transformation since for a > b, log(a) > log(b). Further, many numerical techniques know how to minimize a function rather than maximize it. Thus, instead of maximizing the log-likelihood function log L (θ | x ), we will minimize -log L (θ | x ) in R. In the R package stats4 we are provided with a mle function, which returns the MLE. There are a host of probability distributions for which it is possible to obtain the MLE. We will continue the illustrations for the examples considered earlier in the chapter. Example 5.1.4. Finding the MLE of a binomial distribution (continuation of Example 5.1.1): The negative log-likelihood function of binomial distribution, sans the constant values, the combinatorial term is excluded since its value is independent of p, is given by: -log L ( p | t x , n ) = −t x log p − ( n − t x ) log (1 − p ) [ 135 ] Statistical Inference The maximum likelihood estimator of p, differentiating the preceding equation with respect to p and equating the result to zero and then solving the equation, is given by the sample proportion: pˆ = tx n An estimator of a parameter is denoted by accentuation the parameter with a hat. Though this is very easy to compute, we will resort to the useful function mle. Example 5.1.5. MLE of a Poisson distribution (continuation of Example 5.1.2): The negative log-likelihood function is given by: n − log L ( λ | x ) = nλ − ∑ xi log λ i =1 The MLE for λ admits a closed form, which can be obtained from calculus arguments, and it is given by: λˆ = Σin=1 xi n To obtain the MLE, we need to write exclusive code for the negative log-likelihood function. For the normal distribution, we will use the mle function. There is another method of finding the MLE than the mle function available in the stats4 package. We consider it next. The R codes will be given in the forthcoming action. Using the fitdistr function In the previous examples, we needed to explicitly specify the negative log-likelihood function. The fitdistr function from the MASS package can be used to obtain the unknown parameters of a probability distribution, for a list of the probability functions for which it applies see ?fitdistr, and the fact that it uses the maximum likelihood fitting complements our approach in this section. Example 5.1.6. MLEs for Poisson and normal distributions: In the next action, we will use the fitdistr function from the MASS package for obtaining the MLEs in Example 5.1.2 and Example 5.1.3. In fact, using the function, we get the answers readily without the need to specify the negative log-likelihood explicitly. [ 136 ] Chapter 5 Time for action – finding the MLE using mle and fitdistr functions The mle function from the stats4 package will be used for obtaining the MLE from popular distributions such as binomial, normal, and so on. The fitdistr function will be used too, which fits the distributions using the MLEs. 1. 2. Load the library package CIT with library(stats4). Specify the number of success in a vector format and the number of observations with x<-rep(c(0,1),c(3,7)); n <- length(x). 3. Define the negative log-likelihood function with a function: binomial_nll <- function(prob) -sum(dbinom(x,size=1,prob, log=TRUE)) The code works as follows. The dbinom function is invoked from the stats package and the option log=TRUE is exercised to indicate that we need a log of the probability (actually likelihood) values. The dbinom function returns a vector of probabilities for all the values of x. The sum, multiplied by -1, now returns us the value of negative log-likelihood. 4. Now, enter fit_binom <- mle(binomial_nll,start=list( prob=0.5),nobs=n) on the R console. Now, mle as a function optimizes the binomial_nll function defined in the previous step. Initial values, a guess or a legitimate value are specified for the start option, and we also declare the number of observations available for this problem. 5. summary(fit_binom) will give details of the mle function applied on binomial_nll. The output is displayed in the next screenshot. 6. Specify the data for the Poisson distribution problem in x <- c(1,2,2,1,0, 2,3,1,2,4); n <- length(x). 7. Define the negative log-likelihood function on parallel lines of a binomial distribution: pois_nll <- function(lambda) -sum(dpois(x,lambda,log=TRUE)) 8. Explore different options of the mle function by specifying the method, a guess of the least and most values of the parameter, and the initial value as the median of the observations: fit_poisson <- mle(pois_nll,start=list(lambda=median(x)),nobs=n, method = "Brent", lower = 0, upper = 10) 9. 10. Get the answer by entering summary(fit_poisson). Define the negative log-likelihood function for the normal distribution by: normal_nll <- function(mean) -sum(dnorm(xn,mean,sd=2,log=TRUE)) [ 137 ] Statistical Inference 11. 12. 13. 14. 15. Find the MLE of the normal distribution with fit_normal <- mle( normal_nll ,start=list(mean=8),nobs=n). Get the final answer with summary(fit_normal). Load the MASS package: library(MASS). Fit the x vector with a Poisson distribution by running fitdistr( x,"poisson") in R. Fit the xn vector with a normal distribution by running fitdistr(xn,"normal"). Figure 2: Finding the MLE and related summaries What just happened? You have explored the possibility of finding the MLEs for many standard distributions using mle from the stats4 package and fitdistr from the MASS package. The main key in obtaining the MLE is the right construction of the log-likelihood function. [ 138 ] Chapter 5 Confidence intervals The MLE is a point estimate and as such on its own it is almost not of practical use. It would be more appropriate to give coverage of parameter points, which is most likely to contain the true unknown parameter. A general practice is to specify the coverage of the points through an interval and then consider specific intervals which have a specified probability. A formal definition is in order. A confidence interval for a population parameter is an interval that is predicted to contain the parameter with a certain probability. The common choice is to obtain either 95 percent or 99 percent confidence intervals. It is common to specify the coverage of the confidence through a significance level α , more about this in the next section, which is a small number closer to 0. The 95 percent and 99 percent confidence intervals then correspond to 100 (1 − α ) percent intervals with respective α equal to 0.05 and 0.01. In general, a 100 (1 − α ) percent confidence interval says that if the experiment is performed many times over, we expect the result to fall in the confidence interval by 100 (1 − α ) percent. Example 5.2.1. Confidence interval for binomial proportion: Consider n Bernoulli trials X 1 ,..., X n with the probability of success being p. We saw earlier that the MLE of p is: pˆ = tx n where t x = ∑ i =1 xi . Theoretically, the expected value of p̂ is p and its standard deviation is p (1 − p ) / n . An estimate of the standard deviation is pˆ (1 − pˆ ) / n . For large n and when both np and np(1-p) are greater than 5, using a normal approximation, by virtue of the central limit theorem, a 100 (1 − α ) percent confidence interval for p is given by: n pˆ − zα /2 pˆ (1 − pˆ ) , pˆ + zα /2 n pˆ (1 − pˆ ) n where zα /2 is the α / 2 quantile of the standard normal distribution. The confidence intervals obtained by using the normal approximation are not reliable when the p value is near 0 or 1. Thus, if the lower confidence limit falls below 0, or the upper confidence limit exceeds 0, we will adapt the convention of taking them as 0 and 1 respectively. [ 139 ] Statistical Inference Example 5.2.2. Confidence interval for normal mean with known variance: Consider a random sample of size n from a normal distribution with an unknown mean µ and a known n standard deviation σ . It may be shown that the MLE of mean µ is the sample mean X = ∑ xi / n i =1 and that the distribution of X is again normal with mean µ and standard deviation σ / n . Then, the 100 (1 − α ) percent confidence interval is given by: σ σ , X + Zα /2 X − Zα /2 n n where zα /2 is the α / 2 quantile of the standard normal distribution. The width of the preceding confidence interval is 2 zα /2σ / n . Thus, when the sample size is increased by four times, the width will decrease by half. Example 5.2.3. Confidence interval for normal mean with unknown variance: We continue with a sample of size n. When the variance is not known, the steps become very different. Since the variance is not known, we replace it by the sample variance: S2 = 2 1 n Xi − X ) ( ∑ n − 1 i =1 The denominator is n - 1 since we have already estimated µ using the n observations. To develop the confidence interval for µ , consider the following statistic: T= X −µ S2 / n This new statistic T has a t-distribution with n - 1 degrees of freedom. The 100 (1 − α ) percent confidence interval for µ is then given by the following interval: S S , X + tn −1,α /2 X − tn −1,α /2 n n where tn −1,α /2 is the α / 2 quantile of a t random variable with n - 1 degrees of freedom. We will create functions for obtaining the confidence intervals for the preceding three examples. Many statistical tests in R return confidence intervals at desired levels. However, we will be encountering these tests in the last section of the chapter, and hence up to that point, we will contain ourselves to defined functions and applications. [ 140 ] Chapter 5 Time for action – confidence intervals We create functions that will enable us to obtain confidence intervals of the desired size: 1. Create a function for obtaining the confidence intervals of the proportion from a binomial distribution with the following function: binom_CI = function(x, n, alpha) { phat = x/n ll=phat-qnorm(alpha/2,lower.tail=FALSE)*sqrt(phat*(1-phat)/n) ul=phat+qnorm(alpha/2,lower.tail=FALSE)*sqrt(phat*(1-phat)/n) return(paste("The ", 100*(1-alpha),"% Confidence Interval for Binomial Proportion is (", round(ll,4),",", round(ul,4),")", sep='')) } The arguments of the functions are x, n, and alpha. That is, the user of the function needs to specify the x number of success out of the n Bernoulli trials, and the significance level α . First, we obtain the MLE p̂ of the proportion p by calculating phat = x/n. To obtain the value of zα /2 , we use the qnorm quantile function qnorm(alpha/2,lower.tail= FALSE). The quantity pˆ (1 − pˆ ) / n is computed with sqrt(phat*(1-phat)/n). The rest of the code for ll and ul is self-explanatory. We use the paste function to get the output in a convenient format along with the return function. 2. Consider the data in Example 5.2.1 where we have x = 7, and n = 10. Suppose that we require 95 percent and 99 percent confidence intervals. The respective α values for these confidence intervals are 0.05 and 0.01. Let us execute the binom_CI function on this data. That is, we need to run binom_CI(x=7,n=10,alpha =0.05) and binom_CI(x=7,n=10,alpha=0.01) on the R console. The output will be as shown in the next screenshot. Thus, (0.416, 0.984) is the 95 percent confidence interval for p and (0.3267, 1.0733) is the 99 percent confidence interval for it. Since the upper confidence limit exceeds 1, we will use (0.3267, 1) as the 99 percent confidence interval for p. 3. We first give the function for construction of confidence intervals for the mean µ of a normal distribution when the standard deviation is known: normal_CI_ksd = function(x,sigma,alpha) { xbar = mean(x) n = length(x) ll = xbar-qnorm(alpha/2,lower.tail=FALSE)*sigma/sqrt(n) ul = xbar+qnorm(alpha/2,lower.tail=FALSE)*sigma/sqrt(n) return(paste("The ", 100*(1-alpha),"% Confidence Interval for the Normal mean is (", round(ll,4),",",round(ul,4),")",sep='')) } [ 141 ] Statistical Inference The function normal_CI_ksd works differently from the earlier binomial one. Here, we provide the entire data to the function and specify the known value of standard deviation and the significance level. First, we obtain the MLE X of the mean µ with xbar = mean(x). The R code qnorm(alpha/2,lower.tail=FALSE) is used to obtain zα /2 . Next, σ / n is computed by sigma/sqrt(n). The code for ll and ul is straightforward to comprehend. The return and paste have the same purpose as in the previous example. Compile the code for the normal_CI_ksd function. 4. Let us see a few examples, continuation of Example 5.1.3, for obtaining the confidence interval for the mean of a normal distribution with the standard deviation known. To obtain the 95 percent and 99 percent confidence interval for the xn data, where the standard deviation was known to be 2, run normal_CI_ksd(x=xn,sigma=2,alpha=0.05) and normal_CI_ksd(x= xn,sigma=2,alpha=0.01) on the R console. The output is consolidated in the next screenshot. Thus, the 95 percent confidence interval for µ is (9.1494, 10.7173) and the 99 percent confidence interval is (8.903, 10.9637). 5. Create a function, normal_CI_uksd, for obtaining the confidence intervals for of a normal distribution when the standard deviation is unknown: µ normal_CI_uksd = function(x,alpha) { xbar = mean(x); s = sd(x) n = length(x) ll = xbar-qt(alpha/2,n-1,lower.tail=FALSE)*s/sqrt(n) ul = xbar+qt(alpha/2,n-1,lower.tail=FALSE)*s/sqrt(n) return(paste("The ", 100*(1-alpha),"% Confidence Interval for the Normal mean is (", round(ll,4),",",round(ul,4),")",sep='')) } We have an additional computation here in comparison with the earlier function. Since the standard deviation is unknown, we estimate it with s = sd(x). Furthermore, we need to obtain the quantile from t-distribution with n - 1 degrees of freedom, and hence we have qt(alpha/2,n-1,lower.tail=FALSE) for the computation of tn −1,α /2. The rest of the details follow the previous function. 6. Let us obtain confidence intervals, 95 percent and 99 percent, for the vector xn under the assumption that the variance is not known. The codes for achieving the results are given in normal_CI_uksd(x=xn,alpha=0.05) and normal_CI_ uksd(x=xn,alpha=0.01). [ 142 ] Chapter 5 Thus, the 95 percent confidence interval for the mean is (9.1518, 10.7419) and the 99 percent confidence interval is (8.8742, 10.9925). Figure 3: Confidence intervals: Some raw programs What just happened? We created special functions for obtaining the confidence intervals and executed them for three different cases. However, our framework is quite generic in nature and with a bit of care and caution, it may be easily extended to other distributions too. [ 143 ] Statistical Inference Hypotheses testing "Best consumed before six months from date of manufacture", "Two years warranty", "Expiry date: June 20, 2015", and so on, are some of the likely assurances that you would have easily come across. An analyst will have to arrive at such statements using the related data. Let us first define a hypothesis. Hypothesis: A hypothesis is an assertion about the unknown parameter of the probability distribution. For the quote of this section, denoting the least time (in months) till which an eatery will not be losing its good taste by θ , the hypothesis of interest will be H 0 : θ ≥ 6 . It is common to denote the hypothesis of interest by H 0 and it called the null hypothesis. We want to test the null hypothesis against the alternative hypothesis that the consumption time is well before the six months' time, which in symbols is denoted by H1 : θ < 6 . We will begin with some important definitions followed by related examples. Test statistic: A statistic that is a function of the random sample is called as test statistic. For an observation X following a binomial distribution b(n, p), the test statistic for p will be X/n, whereas for a random sample from the normal distribution, the test statistic may be 2 n mean X = ∑ i =1 X i / n or the sample variance S 2 = ∑ in=1 ( X i − X ) / ( n − 1) depending on whether the testing problem is for µ or σ 2 . The statistical solution to reject (or not) the null hypothesis depends on the value of test statistic. This leads us to the next definition. Critical region: The set of values of the test statistic which leads to the rejection of the null hypothesis is known as the critical region. We have made various kinds of assumptions for the random experiments. Naturally, depending on the type of the probability family, namely binomial, normal, and so on, we will have an appropriate testing tool. Let us look at the very popular tests arising in statistics. Binomial test A binomial random variable X, distribution represented by b(n, p), is characterized by two parameters n and p. Typically, n represents the number of trials and is known in most cases and it is the probability of success p that one is generally interested in the hypotheses related to it. For example, an LCD panel manufacturer would like to test if the number of defectives is at most four percent. The panel manufacturer has randomly inspected 893 LCDs and found 39 to be defective. Here the hypotheses testing problem would be H 0 : p ≤ 0.04 vs H1 : p ≤ 0.04 . [ 144 ] Chapter 5 A doctor would like to test whether the proportion of people in a drought-effected area having a viral infection such as pneumonia is 0.2, that is, H 0 : p = 0.2 vs H1 : p ≠ 0.2 . The drought-effected area may encompass a huge geographical area and as such it becomes really difficult to carry out a census over a very short period of a day or two. Thus the doctor selects the second-eldest member of a family and inspects 119 households for pneumonia. He records that 28 out of 119 inspected people are suffering from pneumonia. Using this information, we need to help the doctor in testing the hypothesis of interest for him. In general, the hypothesis-testing problems for the binomial distribution will be along the lines of H 0 : p ≤ p0 vs H1 : p > p0 , H 0 : p ≥ p0 vs H1 : p > p0 , or H 0 : p = p0 vs H1 : p ≠ p0 . Let us see how the binom.test function in R helps in testing hypotheses problems related to the binomial distribution. Time for action – testing the probability of success We will use the R function binom.test for testing hypotheses problems related to p. This function takes as arguments n number of trials, x number of successes, p as the probability of interest, and alternative as one of greater, less, or greater. 1. Discover the details related to binom.test using ?binom.test, and then run example(binom.test) and ensure that you understand the default example. 2. For the LCD panel manufacturer, we have n = 893 and x = 39. The null hypothesis occurs at p = 0.04. Enter this data first in R with the following code: n_lcd <- 893; x_lcd <- 39; p_lcd <- 0.04 3. The alternative hypothesis is that the proportion of success p is greater than 0.04, which is listed to the binom.test function with the option alternative="greater", and hence the complete binom.test function for the LCD panel is delivered by: binom.test(n=n_lcd,x=x_lcd,p=p_lcd,alternative="greater") The output, following screenshot, shows that the estimated probability of success is 0.04367, which is certainly greater than 0.04. However, the p-value = 0.3103 indicates that we do not have enough evidence in the data to reject the null hypothesis H 0 : p ≤ 0.04 . Note that the binom.test also gives us a 95 percent confidence interval for p as (0.033, 1.000) and since the hypothesized probability lies in this interval we arrive at the same conclusion. This confidence interval is recommended over the one developed in the previous section, and in particular we don't have to worry about the confidence limits either being lesser than 0 or greater than one. Also, you may obtain any confidence interval of your choice 100 (1 − α ) percent CI with the argument conf.int. [ 145 ] Statistical Inference 4. For the doctors problem, we have the data as: n_doc <- 119; x_doc <- 28; p_doc <- 0.2 5. We need to test the null hypothesis against a two-sided alternative hypothesis and though this is the default setting of the binom.test, it is a good practice to specify it explicitly, at least until the expertise is felt by the user: binom.test(n=n_doc,x=x_doc,p=p_doc,alternative="two.sided") The estimated probability of success, actually a patient's probability of having the viral infection, is 0.3193. Since the p-value associated with the test is p-value = 0.001888, we reject the null hypothesis that H 0 : p = 0.2 . All the output is given in the following screenshot. The 95 percent confidence interval is (0.2369, 0.4110), again given by the binom.test, which does not contain the hypothesized value of 0.2, and hence we can reject the null hypothesis based on the confidence interval. Figure 4: Binomial tests for probability of success What just happened? Binomial distribution arises in a large number of proportionality test problems. In this section we used the binom.test for testing problems related to the probability of success. We also note that the confidence intervals for p are given as a side-product of the application of the binom.test. The confidence intervals are also given as a by-product of the application of the binom.test. [ 146 ] Chapter 5 Tests of proportions and the chi-square test In Chapter 3, Data Visualization, we came across the Titanic and UCBAdmissions datasets. For the Titanic dataset, we may like to test if the Survived proportion across the Class is same for the two Sex groups. Similarly, for the UCBAdmissions dataset, we may wish to know if the proportion of the Admitted candidates for the Male and Female group is same across the six Dept. Thus, there is a need to generalize the binom.test function to a group of proportions. In this problem, we may have k-proportions and the probability vector is specified by p = ( p1 ,..., pk ) . The hypothesis problem may be specified as testing the null hypothesis H 0 : p = p0 against the alternative hypothesis H1 : p ≠ p0 . Equivalently, in the vector form the problem is testing H 0 : ( p1 ,..., pk ) = ( p01 ,..., p0 k ) against H1 : ( p1 ,..., pk ) ≠ ( p01 ,..., p0 k ). The R extension of binom.test is given in prop.test. Time for action – testing proportions We will use the prop.test R function here for testing the equality of proportions for the count data problems. 1. Load the required dataset with data(UCBAdmissions). For the UCBAdmissions dataset, first obtain the Admitted and Rejected frequencies for both the genders across the six departments with: UCBA.Dept <- ftable(UCBAdmissions, row.vars="Dept", col.vars = c("Gender", "Admit")) 2. Calculate the Admitted proportions for Female across the six departments with: p_female <- prop.table(UCBA.Dept[,3:4],margin=1)[,1] Check p_female! 3. Test whether the proportions across the departments for Male matches with Female using the prop.test: prop.test(UCBA.Dept[,1:2],p=p_female) The proportions are not equal across the Gender as p-value < 2.2e-16 rejects the null hypothesis that they are equal. 4. Next, we want to investigate whether the Male and Female survivors proportions are the same in the Titanic dataset. The approach is similar to the UCBAdmissions problem; run the following code: T.Class <- ftable(Titanic, row.vars="Class", col.vars = c("Sex", "Survived")) [ 147 ] Statistical Inference 5. Compute the Female survivor proportions across the four classes with p_female <- prop.table(T.Class[,3:4],margin=1)[,1]. Note that this new variable, p_female, will overwrite the same named variable from the earlier steps. 6. Display p_female and then carry out the comparison across the two genders: prop.test(T.Class[,1:2],p=p_female) The p-value < 2.2e-16 clearly shows that the survivor proportions are not the same across the genders. Figure 5: prop.test in action Indeed, there is more complexity to the two datasets than mere proportions for the two genders. The web page http://www-stat.stanford.edu/~sabatti/ Stat48/UCB.R has detailed analysis of the UCBAdmissions dataset and here we will simply apply the chi-square test to check if the admission percentage within each department is independent of the gender. 7. The data for the admission/rejection for each department is extractable through the third index in the array, that is, UCBAdmissions[,,i] across the six departments. Now, we apply the chisq.test to check if the admission procedure is independent of the gender by running chisq.test( UCBAdmissions[,,i]) six times. The result has been edited in a foreign text editor and then a screenshot of it is provided next. [ 148 ] Chapter 5 It appears that the Dept = A admits more males than females. Figure 6: Chi-square tests for the UCBAdmissions problem What just happened? We used prop.test and chisq.test to test the proportions and independence of attributes. Functions such as ftable and prop.table and arguments such as row. vars, col.vars, and margin were useful to get the data in the right format for the analysies purpose. We will now look at important family of tests for the normal distribution. Tests based on normal distribution – one-sample The normal distribution pops up in many instances of statistical analysis. In fact Whittaker and Robinson have quoted on the popularity of normal distribution as follows: Everybody believes in the exponential law of errors [that is, the normal distribution]: the experimenters, because they think it can be proved by mathematics; and the mathematicians, because they believe it has been established by observation. We will not make an attempt to find out whether the experimenters are correct or the mathematicians, well, at least not in this section. [ 149 ] Statistical Inference In general we will be dealing with either one-sample or two-sample tests. In the one-sample problem we have a random sample of size n from N ( µ , σ 2 ) in ( X 1 , X 2 ,..., X n ). The hypotheses testing problem may be related to either or both of the parameters ( µ , σ 2 ). The interesting and most frequent hypotheses testing problems for the normal distribution are listed here: Testing for mean with known variance σ 2 : H 0 : µ < µ0 vs H1 : µ ≥ µ0 H 0 : µ > µ0 vs H1 : µ ≤ µ0 H 0 : µ = µ0 vs H1 : µ ≠ µ0 Testing for mean with unknown variance σ 2 : this is the same set of hypotheses problems as in the preceding point Testing for the variance with unknown mean: H 0 : σ > σ 0 vs H1 : σ ≤ σ 0 H 0 : σ < σ 0 vs H1 : σ ≥ σ 0 H 0 : σ = σ 0 vs H1 : σ ≠ σ 0 In the case of known variance, the hypotheses testing problem for the mean is based on the Z-statistic given by: Z= X − µ0 σ/ n where X = ∑ i =1 X i / n . The test procedure, known as Z-test, for the hypotheses testing problem H 0 : µ < µ0 vs H1 : µ ≥ µ0 is to reject the null hypothesis at α -level of significance H 0 : µ > µ0 if X > zα σ / n + µ0 , where zα is the α percentile of a standard normal distribution. For the hypotheses testing problem H 0 : µ > µ0 vs H1 : µ ≤ µ0 , the critical/reject region is X < zα σ / n + µ0 . Finally, for the testing problem of H 0 : µ = µ0 vs H1 : µ ≠ µ0 , we reject the null hypothesis if: n X − µ0 σ/ n ≥ zα /2 [ 150 ] Chapter 5 An R function, z.test, is available in the PASWR package, which carries out the Z-test for each type of the hypotheses testing problem. Now, we consider the case when the variance n 2 σ 2 is not known. In this case, we first find an estimate of the variance using S 2 = ∑ ( X i − X ) / ( n − 1) . i =1 The test procedure is based on the well-known t-statistic: t= X −µ S2 / n The test procedure based on the t-statistic is highly popular as the t-test or student's t-test, and its implementation is there in R with the t.test function in the base package. The distribution of the t-statistic, under the null hypothesis is the t-distribution with (n - 1) degrees of freedom. The rationale behind the application of the t-test for the various types of hypotheses remains the same as the Z-test. For the hypotheses testing problem concerning the variance σ 2 of the normal n distribution, we need to first compute the sample variance using S 2 = ∑ ( X i − X )2 / ( n − 1) i =1 and define the chi-square statistic: χ2 = ( n − 1) S 2 σ 02 Under the null hypothesis, the chi-square statistic is distributed as a chi-square random variable with n - 1 degrees of freedom. In the case of knownnmean, which is seldom the case, the test procedure is based on the test statistic χ = i∑=1( X i − µ ) /σ , which follows a chi-square random variable with n degrees of freedom. For the hypotheses problem H 0 : σ > σ 0 vs H1 : σ ≤ σ 0 , the test procedure is to reject H 0 : σ > σ 0 if X 2 < χ12−α . Similarly, for the hypotheses problem H 0 : σ < σ 0 vs H1 : σ ≥ σ 0 , the procedure is to reject H 0 : σ < σ 0 if X 2 > χα2 ; and finally for the problem H 0 : σ = σ 0 vs H1 : σ ≠ σ 0, the test procedure rejects H 0 : σ = σ 0 if either X 2 < χ12−α /2 or X 2 > χα2 /2 . 2 2 2 0 Test examples. Let us consider some situations when the preceding set of hypotheses arise in a natural way: A certain chemical experiment requires that the solution used as a reactant has a pH level greater than 8.4. It is known that the manufacturing process gives measurements which follow a normal distribution with a standard deviation of 0.05. The ten random observations are 8.30, 8.42, 8.44, 8.32, 8.43, 8.41, 8.42, 8.46, 8.37, and 8.42. Here, the hypotheses testing problem of interest is H 0 : µ > 8.4 vs H1 : µ ≤ 8.4 . This problem is adopted from page 408 of Ross (2010). [ 151 ] Statistical Inference Following a series of complaints that his company's LCD panels never last more than a year, the manufacturer wants to test if his LCD panels indeed fail within a year. Using historical data, he knows the standard deviation of the panel life due to the manufacturing process is two years. A random sample of 15 units from a freshly manufactured lot gives their lifetimes as 13.37, 10.96, 12.06, 13.82, 12.96, 10.47, 10.55, 16.28, 12.94, 11.43, 14.51, 12.63, 13.50, 11.50, and 12.87. You need to help the manufacturer validate his hypothesis. Freund and Wilson (2003). Suppose that the mean weight of peanuts put in jars is required to be 8 oz. The variance of the weights is known to be 0.03, and the observed weights for 16 jars are 8.08, 7.71, 7.89, 7.72, 8.00, 7.90, 7.77, 7.81, 8.33, 7.67, 7.79, 7.79, 7.94, 7.84, 8.17, and 7.87. Here, we are interested in testing H 0 : µ = 8.0 vs H1 : µ ≠ 8.0 . New managers have been appointed at the respective places in the preceding bullets. As a consequence the new managers are not aware about the standard deviation for the processes under their control. As an analyst, help them! Suppose that the variance in the first example is not known and that it is a critical requirement that the variance be lesser than 7, that is, the null hypothesis is H 0 : σ 2 < 7 while the alternative is H1 : σ 2 ≥ 7 . Suppose that the variance test needs to be carried out for the third method, 2 2 that is, the hypotheses testing problem is then H 0 : σ = 0.03 vs H1 : σ ≠ 0.03 . We will perform the necessary test for all the problems described before. Time for action – testing one-sample hypotheses We will require R packages PASWR and PairedData here. The R functions such as t.test, z.test, and var.test will be useful for testing one-sample hypotheses problems related to a random sample from normal distributions. 1. 2. Load the library library(PASWR). Enter the data for pH in R: pH_Data <-c(8.30,8.42,8.44,8.32,8.43,8.41,8.42,8.46,8.37,8.42) 3. 4. Specify the known variance of pH in pH_sigma <- 0.05. Use z.test from the PASWR library to test the hypotheses described in the first example with: z.test(x=pH_Data,alternative="less",sigma.x=pH_sigma,mu=8.4) [ 152 ] Chapter 5 The data is specified in the x option, the type of the hypotheses problem is specified by stating the form of the alternative hypothesis, the known variance is fed through the sigma.x option, and finally, the mu option is used to specify the value of the mean under the null hypothesis. The output of the complete R program is collected in the forthcoming two screenshots. The p-value is 0.4748 which means that we do not have enough evidence to reject the null hypothesis H 0 : µ > 8.4 and hence we conclude that mean pH value is above 8.4. 5. Get the data of LCD panel in your session with: LCD_Data <- c(13.37, 10.96, 12.06, 13.82, 12.96, 10.47, 10.55, 16.28, 12.94, 11.43, 14.51, 12.63, 13.50, 11.50, 12.87) 6. Specify the known variance LCD_sigma <- 2 and run the z.test with: z.test(x=LCD_Data,alternative="greater",sigma.x=LCD_sigma,mu=12) The p-value is seen to be 0.1018 and hence we again do not have enough data evidence to reject the null hypothesis that the average mean lifetime of an LCD panel is at least a year. 7. The complete program for third problem can be given as follows: peanuts <- c(8.08, 7.71, 7.89, 7.72, 8.00, 7.90, 7.77, 7.81, 8.33, 7.67, 7.79, 7.79, 7.94, 7.84, 8.17, 7.87) peanuts_sigma <- 0.03 z.test(x=peanuts,sigma.x=peanuts_sigma,mu=8.0) Since the p-value associated with this test 2.2e-16, that is, it is very close to zero, we reject the null hypothesis H 0 : µ = 8.0 . 8. If the variance(s) are not known and a test of the sample means is required, we need to move from the z.test (in the PASWR library) to the t.test (in the base library): t.test(x=pH_Data,alternative="less",mu=8.4) t.test(x=LCD_Data,alternative="greater",mu=12) t.test(x=peanuts,mu=8.0) If the variance is not known, the conclusions for the problems related to pH and peanuts do not change. However, the conclusion changes for the LCD panel problem, and here the null hypothesis is rejected as p-value is 0.06414. [ 153 ] Statistical Inference For the problem of testing variances related to the one-sample problem, my initial idea was to write raw R codes as there did not seem to be a function, package, and so on, which readily gives the answers. However, a more appropriate search at google.com revealed that an R package titled PairedData and created by Stephane Champely did certainly have a function, var.test, not to be confused with the same named function in the stats library, which is appropriate for testing problems related to the variance of a normal distribution. The problem is that the routine method of fetching the package using install.packages("PairedData") gives a warning message, namely package 'PairedData' is not available (for R version 2.15.1). This is the classic case of "so near, yet so far…". However, a deeper look into this will lead us to http://cran.r-project.org/src/contrib/ Archive/PairedData/. This web page shows the various versions of the PairedData package. A Linux user should have no problem in using it, though the other OS users can't be helped right away. A Linux user needs to first download one of the zipped file, say PairedData_1.0.0.tar.gz, to a specific directory and with the path of GNOME Terminal in that directory execute R CMD INSTALL PairedData_1.0.0.tar.gz. Now, we are ready to carry out the tests related to the variance of a normal distribution. A Windows user need not be discouraged with this scenario, and the important function var1.test is made available in the RSADBE package of the book. A more recent check on the CRAN website reveals that the PairedData package is again available for all OS platforms since April 18, 2013. Figure 7: z.test and t.test for one-sample problem [ 154 ] Chapter 5 9. 10. Load the required library with library(PairedData). Carry out the two testing problems in the fifth problem with: var.test(x=pH_Data,alternative="greater",ratio=7) var.test(x=peanuts,alternative="two.sided",ratio=0.03) 11. It may be seen from the next screenshot that the data does not lead to rejection of the null hypotheses. For a Windows user, the alternative is to use the var1.test function from the RSADBE package. That is, you need to run: var1.test(x=pH_Data,alternative="greater",ratio=7) var1.test(x=peanuts,alternative="two.sided",ratio=0.03) You'll get same results: Figure 8: var.test from the PairedData library What just happened? The tests z.test, t.test, and var.test (from the PairedData library) have been used for the testing hypotheses problems under varying degrees of problems. [ 155 ] Statistical Inference Have a go hero Consider the testing problem H 0 : σ = σ 0 vs H1 : σ ≠ σ 0 . The test statistic for this hypothesis testing problem is given by: ∑in=1 ( X i − X i ) χ = 2 2 σ 02 which follows a chi-square distribution with n - 1 degrees of freedom. Create your own new function for the testing problems and compare it with the results given by var.test of PairedData package. With the testing problem of parameters of normal distribution in the case of one sample behind us, we will next focus on the important two-sample problem. Tests based on normal distribution – two-sample The two-sample problem has data from two populations where ( X 1 , X 2 ,..., X n ) are n1 observations from N ( µ1 , σ 12 ) and (Y1 , Y2 ,..., Yn ) are n2 observations from N ( µ2 , σ 22 ) . We assume that the samples within each population are independent of each other and further that the samples across the two populations are also independent. Similar to the one-sample problem, we have the following set of recurring and interesting hypotheses testing problems. 2 Mean comparison with known variances σ 1 and σ 22 : 2 H 0 : µ1 > µ2 vs H1 : µ1 ≤ µ2 H 0 : µ1 < µ2 vs H1 : µ1 ≥ µ2 H 0 : µ1 = µ2 vs H1 : µ1 ≠ µ2 Mean comparison with unknown variances σ 12 and σ 22 : the same set of hypotheses problems as before. We make an additional assumption here that the variances σ 12 and σ 22 are assumed to be equal, though unknown. The variances comparison: H 0 : σ 1 > σ 2 vs H1 : σ 1 ≤ σ 2 H 0 : σ 1 < σ 2 vs H1 : σ 1 ≥ σ 2 H 0 : σ 1 = σ 2 vs H1 : σ 1 ≠ σ 2 First define the sample means for the two populations with X = ∑ i =11 X i / n1 n and Y = ∑ i =21 X i / n2 . For the case of known variances σ 12 and σ 22 , the test statistic is defined by: n [ 156 ] Chapter 5 Z= X − Y − ( µ1 − µ2 ) σ 12 / n1 + σ 22 / n2 Under the null hypotheses, Z = ( X − Y ) / σ 12 / n1 + σ 22 / n2 follows a standard normal distribution. The test procedure for the problem H 0 : µ1 > µ2 vs H1 : µ1 ≤ µ2 is to reject H0 if z ≥ zα , and the procedure for H 0 : µ1 < µ2 vs H1 : µ1 ≥ µ2 is to reject H0 if z < zα . As expected and on earlier intuitive lines, the test procedure for the hypotheses problem H 0 : µ1 = µ2 vs H1 : µ1 ≠ µ2 is to reject H0 as z ≥ zα /2 . Let us now consider the case when the variances σ 12 and σ 22 are not known and assumed (or known) to be equal. In this case, we can't use the Z-test any further and need to look at the estimator of the common variance. For this, we defined the pooled variance estimator as follows: S p2 = n1 − 1 n2 − 1 S x2 + S y2 n1 + n2 − 2 n1 + n2 − 2 where S x2 and S y2 are the sampling variances of the two populations. Define the t-statistic as follows: t= X −Y S (1/ n1 + 1/ n2 ) 2 p The test procedure for the set of the three hypotheses testing problems is then to reject the null hypotheses if t < tn1 + n2 − 2,α , t > tn1 + n2 − 2,α , or t < tn + n − 2,α /2 . 1 2 Finally, we focus on the problem of testing variances across two samples. Here, the test statistic is given by: F= S x2 S y2 The test procedures would be to respectively reject the null hypotheses of the testing problems H 0 : σ 1 > σ 2 vs H1 : σ 1 ≤ σ 2 , H 0 : σ 1 < σ 2 vs H1 : σ 1 ≥ σ 2, and H 0 : σ 1 = σ 2 vs H1 : σ 1 ≠ σ 2 if F < Fn1 −1,n2 −1,α , F > Fn1 −1,n2 −1,α , F < Fn1 −1,n2 −1,α /2 . [ 157 ] Statistical Inference Let us now consider some scenarios where we have the previously listed hypotheses testing problems. Test examples. Let us consider some situations when the preceding set of hypotheses arise in a natural way. In continuation of the chemical experiment problem, let us assume that the chemists have come up with a new method of obtaining the same solution as discussed in the previous section. For the new technique, the standard deviation continues to be 0.05 and 12 observations for the new method yield the following measurements: 8.78, 8.85, 8.74, 8.83, 8.82, 8.79, 8.82, 8.74, 8.84, 8.78, 8.75, 8.81. Now, this new solution is acceptable if its mean is greater than that for the earlier one. Thus, the hypotheses testing problem is now H 0 : µ NEW > µOLD vs H1 : µ NEW ≤ µOLD . Ross (2008), page 451. The precision of instruments in metal cutting is a much serious business and the cut pieces can't be significantly lesser than the target nor be greater than it. Two machines are used to cut 10 pieces of steel, and their measurements are respectively given in 122.4, 123.12, 122.51, 123.12, 122.55, 121.76, 122.31, 123.2, 122.48, 121.96, and 122.36, 121.88, 122.2, 122.88, 123.43, 122.4, 122.12, 121.78, 122.85, 123.04. The standard deviation of the length of a cut is known to be equal to 0.5. We need to test if the average cut length is same for the two machines. For both the preceding problems assume that though the variances are equal, they are not known. Complete the hypotheses testing problems using t.test. Freund and Wilson (2003), page 199. The monitoring of the amount of peanuts being put in jars is an important issue in quality control viewpoint. The consistency of the weights is of prime importance and the manufacturer has been introduced to a new machine, which is supposed to give more accuracy in the weights of the peanuts put in the jars. With the new device, 11 jars were tested for their weights and found to be 8.06, 8.64, 7.97, 7.81, 7.93, 8.57, 8.39, 8.46, 8.28, 8.02, 8.39, whereas a sample of nine jars from the previous machine weighed at 7.99, 8.12, 8.34, 8.17, 8.11, 8.03, 8.14, 8.14, 7.87. Now, the task is to test H 0 : σ NEW = σ OLD vs H1 : σ NEW ≥ σ OLD . Let us do the tests for the preceding four problems in R. [ 158 ] Chapter 5 Time for action – testing two-sample hypotheses For the problem of testing hypotheses for the means arising from two populations, we will be using the functions z.test and t.test. 1. 2. As earlier, load the library(PASWR) library. Carry out the Z-test using z.test and the options x, y, sigma.x, and sigma.y: pH_Data <- c(8.30, 8.42, 8.44, 8.32, 8.43, 8.41, 8.42, 8.46, 8.37, 8.42) pH_New <- c(8.78, 8.85, 8.74, 8.83, 8.82, 8.79, 8.82, 8.74, 8.84, 8.78, 8.75, 8.81) z.test(x=pH_Data,y=pH_New,sigma.x=sigma.y=0.05,alternative="less") The p-value is very small (2.2e-16) indicating that we reject the null hypothesis that H 0 : µ NEW > µOLD . 3. For the steel length cut data problem, run the following code: length_M1 <- c(122.4, 123.12, 122.51, 123.12, 122.55, 121.76, 122.31, 123.2, 122.48, 121.96) length_M2 <- c(122.36, 121.88, 122.2, 122.88, 123.43, 122.4, 122.12, 121.78, 122.85, 123.04) z.test(x=length_M1,y=length_M2,sigma.x=0.5,sigma.y=0.5) The display of p-value = 0.8335 shows that the machines do not cut the steel in different ways. 4. If the variances are equal but not known, we need to use t.test instead of the z.test: t.test(x=pH_Data,y=pH_New,alternative="less") t.test(x=length_M1,y=length_M2) 5. The p-values for the two hypotheses problems are p-value = 3.95e-13 and p-value = 0.8397. We leave the interpretation aspect to the reader. 6. For the fourth problem, we have the following R program: machine_new <- c(8.06, 8.64, 7.97, 7.81, 7.93, 8.57, 8.39, 8.46, 8.28, 8.02, 8.39) machine_old <- c(7.99, 8.12, 8.34, 8.17, 8.11, 8.03, 8.14, 8.14, 7.87) t.test(machine_new,machine_old, alternative="greater") Again, we have p-value = 0.1005! [ 159 ] Statistical Inference What just happened? The functions t.test and z.test were simply extensions from the one-sample case to the two-sample test. Have a go hero In the one-sample case you used var.test for the same datasets, which needed a comparison of means with some known standard deviation. Now, test for the variance in the two-sample case using var.test using appropriate hypotheses for them. For example, test whether the variances are equal for pH_Data and pH_New. Find more details of the test with ?var.test. Summary In this chapter we have introduced "statistical inference", which in a common usage term consists of three parts: estimation, confidence intervals, and hypotheses testing. We began the chapter with the importance of likelihood and to obtain the MLE in many of the standard probability distributions using built-in modules. Later, simply to maintain the order of concepts, we defined functions exclusively for obtaining the confidence intervals. Finally, the chapter considered important families of tests that are useful across many important stochastic experiments. In the next chapter we will introduce the linear regression model, which more formally constitutes the applied face of the subject. [ 160 ] 6 Linear Regression Analysis In the Visualization techniques for continuous variable data section of Chapter 3, Data Visualization, we have seen different data visualization techniques which help in understanding the data variables (boxplots and histograms) and their interrelationships (matrix of scatter plots). We had seen in Example 4.6.1. Resistant line for the IO-CPU time an illustration of the resistant line, where CPU_Time depends linearly on the No_of_IO variable. The pair function's output in Example 3.2.9. Octane rating of gasoline blends indicated that the mileage of a car has strong correlations with the engine-related characteristics, such as displacement, horsepower, torque, the number of transmission speeds, and the type of transmission being manual or automatic. Further, the mileage of a car also strongly depends on the vehicle dimensions, such as its length, width, and weight. The question addressed in this chapter is meant to further these initial findings through a more appropriate model. Now, we take the next step forward and build linear regression models for the problems. Thus, in this chapter we will provide more concrete answers for the mileage problem. Linear Regression Analysis The first linear regression model was built by Sir Francis Galton in 1908. The word regression implies towards-the-center. The covariates, also known as independent variables, features, or regressors, have a regressive effect on the output, also called dependent or regressand variable. Since the covariates are allowed, actually assumed, to affect the output in linear increments, we call the model the linear regression model. The linear regression models provide an answer for the correlation between the regressand and the regressors, and as such do not really establish causation. As it will be seen later in the chapter, using data, we will be able to understand the mileage of a car as a linear function of the car-related dynamics. From a pure scientific point of view, the mileage should really depend on complicated formulas of the car's speed, road conditions, the climate, and so on. However, it will be seen that linear models work just fine for the problem despite not really going into the technical details. However, there will also be a price to pay, in the sense that most regression models work well when the range of the variables is well defined, and that an attempt to extrapolate the results usually does not result in satisfactory answers. We will begin with a simple linear regression model where we have one dependent variable and one covariate. At the conclusion of the chapter, you will be able to build a regression model through the following steps: Building a linear regression model and their interpretation Validation of the model assumptions Identifying the effect of every single observation, covariates, as well as the output Fixing the problem of dependent covariates Selection of the optimal linear regression model The simple linear regression model In Example 4.6.1. Resistant line for the IO-CPU time of Chapter 4, Exploratory Analysis, we built a resistant line for CPU_Time as a function of the No_of_IO processes. The results were satisfactory in the sense that the fitted line was very close to covering all the data points, refer to Figure 7 of Chapter 4, Exploratory Analysis. However, we need more statistical validation of the estimated values of the slope and intercept terms. Here we take a different approach and state the linear regression model in more technical details. [ 162 ] Chapter 6 The simple linear regression model is given by Y = β 0 + β1 X + ε ,where X is the covariate/ independent variable, Y is the regressand/dependent variable, and ε is the unobservable error term. The parameters of the linear model are specified by β 0 and β1. Here β 0 is the intercept term and corresponds to the value of Y when x = 0. The slope term β1 reflects the change in the Y value for a unit change in X. It is also common to refer to the β 0 and β1 values as regression coefficients. To understand the regression model, we begin with n pairs of observations (Y1 , X 1 ) ,...,(Yn , X n ) with each pair being completely independent of the other. We make an assumption of normal and independent and identically distributed (iid) for the error term ε , specifically, ε ∼ N ( 0,σ 2 ) , where σ 2 is the variance of the errors. The core assumptions of the model are listed as follows: All the observations are independent The regressand depends linearly on the regressors The errors are normally distributed, that is ε ∼ N ( 0 ,σ 2 ) We need to find all the unknown parameters in β 0 , β1 and σ 2 . Suppose we have n independent observations. Statistical inference for the required parameters may be carried out using the maximum likelihood function as described in the Maximum likelihood estimator section of Chapter 5, Statistical Inference. The popular technique for the linear regression model is the least squares method which identifies the parameters by minimizing the error sum of squares for the model, and under the assumptions made thus far agrees with the MLE. Let β 0 and β1 be a choice of parameters. Then the residuals, the distance between the actual points and the model predictions, made by using the proposed choice of β 0 , β1 on the i-th pair of observation (Yi , X i ) is defined by: ei = Yi − β0 + β1 X i ,i = 1, 2 ,...,n Let us now specify different values for the pair ( β 0 , β1 ) and visualize the residuals for them. What happens to the arbitrary choice of parameters? For the IO_Time dataset, the scatter plot suggests that the intercept term is about 0.05. Further, the resistant line gives an estimate of the slope at about 0.04. We will have three pairs for guesses for ( β 0 , β1 ) as (0.05, 0.05), (0.1, 0.04), and (0.15, 0.03). We will now plot the data and see the different residuals for the three pairs of guesses. [ 163 ] Linear Regression Analysis Time for action – the arbitrary choice of parameters 1. We begin with reasonable guesses of the slope and intercept terms for a simple linear regression model. The idea is to inspect the difference between the fitted line and the actual observations. Invoke the graphics windows using par(mfrow=c(1,3)). 2. Obtain the scatter plot of the CPU_Time against No_of_IO with: plot(No_of_IO,CPU_Time,xlab="Number of Processes",ylab="CPU Time", ylim=c(0,0.6),xlim=c(0,11)) 3. For the guessed regression line with the values of ( β 0 , β1 ) being (0.05, 0.05), plot a line on the scatter plot with abline(a=0.05,b=0.05,col= "blue"). 4. Define a function which will find the y value for the guess of the pair (0.05, 0.05) using myline1 = function(x) 0.05*x+0.05. 5. Plot the error (residuals) made due to the choice of the pair (0.05, 0.05) from the actual points using the following loop and give a title for the first pair of the guess: for(i in 1:length(No_of_IO)){ lines(c(No_of_IO[i], No_of_IO[i]), c(CPU_Time[i], myline1(No_of_IO[i])),col="blue", pch=10) } title("Residuals for the First Guess") 6. Repeat the preceding exercise for the last two pairs of guesses for the regression coefficients ( β 0 , β1 ) . The complete R program is given as follows: par(mfrow=c(1,3)) plot(IO_Time$No_of_IO,IO_Time$CPU_Time,xlab="Number of Processes",ylab="CPU Time",ylim=c(0,0.6),xlim=c(0,11)) abline(a=0.05,b=0.05,col="blue") myline1 <- function(x) 0.05*x+0.05 for(i in 1:length(IO_Time$No_of_IO)){ lines(c(IO_Time$No_of_IO[i],IO_Time$No_of_IO[i]),c(IO_Time$CPU_ Time[i],myline1(IO_Time$No_of_IO[i])),col="blue",pch=10) } title("Residuals for the First Guess") plot(IO_Time$No_of_IO,IO_Time$CPU_Time,xlab="Number of Processes",ylab="CPU Time",ylim=c(0,0.6),xlim=c(0,11)) abline(a=0.1,b=0.04,col="green") myline2 <- function(x) 0.04*x+0.1 for(i in 1:length(IO_Time$No_of_IO)){ lines(c(IO_Time$No_of_IO[i],IO_Time$No_of_IO[i]),c(IO_Time$CPU_ Time[i],myline2(IO_Time$No_of_IO[i])),col="green",pch=10) [ 164 ] Chapter 6 } title("Residuals for the Second Guess") plot(IO_Time$No_of_IO,IO_Time$CPU_Time,xlab="Number of Processes",ylab="CPU Time",ylim=c(0,0.6),xlim=c(0,11)) abline(a=0.15,b=0.03,col="yellow") myline3 <- function(x) 0.03*x+0.15 for(i in 1:length(IO_Time$No_of_IO)){ lines(c(IO_Time$No_of_IO[i],IO_Time$No_of_IO[i]),c(IO_Time$CPU_ Time[i],myline3(IO_Time$No_of_IO[i])),col="yellow",pch=10) } title("Residuals for the Third Guess") Figure 1: Residuals for the three choices of regression coefficients What just happened? We have just executed an R program which displays the residuals for arbitrary choices of the regression parameters. The displayed result is the preceding screenshot. In the preceding R program, we first plot CPU_Time against No_of_IO. The first choice of the line is plotted by using the abline function, and we specify the required intercept and slope through a = 0.05 and b = 0.05. From this straight line (color blue), we need to obtain the magnitude of error, through perpendicular lines from the points to the line, from the original points. This is achieved through the for loop, where the lines function joins the points and the line. [ 165 ] Linear Regression Analysis For the pair (0.05, 0.05) as a guess of ( β 0 , β1 ), we see that there is a progression in the residual values as x increases, and it is the other way for the guess of (0.15, 0.03). In either case, we are making large mistakes (residuals) for certain x values. The middle plot for the guess (0.1, 0.04) does not seem to have large residual values. This choice may be better over the other two choices. Thus, we need to define a criterion which enables us to find the best values ( β 0 , β1 ) in some sense. The criterion is to minimize the sum of squared errors: min ( β0 ,β1 ) n ∑e i =1 2 i Where: n n ∑ e = ∑{ y − ( β i =1 2 i i =1 i + β1 xi )} 2 0 Here, the summation is over all the observed pairs ( yi ,xi ) ,i = 1, 2,...,n . The technique of minimizing the error sum of squares is known as the method of least squares, and for the simple linear regression model, the values of ( β 0 , β1 ) which meet the criterion are given by: βˆ1 = S xy S xx , βˆ 0 = y − βˆ1 x Where: x= ∑in−1 xi ∑n y , and y = i −1 i n n And: n n S xx = ∑ ( xi − x ) , and S xx = ∑ yi ( xi − x ) 2 i =1 2 i =1 We will now learn how to use R for building a simple linear regression model. [ 166 ] Chapter 6 Building a simple linear regression model We will use the R function lm for the required construction. The lm function creates an object of class lm, which consists of an ensemble of the fitted regression model. Through the following exercise, you will learn the following: The basic construction of an lm object The criteria which signifies the model significance The criteria which signifies the variable significance The variation of the output explained by the inputs The relationship is specified by a formula in R, and the details related to the generic form may be obtained by entering ?formula in the R console. That is, the lm function accepts a formula object for the model that we are attempting to build. data.frame may also be explicitly specified, which consists of the required data. We need to model CPU_Time as a function of No_of_IO, and this is carried out by specifying CPU_Time ~ No_of_IO. The function lm is wrapped around the formula to obtain our first linear regression model. Time for action – building a simple linear regression model We will build the simple linear regression model using the lm function with its useful arguments. 1. Create a simple linear regression model for CPU_Time as a function of No_of_IO by keying in IO_lm <- lm(CPU_Time ~ No_of_IO, data=IO_Time). 2. 3. Verify that IO_lm is of the lm class with class(IO_lm). Find the details of the fitted regression model using the summary function: summary(IO_lm). [ 167 ] Linear Regression Analysis The output is given in the following screenshot: Figure 2: Building the first simple linear regression model The first question you should ask yourself is, "Is the model significant overall?". The answer is provided by the p-value of the F-statistic for the overall model. This appears in the final line of summary(IO_lm). If the p-value is closer to 0, it implies that the model is useful. A rule of thumb for the significance of the model is that it should be less than 0.05. The general rule is that if you need the model significance at a certain percentage, say P, then the p-value of the F-statistic should be lesser than (1-P/100). Now that we know that the model is useful, we can ask whether the independent variable as well as the intercept term, are significant or not. The answer to this question is provided by Pr(>|t|) for the variables in the summary. R has a general way of displaying the highest significance level of a term by using ***, **, * and . in the Signif. codes:. This display may be easily compared with the review of a movie or a book! Just as with general ratings, where more stars indicate a better product, in our context, the higher the number of stars indicate the variables are more significant for the built model. In our linear model, we find No_of_IO to be highly significant. The estimate value of No_of_IO is given as 0.04076. This coefficient has the interpretation that for a unit increase in the number of IOs; CPU_Time is expected to increase by 0.04076. [ 168 ] Chapter 6 Now that we know that the model, as well as the independent variable, are significant, we need to know how much of the variability in CPU_Time is explained by No_of_IO. The answer to this question is provided by the measure R2, not to be confused with the letter R for the software, which when multiplied by 100 gives the percentage of variation in the regressand explained by the regressor. The term R2 is also called as the coefficient of determination. In our example, 98.76 percent of the variation in CPU_Time is explained by No_of_IO, see the value associated with Multiple R-squared in the summary(IO_lm). The R2 measure does not consider the number of parameters estimated or the number of observations n in a model. A more robust explanation, which takes into consideration the number of parameters and observations, is provided by Adjusted R-squared which is 98.6 percent. We have thus far not commented on the first numerical display as a result of using the summary function. This relates to the residuals and the display is about the basic summary of the residual values. The residuals vary from -0.016509 to 0.024006, which are not very large in comparison with the CPU_Time values, check with summary(CPU_Time) for instance. Also, the median of the residual values is very close to zero, and this is an important criterion as the median of the standard normal distribution is 0. What just happened? You have fitted a simple linear regression model where the independent variable is No_of_ IO and the dependent variable (output) is CPU_Time. The important quantities to look for the model significance, the regression coefficients, and so on have been clearly illustrated. Have a go hero Load the dataset anscombe from the datasets package. The anscombe dataset has four pairs of datasets in x1, x2, x3, x4, y1, y2, y3, and y4. Fit a simple regression model for all the four pairs and obtain the summary for each pair. Make your comments on the summaries. Pay careful attention to the details of the summary function. If you need further help, simply try out example(anscombe). We will next look at the ANOVA (Analysis of Variance) method for the regression model, and also obtain the confidence intervals for the model parameters. ANOVA and the confidence intervals The summary function of the lm object specifies the p-value for each variable in the model, including the intercept term. Technically, the hypothesis problem is testing H 0j : β j = 0, j = 0,1 j against the corresponding alternative hypothesis, H1 : β j ≠ 0 , j = 0 ,1. This testing problem is technically different from the simultaneous hypothesis testing H 0 : β 0 = β1 = 0 against the alternative that at least one of the regression coefficients is different from 0. The ANOVA technique gives the answer to the latter null hypothesis of interest. [ 169 ] Linear Regression Analysis For more details about the ANOVA technique, you may refer to http://en.wikipedia. org/wiki/Analysis_of_variance. Using the anova function, it is very simple in R to obtain the ANOVA table for a linear regression model. Let us apply it for our IO_lm linear regression model. Time for action – ANOVA and the confidence intervals The R functions anova and confint respectively help obtain the ANOVA table and confidence intervals from the lm objects. Here, we use them for the IO_lm regression object. 1. Use the anova function on the IO_lm lm object to obtain the ANOVA table by using IO_anova <- anova(IO_lm). 2. 3. Display the ANOVA table by keying in IO_anova in the console. The 95 percent confidence intervals for the intercept and the No_of_IO variable is obtained by confint(IO_lm). The output in R is as follows: Figure 3: ANOVA and the confidence intervals for the simple linear regression model The ANOVA table confirms that the variable No_of_IO is significant indeed. Note the difference of the criteria for confirming this with respect to summary(IO_lm). In the former case, the significance was arrived at using the t-statistics and here we have used the F-statistic. Precisely, we check for the variance significance of the input variable. We now give the tool for obtaining confidence intervals. [ 170 ] Chapter 6 Check whether or not the estimated value of the parameters fall within the 95 percent confidence intervals. The preceding results show that we indeed have a very good linear regression model. However, we also made a host of assumptions in the beginning of the section, and a good practice is to ask how valid they are in the experiment. We next consider the problem of validation of the assumptions. What just happened? The ANOVA table is a very fundamental block for a regression model, and it gives the split of the sum of squares for the variable(s) and the error term. The difference between ANOVA and the summary of the linear model object is in the respectively p-values reported by them as Pr(>F) and Pr(>|t|). You also found a method for obtaining the confidence intervals for the independent variables of the regression model. Model validation The violations of the assumptions may arise in more than one way. Tattar, et. al. (2012), Kutner, et. al. (2005) discusses the numerous ways in which the assumptions are violated and an adaption of the methods mentioned in these books is now considered: The regression function is not linear. In this case, we expect the residuals to have a pattern which is not linear when viewed against the regressors. Thus, a plot of the residuals against the regressors is expected to indicate if this assumption is violated. The error terms do not have constant variance. Note that we made an assumption stating that ε ∼ N ( 0,σ 2 ) , that is the magnitude of errors do not depend on the corresponding x or y value. Thus, we expect the plot of the residuals against the predicted y values to reveal if this assumption is violated. The error terms are not independent. A plot of the residuals against the serial number of the observations indicated if the error terms are independent or not. We typically expect this plot to exhibit a random walk if the errors are independent. If any systematic pattern is observed, we conclude that the errors are not independent. The model fits all but one or a few outlier observations. Outliers are a huge concern in any analytical study as even a single outlier has a tendency to destabilize the entire model. A simple boxplot of the residuals indicates the presence of an outlier. If any outlier is present, such observations need to be removed and the model needs to be rebuilt. The current step of model validation needs to be repeated for the rebuilt model. In fact the process needs to be iterated until there are no more outliers. However, we need to caution the reader that if the subject experts feel that such outliers are indeed expected values, it may convey that some appropriate variables are missing in the regression model. [ 171 ] Linear Regression Analysis The error terms are not normally distributed. This is one of the most crucial assumptions of the linear regression model. The violation of this assumption is verified using the normal probability plot in which the predicted values (actually cumulative probabilities) are plotted against the observed values. If the values fall along a straight line, the normality assumption for errors holds true. The model is to be rejected if this assumption is violated. The next section shows you how to obtain the residual plots for the purpose of model validation. Time for action – residual plots for model validation The R functions resid and fitted can be used to extract residuals and fitted values from an lm object. 1. Find the residuals of the fitted regression model using the resid function: IO_lm_ resid <- resid(IO_lm). 2. We need six plots, and hence we invoke the graphics editor with par(mfrow = c(3,2)). 3. Sketch the plot of residuals against the predictor variable with plot(No_of_IO, IO_lm_resid). 4. To check whether the regression model is linear or not, obtain the plots of absolute residual values against the predictor variable and also that of squared residual values against the predictor variable respectively with plot(No_of_IO, abs(IO_ lm_resid),…) and plot(No_of_IO, IO_lm_resid^2,…). 5. The assumption that errors have constant variance may be verified by the plot of residuals against the fitted values of the regressand. The required plot is obtained by using plot(IO_lm$fitted.values,IO_lm_resid). 6. The assumption that the errors are independent of each other may be verified plotting the residuals against their index numbers: plot.ts(IO_lm_resid). [ 172 ] Chapter 6 7. Finally, the presence of outliers is investigated by the boxplot of the residuals: boxplot(IO_lm_resid). 8. Finally, the assumption of normality for the error terms is verified through the normal probability plot. This plot is on a new graphics editor. The complete R program is as follows: IO_lm_resid <- resid(IO_lm) par(mfrow=c(3,2)) plot(No_of_IO, IO_lm_resid,main="Plot of Residuals Vs Predictor Variable",ylab="Residuals",xlab="Predictor Variable") plot(No_of_IO, abs(IO_lm_resid), main="Plot ofAbsolute Residual Values Vs Predictor Variable",ylab="Absolute Residuals", xlab="Predictor Variable") # Equivalently plot(No_of_IO, IO_lm_resid^2,main="Plot of Squared Residual Values Vs Predictor Variable", ylab="Squared Residuals", xlab="Predictor Variable") plot(IO_lm$fitted.values,IO_lm_resid, main="Plot of Residuals Vs Fitted Values",ylab="Residuals", xlab="Fitted Values") plot.ts(IO_lm_resid, main="Sequence Plot ofthe Residuals") boxplot(IO_lm_resid,main="Box Plot of the Residuals") rpanova = anova(IO_lm) IO_lm_resid_rank=rank(IO_lm_resid) tc_mse=rpanova$Mean[2] IO_lm_resid_expected=sqrt(tc_mse)*qnorm((IO_lm_resid_rank-0.375)/ (length(CPU_Time)+0.25)) plot(IO_lm_resid,IO_lm_resid_expected,xlab="Expected",ylab="Residuals" ,main="The Normal Probability Plot") abline(0,1) [ 173 ] Linear Regression Analysis The resulting plot for the model validation plot is given next. If you run the preceding R program up to the rpanova code, you will find the plot similar to the following: Figure 4: Checking for violations of assumptions of IO_lm We have used the resid function to extract the residuals out of the lm object. The first plot of residuals against the predictor variable No_of_IO shows that more the number of IO processes, the larger is the residual value, as is also confirmed by Plot of Absolute Residual Values Vs Predictor Variable and Plot of Squared Residual Values Vs Predictor Variable. However, there is no clear non-linear pattern suggested here. The Plot of Residuals Vs Fitted Values is similar to the first plot of residuals against the predictor. The time series plot of residuals does not indicate a strict deterministic trend and appears a bit similar to the random walk. Thus, these plots do not give any evidence of any kind of dependence among the observations. The boxplot does not indicate any presence of an outlier. The normal probability plot for the residuals is given next: [ 174 ] Chapter 6 Figure 5: The normal probability plot for IO_lm As all the points fall close to the straight line, the normality assumption for the errors does not appear to be violated. What just happened? The R program given earlier gives various residual plots, which help in validation of the model. It is important that these plots are always checked whenever a linear regression model is built. For CPU_Time as a function of No_of_IO, the linear regression model is a fairly good model. [ 175 ] Linear Regression Analysis Have a go hero From a theoretical perspective and my own experience, the seven plots obtained earlier were found to be very useful. However, R, by default, also gives a very useful set of residual plots for an lm object. For example, plot(my_lm) generates a powerful set of model validation plots. Explore the same for IO_lm with plot(IO_lm). You can explore more about plot and lm with the plot.lm function. We will next consider the general multiple linear regression model for the Gasoline problem considered in the earlier chapters. Multiple linear regression model In the The simple linear regression model section, we considered an almost (un)realistic problem of having only one predictor. We need to extend the model for the practical problems when one has more than a single predictor. In Example 3.2.9. Octane rating of gasoline blends we had a graphical study of mileage as a function of various vehicle variables. In this section, we will build a multiple linear regression model for the mileage. If we have X1, X2, …, Xp independent set of variables which have a linear effect on the dependent variable Y, the multiple linear regression model is given by: Y = β 0 + β1 X 1 + β 2 X 2 + ... + + β p X p + ε This model is similar to the simple linear regression model, and we have the same interpretation as earlier. Here, we have additional independent variables in X2, …, Xp and their effect on the regressand Y are respectively through the additional regression parameters β 2 ,..., β p. Now, suppose we have n pairs of random observations (Y1 , X 1 ) ,...,(Yn , X n ) for understanding the multiple linear regression model, here X i = ( X i ,..., X ip ) ,i = ,...,n . A matrix form representation of the multiple linear regression model is useful in the understanding of the estimator for the vector of regression coefficients. We define the following quantities: Y = (Y1 ,...,Yn )' ε = ( ε1 ,...,ε n )', β = ( β 0 , β1 ,..., β p )', and 1 X 11 1 X 21 X= 1 X n1 X 12 X 12 X 22 X 2 p X n 2 X np [ 176 ] Chapter 6 The multiple linear regression model for n observations can be written in a compact matrix form as: Y = X ' β +ε The least squares estimate of β is given by: βˆ = ( X ′X ) X 'Y −1 Let us fit a multiple linear regression model for the Gasoline mileage data considered earlier. Averaging k simple linear regression models or a multiple linear regression model We already know how to build a simple linear regression model. Why should we learn another theory when an extension is possible in a certain manner? Intuitively, we can build k models consisting of the kth variable and then simply average out over the k models. Such an averaging can also be considered for the univariate case too. Here, we may divide the covariate over distinct intervals, then build the simple linear regression model over the intervals, and finally average over the different models. Montgomery, et. al. (2001) highlights the drawback of such an approach, pages 120-122. Typically, the simple linear regression model may indicate the wrong sign for the regression coefficients. The wrong sign of such a naïve approach may arise as a result of multiple reasons: restricting the range of some regressors, critical regressors may have been omitted from the model building, or some computational errors may have crept in. To drive home the point, we consider an example from Montgomery, et. al. Time for action – averaging k simple linear regression models We will build three models here. We have a vector of regressand y and two covariates: x1 and x2. 1. Enter the dependent variable and the independent variables with y <- c(1, 5,3,8,5,3,10,7), x1 = c(2,4,5,6,8,10,11,13), and x2 <- c(1,2, 2,4,4,4,6,6). 2. Visualize the relationships among the variables with: par(mfrow=c(1,3)) plot(x1,y) plot(x2,y) plot(x1,x2) [ 177 ] Linear Regression Analysis 3. Build the individual simple linear regression model and our first multiple regression model with: summary(lm(y~x1)) summary(lm(y~x2)) summary(lm(y~x1+x2)) # Our first multiple regression model Figure 6: Averaging k simple linear regression models [ 178 ] Chapter 6 What just happened? The visual plot (the preceding screenshot) indicates that both x1 and x2 have a positive impact on y, and this is also captured in lm( y~x1 ) and lm( y~x2 ), see the next R output display. We have omitted the scatter plot, though you should be able to see the same on your screen after running the R code after step 2 in the next section. However, both the models are under the assumption that the information contained in x1 and x2 is complete. The variables are also seen to have a significant effect on the output. However, the metrics such as Multiple R-squared and Adjusted R-squared are very poor for both simple (linear) regression models. This is one of the indications that we need to collect more information and thus, we include both the variables and build our first multiple linear regression model, see the next section for more details. There are two important changes worth registering now. First, the sign of the regression coefficient x1 now becomes negative, which is now contradicting the intuition. The second observation is the great increase in the R-squared metric value. To summarize our observations here, it suffices to say that the sum of the parts may sometimes fall way short of the entire picture. Building a multiple linear regression model The R function lm remains the same as earlier. We will continue Example 3.2.9. Octane rating of gasoline blends from the Visualization techniques for continuous variable data section of Chapter 3, Data Visualization. Recall that the variables, independent and dependent as well, are stored in the dataset Gasoline in the RSADBE package. Now, we tell R that y, which is the mileage, is the dependent variable, and we need to build a multiple linear regression model which includes all other variables of the Gasoline object. Thus, the formula is specified by y~. indicating that all other variables from the Gasoline object need to be treated as the independent variables. We proceed as earlier to obtain the summary of the fitted multiple linear regression model. Time for action – building a multiple linear regression model The method of building a multiple linear regression model remains the same as earlier. If all the variables in data.frame are to be used, we use the formula y ~ .. However, if we need specific variables, say x1 and x3, the formula would be y ~ x1 + x3. 1. Build the multiple linear regression model with gasoline_lm <- lm(y~., data=Gasoline). Here, the formula y~. considers the variable y as the dependent variable and all the remaining variables in the Gasoline data frame as independent variables. 2. Get the details of the fitted multiple linear regression model with summary(gasoline_lm). [ 179 ] Linear Regression Analysis The R screen then appears as follows: Figure 7: Building the multiple linear regression model As with the simple model, we need to first check whether the overall model is significant by looking at the p-value of the F-statistics, which appears as the last line of the summary output. Here, the value 0.0003 being very close to zero, the overall model is significant. Of the 11 variables specified for modeling, only x1 and x3, that is, the engine displacement and torque, are found to have a meaningful linear effect on the mileage. The estimated regression coefficient values indicate that the engine displacement has a negative impact on the mileage, whereas the torque impacts positively. These results are in confirmation with the basic science of vehicle mileage. [ 180 ] Chapter 6 We have a tricky output for the eleventh independent variable, which for some strange reasons R has been renamed as x11M. We need to explain this. You should verify the output as a consequence of running sapply(Gasoline,class) on the console. Now, the x11 variable is a factor variable assuming two possible values A and M, which stand for the transmission box being Automatic or Manual. As the categorical variables are of a special nature, they need to be handled differently. The user may be tempted to skip this, as the variable is seen to be insignificant in this case. However, the interpretation is very useful and the "skip" part may prove really expensive later. For computational purposes, an m-level factor variable is used to create m-1 new different variables. If the variable assumes the level l, the lth variable takes value 1, else 0, for l = 1, 2, …, m-1. If the variable assumes level m, all the (m-1) new variables take the value 0. Now, R takes the lth factor level and names that vector by concatenating the variable name and the factor level. Hence, we have x11M as the variable name in the output. Here, we found the factor variable to be insignificant. If in certain experiments we find some factor levels to be significant at certain p-value, we can't ignore the other factor levels even if their p-values suggest them as insignificant. What just happened? The building of a multiple linear regression model is a straightforward extension of the simple linear regression model. The interpretation is where one has to be more careful with the multiple linear regression model. We will now look at the ANOVA and confidence intervals for the multiple linear regression model. It is to be noted that the usage is not different from the simple linear regression model, as we are still dealing with the lm object. The ANOVA and confidence intervals for the multiple linear regression model Again, we use the anova and confint functions to obtain the required results. Here, the null hypothesis of interest is whether all the regression coefficients equal 0, that is H 0 : β 0 = β1 = = β p = 0 against the alternative that at least one of the regression coefficients is different from 0, that is H1 : β 0 ≠ 0 for at least one j = 0, 1, …, p. [ 181 ] Linear Regression Analysis Time for action – the ANOVA and confidence intervals for the multiple linear regression model The use of anova and confint extend in a similar way as lm is used for simple and multiple linear regression models. 1. The ANOVA table for the multiple regression model is obtained in the same way as for the simple regression model, after all we are dealing with the object of class lm: gasoline_anova<-anova(gasoline_lm). 2. The confidence intervals for the independent variables are obtained by using confint(gasoline_lm). The R output is given as follows: Figure 8: The ANOVA and confidence intervals for the multiple linear regression model [ 182 ] Chapter 6 Note the difference between the anova and summary results. Now, we find only the first variable to be significant. The interpretation of the confidence intervals is left to you. What just happened? The extension from simple to multiple linear regression model in R, especially for the ANOVA and confidence intervals, is really straightforward. Have a go hero The ANOVA table in the preceding screenshot and the summary of gasoline_lm in the screenshot given in step 2 of the Time for action – building a multiple linear regression model section build linear regression models using the significant variables only. Are you amused? Useful residual plots In the context of multiple linear regression models, modifications of the residuals have been found to be more useful than the residuals themselves. We have assumed the residuals to follow a normal distribution with mean zero and unknown variance. An estimator of the unknown variance is provided by the mean residual sum of squares. There are four useful types of residuals for the current model: Standardized Residuals: We know that the residuals have zero mean. Thus, the standardized residuals are obtained by scaling the residuals with the estimator of the standard deviation, that is the square root of the mean residual sum of squares. The standardized residuals are defined by: di = ei MS Re s Here, MS Re s = ∑ in=1 ei2 / ( n − p ), and p is the number of covariates in the model. The residual is expected to have mean 0 and MS Re s is an estimate of its variance. Hence, we expect the standardized residuals to have a standard normal distribution. This in turn helps us to verify whether the normality assumption for the residuals is meaningful or not. Semi-studentized Residuals: The semi-studentized residuals are defined by: ri = ei MS Re s (1 − hii ) [ 183 ] ,i = 1,...,n Linear Regression Analysis Here, hii is the ith diagonal element of the matrix H = X ( X ' X )−1 X ' . The variance of a residual depends on the covariate value and hence, a flat scaling by MS Re s is not appropriate. A correction is provided by (1 − hii ), and MS Re s (1 − hii ) turns out to be an estimate of the variance of ei. This is the motivation for the semi-studentized residual ri. PRESS Residuals: The predicted residual, PRESS, for observation i is the difference between the actual value yi and the value predicted for it, by using a regression model based on the remaining (n-1) observations. Now let β̂(i ) be the estimator of regression coefficients based on the (n-1) observations (not including the ith observations). Then, PRESS for observations i is given by: e(i ) = yi − xi βˆ (i ) ,i = 1,...,n Here, the idea is that the estimate of residual for an observation is more appropriate when obtained from a model which is not influenced by its own value. R-student Residuals: This residual is especially useful for the detection of outliers. ti = ei MS Re s(i ) (1 − hii ) Here, MS Re s(i ) is an estimator of the variance σ 2 based on the remaining (n-1) observations. The scaling change is on similar lines as with the studentized residuals. The task of building n linear models may look daunting! However, there are very useful formulas in Statistics and functions in R which save the day for us. It is appropriate that we use those functions and develop the residual plots for the Gasoline dataset. Let us set ourselves for some action. Time for action – residual plots for the multiple linear regression model R functions resid, hatvalues, rstandard, and rstudent are available, which can be applied on an lm object to obtain the required residuals. [ 184 ] Chapter 6 1. Get the MSE of the regression model with gasoline_lm_mse <- gasoline_ anova$Mean[length(gasoline_anova$Mean)]. 2. Extract the residuals with the resid function, and standardize the residuals using stan_resid_gasoline <- resid(gasoline_lm)/sqrt( gasoline_ lm_mse). 3. To obtain the semi-studentized residuals, we first need to get the hii elements which are obtainable using the hatvalues function: hatvalues(gasoline_lm). The remaining code is given at the end of this list. 4. 5. The PRESS residuals are calculated using the rstandard function available in R. The R-student residuals can be obtained using the rstudent function in R. The detailed code is as follows: # Useful Residual Plots gasoline_lm_mse<-gasoline_anova$Mean[length(gasoline_anova$Mean)] stan_resid_gasoline <- resid(gasoline_lm)/sqrt(gasoline_lm_mse) #Standardizing the residuals studentized_resid_gasoline <- resid(gasoline_lm)/ (sqrt(gasoline_ lm_mse*(1-hatvalues(gasoline_lm)))) #Studentizing the residuals pred_resid_gasoline <- rstandard(gasoline_lm) pred_student_resid_gasoline<-rstudent(gasoline_lm) # returns the R-Student Prediction Residuals par(mfrow=c(2,2)) plot(gasoline_fitted,stan_resid_gasoline,xlab="Fitted", ylab="Residuals") title("Standardized Residual Plot") plot(gasoline_fitted,studentized_resid_gasoline,xlab="Fitted",ylab ="Residuals") title("Studentized Residual Plot") plot(gasoline_fitted,pred_resid_gasoline,xlab="Fitted", ylab="Residuals") title("PRESS Plot") plot(gasoline_fitted,pred_student_resid_gasoline,xlab="Fitted",yla b="Residuals") title("R-Student Residual Plot") [ 185 ] Linear Regression Analysis All the four residual plots in the screenshot given in the Time for action – residual plots for model navigation section look identical though there is a difference in their y-scaling. It is apparent from the residual plots that there are no patterns which show the presence of non-linearity, that is the linearity assumption appears valid. In the standardized residual plot, all the observations are well within -3 and 3. Thus, it is correct to say that there are no outliers in the dataset. Figure 9: Residual plots for the multiple linear regression model What just happened? Using the resid, rstudent, rstandard, and other functions, we have obtained useful residual plots for the multiple linear regression models. Regression diagnostics In the Useful residual plots subsection, we saw how outliers may be identified using the residual plots. If there are outliers, we need to ask the following questions: [ 186 ] Chapter 6 Is the observation an outlier due to an anomalous value in one or more covariate values? Is the observation an outlier due to an extreme output value? Is the observation an outlier because of both the covariate and output values being extreme values? The distinction in the nature of an outlier is vital as one needs to be sure of its type. The techniques for outlier identification are certainly different as is their impact. If the outlier is due to the covariate value, the observation is called a leverage point, and if it is due to the y value, we call it an influential point. The rest of the section is for the exact statistical technique for such an outlier identification. Leverage points As noted, a leverage point has an anomalous x value. The leverage points may be theoretically proved not to impact the estimates of the regression coefficients. However, these points are known to drastically affect the R 2 value. The question then is, how do we identify such points? The answer is by looking at the diagonal elements of the hat matrix −1 H = X ( X ' X ) X ' . Note that the matrix H is of the order n × n . The (i, i) element of the hat matrix hii may be interpreted as the amount of leverage by the observation, i, on the fitted value ŷi . The average size of a leverage is h = p / n , where p is the number of covariates and n is the number of observations. It is better to leave out an observation if its leverage value is greater than twice of p / n , and we then conclude that the observation is a leverage point. Let us go back to the Gasoline problem and see the leverage of all the observations. In R, we have a function, hatvalues which readily extracts the diagonal elements of H . The R output is given in the next screenshot. Clearly, we have 10 observations which are leverage points. This is indeed a matter of concern as we have only about 25 observations. Thus, the results of the linear model need to be interpreted with caution! Let us now identify the influential points for the Gasoline linear model. [ 187 ] Linear Regression Analysis Influential points An influential point has a tendency to pull the regression line (plane) towards its direction and hence, they drastically affect the values of the regression coefficients. We want to identify the impact of an observation on the regression coefficients, and one approach is to consider how much the regression coefficient values change if the observation was not considered. The relevant mathematics for identification of influential points is beyond the scope of the book, so we simply help ourselves with the metric Cook's distance which finds the influential points. The R function, cooks.distance, returns the values of the Cook's distance for each observation, and the thumb rule is that if the distance is greater than 1, the observation is an influential point. Let us use the R function and identify the influential points for the Gasoline problem. Figure 10: Leverage and influential points of the gasoline_lm fitted model For this dataset, we have only one influential point in Eldorado. The plot of Cook's distance against the observation numbers and that of Cook's distance against the leverages may be easily obtained with plot(gasoline_lm,which=c(4,6)). [ 188 ] Chapter 6 DFFITS and DFBETAS Belsley, Kuh, and Welsch (1980) proposed two additional metrics for finding the influential points. The DFBETAS metric indicates the change of regression coefficients (in standard deviation units) if the ith observation is removed. Similarly, DFFITS is a metric which gives the impact on the fitted values ŷi . The rule which indicates the presence of an influential point using the DFFITS is | DFFITSi | > 2 / p / n , where p is the number of covariates and n is the number of observations. Finally, an observation i is influential for regression coefficient j if DFBETAS j ,i > 2 / n. Figure 11: DFFITS and DFBETAS for the Gasoline problem We have given the DFFITS and DFBETAS values for the Gasoline dataset. It is left as an exercise to the reader to identify the influential points from the outputs given above. The multicollinearity problem One of the important assumptions of the multiple linear regression model is that the covariates are linearly independent. The linear independence here is the sense of Linear Algebra that a vector (covariate in our context) cannot be expressed as a linear combination −1 of others. Mathematically, this assumption translates into an implication that ( X ' X ) is nonsingular, or that its determinant is non-zero. If this is not the case then we have one or more of the following problems: [ 189 ] Linear Regression Analysis The estimated will be unreliable, and there is a great chance of the regression coefficients having the wrong sign The relevant significant factors will not be identified by either the t-tests or the F-tests The importance of certain predictors will be undermined Let us first obtain the correlation matrix for the predictors of the Gasoline dataset. We will exclude the final covariate, as it is factor variable. Figure 12: The correlation matrix of the Gasoline covariates We can see that covariate x1 is strongly correlated with all other predictors except x4. Similarly, x8 to x10 are also strongly correlated. This is a strong indication of the presence of the multicollinearity problem. Define C = ( X ' X )−1. Then it can be proved that, refer Montgomery, et. al. (2003), the −1 jth diagonal element of C can be written as C jj = (1 − R 2j ) , where R 2j is the coefficient of determination obtained by regressing all other covariates for xj as the output. Now, the variable xj is independent of all the other covariates; we expect the coefficient of determination to be zero, and hence C jj to be closer to unity. However, if the covariate depends on the others, we expect the coefficient of determination to be a high value, and hence the C jj to be a large number. The quantity C jj is also called the variance inflation factor, denoted by VIFj. A general guideline for a covariate to be linearly independent of other covariates is that its VIFj should be lesser than 5 or 10. [ 190 ] Chapter 6 Time for action – addressing the multicollinearity problem for the Gasoline data The multicollinearity problem is addressed using the vif function, which is available from two libraries: car and faraway. We will use it from the faraway package. 1. 2. Load the faraway package with library(faraway). We need to find the variance inflation factor (VIF) of the independent variables only. The covariate x11 is a character variable, and the first column of the Gasoline dataset is the regressand. Hence, run vif(Gasoline[,-c(1,12)]) to find the VIF of the eligible independent variables. 3. The VIF for x3 is highest at 217.587. Hence, we remove it to find the VIF among the remaining variables with vif(Gasoline[,-c(1,4,12)]). Remember that x3 is the fourth column in the Gasoline data frame. 4. In the previous step, we find x10 having the maximum VIF at 77.810. Now, run vif(Gasoline[,-c(1,4,11,12)]) to find if all VIFs are less than 10. 5. For the first variable x1 the VIF is 31.956, and we now remove it with vif(Gasoline[,-c(1,2,4,11,12)]). 6. At the end of the previous step, we have the VIF of x1 at 10.383. Thus, run vif(Gasoline[,-c(1,2,3,4,11,12)]). 7. Now, all the independent variables have VIF lesser than 10. Hence, we stop at this step. 8. Removing all the independent variables with VIF greater than 10, we arrive at the final model, summary(lm(y~x4+x5+x6+x7+x8+x9,data= Gasoline)). Figure 13: Addressing the multicollinearity problem for the Gasoline dataset [ 191 ] Linear Regression Analysis What just happened? We used the vif function from the faraway package to overcome the problem of multicollinearity in the multiple linear regression model. This helped to reduce the number of independent variables from 10 to 6, which is a huge 40 percent reduction. The function vif from the faraway package is applied to the set of covariates. Indeed, there is another function of the same name from the car package which can be directly applied on an lm object. Model selection The method of removal of covariates in the The multicollinearity problem section depended solely on the covariates themselves. However, it may happen more often that the covariates in the final model will be selected with respect to the output. Computational cost is almost a non-issue these days and especially for not-so-very-large datasets! The question that arises then is can one retain all possible covariates in the model, or do we have any choice of covariates which meet a certain regression metric, say R2 > 60 percent? The problem is that having more covariates increases the variance of the model which having lesser of them will have large bias. The philosophical Occam's Razor principle applies here, and the best model is the simplest model. In our context, the smallest model which fits the data is the best. There are two types of model selection: stepwise procedures and criterion-based procedures. In this section, we will consider both the procedures. Stepwise procedures There are three methods of selecting covariates for inclusion in the final model: backward elimination, forward selection, and stepwise regression. We will first describe the backward elimination approach and develop the R function for it. The backward elimination In this model, one first begins with all the available covariates. Suppose we wish to retain all covariates for whom the p-value is at the most α . The value α is referred to as critical alpha. Now, we first eliminate that covariate whose p-value is maximum among all the covariates having p-value greater than α . The model is refitted for the current covariates. We continue the process until we have all the covariates whose p-value is less than α . In summary, the backward elimination algorithm is as explained next: [ 192 ] Chapter 6 1. Consider all the available covariates. 2. Remove the covariate with maximum p-value among all the covariates which have p-value greater than α . 3. Refit the model and go to the first step. 4. Continue the process until all p-values are less than α . Typically, the user investigates the p-values in the summary output and then carries out the preceding algorithm. Tattar, et. al. (2013) gives a function which right away executes the entire algorithm, and we adapt the same function here and apply it on the linear regression model gasoline_lm. The forward selection In the previous procedure we started with all covariates. Here, we begin with an empty model and look forward for the most significant covariates with p-value lesser than α . That is, we build k new linear models with the k-th covariate for the k-th model. Naturally, by "most significant" we mean that the p-value should be the least among all the covariates whose p-value is lesser than α . Then, we build the model with the selected covariate. A second covariate is selected by treating the previous model as the initial empty model. The model selection is continued until we fail to add any more covariates. This is summarized in the following algorithm: 1. Begin with an empty model. 2. For each covariate, obtain the p-value if it is added to the model. Select the covariates with the least p-value among all the covariates whose p-value is lesser than α . 3. Repeat the preceding step until no more covariates can be updated for the model. We again use the function created in Tattar, et. al. (2013) and apply it for the gasoline problem. There is yet another method of model selection. Here, we begin with the empty model. We add a covariate as in the forward selection step and then perform a backward elimination to remove any unwanted covariate. Then, the forward and backward steps are continued until we can't either add a new covariate or remove an existing covariate. Of course, the alpha critical values for forward and backward steps are specified distinctly. This method is called stepwise regression. This method is however skipped here for the purpose of brevity. [ 193 ] Linear Regression Analysis Criterion-based procedures A useful tool for the model selection problem is to evaluate all possible models and select one of them according to certain criteria. The Akaike Information Criteria (AIC) is one such criterion which can be used to select the best model. Let log ( L ( βˆ 0 , βˆ1 ,..., βˆ p ,σˆ 2 | y ) ) denote the log likelihood function of the fitted regression model. Define K = p + 2 which is the total number of estimated parameters. The AIC for the fitted regression model is given by: ( ( )) AIC = 2 -log L βˆ 0 , βˆ1 ,..., βˆ p ,σˆ 2 | y + K Now, the model which has the least AIC among the candidate models is the best model. The step function available in R gets the job done for us, and we will close the chapter with the continued illustration of the Gasoline problem. Time for action – model selection using the backward, forward, and AIC criteria For the forward and backward selection procedure under the stepwise procedures of the model selection problem, we first define two functions: backwardlm and forwardlm. However, for the criteria-based model selection, say AIC, we use the step function, which can be performed on the fitted linear models. 1. Create a function pvalueslm which extracts the p-values related to the covariates of an lm object: pvalueslm <- function(lm) {summary(lm)$coefficients[-1,4]} 2. Create a backwardlm function defined as follows: backwardlm <- function(lm,criticalalpha) { lm2=lm while(max(pvalueslm(lm2))>criticalalpha) { lm2=update(lm2,paste(".~.-",attr(lm2$terms, "term.labels")[(which(pvalueslm(lm2)==max(pvalueslm(lm2))))],s ep="")) } return(lm2) } [ 194 ] Chapter 6 The code needs to be explained in more detail. There are two new functions created here for the implementation of the backward elimination procedure. Let us have a detailed look at them. The function pvalueslm extracts the p-values related to the covariates of an lm object. The choice of summary(lm)$coefficients[-1,4] is vital, as we are interested in the p-values of the covariates and not the intercept term. The p-values are available once the summary function is applied on the lm object. Now, let us focus on the backwardlm function. Its arguments are the lm object and the value of critical α . Our goal is to carry out the iterations until we do not have any more covariates with p-value greater than α . Thus, we use the while function which is typical of algorithm, where the last step appears during the beginning of a function/program. We want our function to work for all the linear models and not just gasoline_lm, and we need to get the names of the covariates which are specified in the lm object. Remember, we conveniently used the formula lm(y~.) and this will try to haunt us! Thankfully, attr(lm$terms, "term.labels") extracts all the covariate names of an lm object. The argument [(which(pvalueslm(lm2)==max (pvalueslm (lm2))))] identifies the covariate number which has the maximum p-value above α . Next, paste(".~.",attr(), sep="") returns the formula which would have removed the unwanted covariate. The explanation of the formula is lengthier than the function itself, which is not surprising, as R is object-oriented and a few lines of code do more actions than detailed prose. 3. Obtain the efficient linear regression model by applying the backwardlm function, with the critical alpha at 0.20 on the Gasoline lm object: gasoline_lm_backward <- backwardlm(gasoline_lm,criticalalpha=0.20) 4. Find the details of the final model obtained in the previous step: summary(gasoline_lm_backward) [ 195 ] Linear Regression Analysis The output as a result of applying the backward selection algorithm is the following: Figure 14: The backward selection model for the Gasoline problem 5. The forwardlm function is given by: forwardlm <- function(y,x,criticalalpha) { yx <- data.frame(y,x) mylm <- lm(y~-.,data=yx) avail_cov <- attr(mylm$terms,"dataClasses")[-1] minpvalues <- 0 while(minpvalues |t|) associated with the Sat variable is 0.00018, which is again significant. However, the predicted value for SAT-M marks at 400 and 700 are respectively seen as -0.4793 and 1.743. The problem with the model is that the predicted values can be negative as well as greater than 1. It is essentially these limitations which restrict the use of the linear regression model when the regressand is a binary outcome. What just happened? We used the simple linear regression model for the probability prediction of a binary outcome and observed that the probabilities are not bound in the unit interval [0,1] as they are expected to be. This shows that we need to have special/different statistical models for understanding the relationship between the covariates and the binary output. We will use two regression models that are appropriate for binary regressand: probit regression, and logistic regression. The former model will continue the use of normal distribution for the error through a latent variable, whereas the latter uses the binomial distribution and is a popular member of the more generic generalized linear models. [ 203 ] The Logistic Regression Model Probit regression model The probit regression model is constructed as a latent variable model. Define a latent variable, also called as auxiliary random variable, Y * as follows: Y* = X' β + ε which is same as the earlier linear regression model with Y replaced by Y *. The error term ε is assumed to follow a normal distribution N ( 0,σ ). Then Y can be considered as 1 if the 2 latent variable is positive, that is: 1, if y* > 0 , equivalently X ' β > −ε Y = 0 , otherwise Without loss of generality, we can assume that ε ∼ N ( 0,1) . Then, the probit model is obtained by: P (Y = 1 | X ) = P (Y* > 0 ) = P ( ε > − X ' β ) = P (ε < X ' β ) = φ ( X ' β ) The method of maximum likelihood estimation is used to determine β . For a random sample of size n, the log likelihood function is given by: n ( log L ( β ) = ∑ yi log φ ( xi ' β ) + (1 − yi ) log (1 − φ ( xi ' β ) ) i =1 ) Numerical optimization techniques can be deployed to find the MLE of β . Fortunately, we don't have to undertake this daunting task and R helps us out with the glm function. Let us fit the probit model for the Sat dataset seen earlier. Time for action – understanding the constants The probit regression model is built for the Pass variable as a function of the Sat score using the glm R function and the argument binomial(probit). 1. Using the glm function and the binomial(probit) option we can fit the probit model for Pass as a function of the Sat score: pass_probit <- glm(Pass~Sat,data=sat,binomial(probit)) [ 204 ] Chapter 7 2. The details about the pass_probit glm object are fetched using summary(pass_probit). The summary function does not give a measure of R2, the coefficient of determination, as we obtained for the linear regression model. In general such a measure is not exactly available for the GLMs. However, certain pseudo-R2 measures are available and we will use pR2 function from the pscl package. This package has been developed at the Political Science Computational Laboratory, Stanford University, which explains the name of the package as pscl. 3. Load the pscl package with library(pscl), and apply the pR2 function on pass_probit to obtain the measures of pseudo R2. Finally, we check how the probit model overcomes the problems posed by application of the linear regression model. 4. Find the probability of passing the course for students with a SAT-M score of 400 and 700 with the following code: predict(pass_probit,newdata=list(Sat=400),type = "response") predict(pass_probit,newdata=list(Sat=700),type = "response") The following picture is the screenshot of R action: Figure 2: The probit regression model for SAT problem [ 205 ] The Logistic Regression Model The Pr(>|z|) for Sat is 0.0052, which shows that the variable has a significant say in explaining whether the student successfully completes the course or not. The regression coefficient value for the Sat variable indicates that if the Sat variable increases by one mark, the influence on the probit link increases by 0.0334. In easy words, the SAT-M variable has a positive impact on the probability of success for the student. Next, the pseudo R2 value of 0.3934 for the McFadden metric indicates that approximately 39.34 percent of the output is explained by the Sat variable. This appears to suggest that we need to collect more information about the students. That is, the experimenter may try to get information on how many hours did the student spend exclusively for the course/examination, the students' attendance percentages, and so on. However, the SAT-M score, which may have been obtained nearly two years before the final exam of the course, continues to have a good explanatory power! Finally, it may be seen that the probability of completing the course for students with SAT-M scores of 400 and 700 are respectively 2.019e-06 and 1. It is important for the reader to note the importance of the type = "response" option. More details may be obtained running ?predict.glm at the R terminal. What just happened? The probit regression model is appropriate for handling the binary outputs and is certainly much more appropriate than the simple linear regression model. The reader learned how to build the probit model using the glm function which is in fact more versatile as will be seen in the rest of the chapter. The prediction probabilities were also seen to be in the range of 0 to 1. The glm function can be conveniently used for more than one covariate. In fact, the formula structure of glm remains the same as lm. Model-related issues have not been considered in full details till now. The reason being that there is more interest in the logistic regression model, as it will be focus for the rest of the chapter, and the logic does not change. In fact we will return to the probit model diagnostics in parallel with the logistic regression model. Logistic regression model The binary outcomes may be easily viewed as failures or successes, and we have done the same on many earlier occasions. Typically, it is then common to assume that we have a binomial distribution for the probability of an observation to be successful. The logistic regression model specifies the linear effect of the covariates as a specific function of the probability of success. The probability of success for observation is denoted by π ( x ) = P (Y = 1) and the model is specified through the logistic function: β + β x + ...+ β x p p e 0 11 π ( x) = β + β x + ...+ β p x p 1+ e 0 1 1 [ 206 ] Chapter 7 The choice of this function is for fairly good reasons. Define w = β 0 + β1 x1 + ... + β p x p . Then, it may be easily seen that π ( x ) = e w / (1 + e w ) = 1 / (1 + e− w ). Thus, as w decreases to negative of infinity, π ( x ) approaches 0, and if w increases towards infinity, π ( x ) reaches 1. For w = 0, π ( x ) takes the value 0.5. The ratio of probability of success to that of failure is known as the odds ratio, denoted by OR, and following some simple arithmetic steps, it may be shown that: OR = π ( x) β =e 1− π ( x) 0 + β1 x1 + ...+ β p x p Taking a logarithm of the odds ration gets us: π ( x) log OR = log = β 0 + β 0 x1 + ... + β p x p 1− π ( x) And thus, we finally see that the logarithm of the odds ratio as a linear function of the covariates. It is actually the second term log (π ( xi ) / (1 − π ( xi ) ) ) , which is the form of a logit function that this model derives its name from. The log-likelihood function based on the data ( y1 ,x1 ) ,( y2 ,x2 ) ,...,( yn ,xn ) is then: ( p n n ∑p β x log L ( β ) = ∑ yi ∑ β j xij − ∑ log 1 + e j=0 j ij i =1 j =0 i =1 ) The preceding expression is indeed a bit complex in nature to obtain an explicit form for an estimate of β . Indeed, a specialized algorithm is required here and it is known as the iterative reweighted least-squares (IRLS) algorithm. We will not go into the details of the algorithm and refer the readers to an online paper of Scott A. Czepiel available at http://czep.net/stat/ mlelr.pdf. A raw R implementation of the IRLS is provided in Chapter 19 of Tattar, et. al. (2013). For our purpose, we will be using the solution as provided from the glm function. Let us now fit the logistic regression model for the Sat-M dataset considered hitherto. Time for action – fitting the logistic regression model The logistic regression model is built using the glm function with the family = 'binomial' option. We will obtain the pseudo-R2 values using the pR2 function from the pscl package. 1. Fit the logistic regression model for the Pass as a function of the Sat using the option family = 'binomial' in the glm function: pass_logistic <- glm(Pass~Sat,data=sat,family = 'binomial') [ 207 ] The Logistic Regression Model 2. The details of the fitted logistic regression model is obtained using the summary function: summary(pass_logistic). In the summary you will see two statistics called Null deviance and Residual deviance. In general, a deviance is a measure useful for assessing the goodness-of-fit, and for the logistic regression model it plays the analogous role of residual sum of squares for the linear regression model. The null deviance is the measure of a model that is built without using any information, such as Sat, and thus we would expect such a model to have a large value. If the Sat variable is influencing Pass, we expect the residuals of such a fitted model to be significantly lesser than the null deviance model. If the residual deviance is significantly smaller than the null deviance, we conclude that the covariates have significantly improved the model fit. 3. 4. Find the pseudo-R2 with pR2(pass_logistic) from the pscl package. The overall model significance of the fitted logistic regression model is obtained with with(pass_logistic, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE)) The p-value is 0.0001496 which shows that the model is indeed significant. The p-values for the Sat covariate Pr(>|z|) is 0.011, which means that this variable is indeed valuable for understanding Pass. The estimated regression coefficient for Sat of 0.0578 indicates that for the increase of a single mark increases the odds of the candidate to pass the course by 0.0578. A brief explanation of this R code! It may be seen from the output following the summary.glm(pass_logistic) that we have all the terms null.deviance, deviance, df.null, and df.residual. So, the with function extracts all these terms from the pass_logistic object and finds the p-value using the pchisq function based on the difference between the deviances (null.deviance deviance) and the correct degrees of freedom (df.null - df.residual). [ 208 ] Chapter 7 Figure 3: Logistic regression model for the Sat dataset 5. The confidence intervals, with a default 95 percent requirement, for the parameters of the regression coefficients, is extracted using the confint function: confint(pass_logistic). The ranges of the 95 percent confidence intervals do not contain 0 among them, and hence we conclude that the intercept term and Sat variable are both significant. 6. The prediction for the unknown scores are obtained as in the probit regression model: predict.glm(pass_logistic,newdata=list(Sat=400),type = "response") predict.glm(pass_logistic,newdata=list(Sat=700),type = "response") 7. Let us compare the logistic and probit model. Consider a sequence of hypothetical SAT-M scores: sat_x = seq(400,700, 10). For the new sequence sat_x, we predict the probability of course completion using both the pass_logistic and pass_probit models and visualize them if their predictions are vastly different: pred_l <- predict(pass_logistic,newdata=list(Sat=sat_x), type= "response") pred_p <- predict(pass_probit,newdata=list(Sat=sat_x), type= "response") plot(sat_x,pred_l,type="l",ylab="Probability",xlab="Sat_M") lines(sat_x,pred_p,lty=2) [ 209 ] The Logistic Regression Model The prediction says that a candidate with a SAT-M score of 400 is very unlikely to complete the course successfully while the one with SAT-M score of 700 is almost guaranteed to complete it. The predictions with probabilities closer to 0 or 1 need to be taken with a bit caution since we rarely have enough observations at the boundaries of the covariates. Figure 4: Prediction using the logistic regression What just happened? We fitted our first logistic regression model and viewed its various measures which tell us whether the fitted model is a good model or not. Next, we learnt how to interpret the estimated regression coefficients and also had a peek at the pseudo-R2 value. The importance of confidence intervals is also emphasized. Finally, the model has been used to make predictions for some unobserved SAT-M scores too. Hosmer-Lemeshow goodness-of-fit test statistic We may be satisfied with the analysis thus far, and there is always a lot more that we ∑ β x ∑ β x / 1 + e versus can do. The testing hypothesis problem is of the form H : E (Y ) = e ∑ β x ∑ β x H : E (Y ) ≠ e / 1 + e . An answer to this hypothesis testing problem is provided by the Hosmer-Lemeshow goodness-of-fit test statistic. The steps of the construction of this test statistic are first discussed: 0 1 p j=0 j ij p j=0 j ij [ 210 ] p j=0 j ij p j=0 j ij Chapter 7 1. Order the fitted values using sort and fitted functions. 2. Group the fitted values into g classes, the preferred values of g vary between 6-10. 3. Find the observed and expected number in each group. 4. Perform a chi-square goodness-of-fit test on the these groups. That is, denote Ojk for the number of observations of class k, k = 0, 1, in the group j, j = 1, 2, …, g, and by Ejk the corresponding expected numbers. The chi-square test statistic is then given by: g χ =∑∑ 2 (O j =1 k = 0.1 jk − E jk ) 2 E jk And it may be proved that under the null-hypothesis χ 2 ∼ χ g2− 2 . We will use an R program available at http://sas-and-r.blogspot.in/2010/09/ example-87-hosmer-and-lemeshow-goodness.html. It is important to note here that when we use the code available on the web we verify and understand that such code is indeed correct. Time for action – The Hosmer-Lemeshow goodness-of-fit statistic The Hosmer-Lemeshow goodness-of-fit statistic for logistic regression is one of the very important metrics for evaluating a logistic regression model. The hosmerlem function from the preceding web link will be used for the pass_logistic regression model. 1. 2. Extract the fitted values for the pass_logistic model with pass_hat 2*(length(pass_logistic$coefficients)-1)/ length(pass_logistic$y) [ 217 ] The Logistic Regression Model 4. An observation is considered to have great influence on the parameter estimates if the Cooks distance, as given by cooks.distance, is greater than 10 percent quantile of the Fp +1,n −( p +1) distribution, and it is considered highly influential if it exceeds 50 percent quantile of the same distribution. In terms of R program, we need to execute: cooks.distance(pass_logistic)>qf(0.1,length(pass_logistic$ coefficients),length(pass_logistic$y)-length(pass_logistic$ coefficients)) cooks.distance(pass_logistic)>qf(0.5,length(pass_logistic$ coefficients),length(pass_logistic$y)-length(pass_logistic$ coefficients)) Figure 8: Identifying the outliers The previous screenshot shows that there are eight high leverage points. We also see that at the 10 percent quantile of the F-distribution we have two influential points whereas we don't have any highly influential points. 5. Use the plot function to identify the influential observations suggested by the DFFITS and DFBETAS measure: par(mfrow=c(1,3)) plot(dfbetas(pass_logistic)[,1],ylab="DFBETAS - INTERCEPT") plot(dfbetas(pass_logistic)[,2],ylab="DFBETAS - SAT") plot(dffits(pass_logistic),ylab="DFFITS") [ 218 ] Chapter 7 Figure 9: DFFITS and DFBETAS for the logistic regression model As with the linear regression model, the DFFITS and DFBETAS are measures of influence of the observations on the regression coefficients. The thumb rule for the DFBETAS is that if their absolute value exceeds 1, the observations have significant influence on the covariates. In our case it is not correct and we conclude that we do not have influential observations. The interpretation of DFFITS is left as an exercise. What just happened? We adapted the influential measures in the context of generalized linear models, and especially in the context of logistic regression. Have a go hero The influence and leverage measures were executed on the logistic regression model, the pass_logistic object in particular. You also have the pass_probit object! Repeat the entire exercise of hatvalues, cooks.distance, dffits, and dfbetas on the pass_probit fitted probit model and draw your inference. [ 219 ] The Logistic Regression Model Receiving operator curves In the binary classification problem, we have certain scenarios where the comparison between the predicted and actual class is of great importance. For example, there is a genuine problem in the banking industry for identifying fraudulent transactions against the non-fraudulent transactions. There is another problem of sanctioning loans to customers who may successfully repay the entire loan and the customers who will default at some stage during the loan tenure. Given the historical data, we will build a classification model, for example the logistic regression model. Now with the logistic regression model, or any other classification model for that matter, if the predicted probability is greater than 0.5, the observation is predicted as a successful observation, and a failure otherwise. We remind ourselves again that success/failure is defined according to the experiment. At least with the data on hand, we know the true labels of the observations and hence a comparison of the true labels with the predicted label makes a lot of sense. In an ideal scenario we expect the predicted labels to match perfectly with the actual labels, that is, whenever the true label stands for success/failure, the predicted label is also success/failure. However, in the real scenario it is rarely the case. This means that there are some observations which are predicted as success/failure when the true labels are actually failure/success. In other words, we make mistakes! It is possible to put these notes in the form of a table widely known as the confusion matrix. Observed Predicted Success Failure Success True Positive (TP) False Positive (FP) Failure False Negative (FN) True Negative (TN) Table 1: The confusion matrix The number in parenthesis is the count of the cases. It may be seen from the preceding table that the cells colored in green are the correct predictions made by the model, whereas the red colored are the ones with mistakes. The following metrics may be considered for comparison across multiple models: TP + TN TP + TN + FP + FN TP TP + FP TP TP + FN Accuracy: Precision: Recall: However, it is known that these metrics have a lot of limitations and more robust steps are required. The answer is provided by the receiver operator characteristic (ROC) curve. We need two important metrics towards the construction of an ROC. The true positive rate (tpr) and false positive rate (fpr) are respectively defined by: [ 220 ] Chapter 7 tpr = TP , TP + FN fpr = FP TN + FP The ROC graphs are constructed by plotting the tpr against the fpr. We will now explain this in detail. Our approach will be explaining the algorithm in an Action framework. Time for action – ROC construction A simple dataset is considered and the ROC construction is explained in a very simple step-by-step approach: 1. Suppose that the predicted probabilities of n = 10 observations are 0.32, 0.62, 0.19, 0.75, 0.18, 0.18, 0.95, 0.79, 0.24, 0.59. Create a vector of it as follows: pred_prob<-c(0.32, 0.62, 0.19, 0.75, 0.18, 0.18, 0.95, 0.79, 0.24, 0.59) 2. Sort the predicted probabilities in a decreasing order: > (pred_prob=sort(pred_prob,decreasing=TRUE)) [1] 0.95 0.79 0.75 0.62 0.59 0.32 0.24 0.19 0.18 0.18 3. Normalize the predicted probabilities in the preceding step to the unit interval: > pred_prob<-(pred_prob-min(pred_prob))/(max(pred_prob)-min(pred_ prob )) > pred_prob [1] 1.00000 0.79221 0.74026 0.57143 0.53247 0.18182 0.07792 0.01299 0.00000 0.00000 Now, at each percentage of the previously sorted probability, we commit false positives as well as false negatives. Thus, we want to check at each part of our prediction percentiles, the quantum of tpr and fpr. Since ten points are very less, we now consider a dataset of predicted probabilities and the true labels. 4. 5. 6. 7. Load the illustrative dataset from the RSADBE package with data(simpledata). Set up the threshold vector threshold <- seq(1,0,-0.01). Find the number of positive (success) and negative (failure) cases in the dataset P <- sum(simpledata$Label==1) and N <- sum(simpledata$Label ==0). Initialize the fpr and tpr with tpr <- fpr <- threshold*0. [ 221 ] The Logistic Regression Model 8. Set up the following loop which computes tpr and fpr at each point of the threshold vector: for(i in 1:length(threshold)) { FP=TP=0 for(j in 1:nrow(simpledata)) { if(simpledata$Predictions[j]>=threshold[i]) { if(simpledata$Label[j]==1) TP=TP+1 else FP=FP+1 } } tpr[i]=TP/P fpr[i]=FP/N } 9. Plot the tpr against the fpr with: plot(fpr,tpr,"l",xlab="False Positive Rate", ylab="True Positive Rate",col="red") abline(a=0,b=1) Figure 10: An ROC illustration The diagonal line is about the performance of a random classifier in that it simply says "Yes" or "No" without looking at any characteristic of an observation. Any good classifier must sit, rather be displayed, above this line. The classifier, albeit an unknown one, seems a much better classifier than the random classifier. The ROC curve is useful in comparison to the competitive classifiers in the sense that if one classifier is always above another, we select the former. [ 222 ] Chapter 7 An excellent introductory exposition of the ROC curves is available at the website http:// ns1.ce.sharif.ir/courses/90-91/2/ce725-1/resources/root/Readings/ Model%20Assessment%20with%20ROC%20Curves.pdf. What just happened? The construction of ROC has been demystified! The preceding program is a very primitive one. In the later chapters we will use the ROCR package for the construction of ROC. We will next look at a real-world problem. Logistic regression for the German credit screening dataset Millions of applications are made to a bank for a variety of loans! The loan may be a personal loan, home loan, car loan, and so forth. From a bank's perspective, loans are an asset for them as obviously the customer pays them interest and over a period of time the bank makes profit. If all the customers promptly pay back their loan amount, all their tenure equated monthly installment (EMI) or the complete amount on preclosure of the principal amount, there is only money to be made. Unfortunately, it is not always the case that the customers pay back the entire amount. In fact, the fraction of people who do not complete the loan duration may also be very small, say about five percent. However, a bad customer may take away the profits of may be 20 or more customers. In this hypothetical case, the bank eventually makes more losses than profit and this may eventually lead to its own bankruptcy. Now, a loan application form seeks a lot of details about the applicant. The data from these details in the application can help the bank build appropriate classifiers, such as a logistic regression model, and make predictions about which customers are most likely to turn up as fraudulent. The customers who have been predicted to default in the future are then declined the loan. A real dataset of 1,000 customers who had borrowed loan from a bank is available on the web at http://www.stat.auckland.ac.nz/~reilly/credit-g.arff and http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data). This data has been made available by Prof. Hofmann and it contains details on 20 variables related to the customer. It is also known whether the customers defaulted or not. The variables are described in the following table. [ 223 ] The Logistic Regression Model A detailed analysis of the dataset using R has been done by Sharma and his very useful document can be downloaded from cran.r-project.org/doc/contrib/SharmaCreditScoring.pdf. No Variable Characteristic Description No Variable Characteristic Description 12 property factor Property 13 age numeric Age in years 1 checking integer Status of existing checking account 2 duration integer Duration in month 3 history integer Credit history 14 other integer Other installment plans 4 purpose factor Purpose 15 housing integer Housing Number of existing credits at this bank 5 6 7 amount savings employed numeric Credit amount 16 existcr integer integer Savings account/bonds 17 job integer Job integer Present employment since 18 depends integer Number of people being liable to provide maintenance for 19 telephon integer Telephone 8 installp integer Installment rate in percentage of disposable income 9 marital integer Personal status and sex 20 foreign integer Foreign worker 10 coapp integer Other debtors/ guarantors 21 good_bad factor Loan defaulter integer Present residence since 22 default integer good_bad in numeric 11 resident We have the German credit dataset with us in the GC data from the RSADBE package. Let us build a classifier for identifying the good customers from the bad ones. [ 224 ] Chapter 7 Time for action – logistic regression for the German credit dataset The logistic regression model will be built for credit card application scoring model and an ROC curve fit to evaluate the fit of the model. 1. 2. 3. Invoke the ROCR library with library(ROCR). 4. Run summary(GC_LR) and identify the significant variables. Also answer the question of whether the model is significant? 5. Get the predictions using the predict function: Get the German credit dataset in your current session with data(GC). Build the logistic regression model for good_bad with GC_LR <- glm (good_bad~.,data=GC,family=binomial()). LR_Pred <- predict( GC_LR,type='response') 6. Use the prediction function from the ROCR package to set up a prediction object: GC_pred <- prediction(LR_Pred,GC$good_bad) The function prediction sets up different manipulations required computations as required for constructing the ROC curve. Get more details related to it with ?prediction. 7. Set up the performance vector required to obtain the ROC curve with GC_perf =12 & X<=18)] break2 <- X[which(X>=27 & X<=33)] 4. Get the number of points that are candidates for being the breakpoints with n1 =break1[i] & X =break2[j])) MSE_MAT[curriter,3] <- as.numeric(summary(piecewise1)[6]) } } Note the use of the formula ~ in the specification of the piecewise linear regression model. 6. The time has arrived to find the pair of breakpoints: MSE_MAT[which(MSE_MAT[,3]==min(MSE_MAT[,3])),] The pair of breakpoints is hence (14.000, 30.000). Let us now look at how good the model fit is! [ 236 ] Chapter 8 7. First, reobtain the scatter plot with plot(PW_Illus). Fit the piecewise linear regression model with breakpoints at (14,30) with pw_final <- lm(Y ~ X*(X<14)+X*(X>=14 & X<30)+X*(X>=30)). Add the fitted values to the scatter plot with points(PW_Illus$X,pw_final$fitted.values,col ="red"). Note that the fitted values are a very good reflection of the original data values, (Figure 4 (B)). The fact that linear models can be extended to such different scenarios makes it very promising to study this in even more detail as will be seen in the later part of this section. Figure 4: Scatterplot of a dataset (A) and the fitted values using piecewise linear regression model (B) What just happened? The piecewise linear regression model has been explored for a hypothetical scenario, and we investigated how to identify breakpoints by using the criterion of the mean residual sum of squares. The piecewise linear regression model shows a useful flexibility, and it is indeed a very useful model when there is a genuine reason to believe that there are certain breakpoints in the model. This has some advantages and certain limitations too. From a technical perspective, the model is not continuous, whereas from an applied perspective, the model possesses problems in making guesses about the breakpoint values and also the problem of extensions to multi-dimensional cases. It is thus required to look for a more general framework, where we need not be bothered about these issues. Some answers are provided in the following sections. [ 237 ] Regression Models with Regularization Natural cubic splines and the general B-splines We will first consider the polynomial regression splines model. As noted in the previous discussion, we have a lot of discontinuity in the piecewise regression model. In some sense, "greater continuity" can be achieved by using the cubic functions of x and then constructing regression splines in what are known as "piecewise cubics", see Berk (2008) Section 2.2. Suppose that there are K data points at which we require the knots. Suppose that the knots are located at the points ξ1 ,ξ 2 ,...,ξ K , which are between the boundary points ξ0 and ξ k +1 , such that ξ 0 < ξ1 < ξ 2 < ...ξ K < ξ K +1 . The piecewise cubic polynomial regression model is given as follows: K Y = β 0 + β1 X + β 2 X 2 + β3 X 3 + ∑ θ j ( X − ξ j ) + ε 3 + j =1 Here, the function ( .)+ represents that the positive values from the argument are accepted and then the cube power performed on it; that is: 3 ( X −ξ ) 3 j + ( X − ξ )3 j = 0, if X > ξ j otherwise For this model, the K+4 basis functions are as follows: h1 ( X ) = 1,h2 ( X ) = X ,h3 ( X ) = X 2 ,h4 ( X ) = X 3 ,h j + 4 ( X ) = ( X − ξ j ) , j = 1,...,K 3 + We will now consider an example from Montgomery, et. al. (2005), pages 231-3. It is known that the battery voltage drop in a guided missile motor has a different behavior as a function of time. The next screenshot displays the scatterplot of the battery voltage drop for different time points; see ?VD from the RSADBE package. We need to build a piecewise cubic regression spline for this dataset with knots at time t = 6.5 and t = 13 seconds since it is known that the missile changes its course at these points. If we denote the battery voltage drop by Y and the time by t, the model for this problem is then given as follows: Y = β 0 + β1t + β 2t 2 + β3t 3 + θ1 ( t − 6.5 )+ + θ 2 ( t − 13)+ + ε 3 3 It is not possible with the math scope of this book to look into the details related to the natural cubic spline regression models or the B-spline regression models. However, we can fit them by using the ns and bs options in the formula of the lm function, along with the knots at the appropriate places. These models will be built and their fit will be visualized too. Let us now fit the models! [ 238 ] Chapter 8 Time for action – fitting the spline regression models A natural cubic spline regression model will be fitted for the voltage drop problem. 1. 2. 3. Read the required dataset into R by using data(VD). Invoke the graphics editor by using par(mfrow=c(1,2)). Plot the data and give an appropriate title: plot(VD) title(main="Scatter Plot for the Voltage Drop") 4. Build the piecewise cubic polynomial regression model by using the lm function and related options: VD_PRS<-lm(Voltage_Drop~Time+I(Time^2)+I(Time^3)+I(((Ti me-6.5)^3)*(sign(Time-6.5)==1))+I(((Time-13)^3)*(sign(Time13)==1)),data=VD) The sign function returns the sign of a numeric vector as 1, 0, and -1, accordingly as the arguments are positive, zero, and negative respectively. The operator I is an inhibit interpretator operator, in that the argument will be taken in an as is format, check ?I. This operator is especially useful in data.frame and the formula program of R. 5. To obtain the fitted plot along with the scatterplot, run the following code: plot(VD) points(VD$Time,fitted(VD_PRS),col="red","l") title("Piecewise Cubic Polynomial Regression Model") Figure 5: Voltage drop data - scatter plot and a cubic polynomial regression model [ 239 ] Regression Models with Regularization 6. Obtain the details of the fitted model with summary(VD_PRS). The R output is given in the next screenshot. The summary output shows that each of the basis function is indeed significant here. Figure 6: Details of the fitted piecewise cubic polynomial regression model 7. Fit the natural cubic spline regression model using the ns option: VD_NCS <-lm(Voltage_Drop~ns(Time,knots=c(6.5,13),intercept= TRUE, degree=3), data=VD) 8. Obtain the fitted plot as follows: par(mfrow=c(1,2)) plot(VD) points(VD$Time,fitted(VD_NCS),col="green","l") title("Natural Cubic Regression Model") [ 240 ] Chapter 8 9. Obtain the details related to VD_NCS with the summary function summary( VD_ NCS); see Figure 09: A first look at the linear ridge regression. 10. Fit the B-spline regression model by using the bs option: VD_BS <- lm(Voltage_Drop~bs(Time,knots=c(6.5,13),intercept=TRUE, degree=3), data=VD) 11. Obtain the fitted plot for VD_BS with the R program: plot(VD) points(VD$Time,fitted(VD_BS),col="brown","l") title("B-Spline Regression Model") Figure 7: Natural Cubic and B-Spline Regression Modeling 12. Finally, get the details of the fitted B-spline regression model by using summary(VD_BS). The main purpose of the B-spline regression model is to illustrate that the splines are smooth at the boundary points in contrast with the natural cubic regression model. This can be clearly seen in Figure 8: Details of the natural cubic and B-spline regression models. [ 241 ] Regression Models with Regularization Both the models, VD_NCS and VD_BS, have good summary statistics and have really modeled the data well. Figure 8: Details of the natural cubic and B-spline regression models What just happened? We began with the fitting of a piecewise polynomial regression model and then had a look at the natural cubic spline regression and B-spline regression models. All the three models provide a very good fit to the actual data. Thus, with a good guess or experimental/ theoretical evidence, the linear regression model can be extended in an effective way. [ 242 ] Chapter 8 Ridge regression for linear models In Figure 3: Regression coefficients of polynomial regression models, we saw that the magnitude of the regression coefficients increase in a drastic manner as the polynomial degree increases. The right tweaking of the linear regression model, as seen in the previous section, gives us the right results. However, the models considered in the previous section had just one covariate and the problem of identifying the knots in the multiple regression model becomes an overtly complex issue. That is, if we have a problem where there are large numbers of covariates, naturally there may be some dependency amongst them, which cannot be investigated for certain reasons. In such problems, it may happen that certain covariates dominate other covariates in terms of the magnitude of their regression coefficients, and this may mar the overall usefulness of the model. Further more, even in the univariate case, we have the problem that the choice of the number of knots, their placements, and the polynomial degree may be manipulated by the analyzer. We have an n alternative to this problem in the way we minimize the residual sum of squares min ∑ ei2 . β i =1 The least-squares solution leads to an estimator of β : βˆ = ( X ' X ) X ' Y −1 We saw in Chapter 6, Linear Regression Analysis, how to guard ourselves against the outliers, the measures of model fit, and model selection techniques. However, these methods are in action after the construction of the model, and hence though they offer protection in a certain sense to the problem of overfitting, we need more robust methods. The question that arises is can we guard ourselves against overfitting when building the model itself? This will go a long way in addressing the problem. The answer is obviously an affirmative, and we will check out this technique. The least-squares solution is the optimal solution when we have the squared loss function. The idea then is to modify this loss function by incorporating a penalty term, which will give us additional protection against the overfitting problem. Mathematically, we add the penalty term for the size of the regression coefficients; in fact, the constraint would be to ensure that the sum of squares of the regression coefficients is minimized. Formally, the goal would be to obtain an optimal solution of the following problem: p n min ∑ ei2 + λ ∑ β j2 β j =1 i =1 [ 243 ] Regression Models with Regularization Here, λ > 0 is the control factor, also known as the tuning parameter, and ∑ j =1 β j2 is the penalty. If the λ value is zero, we get the earlier least-squares solution. Note that the intercept has been deliberately kept out of the penalty term! Now, for the large values of p sum of squares will be large. Thus, loosely speaking, for the minimum ∑ j =1 β j2 , then residual p p 2 2 value of ∑ i =1 ei + λ ∑ j =1 β j , we will require ∑ j =1 β j2 to be at a minimum value too. The optimal solution for the preceding minimization problem is given as follows: p βˆ Ridge = ( X ' X + λ I ) X ' Y −1 The choice of λ is a critical one. There are multiple options to obtain it: Find the value of λ by using the cross-validation technique (discussed in the last section of this chapter) Find the value of a semi-automated method as described at http://arxiv.org/ pdf/1205.0686.pdf for the value of λ For the first technique, we can use the function lm.ridge from the MASS package, and the second method of semi-automatic detection can be obtained from the linearRidge function of the ridge package. In the following R session, we use the functions lm.ridge and linearRidge. Time for action – ridge regression for the linear regression model The linearRidge function from ridge package and lm.ridge from the MASS package are two good options for developing the ridge regression models. 1. Though the OF object may still be there in your session, let us again load it by using data(OF). 2. 3. Load the MASS and ridge package by using library(MASS); library(ridge). For a polynomial regression model of degree 3 and various values of lambda, including 0, 0.5, 1, 1.5, 2, 5, 10, and 30, obtain the ridge regression coefficients with the following single line R code: LR <-linearRidge(Y~poly(X,3),data=as.data.frame(OF),lambda =c(0, 0.5,1,1.5,2,5,10,30)) LR [ 244 ] Chapter 8 The function linearRidge from the ridge package performs the ridge regression for a linear model. We have two options. First, we specify the values of lambda, which may either be a scalar or a vector. In the case of a scalar lambda, it will simply return the set of (ridge) regression coefficients. If it is a vector, it returns the related set of regression coefficients. 4. Compute the value of ∑ j =1 β j for different lambda values: p 2 LR_Coef <- LR$coef colSums(LR_Coef^2) 2 Note that as the lambda value increases, the value of ∑ j =1 β j decreases. However, p 2 this is not to say that higher lambda value is preferable, since the sum ∑ j =1 β j will decrease to 0, and eventually none of the variables will have a significant explanatory power about the output. The choice of selection of the lambda value will be discussed in the last section. p 5. The linearRidge function also finds the "best" lambda value: linearRidge(Y~poly(X,3),data=as.data. frame(OF),lambda="automatic"). 6. Fetch the details of the "best" ridge regression model with the following line of code: summary(linearRidge(Y~poly(X,3),data=as.data.frame(OF),lambda=" automatic")). The summary shows that the value of lambda is chosen at 0.07881, and that it used three PCs. Now, what is a PC? PC is an abbreviation of principal component, and unfortunately we can't really go into the details of this aspect. Enthusiastic readers may refer to Chapter 17 of Tattar, et. al. (2013). Compare these results with those in the first section. 7. For the same choice of different lambda values, use the lm.ridge function from the MASS package: LM <-lm.ridge(Y~poly(X,3),data = as.data.frame(OF),lambda =c(0,0.5,1,1.5,2,5,10,30)) LM 8. The lm.ridge function obviously works a bit differently from linearRidge. The results are given in the next image. Comparison of the results is left as an exercise p 2 to the reader. As with the linearRidge model, let us compute the value of ∑ j =1 β j for lm.ridge fitted models too. [ 245 ] Regression Models with Regularization 9. Use the colSums function to get the required result: LM_Coef <- LM$coef colSums(LM_Coef^2) Figure 09: A first look at the linear ridge regression So far, we are still working with a single covariate only. However, we need to consider the multiple linear regression models and see how ridge regression helps us. To do this, we will return to the gasoline mileage considered in Chapter 6, Linear Regression Analysis. [ 246 ] Chapter 8 1. 2. Read the Gasoline data into R by using data(Gasoline). Fit the ridge regression model (and the multiple linear regression model again) for the mileage as a function of other variables: gasoline_lm <- lm(y~., data=Gasoline) gasoline_rlm <- linearRidge(y~., data=Gasoline,lambda= "automatic") 3. Compare the lm coefficients with the linearRidge coefficients: sum(coef(gasoline_lm)[-1]^2)-sum(coef(gasoline_rlm)[-1]^2) 4. Look at the summary of the fitted ridge linear regression model by using summary(gasoline_rlm). 5. The difference between the sum of squares of the regression coefficients for the linear and ridge linear model is indeed very large. Further more, the gasoline_ rlm details reveal that there are four variables, which have significant explanatory power for the mileage of the car. Note that the gasoline_lm model had only one significant variable for the car's mileage. The output is given in the following figure: Figure 10: Ridge regression for the gasoline mileage problem [ 247 ] Regression Models with Regularization What just happened? We made use of two functions, namely lm.ridge and linearRidge, for fitting ridge regression models for the linear regression model. It is observed that the ridge regression models may sometimes reveal more significant variables. In the next section, we will fit consider the ridge penalty for the logistic regression model. Ridge regression for logistic regression models We will not be able to go into the math of the ridge regression for the logistic regression model, though we will happily make good use of the logisticRidge function from the ridge package, to illustrate how to build the ridge regression for logistic regression model. For more details, we refer to the research paper of Cule and De Iorio (2012) available at http:// arxiv.org/pdf/1205.0686.pdf. In the previous section, we saw that gasoline_rlm found more significant variables than gasoline_lm. Now, in Chapter 7, Logistic Regression Model, we fit a logistic regression model for the German credit data problem in GC_LR. The question that arises is if we obtain a ridge regression model of the related logistic regression model, say GC_RLR, can we expect to find more significant variables? Time for action – ridge regression for the logistic regression model We will use the logisticRidge function here from the ridge package to fit the ridge regression, and check if we can obtain more significant variables. 1. 2. Load the German credit dataset with data(German). Use the logisticRidge function to obtain GC_RLR, a small manipulation required here, by using the following line of code: GC_RLR<-logisticRidge(as.numeric(good_bad)-1~.,data= as.data. frame(GC), lambda = "automatic") [ 248 ] Chapter 8 3. Obtain the summaries of GC_LR and GC_RLR by using summary(GC_LR) and summary(GC_RLR). The detailed summary output is given in the following screenshot: Figure 11: Ridge regression with the logistic regression model It can be seen that the ridge regression model offers a very slight improvement over the standard logistic regression model. What just happened? The ridge regression concept has been applied to the important family of logistic regression models. Although in the case of the German credit data problem we found slight improvement in identification of the significant variables, it is vital that we should always be on the lookout to fit better models, as in sensitiveness to outliers, and the logisticRidge function appears as a good alternative to the glm function. [ 249 ] Regression Models with Regularization Another look at model assessment In the previous two sections, we used the automatic option for obtaining the optimum λ values, as discussed in the work of Cule and De Iorio (2012). There is an iterative technique for finding the penalty factor λ . This technique is especially useful when we do not have sufficient well-developed theory for regression models beyond the linear and logistic regression model. Neural networks, support vector machines, and so on, are some very useful regression models, where the theory may not have been well developed; well at least to the best known practice of the author. Hence, we will use the iterative method in this section. For both the linearRidge and lm.ridge fitted models in the Ridge regression for linear models section, we saw that for an increasing value of λ , the sum of squares of regression p 2 coefficients, ∑ j =1 β j , decreases. The question then is how to select the "best" λ value. A popular technique in the data mining community is to split the dataset into three parts, namely Train, Validate, and Test part. There are no definitive answers for what needs to be the split percentage for the three parts and a common practice is to split them into either 60:20:20 percentages or 50:25:25 percentages. Let us now understand this process: Training dataset: The models are built on the data available in this data part. Validation dataset: For this part of the data, we pretend as though we do not know the output values and make predictions based upon the covariate values. This step is to ensure that overfitting is minimized. The errors (residual squares for regression model and accuracy percentages for classification model) are then compared with respect to the counterpart errors in the training part. If the errors decrease in the training set while they remain the same for the validation part, it means that we are overfitting the data. A threshold, after which this is observed, may be chosen as the better lambda value. Testing dataset: In practice, these are really unobserved cases for which the model is applied for forecasting purposes. For the gasoline mileage problem, we will split the data into three parts and use the training and validation part to select the λ value. Time for action – selecting lambda iteratively and other topics Iterative selection of the penalty parameter for ridge regression will be covered in this section. The useful framework of train + validate + test will also be considered for the German credit data problem. 1. For the sake of simplicity, we will remove the character variable of the dataset by using Gasoline <- Gasoline[,-12]. [ 250 ] Chapter 8 2. Set the random seed by using set.seed(1234567). This step is to ensure that the user can validate the results of the program. 3. Randomize the observations to enable the splitting part: data_part_label = c("Train","Validate","Test") indv_label=sample(data_part_label,size=nrow(Gasoline),replace=TRUE ,prob=c(0.6,0.2,0.2)) 4. Now, split the gasoline dataset: G_Train <- Gasoline[indv_label=="Train",] G_Validate <- Gasoline[indv_label=="Validate",] G_Test <- Gasoline[indv_label=="Test",] 5. 6. Define the λ vector with lambda <- seq(0,10,0.2). Initialize the training and validation errors: Train_Errors <- vector("numeric",length=length(lambda)) Val_Errors <- vector("numeric",length=length(lambda)) 7. Run the following loop to get the required errors: 8. Plot the training and validation errors: plot(lambda,Val_Errors,"l",col="red",xlab=expression(lambda),ylab= "Training and Validation Errors",ylim=c(0,600)) points(lambda,Train_Errors,"l",col="green") legend(6,500,c("Training Errors","Validation Errors"),col=c( "green","red"),pch="-") [ 251 ] Regression Models with Regularization The final output will be the following: Figure 12: Training and validation errors The preceding plot suggests that the lambda value is between 0.5 and 1.5. Why? The technique of train + validate + test is not simply restricted to selecting the lambda value. In fact, for any regression/classification model, we can try to understand if the selected model really generalizes or not. For the German credit data problem in the previous chapter, we will make an attempt to see what the current technique suggests. 9. The program and its output (ROC curves) is displayed following it. [ 252 ] Chapter 8 10. The ROC plot is given in the following screenshot: Figure 13: ROC plot for the train + validate + test partition of the German data We will close the chapter with a short discussion. In the train + validate + test partitioning, we had one technique of avoiding overfitting. A generalization of this technique is the well-known cross-validation method. In an n-fold cross-validation approach, the data is randomly partitioned into n divisions. In the first step, the first part is held for validation and the model is built using the remaining n-1 parts and the accuracy percentage is calculated. Next, the second part is treated as the validation dataset and the remaining 1, 3, …, n-1, n parts are used to build the model and then tested for accuracy on the second part. This process is then repeated for the remaining n-2 parts. Finally, an overall accuracy metric is reported. At the surface, this process is complex enough and hence we will resort to the well-defined functions available in the DAAG package. [ 253 ] Regression Models with Regularization 11. As the cross-validation function itself carries out the n-fold partitioning, we build it over the entire dataset: library(DAAG) data(VD) CVlm(df=VD,form.lm=formula(Voltage_Drop~Time+I(Time^2)+I( Time^3)+I(((Time-6.5)^3)*(sign(Time-6.5)==1)) +I(((Time-13)^3)*(sign(Time-13)==1))),m=10,plotit="Observed") The VD data frame has 41 observations, and the output in Figure 14: Crossvalidation for the voltage-drop problem shows that the 10-fold cross-validation has 10 partitions with fold 2 containing five observations and the rest of them having four each. Now, for each fold, the cubic polynomial regression model fits the model by using the data in the remaining folds: Figure 14: Cross-validation for the voltage-drop problem [ 254 ] Chapter 8 Using the fitted polynomial regression model, a prediction is made for the units in the fold. The observed versus predicted regressand values plot is given in Figure 15: Predicted versus observed plot using the cross-validation technique. A close examination of the numerical predicted values and the plot indicate that we have a very good model for the voltage drop phenomenon. The generalized cross-validation (GCV) errors are also given with the details of a lm.ridge fit model. We can use this information to arrive at the better value for the ridge regression models: Figure 15: Predicted versus observed plot using the cross-validation technique 12. For the OF and G_Train data frames, use the lm.ridge function to obtain the GCV errors: > LM_OF <- lm.ridge(Y~poly(X,3),data=as.data.frame(OF), + lambda=c(0,0.5,1,1.5,2,5,10,30)) > LM_OF$GCV 0.0 0.5 1.0 1.5 2.0 5.0 10.0 30.0 5.19 5.05 5.03 5.09 5.21 6.38 8.31 12.07 > LM_GT <- lm.ridge(y~.,data=G_Train,lambda=seq(0,10,0.2)) > LM_GT$GCV 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 1.777 0.798 0.869 0.889 0.891 0.886 0.877 0.868 0.858 0.848 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 [ 255 ] Regression Models with Regularization 0.838 4.0 0.769 6.0 0.731 8.0 0.710 10.0 0.697 0.830 4.2 0.764 6.2 0.729 8.2 0.708 0.821 4.4 0.760 6.4 0.726 8.4 0.707 0.813 4.6 0.755 6.6 0.723 8.6 0.705 0.806 4.8 0.751 6.8 0.721 8.8 0.704 0.798 5.0 0.748 7.0 0.719 9.0 0.703 0.792 5.2 0.744 7.2 0.717 9.2 0.701 0.786 5.4 0.740 7.4 0.715 9.4 0.700 0.780 5.6 0.737 7.6 0.713 9.6 0.699 0.774 5.8 0.734 7.8 0.711 9.8 0.698 For the OF data frame, the value appears to lie in the interval (1.0, 1.5). On the other hand for the GT data frame, the value appears in (0.2, 0.4). What just happened? The choice of the penalty factor is indeed very crucial for the success of a ridge regression model, and we saw different methods for obtaining this. This step included the automatic choice of Cule and De Iorio (2012) and the cross-validation technique. Further more, we also saw the application of the popular train + validate + test approach. In practical applications, these methodologies will go a long way to obtain the best models. Pop quiz What do you expect as the results if you perform the model selection task step function on a polynomial regression model? That is, you are trying to select the variables for the polynomial model lm(Y~poly(X,9,raw=TRUE),data=OF), or say VD_PRS. Verify your intuition by completing the R programs. Summary In this chapter, we began with a hypothetical dataset and highlighted the problem of overfitting. In case of a breakpoint, also known as knots, the extensions of the linear model in the piecewise linear regression model and the spline regression model were found to be very useful enhancements. The problem of overfitting can also sometimes be overcome by using the ridge regression. The ridge regression solution has been extended for the linear and logistic regression models. Finally, we saw a different approach of model assessment by using the train + validate + test approach and the cross-validation approach. In spite of the developments where we have intrinsically non-linear data, it becomes difficult for the models discussed in this chapter to emerge as useful solutions. The past two decades has witnessed a powerful alternative in the so-called Classification and Regression Trees (CART). The next chapter discusses CART in greater depth and the final chapter considers modern development related to it. [ 256 ] 9 Classification and Regression Trees In the previous chapters, we focused on regression models, and the majority of the emphasis was on the linearity assumption. In what appears as the next extension must be non-linear models, we will instead deviate to recursive partitioning techniques, which are a bit more flexible than the non-linear generalization of the models considered in the earlier chapters. Of course, the recursive partitioning techniques, in most cases, may be viewed as non-linear models. We will first introduce the notion of recursive partitions through a hypothetical dataset. It is apparent that the earlier approach of the linear models changes in an entirely different way with the functioning of the recursive partitions. Recursive partitioning depends upon the type of problem we have in hand. We develop a regression tree for the regression problem when the output is a continuous variable, as in the linear models. If the output is a binary variable, we develop a classification tree. A regression tree is first created by using the rpart function from the rpart package. A very raw R program is created, which clearly explains the unfolding of a regression tree. A similar effort is repeated for the classification tree. In the final section of this chapter, a classification tree is created for the German credit data problem along with the use of ROC curves for understanding the model performance. The approach in this chapter will be on the following lines: Understanding the basis of recursive partitions and the general CART. Construction of a regression tree Construction of a classification tree Application of a classification tree to the German credit data problem The finer aspects of CART Classification and Regression Recursive partitions The name of the library package rpart, shipped along with R, stands for Recursive Partitioning. The package was first created by Terry M Therneau and Beth Atkinson, and is currently maintained by Brian Ripley. We will first have a peek at means recursive partitions are. A complex and contrived relationship is generally not identifiable by linear models. In the previous chapter, we saw the extensions of the linear models in piecewise, polynomial, and spline regression models. It is also well known that if the order of a model is larger than 4, then interpretation and usability of the model becomes more difficult. We consider a hypothetical dataset, where we have two classes for the output Y and two explanatory variables in X1 and X2. The two classes are indicated by filled-in green circles and red squares. First, we will focus only on the left display of Figure 1: A complex classification dataset with partitions, as it is the actual depiction of the data. At the outset, it is clear that a linear model is not appropriate, as there is quite an overlap of the green and red indicators. Now, there is a clear demarcation of the classification problem accordingly, as X1 is greater than 6 or not. In the area on the left side of X1=6, the mid-third region contains a majority of green circles and the rest are red squares. The red squares are predominantly identifiable accordingly, as the X2 values are either lesser than or equal to 3 or greater than 6. The green circles are the majority values in the region of X2 being greater than 3 and lesser than 6. A similar story can be built for the points on the right side of X1 greater than 6. Here, we first partitioned the data according to X1 values first, and then in each of the partitioned region, we obtained partitions according to X2 values. This is the act of recursive partitioning. Figure 1: A complex classification dataset with partitions Let us obtain the preceding plot in R. [ 258 ] Chapter 9 Time for action – partitioning the display plot We first visualize the CART_Dummy dataset and then look in the next subsection at how CART gets the patterns, which are believed to exist in the data. 1. Obtain the dataset CART_Dummy from the RSADBE package by using data( CART_Dummy). 2. Convert the binary output Y as a factor variable, and attach the data frame with CART_Dummy$Y <- as.factor(CART_Dummy$Y). attach(CART_Dummy) In Figure 1: A complex classification dataset with partitions, the red squares refer to 0 and the green circles to 1. 3. Initialize the graphics windows for the three samples by using par(mfrow= c(1,2)). 4. Create a blank scatter plot: plot(c(0,12),c(0,10),type="n",xlab="X1",ylab="X2"). 5. Plot the green circles and red squares: points(X1[Y==0],X2[Y==0],pch=15,col="red") points(X1[Y==1],X2[Y==1],pch=19,col="green") title(main="A Difficult Classification Problem") 6. Repeat the previous two steps to obtain the identical plot on the right side of the graphics window. 7. 8. First, partition according to X1 values by using abline(v=6,lwd=2). Add segments on the graph with the segment function: segments(x0=c(0,0,6,6),y0=c(3.75,6.25,2.25,5),x1=c(6,6,12,12),y1=c (3.75,6.25,2.25,5),lwd=2) title(main="Looks a Solvable Problem Under Partitions") What just happened? A complex problem is simplified through partitioning! A more generic function, segments, has nicely slipped in our program, which you may use for many other scenarios. [ 259 ] Classification and Regression Now, this approach of recursive partitioning is not feasible all the time! Why? We seldom deal with two or three explanatory variables and data points as low as in the preceding hypothetical example. The question is how one creates recursive partitioning of the dataset. Breiman, et. al. (1984) and Quinlan (1988) have invented tree building algorithms, and we will follow the Breiman, et. al. approach in the rest of book. The CART discussion in this book is heavily influenced by Berk (2008). Splitting the data In the earlier discussion, we saw that partitioning the dataset can benefit a lot in reducing the noise in the data. The question is how does one begin with it? The explanatory variables can be discrete or continuous. We will begin with the continuous (numeric objects in R) variables. For a continuous variable, the task is a bit simpler. First, identify the unique distinct values of the numeric object. Let us say, for example, that the distinct values of a numeric object, say height in cms, are 160, 165, 170, 175, and 180. The data partitions are then obtained as follows: data[Height<=160,], data[Height>160,] data[Height<=165,], data[Height>165,] data[Height<=170,], data[Height>170,] data[Height<=175,], data[Height>175,] The reader should try to understand the rationale behind the code, and certainly this is just an indicative one. Now, we consider the discrete variables. Here, we have two types of variables, namely categorical and ordinal. In the case of ordinal variables, we have an order among the distinct values. For example, in the case of the economic status variable, the order may be among the classes Very Poor, Poor, Average, Rich, and Very Rich. Here, the splits are similar to the case of continuous variable, and if there are m distinct orders, we consider m-1 distinct splits of the overall data. In the case of a categorical variable with m categories, for example the departments A to F of the UCBAdmissions dataset, the number of possible splits becomes 2m-1-1. However, the benefit of using software like R is that we do not have to worry about these issues. The first tree In the CART_Dummy dataset, we can easily visualize the partitions for Y as a function of the inputs X1 and X2. Obviously, we have a classification problem, and hence we will build the classification tree. [ 260 ] Chapter 9 Time for action – building our first tree The rpart function from the library rpart will be used to obtain the first classification tree. The tree will be visualized by using the plot options of rpart, and we will follow this up with extracting the rules of a tree by using the asRules function from the rattle package. 1. 2. 3. Load the rpart package by using library(rpart). Create the classification tree with CART_Dummy_rpart <- rpart (Y~ X1+X2,data=CART_Dummy). Visualize the tree with appropriate text labels by using plot(CART_Dummy_ rpart); text(CART_Dummy_rpart). Figure 2: A classification tree for the dummy dataset Now, the classification tree flows as follows. Obviously, the tree using the rpart function does not partition as simply as we did in Figure 1: A complex classification dataset with partitions, the working of which will be dealt within the third section of this chapter. First, we check if the value of the second variable X2 is less than 4.875. If the answer is an affirmation, we move to the left side of the tree; the right side in the other case. Let us move to the right side. A second question asked is whether X1 is lesser than 4.5 or not, and then if the answer is yes it is identified as a red square, and otherwise a green circle. You are now asked to interpret the left side of the first node. Let us look at the summary of CART_Dummy_rpart. [ 261 ] Classification and Regression 4. Apply the summary, an S3 method, for the classification tree with summary( CART_ Dummy_rpart). That one is a lot of output! Figure 3: Summary of a classification tree Our interests are in the nodes numbered 5 to 9! Why? The terminal nodes, of course! A terminal node is one in which we can't split the data any further, and for the classification problem, we arrive at a class assignment as the class that has a majority count at the node. The summary shows that there are indeed some misclassifications too. Now, wouldn't it be great if R gave the terminal nodes asRules. The function asRules from the rattle package extracts the rules from an rpart object. Let's do it! [ 262 ] Chapter 9 5. Invoke the rattle package library(rattle) and using the asRules function, extract the rules from the terminal nodes with asRules(CART_Dummy_rpart). The result is the following set of rules: Figure 4: Extracting "rules" from a tree! We can see that the classification tree is not according to our "eye-bird" partitioning. However, as a final aspect of our initial understanding, let us plot the segments using the naïve way. That is, we will partition the data display according to the terminal nodes of the CART_Dummy_rpart tree. 6. The R code is given right away, though you should make an effort to find the logic behind it. Of course, it is very likely that by now you need to run some of the earlier code that was given previously. abline(h=4.875,lwd=2) segments(x0=4.5,y0=4.875,x1=4.5,y1=10,lwd=2) abline(h=1.75,lwd=2) segments(x0=3.5,y0=1.75,x1=3.5,y1=4.875,lwd=2) title(main="Classification Tree on the Data Display") [ 263 ] Classification and Regression It can be easily seen from the following that rpart works really well: Figure 5: The terminal nodes on the original display of the data What just happened? We obtained our first classification tree, which is a good thing. Given the actual data display, the classification tree gives satisfactory answers. We have understood the "how" part of a classification tree. The "why" aspect is very vital in science, and the next section explains the science behind the construction of a regression tree, and it will be followed later by a detailed explanation of the working of a classification tree. [ 264 ] Chapter 9 The construction of a regression tree In the CART_Dummy dataset, the output is a categorical variable, and we built a classification tree for it. In Chapter 6, Linear Regression Analysis, the linear regression models were built for a continuous random variable, while in Chapter 7, The Logistic Regression Model, we built a logistic regression model for a binary random variable. The same distinction is required in CART, and we thus build classification trees for binary random variables, where regression trees are for continuous random variables. Recall the rationale behind the estimation of regression coefficients for the linear regression model. The main goal was to find the estimates of the regression coefficients, which minimize the error sum of squares between the actual regressand values and the fitted values. A similar approach is followed here, in the sense that we need to split the data at the points that keep the residual sum of squares to a minimum. That is, for each unique value of a predictor, which is a candidate for the node value, we find the sum of squares of y's within each partition of the data, and then add them up. This step is performed for each unique value of the predictor, and the value, which leads to the least sum of squares among all the candidates, is selected as the best split point for that predictor. In the next step, we find the best split points for each of the predictors, and then the best split is selected across the best split points across the predictors. Easy! Now, the data is partitioned into two parts according to the best split. The process of finding the best split within each partition is repeated in the same spirit as for the first split. This process is carried out in a recursive fashion until the data can't be partitioned any further. What is happening here? The residual sum of squares at each child node will be lesser than that in the parent node. At the outset, we record that the rpart function does the exact same thing. However, as a part of cleaner understanding of the regression tree, we will write raw R codes and ensure that there is no ambiguity in the process of understanding CART. We will begin with a simple example of a regression tree, and use the rpart function to plot the regression function. Then, we will first define a function, which will extract the best split given by the covariate and dependent variable. This action will be repeated for all the available covariates, and then we find the best overall split. This will be verified with the regression tree. The data will then be partitioned by using the best overall split, and then the best split will be identified for each of the partitioned data. The process will be repeated until we reach the end of the complete regression tree given by the rpart. First, the experiment! [ 265 ] Classification and Regression The cpus dataset available in the MASS package contains the relative performance measure of 209 CPUs in the perf variable. It is known that the performance of a CPU depends on factors such as the cycle time in nanoseconds (syct), minimum and maximum main memory in kilobytes (mmin and mmax), cache size in kilobytes (cach), and minimum and maximum number of channels (chmin and chmax). The task in hand is to model the perf as a function of syct, mmin, mmax, cach, chmin, and chmax. The histogram of perf—try hist(cpus$perf)—will show a highly skewed distribution, and hence we will build a regression tree for the logarithm transformation log10(perf). Time for action – the construction of a regression tree A regression tree is first built by using the rpart function. The getNode function is introduced, which helps in identifying the split node at each stage, and using it we build a regression tree and verify that we had the same tree as returned by the rpart function. 1. 2. Load the MASS library by using library(MASS). Create the regression tree for the logarithm (to the base 10) of perf as a function of the covariates explained earlier, and display the regression tree: cpus.ltrpart <- rpart(log10(perf)~syct+mmin+mmax+cach+chmin+chmax, data=cpus) plot(cpus.ltrpart); text(cpus.ltrpart) The regression tree will be indicated as follows: Figure 6: Regression tree for the "perf" of a CPU [ 266 ] Chapter 9 We will now define the getNode function. Given the regressand and the covariate, we need to find the best split in the sense of the sum of squares criterion. The evaluation needs to be done for every distinct value of the covariate. If there are m distinct points, we need m-1 evaluations. At each distinct point, the regressand needs to be partitioned accordingly, and the sum of squares should be obtained for each partition. The two sums of squares (in each part) are then added to obtain the reduced sum of squares. Thus, we create the required function to meet all these requirements. 3. Create the getNode function in R by running the following code: getNode <- function(x,y) { xu <- sort(unique(x),decreasing=TRUE) ss <- numeric(length(xu)-1) for(i in 1:length(ss)) { partR <- y[x>xu[i]] partL <- y[x<=xu[i]] partRSS <- sum((partR-mean(partR))^2) partLSS <- sum((partL-mean(partL))^2) ss[i]<-partRSS + partLSS } return(list(xnode=xu[which(ss==min(ss,na.rm=TRUE))], minss = min(ss,na.rm=TRUE),ss,xu)) } The getNode function gives the best split for a given covariate. It returns a list consisting of four objects: xnode, which is a datum of the covariate x that gives the minimum residual sum of squares for the regressand y The value of the minimum residual sum of squares The vector of the residual sum of squares for the distinct points of the vector x The vector of the distinct x values We will run this function for each of the six covariates, and find the best overall split. The argument na.rm=TRUE is required, as at the maximum value of x we won't get a numeric value. [ 267 ] Classification and Regression 4. We will first execute the getNode function on the syct covariate, and look at the output we get as a result: > getNode(cpus$syct,log10(cpus$perf))$xnode [1] 48 > getNode(cpus$syct,log10(cpus$perf))$minss [1] 24.72 > getNode(cpus$syct,log10(cpus$perf))[[3]] [1] 43.12 42.42 41.23 39.93 39.44 37.54 37.23 36.87 36.51 36.52 35.92 34.91 [13] 34.96 35.10 35.03 33.65 33.28 33.49 33.23 32.75 32.96 31.59 31.26 30.86 [25] 30.83 30.62 29.85 30.90 31.15 31.51 31.40 31.50 31.23 30.41 30.55 28.98 [37] 27.68 27.55 27.44 26.80 25.98 27.45 28.05 28.11 28.66 29.11 29.81 30.67 [49] 28.22 28.50 24.72 25.22 26.37 28.28 29.10 33.02 34.39 39.05 39.29 > getNode(cpus$syct,log10(cpus$perf))[[4]] [1] 1500 1100 900 810 800 700 600 480 400 350 330 320 300 250 240 [16] 225 220 203 200 185 180 175 167 160 150 143 140 133 125 124 [31] 116 115 112 110 105 100 98 92 90 84 75 72 70 64 60 [46] 59 57 56 52 50 48 40 38 35 30 29 26 25 23 17 The least sum of squares at a split for the best split value of the syct variable is 24.72, and it occurs at a value of syct greater than 48. The third and fourth list objects given by getNode, respectively, contain the details of the sum of squares for the potential candidates and the unique values of syct. The values of interest are highlighted. Thus, we will first look at the second object from the list output for all the six covariates to find the best split among the best split of each of the variables, by the residual sum of squares criteria. 5. Now, run the getNode function for the remaining five covariates: getNode(cpus$syct,log10(cpus$perf))[[2]] getNode(cpus$mmin,log10(cpus$perf))[[2]] getNode(cpus$mmax,log10(cpus$perf))[[2]] getNode(cpus$cach,log10(cpus$perf))[[2]] getNode(cpus$chmin,log10(cpus$perf))[[2]] getNode(cpus$chmax,log10(cpus$perf))[[2]] getNode(cpus$cach,log10(cpus$perf))[[1]] sort(getNode(cpus$cach,log10(cpus$perf))[[4]],decreasing=FALSE) [ 268 ] Chapter 9 The output is as follows: Figure 7: Obtaining the best "first split" of regression tree The sum of squares for cach is the lowest, and hence we need to find the best split associated with it, which is 24. However, the regression tree shows that the best split is for the cach value of 27. The getNode function says that the best split occurs at a point greater than 24, and hence we take the average of 24 and the next unique point at 30. Having obtained the best overall split, we next obtain the first partition of the dataset. 6. Partition the data by using the best overall split point: cpus_FS_R <- cpus[cpus$cach>=27,] cpus_FS_L <- cpus[cpus$cach<27,] The new names of the data objects are clear with _FS_R indicating the dataset obtained on the right side for the first split, and _FS_L indicating the left side. In the rest of the section, the nomenclature won't be further explained. 7. Identify the best split in each of the partitioned datasets: getNode(cpus_FS_R$syct,log10(cpus_FS_R$perf))[[2]] getNode(cpus_FS_R$mmin,log10(cpus_FS_R$perf))[[2]] getNode(cpus_FS_R$mmax,log10(cpus_FS_R$perf))[[2]] getNode(cpus_FS_R$cach,log10(cpus_FS_R$perf))[[2]] getNode(cpus_FS_R$chmin,log10(cpus_FS_R$perf))[[2]] getNode(cpus_FS_R$chmax,log10(cpus_FS_R$perf))[[2]] getNode(cpus_FS_R$mmax,log10(cpus_FS_R$perf))[[1]] sort(getNode(cpus_FS_R$mmax,log10(cpus_FS_R$perf))[[4]], decreasing=FALSE) getNode(cpus_FS_L$syct,log10(cpus_FS_L$perf))[[2]] [ 269 ] Classification and Regression getNode(cpus_FS_L$mmin,log10(cpus_FS_L$perf))[[2]] getNode(cpus_FS_L$mmax,log10(cpus_FS_L$perf))[[2]] getNode(cpus_FS_L$cach,log10(cpus_FS_L$perf))[[2]] getNode(cpus_FS_L$chmin,log10(cpus_FS_L$perf))[[2]] getNode(cpus_FS_L$chmax,log10(cpus_FS_L$perf))[[2]] getNode(cpus_FS_L$mmax,log10(cpus_FS_L$perf))[[1]] sort(getNode(cpus_FS_L$mmax,log10(cpus_FS_L$perf))[[4]], decreasing=FALSE) The following screenshot gives the results of running the preceding R code: Figure 8: Obtaining the next two splits [ 270 ] Chapter 9 Thus, for the first right partitioned data, the best split is for the mmax value as the mid-point between 24000 and 32000; that is, at mmax = 28000. Similarly, for the first left-partitioned data, the best split is the average value of 6000 and 6200, which is 6100, for the same mmax covariate. Note the important step here. Even though we used cach as the criteria for the first partition, it is still used with the two partitioned data. The results are consistent with the display given by the regression tree, Figure 6: Regression tree for the "perf" of a CPU. The next R program will take care of the entire first split's right side's future partitions. 8. Partition the first right part cpus_FS_R as follows: cpus_FS_R_SS_R <- cpus_FS_R[cpus_FS_R$mmax>=28000,] cpus_FS_R_SS_L <- cpus_FS_R[cpus_FS_R$mmax<28000,] Obtain the best split for cpus_FS_R_SS_R and cpus_FS_R_SS_L by running the following code: cpus_FS_R_SS_R <- cpus_FS_R[cpus_FS_R$mmax>=28000,] cpus_FS_R_SS_L <- cpus_FS_R[cpus_FS_R$mmax<28000,] getNode(cpus_FS_R_SS_R$syct,log10(cpus_FS_R_SS_R$perf))[[2]] getNode(cpus_FS_R_SS_R$mmin,log10(cpus_FS_R_SS_R$perf))[[2]] getNode(cpus_FS_R_SS_R$mmax,log10(cpus_FS_R_SS_R$perf))[[2]] getNode(cpus_FS_R_SS_R$cach,log10(cpus_FS_R_SS_R$perf))[[2]] getNode(cpus_FS_R_SS_R$chmin,log10(cpus_FS_R_SS_R$perf))[[2]] getNode(cpus_FS_R_SS_R$chmax,log10(cpus_FS_R_SS_R$perf))[[2]] getNode(cpus_FS_R_SS_R$cach,log10(cpus_FS_R_SS_R$perf))[[1]] sort(getNode(cpus_FS_R_SS_R$cach,log10(cpus_FS_R_SS_R$perf))[[4]], decreasing=FALSE) getNode(cpus_FS_R_SS_L$syct,log10(cpus_FS_R_SS_L$perf))[[2]] getNode(cpus_FS_R_SS_L$mmin,log10(cpus_FS_R_SS_L$perf))[[2]] getNode(cpus_FS_R_SS_L$mmax,log10(cpus_FS_R_SS_L$perf))[[2]] getNode(cpus_FS_R_SS_L$cach,log10(cpus_FS_R_SS_L$perf))[[2]] getNode(cpus_FS_R_SS_L$chmin,log10(cpus_FS_R_SS_L$perf))[[2]] getNode(cpus_FS_R_SS_L$chmax,log10(cpus_FS_R_SS_L$perf))[[2]] getNode(cpus_FS_R_SS_L$cach,log10(cpus_FS_R_SS_L$perf))[[1]] sort(getNode(cpus_FS_R_SS_L$cach,log10(cpus_FS_R_SS_L$perf)) [[4]],decreasing=FALSE) [ 271 ] Classification and Regression For the cpus_FS_R_SS_R part, the final division is according to cach being greater than 56 or not (average of 48 and 64). If the cach value in this partition is greater than 56, then perf (actually log10(perf)) ends in the terminal leaf 3, else 2. However, for the region cpus_FS_R_SS_L, we partition the data further by the cach value being greater than 96.5 (average of 65 and 128). In the right side of the region, log10(perf) is found as 2, and a third level split is required for cpus_FS_R_SS_L with cpus_FS_R_SS_L_TS_L. Note that though the final terminal leaves of the cpus_FS_R_SS_L_TS_L region shows the same 2 as the final log10(perf), this may actually result in a significant variability reduction of the difference between the predicted and the actual log10(perf) values. We will now focus on the first main split's left side. [ 272 ] Chapter 9 Figure 9: Partitioning the right partition after the first main split [ 273 ] Classification and Regression 9. Partition cpus_FS_L accordingly, as the mmax value being greater than 6100 or otherwise: cpus_FS_L_SS_R <- cpus_FS_L[cpus_FS_L$mmax>=6100,] cpus_FS_L_SS_L <- cpus_FS_L[cpus_FS_L$mmax<6100,] The rest of the partition for cpus_FS_L is completely given next. 10. The details will be skipped and the R program is given right away: cpus_FS_L_SS_R <- cpus_FS_L[cpus_FS_L$mmax>=6100,] cpus_FS_L_SS_L <- cpus_FS_L[cpus_FS_L$mmax<6100,] getNode(cpus_FS_L_SS_R$syct,log10(cpus_FS_L_SS_R$perf))[[2]] getNode(cpus_FS_L_SS_R$mmin,log10(cpus_FS_L_SS_R$perf))[[2]] getNode(cpus_FS_L_SS_R$mmax,log10(cpus_FS_L_SS_R$perf))[[2]] getNode(cpus_FS_L_SS_R$cach,log10(cpus_FS_L_SS_R$perf))[[2]] getNode(cpus_FS_L_SS_R$chmin,log10(cpus_FS_L_SS_R$perf))[[2]] getNode(cpus_FS_L_SS_R$chmax,log10(cpus_FS_L_SS_R$perf))[[2]] getNode(cpus_FS_L_SS_R$syct,log10(cpus_FS_L_SS_R$perf))[[1]] sort(getNode(cpus_FS_L_SS_R$syct,log10(cpus_FS_L_SS_R$perf))[[4]], decreasing=FALSE) getNode(cpus_FS_L_SS_L$syct,log10(cpus_FS_L_SS_L$perf))[[2]] getNode(cpus_FS_L_SS_L$mmin,log10(cpus_FS_L_SS_L$perf))[[2]] getNode(cpus_FS_L_SS_L$mmax,log10(cpus_FS_L_SS_L$perf))[[2]] getNode(cpus_FS_L_SS_L$cach,log10(cpus_FS_L_SS_L$perf))[[2]] getNode(cpus_FS_L_SS_L$chmin,log10(cpus_FS_L_SS_L$perf))[[2]] getNode(cpus_FS_L_SS_L$chmax,log10(cpus_FS_L_SS_L$perf))[[2]] getNode(cpus_FS_L_SS_L$mmax,log10(cpus_FS_L_SS_L$perf))[[1]] sort(getNode(cpus_FS_L_SS_L$mmax,log10(cpus_FS_L_SS_L$perf)) [[4]],decreasing=FALSE) cpus_FS_L_SS_R_TS_R <- cpus_FS_L_SS_R[cpus_FS_L_SS_R$syct<360,] getNode(cpus_FS_L_SS_R_TS_R$syct,log10(cpus_FS_L_SS_R_TS_R$ perf))[[2]] getNode(cpus_FS_L_SS_R_TS_R$mmin,log10(cpus_FS_L_SS_R_TS_R$ perf))[[2]] getNode(cpus_FS_L_SS_R_TS_R$mmax,log10(cpus_FS_L_SS_R_TS_R$ perf))[[2]] getNode(cpus_FS_L_SS_R_TS_R$cach,log10(cpus_FS_L_SS_R_TS_R$ perf))[[2]] getNode(cpus_FS_L_SS_R_TS_R$chmin,log10(cpus_FS_L_SS_R_TS_R$ perf))[[2]] getNode(cpus_FS_L_SS_R_TS_R$chmax,log10(cpus_FS_L_SS_R_TS_R$ perf))[[2]] getNode(cpus_FS_L_SS_R_TS_R$chmin,log10(cpus_FS_L_SS_R_TS_R$ perf))[[1]] sort(getNode(cpus_FS_L_SS_R_TS_R$chmin,log10(cpus_FS_L_SS_R_TS_ R$perf))[[4]],decreasing=FALSE) [ 274 ] Chapter 9 We will now see how the preceding R code gets us closer to the regression tree: Figure 10: Partitioning the left partition after the first main split We leave it to you to interpret the output arising from the previous action. [ 275 ] Classification and Regression What just happened? Using the rpart function from the rpart library we first built the regression tree for log10(perf). Then, we explored the basic definitions underlying the construction of a regression tree and defined the getNode function to obtain the best split for a pair of regressands and a covariate. This function is then applied for all the covariates, and the best overall split is obtained; using this we get our first partition of the data, which will be in agreement with the tree given by the rpart function. We then recursively partitioned the data by using the getNode function and verified that all the best splits in each partitioned data are in agreement with the one provided by the rpart function. The reader may wonder if the preceding tedious task was really essential. However, it has been the experience of the author that users/readers seldom remember the rationale behind using direct code/functions for any software after some time. Moreover, CART is a difficult concept and it is imperative that we clearly understand our first tree, and return to the preceding program whenever the understanding of a science behind CART is forgotten. The construction of a classification tree uses entirely different metrics, and hence its working is also explained in considerable depth in the next section. The construction of a classification tree We first need to set up the splitting criteria for a classification tree. In the case of a regression tree, we saw the sum of squares as the splitting criteria. For identifying the split for a classification tree, we need to define certain measures known as impurity measures. The three popular measures of impurity are Bayes error, the cross-entropy function, and Gini index. Let p denote the percentage of success in a dataset of size n. The formulae of these impurity measures are given in the following table: Measure Formula Bayes error ϕ B ( p ) = min ( p ,1 − p ) The cross-entropy measure ϕCE ( p ) = − p log p − (1 − p ) log (1 − p ) Gini index ϕG ( p ) = p (1 − p ) [ 276 ] Chapter 9 We will write a short program to understand the shape of these impurity measures as a function of p: p <- seq(0.01,0.99,0.01) plot(p,pmin(p,1-p),"l",col="red",xlab="p",xlim=c(0,1),ylim=c(0,1), ylab="Impurity Measures") points(p,-p*log(p)-(1-p)*log(1-p),"l",col="green") points(p,p*(1-p),"l",col="blue") title(main="Impurity Measures") legend(0.6,1,c("Bayes Error","Cross-Entropy","Gini Index"),col=c("red" ,"green","blue"),pch="-") The preceding R code when executed in an R session gives the following output: Figure 11: Impurity metrics – Bayes error, cross-entropy, and Gini index Basically, we have these three choices of impurity metrics as a building block of a classification tree. The popular choice is the Gini index, and there are detailed discussions about the reason in the literature; see Breiman, et. al. (1984). However, we will delve into this aspect and for the development in this section, we will be using the cross-entropy function. [ 277 ] Classification and Regression Now, for a given predictor, assume that we have a node denoted by A. In the initial stage, where there are no partitions, the impurity is based on the proportion p. The impurity of node A is taken to be a non-negative function of the probability y = 1, and is mathematically written as p(y=1|A). The impurity of node A is defined as follows: I ( A ) = ϕ p ( y = 1 | A ) Here, ϕ is one of the three impurity measures. When A is one of the internal nodes, the tree gets bifurcated to the left- and right- hand side; that is, we now have left daughter AL and a right daughter AR. For the moment, we will take the split according to the predictor variable x; that is, if x ≤ c , the observation moves to AL, otherwise to AR. Then, according to the split criteria, we have the following table; this is the same as Table 3.2 of Berk (2008): Split criteria x≤c AL x ≤ c AR x > c Failure (0) Success (1) Total n11 n12 n1. n21 n22 n2. n.1 n.2 n.. Using the frequencies in the preceding table, the impurity for the daughter nodes AL and AR, based on the cross-entropy metric, are given as follows: I ( AL ) = − n n n n11 log 11 − 12 log 12 n1 n1 n1 n1 I ( AR ) = − n n n n21 log 21 − 22 log 22 n2 n2 n2 n2 And: The probability of an observation falling in the left- and right- hand daughter nodes are respectively given by p ( AL ) = n1 / n and p ( AR ) = n2 / n . Then, the benefit of using the node A is given as follows: ∆ ( A ) = I ( A ) − p ( AL ) I ( AL ) − p ( AR ) I ( AR ) [ 278 ] Chapter 9 Now, we capture ∆ ( A ) for all unique values of a predictor, and choose that value as the best split for which ∆ ( A ) is a maximum. This step is repeated across all the variables, and the best split is selected, which has a maximum ∆ ( A ). According to the best split, the data is partitioned, and as seen earlier during the construction of the regression tree, a similar search is performed in each of the partitioned data. The process continues until the gain by the split reaches a threshold minimum in each of the partitioned data. We will begin with the classification tree as delivered by the rpart function. The illustrative dataset kyphosis is selected from the rpart library itself. The data relates to children who had corrective spinal surgery. This medical problem is about the exaggerated outward curvature of the thoracic region of the spine, which results in a rounded upper back. In this study, 81 children underwent a spinal surgery and after the surgery, information is captured to know whether the children still have the kyphosis problem in the column named Kyphosis. The value of Kyphosis="absent" indicates that the child has been cured of the problem, and Kyphosis="present" means that child has not been cured for kyphosis. The other information captured is related to the age of the children, the number of vertebrae involved, and the number of first (topmost) vertebrae operated on. The task for us is building a classification tree, which gives the Kyphosis status dependent on the described variables. We will first build the classification tree for Kyphosis as a function of the three variables Age, Start, and Number. The tree will then be displayed and rules will be extracted from it. The getNode function will be defined based on the cross-entropy function, which will be applied on the raw data and the first overall optimal split obtained to partition the data. The process will be recursively repeated until we get the same tree as returned by the rpart function. Time for action – the construction of a classification tree The getNode function is now defined here to help us identify the best split for the classification problem. For the Kyphosis dataset from the rpart package, we plot the classification tree by using the rpart function. The tree is reobtained by using the getNode function. 1. Using the option of split="information", construct a classification tree based on the cross-entropy information for the kyphosis data with the following code: ky_rpart <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis,p arms=list(split="information")) [ 279 ] Classification and Regression 2. Visualize the classification tree by using plot(ky_rpart); text(ky_rpart): Figure 12: Classification tree for the kyphosis problem 3. Extract rules from ex_rpart by using asRules: > asRules(ky_rpart) Rule number: 15 [Kyphosis=present cover=13 (16%) prob=0.69] Start< 12.5 Age>=34.5 Number>=4.5 Rule number: 14 [Kyphosis=absent cover=12 (15%) prob=0.42] Start< 12.5 Age>=34.5 Number< 4.5 Rule number: 6 [Kyphosis=absent cover=10 (12%) prob=0.10] Start< 12.5 Age< 34.5 Rule number: 2 [Kyphosis=absent cover=46 (57%) prob=0.04] Start>=12.5 4. Define the getNode function for the classification problem: [ 280 ] Chapter 9 In the preceding function, the key functions would be unique, table, and log. We use unique to ensure that the search is carried for the distinct elements of the predictor values in the data. table gets the required counts as discussed earlier in this section. The if condition ensures that neither the p nor 1-p values become 0, in which case the logs become minus infinity. The rest of the coding is self-explanatory. Let us now get our first best split. 5. We will need a few data manipulations to ensure that our R code works on the expected lines: KYPHOSIS <- kyphosis KYPHOSIS$Kyphosis_y <- (kyphosis$Kyphosis=="absent")*1 6. To find the first best split among the three variables, execute the following code; the output is given in a consolidated screenshot after all the iterations: getNode(KYPHOSIS$Age,KYPHOSIS$Kyphosis_y)[[2]] getNode(KYPHOSIS$Number,KYPHOSIS$Kyphosis_y)[[2]] getNode(KYPHOSIS$Start,KYPHOSIS$Kyphosis_y)[[2]] getNode(KYPHOSIS$Start,KYPHOSIS$Kyphosis_y)[[1]] sort(getNode(KYPHOSIS$Start,KYPHOSIS$Kyphosis_y)[[4]], decreasing=FALSE) [ 281 ] Classification and Regression Now, getNode indicates that the best split occurs for the Start variable, and the point for the best split is 12. Keeping in line with the argument of the previous section, we split the data into two parts accordingly, as the Start value is greater than the average of 12 and 13. For the partitioned data, the search proceeds in a recursive fashion. 7. Partition the data accordingly, as the Start values are greater than 12.5, and find the best split for the right daughter node, as the tree display shows that a search in the left daughter node is not necessary: KYPHOSIS_FS_R <- KYPHOSIS[KYPHOSIS$Start<12.5,] KYPHOSIS_FS_L <- KYPHOSIS[KYPHOSIS$Start>=12.5,] getNode(KYPHOSIS_FS_R$Age,KYPHOSIS_FS_R$Kyphosis_y)[[2]] getNode(KYPHOSIS_FS_R$Number,KYPHOSIS_FS_R$Kyphosis_y)[[2]] getNode(KYPHOSIS_FS_R$Start,KYPHOSIS_FS_R$Kyphosis_y)[[2]] getNode(KYPHOSIS_FS_R$Age,KYPHOSIS_FS_R$Kyphosis_y)[[1]] sort(getNode(KYPHOSIS_FS_R$Age,KYPHOSIS_FS_R$Kyphosis_y)[[4]], decreasing=FALSE) The maximum incremental value occurs for the predictor Age, and the split point is 27. Again, we take the average of the 27 and next highest value of 42, which turns out as 34.5. The (first) right daughter node region is then partitioned in two parts accordingly, as the Age values are greater than 34.5, and the search for the next split continues in the current right daughter part. 8. The following code completes our search: KYPHOSIS_FS_R_SS_R <- KYPHOSIS_FS_R[KYPHOSIS_FS_R$Age>=34.5,] KYPHOSIS_FS_R_SS_L <- KYPHOSIS_FS_R[KYPHOSIS_FS_R$Age<34.5,] getNode(KYPHOSIS_FS_R_SS_R$Age,KYPHOSIS_FS_R_SS_R$Kyphosis_y)[[2]] getNode(KYPHOSIS_FS_R_SS_R$Number,KYPHOSIS_FS_R_SS_R$ Kyphosis_y)[[2]] getNode(KYPHOSIS_FS_R_SS_R$Start,KYPHOSIS_FS_R_SS_R$ Kyphosis_y)[[2]] getNode(KYPHOSIS_FS_R_SS_R$Number,KYPHOSIS_FS_R_SS_R$ Kyphosis_y)[[1]] sort(getNode(KYPHOSIS_FS_R_SS_R$Number,KYPHOSIS_FS_R_SS_R$ Kyphosis_y)[[4]], decreasing=FALSE) [ 282 ] Chapter 9 We see that the final split occurs for the predictor Number and the split is 4, and we again stop at 4.5. We see that the results from our raw code completely agree with the rpart function. Thus, the efforts of writing custom code for the classification tree have paid the right dividends. We now have enough clarity for the construction of the classification tree: Figure 13: Finding the best splits for classification tree using the getnode function [ 283 ] Classification and Regression What just happened? A deliberate attempt has been made at demystifying the construction of a classification tree. As with the earlier attempt of understanding a regression tree, we first deployed the rpart function, and saw a display of the classification tree for the Kyphosis as a function of Age, Start, and Number, for the choice of the cross-entropy impurity metric. The getNode function is defined on the basis of the same impurity metric and in a very systematic fashion; we reproduced the same tree as obtained by the rpart function. With the understanding of the basic construction behind us, we will now build the classification tree for the German credit data problem. Classification tree for the German credit data In Chapter 7, The Logistic Regression Model, we constructed a logistic regression model, and in the previous chapter, we obtained the ridge regression version for the German credit data problem. However, problems such as these and many others may have non-linearity built in them, and it is worthwhile to look at the same problem by using a classification tree. Also, we saw another model performance of the German credit data using the train, validate, and test approach. We will have the following approach. First, we will partition the German dataset into three parts, namely train, validate, and test. The classification tree will be built by using the data in the train set and then it will be applied on the validate part. The corresponding ROC curves will be visualized, and if we feel that the two curves are reasonably similar, we will apply it on the test region, and take the necessary action of sanctioning the customers their required loan. Time for action – the construction of a classification tree A classification tree is built now for the German credit data by using the rpart function. The approach of train, validate, and test is implemented, and the ROC curves are obtained too. [ 284 ] Chapter 9 1. The following code has been used earlier in the book, and hence there won't be an explanation of it: set.seed(1234567) data_part_label <- c("Train","Validate","Test") indv_label = sample(data_part_label,size=1000,replace=TRUE,prob =c(0.6,0.2,0.2)) library(ROCR) data(GC) GC_Train <- GC[indv_label==»Train»,] GC_Validate <- GC[indv_label==»Validate»,] GC_Test <- GC[indv_label=="Test",] 2. Create the classification tree for the German credit data, and visualize the tree. We will also extract the rules from this classification tree: GC_rpart <- rpart(good_bad~.,data=GC_Train) plot(GC_rpart); text(GC_rpart) asRules(GC_rpart) The classification tree for the German credit data appears as in the following screenshot: Figure 14: Classification tree for the test part of the German credit data problem [ 285 ] Classification and Regression By now, we know how to find the rules of this tree. An edited version of the rules is given as follows: Figure 15: Rules for the German credit data 3. We use the tree given in the previous step on the validate region, and plot the ROC for both the regions: Pred_Train_Class <- predict(GC_rpart,type='class') Pred_Train_Prob <- predict(GC_rpart,type='prob') Train_Pred <- prediction(Pred_Train_Prob[,2],GC_Train$good_bad) Perf_Train <- performance(Train_Pred,»tpr»,»fpr») plot(Perf_Train,col=»green»,lty=2) Pred_Validate_Class<-predict(GC_rpart,newdata=GC_Validate[,21],type='class') Pred_Validate_Prob<-predict(GC_rpart,newdata=GC_Validate[,21],type='prob') Validate_Pred<-prediction(Pred_Validate_Prob[,2], GC_ Validate$good_bad) [ 286 ] Chapter 9 Perf_Validate <- performance(Validate_Pred,"tpr","fpr") plot(Perf_Validate,col="yellow",lty=2,add=TRUE) We will go ahead and predict for the test part too. 4. The necessary code is the following: Pred_Test_Class<-predict(GC_rpart,newdata=GC_Test[,21],type='class') Pred_Test_Prob<-predict(GC_rpart,newdata=GC_Test[,21],type='prob') Test_Pred<- prediction(Pred_Test_Prob[,2],GC_Test$good_bad) Perf_Test<- performance(Test_Pred,"tpr","fpr") plot(Perf_Test,col="red",lty=2,add=TRUE) legend(0.6,0.5,c("Train Curve","ValidateCurve","Test Curve"),col=c ("green","yellow","red"),pch="-") The final ROC curve looks similar to the following screenshot: Figure 16: ROC Curves for German Credit Data [ 287 ] Classification and Regression The performance of the classification tree is certainly not satisfactory with the validate group itself. The only solace here is that the test curve is a bit similar to the validate curve. We will look at the more modern ways of improving the basic classification tree in the next chapter. The classification tree in Figure 14: Classification tree for the test part of the German credit data problem is very large and complex, and we sometimes need to truncate the tree to make the classification method a bit simpler. Of course, one of the things that we should suspect whenever we look at very large trees is that maybe we are again having the problem of overfitting. The final section deals with a simplistic method of overcoming this problem. What just happened? A classification tree has been built for the German credit dataset. The ROC curve shows that the tree does not perform well on the validate data part. In the next and concluding section, we look at the two ways of improving this tree. Have a go hero Using the getNode function, verify the first five splits of the classification tree for the German credit data. Pruning and other finer aspects of a tree Recall from Figure 14: Classification tree for the test part of the German credit data problem that the rules numbered 21, 143, 69, 165, 142, 70, 40, 164, and 16, respectively, covered only 20, 25, 11, 11, 14, 12, 28, 19, and 22. If we look at the total number of observations, we have about 600, and individually these rules do not cover even about five percent of them. This is one reason to suspect that maybe we overfitted the data. Using the option of minsplit, we can restrict the minimum number of observations each rule should cover at the least. Another technical way of reducing the complexity of a classification tree is by "pruning" the tree. Here, the least important splits are recursively snipped off according to the complexity parameter; for details, refer to Breiman, et. al. (1984), or Section 3.6 of Berk (2008). We will illustrate the action through the R program. [ 288 ] Chapter 9 Time for action – pruning a classification tree A CART is improved by using minsplit and cp arguments in the rpart function. 1. 2. Invoke the graphics editor with par(mfrow=c(1,2)). Specify minsplit=30, and re-do the ROC plots by using the new classification tree: GC_rpart_minsplit<- rpart(good_bad~.,data=GC_Train, minsplit=30) GC_rpart_minsplit <- prune(GC_rpart,cp=0.05) Pred_Train_Class<- predict(GC_rpart_minsplit,type='class') Pred_Train_Prob<-predict(GC_rpart_minsplit,type='prob') Train_Pred<- prediction(Pred_Train_Prob[,2],GC_Train$good_bad) Perf_Train<- performance(Train_Pred,»tpr»,»fpr») plot(Perf_Train,col=»green»,lty=2) Pred_Validate_Class<-predict(GC_rpart_minsplit,newdata=GC_ Validate[,-21],type='class') Pred_Validate_Prob<-predict(GC_rpart_minsplit,newdata= GC_ Validate[,-21],type='prob') Validate_Pred<- prediction(Pred_Validate_Prob[,2],GC_ Validate$good_bad) Perf_Validate<- performance(Validate_Pred,"tpr","fpr") plot(Perf_Validate,col="yellow",lty=2,add=TRUE) Pred_Test_Class<- predict(GC_rpart_minsplit,newdata = GC_Test[,21],type='class') Pred_Test_Prob<-predict(GC_rpart_minsplit,newdata = GC_Test[,21],type='prob') Test_Pred<- prediction(Pred_Test_Prob[,2],GC_Test$good_bad) Perf_Test<- performance(Test_Pred,"tpr","fpr") plot(Perf_Test,col="red",lty=2,add=TRUE) legend(0.6,0.5,c("Train Curve","ValidateCurve","Test Curve"),col=c ("green","yellow","red"),pch="-") title(main="Improving a Classification Tree with "minsplit"") 3. For the pruning factor cp=0.02, repeat the ROC plot exercise: GC_rpart_prune <- prune(GC_rpart,cp=0.02) Pred_Train_Class<- predict(GC_rpart_prune,type='class') Pred_Train_Prob<-predict(GC_rpart_prune,type='prob') Train_Pred<- prediction(Pred_Train_Prob[,2],GC_Train$good_bad) Perf_Train<- performance(Train_Pred,»tpr»,»fpr») plot(Perf_Train,col=»green»,lty=2) [ 289 ] Classification and Regression Pred_Validate_Class<-predict(GC_rpart_prune,newdata = GC_ Validate[,-21],type='class') Pred_Validate_Prob<-predict(GC_rpart_prune,newdata = GC_ Validate[,-21],type='prob') Validate_Pred<- prediction(Pred_Validate_Prob[,2],GC_ Validate$good_bad) Perf_Validate<- performance(Validate_Pred,"tpr","fpr") plot(Perf_Validate,col="yellow",lty=2,add=TRUE) Pred_Test_Class<- predict(GC_rpart_prune,newdata = GC_Test[,21],type='class') Pred_Test_Prob<-predict(GC_rpart_prune,newdata = GC_Test[,21],type='prob') Test_Pred<- prediction(Pred_Test_Prob[,2],GC_Test$good_bad) Perf_Test<- performance(Test_Pred,"tpr","fpr") plot(Perf_Test,col="red",lty=2,add=TRUE) legend(0.6,0.5,c("Train Curve","ValidateCurve","Test Curve"),col=c ("green","yellow","red"),pch="-") title(main="Improving a Classification Tree with Pruning") The choice of cp=0.02 has been drawn from the plot of the complexity parameter and the relative error; try it yourselves with plotcp(GC_rpart). Figure 17: Pruning the CART [ 290 ] Chapter 9 What just happened? Using the minsplit and cp options, we have managed to obtain a reduced set of rules, and in that sense, the fitted model does not appear to be an overfit. The ROC curves show that there has been considerable improvement in the performance of the validate region. Again, as earlier, the validate and test regions have a similar ROC, and it is hence preferable to use GC_rpart_prune or GC_rpart_minsplit over GC_rpart. Pop quiz With the experience of model selection from the previous chapter, justify the choice of cp=0.02 from the plot obtained as a result of running plotcp(GC_rpart). Summary We began with the idea of recursive partitioning and gave a legitimate reason as to why such an approach is practical. The CART technique is completely demystified by using the getNode function, which has been defined appropriately depending upon whether we require a regression or a classification tree. With the conviction behind us, we applied the rpart function to the German credit data, and with its results, we had basically two problems. First, the fitted classification tree appeared to overfit the data. This problem may many times be overcome by using the minsplit and cp options. The second problem was that the performance was really poor in the validate region. Though the reduced classification trees had slightly better performance as compared to the initial tree, we still need to improve the classification tree. The next chapter will focus more on this aspect and discuss the modern development of CART. [ 291 ] 10 CART and Beyond In the previous chapter, we studied CART as a powerful recursive partitioning method, useful for building (non-linear) models. Despite the overall generality, CART does have certain limitations that necessitate some enhancements. It is these extensions that form the crux of the final chapter of this book. For some technical reasons, we will focus solely on the classification trees in this chapter. We will also briefly look at some limitations of the CART tool. The first improvement that can be made to the CART is provided by the bagging technique. In this technique, we build multiple trees on the bootstrap samples drawn from the actual dataset. An observation is put through each of the trees and a prediction is made for its class, and based on the majority prediction of its class, it is predicted to belong to the majority count class. A different approach is provided by Random Forests, where you consider a random pool of covariates against the observations. We finally consider another important enhancement of a CART by using the boosting algorithms. The chapter will discuss the following: Cross-validation errors for CART The bootstrap aggregation (bagging) technique for CART Extending the CART with random forests A consolidation of the applications developed from Chapter 6 to Chapter 10, CART and Beyond Improving CART In the Another look at model assessment section of Chapter 8, we saw that the technique of train + validate + test may be further enhanced by using the cross-validation technique. In the case of, linear regression model, we had used the CVlm function from the DAAG package for the purpose of cross-validation of linear models. The cross-validation technique for the logistic regression models may be carried out by using the CVbinary function from the same package. Profs. Therneau and Atkinson created the package rpart, and a detailed documentation of the entire rpart package is available on the Web at http://www.mayo.edu/hsr/ techrpt/61.pdf. Recall the slight improvement provided in the Pruning and other finer aspects of a tree section of the previous chapter. The two aspects considered there related to the complexity parameter cp and the minimum split criteria minsplit. Now, the problem of overfitting with the CART may be reduced to an extent by using the cross-validation technique. In the ridge regression model, we had the problem of selecting the penalty factor . Similarly, here we have the problem of selecting the complexity parameter, though not in an analogous way. That is, for the complexity parameter, which is a number between 0 and 1, we need to obtain the predictions based on the cross-validation technique. This may lead to a small loss of accuracy; however, we will then increase the accuracy by looking at the generality. An object of the rpart class has many summaries contained with it, and the various complexity parameters are stored in the cptable matrix. This matrix has values for the following metrics: CP, nsplit, rel error, xerror, and xstd. Let us understand this matrix through the default example in the rpart package, which is example(xpred. rpart); see Figure 1: Understanding the example for the "xpred.rpart" function: [ 294 ] Chapter 10 Figure 1: Understanding the example for the "xpred.rpart" function Here the tree has CP at four values, namely 0.595, 0.135, 0.013, and 0.010. The corresponding nsplit numbers are 0, 1, 2, and 3, and similarly, the relative error values xerror and xstd are given in the last part of previous screenshot. The interpretation for the CP value is slightly different, and the reason being that these have to be considered as ranges and not values, in the sense that the rest of the performance is not with respect to the CP values as mentioned previously, rather they are with respect to the intervals [0.595,1], [0.135, 0.595), [0.013, 0.135), and [0.010, 0.013); see ?xpred. rpart for more information. Now, the function xpred.rpart returns the predictions based on the cross-validated technique. Therefore, we will use this function for the German data problem and for different CP values (actually ranges), to obtain the accuracy of the cross-validation technique. [ 295 ] CART and Beyond Time for action – cross-validation predictions We will use the xpred.rpart function from rpart to obtain the cross-validation predictions from an rpart object. 1. 2. 3. Load the German dataset and the rpart package using data(GC); library(rpart). Fit the classification tree with GC_Complete <- rpart(good_bad~., data=GC). Check cptable with GC_Complete$cptable: 1 2 3 4 5 6 4. 5. CP nsplit rel error xerror xstd 0.05167 0 1.0000 1.0000 0.04830 0.04667 3 0.8400 0.9833 0.04807 0.01833 4 0.7933 0.8900 0.04663 0.01667 6 0.7567 0.8933 0.04669 0.01556 8 0.7233 0.8800 0.04646 0.01000 11 0.6767 0.8833 0.04652 Obtain the cross-validation predictions using GC_CV_Pred <- xpred.rpart( GC_Complete). Find the accuracy of the cross-validation predictions: sum(diag(table(GC_CV_Pred[,2],GC$good_bad)))/1000 sum(diag(table(GC_CV_Pred[,3],GC$good_bad)))/1000 sum(diag(table(GC_CV_Pred[,4],GC$good_bad)))/1000 sum(diag(table(GC_CV_Pred[,5],GC$good_bad)))/1000 sum(diag(table(GC_CV_Pred[,6],GC$good_bad)))/1000 The accuracy output is as follows: > sum(diag(table(GC_CV_Pred[,2],GC$good_bad)))/1000 [1] 0.71 > sum(diag(table(GC_CV_Pred[,3],GC$good_bad)))/1000 [1] 0.744 > sum(diag(table(GC_CV_Pred[,4],GC$good_bad)))/1000 [1] 0.734 > sum(diag(table(GC_CV_Pred[,5],GC$good_bad)))/1000 [1] 0.74 > sum(diag(table(GC_CV_Pred[,6],GC$good_bad)))/1000 [1] 0.741 [ 296 ] Chapter 10 It is natural that when you execute the same code, you will most likely have a different output. Why is that? Also, you need to answer for yourselves why we did not check the accuracy for GC_CV_Pred[,1]. In general, for decreasing the CP range, we expect higher accuracy. We have checked for cross-validation predictions for various CP ranges. There are also other techniques to enhance the performance of a CART. What just happened? We used the xpred.rpart function to obtain the cross-validation predictions for a range of CP values. The accuracy of a prediction model has been assessed by using simple functions such as table and diag. However, the control actions of minsplit and cp are of a reactive nature after the splits have been already decided. In that sense, when we have a large number of covariates, the CART may lead to an overfit of the data and may try to capture all the local variations of the data, and thus lose sight of the overall generality. Therefore, we need useful mechanisms to overcome this problem. The classification and regression tree considered in the previous chapter is a single model. That is, we are seeking the opinion (prediction) of a single model. Wouldn't it be nice if we could extend this! Alternately, we can seek multiple models instead of a single model. What does this mean? In the forthcoming sections, we will see the use of multiple models for the same problem. Bagging Bagging is an abbreviation for bootstrap aggregation. The important underlying concept here is the bootstrap, which was invented by the eminent scientist Bradley Efron. We will first digress here a bit from the CART technique and consider a very brief illustration of the bootstrap technique. The bootstrap Consider a random sample, X 1 ,..., X n , of size n from f ( x,θ ) . Let T ( X 1 ,..., X n ) be an estimator of θ . To begin with, we first draw a random sample of size n from X 1 ,..., X n with replacement; that is, we obtain a random sample X 1* , X 2* ,..., X n* , where some of the observations from the original sample may have repetitions and some may not be present at all. There is no one-to-one correspondence between X 1 ,..., X n and X 1* , X 2* ,..., X n* . Using 1 * * X 1* , X 2* ,..., X n* , we compute T ( X 1 ,..., X n ) . Repeat this exercise a large number of times, say B. The inference for θ is carried out by using the sampling distribution of the bootstrap samples B * * T 1 ( X 1* ,..., X n* ) , …, T ( X 1 ,..., X n ) . Let us illustrate the concept of bootstrap with the famous aspirin example; see Chapter 8 of Tattar, et. al. (2013). [ 297 ] CART and Beyond A surprising double-blind experiment conducted by the New York Times indicated that an aspirin consumed on alternate days significantly reduces the number of heart attacks within men. In their experiment, 104 out of 11034 healthy middle-aged men consuming the small doses of aspirin suffered a fatal/non-fatal heart attack, whereas 189 out of 11037 placebo individuals had the attack. Therefore, the odds ratio of the aspirin-to-placebo heart attack possibility is (104 / 11034) / (189 / 11037) = 0.5504. This indicates that only 55 percent of the number of heart attacks observed for the group taking the placebo is likely to be observed for men consuming small doses of aspirin. That is, the chances of having a heart attack if taking aspirin are almost halved. The experiment being scientific, the results look very promising. However, we would like to obtain a confidence interval for the odds ratio of the heart attack. If we don't know the sampling distribution of the odds ratio, we can use the bootstrap technique to obtain the same. There is another aspect of the aspirin study. It has been observed that the aspirin group had about 119 individuals who had strokes. The strokes number for the placebo group is 98. Therefore, the odds ratio of a stroke is (114 / 11034) / (98 / 11037) = 1.164. This is shocking! It says that though the aspirin reduces the heart attack possibility, about 16 percent more people are likely to have a stroke when compared to the placebo group. Now, let's use the bootstrap technique to obtain the confidence intervals for the heart attack as well as the strokes. Time for action – understanding the bootstrap technique The boot package, which comes shipped with R base, will be used for bootstrapping the odds ratio. 1. Get the boot package using library(boot). The boot package is shipped with the R software itself, and thus it does not require separate installation. The main components for the boot function will be soon explained. 2. Define the odds-ratio function: OR <- function(data,i) { x <- data[,1]; y <- data[,2] odds.ratio <- (sum(x[i]==1,na.rm=TRUE)/length(na.omit(x[i])))/ (sum(y[i]==1,na.rm=TRUE)/length(na.omit(y[i]))) return(odds.ratio) } [ 298 ] Chapter 10 The OR name stands, of course, for odds-ratio. data for this function consists of two columns, one of which may have more observations than the other. The option na.rm is used to ignore the NA data values, whereas the na.omit function will remove them. It is easier to see that the odds.ratio object indeed computes the odds-ratio. Note that we have specified i as an input to the function OR, since this function will be used within boot. Therefore, i is used to indicate that the odds ratio will be calculated for the ith bootstrap sample. Note that x[i] does not reflect the ith element of x. 3. Get the data for both the aspirin and placebo groups (the heart attack and stroke data), with the following code: aspirin_hattack placebo_hattack aspirin_strokes placebo_strokes 4. <<<<- c(rep(1,104),rep(0,11037-104)) c(rep(1,189),rep(0,11034-189)) c(rep(1,119),rep(0,11037-119)) c(rep(1,98),rep(0,11034-98)) Combine the data groups and run 1000 bootstrap replicates, calculating the odds-ratio for each of the bootstrap samples. Use the following boot function: hattack <- cbind(aspirin_hattack,c(placebo_hattack,NA,NA,NA)) hattack_boot <- boot(data=hattack,statistic=OR,R=1000) strokes <- cbind(aspirin_strokes,c(placebo_strokes,NA,NA,NA)) strokes_boot <- boot(data=strokes,statistic=OR,R=1000) We are using three options of the boot function, namely data, statistic, and R. The first option accepts the data frame of interest; the second one accepts the statistic, either an existing R function or a function defined by the user; and finally, the third option accepts the number of bootstrap replications. The boot function creates an object of the boot class, and in this case, we are obtaining the odds-ratio for various bootstrap samples. 5. Using the bootstrap samples and the odds-ratio for the bootstrap samples, obtain a 95 percent confidence interval by using the quantile function: quantile(hattack_boot$t,c(0.025,0.975)) quantile(strokes_boot$t,c(0.025,0.975)) The 95 percent confidence interval for the odds-ratio of the heart attack rate is given as (0.4763, 0.6269), while that for the strokes is (1.126, 1.333). Since the point estimates lie in the 95 percent confidence intervals, we accept that the odds-ratio of a heart attack for the aspirin tablet indeed reduces by 55 percent in comparison with the placebo group. [ 299 ] CART and Beyond What just happened? We used the boot function from the boot package and obtained bootstrap samples for the odds-ratio. Now that we have an understanding of the bootstrap technique, let us check out how the bagging algorithm works. The bagging algorithm Breiman (1996) proposed the extension of the CART in the following manner. Suppose that the values of the n random observations for the classification problem are ( y1 , x1 ) , ( y2 , x2 ) ,..., ( yn , xn ) . As with our setup, the dependent variables yi are binary. As with the bootstrap technique explained earlier, we obtain a bootstrap sample of size n from the data with replacement and build a tree. If we prune the tree, it is very likely that we may end up with the same tree on most occasions. Hence, pruning is not advisable here. Now, using the tree based on the (first) bootstrap sample, a prediction is made for the class of the i-th observation and the predicted value is noted. This process is repeated a large number of times, say B. A general practice is to take B = 100. Therefore, we have B number of predictions for every observation. The decision process is to classify the observation to the category that has the majority of class predictions. That is, if more than 50 times out of B = 100 it has been predicted to belong to a particular class, we say that the observation is predicted to belong to that class. Let us formally state the bagging algorithm. 1. Draw a sample of size n with replacement from the data ( y1 , x1 ) , ( y2 , x2 ) ,..., ( yi , xn ) , and denote the first bootstrap sample with ( y1 , x1 )1 , ( y2 , x2 )1 ,..., ( yi , xn )1 . 1 1 1 1 2. Create a classification tree with ( y1 , x1 )1 , ( y2 , x2 )1 ,..., ( yi , xn ) . Do not prune the classification tree. Such a tree may be called a bootstrapped tree. 1 1 3. For each terminal node, assign a class; put each observation down the tree and find its predicted class. 4. Repeat steps 1 to 3 a large number of times, say B. 5. Find the number of times each observation is classified to a particular class out of the B bootstrapped trees. The bagging procedure classifies an observation to belong to a particular class that has the majority count. 6. Compute the confusion table from the predictions made in step 5. [ 300 ] Chapter 10 The advantage of multiple trees is that the problem of overfitting, which happens in the case of a single tree, is overcome to a large extent, as we expect that resampling will ensure that the general features are captured and the impact of local features is minimized. Therefore, if an observation is classified to belong to a particular class because of a local issue, it will not get repeated over in the other bootstrapped trees. Therefore, with predictions based on a large number of trees, it is expected that the final prediction of an observation really depends upon its general features and not on a particular local feature. There are some measures that are important to be considered with the bagging algorithm. A good classifier, a single tree, or a bunch of them should be able to predict the class of an observation with more conviction. For example, we use a probability threshold of 0.5 and above as a prediction for success when using a logistic regression model. If the model can predict most observations in the neighborhood of either 0 or 1, we will have more confidence in the predictions. As a consequence, we will be a bit hesitant to classify an observation as either a success or failure if the predicted probability is in the vicinity of 0.5. This precarious situation applies to the bagging algorithm too. Suppose we choose B = 100 for the number of bagging trees. Assume that an observation belongs to a class, Yes, and let the overall classes for the study be {"Yes", "No"}. If a large number of trees predict an observation to belong to the Yes class, we are confident about the prediction. On the other hand, if approximately B/2 number of trees classify the observation to the Yes class, the decision gets swapped, as a few more trees had predicted the observation to belong to the No class. Therefore, we introduce a measure called margin as the difference between the proportion of times an observation is correctly classified and the proportion of times it is incorrectly classified. If the bagging algorithm is a good model, we expect the average margin over all the observations to be a large number away from 0. If bagging is not appropriate, we expect the average margin to be near the number 0. Let us prepare ourselves for action. The bagging algorithm is available in ipred and the randomForests package. Time for action – the bagging algorithm The bagging function from the ipred package will be used for bagging a CART. The options of coob=FALSE and nbagg=200 are used to specified the appropriate options. 1. 2. Get the ipred package by using library(ipred). Load the German credit data by using data(GC). [ 301 ] CART and Beyond 3. For B=200, fit the bagging procedure for the GC data: GC_bagging <- bagging(good_bad~.,data=GC,coob=FALSE, nbagg=200,keepX=T) We know that we have fit B =200 number of trees. Would you like to see them? Fine, here we go. 4. The B =200 trees are stored in the mtrees list of classbagg GC_bagging. That is, GC_bagging$mtrees[[i]] gives us the i-th bootstrapped tree, and plot(GC_ bagging$mtrees[[i]]$btree) displays that tree. Adding text(GC_bagging$m trees[[i]]$btree,pretty=1, use.n=T) is also important. Next, put the entire thing in a loop, execute it, and simply sit back and enjoy the display of the B number of trees: for(i in 1:200) { plot(GC_bagging$mtrees[[i]]$btree); text(GC_bagging$mtrees[[i]]$btree,pretty=1,use.n=T) } We hope that you understand that we can't publish all 200 trees! The next goal is to obtain the margin of the bagging algorithm. 5. Predict the class probabilities of all the observations with the predict. classbagg function by using GCB_Margin = round(predict( GC_ bagging,type="prob")*200,0). Let us understand the preceding code. The predict function returns the probabilities of an observation to belong to the good and bad classes. We have used 200 trees, and hence multiplying these probabilities with it gives us the expected number of times an observation is predicted to belong to these classes. The round function with the 0 argument completes the prediction to integers. 6. Check the first six predicted classes with head(GCB_Margin): bad good [1,] 17 183 [2,] 165 35 [3,] 11 189 [4,] 123 77 [5,] 101 99 [6,] 95 105 7. To obtain the overall margin of the bagging technique, use the R code mean(pmax(GCB_Margin[,1],GCB_Margin[,2]) pmin(GCB_ Margin[,1],GCB_Margin[,2]))/200. [ 302 ] Chapter 10 The overall margin for the author's execution turns out to be 0.5279. You may, though, get a different answer. Why? Thus far, the bagging technique made predictions for the observations from which it built the model. In the earlier chapters, we had championed the need of validate group and cross-validation techniques. That is, we did not always rely on the model measures solely from the data on which it was built. There is always the possibility of failure as a result of unforeseen examples. Can the bagging technique be built for taking care of the unforeseen observations? The answer is a definite yes, and this is well known as out-of-bag validation. In fact, such an option has been suppressed when building the bagging model in step 3 here, as the option coob=FALSE. coob stands for an out-of-bag estimate of the error rate. So, now rebuild the bagging model with coob=TRUE option. 8. Build an out-of-bag bagging model with GC_bagging_oob <- bagging(good_ba d~.,data=GC,coob=TRUE,nbagg=200,keepX=T). Find the error-rate with GC_ bagging_oob$err. > GC_bagging_oob <- bagging(good_bad~.,data=GC,coob=TRUE, nbagg=200,keepX=T) > GC_bagging_oob$err [1] 0.241 What Just Happened? We have seen an important extension of the CART model in the bagging algorithm. To an extent, this enhancement is vital and vastly different, as seen in the improvements of earlier models. The bagging algorithm is different, in the sense that we rely on the predictions based on more than a single model. This ensures that the overfitting problem, which occurs due to local features, is almost eliminated. It is important to note that the bagging technique is not without any limitations; refer to Section 4.5 of Berk (2008). We now move to the final model of the book, which is an important technique for the CART school. Random forests In the previous section, we built multiple models for the same classification problem. The bootstrapped trees were generated by using resamples of the observations. Breiman (2001) suggested an important variation—actually, there is more to it than just a variation—where a CART is built with the covariates (features) being resampled for each of the bootstrap samples of the dataset. Since the final tree of each bootstrap sample has different covariates, the ensemble of the collective trees is called a Random Forest. A formal algorithm is given next. [ 303 ] CART and Beyond 1. As with the bagging algorithm, draw a sample of size n1, n1 < n with replacement 1 1 1 from the data ( y1 , x1 ) , ( y2 , x2 ) ,..., ( yi , xn ) , and denote the first resampled data 1 with ( y1 , x1 )1 , ( y2 , x2 )1 ,..., ( yi , xn ) . The remaining n to n1 data form the out-of-bag dataset. 1 2. Among the covariate vector x, select a random number of covariates without replacement. Note that the same covariates are selected for all the observations. 3. Create the CART tree from the data in steps 1 and 2, and, as earlier, do not prune the tree. 4. For each terminal node, assign a class. Put each out-of-bag data down the tree and find its predicted class. 5. Repeat steps 1 to 3 a large number of times; say 200 or 500. 6. For each observation, count the number of times it is predicted to belong to a class only when it is a part of the out-of-bag dataset. 7. The majority count for the observation to belong to a class is considered as it is a predicted class. This is quite a complex algorithm. Luckily, the randomForest package helps us out. We will continue with the German credit data problem. Time for action – random forests for the German credit data The function randomForest from the package of the same name will be used to build a random forest for the German credit data problem. 1. 2. 3. Get the randomForest package by using library(randomForest). Load the German credit data by using data(GC). Create a random forest with 500 trees: GC_RF <- randomForest(good_bad~.,data=GC,keep.forest=TRUE, ntree=500). [ 304 ] Chapter 10 It is very difficult to visualize a single tree of the random forest. A very ad-hoc approach has been found at http://stats.stackexchange.com/ questions/2344/best-way-to-present-a-random-forest-in-apublication. Now we reproduce the necessary function to get the trees, and as the solution step is not exactly perfect, you may skip this part; steps 4 and 5. 4. Define the to.dendrogram function: 5. Use the getTree function, and with the to.dendrogram function defined previously, visualize the first 20 trees of the forest: for(i in 1:20) { tree <- getTree(GC_RF,i,labelVar=T) d <- to.dendrogram(tree) plot(d,center=TRUE,leaflab='none',edgePar=list(t.cex=1,p.col=NA,p. lty=0)) } The error rate is of primary concern. As we increase the number of trees in the forest, we expect a decrease in the error rate. Let us investigate this for the GC_RF object. [ 305 ] CART and Beyond 6. Plot the out-of-bag error rate against the number of trees with plot(1:500,GC_ RF$err.rate[,1],"l",xlab="No.of.Trees",ylab="OOB Error Rate"). Figure 2: Performance of a random forest The covariates (features) are selected differently for different trees. It is then a concern to know which variables are significant. The important variables are obtained using the varImpPlot function. 7. The function varImpPlot produces a display of the importance of the variables by using varImpPlot(GC_RF). [ 306 ] Chapter 10 Figure 3: Important variables of the German credit data problem Thus, we can see which variables have more relevance over others. What just happened? Random forests are a very important extension of the CART concept. In this technique, we need to know the error-rate distribution as the number of trees increases. This is expected to decrease with the increase in the number of trees. varImpPlot also gives a very important display of the importance of the covariates for classifying the customers as good or bad. In conclusion, we will undertake a classification dataset and revise all the techniques seen in the book, especially in Chapter 6 to Chapter 10. We will now consider the problem of low birth weight among the infants. [ 307 ] CART and Beyond The consolidation The goal of this section is to quickly review all of the techniques learnt in the latter half of the book. Towards this, a dataset has been selected where we have ten variables, including the output. Low birth weight is a serious concern, and it needs to be understood as a factor of many other variables. If the weight of a child at birth is lesser than 2500 grams, it is considered as a low birth weight. This problem has been studied in Chapter 19 of Tattar, et. al. (2013). The following table gives a description of the variables. Since the dataset may be studied as a regression problem (variable BWT) as well as a classification problem (LOW), you can choose any path(s) that you deem fit. Let the final action begin. Serial number Description Abbreviation 1 Identification code ID 2 Low birth weight LOW 3 Age of mother AGE 4 Weight of mother at last menstrual period LWT 5 Race RACE 6 Smoking status during pregnancy SMOKE 7 History of premature labor PTL 8 History of hypertension HT 9 Presence of uterine irritability UI 10 Number of physician visits during the first trimester FTV 11 Birth weight BWT Time for action – random forests for the low birth weight data The techniques learnt from Chapter 6 to Chapter 10 will now be put to the test. That is, we will use the linear regression model, logistic regression, as well as CART. 1. Read the dataset into R with data(lowbwt). [ 308 ] Chapter 10 2. Visualize the dataset with the options diag.panel, lower.panel, and upper.panel: pairs(lowbwt,diag.panel=panel.hist,lower.panel=panel.smooth,upper. panel=panel.cor) Interpret the matrix of scatter plots. Which statistical model seems most appropriate to you? Figure 4: Multivariable display of the "lowbwt" dataset As the correlations look weak, it seems that a regression model may not be appropriate. Let us check. 3. Create (sub) datasets for the regression and classification problems: LOW <- lowbwt[,-10] BWT <- lowbwt[,-1] [ 309 ] CART and Beyond 4. First, we will check if a linear regression model is appropriate: BWT_lm <- lm(BWT~., data=BWT) summary(BWT_lm) Interpret the output of the linear regression model; refer to Linear Regression Analysis, Chapter 6, if necessary. Figure 5: Linear model for the low birth weight data The low R2 makes it difficult for us to use the model. Let us check out the logistic regression model. 5. Fit the logistic regression model as follows: BWT_glm <- glm(BWT~., data=BWT) summary(BWT_glm). [ 310 ] Chapter 10 The summary of the model is given in the following screenshot: Figure 6: Logistic regression model for the low birth weight data 6. The Hosmer-Lemeshow goodness-of-fit test for the logistic regression model is given by hosmerlem(LOW_glm$y,fitted(LOW_glm)). Now, the p-value obtained is 0.7813, which shows that there is no significant difference between the fitted values and the observed values. Therefore, we conclude that the logistic regression model is a good fit. However, we will go ahead and perform CART models for this problem as well. Note that the estimated regression coefficients are not huge values, and hence we do not need to check out for the ridge regression problem. [ 311 ] CART and Beyond 7. Fit a classification tree with the rpart function: LOW_rpart <- rpart(LOW~.,data=LOW) plot(LOW_rpart) text(LOW_rpart,pretty=1) Does the classification tree appear more appropriate than the logistic regression fitted earlier? Figure 7: Classification tree for the low birth weight data 8. Get the rules of the classification tree using asRules(LOW_rpart). Figure 8: Rules for the low birth weight problem [ 312 ] Chapter 10 You can see that these rules are of great importance to the physician who does the operations. Let us check the bagging effect on the classification tree. 9. Using the bagging function, find the error rate of the bagging technique with the following code: LOW_bagging <- bagging(LOW~., data=LOW,coob=TRUE,nbagg=50,keepX=T) LOW_bagging$err The error rate is 0.3228, which seems very high. Let us see if random forests help us out. 10. Using the randomForest function, find the error rate for the out-of-bag problem: LOW_RF <- randomForest(LOW~.,data=LOW,keep.forest=TRUE, ntree=50) LOW_RF$err.rate The error rate is still around 0.34. The initial idea was that with the number of observations being less than 200, we developed with only 50 trees. Repeat the task with 150 trees and check if the error rate decreases. 11. Increase the number of trees to 150 and obtain the error-rate plot: LOW_RF <- randomForest(LOW~.,data=LOW,keep.forest=TRUE, ntree=150) plot(1:150,LOW_RF$err.rate[,1],"l",xlab="No.of.Trees",ylab="OOB Error Rate") The error rate of about 0.32 seems to be the best solution we can obtain for this problem. Figure 9: The error rate for the low birth weight problem [ 313 ] CART and Beyond What just happened? We had a very quick look back at all the techniques used over the last five chapters of the book. Summary The chapter began with two important variations of the CART technique: the bagging technique and random forests. Random forests are particularly a very modern technique, invented in 2001 by Breiman. The goal of the chapter was to familiarize you with these modern techniques. Together with the German credit data and the complete revision of the earlier techniques with the low birth weight problem, it is hoped that you benefitted a lot from the book and will have gained enough confidence to apply these tools in your own analytical problems. [ 314 ] References The book has been influenced by many of the classical texts on the subject from Tukey (1977) to Breiman, et. al. (1984). The modern texts of Hastie, et. al. (2009) and Berk (2008) have particularly influenced the later chapters of the book. We have alphabetically listed only the books/monographs which have been cited in the text, however, the reader may go beyond the current list too. Agresti, A. (2002), Categorical Data Analysis, Second Edition. J. Wiley Baron, M. (2007), Probability and Statistics for Computer Scientists, Chapman and Hall/CRC Belsley, K., Kuh, and Welsch, E. (1980), Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, J. Wiley Berk, R. A. (2008), Statistical Learning from a Regression Perspective.Springer Breiman, L. (1996), Bagging predictors. Machine learning, 24(2), 123-140 Breiman, L. (2001), Random forests. Machine learning, 45(1), 5-32 Breiman, L., Friedman, J. H., Olshen, R. A., and Stone. C.J. (1984), Classification and Regression Trees, Wadsworth Chen, Ch., Härdle, W., and Unwin, A. (2008). Handbook of Data Visualization. Springer Cleveland, W. S. (1985). The Elements of Graphing Data. Monterey, CA: Wadsworth Cule, E. and De Iorio, M. (2012), A semi-automatic method to guide the choice of ridge parameter in ridge regression.arXiv:1205.0686v1 [stat.AP] Freund, R.F., and Wilson, W.J. (2003), Statistical Methods, Second Edition, Academic Press Friendly, M. (2001), Visualizing Categorical Data.SAS Friendly, M. (2008), A brief history of data visualization. In Handbook of Data Visualization (pp. 15-56), Springer References Gunst, R. F. (2002), Finding confidence in statistical significance. Quality Progress, 35 (10), 107–108 Gupta, A. (1997), Establishing Optimum Process Levels of Suspending Agents for a Suspension Product. Quality Engineering, 10,347-350 Hastie, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Statistical Learning, Second Edition, Springer Horgan, J. M. (2008), Probability with R – An Introduction with Computer Science Applications, J. Wiley Johnson, V. E., and Albert, J. H. (1999), Ordinal Data Modeling, Springer Kutner, M. H., Nachtsheim, C., and Neter, J. (2004), Applied Linear Regression Models, McGraw Hill Montgomery, D. C. (2007), Introduction to Statistical Quality Contro,. J. Wiley Montgomery, D. C., Peck, E. A., and Vining, G. G. (2012), Introduction to Linear Regression Analysis, Wiley Montgomery, D.C., and Runger, G. C. (2003), Applied Statistics and Probability for Engineers, J. Wiley Pawitan, Yudi. (2001), In All Likelihood: Statistical Modelling and Inference Using Likelihood, OUP Oxford Quinlan, J. R. (1993), C4. 5: Programs for Machine Learning, Morgan Kaufmann Ross, S.M. (2010), Introductory Statistics, 3e, Academic Press Rousseeuw, P.J., Ruts, I., and Tukey, J.W. (1999), The bagplot: a bivariate boxplot, The American Statistician 53.4: 382-387 Ryan, T.P. (2007), Modern Engineering Statistics, J. Wiley Sarkar, D. (2008), Lattice-Springer Tattar, P.T., Suresh. R., and Manjunath. B.G. (2013), A Course in Statistics with R. Narosa Tufte, E. R. (2001), The Visual Display of Quantitative Information, Graphics Pr Tukey, J.W. (1977), Exploratory Data Analysis, Addison-Wesley Velleman, P.F., and Hoaglin, D.C. (1981), Applications, Basics, and Computing of Exploratory Data Analysis; available at http://dspace.libraty.comell.edu/ Wickham, H. (2009). ggplot2, Springer Wilkinson, L. (2005), The Grammar of Graphics, Second Edition, Springer [ 316 ] Index Symbols 3RSS 123 3RSSH 123 4253H smoother 123 %% operator 35 A actual probabilities 13 Age in Years variable 9 Akaike Information Criteria (AIC) 194 alternative hypothesis 144 amount of leverage by the observation 187 ANOVA technique about 170 obtaining 170 anscombe dataset 169 aplpack package 117 Automotive Research Association of India (ARAI) 10 B backward elimination approach 192 backwardlm function 195 bagging 297 bagging algorithm 300-302 bagging technique 93 bagplot about 116 characteristics 116 displaying, for multivariate dataset 117, 118 for gasoline mileage problem 117 barchart function 73 bar charts built-in examples 66, 67 visualizing 68-73 barplot function 73 basic arithmetic, vectors performing 36 unequal length vectors, adding 36 basis functions, regression spline 234 best split point 265 binary regression problem 202 binomial distribution about 20, 21 examples 21-23 binomial test performing 144 proportions, testing 147, 148 success probability, testing 145, 146 bivariate boxplot 116 boosting algorithms 293 bootstrap 298, 299 bootstrap aggregation 297 bootstrapped tree 300 box-and-whisker plot 84 boxplot about 84 examples 84, 85 implementing 85-87 boxplot function 86, 108 B-spline regression model fitting 241, 242 purpose 241 built-in examples, bar chart Bug Metrics dataset 67 Bug Metrics of five software 67 bwplot function 86 C CART about 293 cross-validation predictions, obtaining 296, 297 improving 294, 295 CART_Dummy dataset visualizing 259 categorical variable 9, 18, 260 central limit theorem 139 classification tree 257 construction 276-283 pruning 289-291 classification tree, German credit data construction 284-288 coefficient of determination 169 colSums function 77 Comprehensive R Archive Network. See CRAN computer science experiments with uncertainty 14 confidence interval about 139 for binomial proportion 139 for normal mean with known variance 140 for normal mean with unknown variance 140 obtaining 141, 142, 170, 171 confusion matrix 220 continuous distributions about 26 exponential distribution 28 normal distribution 29 uniform distribution 27 continuous variable 9 Cooks distance 188 covariate 229 CRAN about 15 URL 15 criteria-based model selection about 194 AIC criteria, using 194-197 backward, using 194-197 forward, using 194-197 critical alpha 192 critical region 144 CSV file reading, from 54 cumulative density function 27 Customer ID variable 8 CVbinary function 294 CVlm function 294 D data importing, from external files 55, 57 importing, from MySQL 58, 59 spiltting 260 databases (DB) 58 data characteristics 12, 13 data formats CSV (comma separated variable) format 52 ODS (OpenOffice/LibreOffice Calc) 52 XLS or XLSX (Microsoft Excel) form 52 data.frame object about 45 creating 45, 46 data/graphs exporting 60 graphs, exporting 60, 61 R objects, exporting 60 data re-expression about 114 for Power of 62 Hydroelectric Stations 114-116 data visualization about 65 visualization techniques, for categorical data 66 visualization techniques, for continuous variable data 84 DAT file reading, from 54 depth 113 deviance residual 213 DFBETAS 189 DFFITS 189 diff function 108 discrete distributions about 18 binomial distribution 20 [ 318 ] discrete uniform distribution 19 hypergeometric distribution 24 negative binomial distribution 24 poisson distribution 26 discrete uniform distribution 19 dot chart about 74 example 74 visualizing 74-76 dotchart function 74 E EDA 103 exponential distribution 28, 29 F false positive rate (fpr) 220 fence 117 files reading, foreign package used 55 first tree building 261-263 fitdistr function used, for findgin MLE 136 fivenum function 108 forwardlm function 194 forward selection approach 193 fourfold plot about 82 examples 83 Full Name variable 9 G Gender variable 9 generalized cross-validation (GCV) errors 255 geometric RV 25 German credit data classification tree 284-288 German credit screening dataset logistic regression 223-226 getNode function 267 getTree function 305 ggplot 101 ggplot2 about 99, 100 ggplot 101 qplot function 100 GLM influential and leverage points, identifying 216 residual plots 213 graphs exporting 60, 61 H han function 124 hanning 123 hinges 104, 105 hist function 90 histogram about 88 construction 88, 89 creating 90-92 effectiveness 90 examples 89 histogram function 90 Hosmer-Lemeshow goodness-of-fit test statistic 210, 212 hypergeometric distribution 24 hypergeometric RV 24 hypotheses testing about 144 binomial test 144 one-sample hypotheses, testing 152-155 one-sample problems, for normal distribution 150, 151 two-sample hypotheses, testing 159 two sample problems, for normal distribution 156, 157 hypothesis about 144 alternative hypothesis 144 null hypothesis 144 statistic, testing 144 I impurity measures 276 independent and identical distributed (iid) sample 130 [ 319 ] influence and leverage points, GLM identifying 216 influential point 188 interquartile range (IQR) 84, 105 iterative reweighted least-squares (IRLS) algorithm 207 L leading digits 109 letter values 113 leverage point 187 likelihood function about 131 visualizing 131-134 likelihood function, of binomial distribution 131 likelihood function, of normal distribution 132 likelihood function, of Poisson distribution 132 linear regression model about 162 limitations 202, 203 linearRidge function 244 list object about 44 creating 44, 45 lm.ridge fit model 255 lm.ridge function 245, 255 logistic regression, German credit dataset 223226 logistic regression model about 201-207 diagnostics 216 fitting 207- 210 influence and leverage points, identifying 216-218 model validation 213 residuals plots, for GLM 213 ROC 220 logisticRidge function 248 loop 117 M margin 301 matrix computations about 41 performing 41-43 maximum likelihood estimator. See MLE mean 18, 122 mean residual sum of squares 183 median 104, 122 median polish 125 median polish algorithm about 125, 126 example 126, 127 medpolish function 126 MLE about 129-131 finding 135 finding, fitdistr function used 137 finding, mle function used 137 likelihood function, visualizing 131 MLE, binomial distribution finding 135, 136 MLE, Poisson distribution finding 136 model assessment about 250 penalty factor, finding 250 penalty parameter, selecting iteratively 250-255 testing dataset 250 training dataset 250 validation dataset 250 model selection about 192 criterion-based procedures 194 stepwise procedures 192 model validation, simple linear regression model about 171 residual plots, obtaining 172, 174 mosaic plot for Titanic dataset 80, 81 mosaicplot function 80 multicollinearity problem about 189, 190 addressing 191, 192 multiple linear regression model about 176, 177 ANOVA, obtaining 182 building 179, 180 confidence intervals, obtaining 182 k simple linear regression models, averaging 177-179 [ 320 ] useful residual plots 183 multivariate dataset bagplot, displaying for 117, 118 MySQL data, importing from 58, 59 N natural cubic spline regression model about 238 fitting 239-241 negative binomial distribution about 24, 25 examples 25 negative binomial RV 25 nominal variable 18 normal distribution about 29 examples 30 null hypothesis 144 NumericConstants 34 O Octane Rating of Gasoline Blends 109 odds ratio 207 one-sample hypotheses testing 152-155 operator curves receiving 220 ordinal variable 9, 19, 260 out-of-bag validation 303 overfitting 230-233 P pairs function 118 panel.bagplot function 118 Pareto chart about 97 examples 98, 99 partial residual 213 pdf 26 pearson residual 213 percentiles 104 piecewise cubic 238 piecewise linear regression model about 235 fitting 235-237 pie charts about 81 drawbacks 82 examples 81 plot.lm function 176 poisson distribution about 26 examples 26 polynomial regression model building 231 fitting 229 pooled variance estimator 157 PRESS residuals 184 principal component (PC) 245 probability density function 26 probability mass function (pmf) 18 probit model 201 probit regression model about 204 constants 204-206 pruning 288 Q qplot function 100 quantiles 104 questionnaire about 8 components 8 Questionnaire ID variable 8 R R constants 34 continuous distributions 26 data characteristics 12, 13 data.frame 33 data visualization 65 discrete distributions 18 downloading, for Linux 16 downloading, for Windows 16 session management 62, 64 simple linear regression model 162 vectors 34, 35 [ 321 ] randomForest function 304 Random Forests about 293, 303 for German credit data 304, 306 for low birth weight data 308-313 random sample 130 random variable 13 range function 108 R databases 33 read.csv 54 read.delim 54 read.table function 53 read.xls function 54 receiving operator curves. See ROC Recursive Partitioning about 258 data, splitting 260 display plot, partitioning 259 regression 162 regression diagnostics about 186 DFBETAS 189 DFFITS 189 influential point 187, 188 leverage point 187 regression spline about 234 basis functions 234 natural cubic splines 238 piecewise linear regression model 235 regression tree about 257 construction 265-274 representative probabilities 13 reroughing 123 resid function 174 residual plots, GLM deviance residual 213 obtaining, fitted function used 214, 215 obtaining, residuals function used 214, 215 partial residual 213 pearson residual 213 response residual 213 working residual 213 residual plots, multiple linear regression model about 183 PRESS residuals 184 R-student residuals 184, 186 semi-studentized residuals 183, 184 standardized residuals 183 residuals function 216 resistant line about 118, 120 as regression model 120, 121 for IO-CPU time 120 response residual 213 ridge regression, for linear regression model 243-247 ridge regression, for logistic regression models 248, 249 R installation about 15, 16 R packages, using 16, 17 RSADBE 17 rline function 120 R names function 34 R objects about 33 exporting 60 letters 34 LETTERS 34 month.abb 34 month.name 34 pi 34 ROC about 220 construction 221, 222 rootstock dataset 55 rough 123 R output 33 rowSums function 78 R packages using 16 rpart class 294 rpart function 257 RSADBE package 17, 105 R-student residuals 184 RV 13 S scatter plot about 93 creating 94-96 [ 322 ] examples 93 semi-studentized residuals 183 Separated Clear Volume 54 session management about 62 performing 62, 64 short message service (SMS) 8 significance level 139 simple linear regression model about 163 ANOVA technique 169 building 167, 168 confidence intervals, obtaining 170 core assumptions 163 limitation 230 overfitting problem 230 residuals for arbitrary choice of parameters, displaying 164-166 validation 171 smooth function 124 smoothing data technique about 122 for cow temperature 124, 125 spine/mosaic plot about 76 advantages 76 examples 77 spine plot for shift and operator data 77, 79 spineplot function 77 spline 234 standardized residuals 183 Statistical Process Control (SPC) 109 stem-and-leaf plot 109 stem function about 110 working 110-112 stems 109 step function 194 stepwise procedures about 192 backward elimination 192 forward selection 193 stepwise regression 193 summary function 169 summary statistics about 104 for The Wall dataset 105-108 hinges 104, 105 interquartile range (IQR) 105 median 104 percentiles 104 quantiles 104 T table object about 49, 50 Titanic dataset, creating 51, 52 testing dataset 250 text variable 9 Titanic data exporting 60 to.dendrogram function 305 towards-the-center 162 trailing digits 109 training dataset 250 true positive rate (tpr) 220 two-sample hypotheses testing 159 U UCBAdmissions 52 UCBAdmissions dataset 260 uniform distribution about 27 examples 28 V validation dataset 250 variance 18 variance inflation factor (VIF) 191 vector about 34 examples 35 generating 35 vector objects basic arithmetic 36 creating 35 visualization techniques, for categorical data about 66 bar chart 66 dot chart 74 [ 323 ] fourfold plot 81 mosaic plot 76 pie charts 81 spline plot 76 visualization techniques, for continuous variable data about 84 boxplot 84 histogram 88 Pareto chart 97 scatter plot 93 W working residual 213 write.table function 60 X xpred.rpart function 296 [ 324 ] Thank you for buying R Statistical Application Development by Example Beginner's Guide About Packt Publishing Packt, pronounced 'packed', published its first book "Mastering phpMyAdmin for Effective MySQL Management" in April 2004 and subsequently continued to specialize in publishing highly focused books on specific technologies and solutions. Our books and publications share the experiences of your fellow IT professionals in adapting and customizing today's systems, applications, and frameworks. Our solution based books give you the knowledge and power to customize the software and technologies you're using to get the job done. Packt books are more specific and less general than the IT books you have seen in the past. Our unique business model allows us to bring you more focused information, giving you more of what you need to know, and less of what you don't. Packt is a modern, yet unique publishing company, which focuses on producing quality, cutting-edge books for communities of developers, administrators, and newbies alike. For more information, please visit our website: www.packtpub.com. About Packt Open Source In 2010, Packt launched two new brands, Packt Open Source and Packt Enterprise, in order to continue its focus on specialization. This book is part of the Packt Open Source brand, home to books published on software built around Open Source licences, and offering information to anybody from advanced developers to budding web designers. The Open Source brand also runs Packt's Open Source Royalty Scheme, by which Packt gives a royalty to each Open Source project about whose software a book is sold. Writing for Packt We welcome all inquiries from people who are interested in authoring. Book proposals should be sent to author@packtpub.com. If your book idea is still at an early stage and you would like to discuss it first before writing a formal book proposal, contact us; one of our commissioning editors will get in touch with you. We're not just looking for published authors; if you have strong technical skills but no writing experience, our experienced editors can help you develop a writing career, or simply get some additional reward for your expertise. NumPy 1.5 Beginner's Guide ISBN: 978-1-84951-530-6 Paperback: 234 pages An action-packed guide for the easy-to-use, high performance, Python based free open source NumPy mathematical library using real-world examples 1. The first and only book that truly explores NumPy practically 2. Perform high performance calculations with clean and efficient NumPy code 3. Analyze large data sets with statistical functions 4. Execute complex linear algebra and mathematical computations Matplotlib for Python Developers ISBN: 978-1-84719-790-0 Paperback: 308 pages Build remarkable publication-quality plots the easy way 1. Create high quality 2D plots by using Matplotlib productively 2. Incremental introduction to Matplotlib, from the ground up to advanced levels 3. Embed Matplotlib in GTK+, Qt, and wxWidgets applications as well as web sites to utilize them in Python applications 4. Deploy Matplotlib in web applications and expose it on the Web using popular web frameworks such as Pylons and Django Please check www.PacktPub.com for information on our titles Sage Beginner's Guide ISBN: 978-1-84951-446-0 Paperback: 364 pages Unlock the full potential of Sage for simplifying and automating mathematical computing 1. The best way to learn Sage which is a open source alternative to Magma, Maple, Mathematica, and Matlab 2. Learn to use symbolic and numerical computation to simplify your work and produce publicationquality graphics 3. Numerically solve systems of equations, find roots, and analyze data from experiments or simulations R Graph Cookbook ISBN: 978-1-84951-306-7 Paperback: 272 pages Detailed hands-on recipes for creating the most useful types of graphs in R—starting from the simplest versions to more advanced applications 1. Learn to draw any type of graph or visual data representation in R 2. Filled with practical tips and techniques for creating any type of graph you need; not just theoretical explanations 3. All examples are accompanied with the corresponding graph images, so you know what the results look like Please check www.PacktPub.com for information on our titles
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Create Date : 2013:07:24 11:55:48+05:30 Creator : Adobe InDesign CS5.5 (7.5.3) Modify Date : 2013:07:24 12:04:48+05:30 XMP Toolkit : Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26 Metadata Date : 2013:07:24 12:04:48+05:30 Creator Tool : Adobe InDesign CS5.5 (7.5.3) Page Image Page Number : 1, 2 Page Image Format : JPEG, JPEG Page Image Width : 256, 256 Page Image Height : 256, 256 Page Image : (Binary data 5608 bytes, use -b option to extract), (Binary data 7216 bytes, use -b option to extract) Instance ID : uuid:e100fcda-cc2a-42d6-97a3-5e1818d736f1 Original Document ID : xmp.did:B87A611A5EA5E1118190C61F49E6C840 Document ID : xmp.did:D58DDA32A6E3E211A4E9EA6DB75B3446 Rendition Class : proof:pdf History Action : created, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved History Instance ID : xmp.iid:B87A611A5EA5E1118190C61F49E6C840, xmp.iid:B97A611A5EA5E1118190C61F49E6C840, xmp.iid:BA7A611A5EA5E1118190C61F49E6C840, xmp.iid:FD80CAD31BA6E1119CBDAF783C44C647, xmp.iid:FE80CAD31BA6E1119CBDAF783C44C647, xmp.iid:FF80CAD31BA6E1119CBDAF783C44C647, xmp.iid:D992E68D2AA6E1119CBDAF783C44C647, xmp.iid:C38F116C2BA6E1119CBDAF783C44C647, xmp.iid:473D051F2FA6E1119CBDAF783C44C647, xmp.iid:E6C3BC302FA6E1119CBDAF783C44C647, xmp.iid:C3D6C44F2FA6E1119CBDAF783C44C647, xmp.iid:D868BC652FA6E1119CBDAF783C44C647, xmp.iid:9B0D924132A6E1119CBDAF783C44C647, xmp.iid:AD646D3433A6E1119CBDAF783C44C647, xmp.iid:42D5624033A6E1119CBDAF783C44C647, xmp.iid:A1B1401E36A6E1119CBDAF783C44C647, xmp.iid:9CA2641648A6E1119CBDAF783C44C647, xmp.iid:3C2ADED648A6E1119CBDAF783C44C647, xmp.iid:622FC9FD49A6E111A228FF32ABC8ED44, xmp.iid:BAE8661E4AA6E111A228FF32ABC8ED44, xmp.iid:C325ABE34DA6E111A228FF32ABC8ED44, xmp.iid:C4E21B1B2BE1E11183ABBB3C5088B0FC, xmp.iid:42021577DBE5E1119230F7C78DA53C9D, xmp.iid:43021577DBE5E1119230F7C78DA53C9D, xmp.iid:F146FFDCA5F1E1118921F528167F9165, xmp.iid:BEC5BC3A6B13E211A17CFBAEB5168E96, xmp.iid:BFC5BC3A6B13E211A17CFBAEB5168E96, xmp.iid:6A6CC6406B13E211A17CFBAEB5168E96, xmp.iid:D20197E31ACCE211BB32BBCEA702D681, xmp.iid:D30197E31ACCE211BB32BBCEA702D681, xmp.iid:D7F4245E1BCCE211BB32BBCEA702D681, xmp.iid:5901D1580BCDE211803FB64DC20198DE, xmp.iid:8D4516B8E3D3E21183BDE1B7A29185C3, xmp.iid:6F1F19A9F2D3E21183BDE1B7A29185C3, xmp.iid:BACEADC901D4E21183BDE1B7A29185C3, xmp.iid:3F1B6FE3F1D4E211BC009EA234C67E6B, xmp.iid:401B6FE3F1D4E211BC009EA234C67E6B, xmp.iid:3DC9004AE0D7E211B10BC2BEBF6A8FE1, xmp.iid:3EC9004AE0D7E211B10BC2BEBF6A8FE1, xmp.iid:DB4E9438E3D7E211B10BC2BEBF6A8FE1, xmp.iid:96C3DD97C7D8E21189E5D619D0C13CA3, xmp.iid:E343ADE3C7D8E21189E5D619D0C13CA3, xmp.iid:EB18DF3AD8D8E211982697336250369F, xmp.iid:EC18DF3AD8D8E211982697336250369F, xmp.iid:C6091BF6D9D8E211982697336250369F, xmp.iid:BF6514D0C6E3E211A4E9EA6DB75B3446, xmp.iid:686328D0C6E3E211A4E9EA6DB75B3446, xmp.iid:33EE6A2CC7E3E211A4E9EA6DB75B3446, xmp.iid:34EE6A2CC7E3E211A4E9EA6DB75B3446, xmp.iid:35EE6A2CC7E3E211A4E9EA6DB75B3446, xmp.iid:88CB111B9BE4E211BB4CEA0B2F0A655C, xmp.iid:B09236A09BE4E211BB4CEA0B2F0A655C, xmp.iid:BD5A41449CE4E211BB4CEA0B2F0A655C, xmp.iid:1B6F3E8762E8E211B221EFA364D49A29, xmp.iid:2B4677A848E9E211B89BFB9F579F4FB7, xmp.iid:A81BC9E348E9E211B89BFB9F579F4FB7, xmp.iid:572721B6EBE9E2118E6DA8917387E144, xmp.iid:4B292E36EFE9E2118E6DA8917387E144, xmp.iid:1FCED9AFF8E9E2118E6DA8917387E144, xmp.iid:C019906007EAE2118E6DA8917387E144, xmp.iid:3A7123B60DEAE2118E6DA8917387E144, xmp.iid:54E07EE50DEAE2118E6DA8917387E144, xmp.iid:7327FA0B19EAE2118E6DA8917387E144, xmp.iid:D7B078921AEAE2118E6DA8917387E144, xmp.iid:1D54BACDB7EAE2119B7BCDEEEEC3D32E, xmp.iid:4C4852D100EDE211923AAB26DE528002, xmp.iid:1C45BCEE00EDE211923AAB26DE528002, xmp.iid:1DCBC56E36EDE211923AAB26DE528002, xmp.iid:6A55EC4B39EDE211923AAB26DE528002, xmp.iid:26997108ADEEE211968CF8BD9014E7AB, xmp.iid:4C839440C0EEE211968CF8BD9014E7AB, xmp.iid:D2D7AF9AC9EEE211968CF8BD9014E7AB, xmp.iid:4908105DD4EEE211968CF8BD9014E7AB History When : 2012:05:24 10:50:02+05:30, 2012:05:24 10:50:27+05:30, 2012:05:24 10:50:27+05:30, 2012:05:25 09:13:44+05:30, 2012:05:25 09:13:44+05:30, 2012:05:25 09:14:01+05:30, 2012:05:25 10:59:09+05:30, 2012:05:25 11:05:22+05:30, 2012:05:25 11:31:51+05:30, 2012:05:25 11:32:20+05:30, 2012:05:25 11:33:12+05:30, 2012:05:25 11:33:49+05:30, 2012:05:25 11:54:17+05:30, 2012:05:25 12:01:05+05:30, 2012:05:25 12:01:25+05:30, 2012:05:25 12:21:56+05:30, 2012:05:25 14:30:34+05:30, 2012:05:25 14:35:57+05:30, 2012:05:25 14:44:11+05:30, 2012:05:25 14:45:06+05:30, 2012:05:25 15:12:06+05:30, 2012:08:08 13:01:45+05:30, 2012:08:14 12:14:15+05:30, 2012:08:14 12:14:15+05:30, 2012:08:29 12:20:47+05:30, 2012:10:11 11:46:44+05:30, 2012:10:11 11:46:54+05:30, 2012:10:11 11:46:54+05:30, 2013:06:03 12:27:43+05:30, 2013:06:03 12:27:43+05:30, 2013:06:03 12:31:09+05:30, 2013:06:04 17:08:59+05:30, 2013:06:13 11:59:54+05:30, 2013:06:13 11:59:54+05:30, 2013:06:13 13:48:11+05:30, 2013:06:14 18:26:54+05:30, 2013:06:14 18:26:54+05:30, 2013:06:18 11:58:28+05:30, 2013:06:18 12:19:27+05:30, 2013:06:18 12:19:28+05:30, 2013:06:19 15:34:13+05:30, 2013:06:19 15:36:20+05:30, 2013:06:19 17:33:18+05:30, 2013:06:19 17:33:18+05:30, 2013:06:19 17:45:42+05:30, 2013:07:03 15:26:20+05:30, 2013:07:03 15:26:20+05:30, 2013:07:03 15:28:55+05:30, 2013:07:03 15:35:55+05:30, 2013:07:03 16:21:05+05:30, 2013:07:04 16:45:59+05:30, 2013:07:04 16:49:43+05:30, 2013:07:04 16:54:18+05:30, 2013:07:09 12:11:04+05:30, 2013:07:10 15:38:24+05:30, 2013:07:10 15:40:04+05:30, 2013:07:11 11:05:35+05:30, 2013:07:11 11:30:39+05:30, 2013:07:11 12:38:28+05:30, 2013:07:11 14:23:38+05:30, 2013:07:11 15:08:58+05:30, 2013:07:11 15:10:18+05:30, 2013:07:11 16:30:07+05:30, 2013:07:11 16:41:02+05:30, 2013:07:12 11:26:32+05:30, 2013:07:15 09:14:14+05:30, 2013:07:15 09:15:03+05:30, 2013:07:15 15:38:01+05:30, 2013:07:15 15:58:31+05:30, 2013:07:17 12:19:31+05:30, 2013:07:17 14:37:05+05:30, 2013:07:17 15:44:02+05:30, 2013:07:17 17:01:03+05:30 History Software Agent : Adobe InDesign 6.0, Adobe InDesign 6.0, Adobe InDesign 6.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign CS6 (Windows), Adobe InDesign CS6 (Windows), Adobe InDesign CS6 (Windows), Adobe InDesign CS6 (Windows), Adobe InDesign CS6 (Windows), Adobe InDesign CS6 (Windows), Adobe InDesign CS6 (Windows), Adobe InDesign CS6 (Windows), Adobe InDesign CS6 (Windows), Adobe InDesign CS6 (Windows), Adobe InDesign CS6 (Windows), Adobe InDesign CS6 (Windows), Adobe InDesign CS6 (Windows), Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign CS6 (Windows), Adobe InDesign CS6 (Windows), Adobe InDesign CS6 (Windows), Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.5 History Changed : /, /metadata, /;/metadata, /metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /, /metadata, /, /, /metadata, /, /, /metadata, /, /, /metadata, /, /, /;/metadata, /metadata, /;/metadata, /metadata, /;/metadata, /;/metadata, /;/metadata, /, /metadata, /, /;/metadata, /metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata Derived From Instance ID : xmp.iid:3EC9004AE0D7E211B10BC2BEBF6A8FE1 Derived From Document ID : xmp.did:37243395F0D4E211BC009EA234C67E6B Derived From Original Document ID: xmp.did:B87A611A5EA5E1118190C61F49E6C840 Derived From Rendition Class : default Doc Change Count : 2618 Key Stamp Mp : AAAAAA== Format : application/pdf Producer : Adobe PDF Library 9.9 Trapped : False Page Count : 345EXIF Metadata provided by EXIF.tools