(A Small) R Guide For The Beginners
User Manual:
Open the PDF directly: View PDF .
Page Count: 115
Download | |
Open PDF In Browser | View PDF |
(A small) R Guide for the Beginners with Applications for Decision Making Intended to be used on the course “Data Science for Business”, Aalto University, School of Business Contents 0. Preparing the working space ................................................................................................................ 4 0.1. 1. 2. 3. Using RStudio ................................................................................................................................ 5 Data types, data classes and files.......................................................................................................... 7 1.1. Logical operators, comparison and truth values .......................................................................... 7 1.2. Scalars ........................................................................................................................................... 8 1.3. Arithmetic ..................................................................................................................................... 9 1.4. Vectors ........................................................................................................................................ 10 1.5. Matrices ...................................................................................................................................... 12 1.6. Characters ................................................................................................................................... 15 1.7. Factors......................................................................................................................................... 16 1.8. Data frames ................................................................................................................................. 16 1.9. List ............................................................................................................................................... 17 1.10. Loading the data from an external file.................................................................................... 20 1.11. Referring to cells, rows and columns, choosing partial data .................................................. 24 Statistical functions ............................................................................................................................. 28 2.1. Descriptives ................................................................................................................................. 30 2.2. Basic functions for normal distribution ...................................................................................... 30 2.3. Apply() function and its sister functions ..................................................................................... 31 2.4. Creating functions ....................................................................................................................... 33 2.5. Algorithms ................................................................................................................................... 34 2.5.1. For loops.............................................................................................................................. 34 2.5.2. If else structure ................................................................................................................... 35 Methods of numerical inference ........................................................................................................ 36 3.1. Cross-tabulating .......................................................................................................................... 36 3.2. Classification of the data............................................................................................................. 38 3.3. T-test and confidence intervals................................................................................................... 38 3.4. F-test (incomplete)...................................................................................................................... 39 4. 5. 3.5. Tests for the variance (independence) ....................................................................................... 39 3.6. Bayesian inference ...................................................................................................................... 40 Graphics .............................................................................................................................................. 42 4.1. Basic graphics .............................................................................................................................. 42 4.2. Ggplot2........................................................................................................................................ 44 4.3. Plotly (incomplete) ...................................................................................................................... 45 Decision-making methods for business .............................................................................................. 45 5.1. Simple linear regression .............................................................................................................. 45 5.2. Multiple linear regression ........................................................................................................... 53 5.3. Logistic regression....................................................................................................................... 54 5.4. Confounding variables (incomplete) ........................................................................................... 62 5.5. Multinomial logistic regression ................................................................................................... 62 5.6. Time series .................................................................................................................................. 72 5.6.1. General techniques ............................................................................................................. 74 5.6.2. ARIMA models..................................................................................................................... 80 5.6.3. VAR models ......................................................................................................................... 83 5.7. Bayesian inference (incomplete) ................................................................................................ 87 5.8. Tree-based methods (incomplete) ............................................................................................. 87 5.9. Support vector classifier and support vector machine ............................................................... 87 5.10. Depicting the data ................................................................................................................... 99 Linear scale vs. log scale.................................................................................................................... 100 Linear growth rates vs. log growth rate ............................................................................................ 100 Cutting and scaling the axes ............................................................................................................. 101 Spans of time series .......................................................................................................................... 104 Proportions vs. counts ...................................................................................................................... 106 Palettes for everyone ........................................................................................................................ 107 6. References ........................................................................................................................................ 109 6.1. Function reference .................................................................................................................... 109 6.2. Literature reference .................................................................................................................. 115 0. Preparing the working space As of now there are several open source environments to work with R and at least two of them are meant to work only with R. In any case you need to install the base system first. 0. Go to http://cran.r-project.org/mirrors.html and choose one of the mirrors. Usually, the closer the mirror to you geographically, the higher the download speed. 1. Click the mirror link you prefer. In the section “Download and Install R” you can choose the preferred version. Currently R is available for Mac, Windows, and Linux platforms. 2. Choose a version according to your platform. If you are installing R for the first time on Windows, then the easiest way is to click “This is what you want to install R for the first time.” Then click “Download R X.X.X for Windows” In case you are using Mac, choose the topmost link “R-X.X.X.pkg”. If you are using an older version of MacOS the other links as well. For Linux, follow the instructions found in /debian/ README.html 3. Unless you are using Linux, the file you download will include the GUI (graphical user interface) for R. So you can start using R immediately after the installation. 4. Now let’s proceed with the installation of RStudio, which is a more user friendly software. Go to the page: https://www.rstudio.com/products/rstudio/download/ (you need to have the base of R to use RStudio) 5. In the section “Installers for Supported Platforms” choose the version that suits your platform (Mac, Windows or Linux). Download it and install it. 0.1. Using RStudio After installing RStudio the basic view should look like: In the Section 1 you can write your code, functions and algorithms, and load saved code files. By default the file extension of the code files should be .R, but you can load code from the text files with different extensions, for instance .txt (caveat: unless the extension is .R you won’t be able to use all the available features, one of them is code completion). In the Box 2 you can see all the data that has been created, loaded and modified previously. All the data and functions (at least in the RStudio version 0.99.465) can be viewed by clicking them and they open in the Box 1. In the Box 2 you can also view all the commands typed previously by choosing the tab “History”. The values stored in the environment can be viewed by typing their names in the Box 3. In the Box 3 you can also type all the commands and write algorithms in the same way as in Box 1, even though it may be more practical to do that in the Box 1. Box 3 is also a log console; hence you can see all the actions being performed in the current session. Box 4 is one of the biggest advantages of RStudio compared to the original R GUI. You can view files in your working directory, view the plots you created, manage the libraries that you are using, and surf through the vast help documentation. It’s a multifunctional section that is connected with the Box 3. If you feel unsatisfied with the default setup of the working interface, you can change it by choosing from the upper menu: “Tools” -> “Global Options...” -> “Pane Layout”. RStudio, just like R GUI is using a default working directory to load, save and store all the files that you need to use. To change the default working directory go to “Tools” -> “Global Options…” -> “General” -> “Default working directory”. You can redefine the working directory later and change it if you create a new project. Creating new projects is simple and practical especially when one has multiple data sets preloaded. By creating separate projects you can ensure that you won’t run out of memory. (Take this warning seriously in case you are working with bug datasets. Loading a dataset with the size of, say, 1 Gb, RStudio will reserve an equal amount of operative memory in your machine) Go to “File” -> “New Project…” to create new projects. 1. Data types, data classes and files “R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls, surveys of data miners, and studies of scholarly literature databases show that R's popularity has increased substantially in recent years.” -Wikipedia. This section is meant to introduce a reader to different types of data that can be processed in RStudio. Data types are introduced parsimoniously and one is encouraged to practice more independently. 1.1. Logical operators, comparison and truth values In R there is a possibility to compare values with the following operators: “<” standing for less, “>” for more, “>=” for more or equal, “<=” for less or equal, and “==” for equal. Operator “=” is meant for inserting the values and therefore it can’t be used for the comparison. The comparison operators print out TRUE or FALSE and the result can be stored in the variable. Also, logical operators such as “&” standing for AND, “|” for OR, and “!” for negation are in use. All logical and comparison operators can be used in the same sentence. All variables have different types depending on the values stored in them. You can always check the type of the variable with the function class(). Example 1. .................................................................................................................................... > 0>1 [1] FALSE > 0=1 Error in 0 = 1 : invalid (do_set) left-hand side to assignment > 0==1 [1] FALSE > a <- 0 != 0 > a [1] FALSE > a == FALSE [1] TRUE Example 2. .................................................................................................................................... > (1==0) & (1>0) [1] FALSE > (1==0) | (1>0) [1] TRUE > !(1==0) [1] TRUE > TRUE | FALSE [1] TRUE > !TRUE [1] FALSE > TRUETRUE>FALSE [1] TRUE 1.2. Scalars Scalars are the simplest data types and they can be input in R by simply typing the values. Any variable can be assigned any value by using the operator “<-“. Later the variable can be printed out by typing its name. Example 3. .................................................................................................................................... > 42 [1] 42 > 0.42 [1] 0.42 Example 4. .................................................................................................................................... > a <- 42 > a [1] 42 > b <- .42 > b [1] 0.42 > c(a,b) [1] 42.00 0.42 1.3. Arithmetic In R it is possible to perform basic arithmetic operations such as addition, subtraction, multiplication, division, exponentiation, integer division and modulus calculation with real and complex numbers. Arithmetic operations can be performed with predefined existing variables. Example 5. .................................................................................................................................... > c<-5 > d<-3 > c-d [1] 2 > c+d [1] 8 > e<-c+d > e [1] 8 > c-10 [1] -5 > c*d [1] 15 > c/d [1] 1.666667 > e%%d [1] 2 > (e%%d)^d [1] 8 > (e%%d)**d [1] 8 > e%/%c [1] 1 > 42%%5 #Division remainder calculator [1] 2 > 42%/%5 #Division quotient calculator [1] 8 1.4. Vectors Vectors are one-dimensional matrices. Both scalars and vectors can be combined into new vectors with a combination function c(). A new vector variable can be saved into another variable in the same way as scalars. In fact, scalars are also vectors in R. Some simple data is better represented in vectors, which can be thought as values of a single variable. A function seq() can be used to create generic arithmetic sequences. Example 6. .................................................................................................................................... > a<-c(0,2,4,6,8,10) > a [1] 0 2 4 6 8 10 > d<-c(12,3,4,a) > d [1] 12 3 4 0 2 > 0:10 [1] 0 1 2 3 4 4 5 6 6 8 10 7 8 9 10 > seq(0,10,by=0.3) [1] 0.0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3.0 3.3 [13] 3.6 3.9 4.2 4.5 4.8 5.1 5.4 5.7 6.0 6.3 6.6 6.9 [25] 7.2 7.5 7.8 8.1 8.4 8.7 9.0 9.3 9.6 9.9 > 10:0 [1] 10 9 8 7 6 5 4 3 2 1 0 Vector’s cell can be referred to by giving its index in the square brackets after the name of the vector. N.B.: indexing starts from 1. Referring to an index that doesn’t exist returns NA value. It is possible to refer to multiple indices at the same time by inputting a vector as an index. Function length() returns the length of a vector. Example 7. .................................................................................................................................... > e<-c(5,10,5) > e[2] [1] 10 > e[10] [1] NA > length(e) [1] 3 > e[length(e)] [1] 5 > e[c(1,2)] [1] 5 10 > e[c(2,3)] [1] 10 5 > e[1:3] [1] 5 10 5 > e[1:length(e)] [1] 5 10 5 > e[1:4] [1] 5 10 5 NA > e[c(F,T,T)] [1] 10 5 > e[c(-2)] [1] 5 5 Changing all odd numbers to 42: Example 8. .................................................................................................................................... > a<-seq(1:10) > a [1] 1 2 3 4 > a%%2==0 [1] FALSE 5 6 TRUE FALSE > a[!a%%2==0]=42 > a [1] 42 2 42 4 42 > a==42 [1] TRUE FALSE 7 8 9 10 TRUE FALSE 6 42 TRUE FALSE TRUE FALSE TRUE 8 42 10 TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE 1.5. Matrices 1 2 3 A matrix can be created from a vector or a set of vectors. For example a matrix [4 5 6] can be 7 8 9 created with a command matrix() in a following manner: > m <- matrix(c(1,2,3,4,5,6,7,8,9), ncol=3, nrow=3, byrow=TRUE) > m [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 where the argument “byrow” takes values TRUE and FALSE depending on how we would like to organize the matrix, by rows or by columns. Argument “ncol” signifies the number of columns and “nrow” respectively the number of rows. Matrices can be handled in same way as vectors. It is possible to refer to their cells, rows, and columns by using the square brackets after their name. Adding more rows or columns to the matrix can be done with the commands rbind() and cbind() respectively. Removing rows and columns can be done by using truth values in same way as vector operations. > rbind(m, c(42,42,42)) [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 [4,] 42 42 42 > cbind(rbind(m, c(42,42,42)), c(42,42,42,42)) [,1] [,2] [,3] [,4] [1,] 1 2 3 42 [2,] 4 5 6 42 [3,] 7 8 9 42 [4,] 42 42 42 42 Example 9. .................................................................................................................................... > M <- array(1:10, dim=c(5,5)) > M [,1] [,2] [,3] [,4] [,5] [1,] 1 6 1 6 1 [2,] 2 7 2 7 2 [3,] 3 8 3 8 3 [4,] 4 9 4 9 4 [5,] 5 10 5 10 5 For example removing first three rows and columns can be done as follows > M<- M[-1:-3,-1:-3] > M [,1] [,2] [1,] 9 4 [2,] 10 5 Or alternatively > M<-M[c(F,F,F,T,T),c(F,F,F,T,T)] > M [,1] [,2] [1,] 9 4 [2,] 10 5 Hence, the result is [,1] [,2] [,3] [,4] [,5] [1,] 1 6 1 6 [2,] 2 7 2 7 [3,] 3 8 3 8 [4,] 4 9 4 9 [5,] 5 10 5 10 1 2 3 4 5 Multiplying matrices can be done using the operator “%*%”, multiplication and division can be performed with “+” and “-“, and the command t() transposes the matrix. Multiplying: Example 10. .................................................................................................................................. > M1 <- matrix(c(1,2,3,4,5,6,7,8,9), ncol=3, nrow=3, byrow=T) > M1 [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 > M2 <- matrix(c(1,2,3,4,5,6,7,8,9), ncol=3, nrow=3, byrow=T) > M2 [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 > M3<-M1%*%M2 > M3 [,1] [,2] [,3] [1,] 30 36 42 [2,] 66 81 96 [3,] 102 126 150 Transposing: Example 11. .................................................................................................................................. > t(M3) [,1] [,2] [,3] [1,] 30 66 102 [2,] 36 81 126 [3,] 42 96 150 solve() function can be used to solve the systems of linear equations whose parameters are in the matrix form. The system of linear equations has to be in the form Ax=B. If B is not defined, then it is assumed to be the identity matrix, and the function will return the inverse of A. Example 12. .................................................................................................................................. Say we want to solve a system of equations as follows 3𝑥 + 2𝑦 − 4𝑧 = −4 { 𝑥 − 6𝑦 + 3𝑧 = −4 2𝑥 − 𝑦 + 𝑧 = 5 which corresponds to 3 2 −4 𝑥 3 [1 −6 3 ] [𝑦] = [1 2 −1 1 𝑧 2 2 −4 𝑥1 −4 −6 3 ] [𝑥2 ] = [−4] −1 1 𝑥3 5 > A<-matrix(c(3,1,2,2,-6,-1,-4,3,1),ncol=3,nrow=3,byrow=F) > B<-matrix(c(-4,-4,5),ncol=1,nrow=3,byrow=F) > A [,1] [,2] [,3] [1,] 3 2 -4 [2,] 1 -6 3 [3,] 2 -1 1 > B [1,] [2,] [3,] [,1] -4 -4 5 > solve(A,B) [,1] [1,] 2 [2,] 3 [3,] 4 > solve(A) [,1] [,2] [,3] [1,] 0.06976744 -0.04651163 0.4186047 [2,] -0.11627907 -0.25581395 0.3023256 [3,] -0.25581395 -0.16279070 0.4651163 1.6. Characters In addition to numeric variables and Booleans it is also possible to store character values that consist of letters and numbers as a text. Sequences of characters can be input in single or double quotes and they can be stored in the same way as other values. Example 13. .................................................................................................................................. > string <- "RStudio" > string [1] "RStudio" > data_structures<-c("matrix", "table", "list") > data_structures [1] "matrix" "table" "list" 1.7. Factors Factors are the type of values that are meant to be used with classifies variables, such as the sex “M” or “F”. A variable can be converted to a factor variable using the function as.factor(). Example 14. .................................................................................................................................. > answer<-c("Y","N","N","N","Y") > answer [1] "Y" "N" "N" "N" "Y" > answer<-as.factor(answer) > answer [1] Y N N N Y Levels: N Y 1.8. Data frames Data frames are used as the fundamental data structure by most of R’s modeling software. They are tightly coupled collections of variables which share many of the properties of matrices and lists. For instance, combining the numeric and character vectors will produce a data frame as an output. Data frames can be created using a function data.frame() and an argument “stringAsFactors = FALSE/TRUE” defines whether string value will be stored as characters or as factors. In fact, strings are characters in R terminology. This is another confusing part in R terminology apart from library/package thing. Example 15. .................................................................................................................................. > pizza <- c('Margherita', 'Kebab', 'Quattro stagioni','Frutti di mare','Hawaii') > isItalian <- c('Y','N','Y','Y','N') > price <- c(4,5,5,6,4.5) > pizzeria <- data.frame(pizza,isItalian,price, stringsAsFactors=TRUE) > pizzeria pizza isItalian price 1 Margherita Y 4.0 2 Kebab N 5.0 3 Quattro stagioni Y 5.0 4 Frutti di mare Y 6.0 5 Hawaii N 4.5N 4.5 > str(pizzeria) 'data.frame': 5 obs. of 3 variables: $ pizza : Factor w/ 5 levels "Frutti di mare",..: 4 3 5 1 2 $ isItalian: Factor w/ 2 levels "N","Y": 2 1 2 2 1 $ price : num 4 5 5 6 4.5 1.9. List List is a data structure that provides a flexible method to store the variables of different types, classes, and lengths, which is not possible in a vector or a table. Trying to combine the variables from different classes in a vector converts them to the same class. “If you are using a data frame the data types must all be the same otherwise they will be subjected to type conversion. This may or may not be what you want, if the data frame has string/character data as well as numeric data, the numeric data will be converted to strings/characters and numerical operations will probably not give what you expected. “ Source: r-bloggers.com Example 16. .................................................................................................................................. > vector <- c('test', 1:5, 42) > vector [1] "test" "1" "2" "3" "4" "5" "42" > class(vector) [1] "character" Example 17. .................................................................................................................................. > pizza <- c('Margherita','Quattro stagioni','Frutti di mare') > ingredients <- c('Tomato sauce', 'Cheese', 'Basilica','Olive oil') > price <- c(4:10) > owner_height <- sqrt(32400) > pizzeria <- list(pizza,ingredients,price,owner_height,3.14, c(T,F,T)) > pizzeria [[1]] [1] "Margherita" "Quattro stagioni" "Frutti di mare" [[2]] [1] "Tomato sauce" "Cheese" [[3]] [1] 4 5 6 7 8 9 10 "Basilica" "Olive oil" [[4]] [1] 180 [[5]] [1] 3.14 [[6]] [1] TRUE FALSE TRUE > pizzeria[3] [[1]] [1] 4 5 6 7 8 > pizzeria[1:3] [[1]] [1] "Margherita" 9 10 "Quattro stagioni" "Frutti di mare" [[2]] [1] "Tomato sauce" "Cheese" [[3]] [1] 4 5 6 7 8 "Basilica" "Olive oil" 9 10 Subgroups of lists are lists as well: Example 18. .................................................................................................................................. > second <- pizzeria[2] > second [[1]] [1] "Tomato sauce" "Cheese" > second[1] [[1]] [1] "Tomato sauce" "Cheese" > class(second) [1] "list "Basilica" "Olive oil" "Basilica" "Olive oil" "Basilica" "Olive oil" > is.list(second) [1] TRUE > second[2] [[1]] NULL > second <- pizzeria[[2]] > second [1] "Tomato sauce" "Cheese" > second[2] [1] "Cheese" > class(second) [1] "character" Example 19. .................................................................................................................................. > cars <- list(brand=c("Ford"), model=c("Mustang", "Taurus", "Tourneo", "Transit"), body=as.factor(c("passenger","van")), intro_year=c(1964,1986,1995,1965), generation=1:6, serialno=sample(x=10000:10000000, size=4)) > cars $brand [1] "Ford" $model [1] "Mustang" "Taurus" "Tourneo" "Transit" $body [1] passenger van Levels: passenger van $intro_year [1] 1964 1986 1995 1965 $generation [1] 1 2 3 4 5 6 $serialno [1] 8181771 879340 1885626 9812503 > cars[[2]] [1] "Mustang" "Taurus" "Tourneo" "Transit" > cars$model[2] [1] "Taurus" > cars$model <- c(cars$model, "Focus") > cars$model [1] "Mustang" "Taurus" "Tourneo" "Transit" "Focus" > cars$engine <- "V12" > cars$engine [1] "V12" > class(cars$engine) [1] "character" > cars$engine <- as.factor(c(cars$engine, "V4", "V6", "V8", "V10")) > cars$engine [1] V12 V4 V6 V8 V10 Levels: V10 V12 V4 V6 V8 > class(cars$engine) [1] "factor" Removing an object from the list can be done in a following manner. Example 20. .................................................................................................................................. > remove(cars$engine) Error in remove(cars$engine) : ... must contain names or character strings > cars$engine <- NULL > cars$engine NULL > cars <- cars[-7] > cars$engine NULL 1.10. Loading the data from an external file By default R software is using a predefined directory for loading and storing the files, this directory can be checked with the command getwd(). A new default directory can be set using a command setwd() or by clicking “File” -> “Change dir” in the default R GUI and “Tools” -> “Global Options…” -> “General” in RStudio. A data file can be loaded into R in many different ways, for example using the following command: X <- read.table("exampledata.txt", header = TRUE, stringsAsFactors=FALSE) Where X will be the name of the data frame and the argument header defines whether to use the default names of the variables/columns that are already in the file being loaded. If the data being loaded doesn’t have column names or you set “header = FALSE”, then R will use the default column names look like “V1”, “V2”,… and so on. By default, strings in the data are treated as factors. Setting the option “stringsAsFactors=FALSE” will make sure that the strings are being loaded as characters. A column can be converted to factors afterwards by typing: X$Column <- factor(X$Column) where ”Column” is any variable in the data set. The files can also be loaded from outside of the working directory in the following way: X <- read.table(file.choose(), header = TRUE) After that the software will ask you to choose a file manually from the system. For the CSV (comma-separated values) files use the following format instead: X <- read.csv(file.choose(), header = TRUE) In RStudio you can load the file interactively clicking: “Environment” -> “Import Dataset” -> “From Text File…” and by choosing a file from a directory. After choosing a file, you will be provided with several options to import the file. The good thing is that you can see all the changes you make before the import in the window “Data Frame”. Default data extensions in R are .csv and .txt. If you want to load .xlsx files you have to import them into R using special importing functions. The ‘xlsx’ package has a function read.xlsx() for reading Excel files: install.packages("xlsx") library (xslx) data <- read.xlsx ("data.xlsx", 1) #instead of “data.xlsx” you can write #file.choose() to choose a file from a #different directory. The second argument #defines the number of sheet being loaded Package ‘xlsx’ depends on a package ‘rJava’, which is being loaded automatically. If you get an error loading ‘rJava’, it is probably due to the outdated version of Java you are using. Head to http://www.java.com/en/download/ to download the latest version. If it seems like RStudio can’t find where Java is installed use the following code to point it (change the location if needed): Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre7') # for 64-bit version Windows Sys.setenv(JAVA_HOME='C:\\Program Files (x86)\\Java\\jre7') # for 32-bit version Windows Package ‘gdata’ has a method to load older .xls files: install.packages("gdata") library(gdata) data <- read.xls("datafile.xls") #reads the first sheet by default. #Otherwise use argument sheet to define #the number of sheet needed Also in excel software it is possible to save the file in .csv or .txt format. Loading data from other files is possible as well. Package ‘foreign’ can handle files from SPSS and other sources: install.packages("foreign") library(foreign) Use following functions to load files depending on your source. SPSS: read.spss() Stata: read.stata() MATLAB and Octave: read.octave() SAS: read.xport() Example 21. .................................................................................................................................. data: Auto.csv from http://www-bcf.usc.edu/~gareth/ISL/data.html > getwd() [1] "C:/WINDOWS/system32" > setwd("C:/R") > getwd() [1] "C:/R" > data = read.csv(file.choose(), header = TRUE, stringsAsFactors=FALSE) > data[1:5,1:5] mpg cylinders displacement horsepower weight 1 18 8 307 130 3504 2 15 8 350 165 3693 3 18 8 318 150 3436 4 16 8 304 150 3433 5 17 8 302 140 3449 1.11. Referring to cells, rows and columns, choosing partial data A data frame’s cells, rows and columns can be pointed in the same way as the ones of matrices. The columns of the data frame are vectors and they can be referred using numbers or their names (headers). The partial data can be chosen using different conditions. When applying a condition to a specific vector/column in the table the operator $ should be used for referring to this column. Example 22. .................................................................................................................................. data: Auto.csv from http://www-bcf.usc.edu/~gareth/ISL/data.html > data[5,4] [1] "140" > data[1:5,4] [1] "130" "165" "150" "150" "140" > data[5,] mpg cylinders displacement horsepower 5 17 8 302 140 > data$weight[1:5] [1] 3504 3693 3436 3433 3449 > View(data) > class(data$cylinders) [1] "integer" > class(data$acceleration) [1] "numeric" > class(data$name) [1] "character" > data$name[data$cylinders==8 & data$weight<3300] [1] "buick estate wagon (sw)" "chevrolet monza 2+2" "ford futura" "ford mustang ii" > data$horsepower/100 Error in data$horsepower/100 : non-numeric argument to binary operator > class(data$horsepower) [1] "character" > data$horsepower <- as.numeric(data$horsepower) > data$horsepower[1:5]/100 [1] 1.30 1.65 1.50 1.50 1.40 Example 23. .................................................................................................................................. > > > > > > > 1 2 3 4 model <- c("Mustang", "Taurus", "Tourneo", "Transit") body <- c(1,NA,2,NA) body <- factor(body,labels=c("passenger","van")) intro_year <- c(1964,1986,1995,1965) serialno <- sample(x=10000:10000000, size=length(model)) ford <- data.frame(model, body, intro_year, serialno) ford model body intro_year serialno Mustang passenger 1964 901917 Taurus 1986 2676266 Tourneo van 1995 660185 Transit 1965 5949726 Example 24. .................................................................................................................................. > ford[ford$body=='passenger',] model body intro_year serialno 1 Mustang passenger 1964 901917 NA NA NA NA.1 NA NA > nrow(ford[ford$body == 'passenger', ]) [1] 3 > subset(ford, body == 'passenger') model body intro_year serialno 1 Mustang passenger 1964 901917 > nrow(subset(ford, body == 'passenger')) [1] 1 Sometimes empty or NA values in the data can produce false results, especially if we want to dismiss the NA values completely. In such cases which() or subset() function can fix the problem. Example 25. .................................................................................................................................. > v < 4 [1] TRUE TRUE > v[v<4] [1] 1 2 NA 3 NA NA FALSE TRUE NA > length(v[v<4]) [1] 5 > sum(v<4, na.rm=T) [1] 3 Example 26. .................................................................................................................................. > which(v<4) [1] 1 2 5 > v[which(v<4)] [1] 1 2 3 > length(v[which(v<4)]) [1] 3 > length(which(v<4)) [1] 3 Example 27. .................................................................................................................................. > subset(v, v<4) [1] 1 2 3 > length(subset(v, v<4)) [1] 3 Example 28. .................................................................................................................................. > M42 <- data.frame(matrix(1:42, nrow=7)) > colnames(M42) <- c("V1", "V2", "V3", + "V4", "V5", "V6") > M42 V1 V2 V3 V4 V5 V6 1 1 8 15 22 29 36 2 2 9 16 23 30 37 3 3 10 17 24 31 38 4 4 11 18 25 32 39 5 5 12 19 26 33 40 6 6 13 20 27 34 41 7 7 14 21 28 35 42 > M42.345 <- subset(M42, select=c("V3", "V4", "V5")) > M42.345 V3 V4 V5 1 15 22 29 2 16 23 30 3 17 24 31 4 18 25 32 5 19 26 33 6 20 27 34 7 21 28 35 > M42.345.over10 <- subset(M42, V2 > 10, select=c("V3", "V4", "V5")) > M42.345.over10 V3 V4 V5 4 18 25 32 5 19 26 33 6 20 27 34 7 21 28 35 Function paste() can be useful if there’s a need to create several columns that have consequent generic names. Example 29. .................................................................................................................................. > paste("column", 1:3) [1] "column 1" "column 2" "column 3" > paste("column", 1:3, sep="") [1] "column1" "column2" "column3" > paste("column", 1:3, sep="_") [1] "column_1" "column_2" "column_3" > paste("column", 1, letters[1:5], sep="") [1] "column1a" "column1b" "column1c" "column1d" [5] "column1e" > paste("column", 1, LETTERS[1:5], sep="") [1] "column1A" "column1B" "column1C" "column1D" [5] "column1E" Example 30. .................................................................................................................................. > M42.345 <- subset(M42, select=paste("V",3:5, sep="")) > M42.345 V3 V4 V5 1 15 22 29 2 16 23 30 3 17 24 31 4 18 25 32 5 19 26 33 6 20 27 34 7 21 28 35 > M42.135 <- subset(M42, select=paste("V", seq(1,6, by=2), sep="")) > M42.135 V1 V3 V5 1 1 15 29 2 2 16 30 3 3 17 31 4 4 18 32 5 5 19 33 6 6 20 34 7 7 21 35 > M42.345.V2even <- subset(M42, V2%%2==0, select=paste("V", seq(3,5,by=1), sep="")) > M42.345.V2even V3 V4 V5 1 15 22 29 3 17 24 31 5 19 26 33 7 21 28 35 2. Statistical functions There is a vast collection of functions for data analysis in R. The functions are included in different libraries which tend to have different purposes. The standard libraries are always included in R package, the rest can be downloaded by clicking “Packages” -> “Load package” in the standard R framework or simply by typing library() with the name of the package in the brackets. If you don’t have a package you need to use, then you have to install it first. You can do it by typing install.packages() in the dialog box with the name of the package in the brackets. Also you can do it interactively by choosing “Packages” -> “Install”: You may be prompted to select a download mirror. You can either choose the one geographically closest to you, or, if you want to make sure you have the most up-to-date version of your package, choose the Austrian site, which is the primary CRAN server. Get back to the dialog box and pick the Austrian repository by typing: setRepositories() and by choosing a preferred action from the dialog box. When you tell R to install a package, it will automatically install any other packages that the first package depends on. The content (all the functions) available in a package can be checked using a function ls(): ls("package: the_name_of_the_package") Please be aware that in addition to methods, tests and functions many packages include data sets which can be very useful for practicing. Although one has to use the library() function to load a package, a package is not a library. A library is a directory that contains a set of packages. You might, for example, have a system-wide library as well as a library for each user. 2.1. Descriptives There are plenty of built-in functions to get the descriptive statistics in R. Some of them are introduced below. Example 31. .................................................................................................................................. data: Auto.csv from http://www-bcf.usc.edu/~gareth/ISL/data.html > mean(data$weight) [1] 2970.262 > median(data$weight) [1] 2800 > max(data$weight) [1] 5140 > min(data$weight) [1] 1613 > sd(data$weight) [1] 847.9041 You may find a package “pastecs” useful as it includes a function for a table of comprised descriptive statistics. Example 32. .................................................................................................................................. > > > > library(pastecs) options(digits=2) options(scipen=100) stat.desc(data$weight, basic=F) median 2800.00 mean 2970.26 std.dev 847.90 coef.var 0.29 2.2. SE.mean CI.mean.0.95 42.56 83.66 var 718941.40 Basic functions for normal distribution Example 33. .................................................................................................................................. > rnorm(mean = 0, sd=1, n=10) #Simulating a normal distribution with zero #mean,the standard deviation (sd) of one, n=10 [1] 0.1611 -0.6463 0.0080 -0.0831 0.1452 -1.2257 -1.2628 > rnorm(mean = 0, sd=1, n=10) [1] 1.09 -1.16 0.14 0.92 1.31 -0.55 -0.39 1.46 0.0092 1.50 0.0437 -0.5330 0.61 > pnorm(q = -1, mean=0, sd = 1) #Cumulative distribution function [1] 0.1586552539 #the point at quantile -1 > qnorm(mean=0, sd = 1, p = 1/3, lower=T) #quantile functions [1] -0.43 #1/3 of the probability mass is found to the #left of this point > dnorm(x=1, mean=0, sd = 1) [1] 0.24 2.3. #Density functions #the solution of the equation f(x) with x=1 Apply() function and its sister functions apply() function is especially practical in case of matrices or data frames. It allows to run operations row by row or column by column for the matrices. apply() function has a following syntax: apply(X, MARGIN, FUN, ...), where X is the data being elaborated, MARGIN indicates row if equal to 1 and columns if equal to 2, FUN is the function being used, … may include optional arguments for the function FUN. lapply() performs the similar actions for the lists or vectors. The main difference between apply() and lapply() is what these two different functions print out. lapply() has a following syntax: lapply(X, FUN, ...), where arguments correspond to the same purpose as in the case of apply(). Descriptives of the partial data (especially factors) from a matrix or a data frame can be most easily done using the function tapply() which a following syntax: tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE), where X is an array, typically a column of the matrix, INDEX is an argument for factors, typically another column. If “simplify=FALSE”, the output is a list, otherwise the output is a scalar. FUN is a function being applied. Example 34. .................................................................................................................................. > m <- matrix(rep(1:3,3),ncol=3) > m [1,] [2,] [3,] [,1] [,2] [,3] 1 1 1 2 2 2 3 3 3 > apply(m,1,sum) [1] 3 6 9 > a <- apply(m,2,sum) > a [1] 6 6 6 > class(a) [1] "integer" Example 35. .................................................................................................................................. > l <- list(1:3,4:6,7:9) > l [[1]] [1] 1 2 3 [[2]] [1] 4 5 6 [[3]] [1] 7 8 9 > lapply(l,mean) [[1]] [1] 2 [[2]] [1] 5 [[3]] [1] 8 Example 36. .................................................................................................................................. > > > > ford$weight <- c(3139, 2700, 2600, 3200) ford$body[ford$model=="Taurus"] <- 'passenger' ford$body[ford$model=="Transit"] <- 'van' ford model body intro_year serialno weight 1 Mustang passenger 1964 901917 3139 2 Taurus passenger 1986 2676266 2700 3 Tourneo van 1995 660185 2600 4 Transit van 1965 5949726 3200 > tapply(ford$weight, ford$body, mean) passenger van 2920 2900 2.4. Creating functions Even though R has a big set of functions, sometimes one may need to combine these functions. Writing own functions in R can be done most conveniently in the script environment. You can create a new script - In RStudio: File -> New File -> R Script or Ctrl+Shift+N - In R console: File -> New Script Example 37. .................................................................................................................................. theTruth <- function(N) { #A function that returns an N amount of 42’s x <- rep(42,N) return(x) } > theTruth(10) [1] 42 42 42 42 42 42 42 42 42 42 Example 38. .................................................................................................................................. gm_mean <- function(a){ #a function to calculate a geometric average prod(a)^(1/length(a)) } > x<-seq(1:10) > gm_mean(x) [1] 4.528729 2.5. Algorithms 2.5.1. For loops For loops can be used to create repetitive algorithms in R. It is important to keep in mind that for is a function in R like any other function and therefore it outputs a value at the end of its execution. Therefore it is important to use return() function to print out the value of the for function when it’s needed. A misuse of the for function can lead to errors without the return() function. Let’s use a simple for structure to demonstrate its action. Example 39. .................................................................................................................................. > for(i in 1:5){ + print(i) + } [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 In the previous example for function repeats itself 5 times according to the instructions in braces. Hence, as long as the index i is in the range from 1 to 5 (that is the condition for for function to continue the execution), the function prints it out. The instructions in braces don’t necessarily have to be related to the index i itself. We could instead create the following algorithm as well: > for(i in 1:5){ + print(paste("This + } [1] "This is number [1] "This is number [1] "This is number [1] "This is number [1] "This is number is number 42, it is being printed ", i ," times")) 42, 42, 42, 42, 42, it it it it it is is is is is being being being being being printed printed printed printed printed 1 2 3 4 5 times" times" times" times" times" Example 40. .................................................................................................................................. totSum <- function(data){ #A function that calculates the sum of an array of #numbers sum=data[1] N=length(data)-1 for (i in 1:N){ sum=sum+data[i+1] } return(sum) } Calculating the total account balance of all customers (data: Credit.csv from http://wwwbcf.usc.edu/~gareth/ISL/data.html): > totSum(credit$Balance) [1] 208006 For loops work well with the matrices and vectors, but sometimes it is more practical to use sapply() function instead, when the output is in a matrix or a vector form. The next example demonstrates the trick. Example 41. .................................................................................................................................. M42 <- matrix(1:42, nrow=6) # “values” is the empty vector for the minimum values values <- rep(0,7) min_value <- function(x) { #The algorithm to calculate the minimum values of #the matrix columns return(min(M42[,x])) } for(i in 1:10){ values[i] <- min_value(i) } > M42 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 1 7 13 19 25 31 37 [2,] 2 8 14 20 26 32 38 [3,] 3 9 15 21 27 33 39 [4,] 4 10 16 22 28 34 40 [5,] 5 11 17 23 29 35 41 [6,] 6 12 18 24 30 36 42 > values [1] 1 7 13 19 25 31 37 > sapply(1:7, function(x) min_value(x)) [1] 1 7 13 19 25 31 37 2.5.2.If else structure Conditional sentences can be programmed using if else structure. Example 42. .................................................................................................................................. numCompare <- function(x,y) if(x > y) { cat(x,"is bigger than",y) } else { cat(x,"is not bigger than", y) } #OR numCompare <- function(x,y){ if(x > y) cat(x,"is bigger than",y) else cat(x,"is not bigger than", y) } > numCompare(10,12) 10 is not bigger than 12 > numCompare(12,10) 12 is bigger than 10 Example 43. .................................................................................................................................. numCompare2 <- function(x,y) if(x > y) { cat(x,"is bigger than", y) } else if(x y) cat(x,"is bigger than", y) else if(x numCompare2(10,10) 10 is equal to 10 3. Methods of numerical inference 3.1. Cross-tabulating The function table() returns the amount of occurrences that each value gets in the data. Example 44. .................................................................................................................................. > v <- c(3,4,5,5,5,6,8,9,7,6,4,2,78,90,6,1,3,5,7,4,23) > table(v) v 1 1 2 1 3 2 4 3 5 4 6 3 7 2 8 1 9 23 78 90 1 1 1 1 > table(ford$body) passenger 2 van 2 Using the same function table() there is a chance to cross tabulate two variables. Apply() function and its sisters can be used in the same way for the crosstabs and the function prop.table() is a quick way to printout the proportions of the values row-wise or column-wise. The second argument in prop.table() defines the method of proportions, where 1 stands for the rows, 2 for the columns. Example 45. .................................................................................................................................. > table(ford$body, ford$intro_year) passenger van 1964 1965 1986 1995 1 0 1 0 0 1 0 1 > table(ford$body, ford$intro_year)[2,2] [1] 1 > table(ford$body, ford$intro_year)[2,] 1964 1965 1986 1995 0 1 0 1 > cross.ford <- table(ford$body, ford$intro_year) > cross.ford[1,1] [1] 1 > cross.ford[,1] passenger van 1 0 Example 46. .................................................................................................................................. > apply(cross.ford, 1, sum) passenger van 2 2 > > prop.table(cross.ford, 1) passenger van 1964 1965 1986 1995 0.5 0.0 0.5 0.0 0.0 0.5 0.0 0.5 > > prop.table(cross.ford, 2) passenger van 3.2. 1964 1965 1986 1995 1 0 1 0 0 1 0 1 Classification of the data The classification of the data is especially useful in case of continuous variables. Cut() function is one of the options that can be used. It returns the input as factors. Example 47. .................................................................................................................................. > cut(M42$V2, breaks=c(0,10,14)) [1] (0,10] (0,10] (0,10] (10,14] (10,14] [6] (10,14] (10,14] Levels: (0,10] (10,14] > table(cut(M42$V2, breaks=c(0,10,14))) (0,10] (10,14] 3 4 > cut(M42$V2, breaks=c(0,4,8,12)) [1] (4,8] (8,12] (8,12] (8,12] (8,12] [7] Levels: (0,4] (4,8] (8,12] > table(cut(M42$V2, breaks=c(0,4,8,12))) (0,4] 0 (4,8] (8,12] 1 4 Tapply() function can be used to perform operations for the columns that are being classified with the classes from another column. Example 48. .................................................................................................................................. > class <- cut(M42$V2, breaks=c(0,10,14)) > tapply(M42$V3, class, mean) (0,10] (10,14] 16.0 19.5 3.3. T-test and confidence intervals Example 49. .................................................................................................................................. > norm <- rnorm(mean = 0, sd=1, n=1000) > t.test(norm, conf.level=0.99) One Sample t-test data: norm t = -0.1961, df = 999, p-value = 0.8446 alternative hypothesis: true mean is not equal to 0 99 percent confidence interval: -0.08721620 0.07490038 sample estimates: mean of x -0.006157911 > t.test(norm, conf.level=0.99)$conf.int [1] -0.08721620 0.07490038 attr(,"conf.level") [1] 0.99 # replicate confidence interval test 100 times confidence <- replicate(100, t.test(rnorm(n=100, mean=0, sd=1))$conf.int) 3.4. F-test (incomplete) 3.5. Tests for the variance (independence) Example 50. .................................................................................................................................. Let’s assume a hypothetical situation where a company wants to test the efficiency of it’s marketing method. In the first approach the company is using a conventional marketing method to reach its customers and in another approach it is using a new method. After trying both methods the company divides customers into two groups based on the marketing applied on them and picks 1000 random people from both groups. As a result the manager of the company gets the following data: in the group of a new marketing method 745 committed to buy a product and 255 didn’t, while in another group 690 committed and 310 didn’t. Assuming all other factors being fixed and both groups of people being homogeneous, how can we test whether there has been a significant positive impact using a new method? Let’s create a matrix with values as follows: > sales <- matrix(c(745,255,690,310), ncol=2) Naming the rows and the columns: > dimnames(sales) <- list(Result = c("Commit", "No commit"), + Marketing_method = c("New","Old")) > sales Marketing_method Result New Old Commit 745 690 No commit 255 310 Applying Pearson’s Chi-squared test: > chisq.test(sales, corr=F) Pearson's Chi-squared test data: sales X-squared = 7.462, df = 1, p-value = 0.006302 Here the argument “corr” means Yates' continuity correction. As we can see the p-value is very small, less than 1%. The null hypothesis is that there’s no difference in distributions of two groups and hence the small p-value speaks against the null hypothesis. Therefore the manager has a strong support for the new marketing method. 3.6. Bayesian inference Let’s augment the previous marketing example to calculate the conditional probability of purchasing the product and being reached by the new marketing campaign. Let’s mark B=1 if a person has purchased the product and B=0 if she hasn’t. Next let’s mark M=1 if the person has been reached by a new marketing campaign (new and conventional combined) and M=0 if she has been reached by the old campaign. Based on the marketing data (customers reached by both campaigns) our manager estimated that a person buys the product when he has been reached by a new marketing campaign with a probability of 74.5%, in other words P(B=1│M=1)=0.745. And he didn’t buy the product when he wasn’t reached by the new campaign (but he has been reached by the conventional campaign) with a probability of 31%, in other words P(B=0│M=0)=0.31. Example 51. .................................................................................................................................. Assume that 12345 people were reached by the campaigns and out of them 34% were reached by the new campaign. Hence, 4197 out of customers reached by the campaigns were reached by the new campaign. We pick a random person from the customers who were reached by the marketing campaigns and notice that this person has purchased the product, what is the probability that this person has been reached by the new campaign? You want to calculate the following probability: 𝑃(𝑀 = 1|𝐵 = 1) = ∑ 𝑃(𝑀 = 1⋂𝐵 = 1) ∑ 𝑃(𝐵 = 1) n <- 12345 newCamp <- rbinom(n=n, size=1, prob=0.34) purchased <- numeric(n) for(i in 1:n) { if(newCamp[i] == 1) { purchased[i] <- rbinom(n=1, size=1, prob=0.745) } else { purchased[i] <- rbinom(n=1, size=1, prob=1-0.31) } } > > sum(newCamp * purchased) / sum(purchased) [1] 0.3586286 NB: in real life you wouldn’t have to simulate the distribution. You could instead take the binary vector of the purchasing customers and run the function: > sum(newCamp * purchased) / sum(purchased) Example 52. .................................................................................................................................. Based on the whole pool of potential customers (e.g. segmented by age or income) our manager has estimated that 11% of them have purchased the product. This group also includes people not being reached by the marketing campaign. The manager estimates that as much as 23% of the people from the whole pool of customers have been reached by a marketing campaign. We pick a random person the customer pool and notice that this has not purchased a product. What is the probability that this person has been reached by a marketing campaign? Let’s mark the probability that the person has been reached by a marketing campaign 𝑃(𝑀 = 1) and 𝑃(𝑀 = 0) if she hasn’t. 𝑃(𝐵 = 1) and 𝑃(𝐵 = 0) respectively for the purchases. Now we want calculate the following probability: 𝑃(𝑀 = 1|𝐵 = 0) = ∑ 𝑃(𝑀 = 1⋂𝐵 = 0) ∑ 𝑃(𝐵 = 0) We know that 𝑃(𝐵 = 0) = 1 − 0.11 = 0.89 and 𝑃(𝑀 = 1) = 0.23. And the whole customer pool is 12345/0.23 = 53673. In addition, based on the data the manager estimated that if customers are not reached by a marketing campaign they purchase the product with 37% probability. total <- 53673 camp <- rbinom(n=total, size=1, prob=0.23) no_purchase <- numeric(total) for(i in 1:total) { if(camp[i] == 1) { no_purchase[i] <- rbinom(n=1, size=1, prob=((12345sum(purchased))/12345)) } else { no_purchase[i] <- rbinom(n=1, size=1, prob=1-0.37) } } >sum(camp * no_purchase) / sum(no_purchase) [1] 0.1230405 4. Graphics 4.1. Basic graphics You will never feel yourself alone with the collection of graphical tools in R. Example 53. .................................................................................................................................. Let’s draw a graph of trigonometric functions as follows. > plot(NULL, xlim=c(-1,1), ylim=c(-1,1), main="[-1,1]x[-1,1] plot", xlab="sin(y)", ylab="cos(y)") > y <- sapply(1:100, function(y) c(sin(y), cos(y))) > lines(y[1,],y[2,]) Example 54. .................................................................................................................................. Data from: http://www-bcf.usc.edu/~gareth/ISL/Credit.csv > credit=read.csv(file.choose(),header=TRUE,stringsAsFactors=F) > View(credit) > > layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE)) > > plot(credit$Age, credit$Income, main="Income vs. age", xlab="Age", ylab="Income", xlim=c(22, 80), ylim=c(20,150)) > > hist(credit$Income, main="Income") > > boxplot(credit$Age, main="Age") 4.2. Ggplot2 Start by installing the ‘ggplot2’, you will make yourself a big favor. install.packages("ggplot2") Ggplot2 is probably one of the most useful, intuitive and varied package for drawing graphics in R. It is based on the package ‘lattice’ and it’s very useful for the multilayered plots. Ggplot2 has a very good documentation here: http://docs.ggplot2.org/current/ and there’s no way we could cover everything in this material as ggplot2 includes more than 250 functions and hundreds of other parameter variables. Instead, we’ll go through the most commonly used functions of ggplot2. 4.3. Plotly (incomplete) 5. Decision-making methods for business 5.1. Simple linear regression Linear regression is one of the most applied methods in statistical modeling. It’s been in use for years, but still powerful enough in the modern business environment and its applications. Simple linear regression allows to estimate the relationship between two variables. Let’s assume that we want to estimate how the budget consumed on TV ads affects sales of a company, in other words how much the sales change when spending on TV ads increases. The important prerequisite of the relationship estimation is that relationship is indeed linear. In case of two variables this can be checked graphically by plotting both variables in a 2-dimensional plane. Graphically the linear relationship between the variables would represent a straight line going up and right (in case of a positive relationship) or down and right (in case of a negative relationship). Mathematically the relationship can be represented as 𝒀 = 𝜷0 + 𝜷1 𝑿 Where 𝒀 is a (dependent) vector that represents data points of the variable of sales (the one that we want to explain with TV ads), 𝑋 represents the vector that includes the data points of the (explanatory) variable of the consumption on TV ads, 𝜷0 and 𝜷𝟏 are unknown, but fixed coefficients, intercept and slope, that we want to estimate with our model. Estimating these coefficients leads to the following model: 𝑦̂ = 𝛽̂0 + 𝛽̂1 𝑥 or ̂0 + 𝜷 ̂1 𝑿 + 𝜺 𝒀=𝜷 where 𝛽̂0 and 𝛽̂1 are the estimators of the unknown 𝛽0 and 𝛽1 , and 𝑦̂ is the predictor of 𝑌 on the basis of 𝑥 using pairs of observations (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) and 𝜺 is a vector of error terms, which geometrically simply represents the distance between the observations (points) and the straight line. In other words 𝑠𝑎𝑙𝑒𝑠 = 𝜷0 + 𝜷1 × 𝑇𝑉. This model can be represented graphically as a straight line: Figure 1. Source: Hastie et al.(2015) There are many methods to estimate the coefficients of the model. But probably the most used one is by minimizing the distance between the observations and the straight line with ordinary least squares (OLS), where 𝛽0 is simply the point where the straight line cuts the vertical axis. Therefore changing 𝛽0 would mean moving the line up or down. Apart from the graphical inspection we could set up a statistical hypothesis framework to test whether there is a (linear) relationship between the variables. Our hypothesis framework takes the following form: 𝐻0 : 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑛𝑜 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌 versus 𝐻1 : 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑎 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌. Mathematically this corresponds to 𝐻0 : 𝛽1 = 0 versus 𝐻1 : 𝛽1 ≠ 0, since if 𝛽1 = 0 then our model reduces to 𝑌 = 𝛽0 + 𝜀. In practice 𝛽̂1 can be assessed by computing the t-statistic given by 𝑡= 𝛽̂1 − 0 , 𝑆𝐸(𝛽̂1 ) which simply measures how far away (how many standard deviations) 𝛽̂1 is from 0 (recall that the normal distribution is has a bell shape). From t-statistic we can deduct a p-value which is the probability of attaining such a coefficient estimate simply by chance. Hence, the smaller the p-value, the higher there is a chance that there is a relationship between the variables in case. Typically we would reject the null hypothesis when the p-value is lower than 0.05. Recall that correlation 𝐶𝑜𝑟𝑟(𝑋, 𝑌) = ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅) √∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 √∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 is also a measure of the linear relationship between the variables. If 𝛽̂1 turns out to be statistically significant one can move further and assess the model we created. Again, graphical inspection is one of the methods that should work for the beginning. The most common statistical units for model estimation are residual standard error (RSE) and R squared (𝑅 2) . Residual standard error measures the average amount that the response will deviate from the true regression line and it is computed using the formula 𝑛 1 1 𝑅𝑆𝐸 = √ 𝑅𝑆𝑆 = √ ∑(𝑦𝑖 − 𝑦̂𝑖 )2 , 𝑛−2 𝑛−2 𝑖=1 where 𝑛 𝑅𝑆𝑆 = ∑(𝑦𝑖 − 𝑦̂𝑖 )2 . 𝑖=1 And R squared is defined as 𝑅2 = 𝑇𝑆𝑆 − 𝑅𝑆𝑆 𝑅𝑆𝑆 =1− , 𝑇𝑆𝑆 𝑇𝑆𝑆 where 𝑛 𝑇𝑆𝑆 = ∑(𝑦𝑖 − 𝑦̅)2 𝑖=1 is the total sum of squares. In a simple linear regression 𝑅 2 = [𝐶𝑜𝑟𝑟(𝑋, 𝑌)]2 . However this doesn’t necessarily apply in the framework of the multiple linear regression. N.B. The linear model is always linear as long as its coefficients remain linear. For instance, a model 𝑌 = 𝛽0 + 𝛽1 𝑋 2 is also linear. Example 55. .................................................................................................................................. We start by loading the data into R: >adv=read.csv(file.choose(), header=TRUE) This is the data set provided by Gareth James, you can download it here (Advertising.csv): http://wwwbcf.usc.edu/~gareth/ISL/data.html >head(adv, n=5) TV Radio Newspaper Sales 1 230.1 37.8 69.2 22.1 2 44.5 39.3 45.1 10.4 3 17.2 45.9 69.3 9.3 4 151.5 41.3 58.5 18.5 5 180.8 10.8 58.4 12.9 >plot(adv) We want to explore how the variables “TV” and “Sales” are correlated to each other. To do that we’ll run a linear OLS regression: 𝑠𝑎𝑙𝑒𝑠 = 𝜷0 + 𝜷1 × 𝑇𝑉. But first, let’s plot TV vs. Sales in a scatterplot. >plot(adv$TV,adv$Sales, xlab="TV", ylab="Sales", main="Scatterplot of TV vs. Sales") >abline(lm(adv$Sales ~ adv$TV)) As you can see, there is some positive (why?) correlation. And now the regression statistics: >SalesTV=lm(Sales~TV, adv) >summary(SalesTV) Call: lm(formula = Sales ~ TV, data = adv) Residuals: Min 1Q Median -8.3860 -1.9545 -0.1913 3Q 2.0671 Max 7.2124 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.032594 0.457843 15.36 <2e-16 *** TV 0.047537 0.002691 17.67 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.259 on 198 degrees of freedom Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099 F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16 Both 𝛽0 and 𝛽1 are statistically significant at the 99,9% level. Hence, we can conclude that there is a correlation based on OLS regression. Using the maximum likelihood estimators which are 𝛽0 = 7.033 and 𝛽1 = 0.048 we can draw the regression line manually using lines() function: > x <- c(0,300) > y <- 7.033 + 0.048*x > lines(x,y, col="red") Or curve() function: > curve(7.033 + 0.048*x, from=0, to=50, add=T, col="blue") Let’s put 90% confidence intervals for the regression line: newx <- seq(0,300) pred <- predict(SalesTV, data.frame(TV=newx), interval = "confidence", level = 0.90, type="response") lines(newx, pred[,2], col="black") lines(newx, pred[,3], col="black") lines(newx, pred[,1], col="red") What are the confidence intervals for the coefficients? >confint(SalesTV) 2.5 % 97.5 % (Intercept) 6.12971927 7.93546783 TV 0.04223072 0.05284256 Hence, for example the 95% confidence interval for 𝛽0 is [6.13,7.94]. Predicting three first values of “Sales” based on “TV”: >predict(SalesTV,data.frame(TV=(c(1,2,3))),interval="confidence") fit lwr upr 1 7.080130 6.181837 7.978423 2 7.127667 6.233947 8.021387 3 7.175203 6.286048 8.064359 How well is sales explained with this linear model? Plot of the residuals: >hist(SalesTV$residuals) Shapiro-Wilk normality test: >shapiro.test(SalesTV$residuals) Shapiro-Wilk normality test data: SalesTV$residuals W = 0.9905, p-value = 0.2133 The null hypothesis of this test is that the data is normally distributed. Recall that we reject the null hypothesis at 95% level if p-value is less than 0.05. Keep in mind that the residuals represent the variation in “Sales” that can’t be explained with “TV”. Based on this result “Sales” can be fairly well explained by “TV” in a linear setup. How do residuals change depending on the predicted value of “Sales” based on “TV”? >plot(SalesTV$fitted.values, SalesTV$residuals) What can we conclude from the graph above? 5.2. Multiple linear regression In the same way as a simple linear regression multiple regression is used to predict a variable based on other variables, but this time we want to measure the impact of multiple variables in the same regression. The multiple regression takes to the form 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝 + 𝜀, Where each vector 𝑋𝑗 represents a different predictor. 𝛽𝑗 is the average effect on 𝑌 of one unit increase in 𝑋𝑗 , holding all other predictors fixed. The coefficients 𝛽0 , 𝛽1 , … , 𝛽𝑝 are chosen to minimize the sum of squared residuals 𝑛 𝑛 2 𝑅𝑆𝑆 = ∑(𝑦𝑖 − 𝑦̂𝑖 )2 = ∑(𝑦𝑖 − 𝛽̂0 − 𝛽̂1 𝑥𝑖1 − 𝛽̂2 𝑥𝑖2 − ⋯ − 𝛽̂𝑝 𝑥𝑖𝑝 ) . 𝑖=1 𝑖=1 As in the case with the simple regression we are interested to know whether the estimated coefficients are statistically significantly different from zero, in other words whether our predictors explain well the behavior of the variable being explained. The null hypothesis takes the following form: 𝐻0 : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑝 = 0 And the alternative hypothesis 𝐻1 : at least one 𝛽𝑗 is non-zero. This hypothesis test is performed by computing the F-statistic: 𝐹= (𝑇𝑆𝑆 − 𝑅𝑆𝑆)/𝑝 , 𝑅𝑆𝑆/(𝑛 − 𝑝 − 1) Where TSS and RSS are defined as previously. When there is no relationship between the response and predictors, the F-statistic is close to 1 and if 𝐻1 holds, then 𝐸{(𝑇𝑆𝑆 − 𝑅𝑆𝑆)/𝑝} > 𝜎 2 . Nevertheless, the F-statistic doesn’t necessarily have to be much larger than 1 to provide evidence against the null hypothesis. The rule of thumb is that the larger 𝑛 the smaller F-statistic is enough to provide evidence against the null hypothesis. Luckily R like any other statistical software provides tools to calculate the pvalue associated with any given F-statistic. 5.3. Logistic regression The purpose of the logistic regression is similar to the linear one, but it’s intended to be used with qualitative variables. The logistic regression belongs to the classifying models as it helps to predict the set the values of the model will fall to. Some examples of its applications: Predicting the probability of default in the banking industry based on the income, transaction history and other related variables Predicting the disease of a patient based on her symptoms Predicting the risk of a car accident based on the driver’s personal factors Finding the most suitable marketing approach based on the customer’s age, income, and sex The outcome variables of the prediction models based on the logistic regression are discrete and categorical and hence, there’s no uniform distance between them. This means that in most cases the linear regression can’t be used to estimate them as it produces continuous values. As in the case of linear regression we want to predict a response variable with explanatory variables whose number can go from one upwards. For the sake of example let’s say we want to predict a default risk of a customer based on her credit card balance. Hence, our response variable will have binary values, 0 or 1: 𝑌={ 0 𝑖𝑓 𝑛𝑜 𝑑𝑒𝑓𝑎𝑢𝑙𝑡 1 𝑖𝑓 𝑑𝑒𝑓𝑎𝑢𝑙𝑡 One way to predict the default risk is estimate the probability of the response variable so that Pr(𝑌 = 1|𝑋), where 𝑋 stands for the credit card balance. Using the linear model the probability could be predicted in the conventional way: 𝑝(𝑋) = Pr(𝑌 = 1|𝑋) = 𝛽0 + 𝛽1 𝑋 producing the following result (graph X.X). As it becomes clear from the graph, using the linear regression approach can produce somewhat bizarre results. For instance the probability of default goes from negative to over 1, which is not possible. We could avoid the problem by using the logistic regression with the output values between 0 and 1. The logistic probability is estimated using the following setup: 𝑝(𝑋) = 𝑒 𝛽0 +𝛽1 𝑋 1 + 𝑒𝛽0 +𝛽1 𝑋 which can be easily modified 𝑝(𝑋) = 𝑒 𝛽0 +𝛽1 𝑋 1 − 𝑝(𝑋) The quantity 𝑝(𝑋)/[1 − 𝑝(𝑋)] is called the odds, and can take on any value between 0 and ∞. Values of the odds close to 0 and ∞ indicate very low and very high probabilities of default, respectively. For example, on average 1 in 5 people with an odds of 1/4 will default, since p(X) = 0.2 implies an odds of 1−0.2 = 1/4. Likewise on average nine out of every ten people with an odds of 9 will default, since p(X) = 0.9 implies an odds of 1−0.9 = 9. By taking the logarithms from the previous equation we get the log-odds or logit: log ( 𝑝(𝑋) ) = 𝛽0 + 𝛽1 𝑋. 1 − 𝑝(𝑋) The coefficients 𝛽0 and 𝛽1 are usually estimated using the maximum likelihood approach: 𝑙(𝛽0 , 𝛽1 ) = ∏ 𝑝(𝑥𝑖 ) ∏ (1 − 𝑝(𝑥𝑖′ )). 𝑖:𝑦𝑖 =1 𝑖 ′ :𝑦𝑖′ =0 Hence, 𝛽̂0 and 𝛽̂1 are chosen so that the result in equation (x.x) yields a value close to 1 for the ones who defaulted and a value close to 0 for the ones who didn’t. Example 56. .................................................................................................................................. We use a partial data from a dataset “Default” that is included in the package “ISLR”. > head(Balance2) Balance Default 1 1861.624 Yes 2 1688.673 Yes 3 1395.560 Yes 4 1560.478 Yes 5 1997.980 Yes 6 1834.381 Yes Income 14891 106025 104593 148924 55882 80180 Balance3$Default <- ifelse(Balance3$Default=="Yes", 1, 0) > head(Balance3) Balance Default 1 1861.624 1 2 1688.673 1 3 1395.560 1 4 1560.478 1 5 1997.980 1 6 1834.381 1 Income 14891 106025 104593 148924 55882 80180 p <- ggplot(Balance3, aes(x=sex, y=Default)) #Logistic p + geom_point(shape=1, size=3) + geom_smooth(method="glm", family="binomial", se=FALSE)+ theme_grey(base_size = 18) + scale_x_continuous(name="Sex") + scale_y_continuous(name="Probability of Default") #Linear p + geom_point(shape=1, size=3) + geom_smooth(method=lm, se=FALSE)+ theme_grey(base_size = 18) + scale_x_continuous(name="Sex") + scale_y_continuous(name="Probability of Default") The right-hand side of the plot depicts the probability modeling using the logistic regression. As you can notice the standard linear regression assumes non-sense negative probabilities. Example 57. .................................................................................................................................. >reg=glm(Default~Balance, data=Balance3, family = "binomial") >summary(reg) The statistics of the regression: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.003e+01 1.092e+00 -9.180 <2e-16 *** Balance 6.937e-03 7.202e-04 9.631 <2e-16 *** Coefficients: (Intercept) -10.028759 Balance 0.006937 Degrees of Freedom: 399 Total (i.e. Null); Null Deviance: 554.5 Residual Deviance: 263.1 AIC: 267.1 398 Residual Using the values of the regression we can estimate for example the probability of default for an individual with the credit balance of 1500: >eq <- exp(reg$coefficients[1] + reg$coefficients[2]*1500) >eq/(1+eq) 𝑝̂ (𝑋) = 𝑒 𝛽0 +𝛽1 𝑋 𝑒 −10.028759+0.006937 × 1500 = = 0.5930868 1 + 𝑒𝛽0 +𝛽1𝑋 1 + 𝑒 −10.028759+0.006937 × 1500 Thus, an individual with the credit balance of 1500 would default with a probability of around 60%. Example 58. .................................................................................................................................. How does sex of a client affect the default probability? In other words we would like to model: Pr(𝐷𝑒𝑓𝑎𝑢𝑙𝑡 = 𝑌𝑒𝑠|𝑆𝑒𝑥 = 𝑀𝑎𝑙𝑒) and Pr(𝐷𝑒𝑓𝑎𝑢𝑙𝑡 = 𝑌𝑒𝑠|𝑆𝑒𝑥 = 𝐹𝑒𝑚𝑎𝑙𝑒) >sex <- rbinom(n=400, size=1, prob=0.5) >Balance3<-cbind(Balance3,sex) >reg2=glm(Default~sex, data=Balance3, family = "binomial") >summary(reg2) We get the following estimates from the regression: Deviance Residuals: Min 1Q -1.27076 -1.07438 Median 0.00623 3Q 1.08684 Max 1.28405 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.2472 0.1474 -1.678 0.0934 . sex 0.4640 0.2018 2.300 0.0215 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 554.52 Residual deviance: 549.19 AIC: 553.19 on 399 on 398 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 3 Inserting the values in the equation yields the following estimates (sex=1 if male, 0 otherwise): eq2 <- exp(reg2$coefficients[1] + reg2$coefficients[2]*1) eq2/(1+eq2) eq3 <- exp(reg2$coefficients[1] + reg2$coefficients[2]*0) eq3/(1+eq3) 𝑒 −0.2472+0.4640 × 1 = 0.5539887 1 + 𝑒 −0.2472+0.4640 × 1 𝑒 −0.2472+0.4640 × 0 ̂ (𝐷𝑒𝑓𝑎𝑢𝑙𝑡 = 𝑌𝑒𝑠|𝑆𝑒𝑥 = 𝑓𝑒𝑚𝑎𝑙𝑒) = 𝑃𝑟 = 0.4385128 1 + 𝑒 −0.2472+0.4640 × 0 ̂ (𝐷𝑒𝑓𝑎𝑢𝑙𝑡 = 𝑌𝑒𝑠|𝑆𝑒𝑥 = 𝑚𝑎𝑙𝑒) = 𝑃𝑟 The probability of a male individual to default is around 55% while for a female individual it is around 44% holding all other variables fixed. Warning: the default probabilities of male and female individuals are independent and hence, their sum doesn’t yield 1. This can be seen by assigning 0.5 for sex: ̂ (𝐷𝑒𝑓𝑎𝑢𝑙𝑡 = 𝑌𝑒𝑠|𝑆𝑒𝑥 = 0.5) = 𝑃𝑟 𝑒 −0.2472+0.4640 × 0.5 = 0.4962001 1 + 𝑒 −0.2472+0.4640 × 0.5 Let’s consider a case with several explanatory variables. This approach is referred to as multiple logistic regression. Say, we want to explain the probability of default with credit balance, income, and sex. In this case our predictor set will be 𝑋 = (𝑋1 , 𝑋2 , 𝑋3 ) = (𝑏𝑎𝑙𝑎𝑛𝑐𝑒, 𝑖𝑛𝑐𝑜𝑚𝑒, 𝑠𝑒𝑥). The multinomial model takes the form as follows: log ( 𝑝(𝑋) ) = 𝑿′𝜷 1 − 𝑝(𝑋) where 𝑿′𝜷 = 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽𝑝 𝑋𝑝 and in our case it becomes Pr(𝑌 = 1|𝑋) = 𝑝(𝑋) = 𝑒 𝛽0 +𝛽1 𝑋1 +𝛽2 𝑋2 +𝛽3 𝑋3 𝑒𝑥𝑝{𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 } = 𝛽 +𝛽 𝑋 +𝛽 𝑋 +𝛽 𝑋 0 1 1 2 2 3 3 1 + 𝑒𝑥𝑝{𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 } 1+𝑒 Example 59. .................................................................................................................................. The estimation of the same data yields the following results: >reg3=glm(Default~Balance+Income+sex, data=Balance3, family = "binomial") >summary(reg3) Deviance Residuals: Min 1Q Median 3Q Max -2.64500 -0.28106 0.05239 0.53101 1.65437 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.025e+01 1.139e+00 -8.999 <2e-16 *** Balance 6.911e-03 7.220e-04 9.572 <2e-16 *** Income 1.390e-06 4.612e-06 0.301 0.763 sex 3.749e-01 3.117e-01 1.203 0.229 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 554.52 Residual deviance: 261.51 AIC: 269.51 on 399 on 396 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 6 What can we conclude from these estimates above? When we include balance, income, and sex as the explanatory variables only balance remains statistically significant, while income and sex are both too weak to explain the default probability. Hence, even though sex is a strong explanatory variable by itself, it is not as strong together with balance and income variables. But it doesn’t mean that it’s needless! Applying the coefficient values we could estimate the default probability for a male individual with income of 50000 and a credit balance of 1000: eq <- exp(reg3$coefficients[1] + reg3$coefficients[2]*1000 + reg3$coefficients[3]*50000 + reg3$coefficients[4]*1 ) eq/(1+eq) ̂ (𝑌 = 1|𝑋) = 𝑝̂ (𝑋) = 𝑃𝑟 𝑒 −10.25 + 0.0069 × 1000 + 0.000001390 × 50000 + 0.3749 × 1 = 0.0519 1 + 𝑒 −10.25 + 0.0069 × 1000 + 0.000001390 × 50000 + 0.3749 × 1 and a female individual with the same balance and income: eq <- exp(reg3$coefficients[1] + reg3$coefficients[2]*1000 + reg3$coefficients[3]*50000 + reg3$coefficients[4]*0 ) eq/(1+eq) ̂ (𝑌 = 1|𝑋) = 𝑝̂ (𝑋) = 𝑃𝑟 𝑒 −10.25 + 0.0069 × 1000 + 0.000001390 × 50000 + 0.3749 × 0 = 0.0362 1 + 𝑒 −10.25 + 0.0069 × 1000 + 0.000001390 × 50000 + 0.3749 × 0 Hence, a male individual with income of 50000 and a credit balance of 1000 will default with a probability of 5.19% and a female individual with the same balance and income with a probability of 3.62%. 5.4. Confounding variables (incomplete) 5.5. Multinomial logistic regression Logistic regression is a nice tool to predict discrete outcomes that are binary. But what if we had a dependent or explanatory variable that would have more than just two possible discrete values? Say, we want to predict whether an individual would prefer to buy a BMW, Mercedes, Ford or Toyota based on her income. And how do choices change when income changes? How should we deal with this problem? Let’s look at an example of the heating system choice related to the number of rooms in the houses. The households have five options of heating systems: gas central, gas room, electric central, electric room, and heat pump. The number of rooms has three classes: 2-3 rooms, 4-5 rooms and 6-7 rooms. This data is taken from the package “mlogit”. The cross-table of options looks as follows: Rooms Heating choice Total gc gr ec er hp 2-3 208 48 19 32 19 326 4-5 182 39 25 20 15 281 6-7 Total 183 42 20 32 16 293 573 129 64 84 50 900 Here we can see how the number of rooms is related to the heating choice of the house. Obviously our response variable is the heating system choice and we want to know whether it can be explained with the number of rooms in the house. The original data includes observations for 900 households that can be indexed from 1 to 900, namely 𝑖 = 1, … ,900. Next, we have 5 different heating system options, that can be indexed from 1 to 5 (gc=1, gr=2, ec=3, er=4, hp=5), namely 𝑗 = 1, … ,5. The probability that a household 𝑖 would choose the option 𝑗 as a heating system can be denoted as 𝜋𝑖𝑗 = 𝑃𝑟{𝑌𝑖 = 𝑗}. But since we have grouped the observations we only need to know which group a particular observation falls in, and instead we could index the groups from 1 to 3: 𝑖 = 1,2,3 (1=2-3 rooms, 2=4-5 rooms, 3=6-7 rooms). Say, a household has a house with 4 rooms, what is the probability that it chooses electric central as a heating system (assuming that the number of rooms is the only factor affecting the choice of the heating system)? The answer is 25/281 ≈ 0.089, or about 8.9%. The probabilities of heating system choices for a certain household always add up to 1, hence ∑𝐽𝑗=1 𝜋𝑖𝑗 = 1 and consequently we need to have 𝐽 − 1 parameters to have all the information needed. Next, we denote #𝑖 the number of observations in a particular group, here we have for example #3 = 293 and consequently 𝑦𝑖1 , … , 𝑦𝑖5 represent the numbers of households per each heating system in the group 𝑖, which, of course, sum up as ∑𝑗 𝑦𝑖𝑗 = #𝑖 . And in general, 𝑦𝑖𝑗 represents the number of observations in a particular group for a particular heating system. Since a single household can obviously choose only one heating system, in this case 𝑛𝑖 = 1 and 𝑌𝑖𝑗 can be further interpreted as a dummy variable. Combining the information that we assigned to the variables we can represent it in the multinomial distribution form (probability mass function), namely 𝑃𝑟{𝑌𝑖1 = 𝑦𝑖1 , … , 𝑌𝑖𝐽 = 𝑦𝑖𝐽 } = ( 𝑛𝑖 𝑦 𝑦 ) 𝜋𝑖1𝑖1 ⋯ 𝜋𝑖𝐽𝑖𝐽 . 𝑦𝑖1 , … , 𝑦𝑖𝐽 In the multinomial logistic model we assume that the log-odds be represented in the linear model, as in the logistic regression model: log 𝜋𝑖𝑗 = 𝛽0 + 𝒙′𝑖 𝜷𝑗 , 𝜋𝑖𝐽 𝜋 where 𝑗 = 1, … , 𝐽 − 1. By exponentiating the equation above, noting that log 𝜋𝑖𝐽 = 0 for all 𝐽 and that 𝑖𝐽 ∑𝑗 𝜋𝑖𝑗 = 1 we obtain 𝜋𝑖𝐽 = 1⁄∑𝑗 exp{log 𝜋𝑖𝑗 ⁄𝜋𝑖𝐽 } which leads in the equation below 𝜋𝑖𝑗 = exp{𝛽0 + 𝒙′𝑖 𝜷𝑗 } ∑𝐽𝑘=1 exp{𝛽0 + 𝒙′𝑖 𝜷𝑘 } . The log-odds of the multinomial logistic model can be also derived through the utility representation of the individuals’ choices. Let 𝑈𝑖𝑗 denote the utility of the 𝑗-th choice for the 𝑖-th individual. Now let’s split 𝑈𝑖𝑗 in the systematic component 𝜂𝑖𝑗 and a random component 𝜖𝑖𝑗 (error terms) so that 𝑈𝑖𝑗 = 𝜂𝑖𝑗 + 𝜖𝑖𝑗 . If we further assume that individuals maximize their utility by choosing the largest of 𝑈𝑖1 , … , 𝑈𝑖𝐽 , then the probability that an individual 𝑖 will choose alternative 𝑗 to maximize her utility is 𝜋𝑖𝑗 = Pr{𝑌𝑖 = 𝑗} = Pr{max(𝑈𝑖1 , … , 𝑈𝑖𝐽 ) = 𝑈𝑖𝑗 }. Now, because error terms 𝜖𝑖𝑗 have extreme value distributions with density 𝑓(𝜖) = exp{−𝜖 − exp{−𝜖}} then (Maddala, 1983, pp 60-61) 𝜋𝑖𝑗 = exp{𝜂𝑖𝑗 } . 𝐽 ∑𝑘=1 exp{𝜂𝑖𝑘 } Another useful representation of the model above is the conditional form (conditional logistic model) or in terms of alternatives rather than attributes of the individuals. Let 𝒘𝑗 represent a vector of 𝑗-th alternative, then the utility of the individual can be represented as 𝜂𝑖𝑗 = 𝒘𝑗′ 𝜸 and by combining it with a multinomial logistic model we can obtain a general or mixed multinomial logistic model 𝜂𝑖𝑗 = 𝒙′𝑖 𝜷𝑗 + 𝒘′𝑖𝑗 𝜸 where the vector of independent variables 𝒙′𝑖 is called alternative-invariant and 𝒘′𝑖𝑗 is called alternative-variant. The theory of the multinomial logistic model can be rather tricky and it’s not always clear what the conditional multinomial logistic model is meant for. Nevertheless, with the help of the following examples you should be able to understand the basic principles of the multinomial logistic model. Example 60. .................................................................................................................................. The data being used here is taken from the package “mlogit”. It is called “Mode Choice for the MontrealToronto Corridor“, where individuals choose the travelling mode between Montreal and Toronto in Canada. Most of them have four options: airplane, bus, car, and train, while some of them have fewer options. We will subset the data to include only the ones who have all four options available. > library(mlogit) #this is the recommended package (nnet is another one, but with different functions) > data("ModeCanada") #the data set > View(ModeCanada) > table(ModeCanada$noalt) 2 462 3 4 3942 11116 > ModeCanada <- subset(ModeCanada, ModeCanada$noalt>3) > table(ModeCanada$noalt) 4 11116 > > head(ModeCanada) case alt choice dist cost ivt ovt freq income urban noalt nchoice 304 109 train 0 377 58.25 215 74 4 45 0 4 4 305 109 air 1 377 142.80 56 85 9 45 0 4 4 306 109 bus 0 377 27.52 301 63 8 45 0 4 4 307 109 car 0 377 71.63 262 0 0 45 0 4 4 308 110 train 0 377 58.25 215 74 4 70 0 4 4 309 110 air 1 377 142.80 56 85 9 70 0 4 4 This form of data is called ‘long’. Here the rows are organized according to the choices of the individuals. The actual chosen mode of transport is assigned 1 for each individual and the rest are 0. Hence, there are four rows of data for each individual. As you can see some of the rows repeat, such as income, because it doesn’t change for an individual no matter which mode s/he chooses. The data could include other variables of choice as well and they would require the same type of dummy column of choice as above. Another form of data that could be used is called ‘wide’. We will see later how one can transform from one to another and back, but it’s really important to keep in mind that the form of data really affects the outcome. First, let’s see how we can operate with the data in the form ‘long’. Warning: the coefficients of the pure, conditional or mixed multinomial logistic model are difficult to interpret. Neither the sign, nor the magnitude of the coefficients has an intuitive meaning. Nevertheless, careful analysis should always help. Example 61. .................................................................................................................................. Say, we are interested to know how distance (dist), income, and cost affect the choice of the transport mode. > table(ModeCanada$choice,ModeCanada$alt) 0 1 train air bus car 2316 1740 2769 1512 463 1039 10 1267 As we see ‘bus’ was chosen really seldom while ‘car’ and ‘air’ are the most frequent choices. > ModeCanada <- mlogit.data(ModeCanada, alt.var="alt", choice = "alt", shape = "long")[,c(2,3,4,5,9)] > > ModeCanada[1:12,] alt choice dist cost income 1.train train 0 377 58.25 45 1.air air 1 377 142.80 45 1.bus bus 0 377 27.52 45 1.car car 0 377 71.63 45 2.train train 0 377 58.25 70 2.air air 1 377 142.80 70 2.bus bus 0 377 27.52 70 2.car car 0 377 71.63 70 3.train train 0 377 58.25 35 3.air air 1 377 142.80 35 3.bus bus 0 377 27.52 35 3.car car 0 377 71.63 35 > > mlogit.model1 <- mlogit(choice ~ 1 | cost, data=ModeCanada, shape="long", alt.var="alt", reflevel = "car") > > summary(mlogit.model1) Call: mlogit(formula = choice ~ 1 | cost, data = ModeCanada, reflevel = "car", shape = "long", alt.var = "alt", method = "nr", print.level = 0) Frequencies of alternatives: car train air bus 0.4559194 0.1666067 0.3738755 0.0035984 nr method 8 iterations, 0h:0m:0s g'(-H)^-1g = 7.19E-07 gradient close to zero Coefficients : Estimate Std. Error t-value Pr(>|t|) train:(intercept) -1.6019197 0.2067141 -7.7494 9.326e-15 air:(intercept) -10.5412593 0.4736322 -22.2562 < 2.2e-16 bus:(intercept) -4.1906113 1.1378782 -3.6828 0.0002307 train:cost 0.0115781 0.0038360 3.0183 0.0025421 air:cost 0.0668507 0.0030441 21.9606 < 2.2e-16 bus:cost -0.0273963 0.0472261 -0.5801 0.5618412 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ *** *** *** ** *** 1 Log-Likelihood: -2533.4 McFadden R^2: 0.12743 Likelihood ratio test : chisq = 739.96 (p.value = < 2.22e-16) > This is the pure multinomial model. Our dependent variable is ‘choice’ and we fix at 1, since it is not varying with alternatives. We want to see how cost affects the choice of the transport mode. Next, we choose “car” as our mode of transport of reference. This means that we observe how other modes of transport change compared to “car” when the cost is changing. For example growing the cost of driving a car by one unit will grow the log-odds of taking a train by approximately 1.16% compared to driving a car. The coefficient of train is significant at 95% level. Growing the cost of driving a car by one unit results in increasing the log-odds of taking the airplane by approximately 6.69% and its coefficient is significant at 99% level. Nevertheless the coefficient of bus is not significant (the modulus of the standard error exceeds the size of the coefficient). > exp(coef(mlogit.model1)) train:(intercept) air:(intercept) 2.015093e-01 2.642343e-05 train:cost air:cost 1.011645 1.069136 attr(,"fixed") train:(intercept) air:(intercept) FALSE FALSE train:cost air:cost FALSE FALSE bus:(intercept) 1.513703e-02 bus:cost 0.9729756 bus:(intercept) FALSE bus:cost FALSE In the table above the same observation is obvious the value of x:cost tells the multiplier of the probability of taking the transport after increasing the cost of “car” by one unit. Example 62. .................................................................................................................................. Now, let’s use another data set to demonstrate the conditional multinomial logistic model. data("Car", package=”mlogit”) #the data set View(Car) Car <- mlogit.data(Car, alt.levels = 1:6, varying = 5:70, choice = "choice", shape = "wide", sep = "") #alt.levels: possible options, 6 for each person #varying: nominal parameters Remember, in the conditional model we estimate the coefficients that are alternative-invariant, so we don’t have to choose a reference choice for the regression. For the sake of example let’s regress the variable of choice on all available nominal variables in the data set. # Conditional model > mlogit.model2 <- mlogit(choice ~ price+range+acc+speed+pollution+size+space+cost+station, data=Car, shape="long", alt.var="alt") > summary(mlogit.model2) Call: mlogit(formula = choice ~ price + range + acc + speed + pollution + size + space + cost + station, data = Car, shape = "long", alt.var = "alt", method = "nr", print.level = 0) Frequencies of alternatives: 1 2 3 4 5 6 0.190589 0.057800 0.288999 0.074989 0.322089 0.065535 nr method 5 iterations, 0h:0m:1s g'(-H)^-1g = 4.91E-06 successive function values within tolerance limits Coefficients : Estimate Std. Error t-value Pr(>|t|) 2:(intercept) -1.19313360 0.06960508 -17.1415 < 2.2e-16 3:(intercept) -0.01605586 0.06236401 -0.2575 0.7968283 4:(intercept) -1.36513323 0.07753132 -17.6075 < 2.2e-16 5:(intercept) -0.32556327 0.09205583 -3.5366 0.0004053 6:(intercept) -1.91780499 0.10528938 -18.2146 < 2.2e-16 price -0.18712971 0.02699695 -6.9315 4.164e-12 range 0.00401531 0.00025783 15.5737 < 2.2e-16 acc -0.07281306 0.01095333 -6.6476 2.980e-11 speed 0.00340031 0.00076759 4.4298 9.431e-06 pollution -0.18147772 0.09521123 -1.9061 0.0566432 size 0.07203239 0.02916950 2.4694 0.0135324 space 0.91481193 0.17538722 5.2160 1.829e-07 cost -0.07233509 0.00744226 -9.7195 < 2.2e-16 station 0.25023356 0.07047056 3.5509 0.0003839 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 *** *** *** *** *** *** *** *** . * *** *** *** ‘ ’ 1 Log-Likelihood: -7080.8 McFadden R^2: 0.035344 Likelihood ratio test : chisq = 518.87 (p.value = < 2.22e-16) How to interpret the coefficients of the conditional model? If we added for example the “choice 3” as a reference level in the previous setup we’d notice that the coefficients of the independent variables are not changing. Why so? In fact the coefficients of the conditional model can be interpreted to explain whether these factors are affecting the choice of the car in the first hand. For example the coefficient of the price (price of a vehicle divided by the logarithm of income) is negative and significant at 99% level. This tells us that increasing the price should reduce the willingness to buy a car, which is quite a realistic result. The same logic applies to the coefficient of cost (cost per mile of travel (tens of cents)). But on the other hand since the coefficient of the space variable is positive and statistically significant, increasing the space of the car (fraction of luggage space in comparable new gas vehicle) is likely to increase its attractiveness. Example 63. .................................................................................................................................. Next, let’s proceed with the mixed multinomial model. # Mixed model > mlogit.model3 <- mlogit(choice ~ cost+speed | price, data=Car, shape="long", alt.var="alt", reflevel = "3") > summary(mlogit.model3) Call: mlogit(formula = choice ~ cost + speed | price, data = Car, reflevel = "3", shape = "long", alt.var = "alt", method = "nr", print.level = 0) Frequencies of alternatives: 3 1 2 4 5 6 0.288999 0.190589 0.057800 0.074989 0.322089 0.065535 nr method 5 iterations, 0h:0m:1s g'(-H)^-1g = 0.000266 successive function values within tolerance limits Coefficients : Estimate Std. Error t-value 1:(intercept) -0.1865314 0.1051452 -1.7740 2:(intercept) -0.8616399 0.1633765 -5.2740 4:(intercept) -0.9934246 0.1418352 -7.0041 5:(intercept) 0.0370236 0.0891179 0.4154 6:(intercept) -1.1843830 0.1536278 -7.7094 cost -0.0713879 0.0072506 -9.8458 speed 0.0035951 0.0007413 4.8497 1:price -0.0488609 0.0221505 -2.2059 2:price -0.1783347 0.0382607 -4.6610 4:price -0.0870310 0.0322393 -2.6995 5:price 0.0171088 0.0189882 0.9010 6:price -0.0726519 0.0348173 -2.0867 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ Pr(>|t|) 0.076057 1.335e-07 2.486e-12 0.677816 1.266e-14 < 2.2e-16 1.237e-06 0.027394 3.146e-06 0.006944 0.367577 0.036919 . *** *** *** *** *** * *** ** * 0.05 ‘.’ 0.1 ‘ ’ 1 Log-Likelihood: -7260.9 McFadden R^2: 0.010807 Likelihood ratio test : chisq = 158.65 (p.value = < 2.22e-16) The coefficients of the mixed multinomial model are alternative-variant and alternative-invariant. In this setup the coefficients of cost and speed are alternative-invariant and the coefficients of the price vector are with the reference to the choice 3. If you compare this regression to the conditional one, you should notice that the coefficients of cost and speed are pretty much similar. The coefficients of price should be interpreted in the similar manner as with the ‘ModeCanada’ data set. For example increasing the price Example 64. .................................................................................................................................. In order to measure how much the change in the coefficients affects the probability of making a specific choice we need calculate the marginal effects. The marginal effects for the conditional model can be calculated as follows. As you recall we created the conditional model with the following function: mlogit.model2 <- mlogit(choice ~ price+range+acc+speed+pollution+size+space+cost+station, data=Car, shape="long", alt.var="alt", reflevel = "4") First, we need to calculate the means of the independent variables for each choice. This can be done: z <-with(Car, data.frame(price=tapply(price, index(mlogit.model2)$alt, mean), range=tapply(range, index(mlogit.model2)$alt, mean), acc=tapply(acc, index(mlogit.model2)$alt, mean), speed=tapply(speed, index(mlogit.model2)$alt, mean), pollution=tapply(pollution, index(mlogit.model2)$alt, mean), size=tapply(size, index(mlogit.model2)$alt, mean), space=tapply(space, index(mlogit.model2)$alt, mean), cost=tapply(cost, index(mlogit.model2)$alt, mean), station=tapply(station, index(mlogit.model2)$alt, mean))) > effects(mlogit.model2, covariate = "cost", data = z) 4 1 2 4 -0.004972598 0.001006346 0.0003051940 1 0.001006346 -0.011013015 0.0007699902 2 0.000305194 0.000769990 -0.0038763863 3 0.001537217 0.003878325 0.0011761776 5 0.001764767 0.004452424 0.0013502842 6 0.000359075 0.000905930 0.0002747409 3 5 6 4 0.001537216 0.001764767 0.0003590753 1 0.003878325 0.004452424 0.0009059301 2 0.001176177 0.001350284 0.0002747409 3 -0.014776727 0.006801179 0.0013838289 5 0.006801179 -0.015957328 0.0015886736 6 0.001383829 0.001588673 -0.0045122481 The values in the table above are the covariance values of the choices. The first important observation is that the diagonal values are negative and the rest of the values are positive. Let’s take the cell 1,1 as an example. Remember, our independent variable is cost. Hence, the negative coefficient in the cell 1,1 means that if the cost of the choice 4 increases then the choice 4 becomes less wanted. The values of the covariates can be interpreted simply as percentage changes, since the marginal effects are calculated as 𝜕𝑝𝑖𝑗 = 𝑝𝑖𝑗 (𝛾𝑗 − 𝛾̅𝑖 ) 𝜕𝒘𝑖 We can also calculate the marginal effects of the mixed model with respect to the variable that has a reference choice. Recall the mixed multinomial model: mlogit.model3 <- mlogit(choice ~ cost+speed | price, data=Car, shape="long", alt.var="alt", reflevel = "1") Now marginal effects w.r.t. price: > effects(mlogit.model3, covariate = "price", data = z) 1 2 3 4 0.008701082 -0.008388572 -0.006544084 -0.004722780 5 6 0.014035903 -0.003081549 Surprisingly, increasing the price of the choice 1 is also increasing its attractiveness (irrationality or a Giffen good?), as well as for the choice 5. For the choices 2, 3, 4 and 6 increasing the price decreases their attractiveness. Finally, about an assumption called “Independence of irrelevant alternatives” (IIA). By the definition of the multinomial logistic regression it is assumed that introducing new choices should not have any effect on the willingness to select older choices. This is rather a restrictive assumption and therefore it is not believed to hold in practice. Nevertheless, it is possible to control for it to some degree by running the Hausman-McFadden test. The basic idea is to subset a collection of choices by dismissing a choice on purpose. Example 65. .................................................................................................................................. Say, we want test whether IIA is holding for the mixed logistic model. We first subset a new set of choices by dropping the choice 1 and choice 2 and then perform a test. m.1 <- mlogit(choice ~ cost+speed | price, data=Car, shape="long", alt.var="alt", reflevel = "4") m.2 <- mlogit(choice ~ cost+speed | price, data=Car, alt.subset = c("3","4","5","6"), reflevel = "4") # Hausman-McFadden test hmftest(m.1,m.2) Hausman-McFadden test data: Car chisq = 13.085, df = 8, p-value = 0.109 alternative hypothesis: IIA is rejected Recall, small p-value is an argument against the null hypothesis; hence we can’t reject the null hypothesis, which is “IIA holds”. Changing the composition of choices may change the result of the test significantly. Try it! 5.6. Time series Time series technics are widely applied in macro and microeconomics. It represents a collection of data in which the observations correspond to consecutive time periods that can vary in length and frequency. Time series can be forecasted and related to other data using their own or other time series historical data. Despite several modern branches and technics, the most applied technique in time series analysis nowadays is still the linear regression. Time series analysis has been developing rapidly since the late 70’s due to applications in macroeconomics and finance; later on it was adapted in other business industries as well. Time series analysis is famous for models trying to catch the dynamics of the error terms and volatility (linearly and non-linearly) of the variables. Nevertheless, it still relies on the basic concepts of statistics, such as hypothesis testing, regression analysis, maximum likelihood estimators, and correlation and variance technics. Other than business or economics fields such as meteorology and signal processing have also largely benefited from the development of the time series analysis. In this paragraph you will be introduced to the most important and profound time series technics that should provide you the basic grip in this field of statistics. As in the case of data analytics in general, inspecting the graphical representation of a time series is the best way to gain an intuition of what could be the starting point for the modeling. Some popular time series graphed: One of the most important observations in the time series analysis could be to check whether time series values are fluctuating around a certain mean, like zero. For example the daily returns of the S&P500 index have a clear mean around zero, while some of the time series presented have a trend or a seasonal variation. Most of the times it is necessary to untrend the time series or remove a cyclical part from it in order to model it or to analyze it in comparison to other time series. Perhaps the most common techniques for untrending is taking a difference or a log-difference as follows: ∆𝑦𝑡 = 𝑦𝑡 − 𝑦𝑡−1 OR ∆ log 𝑦𝑡 = log(𝑦𝑡 ⁄𝑦𝑡−1 ) = log 𝑦𝑡 − log 𝑦𝑡−1 ≈ (𝑦𝑡 − 𝑦𝑡−1 )⁄𝑦𝑡−1 where 𝑦𝑡 is a common notation for time series data at the time period 𝑡 = 1, … , 𝑇, where 𝑇 is the total number of observations. 𝑦𝑡−1 is usually called the first lag of the time series, 𝑦𝑡−2 is the second lag and so on. 𝑦𝑡+1 is the first lead, 𝑦𝑡+2 is the second lead and so on. In case of a linear trend it’s possible to fit a trend line using least squares regression and to consider modeling the resulting residual series. Then, the observed time series 𝑦𝑡 would be replaced by the time series (𝜀̂𝑡 = 𝑦𝑡 − 𝛽̂0 − 𝛽̂1 𝑡, 𝑡 = 1, … , 𝑇), where the estimates 𝛽̂0 and 𝛽̂1 are obtained by regressing 𝑦𝑡 on a constant and 𝑡. If the time series seem to have a cyclical behavior, then most of the times it’s not a bad idea to identify a seasonal (repetitive) pattern to produce a relevant analysis. A possible solution could be to use trigonometric functions of time, but unfortunately especially in the case of economic and business time series the seasonality is not uniform across the data and therefore trigonometric functions demand a lot of parametric adjustment. Hence, an often used alternative is to use seasonal indicators. For example in case of quarter-annual data one could use the residual series: 𝜀̂𝑡 = 𝑦𝑡 − 𝛽̂1 𝑑1𝑡 − ⋯ − 𝛽̂4 𝑑4𝑡 , 𝑡 = 1, … , 𝑇, where 𝑖th seasonal indicator 𝑑𝑖𝑡 takes value one when the observation in question corresponds to quarter 𝑖 and zero otherwise (𝑖 = 1, … ,4). Very often, the aim of time series analysis is to build a statistical model that represents the observed time series and its fluctuations and the dependence structure of consecutive observations. This dependence is often quite apparent, either by inspecting a graph of the series, or because of an underlying theory explaining how the phenomenon generating the time series behaves. After a statistical model has been chosen, one can estimate the unknown parameters of the model, check whether the estimated model and the observations are compatible, test hypotheses concerning the model parameters, and finally use the model for the intended purpose. One purpose could be to simply provide a summarized description of the observed data set, which may be useful in understanding the underlying data generation process producing the data. Another central use of a time series model is to forecast the future values of the time series (or several time series). On the other hand, a series of interest may be forecasted making use of information contained in other explanatory time series. In case of multiple time series, often exploring potential temporal and contemporaneous dependencies between the time series is another issue of interest. 5.6.1. General techniques Example 66. .................................................................................................................................. We’ve got a construction permits monthly data from Finland spanning 1990-2013: # Data: construction permits in Finland 1990M01-2013M12 no.permits <- ts(constr$number, start = c(1990,1), frequency=12) # Plotting the original series plot(no.permits, col="green4", main="Number of construction permits in Finland 1990I-2013XII") # Plotting the differences of the original series plot(diff(no.permits), col="green4", main="Difference of number of construction permits in Finland 1990I-2013XII") > mean(no.permits) [1] 2784.997 > mean(diff(no.permits)) [1] -0.630662 Example 67. .................................................................................................................................. The original series as well the differences of it seems to have a seasonal behavior. Let’s check the means of each month first to find if there are differences between them: > colMeans(permits) Jan Feb Mar Apr May Jun Jul Aug 1629.750 2397.042 3042.750 3656.375 3796.458 3988.917 2048.042 3088.000 Sep Oct Nov Dec 2980.292 2621.208 2215.625 1955.500 June seems to have the most permits on average and then there’s another spike in August. Otherwise the series gradually declines to wards January which has the lowest number of permits. Another bottom is in July. In fact we can identify to cycles here, since the series has its lowest values every half a year on average. We could fit the cyclical part repeating every half a year and every month by using a hands-on algorithm: plot(no.permits, col="red", main="Number of construction permits in Finland 1990I-2013XII (incl. seasonal variation)") # Quartely a=3 permits.cycle.3 <- filter(no.permits,filter=rep(1/7,7)) lines(permits.cycle.3,col="forestgreen", lwd=2) # Every half a year a=6 permits.cycle.6 <- filter(no.permits,filter=rep(1/13,13)) lines(permits.cycle.6,col="blue", lwd=2) # Every year a=12 permits.cycle.12 <- filter(no.permits,filter=rep(1/25,25)) lines(permits.cycle.12,col="black", lwd=2) How to interpret the coefficients of filter()? In fact this algorithm represents a decomposition with a simple class of linear filters which are moving averages with equal weights: 𝑎 ∞ 𝑇𝑡 − ∑ 𝜆𝑖 𝑦𝑡+𝑖 𝑖=−∞ 1 = 𝑇𝑡 − ∑ 𝑦𝑡+𝑖 2𝑎 + 1 𝑖=−𝑎 In this case, the filtered value of a time series at a given period 𝑡 is represented by the average of the 1 1 values {𝑦𝑡−𝑎 , … , 𝑦𝑡 , … , 𝑦𝑡+𝑎 } and the coefficients of the filtering are {2𝑎+1 , … , 2𝑎+1}. For instance, 1 1 1 1 if 𝑎 = 3, then 𝜆7 = {2𝑎+1 ⏟ , … , 2𝑎+1} = {7 , … , 7} 7 𝑡𝑖𝑚𝑒𝑠 1 1 1 1 if 𝑎 = 12, then 𝜆12 = {⏟ , … , } = {25 , … , 25} 2𝑎+1 2𝑎+1 25 𝑡𝑖𝑚𝑒𝑠 Example 68. .................................................................................................................................. Another beautiful way to fit a trend line is to use a quadratic function of time: # Quadratic function of time fitted to the series lpermits<-log(constr$number) t<-seq(from=1,to=length(constr$number),by=1) t2<-t^2 plot(lpermits,type="o", pch=1, lty=1, col=2, main="Log of n of construction permits in Finland 1990I-2013XII") lines(lm(lpermits~t+t2)$fit,col="black",lwd=2) If it seems that a quadratic equation log 𝑦𝑡 = 𝛽0 + 𝛽1 𝑡 + 𝛽2 𝑡 2 + 𝜀𝑡 doesn’t describe the data well enough, then you can always try higher degree polynomials, such as a cubic one log 𝑦𝑡 = 𝛽0 + 𝛽1 𝑡 + 𝛽2 𝑡 2 + 𝛽3 𝑡 3 + 𝜀𝑡 : # Cubic function of time fitted to the series lpermits<-log(constr$number) t<-seq(from=1,to=length(constr$number),by=1) t2<-t^2 t3<-t^3 plot(lpermits,type="o", pch=1, lty=1, col=2, main="Log of n of construction permits in Finland 1990I-2013XII") lines(lm(lpermits~t+t2+t3)$fit,col="black",lwd=2) Example 69. .................................................................................................................................. The decomposition of the time series into cyclical, trending and main data components can be easily done by using for example a function stl(): permits.stl <- stl(log(no.permits),s.window="periodic") plot(permits.stl, main="Decomposition of time series of construction permits") plot(no.permits, col="cyan4", main="Number of construction permits in Finland 1990I-2013XII + fitted Holt-Winters") lines(HoltWinters(no.permits)$fit[,1],col="red", lwd=2) Predicting the future values using Holt-Winters values can be done using a generic function predict().It only requires that we rescale the x-axis so that the new values can be fitted in the graph. # Predicting using Holt-Winters plot(no.permits, col="cyan4", main="Number of construction permits in Finland 1990I-2013XII + forecast of Holt-Winters", xlim=c(1990,2018)) lines(predict(HoltWinters(no.permits),n.ahead=48),col="black", lwd=2) 5.6.2. ARIMA models ARIMA models are used for describing the evolution of a single (univariate) time series. ARIMA are constructed using two kinds of components: autoregressive (AR) and moving average (MA). They are applied for forecasting. In ARIMA modeling often one prefer to use Box-Jenkins approach consisting of three stages (in the application order): 1) Model identification 2) Parameter estimation 3) Diagnostic checking These stages can be repeated until the model suits the data considerably well. The goal is to find the most parsimonious model, i.e. the model with the smallest amount of parameters 𝑝 and 𝑞. Autocorrelation (AC) and partial autocorrelation (PAC) functions play a key role, because they help to identify the number of parameters (𝑝 and 𝑞 in 𝐴𝑅𝑀𝐴(𝑝, 𝑞)) or the order that may be needed. In partial autocorrelation function the impact of values 𝑦𝑡−1 , … , 𝑦𝑡−𝑠+1 is not included. A rule of thumb is that for 𝐴𝑅𝑀𝐴(𝑝, 𝑞) AC function’s values decay after lag 𝑞 (but may alternate in sign) PAC function’s values decay after lag 𝑝 (but may alternate in sign) For 𝐴𝑅(𝑝) PAC function’s values are zeroes for 𝑠 > 𝑝 Example 70. .................................................................................................................................. This example demonstrates how the estimated model by auto.arima() compares with the original series: plot(no.permits, col="cyan4", main="Number of construction permits in Finland 1990I-2013XII + fitted auto.ARIMA (in red)") lines(fitted(fit.auto), col="red", lwd=2) And fitted auto.ARIMA against 𝐴𝑅𝐼𝑀𝐴(1,0,1): nobs=length(no.permits) manual.fit <- arima(no.permits,order=c(1,0,1) , xreg=1:nobs) summary(manual.fit) plot(fitted(fit.auto), col="red", main="Fitted auto.ARIMA (in red) + ARIMA(1,0,1) (in green)") lines(fitted(manual.fit), col="green", lwd=2) How to forecast using ARIMA? The simplest way is to use the generic function predict(): no.permits.pred <- predict(fit.auto, n.ahead=48) plot(no.permits, col="cyan4", main="Number of construction permits in Finland 1990I-2013XII + forecast 48 months ahead using auto.ARIMA model + 95% confidence interval of the forecast", xlim=c(1990,2017)) lines(no.permits.pred$pred,col="red") lines(no.permits.pred$pred+1.96*no.permits.pred$se,col="blue",lty=3) lines(no.permits.pred$pred-1.96*no.permits.pred$se,col="blue",lty=3) And here comes the forecast by 𝐴𝑅𝐼𝑀𝐴(1,0,1) without seasonal parameters: no.permits.pred2 <- predict(manual.fit, n.ahead=48, newxreg=(nobs+1):(nobs+48)) plot(no.permits, col="cyan4", main="Number of construction permits in Finland 1990I-2013XII + forecast 48 months ahead using ARIMA(1,0,1) model + 95% confidence interval of the forecast", xlim=c(1990,2017)) lines(no.permits.pred2$pred,col="red") lines(no.permits.pred2$pred+1.96*no.permits.pred2$se,col="blue",lty=3) lines(no.permits.pred2$pred-1.96*no.permits.pred2$se,col="blue",lty=3) 5.6.3. VAR models How can we test for the impact of a variable on other variables over time? 𝑉𝐴𝑅(𝑝) is a system of 𝑘 equations with 𝑘 × 𝑝 explanatory variables (𝑝 lags of each of the 𝑘 variables included in 𝑦), plus 𝑘 constants (if means are not zeros). Say, we have two variables 𝑦1,𝑡 and 𝑦2,𝑡 , then VAR(1) model is 𝑦1,𝑡 = 𝜑11 𝑦1,𝑡−1 + 𝜑12 𝑦2,𝑡−1 + 𝜀1,𝑡 {𝑦 = 𝜑 𝑦 2,𝑡 21 1,𝑡−1 + 𝜑22 𝑦2,𝑡−1 + 𝜀2,𝑡 which can be expressed as 𝒚𝑡 = 𝚽𝒚𝑡−1 + 𝜺𝑡 , where 𝑦1,𝑡 𝒚𝑡 = [𝑦 ] , 2,𝑡 𝜀1,𝑡 𝜑11 𝜺𝑡 = [𝜀 ] , 𝚽 = [𝜑 2,𝑡 21 𝜑12 𝜑22 ]. In general 𝑉𝐴𝑅(𝑝) model can be expressed as 𝒚𝑡 = 𝚽1 𝒚𝑡−1 + 𝚽2 𝒚𝑡−2 + ⋯ + 𝚽𝑝 𝒚𝑡−𝑝 + 𝜺𝑡 . The equation above is called a reduced from 𝑉𝐴𝑅. Error terms of the linear equations are allowed to be correlated with each other, but since the explanatory variables are the same this setup can be estimated using OLS. 𝑉𝐴𝑅 model can be expressed using lag polynomials as Φ(𝐿)𝑦𝑡 = 𝜀𝑡 , or equivalently 𝑦𝑡 = 𝛷(𝐿)−1 𝜀𝑡 = ∑∞ 𝑗=0 ψ𝑗 𝜀𝑡−𝑗 = 𝚿𝜺𝑡 , where ψ0 = 𝐼 is an identity matrix. This is a particularly practical form, since many times one is interested to know how a shock in one variable is reflected in another one. 𝚿 is thus a matrix of impulse responses. An impact of one unit (standard deviation) shock in one error term on each of the variables in the model produces impulse response functions. Sometimes the shocks 𝜀1,𝑡 , 𝜀2,𝑡 … can be correlated making it difficult distinguishing shocks from each other. Hence, the model should be transformed (orthogonized) to make error terms uncorrelated. Example 71. .................................................................................................................................. Does advertising create more sales or other way around? Let’s investigate that. library(vars) library(tseries) lydia=ts(lydia[,],start=c(1907,1),deltat=1) lydia colnames(lydia)=c("Sales","Advertising") lydia1955<-window(lydia,end=1955) dim(lydia1955) var3_1955=VAR(lydia1955, p = 3, type = c("const"),season = NULL, exogen = NULL, lag.max = NULL) summary(var3_1955) fcast=predict(var3_1955,n.ahead=10,ci=0.95) fcast$fcst fanchart(fcast) > cause_sales=causality(var_lag3, cause = "Sales") > cause_adv=causality(var_lag3, cause = "Advertising") > cause_sales $Granger Granger causality H0: Sales do not Granger-cause Advertising data: VAR object var_lag3 F-Test = 9.6761, df1 = 3, df2 = 88, p-value = 1.385e-05 $Instant H0: No instantaneous causality between: Sales and Advertising data: VAR object var_lag3 Chi-squared = 10.2754, df = 1, p-value = 0.001348 > cause_adv $Granger Granger causality H0: Advertising do not Granger-cause Sales data: VAR object var_lag3 F-Test = 1.8284, df1 = 3, df2 = 88, p-value = 0.1478 $Instant H0: No instantaneous causality between: Advertising and Sales data: VAR object var_lag3 Chi-squared = 10.2754, df = 1, p-value = 0.001348 Based on our model it seems that advertising is not causing more sales! Example 72. .................................................................................................................................. Impulse response functions is a nice way to see how changes in one variable get reflected in another variable. impulse=irf(var_lag3, n.ahead=20, ortho=TRUE, boot=TRUE,ci=0.95,runs=100, cumulative=FALSE) plot(impulse) 5.7. 5.8. Bayesian inference (incomplete) Tree-based methods (incomplete) Tree-based methods are powerful yet simple to interpret classification tools. 5.9. Support vector classifier and support vector machine The support vector machine (SVM) is an extension of the support vector classifier which is an extension of the maximal margin classifier. All definitions aside, SVM is a hands-on technique with a rather engineering flavor to classify the variables that are correlated in a more complex way than the one that would be possible to model with a logistic regression. SVM has gained its reputation because of its functionality and flexibility, and that is the reason why it is being introduced here. SVM can be used to classify the data into many classes, but in this chapter we will only go through the binary classification setting. The concept of a hyperplane is a foundation of the technique being employed in the classification methods applied in SVM. Hyperplane a 𝑝 − 1 –dimensional subspace of the 𝑝 –dimensional space. Hence, in the case of the ordinary 3-dimensional space its hyperplane is 2-dimensional. This means that the hyperplane of the 3-dimensional space is a plane. From that, the reader can easily infer that the hyperplane of a 2-dimensional space is a line, which is a 1-dimensional object. Mathematically a 𝑝 -dimensional hyperplane in a 𝑝 + 1 –dimensional space is defined as 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝 = 0 and a 2-dimensional hyperplane stands for a linear equation 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 = 0 where 𝑋 = (𝑋1 , 𝑋2 )𝑇 stands for a vector that assigning the points the hyperplane. This case can be showcased graphically as follows. v1<- rep(seq(0,1,0.1), times=10, each=1) v2<- rep(seq(0,1,0.1), times=1, each=10) m.svm<-cbind(v1,v2) plot(m.svm, xlim=c(0,1), ylim=c(0,1), pch=19, col="dodgerblue4") abline(coef = c(1,-1), lwd=3, col='orangered') plot(1,xlim=c(0.035,0.965), ylim=c(0.035,0.965)) polygon(x=c(0,0,1), y=c(0,1,0), col = 'skyblue') polygon(x=c(1,0,1), y=c(1,1,0), col = 'deeppink') abline(coef = c(1,-1), lwd=3) The line in the graph represents a hyperplane 𝑋1 + 𝑋2 − 1 = 0, where 𝑋1 is represented by a vector 𝑣1 and 𝑋2 is represented by 𝑣2 . Thus, 𝛽0 = −1, 𝛽1 = 1, and 𝛽2 = 1. All points which lie above or below the line in the left-hand side of the graph satisfy an equation 𝑋1 + 𝑋2 − 1 > 0 or 𝑋1 + 𝑋2 − 1 < 0 respectively, and are represented by a purple and a blue area in the right-hand side of the graph. For instance, in case of 3 − 2𝑋1 − 4𝑋2 = 0, the graph would look as follows. Now, say, that we’d like to classify two sets of points in a 2-dimensional setting, then the hyperplane can be used for that purpose (i.e. decision making): v1=c(0.12,0.32,0.38,0.24,0.17,0.25,.64,.58,.73,.67,.90) v2=c(0.34,0.05,0.21,0.16,0.17,0.50,.67,.96,.85,.56,.78) y=c(rep(-1,6) ,rep (1,5)) m.svm2<-cbind(v1,v2) plot(m.svm2, col =(3-y), pch=19) abline(coef = c(1,-1), lwd=3) abline(coef = c(-4.5,10), lwd=3) abline(coef = c(1.5,-2.5), lwd=3) Recall from the linear algebra course or high school mathematics, the formula for line equation is 𝑦0 − 𝑦 = 𝑦 − 𝑦0 (𝑥 − 𝑥0 ), 𝑥 − 𝑥0 where (𝑦 − 𝑦0 )/(𝑥 − 𝑥0 ) is the slope coefficient / first derivative of the equation. Based on the example above we may be able construct many different separating planes (not always the case). From the result above we may as well conclude that for a line to be a separating plane for two classes it has to satisfy the following property: 𝑦𝑖 (𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 ) > 0, where 𝑦𝑖 = {−1,1}. And from this result we may generalize further that 𝑦𝑖 (𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑝 𝑥𝑖𝑝 ) > 0, for a 𝑝 –dimensional hyperplane, for all 𝑖 = 1, … , 𝑛 observations. After being able to find several separating hyperplanes the natural question arises, what would be the best one to classify the data? One approach is to search for a separating hyperplane with the largest possible 𝑀, that represents the minimum distance from each point to that particular plane. Formally this becomes a maximization problem with the following setup maximize 𝑀 𝛽0 ,𝛽1 ,…,𝛽𝑝 𝑝 subject to ∑ 𝛽𝑗2 = 1, 𝑗=1 𝑦𝑖 (𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑝 𝑥𝑖𝑝 ) ≥ 𝑀 ∀𝑖 = 1, … , 𝑛. In case one is able to find such 𝑀 we call it the maximal margin hyperplane. Say, we have a two class set of observations, then the distance of observations from the hyperplane looks graphically as follows: The two points connected with arrows are the ones with minimal distance from the hyperplane in case. Sometimes it is not possible to separate the set of points linearly, i.e. 𝑀 > 0 does not exist for that case as shown in the graph below. Nevertheless, it is possible to apply a support vector classifier technique in that case. On the other hand, even when there’s a possibility to separate observations into classes, introducing new observations may lead to contradictions with a defined hyperplane, which is showcased in the graph below. In case a new observation doesn’t fit a predefined hyperplane, it can be a sign of an inappropriate hyperplane. Since a separating hyperplane is intended to perfectly classify the observations, there’s a chance for it to be sensitive to individual observations. Adjusting the hyperplane every time after new observations are introduced may work as well, but this operation has a tendency to reduce the margins of the classifier, while the margins can be seen as a measure of confidence. On top of that, a real life data is rarely perfectly classifiable, therefore instead of searching for a perfect classifier one could try to search for techniques bringing more robustness to the single observations and instead focus on finding a classifier that is classifying most observations at a satisfactory level that could be just enough to make decisions and accept a certain level of uncertainty. Accepting a possible non-perfectness of a classifier brings us to a concept of the support vector classifier. Its maximization problem is defined as follows: maximize 𝛽0 ,𝛽1 ,…,𝛽𝑝 ,𝜖1,…𝜖𝑛 𝑀 𝑝 subject to ∑ 𝛽𝑗2 = 1, 𝑗=1 𝑦𝑖 (𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑝 𝑥𝑖𝑝 ) ≥ 𝑀(1 − 𝜖𝑖 ) ∀𝑖 = 1, … , 𝑛, 𝑛 𝜖𝑖 ≥ 0, ∑ 𝜖𝑖 ≤ 𝐶, 𝑖=1 where 𝐶 is a nonnegative tuning parameter, 𝑀 is the width of the margin, and 𝜖1 , … , 𝜖𝑛 are slack variables to allow single observations to end up on the wrong side of the margin. Slack variable 𝜖𝑖 tells us the location of the 𝑖th variable with respect to the classifier and it provides the following information: 𝜖𝑖 = 0 0 < 𝜖𝑖 ≤ 1 𝜖𝑖 > 1 if observation 𝑖 is on the correct side of the margin if observation 𝑖 violates the margin if observation 𝑖 is on the wrong side of the hyperplane. In the light of the slack variable’s definition the parameter 𝐶 can be seen as a total level of violations caused by the observations with respect to the hyperplane in case. On the other hand 𝐶 is positively correlated with the choice of 𝑀, as increasing the margins inevitably increases the violations as the number of observations grows, which is demonstrated in the following graph. So far we have focused on the classifiers that are linear. Unfortunately this set of classifiers has big limitations when the data is not clustered linearly. In these cases even accepting a non-perfectness of the classifier can be not sufficient enough in terms of its performance. Have a look at the following graph. In the graph above one can clearly identify three clusters of data points that belong to either class. Trying to fit a linear classifier in this case can be rather disappointing and this becomes obvious in the right-hand side of the picture. One solution to this problem could be to use non-linear restrictions with more features, i.e. instead of using only linear observations 𝑋1 , 𝑋2 , … , 𝑋𝑝 we could use their second power 𝑋1 , 𝑋12 , 𝑋2 , 𝑋22 , … , 𝑋𝑝 , 𝑋𝑝2 and even higher power 𝑚 in general: 𝑋1 , 𝑋12 , … , 𝑋1𝑚 , 𝑋2 , 𝑋22 , … 𝑋2𝑚 , … , 𝑋𝑝 , 𝑋𝑝2 , … , 𝑋𝑝𝑚 while doubling or multiplying the number of parameters even further. In that case the familiar maximization problem becomes maximize 𝛽0 ,𝛽11 ,𝛽1𝑚 ,…,𝛽𝑝1, 𝛽𝑝𝑚 ,𝜖1 ,…𝜖𝑛 𝑝 𝑀 𝑚 2 subject to ∑ ∑ 𝛽𝑗𝑘 = 1, 𝑗=1 𝑘=1 𝑝 𝑝 𝑝 2 𝑚 𝑦𝑖 (𝛽0 + ∑ 𝛽𝑗1 𝑥𝑖𝑗 + ∑ 𝛽𝑗2 𝑥𝑖𝑗 + ⋯ + ∑ 𝛽𝑗𝑚 𝑥𝑖𝑗 ) ≥ 𝑀(1 − 𝜖𝑖 ), 𝑗=1 𝑗=1 𝑗=1 𝑛 𝜖𝑖 ≥ 0, ∑ 𝜖𝑖 ≤ 𝐶, 𝑖=1 Even though the decision boundary in this cases is linear the solutions of this maximization problem are in general non-linear. Unfortunately increasing the parameters brings another drawback to this problem: computations. Luckily, there is another efficient way to deal with nonlinearities while controlling for the computation capacity that the problem involves. The support vector machine (SVM) is one more extension to deal with this set of challenges. The idea of the support vector machine is to model non-linear boundaries between the classes, but instead of using linear parameters its foundation lies in kernels. The theory of the kernels is complex, therefore we will omit it, but some of the most commonly used kernels will be introduced in a general manners and then some of the examples of their use will be provided after that. Let 𝐾(𝑥𝑖 , 𝑥𝑖 ∗ ) be the kernel, where observations {𝑥𝑛 } represent training vectors. The most basic kernel used in SVM is a linear kernel that is in fact the inner product of the vectors 𝑥𝑖 and 𝑥𝑖 ∗ , 〈𝑥𝑖 , 𝑥𝑖 ∗ 〉 = 𝑑 ∑𝑝𝑗=1 𝑥𝑖𝑗 𝑥𝑖 ∗ 𝑗 . Its extension is a polynomial kernel (1 + ∑𝑝𝑗=1 𝑥𝑖𝑗 𝑥𝑖 ∗ 𝑗 ) . Another commonly used kernel is 2 𝑝 called radial, and it takes a form: exp (−𝛾 ∑𝑗=1(𝑥𝑖𝑗 − 𝑥𝑖 ∗ 𝑗 ) ). Finally, the last kernel that deserves our 𝑝 attention is called hyperbolic tangent or sigmoid, and it takes a form as follows: tanh(𝜅 ∑𝑗=1 𝑥𝑖𝑗 𝑥𝑖 ∗ 𝑗 + 𝑐). Thus, the four most commonly applied kernels are 𝑝 ∑ 𝑗=1 𝑑 𝑝 (1 + ∑ 𝑗=1 𝑝 𝐾(𝑥𝑖 , 𝑥𝑖 ∗ ) = exp (−𝛾 ∑ 𝑥𝑖𝑗 𝑥𝑖 ∗ 𝑗 𝑥𝑖𝑗 𝑥𝑖 ∗ 𝑗 ) , for 𝑑 > 1 2 (𝑥𝑖𝑗 − 𝑥𝑖 ∗ 𝑗 ) ) , for 𝛾 > 1 𝑗=1 𝑝 tanh (𝜅 ∑ 𝑥𝑖𝑗 𝑥𝑖 ∗ 𝑗 + 𝑐) , for 𝜅 > 0 and 𝑐 < 0. { 𝑗=1 You have to have a package “e1071” installed in order to run the following examples. It’s an easy use package to elaborate SVM’s. Example 73. .................................................................................................................................. V<-matrix(rnorm(80,0,1), ncol = 2) y=c(rep(-1,20), rep(1,20)) V[21:30,]=V[21:30,] + 3 V[31:40,]=V[31:40,] - 3 dat=data.frame(x=V, y=as.factor(y)) library(e1071) tune.out=tune(svm ,y~.,data=dat ,kernel ="radial", ranges =list(cost=c(0.001 , 0.01, 0.1, 1,5,10,100) )) tune.out$best.model Call: best.tune(method = svm, train.x = y ~ ., data = dat, ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)), kernel = "radial") Parameters: SVM-Type: SVM-Kernel: cost: gamma: C-classification radial 1 0.5 Number of Support Vectors: 14 svmfit=svm(y~., data=dat, kernel="radial", cost=1, gamma=0.5, scale=FALSE) plot(svmfit, dat, fill = TRUE, svSymbol = 42, dataSymbol = 19) This is a radial classification, where the star signs stand for the support vectors and dots are data symbols. Example 74. .................................................................................................................................. V.poly<-matrix(rnorm(80,0,1), ncol = 2) y=c(rep(-1,20), rep(1,20)) V.poly[1:20,]=(V.poly[1:20,])^2 dat=data.frame(x=V.poly, y=as.factor(y)) tune.out=tune(svm ,y~.,data=dat ,kernel ="polynomial", ranges =list(cost=c(0.001 , 0.01, 0.1, 1,5,10,100) )) summary(tune.out) bestmod=tune.out$best.model bestmod Call: best.tune(method = svm, train.x = y ~ ., data = dat, ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)), kernel = "polynomial") Parameters: SVM-Type: SVM-Kernel: cost: degree: gamma: coef.0: C-classification polynomial 1 3 0.5 0 Number of Support Vectors: 28 svmfit=svm(y~., data=dat, kernel="polynomial", cost=1, degree=3, gamma=0.5, coef.0=0, scale=FALSE) plot(svmfit, dat, fill = TRUE, svSymbol = 42, dataSymbol = 19) This is a polynomial classification method. Example 75. .................................................................................................................................. V.tan<-matrix(rnorm(80,0,8), ncol = 2) y=c(rep(-1,20), rep(1,20)) V.tan[1:20,]=tan(V.tan[1:20,]) plot(V.tan, col=(3-y),pch=19) dat=data.frame(x=V.tan, y=as.factor(y)) tune.out=tune(svm ,y~.,data=dat ,kernel ="sigmoid", ranges =list(cost=c(0.001 , 0.01, 0.1, 1,5,10,100) )) tune.out$best.model Call: best.tune(method = svm, train.x = y ~ ., data = dat, ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)), kernel = "sigmoid") Parameters: SVM-Type: SVM-Kernel: cost: gamma: coef.0: C-classification sigmoid 10 0.5 0 Number of Support Vectors: 27 svmfit=svm(y~., data=dat, kernel="sigmoid", cost=10, gamma=0.5, coef.0=0, scale=FALSE) tmp<-dat[,1] dat[,1]<-dat[,2] dat[,2]<-tmp plot(svmfit, dat, fill = TRUE, svSymbol = 42, dataSymbol = 19) This is a hyperbolic tangent attempt to classify the (complicated) data. 5.10. Depicting the data This section is meant to raise the issues related to displaying the data. Linear scale vs. log scale # World population library(gcookbook) library(gridExtra) data("worldpop") # Linear scale linS <- ggplot(worldpop, aes(x=Year, y=Population)) + geom_line() + geom_point() # Log-scale logS <- ggplot(worldpop, aes(x=Year, y=Population)) + geom_line() + geom_point() + scale_y_log10() grid.arrange(linS, logS, nrow=1, ncol=2) Linear growth rates vs. log growth rate #World population growth diff.p<-numeric(length(worldpop$Population)-1) for (i in 1:length(diff.p)){ diff.p[i]<-((worldpop$Population[i+1]worldpop$Population[i])/worldpop$Population[i]) } diff.log.p<-diff(log(worldpop$Population)) worldpop2<-cbind(worldpop[2:length(worldpop$Population),],diff.p,diff.log.p) # Linear scale diff.linS <- ggplot(worldpop2, aes(x=Year, y=diff.p)) + geom_line() + geom_point() + ylab("Percentage growth") # Log-scale diff.logS <- ggplot(worldpop2, aes(x=Year, y=diff.log.p)) + geom_line() + geom_point() + ylab("Logarithmic growth") grid.arrange(diff.linS, diff.logS, nrow=1, ncol=2) Cutting and scaling the axes pop <- ts(china$Pop, start = c(1960), frequency = 1) library(ggplot2) library(ggfortify) p1 <- autoplot(pop/10^6, ts.colour = 'red2', size = 1) p1 <- p1 + theme_grey(base_size = 15) + scale_x_continuous(name="Years") + scale_y_continuous(name="Population (millions)") + labs(title = "Population in China 1960-2014 ", size=3) p2 <- autoplot(diff(log(pop/10^6)), ts.colour = 'red2', size = 1) p2 <- p2 + theme_grey(base_size = 15) + scale_x_continuous(name="Years") + scale_y_continuous(name="Annual growth rate") + labs(title = "Population growth rate in China 1960-2014", size=3) grid.arrange(p1,p2,nrow=1,ncol=2) grid.arrange(p1,p2,nrow=2,ncol=1) #Average prices of fishing using 'charter' and 'beach' library(mlogit) data(Fishing) prices<-cbind(c("Beach","Charter"), c(round(mean(Fishing$price.beach)), round(mean(Fishing$price.charter)))) colnames(prices)<-c("Mode", "Av.price") prices<- as.data.frame(prices) prices$Av.price<- as.numeric(levels(prices$Av.price)) gg <- ggplot(prices, aes(x=reorder(Mode, Av.price), y=Av.price, fill=Mode)) + geom_bar(stat="identity", colour="black") gg #Average prices of fishing using 'charter' and 'beach' with a rescaled gg + scale_y_continuous(limits=c(0, 5000)) #Average prices of fishing using 'charter' and 'beach' with a limited y axis gg + coord_cartesian(ylim = c(80, 105)) Spans of time series library(scales) library(grid) library(ggplot2) library(gridExtra) library(ggfortify) library (plyr) #Global barley index 3.1.2000-10.2.2015 gg1 <- ggplot(barley, aes(x=Date2, y=P)) + geom_line(aes(colour = P)) + scale_colour_gradient(high="red", low="dodgerblue3", name="") + scale_x_datetime(breaks = "800 days", minor_breaks = "100 days", labels = date_format("%m/%Y")) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 12), axis.text.y = element_text(size=15), axis.title.x = element_text(size = 12), axis.title.y = element_text(size = 12), plot.title = element_text(size = 15), legend.text = element_text(size=10), legend.title = element_text(size=1), legend.key.size = unit(0.25, "cm"), legend.position = c(.3,.6)) + labs(title="3.1.2000-10.2.2015", x="", y="") gg1 #Global barley index 1.6.2007-30.9.2007 gg2 <- ggplot(barley[1934:2019,], aes(x=Date2, y=P)) + geom_line(aes(colour = P), size=1) + scale_colour_gradient(high="red", low="dodgerblue3", name="Price index") + scale_x_datetime(breaks = date_breaks("30 days"), labels = date_format("%d/%m/%Y")) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 12), axis.text.y = element_text(size=15), axis.title.x = element_text(size = 12), axis.title.y = element_text(size = 12), plot.title = element_text(size = 15), legend.text = element_text(size=10), legend.title = element_text(size=10), legend.key.size = unit(0.25, "cm")) + labs(title="1.6.2007-30.9.2007", x="", y="") gg2 #Global barley index 1.1.2009-31.12.2011 gg3 <- ggplot(barley[2348:3129,], aes(x=Date2, y=P)) + geom_line(aes_string(col = "P", fill = NULL)) + scale_colour_gradient(high="red", low="dodgerblue3", name="") + scale_x_datetime(breaks = date_breaks("100 days"), labels = date_format("%m/%Y")) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 12), axis.text.y = element_text(size=15), axis.title.x = element_text(size = 12), axis.title.y = element_text(size = 12), plot.title = element_text(size = 15), legend.text = element_text(size=10), legend.title = element_text(size=1), legend.key.size = unit(0.25, "cm"), legend.position = c(.3,.6)) + labs(title="1.1.2009-31.12.2011", x="", y="") gg3 #Global barley index 14.9.2005-14.4.2006 gg4 <- ggplot(barley[1487:1639,], aes(x=Date2, y=P)) + geom_path(col="dodgerblue", size=1) + scale_x_datetime(breaks = date_breaks("30 days"), labels = date_format("%m/%Y")) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 12), axis.text.y = element_text(size=15), axis.title.x = element_text(size = 12), axis.title.y = element_text(size = 12), plot.title = element_text(size = 15), legend.text = element_text(size=15), legend.title = element_text(size=12)) + labs(title="14.9.2005-14.4.2006", x="", y="") + ylim(c(120,140)) gg4 #Combining the plots gg.l=list(gg1,gg2, gg3, gg4) aG <- arrangeGrob(grobs=gg.l, nrow=2,ncol=2, top="Global barley index, daily basis: January 2000=100") grid.arrange(aG) Proportions vs. counts library(ggplot2) library(grid) library(gridExtra) weight.int <- cut(ChickWeight$weight,5) ChickWeight <- data.frame(ChickWeight, weight.int) t <- theme(plot.title = element_text(size = 15), axis.text.x = element_text(size = 12), axis.text.y = element_text(size=12), axis.title.x = element_text(size = 15), axis.title.y = element_text(size = 15), legend.text = element_text(size=13), legend.title = element_text(size=13), legend.key.size = unit(0.5, "cm"), legend.position="right") gg1 <- ggplot(ChickWeight, aes(weight.int, fill=Diet)) + geom_bar(position="fill") + labs(title="Proportions based on diet with 5 weight intervals of equal length", x="", y="Proportion") + t gg2 <- ggplot(ChickWeight, aes(weight.int, fill=Diet)) + geom_bar() + labs(title="Number of observations based on diet with 5 weight intervals of equal length", x="Weight intervals", y="Count") + t gg.l=list(gg1,gg2) aG <- arrangeGrob(grobs=gg.l, nrow=2,ncol=1) grid.arrange(aG) Palettes for everyone It is estimated that 8% of all men and 0.5% of all women are affected by this problem. It is making their lives difficult and handicapped. This problem is not curable (as of Aug 4, 2015), but it is possible to help these people. The issue is color blindness. Even though the problem is called color blindness, in fact most of the people affected are not completely color blind. Instead, they are not able to distinguish some of the hues that people with the standard vision are able to see. Dichromacy is the most common type of color blindness and it is possible to split it into three separate generalized deficiencies: protanopia, deuteranopia, and tritanopia. People suffering from the first deficiency are less sensitive to red light, while the ones suffering from deuteranopia have issues with green light. Nevertheless, both are having problems with red and green hues. People affected by tritanopia confuse blue with green and yellow with violet. Let’s see how these issues compare to the normal vision. # Color blindness #### library(grid) library(gridExtra) library(dichromat) data("dalton") df <- data.frame(V1=rep(1:2, each=128),V2=1:128, V3=1:256) gg <- ggplot(df, aes(V2,V1, fill=factor(V3))) + geom_raster(hjust = 0, vjust = 0) + theme(legend.position ="none", axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank()) + xlab("")+ ylab("") gg.1 <- gg + scale_fill_manual(values = dalton.colors$normal)+ ggtitle("Normal vision") gg.2 <- gg + scale_fill_manual(values = dalton.colors$deutan)+ ggtitle("Deutan vision") gg.3 <- gg + scale_fill_manual(values = dalton.colors$protan)+ ggtitle("Protan vision") gg.4 <- gg + scale_fill_manual(values = dalton.colors$tritan)+ ggtitle("Tritan vision") grid.arrange(arrangeGrob(grobs=list(gg.1, gg.2,gg.3,gg.4), nrow=4, ncol=1, top="")) 6. References 6.1. Function reference abline {graphics}........................................................ adds one or more straight lines through the current plot acf {stats}......... computes estimates of the autocovariance or autocorrelation function of a given time series Acf {forecast}.................. computes estimates of the (partial) autocorrelation function of a given time series aes {ggplot2} ........................................................................ creates a list of unevaluated expressions in ggplot apply {base}............................. returns a vector or an array or a list of values obtained by applying a function arima {stats} ............................................................................. fits an ARIMA model to a univariate time series arima.sim {stats} .......................................................................................................simulates an ARIMA model arrangeGrob {gridExtra} .......................................................................... produces plots with combined graphs array {base} ................................................................................................................. creates or tests for arrays as.data.frame {base} .................................................... coerces its argument to an object of type “data frame” as.factor {base} ................................................................................................. coerces its argument to a factor as.numeric {base}.......................................................... coerces its argument to a an object of type “numeric” attach {base} ............................................................................... attaches the databases to the R environment attr {base}......................................................................................... gets or sets specific attributes of an object auto.arima {forecast} .............................. returns the best ARIMA model according to a lag selection criterion autoplot {ggplot2} ..........................................................................draws a plot based on the class of an object boxplot {graphics} ............................................ produces box-and-whisker plots of the given (grouped) values c {base} .................................................................................................... combines values into a vector or a list cat {base}................................................................ outputs the objects and concatenates the representations cbind {base}....................................................................................................... combines R Objects by columns chisq.test {stats} ................................. performs chi-squared contingency table tests and goodness-of-fit tests class {base} .......................................................... prints the vector of names of classes an object inherits from coef {stats} ...................................... extracts model coefficients from objects returned by modeling functions colMeans {base} .................................................................................. forms column means for numeric arrays colnames {base} ...................................................... retrieves or sets the column names of a matrix-like object confint {stats} ............................ computes confidence intervals for one or more parameters in a fitted model cos {base} ............................................................................................................................computes the cosine curve {graphics}......................................................draws a curve corresponding to a function over an interval cut {base} .............................................. divides and codes the values in x according to which interval they fall cut2 {Hmisc} .................................................................................................................extended version of cut() data {utils} ................................................................... loads specified data sets, or lists the available data sets data.frame {base}................................................................................................................. creates data frames ddply {plyr} ..................................... applies a function and combines results into a data frame for each subset diff {base} ................................................................................ returns suitably lagged and iterated differences dimnames {base}............................................................................ retrieves or sets the dimnames of an object dnorm {stats} ............................................................................... gives the density for the normal distribution duplicated {base} ................................................ determines duplicated elements and returns a logical vector effects {stats} .................................... returns (orthogonal) effects from a fitted model, usually a linear model exp {base} .................................................................................................... computes the exponential function facet_grid {ggplot2} ............................................................................... factorizes the plots by a given variable factor {base} ................................................................................................... returns an object of class "factor" file.choose {base} ........................................................................................................ choose a file interactively filter {stats} ........................... applies linear filtering to a time series or to a multivariate time series searately fitted {stats} .............................................. extracts fitted values from objects returned by modeling functions getwd {base} ........................................................................... returns a path of the current working directory geom_bar {ggplot2} .......... produces bar charts for categorical x, and histograms for continuous y of a ggplot geom_errorbar {ggplot2} ......................................................................................... draws error bars in ggplot2 geom_hline {ggplot2} ....................................................................................... draws horizontal lines in ggplot2 geom_line {ggplot2} .......................................................................................... connects observations in ggplot geom_point {ggplot2} ........................................................................................... creates scatterplots in ggplot geom_raster {ggplot2} ............................................................................................. draws rectangles in ggplot2 geom_smooth {ggplot2} .............................................................................adds a smooth mean to the ggplot2 ggplot {ggplot2}........................................................................................................... initializes a ggplot object ggtitle {ggplot2}............................................................................... changes labels of axes and titles in ggplot2 grid.arrange {gridExtra}................................ sets up a gtable layout to place multiple ggplot graphs on a page head {utils} ...................................... returns the first parts of a vector, matrix, table, data frame or a function hist {graphics}............................................................................ computes a histogram of the given data values hmftest {mlogit} ............................................................. tests the IIA hypothesis for a multinomial logit model HoltWinters {stats}......................................................... computes Holt-Winters filtering of a given time series is.list {base} ........................................................................................................... checks if its argument is a list labs {ggplot2} ..................................................... changes the axis labels and the legend title of a ggplot object lapply {base} ..................................applies and a function for each corresponding element of a list or a vector layout {graphics} ........................ divides the device up into as many rows and columns as there are in matrix length {base} ................................gets or sets the length of an R object for which a method has been defined levels {base} ....................................................................... provides access to the levels attribute of a variable library {base} ................................................................................................................... loads add-on packages lines {graphics} ..................................................................... joins the corresponding points with line segments list {base} ..................................................................................................................................... constructs a list lm {stats} ................................................................................................................................. fits linear models log {base}..........................................................................computes logarithms (by default natural logarithms) ls {base} ................................................................gives the names of the objects in the specified environment matrix {base} ......................................................................................................................... constructs a matrix max {base}............................................................................................. returns the maxima of the input values mean {base} ............................................................................ returns the arithmetic mean of the input values median {stats} ...................................................................................................... computes the sample median min {base} .............................................................................................. returns the minima of the input values mlogit {mlogit} ....................................................................................... estimates the multinomial logit model mlogit.data {mlogit} ........................ shapes a data.frame in a suitable form for the use of the mlogit function nrow {base} ............................................. return the number of rows present in a vector, array or data frame numeric {base} ............................................................................................... creates objects of type “numeric” options {base} .............................................sets a variety of global options to compute and display the results pacf {stats} ............................................................ computes estimates of the partial autocorrelation function Pacf {forecast} ......... computes an estimate of the partial autocorrelation function of a univariate time series par {graphics} ............................................................................................ sets or queries graphical parameters paste {base}........................................................................ concatenates vectors after converting to character pchisq {stats} ........................................................... gives the distribution function of Chi-Squared distribution plot {graphics} ...............................................................................................................................plots R objects pnorm {stats} .......................................................... gives the distribution function for the normal distribution polygon {graphics}.......................................................... draw polygons manually with predefined coordinates position_dodge {ggplot2}....................................... adjusts position by dodging overlaps to the side in ggplot2 predict {stats} ......................................................... predicts from the results of various model fitting functions print {base}..................................................................................................... prints its argument and returns it prop.table {base}.................................................................. prints out the proportions of each cell of an array qnorm {stats} ................................................................ gives the quantile function for the normal distribution qplot {ggplot2} ................................................... allows to use the syntax of plot() in the ggplot2 environment reorder {stats} .................................................... reorders its levels based on the values of the second variable rbind {base} ............................................................................................................. combines R Objects by rows rbinom {stats}............................................................ generates random deviates for the binomial distribution read.csv {utils} ........................................................ reads a file in CSV format and creates a data frame from it read.octave {foreign} ................................................................. reads a MATLAB/Octave file into a data frame read.spss {foreign} ...................................................................................... reads an SPSS file into a data frame read.stata {foreign} .................................................................................... reads a STATA file into a data frame read.xls {gdata} .......................................................................................... reads an excel file into a data frame read.xlsx {xlsx} ............................................................................................ reads an excel file into a data frame read.table {utils}................................................... reads a file in table format and creates a data frame from it remove OR rm {base} .................................................. removes a given object from working environment in R reorder {stats} ................................................................................................ reorder factors and other objects rep {base} ............................................................................................................................ replicates the values replicate {base} .........................................................................................................................see sapply {base} return {base} ...................................................................................................................... returns a given value rnorm {stats} ........................................................................ gives random deviates for the normal distribution row.names {base} ......................................................................... changes the names of rows in a given object round {base} ....................... rounds the values in its first argument to the specified number of decimal places sample {base} ................................................................................. gives a random sample of the specified size sapply {base} ............................................................................. a user-friendly version and wrapper of lapply() scale {base} ................................................................................................ scales the columns of a given matrix scale_y_log10 {ggplot2} ................................................................................ logarithmizes the y-scale of ggplot sd {stats} ................................................................................computes the standard deviation of a given array setRepositories() {utils} ........................ interacts with the user to choose the package repositories to be used seq {base} ............................................................................................................... generates regular sequences setwd() {base} ........................................................................... sets a given path as the new working directory shapiro.test {stats} ........................................................................ performs the Shapiro-Wilk test of normality sin {base} ................................................................................................................................ computes the sine scale_colour_gradient {ggplot2} .................................................... creates gradient transition for a given color scale_colour_hue {ggplot2} ...................................... creates gradient colours to different variables in ggplot2 scale_fill_gradient {ggplot2} ................................................. creates gradient color transitions for given colors scale_x_continuous {ggplot2} ..................................................................... sets the continuous position x scale scale_x_datetime {ggplot} .................................................................................................. scales the time scale scale_y_continuous {ggplot2} ..................................................................... sets the continuous position y scale solve {base} ............................................... solves the equation 𝑎𝑥 = 𝑏 for 𝑥 ; calculates the inverse of a matrix split {base}.............................................................................................................. split the data into subgroups sqrt {base} .................................................................................................................. computes the square root stat.desc {pastecs} ................................ compute a table giving various descriptive statistics of the given data stat_smooth {ggplot2} ............................................... helps to distinguish the objects in case of heavy plotting stl {stats}..........................................decomposes a time series into seasonal, trend and irregular components str {utils} ....................................................................... compactly display the internal structure of an R object subset {base} ...................................................................... returns subsets of vectors, matrices or data frames sum {base}................................................................ returns the sum of all the values present in its arguments summary {base} .......................... produces result summaries of the results of various model fitting functions svm {e1071} ........................................................................................................ trains support vector machine t {base} ........................................................................................................transposes a matrix or a data frame t.test {stats} ................................................................ performs one and two sample t-tests on vectors of data table {base} ............................... builds a contingency table of the counts at each combination of factor levels tan {base} ......................................................................................................................... calculates the tangent tapply {base} ............................................................................ applies a function to each cell of a ragged array theme {ggplot2} ...................................................................................... works with generic themes of ggplot2 theme_grey {ggplot2} ............................ sets the general aspect of a ggplot, check ggtheme for more themes ts {stats} .................................................................................................................... creates time-series objects tune {e1071} ................................................................. tunes the parameters of a function, for exampe svm() unit {grid} ............................................................................................................................ creates a unit vector unique {base} ....................................................................................... removes duplicate elements in a vector unlist {base}.............................................................................................................. converts a list into a vector View {utils} .................................................. invokes a spreadsheet-style data viewer on a matrix-like R object which {base} ....................................................................................... gives the TRUE indices of a logical object with {base} ............................................. evaluates an R expression in an environment constructed from data xlim {ggplot2} ...................................... limits the range of observation to the given range on x axis of a ggplot ylim {ggplot2} ............................................................................ adds and controls the label of y-axis in ggplot2 xlab {ggplot2} ............................................................................ adds and controls the label of x-axis in ggplot2 ylab {ggplot2} ................................................................................ changes the label of y-axis in a ggplot object 6.2. Literature reference Chang, W. 2012. R Graphics Cookbook. O'Reilly Media. Hastie, T., James, G., Tibshirani, R., Witten, D. 2015. An Introduction to Statistical Learning with Applications in R. Springer Texts in Statistics. Provost, F. and Fawcett, T. 2013. Data Science for Business: What you need to know about data mining and data-analytic thinking. O'Reilly Media.
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 115 Language : en-US Producer : Microsoft® Word 2016 Creator : Microsoft® Word 2016 Create Date : 2016:03:12 14:25:32+01:00 Modify Date : 2016:03:12 14:25:32+01:00EXIF Metadata provided by EXIF.tools