Samuel E. Buttrey, Lyn R. Whitaker A Data Scientist’s Guide To Acquiring, Cleaning, And

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 293 [warning: Documents this large are best viewed by clicking the View PDF Link!]

A Data Scientist’s Guide to Acquiring,
Cleaning, and Managing Data in R
Samuel E. Buttrey and Lyn R. Whitaker
Naval Postgraduate School, California, United States
is edition first published 2018
© 2018 John Wiley & Sons Ltd
Library of Congress Cataloging-in-Publication Data applied for
Hardback ISBN: 9781119080022
Contents
1R1
1.1 Introduction 1
1.1.1 What Is R? 1
1.1.2 Who Uses R and Why? 2
1.1.3 Acquiring and Installing R 2
1.1.4 Starting and Quitting R 3
1.2 Data 3
1.2.1 Acquiring Data 3
1.2.2 Cleaning Data 4
1.2.3 e Goal of Data Cleaning 4
1.2.4 Making Your Work Reproducible 5
1.3 e Very Basics of R 5
1.3.1 Top Ten Quick Facts You Need to Know about R 5
1.3.2 Vocabulary 8
1.3.3 Calculating and Printing in R 11
1.4 Running an R Session 12
1.4.1 Where Your Data Is Stored 13
1.4.2 Options 13
1.4.3 Scripts 14
1.4.4 R Packages 14
1.4.5 RStudio and Other GUIs 15
1.4.6 Locales and Character Sets 15
1.5 Getting Help 16
1.5.1 At the Command Line 16
1.5.2 e Online Manuals 16
1.5.3 On the Internet 17
Preface xvii
About the Companion Website xxi
1.5.4 Further Reading 17
1.6 How to Use is Book 17
1.6.1 Syntax and Conventions in is Book 17
1.6.2 e Chapters 18
2 R Data, Part 1: Vectors 21
2.1 Vectors 21
2.1.1 Creating Vectors 21
2.1.2 Sequences 22
2.1.3 Logical Vectors 23
2.1.4 Vector Operations 24
2.1.5 Names 27
2.2 Data Types 27
2.2.1 Some Less-Common Data Types 28
2.2.2 WhatTypeofVectorIsis? 28
2.2.3 Converting from One Type to Another 29
2.3 Subsets of Vectors 31
2.3.1 Extracting 31
2.3.2 Vectors of Length 0 34
2.3.3 Assigning or Replacing Elements of a Vector 35
2.4 Missing Data (NA) and Other Special Values 36
2.4.1 e Effect of NAsinExpressions 37
2.4.2 Identifying and Removing or Replacing NAs37
2.4.3 Indexing with NAs39
2.4.4 NaN and Inf Values 40
2.4.5 NULL Values 40
2.5 e table() Function 40
2.5.1 Two- and Higher-Way Tables 42
2.5.2 Operating on Elements of a Table 42
2.6 Other Actions on Vectors 45
2.6.1 Rounding 45
2.6.2 Sorting and Ordering 45
2.6.3 Vectors as Sets 46
2.6.4 Identifying Duplicates and Matching 47
2.6.5 Finding Runs of Duplicate Values 49
2.7 Long Vectors and Big Data 50
2.8 Chapter Summary and Critical Data Handling Tools 50
3 R Data, Part 2: More Complicated Structures 53
3.1 Introduction 53
3.2 Matrices 53
3.2.1 Extracting and Assigning 54
3.2.2 Row and Column Names 56
3.2.3 Applying a Function to Rows or Columns 57
3.2.4 Missing Values in Matrices 59
3.2.5 Using a Matrix Subscript 60
3.2.6 Sparse Matrices 61
3.2.7 ree- and Higher-Way Arrays 62
3.3 Lists 62
3.3.1 Extracting and Assigning 64
3.3.2 Lists in Practice 65
3.4 Data Frames 67
3.4.1 Missing Values in Data Frames 69
3.4.2 Extracting and Assigning in Data Frames 69
3.4.3 Extracting ings at Aren’t ere 72
3.5 Operating on Lists and Data Frames 74
3.5.1 Split, Apply, Combine 75
3.5.2 All-Numeric Data Frames 77
3.5.3 Convenience Functions 78
3.5.4 Re-Ordering, De-Duplicating, and Sampling from Data Frames 79
3.6 Date and Time Objects 80
3.6.1 Formatting Dates 80
3.6.2 Common Operations on Date Objects 82
3.6.3 Differences between Dates 83
3.6.4 Dates and Times 83
3.6.5 Creating POSIXt Objects 85
3.6.6 Mathematical Functions for Date and Times 86
3.6.7 Missing Values in Dates 88
3.6.8 Using Apply Functions with Dates and Times 89
3.7 Other Actions on Data Frames 90
3.7.1 Combining by Rows or Columns 90
3.7.2 Merging Data Frames 91
3.7.3 Comparing Two Data Frames 94
3.7.4 Viewing and Editing Data Frames Interactively 94
3.8 Handling Big Data 94
3.9 Chapter Summary and Critical Data Handling Tools 96
4 R Data, Part 3: Text and Factors 99
4.1 Character Data 100
4.1.1 e length() and nchar() Functions 100
4.1.2 Tab, New-Line, Quote, and Backslash Characters 100
4.1.3 e Empty String 101
4.1.4 Substrings 102
4.1.5 Changing Case and Other Substitutions 103
4.2 Converting Numbers into Text 103
4.2.1 Formatting Numbers 103
4.2.2 Scientific Notation 106
4.2.3 Discretizing a Numeric Variable 107
4.3 Constructing Character Strings: Paste in Action 109
4.3.1 Constructing Column Names 109
4.3.2 Tabulating Dates by Year and Month or Quarter Labels 111
4.3.3 Constructing Unique Keys 112
4.3.4 Constructing File and Path Names 112
4.4 Regular Expressions 112
4.4.1 Types of Regular Expressions 113
4.4.2 Tools for Regular Expressions in R 113
4.4.3 Special Characters in Regular Expressions 114
4.4.4 Examples 114
4.4.5 e regexpr() Function and Its Variants 121
4.4.6 Using Regular Expressions in Replacement 123
4.4.7 Splitting Strings at Regular Expressions 124
4.4.8 Regular Expressions versus Wildcard Matching 125
4.4.9 Common Data Cleaning Tasks Using Regular Expressions 126
4.4.10 Documenting and Debugging Regular Expressions 127
4.5 UTF-8 and Other Non-ASCII Characters 128
4.5.1 Extended ASCII for Latin Alphabets 128
4.5.2 Non-Latin Alphabets 129
4.5.3 Character and String Encoding in R 130
4.6 Factors 131
4.6.1 What Is a Factor? 131
4.6.2 Factor Levels 132
4.6.3 Converting and Combining Factors 134
4.6.4 Missing Values in Factors 136
4.6.5 Factors in Data Frames 137
4.7 R Object Names and Commands as Text 137
4.7.1 R Object Names as Text 137
4.7.2 R Commands as Text 138
4.8 Chapter Summary and Critical Data Handling Tools 140
5 Writing Functions and Scripts 143
5.1 Functions 143
5.1.1 Function Arguments 144
5.1.2 Global versus Local Variables 148
5.1.3 Return Values 149
5.1.4 Creating and Editing Functions 151
5.2 Scripts and Shell Scripts 153
5.2.1 Line-by-Line Parsing 155
5.3 Error Handling and Debugging 156
5.3.1 Debugging Functions 156
5.3.2 Issuing Error and Warning Messages 158
5.3.3 Catching and Processing Errors 159
5.4 Interacting with the Operating System 161
5.4.1 File and Directory Handling 162
5.4.2 Environment Variables 162
5.5 Speeding ings Up 163
5.5.1 Profiling 163
5.5.2 Vectorizing Functions 164
5.5.3 Other Techniques to Speed ings Up 165
5.6 Chapter Summary and Critical Data Handling Tools 167
5.6.1 Programming Style 168
5.6.2 Common Bugs 169
5.6.3 Objects, Classes, and Methods 170
6 Getting Data into and out of R 171
6.1 Reading Tabular ASCII Data into Data Frames 171
6.1.1 Files with Delimiters 172
6.1.2 Column Classes 173
6.1.3 Common Pitfalls in Reading Tables 175
6.1.4 An Example of When read.table() Fails 177
6.1.5 Other Uses of the scan() Function 181
6.1.6 Writing Delimited Files 182
6.1.7 Reading and Writing Fixed-Width Files 183
6.1.8 A Note on End-of-Line Characters 183
6.2 Reading Large, Non-Tabular, or Non-ASCII Data 184
6.2.1 Opening and Closing Files 184
6.2.2 Reading and Writing Lines 185
6.2.3 Reading and Writing UTF-8 and Other Encodings 187
6.2.4 e Null Character 187
6.2.5 Binary Data 188
6.2.6 Reading Problem Files in Action 190
6.3 Reading Data From Relational Databases 192
6.3.1 Connecting to the Database Server 193
6.3.2 Introduction to SQL 194
6.4 Handling Large Numbers of Input Files 197
6.5 Other Formats 200
6.5.1 Using the Clipboard 200
6.5.2 Reading Data from Spreadsheets 201
6.5.3 Reading Data from the Web 203
6.5.4 Reading Data from Other Statistical Packages 208
6.6 Reading and Writing R Data Directly 209
6.7 Chapter Summary and Critical Data Handling Tools 210
7 Data Handling in Practice 213
7.1 Acquiring and Reading Data 213
7.2 Cleaning Data 214
7.3 Combining Data 216
7.3.1 Combining by Row 216
7.3.2 Combining by Column 218
7.3.3 Merging by Key 218
7.4 Transactional Data 219
7.4.1 Example of Transactional Data 219
7.4.2 Combining Tabular and Transactional Data 221
7.5 Preparing Data 225
7.6 Documentation and Reproducibility 226
7.7 eRoleofJudgment 228
7.8 Data Cleaning in Action 230
7.8.1 Reading and Cleaning BedBath1.csv 231
7.8.2 Reading and Cleaning BedBath2.csv 236
7.8.3 Combining the BedBath Data Frames 238
7.8.4 Reading and Cleaning EnergyUsage.csv 239
7.8.5 Merging the BedBath and EnergyUsage Data Frames 242
7.9 Chapter Summary and Critical Data Handling Tools 245
8 Extended Exercise 247
8.1 Introduction to the Problem 247
8.1.1 e Goal 248
8.1.2 Modeling Considerations 249
8.1.3 Examples of ings to Check 249
8.2 e Data 250
8.3 Five Important Fields 252
8.4 Loan and Application Portfolios 252
8.4.1 Layout of the Beachside Lenders Data 253
8.4.2 Layout of the Wilson and Sons Data 254
8.4.3 Combining the Two Portfolios 254
8.5 Scores 256
8.5.1 Scores Layout 256
8.6 Co-borrower Scores 257
8.6.1 Co-borrower Score Examples 258
8.7 Updated KScores 259
8.7.1 Updated KScores Layout 259
8.8 Loans to Be Excluded 260
8.8.1 Sample Exclusion File 260
8.9 Response Variable 260
8.10 Assembling the Final Data Sets 262
8.10.1 Final Data Layout 262
8.10.2 Concluding Remarks 263
A Hints and Pseudocode 265
A.1 Loan Portfolios 265
A.1.1 ings to Check 266
A.2 Scores Database 267
A.2.1 ings to Check 268
A.3 Co-borrower Scores 269
A.3.1 ings to Check 270
A.4 Updated KScores 271
A.4.1 ings to Check 272
A.5 Excluder Files 272
A.5.1 ings to Check 272
A.6 Payment Matrix 273
A.6.1 ings to Check 274
A.7 Starting the Modeling Process 275
Bibliography 277
Index 279
k
k k
k
Preface
Statisticians use data to build models, and they use models to describe the world
and to make predictions about what will happen next. ere has been a large
number of very good books that describe statistical modeling, but these model-
ing efforts usually start with a set of “clean,” well-behaved data in which nothing
is missing or anomalous.
In real life, data is messy. ere will be missing values, impossible values,
and typographical errors. Data is gathered from multiple sources, leading to
both duplication and inconsistency. Data that should be categorical is coded
as numeric; data that should be numeric can appear categorical; data can be
hidden inside free-form text; and data can be in the form of dates in a wide
number of possible formats. We estimate that 80% of the time taken in any
data analysis problem is taken up just in reading and preparing the data. So, any
analyst needs to know how to acquire data and how to prepare it for modeling,
and the steps taken should be automatic, as far as possible, and reproducible.
is book describes how to handle data using the R software. R is the most
widely used software in statistics, and it has the advantage of being free,
open-source, and available on every major computing platform. Whatever
software you use, you will find yourself facing the issues of acquiring, cleaning,
and merging data, and documenting the steps you took. We hope this book
will help you do these things efficiently.
Sam Buttrey and Lyn WhitakerMonterey, California, USA
November 30, 2016
k
Don’t forget to visit the companion website for this book:
www.wiley.com/go/buttrey/datascientistsguide
ere you will find valuable material designed to enhance your learning,
including:
A complete listing of all the R code in the Book
Example datasets used in the Exercises
Companion Website
k
k k
k
1
1
R
1.1 Introduction
is book focuses on one problem that is common to almost every statistical
problem – indeed, to almost any problem involving any sort of analysis. at
problem is acquiring and preparing the data. Across our many years of data
analysis, we have learned that seemingly 80% of our time – maybe more – goes
into the data preparation steps (a belief echoed by others such as Dasu and
Johnson, 2003). Collectively, we call these actions data cleaning, although,
as we will discuss later, we sometimes use that term for something a little
more specific. Regardless of the name, almost any analysis requires that you
(i) acquire that data, that is, read it into the computer program; (ii) clean the
data, that is, identify entries that are duplicated or clearly erroneous or anoma-
lous, and take other preparation steps (e.g., combining entries such as “Female,”
“female,” and “F”); (iii) merge data from different sources; and (iv) prepare
the data for modeling, which might involve dividing a set of numeric values
into subsets, combining states into regions, and so on. is book discusses
some approaches for accomplishing these four steps in the R language (R Core
Team, 2013). A fifth problem, which receives less emphasis, is the problem of
long-term curation of the data. Which parts of the data must be saved and in
what way? We address that question by reference to the idea of reproducible
research, which we discuss later in this chapter, and later in the book as well.
1.1.1 What Is R?
R is a computer program that lets you analyze data. By “analyze” we mean, first,
read the data into the program and then operate on it – drawing graphs and
charts, manipulating values, fitting statistical models, and so on. (Notice that
we prefer to call data “it” rather than “them.” We discuss this choice briefly
toward the end of the chapter.) R is both a statistical “environment” and also
k
k k
k
2A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
a programming language, and it is very widely used both in commercial and
academic settings. R is free and open-source and runs on Windows, Apple, and
Linux operating systems. It is maintained by a group of volunteers who release
bug fixes and new features regularly.
1.1.2 Who Uses R and Why?
R started as a tool for statisticians, evolving from a language called S that
was created in the 1970s. Today, R remains the primary language of academic
statisticians, and it also has a prominent place among analysts in business
and government as well. It is used not only for building statistical models
but also for handling and cleaning data, as in this book, and for developing
new statistical methods, building simulations, for visualization, and generally
for all the data-handling tools the statistician and the data scientist require.
Because of the ease with which users can develop and distribute new methods,
R has also become the tool of choice in certain fast-growing fields such as
biostatistics and genetics. Articles on “surveys of the top tools used by data
scientists” inevitably name R as one of the important tools with which data
scientists, as well as statisticians, should be familiar. Moreover, R’s popularity is
such that there are extensions to R (see “packages” in Section 1.4.4) that allow
you to connect to other programs such as the Python and Java languages, the
H2O machine-learning system, the ArcGIS geographical information system,
and many more.
1.1.3 Acquiring and Installing R
e primary way to acquire R is to download it from the Internet. e main
RwebsiteforRiswww.r-project.org,andthewww.cran.r-project
.org page (“CRAN” standing for “Comprehensive R Archive Network”) is
where you can download R itself. ere are in fact dozens of “mirror” sites for
CRAN – that is, websites that are essentially copies of the CRAN site – so as
to reduce the load on the CRAN site. You can probably find a mirror near you
on the “mirrors” page. After you download R, install it in the way you would
normally install a program on your operating system.
At any one time, users around the world will be running slightly different
versions of R, since new ones are released fairly frequently. For example, at this
writing the current version of R was called 3.3.2, but many users are still using
3.2 or earlier versions. is will almost never cause problems, but it is a good
idea to update your version of R from time to time.
ere are also several slightly different versions of R distributed other than at
CRAN. Microsoft R Open is a particular version of R that uses a different set
of math libraries intended to make certain computations faster. Like “regular
R, Microsoft R Open is free, although it does not run on OS X. Other ver-
sions of R are intended to communicate with relational databases or with other
k
k k
k
R3
big-data platforms. For this book, we will assume you are running “regular”
R – but in any case for our purposes all versions of R should behave exactly the
same way.
1.1.4 Starting and Quitting R
e way you start R depends on your operating system. Normally double-
clicking on an R icon will be enough to get R started. In the command-
line interface of many Linux systems, or using the OS X terminal window, it
may be enough just to type the upper-case letter R(or, for Windows command
lines, Rgui). When R has started, you will see the command prompt >. is is
the R console, the place where commands are entered. At this point, you can
start typing commands to R. When it comes time to quit R, you can either
“kill” the window in the usual way (for OS X, the red dot, the lightswitch in the
top right, or via the File dialog; for Windows, the red X or File dialog) or you
can type the q() command. In either case, R will then ask you if you want to
“Save workspace image.” If you answer “yes” to this question, R will save to the
disk any changes you made during the current session, whereas if you answer
“no,” R will return its workspace to the condition it was in when R was last
started. We almost always want to answer “yes” to this question!
1.2 Data
Data is information about the elements of whatever problem we are investigat-
ing. Data comes in many forms, but for our purposes it will always be presented
in a set of computer-ready values. For example, a database concerning birds
might include text about the habits of the birds, numbers giving lengths and
weights of the individuals, maps showing migration patterns, images showing
the birds themselves, sound recordings of the birds’ calls, and so on. Although
they look very different, all of these different pieces of information can be rep-
resented in the computer in digital form in one way or another. In this example,
one of our primary tasks might be to ensure that each bird’s description is cor-
rectly matched with the correct map, image, and song file. Our data analysis
projects rarely include data quite so disparate, but in almost every case we need
to acquire data, clean it (a process we start to describe in what follows and con-
tinue throughout the book), and prepare it for modeling, and in almost every
case we expect our data to consist of both numeric and textual values.
1.2.1 Acquiring Data
e first step in a data analysis project, of course, is to get the data into R where
it can be manipulated. We are old enough to remember the days when this
involved typing all the data from the back of a book or journal paper into a
k
k k
k
4A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
statistics package by hand, but happily this is not necessary today. On the other
hand, data now comes in a variety of formats, few of which were created with
the convenience of the data scientist in mind. In Chapter 6, we describe some
of these common formats and how to use R to read data effectively.
1.2.2 Cleaning Data
We “clean” data when we detect (and, in many cases, remove) anomalies.
Anomalies will very often be missing values, but they might also be absurd
ones,aswhenpeoplesagesarereportedas999or1. Sometimes, as in our
earlier example, we might have genders reported as “Female,” “female,” and “F”
and we want to combine these three values. In the cleaning process we might
learn, for example, that one data source produced no data at all in August 2016;
this sort of fact will need to be brought to the attention of the data provider.
e data cleaning process also involves merging data from different sources,
extracting subsets or reshaping the data in some way. All in all, data cleaning
is the process of turning raw data, received from one or more providers, into a
data set that can be used in visualization, modeling, and decision-making.
In practice these steps are iterative. Our cleaning process not only informs the
modeling, but it sometimes leads us to re-acquire the data in a different, more
usable form. Similarly, insights from modeling will often lead us to prepare the
data in a new and more revealing way – because it is when we model that we
often discover anomalies or other interesting attributes of the data.
1.2.3 The Goal of Data Cleaning
What a “clean” data set should look like depends on what your goals are. One
useful perspective is given by Wickham (2014), who describes what he calls
“tidy” data. A tidy data set is rectangular (or tabular); each row describes one
unit of analysis (an observation), and each column gives one measurement (a
variable). For example, in a data set giving measurements about people, each
row would concern itself with a person, and the columns might give height,
weight, age, blood type, and so on.
In some problems, it is not immediately clear what the unit of analysis is.
For example, imagine data that describes the locations of boats over the course
of a month, as recorded by GPS. For some purposes, a “tidy” data set would
have one row per GPS ping, each row giving a ship identifier, a location, and
a time. For other purposes, we might prefer a data set with one row per boat,
each row giving the southernmost point that ship reaches, or perhaps giving a
binary indicator of whether the ship did, or did not, spend time in international
waters. Some data – images and sound, for example – do not lend themselves
to this “tidy” approach.
k
k k
k
R5
e exact layout of your final data will depend on what you plan to do with
it – and in some cases this won’t be known until after you have operated on
the data.
1.2.4 Making Your Work Reproducible
It is vital that other people be able to reproduce the actions you took on your
data. Ideally, you or another analyst should be able to start with your raw data,
run all the steps you applied to it, and emerge with exactly the same clean, pre-
pared data sets. is will be useful to you when you encounter a situation similar
to the one in the previous paragraph, where the form of the new data needs to
be designed. But it is even more important for another analyst, since if you
or another analyst can reproduce your results there will be no disagreement
about the data. e act of making research reproducible has, in recent years,
been rightfully recognized as a cornerstone of scientific progress. Record and
document every step you take so that others can repeat them.
1.3 The Very Basics of R
is book is about handling data in R. It cannot teach you the very basics of R in
detail – although, happily, there are many good books and online resources that
can. (We give a few examples at the end of this chapter.) In this section, we list a
few of the most basic facts about R, but, again, this book is not intended to teach
you R. Rather, we focus on the details of R and of the way data is represented
in R, in order to help you understand some of the ways to acquire, clean, and
handle data inside R.
1.3.1 Top Ten Quick Facts You Need to Know about R
In this section, we give a few of the most important facts about R a beginner
needs to know. ere will be more detail on these facts later in the chapter and
throughout the book.
1) e prompt is (by default) >. If you leave a command incomplete, maybe
because there is an unclosed parenthesis or quotation mark, R gives you
the continuation prompt, which is +. e Esc key (Windows) or control-C
(other systems) produces the break command, which will take you back to
the regular prompt. In this example, we show what a completed command
looks like – in this case, R is computing the value of 3 divided by 2.
>3/2
[1] 1.5
k
k k
k
6A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
Here, R produced the prompt (>), and we typed 3/2and pressed the
Enter (or “Return”) key. R then produced the output. We will talk about
the [1] part in Chapter 2, but the computed value of 1.5 is shown. In
the following example, we show what happens when we press Enter after
typing the slash character:
>3/
+2
[1] 1.5
Here, since the expression on the first line was incomplete, R produced the
continuation prompt, +.Whenwetyped2and hit Enter, the expression
was complete and the result was shown. In case of confusion, press break
until the original >prompt is showing.
In examples in this book where we want to show the R output, we also
show the >prompt in front of our code. Remember, that >is produced by
R; you don’t need to type that yourself. (At the end of the chapter, we tell
youwhereyoucangetallthecodefromthebookinelectronicform.)
2) R is case-sensitive, which means that upper- and lower-case letters
are different in R. For example, the built-in R object LETTERS gives
all 26 upper-case letters. A different item called letters contains the
lower-case versions of the alphabet. ere is no built-in object called
Letters.
3) Show an object by typing its name. For example, if you type ls by itself,
you see the contents of the function whose name is ls, the one that lists all
the objects in your workspace (which we define later). To actually run the
function and see the objects, you need to type the function’s name together
with parentheses. In this case, list your objects by typing ls().
4) Get help for a function or object named thing with the command
help(thing) or ?thing. For example, to see the help for the
ls() function, type help(ls). If you don’t know the name, try
help.search() with a relevant word in quotation marks; for example,
try help.search ("matrices") to see functions that handle
matrices.
5) Assign a value or object to a name with the left-arrow (less-than plus
hyphen): for example, the command a<-1creates a new object named
awith value 1. (You can also assign with a command such as a=1,
but we don’t recommend it.) e assignment will over-write any existing
object named ayou might have had. Once you create an object, it is in
your “workspace,” and your workspace can be saved when you quit. So
unless your computer crashes, when you create an object it will persist
until you delete it. Display the set of objects in your workspace with
objects() or ls();removeanobjectwithremove() or rm().Not
every character is permitted in the name of an R object. Start a name
k
k k
k
R7
with a letter or a dot, and then stick to numbers, letters, underscores,
and dots. Names cannot contain spaces. In this example, we show some
assignmentsthatsucceedandsomethatdonot.
>a<-1
> a.1 <- 1
>2a<-1
Error: unexpected symbol in "2a"
>a2<-1
Error: unexpected numeric constant in "a 2"
e first two of these assignments succeed, because aand a.1 are valid
names. e last two fail because they refer to invalid names.
6) e comment character is #. A comment ends at the end of the line. If you
want a comment to span multiple lines, you need to start each comment
line with #.
7) Recall earlier commands with the up-arrow. You can edit an earlier
command and then press the Enter key to run the new version. e
history() command shows a list of your recent commands; put a
number in (as in history(500))toseemore.
8) When referring to file names, R itself uses the forward slash in the console.
e Windows file system uses the backward slash, so Windows users may
usethat,too,butinthatcaseyouhavetotype\\ (we talk more about
this later on). For example, a Windows user who wants to access a file
named c:\temp\mycode.R in an R command will need to type either
c:/temp/mycode.R or c:\\temp\\mycode.R. You’ll need to use a
regular, single backslash if you are interacting with the Windows operat-
ing system and not R – if, for example, you are presented with a graphical
“select file” window. e file systems for OS X and Linux users use the
forward slash at all times.
9) Just about any function you want is built into R, so R makes an excellent
calculator. For example,
> sin (log (34))
[1] -0.375344
is says that the sine (using radians) of the logarithm (base e)of34is
0.375344. Most functions allow you to specify “arguments,” values you
pass to the function to modify its behavior. Some must be specified; others
have default values. For example, log (34, 10) produces the base 10
logarithm instead of the natural logarithm. If a function accepts multiple
arguments, you will need to specify them in the proper order – or by name.
In this example, the arguments to log are named xand base (see the
help at ?log), so we could have entered log(base = 10, x = 34)
too.
k
k k
k
8A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
10) R’s operators include the comparison operators != for “not equal,” == for
“is equal to,” <= and >= for“lessthanorequalto”and“greaterthanorequal
to,” and the arithmetic operators *for “multiplied by” and ^for “raised to
the power of.”
1.3.2 Vocabulary
As we get started, it will be worthwhile for us to repeat some of the vocabulary
of R, and of data, that you should be familiar with. In this section, we define
some of the terms that are commonly used in discussion of R, both in this book
and elsewhere.
vector Avector is the simplest piece of data in R. It consists of one or more
entries (also called “items” or “elements”) that are all either text or all num-
bers or all “logical” (i.e., TRUE or FALSE). (Technically, a vector might have
length 0, and there are some other types, but that last sentence covers 99%
of what you will do with R.) For example, the value of the famous constant 𝜋
is built into R as the object pi,andtheRobjectpi is a numeric vector with
length 1. We talk about vectors in Chapter 2.
matrix Amatrix is just a two-dimensional vector in rectangular shape. While
matrices are important in statistics, they are less important in the data clean-
ing process. Still, it is useful to know about matrices in preparation for using
data frames (below). We discuss matrices at the start of Chapter 3.
list Alist is an R object that can hold other R objects. Lists are everywhere in
R and you will need to know how to create them and access their elements.
We discuss lists starting in Section 3.3.
data frame Adata frame is a cross between a matrix and a list. Like a matrix, it
is rectangular, but like a list it can contain items of different sorts – numeric,
text, and so on – as its columns. You can think of a data frame as a list of
vectors all of which are the same length. Most of the data we encounter will
be in the form of data frames, and, if it isn’t, we will usually try to put it into
a data frame. We talk about data frames starting in Section 3.4.
object An object is a general word for anything in R. Usually, we will use this to
refer to data objects such as vectors, matrices, lists, or data frames, but we
might use “object” to refer to a function, a file handle, or anything else with
anameinR.
rows and columns A data frame and a matrix are two-dimensional rectan-
gular objects, consisting of rows and columns. Our goal, in a data cleaning
problem, is almost always to produce one or more data frames whose rows
correspond to the things being measured, and whose columns give the
different measurements. For example, in a military manpower problem each
row might represent a soldier, and the columns would give measurements
such as age, sex, rank, and years in service. Statisticians sometimes call
rows and columns “observations” and “variables” (although that second
k
k k
k
R9
word has another meaning in R, see the following discussion). Confusingly,
other terms exist too: authors in machine learning talk of “instances” (or
entities”) and “attributes” (“features”). We will use “rows” and “columns”
when the emphasis is on the representation of the data in a data frame, and
observations” and “variables” when the emphasis is on the role being played
by the data.
variable Avariable is also a generic term for an R object, especially one of
the objects in our workspace. e name is slightly misleading because the
object’s value doesn’t have to change. We would call pi a “variable,” at least
in casual conversation.
operator An operator describes an action on one or two objects – often vec-
tors – and produces a result. For example, the *operator, placed between two
numbers, produces their product. Most operators act on two things – we say
they are “binary.” e +and -operators can also be “unary,” meaning they
act on one number. So in the expression -3,the-is a unary operator. Oper-
ations are often “vectorized,” meaning they act separately on each item of a
vector.
function Afunction isakindofRobjectthatcantakeanaction.Functions
often accept arguments to control the computations they make, and pro-
duce “return values,” the results of the computation. For example, the cos()
function takes as its one argument the size of an angle, in radians, and pro-
duces, as its return value, the cosine of that angle. So typing cos(1) invokes
a function and produces a value of about 0.54. Operators are functions, too,
although they don’t look like it. For example, you can multiply two numbers
by calling the *function explicitly with two arguments, though you’ll need
quotation marks; "*"(3, 4) operates *on 3 and 4 and produces 12. Func-
tions are covered in detail in Chapter 5.
expression An expression is a legal R “phrase” that would produce an action if
you entered it into R. For example, a<-3is an expression that, if evaluated,
would cause an item ato be created and given the value 3. at expression
is called an assignment.pi > 3 is an expression that would produce TRUE,
since the number pi is greater than 3. is is an example of a comparison.
Just typing 2is also an expression; the system interprets this as being the
same as print(2), and prints out the value 2. Most expressions involve the
use of functions or operators, as well as R variables.
command We often use the word “command” as a casual shortcut to mean
“function,” “operator,” or “expression.” For example, we might say “use the
help command” instead of “run the help function.”
script Ascript is a text file that can list R commands. We use script files in all
of our projects and we recommend that you do, too. We discuss scripts in
Chapter 5.
workspace e workspace is the set of objects (data and functions) in our cur-
rent environment. ese are objects we have created.
k
k k
k
10 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
working directory e working directory is the folder on your computer where
your R data is stored. By default, R will look in this directory for any exter-
nal files you might ask for. We talk more about the working directory in the
following section.
With this vocabulary in mind it is easier to discuss some of the ways that R
operates. As an example, it’s not always obvious what the different operators
in R will do in weird cases. We know that 3<10is TRUE. What is the value
of 3 < "10"?eanswerisFALSE. R cannot compare a number to a char-
acter, so converts both values into characters. en the comparison is made
alphabetically. So just as "Apple" < "Banana" is TRUE because "Apple"
comes first in alphabetical order, so too does "10" come before "3" –since,
as always, we compare the initial characters first, and the 1character precedes
the 3character in our computer’s sorting system. We talk much more about
the different types of data in R, and converting between them, in Chapter 2.
Another example of unexpected behavior has to do with the way R reads
commands typed in at the command line. We saw that the command a<-3
assigns the value 3to an object a. However, what happens when you type
a<-3, with a space between <and -?eansweristhatRattachesthe
hyphen to the value 3, and then compares the value of ato the number -3.In
general, spaces will not affect your R commands – but in this case the space
“broke” the assignment operator <-.
R objects have names and names have to conform to a small set of rules. If
data is brought in from outside R, perhaps from a spreadsheet, names will be
changed if they need to be made valid (details can be seen in the help for the
make.names() function). Technically it is possible to force R to use invalid
names, but don’t do that. A few names in R are reserved, meaning they cannot
beusedasthenameofanRvariable.Forexample,youcannotnameanobject
TRUE; that name is reserved. (You may name an object T, because that name
isn’t reserved, but we don’t recommend it.) It is also wise to try to avoid giv-
ing an object the name of an existing R function (although there are lots of R
functions and some are obscure). If you name a vector sum, and then use the
sum() function to add things up, R will be smart enough to differentiate your
vector from the systems function. But if you create a function called sum() in
your workspace, R will use that one (since your function will appear first on the
search path; see “search path” in Section 1.4.1). is is almost never what you
want. e R functions c() and t() provide good examples of names to avoid.
Finally, R can operate in an “object-oriented” way. A number of R functions
are “generic,” meaning that have specific methods to handle specific data types.
For example, the summary() function applied to a numeric vector gives some
information about the values in the vector, but the same function applied to
the output of a modeling function will often give summary statistics about the
model. e exact action that the generic function takes depends on the “class”
k
k k
k
R11
(i.e., the type) of the object passed to it. We run across a few of these generic
functions in the following few chapters and discuss object-oriented program-
ming briefly in Section 5.6.3
1.3.3 Calculating and Printing in R
R performs calculations and prints results. In this section, we talk about some
of the differences between what R computes and what it prints, as well as how
text data is represented.
Floating-Point Error
isisagoodplacetodiscussanissuethatarisesinalotofdatacleaningprob-
lems and has caught us and our students off-guard more than once. For almost
all computations, R uses double-precision floating-point” arithmetic, as most
other systems do. What this means is that R can represent numbers up to about
±1.79 ×10±308 with at least some accuracy. However, double precision is not
exact. Consider this example, in which we multiply together the numbers (1/49)
and 49.
> 1/49 * 49
[1] 1 # as expected
> 1 - (1/49 * 49)
[1] 1.110223e-16
> (49 * 1/49) == (1/49 * 49) # should be TRUE
[1] FALSE
e first computation shows the “expected” product of (1/49) and 49 – the
value 1. In fact, though, the second computation shows that this prod-
uct is not exactly 1; it differs from 1 by a tiny amount that we might call
“floating-point error.” at amount was so small that it wasn’t displayed in the
first computation, according to R’s default display conditions. (e command
print(1/49 * 49, digits = 16) will reveal that this product is
computed as a number very slightly less than 1.) is is not a bug in R; it’s a
statement about the way double-precision floating-point arithmetic works,
analogous to the way that in ordinary arithmetic, the number 0.333333
is not quite 1/3. e final computation shows the practical effect of this: if
you compare two floating-point values directly, they might be recorded as
being different just because of floating-point error. You will need to be aware
of this when you compare the results of doing the same computation in two
different ways.
Significant Digits
In the above-mentioned example, we saw how R printed 1even though
the number in question was slightly different. While R’s computations use
k
k k
k
12 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
double-precision floating point, its display will generally print a smaller
number of digits than are available. Moreover, R formats outputs in a neat
way, so that typing 2.00 produces 2, but typing 2.01 prints out as 2.01.
ese formatting choices are most noticeable when many values are being
shown. e display that R chooses does not affect the precision with which
it does calculations. Of course you can force R to round off the results of
its calculation; we discuss formatting, rounding, and scientific notation in
Chapter 4.
Character Strings
We will spend a lot of time in this book handling text or character data, data
in the form of letters such as "Oakland" or "Missing".Sometimes,asis
common, we will call a set of characters a string. In R, strings are enclosed by
quotation marks, and either the double-quotation mark "orthesingleone
can be used. A string delineated by single-quotation marks is converted
into the other kind. e two kinds of quotation marks make it possible to
insert a quote into a string, such as this: "She said ’No.’ " (If you
typed "She said "No." ", you would see R produce an error.) If you type
’She said "No." ’, the outside quotes are converted to double quotes.
en, since there are double quotes on the inside, too, those interior quotation
marks are “protected” by preceding them with the backslash character. e
result is converted into "She said \"No.\" "
is idea of “protecting” certain special characters goes beyond quotation
marks. e character that marks the end of a line of text is called “new-line” and
is written as \n, backslash followed by n. Typing this character requires two
keystrokes, but it counts as only one character. In general, special characters
are “protected” by the backslash characters. Besides the quotation mark and the
new-line, the important special characters are \t,thetab,and\\,thebackslash
itself. e backslash also serves to introduce strings in special formats, such as
hexadecimal (e.g., "\xb1" produces the character with hexadecimal value b1,
which displays as the plus-minus sign, ±) or Unicode (e.g., "\U20ac" uses
Unicode to display the Euro currency symbol). We talk much more about text
in general and Unicode in particular in Chapter 4.
1.4 Running an R Session
Once you start using R you may find yourself using it for lots of different
projects. Although this is partly a matter of taste, we find it useful to keep
separate sets of data for separate projects. In this section, we describe where R
keeps your data, and some other aspects of R with which you will need to be
familiar.
k
k k
k
R13
1.4.1 Where Your Data Is Stored
When you start R, you start it in a working directory, and this directory forms
the starting point for where R looks for, and stores, data. For example, typing
list.files() will list all of the files in your working directory. When you
quit R and save the workspace, a file with all of your R objects will be created
in that same directory. is file is named .RData. e leading dot in the name
is important, because some terminal programs, such as the “bash” command
interpreter, do not by default list files whose names start with a dot. We don’t
recommend changing the name of the .RData file.
is provides a natural mechanism for project management. To prepare for a
new project on a system with a command-line interface, just create a new direc-
tory and start R from there (see “starting R” above). On systems with desktop
icons, copy an existing R icon, edit the properties to point to the new directory,
and add the project name to the icon. e details of this operation will depend
on your operating system. In this way, you can keep the different .RData files
for your different projects separate.
When you start R, it will use an existing .RData leifthereisoneinthe
working directory, or create a new, empty one if there is not. Often we have a
certain number of objects from earlier projects that we want in the new project.
ere are two mechanisms for acquiring those existing R objects. In one case,
we literally copy all the objects from another .RData in a different project’s
directory into the existing workspace, using the load() function. is can
be dangerous because objects being copied will over-write existing ones with
the same names. A second mechanism uses attach(), which puts the other
.RData on the “search path.” e search path is a list of places where R looks
for objects when you mention them. You can examine your current search path
with the search() command. e first entry on the search path is the current
.RData file (although it carries the confusing name .GlobalEnv); most of
the other entries on the search path are put there by R itself. When you use a
name such as pi, R looks for that object in your workspace, and then in each
of the packages or directories named in the search path until it finds one by
that name. You can attach other .RData files anywhere in the search path,
except in the first position; usually we put them into position two so that they
are searched right after the local workspace. We talk more about getting data
into and out of R in Chapter 6.
1.4.2 Options
R maintains a list of what it calls “options,” which describe aspects of your inter-
action with it. For example, one option sets the text editor that R calls when
you edit a function, one describes how much memory is set aside for R, one
lets you change the prompt character from its default, and so on. Generally, we
k
k k
k
14 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
find the default values reasonable, but the help for the options() function
describes the possible values and running options() shows you the current
ones. Changes to the options last only for this R session. Section 3.3.2 shows an
example of setting one of the options.
1.4.3 Scripts
Most of the work we do with R is interactive – that is, we issue commands
and wait for R’s response. is use of R is best when we are exploring data and
developing approaches to handling and modeling it. As we develop sets of com-
mands for a particular project, we can combine these into “scripts,” which are
simply files full of commands. Having a set of commands together allows us
to execute them in exactly the same way every time, and it allows us to add
comments and other notes that will be useful to us and to other users whom
we share the code with. is approach, while still interactive, is best when we
have developed an approach and want to use it repeatedly. Scripts also provide
a natural mechanism for project management: often we start with an empty
workspace and use scripts to populate the workspace by reading and preparing
data, loading from other sources, or attaching other directories, before starting
on the modeling steps.
R can also be run in batch mode – that is, it can start, run a single set of
commands, and then stop. is approach can be used when the same task needs
to be performed repeatedly, perhaps on different data – say, every day to process
data gathered overnight. We talk about scripts and batch use of R in Chapter 5.
1.4.4 R Packages
Apackage is a set of functions (and maybe data and other stuff too) that pro-
vides an extension to R. R comes with a set of packages, some of which are
automatically placed onto the search path, and others of which are not. If a
package is present on your computer but not in your search path, you can access
(or “load”) it with the library() or require() command (these two differ
only in how they react if a package cannot be found). A package only needs to
be loaded once per R session, but when you re-start R you will need to re-load
packages. ere are also thousands of additional packages that have been con-
tributedbyRusersthatcanbefoundontheInternet,primarilyatthemain
repository at cran.r-project.org and its mirror sites. If your computer
is connected to the Internet, you can install a package (if you know its name)
with the install.packages() command. If that works, the package will
still need to be loaded with the library() command. If your computer is
not connected to the Internet, you can still install packages from a disk file if
one is available. Most of the code in this book requires no additional packages,
although in some cases we will point out cases where additional packages make
particular tasks easier, more efficient, or, in rare cases, possible.
k
k k
k
R15
It is possible to force certain packages to be loaded whenever you start R.
When we anticipate needing a package, our preference is to include a call to
library() or require() inside our scripts.
1.4.5 RStudio and Other GUIs
e “look” of R depends on your operating system. At its most basic – and we
often see this when we are connecting to remote servers – R consists only of a
command line. On the most popular platforms – Windows and OS X – running
R produces a graphical user interface, or GUI. is is a set of windows con-
taining a number of menu items giving selections, or buttons that help you
perform common tasks. Most of the GUI, though, consists of the console. A
few enhanced GUIs are available. Perhaps the most widely used among these
is RStudio (RStudio Team, 2015), a development environment that includes a
console window, a set of script window tabs, and better handling of multiple
graphics windows. RStudio comes in free and paid versions for all operating
systems and is available from its maker at rstudio.com. We have found that
many of our students prefer the more interactive, perhaps more modern feel of
RStudio to the standard R interface – but underneath, the R language is exactly
the same.
1.4.6 Locales and Character Sets
R is essentially the same program whether you run it on Windows, OS X, or
Linux. (ere are minor differences in the way you access external files and
in some low-level technical functions that will not be relevant in data clean-
ing.) In particular, R is an English-language program, so a “for” loop is always
indicated by for(). Speakers of many languages can arrange to have error
messages delivered in their language, if this ability is configured at the time R is
installed – see the help for the Sys.setenv() function and for “environment
variables.”
Even though R is in English, it is possible to set the “locale” of R. is allows
you to change the way that R does things such as format currency values.
English speakers use the dot as the decimal separator and the comma to set
off thousands from hundreds, but many Europeans use those two characters
in reverse. Other locale settings affect the abbreviations in use for days of
the week and months of the year. We discuss some of these in Chapter 3, but
one important one to note here is the “collation” setting. is describes how
R sorts alphabetical items. Under the usual choices on Windows and OS X,
lower- and upper-case letters are sorted together, so that “a” precedes “A”
in alphabetical order, but both precede “b.” To continue an earlier example,
this ensures that "apple" < "banana" and "apple" < "Banana"
are both TRUE. However, on some Linux systems the so-called “C” collation
sequence is used. In that scheme, all the upper-case letters come before
k
k k
k
16 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
any of the lower-case ones – so that "apple" < "banana" is TRUE,but
"apple" < "Banana" is FALSE. Moreover, as the help for Comparison
points out, “in Estonian, Zcomes between Sand T.” You have to be aware of
your both locale and the relevant language whenever you compare strings.
Another aspect of character handling is the use of different character sets.
Text in non-Roman languages such as Hebrew or Korean requires some special
considerations. We discuss these at some length in Chapter 4.
1.5 Getting Help
R has a number of ways of getting help to you. “Help” can mean information
about the specific syntax of individual R commands, about putting the pieces
of R together in programs, or about the details of the various statistical models
and tools that R provides. In this section, we describe some of the resources
available to help you learn about R.
1.5.1 At the Command Line
e most basic help is provided at the command line, through the commands
help(),?and help.search(). e first two commands act identically and
will be most useful when you need information on a particular R function or
operator whose name you know. In most cases, the argument doesn’t need to
be in quotation marks, though it may be – so help(matrix) or ?"matrix"
both bring up a page about some matrix functions. Quotation marks will
be required when looking for help on some elements of the R language – so
?"for" gives the help page for the for looping term and help("==")
produces the page on comparison operators. e help.search() command
is useful when the subject, rather than the name, is known; this command
opens a window (depending on your operating system) that gives links to
associated R objects. A related command is the apropos() function, which
takes a character argument (as in apropos("matrix"))andreturnsa
vector of names of objects containing that string (in this example, every object
with matrix in its name). A final piece of command-line help is provided by
the args() function, which takes a function and displays the set of arguments
expected by, and default values provided by that function.
1.5.2 The Online Manuals
When you install R, you are given the opportunity to install the online manuals
with it. ese manuals are generally correct and complete, but they are intended
as references, and are not always useful as tutorials.
k
k k
k
R17
1.5.3 On the Internet
e main page for the R project is r-project.org. is is the central repos-
itory for R and its documentation. If you are interested in participating in a
community of R users, you might consider joining one of the mailing lists,
which you can find under mail.html at that page.
R is very popular and there are lots and lots of blogs, pages, and other web
documents that address R and solve specific problems. Your favorite Internet
search engine will be able to find dozens of these.
1.5.4 Further Reading
A lot of documentation comes with R when you install in the usual way. You
can find a list of these manuals under Help |Manuals in Windows, or Help |R
Help on OS X, or with help.start(). e “Introduction to R” manual is a
good place to start.
e book “e Art of R Programming” (Matloff, 2011) is a nice tour of many R
features ranging from beginning to advanced. As its name suggests, the empha-
sis is on writing powerful and efficient R programs. Many other books introduce
the use of R, or describe its application in specific fields such as economics or
genomics. e r-project website has a list of over 150 books using R. As we
mentioned earlier, that site also maintains mailing lists for interested users, and
a quick web search will reveal scores of blogs and web pages devoted to R and
to answering R questions.
e recent book by Wickham and Grolemond (2016) describes those authors’
approach to not only data cleaning but a set of additional tasks, including visual-
ization and modeling, which we think of as beyond the scope of data acquisition
and cleaning. at approach requires an entire set of tools from packages out-
side R – although they come conveniently bundled together – as well as a new
vocabulary. is ecosystem has its adherents, but we prefer to use base R where
possible.
1.6 How to Use This Book
1.6.1 Syntax and Conventions in This Book
We reproduce a lot of R code in this book. R code is indicated in a fixed-width
font like this. Since R is case-sensitive, our text will exactly match what is
typedintoR–exceptthatintheprosewecapitalizelettersofRobjectsifthey
appear at the beginning of sentences. Inside a paragraph, or when we want to
show a sequence of commands, we reproduce exactly what we type, like this:
k
k k
k
18 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
sqrt(pi). When we also want to show what R returns, the code will be shown
with the prompt and the literal R output, like this:
> sqrt (pi)
[1] 1.772454
Unlike the example in the “top ten quick fact” #1, we suppress the continuation
prompt +,sothatitisnotconfusedwiththeordinaryplussign.
ere are several different schemes for formatting code that you can find
described on the Internet, and they do not always agree. To us the most impor-
tant rule is to make your code easy to read. is means, first, use spacing and
indenting in a helpful and consistent way, and second, add plenty of comments
to help the reader. ere is always a temptation to write code as quickly as pos-
sible, with an eye toward worrying about neatness later. Resist that temptation!
Code is for sharing and for re-use.
On a lighter note, we know that the word “data” originated as the plural of
the singular “datum,” but it has long been permitted to construe “data” in the
singular, and we do that in this book. You will find us saying “the data is...” rather
than“are.”isisintentional.
1.6.2 The Chapters
In order to use R wisely, you have to understand what data looks like to R. e
following three chapters describe the sorts of data that R recognizes, and how
to manipulate R’s objects. We start by describing vectors, the simplest form of
data in R, in Chapter 2. is chapter describes the common types of vectors,
the different ways to extract subsets from them, and how to change values in
vectors. It also describes how R stores missing values, an integral part of almost
every data cleaning problem. e chapter concludes with a look at the impor-
tant table() function and some of the basic operations on vectors – sorting,
identifying duplicates, computing unions and intersections of sets, and so on.
Chapter 3 describes more complicated data structures: matrices, lists, and
finally data frames. Understanding how data frames work is critical to using R
intelligently. We defer until this chapter discussion of how R handles times and
dates, because part of that discussion requires an understanding of lists.
e final data chapter, Chapter 4, discusses the last important data type – text
or character data. Text data is stored in vectors and data frames such as other
kinds, but there are a number of operations specific to text. is chapter
describes how to manipulate text in R – changing case, extracting and
assembling pieces of strings, formatting numbers into strings, and so on. One
important topic is regular expressions, a set of tools for finding strings that
contain a pattern of characters. is chapter also discusses the UTF-8 system
of encoding non-Roman alphabets such as Greek or Chinese and R’s concept
of factors, which are important in modeling but often cause problems during
the data cleaning process.
k
k k
k
R19
Chapter 5 discusses two types of tool used to automate computations in R:
functions and scripts. ese different, but related, tools, will be part of every
analysis you ever do, so you should understand how to construct them intelli-
gently. We also look briefly at “shell scripts,” which are a special sort of script
that let you run R in batch, rather than interactive mode, and discuss some of
the tools available in R for debugging.
isisabookaboutcleaningdata,butthedatatobecleanedneedsto
come from somewhere. Chapter 6 describes the different ways to bring
data into R: from other R sessions, from spreadsheet-like text files, from
relational databases, and so on. We describe two of the formats in which data
is commonly found in modern applications: XML and JSON. We also describe
how to acquire data programmatically from web pages.
Chapter 7 takes a bigger view of the data cleaning process. While the earlier
chapters focus on the nuts and bolts of R as they relate to data cleaning, this
chapter describes the sort of challenges in a real-life data cleaning project. We
talk about how to combine data from different sources and give examples of the
sort of anomalies that you have to expect in dealing with real data. In almost
every case you will have to rely on judgment, rather than just on a cookbook
of techniques. We spend some time discussing the role of judgment on data
cleaning.
The Exercise
e culmination of the book is the data cleaning exercise presented in
Chapter 8. is chapter presents a complicated data acquisition and cleaning
problem that, while artificial, reflects many of the problems and challenges we
have seen over our years of real-life data handling experience. If you can find
your way through to the end of the exercise, we expect that you will be well
prepared to handle the data the real world sends your way.
Critical Data Handling Tools
In every chapter, we have set aside the final section to recap commands and
tools we think are particularly important when it comes to data handling and
manipulation. If you can master the use of these tools, and apply them wisely,
you can reduce the risk of missing important information in your data.
The Code
All of the code reproduced in this book appears in scripts in the cleaning
Book package you can download from the CRAN website. You can open these
scripts in R and run the code from there – although since most examples are
very short, we suggest that you consider typing them in yourself, to get a feel
for the R language.
k
k k
k
21
2
R Data, Part 1: Vectors
e basic unit of computation in R is the vector. A vector is a set of one or
more basic objects of the same kind.(Actually,itisevenpossibletohavea
vector with no objects in it, as we will see, and this happens sometimes.) Each
oftheentriesinavectoriscalledanelement. In this chapter, we talk about
the different sorts of vectors that you can have in R. en, we describe the
very important topic of subsetting, which is our word for extracting pieces of
vectors – all of the elements that are greater than 10, for example. at topic
goes together with assigning, or replacing, certain elements of a vector. We
describe the way missing values are handled in R; this topic arises in almost
every data cleaning problem. e rest of the chapter gives some tools that are
useful when handling vectors.
2.1 Vectors
By a “basic” object, we mean an object of one of R’s so-called “atomic” classes.
ese classes, which you can find in help(vector),arelogical (values
TRUE or FALSE, although Tand Fare provided as synonyms); integer;
numeric (also called double); character, which refers to text; raw,
which can hold binary data; and complex. Some of these, such as complex,
probably won’t arise in data cleaning.
2.1.1 Creating Vectors
Wearemostlyconcernedwithvectorsthathavebeengiventousasdata.How-
ever, there are a number of situations when you will need to construct your own
vectors. Of course, since a scalar is a vector of length 1, you can construct one
directly, by typing its value:
>5
[1] 5
k
k k
k
22 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
Rdisplaysthe[1] before the answer to show you that the 5isthefirstelement
of the resulting vector. Here, of course, the resulting vector only had one entry,
but R displays the [1] nonetheless. ere is no such thing as a “scalar” in R;
even 𝜋, represented in R by the built-in value pi, is a vector of length 1. To
combine several items into a vector, use the c() function, which combines as
manyitemsasyouneed.
> c(1, 17)
[1] 1 17
> c(-1, pi, 17)
[1] -1.000000 3.141593 17.000000
> c(-1, pi, 1700000)
[1] -1.000000e+00 3.141593e+00 1.700000e+06
Rhasformattedthenumbersinthevectorsinaconsistentway.Inthesec-
ond example, the number of digits of pi is what determines the formatting;
see Section 1.3.3. In example three, the same number of digits is used, but
the large number has caused R to use scientific notation. We discuss that in
Section 4.2.2. Analogous formatting rules are applied to non-numeric vectors
as well; this makes output much more readable. e c() function can also be
used to combine vectors, as long as all the vectors are of the same sort.
Another vector-creation function is rep(), which repeats a value as many
times as you need. For example, rep(3, 4) produces a vector of four 3s. In
this example, we show some more of the abilities of rep().
> rep (c(2, 4), 3) # repeat a vector
[1]242424
> rep (c("Yes", "No"), c(3, 1)) # repeat elements of vector
[1] "Yes" "Yes" "Yes" "No"
> rep (c("Yes", "No"), each = 8)
[1] "Yes" "Yes" "Yes" "Yes" "Yes" "Yes" "Yes" "Yes" "No"
[10] "No" "No" "No" "No" "No" "No" "No"
e last two examples show rep() operating on a character vector. e final
one shows how R displays longer vectors – by giving the number of the first
element on each line. Here, for example, the [10] indicates that the first "No"
on the second line is the 10th element of the vector.
2.1.2 Sequences
We also very often create vectors of sets of consecutive integers. For example,
we might want the first 10 integers, so that we can get hold of the first 10 rows
in a table. For that task we can use the colon operator, :. Actually, the colon
operator doesn’t have to be confined to integers; you can also use it to produce
a sequence of non-integers that are one unit apart, as in the following example,
butwehaventfoundthattobeveryuseful.
k
k k
k
R Data, Part 1: Vectors 23
> 1:5
[1]12345
> 6:-2
[1]6543210-1-2#Cangoinreverse, by 1
> 2.3:5.9
[1] 2.3 3.3 4.3 5.3 # Permitted (but unusual)
> 3 + 2:7 # Watch out here! This is 3 +
[1]5678910 #(vector produced by 2:7)
> (3 + 2):7
[1] 5 6 7 # This is 5:7
In that last pair of examples, we see that R evaluates the 2:7 operation before
adding the 3. is is because :has a higher precedence in the order of opera-
tions than addition. e list of operators and their precedences can be found at
?Syntax, and precedence can always be over-ridden with parentheses, as in
the example – but this is the only example of operator precedence that is likely
to trip you up. Also notice that adding 3to a vector adds 3to each element of
that vector; we talk more about vector operations in Section 2.1.4.
Finally, we sometimes need to create vectors whose entries differ by a num-
ber other than one. For that, we use seq(), a function that allows much finer
control of starting points, ending points, lengths, and step sizes.
2.1.3 Logical Vectors
We can create logical vectors using the c() function, but most often they
are constructed by R in response to an operation on other vectors. We saw
examples of operators back in Section 1.3.2; the R operators that perform
comparisons are <,<=,>,>=,== (for “is equal to”) and != (for “not equal to”).
In this example, we do some simple comparisons on a short vector.
> 101:105 >= 102 # Which elements are >= 102?
[1] FALSE TRUE TRUE TRUE TRUE
> 101:105 == 104 # Which equal (==) 104?
[1] FALSE FALSE FALSE TRUE FALSE
Of course, when you compare two floating-point numbers for equality, you
can get unexpected results. In this example, we compute 1 - 1/46 * 46,
which is zero; 1 - 1/47 * 47, and so on up through 50. We have seen this
example before!
> 1 - 1/46:50 * 46:50 == 0
[1] TRUE TRUE TRUE FALSE TRUE
We noted earlier that R provides Tand Fas synonyms for TRUE and FALSE.
Wesometimesusethesesynonymsinthebook.However,itisbesttobeware
of using these shortened forms in code. It is possible to create objects named
k
k k
k
24 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
Tor F, which might interfere with their usage as logical values. In contrast,
the full names TRUE and FALSE are reserved words in R. is means that you
cannot directly assign one of these names to an object and, therefore, that they
are never ambiguous in code.
The Number and Proportion of Elements That Meet a Criterion
One task that comes up a lot in data cleaning is to count the number (or pro-
portion) of events that meet some criterion. We might want to know how many
missing values there are in a vector, for example, or the proportion of elements
that are less than 0.5. For these tasks, computing the sum() or mean() of a
logical vector is an excellent approach. In our earlier example, we might have
been interested in the number of elements that are 102, or the proportion that
are exactly 104.
> 101:105 >= 102
[1] FALSE TRUE TRUE TRUE TRUE
> sum (101:105 >= 102)
[1] 4 # Four elements are >= 102
> 101:105 == 104
[1] FALSE FALSE FALSE TRUE FALSE
> mean (101:105 == 104)
[1] 0.2 # 20% are == 104
It may be worth pondering this last example for a moment. We start with the
logical vector that is the result of the comparison operator. In order to apply
a mathematical function to that vector, R needs to convert the logical ele-
ments to numeric ones. FALSE values get turned into zeros and TRUE values
into ones (we discuss conversion further in Section 2.2.3). en, sum() adds
up those 0s and 1s, producing the total number of 1s in the converted vec-
tor – that is, the number of TRUE values in the logical vector or the number
of elements of the original vector that meet the criterion by being 102. e
mean() function computes the sum of the number of 1s and then divides that
sum by the total number of elements, and that operation produces the propor-
tion of TRUE values in the logical vector,thatis,theproportionofelements
in the original vector that meet the criterion.
2.1.4 Vector Operations
Understanding how vectors work is crucial to using R properly and efficiently.
Arithmetic operations on vectors produce vectors, which means you very often
do not have to write an explicit loop to perform an operation on a vector. Sup-
pose we have a vector of six integers, and we want to perform some operations
on them. We can do this:
> 5:10
[1]5678910
> (5:10) + 4
k
k k
k
R Data, Part 1: Vectors 25
[1] 91011121314
> (5:10)^2 # Square each element;
[1] 25 36 49 64 81 100 # parentheses necessary
Just to repeat, arithmetic and most other mathematical operations operate on
vectors and return vectors. So if you want the natural logarithm of every item
in a vector named x, for example, you just enter log(x).Ifyouwantthe
square of the cosine of the logarithm of every element of x, you would use
cos(log(x))^2, and so on. ere are functions, such as length(),sum(),
mean(),sd(),min(),andmax(), that operate on a vector and produce a
single number (which, to be sure, is also a vector in R). ere are also func-
tions such as range(), which returns a vector containing the smallest and
largest values, and summary(), which returns a vector of summary statistics,
but one of the sources of R’s power is the ability to perform computations on
every element of a vector at once.
In the last two examples above, we operated on a vector and a single number
simultaneously. R handles this in the natural way: by repeating the 4(in the first
example) or the 2(in the second) as many times as needed. R calls this recycling.
In the following example, we see what R does in the case of operating on two
vectors of the same length. e answer is, it performs the operation between
the first elements of each vector, then the second elements, and so on. In the
opening command, we have the usual assignment, using <-, and also an addi-
tional set of parentheses outside that command. ese additional parentheses
cause the result of the assignment to be printed. Without them, we would have
created thing1, but its value would not have been displayed.
> (thing1 <- c(20, 15, 10, 5, 0)^2)
[1] 400 225 100 25 0
> (thing2 <- 105:101)
[1] 105 104 103 102 101
> thing2 + thing1
[1] 505 329 203 127 101
> thing2 / thing1
[1] 0.2625000 0.4622222 1.0300000 4.0800000 Inf
In the last lines, R computes the ratios element by element. e final ratio,
101/0, yields the result Inf, referring to an infinite value. We discuss Inf more
in Section 2.4.4. e following example compares a function that returns a sin-
gle,summaryvaluetoonethatoperateselementbyelement.
> max (thing2, thing1)
[1] 400
> pmax (thing2, thing1)
[1] 400 225 103 102 101
e max() function produces the largest value anywhere in any of its
arguments – in this case, the 400 from the first element of thing1.e
k
k k
k
26 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
pmax() (“parallel maximum”) function finds the larger of the first element of
the two vectors, and the larger of the second element of the two vectors, and
so on.
Two logical vectors can also be combined element by element, using the |
logical operator for “or” (i.e., returning TRUE if either element is TRUE)and
the &operator for “and” (i.e., returning TRUE only if both elements are TRUE).
ese operators differ in a subtle way from their doubled versions || and &&.
e single versions evaluate the condition for every pair of elements from both
vectors, whereas the doubled versions evaluate multiple TRUE/FALSE condi-
tions from left to right, stopping as soon as possible. ese doubled versions
aremostusefulin,forexample,if() statements.
Recycling
ere can be a complication, though: what if two vectors being operated on are
not of the same length?
> 5:10 + c(0, 10, 100, 1000, 10000, 100000) # Two 6-vectors
[1] 5 16 107 1008 10009 100010 # Add by element
> 5:10 + c(1, 10, 100) # A 6-vector and a 3-vector
[1] 6 16 107 9 19 110 # The 3-vector is replicated
> 5:10 + 3:7 # A 6-vector and a 5-vector
[1] 8 10 12 14 16 13 # 5+3, 6+4, ..., 9+7, 10+3
Warning message:
In 5:10 + 3:7 :
longer object length is not a multiple of shorter length
It is important to understand these last two examples because the problem
of mismatched vector lengths arises often. In the first of the two examples, the
3-vector (1, 10, 100) was added to the first three elements of the 6-vector,
and then added again to the second three elements. Once again R is recycling.
No warning was issued because 3 is a factor of 6, so the shorter vector was
recycled an exact number of times. In the final example, the 5-vector was
added to the first five elements of the 6-vector. In order to finish the addition,
R recycled the first element of the 5-vector, the value 3. at value was added
to the last entry of the 6-vector, 10, to produce the final element of the result,
13. e recycling only used part of the 5-vector; since 5 is not a factor of 6, a
warning was issued.
Recycling a vector of length 1, as we did when we computed (5:10) + 4,
is very common. Recycling vectors of other lengths is rarer, and we suggest you
avoid it unless you are certain you know what you are doing. When you see
the longer object length... warning as we did in the last example, we
recommend you treat that as an error and get to the root of that problem.
Tools for Handling Character Vectors
Almost every data cleaning problem requires some handling of characters.
Either the data contains characters to start with – maybe names and addresses,
or dates, or fields that indicate sex, for example – or we will need to construct
k
k k
k
R Data, Part 1: Vectors 27
some (perhaps turning sexes labeled 1or 2into Mand F). We also often need to
search through character strings to find ones that match a particular pattern;
remove commas or currency signs that have been put into formatted numbers
(such as “$2500.00”); or discretize a numeric variable into a smaller number
of groups (such as turning an Age field into levels Child,Teen,Adult,
Senior). Character data is so important, and so common, that we have
devoted an entire chapter (Chapter 4) to special techniques for handling it.
2.1.5 Names
Avectormayhavenames, a vector of character strings that act to identify the
individual entries. It is possible to add names to a vector, and in this section we
give examples of that. More commonly, though, R adds names to a table when
you tabulate a vector using the table() function. We will have more to say
about table(), and the names it produces, in Section 2.5. In the meantime,
here is a simple example of a vector with names. Notice that the third name
has an embedded space. is name is not “syntactically valid” according to R’s
rules. A syntactically valid name has only letters, numbers, dots, and under-
scores and starts either with a letter or a dot and then a non-numeric character.
It is usually a bad practice to have a vector’s names be invalid, but, as we show in
the following example, it is possible. See Section 3.4.2 for information on how
to ensure that your names are valid.
> vec <- c(101, 102, 103)
> names(vec)
NULL
> names(vec) <- c("a", "b", "Long name")
> names(vec)
[1] "a" "b" "Long name"
After the second line, R returned the special value NULL to indicate that the vec-
tor had no names. (We talk more about NULL in Section 2.4.5.) e names()
function then assigned names to the elements of the vector. We can also assign
names directly in the c() function, as in this example.
> c(a = 101, b = 102, Long.name = 103)
a b Long.name
101 102 103
In this case, we used a syntactically valid name; an invalid one would have had
to be enclosed in quotation marks.
2.2 Data Types
e three data types we have mentioned so far – numeric, logical, and charac-
ter – are the ones we most often use. R does support several other data types. In
this section, we mention these datatypes briefly, and then discuss the important
k
k k
k
28 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
topic of converting data from one type to another. Sometimes this is an opera-
tion we do explicitly and intentionally; other times R performs the conversion
automatically.
2.2.1 Some Less-Common Data Types
Integers
R can represent as integer values between −(231 1)and 231 1. (is num-
ber is 2,147,483,647.) Values outside this range may be displayed as if they were
integers, but they will be stored as doubles. When doing calculations, R auto-
matically converts values that are too big to be integers into doubles, so the only
time integer storage will matter is if you explicitly convert a really large value
into an integer (see Section 2.2.3). If you need R to regard an item as an integer
for some reason, you can append Lon its end. So, for example, 123 is numeric
but 123L is regarded as an integer value. Of course, it only makes sense to add
Lto a thing that really is an integer.
Raw
“Raw” refers to data kept in binary (hexadecimal) form. is is the format that
data from images, sound, or video will take in R. We rarely need to handle that
kind of file in a data cleaning problem. However, we do sometimes resort to
using raw data when a file has unexpected characters in it, or at the beginning
of an analysis when we do not know what sort of data a file might have. In that
case, the data will be read into R and held as a vector of class raw.Araw vector
is a string of bytes represented in hexadecimal form. It can be converted into
character data (when that makes sense) with the rawToChar() function. We
talk more about reading raw data, particularly to handle the case of unexpected
characters, in Section 6.2.5.
Complex Numbers
R has the ability to manipulate complex numbers (numbers such as 1.3 2.4i,
where iis 1). Since complex numbers almost never arise in data cleaning,
we will not discuss them in this book.
2.2.2 What Type of Vector Is This?
Youcanusuallytellwhatsortofvectoryouhavebylookingatafewofitsentries.
Character data has entries surrounded by quotes; numeric entries have
no quotes; and logical entries are either TRUE or FALSE.So,forexample,
the value "TRUE", with quotation marks, can only belong to a character vec-
tor. ere are also several functions in R that tell you explicitly what sort of
thing you have. Two of these functions, mode() and typeof(),tellyouthe
basic type of vector. ey are essentially identical for our purposes, except that
typeof() differentiates between integer and double,whereasmode()
k
k k
k
R Data, Part 1: Vectors 29
calls them both numeric.estr() function (for “structure”) not only tells
you the type of vector but also shows you the first few entries. A related func-
tion, class(), is a more general operator for complex types.
A second group of functions gives a TRUE/FALSE answer as to whether
a specific vector has a specific mode. ese functions are named is.
logical(),is.integer(),is.numeric(),andis.character(),
and each returns a single logical value describing the type of the vector.
A more general version, is(),letsyouspecifytheclassasanargument:
so is.numeric(pi) is identical to is(pi, "numeric").ismore
general form is particularly useful when testing for more complicated, possibly
user-defined classes.
2.2.3 Converting from One Type to Another
It is important to remember that a vector can contain elements of only one type.
When types are mixed – for example, if you inadvertently insert a character
element into a numeric vector – R modifies the entire vector to be of the more
complicated type. Here is an example:
> c(1, 4, 7, 2, 5) # Create numeric vector
[1]14725
> c(1, 4, 7, 2, 5, "3") # What if one element is character?
[1] "1" "4" "7" "2" "5" "3"
In this example, the entire vector got converted to character.eruleis
that R will convert every element of a vector to the “most complicated” type
of any of the elements. Logical is the least complicated type, followed by
raw,numeric,complex, and then character.(Raw vectorsbehavealittle
differently from the others. See Section 6.2.5.)
It is important to know what values the less complicated types get when they
are converted to more complicated ones. Logical elements that are converted
into numeric become 0 where they have the value FALSE and 1 where they are
TRUE. A logical converted into a character, however, gets values "FALSE" and
"TRUE". A number gets converted into a high-accuracy text representation of
itself, as we see in these examples.
> 1/7
[1] 0.1428571 # by default, 7 digits are displayed
> c(1/7, "a")
[1] "0.142857142857143" "a"
One instance where R frequently performs conversions automatically is from
integer to numeric types.
Conversion Functions
R will convert less complicated types into more complicated ones where
required. Sometimes you need to force the elements of a vector back into a
k
k k
k
30 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
less complicated representation. Just as there are functions whose names start
with is. fortestingthetypeofanobject,thereisasetofas. functions for
converting from one type to another. e rules are these: a character will be
successfullyconvertedtoanumericifithasthesyntaxofanumber.Itmay
have leading or trailing spaces (or new-lines or tabs), but no embedded ones; it
may have leading zeros; it may have a decimal point (but only one); it may not
have embedded commas; it may have a leading minus or plus sign, and, if it is
in scientific notation, the exponent character Emay be in upper- or lower-case
and may also be followed by a minus or plus sign. In this example, we show
some character strings that do and do not get converted to numbers. Notice
that the elements of the vector that do not get converted turn into missing
values (NA). We discuss missing values in Section 2.4.
> as.numeric (c(" 123.5 ", "-123e-2", "4,355", "45. 6",
"$23", "75%"))
[1] 123.50 -1.23 NA NA NA NA
Warning message:
NAs introduced by coercion
In this case, the first two elements were successfully converted. e third has a
comma,thefourthhasanembeddedspace,andthelasttwohavenon-numeric
characters. In order to convert strings such as those into numbers, you would
have to remove the offending characters. We describe how to manipulate text
in Chapter 4.
e warning message you see here is a very common one. Unlike most
warning messages, this one will often arise naturally in the course of data
cleaning – but make sure you understand exactly where it’s coming from.
e only character values that can be successfully converted into
logical are "T","TRUE","True",and"true" and "F","FALSE",
"False",and"false". In this case, no extraneous spaces are permitted.
All other character values are converted into NAs.
e rule is simple for converting numeric values into logical ones.
Numeric values that are zero become FALSE; all other numbers become TRUE.
e only issue is that sometimes numbers you expect to be zero aren’t quite
because of floating-point error. In this example, we convert some numbers and
expressions to logical.
> as.logical (c(123, 5 - 5, 1e-300, 1e-400, 1 - 1/49 * 49))
[1] TRUE FALSE TRUE FALSE TRUE
e first element here is clearly non-zero, so it gets converted to TRUE.e
second evaluates to exactly zero and produces FALSE. e third is non-zero,
but the fourth counts as zero since it is outside the range of double precision
(see Section 1.3.3). e last element is our running example of an expression
that“should”bezerobutisnot(again,seeSection1.3.3).Sinceitisnotzero,
k
k k
k
R Data, Part 1: Vectors 31
it gets converted to TRUE.Numeric, non-missing values never produce NA
when converted to logical.
2.3 Subsets of Vectors
We very often need to pull out just a piece of a vector. is is called subsetting
or extracting.Inmostcases,whereweextractasubset,wecanuseasimilar
expression to replace (or assign) new values to a subset of the elements in a
vector. Knowing how to do this is crucial to data cleaning in R; you cannot
work efficiently in R without understanding this material.
2.3.1 Extracting
We constantly perform this operation in one form or another when cleaning
data: we look at subsets of rows or columns, we examine a vector for anoma-
lous entries, we extract all the elements of one vector for which another has a
specific value, and so on. ere are three methods by which we can extract a
subset of a vector. First, we can use a numeric vector to specify which elements
to extract. is numeric vector is an example of a “subscript” and its entries
are called “indices.” Second, we can use a logical subscript; and, third, we can
extract elements using their names.
Numeric Subscripts
e most basic way to extract a piece of a vector is to use a numeric subscript
inside square brackets. For example, if you have a vector named a,thecom-
mand a[1] will extract the first element of a.eresultofthatcommandisa
vector of length 1, of the same mode as the original a.ecommanda[2:5]
will produce a vector of length 4, with the second through fifth elements of
a. If you ask for elements that aren’t there – if, for example, aonly had three
elements – then R will fill up the missing spots with missing (NA) values. We dis-
cussthosefurtherinSection2.4.Inthisexample,wehaveavectoracontaining
the numbers from 101 to 105.
> (a <- 101:105)
[1] 101 102 103 104 105
> a[3]
[1] 103
It’s possible to pull out elements in any order, just by preparing the subscript
properly. You can even use a numeric expression to compute your subscript,
but only do this if you’re sure your expression is an integer. If the result of your
expression isn’t an integer, even if it misses by just a tiny bit, you will get some-
thing you might not expect.
k
k k
k
32 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
> a[c(4, 2)]
[1] 104 102
> a[1+1] # A simple expression; this works
[1] 102
> a[2.999999999999999] # This is truncated to 2, but...
[1] 102
> a[2.9999999999999999] # exactly 3 in double-precision.
[1] 103
> a[49 * (1/49)] # This index gets truncated to zero;
integer (0) # R produces a vector of length zero
ere are two kinds of special values in numeric subscripts: negative values
and zeros. Negative values tell R to omit those values, instead of extracting
them – so a[-1], for example, returns everything except the first element of a.
You can have more than one negative number in your subscript, but you cannot
mix positive and negative numbers, and that makes sense. (For example, in the
expression a[c(-1, 3)], should the second element be returned or not?)
Zeros are another special value in a subscript. ey are simply ignored by
R. Zeros appear primarily as a result of the match() function; you will rarely
use them intentionally yourself. Knowing that zeros are permitted helps make
sense of the error message in the following example, though.
> a[-2] # Omit element 2
[1] 101 103 104 105
> a[c(-1, 3)] # Illegal
Error in a[c(-1, 3)] : only 0's may be mixed
with negative subscripts
> a[-1:2] # Illegal, because -1:2 evaluates to -1, 0, 1, 2
Error in a[-1:2] : only 0's may be mixed
with negative subscripts
> a[-(1:2)] # -(1:2) is (-1, -2): omit elements 1 and 2.
[1] 103 104 105
Logical Subscripts
Logical subscripts are also very powerful. A logical subscript is a logical vec-
tor of the same length as the thing being extracted from. Values in the original
vector that line up with TRUE elements of the subscript are returned; those that
line up with FALSE are not.
We almost never construct the logical subscript directly, using c().Instead
it is almost always the result of a comparison operation. In this example, we
start with a vector of people’s ages, and extract just the ones that are >60.
> age <- c(53, 26, 81, 18, 63, 34)
>age>60
[1] FALSE FALSE TRUE FALSE TRUE FALSE
> age[age > 60]
[1] 81 63
k
k k
k
R Data, Part 1: Vectors 33
e age > 60 vector has one entry for each element of age,soitiseasyto
use that to extract the numeric values of age,whichare>60. But the power
of logical subscripting goes well beyond that. Imagine that we also knew the
names of each of the people. Here we show how to extract the names just for
the people whose ages are >60.
> people <- c("Ahmed", "Mary", "Lee", "Alex", "John", "Viv")
> age > 60 # Just as a reminder
[1] FALSE FALSE TRUE FALSE TRUE FALSE
> people[age > 60] # Return name where (age > 60) is TRUE
[1] "Lee" "John"
is particular manipulation – extracting a subset of one vector based on
values in another – is something we do in every data cleaning problem. It is
important to be sure that you know exactly how it works.
One case where results might be unexpected is when you inadvertently cause
a logical subscript to be converted to a numeric one. In the example above,
suppose we had saved the logical vector as a new R object called age.gt.60.
In the following example, we show what happens if R is allowed to convert that
logical vector into a numeric one.
> age.gt.60 <- age > 60
> people[age.gt.60]
[1] "Lee" "John" # as expected
> people[0 + age.gt.60]
[1] "Ahmed" "Ahmed"
> people[-age.gt.60]
[1] "Mary" "Lee" "Alex" "John" "Viv"
In the 0 + age.gt.60 example, R has to convert the logical subscript to
numeric in order to perform the addition. After the addition, then, the subscript
has the values 001010, and the extraction produces the first element of
the vector two times, ignoring the zeros. In the following example, the negative
sign once again causes R to convert the logical subscript to numeric; after the
application of the sign operator the subscript has the values 00-10-10.
e extraction drops the first element (because of the -1 value) and the rest are
returned. is is a mistake we sometimes make with a logical subscript – in this
example, we probably intended to enter people[!age.gt.60],withthe!
operator,inordertoreturnpeoplewhoseagesarenotgreaterthan60.
When using a logical subscript, it is possible for the two vectors – the data
and the subscript – to be of different lengths. In that case R recycles the shorter
one, as described in Section 2.1.4. is might be useful if, say, you wanted to
keep every third element of your original vector, but in general we recommend
that your logical subscript be the same length as the original vector.
e which() function can be used to convert a logical vector into a numeric
one. It returns the indices (i.e., the position numbers) of the elements that are
k
k k
k
34 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
TRUE. So this is particularly useful when trying to find one or two anomalous
entries in a long vector of logical values. To find the locations of the minimum
valueinavectory,youcanusewhich(y == min(y)), but the act of find-
ing the index specifically of the minimum or maximum value is so common
that there are dedicated functions, called which.min() and which.max(),
for this task. ere is one difference, though: these two functions break ties by
selecting the first index for which yis at its maximum or minimum, whereas
which() returns all the matching indices.
Using Names
e third kind of subscripting is to use a vectors names. Since names are char-
acters, a name subscript will need to be a character as well. Here is a named
vector, together with an example of subscripting by name.
> (vec <- c(a = 12, b = 34, c = -1))
abc
12 34 -1
> vec["b"]
b
34
> vec[names (vec) != "a"]
bc
34 -1
To show all the values except the one named a, it is tempting to try something
like vec[-"a"]. However, R tries to compute the value of “negative a,” fails,
andproducesanerror.efinalexampleaboveshowsonewaytoexcludethe
element with a particular name from being extracted.
Named vectors are not uncommon, but they do not come up very often in
data cleaning. e real use of names will become clearer in Chapter 3, where
we will encounter rectangular structures that have row names, column names,
or, very often, both.
2.3.2 Vectors of Length 0
Any of these extraction methods can produce a vector of length 0, if no element
meets the criterion. is happens particularly often when all of the elements
of a logical subscript are FALSE. A vector of length 0 is displayed as inte-
ger(0),numeric(0),character(0),orlogical(0). In this example,
we show how such a vector might arise.
> (b <- c(101, 102, 103, 104))
[1] 101 102 103 104
> a <- b[b < 99] # Reasonable, but no elements of b are < 99
>a
numeric(0) # a has length 0
k
k k
k
R Data, Part 1: Vectors 35
A zero-length vector cannot be used intelligently in arithmetic, and watch
out: the sum() of a numeric or logical vector of length 0 is itself zero. If a
zero-lengthvectorisusedastheconditioninanif() statement, an error
results.isisanerrorthatarisesindatacleaning,asinthisexample:
> sum (a)
[1] 0 # Possibly unexpected
> sum (a + 12345)
[1] 0 # Definitely unexpected
> if (a < 2) cat ("yes\n")
Error in if (a < 2) cat("yes\n") : argument is length zero
In the last example, we made use of the cat() function, which writes its
arguments out to the screen, or, as R calls it, the console. e \n represents the
new-line, to return the cursor to the left of the screen. When writing functions
to do data cleaning (Chapter 5), we will need to check that the conditions
being tested are not vectors of length 0.
2.3.3 Assigning or Replacing Elements of a Vector
Every operation that extracts some values can also be used to replace those val-
ues, simply using the extraction operation on the left side of an assignment. Of
course, R will require that the resulting vector have all its entries of the same
type. So, for example, a[2] <- 3 will replace the second entry of awith the
value 3.Ifais logical, this operation will force it to be numeric; if ais character,
the second entry of awill be assigned the character value "3".Justaswecan
extract using logical subscripts or names, we can use those subscripting tech-
niques for assignment as well. ese examples show replacement with numeric
and logical subscripts.
> (a <- c(101, 102, -99, 104, -99, 9106)) # last item should
[1] 101 102 -99 104 -99 9106 # have been 106
> a[6] <- 106 # numeric subscript
>a
[1] 101 102 -99 104 -99 106
> a[a < 0] <- 9999 # logical subscript
>a
[1] 101 102 9999 104 9999 106
As we mentioned, a logical subscript will almost always have the same length as
the data vector on which it is operating. In the preceding example, the logical
subscript a<0has the same length as aitself.
ese examples show how names can be used to assign new values to the
elements of a vector.
> b <- c("A", "missing", "C", "D")
> names (b) <- c("Red", "White", "Blue", "Green")
k
k k
k
36 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
>b
Red White Blue Green
"A" "missing" "C" "D"
> b["White"] <- "B" # name subscript
>b
Red White Blue Green
"A" "B" "C" "D"
It is also possible to assign to elements of a vector out past its end. is is one
way to combine two vectors. Elements that are not assigned will be given the
special NA value (see the following section). Another way to combine two vec-
tors is with the c() command, but either way, if two vectors of different types
are combined, R will need to convert them to the same type. In this example,
we combine two vectors.
> a <- 101:103
> b <- c(7, 2, 1, 15)
> c(a, b) # Combine two vectors
[1] 101 102 103 7 2 1 15
> a # Unchanged; no assignment made
[1] 101 102 103
> a[4:7] <- b # index non-existent values
>a
[1] 101 102 103 7 2 1 15
>b
[1]72115
> b[6] <- 22 # index non-existent value
>b
[1]72115NA22 #b[5] filled in with NA
In the last example, b[6] was assigned, but no instruction was given about
what to do with the newly created fifth element of b. R filled it in with the special
missing value code, NA. e following section describes how NA values operate
in R.
2.4 Missing Data (NA) and Other Special Values
In R, missing values are identified by NA (or, under some circumstances, by
<NA>; see Sections 2.5 and 4.6). is is a special code; it is not the two capital
letters Nand Aput together. Missing values are inevitable in real data, so it is
important to know the effect they have on computations, and to have tools to
identify them and replace them where necessary. In this section, we discuss NA
values in vectors; subsequent chapters expand the discussion to describe the
effect of NA values in other sorts of R objects.
Missing values arise in several ways. First, sometimes data is just missing – it
would make sense for an observation to be present, but in fact it was lost
k
k k
k
R Data, Part 1: Vectors 37
or never recorded. Second, some observations are inherently missing. For
example, a field named MortPayRat might contain the ratio of a customer’s
monthly home mortgage payment to her monthly income. Customers with
no mortgage at all would presumably have no value for this field. An NA value
would make more sense than a zero, which would suggest a mortgage payment
of zero. ird, as we saw in the last section, missing values appear when we
try to extract an item that was never present in a vector. For example, the
built-in item letters is a vector containing the 26 lower-case letters of the
English alphabet. e expression letters[27] will return an NA. Finally, we
sometimes see other special values Inf or -Inf or NaN in response to certain
computations, like trying to divide by zero. ose special values can often be
treated as if they were NA values. We discuss these and a final special value,
NULL,inthissection.
Sincealltheelementsofavectormustbeofthesamekind,thereare
actually several different kinds of NA.AnNA in a logical vector is a little
different from an NA in a numeric or character one. (ere are actually objects
named NA_real_,NA_integer_,andNA_character_, which make this
explicit.) Normally, the difference will not matter, but there is one case where
knowing about the types of NA can explain some behavior that both arises
fairly often and also seems mysterious. We mention this in Section 2.4.3.
2.4.1 The Effect of NAs in Expressions
A general, if imprecise, rule about NA values is that any computation with an
NA itself becomes an NA. If you add several numbers, one of which is an NA,
the sum becomes NA. If you try to compute the range of a numeric with miss-
ing values, both the minimum and maximum are computed as NA.ismakes
sense when you think of an NA as an unknown that could take on any value.
Basic mathematical computations for numeric vectors all allow you to specify
the na.rm = TRUE argument, to compute the result after omitting missing
values.
2.4.2 Identifying and Removing or Replacing NAs
In every data cleaning problem we need to determine whether there are NA
values. What you cannot do to identify missing values is to compare them
directly to the value NA. Just as adding an NA to something produces an NA,
comparing an NA to something produces NA.Soifavariablething has value
3, the expression thing == NA produces NA,andifthing has value NA,the
expression thing == NA also produces NA. To determine whether any of
your values are missing, use the anyNA() function. is operates on a vector
and returns a logical, which is TRUE if any value in the vector is NA.More
useful, perhaps, is the is.na() function: if we have a vector named vec,a
call to is.na(vec) returns a vector of logicals, one for each element in vec,
k
k k
k
38 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
giving TRUE for the elements that are NA and FALSE for those that are not.
We can also use which(is.na(vec)) to find the numeric indices of the
missing elements. Here, we show an example of a vector with NA values and
some example of what operations can, and cannot, be performed on them.
> (nax <- c(101, 102, NA, 104))
[1] 101 102 NA 104
> nax * 2 # Arithmetic on NAs gives NAs...
[1] 202 204 NA 208
> nax >= 102 # ...as do comparisons
[1] FALSE TRUE NA TRUE
> mean (nax) # One NA affects the computation
[1] NA
> mean (nax, na.rm = TRUE) # na.rm = TRUE excludes NAs
[1] 102.3333
> is.na (nax) # Locate NAs with logical vector
[1] FALSE FALSE TRUE FALSE
> which (is.na (nax)) # Numeric indices of NAs
[1] 3
When your data has NA or other special values, you are faced with a deci-
sion about how to handle them. Generally they can be left alone, replaced, or
removed. Removing missing values from a single vector is easy enough; the
command vec[!is.na(vec)] will return the set of non-missing entries in
vec. A more sophisticated alternative is the na.omit() function, which not
only deletes the missing values but also keeps track of where in the vector they
used to be. is information is stored in the vectors “attributes,” which are extra
pieces of information attached to some R objects.
> nax[!is.na (nax)] # Return the non-missing values
[1] 101 102 104
> (nay <- na.omit (nax)) # This keeps track of deleted ones
[1] 101 102 104
attr(,"na.action")
[1] 3
attr(,"class")
[1] "omit"
Data cleaners will very often want to record information about the original
location of discarded entries. In this example, these can be extracted with a
command like attr(nay, "na.action").
ings get more complicated when the vector is one of many that need to
be treated in parallel, perhaps because the vector is part of a more compli-
cated structure like a matrix or data frame. Often if an entry is to be deleted, it
needs to be deleted from all of these parallel items simultaneously. We talk more
about these structures, and how to handle missing values in them, in Chapter
3. (We also note that most modeling functions in R have an argument called
k
k k
k
R Data, Part 1: Vectors 39
na.action that describes how that function should handle any NA values it
encounters. is is outside our focus on data cleaning.)
2.4.3 Indexing with NAs
When an NA appears in an index, NA is produced, but the actual effect that R
produces can be surprising. is arises often in data cleaning, since it is com-
mon to have a vector (usually fairly long and as part of a larger data set) with
many NAs that you may not be aware of. Suppose we have a vector of data b
and another vector of indices a, and we want to extract the set of elements of
bfor which ahas the value 1, like this: b[a == 1].ecomparisona==1
will return NA wherever ais missing, and b[NA] produces NA values. So the
resultisavectorwithboththeentriesofbfor which a==1and also one NA
for every missing value in a. is is almost never what we want. If we want to
extract the values of bfor which ais both not missing and also equal to 1, we
have to use the slightly clunky expression b[!is.na(a) & a == 1].is
example shows what this might look like in practice.
> (b <- c(101, 102, 103, 104))
[1] 101 102 103 104
> (a <- c(1, 2, NA, 4))
[1] 1 2 NA 4
> b[!is.na (a) & a == 2] # We probably want this...
[1] 102
> b[a == 2] # ...and not this.
[1] 102 NA
In the following example, we show how two commands that look alike are
treated slightly differently by R.
> b[a[2]] # a[2] = 2; extract element 2 of b
[1] 102 # ... which is 102
> b[a[3]] # a[3] is NA
[1] NA
> (a <- as.logical (a)) # Now convert a to logical
[1] TRUE TRUE NA TRUE
> b[a[3]] # a[3] is NA
[1] NA NA NA NA
In the first example of b[a[3]], the value in a[3] was a numeric NA,soR
treated the subscripting operation as a numeric one. It returned only one value.
In the second example, a[3] was a logical NA, and when R subscripts with a
logical – even when that logical value is NA – it recycles the subscript to have
the same length as the index (we saw this in Section 2.1.4).
e lesson here is that when you have an NA in a subscript, R may return
something other than what you expect.
k
k k
k
40 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
2.4.4 NaN and Inf Values
A different kind of special value can arise when a computation is so big that
it overflows the ability of the computer to express the result. Such a value is
expressed in R as Inf or -Inf. On 64-bit machines Inf is a bit bigger than
1.79 ×10308; it most often appears when a positive number is accidentally
divided by zero. Inf values are not missing, and is.na(Inf) produces
FALSE. Another special value is NaN, “not a number,” which is the result of
certain specific computations such as 0/0 or Inf + -Inf or computing the
mean of a vector of length 0. Unlike Inf,anNaN valueisconsideredtobe
missing. As with NA values, Inf and NaN values take over every computation
in which they are evaluated. ere are rules for when more than one is
present – for example, Inf + NA gives NA,butNaN + NA gives NaN.Froma
data cleaning perspective, all of these values cause trouble and you will gener-
ally want to identify any of these values early on. e function is.finite()
is useful here; this produces TRUE for numbers that are neither NA nor NaN
nor Inf or -Inf.Sointhatsenseitservesasacheckonvalidvalues.Tosee
whether every element of a numeric vector vec consists of values that are not
any of these special ones, use the command all(is.finite(vec)).
2.4.5 NULL Values
AfinalsortofspecialvalueistheRvalueNULL.ANULL is an object with zero
length, no contents and no class. (A vector of length 0 has no contents, but
since it has a class – numeric, logical, or something else – it is not NULL.) In
data cleaning, NULLs most often arise when attempting to access an element
of a list, or a column of a data frame, which does not exist. We discuss this in
Section 3.4.3. For the moment, the important point is that we can test for NULL
values with the is.null() function, and that if you index using a NULL value
the result will be a vector of length 0.
2.5 The table() Function
e table() function is so important in data cleaning that it merits its own
section. is command, as its name suggests, produces a table giving, for each
of the unique values in its argument, the number of times that value appears.
In this example, we will create a vector with some color names in it, and we will
add in an NA as well.
> vec <- rep (c("red", "blue", NA, "green"), c(3, 2, 1, 4))
> vec
[1] "red" "red" "red" "blue" "blue" NA
[7] "green" "green" "green" "green"
> table (vec)
k
k k
k
R Data, Part 1: Vectors 41
vec
blue green red
243
ere are a couple of things to notice here. First, the ordering of the results in
the table is alphabetical, rather than being determined by the order the entries
appear in the vector vec. Second, the resulting object is not quite a named
vector, as you can see by the word vec that appears above the word blue.
(We omit this line in many future displays to save space.) In fact, this object
has class table, but it can be treated like a named vector – so, for example,
table(vec)["green"] produces 4. ird, by default table omits NA as
well as NaN values. In data cleaning this is almost never what we want. ere
are two different arguments to the table() function that serve to declare how
you want missing values to be treated. e first of these is named useNA.is
argument takes the character values "no" (meaning exclude NA values, which
was the default as seen earlier), "ifany" (meaning to show an entry for NAs
if there are any, but not if there aren’t) and "always", meaning to show an
entry for NAs whether there are any NA values or not. In our current example,
where there is one NA,thetable() command with useNA set to "ifany"
or "always" will produce output like this:
> table (vec, useNA = "always")
blue green red <NA>
2431
Notice that R displays the entry for NA values as <NA>, with angle brackets.
is makes it easier to use the characters "NA" as a regular character string,
perhaps for “North America” or possibly “sodium.” (is angle bracket usage
will appear again later.) R will not be confused if you have both NA values and
also actual character strings with the angle brackets, such as "<NA>",butit
is definitely a bad practice. To see what happens when there are no NAs, let us
look at the same vector without its missing entry, which is number 6.
> table (vec[-6], useNA="ifany")
blue green red
243
> table (vec[-6], useNA="always")
blue green red <NA>
2430
For data cleaning purposes, we almost always want to know about missing
values, so we will almost always want useNA to be "ifany" or "always".
e second missing-value argument, exclude, allows you to exclude specific
values from the table. By default, exclude has the value c(NA, NaN),
which is why those values do not appear in tables. Most commonly we set
this value to NULL to signify that no entries should be excluded, although
sometimes we exclude certain very common values. Here we might want to
k
k k
k
42 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
exclude the common value green while tabulating all other values, including
NAs. e following example shows how we can do that. It also shows the use
of exclude = NULL.
> table (vec, exclude="green")
blue red <NA>
231
> table (vec, exclude=NULL)
blue green red <NA>
2431
It is possible to supply both useNA and exclude at the same time, but the
results may not be what you expect. We recommend using either useNA or
exclude to display missing values in every table.
2.5.1 Two- and Higher-Way Tables
If we give two vectors of the same length to the table() function, the result
is a two-way table, also called a cross-tabulation. For example, suppose we had
a vector called years, one for each transaction in our data set, with values
2015,2016,and2017; and suppose we also had a vector called months,
of the same length, with values such as "Jan","Feb",andsoon.en
table(years, months) would produce a 3 ×12 table of counts, with
each cell in the table telling how many entries in the two vectors had the values
for the cell. at is, the top-left cell would give the number of entries from
January 2015; the one to the right of that would give the number of entries for
February 2015; and so on. (If there are fewer than 12 months represented in
the data, of course, there will be fewer than 12 columns in the table.) is is an
important data cleaning task – to determine whether two variables are related
in ways we expect. If, for example, we saw no transactions at all in March 2016,
we would want to know why.
InR,atwo-waytableistreatedthesameasamatrix; we discuss matrices
in detail in the following chapter. For very large vectors, the data.table()
function in the data.table package (Dowle et al., 2015) may prove more
efficient than table(). ree- and higher-way tables are produced when the
arguments to table() are three or more equal-length vectors. ese tables
are treated in R as arrays;wegiveanexampleinSection3.2.7.extabs()
function is also useful for creating more complex tables.
2.5.2 Operating on Elements of a Table
e table() command counts the number of observations that fall into a par-
ticular category. In the example above, the table(years, months) com-
mand produces a two-way table of counts. Often we want to know more than
just how many observations fall into a cell. R has several special-purpose func-
tions that operate on tables. e prop.table() function takes, as its first
k
k k
k
R Data, Part 1: Vectors 43
argument, the output from a call to table(), and depending on its second
argument produces proportions of the total counts in the table by cell, or by
row, or by column. In this example we set up three vectors, each of length 15.
en we show the effect of calling table(), and of calling prob.table() on
the result. By default, prop.table() computes the proportions of observa-
tionsineachcellofthetable.Inthefinalexample,weusethesecondargument
of 2to compute the proportions within each column; supplying 1would have
produced the proportions within each row.
> yr <- rep (2015:2017, each=5)
> market <- c("a", "a", "b", "a", "b", "b", "b", "a", "b",
"b", "a", "b", "a", "b", "a")
> cost <- c(64, 87, 71, 79, 79, 91, 86, 92, NA,
55, 37, 41, 60, 66, 82)
> (tab <- table (market, yr))
yr
market 2015 2016 2017
a313
b242
> prop.table (tab) # These proportions sum to 1
yr
market 2015 2016 2017
a 0.20000000 0.06666667 0.20000000
b 0.13333333 0.26666667 0.13333333
> prop.table (tab, 2) # Each column's proportions sum to 1
yr
market 2015 2016 2017
a 0.6 0.2 0.6
b 0.4 0.8 0.4
e margin.table() command produces the marginal totals from a
table – that is, row or column totals (controlled by the second argument) for a
two-way table, and corresponding sums for a higher-way one. e addmar-
gins() function incorporates those totals into the table, producing a new
row or column named Sum (or both). is is often a summary statistic we want,
but watch out – the convention regarding the second argument of addmar-
gins() is not the same as that of prop.table() and margin.table().
is example shows addmargins() in action.
> addmargins (tab) # append row and column sums
yr
market 2015 2016 2017 Sum
a3137
b2428
Sum 5 5 5 15
> addmargins (tab, 2) # append column sums
k
k k
k
44 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
yr
market 2015 2016 2017 Sum
a3137
b2428
We might also want to know the average, standard deviation, or maximum of
entries in a numeric variable, broken down by which cell they fall into. In our
example, we might want the maximum cost among the three observations
from 2015 with market a, and for the two from 2015 and market b,andso
on. For this purpose we use the tapply() function, whose name reminds us
that it applies a function to a table. is function’s arguments are the vector
on which to do the computation (in our example, cost), an argument named
INDEX describing the grouping (here, we might use the vector yr), and then
the function to be applied. e following example shows tapply() at work.
Inthefirstline,weusethemin() function to produce the minimum value for
each year – but an NA is produced for 2016 since one cost for that year is
NA.Wecanpassthena.rm = TRUE argument into tapply(),whichthen
passes it into min() as in the following example, if we want to compute the
minimum value among non-missing entries.
> tapply (cost, yr, min) # find minimum within each yr
2015 2016 2017
64 NA 37
> tapply (cost, yr, min, na.rm = TRUE)
2015 2016 2017
64 55 37
It is possible to extend this example to the two-way case of minimum cost,
or another statistic, by both market and year. Here the tabularization part, rep-
resented by the argument INDEX, needs to be a list. We discuss lists starting
in Section 3.3; for the moment, just know that a list is required when group-
ing with more than one vector. In the first example as follows, we compute
themeanofthecost values for each combination of market and year (using
na.rm = TRUE as above, and the list() function to construct the list). In
the second example, we show how we can supply our own function “in line,”
which makes it more transparent than if we had written a separate function.
e details of writing functions are covered in Chapter 5, but here our function
takes one argument, named x, and returns the value given by the sum of the
squares of the entries of x. (In this example, we pass the na.rm = TRUE argu-
ment directly to sum to keep our function simpler.) e tapply() function
is in charge of calling our function six times, once for each cell of the table.
> tapply (cost, list (market, yr), mean, na.rm = TRUE)
2015 2016 2017
a 76.66667 92.00000 59.66667
b 75.00000 77.33333 53.50000
k
k k
k
R Data, Part 1: Vectors 45
> tapply (cost, list (market, yr),
function (x) sum (x^2, na.rm = TRUE))
2015 2016 2017
a 17906 8464 11693
b 11282 18702 6037
2.6 Other Actions on Vectors
In this section, we describe additional actions on vectors that we find partic-
ularly important for data cleaning. is includes rounding numeric values,
sorting, set operations, and the important topics of identifying duplicates and
matching.
2.6.1 Rounding
R operates on numeric vectors using double-precision arithmetic, which means
that often there are more significant digits available than are useful. Results will
often need to be displayed with, say, two or three significant digits. e natural
way to prepare displays like this is through formatting the numbers – that
is, changing the way they display, but not their actual values. We discuss
formatting in Section 4.2. But sometimes we want to change the numbers
themselves, perhaps to force them to be integers or to have only a few signifi-
cant digits. e round() function and its relatives do this. Round() lets the
user specify the number of digits to the right of the decimal place to be saved;
the signif() function lets him or her specify the total number of significant
digits retained. So round(123.4567, 3) produces 123.457, while
signif(123.4567, 3) produces 123. A negative second argument pro-
duces rounding the nearest power of 10, so round(123.4567, -1) rounds
to the nearest 10 and produces 120, while round(123.4567, -2)
rounds to the nearest 100 and produces 100.etrunc() function discards
the part the decimal and produces an integer; floor() and ceiling()
round to the next lower and next higher integer, respectively, so floor(-3.4)
is -4 while trunc(-3.4) is -3. Rounding of problematic entries (like those
that end in a 5) can be affected by floating-point error (see Section 1.3.3).
2.6.2 Sorting and Ordering
It is common to have to sort the elements of a vector, and the sort() function
performs that task in R. By default, the sort is from smallest to largest, but the
decreasing = TRUE argument will reverse the order. ere are two minor
complications. First, sort() will drop NA and NaN values by default, giving a
vector shorter than the original when these values are present. is behavior
is controlled by the na.last argument, which itself defaults to NA.Ifsetto
TRUE, this argument will have the sort() function place NA and NaN values
at the end, and, if FALSE, at the beginning of the sorted output.
k
k k
k
46 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
A second complication is in sorting character vectors. Sorting in this case is
alphabetical, of course, so if the characters are text representations of numbers
such as "1","2","5","10",and"18", the resulting output, sorted alpha-
betically, will be "1","10","18","2",and"5". Moreover, the sorting order
depends on the character set and locality being used. We mentioned this in
Section 1.4.6 and address it further in Section 4.5.
e related order() function returns a set of indices that you can use to sort
avector.isisusefulwhenyouwanttore-arrangeonevectorsvaluesinthe
order specified by a second vector. (If that sounds as if it wouldn’t be a common
task, wait until Section 3.5.4.) In this example, we have a vector of names, and
a vector of scores, and we want the names in ascending order of score.
> nm <- c("Freehan", "Cash", "Horton",
"Stanley", "Northrop", "Kaline")
> scores <- c(263, 263, 285, 259, 264, 287) # 2 tied at 263
> nm[order(scores)] # ascending order of score
[1] "Stanley" "Freehan" "Cash"
[4] "Northrop" "Horton" "Kaline"
> nm[order(scores, nm)] # tie broken by nm
[1] "Stanley" "Cash" "Freehan" # (alphabetically)
[4] "Northrop" "Horton" "Kaline"
> nm[order (scores, decreasing = TRUE)] # descending
[1] "Kaline" "Horton" "Northrop"
[4] "Freehan" "Cash" "Stanley"
As in the example, the order() function can be given more than one vector.
In this case, the second vector is used to break ties in the first; if a third vector
were supplied, it would be used to break any remaining ties, and so on. It is
very common to re-order a set of data that has time indicators (month and year,
maybe) from oldest to newest. e order() function has the same na.last
argument that sort() has, although its default value is TRUE.
2.6.3 Vectors as Sets
Oftenweneedtofindtheextenttowhichtwovectorshavevaluesthatoverlap.
For example, we might have customer data from two sources and we want
to determine the extent to which the customer IDs agree; or we might want
to find the set of states in which none of our customers reside. ese call for
techniques that treat vectors as sets and that will normally be most useful
when the data is a small number of integers, character data, or factors, about
which we say more in Section 4.6. ey can be used with non-integer data
as well, but as always we cannot rely on two floating-point numbers that we
expect to be equal actually being equal.
e essential set membership operation is performed by the %in% function.
R has a few functions with names like this, surrounded by percentage signs.
k
k k
k
R Data, Part 1: Vectors 47
is allows us to use a command like a %in% b, rather than the equivalent,
but perhaps less transparent, is.element(a, b). e return value is a vec-
tor the same length as a, with a logical indicating whether each element of a
is found anywhere in b.Indatacleaningweveryoftentabulatetheresultof
this function call; so a command like table(a %in% b) produces a table of
FALSE and TRUE, giving the number of items in athat were not found in b,
and the number that were. For this purpose, an NA value in amatches only an
NA in b,andsimilarlyanNaN value in amatches only an NaN value in b.In
this example, we compare some alphanumeric characters to the built-in data
set letters containing the 26 lower-case letters of the alphabet.
> c("g", "5", "b", "J", "!") %in% letters
[1] TRUE FALSE TRUE FALSE FALSE
> table (c("g", "5", "b", "J", "!") %in% letters)
FALSE TRUE
32
e union(),intersect(),andsetdiff() functions produce the
union, intersection, and difference between two sets. is example shows
those functions in action.
> union (c("g", "5", "b", "J", "!"),
letters) # elements in either vector
[1] "g" "5" "b" "J" "!" "a" "c" "d" "e" "f" "h" "i" "j"
[14] "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w"
[27] "x" "y" "z"
> intersect (c("g", "5", "b", "J", "!"),
letters) # elements in both vectors
[1] "g" "b"
> setdiff (c("g", "5", "b", "J", "!"),
letters) # elements of a not in b
[1] "5" "J" "!"
2.6.4 Identifying Duplicates and Matching
Another data cleaning task is to find duplicates in vectors. e anyDupli-
cated() function tells you whether any of the elements of a vector are
duplicates. e unique() function extracts only the set of distinct values
(including, by default, NA and NaN). e distinct values appear in the output
in the order in which they appear in the input; for data cleaning purposes we
will often sort those unique values.
Often it will be important to know which elements are duplicates. e
duplicated() function returns a logical vector with the value TRUE for
the second and subsequent entries in a set of duplicates. However, the first
entry in a set of duplicates is not indicated. For example, duplicated
(c(1, 2, 1, 1)) returns FALSE FALSE TRUE TRUE; the first 1is not
k
k k
k
48 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
considered duplicated under this definition. (Alternatively, the fromLast
=TRUEargument reads from the end of the vector back to the begin-
ning, but again the “first” member of a set of duplicates is not indicated.)
Combining a call with fromLast = FALSE and one with fromLast
=TRUE,usingtheunion() function, identifies all duplicates.
A common task is to find all the entries that are duplicated anywhere in the
data set (or that are never duplicated). One way to do this is via table().Any
value that appears more than once is, of course, duplicated (but remember that
floating-point numbers might not match exactly). In this example, we construct
a vector from the lower-case letters, but add a few duplicates.
> let <- c(letters, c("j", "j", "x"))
> (tab <- table (let))
let
abcdefghijklmnopqrstuvwxyz
11111111131111111111111211
> which (tab != 1) # table locations where duplicates appear
jx
10 24 # 10th & 24th table entries aren't ones
> names (tab)[tab != 1]
[1] "j" "x"
It is often useful to use table() twice in a row. is example counts the num-
ber of entries that appear once, twice, and so on in the original data. Consider
this example:
> table (table (let))
123
24 1 1 # 24 entries are 1, one is 2, one is 3
e last line shows that there are 24 entries in let that appear once; one entry,
x, that appears twice; and one entry, j,thatappearsthreetimes.Weusethisin
almost every data cleaning problem to find entries that appear more often than
we expect. In a real application, we might have tens of thousands of elements
and only a few duplicates. e which(tab != 1) command shows us the
elements that are duplicated, but not how many times each one appears; the
table(table(let)) command shows us how many duplicates there are,
but not which letter goes with which count.
Another important task is matching, which is where we identify where, in a
vector, we can find the values in another vector. We will find this particularly
useful when merging data frames in Section 3.7.2. ere are two ways to han-
dle elements that do not match; they can be returned as NA,preservingthe
length of the original argument in the length of the return value, or, with the
nomatch = 0 argument, they can be returned as 0, which allows the return
value to be used as an index. In this example, we match two sets of names.
k
k k
k
R Data, Part 1: Vectors 49
> nm <- c("Jensen", "Chang", "Johnson",
"Lopez", "McNamara", "Reese")
> nm2 <- c("Lopez", "Ruth", "Nakagawa", "Jensen", "Mays")
> match (nm, nm2)
[1] 4 NA NA 1 NA NA
> nm2[match (nm, nm2)]
[1] "Jensen" NA NA "Lopez" NA NA
e third command tells us that the first element of nm,whichisJensen,
appears in position 4 of nm2; the second element of nm,Chang,doesnotappear
in nm2, and so on. We can extract the elements that matched from the nm2 vec-
tor as in the last line – but the NA entries in the output of match() produce
NAs in the vector of names. An easier approach is to supply the nomatch = 0
argument,asinthisexample.
> match (nm, nm2, nomatch = 0)
[1]400100
> nm2[match (nm, nm2, nomatch = 0)]
[1] "Jensen" "Lopez"
We use match() (or its equivalent) in any data cleaning problem that requires
combining two data sets. Understanding how match() works makes data
cleaning easier. Match() is, in fact, a more powerful version of %in%.
2.6.5 Finding Runs of Duplicate Values
During a data cleaning problem, it often happens that a particular identifier – a
name or account number, perhaps – appears many times in an input data set. As
an example we might be given a list of payments, with each payment identified
by a customer number and each customer contributing dozens of payments.
It will be useful to count the number of times each repeated item appears. We
also use this on logical vectors to find, for example, the locations and lengths of
sets of payments that are equal to 0. e rle() function (the name stands for
“run length encoding”) does exactly this: given a vector, it returns the number
of “runs” – that is, repetitions – and each run’s length. In this example, we show
what the output of the rle() function looks like.
> rle (c("a", "b", "b", "a", "c", "c", "c"))
Run Length Encoding
lengths: int [1:4] 1 2 1 3
values : chr [1:4] "a" "b" "a" "c"
is output shows that the vector starts with a run of length 1 (the first element
in the lengths vector)withvaluea(the values vector); then a run of length
2 with value b; and so on. e output is actually returned in the form of a list
with two parts named lengths and values; in Section 3.3, we discuss how
to access the pieces of a list individually.
k
k k
k
50 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
2.7 Long Vectors and Big Data
Starting in version 3.0.0, R introduced something called a long vector,aspecial
mechanism that allows vectors to be much longer than before. Since there are
only 231 1 values of an integer, entries in a long vector beyond that point will
have to be indexed by double indices. Other than that, this extension should,
in principle, be invisible to users. One exception is that the match() function,
and its descendants, is.element() and %in%, do not work on long vectors.
On long vectors, table() can be very slow and the data.table package
provides some faster alternatives. R’s documentation suggests avoiding the use
of long vectors that are characters.
2.8 Chapter Summary and Critical Data Handling
Tools
is chapter introduces R vectors, which come in several forms, primarily logi-
cal, numeric, and character. e mode(),typeof(),andclass() functions
give you information about the class of a vector. e set of is. functions like
is.numeric() returns a TRUE/FALSE result when an object is of the spec-
ified model, and the set of as. functions performs the conversion. Remember
that logicals are simpler than numerics, and numerics simpler than character,
and that converting from a simpler to a more complicated mode is straightfor-
ward. Converting from a more complicated to a simpler mode follows these
rules:
Converting character to numeric produces NA for things that aren’t numbers,
like the character strings "TRUE" or "$199.99".
Converting character to logical produces NA for any string that isn’t "TRUE",
"True","true","T","FALSE","False","false" or "F".
Converting numeric to logical produces FALSE for a zero and TRUE for any
non-zero entry (and watch out for floating-point error here).
Extracting and assigning subsets of vectors are critical parts of any data clean-
ing project. We can use any of the modes as an index or “subscript” with which
to extract or assign. A logical subscript returns the values that match up with
its TRUE entries. Logical subscripts are extended by recycling where neces-
sary(butmostoftenwhenwedothisitisbymistake).Anumericsubscript
returns the values specified in the subscript – and, unsurprisingly, numeric sub-
scripts are not recycled. e which() command identifies TRUE values in a
logical vector, so you can use that to convert a logical subscript to a numeric
one. Finally, a character subscript will extract, from a named vector, elements
whose names are present in the subscript (and, again, this kind of subscript is
not recycled).
k
k k
k
R Data, Part 1: Vectors 51
Any kind of vector can have missing values, indicated by NA, and there are
a few other special values as well. Missing values influence computations they
areinvolvedin,soweoftenwanttosupplyanargumentlikena.rm = TRUE
to a function computing a sum, mean or other summary statistic on numeric
data. You should expect to encounter missing values in any data set from any
source and be prepared to accommodate them.
e table() function is critical to data cleaning. It tabulates a vector,
returning the number of times each unique value appears, with names corre-
sponding to the original values in the data set. Passing two or more vectors
to table() produces a two- or higher-way cross-tabulation. We recom-
mend adding the useNA = "ifany",useNA = "always",orexclude
= NULL arguments to ensure that table() counts and displays the number
of NA values, unless you’re certain no values are missing. Using table() on
the output of table() –asintable(table(x)) –tellsushowmany
items in a vector xappear once, twice, and so on. is is useful for detecting
entries that appear more often than expected.
Using names() on the output of table() will produce the unique entries
in a vector, but we also use the unique() function to find these. We spend
a lot of energy in identifying duplicates, and the duplicated() function is
useful here – although, remember, it does not return TRUE for the first item in a
set of duplicates. e is.element() and %in% functions help determine the
extent to which two sets of values overlap; both of these are simpler versions
of the match() function, which is critical to combining data from different
sources.
k
k k
k
53
3
R Data, Part 2: More Complicated Structures
3.1 Introduction
R data is made up of vectors, but, as you already know, there are more
complicated structures that consist of a group of vectors put together. In
this chapter, we talk about the three major structures in R that data handlers
need to know about. e most important of these is the data frame,inwhich,
eventually, almost all of our data will be held. But in order to build up to the
data frame, we first need to describe matrices and lists. A data frame is part
matrix, part list, and in order to use data frames most efficiently, you need to
be able to think of it in both ways. Furthermore, we do encounter matrices in
the data cleaning world, since the table() command can produce something
that is basically a matrix.
3.2 Matrices
Amatrix (plural matrices) is essentially a vector, arrayed in a (two-dimensional)
rectangle. As with a vector, every element of a matrix needs to be of the same
type – logical, numeric, or character. Most of the matrices we will see will be
numeric, but it is also possible to have a logical matrix, typically for subscript-
ing, as we shall see. We start using the vector of 15 numbers, 101, 102, , 115,
to produce a 5 ×3 (i.e., five rows by three columns) numeric matrix.
> (a <- matrix (101:115, nrow = 5, ncol = 3))
[,1] [,2] [,3]
[1,] 101 106 111
[2,] 102 107 112
[3,] 103 108 113
[4,] 104 109 114
[5,] 105 110 115
k
k k
k
54 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
ere are a couple of points to mention here. First, the matrix is filled column
by column, with the first column being filled before the second one starts. We
often intuitively expect the matrix to be filled row by row, because our data
comes in rows, and we read English left-to-right, but this is not how R works. If
you need to load your data into your matrix by rows, use the byrow = TRUE
argument. is arises when you copy a matrix off of a web page, for example;
we expect the entries to be read along the top line, but R stores them down the
first column. (We come back to this example in Section 6.5.3.)
Second, notice the row and column indicators such as [4,] and [,2].In
the following section, we will see how to use those numbers to extract elements
from the matrix, or to assign new ones.
ird, the length() operatorcanbeusedonamatrix,butitreturnsthe
total number of elements in the matrix. More often we want to know the num-
ber of rows and columns; that information is returned by the nrow() and
ncol() functions, or jointly by the dim() function, which gives the numbers
of rows and columns in that order:
> length (a)
[1] 15
> dim (a)
[1] 5 3
Fourth, in our example, we used the matrix() function to create the
matrix from one long vector. An alternative is to create a matrix from a set
of equal-length vectors. e cbind() function (“c” for column) combines a
set of vectors into a matrix column by column, while rbind() performs the
operation row by row. If the vectors are of unequal length, R will use the usual
recycling rules (Section 2.1.4). Again, all of the elements of a matrix need to be
of the same sort, so if any vector is of type character, the entire matrix will be
character.
As with vectors, arithmetic operations on matrices are performed element by
element, so A^2 squares each element of Aand A*Bmultiplies two matrices
element by element. ere are special symbols for matrix-specific operations:
for example, A %*% B performs the usual kind of matrix multiplication, t(A)
transposes a matrix, and solve() inverts a matrix. ese operations do not
tend to come up much in data cleaning, but often, we want to perform an oper-
ation on a matrix row by row or column by column. We come back to these row
and column operations in Section 3.2.3.
3.2.1 Extracting and Assigning
Since a matrix is just a vector, it is possible to use a subscript just like the
one we used in Section 2.3.1 to pull out or replace an element. In the example
above, a[6] would produce 106 (remember that we count by columns first),
and a[6] <- 999 would replace that element with 999. However, it is much
k
k k
k
R Data, Part 2: More Complicated Structures 55
more common to identify elements of a matrix by two subscripts, one for the
row and one for the column. ese two subscripts are separated by a comma.
In our example, a[1,2] would produce 106, and a[1,2] <- 999 would
replacethatvalue.
Ofcourse,itispossibletoaskformorethanoneentryatatime.Inthis
example, we ask for a 3 ×2 sub-matrix from our original matrix a:
> a[c(4, 2), c(3, 1)]
[,1] [,2]
[1,] 114 104
[2,] 112 102
etworowsweaskedfor,numbers4and2,inthatorder,arereturned,with
the corresponding entries from columns 3 and 1, in that order. Just as when we
use subscripts on a vector, we may use duplicate subscripts; a vector of negative
numbers indicates that the corresponding entries should be removed.
If you leave one of the two subscripts empty, you are asking for an entire row
or column. is command says “give me all the rows except for number 2, and
all the columns.”
> a[-2,]
[,1] [,2] [,3]
[1,] 101 106 111
[2,] 103 108 113
[3,] 104 109 114
[4,] 105 110 115
Notice here that some rows have been renumbered. e row that had been
number 5 in the original ais now the fourth row. is is not surprising, but it
raisesthequestionastowhetherwemightbeabletokeeptrackofrowsthat
have been deleted, since that would help us audit changes we have made to the
data. We will describe one way to do that using row names in Section 3.2.2.
In addition to using a numeric subscript, we can use a logical one. Logical
subscripts for rows or columns act exactly as logical subscripts for vectors (see
Section 2.3.1). Whether you use numeric or logical subscripts, subscripting a
matrix with row and column indices will return a rectangular object. To extract
values from, or assign new values to, a non-rectangular set of entries, you can
use a matrix subscript, which we describe in Section 3.2.5.
Demoting a Matrix to a Vector
In order to turn a matrix into a vector, use the c() function on it. Just as c()
creates vectors from individual elements (see, e.g., Section 2.1.1), it also cre-
ates vectors from matrices. In our example, c(a) will produce a vector of 15
numbers. e entries in that vector come from the first column, followed by
the second column, and so on. In order to extract data row by row, transpose
the matrix first, using the t() function in a command like c(t(a)).
k
k k
k
56 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
Sometimes, though, R produces a vector from a matrix when we did not
expect it. In this example, see what happens when we ask for, say, the second
column of a. Remember that ahas five rows and three columns.
> a[,2]
[1] 106 107 108 109 110
e result of this operation is not a matrix with five rows and one column;
it is a vector of length 5. is reduction – or “demotion” – from a matrix to
a vector follows a general rule in R under which dimensions of length 1 are
usually removed (“dropped”) by default. is can cause trouble when you have
a function that is expecting a matrix, perhaps because it plans to use dim() to
find the number of rows. If you pass a single column of a matrix, that is, a vector,
such a function would call dim() on a vector, which returns the value NULL.
e way around this is to specify that this dropping should not take place, using
the drop = FALSE argument, like this:
> a[,2,drop = FALSE]
[,1]
[1,] 106
[2,] 107
[3,] 108
[4,] 109
[5,] 110
e result of that operation is a matrix with five rows and one column. When
building functions that take subsets of matrices, it is often a good idea to use
drop = FALSE to ensure that the resulting subset is itself a matrix and not a
vector value.
3.2.2 Row and Column Names
It is very convenient to have a matrix whose rows and columns have names.
We can assign (and extract) row and column names with the dimnames()
function, described in Section 3.3.2, and there are also functions named
rownames() and colnames() to do the same job. (ere is also an equiv-
alent row.names() function, spelled with a dot, but, interestingly, there is
no col.names() function.) Rows and columns are named automatically
by the table() function (technically, a two-way table has class table,not
matrix, but that distinction will not matter here). We start this extended
example by constructing a table.
> yr <- rep (2015:2017, each = 5)
> market <- c(2, 2, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 2, 3, 2)
> (tbl <- table (market, yr))
k
k k
k
R Data, Part 2: More Complicated Structures 57
yr
market 2015 2016 2017
2313
3242
Notice that the row-name entries ("2" and "3" under market)arenota
column of the table; they are merely identifiers. is table has three columns,
not four. Now we show the column names and demonstrate how they can be
changed using the colnames() function.
> colnames (tbl)
[1] "2015" "2016" "2017"
> colnames (tbl) <- c("FY15", "FY16", "FY17")
> tbl
yr
market FY15 FY16 FY17
2313
3242
Onceroworcolumnnameshavebeenassigned,wecanrefertothembyname
as well as by number. is makes it possible to refer to a row or column in a con-
sistent way, without having to know its location. Notice, though, that dimension
names are characters, even if they look numeric. So, for example, tbl[2,] will
produce the second row of the matrix tbl, while tbl["2",] will produce the
row whose name is "2", regardless of what number that row has – and even if
earlierrowshavebeenremoved.
> tbl[2,]
FY15 FY16 FY17
242
> tbl["2",]
FY15 FY16 FY17
313
3.2.3 Applying a Function to Rows or Columns
ere are lots and lots of operations on matrices supported by R, but many
of them are mathematical and not useful in data cleaning. One operation that
does come up, though, is running a function separately on each row or column
of a matrix. A few of these are so common that they are built in. Specifically,
there are functions named colSums() and rowSums(), which compute all
of the column sums or row sums, and corresponding functions for the means,
colMeans() and rowMeans(). Very often, though, you want to apply some
custom function, such as the one that tells how many entries are NA or missing.
e facility for doing this is the apply() function, to which you supply the
matrix, the direction of travel (1 for across rows, 2 for down columns), and
then the function that is to be applied to each row or column. is last can be
k
k k
k
58 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
a function built into R, a function you have written yourself, or even a function
defined on the fly.
> a <- matrix (101:115, 5, 3)
# These four commands produce identical results
> rowSums (a)
> apply (a, 1, sum)
# Pass argument na.rm into the sum() function
> apply (a, 1, sum, na.rm = T)
> apply (a, 1, function (x) sum (x))
[1] 318 321 324 327 330
# User-written command selects the second-smallest entry
# in each column
> apply (a, 2, function (x) sort(x)[2])
[1] 102 107 112
If each call to the function returns a vector of the same length, apply()
creates a matrix. In this example, we use the range() function to produce
two values for each column of a.
> apply (a, 2, range)
[,1] [,2] [,3]
[1,] 101 106 111
[2,] 105 110 115
When apply() is used with a vector-valued function, such as range() in
the last example, the output is arranged in columns, regardless of whether the
operation was performed on the rows or the columns of the original matrix.
is does not always match our intuition, particularly when the operation was
performed on rows. In this example, we show the row-by-row ranges of the a
matrix and then transpose using the t() function.
> apply(a, 1, range)
[,1] [,2] [,3] [,4] [,5]
[1,] 101 102 103 104 105
[2,] 111 112 113 114 115
# Use t() to transpose that matrix
> t(apply(a, 1, range))
[,1] [,2]
[1,] 101 111
[2,] 102 112
[3,] 103 113
[4,] 104 114
[5,] 105 115
A difficulty arises when different calls to the function produce vectors of dif-
ferent lengths. In that case, R cannot construct a matrix and has to return the
results in the form of a list (we discuss lists in Section 3.3). is might arise, say,
k
k k
k
R Data, Part 2: More Complicated Structures 59
when looking for the locations of unusual values by column. In this example, we
look for the locations in each column of values greater than 109 in the matrix a.
> apply (a, 2, function (x) which (x > 109))
[[1]]
integer(0)
[[2]]
[1] 5
[[3]]
[1]12345
is result tells us that the first column has no entries >109, the second col-
umns fifth entry is >109, and all five entries in the third column are >109. In
general, you have to be aware that apply() mightreturnalistifthefunction
being applied can return vectors of different lengths.
3.2.4 Missing Values in Matrices
One very common use of apply() is to count the number of missing values in
each row or column, since missing values always affect how we do data cleaning.
is code shows how to count the number of NA value in each column. To show
off some more of R’s capabilities, we use the semicolon, which allows multiple
commands on one line, and the multiple assignment operation, which lets us
assign several things at once.
> a <- matrix (101:115, 5, 3); a[5, 3] <- a[3, 1] <- NA; a
[,1] [,2] [,3]
[1,] 101 106 111
[2,] NA 107 112
[3,] 103 108 113
[4,] 104 109 114
[5,] 105 110 NA
> apply (a, 2, function (x) sum (is.na (x)))
[1] 1 0 1
From the last command, we see there is one missing value in each of columns
1and3.
We saw how to use which() to identify missing values in a vector back
in Section 2.4, and the same command can also identify missing values in a
matrix. By default, which(is.na(vec)) will return the indices of vec with
missing values as if vec had been stretched out into a long vector (column by
column, as always). However, the arr.ind = TRUE argument will supply the
row and column indices of the items selected by which(). is is extremely
useful in tracking down a small number of missing values. In this example, we
use which() to identify the missing entries in a.
k
k k
k
60 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
> which (is.na (a))
[1] 3 15
> which (is.na (a), arr.ind = TRUE)
row col
[1,] 2 1
[2,] 5 3
Here, which() returns a matrix with two named columns and two unnamed
rows. Of course, this approach is not limited to finding NAs. It can also be used
to find negative values, or anything else that is unexpected and needs to be
cleaned.
3.2.5 Using a Matrix Subscript
In the last example, we saw how which() with arr.ind = TRUE returns
a matrix giving a vector of rows and a vector of columns that, together,
identify the cells that had NA values. One underused feature of R is that
we can use a matrix subscript, such as the one returned by which() with
arr.ind = TRUE to extract from, or assign to, another matrix. We can
also use the vector returned by the ordinary use of which(),butthematrix
approach sometimes makes it much easier to extract the necessary rows or
columns. In this example, we construct a matrix with five columns of data and
a sixth column named "Use". is final column tells us which of the data
columns should be extracted for each of the rows.
> b <- matrix (1:20, nrow = 4, byrow = TRUE)
> b <- cbind (b, c(3, 2, 0, 5))
> colnames (b) <- c("P1", "P2", "P3", "P4", "P5", "Use")
> rownames (b) <- c("Spring", "Summer", "Fall", "Winter")
>b
P1 P2 P3 P4 P5 Use
[1,]12345 3
[2,]678910 2
[3,] 11 12 13 14 15 0
[4,] 16 17 18 19 20 5
Since the first row’s value of Use is 3, we want to extract the third element of
that row; since the second row’s value of Use is 2, we want the second element
of that row; and so on. Without the ability to use a matrix subscript, we might
be forced to loop through the rows of b, but in R we can extract all these items
in one call. Our matrix subscript has two columns, one giving the rows from
which we are extracting (in this case, all the rows of bin order) and another
giving the column from which to extract (in this case, the values in the Use
column of b). Here we show we can construct this matrix subscript and use it
to extract the relevant entries of b.
k
k k
k
R Data, Part 2: More Complicated Structures 61
> (subby <- cbind (1:nrow(b), b[,"Use"]))
[,1] [,2]
[1,] 1 3
[2,] 2 2
[3,] 3 0
[4,] 4 5
> b[subby]
[1] 3 7 20
Notice that in this example the value of Use in the third row was zero – and
therefore no value was produced for that row of the matrix subscript (see “zero
subscripts” in Section 2.3.1). Negative values cannot be used in a matrix sub-
script.
As a real-life example of where this might occur, we were recently given a
matrix of customer payments. e first 96 columns contained monthly pay-
mentamounts.elastcolumngavethenumberofthemonthwiththelastpay-
ment in it. Our task was to extract the payment amount whose month appeared
in that final column. So if, in the first row, that column had the value 15,we
would have extracted the amount from the 15th column of the payment matrix;
and so on for the second and subsequent rows.
Two points here: notice that we extracted in our example earlier using
b[subby] using no additional commas. e matrix subscript defines both
rows and columns. Second, remember to use cbind() to construct the
subscript argument (our subby above). Make sure that the matrix subscript
really is a matrix, and not two separate vectors, or you will extract rows and
columns separately. Matrix subscripting works with names, too. If our matrix b
had had both row and column names, we could have used a character matrix in
exactly the same way as the numeric subby.Inthatcase,bwould need both
row and column names so that both columns of the subscript argument could
be character. We cannot have one vector be numeric and the other character,
because we need to combine them into a matrix, and all the entries in a matrix
have to be of the same type. It is also possible to have a logical matrix act as a
subscript – but the results are surprising and we do not recommend it.
3.2.6 Sparse Matrices
Asparse matrix is one whose entries are largely zero. For example, in a language
processing application we might form a matrix with words in the rows and doc-
uments in the columns. en a particular cell, say, the ijth one, would have a
zero if word idid not appear in document j, and since, in many examples, most
words do not appear in most documents that matrix might have a high propor-
tion of zeros. ere are a number of schemes for representing sparse matrices.
e recommended Matrix package (Bates and Maechler, 2016) implements
k
k k
k
62 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
many of these. We encounter sparse matrices in our work, but rarely in the
context of data cleaning, so we will not discuss them in this book.
3.2.7 Three- and Higher-Way Arrays
ree-way (and higher-way) matrices are called arrays in R. An array looks like
a matrix in that all of its elements need to be of the same type, but a three-way
array requires three subscripts, a four-way array requires four subscripts, and so
on. e only time we seem to have encountered such a thing in data cleaning is
when constructing a three- or higher-way table(). In this example, we show
athree-waytablemadefromthreevectorseachoflength8,andthenweextract
the value 3from the second row of the first column of the first “panel.”
> who <- rep (c("George", "Sally"), c(2, 6))
> when <- rep (c("AM", "PM"), 4)
> worked <- c(T, T, F, T, F, T, F, T)
> (sched <- table (who, when, worked))
, , worked = FALSE
when
who AM PM
George 0 0
Sally 3 0
, , worked = TRUE
when
who AM PM
George 1 1
Sally 0 3
> sched[2,1,1]
[1] 3
Manycommandsthatworkonmatrices,like,apply() and prop.table(),
operate on arrays as well. You can also use c() on an array to produce a vec-
tor – in this case, the first column of the first panel is followed by the second
column of the first panel, and so on. e aperm() function plays the role of
t() for higher-way arrays.
3.3 Lists
Alist is the most general type of R object. A list is a collection of things that
might be of different types or sizes; a list might include a numeric matrix, a
character vector, a function, another list, or any other R object. Almost every
modeling function in R returns a list, so it is important to understand lists when
k
k k
k
R Data, Part 2: More Complicated Structures 63
using R for modeling, but we also need to describe lists because one special sort
of list is the data frame, which we describe in the following section.
Normally, we will encounter lists as return values from functions, but we can
create a list with the list() function, like this:
> (mylist <- list (alpha = 1:3, b = "yes", funk = log, 45))
$alpha
[1] 1 2 3
$b
[1] "yes"
$funk
function (x, base = exp(1)) .Primitive("log")
[[4]]
[1] 45
Listsalsoappearastheoutputfromthesplit() function, which divides a
vector into (possibly unequal-length) pieces according to the value of another
vector. We use this frequently in data cleaning. For example, we might divide
a vector of people’s ages according to their gender. In this simple example, we
show how split() produces a list; later, in Section 3.5.1, we show how that
listcanbeputtouse.
> ages <- c(26, 45, 33, 61, 22, 71, 43)
> gender <- c("F", "M", "F", "M", "M", "F", "F")
> split (ages, gender)
$F
[1] 26 33 71 43
$M
[1] 45 61 22
> split (ages, ages > 60)
$‘FALSE‘
[1] 26 45 33 22 43
$‘TRUE‘
[1] 61 71
It is worth noting that if the second argument – gender in this case – has miss-
ing values, those values will be dropped from the output of split().Notice
also that in the second example the names of the list elements have been sur-
rounded by backward quotes. is is for display, because FALSE and TRUE are
not valid names here, but the character strings "FALSE" and "TRUE" are.
e length of a list, as found using length(), is the number of elements,
regardless of how big each individual element is. e lengths() function
k
k k
k
64 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
returns a vector of lengths, one for each element on the list. In our example,
length(mylist) returns the value 4, whereas lengths(mylist) returns
a vector with four lengths in it (including the length of 1 that is returned for the
function funk()). e str() command we described in Section 2.2.2 works
on lists as well. e resulting value printed to the screen gives a description of
every element on the list – one line for atomic elements and multiple lines for
lists within lists. is is one way to help understand the structure of your data
quickly.
3.3.1 Extracting and Assigning
In the first example in this section, the first three elements were given names
and the fourth was not. at output hints at how to extract items from a list. You
can use double square brackets – so mylist[[4]] will return 45 – or, if an
element has a name, you can use the dollar sign and the name – so mylist$b
will return "yes",andsplit(ages, ages > 60)$"TRUE" will return the
vector of ages >60. Single square brackets can be used, with a numeric, logi-
cal, or name subscript, but there’s a catch – single square brackets return a list,
not the contents of the list. is is useful if you want only a couple of pieces
of a list. For example, mylist[1:2] will return a list with the first two ele-
ments of mylist,andmylist[1] will return a list with the first element of
mylist – not as a vector but as a list. A logical subscript will also work here:
mylist[c(T, T, F, F)] willreturnthesamelistasmylist[c(1,2)]
or mylist[c("alpha", "b")]. Most of the lists we run into will have
names, and we usually extract elements one at a time with the dollar sign, but
the distinction between single and double brackets is still important. Single
brackets create lists; double brackets extract contents. And what happens in our
example if you ask for mylist[[2:3]] or mylist[[c(F, T, T, F)]]?
Unsurprisingly, these commands generate errors.
When you request a list item using single brackets and a name that is not
present on the list, R returns a list with one NULL element; with double brackets
or the dollar sign, it returns the NULL itself. is is consistent with the rule
that says single brackets produce lists, while double brackets and dollar signs
extract contents. Using double brackets with a numeric subscript greater than
the number of elements in the list, such as mylist[[11]] in our example,
produces an error rather than a NULL.
Of course, to extract elements of a list by name, we need to know the names.
We can determine the names a list has using the names() function. If the
list has no names at all, this function will return NULL; if some elements have
names, the names() function will return an empty string for those elements
with no names. is example shows the names of the mylist list.
> names(mylist)
[1] "alpha" "b" "funk" ""
k
k k
k
R Data, Part 2: More Complicated Structures 65
We can also use the names() function on the left-hand side of an assignment
to change the names of the elements on a list. For example, the commands
names(mylist)[4] <- "RPM" would change the name of the fourth ele-
ment of mylist to "RPM".
Unlike when you use single or double square brackets, when using the dollar
sign to extract an item, you don’t need its full name. (Technically, you can pass
the exact argument into double square brackets to control this behavior, but
we don’t.) You only need enough to identify the item unambiguously. In this
example, mylist$a would be enough to produce the same numeric vector
returned by mylist$alpha, but if there were two items on the list, say alpha
and algorithm,typingmylist$a would produce NULL. You would need to
specify at least mylist$alp in order to be unambiguous. It’s often convenient
to use these abbreviated names, but that approach is best suited for quick work
at the command line. We recommend using full names in functions and scripts,
to avoid confusion or even an error if new items get added to the list later.
To replace an item on a list, just re-assign it. If you want to add a new
item to a list, just assign the new item to a new name. Here, naturally, you
need to use the item’s full name. If your mylist has an item called alpha
and you use the command mylist$alp <- 3,youwillcreateanewitem
named alp and leave the old one, alpha, unchanged. To delete an item from
a list, you can use subscripting as we did for a vector. For example, either
mylist <- mylist[c(1, 2, 4)] or mylist <- mylist[-3] will
drop the third entry. But another, possibly easier, way is to assign NULL
to the name or number. In this example, mylist$funk <- NULL or
mylist[[3]] <- NULL would remove the item named funk from
mylist. is behavior means that it is difficult to intentionally store a NULL
value in a list, but this does not seem to be much of a limitation.
Another useful function for operating on lists is unlist(),which,asits
names suggests, tries to turn your entire list into a vector. When the list con-
tains unusual objects, such as the function element of the mylist list in our
example, the results of unlist() can be difficult to predict. is example
showstheeectofunlist() operating on a list of regular vectors, which we
create by excluding the function element of mylist().
> unlist (mylist[-3])
alpha1 alpha2 alpha3 b
"1" "2" "3" "yes" "45"
Here we can see that R has produced names for each of the elements from the
vector mylist$a, and in a well-behaved list these names can be useful.
3.3.2 Lists in Practice
Generally, we do not need lists much when data cleaning. As we have noted,
lists arise as the output of many R functions – a function in R cannot return
k
k k
k
66 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
more than one result, so if a function computes two things of different sizes,
it will need to return a list. For example, the rle() function we described in
Chapter 2 returns the lengths of runs and, separately, the value associated with
each run. It is then your job to extract the pieces from the list. e pieces will
almost always be named, so they will be able to be extracted using the $opera-
tor. (In the case of rle(), the pieces are called lengths and values.) Lists
also arise as the output from the split() command. Normally, after calling
split() we would then call an apply()-typefunctiononeachelementof
the resulting list. We describe this in Section 3.5.1. And, of course, the apply
functions can themselves produce lists, as we saw in Section 3.2.3.
Another common context in which lists arise concerns the dimension names
of a matrix. e dimnames() function returns NULL when applied to a matrix
without row or column names. Otherwise, it returns a list with two elements:
the vector of row names and the vector of column names. In general, this return
has to be a list, rather than a matrix, because the number of rows and number
of columns will be different. Either of the two entries may be NULL,becausea
matrix may have row names without column names, or vice versa. e dim-
names() function may be used to assign, as well as extract, dimension names.
ese examples continue the earlier ones using the two-way table tbl and the
three-way table sched from Sections 3.2.2 and 3.2.7, respectively, and show
dimnames() at work. Notice that dimnames() produces a list with three
vectors of names from the three-way table.
> dimnames(tbl)
$market
[1] "2" "3"
$yr
[1] "FY15" "FY16" "FY17"
> dimnames (sched)
$who
[1] "George" "Sally"
$when
[1] "AM" "PM"
$worked
[1] "FALSE" "TRUE"
As we have seen before, dimension names are always characters. So in the
three-way array example, the names for the worked dimension are the char-
acter strings "FALSE" and "TRUE", not the logical values. In the following
example, we show how we can modify an element of the dimnames() list.
k
k k
k
R Data, Part 2: More Complicated Structures 67
> dimnames(tbl)[[2]][1] <- "Archive"
> tbl
yr
market Archive FY16 FY17
2 313
3 242
In the dimnames() assignment we change the first column name. Here,
dimnames(tbl) produces a list, the [[2]] part extracts the vector
of column names, and the [1] part accesses the element we want to
change. Of course, we could have achieved the same result with dimnames
(tbl)$yr[1] <- "Archive".
Another list that arises from R itself is the list of session options, returned
from a call to the options() function. is list includes dozens of elements
describing things such as the number of digits to be displayed, the current
choice of editor, the choices going into scientific notation, and many more.
Calling names(options()) will produce a vector of the names of the
current options. You can examine a particular option, once you know its
name, with a command like options()$digits. To set an option, pass
itsnameandvalueintotheoptions() function, with a command like
options(digits = 9).
3.4 Data Frames
Now that we understand how matrices and lists work, we can focus on the
most important object of all, the data frame. A data frame (written with a dot
in R, as data.frame) is a list of vectors, all of which are the same length, so
that they can be arrayed in a matrix-like rectangle. (Technically, the elements
of a data frame can also be matrices, as long as they are of the right size, but
let us avoid that complication. For our purposes, the elements of a data frame
will be ordinary vectors.) e vectors in the list serve as the columns in the
rectangle. A data frame looks like a matrix, with the critical difference that
the different columns can be of different types. One column can be numeric,
another character, a third factor, a fourth logical, and so on. Each vector
has elements of one type, as usual, but the data frame allows us to store the
sort of data we get in real life. So a data frame about people might contain
their names (which would probably be character), their ages (often numeric,
but possibly factor), their gender (possibly character, possibly factor), their
eligibility for a particular program (which might be logical), and so on. In this
example, we use the data.frame() function to construct a data frame. In
data cleaning, our data frames are very often produced by functions that read
in data from the disk, a database, or some other source. We describe methods
k
k k
k
68 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
of acquiring data in Chapter 6, but for the moment we will use this simple
example.
> (mydf <- data.frame (
Who = letters[1:5], Cost = c(3, 2, 11, 4, 0),
Paid = c(F, T, T, T, F), stringsAsFactors = FALSE))
Who Cost Paid
1 a 3 FALSE
2 b 2 TRUE
3 c 11 TRUE
4 d 4 TRUE
5 e 0 FALSE
ere are a few points worth noting here. First, R has provided row names
(visible as 1through 5on the left) to the data frame automatically. A matrix
need not have row names or column names, and a list need not have names,
but a data frame must always have both row names and column names.
R will create them if they are not explicitly assigned, as it did here. e
data.frame() function ensures (unless you specify otherwise) that column
names are valid and not duplicated. You may specify row names explicitly,
using the row.names argument, in which case they must be not duplicated
and not missing. Column names can be examined and set using the names()
command, as with a list, or with the colnames() or dimnames() com-
mands, as with a matrix. Generally, you will probably find the names() or
colnames() approaches to be easier, since they involve vectors and not a list.
For row names, the rownames() and row.names() functions allow the
row names of a data frame to be examined or assigned. Section 3.2.2 describes
how row names can be useful when handling matrices, and those points are
true for data frames as well.
A second point is that, by default, the data.frame() function turns char-
acter vectors into factors. Factors are discussed in Section 4.6, and, as we men-
tion there, they are useful, even required, in some modeling contexts. ey are
rarely what we want in data cleaning, however. e best way to keep factors out
of data frames is to not allow them in the first place; we accomplished this in
the example above by passing the stringsAsFactors = FALSE argument
to the data.frame() function. Without that argument, the Who column of
mydf would have been a factor variable with five levels. Another way to prevent
factors from being created is to set the stringsAsFactors global option
to be FALSE,usingtheoptions(stringsAsFactors = FALSE) com-
mand. However, we cannot rely on all of the users of our code having that setting
in place, so we always try to remember to turn this option off explicitly when
we call data.frame(). is issue will arise again when we talk about com-
bining data frames later in this section, and about reading data in from outside
sourcesinChapter6.
k
k k
k
R Data, Part 2: More Complicated Structures 69
ere are several functions that help you examine your data frame. Of
course, in many cases, it will be too big to simply print out and examine. e
head() and tail() functions display only the first or last six rows of a data
frame, by default, but this can be changed by the second argument, named n.
So head(mydf, n = 10) will show the first 10 rows, tail(mydf, 12)
will show the last 12, and, using a negative argument, head(mydf, -120)
will show all but the last 120. e str() function prints a compact represen-
tation of a data frame that includes the type of each column, as well as the first
few entries. Other useful functions include dim(),toreportthenumbersof
rows and columns, and summary(), which gives a brief description of each
column.
3.4.1 Missing Values in Data Frames
Because the columns of a data frame can be of different classes, missing val-
ues can be of different classes, too. A missing value in a numeric column will
be a numeric missing value, while in a character column, the missing value
will be of the character type. We discussed missing values at some length in
Section 2.4. It is always good to know where missing values come from and
why they exist – often investigating the causes of “missingness” will lead to dis-
coveries about the data. e is.na() function operates on a data frame and
returns a logical-valued matrix showing which elements (if any) are missing;
the anyNA() function operates on data frames as well. One approach to han-
dling missing data is to simply omit any observations (rows) of the data frame in
which one or more elements is missing. R’s na.omit() function does exactly
that. (For this purpose, NaN is missing but Inf and -Inf arenot.)isisthe
default behavior for a number of R’s modeling functions, but in general we do
not recommend deleting records with missing values until the reason for the
values being missing is understood.
3.4.2 Extracting and Assigning in Data Frames
Since a data frame is matrix-like and also list-like, we can use both matrix-style
and list-style subsetting operations on a data frame. One difference appears
when we select a single row. With a matrix, selecting a single row returns a
vector, unless you specify drop = FALSE (see Section 3.2.1). However, with
a data frame, even a single row is returned as a data frame with one row because
in general even one row of a data frame will contain entries of different types.
Withthatonedierence,weextractrowsfromadataframejustaswe
extract rows from a matrix – by number, including negatives; using a logical
vector; or by names (as we mentioned, the rows of a data frame, and the
columns, always have names). We can extract columns using either list-style
access or matrix-style access. List-style access uses single brackets to produce
k
k k
k
70 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
sub-lists, which in this case means that using single brackets will produce
a data frame. Double bracket subscripts, or the dollar sign, will produce a
vector. e difference is that double brackets require an exact name, unless
exact = FALSE is set, whereas the dollar sign only requires enough of the
name to be unambiguous. If there are two columns with similar names, and
your request is not sufficient to determine a unique answer, nothing at all
(i.e., NULL) is returned. erefore, it makes sense, particularly when writing
functions for other people, to use full names for columns.
Matrix-style access uses column names or numbers; just as with a matrix,
selecting only one column will produce a vector unless you explicitly set
drop = FALSE. is example shows a number of ways of extracting columns
from data frames. We start by showing list-style access using single brackets.
> mydf[2] # Numeric subscript
Cost
13
22
311
44
50
> mydf["Cost"] # Subscript by name
Cost
13
22
311
44
50
> mydf[c(F, T, F)] # Logical subscript
Cost
13
22
311
44
50
Each of those operations produced a data frame with five rows and one column
(which is, of course, a list). In the following examples, we use double brack-
ets together with a numeric or character subscript and produce a vector. As
with a list, a logical subscript with more than one TRUE inside a pair of dou-
blebracketswillproduceanerror.(Youmighthaveexpectedthesameresult
with a numeric subscript; in fact, a numeric subscript of length 2 can be used;
it acts as a one-row matrix subscript.) When using a character index inside
double brackets, you can specify exact = FALSE to permit the same sort of
matching that we get with the dollar sign.
k
k k
k
R Data, Part 2: More Complicated Structures 71
> mydf[[2]]
[1]321140
> mydf[["Cost"]]
[1]321140
> mydf[["C"]]
NULL
> mydf[["C", exact = FALSE]]
[1]321140
Notice that the result in each of these cases is a vector. In the following
examples, we show the use of the dollar sign to extract a column. In this
case, as we mentioned, we need to specify only enough of the name to be
unambiguous.
> mydf$W # Extracts the "Who" column
[1] "a" "b" "c" "d" "e"
e dollar sign can only refer to one column at a time. To extract more than one
column, we can use single brackets as above, or matrix-style access in which
we explicitly specify rows and columns. As with a matrix, leaving one of those
two indices blank will select all of them, and R will produce vectors from sin-
gle columns unless the drop = FALSE argument is specified. is example
shows extraction using matrix-style syntax.
> mydf[1:2, c("Cost", "Paid")]
Cost Paid
1 3 FALSE
2 2 TRUE
> mydf[,"Who", drop = FALSE] # Example of drop = FALSE
Who
1a
2b
3c
4d
5e
Removing a column from a data frame is exactly like removing an element
from a list and is accomplished in the same way – by assigning NULL to
the column reference. Running the command mydf$Paid <- NULL will
remove the column Paid from the data frame using list-style notation, and
mydf[,"Paid"] <- NULL performs the same task using matrix-style
notation.
To replace subsets of elements you can once again use the matrix-style
or list-style syntax. So, for example, mydf[c(1,3), "b"] <- "A" and
mydf$b[c(1,3)] <- "A" both replace the first and third entries of the b
column of mydf with "A". (Of course, if that column had been numeric or
logical before, this operation will force R to convert it to character.)
k
k k
k
72 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
3.4.3 Extracting Things That Aren’t There
e critical difference between a matrix and a data frame is that the columns
of a data frame can be vectors of different types. Another difference manifests
itself when you try to access an element that isn’t there, maybe because you
asked for a row or column number that was too big or a row or column name
that didn’t exist. In a vector, attempts to extract an item beyond the end of the
vector will produce NAs. But if you ask a matrix for a row or column that doesn’t
exist, R will produce an error. is example shows the difference:
> (mat <- matrix (1001:1006, 2, 3)) # Matrix with six items
# Ask for a non-existent entry, using vector-like indexing
> mat[8]
[1] NA
> mat[,4] # Ask for a non-existent column
Error in mat[, 4] : subscript out of bounds
In general, we prefer the error. A function that sees an NA will often try to carry
on, whereas an error will force you to stop and figure out what has happened.
e situation with data frames (and lists) is different. Supplying subscripts for
which there are no rows produces one row with all NAs for every unusable sub-
script. e entries in these rows will have the same classes (numeric, character,
etc.)thatthedataframehad.isariseswhensomerowshavebeendeleted,
and then you, or a program, try to access one of the deleted rows by name. In
this example, we show how asking for rows that don’t exist can cause trouble.
> mydf2 <- data.frame (alpha = 1:5, b = c(T, T, F, T, F),
NX = c("NA", "NB", "NC", "ND", "NE"),
stringsAsFactors = FALSE,
row.names = c("Red", "Blue", "White", "Reddish", "Black"))
> mydf2
alpha b NX
Red 1 TRUE NA
Blue 2 TRUE NB
White 3 FALSE NC
Reddish 4 TRUE ND
Black 5 FALSE NE
# Let's ask for rows that don't exist.
> mydf2[c(9, 4, 7, 1),]
alpha b NX
NA NA NA <NA>
Reddish 4 TRUE ND
NA.1 NA NA <NA>
Red 1 TRUE NA
In this example, we see that the resulting data frame has four rows, two of
which contain only NA values. e character columns NAs are represented with
k
k k
k
R Data, Part 2: More Complicated Structures 73
angle brackets, as <NA>, to make it easy to distinguish a missing value from
the legitimate character string NA in row 1. e first two columns’ NAsare
numeric and logical. As elsewhere (e.g., Section 2.4.3), logical subscripts will
recycle – which is rarely what you want – and usually produce unwanted results
when they contain NAs.
In the following example, we show one more operation that can produce
rows with NAs in them. Since our data frame has rows named both "Red"
and "Reddish", asking for a row named "Re" is ambiguous and produces
a row of NAs. (In contrast, the row names of a matrix may not be abbreviated;
supplying a name that is not an exact row name produces an error.)
> mydf2["Re",] # Not enough to be unambiguous
alpha b c
NA NA NA <NA>
A much more frequent problem happens when accessing columns. If you
access a non-existent column in the matrix or list styles, using an abbrevia-
tion, R produces an error. In our example, mydf2[,"gamma"] (referring to a
non-existent column), mydf2[,"N"] (referring to an abbreviated name, with
acomma),andmydf2["N"] (without the comma) all produce errors. In con-
trast, when using the double-bracket notation, NULL is returned when a name
is abbreviated or non-existent. (As we mentioned with lists, there is in fact
an exact argument to the double brackets that we do not use.) Just like the
NA returned when accessing a non-existent element of a vector, this NULL has
the potential to be more trouble than an error would have been. e use of
the dollar sign, as we mentioned, permits the use of unambiguously abbrevi-
ated names but produces a NULL when used with a non-existent name. In this
example, we show how asking for a non-existent name can produce an unex-
pected result.
# Ask for the first column by abbreviated name.
> mydf2$alph
[1] 1 2 3 4 5 # No problem
# Create another column with a similar name
> mydf2$alpha.plus.1 <- mydf2$alpha + 1
> mydf2$alph
NULL
> mydf2$alph + 1 # No error, but..
numeric(0) # probably unexpected
e second-to-last operation produced NULL because alph was not sufficient
to differentiate between the columns alpha and alpha.plus.1.Ifarow
or column name matches exactly, R will extract it properly (so if you have
alpha and alpha.plus.1 and ask for alpha, there is no ambiguity). It
is a good practice to use complete names, unless there is a strong reason
not to.
k
k k
k
74 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
3.5 Operating on Lists and Data Frames
Very often we will want to operate on each of the elements of a list or each
of the rows or columns of a data frame. For example, we might want to know
how many missing values are in each column. In Section 3.2.3, we saw a matrix
using apply(),butapply() does not work on a list (since a list doesn’t have
dimensions). e apply() function does work on data frames, but it first con-
verts the data frame into a matrix. is conversion will only be sensible when all
the columns are of the same type, as with the all-numeric data frame described
inSection3.5.2.Inothercases,theresultscanbequiteunexpected.Inthis
example, we operate on the rows of a data frame, using apply(),toshowhow
this can go wrong.
> (dd <- data.frame (a = c(TRUE, FALSE), b = c(1, 123),
cc = c("a", "b"), stringsAsFactors = FALSE))
abc
1 TRUE 1 a
2 FALSE 123 b
> apply (dd, 1, function (x) x)
[,1] [,2]
a " TRUE" "FALSE"
b " 1" "123"
cc "a" "b"
Here the function passed to apply() does nothing but return whatever it
passed to it. Since data frame dd has a character column, apply() converted
the whole data frame into a character matrix. It does this in part by calling
the format() function column by column, producing the results seen here:
avalue" TRUE" with a leading space in row 1 (formatted to be the same
length as the string "FALSE"), and "1"with two leading spaces in row 2
(formatted to be the same length as the string "123"). Analogous conversions
happen whenever a data frame with at least one column that is neither logical
nor numeric is passed to apply(), used in other matrix functions such as t()
(transpose), or accessed with a matrix subscript.
A general approach to this sort of operation (element by element for a list, col-
umn by column for a data frame) is supplied by sapply() and lapply().
e lapply() function always returns a list, whereas sapply() runs
lapply() and then tries to make the output into a vector (if the function
always returns a vector of length 1) or a matrix (if the function returns a
vector of constant length). Be careful, though, because if the different function
calls return items of different lengths, sapply() will need to return a list,
just as the ordinary apply() function did back in Section 3.2.3. Moreover,
if the function returns elements of different types (perhaps as a row of a
data frame), sapply() will try to convert these to a common type. In these
k
k k
k
R Data, Part 2: More Complicated Structures 75
cases, use lapply(). e following example shows one very common use of
sapply(), which is to return the classes of each column in a data frame.
> sapply (mydf2, class)
alpha b NX alpha.plus.1
"integer" "logical" "character" "numeric"
In this example, the regular apply() function will convert the whole data
frame to character first, before computing the classes, which it would report
as all character.
It is easy to operate on the columns of a data frame (or the elements of a list)
with lapply() and sapply() functions. As we have seen, it is more difficult
to operate on the rows. ese two functions provide a solution to this problem.
ey can be used with an ordinary numeric vector as their first argument,
in which case they act like a for() loop, applying their function to each
element of the vector. e for()-like behavior of lapply() and sapply()
is most useful when using a complicated function on each row of a data frame.
e command sapply (1:nrow(ourdf), function (i) fancy
(ourdf[i,])) runs a user-written function called fancy() on each row
of a data frame. supplied by sapply() and lapply().eargumentto
fancy() really is a data frame, and not one that has been converted into a
matrix. In this example, we show how we might identify rows that contain the
number 1. Note that the naïve use of apply() does not find the number 1 in
thefirstrow.
> apply (dd, 1, function (x) any (x == 1))
[1] FALSE FALSE
> sapply (1:2, function (i) any (dd[i,] == 1))
[1] TRUE FALSE
3.5.1 Split, Apply, Combine
e family of apply() functions all operate as part of strategy that Wickham
(2011) calls “split-apply-combine.” e data is split (possibly by row, possibly
by column), a function is applied to each piece, and the results recombined.
We have already met the tapply() function (Section 2.5.2), which performs
exactly this set of operations on vectors. We can also do this explicitly via
split() and sapply() or lapply(). We start the following example
by constructing a data frame with some people’s ages, genders, and ages of
spouses, and computing the average value of Age by Gender. In this example,
we do not specify stringsAsFactors = FALSE.
> age <- data.frame (Age = c(35, 37, 56, 24, 72, 65),
Spouse = c(34, 33, 49, 28, 70, 66),
Gender = c("F", "M", "F", "M", "F", "F"))
k
k k
k
76 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
> split (age$Age, age$Gender)
$F
[1] 35 56 72 65
$M
[1] 37 24
> sapply (split (age$Age, age$Gender), mean)
FM
57.0 30.5
Here the split() function returns a list with the elements of Age divided
by value of Gender.ensapply() operates the mean() function on
each element of the list and returns a vector (i.e., it performs both the
apply” and “combine” operations). In this example, we could have used
tapply(age$Ages, age$Gender, mean) to produce an identical result.
However, unlike tapply(),split() can operate on a data frame, pro-
ducingalistofdataframes.Wecanthenwriteafunctiontooperateoneach
data frame. In this example, we split our data frame by Gender and then use
summary() on each of the resulting data frames to return some informa-
tion about every column. Summary() applied to the factor column Gender is
more informative than when applied to a character column; this is why we did
not specify stringsAsFactors = FALSE earlier. e result of the calls to
summary() appears as a specially formatted table.
> split (age, age$Gender)
$F
Age Spouse Gender
135 34 F
356 49 F
572 70 F
665 66 F
$M
Age Spouse Gender
237 33 M
424 28 M
> lapply (split (age, age$Gender), summary)
$F
Age Spouse Gender
Min. :35.00 Min. :34.00 F:4
1st Qu.:50.75 1st Qu.:45.25 M:0
Median :60.50 Median :57.50
Mean :57.00 Mean :54.75
3rd Qu.:66.75 3rd Qu.:67.00
Max. :72.00 Max. :70.00
k
k k
k
R Data, Part 2: More Complicated Structures 77
$M
Age Spouse Gender
Min. :24.00 Min. :28.00 F:0
1st Qu.:27.25 1st Qu.:29.25 M:2
Median :30.50 Median :30.50
Mean :30.50 Mean :30.50
3rd Qu.:33.75 3rd Qu.:31.75
Max. :37.00 Max. :33.00
Using sapply() in this case produces an unexpected result (try it!). at
function tries hard to construct a vector or matrix whenever it can. A
single command that produces essentially the same final result, without
letting you save the list, is the by() function. In this example, by(age,
age$Gender, summary) performs the summary() operation on each
column, broken down by gender.
Under some circumstances, the three tasks of split, apply, and combine might
require separate functions, each of which may have its own arguments and con-
ventions. e dplyr package Wickham and Francois (2015) presents a set of
tools that aim to make this sort of processing more consistent. Although this
package is intended for data frames, the earlier plyr Wickham (2011) package
handles lists and arrays as well. Both are intended to be fast and efficient and
to permit parallel computation, which we address in Section 5.5. We have been
accustomed to performing these tasks in regular R, and we recommend that
users know how to perform these tasks there, since lots of existing code and
users take that approach.
3.5.2 All-Numeric Data Frames
We noted above that it is difficult to apply a function to the rows of a data
frame because the entries of a row may have different classes. All-numeric data
frames, though – those whose columns are all logical or numeric – behave
specially in these situations. When one of these data frames is converted to
a matrix the numeric nature of the columns is preserved (with logicals being
converted to numeric). ese data frames can also be transposed, or accessed
with a matrix subscript, without losing their numeric nature. All-numeric data
frames provide a useful way of storing numbers in a matrix-like way while being
able to use data-frame-like syntax – but, again, as soon as one character column
(perhaps an ID) is added, the nature of the data frame changes.
Just as there are functions as.numeric() and so on to convert vectors
from one class to another (see Section 2.2.3), R provides as.matrix() and
as.data.frame() functions to convert data frames to matrices and vice
versa. is is mostly useful for all-numeric data frames or for older functions
that require numeric matrices.
k
k k
k
78 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
3.5.3 Convenience Functions
We encourage users to use long names for their data objects and for their col-
umn names, for increased readability. However, this often leads to a situation
wheretouseasimpleexpressionweneedalonglineliketheoneinthisexample:
CustPayment2016$JanDebt + CustPayment2016$FebPurch -
CustPayment2016$FebPmt
e with() and within() functions provide an easier way to perform oper-
ations such as these, and they are particularly useful when the same operation
needs to be done multiple times on multiple data objects, usually data frames.
For each of these functions, we pass the data frame’s name and then the expres-
sion to be performed, like this:
with (CustPayment2016, JanDebt + FebPurch - FebPmt)
One issue that if the expression includes an assignment, the assignment is
ignored. In order to create a new column in CustPayment2016 we would
need code like this:
CustPayment2016$FebDebt <- with (CustPayment2016,
JanDebt + FebPurch - FebPmt)
As an alternative, the within() function can perform assignments; it returns
a copy of the data with the expression evaluated. In this case, we could add a
new column called FebDebt to the data frame with a command like this:
CustPayment2016 <- within (CustPayment2016,
FebDebt <- JanDebt + FebPurch - FebPmt)
Notice that in this example within() returns a copy, which then needs to be
saved.
Two more convenience functions are the subset() and transform()
functions. Much beloved of beginners, they make the subsetting and trans-
formation process easier to follow by helping do away with square brackets.
For example, we might extract all the rows of data frame dfor which column
Price is positive with a command like d[ d$Price > 0,];subset()
allows us to use the alternative subset(d, Price > 0).Itisalsopossible
to extract a subset of columns at the same time. e transform() function
allows the user to specify transformations to existing columns in a data frame
and returns the updated version. e help pages for both of these functions are
accompanied by warnings that recommend using them interactively only, not
for programming, and we generally avoid them.
A final convenience function is the ability to “pipe” provided by the %>%
function in the magrittr package (Bache and Wickham, 2014). is
is intended to make code more readable by allowing one function’s out-
put to serve as another’s input directly at the command line, rather than
k
k k
k
R Data, Part 2: More Complicated Structures 79
requiring nested calls. For example, consider this evaluation of a mathematical
expression:
> cos (log (sqrt (8 - 3)))
[1] 0.6933138
In R, we have to read this from the inside out: we compute 8 3; take the
square root of the result; take the logarithm of that result; and finally compute
thecosineoftheresultfromthelog() function. Using the pipe notation, we
can pass the results of one computation to the next in the order in which they
are performed. is example shows the same computation performed using
the pipe notation.
> (8 - 3) %>% sqrt %>% log %>% cos
[1] 0.6933138
e pipe notation is particularly useful for nested functions and can be brought
to bear on data frames. However, be aware that not every function is suitable
for piping, and notice that the order of precedence required that we surround
the 8 3 with parentheses.
3.5.4 Re-Ordering, De-Duplicating, and Sampling from Data Frames
Data frames can be re-ordered (i.e., sorted) using a command that extracts
all the rows in a new order. is ordering will usually be a vector of row
indices constructed with the order() function (see Section 2.6.2). So if a
data frame named cust has columns ID and Date,thenord <- order
(cust$ID, cust$Date) (or the slightly more convenient alternative,
ord <- with(cust, order (ID, Date))) will produce a vec-
tor ord that shows the ordering of the data frame’s rows by increasing
ID, and then by increasing Date within ID. erefore, the command
cust <- cust[ord,] will replace the old cust with the newly
ordered one.
In Section 2.6.4, we saw that the unique() function returns the unique
entries in a vector, while its counterpart duplicated() returns a logical vec-
tor that is TRUE for any entry that appears earlier in the vector. ese two
functions operate directly on matrices and data frames as well. So the com-
mand unique(mydf) takes a data frame named mydf and returns the set of
non-duplicated rows. As always, floating-point error can be a problem when
detecting whether two things are identical.
One more operation that comes up is random sampling from a data frame.
isisagoodplanwhentheoriginaldatasetissobigthatitcannotbeeasily
used for testing, for example, or plotting. As with re-ordering, the idea is to
construct a sample of row indices and then to subset the data frame with that
sample. e sample() commandisusefulhere.Initsmostbasicform,we
k
k k
k
80 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
pass an integer named xgivingthenumberofrowsinthedataframeandan
argument named size giving the desired sample size. e result is a random
set of integers selected without replacement with each value from 1 to xbeing
equally likely. To sample 200 rows from a data frame named mydf, we could use
the command sam <- sample(nrow(mydf), 200) to get a vector of 200
row numbers, and then mydf[sam,] to do the sampling. (is presumes that
there are 200 or more rows in mydf. If not, R produces an error.) Of course, the
new data frame’s rows will maintain the numbers they had in the original mydf,
so the row names of the new version will be out of order. If that bothers you, a
quick sam <- sort(sam) prior to subsetting will fix that. e sample()
function also has a number of more sophisticated features, including sampling
with replacement and the ability to specify different probabilities for different
choices.
3.6 Date and Time Objects
Most data cleaning problems will include dates (and sometimes times). e
most important tasks we face with dates in data cleaning are doing arithmetic
(e.g., adding a number of days to a date or finding the number of days between
two dates) and extracting each date’s day, day of the week, month, calendar
quarter, or year. Objects representing dates and times come in several forms in
R, but since one of them takes the form of a list, we have postponed discussion
of those objects until here.
3.6.1 Formatting Dates
ere are lots of ways to display a date in text, and during data cleaning it will
feel like you meet all of them. Americans might write July 4, 2017 as “7/4/17,”
but to most of the rest of the world, this indicates April 7th. Furthermore,
this representation leaves unclear precisely where in the string the day starts;
it starts in the third character for an Americans “7/4/17” but in the first for
an internationally formatted date like “26/05/17.” e two unambiguous for-
mats “2017-07-04” and “2017/07/04” are good starting points for storing dates,
especially in text files outside of R. (e value “2017-7-4” is permitted, but this
format leads to date strings of different lengths; 20170704 is easy to mistake for
an integer.)
e simplest date class in R is called Date, and an object of this class is
represented internally as an integer representing the number of days since a
particular “origin” date. e as.Date() function converts text into objects of
class Date in two ways. First, it can convert an integer number of days since
the origin into a date. e usual origin date in R is January 1, 1970, or, unam-
biguously, “1970-01-01.” In this example, we show how a vector of integers can
be converted into a Date object.
k
k k
k
R Data, Part 2: More Complicated Structures 81
> (dvec <- as.Date (c(0, 17250:17252),
origin = "1970-01-01"))
[1] "1970-01-01" "2017-03-25" "2017-03-26" "2017-03-27"
Notice that the value 0is converted into the origin date, “1970-01-01.” If we
are given integer dates, we need to know what the origin is supposed to be.
is concern arises when reading data in from the Excel spreadsheet program.
Excel uses integer dates, but the origins are different between Windows and
Mac, and Excel mistakenly treats 1900 as a leap year. We describe this in more
detail in Section 6.5.2 when we describe reading data in.
e second conversion that as.Date() can perform is to convert text-
based representations such as “7/4/17” or “July 4, 2017,” using a format
string that describes the way the input text is formatted. Each piece of the
format string that starts with %identifies one part of the date or time; other
pieces represent characters such as space, comma, /,or-between pieces
of the input text. For example, %B matches the name of the month and %a
matches the name of the day of the week. e most important pieces of the
format are %d for day of the month, %m for the month, and %y and %Y for
two- and four-digit year, respectively. (Two-digit years between 69 and 99
are assumed to be twentieth-century ones starting with 19, and the rest,
twenty-first-century ones.) e help page for as.Date() refers us to the help
page for strptime(), which lists all of the possibilities. For example, this
command uses the format string "%B %d, %Y" to convert text dates such as
"September 20, 2016" into a Date object.
> as.Date (c("Feb 29, 2016", "Feb 29, 2017",
"September 30, 2017"), format = "%b %d, %Y")
[1] "2016-02-29" NA "2017-09-30"
Notice that the format string had to contain the same pattern of spaces
and comma that the input text had. R was able to read both the three-letter
abbreviation Feb and the full name September – but it produced an NA for
Feb 29, 2017 which was not a legitimate date.
e names of the days of the week, and the months of the year, are set by the
computer’s locale (see Section 1.4.6). By changing locales R can be made to read
in days or months in other languages, as well, which is useful when data comes
from international sources. In this example, we have some dates in which the
month has been given in Spanish. By changing the locale we can read these in;
then by re-setting the locale we can use as.character() to convert them
into English.
> sp.dates <- c("3 octubre 2016", "26 febrero 2017",
"5 mayo 2017")
> as.Date (sp.dates, format = "%d %B %y")
[1] NA NA NA
# Not understood in English locale; use Spanish for now
k
k k
k
82 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
> Sys.setlocale ("LC_TIME", "Spanish")
[1] "Spanish_Spain.1252" # Setting was successful
> (dts <- as.Date (sp.dates, format = "%d %B %Y"))
[1] "2016-10-03" "2017-02-26" "2017-05-05"
> Sys.setlocale ("LC_TIME", "USA") # Change back
[1] "English_United States.1252" # Setting was successful
> as.character (dts, "%d %B %Y")
[1] "03 October 2016" "26 February 2017" "05 May 2017"
3.6.2 Common Operations on Date Objects
ere are a number of convenience functions to manipulate date objects. e
months() and weekdays() functions act on Date objects and return the
names of the corresponding months and days of the week. Each has an abbre-
viate argument that defaults to FALSE;whensettoTRUE these arguments
produce three-letter abbreviations. In this example, we show examples of these
convenience functions.
> d1 <- as.Date ("2017-01-02")
> d2 <- as.Date ("2017-06-15")
> weekdays (c(d1, d2))
[1] "Monday" "Thursday"
> months (c(d1, d2))
[1] "January" "June"
> months (c(d1, d2), abbreviate = TRUE)
[1] "Jan" "Jun"
> quarters (c(d1, d2))
[1] "Q1" "Q2"
ere is no function to extract the numeric day, month, or year from a Date
object. ese operations are performed using the format() function, which
calls format.Date() to produce character output that can then be converted
to numeric using as.numeric(). e elements of the format string are like
those that are used in as.Date(). is example shows how to extract some
of those pieces from a vector of Date objects – but, again, note that the output
of format() is text.
> format (c(d1, d2), "%Y")
[1] "2017" "2017"
> format (c(d1, d2), "%d")
[1] "02" "15"
> format (d1, "%A, %B %m, %Y")
[1] "Monday, January 01, 2017"
e final command shows a more sophisticated formatting operation, using a
format string like the one in as.Date().
It is permitted to use decimals in a Date object to represent times of
day. If you want to create a date object to represent 1:00 p.m. on July 29,
k
k k
k
R Data, Part 2: More Complicated Structures 83
2015, as.Date(16645 + 13/24, origin = "1970-01-01") will
return a numeric, non-integer object that can be used as a date. However,
as.Date("2017-07-29 13:00:00") producesaDatethatisrepresented
internally by the integer 17,376 – the time portion is ignored. Moreover
non-integer parts are never displayed and can even be truncated by some
operations. When times of day are required, it is a better idea to use a POSIXt
object (Section 3.6.4).
3.6.3 Differences between Dates
Very often we need to know how far apart two dates are. e difference between
two Date objects is not a date; it is instead a period of time. In R, one of
these differences is stored as a difftime object. Some functions, such as
mean() and range(), handle difftime objects in the expected way. Oth-
ers, such as hist() (to produce a histogram) or summary(), fail or produce
unhelpful results. Normally, we will convert difftime objects into numeric
items with as.numeric(). Be careful, though: the units that R uses for the
conversion can depend on the size of the difference, whereas for data clean-
ing we almost always want to use one consistent choice of unit. erefore, it
is a good habit, when converting difftime objects to numbers, to specify
units = "days" (or whichever unit we want) explicitly. In this example,
which continues the one above, we show addition on dates plus an example of
adifftime object.
# Date objects are numeric; we can add and subtract them
>d1+30
[1] "2017-07-02"
>(d<-d2-d1)
Time difference of 13 days # an object of class difftime
> as.numeric (d) # convert to numeric, in days
[1] 13
> units (d)
[1] "days"
> as.numeric (d, units = "weeks")
[1] 1.857143
In the last pair of commands, we saw that as.numeric() produced an out-
put in days by default, the units being revealed by the units() command.
We can also set the units of a difftime object explicitly, with a command
like units(d) <- "weeks",orusethedifftime() function directly, like
difftime(d2, d1, units = "weeks").
3.6.4 Dates and Times
If you don’t need to do computations with times – only with dates – the Date
class will be enough, at least back to 1752, when Britain switched from the Julian
k
k k
k
84 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
to the Gregorian calendar. If you need to do computations on times, there is a
second set of objects that are stronger at storing and computing those. ese
are named POSIXct and POSIXlt objects, after the POSIX set of standards.
Collectively, these two types of objects are called POSIXt objects. POSIXt
objects measure the number of seconds (possibly with a decimal part) since
the beginning of January 1, 1970 using Coordinated Universal Time (UTC),
which is identical to Greenwich Mean Time (GMT). (Technically, the POSIX
standard does not include leap seconds, a vector of which is given by R’s built-in
.leap.seconds variable. is has never affected us.)
e POSIXlt object is implemented as a list, which makes it easy to extract
pieces; the POSIXct object acts more like a number, which makes it the choice
for storing as a column in a data frame. We start with an example of a POSIXlt
number. It prints out in a character string, but it behaves like a list. One unusual
feature is that, to see the names of the list, you need to unlist() the object
first. For example,
> (start <- as.POSIXlt("2017-01-17 14:51:23"))
[1] "2017-01-17 14:51:23 PST" # R has inferred time zone PST
> unlist (start)
sec min hour mday mon year wday yday
"23" "51" "14" "17" "0" "117" "2" "16"
isdst zone gmtoff
"0" "PST" NA
Here start really is a list, and we can extract components in the usual way,
with a dollar sign or double brackets (but, although you can use its names,
names(start) is NULL, and you cannot extract a subset of components with
single brackets). Notice also that the first day of the month gets number 1,but
the first month of the year, January, carries the number 0, and that the year
element counts the number of years since 1900. e advantage of a list is that,
given a vector of POSIXlt objects named date.vec, say, you can extract all
the months at once with data.vec$mon – but again, January is month 0 and
December is month 11. Weekdays are given in the list by wday, with 0–6 rep-
resenting Sunday through Saturday, respectively. e weekdays() function
from above, and the other Date functions, also work on POSIXt objects – but
be aware that the results are displayed in the locale of the user. Notice that the
time zone above, PST, is deduced by our computer from its locale. e help
for DateTimeClasses gives more information on the niceties of time zones,
many of which are system specific.
Although we can use the weekdays(),months(),andquarters()
functions on POSIXct objects, we extract other components, such as years
or hours, via the format() function, as we did for Date objects. is is
slightly less efficient than the list-type extraction from a POSIXlt object,
butwerecommendusingPOSIXct objects where possible, because we have
k
k k
k
R Data, Part 2: More Complicated Structures 85
encountered unexpected behavior when changing time zones with POSIXlt
objects.
It is worth noting that although a POSIXt object may have a time, a time
is not required. When a Date object is converted into a POSIXt object, the
resulting object is given a time of 00:00 (i.e., midnight) in UTC. A vector of
POSIXt objects that are all at midnight display without the time visible, but
they do contain a time value. When a POSIXt object is converted to a Date
object, the time is truncated.
3.6.5 Creating POSIXt Objects
R’s as.POSIXct() and as.POSIXlt() functions convert text that is unam-
biguously formatted into POSIXt objects just as as.Date() does. Here the
date can be followed by a 24-hour clock time like 17:13:14 or a 12-hour time
with an AM/PM indicator. More usefully, perhaps, these functions allow the use
of a format string such as the one used by as.Date(). is format string, doc-
umented in the help for strptime(), allows times, time zones, and AM/PM
indicators, attributes that are also accepted by as.Date(). Often we discard
time information, since we are only interested in dates, but sometimes discard-
ing time information can lead to incorrect conclusions. In this example, we
construct two POSIXct objects that represent the same moment expressed in
two different time zones.
> (ct1 <- as.POSIXct ("Mar 31, 2017 10:26:08 pm",
format = "%b %d, %Y %I:%M:%S %p"))
[1] "2017-03-31 22:26:08 PDT"
> (ct2 <- as.POSIXct ("2017-04-01 05:26:08", tz = "UTC"))
[1] "2017-04-01 05:26:08 UTC"
> as.numeric (ct1 - ct2, units = "secs")
[1] 0
e first date, ct1, is not given an explicit time zone, so the system selects the
local one (shown here as PDT). In the second example, we explicitly provide the
UTC indicator with the tz argument. e as.numeric() command shows
that the two times are identical. ere are a few confusing properties of POSIXt
objects. All the objects in a vector of length >1 will be displayed with the local
time zone, and their weekdays() and months() willbe,too.Forasingle
object, though, these functions refer to the time zone of the object, although,
as this example shows, there is a complication.
> c(ct1, ct2)
[1] "2017-03-31 22:26:08 PDT" "2017-03-31 22:26:08 PDT"
> weekdays (c(ct1, ct2))
[1] "Friday" "Friday"
> weekdays (ct2) #
[1] "Saturday"
k
k k
k
86 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
> weekdays (c(ct2))
[1] "Friday"
e top command shows that the vector of dates is displayed in our locale.
at date refers to a moment that was on a Friday locally. When weekdays()
acts on ct2 by itself, though, it shows that that moment was on a Saturday in
Greenwich. In the final command, the c() causes ct2 to be converted to local
time, where its date falls on a Friday.
To explicitly convert the time zone of a POSIXct object, you can set its
tzone attribute, with a command like attr(ct1, tzone = "UTC"),or,
equivalently, with tzone = "GMT";seethehelpforSys.timezone() for
a way to determine the names of time zones. (e approach for POSIXlt
objects is more complicated and we do not discuss it here.) Note that when
POSIXct objects are converted to Date objects, they are rendered in UTC,
so as.Date(ct1) and as.Date(ct2) both produce dates with value
"2017-04-01".
e format string that is passed to as.POSIXct() allows for a lot of flexibil-
ity in the way dates are formatted. is example shows how you might convert
R’s own date stamp, produced by the date() function, into a POSIXct object
and then a Date object.
> (curdate <- date())
[1] "Wed Sep 21 00:36:47 2016"
> (now <- as.POSIXct (curdate,
format = "%A %B %d %H:%M:%S %Y"))
[1] "2016-09-21 00:36:47 PDT" # POSIXct object
> as.Date (now)
[1] "2016-09-21"
As long as the format of the dates in your data is consistent, it will probably
be possible to read them in using as.POSIXct().Insomecases,dates
may appear with extraneous text. If the contents of the text is known exactly,
the text can be matched. For example, the string Wednesday, the 17th
of March, 2017 at 6:30 pm canbereadinwiththeformatstring
"%A, the %dth of %B, %Y at %I:%M %p". But this formatting will fail
for the 21st or the 22nd or if the input string ends with p.m. (with
periods). In cases where there is variable, extraneous text, you may have to
resort to manipulating the text strings using the tools in Chapter 4.
3.6.6 Mathematical Functions for Date and Times
Since Date and POSIXt objects are numeric, many functions intended to
work on numeric data also work on these date objects. In particular, range(),
max(),min(),mean(),andmedian() objectsinR.allproducevectors
k
k k
k
R Data, Part 2: More Complicated Structures 87
of date objects. e diff() function computes differences between adjacent
elements in a vector, so diff(range(x)) produces the range of dates in the
vector xas a difftime object. e summary() function acts on a vector
of date objects, producing an object that is slightly different from a vector of
dates but still usable. You can also tabulate Date and POSIXct objects with
table() –buttable() does not work on the list-like POSIXlt objects.
e seq() function can also be used to generate a sequence of dates. is
is useful for generating the endpoints of “bins” for histograms or other sum-
maries.Aswementioned,Date objects are implemented in units of days, so
a sequence of Date objects one unit apart has values 1 day apart by default.
However, POSIXt objects are in units of seconds, so a sequence of POSIXt
objects one unit apart are 1 second apart. One way to create a sequence of
POSIXt objects representing consecutive days is to use by = 86400,since
there are 86,400 seconds in a day. However, R has a better approach. When
called with a vector of Date or POSIXt objects, the seq() function invokes
one of the functions, seq.Date() or seq.POSIXt(),thatissmarterabout
date objects. ese functions let you use the by argument with a word like
"hour","day" and so on. An additional value "DSTday" (for POSIXt only)
ignores daylight saving time to produce the same clock time every day. In this
example, we generate some sequences of Date and POSIXt objects. Notice
that R suppresses times for POSIXt dates when all of the times in the vector
are midnight.
> seq (as.Date ("2016-11-04"), by = 1, length = 4)
[1] "2016-11-04" "2016-11-05" "2016-11-06" "2016-11-07"
# Create and save a POSIXct object, for convenience
> ourPos <- as.POSIXct ("2016-11-04 00:00:00")
> seq (ourPos, by = 1, length = 3)
[1] "2016-11-04 00:00:00 PDT" "2016-11-04 00:00:01 PDT"
[3] "2016-11-04 00:00:02 PDT"
> seq (ourPos, by = "day", length = 3)
[1] "2016-11-04 PDT" "2016-11-05 PDT" "2016-11-06 PDT"
> seq (ourPos, by = "day", length = 4)
[1] "2016-11-04 00:00:00 PDT" "2016-11-05 00:00:00 PDT"
[3] "2016-11-06 00:00:00 PDT" "2016-11-06 23:00:00 PST"
> seq (ourPos, by = "DSTday", length = 4)
[1] "2016-11-04 PDT" "2016-11-05 PDT" "2016-11-06 PDT"
[4] "2016-11-07 PST"
> seq (ourPos, by = "month", length = 4)
[1] "2016-11-04 PDT" "2016-12-04 PST" "2017-01-04 PST"
[4] "2017-02-04 PST"
In the top example, we see a sequence of Date objects1dayapart(asspecied
by by = 1). at same specification produces POSIXt dates 1 second apart.
k
k k
k
88 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
Using by = "day" moves the clock by 24 hours, but since the Pacific Time
Zone, where these examples were generated, switched from daylight saving to
standard time on November 6, 2016, the old time of midnight standard time
was advanced 24 hours to 11 p.m. standard time. With by = "DSTday"
the clock time is preserved across days. e final example shows how we can
advance 1 month at a time – the help for seq.POSIXt() shows how these
functions adjust for the case when advancing by month starting at January 31,
for example.
Differences between two POSIXt objects, like differences between Date
objects, are represented by difftime objects in R. Here, though, you need
to be even more careful to specify the units when converting the difftime
object to numeric. is example shows how neglecting that specification can
cause problems.
> d1 <- as.POSIXct ("2017-05-01 12:00:00")
> d2 <- as.POSIXct ("2017-05-01 12:00:06") # d1 + 6 seconds
> d3 <- as.POSIXct ("2017-05-07 12:00:00") # d1 + 6 days
> (d2 - d1) == (d3 - d1)
[1] FALSE # expected
> as.numeric (d2 - d1) == as.numeric (d3 - d1)
[1] TRUE # possibly unexpected
Here the d2 d1 difference has the value 6 seconds, while the d3 d1
difference has the value 6 days. e units are preserved in the difftime
objects but discarded by as.numeric().Itisagoodpracticetoalways
specify units = "days" or whatever your preferred unit is, whenever you
convert a difftime object to a numeric value.
3.6.7 Missing Values in Dates
Dates of different classes should not be combined in a vector. It is always wise
touseanexplicitfunctiontoforcealltheelementsofadatevectortohavethe
same class. is also applies to missing values in date objects – they need to be
oftheproperclass.Inthisexample,wecombineanNA with the d1 date from
above, using the c() function. e c() function can call a second function
depending on the class of its first argument – c.Date(),c.POSIXct() or
c.POSIXlt().
> c(d1, NA)
[1] "2017-05-01 12:00:00 PDT" NA
> c(NA, d1)
[1] NA 1493665200
> c(as.POSIXct (NA), d1)
[1] NA "2017-05-01 12:00:00 PDT"
> c(NA, as.Date (d1))
[1] NA 17287
k
k k
k
R Data, Part 2: More Complicated Structures 89
e first command succeeds, as expected, because c.POSIXct() is able to
convert the NA value into a POSIXct object. In the second command, though,
c() sees the NA and does not call a class-specific function. Instead, it converts
both values to numeric. e resulting second element is the number of seconds
since the POSIXt origin date. e way around this is to explicitly specify an NA
value of class POSIXct, as in the third command. e final command shows
that this problem exists for Date objects as well – here, d1 is converted into
the number of days since the origin date. e lesson is that you should ensure
that every date element, even the NA ones, in your vector has the same class.
3.6.8 Using Apply Functions with Dates and Times
Often a data set will arrive as a data frame with a series of dates in each row.
ese might be dates on which a phenomenon is repeatedly recorded –
monthly manpower data, for example, or payment information. If an operation
needs to be performed on each row – say, finding the range of the dates in each
one – it is tempting to use apply() on such a data frame. As with earlier
examples (Section 3.5), this will not succeed – even (perhaps surprisingly) if
the data frame’s columns are all Date or all POSIXct. A better approach is
to operate on each row via the lapply() or sapply() functions. Here we
show an example of a data frame whose columns are both Date objects.
> date.df <- data.frame (
Start = as.Date (c("2017-05-03", "2017-04-16")))
> date.df$End <- as.Date (c("2018-06-01", "2018-02-16"))
> date.df
Start End
1 2017-05-03 2018-06-01
2 2017-04-16 2018-02-16
> apply (date.df, 1, function (x) x[2] - x[1])
Error in x[2] - x[1]: non-numeric arg. to binary operator
Here, the apply() function converts the data frame to a character matrix.
(Why it does not convert it to a numeric one is not clear.) So the mathematical
operation fails. One way to apply the function to each row is via sapply(),as
in this example:
> sapply (1:2, function (i)
as.numeric (date.df[i,2] - date.df[i,1],
units = "days"))
[1] 394 306
Using sapply() to index the rows, we can compute each difference in days in
a straightforward way. In general, you will need to pay attention when dealing
with data frames of dates row by row.
k
k k
k
90 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
3.7 Other Actions on Data Frames
It is a rare data cleaning task that does not involve manipulating data frames,
and one very common operation is to combine two data frames. ere are
essentially three ways in which we might want to combine data frames: by
columns (i.e., combining horizontally); by rows (i.e., stacking vertically); and
matching up rows using a key (which we call merging). e first two of these are
straightforward and the third is only a little more complicated. In this section,
we describe these tasks, as well as some other actions you can perform on data
frames. We show some more detailed examples in Chapter 7.
3.7.1 Combining by Rows or Columns
When we talk about “combining data frames by columns,” we mean combining
them side by side, creating a “wide” result whose number of columns is the
sum of the numbers of columns in the things being combined. We have seen
the cbind() function, which is the preferred function for joining matrices. We
can also supply two data frames as arguments to the data.frame() function
and R will join them. Both cbind() and data.frame() can incorporate
vectors and matrices in its arguments as well – but they will convert characters
to factors unless you explicitly provide the stringsAsFactors = FALSE
argument. R is prepared to recycle some inputs, but it is best if the things being
combined have the same numbers of rows.
Recall that a data frame needs to have column names, and that we (almost)
always want these to be distinct. If two columns have the same name, R will use
the make.names() function with the unique = TRUE argument to con-
struct a set of distinct names. If three data frames each have a column named
a, for example, the result will have columns a,a.1,anda.2.Itisalwaysa
good idea to examine the set of column names for duplication (perhaps using
intersection() as in Section 2.6.3) to ensure that you know what action R
will take.
Combining data frames by rows means stacking them vertically, creating
a “tall” result whose number of rows is the sum of the numbers of rows in
the things being combined. e rbind() function combines data frames in
this way. We can only operate rbind() on things with the same number of
columns; moreover, the columns need to have the same names, but they need
not be in the same order; R will match the names up. You will almost always
want the columns being joined to be of the same sort – numeric with numeric,
character with character, POSIXct with POSIXct, and so on – otherwise,
R will convert each column to a common class. We usually check the classes
explicitly and recommend you pass the stringsAsFactors = FALSE
argument to rbind(). If we have two data frames called df1 and df2,we
start by comparing the names, using code like this:
k
k k
k
R Data, Part 2: More Complicated Structures 91
> n1 <- names (df1)
> n2 <- names (df2)
> all (sort (n1) == sort (n2)) # should be TRUE
We sort the names of each data frame to account for the fact that they might be
out of order. Next, we extract the class of each column. e results, c1 and c2
as follows, will often be vectors, although they might be lists if some columns
produce a vector of length 2 or more. (is will be the case if any columns are
POSIXct objects.) We compare these two objects as in this example:
> c1 <- sapply (df1, class)
> c2 <- sapply (df2, class)
> isTRUE (all.equal (c1, c2[names (c1)])) # should be TRUE
Notice that we re-order the names of c2 so that they match the order of
the names of c1.eall.equal() function compares two objects and
returns TRUE if they match, and a small report (a vector of character strings)
describing their differences if they do not. is report is useful, but to test for
equality in, for example, an if() statement, the isTRUE() function is useful.
is function produces TRUE if its argument is a single TRUE,asreturned
by all.equal() when its arguments match, and FALSE if its argument is
anything else, like the character strings produced by all.equal() when its
arguments differ.
If the data frames being combined have the usual unmodified numeric row
names, R will adjust them so that the resulting row names go from 1 upward,
but if there are non-numeric or modified row names, R will try to keep them,
again deconflicting matches to ensure that row names are distinct.
When combining a large number of data frames, the do.call() function
will often be useful. is function takes the name of a function to be run,
and a list of arguments and runs the function with those arguments. For
example, the command log(x = 32, base = 2) produces the result
5,becauselog
2(32)=5. We get the exact same result with the command
do.call("log", list(x = 32, base = 2)).Noticethatthe
arguments are specified in the form of a list. is mechanism allows us to
combinealargenumberofdataframesinafairlysimpleway.Supposewe
have a list of data frames named list.of.df (such a list arises frequently
as the output from lapply()). Extracting the individual data frames from
the list can be tedious, but we can rbind() them all with a command like
do.call ("rbind", list.of.df) (assuming the data frames meet the
rbind() criteria). If the data frames are not already on a list, we can construct
such a list with a command like list(first.df, second.df, ...).
3.7.2 Merging Data Frames
Merging is a more complicated and powerful operation. In the usual type of
merging, each data frame has a “key” field, typically a unique one. e merge()
k
k k
k
92 A Data Scientists Guide to Acquiring, Cleaning, and Managing Data in R
matches up the k