Samuel E. Buttrey, Lyn R. Whitaker A Data Scientist’s Guide To Acquiring, Cleaning, And

User Manual: Pdf

Open the PDF directly: View PDF .
Page Count: 293

Download
Open PDF In Browser	View PDF

A Data Scientist’s Guide to Acquiring,
Cleaning, and Managing Data in R

Samuel E. Buttrey and Lyn R. Whitaker
Naval Postgraduate School, California, United States

This edition ﬁrst published 2018
© 2018 John Wiley & Sons Ltd
Library of Congress Cataloging-in-Publication Data applied for
Hardback ISBN: 9781119080022

Contents

Preface

xvii

About the Companion Website

xxi

1

R 1

1.1
1.1.1
1.1.2
1.1.3
1.1.4
1.2
1.2.1
1.2.2
1.2.3
1.2.4
1.3
1.3.1
1.3.2
1.3.3
1.4
1.4.1
1.4.2
1.4.3
1.4.4
1.4.5
1.4.6
1.5
1.5.1
1.5.2
1.5.3

Introduction 1
What Is R? 1
Who Uses R and Why? 2
Acquiring and Installing R 2
Starting and Quitting R 3
Data 3
Acquiring Data 3
Cleaning Data 4
The Goal of Data Cleaning 4
Making Your Work Reproducible 5
The Very Basics of R 5
Top Ten Quick Facts You Need to Know about R
Vocabulary 8
Calculating and Printing in R 11
Running an R Session 12
Where Your Data Is Stored 13
Options 13
Scripts 14
R Packages 14
RStudio and Other GUIs 15
Locales and Character Sets 15
Getting Help 16
At the Command Line 16
The Online Manuals 16
On the Internet 17

5

1.5.4
1.6
1.6.1
1.6.2

Further Reading 17
How to Use This Book 17
Syntax and Conventions in This Book
The Chapters 18

2

R Data, Part 1: Vectors 21

2.1
2.1.1
2.1.2
2.1.3
2.1.4
2.1.5
2.2
2.2.1
2.2.2
2.2.3
2.3
2.3.1
2.3.2
2.3.3
2.4
2.4.1
2.4.2
2.4.3
2.4.4
2.4.5
2.5
2.5.1
2.5.2
2.6
2.6.1
2.6.2
2.6.3
2.6.4
2.6.5
2.7
2.8

Vectors 21
Creating Vectors 21
Sequences 22
Logical Vectors 23
Vector Operations 24
Names 27
Data Types 27
Some Less-Common Data Types 28
What Type of Vector Is This? 28
Converting from One Type to Another 29
Subsets of Vectors 31
Extracting 31
Vectors of Length 0 34
Assigning or Replacing Elements of a Vector 35
Missing Data (NA) and Other Special Values 36
The Eﬀect of NAs in Expressions 37
Identifying and Removing or Replacing NAs 37
Indexing with NAs 39
NaN and Inf Values 40
NULL Values 40
The table() Function 40
Two- and Higher-Way Tables 42
Operating on Elements of a Table 42
Other Actions on Vectors 45
Rounding 45
Sorting and Ordering 45
Vectors as Sets 46
Identifying Duplicates and Matching 47
Finding Runs of Duplicate Values 49
Long Vectors and Big Data 50
Chapter Summary and Critical Data Handling Tools 50

3

R Data, Part 2: More Complicated Structures 53

3.1
3.2
3.2.1
3.2.2

Introduction 53
Matrices 53
Extracting and Assigning 54
Row and Column Names 56

17

3.2.3
3.2.4
3.2.5
3.2.6
3.2.7
3.3
3.3.1
3.3.2
3.4
3.4.1
3.4.2
3.4.3
3.5
3.5.1
3.5.2
3.5.3
3.5.4
3.6
3.6.1
3.6.2
3.6.3
3.6.4
3.6.5
3.6.6
3.6.7
3.6.8
3.7
3.7.1
3.7.2
3.7.3
3.7.4
3.8
3.9

Applying a Function to Rows or Columns 57
Missing Values in Matrices 59
Using a Matrix Subscript 60
Sparse Matrices 61
Three- and Higher-Way Arrays 62
Lists 62
Extracting and Assigning 64
Lists in Practice 65
Data Frames 67
Missing Values in Data Frames 69
Extracting and Assigning in Data Frames 69
Extracting Things That Aren’t There 72
Operating on Lists and Data Frames 74
Split, Apply, Combine 75
All-Numeric Data Frames 77
Convenience Functions 78
Re-Ordering, De-Duplicating, and Sampling from Data Frames 79
Date and Time Objects 80
Formatting Dates 80
Common Operations on Date Objects 82
Diﬀerences between Dates 83
Dates and Times 83
Creating POSIXt Objects 85
Mathematical Functions for Date and Times 86
Missing Values in Dates 88
Using Apply Functions with Dates and Times 89
Other Actions on Data Frames 90
Combining by Rows or Columns 90
Merging Data Frames 91
Comparing Two Data Frames 94
Viewing and Editing Data Frames Interactively 94
Handling Big Data 94
Chapter Summary and Critical Data Handling Tools 96

4

R Data, Part 3: Text and Factors

4.1
4.1.1
4.1.2
4.1.3
4.1.4
4.1.5
4.2
4.2.1

99
Character Data 100
The length() and nchar() Functions 100
Tab, New-Line, Quote, and Backslash Characters
The Empty String 101
Substrings 102
Changing Case and Other Substitutions 103
Converting Numbers into Text 103
Formatting Numbers 103

100

4.2.2
4.2.3
4.3
4.3.1
4.3.2
4.3.3
4.3.4
4.4
4.4.1
4.4.2
4.4.3
4.4.4
4.4.5
4.4.6
4.4.7
4.4.8
4.4.9
4.4.10
4.5
4.5.1
4.5.2
4.5.3
4.6
4.6.1
4.6.2
4.6.3
4.6.4
4.6.5
4.7
4.7.1
4.7.2
4.8

Scientiﬁc Notation 106
Discretizing a Numeric Variable 107
Constructing Character Strings: Paste in Action 109
Constructing Column Names 109
Tabulating Dates by Year and Month or Quarter Labels 111
Constructing Unique Keys 112
Constructing File and Path Names 112
Regular Expressions 112
Types of Regular Expressions 113
Tools for Regular Expressions in R 113
Special Characters in Regular Expressions 114
Examples 114
The regexpr() Function and Its Variants 121
Using Regular Expressions in Replacement 123
Splitting Strings at Regular Expressions 124
Regular Expressions versus Wildcard Matching 125
Common Data Cleaning Tasks Using Regular Expressions 126
Documenting and Debugging Regular Expressions 127
UTF-8 and Other Non-ASCII Characters 128
Extended ASCII for Latin Alphabets 128
Non-Latin Alphabets 129
Character and String Encoding in R 130
Factors 131
What Is a Factor? 131
Factor Levels 132
Converting and Combining Factors 134
Missing Values in Factors 136
Factors in Data Frames 137
R Object Names and Commands as Text 137
R Object Names as Text 137
R Commands as Text 138
Chapter Summary and Critical Data Handling Tools 140

5

Writing Functions and Scripts 143

5.1
5.1.1
5.1.2
5.1.3
5.1.4
5.2
5.2.1
5.3
5.3.1

Functions 143
Function Arguments 144
Global versus Local Variables 148
Return Values 149
Creating and Editing Functions 151
Scripts and Shell Scripts 153
Line-by-Line Parsing 155
Error Handling and Debugging 156
Debugging Functions 156

5.3.2
5.3.3
5.4
5.4.1
5.4.2
5.5
5.5.1
5.5.2
5.5.3
5.6
5.6.1
5.6.2
5.6.3

Issuing Error and Warning Messages 158
Catching and Processing Errors 159
Interacting with the Operating System 161
File and Directory Handling 162
Environment Variables 162
Speeding Things Up 163
Proﬁling 163
Vectorizing Functions 164
Other Techniques to Speed Things Up 165
Chapter Summary and Critical Data Handling Tools 167
Programming Style 168
Common Bugs 169
Objects, Classes, and Methods 170

6

Getting Data into and out of R 171

6.1
6.1.1
6.1.2
6.1.3
6.1.4
6.1.5
6.1.6
6.1.7
6.1.8
6.2
6.2.1
6.2.2
6.2.3
6.2.4
6.2.5
6.2.6
6.3
6.3.1
6.3.2
6.4
6.5
6.5.1
6.5.2
6.5.3
6.5.4
6.6
6.7

Reading Tabular ASCII Data into Data Frames 171
Files with Delimiters 172
Column Classes 173
Common Pitfalls in Reading Tables 175
An Example of When read.table() Fails 177
Other Uses of the scan() Function 181
Writing Delimited Files 182
Reading and Writing Fixed-Width Files 183
A Note on End-of-Line Characters 183
Reading Large, Non-Tabular, or Non-ASCII Data 184
Opening and Closing Files 184
Reading and Writing Lines 185
Reading and Writing UTF-8 and Other Encodings 187
The Null Character 187
Binary Data 188
Reading Problem Files in Action 190
Reading Data From Relational Databases 192
Connecting to the Database Server 193
Introduction to SQL 194
Handling Large Numbers of Input Files 197
Other Formats 200
Using the Clipboard 200
Reading Data from Spreadsheets 201
Reading Data from the Web 203
Reading Data from Other Statistical Packages 208
Reading and Writing R Data Directly 209
Chapter Summary and Critical Data Handling Tools 210

7.1
7.2
7.3
7.3.1
7.3.2
7.3.3
7.4
7.4.1
7.4.2
7.5
7.6
7.7
7.8
7.8.1
7.8.2
7.8.3
7.8.4
7.8.5
7.9

213
Acquiring and Reading Data 213
Cleaning Data 214
Combining Data 216
Combining by Row 216
Combining by Column 218
Merging by Key 218
Transactional Data 219
Example of Transactional Data 219
Combining Tabular and Transactional Data 221
Preparing Data 225
Documentation and Reproducibility 226
The Role of Judgment 228
Data Cleaning in Action 230
Reading and Cleaning BedBath1.csv 231
Reading and Cleaning BedBath2.csv 236
Combining the BedBath Data Frames 238
Reading and Cleaning EnergyUsage.csv 239
Merging the BedBath and EnergyUsage Data Frames 242
Chapter Summary and Critical Data Handling Tools 245

8

Extended Exercise 247

8.1
8.1.1
8.1.2
8.1.3
8.2
8.3
8.4
8.4.1
8.4.2
8.4.3
8.5
8.5.1
8.6
8.6.1
8.7
8.7.1
8.8
8.8.1
8.9
8.10

Introduction to the Problem 247
The Goal 248
Modeling Considerations 249
Examples of Things to Check 249
The Data 250
Five Important Fields 252
Loan and Application Portfolios 252
Layout of the Beachside Lenders Data 253
Layout of the Wilson and Sons Data 254
Combining the Two Portfolios 254
Scores 256
Scores Layout 256
Co-borrower Scores 257
Co-borrower Score Examples 258
Updated KScores 259
Updated KScores Layout 259
Loans to Be Excluded 260
Sample Exclusion File 260
Response Variable 260
Assembling the Final Data Sets 262

7

Data Handling in Practice

8.10.1
8.10.2

Final Data Layout 262
Concluding Remarks 263

A

Hints and Pseudocode 265

A.1
A.1.1
A.2
A.2.1
A.3
A.3.1
A.4
A.4.1
A.5
A.5.1
A.6
A.6.1
A.7

Loan Portfolios 265
Things to Check 266
Scores Database 267
Things to Check 268
Co-borrower Scores 269
Things to Check 270
Updated KScores 271
Things to Check 272
Excluder Files 272
Things to Check 272
Payment Matrix 273
Things to Check 274
Starting the Modeling Process 275
Bibliography
Index 279

277

Preface

Statisticians use data to build models, and they use models to describe the world
and to make predictions about what will happen next. There has been a large
number of very good books that describe statistical modeling, but these modeling eﬀorts usually start with a set of “clean,” well-behaved data in which nothing
is missing or anomalous.
In real life, data is messy. There will be missing values, impossible values,
and typographical errors. Data is gathered from multiple sources, leading to
both duplication and inconsistency. Data that should be categorical is coded
as numeric; data that should be numeric can appear categorical; data can be
hidden inside free-form text; and data can be in the form of dates in a wide
number of possible formats. We estimate that 80% of the time taken in any
data analysis problem is taken up just in reading and preparing the data. So, any
analyst needs to know how to acquire data and how to prepare it for modeling,
and the steps taken should be automatic, as far as possible, and reproducible.
This book describes how to handle data using the R software. R is the most
widely used software in statistics, and it has the advantage of being free,
open-source, and available on every major computing platform. Whatever
software you use, you will ﬁnd yourself facing the issues of acquiring, cleaning,
and merging data, and documenting the steps you took. We hope this book
will help you do these things eﬃciently.
Monterey, California, USA
November 30, 2016

Sam Buttrey and Lyn Whitaker

Companion Website

Don’t forget to visit the companion website for this book:
www.wiley.com/go/buttrey/datascientistsguide
There you will ﬁnd valuable material designed to enhance your learning,
including:
• A complete listing of all the R code in the Book
• Example datasets used in the Exercises

1

1
R

1.1 Introduction
This book focuses on one problem that is common to almost every statistical
problem – indeed, to almost any problem involving any sort of analysis. That
problem is acquiring and preparing the data. Across our many years of data
analysis, we have learned that seemingly 80% of our time – maybe more – goes
into the data preparation steps (a belief echoed by others such as Dasu and
Johnson, 2003). Collectively, we call these actions data cleaning, although,
as we will discuss later, we sometimes use that term for something a little
more speciﬁc. Regardless of the name, almost any analysis requires that you
(i) acquire that data, that is, read it into the computer program; (ii) clean the
data, that is, identify entries that are duplicated or clearly erroneous or anomalous, and take other preparation steps (e.g., combining entries such as “Female,”
“female,” and “F”); (iii) merge data from diﬀerent sources; and (iv) prepare
the data for modeling, which might involve dividing a set of numeric values
into subsets, combining states into regions, and so on. This book discusses
some approaches for accomplishing these four steps in the R language (R Core
Team, 2013). A ﬁfth problem, which receives less emphasis, is the problem of
long-term curation of the data. Which parts of the data must be saved and in
what way? We address that question by reference to the idea of reproducible
research, which we discuss later in this chapter, and later in the book as well.
1.1.1

What Is R?

R is a computer program that lets you analyze data. By “analyze” we mean, ﬁrst,
read the data into the program and then operate on it – drawing graphs and
charts, manipulating values, ﬁtting statistical models, and so on. (Notice that
we prefer to call data “it” rather than “them.” We discuss this choice brieﬂy
toward the end of the chapter.) R is both a statistical “environment” and also

2

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

a programming language, and it is very widely used both in commercial and
academic settings. R is free and open-source and runs on Windows, Apple, and
Linux operating systems. It is maintained by a group of volunteers who release
bug ﬁxes and new features regularly.
1.1.2

Who Uses R and Why?

R started as a tool for statisticians, evolving from a language called S that
was created in the 1970s. Today, R remains the primary language of academic
statisticians, and it also has a prominent place among analysts in business
and government as well. It is used not only for building statistical models
but also for handling and cleaning data, as in this book, and for developing
new statistical methods, building simulations, for visualization, and generally
for all the data-handling tools the statistician and the data scientist require.
Because of the ease with which users can develop and distribute new methods,
R has also become the tool of choice in certain fast-growing ﬁelds such as
biostatistics and genetics. Articles on “surveys of the top tools used by data
scientists” inevitably name R as one of the important tools with which data
scientists, as well as statisticians, should be familiar. Moreover, R’s popularity is
such that there are extensions to R (see “packages” in Section 1.4.4) that allow
you to connect to other programs such as the Python and Java languages, the
H2O machine-learning system, the ArcGIS geographical information system,
and many more.
1.1.3

Acquiring and Installing R

The primary way to acquire R is to download it from the Internet. The main
R website for R is www.r-project.org, and the www.cran.r-project
.org page (“CRAN” standing for “Comprehensive R Archive Network”) is
where you can download R itself. There are in fact dozens of “mirror” sites for
CRAN – that is, websites that are essentially copies of the CRAN site – so as
to reduce the load on the CRAN site. You can probably ﬁnd a mirror near you
on the “mirrors” page. After you download R, install it in the way you would
normally install a program on your operating system.
At any one time, users around the world will be running slightly diﬀerent
versions of R, since new ones are released fairly frequently. For example, at this
writing the current version of R was called 3.3.2, but many users are still using
3.2 or earlier versions. This will almost never cause problems, but it is a good
idea to update your version of R from time to time.
There are also several slightly diﬀerent versions of R distributed other than at
CRAN. Microsoft R Open is a particular version of R that uses a diﬀerent set
of math libraries intended to make certain computations faster. Like “regular”
R, Microsoft R Open is free, although it does not run on OS X. Other versions of R are intended to communicate with relational databases or with other

R

big-data platforms. For this book, we will assume you are running “regular”
R – but in any case for our purposes all versions of R should behave exactly the
same way.
1.1.4

Starting and Quitting R

The way you start R depends on your operating system. Normally doubleclicking on an R icon will be enough to get R started. In the commandline interface of many Linux systems, or using the OS X terminal window, it
may be enough just to type the upper-case letter R (or, for Windows command
lines, Rgui). When R has started, you will see the command prompt >. This is
the R console, the place where commands are entered. At this point, you can
start typing commands to R. When it comes time to quit R, you can either
“kill” the window in the usual way (for OS X, the red dot, the lightswitch in the
top right, or via the File dialog; for Windows, the red X or File dialog) or you
can type the q() command. In either case, R will then ask you if you want to
“Save workspace image.” If you answer “yes” to this question, R will save to the
disk any changes you made during the current session, whereas if you answer
“no,” R will return its workspace to the condition it was in when R was last
started. We almost always want to answer “yes” to this question!

1.2 Data
Data is information about the elements of whatever problem we are investigating. Data comes in many forms, but for our purposes it will always be presented
in a set of computer-ready values. For example, a database concerning birds
might include text about the habits of the birds, numbers giving lengths and
weights of the individuals, maps showing migration patterns, images showing
the birds themselves, sound recordings of the birds’ calls, and so on. Although
they look very diﬀerent, all of these diﬀerent pieces of information can be represented in the computer in digital form in one way or another. In this example,
one of our primary tasks might be to ensure that each bird’s description is correctly matched with the correct map, image, and song ﬁle. Our data analysis
projects rarely include data quite so disparate, but in almost every case we need
to acquire data, clean it (a process we start to describe in what follows and continue throughout the book), and prepare it for modeling, and in almost every
case we expect our data to consist of both numeric and textual values.
1.2.1

Acquiring Data

The ﬁrst step in a data analysis project, of course, is to get the data into R where
it can be manipulated. We are old enough to remember the days when this
involved typing all the data from the back of a book or journal paper into a

3

4

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

statistics package by hand, but happily this is not necessary today. On the other
hand, data now comes in a variety of formats, few of which were created with
the convenience of the data scientist in mind. In Chapter 6, we describe some
of these common formats and how to use R to read data eﬀectively.
1.2.2

Cleaning Data

We “clean” data when we detect (and, in many cases, remove) anomalies.
Anomalies will very often be missing values, but they might also be absurd
ones, as when people’s ages are reported as 999 or −1. Sometimes, as in our
earlier example, we might have genders reported as “Female,” “female,” and “F”
and we want to combine these three values. In the cleaning process we might
learn, for example, that one data source produced no data at all in August 2016;
this sort of fact will need to be brought to the attention of the data provider.
The data cleaning process also involves merging data from diﬀerent sources,
extracting subsets or reshaping the data in some way. All in all, data cleaning
is the process of turning raw data, received from one or more providers, into a
data set that can be used in visualization, modeling, and decision-making.
In practice these steps are iterative. Our cleaning process not only informs the
modeling, but it sometimes leads us to re-acquire the data in a diﬀerent, more
usable form. Similarly, insights from modeling will often lead us to prepare the
data in a new and more revealing way – because it is when we model that we
often discover anomalies or other interesting attributes of the data.
1.2.3

The Goal of Data Cleaning

What a “clean” data set should look like depends on what your goals are. One
useful perspective is given by Wickham (2014), who describes what he calls
“tidy” data. A tidy data set is rectangular (or tabular); each row describes one
unit of analysis (an observation), and each column gives one measurement (a
variable). For example, in a data set giving measurements about people, each
row would concern itself with a person, and the columns might give height,
weight, age, blood type, and so on.
In some problems, it is not immediately clear what the unit of analysis is.
For example, imagine data that describes the locations of boats over the course
of a month, as recorded by GPS. For some purposes, a “tidy” data set would
have one row per GPS ping, each row giving a ship identiﬁer, a location, and
a time. For other purposes, we might prefer a data set with one row per boat,
each row giving the southernmost point that ship reaches, or perhaps giving a
binary indicator of whether the ship did, or did not, spend time in international
waters. Some data – images and sound, for example – do not lend themselves
to this “tidy” approach.

R

The exact layout of your ﬁnal data will depend on what you plan to do with
it – and in some cases this won’t be known until after you have operated on
the data.
1.2.4

Making Your Work Reproducible

It is vital that other people be able to reproduce the actions you took on your
data. Ideally, you or another analyst should be able to start with your raw data,
run all the steps you applied to it, and emerge with exactly the same clean, prepared data sets. This will be useful to you when you encounter a situation similar
to the one in the previous paragraph, where the form of the new data needs to
be designed. But it is even more important for another analyst, since if you
or another analyst can reproduce your results there will be no disagreement
about the data. The act of making research reproducible has, in recent years,
been rightfully recognized as a cornerstone of scientiﬁc progress. Record and
document every step you take so that others can repeat them.

1.3 The Very Basics of R
This book is about handling data in R. It cannot teach you the very basics of R in
detail – although, happily, there are many good books and online resources that
can. (We give a few examples at the end of this chapter.) In this section, we list a
few of the most basic facts about R, but, again, this book is not intended to teach
you R. Rather, we focus on the details of R and of the way data is represented
in R, in order to help you understand some of the ways to acquire, clean, and
handle data inside R.
1.3.1

Top Ten Quick Facts You Need to Know about R

In this section, we give a few of the most important facts about R a beginner
needs to know. There will be more detail on these facts later in the chapter and
throughout the book.
1) The prompt is (by default) >. If you leave a command incomplete, maybe
because there is an unclosed parenthesis or quotation mark, R gives you
the continuation prompt, which is +. The Esc key (Windows) or control-C
(other systems) produces the break command, which will take you back to
the regular prompt. In this example, we show what a completed command
looks like – in this case, R is computing the value of 3 divided by 2.
> 3 / 2
[1] 1.5

5

6

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Here, R produced the prompt (>), and we typed 3 / 2 and pressed the
Enter (or “Return”) key. R then produced the output. We will talk about
the [1] part in Chapter 2, but the computed value of 1.5 is shown. In
the following example, we show what happens when we press Enter after
typing the slash character:
> 3/
+ 2
[1] 1.5

2)

3)

4)

5)

Here, since the expression on the ﬁrst line was incomplete, R produced the
continuation prompt, +. When we typed 2 and hit Enter, the expression
was complete and the result was shown. In case of confusion, press break
until the original > prompt is showing.
In examples in this book where we want to show the R output, we also
show the > prompt in front of our code. Remember, that > is produced by
R; you don’t need to type that yourself. (At the end of the chapter, we tell
you where you can get all the code from the book in electronic form.)
R is case-sensitive, which means that upper- and lower-case letters
are diﬀerent in R. For example, the built-in R object LETTERS gives
all 26 upper-case letters. A diﬀerent item called letters contains the
lower-case versions of the alphabet. There is no built-in object called
Letters.
Show an object by typing its name. For example, if you type ls by itself,
you see the contents of the function whose name is ls, the one that lists all
the objects in your workspace (which we deﬁne later). To actually run the
function and see the objects, you need to type the function’s name together
with parentheses. In this case, list your objects by typing ls().
Get help for a function or object named thing with the command
help(thing) or ?thing. For example, to see the help for the
ls() function, type help(ls). If you don’t know the name, try
help.search() with a relevant word in quotation marks; for example,
try help.search ("matrices") to see functions that handle
matrices.
Assign a value or object to a name with the left-arrow (less-than plus
hyphen): for example, the command a <- 1 creates a new object named
a with value 1. (You can also assign with a command such as a = 1,
but we don’t recommend it.) The assignment will over-write any existing
object named a you might have had. Once you create an object, it is in
your “workspace,” and your workspace can be saved when you quit. So
unless your computer crashes, when you create an object it will persist
until you delete it. Display the set of objects in your workspace with
objects() or ls(); remove an object with remove() or rm(). Not
every character is permitted in the name of an R object. Start a name

R

with a letter or a dot, and then stick to numbers, letters, underscores,
and dots. Names cannot contain spaces. In this example, we show some
assignments that succeed and some that do not.
> a <- 1
> a.1 <- 1
> 2a <- 1
Error: unexpected symbol in "2a"
> a 2 <- 1
Error: unexpected numeric constant in "a 2"

6)

7)

8)

9)

The ﬁrst two of these assignments succeed, because a and a.1 are valid
names. The last two fail because they refer to invalid names.
The comment character is #. A comment ends at the end of the line. If you
want a comment to span multiple lines, you need to start each comment
line with #.
Recall earlier commands with the up-arrow. You can edit an earlier
command and then press the Enter key to run the new version. The
history() command shows a list of your recent commands; put a
number in (as in history(500)) to see more.
When referring to ﬁle names, R itself uses the forward slash in the console.
The Windows ﬁle system uses the backward slash, so Windows users may
use that, too, but in that case you have to type \\ (we talk more about
this later on). For example, a Windows user who wants to access a ﬁle
named c:\temp\mycode.R in an R command will need to type either
c:/temp/mycode.R or c:\\temp\\mycode.R. You’ll need to use a
regular, single backslash if you are interacting with the Windows operating system and not R – if, for example, you are presented with a graphical
“select ﬁle” window. The ﬁle systems for OS X and Linux users use the
forward slash at all times.
Just about any function you want is built into R, so R makes an excellent
calculator. For example,
> sin (log (34))
[1] -0.375344

This says that the sine (using radians) of the logarithm (base e) of 34 is
−0.375344. Most functions allow you to specify “arguments,” values you
pass to the function to modify its behavior. Some must be speciﬁed; others
have default values. For example, log (34, 10) produces the base 10
logarithm instead of the natural logarithm. If a function accepts multiple
arguments, you will need to specify them in the proper order – or by name.
In this example, the arguments to log are named x and base (see the
help at ?log), so we could have entered log(base = 10, x = 34)
too.

7

8

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

10) R’s operators include the comparison operators != for “not equal,” == for
“is equal to,” <= and >= for “less than or equal to” and “greater than or equal
to,” and the arithmetic operators * for “multiplied by” and ^ for “raised to
the power of.”
1.3.2

Vocabulary

As we get started, it will be worthwhile for us to repeat some of the vocabulary
of R, and of data, that you should be familiar with. In this section, we deﬁne
some of the terms that are commonly used in discussion of R, both in this book
and elsewhere.
vector A vector is the simplest piece of data in R. It consists of one or more
entries (also called “items” or “elements”) that are all either text or all numbers or all “logical” (i.e., TRUE or FALSE). (Technically, a vector might have
length 0, and there are some other types, but that last sentence covers 99%
of what you will do with R.) For example, the value of the famous constant 𝜋
is built into R as the object pi, and the R object pi is a numeric vector with
length 1. We talk about vectors in Chapter 2.
matrix A matrix is just a two-dimensional vector in rectangular shape. While
matrices are important in statistics, they are less important in the data cleaning process. Still, it is useful to know about matrices in preparation for using
data frames (below). We discuss matrices at the start of Chapter 3.
list A list is an R object that can hold other R objects. Lists are everywhere in
R and you will need to know how to create them and access their elements.
We discuss lists starting in Section 3.3.
data frame A data frame is a cross between a matrix and a list. Like a matrix, it
is rectangular, but like a list it can contain items of diﬀerent sorts – numeric,
text, and so on – as its columns. You can think of a data frame as a list of
vectors all of which are the same length. Most of the data we encounter will
be in the form of data frames, and, if it isn’t, we will usually try to put it into
a data frame. We talk about data frames starting in Section 3.4.
object An object is a general word for anything in R. Usually, we will use this to
refer to data objects such as vectors, matrices, lists, or data frames, but we
might use “object” to refer to a function, a ﬁle handle, or anything else with
a name in R.
rows and columns A data frame and a matrix are two-dimensional rectangular objects, consisting of rows and columns. Our goal, in a data cleaning
problem, is almost always to produce one or more data frames whose rows
correspond to the things being measured, and whose columns give the
diﬀerent measurements. For example, in a military manpower problem each
row might represent a soldier, and the columns would give measurements
such as age, sex, rank, and years in service. Statisticians sometimes call
rows and columns “observations” and “variables” (although that second

R

word has another meaning in R, see the following discussion). Confusingly,
other terms exist too: authors in machine learning talk of “instances” (or
“entities”) and “attributes” (“features”). We will use “rows” and “columns”
when the emphasis is on the representation of the data in a data frame, and
“observations” and “variables” when the emphasis is on the role being played
by the data.
variable A variable is also a generic term for an R object, especially one of
the objects in our workspace. The name is slightly misleading because the
object’s value doesn’t have to change. We would call pi a “variable,” at least
in casual conversation.
operator An operator describes an action on one or two objects – often vectors – and produces a result. For example, the * operator, placed between two
numbers, produces their product. Most operators act on two things – we say
they are “binary.” The + and - operators can also be “unary,” meaning they
act on one number. So in the expression -3, the - is a unary operator. Operations are often “vectorized,” meaning they act separately on each item of a
vector.
function A function is a kind of R object that can take an action. Functions
often accept arguments to control the computations they make, and produce “return values,” the results of the computation. For example, the cos()
function takes as its one argument the size of an angle, in radians, and produces, as its return value, the cosine of that angle. So typing cos(1) invokes
a function and produces a value of about 0.54. Operators are functions, too,
although they don’t look like it. For example, you can multiply two numbers
by calling the * function explicitly with two arguments, though you’ll need
quotation marks; "*"(3, 4) operates * on 3 and 4 and produces 12. Functions are covered in detail in Chapter 5.
expression An expression is a legal R “phrase” that would produce an action if
you entered it into R. For example, a <- 3 is an expression that, if evaluated,
would cause an item a to be created and given the value 3. That expression
is called an assignment. pi > 3 is an expression that would produce TRUE,
since the number pi is greater than 3. This is an example of a comparison.
Just typing 2 is also an expression; the system interprets this as being the
same as print(2), and prints out the value 2. Most expressions involve the
use of functions or operators, as well as R variables.
command We often use the word “command” as a casual shortcut to mean
“function,” “operator,” or “expression.” For example, we might say “use the
help command” instead of “run the help function.”
script A script is a text ﬁle that can list R commands. We use script ﬁles in all
of our projects and we recommend that you do, too. We discuss scripts in
Chapter 5.
workspace The workspace is the set of objects (data and functions) in our current environment. These are objects we have created.

9

10

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

working directory The working directory is the folder on your computer where
your R data is stored. By default, R will look in this directory for any external ﬁles you might ask for. We talk more about the working directory in the
following section.
With this vocabulary in mind it is easier to discuss some of the ways that R
operates. As an example, it’s not always obvious what the diﬀerent operators
in R will do in weird cases. We know that 3 < 10 is TRUE. What is the value
of 3 < "10"? The answer is FALSE. R cannot compare a number to a character, so converts both values into characters. Then the comparison is made
alphabetically. So just as "Apple" < "Banana" is TRUE because "Apple"
comes ﬁrst in alphabetical order, so too does "10" come before "3" – since,
as always, we compare the initial characters ﬁrst, and the 1 character precedes
the 3 character in our computer’s sorting system. We talk much more about
the diﬀerent types of data in R, and converting between them, in Chapter 2.
Another example of unexpected behavior has to do with the way R reads
commands typed in at the command line. We saw that the command a <- 3
assigns the value 3 to an object a. However, what happens when you type
a < - 3, with a space between < and -? The answer is that R attaches the
hyphen to the value 3, and then compares the value of a to the number -3. In
general, spaces will not aﬀect your R commands – but in this case the space
“broke” the assignment operator <-.
R objects have names and names have to conform to a small set of rules. If
data is brought in from outside R, perhaps from a spreadsheet, names will be
changed if they need to be made valid (details can be seen in the help for the
make.names() function). Technically it is possible to force R to use invalid
names, but don’t do that. A few names in R are reserved, meaning they cannot
be used as the name of an R variable. For example, you cannot name an object
TRUE; that name is reserved. (You may name an object T, because that name
isn’t reserved, but we don’t recommend it.) It is also wise to try to avoid giving an object the name of an existing R function (although there are lots of R
functions and some are obscure). If you name a vector sum, and then use the
sum() function to add things up, R will be smart enough to diﬀerentiate your
vector from the system’s function. But if you create a function called sum() in
your workspace, R will use that one (since your function will appear ﬁrst on the
search path; see “search path” in Section 1.4.1). This is almost never what you
want. The R functions c() and t() provide good examples of names to avoid.
Finally, R can operate in an “object-oriented” way. A number of R functions
are “generic,” meaning that have speciﬁc methods to handle speciﬁc data types.
For example, the summary() function applied to a numeric vector gives some
information about the values in the vector, but the same function applied to
the output of a modeling function will often give summary statistics about the
model. The exact action that the generic function takes depends on the “class”

R

(i.e., the type) of the object passed to it. We run across a few of these generic
functions in the following few chapters and discuss object-oriented programming brieﬂy in Section 5.6.3
1.3.3

Calculating and Printing in R

R performs calculations and prints results. In this section, we talk about some
of the diﬀerences between what R computes and what it prints, as well as how
text data is represented.
Floating-Point Error

This is a good place to discuss an issue that arises in a lot of data cleaning problems and has caught us and our students oﬀ-guard more than once. For almost
all computations, R uses “double-precision ﬂoating-point” arithmetic, as most
other systems do. What this means is that R can represent numbers up to about
±1.79 × 10±308 with at least some accuracy. However, double precision is not
exact. Consider this example, in which we multiply together the numbers (1/49)
and 49.
> 1/49 * 49
[1] 1
# as expected
> 1 - (1/49 * 49)
[1] 1.110223e-16
> (49 * 1/49) == (1/49 * 49) # should be TRUE
[1] FALSE

The ﬁrst computation shows the “expected” product of (1/49) and 49 – the
value 1. In fact, though, the second computation shows that this product is not exactly 1; it diﬀers from 1 by a tiny amount that we might call
“ﬂoating-point error.” That amount was so small that it wasn’t displayed in the
ﬁrst computation, according to R’s default display conditions. (The command
print(1/49 * 49, digits = 16) will reveal that this product is
computed as a number very slightly less than 1.) This is not a bug in R; it’s a
statement about the way double-precision ﬂoating-point arithmetic works,
analogous to the way that in ordinary arithmetic, the number 0.333333 …
is not quite 1/3. The ﬁnal computation shows the practical eﬀect of this: if
you compare two ﬂoating-point values directly, they might be recorded as
being diﬀerent just because of ﬂoating-point error. You will need to be aware
of this when you compare the results of doing the same computation in two
diﬀerent ways.
Signiﬁcant Digits

In the above-mentioned example, we saw how R printed 1 even though
the number in question was slightly diﬀerent. While R’s computations use

11

12

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

double-precision ﬂoating point, its display will generally print a smaller
number of digits than are available. Moreover, R formats outputs in a neat
way, so that typing 2.00 produces 2, but typing 2.01 prints out as 2.01.
These formatting choices are most noticeable when many values are being
shown. The display that R chooses does not aﬀect the precision with which
it does calculations. Of course you can force R to round oﬀ the results of
its calculation; we discuss formatting, rounding, and scientiﬁc notation in
Chapter 4.
Character Strings

We will spend a lot of time in this book handling text or character data, data
in the form of letters such as "Oakland" or "Missing". Sometimes, as is
common, we will call a set of characters a string. In R, strings are enclosed by
quotation marks, and either the double-quotation mark " or the single one
’ can be used. A string delineated by single-quotation marks is converted
into the other kind. The two kinds of quotation marks make it possible to
insert a quote into a string, such as this: "She said ’No.’ " (If you
typed "She said "No." ", you would see R produce an error.) If you type
’She said "No." ’, the outside quotes are converted to double quotes.
Then, since there are double quotes on the inside, too, those interior quotation
marks are “protected” by preceding them with the backslash character. The
result is converted into "She said \"No.\" "
This idea of “protecting” certain special characters goes beyond quotation
marks. The character that marks the end of a line of text is called “new-line” and
is written as \n, backslash followed by n. Typing this character requires two
keystrokes, but it counts as only one character. In general, special characters
are “protected” by the backslash characters. Besides the quotation mark and the
new-line, the important special characters are \t, the tab, and \\, the backslash
itself. The backslash also serves to introduce strings in special formats, such as
hexadecimal (e.g., "\xb1" produces the character with hexadecimal value b1,
which displays as the plus-minus sign, ±) or Unicode (e.g., "\U20ac" uses
Unicode to display the Euro currency symbol). We talk much more about text
in general and Unicode in particular in Chapter 4.

1.4 Running an R Session
Once you start using R you may ﬁnd yourself using it for lots of diﬀerent
projects. Although this is partly a matter of taste, we ﬁnd it useful to keep
separate sets of data for separate projects. In this section, we describe where R
keeps your data, and some other aspects of R with which you will need to be
familiar.

R

1.4.1

Where Your Data Is Stored

When you start R, you start it in a working directory, and this directory forms
the starting point for where R looks for, and stores, data. For example, typing
list.files() will list all of the ﬁles in your working directory. When you
quit R and save the workspace, a ﬁle with all of your R objects will be created
in that same directory. This ﬁle is named .RData. The leading dot in the name
is important, because some terminal programs, such as the “bash” command
interpreter, do not by default list ﬁles whose names start with a dot. We don’t
recommend changing the name of the .RData ﬁle.
This provides a natural mechanism for project management. To prepare for a
new project on a system with a command-line interface, just create a new directory and start R from there (see “starting R” above). On systems with desktop
icons, copy an existing R icon, edit the properties to point to the new directory,
and add the project name to the icon. The details of this operation will depend
on your operating system. In this way, you can keep the diﬀerent .RData ﬁles
for your diﬀerent projects separate.
When you start R, it will use an existing .RData ﬁle if there is one in the
working directory, or create a new, empty one if there is not. Often we have a
certain number of objects from earlier projects that we want in the new project.
There are two mechanisms for acquiring those existing R objects. In one case,
we literally copy all the objects from another .RData in a diﬀerent project’s
directory into the existing workspace, using the load() function. This can
be dangerous because objects being copied will over-write existing ones with
the same names. A second mechanism uses attach(), which puts the other
.RData on the “search path.” The search path is a list of places where R looks
for objects when you mention them. You can examine your current search path
with the search() command. The ﬁrst entry on the search path is the current
.RData ﬁle (although it carries the confusing name .GlobalEnv); most of
the other entries on the search path are put there by R itself. When you use a
name such as pi, R looks for that object in your workspace, and then in each
of the packages or directories named in the search path until it ﬁnds one by
that name. You can attach other .RData ﬁles anywhere in the search path,
except in the ﬁrst position; usually we put them into position two so that they
are searched right after the local workspace. We talk more about getting data
into and out of R in Chapter 6.
1.4.2

Options

R maintains a list of what it calls “options,” which describe aspects of your interaction with it. For example, one option sets the text editor that R calls when
you edit a function, one describes how much memory is set aside for R, one
lets you change the prompt character from its default, and so on. Generally, we

13

14

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

ﬁnd the default values reasonable, but the help for the options() function
describes the possible values and running options() shows you the current
ones. Changes to the options last only for this R session. Section 3.3.2 shows an
example of setting one of the options.
1.4.3

Scripts

Most of the work we do with R is interactive – that is, we issue commands
and wait for R’s response. This use of R is best when we are exploring data and
developing approaches to handling and modeling it. As we develop sets of commands for a particular project, we can combine these into “scripts,” which are
simply ﬁles full of commands. Having a set of commands together allows us
to execute them in exactly the same way every time, and it allows us to add
comments and other notes that will be useful to us and to other users whom
we share the code with. This approach, while still interactive, is best when we
have developed an approach and want to use it repeatedly. Scripts also provide
a natural mechanism for project management: often we start with an empty
workspace and use scripts to populate the workspace by reading and preparing
data, loading from other sources, or attaching other directories, before starting
on the modeling steps.
R can also be run in batch mode – that is, it can start, run a single set of
commands, and then stop. This approach can be used when the same task needs
to be performed repeatedly, perhaps on diﬀerent data – say, every day to process
data gathered overnight. We talk about scripts and batch use of R in Chapter 5.
1.4.4

R Packages

A package is a set of functions (and maybe data and other stuﬀ too) that provides an extension to R. R comes with a set of packages, some of which are
automatically placed onto the search path, and others of which are not. If a
package is present on your computer but not in your search path, you can access
(or “load”) it with the library() or require() command (these two diﬀer
only in how they react if a package cannot be found). A package only needs to
be loaded once per R session, but when you re-start R you will need to re-load
packages. There are also thousands of additional packages that have been contributed by R users that can be found on the Internet, primarily at the main
repository at cran.r-project.org and its mirror sites. If your computer
is connected to the Internet, you can install a package (if you know its name)
with the install.packages() command. If that works, the package will
still need to be loaded with the library() command. If your computer is
not connected to the Internet, you can still install packages from a disk ﬁle if
one is available. Most of the code in this book requires no additional packages,
although in some cases we will point out cases where additional packages make
particular tasks easier, more eﬃcient, or, in rare cases, possible.

R

It is possible to force certain packages to be loaded whenever you start R.
When we anticipate needing a package, our preference is to include a call to
library() or require() inside our scripts.
1.4.5

RStudio and Other GUIs

The “look” of R depends on your operating system. At its most basic – and we
often see this when we are connecting to remote servers – R consists only of a
command line. On the most popular platforms – Windows and OS X – running
R produces a graphical user interface, or GUI. This is a set of windows containing a number of menu items giving selections, or buttons that help you
perform common tasks. Most of the GUI, though, consists of the console. A
few enhanced GUIs are available. Perhaps the most widely used among these
is RStudio (RStudio Team, 2015), a development environment that includes a
console window, a set of script window tabs, and better handling of multiple
graphics windows. RStudio comes in free and paid versions for all operating
systems and is available from its maker at rstudio.com. We have found that
many of our students prefer the more interactive, perhaps more modern feel of
RStudio to the standard R interface – but underneath, the R language is exactly
the same.
1.4.6

Locales and Character Sets

R is essentially the same program whether you run it on Windows, OS X, or
Linux. (There are minor diﬀerences in the way you access external ﬁles and
in some low-level technical functions that will not be relevant in data cleaning.) In particular, R is an English-language program, so a “for” loop is always
indicated by for(). Speakers of many languages can arrange to have error
messages delivered in their language, if this ability is conﬁgured at the time R is
installed – see the help for the Sys.setenv() function and for “environment
variables.”
Even though R is in English, it is possible to set the “locale” of R. This allows
you to change the way that R does things such as format currency values.
English speakers use the dot as the decimal separator and the comma to set
oﬀ thousands from hundreds, but many Europeans use those two characters
in reverse. Other locale settings aﬀect the abbreviations in use for days of
the week and months of the year. We discuss some of these in Chapter 3, but
one important one to note here is the “collation” setting. This describes how
R sorts alphabetical items. Under the usual choices on Windows and OS X,
lower- and upper-case letters are sorted together, so that “a” precedes “A”
in alphabetical order, but both precede “b.” To continue an earlier example,
this ensures that "apple" < "banana" and "apple" < "Banana"
are both TRUE. However, on some Linux systems the so-called “C” collation
sequence is used. In that scheme, all the upper-case letters come before

15

16

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

any of the lower-case ones – so that "apple" < "banana" is TRUE, but
"apple" < "Banana" is FALSE. Moreover, as the help for Comparison
points out, “in Estonian, Z comes between S and T.” You have to be aware of
your both locale and the relevant language whenever you compare strings.
Another aspect of character handling is the use of diﬀerent character sets.
Text in non-Roman languages such as Hebrew or Korean requires some special
considerations. We discuss these at some length in Chapter 4.

1.5 Getting Help
R has a number of ways of getting help to you. “Help” can mean information
about the speciﬁc syntax of individual R commands, about putting the pieces
of R together in programs, or about the details of the various statistical models
and tools that R provides. In this section, we describe some of the resources
available to help you learn about R.
1.5.1

At the Command Line

The most basic help is provided at the command line, through the commands
help(), ? and help.search(). The ﬁrst two commands act identically and
will be most useful when you need information on a particular R function or
operator whose name you know. In most cases, the argument doesn’t need to
be in quotation marks, though it may be – so help(matrix) or ?"matrix"
both bring up a page about some matrix functions. Quotation marks will
be required when looking for help on some elements of the R language – so
?"for" gives the help page for the for looping term and help("==")
produces the page on comparison operators. The help.search() command
is useful when the subject, rather than the name, is known; this command
opens a window (depending on your operating system) that gives links to
associated R objects. A related command is the apropos() function, which
takes a character argument (as in apropos("matrix")) and returns a
vector of names of objects containing that string (in this example, every object
with matrix in its name). A ﬁnal piece of command-line help is provided by
the args() function, which takes a function and displays the set of arguments
expected by, and default values provided by that function.
1.5.2

The Online Manuals

When you install R, you are given the opportunity to install the online manuals
with it. These manuals are generally correct and complete, but they are intended
as references, and are not always useful as tutorials.

R

1.5.3

On the Internet

The main page for the R project is r-project.org. This is the central repository for R and its documentation. If you are interested in participating in a
community of R users, you might consider joining one of the mailing lists,
which you can ﬁnd under mail.html at that page.
R is very popular and there are lots and lots of blogs, pages, and other web
documents that address R and solve speciﬁc problems. Your favorite Internet
search engine will be able to ﬁnd dozens of these.
1.5.4

Further Reading

A lot of documentation comes with R when you install in the usual way. You
can ﬁnd a list of these manuals under Help | Manuals in Windows, or Help | R
Help on OS X, or with help.start(). The “Introduction to R” manual is a
good place to start.
The book “The Art of R Programming” (Matloﬀ, 2011) is a nice tour of many R
features ranging from beginning to advanced. As its name suggests, the emphasis is on writing powerful and eﬃcient R programs. Many other books introduce
the use of R, or describe its application in speciﬁc ﬁelds such as economics or
genomics. The r-project website has a list of over 150 books using R. As we
mentioned earlier, that site also maintains mailing lists for interested users, and
a quick web search will reveal scores of blogs and web pages devoted to R and
to answering R questions.
The recent book by Wickham and Grolemond (2016) describes those authors’
approach to not only data cleaning but a set of additional tasks, including visualization and modeling, which we think of as beyond the scope of data acquisition
and cleaning. That approach requires an entire set of tools from packages outside R – although they come conveniently bundled together – as well as a new
vocabulary. This ecosystem has its adherents, but we prefer to use base R where
possible.

1.6 How to Use This Book
1.6.1

Syntax and Conventions in This Book

We reproduce a lot of R code in this book. R code is indicated in a ﬁxed-width
font like this. Since R is case-sensitive, our text will exactly match what is
typed into R – except that in the prose we capitalize letters of R objects if they
appear at the beginning of sentences. Inside a paragraph, or when we want to
show a sequence of commands, we reproduce exactly what we type, like this:

17

18

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

sqrt(pi). When we also want to show what R returns, the code will be shown
with the prompt and the literal R output, like this:
> sqrt (pi)
[1] 1.772454

Unlike the example in the “top ten quick fact” #1, we suppress the continuation
prompt +, so that it is not confused with the ordinary plus sign.
There are several diﬀerent schemes for formatting code that you can ﬁnd
described on the Internet, and they do not always agree. To us the most important rule is to make your code easy to read. This means, ﬁrst, use spacing and
indenting in a helpful and consistent way, and second, add plenty of comments
to help the reader. There is always a temptation to write code as quickly as possible, with an eye toward worrying about neatness later. Resist that temptation!
Code is for sharing and for re-use.
On a lighter note, we know that the word “data” originated as the plural of
the singular “datum,” but it has long been permitted to construe “data” in the
singular, and we do that in this book. You will ﬁnd us saying “the data is...” rather
than “are.” This is intentional.
1.6.2

The Chapters

In order to use R wisely, you have to understand what data looks like to R. The
following three chapters describe the sorts of data that R recognizes, and how
to manipulate R’s objects. We start by describing vectors, the simplest form of
data in R, in Chapter 2. This chapter describes the common types of vectors,
the diﬀerent ways to extract subsets from them, and how to change values in
vectors. It also describes how R stores missing values, an integral part of almost
every data cleaning problem. The chapter concludes with a look at the important table() function and some of the basic operations on vectors – sorting,
identifying duplicates, computing unions and intersections of sets, and so on.
Chapter 3 describes more complicated data structures: matrices, lists, and
ﬁnally data frames. Understanding how data frames work is critical to using R
intelligently. We defer until this chapter discussion of how R handles times and
dates, because part of that discussion requires an understanding of lists.
The ﬁnal data chapter, Chapter 4, discusses the last important data type – text
or character data. Text data is stored in vectors and data frames such as other
kinds, but there are a number of operations speciﬁc to text. This chapter
describes how to manipulate text in R – changing case, extracting and
assembling pieces of strings, formatting numbers into strings, and so on. One
important topic is regular expressions, a set of tools for ﬁnding strings that
contain a pattern of characters. This chapter also discusses the UTF-8 system
of encoding non-Roman alphabets such as Greek or Chinese and R’s concept
of factors, which are important in modeling but often cause problems during
the data cleaning process.

R

Chapter 5 discusses two types of tool used to automate computations in R:
functions and scripts. These diﬀerent, but related, tools, will be part of every
analysis you ever do, so you should understand how to construct them intelligently. We also look brieﬂy at “shell scripts,” which are a special sort of script
that let you run R in batch, rather than interactive mode, and discuss some of
the tools available in R for debugging.
This is a book about cleaning data, but the data to be cleaned needs to
come from somewhere. Chapter 6 describes the diﬀerent ways to bring
data into R: from other R sessions, from spreadsheet-like text ﬁles, from
relational databases, and so on. We describe two of the formats in which data
is commonly found in modern applications: XML and JSON. We also describe
how to acquire data programmatically from web pages.
Chapter 7 takes a bigger view of the data cleaning process. While the earlier
chapters focus on the nuts and bolts of R as they relate to data cleaning, this
chapter describes the sort of challenges in a real-life data cleaning project. We
talk about how to combine data from diﬀerent sources and give examples of the
sort of anomalies that you have to expect in dealing with real data. In almost
every case you will have to rely on judgment, rather than just on a cookbook
of techniques. We spend some time discussing the role of judgment on data
cleaning.
The Exercise

The culmination of the book is the data cleaning exercise presented in
Chapter 8. This chapter presents a complicated data acquisition and cleaning
problem that, while artiﬁcial, reﬂects many of the problems and challenges we
have seen over our years of real-life data handling experience. If you can ﬁnd
your way through to the end of the exercise, we expect that you will be well
prepared to handle the data the real world sends your way.
Critical Data Handling Tools

In every chapter, we have set aside the ﬁnal section to recap commands and
tools we think are particularly important when it comes to data handling and
manipulation. If you can master the use of these tools, and apply them wisely,
you can reduce the risk of missing important information in your data.
The Code

All of the code reproduced in this book appears in scripts in the cleaning
Book package you can download from the CRAN website. You can open these
scripts in R and run the code from there – although since most examples are
very short, we suggest that you consider typing them in yourself, to get a feel
for the R language.

19

21

2
R Data, Part 1: Vectors
The basic unit of computation in R is the vector. A vector is a set of one or
more basic objects of the same kind. (Actually, it is even possible to have a
vector with no objects in it, as we will see, and this happens sometimes.) Each
of the entries in a vector is called an element. In this chapter, we talk about
the diﬀerent sorts of vectors that you can have in R. Then, we describe the
very important topic of subsetting, which is our word for extracting pieces of
vectors – all of the elements that are greater than 10, for example. That topic
goes together with assigning, or replacing, certain elements of a vector. We
describe the way missing values are handled in R; this topic arises in almost
every data cleaning problem. The rest of the chapter gives some tools that are
useful when handling vectors.

2.1 Vectors
By a “basic” object, we mean an object of one of R’s so-called “atomic” classes.
These classes, which you can ﬁnd in help(vector), are logical (values
TRUE or FALSE, although T and F are provided as synonyms); integer;
numeric (also called double); character, which refers to text; raw,
which can hold binary data; and complex. Some of these, such as complex,
probably won’t arise in data cleaning.
2.1.1

Creating Vectors

We are mostly concerned with vectors that have been given to us as data. However, there are a number of situations when you will need to construct your own
vectors. Of course, since a scalar is a vector of length 1, you can construct one
directly, by typing its value:
> 5
[1] 5

22

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

R displays the [1] before the answer to show you that the 5 is the ﬁrst element
of the resulting vector. Here, of course, the resulting vector only had one entry,
but R displays the [1] nonetheless. There is no such thing as a “scalar” in R;
even 𝜋, represented in R by the built-in value pi, is a vector of length 1. To
combine several items into a vector, use the c() function, which combines as
many items as you need.
> c(1, 17)
[1] 1 17
> c(-1, pi, 17)
[1] -1.000000 3.141593 17.000000
> c(-1, pi, 1700000)
[1] -1.000000e+00 3.141593e+00 1.700000e+06

R has formatted the numbers in the vectors in a consistent way. In the second example, the number of digits of pi is what determines the formatting;
see Section 1.3.3. In example three, the same number of digits is used, but
the large number has caused R to use scientiﬁc notation. We discuss that in
Section 4.2.2. Analogous formatting rules are applied to non-numeric vectors
as well; this makes output much more readable. The c() function can also be
used to combine vectors, as long as all the vectors are of the same sort.
Another vector-creation function is rep(), which repeats a value as many
times as you need. For example, rep(3, 4) produces a vector of four 3s. In
this example, we show some more of the abilities of rep().
> rep (c(2, 4), 3)
# repeat
[1] 2 4 2 4 2 4
> rep (c("Yes", "No"), c(3, 1)) # repeat
[1] "Yes" "Yes" "Yes" "No"
> rep (c("Yes", "No"), each = 8)
[1] "Yes" "Yes" "Yes" "Yes" "Yes" "Yes"
[10] "No" "No" "No" "No" "No" "No"

a vector
elements of vector

"Yes" "Yes" "No"
"No"

The last two examples show rep() operating on a character vector. The ﬁnal
one shows how R displays longer vectors – by giving the number of the ﬁrst
element on each line. Here, for example, the [10] indicates that the ﬁrst "No"
on the second line is the 10th element of the vector.
2.1.2

Sequences

We also very often create vectors of sets of consecutive integers. For example,
we might want the ﬁrst 10 integers, so that we can get hold of the ﬁrst 10 rows
in a table. For that task we can use the colon operator, : . Actually, the colon
operator doesn’t have to be conﬁned to integers; you can also use it to produce
a sequence of non-integers that are one unit apart, as in the following example,
but we haven’t found that to be very useful.

R Data, Part 1: Vectors

> 1:5
[1] 1 2 3 4 5
> 6:-2
[1] 6 5 4 3 2 1
> 2.3:5.9
[1] 2.3 3.3 4.3 5.3
> 3 + 2:7
[1] 5 6 7 8 9 10
> (3 + 2):7
[1] 5 6 7

0 -1 -2 # Can go in reverse, by 1
# Permitted (but unusual)
# Watch out here! This is 3 +
# (vector produced by 2:7)
# This is 5:7

In that last pair of examples, we see that R evaluates the 2:7 operation before
adding the 3. This is because : has a higher precedence in the order of operations than addition. The list of operators and their precedences can be found at
?Syntax, and precedence can always be over-ridden with parentheses, as in
the example – but this is the only example of operator precedence that is likely
to trip you up. Also notice that adding 3 to a vector adds 3 to each element of
that vector; we talk more about vector operations in Section 2.1.4.
Finally, we sometimes need to create vectors whose entries diﬀer by a number other than one. For that, we use seq(), a function that allows much ﬁner
control of starting points, ending points, lengths, and step sizes.
2.1.3

Logical Vectors

We can create logical vectors using the c() function, but most often they
are constructed by R in response to an operation on other vectors. We saw
examples of operators back in Section 1.3.2; the R operators that perform
comparisons are <, <=, >, >=, == (for “is equal to”) and != (for “not equal to”).
In this example, we do some simple comparisons on a short vector.
> 101:105 >= 102
[1] FALSE TRUE TRUE
> 101:105 == 104
[1] FALSE FALSE FALSE

# Which elements are >= 102?
TRUE
# Which equal (==) 104?
TRUE FALSE

TRUE

Of course, when you compare two ﬂoating-point numbers for equality, you
can get unexpected results. In this example, we compute 1 - 1/46 * 46,
which is zero; 1 - 1/47 * 47, and so on up through 50. We have seen this
example before!
> 1 - 1/46:50 * 46:50 == 0
[1] TRUE TRUE TRUE FALSE

TRUE

We noted earlier that R provides T and F as synonyms for TRUE and FALSE.
We sometimes use these synonyms in the book. However, it is best to beware
of using these shortened forms in code. It is possible to create objects named

23

24

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

T or F, which might interfere with their usage as logical values. In contrast,
the full names TRUE and FALSE are reserved words in R. This means that you
cannot directly assign one of these names to an object and, therefore, that they
are never ambiguous in code.
The Number and Proportion of Elements That Meet a Criterion

One task that comes up a lot in data cleaning is to count the number (or proportion) of events that meet some criterion. We might want to know how many
missing values there are in a vector, for example, or the proportion of elements
that are less than 0.5. For these tasks, computing the sum() or mean() of a
logical vector is an excellent approach. In our earlier example, we might have
been interested in the number of elements that are ≥102, or the proportion that
are exactly 104.
> 101:105 >= 102
[1] FALSE TRUE TRUE TRUE TRUE
> sum (101:105 >= 102)
[1] 4
# Four elements are >= 102
> 101:105 == 104
[1] FALSE FALSE FALSE TRUE FALSE
> mean (101:105 == 104)
[1] 0.2
# 20% are == 104

It may be worth pondering this last example for a moment. We start with the
logical vector that is the result of the comparison operator. In order to apply
a mathematical function to that vector, R needs to convert the logical elements to numeric ones. FALSE values get turned into zeros and TRUE values
into ones (we discuss conversion further in Section 2.2.3). Then, sum() adds
up those 0s and 1s, producing the total number of 1s in the converted vector – that is, the number of TRUE values in the logical vector or the number
of elements of the original vector that meet the criterion by being ≥ 102. The
mean() function computes the sum of the number of 1s and then divides that
sum by the total number of elements, and that operation produces the proportion of TRUE values in the logical vector, that is, the proportion of elements
in the original vector that meet the criterion.
2.1.4

Vector Operations

Understanding how vectors work is crucial to using R properly and eﬃciently.
Arithmetic operations on vectors produce vectors, which means you very often
do not have to write an explicit loop to perform an operation on a vector. Suppose we have a vector of six integers, and we want to perform some operations
on them. We can do this:
> 5:10
[1] 5 6 7
> (5:10) + 4

8

9 10

R Data, Part 1: Vectors

[1] 9 10 11 12 13 14
> (5:10)^2
[1] 25 36 49 64 81 100

# Square each element;
# parentheses necessary

Just to repeat, arithmetic and most other mathematical operations operate on
vectors and return vectors. So if you want the natural logarithm of every item
in a vector named x, for example, you just enter log(x). If you want the
square of the cosine of the logarithm of every element of x, you would use
cos(log(x))^2, and so on. There are functions, such as length(), sum(),
mean(), sd(), min(), and max(), that operate on a vector and produce a
single number (which, to be sure, is also a vector in R). There are also functions such as range(), which returns a vector containing the smallest and
largest values, and summary(), which returns a vector of summary statistics,
but one of the sources of R’s power is the ability to perform computations on
every element of a vector at once.
In the last two examples above, we operated on a vector and a single number
simultaneously. R handles this in the natural way: by repeating the 4 (in the ﬁrst
example) or the 2 (in the second) as many times as needed. R calls this recycling.
In the following example, we see what R does in the case of operating on two
vectors of the same length. The answer is, it performs the operation between
the ﬁrst elements of each vector, then the second elements, and so on. In the
opening command, we have the usual assignment, using <-, and also an additional set of parentheses outside that command. These additional parentheses
cause the result of the assignment to be printed. Without them, we would have
created thing1, but its value would not have been displayed.
> (thing1 <- c(20, 15, 10, 5, 0)^2)
[1] 400 225 100 25
0
> (thing2 <- 105:101)
[1] 105 104 103 102 101
> thing2 + thing1
[1] 505 329 203 127 101
> thing2 / thing1
[1] 0.2625000 0.4622222 1.0300000 4.0800000

Inf

In the last lines, R computes the ratios element by element. The ﬁnal ratio,
101/0, yields the result Inf, referring to an inﬁnite value. We discuss Inf more
in Section 2.4.4. The following example compares a function that returns a single, summary value to one that operates element by element.
> max (thing2, thing1)
[1] 400
> pmax (thing2, thing1)
[1] 400 225 103 102 101

The max() function produces the largest value anywhere in any of its
arguments – in this case, the 400 from the ﬁrst element of thing1. The

25

26

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

pmax() (“parallel maximum”) function ﬁnds the larger of the ﬁrst element of
the two vectors, and the larger of the second element of the two vectors, and
so on.
Two logical vectors can also be combined element by element, using the |
logical operator for “or” (i.e., returning TRUE if either element is TRUE) and
the & operator for “and” (i.e., returning TRUE only if both elements are TRUE).
These operators diﬀer in a subtle way from their doubled versions || and &&.
The single versions evaluate the condition for every pair of elements from both
vectors, whereas the doubled versions evaluate multiple TRUE/FALSE conditions from left to right, stopping as soon as possible. These doubled versions
are most useful in, for example, if() statements.
Recycling

There can be a complication, though: what if two vectors being operated on are
not of the same length?
> 5:10 + c(0, 10, 100, 1000, 10000, 100000) # Two 6-vectors
[1]
5 16
107 1008 10009 100010 # Add by element
> 5:10 + c(1, 10, 100)
# A 6-vector and a 3-vector
[1]
6 16
107 9 19 110
# The 3-vector is replicated
> 5:10 + 3:7
# A 6-vector and a 5-vector
[1] 8 10 12 14 16 13
# 5+3, 6+4, ..., 9+7, 10+3
Warning message:
In 5:10 + 3:7 :
longer object length is not a multiple of shorter length

It is important to understand these last two examples because the problem
of mismatched vector lengths arises often. In the ﬁrst of the two examples, the
3-vector (1, 10, 100) was added to the ﬁrst three elements of the 6-vector,
and then added again to the second three elements. Once again R is recycling.
No warning was issued because 3 is a factor of 6, so the shorter vector was
recycled an exact number of times. In the ﬁnal example, the 5-vector was
added to the ﬁrst ﬁve elements of the 6-vector. In order to ﬁnish the addition,
R recycled the ﬁrst element of the 5-vector, the value 3. That value was added
to the last entry of the 6-vector, 10, to produce the ﬁnal element of the result,
13. The recycling only used part of the 5-vector; since 5 is not a factor of 6, a
warning was issued.
Recycling a vector of length 1, as we did when we computed (5:10) + 4,
is very common. Recycling vectors of other lengths is rarer, and we suggest you
avoid it unless you are certain you know what you are doing. When you see
the longer object length... warning as we did in the last example, we
recommend you treat that as an error and get to the root of that problem.
Tools for Handling Character Vectors

Almost every data cleaning problem requires some handling of characters.
Either the data contains characters to start with – maybe names and addresses,
or dates, or ﬁelds that indicate sex, for example – or we will need to construct

R Data, Part 1: Vectors

some (perhaps turning sexes labeled 1 or 2 into M and F). We also often need to
search through character strings to ﬁnd ones that match a particular pattern;
remove commas or currency signs that have been put into formatted numbers
(such as “$2500.00”); or discretize a numeric variable into a smaller number
of groups (such as turning an Age ﬁeld into levels Child, Teen, Adult,
Senior). Character data is so important, and so common, that we have
devoted an entire chapter (Chapter 4) to special techniques for handling it.
2.1.5

Names

A vector may have names, a vector of character strings that act to identify the
individual entries. It is possible to add names to a vector, and in this section we
give examples of that. More commonly, though, R adds names to a table when
you tabulate a vector using the table() function. We will have more to say
about table(), and the names it produces, in Section 2.5. In the meantime,
here is a simple example of a vector with names. Notice that the third name
has an embedded space. This name is not “syntactically valid” according to R’s
rules. A syntactically valid name has only letters, numbers, dots, and underscores and starts either with a letter or a dot and then a non-numeric character.
It is usually a bad practice to have a vector’s names be invalid, but, as we show in
the following example, it is possible. See Section 3.4.2 for information on how
to ensure that your names are valid.
> vec <- c(101, 102, 103)
> names(vec)
NULL
> names(vec) <- c("a", "b", "Long name")
> names(vec)
[1] "a"
"b"
"Long name"

After the second line, R returned the special value NULL to indicate that the vector had no names. (We talk more about NULL in Section 2.4.5.) The names()
function then assigned names to the elements of the vector. We can also assign
names directly in the c() function, as in this example.
> c(a = 101, b = 102, Long.name = 103)
a
b Long.name
101
102
103

In this case, we used a syntactically valid name; an invalid one would have had
to be enclosed in quotation marks.

2.2 Data Types
The three data types we have mentioned so far – numeric, logical, and character – are the ones we most often use. R does support several other data types. In
this section, we mention these data types brieﬂy, and then discuss the important

27

28

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

topic of converting data from one type to another. Sometimes this is an operation we do explicitly and intentionally; other times R performs the conversion
automatically.
2.2.1

Some Less-Common Data Types

Integers

R can represent as integer values between −(231 − 1) and 231 − 1. (This number is 2,147,483,647.) Values outside this range may be displayed as if they were
integers, but they will be stored as doubles. When doing calculations, R automatically converts values that are too big to be integers into doubles, so the only
time integer storage will matter is if you explicitly convert a really large value
into an integer (see Section 2.2.3). If you need R to regard an item as an integer
for some reason, you can append L on its end. So, for example, 123 is numeric
but 123L is regarded as an integer value. Of course, it only makes sense to add
L to a thing that really is an integer.
Raw

“Raw” refers to data kept in binary (hexadecimal) form. This is the format that
data from images, sound, or video will take in R. We rarely need to handle that
kind of ﬁle in a data cleaning problem. However, we do sometimes resort to
using raw data when a ﬁle has unexpected characters in it, or at the beginning
of an analysis when we do not know what sort of data a ﬁle might have. In that
case, the data will be read into R and held as a vector of class raw. A raw vector
is a string of bytes represented in hexadecimal form. It can be converted into
character data (when that makes sense) with the rawToChar() function. We
talk more about reading raw data, particularly to handle the case of unexpected
characters, in Section 6.2.5.
Complex Numbers

R has the ability
to manipulate complex numbers (numbers such as 1.3 − 2.4i,
√
where i is −1). Since complex numbers almost never arise in data cleaning,
we will not discuss them in this book.
2.2.2

What Type of Vector Is This?

You can usually tell what sort of vector you have by looking at a few of its entries.
Character data has entries surrounded by quotes; numeric entries have
no quotes; and logical entries are either TRUE or FALSE. So, for example,
the value "TRUE", with quotation marks, can only belong to a character vector. There are also several functions in R that tell you explicitly what sort of
thing you have. Two of these functions, mode() and typeof(), tell you the
basic type of vector. They are essentially identical for our purposes, except that
typeof() diﬀerentiates between integer and double, whereas mode()

R Data, Part 1: Vectors

calls them both numeric. The str() function (for “structure”) not only tells
you the type of vector but also shows you the ﬁrst few entries. A related function, class(), is a more general operator for complex types.
A second group of functions gives a TRUE/FALSE answer as to whether
a speciﬁc vector has a speciﬁc mode. These functions are named is.
logical(), is.integer(), is.numeric(), and is.character(),
and each returns a single logical value describing the type of the vector.
A more general version, is(), lets you specify the class as an argument:
so is.numeric(pi) is identical to is(pi, "numeric"). This more
general form is particularly useful when testing for more complicated, possibly
user-deﬁned classes.
2.2.3

Converting from One Type to Another

It is important to remember that a vector can contain elements of only one type.
When types are mixed – for example, if you inadvertently insert a character
element into a numeric vector – R modiﬁes the entire vector to be of the more
complicated type. Here is an example:
> c(1, 4, 7, 2, 5)
# Create numeric vector
[1] 1 4 7 2 5
> c(1, 4, 7, 2, 5, "3") # What if one element is character?
[1] "1" "4" "7" "2" "5" "3"

In this example, the entire vector got converted to character. The rule is
that R will convert every element of a vector to the “most complicated” type
of any of the elements. Logical is the least complicated type, followed by
raw, numeric, complex, and then character. (Raw vectors behave a little
diﬀerently from the others. See Section 6.2.5.)
It is important to know what values the less complicated types get when they
are converted to more complicated ones. Logical elements that are converted
into numeric become 0 where they have the value FALSE and 1 where they are
TRUE. A logical converted into a character, however, gets values "FALSE" and
"TRUE". A number gets converted into a high-accuracy text representation of
itself, as we see in these examples.
> 1/7
[1] 0.1428571
# by default, 7 digits are displayed
> c(1/7, "a")
[1] "0.142857142857143" "a"

One instance where R frequently performs conversions automatically is from
integer to numeric types.
Conversion Functions

R will convert less complicated types into more complicated ones where
required. Sometimes you need to force the elements of a vector back into a

29

30

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

less complicated representation. Just as there are functions whose names start
with is. for testing the type of an object, there is a set of as. functions for
converting from one type to another. The rules are these: a character will be
successfully converted to a numeric if it has the syntax of a number. It may
have leading or trailing spaces (or new-lines or tabs), but no embedded ones; it
may have leading zeros; it may have a decimal point (but only one); it may not
have embedded commas; it may have a leading minus or plus sign, and, if it is
in scientiﬁc notation, the exponent character E may be in upper- or lower-case
and may also be followed by a minus or plus sign. In this example, we show
some character strings that do and do not get converted to numbers. Notice
that the elements of the vector that do not get converted turn into missing
values (NA). We discuss missing values in Section 2.4.
> as.numeric (c(" 123.5 ", "-123e-2", "4,355", "45. 6",
"$23", "75%"))
[1] 123.50 -1.23
NA
NA
NA
NA
Warning message:
NAs introduced by coercion

In this case, the ﬁrst two elements were successfully converted. The third has a
comma, the fourth has an embedded space, and the last two have non-numeric
characters. In order to convert strings such as those into numbers, you would
have to remove the oﬀending characters. We describe how to manipulate text
in Chapter 4.
The warning message you see here is a very common one. Unlike most
warning messages, this one will often arise naturally in the course of data
cleaning – but make sure you understand exactly where it’s coming from.
The only character values that can be successfully converted into
logical are "T", "TRUE", "True", and "true" and "F", "FALSE",
"False", and "false". In this case, no extraneous spaces are permitted.
All other character values are converted into NAs.
The rule is simple for converting numeric values into logical ones.
Numeric values that are zero become FALSE; all other numbers become TRUE.
The only issue is that sometimes numbers you expect to be zero aren’t quite
because of ﬂoating-point error. In this example, we convert some numbers and
expressions to logical.
> as.logical (c(123, 5 - 5, 1e-300, 1e-400, 1 - 1/49 * 49))
[1] TRUE FALSE TRUE FALSE TRUE

The ﬁrst element here is clearly non-zero, so it gets converted to TRUE. The
second evaluates to exactly zero and produces FALSE. The third is non-zero,
but the fourth counts as zero since it is outside the range of double precision
(see Section 1.3.3). The last element is our running example of an expression
that “should” be zero but is not (again, see Section 1.3.3). Since it is not zero,

R Data, Part 1: Vectors

it gets converted to TRUE. Numeric, non-missing values never produce NA
when converted to logical.

2.3 Subsets of Vectors
We very often need to pull out just a piece of a vector. This is called subsetting
or extracting. In most cases, where we extract a subset, we can use a similar
expression to replace (or assign) new values to a subset of the elements in a
vector. Knowing how to do this is crucial to data cleaning in R; you cannot
work eﬃciently in R without understanding this material.
2.3.1

Extracting

We constantly perform this operation in one form or another when cleaning
data: we look at subsets of rows or columns, we examine a vector for anomalous entries, we extract all the elements of one vector for which another has a
speciﬁc value, and so on. There are three methods by which we can extract a
subset of a vector. First, we can use a numeric vector to specify which elements
to extract. This numeric vector is an example of a “subscript” and its entries
are called “indices.” Second, we can use a logical subscript; and, third, we can
extract elements using their names.
Numeric Subscripts

The most basic way to extract a piece of a vector is to use a numeric subscript
inside square brackets. For example, if you have a vector named a, the command a[1] will extract the ﬁrst element of a. The result of that command is a
vector of length 1, of the same mode as the original a. The command a[2:5]
will produce a vector of length 4, with the second through ﬁfth elements of
a. If you ask for elements that aren’t there – if, for example, a only had three
elements – then R will ﬁll up the missing spots with missing (NA) values. We discuss those further in Section 2.4. In this example, we have a vector a containing
the numbers from 101 to 105.
> (a <- 101:105)
[1] 101 102 103 104 105
> a[3]
[1] 103

It’s possible to pull out elements in any order, just by preparing the subscript
properly. You can even use a numeric expression to compute your subscript,
but only do this if you’re sure your expression is an integer. If the result of your
expression isn’t an integer, even if it misses by just a tiny bit, you will get something you might not expect.

31

32

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> a[c(4, 2)]
[1] 104 102
> a[1+1]
[1] 102
> a[2.999999999999999]
[1] 102
> a[2.9999999999999999]
[1] 103
> a[49 * (1/49)]
integer (0)

# A simple expression; this works
# This is truncated to 2, but...
# exactly 3 in double-precision.
# This index gets truncated to zero;
# R produces a vector of length zero

There are two kinds of special values in numeric subscripts: negative values
and zeros. Negative values tell R to omit those values, instead of extracting
them – so a[-1], for example, returns everything except the ﬁrst element of a.
You can have more than one negative number in your subscript, but you cannot
mix positive and negative numbers, and that makes sense. (For example, in the
expression a[c(-1, 3)], should the second element be returned or not?)
Zeros are another special value in a subscript. They are simply ignored by
R. Zeros appear primarily as a result of the match() function; you will rarely
use them intentionally yourself. Knowing that zeros are permitted helps make
sense of the error message in the following example, though.
> a[-2]
# Omit element 2
[1] 101 103 104 105
> a[c(-1, 3)]
# Illegal
Error in a[c(-1, 3)] : only 0's may be mixed
with negative subscripts
> a[-1:2]
# Illegal, because -1:2 evaluates to -1, 0, 1, 2
Error in a[-1:2] : only 0's may be mixed
with negative subscripts
> a[-(1:2)] # -(1:2) is (-1, -2): omit elements 1 and 2.
[1] 103 104 105
Logical Subscripts

Logical subscripts are also very powerful. A logical subscript is a logical vector of the same length as the thing being extracted from. Values in the original
vector that line up with TRUE elements of the subscript are returned; those that
line up with FALSE are not.
We almost never construct the logical subscript directly, using c(). Instead
it is almost always the result of a comparison operation. In this example, we
start with a vector of people’s ages, and extract just the ones that are > 60.
> age <- c(53, 26, 81, 18, 63, 34)
> age > 60
[1] FALSE FALSE TRUE FALSE TRUE FALSE
> age[age > 60]
[1] 81 63

R Data, Part 1: Vectors

The age > 60 vector has one entry for each element of age, so it is easy to
use that to extract the numeric values of age, which are > 60. But the power
of logical subscripting goes well beyond that. Imagine that we also knew the
names of each of the people. Here we show how to extract the names just for
the people whose ages are >60.
> people <- c("Ahmed", "Mary", "Lee", "Alex", "John", "Viv")
> age > 60
# Just as a reminder
[1] FALSE FALSE TRUE FALSE TRUE FALSE
> people[age > 60]
# Return name where (age > 60) is TRUE
[1] "Lee" "John"

This particular manipulation – extracting a subset of one vector based on
values in another – is something we do in every data cleaning problem. It is
important to be sure that you know exactly how it works.
One case where results might be unexpected is when you inadvertently cause
a logical subscript to be converted to a numeric one. In the example above,
suppose we had saved the logical vector as a new R object called age.gt.60.
In the following example, we show what happens if R is allowed to convert that
logical vector into a numeric one.
> age.gt.60 <- age > 60
> people[age.gt.60]
[1] "Lee" "John" # as expected
> people[0 + age.gt.60]
[1] "Ahmed" "Ahmed"
> people[-age.gt.60]
[1] "Mary"
"Lee"
"Alex"
"John"

"Viv"

In the 0 + age.gt.60 example, R has to convert the logical subscript to
numeric in order to perform the addition. After the addition, then, the subscript
has the values 0 0 1 0 1 0, and the extraction produces the ﬁrst element of
the vector two times, ignoring the zeros. In the following example, the negative
sign once again causes R to convert the logical subscript to numeric; after the
application of the sign operator the subscript has the values 0 0 -1 0 -1 0.
The extraction drops the ﬁrst element (because of the -1 value) and the rest are
returned. This is a mistake we sometimes make with a logical subscript – in this
example, we probably intended to enter people[!age.gt.60], with the !
operator, in order to return people whose ages are not greater than 60.
When using a logical subscript, it is possible for the two vectors – the data
and the subscript – to be of diﬀerent lengths. In that case R recycles the shorter
one, as described in Section 2.1.4. This might be useful if, say, you wanted to
keep every third element of your original vector, but in general we recommend
that your logical subscript be the same length as the original vector.
The which() function can be used to convert a logical vector into a numeric
one. It returns the indices (i.e., the position numbers) of the elements that are

33

34

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

TRUE. So this is particularly useful when trying to ﬁnd one or two anomalous
entries in a long vector of logical values. To ﬁnd the locations of the minimum
value in a vector y, you can use which(y == min(y)), but the act of ﬁnding the index speciﬁcally of the minimum or maximum value is so common
that there are dedicated functions, called which.min() and which.max(),
for this task. There is one diﬀerence, though: these two functions break ties by
selecting the ﬁrst index for which y is at its maximum or minimum, whereas
which() returns all the matching indices.
Using Names

The third kind of subscripting is to use a vector’s names. Since names are characters, a name subscript will need to be a character as well. Here is a named
vector, together with an example of subscripting by name.
> (vec <- c(a = 12, b = 34, c = -1))
a b c
12 34 -1
> vec["b"]
b
34
> vec[names (vec) != "a"]
b c
34 -1

To show all the values except the one named a, it is tempting to try something
like vec[-"a"]. However, R tries to compute the value of “negative a,” fails,
and produces an error. The ﬁnal example above shows one way to exclude the
element with a particular name from being extracted.
Named vectors are not uncommon, but they do not come up very often in
data cleaning. The real use of names will become clearer in Chapter 3, where
we will encounter rectangular structures that have row names, column names,
or, very often, both.
2.3.2

Vectors of Length 0

Any of these extraction methods can produce a vector of length 0, if no element
meets the criterion. This happens particularly often when all of the elements
of a logical subscript are FALSE. A vector of length 0 is displayed as integer(0), numeric(0), character(0), or logical(0). In this example,
we show how such a vector might arise.
> (b <- c(101, 102, 103, 104))
[1] 101 102 103 104
> a <- b[b < 99] # Reasonable, but no elements of b are < 99
> a
numeric(0)
# a has length 0

R Data, Part 1: Vectors

A zero-length vector cannot be used intelligently in arithmetic, and watch
out: the sum() of a numeric or logical vector of length 0 is itself zero. If a
zero-length vector is used as the condition in an if() statement, an error
results. This is an error that arises in data cleaning, as in this example:
> sum (a)
[1] 0
# Possibly unexpected
> sum (a + 12345)
[1] 0
# Definitely unexpected
> if (a < 2) cat ("yes\n")
Error in if (a < 2) cat("yes\n") : argument is length zero

In the last example, we made use of the cat() function, which writes its
arguments out to the screen, or, as R calls it, the console. The \n represents the
new-line, to return the cursor to the left of the screen. When writing functions
to do data cleaning (Chapter 5), we will need to check that the conditions
being tested are not vectors of length 0.
2.3.3

Assigning or Replacing Elements of a Vector

Every operation that extracts some values can also be used to replace those values, simply using the extraction operation on the left side of an assignment. Of
course, R will require that the resulting vector have all its entries of the same
type. So, for example, a[2] <- 3 will replace the second entry of a with the
value 3. If a is logical, this operation will force it to be numeric; if a is character,
the second entry of a will be assigned the character value "3". Just as we can
extract using logical subscripts or names, we can use those subscripting techniques for assignment as well. These examples show replacement with numeric
and logical subscripts.
> (a <- c(101, 102,
[1] 101 102 -99
> a[6] <- 106
> a
[1] 101 102 -99 104
> a[a < 0] <- 9999
> a
[1] 101 102 9999

-99, 104, -99, 9106)) # last item should
104 -99 9106
# have been 106
# numeric subscript
-99 106
# logical subscript
104 9999

106

As we mentioned, a logical subscript will almost always have the same length as
the data vector on which it is operating. In the preceding example, the logical
subscript a < 0 has the same length as a itself.
These examples show how names can be used to assign new values to the
elements of a vector.
> b <- c("A", "missing", "C", "D")
> names (b) <- c("Red", "White", "Blue", "Green")

35

36

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> b
Red
White
"A" "missing"
> b["White"] <- "B"
> b
Red White Blue Green
"A"
"B"
"C"
"D"

Blue
Green
"C"
"D"
# name subscript

It is also possible to assign to elements of a vector out past its end. This is one
way to combine two vectors. Elements that are not assigned will be given the
special NA value (see the following section). Another way to combine two vectors is with the c() command, but either way, if two vectors of diﬀerent types
are combined, R will need to convert them to the same type. In this example,
we combine two vectors.
> a <- 101:103
> b <- c(7, 2, 1, 15)
> c(a, b)
[1] 101 102 103
7
2
> a
[1] 101 102 103
> a[4:7] <- b
> a
[1] 101 102 103
7
2
> b
[1] 7 2 1 15
> b[6] <- 22
> b
[1] 7 2 1 15 NA 22

1

# Combine two vectors
15
# Unchanged; no assignment made
# index non-existent values

1

15

# index non-existent value
# b[5] filled in with NA

In the last example, b[6] was assigned, but no instruction was given about
what to do with the newly created ﬁfth element of b. R ﬁlled it in with the special
missing value code, NA. The following section describes how NA values operate
in R.

2.4 Missing Data (NA) and Other Special Values
In R, missing values are identiﬁed by NA (or, under some circumstances, by
; see Sections 2.5 and 4.6). This is a special code; it is not the two capital
letters N and A put together. Missing values are inevitable in real data, so it is
important to know the eﬀect they have on computations, and to have tools to
identify them and replace them where necessary. In this section, we discuss NA
values in vectors; subsequent chapters expand the discussion to describe the
eﬀect of NA values in other sorts of R objects.
Missing values arise in several ways. First, sometimes data is just missing – it
would make sense for an observation to be present, but in fact it was lost

R Data, Part 1: Vectors

or never recorded. Second, some observations are inherently missing. For
example, a ﬁeld named MortPayRat might contain the ratio of a customer’s
monthly home mortgage payment to her monthly income. Customers with
no mortgage at all would presumably have no value for this ﬁeld. An NA value
would make more sense than a zero, which would suggest a mortgage payment
of zero. Third, as we saw in the last section, missing values appear when we
try to extract an item that was never present in a vector. For example, the
built-in item letters is a vector containing the 26 lower-case letters of the
English alphabet. The expression letters[27] will return an NA. Finally, we
sometimes see other special values Inf or -Inf or NaN in response to certain
computations, like trying to divide by zero. Those special values can often be
treated as if they were NA values. We discuss these and a ﬁnal special value,
NULL, in this section.
Since all the elements of a vector must be of the same kind, there are
actually several diﬀerent kinds of NA. An NA in a logical vector is a little
diﬀerent from an NA in a numeric or character one. (There are actually objects
named NA_real_, NA_integer_, and NA_character_, which make this
explicit.) Normally, the diﬀerence will not matter, but there is one case where
knowing about the types of NA can explain some behavior that both arises
fairly often and also seems mysterious. We mention this in Section 2.4.3.
2.4.1

The Eﬀect of NAs in Expressions

A general, if imprecise, rule about NA values is that any computation with an
NA itself becomes an NA. If you add several numbers, one of which is an NA,
the sum becomes NA. If you try to compute the range of a numeric with missing values, both the minimum and maximum are computed as NA. This makes
sense when you think of an NA as an unknown that could take on any value.
Basic mathematical computations for numeric vectors all allow you to specify
the na.rm = TRUE argument, to compute the result after omitting missing
values.
2.4.2

Identifying and Removing or Replacing NAs

In every data cleaning problem we need to determine whether there are NA
values. What you cannot do to identify missing values is to compare them
directly to the value NA. Just as adding an NA to something produces an NA,
comparing an NA to something produces NA. So if a variable thing has value
3, the expression thing == NA produces NA, and if thing has value NA, the
expression thing == NA also produces NA. To determine whether any of
your values are missing, use the anyNA() function. This operates on a vector
and returns a logical, which is TRUE if any value in the vector is NA. More
useful, perhaps, is the is.na() function: if we have a vector named vec, a
call to is.na(vec) returns a vector of logicals, one for each element in vec,

37

38

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

giving TRUE for the elements that are NA and FALSE for those that are not.
We can also use which(is.na(vec)) to ﬁnd the numeric indices of the
missing elements. Here, we show an example of a vector with NA values and
some example of what operations can, and cannot, be performed on them.
> (nax <- c(101, 102, NA, 104))
[1] 101 102 NA 104
> nax * 2
# Arithmetic on NAs gives NAs...
[1] 202 204 NA 208
> nax >= 102
# ...as do comparisons
[1] FALSE TRUE
NA TRUE
> mean (nax)
# One NA affects the computation
[1] NA
> mean (nax, na.rm = TRUE) # na.rm = TRUE excludes NAs
[1] 102.3333
> is.na (nax)
# Locate NAs with logical vector
[1] FALSE FALSE TRUE FALSE
> which (is.na (nax))
# Numeric indices of NAs
[1] 3

When your data has NA or other special values, you are faced with a decision about how to handle them. Generally they can be left alone, replaced, or
removed. Removing missing values from a single vector is easy enough; the
command vec[!is.na(vec)] will return the set of non-missing entries in
vec. A more sophisticated alternative is the na.omit() function, which not
only deletes the missing values but also keeps track of where in the vector they
used to be. This information is stored in the vector’s “attributes,” which are extra
pieces of information attached to some R objects.
> nax[!is.na (nax)]
# Return the non-missing values
[1] 101 102 104
> (nay <- na.omit (nax)) # This keeps track of deleted ones
[1] 101 102 104
attr(,"na.action")
[1] 3
attr(,"class")
[1] "omit"

Data cleaners will very often want to record information about the original
location of discarded entries. In this example, these can be extracted with a
command like attr(nay, "na.action").
Things get more complicated when the vector is one of many that need to
be treated in parallel, perhaps because the vector is part of a more complicated structure like a matrix or data frame. Often if an entry is to be deleted, it
needs to be deleted from all of these parallel items simultaneously. We talk more
about these structures, and how to handle missing values in them, in Chapter
3. (We also note that most modeling functions in R have an argument called

R Data, Part 1: Vectors

na.action that describes how that function should handle any NA values it
encounters. This is outside our focus on data cleaning.)
2.4.3

Indexing with NAs

When an NA appears in an index, NA is produced, but the actual eﬀect that R
produces can be surprising. This arises often in data cleaning, since it is common to have a vector (usually fairly long and as part of a larger data set) with
many NAs that you may not be aware of. Suppose we have a vector of data b
and another vector of indices a, and we want to extract the set of elements of
b for which a has the value 1, like this: b[a == 1]. The comparison a == 1
will return NA wherever a is missing, and b[NA] produces NA values. So the
result is a vector with both the entries of b for which a == 1 and also one NA
for every missing value in a. This is almost never what we want. If we want to
extract the values of b for which a is both not missing and also equal to 1, we
have to use the slightly clunky expression b[!is.na(a) & a == 1]. This
example shows what this might look like in practice.
> (b <- c(101, 102, 103, 104))
[1] 101 102 103 104
> (a <- c(1, 2, NA, 4))
[1] 1 2 NA 4
> b[!is.na (a) & a == 2] # We probably want this...
[1] 102
> b[a == 2]
# ...and not this.
[1] 102 NA

In the following example, we show how two commands that look alike are
treated slightly diﬀerently by R.
> b[a[2]]
[1] 102
> b[a[3]]
[1] NA
> (a <- as.logical (a))
[1] TRUE TRUE
NA TRUE
> b[a[3]]
[1] NA NA NA NA

# a[2] = 2; extract element 2 of b
# ... which is 102
# a[3] is NA
# Now convert a to logical
# a[3] is NA

In the ﬁrst example of b[a[3]], the value in a[3] was a numeric NA, so R
treated the subscripting operation as a numeric one. It returned only one value.
In the second example, a[3] was a logical NA, and when R subscripts with a
logical – even when that logical value is NA – it recycles the subscript to have
the same length as the index (we saw this in Section 2.1.4).
The lesson here is that when you have an NA in a subscript, R may return
something other than what you expect.

39

40

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

2.4.4

NaN and Inf Values

A diﬀerent kind of special value can arise when a computation is so big that
it overﬂows the ability of the computer to express the result. Such a value is
expressed in R as Inf or -Inf. On 64-bit machines Inf is a bit bigger than
1.79 × 10308 ; it most often appears when a positive number is accidentally
divided by zero. Inf values are not missing, and is.na(Inf) produces
FALSE. Another special value is NaN, “not a number,” which is the result of
certain speciﬁc computations such as 0/0 or Inf + -Inf or computing the
mean of a vector of length 0. Unlike Inf, an NaN value is considered to be
missing. As with NA values, Inf and NaN values take over every computation
in which they are evaluated. There are rules for when more than one is
present – for example, Inf + NA gives NA, but NaN + NA gives NaN. From a
data cleaning perspective, all of these values cause trouble and you will generally want to identify any of these values early on. The function is.finite()
is useful here; this produces TRUE for numbers that are neither NA nor NaN
nor Inf or -Inf. So in that sense it serves as a check on valid values. To see
whether every element of a numeric vector vec consists of values that are not
any of these special ones, use the command all(is.finite(vec)).
2.4.5

NULL Values

A ﬁnal sort of special value is the R value NULL. A NULL is an object with zero
length, no contents and no class. (A vector of length 0 has no contents, but
since it has a class – numeric, logical, or something else – it is not NULL.) In
data cleaning, NULLs most often arise when attempting to access an element
of a list, or a column of a data frame, which does not exist. We discuss this in
Section 3.4.3. For the moment, the important point is that we can test for NULL
values with the is.null() function, and that if you index using a NULL value
the result will be a vector of length 0.

2.5 The table() Function
The table() function is so important in data cleaning that it merits its own
section. This command, as its name suggests, produces a table giving, for each
of the unique values in its argument, the number of times that value appears.
In this example, we will create a vector with some color names in it, and we will
add in an NA as well.
> vec <- rep (c("red", "blue", NA, "green"), c(3, 2, 1, 4))
> vec
[1] "red"
"red"
"red"
"blue" "blue" NA
[7] "green" "green" "green" "green"
> table (vec)

R Data, Part 1: Vectors

vec
blue green
2
4

red
3

There are a couple of things to notice here. First, the ordering of the results in
the table is alphabetical, rather than being determined by the order the entries
appear in the vector vec. Second, the resulting object is not quite a named
vector, as you can see by the word vec that appears above the word blue.
(We omit this line in many future displays to save space.) In fact, this object
has class table, but it can be treated like a named vector – so, for example,
table(vec)["green"] produces 4. Third, by default table omits NA as
well as NaN values. In data cleaning this is almost never what we want. There
are two diﬀerent arguments to the table() function that serve to declare how
you want missing values to be treated. The ﬁrst of these is named useNA. This
argument takes the character values "no" (meaning exclude NA values, which
was the default as seen earlier), "ifany" (meaning to show an entry for NAs
if there are any, but not if there aren’t) and "always", meaning to show an
entry for NAs whether there are any NA values or not. In our current example,
where there is one NA, the table() command with useNA set to "ifany"
or "always" will produce output like this:
> table (vec, useNA = "always")
blue green
red 
2
4
3
1

Notice that R displays the entry for NA values as , with angle brackets.
This makes it easier to use the characters "NA" as a regular character string,
perhaps for “North America” or possibly “sodium.” (This angle bracket usage
will appear again later.) R will not be confused if you have both NA values and
also actual character strings with the angle brackets, such as "", but it
is deﬁnitely a bad practice. To see what happens when there are no NAs, let us
look at the same vector without its missing entry, which is number 6.
> table (vec[-6], useNA="ifany")
blue green
red
2
4
3
> table (vec[-6], useNA="always")
blue green
red 
2
4
3
0

For data cleaning purposes, we almost always want to know about missing
values, so we will almost always want useNA to be "ifany" or "always".
The second missing-value argument, exclude, allows you to exclude speciﬁc
values from the table. By default, exclude has the value c(NA, NaN),
which is why those values do not appear in tables. Most commonly we set
this value to NULL to signify that no entries should be excluded, although
sometimes we exclude certain very common values. Here we might want to

41

42

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

exclude the common value green while tabulating all other values, including
NAs. The following example shows how we can do that. It also shows the use
of exclude = NULL.
> table (vec, exclude="green")
blue red 
2
3
1
> table (vec, exclude=NULL)
blue green
red 
2
4
3
1

It is possible to supply both useNA and exclude at the same time, but the
results may not be what you expect. We recommend using either useNA or
exclude to display missing values in every table.
2.5.1

Two- and Higher-Way Tables

If we give two vectors of the same length to the table() function, the result
is a two-way table, also called a cross-tabulation. For example, suppose we had
a vector called years, one for each transaction in our data set, with values
2015, 2016, and 2017; and suppose we also had a vector called months,
of the same length, with values such as "Jan", "Feb", and so on. Then
table(years, months) would produce a 3 × 12 table of counts, with
each cell in the table telling how many entries in the two vectors had the values
for the cell. That is, the top-left cell would give the number of entries from
January 2015; the one to the right of that would give the number of entries for
February 2015; and so on. (If there are fewer than 12 months represented in
the data, of course, there will be fewer than 12 columns in the table.) This is an
important data cleaning task – to determine whether two variables are related
in ways we expect. If, for example, we saw no transactions at all in March 2016,
we would want to know why.
In R, a two-way table is treated the same as a matrix; we discuss matrices
in detail in the following chapter. For very large vectors, the data.table()
function in the data.table package (Dowle et al., 2015) may prove more
eﬃcient than table(). Three- and higher-way tables are produced when the
arguments to table() are three or more equal-length vectors. These tables
are treated in R as arrays; we give an example in Section 3.2.7. The xtabs()
function is also useful for creating more complex tables.
2.5.2

Operating on Elements of a Table

The table() command counts the number of observations that fall into a particular category. In the example above, the table(years, months) command produces a two-way table of counts. Often we want to know more than
just how many observations fall into a cell. R has several special-purpose functions that operate on tables. The prop.table() function takes, as its ﬁrst

R Data, Part 1: Vectors

argument, the output from a call to table(), and depending on its second
argument produces proportions of the total counts in the table by cell, or by
row, or by column. In this example we set up three vectors, each of length 15.
Then we show the eﬀect of calling table(), and of calling prob.table() on
the result. By default, prop.table() computes the proportions of observations in each cell of the table. In the ﬁnal example, we use the second argument
of 2 to compute the proportions within each column; supplying 1 would have
produced the proportions within each row.
> yr <- rep (2015:2017, each=5)
> market <- c("a", "a", "b", "a", "b", "b", "b", "a", "b",
"b", "a", "b", "a", "b", "a")
> cost <- c(64, 87, 71, 79, 79, 91, 86, 92, NA,
55, 37, 41, 60, 66, 82)
> (tab <- table (market, yr))
yr
market 2015 2016 2017
a
3
1
3
b
2
4
2
> prop.table (tab)
# These proportions sum to 1
yr
market
2015
2016
2017
a 0.20000000 0.06666667 0.20000000
b 0.13333333 0.26666667 0.13333333
> prop.table (tab, 2) # Each column's proportions sum to 1
yr
market 2015 2016 2017
a 0.6 0.2 0.6
b 0.4 0.8 0.4

The margin.table() command produces the marginal totals from a
table – that is, row or column totals (controlled by the second argument) for a
two-way table, and corresponding sums for a higher-way one. The addmargins() function incorporates those totals into the table, producing a new
row or column named Sum (or both). This is often a summary statistic we want,
but watch out – the convention regarding the second argument of addmargins() is not the same as that of prop.table() and margin.table().
This example shows addmargins() in action.
> addmargins (tab)
yr
market 2015 2016 2017 Sum
a
3
1
3
7
b
2
4
2
8
Sum
5
5
5 15
> addmargins (tab, 2)

# append row and column sums

# append column sums

43

44

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

yr
market 2015 2016 2017 Sum
a
3
1
3
7
b
2
4
2
8

We might also want to know the average, standard deviation, or maximum of
entries in a numeric variable, broken down by which cell they fall into. In our
example, we might want the maximum cost among the three observations
from 2015 with market a, and for the two from 2015 and market b, and so
on. For this purpose we use the tapply() function, whose name reminds us
that it applies a function to a table. This function’s arguments are the vector
on which to do the computation (in our example, cost), an argument named
INDEX describing the grouping (here, we might use the vector yr), and then
the function to be applied. The following example shows tapply() at work.
In the ﬁrst line, we use the min() function to produce the minimum value for
each year – but an NA is produced for 2016 since one cost for that year is
NA. We can pass the na.rm = TRUE argument into tapply(), which then
passes it into min() as in the following example, if we want to compute the
minimum value among non-missing entries.
> tapply (cost, yr, min)
# find minimum within each yr
2015 2016 2017
64
NA
37
> tapply (cost, yr, min, na.rm = TRUE)
2015 2016 2017
64
55
37

It is possible to extend this example to the two-way case of minimum cost,
or another statistic, by both market and year. Here the tabularization part, represented by the argument INDEX, needs to be a list. We discuss lists starting
in Section 3.3; for the moment, just know that a list is required when grouping with more than one vector. In the ﬁrst example as follows, we compute
the mean of the cost values for each combination of market and year (using
na.rm = TRUE as above, and the list() function to construct the list). In
the second example, we show how we can supply our own function “in line,”
which makes it more transparent than if we had written a separate function.
The details of writing functions are covered in Chapter 5, but here our function
takes one argument, named x, and returns the value given by the sum of the
squares of the entries of x. (In this example, we pass the na.rm = TRUE argument directly to sum to keep our function simpler.) The tapply() function
is in charge of calling our function six times, once for each cell of the table.
> tapply (cost, list (market, yr), mean, na.rm = TRUE)
2015
2016
2017
a 76.66667 92.00000 59.66667
b 75.00000 77.33333 53.50000

R Data, Part 1: Vectors

> tapply (cost, list (market, yr),
function (x) sum (x^2, na.rm = TRUE))
2015 2016 2017
a 17906 8464 11693
b 11282 18702 6037

2.6 Other Actions on Vectors
In this section, we describe additional actions on vectors that we ﬁnd particularly important for data cleaning. This includes rounding numeric values,
sorting, set operations, and the important topics of identifying duplicates and
matching.
2.6.1

Rounding

R operates on numeric vectors using double-precision arithmetic, which means
that often there are more signiﬁcant digits available than are useful. Results will
often need to be displayed with, say, two or three signiﬁcant digits. The natural
way to prepare displays like this is through formatting the numbers – that
is, changing the way they display, but not their actual values. We discuss
formatting in Section 4.2. But sometimes we want to change the numbers
themselves, perhaps to force them to be integers or to have only a few signiﬁcant digits. The round() function and its relatives do this. Round() lets the
user specify the number of digits to the right of the decimal place to be saved;
the signif() function lets him or her specify the total number of signiﬁcant
digits retained. So round(123.4567, 3) produces 123.457, while
signif(123.4567, 3) produces 123. A negative second argument produces rounding the nearest power of 10, so round(123.4567, -1) rounds
to the nearest 10 and produces 120, while round(123.4567, -2)
rounds to the nearest 100 and produces 100. The trunc() function discards
the part the decimal and produces an integer; floor() and ceiling()
round to the next lower and next higher integer, respectively, so floor(-3.4)
is -4 while trunc(-3.4) is -3. Rounding of problematic entries (like those
that end in a 5) can be aﬀected by ﬂoating-point error (see Section 1.3.3).
2.6.2

Sorting and Ordering

It is common to have to sort the elements of a vector, and the sort() function
performs that task in R. By default, the sort is from smallest to largest, but the
decreasing = TRUE argument will reverse the order. There are two minor
complications. First, sort() will drop NA and NaN values by default, giving a
vector shorter than the original when these values are present. This behavior
is controlled by the na.last argument, which itself defaults to NA. If set to
TRUE, this argument will have the sort() function place NA and NaN values
at the end, and, if FALSE, at the beginning of the sorted output.

45

46

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

A second complication is in sorting character vectors. Sorting in this case is
alphabetical, of course, so if the characters are text representations of numbers
such as "1", "2", "5", "10", and "18", the resulting output, sorted alphabetically, will be "1", "10", "18", "2", and "5". Moreover, the sorting order
depends on the character set and locality being used. We mentioned this in
Section 1.4.6 and address it further in Section 4.5.
The related order() function returns a set of indices that you can use to sort
a vector. This is useful when you want to re-arrange one vector’s values in the
order speciﬁed by a second vector. (If that sounds as if it wouldn’t be a common
task, wait until Section 3.5.4.) In this example, we have a vector of names, and
a vector of scores, and we want the names in ascending order of score.
> nm <- c("Freehan", "Cash", "Horton",
"Stanley", "Northrop", "Kaline")
> scores <- c(263, 263, 285, 259, 264, 287) # 2 tied at 263
> nm[order(scores)]
# ascending order of score
[1] "Stanley" "Freehan" "Cash"
[4] "Northrop" "Horton"
"Kaline"
> nm[order(scores, nm)]
# tie broken by nm
[1] "Stanley" "Cash"
"Freehan"
# (alphabetically)
[4] "Northrop" "Horton"
"Kaline"
> nm[order (scores, decreasing = TRUE)] # descending
[1] "Kaline"
"Horton"
"Northrop"
[4] "Freehan" "Cash"
"Stanley"

As in the example, the order() function can be given more than one vector.
In this case, the second vector is used to break ties in the ﬁrst; if a third vector
were supplied, it would be used to break any remaining ties, and so on. It is
very common to re-order a set of data that has time indicators (month and year,
maybe) from oldest to newest. The order() function has the same na.last
argument that sort() has, although its default value is TRUE.
2.6.3

Vectors as Sets

Often we need to ﬁnd the extent to which two vectors have values that overlap.
For example, we might have customer data from two sources and we want
to determine the extent to which the customer IDs agree; or we might want
to ﬁnd the set of states in which none of our customers reside. These call for
techniques that treat vectors as sets and that will normally be most useful
when the data is a small number of integers, character data, or factors, about
which we say more in Section 4.6. They can be used with non-integer data
as well, but as always we cannot rely on two ﬂoating-point numbers that we
expect to be equal actually being equal.
The essential set membership operation is performed by the %in% function.
R has a few functions with names like this, surrounded by percentage signs.

R Data, Part 1: Vectors

This allows us to use a command like a %in% b, rather than the equivalent,
but perhaps less transparent, is.element(a, b). The return value is a vector the same length as a, with a logical indicating whether each element of a
is found anywhere in b. In data cleaning we very often tabulate the result of
this function call; so a command like table(a %in% b) produces a table of
FALSE and TRUE, giving the number of items in a that were not found in b,
and the number that were. For this purpose, an NA value in a matches only an
NA in b, and similarly an NaN value in a matches only an NaN value in b. In
this example, we compare some alphanumeric characters to the built-in data
set letters containing the 26 lower-case letters of the alphabet.
> c("g", "5", "b", "J", "!") %in% letters
[1] TRUE FALSE TRUE FALSE FALSE
> table (c("g", "5", "b", "J", "!") %in% letters)
FALSE TRUE
3
2

The union(), intersect(), and setdiff() functions produce the
union, intersection, and diﬀerence between two sets. This example shows
those functions in action.
> union (c("g", "5", "b", "J", "!"),
letters)
# elements
[1] "g" "5" "b" "J" "!" "a" "c" "d" "e"
[14] "k" "l" "m" "n" "o" "p" "q" "r" "s"
[27] "x" "y" "z"
> intersect (c("g", "5", "b", "J", "!"),
letters)
# elements
[1] "g" "b"
> setdiff (c("g", "5", "b", "J", "!"),
letters)
# elements
[1] "5" "J" "!"

2.6.4

in either vector
"f" "h" "i" "j"
"t" "u" "v" "w"

in both vectors

of a not in b

Identifying Duplicates and Matching

Another data cleaning task is to ﬁnd duplicates in vectors. The anyDuplicated() function tells you whether any of the elements of a vector are
duplicates. The unique() function extracts only the set of distinct values
(including, by default, NA and NaN). The distinct values appear in the output
in the order in which they appear in the input; for data cleaning purposes we
will often sort those unique values.
Often it will be important to know which elements are duplicates. The
duplicated() function returns a logical vector with the value TRUE for
the second and subsequent entries in a set of duplicates. However, the ﬁrst
entry in a set of duplicates is not indicated. For example, duplicated
(c(1, 2, 1, 1)) returns FALSE FALSE TRUE TRUE; the ﬁrst 1 is not

47

48

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

considered duplicated under this deﬁnition. (Alternatively, the fromLast
= TRUE argument reads from the end of the vector back to the beginning, but again the “ﬁrst” member of a set of duplicates is not indicated.)
Combining a call with fromLast = FALSE and one with fromLast
= TRUE, using the union() function, identiﬁes all duplicates.
A common task is to ﬁnd all the entries that are duplicated anywhere in the
data set (or that are never duplicated). One way to do this is via table(). Any
value that appears more than once is, of course, duplicated (but remember that
ﬂoating-point numbers might not match exactly). In this example, we construct
a vector from the lower-case letters, but add a few duplicates.
> let <- c(letters, c("j", "j", "x"))
> (tab <- table (let))
let
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
> which (tab != 1) # table locations where duplicates appear
j x
10 24
# 10th & 24th table entries aren't ones
> names (tab)[tab != 1]
[1] "j" "x"

It is often useful to use table() twice in a row. This example counts the number of entries that appear once, twice, and so on in the original data. Consider
this example:
> table (table (let))
1 2 3
24 1 1
# 24 entries are 1, one is 2, one is 3

The last line shows that there are 24 entries in let that appear once; one entry,
x, that appears twice; and one entry, j, that appears three times. We use this in
almost every data cleaning problem to ﬁnd entries that appear more often than
we expect. In a real application, we might have tens of thousands of elements
and only a few duplicates. The which(tab != 1) command shows us the
elements that are duplicated, but not how many times each one appears; the
table(table(let)) command shows us how many duplicates there are,
but not which letter goes with which count.
Another important task is matching, which is where we identify where, in a
vector, we can ﬁnd the values in another vector. We will ﬁnd this particularly
useful when merging data frames in Section 3.7.2. There are two ways to handle elements that do not match; they can be returned as NA, preserving the
length of the original argument in the length of the return value, or, with the
nomatch = 0 argument, they can be returned as 0, which allows the return
value to be used as an index. In this example, we match two sets of names.

R Data, Part 1: Vectors

> nm <- c("Jensen", "Chang", "Johnson",
"Lopez", "McNamara", "Reese")
> nm2 <- c("Lopez", "Ruth", "Nakagawa", "Jensen", "Mays")
> match (nm, nm2)
[1] 4 NA NA 1 NA NA
> nm2[match (nm, nm2)]
[1] "Jensen" NA
NA
"Lopez" NA
NA

The third command tells us that the ﬁrst element of nm, which is Jensen,
appears in position 4 of nm2; the second element of nm, Chang, does not appear
in nm2, and so on. We can extract the elements that matched from the nm2 vector as in the last line – but the NA entries in the output of match() produce
NAs in the vector of names. An easier approach is to supply the nomatch = 0
argument, as in this example.
> match (nm, nm2, nomatch = 0)
[1] 4 0 0 1 0 0
> nm2[match (nm, nm2, nomatch = 0)]
[1] "Jensen" "Lopez"

We use match() (or its equivalent) in any data cleaning problem that requires
combining two data sets. Understanding how match() works makes data
cleaning easier. Match() is, in fact, a more powerful version of %in%.
2.6.5

Finding Runs of Duplicate Values

During a data cleaning problem, it often happens that a particular identiﬁer – a
name or account number, perhaps – appears many times in an input data set. As
an example we might be given a list of payments, with each payment identiﬁed
by a customer number and each customer contributing dozens of payments.
It will be useful to count the number of times each repeated item appears. We
also use this on logical vectors to ﬁnd, for example, the locations and lengths of
sets of payments that are equal to 0. The rle() function (the name stands for
“run length encoding”) does exactly this: given a vector, it returns the number
of “runs” – that is, repetitions – and each run’s length. In this example, we show
what the output of the rle() function looks like.
> rle (c("a", "b", "b", "a", "c", "c", "c"))
Run Length Encoding
lengths: int [1:4] 1 2 1 3
values : chr [1:4] "a" "b" "a" "c"

This output shows that the vector starts with a run of length 1 (the ﬁrst element
in the lengths vector) with value a (the values vector); then a run of length
2 with value b; and so on. The output is actually returned in the form of a list
with two parts named lengths and values; in Section 3.3, we discuss how
to access the pieces of a list individually.

49

50

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

2.7 Long Vectors and Big Data
Starting in version 3.0.0, R introduced something called a long vector, a special
mechanism that allows vectors to be much longer than before. Since there are
only 231 − 1 values of an integer, entries in a long vector beyond that point will
have to be indexed by double indices. Other than that, this extension should,
in principle, be invisible to users. One exception is that the match() function,
and its descendants, is.element() and %in%, do not work on long vectors.
On long vectors, table() can be very slow and the data.table package
provides some faster alternatives. R’s documentation suggests avoiding the use
of long vectors that are characters.

2.8 Chapter Summary and Critical Data Handling
Tools
This chapter introduces R vectors, which come in several forms, primarily logical, numeric, and character. The mode(), typeof(), and class() functions
give you information about the class of a vector. The set of is. functions like
is.numeric() returns a TRUE/FALSE result when an object is of the speciﬁed model, and the set of as. functions performs the conversion. Remember
that logicals are simpler than numerics, and numerics simpler than character,
and that converting from a simpler to a more complicated mode is straightforward. Converting from a more complicated to a simpler mode follows these
rules:
• Converting character to numeric produces NA for things that aren’t numbers,
like the character strings "TRUE" or "$199.99".
• Converting character to logical produces NA for any string that isn’t "TRUE",
"True", "true", "T", "FALSE", "False", "false" or "F".
• Converting numeric to logical produces FALSE for a zero and TRUE for any
non-zero entry (and watch out for ﬂoating-point error here).
Extracting and assigning subsets of vectors are critical parts of any data cleaning project. We can use any of the modes as an index or “subscript” with which
to extract or assign. A logical subscript returns the values that match up with
its TRUE entries. Logical subscripts are extended by recycling where necessary (but most often when we do this it is by mistake). A numeric subscript
returns the values speciﬁed in the subscript – and, unsurprisingly, numeric subscripts are not recycled. The which() command identiﬁes TRUE values in a
logical vector, so you can use that to convert a logical subscript to a numeric
one. Finally, a character subscript will extract, from a named vector, elements
whose names are present in the subscript (and, again, this kind of subscript is
not recycled).

R Data, Part 1: Vectors

Any kind of vector can have missing values, indicated by NA, and there are
a few other special values as well. Missing values inﬂuence computations they
are involved in, so we often want to supply an argument like na.rm = TRUE
to a function computing a sum, mean or other summary statistic on numeric
data. You should expect to encounter missing values in any data set from any
source and be prepared to accommodate them.
The table() function is critical to data cleaning. It tabulates a vector,
returning the number of times each unique value appears, with names corresponding to the original values in the data set. Passing two or more vectors
to table() produces a two- or higher-way cross-tabulation. We recommend adding the useNA = "ifany", useNA = "always", or exclude
= NULL arguments to ensure that table() counts and displays the number
of NA values, unless you’re certain no values are missing. Using table() on
the output of table() – as in table(table(x)) – tells us how many
items in a vector x appear once, twice, and so on. This is useful for detecting
entries that appear more often than expected.
Using names() on the output of table() will produce the unique entries
in a vector, but we also use the unique() function to ﬁnd these. We spend
a lot of energy in identifying duplicates, and the duplicated() function is
useful here – although, remember, it does not return TRUE for the ﬁrst item in a
set of duplicates. The is.element() and %in% functions help determine the
extent to which two sets of values overlap; both of these are simpler versions
of the match() function, which is critical to combining data from diﬀerent
sources.

51

53

3
R Data, Part 2: More Complicated Structures

3.1 Introduction
R data is made up of vectors, but, as you already know, there are more
complicated structures that consist of a group of vectors put together. In
this chapter, we talk about the three major structures in R that data handlers
need to know about. The most important of these is the data frame, in which,
eventually, almost all of our data will be held. But in order to build up to the
data frame, we ﬁrst need to describe matrices and lists. A data frame is part
matrix, part list, and in order to use data frames most eﬃciently, you need to
be able to think of it in both ways. Furthermore, we do encounter matrices in
the data cleaning world, since the table() command can produce something
that is basically a matrix.

3.2 Matrices
A matrix (plural matrices) is essentially a vector, arrayed in a (two-dimensional)
rectangle. As with a vector, every element of a matrix needs to be of the same
type – logical, numeric, or character. Most of the matrices we will see will be
numeric, but it is also possible to have a logical matrix, typically for subscripting, as we shall see. We start using the vector of 15 numbers, 101, 102, …, 115,
to produce a 5 × 3 (i.e., ﬁve rows by three columns) numeric matrix.
> (a <- matrix (101:115, nrow = 5, ncol = 3))
[,1] [,2] [,3]
[1,] 101 106 111
[2,] 102 107 112
[3,] 103 108 113
[4,] 104 109 114
[5,] 105 110 115

54

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

There are a couple of points to mention here. First, the matrix is ﬁlled column
by column, with the ﬁrst column being ﬁlled before the second one starts. We
often intuitively expect the matrix to be ﬁlled row by row, because our data
comes in rows, and we read English left-to-right, but this is not how R works. If
you need to load your data into your matrix by rows, use the byrow = TRUE
argument. This arises when you copy a matrix oﬀ of a web page, for example;
we expect the entries to be read along the top line, but R stores them down the
ﬁrst column. (We come back to this example in Section 6.5.3.)
Second, notice the row and column indicators such as [4,] and [,2]. In
the following section, we will see how to use those numbers to extract elements
from the matrix, or to assign new ones.
Third, the length() operator can be used on a matrix, but it returns the
total number of elements in the matrix. More often we want to know the number of rows and columns; that information is returned by the nrow() and
ncol() functions, or jointly by the dim() function, which gives the numbers
of rows and columns in that order:
> length (a)
[1] 15
> dim (a)
[1] 5 3

Fourth, in our example, we used the matrix() function to create the
matrix from one long vector. An alternative is to create a matrix from a set
of equal-length vectors. The cbind() function (“c” for column) combines a
set of vectors into a matrix column by column, while rbind() performs the
operation row by row. If the vectors are of unequal length, R will use the usual
recycling rules (Section 2.1.4). Again, all of the elements of a matrix need to be
of the same sort, so if any vector is of type character, the entire matrix will be
character.
As with vectors, arithmetic operations on matrices are performed element by
element, so A^2 squares each element of A and A * B multiplies two matrices
element by element. There are special symbols for matrix-speciﬁc operations:
for example, A %*% B performs the usual kind of matrix multiplication, t(A)
transposes a matrix, and solve() inverts a matrix. These operations do not
tend to come up much in data cleaning, but often, we want to perform an operation on a matrix row by row or column by column. We come back to these row
and column operations in Section 3.2.3.
3.2.1

Extracting and Assigning

Since a matrix is just a vector, it is possible to use a subscript just like the
one we used in Section 2.3.1 to pull out or replace an element. In the example
above, a[6] would produce 106 (remember that we count by columns ﬁrst),
and a[6] <- 999 would replace that element with 999. However, it is much

R Data, Part 2: More Complicated Structures

more common to identify elements of a matrix by two subscripts, one for the
row and one for the column. These two subscripts are separated by a comma.
In our example, a[1,2] would produce 106, and a[1,2] <- 999 would
replace that value.
Of course, it is possible to ask for more than one entry at a time. In this
example, we ask for a 3 × 2 sub-matrix from our original matrix a:
> a[c(4, 2), c(3, 1)]
[,1] [,2]
[1,] 114 104
[2,] 112 102

The two rows we asked for, numbers 4 and 2, in that order, are returned, with
the corresponding entries from columns 3 and 1, in that order. Just as when we
use subscripts on a vector, we may use duplicate subscripts; a vector of negative
numbers indicates that the corresponding entries should be removed.
If you leave one of the two subscripts empty, you are asking for an entire row
or column. This command says “give me all the rows except for number 2, and
all the columns.”
> a[-2,]
[,1] [,2] [,3]
[1,] 101 106 111
[2,] 103 108 113
[3,] 104 109 114
[4,] 105 110 115

Notice here that some rows have been renumbered. The row that had been
number 5 in the original a is now the fourth row. This is not surprising, but it
raises the question as to whether we might be able to keep track of rows that
have been deleted, since that would help us audit changes we have made to the
data. We will describe one way to do that using row names in Section 3.2.2.
In addition to using a numeric subscript, we can use a logical one. Logical
subscripts for rows or columns act exactly as logical subscripts for vectors (see
Section 2.3.1). Whether you use numeric or logical subscripts, subscripting a
matrix with row and column indices will return a rectangular object. To extract
values from, or assign new values to, a non-rectangular set of entries, you can
use a matrix subscript, which we describe in Section 3.2.5.
Demoting a Matrix to a Vector

In order to turn a matrix into a vector, use the c() function on it. Just as c()
creates vectors from individual elements (see, e.g., Section 2.1.1), it also creates vectors from matrices. In our example, c(a) will produce a vector of 15
numbers. The entries in that vector come from the ﬁrst column, followed by
the second column, and so on. In order to extract data row by row, transpose
the matrix ﬁrst, using the t() function in a command like c(t(a)).

55

56

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Sometimes, though, R produces a vector from a matrix when we did not
expect it. In this example, see what happens when we ask for, say, the second
column of a. Remember that a has ﬁve rows and three columns.
> a[,2]
[1] 106 107 108 109 110

The result of this operation is not a matrix with ﬁve rows and one column;
it is a vector of length 5. This reduction – or “demotion” – from a matrix to
a vector follows a general rule in R under which dimensions of length 1 are
usually removed (“dropped”) by default. This can cause trouble when you have
a function that is expecting a matrix, perhaps because it plans to use dim() to
ﬁnd the number of rows. If you pass a single column of a matrix, that is, a vector,
such a function would call dim() on a vector, which returns the value NULL.
The way around this is to specify that this dropping should not take place, using
the drop = FALSE argument, like this:
> a[,2,drop = FALSE]
[,1]
[1,] 106
[2,] 107
[3,] 108
[4,] 109
[5,] 110

The result of that operation is a matrix with ﬁve rows and one column. When
building functions that take subsets of matrices, it is often a good idea to use
drop = FALSE to ensure that the resulting subset is itself a matrix and not a
vector value.
3.2.2

Row and Column Names

It is very convenient to have a matrix whose rows and columns have names.
We can assign (and extract) row and column names with the dimnames()
function, described in Section 3.3.2, and there are also functions named
rownames() and colnames() to do the same job. (There is also an equivalent row.names() function, spelled with a dot, but, interestingly, there is
no col.names() function.) Rows and columns are named automatically
by the table() function (technically, a two-way table has class table, not
matrix, but that distinction will not matter here). We start this extended
example by constructing a table.
> yr <- rep (2015:2017, each = 5)
> market <- c(2, 2, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 2, 3, 2)
> (tbl <- table (market, yr))

R Data, Part 2: More Complicated Structures

yr
market 2015 2016 2017
2
3
1
3
3
2
4
2

Notice that the row-name entries ("2" and "3" under market) are not a
column of the table; they are merely identiﬁers. This table has three columns,
not four. Now we show the column names and demonstrate how they can be
changed using the colnames() function.
> colnames (tbl)
[1] "2015" "2016" "2017"
> colnames (tbl) <- c("FY15", "FY16", "FY17")
> tbl
yr
market FY15 FY16 FY17
2
3
1
3
3
2
4
2

Once row or column names have been assigned, we can refer to them by name
as well as by number. This makes it possible to refer to a row or column in a consistent way, without having to know its location. Notice, though, that dimension
names are characters, even if they look numeric. So, for example, tbl[2,] will
produce the second row of the matrix tbl, while tbl["2",] will produce the
row whose name is "2", regardless of what number that row has – and even if
earlier rows have been removed.
> tbl[2,]
FY15 FY16 FY17
2
4
2
> tbl["2",]
FY15 FY16 FY17
3
1
3

3.2.3

Applying a Function to Rows or Columns

There are lots and lots of operations on matrices supported by R, but many
of them are mathematical and not useful in data cleaning. One operation that
does come up, though, is running a function separately on each row or column
of a matrix. A few of these are so common that they are built in. Speciﬁcally,
there are functions named colSums() and rowSums(), which compute all
of the column sums or row sums, and corresponding functions for the means,
colMeans() and rowMeans(). Very often, though, you want to apply some
custom function, such as the one that tells how many entries are NA or missing.
The facility for doing this is the apply() function, to which you supply the
matrix, the direction of travel (1 for across rows, 2 for down columns), and
then the function that is to be applied to each row or column. This last can be

57

58

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

a function built into R, a function you have written yourself, or even a function
deﬁned on the ﬂy.
> a <- matrix (101:115, 5, 3)
# These four commands produce identical results
> rowSums (a)
> apply (a, 1, sum)
# Pass argument na.rm into the sum() function
> apply (a, 1, sum, na.rm = T)
> apply (a, 1, function (x) sum (x))
[1] 318 321 324 327 330
# User-written command selects the second-smallest entry
# in each column
> apply (a, 2, function (x) sort(x)[2])
[1] 102 107 112

If each call to the function returns a vector of the same length, apply()
creates a matrix. In this example, we use the range() function to produce
two values for each column of a.
> apply (a, 2, range)
[,1] [,2] [,3]
[1,] 101 106 111
[2,] 105 110 115

When apply() is used with a vector-valued function, such as range() in
the last example, the output is arranged in columns, regardless of whether the
operation was performed on the rows or the columns of the original matrix.
This does not always match our intuition, particularly when the operation was
performed on rows. In this example, we show the row-by-row ranges of the a
matrix and then transpose using the t() function.
> apply(a, 1, range)
[,1] [,2] [,3] [,4] [,5]
[1,] 101 102 103 104 105
[2,] 111 112 113 114 115
# Use t() to transpose that matrix
> t(apply(a, 1, range))
[,1] [,2]
[1,] 101 111
[2,] 102 112
[3,] 103 113
[4,] 104 114
[5,] 105 115

A diﬃculty arises when diﬀerent calls to the function produce vectors of different lengths. In that case, R cannot construct a matrix and has to return the
results in the form of a list (we discuss lists in Section 3.3). This might arise, say,

R Data, Part 2: More Complicated Structures

when looking for the locations of unusual values by column. In this example, we
look for the locations in each column of values greater than 109 in the matrix a.
> apply (a, 2, function (x) which (x > 109))
[[1]]
integer(0)
[[2]]
[1] 5
[[3]]
[1] 1 2 3 4 5

This result tells us that the ﬁrst column has no entries >109, the second column’s ﬁfth entry is >109, and all ﬁve entries in the third column are >109. In
general, you have to be aware that apply() might return a list if the function
being applied can return vectors of diﬀerent lengths.
3.2.4

Missing Values in Matrices

One very common use of apply() is to count the number of missing values in
each row or column, since missing values always aﬀect how we do data cleaning.
This code shows how to count the number of NA value in each column. To show
oﬀ some more of R’s capabilities, we use the semicolon, which allows multiple
commands on one line, and the multiple assignment operation, which lets us
assign several things at once.
> a <- matrix (101:115, 5, 3); a[5, 3] <- a[3, 1] <- NA; a
[,1] [,2] [,3]
[1,] 101 106 111
[2,]
NA 107 112
[3,] 103 108 113
[4,] 104 109 114
[5,] 105 110
NA
> apply (a, 2, function (x) sum (is.na (x)))
[1] 1 0 1

From the last command, we see there is one missing value in each of columns
1 and 3.
We saw how to use which() to identify missing values in a vector back
in Section 2.4, and the same command can also identify missing values in a
matrix. By default, which(is.na(vec)) will return the indices of vec with
missing values as if vec had been stretched out into a long vector (column by
column, as always). However, the arr.ind = TRUE argument will supply the
row and column indices of the items selected by which(). This is extremely
useful in tracking down a small number of missing values. In this example, we
use which() to identify the missing entries in a.

59

60

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> which (is.na (a))
[1] 3 15
> which (is.na (a), arr.ind = TRUE)
row col
[1,]
2
1
[2,]
5
3

Here, which() returns a matrix with two named columns and two unnamed
rows. Of course, this approach is not limited to ﬁnding NAs. It can also be used
to ﬁnd negative values, or anything else that is unexpected and needs to be
cleaned.
3.2.5

Using a Matrix Subscript

In the last example, we saw how which() with arr.ind = TRUE returns
a matrix giving a vector of rows and a vector of columns that, together,
identify the cells that had NA values. One underused feature of R is that
we can use a matrix subscript, such as the one returned by which() with
arr.ind = TRUE to extract from, or assign to, another matrix. We can
also use the vector returned by the ordinary use of which(), but the matrix
approach sometimes makes it much easier to extract the necessary rows or
columns. In this example, we construct a matrix with ﬁve columns of data and
a sixth column named "Use". This ﬁnal column tells us which of the data
columns should be extracted for each of the rows.
>
>
>
>
>

b <- matrix (1:20, nrow = 4, byrow = TRUE)
b <- cbind (b, c(3, 2, 0, 5))
colnames (b) <- c("P1", "P2", "P3", "P4", "P5", "Use")
rownames (b) <- c("Spring", "Summer", "Fall", "Winter")
b
P1 P2 P3 P4 P5 Use
[1,] 1 2 3 4 5
3
[2,] 6 7 8 9 10
2
[3,] 11 12 13 14 15
0
[4,] 16 17 18 19 20
5

Since the ﬁrst row’s value of Use is 3, we want to extract the third element of
that row; since the second row’s value of Use is 2, we want the second element
of that row; and so on. Without the ability to use a matrix subscript, we might
be forced to loop through the rows of b, but in R we can extract all these items
in one call. Our matrix subscript has two columns, one giving the rows from
which we are extracting (in this case, all the rows of b in order) and another
giving the column from which to extract (in this case, the values in the Use
column of b). Here we show we can construct this matrix subscript and use it
to extract the relevant entries of b.

R Data, Part 2: More Complicated Structures

> (subby <- cbind (1:nrow(b), b[,"Use"]))
[,1] [,2]
[1,]
1
3
[2,]
2
2
[3,]
3
0
[4,]
4
5
> b[subby]
[1] 3 7 20

Notice that in this example the value of Use in the third row was zero – and
therefore no value was produced for that row of the matrix subscript (see “zero
subscripts” in Section 2.3.1). Negative values cannot be used in a matrix subscript.
As a real-life example of where this might occur, we were recently given a
matrix of customer payments. The ﬁrst 96 columns contained monthly payment amounts. The last column gave the number of the month with the last payment in it. Our task was to extract the payment amount whose month appeared
in that ﬁnal column. So if, in the ﬁrst row, that column had the value 15, we
would have extracted the amount from the 15th column of the payment matrix;
and so on for the second and subsequent rows.
Two points here: notice that we extracted in our example earlier using
b[subby] using no additional commas. The matrix subscript deﬁnes both
rows and columns. Second, remember to use cbind() to construct the
subscript argument (our subby above). Make sure that the matrix subscript
really is a matrix, and not two separate vectors, or you will extract rows and
columns separately. Matrix subscripting works with names, too. If our matrix b
had had both row and column names, we could have used a character matrix in
exactly the same way as the numeric subby. In that case, b would need both
row and column names so that both columns of the subscript argument could
be character. We cannot have one vector be numeric and the other character,
because we need to combine them into a matrix, and all the entries in a matrix
have to be of the same type. It is also possible to have a logical matrix act as a
subscript – but the results are surprising and we do not recommend it.
3.2.6

Sparse Matrices

A sparse matrix is one whose entries are largely zero. For example, in a language
processing application we might form a matrix with words in the rows and documents in the columns. Then a particular cell, say, the ijth one, would have a
zero if word i did not appear in document j, and since, in many examples, most
words do not appear in most documents that matrix might have a high proportion of zeros. There are a number of schemes for representing sparse matrices.
The recommended Matrix package (Bates and Maechler, 2016) implements

61

62

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

many of these. We encounter sparse matrices in our work, but rarely in the
context of data cleaning, so we will not discuss them in this book.
3.2.7

Three- and Higher-Way Arrays

Three-way (and higher-way) matrices are called arrays in R. An array looks like
a matrix in that all of its elements need to be of the same type, but a three-way
array requires three subscripts, a four-way array requires four subscripts, and so
on. The only time we seem to have encountered such a thing in data cleaning is
when constructing a three- or higher-way table(). In this example, we show
a three-way table made from three vectors each of length 8, and then we extract
the value 3 from the second row of the ﬁrst column of the ﬁrst “panel.”
>
>
>
>
,

who <- rep (c("George", "Sally"), c(2, 6))
when <- rep (c("AM", "PM"), 4)
worked <- c(T, T, F, T, F, T, F, T)
(sched <- table (who, when, worked))
, worked = FALSE

when
who
AM PM
George 0 0
Sally
3 0
, , worked = TRUE
when
who
AM PM
George 1 1
Sally
0 3
> sched[2,1,1]
[1] 3

Many commands that work on matrices, like, apply() and prop.table(),
operate on arrays as well. You can also use c() on an array to produce a vector – in this case, the ﬁrst column of the ﬁrst panel is followed by the second
column of the ﬁrst panel, and so on. The aperm() function plays the role of
t() for higher-way arrays.

3.3 Lists
A list is the most general type of R object. A list is a collection of things that
might be of diﬀerent types or sizes; a list might include a numeric matrix, a
character vector, a function, another list, or any other R object. Almost every
modeling function in R returns a list, so it is important to understand lists when

R Data, Part 2: More Complicated Structures

using R for modeling, but we also need to describe lists because one special sort
of list is the data frame, which we describe in the following section.
Normally, we will encounter lists as return values from functions, but we can
create a list with the list() function, like this:
> (mylist <- list (alpha = 1:3, b = "yes", funk = log, 45))
$alpha
[1] 1 2 3
$b
[1] "yes"
$funk
function (x, base = exp(1))

.Primitive("log")

[[4]]
[1] 45

Lists also appear as the output from the split() function, which divides a
vector into (possibly unequal-length) pieces according to the value of another
vector. We use this frequently in data cleaning. For example, we might divide
a vector of people’s ages according to their gender. In this simple example, we
show how split() produces a list; later, in Section 3.5.1, we show how that
list can be put to use.
> ages <- c(26, 45, 33, 61, 22, 71, 43)
> gender <- c("F", "M", "F", "M", "M", "F", "F")
> split (ages, gender)
$F
[1] 26 33 71 43
$M
[1] 45 61 22
> split (ages, ages > 60)
$‘FALSE‘
[1] 26 45 33 22 43
$‘TRUE‘
[1] 61 71

It is worth noting that if the second argument – gender in this case – has missing values, those values will be dropped from the output of split(). Notice
also that in the second example the names of the list elements have been surrounded by backward quotes. This is for display, because FALSE and TRUE are
not valid names here, but the character strings "FALSE" and "TRUE" are.
The length of a list, as found using length(), is the number of elements,
regardless of how big each individual element is. The lengths() function

63

64

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

returns a vector of lengths, one for each element on the list. In our example,
length(mylist) returns the value 4, whereas lengths(mylist) returns
a vector with four lengths in it (including the length of 1 that is returned for the
function funk()). The str() command we described in Section 2.2.2 works
on lists as well. The resulting value printed to the screen gives a description of
every element on the list – one line for atomic elements and multiple lines for
lists within lists. This is one way to help understand the structure of your data
quickly.
3.3.1

Extracting and Assigning

In the ﬁrst example in this section, the ﬁrst three elements were given names
and the fourth was not. That output hints at how to extract items from a list. You
can use double square brackets – so mylist[[4]] will return 45 – or, if an
element has a name, you can use the dollar sign and the name – so mylist$b
will return "yes", and split(ages, ages > 60)$"TRUE" will return the
vector of ages >60. Single square brackets can be used, with a numeric, logical, or name subscript, but there’s a catch – single square brackets return a list,
not the contents of the list. This is useful if you want only a couple of pieces
of a list. For example, mylist[1:2] will return a list with the ﬁrst two elements of mylist, and mylist[1] will return a list with the ﬁrst element of
mylist – not as a vector but as a list. A logical subscript will also work here:
mylist[c(T, T, F, F)] will return the same list as mylist[c(1,2)]
or mylist[c("alpha", "b")]. Most of the lists we run into will have
names, and we usually extract elements one at a time with the dollar sign, but
the distinction between single and double brackets is still important. Single
brackets create lists; double brackets extract contents. And what happens in our
example if you ask for mylist[[2:3]] or mylist[[c(F, T, T, F)]]?
Unsurprisingly, these commands generate errors.
When you request a list item using single brackets and a name that is not
present on the list, R returns a list with one NULL element; with double brackets
or the dollar sign, it returns the NULL itself. This is consistent with the rule
that says single brackets produce lists, while double brackets and dollar signs
extract contents. Using double brackets with a numeric subscript greater than
the number of elements in the list, such as mylist[[11]] in our example,
produces an error rather than a NULL.
Of course, to extract elements of a list by name, we need to know the names.
We can determine the names a list has using the names() function. If the
list has no names at all, this function will return NULL; if some elements have
names, the names() function will return an empty string for those elements
with no names. This example shows the names of the mylist list.
> names(mylist)
[1] "alpha" "b"

"funk"

""

R Data, Part 2: More Complicated Structures

We can also use the names() function on the left-hand side of an assignment
to change the names of the elements on a list. For example, the commands
names(mylist)[4] <- "RPM" would change the name of the fourth element of mylist to "RPM".
Unlike when you use single or double square brackets, when using the dollar
sign to extract an item, you don’t need its full name. (Technically, you can pass
the exact argument into double square brackets to control this behavior, but
we don’t.) You only need enough to identify the item unambiguously. In this
example, mylist$a would be enough to produce the same numeric vector
returned by mylist$alpha, but if there were two items on the list, say alpha
and algorithm, typing mylist$a would produce NULL. You would need to
specify at least mylist$alp in order to be unambiguous. It’s often convenient
to use these abbreviated names, but that approach is best suited for quick work
at the command line. We recommend using full names in functions and scripts,
to avoid confusion or even an error if new items get added to the list later.
To replace an item on a list, just re-assign it. If you want to add a new
item to a list, just assign the new item to a new name. Here, naturally, you
need to use the item’s full name. If your mylist has an item called alpha
and you use the command mylist$alp <- 3, you will create a new item
named alp and leave the old one, alpha, unchanged. To delete an item from
a list, you can use subscripting as we did for a vector. For example, either
mylist <- mylist[c(1, 2, 4)] or mylist <- mylist[-3] will
drop the third entry. But another, possibly easier, way is to assign NULL
to the name or number. In this example, mylist$funk <- NULL or
mylist[[3]] <- NULL would remove the item named funk from
mylist. This behavior means that it is diﬃcult to intentionally store a NULL
value in a list, but this does not seem to be much of a limitation.
Another useful function for operating on lists is unlist(), which, as its
names suggests, tries to turn your entire list into a vector. When the list contains unusual objects, such as the function element of the mylist list in our
example, the results of unlist() can be diﬃcult to predict. This example
shows the eﬀect of unlist() operating on a list of regular vectors, which we
create by excluding the function element of mylist().
> unlist (mylist[-3])
alpha1 alpha2 alpha3
b
"1"
"2"
"3" "yes"

"45"

Here we can see that R has produced names for each of the elements from the
vector mylist$a, and in a well-behaved list these names can be useful.
3.3.2

Lists in Practice

Generally, we do not need lists much when data cleaning. As we have noted,
lists arise as the output of many R functions – a function in R cannot return

65

66

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

more than one result, so if a function computes two things of diﬀerent sizes,
it will need to return a list. For example, the rle() function we described in
Chapter 2 returns the lengths of runs and, separately, the value associated with
each run. It is then your job to extract the pieces from the list. The pieces will
almost always be named, so they will be able to be extracted using the $ operator. (In the case of rle(), the pieces are called lengths and values.) Lists
also arise as the output from the split() command. Normally, after calling
split() we would then call an apply()-type function on each element of
the resulting list. We describe this in Section 3.5.1. And, of course, the apply
functions can themselves produce lists, as we saw in Section 3.2.3.
Another common context in which lists arise concerns the dimension names
of a matrix. The dimnames() function returns NULL when applied to a matrix
without row or column names. Otherwise, it returns a list with two elements:
the vector of row names and the vector of column names. In general, this return
has to be a list, rather than a matrix, because the number of rows and number
of columns will be diﬀerent. Either of the two entries may be NULL, because a
matrix may have row names without column names, or vice versa. The dimnames() function may be used to assign, as well as extract, dimension names.
These examples continue the earlier ones using the two-way table tbl and the
three-way table sched from Sections 3.2.2 and 3.2.7, respectively, and show
dimnames() at work. Notice that dimnames() produces a list with three
vectors of names from the three-way table.
> dimnames(tbl)
$market
[1] "2" "3"
$yr
[1] "FY15" "FY16" "FY17"
> dimnames (sched)
$who
[1] "George" "Sally"
$when
[1] "AM" "PM"
$worked
[1] "FALSE" "TRUE"

As we have seen before, dimension names are always characters. So in the
three-way array example, the names for the worked dimension are the character strings "FALSE" and "TRUE", not the logical values. In the following
example, we show how we can modify an element of the dimnames() list.

R Data, Part 2: More Complicated Structures

> dimnames(tbl)[[2]][1] <- "Archive"
> tbl
yr
market Archive FY16 FY17
2
3
1
3
3
2
4
2

In the dimnames() assignment we change the ﬁrst column name. Here,
dimnames(tbl) produces a list, the [[2]] part extracts the vector
of column names, and the [1] part accesses the element we want to
change. Of course, we could have achieved the same result with dimnames
(tbl)$yr[1] <- "Archive".
Another list that arises from R itself is the list of session options, returned
from a call to the options() function. This list includes dozens of elements
describing things such as the number of digits to be displayed, the current
choice of editor, the choices going into scientiﬁc notation, and many more.
Calling names(options()) will produce a vector of the names of the
current options. You can examine a particular option, once you know its
name, with a command like options()$digits. To set an option, pass
its name and value into the options() function, with a command like
options(digits = 9).

3.4 Data Frames
Now that we understand how matrices and lists work, we can focus on the
most important object of all, the data frame. A data frame (written with a dot
in R, as data.frame) is a list of vectors, all of which are the same length, so
that they can be arrayed in a matrix-like rectangle. (Technically, the elements
of a data frame can also be matrices, as long as they are of the right size, but
let us avoid that complication. For our purposes, the elements of a data frame
will be ordinary vectors.) The vectors in the list serve as the columns in the
rectangle. A data frame looks like a matrix, with the critical diﬀerence that
the diﬀerent columns can be of diﬀerent types. One column can be numeric,
another character, a third factor, a fourth logical, and so on. Each vector
has elements of one type, as usual, but the data frame allows us to store the
sort of data we get in real life. So a data frame about people might contain
their names (which would probably be character), their ages (often numeric,
but possibly factor), their gender (possibly character, possibly factor), their
eligibility for a particular program (which might be logical), and so on. In this
example, we use the data.frame() function to construct a data frame. In
data cleaning, our data frames are very often produced by functions that read
in data from the disk, a database, or some other source. We describe methods

67

68

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

of acquiring data in Chapter 6, but for the moment we will use this simple
example.
> (mydf <- data.frame (
Who = letters[1:5], Cost = c(3, 2, 11, 4, 0),
Paid = c(F, T, T, T, F), stringsAsFactors = FALSE))
Who Cost Paid
1
a
3 FALSE
2
b
2 TRUE
3
c
11 TRUE
4
d
4 TRUE
5
e
0 FALSE

There are a few points worth noting here. First, R has provided row names
(visible as 1 through 5 on the left) to the data frame automatically. A matrix
need not have row names or column names, and a list need not have names,
but a data frame must always have both row names and column names.
R will create them if they are not explicitly assigned, as it did here. The
data.frame() function ensures (unless you specify otherwise) that column
names are valid and not duplicated. You may specify row names explicitly,
using the row.names argument, in which case they must be not duplicated
and not missing. Column names can be examined and set using the names()
command, as with a list, or with the colnames() or dimnames() commands, as with a matrix. Generally, you will probably ﬁnd the names() or
colnames() approaches to be easier, since they involve vectors and not a list.
For row names, the rownames() and row.names() functions allow the
row names of a data frame to be examined or assigned. Section 3.2.2 describes
how row names can be useful when handling matrices, and those points are
true for data frames as well.
A second point is that, by default, the data.frame() function turns character vectors into factors. Factors are discussed in Section 4.6, and, as we mention there, they are useful, even required, in some modeling contexts. They are
rarely what we want in data cleaning, however. The best way to keep factors out
of data frames is to not allow them in the ﬁrst place; we accomplished this in
the example above by passing the stringsAsFactors = FALSE argument
to the data.frame() function. Without that argument, the Who column of
mydf would have been a factor variable with ﬁve levels. Another way to prevent
factors from being created is to set the stringsAsFactors global option
to be FALSE, using the options(stringsAsFactors = FALSE) command. However, we cannot rely on all of the users of our code having that setting
in place, so we always try to remember to turn this option oﬀ explicitly when
we call data.frame(). This issue will arise again when we talk about combining data frames later in this section, and about reading data in from outside
sources in Chapter 6.

R Data, Part 2: More Complicated Structures

There are several functions that help you examine your data frame. Of
course, in many cases, it will be too big to simply print out and examine. The
head() and tail() functions display only the ﬁrst or last six rows of a data
frame, by default, but this can be changed by the second argument, named n.
So head(mydf, n = 10) will show the ﬁrst 10 rows, tail(mydf, 12)
will show the last 12, and, using a negative argument, head(mydf, -120)
will show all but the last 120. The str() function prints a compact representation of a data frame that includes the type of each column, as well as the ﬁrst
few entries. Other useful functions include dim(), to report the numbers of
rows and columns, and summary(), which gives a brief description of each
column.
3.4.1

Missing Values in Data Frames

Because the columns of a data frame can be of diﬀerent classes, missing values can be of diﬀerent classes, too. A missing value in a numeric column will
be a numeric missing value, while in a character column, the missing value
will be of the character type. We discussed missing values at some length in
Section 2.4. It is always good to know where missing values come from and
why they exist – often investigating the causes of “missingness” will lead to discoveries about the data. The is.na() function operates on a data frame and
returns a logical-valued matrix showing which elements (if any) are missing;
the anyNA() function operates on data frames as well. One approach to handling missing data is to simply omit any observations (rows) of the data frame in
which one or more elements is missing. R’s na.omit() function does exactly
that. (For this purpose, NaN is missing but Inf and -Inf are not.) This is the
default behavior for a number of R’s modeling functions, but in general we do
not recommend deleting records with missing values until the reason for the
values being missing is understood.
3.4.2

Extracting and Assigning in Data Frames

Since a data frame is matrix-like and also list-like, we can use both matrix-style
and list-style subsetting operations on a data frame. One diﬀerence appears
when we select a single row. With a matrix, selecting a single row returns a
vector, unless you specify drop = FALSE (see Section 3.2.1). However, with
a data frame, even a single row is returned as a data frame with one row because
in general even one row of a data frame will contain entries of diﬀerent types.
With that one diﬀerence, we extract rows from a data frame just as we
extract rows from a matrix – by number, including negatives; using a logical
vector; or by names (as we mentioned, the rows of a data frame, and the
columns, always have names). We can extract columns using either list-style
access or matrix-style access. List-style access uses single brackets to produce

69

70

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

sub-lists, which in this case means that using single brackets will produce
a data frame. Double bracket subscripts, or the dollar sign, will produce a
vector. The diﬀerence is that double brackets require an exact name, unless
exact = FALSE is set, whereas the dollar sign only requires enough of the
name to be unambiguous. If there are two columns with similar names, and
your request is not suﬃcient to determine a unique answer, nothing at all
(i.e., NULL) is returned. Therefore, it makes sense, particularly when writing
functions for other people, to use full names for columns.
Matrix-style access uses column names or numbers; just as with a matrix,
selecting only one column will produce a vector unless you explicitly set
drop = FALSE. This example shows a number of ways of extracting columns
from data frames. We start by showing list-style access using single brackets.
> mydf[2]
# Numeric subscript
Cost
1
3
2
2
3
11
4
4
5
0
> mydf["Cost"]
# Subscript by name
Cost
1
3
2
2
3
11
4
4
5
0
> mydf[c(F, T, F)] # Logical subscript
Cost
1
3
2
2
3
11
4
4
5
0

Each of those operations produced a data frame with ﬁve rows and one column
(which is, of course, a list). In the following examples, we use double brackets together with a numeric or character subscript and produce a vector. As
with a list, a logical subscript with more than one TRUE inside a pair of double brackets will produce an error. (You might have expected the same result
with a numeric subscript; in fact, a numeric subscript of length 2 can be used;
it acts as a one-row matrix subscript.) When using a character index inside
double brackets, you can specify exact = FALSE to permit the same sort of
matching that we get with the dollar sign.

R Data, Part 2: More Complicated Structures

> mydf[[2]]
[1] 3 2 11 4 0
> mydf[["Cost"]]
[1] 3 2 11 4 0
> mydf[["C"]]
NULL
> mydf[["C", exact = FALSE]]
[1] 3 2 11 4 0

Notice that the result in each of these cases is a vector. In the following
examples, we show the use of the dollar sign to extract a column. In this
case, as we mentioned, we need to specify only enough of the name to be
unambiguous.
> mydf$W
[1] "a" "b" "c" "d" "e"

# Extracts the "Who" column

The dollar sign can only refer to one column at a time. To extract more than one
column, we can use single brackets as above, or matrix-style access in which
we explicitly specify rows and columns. As with a matrix, leaving one of those
two indices blank will select all of them, and R will produce vectors from single columns unless the drop = FALSE argument is speciﬁed. This example
shows extraction using matrix-style syntax.
> mydf[1:2, c("Cost", "Paid")]
Cost Paid
1
3 FALSE
2
2 TRUE
> mydf[,"Who", drop = FALSE] # Example of drop = FALSE
Who
1
a
2
b
3
c
4
d
5
e

Removing a column from a data frame is exactly like removing an element
from a list and is accomplished in the same way – by assigning NULL to
the column reference. Running the command mydf$Paid <- NULL will
remove the column Paid from the data frame using list-style notation, and
mydf[,"Paid"] <- NULL performs the same task using matrix-style
notation.
To replace subsets of elements you can once again use the matrix-style
or list-style syntax. So, for example, mydf[c(1,3), "b"] <- "A" and
mydf$b[c(1,3)] <- "A" both replace the ﬁrst and third entries of the b
column of mydf with "A". (Of course, if that column had been numeric or
logical before, this operation will force R to convert it to character.)

71

72

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

3.4.3

Extracting Things That Aren’t There

The critical diﬀerence between a matrix and a data frame is that the columns
of a data frame can be vectors of diﬀerent types. Another diﬀerence manifests
itself when you try to access an element that isn’t there, maybe because you
asked for a row or column number that was too big or a row or column name
that didn’t exist. In a vector, attempts to extract an item beyond the end of the
vector will produce NAs. But if you ask a matrix for a row or column that doesn’t
exist, R will produce an error. This example shows the diﬀerence:
> (mat <- matrix (1001:1006, 2, 3)) # Matrix with six items
# Ask for a non-existent entry, using vector-like indexing
> mat[8]
[1] NA
> mat[,4] # Ask for a non-existent column
Error in mat[, 4] : subscript out of bounds

In general, we prefer the error. A function that sees an NA will often try to carry
on, whereas an error will force you to stop and ﬁgure out what has happened.
The situation with data frames (and lists) is diﬀerent. Supplying subscripts for
which there are no rows produces one row with all NAs for every unusable subscript. The entries in these rows will have the same classes (numeric, character,
etc.) that the data frame had. This arises when some rows have been deleted,
and then you, or a program, try to access one of the deleted rows by name. In
this example, we show how asking for rows that don’t exist can cause trouble.
> mydf2 <- data.frame (alpha = 1:5, b = c(T, T, F, T, F),
NX = c("NA", "NB", "NC", "ND", "NE"),
stringsAsFactors = FALSE,
row.names = c("Red", "Blue", "White", "Reddish", "Black"))
> mydf2
alpha
b NX
Red
1 TRUE NA
Blue
2 TRUE NB
White
3 FALSE NC
Reddish
4 TRUE ND
Black
5 FALSE NE
# Let's ask for rows that don't exist.
> mydf2[c(9, 4, 7, 1),]
alpha
b
NX
NA
NA
NA 
Reddish
4 TRUE
ND
NA.1
NA
NA 
Red
1 TRUE
NA

In this example, we see that the resulting data frame has four rows, two of
which contain only NA values. The character column’s NAs are represented with

R Data, Part 2: More Complicated Structures

angle brackets, as , to make it easy to distinguish a missing value from
the legitimate character string NA in row 1. The ﬁrst two columns’ NAs are
numeric and logical. As elsewhere (e.g., Section 2.4.3), logical subscripts will
recycle – which is rarely what you want – and usually produce unwanted results
when they contain NAs.
In the following example, we show one more operation that can produce
rows with NAs in them. Since our data frame has rows named both "Red"
and "Reddish", asking for a row named "Re" is ambiguous and produces
a row of NAs. (In contrast, the row names of a matrix may not be abbreviated;
supplying a name that is not an exact row name produces an error.)
> mydf2["Re",]
alpha b
c
NA
NA NA 

# Not enough to be unambiguous

A much more frequent problem happens when accessing columns. If you
access a non-existent column in the matrix or list styles, using an abbreviation, R produces an error. In our example, mydf2[,"gamma"] (referring to a
non-existent column), mydf2[,"N"] (referring to an abbreviated name, with
a comma), and mydf2["N"] (without the comma) all produce errors. In contrast, when using the double-bracket notation, NULL is returned when a name
is abbreviated or non-existent. (As we mentioned with lists, there is in fact
an exact argument to the double brackets that we do not use.) Just like the
NA returned when accessing a non-existent element of a vector, this NULL has
the potential to be more trouble than an error would have been. The use of
the dollar sign, as we mentioned, permits the use of unambiguously abbreviated names but produces a NULL when used with a non-existent name. In this
example, we show how asking for a non-existent name can produce an unexpected result.
# Ask for the first column by abbreviated name.
> mydf2$alph
[1] 1 2 3 4 5
# No problem
# Create another column with a similar name
> mydf2$alpha.plus.1 <- mydf2$alpha + 1
> mydf2$alph
NULL
> mydf2$alph + 1
# No error, but..
numeric(0)
# probably unexpected

The second-to-last operation produced NULL because alph was not suﬃcient
to diﬀerentiate between the columns alpha and alpha.plus.1. If a row
or column name matches exactly, R will extract it properly (so if you have
alpha and alpha.plus.1 and ask for alpha, there is no ambiguity). It
is a good practice to use complete names, unless there is a strong reason
not to.

73

74

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

3.5 Operating on Lists and Data Frames
Very often we will want to operate on each of the elements of a list or each
of the rows or columns of a data frame. For example, we might want to know
how many missing values are in each column. In Section 3.2.3, we saw a matrix
using apply(), but apply() does not work on a list (since a list doesn’t have
dimensions). The apply() function does work on data frames, but it ﬁrst converts the data frame into a matrix. This conversion will only be sensible when all
the columns are of the same type, as with the all-numeric data frame described
in Section 3.5.2. In other cases, the results can be quite unexpected. In this
example, we operate on the rows of a data frame, using apply(), to show how
this can go wrong.
> (dd <- data.frame (a = c(TRUE, FALSE), b = c(1, 123),
cc = c("a", "b"), stringsAsFactors = FALSE))
a
b c
1 TRUE
1 a
2 FALSE 123 b
> apply (dd, 1, function (x) x)
[,1]
[,2]
a " TRUE" "FALSE"
b " 1"
"123"
cc "a"
"b"

Here the function passed to apply() does nothing but return whatever it
passed to it. Since data frame dd has a character column, apply() converted
the whole data frame into a character matrix. It does this in part by calling
the format() function column by column, producing the results seen here:
a value " TRUE" with a leading space in row 1 (formatted to be the same
length as the string "FALSE"), and " 1" with two leading spaces in row 2
(formatted to be the same length as the string "123"). Analogous conversions
happen whenever a data frame with at least one column that is neither logical
nor numeric is passed to apply(), used in other matrix functions such as t()
(transpose), or accessed with a matrix subscript.
A general approach to this sort of operation (element by element for a list, column by column for a data frame) is supplied by sapply() and lapply().
The lapply() function always returns a list, whereas sapply() runs
lapply() and then tries to make the output into a vector (if the function
always returns a vector of length 1) or a matrix (if the function returns a
vector of constant length). Be careful, though, because if the diﬀerent function
calls return items of diﬀerent lengths, sapply() will need to return a list,
just as the ordinary apply() function did back in Section 3.2.3. Moreover,
if the function returns elements of diﬀerent types (perhaps as a row of a
data frame), sapply() will try to convert these to a common type. In these

R Data, Part 2: More Complicated Structures

cases, use lapply(). The following example shows one very common use of
sapply(), which is to return the classes of each column in a data frame.
> sapply (mydf2, class)
alpha
b
"integer"
"logical"

NX alpha.plus.1
"character"
"numeric"

In this example, the regular apply() function will convert the whole data
frame to character ﬁrst, before computing the classes, which it would report
as all character.
It is easy to operate on the columns of a data frame (or the elements of a list)
with lapply() and sapply() functions. As we have seen, it is more diﬃcult
to operate on the rows. These two functions provide a solution to this problem.
They can be used with an ordinary numeric vector as their ﬁrst argument,
in which case they act like a for() loop, applying their function to each
element of the vector. The for()-like behavior of lapply() and sapply()
is most useful when using a complicated function on each row of a data frame.
The command sapply (1:nrow(ourdf), function (i) fancy
(ourdf[i,])) runs a user-written function called fancy() on each row
of a data frame. supplied by sapply() and lapply(). The argument to
fancy() really is a data frame, and not one that has been converted into a
matrix. In this example, we show how we might identify rows that contain the
number 1. Note that the naïve use of apply() does not ﬁnd the number 1 in
the ﬁrst row.
> apply (dd, 1, function (x) any (x == 1))
[1] FALSE FALSE
> sapply (1:2, function (i) any (dd[i,] == 1))
[1] TRUE FALSE

3.5.1

Split, Apply, Combine

The family of apply() functions all operate as part of strategy that Wickham
(2011) calls “split-apply-combine.” The data is split (possibly by row, possibly
by column), a function is applied to each piece, and the results recombined.
We have already met the tapply() function (Section 2.5.2), which performs
exactly this set of operations on vectors. We can also do this explicitly via
split() and sapply() or lapply(). We start the following example
by constructing a data frame with some people’s ages, genders, and ages of
spouses, and computing the average value of Age by Gender. In this example,
we do not specify stringsAsFactors = FALSE.
> age <- data.frame (Age = c(35, 37, 56, 24, 72, 65),
Spouse = c(34, 33, 49, 28, 70, 66),
Gender = c("F", "M", "F", "M", "F", "F"))

75

76

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> split (age$Age, age$Gender)
$F
[1] 35 56 72 65
$M
[1] 37 24
> sapply (split (age$Age, age$Gender), mean)
F
M
57.0 30.5

Here the split() function returns a list with the elements of Age divided
by value of Gender. Then sapply() operates the mean() function on
each element of the list and returns a vector (i.e., it performs both the
“apply” and “combine” operations). In this example, we could have used
tapply(age$Ages, age$Gender, mean) to produce an identical result.
However, unlike tapply(), split() can operate on a data frame, producing a list of data frames. We can then write a function to operate on each
data frame. In this example, we split our data frame by Gender and then use
summary() on each of the resulting data frames to return some information about every column. Summary() applied to the factor column Gender is
more informative than when applied to a character column; this is why we did
not specify stringsAsFactors = FALSE earlier. The result of the calls to
summary() appears as a specially formatted table.
> split (age, age$Gender)
$F
Age Spouse Gender
1 35
34
F
3 56
49
F
5 72
70
F
6 65
66
F
$M
Age Spouse Gender
2 37
33
M
4 24
28
M
> lapply (split (age, age$Gender), summary)
$F
Age
Spouse
Gender
Min.
:35.00
Min.
:34.00
F:4
1st Qu.:50.75
1st Qu.:45.25
M:0
Median :60.50
Median :57.50
Mean
:57.00
Mean
:54.75
3rd Qu.:66.75
3rd Qu.:67.00
Max.
:72.00
Max.
:70.00

R Data, Part 2: More Complicated Structures

$M
Age
Min.
:24.00
1st Qu.:27.25
Median :30.50
Mean
:30.50
3rd Qu.:33.75
Max.
:37.00

Spouse
Min.
:28.00
1st Qu.:29.25
Median :30.50
Mean
:30.50
3rd Qu.:31.75
Max.
:33.00

Gender
F:0
M:2

Using sapply() in this case produces an unexpected result (try it!). That
function tries hard to construct a vector or matrix whenever it can. A
single command that produces essentially the same ﬁnal result, without
letting you save the list, is the by() function. In this example, by(age,
age$Gender, summary) performs the summary() operation on each
column, broken down by gender.
Under some circumstances, the three tasks of split, apply, and combine might
require separate functions, each of which may have its own arguments and conventions. The dplyr package Wickham and Francois (2015) presents a set of
tools that aim to make this sort of processing more consistent. Although this
package is intended for data frames, the earlier plyr Wickham (2011) package
handles lists and arrays as well. Both are intended to be fast and eﬃcient and
to permit parallel computation, which we address in Section 5.5. We have been
accustomed to performing these tasks in regular R, and we recommend that
users know how to perform these tasks there, since lots of existing code and
users take that approach.
3.5.2

All-Numeric Data Frames

We noted above that it is diﬃcult to apply a function to the rows of a data
frame because the entries of a row may have diﬀerent classes. All-numeric data
frames, though – those whose columns are all logical or numeric – behave
specially in these situations. When one of these data frames is converted to
a matrix the numeric nature of the columns is preserved (with logicals being
converted to numeric). These data frames can also be transposed, or accessed
with a matrix subscript, without losing their numeric nature. All-numeric data
frames provide a useful way of storing numbers in a matrix-like way while being
able to use data-frame-like syntax – but, again, as soon as one character column
(perhaps an ID) is added, the nature of the data frame changes.
Just as there are functions as.numeric() and so on to convert vectors
from one class to another (see Section 2.2.3), R provides as.matrix() and
as.data.frame() functions to convert data frames to matrices and vice
versa. This is mostly useful for all-numeric data frames or for older functions
that require numeric matrices.

77

78

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

3.5.3

Convenience Functions

We encourage users to use long names for their data objects and for their column names, for increased readability. However, this often leads to a situation
where to use a simple expression we need a long line like the one in this example:
CustPayment2016$JanDebt +

CustPayment2016$FebPurch CustPayment2016$FebPmt

The with() and within() functions provide an easier way to perform operations such as these, and they are particularly useful when the same operation
needs to be done multiple times on multiple data objects, usually data frames.
For each of these functions, we pass the data frame’s name and then the expression to be performed, like this:
with (CustPayment2016, JanDebt + FebPurch - FebPmt)

One issue that if the expression includes an assignment, the assignment is
ignored. In order to create a new column in CustPayment2016 we would
need code like this:
CustPayment2016$FebDebt <- with (CustPayment2016,
JanDebt + FebPurch - FebPmt)

As an alternative, the within() function can perform assignments; it returns
a copy of the data with the expression evaluated. In this case, we could add a
new column called FebDebt to the data frame with a command like this:
CustPayment2016 <- within (CustPayment2016,
FebDebt <- JanDebt + FebPurch - FebPmt)

Notice that in this example within() returns a copy, which then needs to be
saved.
Two more convenience functions are the subset() and transform()
functions. Much beloved of beginners, they make the subsetting and transformation process easier to follow by helping do away with square brackets.
For example, we might extract all the rows of data frame d for which column
Price is positive with a command like d[ d$Price > 0,]; subset()
allows us to use the alternative subset(d, Price > 0). It is also possible
to extract a subset of columns at the same time. The transform() function
allows the user to specify transformations to existing columns in a data frame
and returns the updated version. The help pages for both of these functions are
accompanied by warnings that recommend using them interactively only, not
for programming, and we generally avoid them.
A ﬁnal convenience function is the ability to “pipe” provided by the %>%
function in the magrittr package (Bache and Wickham, 2014). This
is intended to make code more readable by allowing one function’s output to serve as another’s input directly at the command line, rather than

R Data, Part 2: More Complicated Structures

requiring nested calls. For example, consider this evaluation of a mathematical
expression:
> cos (log (sqrt (8 - 3)))
[1] 0.6933138

In R, we have to read this from the inside out: we compute 8 − 3; take the
square root of the result; take the logarithm of that result; and ﬁnally compute
the cosine of the result from the log() function. Using the pipe notation, we
can pass the results of one computation to the next in the order in which they
are performed. This example shows the same computation performed using
the pipe notation.
> (8 - 3) %>% sqrt %>% log %>% cos
[1] 0.6933138

The pipe notation is particularly useful for nested functions and can be brought
to bear on data frames. However, be aware that not every function is suitable
for piping, and notice that the order of precedence required that we surround
the 8 − 3 with parentheses.
3.5.4

Re-Ordering, De-Duplicating, and Sampling from Data Frames

Data frames can be re-ordered (i.e., sorted) using a command that extracts
all the rows in a new order. This ordering will usually be a vector of row
indices constructed with the order() function (see Section 2.6.2). So if a
data frame named cust has columns ID and Date, then ord <- order
(cust$ID, cust$Date) (or the slightly more convenient alternative,
ord <- with(cust, order (ID, Date))) will produce a vector ord that shows the ordering of the data frame’s rows by increasing
ID, and then by increasing Date within ID. Therefore, the command
cust <- cust[ord,] will replace the old cust with the newly
ordered one.
In Section 2.6.4, we saw that the unique() function returns the unique
entries in a vector, while its counterpart duplicated() returns a logical vector that is TRUE for any entry that appears earlier in the vector. These two
functions operate directly on matrices and data frames as well. So the command unique(mydf) takes a data frame named mydf and returns the set of
non-duplicated rows. As always, ﬂoating-point error can be a problem when
detecting whether two things are identical.
One more operation that comes up is random sampling from a data frame.
This is a good plan when the original data set is so big that it cannot be easily
used for testing, for example, or plotting. As with re-ordering, the idea is to
construct a sample of row indices and then to subset the data frame with that
sample. The sample() command is useful here. In its most basic form, we

79

80

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

pass an integer named x giving the number of rows in the data frame and an
argument named size giving the desired sample size. The result is a random
set of integers selected without replacement with each value from 1 to x being
equally likely. To sample 200 rows from a data frame named mydf, we could use
the command sam <- sample(nrow(mydf), 200) to get a vector of 200
row numbers, and then mydf[sam,] to do the sampling. (This presumes that
there are 200 or more rows in mydf. If not, R produces an error.) Of course, the
new data frame’s rows will maintain the numbers they had in the original mydf,
so the row names of the new version will be out of order. If that bothers you, a
quick sam <- sort(sam) prior to subsetting will ﬁx that. The sample()
function also has a number of more sophisticated features, including sampling
with replacement and the ability to specify diﬀerent probabilities for diﬀerent
choices.

3.6 Date and Time Objects
Most data cleaning problems will include dates (and sometimes times). The
most important tasks we face with dates in data cleaning are doing arithmetic
(e.g., adding a number of days to a date or ﬁnding the number of days between
two dates) and extracting each date’s day, day of the week, month, calendar
quarter, or year. Objects representing dates and times come in several forms in
R, but since one of them takes the form of a list, we have postponed discussion
of those objects until here.
3.6.1

Formatting Dates

There are lots of ways to display a date in text, and during data cleaning it will
feel like you meet all of them. Americans might write July 4, 2017 as “7/4/17,”
but to most of the rest of the world, this indicates April 7th. Furthermore,
this representation leaves unclear precisely where in the string the day starts;
it starts in the third character for an American’s “7/4/17” but in the ﬁrst for
an internationally formatted date like “26/05/17.” The two unambiguous formats “2017-07-04” and “2017/07/04” are good starting points for storing dates,
especially in text ﬁles outside of R. (The value “2017-7-4” is permitted, but this
format leads to date strings of diﬀerent lengths; 20170704 is easy to mistake for
an integer.)
The simplest date class in R is called Date, and an object of this class is
represented internally as an integer representing the number of days since a
particular “origin” date. The as.Date() function converts text into objects of
class Date in two ways. First, it can convert an integer number of days since
the origin into a date. The usual origin date in R is January 1, 1970, or, unambiguously, “1970-01-01.” In this example, we show how a vector of integers can
be converted into a Date object.

R Data, Part 2: More Complicated Structures

> (dvec <- as.Date (c(0, 17250:17252),
origin = "1970-01-01"))
[1] "1970-01-01" "2017-03-25" "2017-03-26" "2017-03-27"

Notice that the value 0 is converted into the origin date, “1970-01-01.” If we
are given integer dates, we need to know what the origin is supposed to be.
This concern arises when reading data in from the Excel spreadsheet program.
Excel uses integer dates, but the origins are diﬀerent between Windows and
Mac, and Excel mistakenly treats 1900 as a leap year. We describe this in more
detail in Section 6.5.2 when we describe reading data in.
The second conversion that as.Date() can perform is to convert textbased representations such as “7/4/17” or “July 4, 2017,” using a format
string that describes the way the input text is formatted. Each piece of the
format string that starts with % identiﬁes one part of the date or time; other
pieces represent characters such as space, comma, /, or - between pieces
of the input text. For example, %B matches the name of the month and %a
matches the name of the day of the week. The most important pieces of the
format are %d for day of the month, %m for the month, and %y and %Y for
two- and four-digit year, respectively. (Two-digit years between 69 and 99
are assumed to be twentieth-century ones starting with 19, and the rest,
twenty-ﬁrst-century ones.) The help page for as.Date() refers us to the help
page for strptime(), which lists all of the possibilities. For example, this
command uses the format string "%B %d, %Y" to convert text dates such as
"September 20, 2016" into a Date object.
> as.Date (c("Feb 29, 2016", "Feb 29, 2017",
"September 30, 2017"), format = "%b %d, %Y")
[1] "2016-02-29" NA
"2017-09-30"

Notice that the format string had to contain the same pattern of spaces
and comma that the input text had. R was able to read both the three-letter
abbreviation Feb and the full name September – but it produced an NA for
Feb 29, 2017 which was not a legitimate date.
The names of the days of the week, and the months of the year, are set by the
computer’s locale (see Section 1.4.6). By changing locales R can be made to read
in days or months in other languages, as well, which is useful when data comes
from international sources. In this example, we have some dates in which the
month has been given in Spanish. By changing the locale we can read these in;
then by re-setting the locale we can use as.character() to convert them
into English.
> sp.dates <- c("3 octubre 2016", "26 febrero 2017",
"5 mayo 2017")
> as.Date (sp.dates, format = "%d %B %y")
[1] NA NA NA
# Not understood in English locale; use Spanish for now

81

82

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> Sys.setlocale ("LC_TIME", "Spanish")
[1] "Spanish_Spain.1252"
# Setting was successful
> (dts <- as.Date (sp.dates, format = "%d %B %Y"))
[1] "2016-10-03" "2017-02-26" "2017-05-05"
> Sys.setlocale ("LC_TIME", "USA") # Change back
[1] "English_United States.1252"
# Setting was successful
> as.character (dts, "%d %B %Y")
[1] "03 October 2016" "26 February 2017" "05 May 2017"

3.6.2

Common Operations on Date Objects

There are a number of convenience functions to manipulate date objects. The
months() and weekdays() functions act on Date objects and return the
names of the corresponding months and days of the week. Each has an abbreviate argument that defaults to FALSE; when set to TRUE these arguments
produce three-letter abbreviations. In this example, we show examples of these
convenience functions.
> d1 <- as.Date ("2017-01-02")
> d2 <- as.Date ("2017-06-15")
> weekdays (c(d1, d2))
[1] "Monday"
"Thursday"
> months (c(d1, d2))
[1] "January" "June"
> months (c(d1, d2), abbreviate = TRUE)
[1] "Jan" "Jun"
> quarters (c(d1, d2))
[1] "Q1" "Q2"

There is no function to extract the numeric day, month, or year from a Date
object. These operations are performed using the format() function, which
calls format.Date() to produce character output that can then be converted
to numeric using as.numeric(). The elements of the format string are like
those that are used in as.Date(). This example shows how to extract some
of those pieces from a vector of Date objects – but, again, note that the output
of format() is text.
> format (c(d1, d2), "%Y")
[1] "2017" "2017"
> format (c(d1, d2), "%d")
[1] "02" "15"
> format (d1, "%A, %B %m, %Y")
[1] "Monday, January 01, 2017"

The ﬁnal command shows a more sophisticated formatting operation, using a
format string like the one in as.Date().
It is permitted to use decimals in a Date object to represent times of
day. If you want to create a date object to represent 1:00 p.m. on July 29,

R Data, Part 2: More Complicated Structures

2015, as.Date(16645 + 13/24, origin = "1970-01-01") will
return a numeric, non-integer object that can be used as a date. However,
as.Date("2017-07-29 13:00:00") produces a Date that is represented
internally by the integer 17,376 – the time portion is ignored. Moreover
non-integer parts are never displayed and can even be truncated by some
operations. When times of day are required, it is a better idea to use a POSIXt
object (Section 3.6.4).
3.6.3

Diﬀerences between Dates

Very often we need to know how far apart two dates are. The diﬀerence between
two Date objects is not a date; it is instead a period of time. In R, one of
these diﬀerences is stored as a difftime object. Some functions, such as
mean() and range(), handle difftime objects in the expected way. Others, such as hist() (to produce a histogram) or summary(), fail or produce
unhelpful results. Normally, we will convert difftime objects into numeric
items with as.numeric(). Be careful, though: the units that R uses for the
conversion can depend on the size of the diﬀerence, whereas for data cleaning we almost always want to use one consistent choice of unit. Therefore, it
is a good habit, when converting difftime objects to numbers, to specify
units = "days" (or whichever unit we want) explicitly. In this example,
which continues the one above, we show addition on dates plus an example of
a difftime object.
# Date objects are numeric; we can add and subtract them
> d1 + 30
[1] "2017-07-02"
> (d <- d2 - d1)
Time difference of 13 days # an object of class difftime
> as.numeric (d)
# convert to numeric, in days
[1] 13
> units (d)
[1] "days"
> as.numeric (d, units = "weeks")
[1] 1.857143

In the last pair of commands, we saw that as.numeric() produced an output in days by default, the units being revealed by the units() command.
We can also set the units of a difftime object explicitly, with a command
like units(d) <- "weeks", or use the difftime() function directly, like
difftime(d2, d1, units = "weeks").
3.6.4

Dates and Times

If you don’t need to do computations with times – only with dates – the Date
class will be enough, at least back to 1752, when Britain switched from the Julian

83

84

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

to the Gregorian calendar. If you need to do computations on times, there is a
second set of objects that are stronger at storing and computing those. These
are named POSIXct and POSIXlt objects, after the POSIX set of standards.
Collectively, these two types of objects are called POSIXt objects. POSIXt
objects measure the number of seconds (possibly with a decimal part) since
the beginning of January 1, 1970 using Coordinated Universal Time (UTC),
which is identical to Greenwich Mean Time (GMT). (Technically, the POSIX
standard does not include leap seconds, a vector of which is given by R’s built-in
.leap.seconds variable. This has never aﬀected us.)
The POSIXlt object is implemented as a list, which makes it easy to extract
pieces; the POSIXct object acts more like a number, which makes it the choice
for storing as a column in a data frame. We start with an example of a POSIXlt
number. It prints out in a character string, but it behaves like a list. One unusual
feature is that, to see the names of the list, you need to unlist() the object
ﬁrst. For example,
> (start <- as.POSIXlt("2017-01-17 14:51:23"))
[1] "2017-01-17 14:51:23 PST" # R has inferred time zone PST
> unlist (start)
sec
min
hour
mday
mon
year
wday
yday
"23"
"51"
"14"
"17"
"0" "117"
"2"
"16"
isdst
zone gmtoff
"0" "PST"
NA

Here start really is a list, and we can extract components in the usual way,
with a dollar sign or double brackets (but, although you can use its names,
names(start) is NULL, and you cannot extract a subset of components with
single brackets). Notice also that the ﬁrst day of the month gets number 1, but
the ﬁrst month of the year, January, carries the number 0, and that the year
element counts the number of years since 1900. The advantage of a list is that,
given a vector of POSIXlt objects named date.vec, say, you can extract all
the months at once with data.vec$mon – but again, January is month 0 and
December is month 11. Weekdays are given in the list by wday, with 0–6 representing Sunday through Saturday, respectively. The weekdays() function
from above, and the other Date functions, also work on POSIXt objects – but
be aware that the results are displayed in the locale of the user. Notice that the
time zone above, PST, is deduced by our computer from its locale. The help
for DateTimeClasses gives more information on the niceties of time zones,
many of which are system speciﬁc.
Although we can use the weekdays(), months(), and quarters()
functions on POSIXct objects, we extract other components, such as years
or hours, via the format() function, as we did for Date objects. This is
slightly less eﬃcient than the list-type extraction from a POSIXlt object,
but we recommend using POSIXct objects where possible, because we have

R Data, Part 2: More Complicated Structures

encountered unexpected behavior when changing time zones with POSIXlt
objects.
It is worth noting that although a POSIXt object may have a time, a time
is not required. When a Date object is converted into a POSIXt object, the
resulting object is given a time of 00:00 (i.e., midnight) in UTC. A vector of
POSIXt objects that are all at midnight display without the time visible, but
they do contain a time value. When a POSIXt object is converted to a Date
object, the time is truncated.
3.6.5

Creating POSIXt Objects

R’s as.POSIXct() and as.POSIXlt() functions convert text that is unambiguously formatted into POSIXt objects just as as.Date() does. Here the
date can be followed by a 24-hour clock time like 17:13:14 or a 12-hour time
with an AM/PM indicator. More usefully, perhaps, these functions allow the use
of a format string such as the one used by as.Date(). This format string, documented in the help for strptime(), allows times, time zones, and AM/PM
indicators, attributes that are also accepted by as.Date(). Often we discard
time information, since we are only interested in dates, but sometimes discarding time information can lead to incorrect conclusions. In this example, we
construct two POSIXct objects that represent the same moment expressed in
two diﬀerent time zones.
> (ct1 <- as.POSIXct ("Mar 31, 2017 10:26:08 pm",
format = "%b %d, %Y %I:%M:%S %p"))
[1] "2017-03-31 22:26:08 PDT"
> (ct2 <- as.POSIXct ("2017-04-01 05:26:08", tz = "UTC"))
[1] "2017-04-01 05:26:08 UTC"
> as.numeric (ct1 - ct2, units = "secs")
[1] 0

The ﬁrst date, ct1, is not given an explicit time zone, so the system selects the
local one (shown here as PDT). In the second example, we explicitly provide the
UTC indicator with the tz argument. The as.numeric() command shows
that the two times are identical. There are a few confusing properties of POSIXt
objects. All the objects in a vector of length >1 will be displayed with the local
time zone, and their weekdays() and months() will be, too. For a single
object, though, these functions refer to the time zone of the object, although,
as this example shows, there is a complication.
> c(ct1, ct2)
[1] "2017-03-31 22:26:08 PDT" "2017-03-31 22:26:08 PDT"
> weekdays (c(ct1, ct2))
[1] "Friday" "Friday"
> weekdays (ct2) #
[1] "Saturday"

85

86

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> weekdays (c(ct2))
[1] "Friday"

The top command shows that the vector of dates is displayed in our locale.
That date refers to a moment that was on a Friday locally. When weekdays()
acts on ct2 by itself, though, it shows that that moment was on a Saturday in
Greenwich. In the ﬁnal command, the c() causes ct2 to be converted to local
time, where its date falls on a Friday.
To explicitly convert the time zone of a POSIXct object, you can set its
tzone attribute, with a command like attr(ct1, tzone = "UTC"), or,
equivalently, with tzone = "GMT"; see the help for Sys.timezone() for
a way to determine the names of time zones. (The approach for POSIXlt
objects is more complicated and we do not discuss it here.) Note that when
POSIXct objects are converted to Date objects, they are rendered in UTC,
so as.Date(ct1) and as.Date(ct2) both produce dates with value
"2017-04-01".
The format string that is passed to as.POSIXct() allows for a lot of ﬂexibility in the way dates are formatted. This example shows how you might convert
R’s own date stamp, produced by the date() function, into a POSIXct object
and then a Date object.
> (curdate <- date())
[1] "Wed Sep 21 00:36:47 2016"
> (now <- as.POSIXct (curdate,
format = "%A %B %d %H:%M:%S %Y"))
[1] "2016-09-21 00:36:47 PDT" # POSIXct object
> as.Date (now)
[1] "2016-09-21"

As long as the format of the dates in your data is consistent, it will probably
be possible to read them in using as.POSIXct(). In some cases, dates
may appear with extraneous text. If the contents of the text is known exactly,
the text can be matched. For example, the string Wednesday, the 17th
of March, 2017 at 6:30 pm can be read in with the format string
"%A, the %dth of %B, %Y at %I:%M %p". But this formatting will fail
for the 21st or the 22nd or if the input string ends with p.m. (with
periods). In cases where there is variable, extraneous text, you may have to
resort to manipulating the text strings using the tools in Chapter 4.
3.6.6

Mathematical Functions for Date and Times

Since Date and POSIXt objects are numeric, many functions intended to
work on numeric data also work on these date objects. In particular, range(),
max(), min(), mean(), and median() objects in R. all produce vectors

R Data, Part 2: More Complicated Structures

of date objects. The diff() function computes diﬀerences between adjacent
elements in a vector, so diff(range(x)) produces the range of dates in the
vector x as a difftime object. The summary() function acts on a vector
of date objects, producing an object that is slightly diﬀerent from a vector of
dates but still usable. You can also tabulate Date and POSIXct objects with
table() – but table() does not work on the list-like POSIXlt objects.
The seq() function can also be used to generate a sequence of dates. This
is useful for generating the endpoints of “bins” for histograms or other summaries. As we mentioned, Date objects are implemented in units of days, so
a sequence of Date objects one unit apart has values 1 day apart by default.
However, POSIXt objects are in units of seconds, so a sequence of POSIXt
objects one unit apart are 1 second apart. One way to create a sequence of
POSIXt objects representing consecutive days is to use by = 86400, since
there are 86,400 seconds in a day. However, R has a better approach. When
called with a vector of Date or POSIXt objects, the seq() function invokes
one of the functions, seq.Date() or seq.POSIXt(), that is smarter about
date objects. These functions let you use the by argument with a word like
"hour", "day" and so on. An additional value "DSTday" (for POSIXt only)
ignores daylight saving time to produce the same clock time every day. In this
example, we generate some sequences of Date and POSIXt objects. Notice
that R suppresses times for POSIXt dates when all of the times in the vector
are midnight.
> seq (as.Date ("2016-11-04"), by = 1, length = 4)
[1] "2016-11-04" "2016-11-05" "2016-11-06" "2016-11-07"
# Create and save a POSIXct object, for convenience
> ourPos <- as.POSIXct ("2016-11-04 00:00:00")
> seq (ourPos, by = 1, length = 3)
[1] "2016-11-04 00:00:00 PDT" "2016-11-04 00:00:01 PDT"
[3] "2016-11-04 00:00:02 PDT"
> seq (ourPos, by = "day", length = 3)
[1] "2016-11-04 PDT" "2016-11-05 PDT" "2016-11-06 PDT"
> seq (ourPos, by = "day", length = 4)
[1] "2016-11-04 00:00:00 PDT" "2016-11-05 00:00:00 PDT"
[3] "2016-11-06 00:00:00 PDT" "2016-11-06 23:00:00 PST"
> seq (ourPos, by = "DSTday", length = 4)
[1] "2016-11-04 PDT" "2016-11-05 PDT" "2016-11-06 PDT"
[4] "2016-11-07 PST"
> seq (ourPos, by = "month", length = 4)
[1] "2016-11-04 PDT" "2016-12-04 PST" "2017-01-04 PST"
[4] "2017-02-04 PST"

In the top example, we see a sequence of Date objects 1 day apart (as speciﬁed
by by = 1). That same speciﬁcation produces POSIXt dates 1 second apart.

87

88

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Using by = "day" moves the clock by 24 hours, but since the Paciﬁc Time
Zone, where these examples were generated, switched from daylight saving to
standard time on November 6, 2016, the old time of midnight standard time
was advanced 24 hours to 11 p.m. standard time. With by = "DSTday"
the clock time is preserved across days. The ﬁnal example shows how we can
advance 1 month at a time – the help for seq.POSIXt() shows how these
functions adjust for the case when advancing by month starting at January 31,
for example.
Diﬀerences between two POSIXt objects, like diﬀerences between Date
objects, are represented by difftime objects in R. Here, though, you need
to be even more careful to specify the units when converting the difftime
object to numeric. This example shows how neglecting that speciﬁcation can
cause problems.
> d1 <- as.POSIXct ("2017-05-01 12:00:00")
> d2 <- as.POSIXct ("2017-05-01 12:00:06") # d1 + 6 seconds
> d3 <- as.POSIXct ("2017-05-07 12:00:00") # d1 + 6 days
> (d2 - d1) == (d3 - d1)
[1] FALSE # expected
> as.numeric (d2 - d1) == as.numeric (d3 - d1)
[1] TRUE # possibly unexpected

Here the d2 − d1 diﬀerence has the value 6 seconds, while the d3 − d1
diﬀerence has the value 6 days. The units are preserved in the difftime
objects but discarded by as.numeric(). It is a good practice to always
specify units = "days" or whatever your preferred unit is, whenever you
convert a difftime object to a numeric value.
3.6.7

Missing Values in Dates

Dates of diﬀerent classes should not be combined in a vector. It is always wise
to use an explicit function to force all the elements of a date vector to have the
same class. This also applies to missing values in date objects – they need to be
of the proper class. In this example, we combine an NA with the d1 date from
above, using the c() function. The c() function can call a second function
depending on the class of its ﬁrst argument – c.Date(), c.POSIXct() or
c.POSIXlt().
> c(d1, NA)
[1] "2017-05-01 12:00:00 PDT" NA
> c(NA, d1)
[1]
NA 1493665200
> c(as.POSIXct (NA), d1)
[1] NA
"2017-05-01 12:00:00 PDT"
> c(NA, as.Date (d1))
[1]
NA 17287

R Data, Part 2: More Complicated Structures

The ﬁrst command succeeds, as expected, because c.POSIXct() is able to
convert the NA value into a POSIXct object. In the second command, though,
c() sees the NA and does not call a class-speciﬁc function. Instead, it converts
both values to numeric. The resulting second element is the number of seconds
since the POSIXt origin date. The way around this is to explicitly specify an NA
value of class POSIXct, as in the third command. The ﬁnal command shows
that this problem exists for Date objects as well – here, d1 is converted into
the number of days since the origin date. The lesson is that you should ensure
that every date element, even the NA ones, in your vector has the same class.
3.6.8

Using Apply Functions with Dates and Times

Often a data set will arrive as a data frame with a series of dates in each row.
These might be dates on which a phenomenon is repeatedly recorded –
monthly manpower data, for example, or payment information. If an operation
needs to be performed on each row – say, ﬁnding the range of the dates in each
one – it is tempting to use apply() on such a data frame. As with earlier
examples (Section 3.5), this will not succeed – even (perhaps surprisingly) if
the data frame’s columns are all Date or all POSIXct. A better approach is
to operate on each row via the lapply() or sapply() functions. Here we
show an example of a data frame whose columns are both Date objects.
> date.df <- data.frame (
Start = as.Date (c("2017-05-03", "2017-04-16")))
> date.df$End <- as.Date (c("2018-06-01", "2018-02-16"))
> date.df
Start
End
1 2017-05-03 2018-06-01
2 2017-04-16 2018-02-16
> apply (date.df, 1, function (x) x[2] - x[1])
Error in x[2] - x[1]: non-numeric arg. to binary operator

Here, the apply() function converts the data frame to a character matrix.
(Why it does not convert it to a numeric one is not clear.) So the mathematical
operation fails. One way to apply the function to each row is via sapply(), as
in this example:
> sapply (1:2, function (i)
as.numeric (date.df[i,2] - date.df[i,1],
units = "days"))
[1] 394 306

Using sapply() to index the rows, we can compute each diﬀerence in days in
a straightforward way. In general, you will need to pay attention when dealing
with data frames of dates row by row.

89

90

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

3.7 Other Actions on Data Frames
It is a rare data cleaning task that does not involve manipulating data frames,
and one very common operation is to combine two data frames. There are
essentially three ways in which we might want to combine data frames: by
columns (i.e., combining horizontally); by rows (i.e., stacking vertically); and
matching up rows using a key (which we call merging). The ﬁrst two of these are
straightforward and the third is only a little more complicated. In this section,
we describe these tasks, as well as some other actions you can perform on data
frames. We show some more detailed examples in Chapter 7.
3.7.1

Combining by Rows or Columns

When we talk about “combining data frames by columns,” we mean combining
them side by side, creating a “wide” result whose number of columns is the
sum of the numbers of columns in the things being combined. We have seen
the cbind() function, which is the preferred function for joining matrices. We
can also supply two data frames as arguments to the data.frame() function
and R will join them. Both cbind() and data.frame() can incorporate
vectors and matrices in its arguments as well – but they will convert characters
to factors unless you explicitly provide the stringsAsFactors = FALSE
argument. R is prepared to recycle some inputs, but it is best if the things being
combined have the same numbers of rows.
Recall that a data frame needs to have column names, and that we (almost)
always want these to be distinct. If two columns have the same name, R will use
the make.names() function with the unique = TRUE argument to construct a set of distinct names. If three data frames each have a column named
a, for example, the result will have columns a, a.1, and a.2. It is always a
good idea to examine the set of column names for duplication (perhaps using
intersection() as in Section 2.6.3) to ensure that you know what action R
will take.
Combining data frames by rows means stacking them vertically, creating
a “tall” result whose number of rows is the sum of the numbers of rows in
the things being combined. The rbind() function combines data frames in
this way. We can only operate rbind() on things with the same number of
columns; moreover, the columns need to have the same names, but they need
not be in the same order; R will match the names up. You will almost always
want the columns being joined to be of the same sort – numeric with numeric,
character with character, POSIXct with POSIXct, and so on – otherwise,
R will convert each column to a common class. We usually check the classes
explicitly and recommend you pass the stringsAsFactors = FALSE
argument to rbind(). If we have two data frames called df1 and df2, we
start by comparing the names, using code like this:

R Data, Part 2: More Complicated Structures

> n1 <- names (df1)
> n2 <- names (df2)
> all (sort (n1) == sort (n2)) # should be TRUE

We sort the names of each data frame to account for the fact that they might be
out of order. Next, we extract the class of each column. The results, c1 and c2
as follows, will often be vectors, although they might be lists if some columns
produce a vector of length 2 or more. (This will be the case if any columns are
POSIXct objects.) We compare these two objects as in this example:
> c1 <- sapply (df1, class)
> c2 <- sapply (df2, class)
> isTRUE (all.equal (c1, c2[names (c1)])) # should be TRUE

Notice that we re-order the names of c2 so that they match the order of
the names of c1. The all.equal() function compares two objects and
returns TRUE if they match, and a small report (a vector of character strings)
describing their diﬀerences if they do not. This report is useful, but to test for
equality in, for example, an if() statement, the isTRUE() function is useful.
This function produces TRUE if its argument is a single TRUE, as returned
by all.equal() when its arguments match, and FALSE if its argument is
anything else, like the character strings produced by all.equal() when its
arguments diﬀer.
If the data frames being combined have the usual unmodiﬁed numeric row
names, R will adjust them so that the resulting row names go from 1 upward,
but if there are non-numeric or modiﬁed row names, R will try to keep them,
again deconﬂicting matches to ensure that row names are distinct.
When combining a large number of data frames, the do.call() function
will often be useful. This function takes the name of a function to be run,
and a list of arguments and runs the function with those arguments. For
example, the command log(x = 32, base = 2) produces the result
5, because log2 (32) = 5. We get the exact same result with the command
do.call("log", list(x = 32, base = 2)). Notice that the
arguments are speciﬁed in the form of a list. This mechanism allows us to
combine a large number of data frames in a fairly simple way. Suppose we
have a list of data frames named list.of.df (such a list arises frequently
as the output from lapply()). Extracting the individual data frames from
the list can be tedious, but we can rbind() them all with a command like
do.call ("rbind", list.of.df) (assuming the data frames meet the
rbind() criteria). If the data frames are not already on a list, we can construct
such a list with a command like list(first.df, second.df, ...).
3.7.2

Merging Data Frames

Merging is a more complicated and powerful operation. In the usual type of
merging, each data frame has a “key” ﬁeld, typically a unique one. The merge()

91

92

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

matches up the keys and produces a data frame with one row per key, with all of
the columns from both of the data frames. There are three main complications
here: what to do when keys are present in one data set but not in the other,
what to do when keys are duplicated, and what to do when keys match only
approximately.
The action when keys are present in one data set, but not in the other, is controlled by the all.x and all.y arguments, both of which default to FALSE.
For this purpose, x refers to the ﬁrst-named data set and y to the second. By
default, the result of the merge has one row for each key that appears in both x
and y (except when there are duplicated keys). Database users call this an “inner
join.” When all.x = TRUE and all.y = FALSE, the result has one row
for each key in x (this is a “left join”). Columns of the corresponding keys that
do not appear in y are ﬁlled with NA values. Naturally, the converse is true when
all.x = FALSE and all.y = TRUE – the result has one row for each key
in y and the result has NAs for those columns contributed from x for those
keys that did not appear in y. When all.x = TRUE and all.y = TRUE,
the result has one row for every key in either x or y (this is an “outer join”). In
this example, we merge two small data sets to show the behavior brought about
by all.x and all.y.
> (df1 <- data.frame (Key = letters[1:3], Value = 1:3,
stringsAsFactors = FALSE))
Key Value
1
a
1
2
b
2
3
c
3
> (df2 <- data.frame (Key = c("a", "c", "f"),
Origin = 101:103, stringsAsFactors = FALSE))
Key Origin
1
a
101
2
c
102
3
f
103
> merge (df1, df2, by = "Key")
# inner join
Key Value Origin
1
a
1
101
2
c
3
102
> merge (df1, df2, by = "Key", all.x = TRUE)
# left join
Key Value Origin
1
a
1
101
2
b
2
NA
3
c
3
102
> merge (df1, df2, by = "Key", all.y = TRUE)
# right join
Key Value Origin
1
a
1
101

R Data, Part 2: More Complicated Structures

2
c
3
102
3
f
NA
103
> merge (df1, df2, by = "Key", all.x = TRUE,
all.y = TRUE)
Key Value Origin
1
a
1
101
2
b
2
NA
3
c
3
102
4
f
NA
103

# outer join

The behavior of merge() when keys are duplicated is straightforward, but
it is rarely what we want. It is best to remove rows with duplicate keys, or to
create a new column with a unique key, before merging. The number of rows
produced by merge() when there are duplicates is the number of pairs of keys
that match between the two data frames. In this example, we establish some
duplicated keys and show the behavior of merge in the left join case.
> (df3 <- data.frame (Key = c("b", "b", "f", "f"),
Origin = 101:104, stringsAsFactors = FALSE))
Key Origin
1
b
101
2
b
102
3
f
103
4
f
104
> merge (df1, df3, by = "Key", all.x = TRUE)
Key Value Origin
1
a
1
NA
2
b
2
101
3
b
2
102
4
c
3
NA

Here the merge produces one row every time a key in df1 matches a key in
df3 – even if that happens more than once – in addition to producing rows
for every key that does not match. If df1 had included two rows with the Key
value of b, as df3 does, then the result would have had four rows with Key
value b.
The issue of matching when keys that match only approximately is a thornier
one. This arises when matching on people’s names, for example, since these are
often represented in slightly diﬀerent ways – think about the slightly diﬀering
strings “George H. W. Bush,” “George HW Bush,” “George Bush 41,” and so on.
The adist() and agrep() functions (see also the discussion of grep() in
Section 4.4.2) help ﬁnd keys that match approximately, but this sort of “fuzzy
matching” (also called “entity resolution” or “record linkage”) is beyond the
scope of this book.

93

94

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

3.7.3

Comparing Two Data Frames

At some point you will have two versions of a data frame, and you will want
to know if they are identical. “Identical” can mean slightly diﬀerent things
here. For example, if two numeric vectors diﬀer only by ﬂoating-point error,
we would probably consider them identical. If a character vector has the
same values as a factor, that might be enough to be identical, but it might
not. The identical() function tests for very strict equivalence and can
be used on any R objects. It returns a single logical value, which is TRUE
when the two items are equal. The help page notes that this function should
usually be applied neither to POSIXlt objects nor, presumably, to data frames
containing these. This is partly because two times might represent the same
value expressed in two diﬀerent time zones.
The all.equal() function described above compares two objects but
with slightly more room for diﬀerence. The tolerance argument lets you
decide how diﬀerent two numbers need to be before R declares them to be
diﬀerent. By default, R requires that the two data frames’ names and attributes
match, but those rules can be over-ridden. Moreover, two POSIXlt items
that represent the same time are judged equal. When two items are equal
under these criteria, all.equal() returns TRUE. Since the return value
of all.equal() when its two arguments are not equal is a vector of text
strings, one correct way to compare data frames a and b for equality is with
isTRUE(all.equal (a, b)).
3.7.4

Viewing and Editing Data Frames Interactively

R has a couple of functions that will let you edit a matrix, list, or data frame in
an interactive, spreadsheet-like form. The View() function shows a read-only
representation of a data frame, whereas edit() allows changes to be made.
The return value from the edit() function can be saved to reﬂect the changes.
A more dangerous option is provided by data.entry(); changes made by
that function are saved automatically. If you use these functions to clean your
data, of course, your steps will not be reproducible, and we strongly recommend
using commented scripts and functions, which we describe in Chapter 5.

3.8 Handling Big Data
The ability to acquire, clean, handle, and model with big data sets will surely
become more and more important in coming years. From its beginning, R has
assumed that all relevant data will ﬁt into main memory on the machine being
used, and although the amount of memory installed in a computer has certainly
grown over time, the size of data sets has been growing much faster. Handling
data sets too big for the computer is not part of this book’s focus, but in the

R Data, Part 2: More Complicated Structures

following section we lay out some ideas for dealing with data sets that are just
too big to hold in memory.
Given data that requires more storage than main memory can provide, we
often proceed by breaking the data into pieces outside of R. For text data we use
the command-line tools provided by the bash program (Free Software Foundation, 2016), a widespread command interpreter that comes standard on OS X
and Linux systems and which is available for Windows as well. Bash includes
tools such as split, which breaks up a data set by rows; cut, which extracts
speciﬁc columns; and shuf, which permutes the lines in a ﬁle (which helps
when taking random samples). These tools provide the ability to break the data
into manageable pieces.
Another approach for manipulating large data, this time inside R (in main
memory), was noted in Section 2.7. R has support for “long vectors,” those
whose lengths exceed 231 − 1, but these are not recommended for character
data. Moreover, they are vectors rather than data frames, so the long vector
approach does not mirror the data frame approach.
Sometimes the data can ﬁt into memory, but the system is very slow performing any actions on it. In this case, the data.table packages might be useful;
it advertises very fast subsetting and tabulation. Unfortunately, the syntax of
the calls inside the data.table package is just foreign enough to be confusing. We will not cover the use of data.table in this book. If speciﬁc actions
are slow, we can often gain insight by “proﬁling,” which is where we determine
which actions are using up large amounts of time. The “Writing R Extensions”
manual has a section on proﬁling (Section 3) that might be useful here.
Other ways to speed up computations include compiling functions and running in parallel. We discuss these and other ways to make your functions faster
in Section 5.5.
There are several add-in packages that provide the ability to maintain
“pointers” to data on disk, rather than reading the data into main memory. The
advantage of this approach is that the size of objects it can handle is limited
only by disk storage, which can be expected to be huge. In exchange, of course,
we have to expect processing to be much slower because so much disk access
can be required. Packages with this approach include bigmemory (Kane et al.,
2013) and its relatives, and ff (Adler et al., 2014) . The tm package (Feinerer
et al., 2008) does something similar for large bodies of text.
R is so popular in the data science world that there are many other programs,
including big data storage mechanisms, for which R interfaces are available.
This allows you to use familiar R commands to access these other mechanisms
without having to understand the details of those programs. In this way, you
can keep your data in a relational database or some storage facility that uses,
for example, distributed memory for eﬃcient retrieval. These approaches are
beyond the scope of this book, but we have some discussion of acquiring data
from a relational database in Chapter 6.

95

96

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

3.9 Chapter Summary and Critical Data
Handling Tools
Matrices are important in many mathematical and statistical contexts, but they
do not play an important role in data cleaning. However, learning about matrices makes learning about data frames more natural. Data frames also have the
attributes of lists, so we have discussed lists as well in this chapter. But the
important type of object in this chapter, and in R generally, is the data frame.
Data frames are often created by reading data in from outside R. We can also
create them directly by combining vectors, matrices, or other data frames with
the cbind(), rbind(), or merge() functions. We can add a character vector to a data frame with the dollar-sign notation, but whenever we supply a
character vector as a column to be added to a data frame via data.frame()
or cbind(), we need to specify the stringsAsFactors = FALSE argument.
Once our data has been placed into a data frame (say, one called data1), we
often start by recording the classes of each column in a vector, using a command like col.cl <- sapply(data1, function(x) class(x)[1]).
We use the function shown here rather than simply using the class()
function, as we did earlier, to account for columns with a vector of two or
more classes – usually these would be columns with one of the date classes like
POSIXct. Keeping the names data1 for the data and col.cl for the vector
of classes in this example, we use commands such as these as part of our data
cleaning process:
• table(col.cl) to tabulate the column classes. Often we will have an
expectation that some proportion of the columns will be numeric, or that
we will have, say, exactly 10 date columns. This is a good starting point to see
if the data frame looks as we expect.
• sapply(data1, function(x) sum(is.na(x))) to count missing
values by column. If the number of columns is large, we would often use
table() on the result of the sapply() call to see if there are a few
columns with a large number of missing values. It is also interesting if many
columns report the same number of missing values. For example, if there
are 56 diﬀerent columns each with exactly 196 missing values, we might
hypothesize that those are the very same 196 records in every column –
and investigate them. In some cases, we might also count the number
of negative values or values equal to 99 or some other “missing” code.
Instead of the function above, then, we might substitute function
(x) sum (x < 0, na.rm = TRUE) or something analogous.
• sapply(data1[,col.cl == "numeric"], range) to compute
the ranges of numeric columns in a search for outliers or anomalies. If some
columns have class “integer” we will need to address those as well, perhaps

R Data, Part 2: More Complicated Structures

using col.cl %in% c("numeric", "integer"). We might have to
add the na.rm = TRUE argument, and we might also use other functions
here such as mean(), median(), or sd().
• sapply(data1, function(x) length(unique(x))) to count
unique values by column. Since these numbers will count NA values, we
might instead use function(x) length(unique(na.omit(x))).
The apply() family functions provide a lot of power, but they need to be
exercised carefully on data frames. The apply() function itself converts the
data frame to a matrix ﬁrst, and should only be used if all the columns of a data
frame are of the same type. Sapply() tries to return a vector or matrix if it can,
so if the return elements are of diﬀerent classes they will often be converted. We
suggest using lapply() unless you know that one of the other functions will
succeed.
Another important focus of this chapter was on date (and time) objects.
Although Date and POSIXct objects are implemented inside R in a numeric
fashion, they are not quite numeric items in the usual ways. Similarly, while
the POSIXlt object has some numeric features, it is best thought of as a list.
So we deferred description of these objects until this chapter. Date and time
data take up a lot of energy in the data cleaning process because the number of
formats is large and variable and because of complications such as time zones
and date arithmetic.

97

99

4
R Data, Part 3: Text and Factors

A lot of data comes in character (“string”) form, sometimes because it really
is text, and sometimes because it was originally intended to be numeric but
included a small number of non-numeric items such as, for example, the word
“Missing.” Almost every data cleaning problem requires manipulating text in
some way, to ﬁnd entries that include particular strings, to modify column
names, or something else. In this chapter, we describe some of the operations
you can perform on character data. This includes extracting pieces of strings,
formatting numbers as text, and searching for matches inside text.
However, there are really two ways that character data can be stored in R. One
is as a vector of character strings, as we saw in Chapter 2. The tools we mentioned above are primarily for this sort of data. A second way text can appear
in R is as a factor, which is a way of storing individual text entries as integers,
together with a set of character labels that match the integers back to the text.
Factors are important in many R modeling functions, but they can cause trouble. We discuss factors in Section 4.6.
One consideration has become much more important in recent years: handling text from alphabets other than the English one. We are very often called
on to deal with text containing accented characters from Western European
languages, and increasingly, particularly as a result of data from social media
sources, we ﬁnd ourselves with text in other alphabets such as Cyrillic, Arabic,
or Korean. The Unicode system of representing all the characters from all the
world’s alphabets (together with other symbols such as emoji) is implemented
in R through encodings including the very popular UTF-8; Section 4.5 discusses
how we can handle non-English texts in R.

100

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

4.1 Character Data
4.1.1

The length() and nchar() Functions

The length of a character vector is, as with other vectors, the number of
elements it has. In the case of a character vector, you also might want to know
how many characters are in each element. We use the nchar() function
for that. Remember that some characters require two keystrokes to type (see
Section 1.3.3 for a discussion), but they still count as only one character. In this
example, we construct a character vector and observe how many characters
each element has.
> (planets <- c("Mercury", "Venus", NA, "Mars"))
[1] "Mercury" "Venus"
NA
"Mars"
> length (planets) # Four elements
[1] 4
> nchar (planets) # Count characters
[1] 7 5 NA 4

Notice that the number of characters in the missing value is itself missing. In
older versions of R (through 3.2), nchar() reported the lengths of missing
values as 2 – as if those entries were made up of the two characters NA. Starting
with version 3.3, returning NA for that string’s length is the default, though the
older behavior can be requested by passing the keepNA = FALSE argument.
4.1.2

Tab, New-Line, Quote, and Backslash Characters

There are a few characters in R that need special treatment. We discussed
this in Section 1.3.3, but it is worth repeating that if you want to enter a tab,
a new-line character, a double quotation mark, or a backslash character, it
needs to be “protected” – we say “escaped” – by a backslash. The leading
backslash does not count as a character and is not part of the string – it’s just
a way to enter these characters that otherwise would be taken literally. As an
example, consider entering into R this text: She wrote, "To enter a
'new-line,' type "\n" ." Normally, of course, we enclose text in quotation marks, but here R will think that the character string ends at the quotation
mark preceding To. To remedy that, we escape the two inner double quotation marks. (Alternatively, we could enclose the entire quote in single, instead
of double quotation marks. Then we would have to escape the two inner single
quotation marks.) Moreover, the backslash is a special character. It, too, needs
to be escaped. So to enter our quote into an R object, we need to type this:
> (quo <"She wrote, \"To enter a 'new-line,' type \"\\n\".\"")
[1] "She wrote, \"To enter a 'new-line,' type \"\\n\".\""

R Data, Part 3: Text and Factors

> nchar (quo)
[1] 47
> cat (quo, "\n")
She wrote, "To enter a 'new-line,' type "\n"."

Notice the length of the string as given by nchar(). Even though it takes 52
keystrokes to type it in, there are only 47 characters in the string. There is no
real diﬀerence between single and double quotes in R; if you create a string
with single quotes, it will be displayed just as if it had been created with double
quotes.
The backslash also escapes hexadecimal (base 16) and Unicode entries.
Hexadecimal values describe entries in the ASCII table that converts binary
values into text ones. For example, if you type "\x45", R returns the value
from the ASCII table that has been given value 45 in base 16 (69 in decimal):
the upper-case E. Passing the string "\x45" to nchar() returns the value 1.
Unicode entries can be one or more characters, and arguments to nchar()
help control what that function will return in more complicated examples. We
talk more about Unicode in Section 4.5.
4.1.3

The Empty String

In an earlier chapter (Section 2.3.2), we saw that some vectors have length
0. We could create a character vector of length 0 with a command like
character(0). However, something that is much more common in text
handling is the empty string, which is a regular character string that does
not happen to have any characters in it. This is indicated by "", two quote
characters with nothing between them. That empty string has length() 1 but
nchar() 0. Often the empty string will correspond to a missing value but not
always. It is very common to see empty strings when, for example, reading data
in from spreadsheets. In our experience, spreadsheets will sometimes produce
empty strings, and other times produce strings of spaces (e.g., sometimes,
when all the other entries in a column are two characters long, the empty cells
of the spreadsheet may contain two spaces). Naturally, these diﬀerent types of
empty or blank strings will need to be addressed in any data cleaning task.
One area of confusion is when using table() on a character vector. The
names() of the table will always be exactly right, but since those names are
displayed without quotation marks, leading spaces are impossible to see.
> vec <- c (" ", " ", "", "
", "", "2016", "",
" 2016", "2016", "
")
> table (vec)
vec
2016 2016
3
2
2
1
2
> names (table (vec))
[1] ""
" "
"
"
" 2016" "2016"

101

102

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

In this example, we have items that are empty, items that consist of one space
and three spaces, and items that look like 2016 but sometimes with a leading
space.
The output of the table() function is not enough to determine the values
being tabulated because of the leading spaces. We need the names() function applied to the table (or, equivalently, something like unique(vec)) to
determine what the values are.
The nzchar() is a fast way to determine whether a string is empty or not;
it returns TRUE for strings that have non-zero length and FALSE for empty
strings (think of “nz” as indicating “non-zero”).
4.1.4

Substrings

Another action we perform frequently in data cleaning is to extract a piece
of a string. This might be extracting a year from a text-formatted date,
for example, or grabbing the last ﬁve characters of a US mailing address,
which hold the ZIP code. The tool for this is the substring() function, which takes a piece of text, an argument first giving the position
of the ﬁrst character to extract, and an argument last that gives the
last position. The last argument defaults to 1 million, so unless your
strings exceed that length, last can be omitted when we seek the end of
a string. For example, to extract characters three and four from a string
named dt containing "2017-02-03", we use the command substring
(dt, 3, 4); the result is the string "17". (If the string has fewer than three
characters, the empty string is returned.) To extract the ﬁnal ﬁve characters we
could use substring(dt, nchar(dt) - 4). This extracts characters
6–10 from a string of length 10, characters 21–25 from a string of length 25,
and so on.
The substring() function works on vectors, so substring(vec,
nchar(vec) - 4) will produce a vector the same length as vec, giving the
last ﬁve (or up to ﬁve) characters of each of its entries. In this example,
the first argument was a numeric vector, and in general both first and
last may be vectors. This lets us use substring() to pull out parts of each
element in a string vector depending on its contents (e.g., “all the characters
after the ﬁrst parenthesis”).
We can exploit this vectorization to use substring() to break a string
into its individual characters. The command substring(a, 1:nchar(a),
1:nchar(a)) does exactly that, just as if we had called substring(a, 1,
1), substring(a, 2, 2), and so on. Another, slightly more eﬃcient, way
to break a string into its characters is mentioned below under strsplit()
(Section 4.4.7).
One of the strengths of substring() is that it can be used on the left side
of an assignment operation. For example, to change the last two letters of each

R Data, Part 3: Text and Factors

month’s name to "YZ", you could do this, using R’s built-in month.name
object, as we do in this example.
> new.month <- month.name
> substring (new.month, nchar (new.month) - 1) <- "YZ"
> new.month
[1] "JanuaYZ"
"FebruaYZ" "MarYZ"
"AprYZ"
[5] "MYZ"
"JuYZ"
"JuYZ"
"AuguYZ"
[9] "SeptembYZ" "OctobYZ"
"NovembYZ" "DecembYZ"

4.1.5

Changing Case and Other Substitutions

R is case-sensitive, and so we often need to manipulate the case of characters
(i.e., change upper-case letters to lower-case or vice versa). The tolower()
and toupper() functions perform those operations, as does the equivalent
casefold() function, which takes an argument called upper that describes
the direction of the intended change (upper = TRUE means “change to
upper case,” with FALSE, the default, indicating “change to lower case”). Note
that case-folding works with non-Roman alphabets in which that operation is
deﬁned, such as Cyrillic and Greek. The help page for these functions describes
a more complicated approach that can capitalize the ﬁrst letter in each word,
which is particularly useful for multi-word names such as “kuala lumpur” or
“san luis obispo.”
A more general substitution facility is provided by the chartr() (“character
translation”) function. This takes two arguments that are vectors of characters,
plus a string, and it changes each character in the ﬁrst argument into the corresponding character in the second argument.

4.2 Converting Numbers into Text
Numbers get special treatment when they are converted into text because R
needs to decide how they should be formatted. As we have seen earlier, R formats entries in a numeric vector for display, but that formatting is part of the
print-out, not part of the vector, and the formatting can change when the vector changes. In this section, we describe some of the details of those formatting
choices. We also describe how R uses scientiﬁc notation, and how you can create categorical versions of numeric vectors.
4.2.1

Formatting Numbers

Often it is convenient to represent a series of numbers in a consistent format for
reporting. The primary tools for formatting are format() and sprintf().
Format() provides a number of useful options, particularly for lining up decimal points and commas. (The European usage, with a comma to denote the start

103

104

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

of the decimal and a period to separate thousands, is also supported.) Of course,
formatting strings nicely in R doesn’t guarantee that those strings will line up
nicely in a report; that will depend on which font is used to display the formatted strings. Still, format() is a fast and easy way to format a set of numbers in
a common way. Important arguments are digits, to determine the number
of digits, nsmall to determine the number of digits in the “small” part (i.e., to
the right of the decimal point), big.mark to determine whether a comma is
used in the “big” part, drop0trailing, which removes trailing zeros in the
small part, and zero.print, which, if TRUE, causes zeros to be printed with
spaces. (You can also specify an alternate character like a dot, which might be
useful when most entries are zero.)
This example shows some of these arguments at work.
# Seven digits by default, decimals aligned
> format (c(1.2, 1234.56789, 0))
[1] "
1.200" "1234.568" "
0.000"
# Add comma separator
> format (c(1.2, 1234.56789, 0), big.mark=",")
[1] "
1.200" "1,234.568" "
0.000"
# Currency style, blank zeros
> format (c(1.2, 1234.56789, 0), digits = 6, nsmall=2,
zero.print=F, width=2)
[1] "
1.20" "1234.57" "
"

In the last example, the digits and nsmall arguments had to be chosen
carefully in order to produce exactly two digits to the right of the decimal point.
(The nsmall argument describes the minimum, not the maximum, number of
digits to be printed.) There are a few formatting tasks, including incorporating
text and adding leading zeros, that format() is not prepared for, and these
are handled by sprintf().
The sprintf() function takes its name from a common function in the
C language (the name evokes “string print, formatted”). This powerful function is complicated, and we just give a few examples here. The important point
about sprintf() is the format string argument, which describes how each
number is to be treated. (In R fashion, the format string can be a vector, in
which case either that argument or the numerics being formatted may have
to be recycled in the manner of Section 2.1.4.) A format string contains text,
which gets reproduced in the function’s output (this is useful for things such as
dollar signs) and conversion strings, which describe how numbers and other
variables should appear in that output. Conversion strings start with a percent sign and contain some optional modiﬁers and then a conversion character,
which describes the manner of object being formatted. Although sprintf()
can produce hexadecimal values and scientiﬁc notation (see the following discussion) with the proper conversion characters, the two most useful are i (or d)
for integer values, f for double-precision numerics, and s for character strings.

R Data, Part 3: Text and Factors

So, for example, sprintf("%f", 123) formats 123 as a double precision
using its default conversion options and produces the text "123.000000",
while sprintf("%f", 123.456) produces "123.456000".
Much of the power in sprintf() comes from the modiﬁers. Primary
among these are the ﬁeld width and precision, two numbers separated by a
period that give the minimum width (the total number of characters, including
sign and decimal point) and the number of digits to the right of the decimal
points, respectively. Other modiﬁers include the 0, to pad with leading zeros;
the space modiﬁer, which leaves a space for the sign if there isn’t one (so that
negative and positive numbers line up), and the + modiﬁer, which produces
plus signs for positive numbers. So, to continue the example, the format
string in sprintf("%9.1f", 123.456) asks for a ﬁeld width of 9 and a
precision of 1, and the result is the nine-character string "
123.5". The
command sprintf("%09.1f", 123.456) asks for leading zeros and
therefore produces "0000123.5". The items to be formatted, and even the
format string itself, can be vectors. This vectorization makes it straightforward
to insert numbers into sentences like this:
costs <- c(1, 333, 555.55, 123456789.012)
# Format as integers using %d
> sprintf ("I spent $%d in %s", costs, month.name[1:4])
[1] "I spent $1 in January"
"I spent $333 in February"
[3] "I spent $555 in March"
"I spent $123456789 in April"

In this example, each element of costs and month.name[1:4] is used, in
turn, with the format string.
The format strings are very ﬂexible. We show two more examples here.
# Format as double-precision (%f) with default precision
> sprintf ("I spent $%f in %s", costs, month.name[1:4])
[1] "I spent $1.000000 in January"
[2] "I spent $333.000000 in February"
[3] "I spent $555.550000 in March"
[4] "I spent $123456789.012000 in April"
# Format as currency, without specifying field width
> sprintf ("I spent $%.2f in %s", costs, month.name[1:4])
[1] "I spent $1.00 in January"
[2] "I spent $333.00 in February"
[3] "I spent $555.55 in March"
[4] "I spent $123456789.01 in April"

One ﬁnal feature of sprintf() is that ﬁeld width or precision (but not both)
can themselves be passed as an argument by specifying an asterisk in the format conversion string. This allows ﬁne-tuning of the widths of output, which
is useful in reporting. Suppose we wanted all four output strings from the last
example to have the same length. We can compute the length of the largest

105

106

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

number in costs (after rounding to two decimal points) and supply that length
as the ﬁeld width, as seen in this example.
> biggest <- max (nchar (sprintf ("%.2f", costs)))
> sprintf ("I spent $%*.2f in %s",
biggest, costs, month.name[1:4])
[1] "I spent $
1.00 in January"
[2] "I spent $
333.00 in February"
[3] "I spent $
555.55 in March"
[4] "I spent $123456789.01 in April"

Although sprintf() is complicated, it is very handy for at least one
job – generating labels that look like 001, 002, 003, and so on. The command
sprintf("%03d", 1:100) will generate 100 labels of that sort.
4.2.2

Scientiﬁc Notation

Scientiﬁc notation is the practice of representing every number by an optional
sign, a number between 1 and 10, and a multiplier of a power of 10. The choice
that R makes to put a number into scientiﬁc notation depends on the number
of signiﬁcant digits required. For example, the number 123,000,050 is written
1.23e+08, but 123,000,051 is written 123000051. When one number in a
vector (or matrix) needs to be represented in scientiﬁc notation, R represents
them all that way, which can be helpful or annoying, depending on the job at
hand. In this example, we show some of the eﬀects of the way R displays numbers in scientiﬁc notation. Notice that the rules are slightly diﬀerent for integer
and for ﬂoating-point values.
> 100000
# Big enough to start scientific notation
[1] 1e+05
> c(1, 100000)
# Both numbers get scientific notation
[1] 1e+00 1e+05
> c(1, 100000, 123456)
# R keeps precision here
[1]
1 100000 123456
> as.integer (10000000 + 1)
[1] 10000001
# Integers are a little different

There is no easy way to change scientiﬁc notation for a single command.
The R option scipen controls the “crossover” points between regular
(“ﬁxed”) and scientiﬁc notation, which depends on the number of characters required to print the vector out. (This, in turn, depends on the
number of digits R is prepared to display, which depends on the digits
option.) Set options(scipen = 999) to disable all scientiﬁc notation,
and options(scipen = -999) to require scientiﬁc notation everywhere – but don’t forget to set it back to the default value of 0 as needed.
(As with other options() calls, the value of scipen is re-set when you

R Data, Part 3: Text and Factors

close and re-open R.) An alternative is to use the format() command with
the scientific = FALSE option. This example shows the format()
command at work on a large number.
> format (10000000)
[1] "1e+07"
# scientific notation
> format (100000000, scientific = FALSE)
[1] "100000000"

Notice that, like sprintf(), format() always produces a character string,
which makes further numeric computation diﬃcult.
4.2.3

Discretizing a Numeric Variable

Very often we construct a discretized, categorical version of a numeric vector
with just a few levels for exploration or modeling purposes (we sometimes call
this procedure “binning”). For example, we might want to convert a numeric
vector into a categorical with levels “Small,” “Medium,” and “Large.” The natural tool for this in R is the cut() function. The arguments are the vector to be
discretized, the breakpoints, and, optionally, some labels to be applied to the
new levels. The result of a call to cut() is a factor vector; we discuss factors
in Section 4.6, but for the moment we will simply convert the result back to
characters. In this example, we start with a numeric vector and bin them into
three groups. We will set the boundary points at 4 and 7.
> vec <- c(1, 5, 6, 2, 9, 3, NA, 7)
> as.character (cut (vec, c(1, 4, 7, 10)))
[1] NA
"(4,7]" "(4,7]" "(1,4]" "(7,10]" "(1,4]"
[7] NA
"(4,7]"

The cut() function has some distracting quirks. By default, intervals
do not include their left endpoint, so that in this example, the value 1
does not belong to any interval. This produces the NA in element 1 of the
output; the second, of course, arises from the missing value in vec. The
include.lowest = TRUE argument will force the leftmost breakpoint to
belong to the leftmost bin. In this example, the number 1 would be found in
the leftmost bin if include.lowest = TRUE were speciﬁed. Alternatively,
the right = FALSE argument makes intervals include their left end and
exclude their right (in which case include.lowest = TRUE actually refers
to the largest of the breakpoints). In this example, any values larger than 10
would have produced NAs as well. This requires that you know the lower and
upper limits of the data before deciding what the breakpoints need to be.
If the exact locations of the breakpoints are not important, cut() provides
a straightforward way to produce bins of equal width or of approximately
equal counts. The ﬁrst is accomplished by specifying the breaks argument as

107

108

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

an integer. In this case, cut() assigns every non-missing observation (even
the lowest) to one of the bins. For bins of approximately equal counts, we can
compute the quantiles of the numeric vector and use those as breakpoints.
The following two examples show both of these approaches on a set of 100
numbers generated from R’s random number generator with the standard
normal distribution. We use the set.seed() function to initialize the
random number generator; if you use this command your generator and ours
should produce the same numbers. First we pass breaks as an integer to
produce bins of approximately equal width.
> set.seed (246)
> vecN <- rnorm (100)
> table (cut (vecN, breaks = 5))
(-3.18,-1.94] (-1.94,-0.709] (-0.709,0.522]
2
21
52
(0.522,1.75]
(1.75,2.99]
20
5

In the following example, we use R’s quantile() function to compute the
minimum, quartiles and maximum of the vecN vector. (Other choices are possible through the use of the probs argument.) Once the quantiles are computed we can pass them as breakpoints to produce four bins with approximately
equal counts – but, as before, cut() produces NA for the smallest value unless
include.lowest = TRUE – and by default, table() omits NA values.
> quantile (vecN)
0%
25%
50%
75%
100%
-3.17217563 -0.61809034 -0.06712505 0.45347704 2.98469969
> table (cut (vecN, quantile (vecN))) # lowest value omitted
(-3.17,-0.618] (-0.618,-0.0671] (-0.0671,0.453]
24
25
25
(0.453,2.98]
25
> table (cut (vecN, quantile (vecN), include.lowest = TRUE))
[-3.17,-0.618] (-0.618,-0.0671] (-0.0671,0.453]
25
25
25
(0.453,2.98]
25

Notice how supplying include.lowest = TRUE changed the ﬁrst bin from
a half-open interval (indicated by the label starting with () to a closed one
(label starting with [). In general, the default labels are somewhat unwieldy – a
character value like "(-0.618,-0.0671]" will be diﬃcult to manage. The
cut() function allows us to pass in a vector of text labels using the labels
argument.

R Data, Part 3: Text and Factors

4.3 Constructing Character Strings: Paste in Action
Character strings arise in data we bring in from other sources, but very often we
need to construct our own. The primary tool for building character strings is the
paste() function, plus its sibling paste0(). In its simplest form, paste()
sticks together two character vectors, converting either or both, as necessary,
from another class into a character vector ﬁrst. By default R inserts a space
in between the two. For example, paste("a" ,"b") produces the result
"a b" while paste(1 == 2, 1 + 2) evaluates the two arguments, converts them to character (see Section 2.2.3) and produces "FALSE 3".
In practice, we prefer to control the character that gets inserted. We want
a space sometimes, for example, when constructing diagnostic messages, but
more often we want some other separator, in order to construct valid column
names, for example. The sep argument to paste() allows us to specify the
separator. Very often in our work we use a period, by setting sep = ".", or
no separator at all, by setting sep = "". In the latter case, we can also use the
paste0() function, which according to the help ﬁle operates more eﬃciently
in this case.
What really gives paste() its power is that it handles vectors. If any of its
arguments is a vector, paste() returns a vector of character strings, recycling
(Section 2.1.4) shorter ones as needed. This gives us great ﬂexibility in constructing sets of strings. For example, the command paste0("Col", LETTERS) produces a vector of the strings "ColA", "ColB", and so on, up to
"ColZ".
A ﬁnal useful argument to paste() is collapse, which combines all the
strings of the vector into one long string, using the separator speciﬁed by the
value of the collapse argument. Common choices are the empty string "",
which joins the pieces directly, and the new-line and tab characters, when formatting text for output to tables.
Paste() is such a big part of character manipulation in R that we think it
important to show a few examples of how it works and where it can be useful.
4.3.1

Constructing Column Names

When a data frame is constructed from data without header names, R constructs names such as V1 and V2. Normally, we will want to replace these
with meaningful names of our own, but in big data sets the act of typing
in those names is tedious and error-prone. Moreover it is often true that
the names follow a pattern – for example, we might have a customer ID
followed by 36 months of balance data from 2016 to 2018, followed by 36
months of payment data for the same years. One way to generate those latter

109

110

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

72 names is through the outer() function. This function operates on two
vectors and performs another function on each pair of elements from the
two vectors, producing a matrix of results. For example, outer(1:10,
1:10, "*") produces a 10 × 10 multiplication table. The command
outer(month.abb, 2016:2018, paste, sep="."),
similarly,
produces a matrix. In this example, we show the ﬁrst few rows of that matrix
using the head() command.
> head (outer (month.abb, 2016:2018, paste, sep = "."), 3)
[,1]
[,2]
[,3]
[1,] "Jan.2016" "Jan.2017" "Jan.2018"
[2,] "Feb.2016" "Feb.2017" "Feb.2018"
[3,] "Mar.2016" "Mar.2017" "Mar.2018"

To construct column labels with Bal. on the front, we can simply paste that
string onto the elements of the matrix. Remember that paste() converts its
arguments into character vectors before operating, so the result of this operation is a vector of column labels, as shown in this example, where again we
show only a few of the elements of the result.
> myout <- outer (month.abb, 2016:2018, paste, sep = ".")
> paste0 ("Bal.", myout)[1:3]
[1] "Bal.Jan.2016" "Bal.Feb.2016" "Bal.Mar.2016"

So to construct a vector with all 73 desired names, we could use a single command as in this example.
newnames <- c("ID", paste0 ("Bal.", myout),
paste0 ("Pay.", myout))

We note that outer() is not very eﬃcient. For very big data sets, we might
create separate vectors from the year part and the month part, and then paste
them together. Suppose that the balance and payment values alternated, so the
ﬁrst two columns gave balance and payment for January 2016, the next two for
February 2016, and so on. Then a straightforward way to construct the labels
using paste() is by repeating the components as needed, with rep(), and
then pasting the resulting vectors together:
# 2 values x 12 months x 3 years
part1 <- rep (c("Bal", "Pay"), 12 * 3)
# Double each month, repeat x 3
part2 <- rep (rep (month.abb, each = 2), 3)
part3 <- rep (2016:2018, each = 24)
newnames <- c("ID", paste (part1, part2, part3, sep = "."))

The task in this example could actually have been done more easily with
expand.grid(). This function takes, as arguments, vectors of values
and produces a data frame containing all combinations of all the values.

R Data, Part 3: Text and Factors

Since the output is a data frame, for many purposes you will want to specify
stringsAsFactors = FALSE. The next step is to use paste() on the
columns of the data frame. We use paste() regularly and in many contexts.
4.3.2

Tabulating Dates by Year and Month or Quarter Labels

Often we want to summarize vectors of dates (Section 3.6) across, for example,
years, months, or calendar quarters. An easy way to do this is by pasting
together identiﬁers of year and month, then using table() or tapply() to
compute the relevant numbers of interest. We use paste() here because the
built-in months() and quarters() functions do not produce the year as
well (and the format() function does not extract quarters). In this example,
we ﬁrst generate 600 dates at random between January 1, 2015 and December
31, 2016 (a period of 731 days) and then tabulate them by quarter.
> set.seed (2016)
> dts <- as.Date (sample (0:730, size = 600),
origin = "2015-01-01")
> table (quarters (dts)) # Shows calendar quarter
Q1 Q2 Q3 Q4
134 153 151 162

To combine both year and quarter, we can use substring() to extract
the years, then paste them together with the quarters. (We could also have
extracted the years with format(dts, "%Y").) We put the years ﬁrst in
these labels so that the table labels are ordered chronologically. In this example,
we paste the year and quarter, and then tabulate.
> table (paste0 (substring (dts, 1, 4), ".",
quarters (dts)))
2015.Q1 2015.Q2 2015.Q3 2015.Q4 2016.Q1 2016.Q2 2016.Q3
72
71
75
81
62
82
76
2016.Q4
81

To add months to the years, we could call the months() function and once
again use paste() to combine the year and month information. Alternatively, we can use format() directly, as in this example. Notice, however, that
table() sorts its entries alphabetically by name.
> (mtbl <- table (format (dts, "%Y.%B")))
2015.April
2015.August 2015.December ...
24
23
24 ...

To put these entries into calendar order explicitly, we can use paste0() to
construct a vector of names to be used as an index. Then we can use that index
to re-arrange the entries in the mtbl table. We show that in this example.

111

112

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> (month.order <- paste0 (2015:2016, ".", month.name))
[1] "2015.January"
"2016.February" "2015.March" ...
> mtbl[month.order]
2015.January 2016.February
2015.March ...
24
21
27 ...

4.3.3

Constructing Unique Keys

Often we need to construct a single column that uniquely labels each row in a
data frame. For example, we might have a table with one row for each customer
in every month in which a transaction took place. Neither customer number
nor month is enough to uniquely identify a transaction, but we can construct
a unique key by pasting account number, year, and month. In this example, we
would probably use a two-digit numeric month here and put year before month.
That way an alphabetical ordering of the keys would put every customer’s transactions together in increasing date order.
4.3.4

Constructing File and Path Names

In many data processing applications, our data is spread out over many ﬁles and
we need to process all the ﬁles automatically. This might require constructing
ﬁle names by pasting together a path name, a separator like /, and a ﬁle name. R
can then loop over the set of ﬁle names to operate on each one. As an example,
one way to get the full (absolute) ﬁle names of all the ﬁles in your working
directory is by combining the name of the directory (retrieved with getwd())
and the names of the ﬁles (retrieved with list.files()). The command
paste(getwd(), list.files(), sep = "/") produces a character
vector of the absolute ﬁle names of ﬁles in the working directory. This is not
quite the same as the output from list.files(full.names = TRUE);
we discuss interacting with the ﬁle system in more detail in Section 5.4.

4.4 Regular Expressions
A regular expression is a pattern used in a tool to ﬁnd strings that match
the pattern. The patterns can be very complicated and perform surprisingly
sophisticated matches, and in fact entire books have been written about regular
expressions. While we cannot cover all the complexities of regular expressions
in this book, we can make you knowledgeable enough to do powerful things.
We use regular expressions to ﬁnd strings that match a rule, or set of rules,
called a pattern. For example, the pattern a matches strings that include one
or more instances of a anywhere in them. The pattern a8 matches strings with
a8, with no intervening characters. Most characters, such as a and 8 in this
example, match themselves. What gives regular expressions their power is the

R Data, Part 3: Text and Factors

ability to add certain other characters that have special meaning to the pattern.
The exact set of special characters diﬀers across the diﬀerent kinds of regular expression, but as a ﬁrst example, the character ^ means “at the start of a
line,” and $ means “at the end of a line.” So the pattern ^The matches every
string that starts with The; end$ matches every string that ends with end, and
^No$ matches every string that consists entirely of No. By default, patterns are
case-sensitive, but shortly we will see how to ignore case.
4.4.1

Types of Regular Expressions

The details of regular expressions diﬀer from one implementation to the next,
so a regular expression you write for R may not work in, for example, Python
or another language. Actually, R supports two sorts of regular expressions:
one is POSIX-style (named for the same POSIX standards group that gave us
the POSIXt date objects), and the other is Perl-style, referring to the regular
expressions used in the Perl language. (Speciﬁcally, if you need to look this up
somewhere, the POSIX style incorporates the GNU extensions and the Perl
style comes via the PCRE library.) By default, POSIX regular expressions are
used in R.
4.4.2

Tools for Regular Expressions in R

There are three primary tools for regular expressions in R: grep() and its variants, regexpr() and its variants, and sub() and its variants. These three
are similar in implementation. We start by describing grep() in some detail.
The grep() function takes a pattern and a vector of strings, and returns a
numeric vector giving the indices of the strings that match the pattern. With the
value = TRUE argument, grep() returns the matching strings themselves.
The related function grepl() (the letter l on the end standing for “logical”)
returns a logical vector with TRUE indicating the elements that match. In this
example, we look through R’s built-in state.name vector to ﬁnd elements
with the capital letter C.
> grep ("C", state.name)
[1] 5 6 7 33 40
> grep ("C", state.name, value = TRUE)
[1] "California"
"Colorado"
"Connecticut"
[4] "North Carolina" "South Carolina"
> grep ("^C", state.name, value = TRUE)
[1] "California" "Colorado"
"Connecticut"

The ﬁrst call to grep() produces a vector of indices. These ﬁve numbers show
the locations in state.name where strings containing C can be found. With
value = TRUE, the names of the matching states are returned. In the ﬁnal
example, we search only for strings that start with C.

113

114

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Several other arguments are also important. First, the ignore.case argument defaults to FALSE, but when set to TRUE, it allows the search to ignore
whether letters are in upper- or lower-case. Second, setting invert = TRUE
reverses the search – that is, grep() produces the indices of strings that do
not match the pattern. (The invert argument is not available for grepl(),
but of course you can use ! applied to the output of grepl() to produce
a logical vector that is TRUE for non-matchers.) Third, fixed = TRUE
suspends the rules about patterns and simply searches for an exact text string.
This is particularly useful when you know your pattern, and also it has a
special character in it – such as, for example, a negative amount indicated
with parentheses, such as ($1,634.34). To continue an earlier example,
grep("^The", vec) ﬁnds the entries of vec that start with the three
characters The, whereas grep("^The", vec, fixed = TRUE) ﬁnds
the entries that include the four characters ^The anywhere in the string.
A fourth useful argument is perl, which, when set to TRUE, leads the grep
functions to use Perl-type regular expressions. Perl-type regular expressions
have many strengths, but for this development we will describe the default,
POSIX style. Finally, all of these regular expression functions permit the use of
the useBytes argument, which speciﬁes that matching should be done byte
by byte, rather than character by character. This can make a diﬀerence when
using character sets in which some characters are represented by more than
one byte, such as UTF-8 (see Section 4.5).
4.4.3

Special Characters in Regular Expressions

We have seen how ^ and $ match, respectively, the beginning and end of a line.
There are a number of other special characters that have speciﬁc meanings in
a regular expression. In order for one of these special characters to be used to
stand for itself, it needs to be “protected” by a backslash. We talk more about
the way backslashes multiply in R regular expressions in the following section.
Table 4.1 lists the special characters in R’s implementation of POSIX regular expressions. Many implementations of regular expressions work on lines of
text in a ﬁle, so we use the word “line” here synonymously with “element of a
character vector.”
4.4.4

Examples

In this section, we show our ﬁrst examples of using regular expressions to locate
matching strings. Remember that, by default, grep() gives the indices of the
matching strings; pass value = TRUE to get the strings themselves and use
grepl() to get a logical indication of which strings match. In these functions, a string matches or does not – there is no notion of the position within
a string where a match takes place. The tool for that is regexpr(), described

R Data, Part 3: Text and Factors

Table 4.1 Special characters in R (POSIX) regular expressions
Char Name

Purpose

Example

Matching characters
.

Period

Match any character

t.e matches strings with tae, tbe
t9e, t;e, and so on, anywhere

[ ] Brackets

Match any character

t[135]e matches t1e, t3e, t5e;

between them

t[1-5]e matches t1e, t2e, …, t5e,
but note: [a-d] might mean[abcd]
or [aAbBcCdD], depending on your
computer. See “character ranges”

^

Caret

(i) Start of line

^L matches lines starting with L

(ii) “Not” when appear- t[^h]e matches lines containing t,
ing ﬁrst in brackets

then something not an h, then e

$

Dollar

End of line

the$ matches lines ending with the

|

Pipe

“Or” operator

th|sc matches either th or sc

( ) Parentheses

Grouping operators

\\

Backslash

Escape character

See text

{}

Braces

Enclose repetition

(a|b){3} matches lines with three

operators

(a or b)’s in a row, for example, aba, bab, …

Separate repetition

j{2,4} matches jj, jjj, jjjj

Repetition characters

,

Comma

operators
*

Asterisk

Match 0 or more

ab* matches a, ab, abb, …

+

Plus

Match 1 or more

ab+ matches ab, abb, abbb, …

?

Question mark Match 0 or 1

ab? matches a or ab

in Section 4.4.5. We start by creating a vector of strings that contain the string
sen in diﬀerent cases and locations.
> sen <- c("US Senate", "Send", "Arsenic", "sent", "worsen")
> grep ("Sen", sen)
# which elements have "Sen"?
[1] 1 2
> grep ("Sen", sen, value = TRUE)
# elements with "Sen"
[1] "US Senate" "Send"
> grep ("[sS]en", sen, value = TRUE) # either case "S"
[1] "US Senate" "Send"
"Arsenic"
"sent"
"worsen"
> grep ("sen", sen, value = TRUE,
# upper or lower-case
ignore.case = T)

115

116

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

[1] "US Senate" "Send"
"Arsenic"
"sent"
"worsen"
> grep ("^[sS]en", sen, value = T)
# start "Sen" or "sen"
[1] "Send" "sent"
> grep ("sen$", sen, value = T)
# end with "sen"
[1] "worsen"

The ﬁrst grep() produces the indices of elements that match the pattern – this
is useful for extracting the subset of items that match. The second grep() uses
value = TRUE to returns the element themselves. These simple examples
start to show the power of regular expressions. That power is multiplied by the
ability to detect repetition, as we see next.
Repetition

The second part of Table 4.1 describes some repetition operators. Often we seek
not a single character, but a set of matching characters – a series of digits, for
example. The regular expression repetition operator ? allows for zero or one
matches, essentially making the match optional; the * allows for zero or more,
and the + operator allows for one or more matches. So the pattern 0+ matches
one or more consecutive zeros. Since the dot character matches any character,
the combination .+ means “a sequence of one or more characters”; this combination appears frequently in regular expressions. We also often see the similar
pattern .* for “zero or more characters.” The following example shows how we
can match strings with extraneous text using repetition operators. We start by
creating a vector of strings, and our goal is to ﬁnd elements of the vector that
start with Reno and are followed at some point later in the string by a ZIP code
(ﬁve digits).
> reno <- c("Reno", "Reno, NV 895116622", "Reno 911",
"Reno Nevada 89507")
> grep ("Reno.+[0-9]{5}", reno, value = TRUE)
[1] "Reno, NV 895116622" "Reno Nevada 89507"

Here, the .+ accounts for any text after the o in Reno and the [0-9]{5}
describes the set of ﬁve digits. It is tempting to add spaces to your pattern to
make it more readable, but this is a mistake; the regular expression will then
take the spaces literally and require that they appear. Notice that the nine-digit
number was also matched by the {5} repetition, since the ﬁrst ﬁve of the nine
digits satisfy the requirement.
We end this section with a more complicated example. Here, we search for
strings with dates in the form of a one- or two-digit numeric day, a month name
(as a three-letter abbreviation), and a four-digit year number, when there might
be text between any of these pieces. This example shows the text to be matched.
> dt <- c("Balance due 16 Jun or earlier in 2017",
"26 Aug or any day in 3018",
"'76 Trombones' marched in a 1962 film",

R Data, Part 3: Text and Factors

"4 Apr 2018", "9Aug2006",
"99 Voters May Register in 20188")

The pieces of the regular expression to detect the dates are these. First, we can
have leading text, so .* will match that if it is present. Second, [0-3]?[0-9]
matches a one-digit number (since the ﬁrst digit is optional, as indicated by the
?) or a two-digit number less than 40. Next, there is (optional) additional text,
followed by a set of month names. The month-related part of the pattern looks
like (Jan|Feb|Mar...|Dec), where the pipes denote that any month will
match and the parentheses make this a single pattern. (The abbreviations in
the pattern will match a full name in the text.) Finally, we match some more
additional text, followed by four digits that have to start with a 1 or a 2. We
construct the month-related part of the pattern ﬁrst by using paste() with
the collapse argument.
> (mo <- paste (month.abb, collapse = "|"))
[1] "Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec"
> re <- paste0 (".*[0-3]?[0-9].*(", mo, ").*[1-2][0-9]{3}")
> grep (re, dt, value = TRUE)
[1] "Balance due 13 Jun or earlier in 2017"
[2] "26 Aug or any day in 2018"
[3] "9Aug2006"
[4] "99 Voters May Register in 20188"

Notice that the mar in marched does not match the month abbreviation Mar.
However, the 99 in the ﬁnal string matches the day portion of our pattern. That
is because the [0-3] is optional; the ﬁrst 9 matches in the [0-9] pattern and
the second, in the .* pattern. Moreover, the ﬁve-digit year 20188 in that string
matches the pattern [1-2][0-9]{3} because its ﬁrst four digits do. We see
how to reﬁne this example later in the section – but regular expressions are
tricky!
The Pain of Escape Sequences

Special characters give regular expression much of their power. But sometimes
we need to use special characters literally – for example, we might want to ﬁnd
strings that contain the actual dollar sign $. A dollar sign in a pattern normally
indicates the end of a line; to use it literally in a pattern it needs to be “escaped”
with a backslash. So in order for the regular expression “engine” to look for a
dollar sign, we need to pass it the pattern \$. But remember that to type a backslash into R, we need to type two backslashes, since R also uses the backslash
as the character that “protects” certain other characters (in strings like \n for
new-line). That is, we have to type \\$ in R so that the engine can see \$ and
know to search for a dollar sign. In this example, we create a vector of character
strings and search for a dollar sign among them. Remember that $ matches the
end of a string. In the ﬁrst grep() command below, the pattern $ matches
every element in the vector that has an end – all of them.

117

118

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> vec <- c("c:\\temp", "/bin/u", "$5", "\n", "2 back: \\\\")
> grep ("$", vec)
# Indices of elements with ends
[1] 1 2 3 4 5
> grep ("\$", vec, value = TRUE)
Error: '\$' is an unrecognized escape...
> grep ("\\$", vec, value = TRUE)
[1] "$5"

The pattern \$ looks to R as if we are constructing a special “protected” character such as \n or \t. Since there is no such character in R, we see an error. The
next command produces the elements of vec that contain dollar signs since
value = TRUE; in this case, the only element that matches is $5.
Other special characters also need to be escaped. To search for a dot, use \\.;
to search for a left parenthesis, use \\(, and so on. The “pain” of this section’s
title refers to searching for the backslash itself. Since a backslash is represented
as \\, and we need to pass two of them to the engine, the pattern for ﬁnding
backslashes in a string is \\\\. This looks like four characters, but it’s actually
two (as nchar ("\\\\") will conﬁrm). The ﬁrst tells the regular expression
engine to take the second literally.
Backslashes are fortunately pretty rare in text, but they do arise in path names
in the Windows operating system. In this example, we show how we can locate
strings containing the backslash character.
> grep ("\\", vec)
Error in grep("\\", vec) :
invalid regular expression '\', reason 'Trailing backslash'
> grep ("\\\\", vec, value = TRUE)
# elements with \
[1] "c:\\temp"
"2 back: \\\\"
> grep ("\\\\\\\\", vec, value = TRUE)
# two backslashes
[1] "2 back: \\\\"

In the ﬁrst command, the backslash \\ in valid in R, but because the regular
expression engine uses the backslash as well, it expects a second character (like
$ in our example above). When no second character is found, grep() produces an error. The second example shows the elements of vec that contain a
backslash. Notice that the \n character in position 4 is a single character. The
backslash depicts its special nature but is not part of the actual character. The
ﬁnal pattern matches the string with two backslashes.
The fixed = TRUE argument can alleviate some of the pain when searching for text that includes special characters. In this example, we repeat the
searches above using fixed = TRUE.
> grep ("\\", vec, value = TRUE, fixed = TRUE)
[1] "c:\\temp"
"2 back: \\\\"
> grep ("\\\\", vec, value = TRUE, fixed = TRUE)
[1] "2 back: \\\\"

# one \
# two \

R Data, Part 3: Text and Factors

As a ﬁnal example in this section, we show how we can use the pipe character
| to ﬁnd elements of vec with either forward slashes or backslashes.
> grep ("\\\\|/", vec, value = TRUE)
[1] "c:\\temp"
"/bin/u"
"2 back: \\\\"
> grep ("\\|/", vec, value = TRUE, fixed = TRUE)
character(0)

In the ﬁrst command, we found strings containing either a backslash (\\\\)
or forward slash (/), the two separated by the pipe character indicating “or.” In
the second command, we used fixed = TRUE to look for strings containing
the literal text \|/ in that order – and of course none was found.
Character Ranges and Classes

We saw earlier how we can match ranges of digits by enclosing them in square
brackets as [0-9]. This extends to other sets of characters. For example,
we might want to match any of the lower-case letters, or any punctuation
character, or any of the letters A–G of the musical scale. It is easy to specify a
range of characters using square brackets and a hyphen, so [a-z] matches a
lower-case letter and [A-G] matches an upper-case musical note. To match
musical notes given in either case, we can combine a range with the pipe
character: [A-G]|[a-g] matches any of those seven letters in either case.
To include a hyphen in the pattern, put it ﬁrst or last in the brackets. (You can
also put an opening square bracket in a set or range, but to include a closing
square bracket, you will need to escape it so that the set isn’t seen as ending
with that character.) So, for example, the range [X-Z] matches “any of the
letters X, Y or Z,” and includes Y, whereas the set [XZ-] matches “any of X, Z,
or hyphen” and does not.
We can negate a character class or range by preceding it with the caret character ^. So the set [^XZ-] matches any characters other than X, Z, or hyphen.
Notice that the caret must be inside the brackets; outside, it matches the start of
the line as we saw above. A caret elsewhere than the ﬁrst character is interpreted
literally – that is, it matches the caret character.
There is a predeﬁned set of character classes that makes it easy to specify certain common sets. These include [:lower:], [:upper:], and
[:alpha:] for lower-case, upper-case, and any letters; [:digit:] for digits; [:alnum:] for alphanumeric character (letters or numbers); [:punct:]
for punctuation; and a few more (see the help pages). Notice that the name of
the class includes the square brackets; to use these in a regular expression they
need to be enclosed in another set of square brackets. So, for example, the
pattern [[:digit:]] matches one digit, and [^[:digit:]] matches any
character that is not a digit. We start this example by showing how to identify
strings that include, or do not include, upper-case letters.
> vec <- c("1234", "6", "99 Balloons", "Catch 22", "Mannix")
> grep ("[[:upper:]]", vec, value = TRUE)
# any upper-case

119

120

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

[1] "99 Balloons" "Catch 22"
"Mannix"
> grep ("[^[:upper:]]", vec, value = TRUE) # any non-upper
[1] "1234"
"6"
"99 Balloons" "Catch 22"
[5] "Mannix"
> grep ("^[^[:upper:]]+$", vec, value = TRUE) # no upper
[1] "1234" "6"

The ﬁrst grep() uses the [:upper:] character class to identify strings with
at least one upper-case character in them. It is tempting to think that the second regular expression, [^[:upper:]], will ﬁnd strings consisting only of
non-upper-case characters, but as you can see the result is something diﬀerent.
In fact, this pattern matches every string that has at least one non-upper-case
character. The last example shows how to identify strings consisting entirely of
these characters – we specify that a sequence of one or more non-upper-case
characters (i.e., [^[:upper:]] followed by +) should be all that can be found
on the line (i.e., between the ^ and the $).
Some classes are so commonly used that they have aliases. We can use \d for
[:digit:] and \s for [:space:], and \D and \S for “not a digit,” “not a
space.” (Here, “space” includes tab and possibly other more unusual characters.)
This makes it easy to, for example, ﬁnd strings that contain no digits at all, as
we see in this example.
> grep ("^[^[:digit:]]+$", vec, value = TRUE) # long way
[1] "Mannix"
> grep ("^\\D*$", vec, value = TRUE)
# shorter
[1] "Mannix"
Word Boundaries

Often we require that a match take place on a word boundary, that is, at the
beginning or end of a word (which can be a space or related character such as
tab or new-line, or at the beginning or end of the string). Word boundaries are
indicated by \b, or by the pair \< and \>. The characters that are considered to
go into a word include all the alphanumeric (non-space) characters. Recall our
earlier example where we tried to locate strings with dates included – such as,
for example, "4 Apr 2018". Our earlier eﬀort inadvertently matched a year
with the value 20188. Using the word-boundary characters, we can specify
that, in order to match, a string must include a word with exactly four digits.
This example shows how that might be done.
> (newvec <- grep ("\\<\\d{4}\\>", dt, value = TRUE))
[1] "Balance due 16 Jun or earlier in 2017"
[2] "26 Aug or any day in 3018"
[3] "'76 Trombones' marched in a 1962 film"
[4] "4 Apr 2018"
> grep (mo, newvec, value = TRUE)
[1] "Balance due 16 Jun or earlier in 2017"

R Data, Part 3: Text and Factors

[2] "26 Aug or any day in 3018"
[3] "4 Apr 2018"

In the ﬁrst command, we found strings that contained words of exactly four digits and saved the result into the new item newvec. The second command then
searched for the month string in that new vector. We did not look for the days
in this example, but in practice we very often use multiple passes (sometimes
with invert = TRUE) in order to extract the set of strings we need. It may
be more computationally eﬃcient to call grep() only once for any problem,
but that may not be the fastest route overall.
4.4.5

The regexpr() Function and Its Variants

While grep() identiﬁes strings that match patterns, the regexpr()
function is more precise: it returns the location of the (ﬁrst) match within the
string – that is, the number of the ﬁrst character of the match. We can use this
information to not only identify strings that contain numbers but also extract
the number itself. This example shows the result of calling regexpr() with
a pattern that looks for the ﬁrst stand-alone integer in each string. It is not
enough to extract a set of digits because that would match strings such as
11-dimensional or B2B. Word boundaries provide the mechanism for
specifying an integer, as seen here:
> (regout <- regexpr ("\\<\\d+\\>", dt))
[1] 13 1 2 1 -1 1
attr(,"match.length")
[1] 2 2 2 1 -1 2
attr(,"useBytes")
[1] TRUE

The regexpr() function returns a vector (plus some other information we
describe as follows). The vector, which starts with 13, shows the number of the
character where the ﬁrst integer begins. For example, the number 16 in the ﬁrst
element of dt appears starting at the 13th character in that string, the number
26 starts in the 1st character of the second element, and so on. The -1 in the
ﬁfth position indicates that string does not contain an integer as a word.
The function also returns attributes, extra pieces of information attached to
its output. The match.length attribute in this case gives the length of the
match – so the ﬁrst element is 2 because the integer in the ﬁrst string is two
characters long; the fourth is 1 because the integer in the fourth string is one
character long. (We will not need the useBytes attribute.) We could extract
the match.length vector using the attr() function, and then use substring() to extract the numbers in the strings. But a more convenient alternative is provided by regmatches(), which takes the initial string and the
output of regexpr() and performs the extraction for us, as in this example.

121

122

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> regmatches (dt, regout)
[1] "13" "26" "76" "4" "99"

There are ﬁve entries in this vector because only ﬁve of the strings contained
integers. (The -1 values in the original vector remind you of which strings did
not produce values here.)
Finding All Matches

The regexpr() function ﬁnds the ﬁrst instance of a match in a vector of
strings. To ﬁnd all the matches is only a little more complicated. We use the
gregexpr() function, the g evoking “global.” The return value of gregexpr() is a list, not a vector, because some strings may contain many integers.
However, regmatches() works on this return value just as it does for regexpr(). In this example, we extract all of the integers from each of our strings
in one command.
# Note that some output from this command is suppressed
> (gout <- gregexpr ("\\<\\d+\\>", dt))
[[1]]
[1] 13 34
attr(,"match.length")
[1] 2 4
...
[[2]]
[1] 1 22
attr(,"match.length")
[1] 2 4
...
[[6]]
[1] 1 27
attr(,"match.length")
[1] 2 5
...
> regmatches (dt, gout)
[[1]]
[1] "13"
"2017"
[[2]]
[1] "26"

"2018"

[[3]]
[1] "76"

"1962"

[[4]]
[1] "4"

"3018"

[[5]]
character(0)

R Data, Part 3: Text and Factors

[[6]]
[1] "99"
"20188"
> matrix (as.numeric (unlist (regmatches (dt, gout))),
ncol = 2, byrow = TRUE)
[,1] [,2]
[1,]
13 2017
[2,]
26 2018
[3,]
76 1962
[4,]
4 3018
[5,]
99 20188

Here, the result of the call to regmatches() is a list of length 6, one for each
string in the original dt. The ﬁfth entry in the list is empty because the ﬁfth
entry of dt had no integers that were words. The ﬁnal command shows one way
you might form the list into a two-column numeric matrix, a useful step on the
way to constructing a data frame. A second approach would use do.call()
and rbind().
Greedy Matching

By default, regular expression matching is “greedy” – that is, matches are as
long as possible. As an example consider using the pattern \\d.*\\d to ﬁnd a
digit, zero or more characters, and a second digit in the string "4 Apr 3018".
You might expect the regular expression engine to ﬁnd the string "4 Apr 3",
but in fact it gathers as much as possible: "4 Apr 3018", stopping at the last
8. Adding a question mark makes the match “ungreedy” (or “lazy”) – so that
\\d.*?\\d produces "4 Apr 3".
4.4.6

Using Regular Expressions in Replacement

In addition to ﬁnding matches, R has tools that allow you to replace the part
of the string that matches a pattern with a new string. These are sub(), which
replaces the ﬁrst matching pattern, and gsub(), which replaces all the matching patterns. The replacement text is not a regular expression. For example, here
is a vector of four character strings. In the ﬁrst example, we replace the ﬁrst
lower-case i with the number 9. In the second, we replace the ﬁrst instance
of either i or I with 9, and in the last, we replace all instances of either one
with 9.
> (mytxt <- c("This is", "what I write.",
"Is it good?", "I'm not sure."))
[1] "This is" "what I write." "Is it good?" "I'm not sure."
> sub("i", "9", mytxt)
# replace first i with 9
[1] "Th9s is" "what I wr9te." "Is 9t good?" "I'm not sure."
> sub("[iI]", "9", mytxt)
# replace first (i or I) with 9
[1] "Th9s is" "what 9 write." "9s it good?" "9'm not sure."

123

124

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> gsub("[iI]", "9", mytxt)
# replace all (i or I) with 9
[1] "Th9s 9s" "what 9 wr9te." "9s 9t good?" "9'm not sure."

Sometimes the text being matched is needed in the replacement. This can
sometimes be done very neatly using “backreferences.” When a regular
expression is enclosed in parentheses, its matching strings get labeled by
integers and can be re-used in the replacement string by referring to them as
\1, \2, and so on – of course, to be typed into R as \\1, \\2, and so on. In
this example, we are given names in the form “Firstname Lastname” and asked
to produce names of the form “Lastname, Firstname.”
> folks <- c("Norman Bethune", "Ralph Bunche",
"Lech Walesa", "Nelson Mandela")
> sub ("([[:alpha:]]+) ([[:alpha:]]+)", "\\2, \\1", folks)
[1] "Bethune, Norman" "Bunche, Ralph"
"Walesa, Lech"
[4] "Mandela, Nelson"

The ﬁrst argument to the sub() command gave the pattern: a series of one or
more letters (captured as backreference 1), a space, and another series of letters
(backreference 2). The replacement part gives the second backreference, then
a comma and space, and then the ﬁrst backreference. We note that this task is
more complicated with people whose names use three words, since sometimes
the second word is a middle or maiden name (as with John Quincy Adams or
Claire Booth Luce) and sometimes it is part of the last name (Martin Van Buren,
Arthur Conan Doyle) – and of course some people’s names require four or more
words (Edna St Vincent Millay, Aung San Suu Kyi).
4.4.7

Splitting Strings at Regular Expressions

It is common to want to split a string whenever a particular character
occurs. This is more or less the opposite of the paste() operation. For
example, in our work we often construct a unique key to identify each of
our observations, using paste(). We might combine a company identiﬁer,
transaction identiﬁer, and date, with a command like key <- paste
(co.id, tr.id, date, sep = "-"). Of course, in this example, the
sep = "-" argument speciﬁes a hyphen as the separator.
At a later time, it might be necessary to “unpaste” those keys into their individual parts. The strsplit() function performs this duty. In this example,
strsplit(key, "-") produces a list with one entry for each string in key.
Each entry is a vector of parts that result when the key is broken at its hyphens;
so if one key looked like 00147-NY-2016-K before the split, the corresponding entry in the output of strsplit() would be a vector with four elements
(and no hyphens). If the key had two hyphens in a row, there would have been
an empty string in the output vector. In this example, we show the eﬀect of
strsplit() on several keys constructed using hyphens.

R Data, Part 3: Text and Factors

> keys <- c("CA-2017-04-02-66J-44", "MI-2017-07-17-41H-72",
"CA-2017-08-24-Missing-378")
> (key.list <- strsplit (keys, "-"))
[[1]]
[1] "CA"
"2017" "04"
"02"
"66J" "44"
[[2]]
[1] "MI"
[[3]]
[1] "CA"

"2017" "07"

"2017"

"17"

"08"

"41H"

"72"

"24"

"Missing" "378"

In cases like these, where the number of pieces is the same in every key, it
is common to construct a matrix or data frame from the parts. We saw a
similar example using the output of regmatches() in an earlier section.
Here, we construct a character matrix in the same way. We can then use
data.frame() to make the matrix into a data frame, although the columns
of the latter will be character unless you then convert them explicitly.
> matrix (unlist
[,1] [,2]
[1,] "CA" "2017"
[2,] "MI" "2017"
[3,] "CA" "2017"

(key.list), ncol = 6, byrow = TRUE)
[,3] [,4] [,5]
[,6]
"04" "02" "66J"
"44"
"07" "17" "41H"
"72"
"08" "24" "Missing" "378"

Note that the alternative do.call("rbind", key.list) produces the
same character matrix.
Unlike the sep argument to paste(), which is a character string, the
second argument to strsplit() can be a regular expression. The strsplit() function also accepts the fixed, perl, and useBytes arguments
as the other regular expression operators do. Because that second argument
can be a regular expression, extra work is required to split at periods. The
command strsplit(key, ".") produces a split at every character, since
the period can represent any character, so it returns an unhelpful vector of
empty strings. The command strsplit(key, "\\.") or its alternatives,
strsplit(key, "[.]") or strsplit(key, ".", fixed = TRUE)
will split at periods. Remember that the output of strsplit() is always a
list, even if only one character string is being split.
4.4.8

Regular Expressions versus Wildcard Matching

The patterns used in regular expressions are more complicated, and more
powerful, than the sort of wildcard matching that many users will have
seen as part of a command-line interpreter. In wildcard matching, the only
special characters are *, meaning “match zero or more characters,” and ?,
meaning “match exactly one character.” So, for example, the wildcard-type

125

126

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

pattern *an? matches any string that includes an followed by exactly one
more character. R does not use wildcard matching, but it does allow you to
convert a wildcard pattern, which R calls a “glob,” into a regular expression,
by means of the glob2rx() function. For example, glob2rx("*an?")
produces "^.*an.$". Notice the ^ and $ sign; glob2rx() adds those
by default, but they can be omitted with the trim.head = TRUE and
trim.tail = TRUE arguments.
4.4.9

Common Data Cleaning Tasks Using Regular Expressions

Regular expressions make it possible to do many complicated things to text,
speciﬁc to your particular problem and data. There are a few operations,
though, that seem to be called for in a lot of data cleaning tasks. In these
sections, we describe how to do some of these.
Removing Leading and Trailing Spaces

One frequent need in text handling is removing leading and trailing spaces from
text. The regular expression "^ *" matches any string with leading spaces,
while " *$" matches one with trailing spaces. To match either or both of these,
we combine them with the pipe character, and use gsub() instead of sub()
since some strings will have both kinds of matches – as in this example.
> gsub ("^ *| *$", "", c(" Both Kinds ", "Trailing
",
"Neither", "
Leading"))
[1] "Both Kinds" "Trailing"
"Neither"
"Leading"

Here, the embedded space inside "Both Kinds" does not match – it is
neither leading nor trailing – and is not deleted. The command gsub(" ",
"", vec) will remove all spaces in every element of vec.
Converting Formatted Currency into Numeric

We see something similar in formatted currency amounts such as
$12,345.67. Here, we need to remove the currency symbol and the
comma before converting to numeric. If the only currency sign we expected to
encounter was the dollar sign, we might do this:
> as.numeric (gsub ("\\$|,", "", "$12,345.67"))
[1] 12345.67

Recall that the as.numeric() will accept and ignore leading and trailing
spaces. More generally, we might delete any non-numeric leading characters
like this:
> as.numeric (gsub ("^[^0-9.]|,", "", "$12,345.67"))
[1] 12345.67
> as.numeric (gsub ("^[^[:digit:]]|,", "", "$12,345.67"))
[1] 12345.67

R Data, Part 3: Text and Factors

In this example, the ﬁrst ^ indicates “a string that starts with · · ·.” The [^0-9.]
bracketed expression starts with a ^, meaning “not,” so that part means “anything except a number or a dot.” The |, sequence says “or a comma,” so the
regular expression will ﬁnd any leading non-numeric (and non-period) characters, as well as any commas anywhere, and delete them all.
Removing HTML Tags

Occasionally, we come across text formatted with HTML tags. These are
instructions to the browser regarding display of the material, so, for example,
Bold formats the word “Bold” in bold face. Other tags indicate
headings, delineate cells of tables, paragraphs, and so on. It can be useful to
strip out all of the formatting information as a ﬁrst step toward processing
the text. Every tag starts with the angle bracket < and ends with >. So, given
a character string txt, the command gsub("<.*?>", "", txt) will
delete all the brackets (the < and > are treated literally) and all the text between
any pair (here, the .* indicates “zero or more characters” and the ? instructs
the engine to match in a lazy way).
Converting Linux/OS X File Paths to R and Windows Ones

The Windows ﬁle system uses the backward slash, \, to separate directories
in a ﬁle path, whereas Linux and Mac operating systems use the forward
one, /. Suppose we are given a Linux-style path like /usr/local/bin, and
we want to switch the direction of the slashes. The command gsub("/",
"\\\\", "/usr/local/bin") will produce the desired result. To make
the change in the other direction, the command gsub("\\\\", "/",
"\\usr\\local\\bin") will convert Windows path separators to Linux
ones. As an alternative in this case, we can specify the matching pattern exactly
with a command like gsub("\\", "/", "\\usr\\local\\bin",
fixed = TRUE).
4.4.10

Documenting and Debugging Regular Expressions

Regular expressions are complicated, and debugging them is hard. It is annoying (and time-consuming) to try to ﬁx a regular expression that you know is
wrong, but you’re not sure why. It is worse to have one that is wrong and not
knowing it. There are online aids to diagnosing problems with regular expressions that we have found useful. An Internet search will turn up a number of
helpful sites – but make sure that the site you ﬁnd describes the regular expression type (POSIX with GNU extensions or PCRE) that you use. Because regular
expressions are complicated, be sure to document them as well as you can.
Write out the patterns you expect to match and the rules you use to match
them.

127

128

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

4.5 UTF-8 and Other Non-ASCII Characters
4.5.1

Extended ASCII for Latin Alphabets

Up until now we have implicitly been dealing only with the “usual” characters,
those found on a keyboard used in English-speaking countries. The starting
point for the way characters are displayed is ASCII, a table that gives characters and the corresponding standard digital representations. ASCII provides
representations of only 128 characters, many of which are unprintable “control” characters, such as tab, new-line, or the command to ring the bell of an
old-fashioned teleprinter. Much of the text we handle in our work is of this
sort, but ASCII does not include codes for letters with accents or other diacritical marks, required for many Western European languages. Every computer
today honors a much broader character set, often based on a standard named
latin1, but realized in slightly diﬀerent ways by diﬀerent manufacturers. For
example, Windows uses its own “Win-1252” table, which includes some characters not found in latin1, such as the Euro currency symbol and the curly
“smart quotes,” and Apple OS X uses a table called “Mac OS Roman.” Each
character has a hexadecimal representation – for example, the upper-case E
with a circumﬂex, Ê, has code ca, and typing "\xca" into R (with quotation marks because this is text) will produce that character. The \x is used to
introduce hexadecimal notation in R, and it is case-sensitive – \X may not be
used – but the hexadecimal digits themselves are not case-sensitive. Entering
text in hexadecimal is diﬀerent from entering numeric values in hexadecimal.
Typing 0xca produces the number whose hexadecimal value is ca, that is, the
number 202. Typing "\xca" produces the character whose code in the ASCII
table is ca, that is, Ê.
Characters represented by their hexadecimal codes can be used just like regular characters, as arguments to grep() or other functions. (They can also
be entered in other, diﬀerent ways that depend on your computer and keyboard.) Almost all of these characters will display on your screen, depending
on which fonts you have installed. One exceptional character is the so-called
null character, which has code 00 (following the convention that every character requires two hexadecimal digits). This character is not permitted in R text;
if needed nulls can be held in objects of class raw, but they should be avoided.
In Chapter 6, we describe how you can skip null characters when reading data
from outside sources.
The Windows and OS X character codes generally coincide. The one commonly encountered character for which the two disagree is the Euro currency
symbol, €, which was introduced after the latin1 standard was decided. In
the Win-1252 table, the symbol has hexadecimal value 80, whereas in OS X, it
has db.

R Data, Part 3: Text and Factors

4.5.2

Non-Latin Alphabets

Of course, the need for standardization goes beyond the Euro sign. Increasingly,
with the availability of data from social media data and other sources, analysts
need methods to read, store, and process characters from very diﬀerent languages such as Chinese, Arabic, and Russian. The computing communities have
settled on Unicode, which is a system that intends to describe all the symbols
in all the world’s alphabets. Unicode values are shown in R by preceding them
with \U (or \u, but the upper-case U is more general). Unicode includes ASCII
as a subset. For example, the lower-case letter k has an ASCII and Unicode representation as the hexadecimal value 6b, so typing "\U6b" or "\U006B" into
R will produce a lower-case k. As with other hexadecimal encodings, Unicode
characters may be in either case.
represent the word
As a non-Latin example, the two Chinese characters
“China” in (simpliﬁed) Chinese. These cannot be represented in ASCII, but
their Unicode representations are (from left) "\U4E2D" and "\U56FD", and
these values can be entered (inside quotation marks) directly in R, as in this
example:
> "\U4e2d\U56fd"
# If fonts permit
[1]
> nchar ("\U4e2d\U56fd")
[1] 2
# Two characters...
> nchar ("\U4e2d\U56fd", type = "bytes")
[1] 6
# ...requiring six bytes in UTF-8

There are several ways to represent Unicode, but the most popular, particularly in web pages, is UTF-8. In this encoding, each character in Unicode is
represented by one or more bytes. For our purposes, it is not important to
know how the encoding works, but it is important to be aware that some characters, particularly those in non-European alphabets, require more than one
byte. In the example above, the two Chinese characters take up six bytes in
UTF-8.
Depending on your computer, its fonts, and the windowing system, the Chinese characters may not appear. Instead, you might see the Unicode representation (such as \U4e2d), an empty square indicating an unprintable character, an
empty space, or even, on some computers, some seemingly garbled characters
. Sometimes these characters indicate the latin1 encoding, but
such as
on some computers the very same representation will be used for UTF-8. You
can ensure that the computer knows these characters are UTF-8 by examining
their encoding (the following section). The important point is that the display
of UTF-8 characters can be inconsistent from one machine to the next, even
when the encodings are correctly preserved. We talk about reading and writing
UTF-8 text in Section 6.2.3.

129

130

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

4.5.3

Character and String Encoding in R

Handling Unicode in R requires knowledge of one more detail, which is “encoding.” R assigns an encoding to every element in a character vector (and
diﬀerent elements in a vector may have diﬀerent encodings). ASCII strings are
unencoded (so their encoding is marked as unknown); strings with latin1
characters (but no non-Latin Unicode) are encoded as latin1 and strings
with non-Latin Unicode are encoded as UTF-8. The Encoding() function
returns the encoding of the strings in a vector and iconv() will convert the
encodings. In the following ﬁrst example, we create a latin1 string and look
for the à character using regexpr(). The search succeeds whether the regular
expression is entered in latin1 style (as "\xe0"), Unicode style (as "\Ue0",
or directly with the keyboard. In each case, the à is found in location 9 as we
expect.
> (yogi <- "It's d\xe9j\xe0 vu all over again.")
’j‘
[1] "It's d e
a vu all over again."
> Encoding (yogi)
[1] "latin1"
> c(regexpr ("\xe0", yogi), regexpr ("\ue0", yogi),
regexpr ("‘
a ", yogi))
[1] 9 9 9

Diﬀerent encodings only cause problems in the rare cases where the Win-1252
and Mac OS Roman pages disagree with Unicode, and the primary example of
this issue is, again, the Euro sign. In this example, we create a string containing
the Euro sign using the Windows value "\x80" (to repeat this example with OS
X, use "\xdb"). We then use grepl() to check to see if the sign is found in the
string. R encodes the string as latin1 when it sees the non-ASCII character.
Here, the Euro sign in the latin1 string is not matched by the Unicode Euro,
but after the string is converted into UTF-8, the Unicode Euro is matched.
> (bob <- "bob owes me \x80123")
[1] "bob owes me €123"
> Encoding (bob)
[1] "latin1"
> (euro <- "\U20ac")
[1] "€"
> Encoding (euro)
[1] "UTF-8"
> grepl (euro, bob)
[1] FALSE
> (bob <- iconv (bob, to = "UTF-8"))
[1] "bob owes me €123"
> grepl (euro, bob)
[1] TRUE

# Assign Unicode Euro

# Is it there?
# Convert to UTF-8
# Is it there?

R Data, Part 3: Text and Factors

Notice that iconv() has no eﬀect on strings that contain only ASCII text.
These will continue to have encoding “unknown.”
UTF-8 is vital for handling non-European text. Although the display is not
always perfect, R is usually intelligent about handling UTF-8 once it is read
in. UTF-8 text behaves as expected in regular expressions, paste() and other
string manipulation tools. R’s functions to read from, and write to, ﬁles also support the notion of encoding in UTF-8 and other formats. We talk more about
reading and writing UTF-8 in Chapter 6.
We have noted that the display of UTF-8 strings can be unexpected on some
computers (at least, for some characters). Even on computers equipped with
the correct fonts, though, an issue arises when UTF-8 characters are part of a
data frame. When the print() function is applied to a data frame, it calls the
print.data.frame() function, which in turns calls format(). This later,
though, reacts poorly to UTF-8, often converting it into a form like .
In this example, we create a data frame with those Chinese characters and show
the results of printing the data frame.
> data.frame (a = "\U4e2d\U56fd", stringsAsFactors = FALSE)
a
1 

Here, the data.frame() command produced a data frame whose
one entry was two characters. The data frame, as displayed by the
print.data.frame() function, shows the -type notation.
Despite the display, the underlying values of the characters are unchanged – as
seen in the next command.
> data.frame (a = "\U4e2d\U56fd",
stringsAsFactors = FALSE)[1,1]
[1]

R shows the expected result because print() is being called on a vector, not
a data frame.
Sometimes, UTF-8 is inadvertently saved to disk in the  form as
literal characters – < followed by U, and so on. At the end of the chapter, we
show one way to reconstruct the original UTF-8 from this representation.

4.6 Factors
4.6.1

What Is a Factor?

A factor is a special type of R vector that looks like text but in many cases
behaves like an integer. Factors are important in modeling, but they often cause
trouble in data entry and cleaning. In this section, we describe how factors are
created, how they behave, and how to get them to do what you want them to do.

131

132

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Factors arise in several ways. You can create a factor vector from some other
vector using the factor() or equivalent as.factor() function; this will
often be a ﬁnal step, after data cleaning has been completed and modeling is
about to start. Factors are also created automatically by R when constructing
data frames, or when character vectors are added into data frames, with the
data.frame() or cbind() functions (Sections 3.4 and 3.7.1), or when reading data into R from other formats (Section 6.1.2). In both of these cases, the
behavior can be changed through a function argument or global option.
Factors are useful in a number of places in R but particularly in modeling.
They provide a natural and powerful way of representing categorical variables
in a statistical model. However, we recommend that you only turn character
vectors into factors when all the data cleaning is ﬁnished and it is time to start
modeling. Chapter 7 shows a complete data cleaning example in depth and
there we ensure that our character data starts out and remains as character.
Still, it is important to understand how factors work in R.
Think of a factor as having two parts. One part is the set of possible values, the
“levels.” In a manpower example, the levels of a factor named “Gender” might
be “Male” and “Female,” and perhaps a third called “Not Recorded.” The second part is a set of integer codes that R uses to represent and store the levels.
These codes start at 1 and go up. By default, R assigns codes to levels alphabetically – so in this example, “Female” would be represented by 1, “Male” by 2,
and “Not Recorded” by 3. The class() of a factor vector is factor, showing its special nature, but the mode() of a factor vector is numeric, and the
typeof() is integer, referring to the underlying codes that R stores. The
advantage of this representation is eﬃciency: in a data set of a million observations, it is clearly much more eﬃcient to store a million small integers than to
store millions of copies of longer strings.
4.6.2

Factor Levels

Once a set of levels is deﬁned for a factor, it is resistant to change. If you try to
change a value of one of the elements of a factor vector to a new value that is
not already a level, R sets that value to NA and issues a warning. Conversely, if
you remove all the elements with a particular value from the vector, that value
is still one of the levels. In this example, we create a factor whose levels are the
three colors of a traﬃc light.
> (cols

<- factor (c("red", "yellow", "green", "red",
"green", "red", "red")))
[1] red
yellow green red
green red
red
Levels: green red yellow
> table (cols)
green
red yellow
2
4
1

R Data, Part 3: Text and Factors

We can tell that the result is showing factor levels, rather than character strings,
because there are no quotation marks and because R also prints out the levels
themselves. Notice that the levels consist of the unique values in the vector,
sorted alphabetically, and that the table() command performs as expected
on the factor vector. The levels and labels arguments to the factor()
function control the setting and ordering of the factor’s levels. In the following
example, we show what happens when we exclude the elements whose values
are green from the vector.
> cols[cols != "green"]
[1] red
yellow red
red
red
Levels: green red yellow
> table (cols[cols != "green"])
green
red yellow
0
4
1

In this example, we see that the green level is still present in the vector,
even though none of the elements in the vector have that value. Moreover,
the table() command acknowledges the empty level. This can be annoying
when many levels are empty, but it can also be helpful when, for example,
levels are months of the year and some sources omit some months. In this
case, tables constructed from the diﬀerent sources can be expected to line up
nicely.
Another way in which factor levels are resistant to change is shown in this
example, where we try to change the value yellow to amber. We start by
making a copy of cols called cols2.
> cols2 <- cols
> cols2[2] <- "amber"
Warning message:
In ‘[<-.factor‘(‘*tmp*‘, 2, value = "amber") :
invalid factor level, NA generated
> cols2
[1] red
 green red
green red
red
Levels: green red yellow

This assignment failed because amber is not one of the levels of the factor
vector cols2. It would have been okay to assign to the yellow element of
our vector the value green or red because those levels existed in the factor.
But trying to assign a new value, such as amber, generates an NA. Notice how
that NA is displayed with angle brackets, as , to help distinguish it from a
legitimate level value of NA.
The levels() function shows you the set of levels in a factor, and you can
use that function in an assignment to change the levels. Here, we show how we
might have changed the yellow level to have a diﬀerent label.

133

134

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> levels(cols)
[1] "green" "red"
"yellow"
> levels(cols)[3] <- "amber"
> cols
[1] red
amber green red
green red
Levels: green red amber

red

This operation changes only the level labels; the underlying integer values are
not changed. Here, then, the labels are no longer in alphabetical order. We
often want to control the order of the levels in our factors; a good example is
when we tabulate a factor whose levels are the names of the months. By default,
the levels are set alphabetically (April, then August, and so on, up to September) – this aﬀects the output of the table() function (and more, like the way
plots are laid out). The order of the levels can be speciﬁed in the original call to
the factor() function, and we can re-order the levels using another call to
factor(), as in this example:
> levels(cols)
[1] "green" "red"
"amber"
> factor (cols, levels = c("red", "amber", "green"))
[1] red
amber green red
green red
red
Levels: red amber green

In this example, we changed the level through use of the factor() function.
To repeat, assigning to the levels() function changes only the labels, not the
underlying integers. The following example shows one common error in factor
handling, which is assigning levels directly.
> (bad.idea <- cols)
[1] red
amber green red
green red
red
Levels: green red amber
> levels(bad.idea) <- c("red", "amber", "green")
> bad.idea
[1] amber green red
amber red
amber amber
Levels: red amber green

Here, the elements of bad.idea that used to be red are now amber. If you
use this approach, make sure this is what you wanted.
The feature of R that causes more data-cleaning problems than any other,
we think, is this: Factor values are easy to convert into their integer codes but
we almost never want this. In the following section, we see an example of how
having a factor can produce unexpected results.
4.6.3

Converting and Combining Factors

To convert a factor f to character, simply use as.character(f). Actually,
the help ﬁles tell us that it is “slightly more eﬃcient” to use levels(f)[f].

R Data, Part 3: Text and Factors

Here, the interior [f] is indexing the set of levels after converting f,
internally, to its underlying integer codes. Usually, we use the slightly less
eﬃcient approach because we think it is easier to read. R’s conversion of factors
to integers can be useful when exploited carefully; this arises more often in
plotting than in data cleaning.
When a factor f has text labels that look like integers, it is tempting to try to
convert it directly into a numeric vector using as.numeric(). This is almost
always a mistake; convert levels to numeric only after ﬁrst converting to character. This example shows how this conversion can go wrong. We start by creating
a factor containing levels that look numeric, except that one of the values in the
vector (and therefore one of the levels of the factor) is the text string Missing.
This factor is intended to give the indices of elements to be extracted from the
vector src.
> wanted <- factor (c(2, 6, 15, 44, "Missing"))
# indices
> src <- 101:200
# vector to extract from
> as.numeric (wanted)
# ...but this happens
[1] 2 4 1 3 5
> src[wanted]
[1] 102 104 101 103 105

When wanted is created, its text labels ("2", "6", …, "Missing") are
stored, together with its integer codes. By default, these are assigned according
to the alphabetical order of the labels; so "15" gets level 1, "2" gets level 2,
"44" gets level 3, and so on. When we enter src[wanted], R uses these
integer codes to extract elements from src. If we actually want the 2nd, 6th,
15th, and so on elements of src, we have to convert the elements of wanted
to character ﬁrst, and then to numeric, as in this example.
> src[as.numeric (as.character (wanted))]
[1] 102 106 115 144 NA
Warning message:
NAs introduced by coercion

Here, the warning message is harmless – it indicates that the text Missing
could not be converted to a numeric value.
One time that the behavior of factors can be helpful is when we need to
convert text into numeric labels for whatever reason. For example, given
a character vector sex containing the values "F" and "M", we might be
called on to produce a numeric vector with 0 for "F" and 1 for "M". In
this case, factor(sex) creates a factor with levels 1 and 2; as.numeric
(factor(sex)) creates an integer vector with values 1 and 2; and therefore
as.numeric(factor(sex)) - 1 produces a numeric vector with values
0 and 1.

135

136

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

It is surprisingly diﬃcult to combine two factor vectors, even if they have the
same levels. R will convert both vectors to their underlying integer codes before
combining them. Our recommendation is to always convert factors to characters before doing anything else to them. There is one happy exception, though,
when two or more data frames containing factors are being combined with
rbind() (Section 6.5). Other than in this case, however, combining factor vectors will usually end badly. We recommend converting factors into character,
combining, and then, if necessary, calling factor() to return the new vector
to factor form.
4.6.4

Missing Values in Factors

Like other vectors, factors may have missing values. Missing values look like
NA values in most vectors, but in factors they are represented by  with
angle brackets. This level is special and does not prevent you from having a real
level whose value is actually , but you should avoid that. (Analogously, it’s
permitted to have the string value "NA", and it is a good idea to avoid that, too.)
Values of a factor that are missing have no level. In this example, we create a
vector with missing values, and also with values that are legitimately "NA" and
"".
> (f <- factor (c("b", "a", "NA", "b", NA, "a", "c",
"b", NA, "")))
[1] b
a
NA
b
 a
c
b
 
Levels:  a b c NA # alphabetized by default
> table (f, exclude=NULL)

a
b
c
NA 
1
2
3
1
1
2
levels (f)
[1] "" "a"
"b"
"c"
"NA"

The levels() function makes no mention of the true missing values, since
they do not have a level. The ﬁrst  in the output of table() describes
the ﬁnal element of the vector, whereas the last  refers to the two items
that really were missing. Clearly there is a possibility of confusion here.
When elements of a factor vector are missing, the addNA() function can be
used to add an explicit level (which is itself NA) to the factor. More often we
want to replace the NA values with an explicit level so that, for example, those
entries are accounted for in the result of table(). In this example, we show
one way to add such a level.
> (f <- factor (c("b", "a", NA, "b", "b", NA, "c", "a")))
[1] b
a
 b
b
 c
a
Levels: a b c
> f <- as.character(f)
# Convert to character
> f[is.na (f)] <- "Missing"
# Replace missings

R Data, Part 3: Text and Factors

> (f <- factor (f))
[1] b
a
Missing b
Levels: a b c Missing

b

# Re-factorize
Missing c

a

Here, the factor is converted to character, missing values replaced by a value
like Missing, and then the vector converted back to factor.
4.6.5

Factors in Data Frames

Factors routinely appear in data frames, and, as we have mentioned, they are
important in R modeling functions. Factors inside data frames act just like factors outside them (except sometimes when printing, as we saw with Chinese
characters in an earlier example) – they have a ﬁxed set of levels and they are
represented internally as integers. A few points should be noticed here. First,
as we mentioned above, R is not good at combining factor vectors on their
own, but when data frames containing factors are combined with rbind(),
R creates new factors from the factors in the input, extending the set of levels
to include all the levels from both data frames. The set of levels is formed by
concatenating the two initial sets; the levels are not re-sorted. (If a column is
factor but its corresponding column in another data frame is character, then the
resulting combined column will have the class of the column in the ﬁrst data
frame passed to rbind().) Second, applying functions to the rows of a data
frame containing factors can produce unexpected results. We discuss applying functions to the rows of a data frame in Section 3.5 and the concerns there
apply even more to data frames containing factor levels. Our recommendation
is to not use apply() functions on data frames, particularly with columns of
diﬀerent types. Instead, use sapply() or lapply() on columns. If you need
to process rows separately, loop over the rows with a command like lapply
(1:nrow(mydf), function(i) ...) where your function operates on
mydf[i,], the ith row of the data frame mydf.

4.7 R Object Names and Commands as Text
4.7.1

R Object Names as Text

In some data cleaning problems, a large set of related objects need to be created or processed. For example, there might be 500 tables stored in disk ﬁles,
and we want to read them all into R, saving them in objects with names such as
M2.2013.Jan, M2.2013.Feb, · · ·, and M2.2016.Dec. Or, we might have
data frames named p001, p002, · · ·, p100 and we want to run a function
on each one. It is easy enough to construct the set of names using paste()
and sprintf() (see Section 4.2.1). But there is a distinction between the
name "p001" (a character string with four characters) and the R object p001

137

138

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

(a data frame). The R command get() accepts a character string and returns
the object with that name. (If there is no object by that name, an error is produced; the related function exists() can test to see whether such an object
exists, and get0() allows a value to be speciﬁed in place of the error.)
One place where get() is useful is when examining the contents of your
workspace. The ls() command returns the names of the objects there; by
using get() in a loop we can apply a function to every object in the workspace.
For example, the object.size() function reports the size of an object in
your workspace in bytes (by default). This function operates on an object, not
a name in character form. So often we do something like this: ﬁrst, we produce the set of names of the objects of interest, perhaps with a command like
projNames <- ls(pattern = "^projA") to identify all the names of
objects that start with projA. Then, the command sapply(projNames,
function(i) object.size(get(i))) passes each name to the function, and the function uses get() to produce the object itself and report its
size. The result is a named vector of the sizes of every object in the workspace
whose name starts with projA.
The complement of get() is assign(). This takes a name and a value and
creates a new R object with that name and value. Be careful; it will over-write
an existing object with that name. Assigning is useful when each iteration of
a loop produces a new object. In the following example, we use a for() loop
to create an item named AA whose value is 1, one named BB with value 2, and
so on, up to an object ZZ with value 26. (We used double letters here to avoid
creating an item F that might conﬂict with the alias for FALSE.)
> for (i in 1:26)
assign (paste0 (LETTERS[i], LETTERS[i]), i, pos = 1)
> get ("WW")
# Example
[1] 23
# Remove the 26 new objects from the workspace
> remove (list = grep ("^[A-Z]{2}$", ls (), value = T))

The ﬁnal command uses a regular expression to remove all items in the
workspace whose names start (^) with a letter ([A-Z]) that is repeated
({2}) and then come to an end ($). The remove() and rm() commands
operate identically. Notice the pos = 1 argument to assign(). At the
command line this has no eﬀect. Inside a function it creates a variable in your
R workspace, not one local to the function. We discuss the notions of global
and local variables in Section 5.1.2.
4.7.2

R Commands as Text

It is also possible to construct R commands as text and then execute them. Suppose in our earlier example that we have objects p001, p002, · · ·, p100 and
we want to run a function report() on each one, producing results res001,

R Data, Part 3: Text and Factors

· · ·, res100. We could use the get() and assign() approach from above,
like this:
nm <- paste0 ("p", sprintf ("%03d", 1:100))
#
res <- paste0 ("res", sprintf ("%03d", 1:100))#
for (i in nm) {
#
result <- report (get (i))
#
assign (res[i], result, pos=1)
}

Object names
Result names
Begin loop
Run function

But it is easy to think of more complicated examples where each call is diﬀerent.
Perhaps the caller needs to supply the month and year associated with a ﬁle as
an argument, or perhaps each call requires an additional argument whose name
also varies. In these cases, it can be useful to construct a vector of R commands
using, say, paste0(), and then execute them. This requires a two-step process: ﬁrst the text is passed to parse() with the text argument, to create an
R “expression” object; then the eval() function executes the expression. For
example, to compute the logarithm of 11 and assign it to log.11 we can use
the command eval(parse(text = "log.11 <- log(11)")). After
this command runs, our R workspace has a new variable called log.11 whose
value is about 2.4.
Imagine having objects p001, p002, …, p100 and also q001, q002, …,
q100, and suppose we wanted to run res001 <- report(p001, q001),
res002 <- report(p002, q002) and so on. It is simple to construct a
set of 100 character strings containing these commands:
> num <- sprintf ("%03d", 1:100)
# 001, 002, etc.
> pnm <- paste0 ("p", num)
> qnm <- paste0 ("q", num)
> rnm <- paste0 ("res", num)
> cmd <- paste0 (rnm, " <- report (", pnm, ", ", qnm, ")")
> cmd[45]
# as an example
[1] "res045 <- report (p045, q045)"

Now all 100 reports can be run with the command eval(parse(text =
cmd)). This approach can both save time and cut down on the errors associated with copying and modifying dozens – or hundreds – of similar lines
of code.
As a ﬁnal example, we encountered a problem with some UTF-8 data
(Section 4.5), which we solved with regular expressions (Section 4.4) and
eval(). Under some circumstances, UTF-8 can be saved to disk as ASCII in
a form like "" – that is, with a literal representation of
<, U, +, and so on. To convert this into “real” UTF-8, we used regular expressions and the gsub() command to delete each > and to replace each  inp <- ""
> (out <- gsub (">", "", inp))
[1] " (out <- gsub (" (out <- paste0 ("\"", out, "\""))
[1] "\"\\U4E2D\\U56FD\""
> eval (parse (text = out))

# ASCII (not UTF-8)
# remove > chars
# change  funk <- function (x, y = 2) {
# Example function to compute sqrt (x) + y
# Arguments: x, numeric;
#
y, numeric
if (x < 0) {
warning ("Negative number cannot be funk-i-fied")
return (NA)
}
return (sqrt(x) + y)
}
> funk (x = 9, y = 3)
# run with x = 9, y = 3
[1] 6

In the ﬁnal command we run the function, passing in both arguments explicitly. Just typing the name funk, without parentheses, causes R to print out the
function itself. Notice that our function includes some comments. In this case,
we have listed the arguments together with their types. It is important to document all of your functions, even the ones you only plan to use yourself. In
addition to the arguments, this documentation might include the date, version
number, author, or other relevant information.
There are several aspects to a function, and you will need to understand
all of them to use functions properly. These include the arguments (information passed into the function), the return value (information computed and
returned), and side eﬀects (actions by the function beyond simply computing
the return value). This section describes the diﬀerent pieces of an R function.
5.1.1

Function Arguments

An argument is a value passed to a function. The aforementioned funk function has two arguments named x and y. R functions may have as many arguments as needed (or none at all). Arguments may be vectors or data frames or
lists or functions or any other R object. This means that if you develop a function, particularly for other users, the ﬁrst thing it should do is to ensure that
the arguments are of the types expected by the user. For example, the funk
function checks to see that the argument x is a non-negative number. But what
happens if the user passes a data frame or a character string or a list? The function developer needs to detect these unexpected inputs and stop gracefully. We
discuss error handling in Section 5.3.
Argument Matching

When a user calls a function, he or she may specify the arguments by name.
In this case, R knows unambiguously which input goes with which argument.

Writing Functions and Scripts

So, in our example, the call funk(x = 4, y = 1) will produce the result
3, and so too will the call funk(y = 1, x = 4). The user may also specify
arguments without naming them. In this case, R matches arguments in the call
to arguments in the function declaration by position. Since funk listed x before
y, the call funk(9, 3) is equivalent to funk(x = 9, y = 3).
Arguments are matched by partial names in a way similar to the way that the
elements of a list can be matched by an unambiguous substring (see Section
3.3.1). If a function g has arguments dimension, data, and subset, for
example, the user may specify s, su, or sub to supply the subset argument.
An argument d is ambiguous and will produce an error, but da would suﬃce
to supply the data argument. While this is permitted, we recommend using
full names where possible. This enhances readability and lessens the chance of
confusion if a function is updated later to include more arguments.
If some arguments are named and others are not, named arguments are
matched by name and the others by position. In the example of the previous paragraph, the call g(data = 2, 5, 3) will assign 2 as the data
argument, then 5 as the dimension argument, and then 3 as subset. In
interactive work we often match arguments by position, but when constructing
code we plan to save, re-use, distribute or archive we try to specify arguments
by name.
The Ellipsis

Some functions are deﬁned with one special argument, the ellipsis (...). This
is R’s mechanism for allowing functions to accept variable numbers of arguments. The ellipsis captures all of the arguments that are not otherwise matched
and presents them to the function as a list. One complication is that the names
of function arguments deﬁned after the ellipsis may not be abbreviated. For
example, the table() function takes the ellipsis as its ﬁrst argument; this
allows us to pass as many vectors as needed into the function. Subsequent arguments include the exclude and usaNA arguments described in Section 2.5. In
this example, we show what happens when one of those names is abbreviated.
> table (c(1, 1, 2, 3, 1,
Error in table(c(1, 1, 2,
all arguments must have
> table (c(1, 1, 2, 3, 1,
1
2
3 
4
2
1
1

NA, 2, 1), use = "always")
3, 1, NA, 2, 1), use = "always") :
the same length
NA, 2, 1), useNA = "always")

In the ﬁrst command, table() sees two vectors to tabulate. The use argument is insuﬃcient to act as the usaNA instruction, so it is treated as an input
vector of length 1 (with the value "always"). Unable to create a two-way table
from vectors of diﬀerent lengths, table() produces an error. In the second
command, the complete name successfully indicates the action to perform with

145

146

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

the NA. We can examine the arguments of table() using the args() function, as follows:
> args (table)
function (..., exclude = if (useNA == "no") c(NA, NaN),
useNA = c("no", "ifany", "always"),
dnn = list.names(...), deparse.level = 1)
NULL

This output shows that the ellipsis precedes the useNA argument in the deﬁnition. Therefore, the name useNA needs to be speciﬁed in its entirety.
Very often the ellipsis is one of the ﬁnal arguments in the function deﬁnition.
This has two consequences. First, arguments deﬁned after the ellipsis must be
matched by name exactly. If not – if you supply an argument whose name is
not in the declaration – that argument will often “fall into” the ellipsis and be
ignored. For example, when you type the name of a data frame at the command
line, R invokes a particular function called print.data.frame(). This
function takes an argument called digits that speciﬁes the precision of the
printing of numeric columns – but this argument is deﬁned only after the
ellipsis, so its name must be matched exactly. Specifying digit = 3, for
example, has no eﬀect – the name digit does not match an argument exactly,
so the argument digit = 3 is given to the ellipsis, which ignores it. In this
example, we construct a data frame and then use print.data.frame() to
display it.
> (newdf <- data.frame (a = c(1, 2.345)))
a
1 1.000
2 2.345
> print.data.frame (newdf, digits = 2)
a
1 1.0
2 2.3
> print.data.frame (newdf, digit = 2)
a
1 1.000
2 2.345

Here, the mis-typed argument name digit has had no eﬀect on the output,
but no error or warning is produced. This points up the importance of comparing the output you see to what you expect. Similarly, arguments that are
unmatched are often ignored in functions with the ellipsis; in this example, we
show how a made-up argument name does not produce an error or warning
in print.data.frame() – but it does in the log() function, which is
deﬁned with no ellipsis.
> print.data.frame (newdf, NOTANARGUMENT = 1)

Writing Functions and Scripts

a
1 1.000
2 2.345
> log (newdf)
# this is valid, but...
a
1 0.0000000
2 0.8522854
> log (newdf, NOTANARGUMENT = 1)
# ...this is not.
Error in log(newdf, NOTANARGUMENT = 1) :
unused argument (NOTANARGUMENT = 1)

In this example, we used the print.data.frame() function to print a
data frame. However, we could have just used print(); R will detect that
the object being printed is a data frame and use the appropriate printing
function. We saw this in Section 4.5.3. This practice of having “generically”
named functions such as print() call functions speciﬁc to a data type is an
example of “object-oriented programming.” In this approach, the class of an
object being operated on helps to determine the operation being performed.
We have seen other examples of this behavior earlier in the book. For example,
we saw how seq() calls seq.POSIXt() in Section 3.6.6. We talk a little
more about object-oriented programming in Section 5.6.3.
Most commonly ellipses are used for function that will call other functions
and pass arguments down to the function being called. For example, lots of
functions that we write draw plots using the plot() function. The plot()
function accepts dozens of possible arguments reﬂecting the values of “graphical parameters” such as colors, line sizes, typefaces, and axes. Rather than
prepare our plot function for every possible argument, we will often create a
function similar to the following one.
myplot <- function (x, y, ...) {
# Do stuff here
plot (x, y, ...) # Call plot()
}

In this example, any arguments after x and y are passed to plot() just as they
were supplied by myplot(), in the same order with the same names. If we
needed access to the individual arguments passed in the ellipsis, we could capture those arguments with a command such as mylist <- list(...) and
use mylist like any other R list. In this example, we examine the arguments
passed to our function myplot() to see if the argument xlab is among them,
and, if so, print it.
> myplot <- function (x, y, ...) {
mylist <- list (...) # grab extra arguments
plot (x, y, ...) # Call plot()
if (any (names (mylist) == "xlab"))
cat ("xlab was supplied as ", mylist$xlab, "\n")
}

147

148

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> myplot (1:5, 1:5, xlab = "Plot of x vs y")
xlab was supplied as Plot of x vs y

Modifying an argument and passing it along as part of the ellipsis requires
some thought. Suppose our standard required that x-axis labels always be in
upper case. To modify the value of the xlab argument, the easiest way is to add
the x and y arguments to mylist, and then to invoke the plot() function
via do.call(plot, mylist). We rarely need to do this, but for advanced
users we give here an example of how it might be done.
myplot <- function (x, y, ...) {
mylist <- list (...) # grab extra arguments
if (any (names (mylist) == "xlab"))
mylist$xlab <- casefold (mylist$xlab, upper = TRUE)
mylist$x <- x; mylist$y <- y
do.call (plot, mylist)
}
Missing Arguments

Sometimes users do not pass an argument, because it is optional, because they
are satisﬁed with the default value, or by mistake. The missing() function
returns TRUE if an argument is missing. If an argument named arg has not
been supplied and has no default value, then any reference to arg in the
code, other than a call to the missing(arg) function, will produce an error.
Here, missing(arg) will produce TRUE and the code can then determine
what action to take. When used inside a function, the missing() function
should be used only near the beginning of a function (since, e.g., if arg gets
assigned in the code, then missing(arg) will subsequently be FALSE). An
argument that is not passed explicitly is considered to be missing even if it has
a default value.
5.1.2

Global versus Local Variables

A local variable is one that exists only inside your function. All of the variables
inside your function are local. That is, when the function starts, it creates a special area of memory where local variables are stored and manipulated, and when
the function ends, that area of memory (and the variables in it) is destroyed.
Other variables are global; most often global variables are in your workspace.
The R variables in the workspace that are supplied as values to the function are
untouched (but see “side eﬀects,” as follows). This example demonstrates how
workspace values passed to a function are unchanged.
> new <- function (a = 5) {
a <- a + 1
cat ("a is now", a, "\n")
return (a)
}

Writing Functions and Scripts

> a <- 11
> new (a)
a is now 12
[1] 12
> a
[1] 11

The value of the global variable a in the workspace is 11 before the call and 11
after. The local a, inside the function, changes, but that change has no eﬀect on
the global one. Another way to say this is that R uses the “call by value” approach.
(In one alternative scheme, “call by reference,” the item passed into the argument is a reference to the global variable, so changes to the reference produce
changes in the global variable.) There are some packages that implement “call
by reference” in R, and this approach can bring eﬃciencies, but discussion of
this strategy is beyond the scope of this book.
It is possible to access global variables directly from inside a function, simply by referring to them by name. We might use system-deﬁned global variables such as pi or letters, in our functions, but you should avoid using
your own workspace objects directly in your functions. The objects in your
workspace can change, and in any case such a function would not be usable
by another user. Instead, pass into the function all of the values it will need as
arguments. The one exception to this informal rule is when writing a function
to be used by a one-time application of apply() or its relatives, when operating on, for example, the rows of a data frame. Recall from Section 3.5 that
it is unwise to use apply() directly on a data frame’s rows. In that section,
we operated on the rows of a data frame named dd with code such as sapply(1:nrow(dd), function (x) any (dd[i,] == 1)). Although
the embedded function uses the workspace variable dd, this approach is powerful and convenient.
As a side observation, note the cat() statement designed to print out some
text and the value of a in the aforementioned example. At the command line,
you can simply type a to have the system print its value. Inside a function,
though, a line with only a on it has no eﬀect – the value of a is evaluated, which
in more complicated examples might require some processing, but nothing is
assigned or printed. To print a value to the screen from inside a function, use
cat() or print() explicitly.
5.1.3

Return Values

The most important thing functions do is to produce a return value, which is
the results of the computation done by the function. In R, a function always
produces one return value, so if you want to compute diﬀerent things inside a
function, and return them, you have to combine them into a vector, matrix, data
frame, or list. A function returns whatever is inside the ﬁrst call to return()

149

150

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

that it encounters; if there isn’t a call to return(), the function returns whatever the last line it executes produces. You can hide a function’s return value
with the invisible() command, but every function has an output. If you
assign the output of a function with an invisible return value, that value is preserved in the usual way.
Side Eﬀects

Functions can do other things beyond producing return values. These things are
called side eﬀects. One important side eﬀect is to produce graphics. It is also
possible to change global variables with a function, separate from the return
value (see the discussion of assign() in Section 4.7.1), and indeed sometimes R modiﬁes global variables without using the return value, intentionally.
For example, the fix() function (Section 5.1.4) creates and modiﬁes functions, and the edit() function (Section 3.7.4) modiﬁes data frames and other
R objects. We do not recommend changing global variables in your code.
A third, more benign, side eﬀect is to print out information for debugging
purposes. This is very handy, particularly when the function needs to run hundreds or thousands of times. We talk more about debugging in Section 5.3.
Another common side eﬀect is to write ﬁles to the disk. These might be text
ﬁles with data in them, they might be tables of results, they might be graphics,
or something else. We discuss the ways to get data out of R in Chapter 6.
Side eﬀects can be good or bad. It is important to recognize when one of your
functions produces a side eﬀect. Such a function might even behave diﬀerently
at a diﬀerent time, or on another machine.
When a Function Produces Errors or Warnings

A function that has been successfully created or edited is always syntactically
correct. However, functions can still produce errors, either because the inputs
were unexpected, or because you tried to read a ﬁle that didn’t exist, because
R encountered an NA for which it was unprepared, or R tried to perform
arithmetic on a character vector, or for one of many other reasons. When an
error results, an error message is produced and the function terminates. Local
variables are lost and no return value is produced. However, any side eﬀects
produced by code before the error occur.
Sometimes, functions produce warnings rather than errors. A function that
produces warnings will nonetheless return a value. Still, unless you’re sure you
understand the cause of the warnings, we recommend that you not ignore them.
We talk more about errors, warnings, and debugging in Section 5.3.3.
Cleaning Up

When a function completes, there might be a few bookkeeping-type tasks
that remain before the result can be returned. For example, ﬁles or other
connections (Chapter 6) might need to be closed, graphics devices reset, and

Writing Functions and Scripts

warnings re-armed. Moreover, we usually want these actions performed even if
the function exits prematurely because of an error. The on.exit() function
allows you to specify one or more expressions to be evaluated however the
function ends. We show an example in Section 6.2.6.
5.1.4

Creating and Editing Functions

It is possible to type a function in at the command line, but this is almost never
practical. Instead, R provides a simple interface to one of your system’s editing
programs to allow you to create and edit functions. The R function is called
fix(). You can create a new function newf by entering fix(newf) and the
editor will appear with an empty function skeleton that looks like this:
function ()
{
}

If the function newf already exists, the fix(newf) command will open
the existing function for editing. The exact editor that R uses depends
on what is available on your system and its name can be displayed by
options()$editor. When you are done editing, exit the editor. In Windows, with the default editor, you can do that by clicking the red X in the top
right and choose “Yes” to keep modiﬁcations; in OS X’s default, click the red
dot in the top left and choose “Save.” Do not choose “Save As” from the File
menu. In Linux using the default “vi” editor, type :wq to save and quit, or
:q! to exit without saving your changes. Other editors may require diﬀerent
keystrokes, of course. If you asked for modiﬁcations to be saved, R checks to
see if the new version of the function has no errors and, if so, saves it so that
the function is ready to use.
If R detects that errors have been introduced, it will produce an error message
that, apart from the details, might look something like this:
Error in .External2(C_edit, name, file, title, editor) :
unexpected symbol occurred on line 3
use a command like
x <- edit()
to recover

The second line contains the meat of the error message: the cause (in this case,
“unexpected symbol,” but many other choices are possible) and the location
(here, line 3). The last part of the error message is a speciﬁc instruction that
is unclear to many R beginners. It says that to resume editing the ﬂawed function newf, we should enter the command newf <- edit(). The edit()
command, without any argument, edits the function most recently operated on;
here this command re-edits our newf function. If our modiﬁcations produce
a valid function, that function is saved to the object newf. Otherwise, the next

151

152

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

error detected will be reported, and we can enter newf <- edit() to edit
the function once again. You will need to produce an error-free version of your
function before R will save it. Edit() without an argument always operates on
the most recently edited function. We recommend using it only immediately
after encountering an editing error and otherwise using fix().
Reading and Writing Functions to and from Disk

It is easy to store functions in readable text ﬁles, so this is a natural way to
save and distribute them. We prefer to save functions using the dump()
command. Dump() creates a disk ﬁle with the exact text of your function
(including comments, blank lines, and UTF-8). The ﬁrst argument to dump()
is a vector of the names of the items to be dumped. The second argument
gives the name of the disk ﬁle to be created. For example, to dump a single
function named newf to a ﬁle named newf.txt, you can use the command
dump("newf", "newf.txt").
Importantly, dump() also adds an assignment line at the top of the disk
ﬁle – so if you dump a function named newf, the ﬁrst line of the ﬁle looks
like newf <-; the second line starts with function and begins the function
deﬁnition.
Once a function is on disk, it can be read back into R using the
source() command. This command reads disk ﬁles containing R code
and executes the code in your R session. In this example, the command
source("newf.txt") will read the ﬁle and re-create the function in your
workspace with its original name. This will over-write an R object named
newf if one exists.
It is equally possible to call dump() with a vector of names of R objects. This
vector can include the names of lists, data frames, or other R objects as well
as functions. So, this is one quick way to transport a set of objects from one
machine or user to another.
However, dump() is not equipped to handle certain complicated objects.
The saveRDS() function (the letters RDS evoke “R data serialization”)
takes an object and produces a binary disk ﬁle with all of that object’s data
and attributes. Each call to saveRDS() applies to exactly one R object.
So, for example, saveRDS(myobj, "newfile") produces a disk ﬁle
named "newfile" with a binary representation of the R object myobj.
The complementary action is performed by readRDS(): in this case,
readRDS("newfile") returns the object just as it was saved. Note
that readRDS() does not replace the existing myobj; instead, it simply
returns the object. You can assign the return value with a command such
as newobj <- readRDS("newfile"), which will create a new object
newobj that is identical to the original myobj.
The save() function operates on sets of objects. We call it by passing the
names of objects (in quotation marks) rather than passing the object itself.

Writing Functions and Scripts

The resulting ﬁle should be readable by R on any machine or operating system.
(Make sure that the two machines share the same character set, like UTF-8.)
Moreover, the ﬁle can be compressed at the time it is created.The complement
of save() is load(); this function reads in a ﬁle created by save() and
re-creates the items speciﬁed at the time the ﬁle was created. An important
distinction between load() and readRDS() is that with load(), all the
objects are automatically re-created with their original names. This means that
R will over-write any existing objects with those names.

5.2 Scripts and Shell Scripts
A script is a text ﬁle that contains R commands and possibly other lines as well.
We like to diﬀerentiate between two sorts of these text ﬁles: ﬁles we call scripts
and ﬁles we call shell scripts. (The word “shell” here refers to a command line,
like Terminal on OS X or cmd.exe on Windows. Admittedly, these two names
are more similar than perhaps they should be.) A script is just a text ﬁle with
stuﬀ in it. New scripts are created through the File | New script (or New Document) dialog, and existing ones can be opened under File | Open script (or
Open Document). A script ﬁle will contain R commands, but it might also contain comments, musings, invalid pieces of code you plan to ﬁx someday, or
other notes. In normal, interactive usage, the script will be visible in a separate window. You can run the line that the cursor is sitting on with control-R
(command-R in OS X), or highlight a few lines with the mouse and run those,
also with control-R, or run the entire script using “Run all” from the “Edit”
menu. (Those who prefer keyboard commands can run an entire script using
control-A to select all the lines, and then control-R to run them.) After you
modify a script you can save the updated version however your system requires.
We use scripts to store a lot of our commands, and we often run them bit by
bit, interactively, a few lines at a time. Running lines from a script is not like
running a function. Instead it is like typing commands, one at a time, into the
console. There are no “local” variables in a script; every assignment is made
immediately in the global workspace. If you run several lines of a script at once,
and one produces an error, R will still attempt to run all the others. (In contrast,
remember, when a function encounters an error, it quits.)
A script can also be run, all at once, like a function, through the source()
command we described earlier (Section 5.1.4). In this case, like a script run
interactively, there are no local variables. An error terminates a script being
run via source(), and only the commands prior to the error are executed.
Also as with a function, if you want to print intermediate results to the screen
during a script ﬁle being source()-d, you need to explicitly call cat() or
print() inside the script. A script to be run in this way is often a good product
to deliver to another R user; you can deliver the data as one part (perhaps using

153

154

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

one of the techniques in Chapter 6) and a script to read or manipulate it, make
computations, draw pictures, or anything else as another. In fact, we sometimes deliver several scripts for performing the diﬀerent tasks of the project,
together with one parent or “wrapper” script that allows the user to invoke all
of the other scripts in the proper order. The scripts for the extended exercise in
Chapter 8 operate in this way, with a wrapper that can invoke a number of other
scripts. These scripts and the wrapper can be found in the cleaningBook
package.
Ordinary scripts are good for developing techniques to handle data, or for
tasks that will only be done one or two times. A shell script is useful in a production environment where a speciﬁc task needs to be performed every day,
for example – frequently and automatically. A shell script is also a text ﬁle, but
it is a unit intended to be run all at once by R or another program, not as part of
an interactive session but from an outside “shell” program. In this way, a shell
script is like a script to be run via source(). The diﬀerence is that a script is
run from within R, either interactively or via source(), whereas a shell script
is run from a command line without having to open R.
A particular version of the R program called “Rscript” (included when you
acquire R) runs shell scripts. Shell scripts diﬀer from regular scripts in that
the very ﬁrst line will always look like #!Rscript. Those ﬁrst two characters,
#!, are sometimes whimsically called “shebang” or “hash bang.” They indicate
that this line is a comment (the #), but that it’s a very special sort of comment that may only be placed on line 1 of the script (the !). The Rscript
after the shebang tells the operating system that the ﬁle that contains it is ﬁlled
with commands intended to be executed by Rscript. Lots of other programs,
besides Rscript, also support shell scripts, so you might see ﬁles that start
#!python or #!ruby or something else.
One note here is that the operating system needs to know how to ﬁnd
the Rscript program. Generally, the set of folders the operating system
will search is set by the PATH environment variable (see Section 5.4.2 for a
discussion of environment variables). If the PATH has not been set to include
the folder that holds the Rscript program, the ﬁrst line of the shell script
will need to specify the complete path to Rscript. For example, on one of
our Linux machines, a shell script might start with the line #!/usr/bin/
Rscript.
It is common for details – information about, say, input ﬁles, output location,
numbers of replications, and so on – to be passed into the shell script via environment variables. An alternative, perhaps more R-like, mechanism is given
by the commandArgs() function, which produces a vector of the arguments
passed to Rscript at the time it was called. Either approach requires that the
shell script know where to look for the information it needs, of course, whereas
with an interactive script this information will often be entered by the user at
the command line.

Writing Functions and Scripts

Once Rscript reaches the end of a shell script, it stops and control returns
to the command-line interface from which it was started. Normally, the point
of the shell script will be to produce some output ﬁles, graphics, or informative messages. You can also have the shell script modify the contents of your
R workspace, but this is not the default and we avoid it because it feels messy.
The command Rscript --help will show you some of the options available
when running shell scripts.
5.2.1

Line-by-Line Parsing

If you use scripts, you will probably some day copy a piece of a function into
a script window and run it from there. Remember that a function is an entire
unit, but a script is a sequence of lines. This distinction can cause problems.
Suppose that you have some code like this:
if (i > 100)
x <- x + 200
else
x <- x - 200

In a function, this will do just what you think it will; it will look at the value of
i and use that to decide what to do with x. In a script, this code will fail. Here’s
why: remember, the script executes line by line. The ﬁrst line is clearly incomplete, so R waits for the second line. At the end of the second line, the script
interpreter has seen an entire R expression, and it executes it. If i is > 100
then x is set to x + 200. Now it comes to the third line and it thinks a new
expression has started with an else that has no paired if. Executing a script
is just like typing the lines of the script into the console.
In this example, you would have to use braces to “protect” the else. In the
following code, the second line doesn’t end the expression because there is still
an open brace.
if (i > 100) {
x <- x + 200
} else {
x <- x - 200
}

The choice of whether to construct a function or script is a personal one.
As we have said, functions operate on local variables, so there is less danger of
them over-writing items in your global workspace or of them creating unnecessary copies of large data sets. Moreover, R examines functions for errors before
saving them. However, functions are diﬃcult to transport and need to be run
all at once, not in pieces. Scripts are less formal; they can be run bit by bit and
passed around as simple text ﬁles. However, they only create global variables,
over-writing the existing objects with the same name in the process.

155

156

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

5.3 Error Handling and Debugging
At some point, sad to say, your function or script will fail unexpectedly or produce unexpected results. You will need to debug it. Debugging is a diﬃcult task;
there are many more ways to do something wrongly than there are to do it correctly, and it is diﬃcult to anticipate all the possible ways your function might
be called. Still, there are several ways to approach debugging, with diﬀerent levels of complexity. This section describes some of these approaches, from least
to most complicated.
5.3.1

Debugging Functions

cat() Statements

The easiest way to debug is to insert cat() statements into your code at strategic locations, to print out intermediate results. Of course it’s diﬃcult to know in
advance what the best locations will be. If your function is failing with an error
message, then of course you want your cat() to be higher up than the place
where you think the error takes place. We recommend labeling your cat()
statements, particularly if you insert several, to display where in the program
they are placed and what the program is about to do. So, for example, you might
have cat() statements that look like the following:
cat ("A: Start setup, i is", i, " dim (X):", dim (X), "\n")
cat ("B: Top of loop, xcount is", xcount, "\n")
cat ("C: End loop, result[1:3] is", result[1:3], "\n")

and so on. The cat() statements are easy to put in and take out (the labeling
scheme helps ensure you remove them all), but for them to be informative,
you must have selected the proper thing to display. We have sometimes
found it useful to include an argument to the function being debugged that
controls whether this diagnostic printing takes place. Conventionally, such
an argument is called verbose. So some of our functions include a number
of lines like, for example, if(verbose) {cat("Now operating on
file", fname, "\n")}. The verbose argument might be logical, or it
might be numeric, with higher numbers producing more detailed diagnostics.
A little note that writes a line every hundredth iteration or so can be very reassuring. The %% operator gives remainders, so if you have a loop variable named
i, the line if(i %% 100 == 0) cat(We’re on rep", i, "\n")
will print a line when i is 100, 200, and so on.
There is a complication when using a function to print out intermediate diagnostic messages. For eﬃciency R uses buﬀered output, which means that it
saves up a lot of these messages to deliver them all at once. For diagnostic purposes, we often want the messages to be shown as soon as they are generated; in
these cases, we turn buﬀering oﬀ. In Windows, you can do that by right-clicking

Writing Functions and Scripts

inside the console window (not on its title bar) and selecting “Buﬀered Output”;
control-W will toggle its setting back and forth.
The traceback() Function

When an error occurs, your ﬁrst question will often be where, exactly, the problem was. Although the line number is often printed, that might be unhelpful if
the function that failed is nested inside a sequence of calls. A call to the traceback() function is often the ﬁrst action to take when an error occurs. Its goal
is to show the sequence of function calls that led to the error and it often (but
perhaps not always) succeeds. We give a brief example in Section 5.3.3. The
recover() function will help advanced users; it lists the set of calls and starts
a browser session in the one that the user selects.
The browser() Function

The browser() function represents a big step forward in interactive debugging. When a function or script encounters a call to browser(), it pauses and
produces a prompt like Browse[1]. (The [1] part indicates that this prompt
arose from a function called at the command line; for a function called by a
function it would be [2], and higher numbers would indicate even more nesting of function calls.) At the browser prompt, you can type the name of an
object to display it, enter other function calls, create local variables, and do
other things you can do in a regular R session – but usually we use the browser
to display or modify the values of variables in the function. There are a few commands you can issue to the browser: c means “continue” (i.e., resume running),
s means “step” (go to the next statement, even if it’s inside another function),
n means “next” (i.e., go to the next statement, treating function calls as if they
were one step), f means “ﬁnish this loop or function,” and Q means “quit the
browser.” Since these are command names, if you need to print out the value of
a variable named c or f or another of these, you need to do that explicitly with a
command such as print(c). It is also worth knowing that, inside a function,
the ls() command shows you only the variables local to that function. To see
the variables in the global workspace, specify ls(pos = 1).
Just as we sometimes set up a verbose argument that allows the user to
specify that diagnostic messages be printed out, it is sometimes valuable to add
an argument such as browse that speciﬁes where calls to browser() might
be made. Ensure that this argument is FALSE by default if you give your code
to other users, since the browser prompt has the potential to confuse them.
The debug() Function

Another mechanism for debugging expands on the browser() call. The
debug() function labels a function as “to be debugged,” so that, whenever the
function is run, browsing starts at its top. The label persists until it is removed
with undebug(); the debugonce() function imposes the label for only one

157

158

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

run of the function. Debugging produces the browser prompt; it just saves you
from having to include an explicit call to browser() in the text of the function.
5.3.2

Issuing Error and Warning Messages

Sometimes, you will want to stop your function early, perhaps because arguments of the wrong type were passed, or some value important to the computation is missing. Other times, you might want the function to produce a
warning in the R style and then continue processing. The stop() and warning() functions and their relatives perform these tasks in R.
Producing Errors

Stop() stops the function and prints its argument to the console, so in a
function called funk, the code stop("Integer needed") will produce
the message Error in funk() : Integer needed. (Notice that no
“new-line” character is needed in the error text.) Even if the function with the
stop() was called from inside another function, control returns all the way
out to the command line. (In the following section, we show how that behavior
can be modiﬁed.)
One very common use of stop() is to test whether all the arguments are
of the expected type. Our code often includes lines such as if(!is.matrix
(X)) stop("X must be matrix"). If multiple strings need to be combined – often a good practice since it allows you to include diagnostic information in the messages – use paste() (Section 4.3) ﬁrst.
An enhancement to stop() is provided by stopifnot(), which acts
like stop() unless every one if its arguments evaluates to a vector of TRUE
values. This makes it easy to handle a set of arguments at once, as well as
the case where NAs are found inside a vector. Suppose we require that an
argument b must be present, and contain elements that are greater than zero
and also not missing. We could ensure that in a function with code such as
if(any(is.na (b)) || any(b < 0)) {stop("Illegal argument b")} but we would need to have one line for each of the several
arguments that had that requirement. Here we used the double-vertical
bar version of OR so that R would stop evaluating if the ﬁrst any() were
TRUE. In this case, it would be easier to specify stopifnot(b > 0). This
function evaluates all its arguments with an implicit all() command, and
calls stop() unless all its arguments are TRUE.
Here, if any element of b is negative or NA, the implicit all() command will
return something other than TRUE, and the function would stop with the error
message Error: b > 0 is not TRUE.
Warnings

In situations that do not require the function to stop entirely, you can issue a
warning to your user. R has two functions with similar names that help you

Writing Functions and Scripts

manage warnings: warning() and warnings(). Like stop(), the warning() function prints out text supplied by the function (often after a call to
paste() to assemble some diagnostic information), but with warning() the
function attempts to continue.
The exact behavior of warning messages is determined by the value of the
warn option. This value can be displayed with a call to options()$warn. By
default, warn has the value 0. This means that warnings are printed after control is returned to the command line. If there are fewer than 10 warnings, their
text is printed out; if there are more than 10, a single message indicating how
many messages are there is displayed. In that case, as the messages point out,
the individual messages can be accessed with a call to the warnings() function. If warn is set to 1, with the command options(warn = 1), warning
messages appear as they are generated, instead of being saved up until control returns to the console. An even more rigorous choice is warn = 2, which
causes any warning to be treated as an error. This is particularly useful in a
debugging environment and we often choose this option. In your own use of R,
you should investigate every warning until you are certain as to why it is being
produced. We also recommend that you anticipate that your users will ignore
your warnings – because they will.
One ﬁnal choice of value for warn is possible; if warn is set to a negative
value, all warnings will be suppressed. This is almost never a good idea,
although we do know of one exception. We often run into a case where the
contents of a character vector (say, vec) are mostly numeric, with a few
non-numeric items that we are prepared to convert to NA. For example, vec
might represent the number of days since a customer declared bankruptcy,
and it might have the value c("342", "1101", "Never", "615").
A call to as.numeric(vec) will produce a warning (“NAs introduced by
coercion”) that should be ignored, and we do not like to produce warnings
that we really do want our users to ignore. Happily, the suppressWarnings() function takes care of this by setting warn to −1 just long enough
to execute the expression passed to it. In this case, the call suppressWarnings(as.numeric(vec)) will produce a numeric vector with one
NA – and no warning.
5.3.3

Catching and Processing Errors

An error, as we mentioned earlier, causes R to stop processing and return control to the command line, even if the error takes place in a nested stack of
function calls. There are many circumstances where we would rather intercept
the error processing, take some necessary action, and then resume processing.
As R has evolved, the mechanisms for this have become more sophisticated,
but the most basic of these mechanisms is the try() function. The try()
function lets you “try” a call to an R expression. If the call fails, the return value

159

160

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

is an object of class “try-error,” whereas if it succeeds, the return value is the
value of the call. As an example, consider a function a that computes the square
of its argument. This function, as follows, issues an error if its argument is not
supplied:
a <- function (arg1) {
if (missing (arg1)) stop ("Missing argument in a!")
return (arg1^2)
}

Now suppose we have a function b that calls a, but does so without checking to
see whether the argument is supplied. In our example, b has parameters input
(which defaults to 9) and offset, the latter of which is used as the input to
a(). The b function then returns input + a(offset). This example shows
the b function, and the result of its being run with no arguments as input.
# Compute input + a (offset)
> b <- function (input = 9, offset) {
a.result <- a (offset)
return (input + a.result)
}
> b ()
Error in a(offset) : Missing argument in a!

The error took place inside a(), but control was returned to the command line
immediately. If we had not known where the error occurred, we might have
called traceback(). This example shows the result of that call.
> traceback ()
3: stop("Missing argument in a!") at #2
2: a(offset) at #3
1: b()

The output of traceback() is not always helpful or easy to read. We start at
the bottom and work up. In this case, we can see that the error arose in a call
to b(), which called a() at its line 3 (i.e., the third line of the b() function).
The error took place at the second line of a().
In this next version of b(), we use try() to see whether the call to a() can
be completed successfully. If it cannot, we issue a warning and set the value of
offset to 3. This example shows an updated version of b() and the result of
running it.
> b <- function (input = 9, offset) {
a.check <- try (a.result <- a (offset))
if (class (a.check)[1] == "try-error") {
warning ("Call to a() failed; setting a.result = 3")
a.result <- 3
}

Writing Functions and Scripts

return (input + a.result)
}
> b ()
Error in a(offset) : Missing argument!
[1] 12
Warning message:
In b() : Call to a() failed; setting a.result to 3

In this example, the function is completed, returning the value 12. The text of
the error was still produced, though; this can be disconcerting to users, and it
can be turned oﬀ with the silent = TRUE argument to the try() function.
Indeed, with both silent = TRUE and no call to warning(), this error will
be handled without notifying the user at all – which very well may not be what
you want. Notice, by the way, that when we examine the class of the a.check
variable, we use class(a.check)[1]. The [1] is there because lots of
R objects have a vector of classes. Comparing that vector to the single value
"try-error" will produce a warning, just the action we’re hoping to avoid.
We have found try() to be particularly useful when we are relying on programs and ﬁles outside R’s control. For example, if a call to an outside function
fails, or if one in a series of ﬁles cannot be read in, we will usually want to trap
the error, inform the user and continue processing. R also supplies more sophisticated error handling, which allow diﬀerent treatments of errors and warnings,
which allow functions to signal unusual conditions and other functions to interpret those signals, and which allow more control over restarting. A discussion
of those is outside the scope of a book on data cleaning, but interested R programmers should start at the help page for tryCatch().

5.4 Interacting with the Operating System
A function or script will very often need to interact with the operating system.
For example, it might need to get a list of all the ﬁles in a particular directory
whose names end in .data so it can process them. It might need to create a
new sub-directory for results (this is an example of a “side eﬀect”), and so on. A
number of built-in functions manage R’s access to the ﬁle system. As examples,
we saw dump() and source() in Section 5.1.4 as ways to interact with ﬁles.
Many scripts – particularly those used for cleaning data – start by acquiring
the data from an outside source such as a ﬁle or relational database. This is
such an important part of the data cleaning process that we devote the following
chapter (Chapter 6) to the topic of getting data into and out of R. In this section,
we give a couple of examples of the diﬀerent ways that R can interact with the
operating system, separately from the chore of actually acquiring data from an
outside source.

161

162

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

5.4.1

File and Directory Handling

By default, R presumes that the ﬁles it is dealing with are in the working directory. This is normally the directory from which R was started. So, a command
such as source ("myfile.txt") is assumed to refer to a ﬁle in the working directory, and a command such as cat(message, file = "output")
will likewise create a ﬁle in that directory. The getwd() command prints the
working directory to the screen; the setwd() command allows you to change
the working directory to a new location.
The list.files() command, with no arguments, will show you all the
ﬁles in the working directory. This function has a large number of useful
arguments. By default, only the ﬁle name, without the path, is returned.
However, it is possible to list the ﬁles in a vector of directories, in which case
the full.names = TRUE argument will return a path name (relative to the
working directory) for each ﬁle. Full names are also useful when using the
recursive = TRUE argument to ﬁnd ﬁles in the working directory and
all of its subdirectories. More important, perhaps, is the ability to select ﬁles
that match a regular expression (Section 4.4). So, for example, the command
list.files("..", pattern = "xlsx*$", recursive = TRUE)
will list all ﬁles whose names end in xls or xlsx in any directory underneath
the parent (..) of the working directory. OS X and Linux users may ﬁnd the
ignore.case argument useful as well, and other arguments control the
inclusion of directories in the listing.
Beyond merely listing the ﬁles, we often want to determine something
about their content. The file.info() command gives some information
about a ﬁle – here, the full name will be needed for a ﬁle outside the working
directory – and a series of commands with names such as file.copy(),
file.exists() (to check for existence), and the slightly more dangerous
file.remove() are also available. For directories there are the corresponding functions dir.exists() and dir.create() that, respectively, test
for the existence of a directory and create one.
5.4.2

Environment Variables

An environment variable is a name and (single) value stored inside the operating system. These are available for programs to read and, in some cases, create or update. For example, on most systems, an environment variable named
HOME holds the location of the user’s home directory. When R starts, it reads a
few existing variables (e.g., LC_ALL describes the locale – see Section 1.4.6) and
creates a few of its own (e.g., R_HOME gives the directory where R is installed).
Environment variables are case-sensitive in OS X and Linux and not in Windows, but it is conventional to use all upper case for variable names.
Environment variables provide a convenient way for one program to communicate with another. In R, the set of environment variables is available

Writing Functions and Scripts

through the Sys.getenv() function, which returns a vector whose names
and values are the names and values of all the variables in the environment.
This list is often long and unwieldy, but you can specify particular variables
to extract with a command such as Sys.getenv("R_HOME"). Variables
can be created or updated with Sys.setenv(). So, for example, we might
create a new environment variable named REPS with a command such as
Sys.setenv(REPS = 12). Now if R starts another program, R’s environment will be available in that program, and in particular that program will
be able to determine that the value REPS had been set to "12". (Notice the
quotation marks; environment values are always treated as text by R.) Environment variables are one way to pass information from “outside” into R in,
for example, a shell script. Notice that a function that creates an environment
variable is producing a side eﬀect (see Section 5.1.3).

5.5 Speeding Things Up
Some functions are fast and some are slow. Sometimes functions are so slow
that they keep you from doing what you need to do. The slow speed can be a
function of things outside R’s control – maybe a ﬁle is just very big, or is being
fetched over a slow network connection – but it can also be a result of ineﬃcient programming. In this section, we talk about how to measure a function’s
performance and then give a few ideas on how to speed things up.
5.5.1

Proﬁling

The process of measuring how much time and memory a function uses is called
“proﬁling.” The very simplest way of measuring how much time a function uses
is with the system.time() function, which essentially reads the computer’s
clock at the beginning and end of a call to an R expression and reports the
diﬀerence. While this is a useful measure, it does neither divide the time used
into pieces attributable to each of the function’s components nor address memory use.
R has a much more sophisticated proﬁling tool based on the Rprof() function. This can help you identify both the steps that use a lot of time and also
the ones using a lot of memory. This function writes out a log ﬁle describing
what R is doing 50 times a second, by default (and so these ﬁles can get big).
After your function terminates, the summaryRprof() function can produce
a report. The help page for Rprof() and the chapter of the online manual referenced there address this topic in detail. While proﬁling can be important in a
production environment where a function is run thousands of times, we rarely
ﬁnd need of it in our more interactive functions for data cleaning that are only
run a few times.

163

164

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

5.5.2

Vectorizing Functions

It is almost always useful to ensure that a function can act on vectors of any
length as well as on single values. Moreover, using a vectorized function on a
vector will almost always be more eﬃcient than looping over the individual
entries. Often vectorization will happen automatically, at least in part, because
arithmetic functions such as + are themselves vectorized. However, the
funk() function from the start of the chapter is not properly vectorized.
The example in Figure 5.1 shows that function on the left (we have shortened
the text of the warning) and a vectorized version on the right.
Here, the non-vectorized behavior was associated with the if(x < 0)
statement that was part of error checking. In the vectorized version, we
initialize the out vector with NAs, then ﬁll only the values for which x is ≥ 0.
As a technical note, the initial out is actually a logical vector because NAs are
logical by default. The vector gets converted to a numeric as soon as the ﬁrst
numeric is stored in it, but if no x values are ≥ 0 this function will return a
logical vector. If this were an issue, we would get around it by initializing out
as as.numeric(rep(NA, length(x)).
One source of slow R code is ineﬃcient looping. If it is possible to replace a
for() or while() loop with a call to one of the apply functions, the result
will almost always be faster. The apply functions use looping internally, too,
but usually in a much more eﬃcient way. Sometimes, replacing a for() loop
will be diﬃcult – for example, if the action to be taken on iteration i depends
on the results from iteration i − 1. But we always try to vectorize our functions.
However, sometimes vectorization makes code harder to read and maintain.
It is gratifying to produce a function that runs faster, but sometimes it is the
total time taken to code, run, explain, and maintain that is the more important
measure of quality. In the sort of large data sets we run into, with many more
rows than columns, the critical point is to work hard to avoid looping over rows.
Looping over columns is usually much less costly. Data in other formats (e.g.,
with many more than columns than rows) will need to be approached diﬀerently.

Non-vectorized

Vectorized

funk <- function (x, y = 2) {
if (x < 0) {
warning ("Neg(s)")
return (NA)
}
return (sqrt(x) + y)
}

funk <- function (x, y = 2) {
out <- rep (NA, length (x))
if (any (x < 0)) warning ("Neg(s)")
out[x >= 0] <- sqrt(x[x >0]) + y
return (out)
}

Figure 5.1 Vectorized and non-vectorized code example.

Writing Functions and Scripts

5.5.3

Other Techniques to Speed Things Up

Vectorization is always the ﬁrst thing to try when you need to speed up your
R code. But sometimes a fully vectorized function is just not fast enough. The
rest of this section suggests some avenues to try when looking for more speed
from your R computations.
Compiling

One quick way to gain some speed is through compiling, which is the
translation of R code (which, of course, looks something like English)
into “byte code,” which the machine can read much faster and more eﬃciently. Indeed almost all of the R’s built-in functions have already been
compiled into byte code; type the name of a function like q and you will
see near the bottom a line with a byte-code address, like, for example,
. The built-in package compiler
allows you to compile your own functions, either one at a time with cmpfun()
for functions (or compile() for expressions), or package by package with
compilePKGS(). Sometimes, compilation does not appear to speed things
up, but it is easy to try. In this example, we show system.time() applied to
a simple function, which, essentially, does nothing.
> dumb <- function (n = 100) {
for (i in 1:n) {}
}
> system.time (dumb (6e8))
user system elapsed
33.48
0.23
33.71

The function dumb() takes no action, but in our example it does so six hundred
million times. R required about 34 seconds for those operations, although the
exact time can vary. The result of system.time() will depend greatly on
how fast your computer is; we show our result as a baseline. (Our machine is
fast. If you try this example, you might want to start with a smaller number.) In
this case, anyway, compilation deﬁnitely helps. The following example shows
the result of compiling the function (an operation that used 0.02 seconds of
elapsed time) and then running it.
> dc <- cmpfun (dumb)
> system.time (dc (6e8))
user system elapsed
7.33
0.00
7.34

In this example, we see a substantial savings from compiling – the compiled version required only 22% of the time required by the original. Another very useful
approach is the “just-in-time” compilation enabled by a call to enableJIT(),
with an integer argument describing the details of how the compilation should

165

166

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

work. Speciﬁcally, enableJIT(3) performs as much compilation as possible.
Once this call has been made, functions are compiled right before their ﬁrst use
(and stay compiled). In this way, you do not need to specify speciﬁc functions
to be compiled – they all are – and this has the potential to result in substantial
time savings. In practice, it sometimes happens that the time-consuming parts
of your data cleaning tasks are being done inside built-in functions – and these
will probably already have been compiled. On the other hand, there are very few
drawbacks to compilation. To “uncompile” a function, edit it with fix() – or
“deparse” it to convert it to text, and then convert it back into a function, with
a command such as dc <- eval(parse(text = deparse(dc, control="useSource"))).
Parallel Processing

An even greater gain in processing speed can sometimes be realized using parallel processing, which exploits the fact that modern computers almost all have
multiple “cores” capable of more-or-less independent processing. The built-in
package parallel allows control over the use of multiple cores. Using the
parallel package requires three steps: ﬁrst, a “cluster” of cores is created,
possibly with the makeCluster() command. (Creating the cluster may take
a few seconds, but it only needs to be done once per session.) Second, any necessary items from the global workspace need to be passed to the cluster via
a call to clusterExport(). Finally, the cluster created in the ﬁrst step is
passed to one of the functions that knows how to exploit it. For example, the
parSapply() function acts as a “parallel sapply(),” running a function on
the columns of a data frame, with diﬀerent cores acting on diﬀerent columns.
The machine on which we wrote this book has 32 cores – we can determine this
via a call to the detectCores() function – so in this example we set aside 24
of them to act as the cluster. Creating the cluster takes about 5 seconds in this
example. Next we export the dumb() and dc() functions so that the cluster
can use them, and then we run parSapply() to run 24 separate instances of
each of these functions.
> detectCores () # after library (parallel)
[1] 32
> clust <- makeCluster (24)
> clusterExport (clust, c("dumb", "dc"))
> system.time (
parSapply (clust, 1:24, function (i) dumb (6e8/24)))
user system elapsed
0.00
0.00
2.29
> system.time (
parSapply (clust, 1:24, function (i) dc (6e8/24)))
user system elapsed
0.00
0.00
0.67

Writing Functions and Scripts

Because we were running 24 instances of our dumb() and dc() functions,
we only needed to run 1/24th of the iterations within each instance. Notice the
substantial time savings realized from running in parallel – even more so when
running the compiled version of the function. Interestingly, functions inside a
call to parSapply() are not automatically compiled even if enableJIT()
has been called; you will need to compile them explicitly (or run them ﬁrst,
to get them to compile if enableJIT() has been set) and then export the
compiled version to the cluster.
It is a good practice to stop the cluster with stopCluster() when parallel
processing is complete, although the cluster will be stopped when R terminates.
Even More Speed

When even more speed is required, R can interface nicely with code that is compiled down to the machine level. Often this is code from somewhere else that
was originally written in, say, C or Fortran. We run code like this all the time
inside R, and in packages, without even knowing it. The ability to write this
sort of code is perhaps more valuable for production environments requiring
lots of specialized computation than for the sorts of data cleaning problems
relevant to this book, and we will not discuss this here. The help pages for
dyn.load() and .C(), and especially Chapter 5 (“System and foreign language interfaces”) of the “Writing R Extensions” manual will be useful starting
points. The Rcpp package provides another, cleaner approach to incorporating C++ code, and the paper of Eddelbuettel and Françcois (2011), and their
web page, www.rcpp.org, together with that of Wickham, adv-r.had.co
.nz/Rcpp.html, provide more details for the interested user.

5.6 Chapter Summary and Critical Data Handling
Tools
This chapter discusses functions and scripts, two ways to automate actions in
R. Every data cleaning task will generate one or more of these. Functions are
self-contained but hard to transport. Unless you include a side eﬀect (such as
plotting or writing to a ﬁle) they operate on local variables and do nothing
more than compute a return value. Scripts are easy-to-read text ﬁles that act
as sets of commands – often including lots of function calls and even function
deﬁnitions – just like those you type into the console (including commands that
may not be valid). All variables in a script are global.
Important features of functions (and scripts) include the following:
• Function arguments are matched by name and by position. Names can often
be abbreviated. One special argument is the ellipsis ..., which allows the

167

168

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

•

•

•

•

•
•

number of arguments to vary. However, names of arguments deﬁned after
the ellipsis cannot be abbreviated, and the ellipsis can “consume” arguments
whose names are typed in error. It is important for you as the function developer – and as a user – to check the arguments carefully. The missing()
function is useful here.
A function has a return value. This can only be one R object, so if you need
to return several items, put them into a list. If you do not want a function to
return any value, end it with a call to invisible().
Side eﬀects are when a function does something other than compute and
return a value. Sometimes these are harmless, as when a function draws a
plot, or necessary, as when a function reads in data from a disk ﬁle. Sometimes side eﬀects are dangerous, as when a function changes or deletes an
object in the global workspace. Avoid this sort of side eﬀect.
Functions can be saved to disk with dump() and restored with source(),
or via save() and restored via load(). Scripts are saved as ordinary text
ﬁles, but using source() on a script ﬁle causes its code to be run.
Functions and scripts will have errors. Debugging is a substantial part
of every development eﬀort. We can debug with a simple method, such
as inserting cat() statements, or via the more sophisticated interactive
debugging tools browser() and debug(). Our functions can generate
our own errors (and warnings), and these can be handled via try(). If a
function produces an error or warning, be sure to ﬁnd out why – but do not
expect your users to behave similarly.
Functions and scripts can access ﬁles and directories through ordinary R
functions.
To speed things up, try compiling your functions, either individually with
cmpfun() from the compiler package, or via enableJIT(), which
compiles every function it runs. For problems with large loops, you can
achieve substantial gains in speed via parallel processing through the
parallel package.

5.6.1

Programming Style

In any data cleaning project you should expect to deliver your functions and
scripts, as well as your results. Your code should be neatly and consistently formatted. A number of proposed R style guides can be found on the Internet
and inevitably they disagree. For example, one source suggests separating parts
of an R object’s names with underscores (as in first_sub_total) while
another recommends never using underscores. R’s code itself uses diﬀerent
schemes in diﬀerent places. Make sure your names are meaningful – s is an
uninformative name for a standard deviation, for example. The cost of typing
longer variable names is less than the cost of debugging code later when you
cannot remember what that variable was for.

Writing Functions and Scripts

Our recommendation is to select a style and stick with it. Most importantly,
add comments to your code – more than you think you need. Use spacing –
blank lines, spaces around operators, and indentation – for readability. For
example, in our code we do not indent commands in a function that are not
part of an if(), for(), or similar clause. Then, we indent four spaces for
each such nested clause. Other programmers indent everything inside a function. On another point, consider including dates or version numbers in your
functions and scripts.
It is as important to write readable code as it is to write eﬃcient code.
Tailor your style to your reader. For example, suppose you want to check
whether a vector has a zero length. We would use an expression such as
if(length(x) == 0) .... This shows clearly the two things being
compared. Some programmers use the more cryptic if(!length(x))....
Here R computes the length, then converts it to a logical to be able to apply
the ! operator. If the length is 0, then !0 produces TRUE. Even if this latter
approach were marginally faster it would not be worth it.
5.6.2

Common Bugs

In this section, we mention a number of the bugs we see in our, and other
people’s, R code. Avoiding these bugs is a good start toward building useful,
re-usable code.
• Many bugs arise from unexpected input, so when you prepare a function
for someone else’s use, you should consider testing the classes, sizes, and
other attributes of input arguments to ensure they match what a function
expects. Sometimes, the “unexpected” behavior is related to missingness, so
if a computation depends on, say, the average of some input vector, you, as a
function writer, will need to decide whether an NA in the input should cause
the function to stop (perhaps testing with anyNA()) or whether the average
should be computed with the missing values excluded (using mean() with
na.rm = TRUE in this example).
• Another common input error arises from R’s habit of converting a one-row
or one-column matrix into a vector (see Section 3.2.1). Users might
intend to pass a matrix to a function expecting one, using a call like
myfun(mymat[mymat$Price > 10,]) to select only those rows for
which Price is greater than 10. If there is only one such row, though, R will
silently convert that row into a vector. Code that relies on matrix attributes,
such as trying to determine how many columns a matrix has using dim(),
will fail.
• A third common error arises from the way that missing values propagate
through computations. We very often use the if() function, but a call to
if(x) produces an error if x is NA. When you observe this error, you will
want to ﬁnd the source of the missing value.

169

170

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

• Errors often arise when reading disk ﬁles (if they are not present) or writing them (if a folder does not exist or if permissions are not properly set).
Check for these conditions with one of the ﬁle-handling functions such as
file.exists(), to check whether a ﬁle is present, or file.info(), to
see if a ﬁle is writeable.
• We commonly see a warning when a function expects a single result and gets
a vector of length 2 or more. For example, in many cases, code will check the
class of an object with code such as if(class(obj) == "lm") ....
Since many objects have a vector of classes, this code will produce a warning
(and examine only the ﬁrst element of the class vector).
• An even more serious problem can occur when sapply() (Section 3.5)
runs a function on a data frame or list that is expected to return a single value
for each column in the data frame. We saw in Section 3.2.3 how a problem
arises if the function in question can return results of diﬀerent lengths.
5.6.3

Objects, Classes, and Methods

We showed an earlier example where calling print() on a data frame led
to R calling another function, print.data.frame(). This is an example
of a widely used programming approach in which the type of object passed
to a function determines the action the function will take. In this example,
print() is a “generic” function and print.data.frame() is a “method”
that is applied to data frame objects. R has almost 200 diﬀerent methods
for the print() function; you can see them with the command methods
("print"). As in the print.data.frame() example, the names of
methods are constructed from the generic function’s name, a dot, and the
object type. To give another example, we construct sequences of POSIXt
objects using the seq() function (Section 3.6.6). R detects the class of the
POSIXt object and runs the speciﬁc seq.POSIXt() function.
This approach, in which one function can have many methods, based on the
class of the object passed to it, is part of “object-oriented programming,” a
widespread architecture of programming languages. R has two diﬀerent ways to
implement object-oriented programming. Neither is important enough to data
cleaning to be part of this book, but you should be aware that many R functions
are equipped to perform diﬀerent duties based on diﬀerent inputs. Sometimes,
you will need to look up the speciﬁc function, rather than the generic one, in
the help pages.

171

6
Getting Data into and out of R
Earlier chapters in this book have described what data in R looks like and how
to manipulate it. In this chapter, we describe how to get data into R for analysis,
and how to get updated data back out. Reading data into R is the ﬁrst step in
every data cleaning project, so it is in this chapter that the real work of data
cleaning begins. But we start with a note on keeping track of your data’s provenance, which is the word we use to describe the documentation of your data’s
history. You should know where you acquired every bit of your data, from what
source, and on what date. A natural place to keep that information is in your
scripts. Often, we have one or more scripts devoted to reading in the data, and
these start with some description of the date, the source of the data, the commands to do the actual reading, and some notes on problems we encountered
reading the data in.
Keeping track of the data’s provenance is always important, but it is especially important if the underlying data is subject to change, perhaps because
you extracted it from a database, a public site, or a web page under someone
else’s control. It is through this sort of documentation that you can make your
research reproducible by others.

6.1 Reading Tabular ASCII Data into Data Frames
Most of the data we read – and write – in R comes to us in the form of rectangular or tabular data, that is, data arranged with observations in rows and
measurements in columns. We expect that every row will have exactly the same
number of items. Being able to get these data ﬁles into R cleanly is a critical part
of data cleaning. Unfortunately, there are a lot of ways a ﬁle can be badly put
together. In this section, we describe how to read tabular data ﬁles into R data
frames and some approaches to try when things aren’t working. Writing data
from R to disk is generally straightforward and we discuss that brieﬂy at the end

172

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

of the section. In this section, we focus on ASCII text, and in the next section,
we describe the minor complications brought about by UTF-8.
6.1.1

Files with Delimiters

Perhaps, the most common sort of ﬁle we read in is a delimited ﬁle in which each
observation is represented by a single row. Within the line, the ﬁelds are separated by a delimiter, which will usually be a single character such as a comma,
tab, semicolon, pipe character (|), or exclamation point. Sometimes, a space
or spaces are used as the delimiter. The end of each line is marked with the
end-of-line character (but, as we observe in Section 6.1.8, these can vary across
platforms). The advantage of delimited ﬁles is that, being text, they are easy
to move across systems and easy to inspect by eye (although of course ﬁelds
will generally not line up). Unlike ﬁelds in ﬁxed-width ﬁles, ﬁelds in delimited
ﬁles need never be truncated nor made artiﬁcially too big. On the downside,
like other text ﬁles, delimited ﬁles are not good at representing numeric data
eﬃciently. Moreover, on occasion users will inadvertently insert the delimiter
character into regular ﬁelds (we illustrate this later in the chapter).
The principal tool for reading in delimited ﬁles in R is the read.table()
function, together with its oﬀspring read.csv() and read.delim().
(“CSV” names the popular “Comma-Separated Values” ﬁle format.) These
three functions are identical except for their default settings. All three of these
produce data frames; if you need a matrix, the best approach is to construct
a data frame and then convert it via as.matrix(). Two more functions,
read.csv2() and read.delim2(), are also available; these are just like
read.csv() and read.delim() except they expect the European-style
comma for the decimal point and that read.csv2() uses the semi-colon as
its delimiter.
The read.table() function and the others employ a host of arguments
that allow most delimited ﬁles to be read in. These arguments are optional and
have default values that are sometimes useful and sometimes not.
The most important of these arguments are as follows:
• header: a logical indicating whether the disk ﬁle has header labels in the
ﬁrst row. If so, set this to TRUE and those labels will be used as column headers (after being passed through the make.names() function to make them
valid for that use).
• sep: the separator character. This might be a comma for a CSV, a tab (written
\t) for a tab-separated ﬁle, a semi-colon, or something else.
• quote: a set of characters to be used to surround quotes, discussed as follows.
• comment.char: the character that is understood to introduce a comment
line.

Getting Data into and out of R

• stringsAsFactors: a logical that determines whether columns that
appear to be characters are converted to factors or left as characters.
• colClasses: a vector that explicitly gives the class of each column.
• na.strings: a vector specifying the indicator(s) of missing values in the
input data.
Other options control the number of lines skipped before the reading starts,
the maximum number of rows to read, whether blank lines are omitted or
included, and more, but the ones above are generally the arguments we worry
about ﬁrst. The choice of sep character can often be inferred from the name
of the ﬁle (comma for ﬁles whose names end in CSV and tab for TSV, although
this is not a requirement). When the separator is unknown, we either try the
usual ones, use an external program to examine the ﬁrst few lines of the ﬁle,
or resort to the scan() function (discussed later in this section). The default
value of sep is the empty string, "", which indicates that any amount of white
space (including tabs) serves as the delimiter. This is intended for text that has
been formatted to line up nicely on the page (so that extra spaces or tabs have
been added for readability). Setting sep to be the space character, " ", means
that read.table() will split the line at every space (and never at a tab).
We have also found that the quote, comment.char, and stringsAsFactors arguments, in particular, very often need to be set explicitly in data
cleaning tasks. By default, the set of quote characters is set to be both ' and
" in read.table(), and to just " in the other functions. This means that
a string inside quotation marks such as "Ann Arbor, MI" is treated as a
single unit. This is a valid approach when the separator is a space, since otherwise that phrase would look as if it has been made up of three separate ﬁelds.
The single quote mark would be useful in the corresponding British environment, where its use is more common. In our work, the single quote is found
most often as an apostrophe, and if the single quote is part of the quote argument, the apostrophe in the phrase Coeur d’Alene, Idaho will be taken
as starting a very long string that might not be ended until a later entry contains
Peter O’Toole or Martha’s Vineyard. We generally turn the interpretation of quotation marks oﬀ by passing quote = "", or set the argument to
recognize only the double quotation mark by passing quote = "\"".
Similarly, the comment character defaults to R’s own comment character, the
hash mark or pound sign (#). Lots of code has comments, but comments in data
are rare. Much more often we see the pound sign as legitimate text in an expression such as Giants are #1! or 241 E. 58th St. #8A. We generally
turn comments oﬀ by passing comment.char = "".
6.1.2

Column Classes

The stringsAsFactors argument is one of a set of arguments that aims
to help R ﬁgure out what to do with the data that it reads in. We have

173

174

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

encountered this name when constructing data frames with data.frame()
and cbind() in Section 3.4. Its default TRUE value speciﬁes that any column
found to be character should be converted to a factor. This can cause problems
in a couple of ways. First, it is often the case that numeric columns can have
a small number of unexpected text items in them, particularly missing values
codes that are diﬀerent from the default NA that R expects. In this case, the
read.table() function interprets these columns as character and then
converts them to factors. Second, even for text data we generally want the
raw text for purposes of data cleaning; we switch to factors only when we are
ready to begin modeling. One way around this automatic conversion is to set
stringsAsFactors = FALSE, and we almost always pass this argument
when we are reading in data. The exceptions are when we know that the data
is numeric (and pre-cleaned), and when we use colClasses, which we
describe in the following paragraph. In fact, this issue arises so often that R
has a built-in option to set the default behavior of stringsAsFactors.
However, we rarely use this option because we want our scripts to be portable
to other users who might not have set it, and for aesthetic reasons we hesitate
to have our scripts set other users’ options.
A more ﬂexible approach is provided by the colClasses argument. This
allows you to specify the column type for each of the columns at the time you
read the table. Of course, this information is not always available – sometimes
you have to read the data to ﬁgure out what is in it. Passing the nrows argument
allows you to specify how many rows read.table() should read. So often
we read just a few dozen lines and inspect the resulting data set to get an idea
about what classes to expect.
The colClasses argument can be speciﬁed as a vector whose length is the
number of columns in the input ﬁle, or as a named vector, in which case you
can name speciﬁc columns, and R assigns classes to the rest. It does this by
reading the entire ﬁle, so supplying column classes explicitly can often speed
up the reading of a big ﬁle.
In addition to the basic types of numeric, character, and logical, you
can also specify Date or POSIXct, and there is an approach by which you can
convert the input into other classes as well (see the help for read.table()).
The elements of colClasses are recycled in the usual way, so colClasses = "character" converts all columns to characters – which is
often a good place to start in a data cleaning problem where the data is poorly
documented or of suspect quality. The as.is argument provides another way
to keep columns from being converted.
As an example of where colClasses can be helpful, consider the case
where a column consists of numbers that might have leading zeros, such
as US ﬁve-digit ZIP codes. Without colClasses, such a column would
automatically be converted to numeric, producing values whose leading zeros
would be lost – so the ZIP code for Logan Airport in Boston would come out as

Getting Data into and out of R

the number 2128 instead of the expected 02128. The stringsAsFactors
argument would have no eﬀect here since R would see the column as numeric.
In this example, we could replace the missing zeros using sprintf() as in
Section 4.2.1, but a more direct approach is to specify "character" in the
colClasses argument.
6.1.3

Common Pitfalls in Reading Tables

In this section, we describe common problems we encounter when reading text
ﬁles into R and some ways you might address them. In any particular case, it is
diﬃcult to know which issue is causing the problem. After we describe some of
these problems, we give an example of how we might go about diagnosing them
with a small realistic data set that appears to have a deceptively simple problem.
Embedded Delimiters

We have mentioned that problems commonly arise when the table to be read
includes # or quote characters. Another problem arises when the data has the
separator character (say, a comma) included as part of some text. Data of the
sort arises particularly often when the original source of the ﬁle is a spreadsheet. Textual comments in spreadsheets very often have commas (or tabs,
which are introduced when the user enters a new-line character within a cell).
It also happens that cities get recorded in a form like “Hempstead, Long Island”
or “Westminster, Orange County.” If those text ﬁelds are surrounded by quotation marks then R can interpret them correctly using quote="\"", but this is
rarely seen in spreadsheet data. More often, embedded delimiters require some
eﬀort to correct, and in the following section we give an example.
Unknown Missing Value Indicator

Another set of problems arises when the “missing value” indicator is unknown.
By default, R expects missing values to be indicated by NA, but the na.
strings argument allows a set of values to be supplied. Any value in the
input that matches an element of na.strings will be interpreted as a
missing value – and blank ﬁelds will also be taken as missing, except in
character or factor ﬁelds. For example, a spreadsheet from the Excel program
can have values such as #NULL!, #N/A, or #VALUE!, so these would be good
candidates for including in na.strings. If they are included, then columns
that are otherwise numeric (or logical) will be correctly interpreted, whereas if
they are not, then those columns will be interpreted as characters – and as we
discussed earlier, character columns are treated as factors by default.
Empty or Nearly Empty Fields

Empty ﬁelds – those with nothing in them – generally do not cause problems, since they are brought in as NA values in numeric ﬁelds. As we noted

175

176

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

in Section 4.1.3, though, sometimes text data – particularly from spreadsheets – represents empty cells by strings with a single space (or, sometimes, a
few spaces in a row). We sometimes think of these cells as “nearly empty.” This
problem is often hard to detect ahead of time, since empty and nearly empty
cells of a spreadsheet look alike. So the ﬁrst thing we do, when we read in a data
frame, is to determine the number of numeric and character or factor columns
(and logical, too, though these are rarer). For a data frame named ourdf,
for example, we run the command table(sapply(ourdf, function(x) class(x))) and compare what we see to what we expect. In
much of our work we expect 50–75% or more of the columns to be numeric,
so if there are no, or only a few, numerics, we conclude that there are textual
values – often, missing-value indicators or nearly empty values – in those
numeric columns. If we know that a particular column (say, one called
NumID) should be numeric but is being represented as character, we can
tabulate the elements that R is unable to convert, with a command like
table(ourdf$NumID[is.na(as.numeric(ourdf$NumID))]). This
is often a good starting point for examining the set of missing value indicators in
the data. These can then be included in the na.strings argument in another
call to read.table() – or we can explicitly set them to NA ourselves.
Blank Lines

A less common problem occurs when the input ﬁle has blank rows.
Read.table() will skip these by default, and this is often what we
want. However, in some cases it is important that the lines from two diﬀerent
ﬁles correspond. In this case set skip.blank.lines = FALSE and blank
lines will be included in the output. The entries in these lines will appear as NA
in numeric columns and as the empty string "" in character ones.
Row Names

In Section 3.4, we noted that a data frame in R must have row and column
names. When a data frame is produced by one of the read.table() functions, the default behavior is for R to create row names from the integers 1, 2,
and so on, unless the header has one fewer ﬁeld than the data rows, in which
case R uses the ﬁrst column’s values as the row names. Remember that row
names need to be unique; if yours are not, you can force R to supply integer
row names by passing row.names = NULL.
You can also specify one of the columns to be used as row names explicitly, by
passing its name or number with the row.names argument. (This argument
can also be used with a vector of row names, but we rarely use that feature.) It
can be useful to specify row names when they represent something important
because the names can be used explicitly in subsetting statements.

Getting Data into and out of R

6.1.4

An Example of When read.table() Fails

There are some data sets for which our initial attempts at using read.table()
just will not work because of one or more of the problems we have
described – or something else. Sometimes we correct these problems,
when we ﬁnd them, by modifying the original data directly and documenting
that fact. More often, we can read the data into R by suitable choice of arguments to read.table(), and sometimes read.table() just cannot be
used and we resort to the more primitive, but more ﬂexible, function scan(),
which we describe later in the chapter.
In this example, we show a small sample of the sort of text we are often supplied with and try to read it into R. The following text has been saved as a ﬁle
named addresses.csv in our working directory.
ID,LastName,Address,City,State
001,O'Higgins,48 Grant Rd.,Des Moines,IA
011,Macina,401 1st Ave., Apt 13G,New York,NY
242,Roeder,71 Quebec Ave.,E. Thetford,VT
146,Stephens,1234 Smythe St., #5,Detroit,MI
241,Ishikawa,986 OceanView Dr.,Pacific Grove,CA

Because the ﬁle’s name ends in csv, we will presume that the ﬁle uses commas
as separators (although this is not always the case), and that the ﬁle includes a
header row. In real life, it is a good idea to examine the ﬁle ﬁrst, where possible,
using a text editor or ﬁle viewer.
Using the lessons learned from the previous section, we know to specify the
arguments quote = "", comment.char = "", and stringsAsFactors=FALSE. Our ﬁrst eﬀort produces this error:
> read.table ("addresses.csv", header = TRUE, sep = ",",
quote = "", comment.char = "", stringsAsFactors = FALSE)
Error in scan(file = file, what = what, sep = sep, ...
line 1 did not have 6 elements

This error is telling us that the longest line encountered contained six elements
(ﬁelds), whereas at least one other – in this case, the header, line 1 – had fewer.
Either the header or some of the data is lacking a ﬁeld – or has too many. Notice
that the error was issued by the scan() function, which is itself called by
read.table().
Using read.table()

Our next step in this process might be to examine the ﬁrst row to make sure
this is a header row and to determine the number of ﬁelds. When we add the
nrows = 1 argument to read.table(), and set header to FALSE, we
extract the very ﬁrst row in the ﬁle.

177

178

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> read.table ("addresses.csv", header = FALSE, sep = ",",
quote = "", comment.char = "", stringsAsFactors = FALSE,
nrows = 1)
V1
V2
V3
V4
V5
1 ID LastName Address City State

R has returned a data frame with one row. Since no row or column names were
provided, R has added them (the 1 at the left and the V1 through V5 at the
top). The header contains ﬁve ﬁelds, so we expect every row of the data to have
ﬁve ﬁelds as well. We can determine the number of ﬁelds that R sees in each
row with the count.fields() command. Since this command produces
one number for every row in the ﬁle, we normally pass its output directly to
table(), but here we show it explicitly.
> count.fields ("addresses.csv", sep = ",", quote = "",
comment.char = "")
[1] 5 5 6 5 6 5

The fact that the second and fourth lines of the ﬁle (after the header) contain
six ﬁelds and the others do not is now visible. We can suspect that those rows
each contain an extra delimiter. Files with embedded delimiters are particularly
painful to handle. We extract the ﬁrst problematic line with read.table()
by skipping the ﬁrst two and reading only the third (using skip = 2 and
nrows = 1). R returns a data frame with seven columns, as shown here.
>

read.table ("addresses.csv", header = FALSE, sep = ",",
quote = "", comment.char = "", stringsAsFactors = FALSE,
nrows = 1, skip = 2)
V1
V2
V3
V4
V5 V6
1 11 Macina 401 1st Ave. Apt 13G New York NY

We are prepared to conclude that the problem with the ﬁle is that some of the
entries in the Address column contain embedded commas – but we would
look at the other problem rows to be sure.
When there are delimiters embedded in ﬁelds, one quick way to get the data
into R with read.table() is by passing the fill = TRUE argument. This
is intended to produce a data frame with the largest number of columns necessary to hold any line. If it succeeds, and there are only a small number of
defective lines, the ﬁnal column of the new data frame will be almost empty. The
rows where there are entries in the ﬁnal column will help you ﬁgure out where
the problems in the input data are arising. Here, we show the results of using
the fill = TRUE argument, and assign the return value of read.table()
to an object named add.
> (add <- read.table ("addresses.csv", header = TRUE,
sep = ",", quote = "", comment = "",
stringsAsFactors = FALSE, fill = TRUE))

Getting Data into and out of R

ID
LastName
Address
City State
001 O'Higgins
48 Grant Rd.
Des Moines
IA
011
Macina
401 1st Ave.
Apt 13G New York
NY
242
Roeder
71 Quebec Ave.
E. Thetford
VT
146 Stephens
1234 Smythe St.
#5 Detroit
MI
241 Ishikawa 986 OceanView Dr. Pacific Grove
CA

There are two things to notice here. First, R has produced row names from the
ID column since the header had one fewer entry than the longest row. This
also causes the column names to be shifted to the right. Had there been duplicate entries in that column, read.table() would have failed and we would
have supplied row.names = NULL on our next eﬀort. Second, the second
and fourth rows’ addresses have been broken at the extra delimiter. This causes
those rows to extend over six columns, leaving them as the only ones with
entries in the rightmost column.
If all the problematic lines appear to be broken in the same way, we would
probably ﬁx them directly in R. In this example, we would start by identifying
the broken lines, paste columns 2 and 3 to complete the address, and then move
columns 4 and 5 into positions 3 and 4. This code shows the steps we might take,
and what the add data frame looks like at this point.
>
>
>
>

fixers <- add$State != ""
# logical vector
add[fixers, 2] <- paste (add[fixers,2], add[fixers,3])
add[fixers, 3:4] <- add[fixers, 4:5]
add
ID
LastName
Address City State
001 O'Higgins
48 Grant Rd.
Des Moines
IA
011
Macina 401 1st Ave. Apt 13G
New York
NY
NY
242
Roeder
71 Quebec Ave.
E. Thetford
VT
146 Stephens
1234 Smythe St. #5
Detroit
MI
MI
241 Ishikawa
986 OceanView Dr. Pacific Grove
CA

All that remains is to adjust the column names, remove the rightmost column,
and insert the current row names as a column (replacing them, perhaps, with
integers). This code shows how this might be done.
#
>
>
#
>
>
>
>

Save column names, then remove last column
mycolnames <- colnames (add)
add$State <- NULL
Insert the ID column
add <- data.frame (ID = rownames (add), add)
colnames (add) <- mycolnames
# now assign column names
rownames (add) <- NULL
# replace old row names
rm (fixers, mycolnames)
# clean up!

It was convenient to save the column names before performing the other modiﬁcations and then to re-assign them at the end. We include this lengthy example

179

180

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

because these are the sorts of problems we face almost every time we read text
data into R. Note that in the last command we remove some temporary variables. Although we have not been showing this in our code, it is something we
do regularly, to keep the workspace clean and reduce the risk of inadvertently
re-using an existing object.
Using scan()

Sometimes, there are multiple problems with an input data set – highly
variable numbers of ﬁelds, diﬀerent separators used in diﬀerent places, and
so on. A good general-purpose data input tool is scan(), which with the
sep = "\n" argument, reads entire lines into an R character vector. By
default, scan() expects to encounter numbers, so in reading text we need to
pass the what = character(), or, for short, the what = "" argument.
Like read.table(), scan() has a host of arguments to let you handle all
sorts of text ﬁles. This example shows the output of using scan() on our
addresses.csv ﬁle.
> (addscan <- scan ("addresses.csv", what = "", sep = "\n",
quote = "", comment.char = ""))
Read 6 items
[1] "ID,LastName,Address,City,State"
[2] "001,O'Higgins,48 Grant Rd.,Des Moines,IA"
[3] "011,Macina,401 1st Ave., Apt 13G,New York,NY"
[4] "242,Roeder,71 Quebec Ave.,E. Thetford,VT"
[5] "146,Stephens,1234 Smythe St., #5,Detroit,MI"
[6] "241,Ishikawa,986 OceanView Dr.,Pacific Grove,CA"

We can ﬁx addscan directly by replacing the third comma by another character – a semi-colon, say – in every row with six commas. This is less complicated
that it might appear at ﬁrst, and represents the sort of data cleaning we do regularly. We start by identifying the problem rows, either with count.fields()
as before, or directly, using gregexpr() from Section 4.4.5.
> commas <- gregexpr (",", addscan)
# locate all commas
> length.5 <- lengths(commas) == 5
# identify long rows
> comma.be.gone <- sapply (commas[length.5],
function (x) x[3])

Each element of the commas list consists of a vector giving the locations of
the commas within a line. In this last command, we have extracted the third
element of each of these vectors. So comma.be.gone gives the location of
the third comma on those lines with a total of ﬁve commas. Now we replace
those commas by semicolons.
> substring (addscan[length.5], comma.be.gone,
comma.be.gone) <- ";"
> addscan
[1] "ID,LastName,Address,City,State"

Getting Data into and out of R

[2]
[3]
[4]
[5]
[6]

"001,O'Higgins,48 Grant Rd.,Des Moines,IA"
"011,Macina,401 1st Ave.; Apt 13G,New York,NY"
"242,Roeder,71 Quebec Ave.,E. Thetford,VT"
"146,Stephens,1234 Smythe St.; #5,Detroit,MI"
"241,Ishikawa,986 OceanView Dr.,Pacific Grove,CA"

Now that addscan is exactly as we want it to be, we have at least three choices.
First, we can write it back out to disk (using the write.table() function
we describe later) in preparation for re-reading. This approach uses extra disk
space, but it has the advantage of creating a clean data set for other users. Second, we can pass the addscan vector back to read.table() using the text
argument, and it will be interpreted just as if the text had been read from a ﬁle.
This example shows the output from that call.
> read.table (text = addscan, header = TRUE, sep = ",",
quote = "", comment = "", stringsAsFactors = FALSE)
ID LastName
Address
City State
1
1 O'Higgins
48 Grant Rd.
Des Moines
IA
2 11
Macina 401 1st Ave.; Apt 13G
New York
NY
3 242
Roeder
71 Quebec Ave.
E. Thetford
VT
4 146 Stephens
1234 Smythe St.; #5
Detroit
MI
5 241 Ishikawa
986 OceanView Dr. Pacific Grove
CA

This approach is straightforward but may not be very eﬃcient for very large
character vectors. Notice also that the ID column has been interpreted
as numeric, so that the leading zeros have been removed. We can correct
this by passing the colClasses argument. In this case, since ID is the
only column whose class needs to be speciﬁed explicitly, we would pass
colClasses = c(ID = "character").
A ﬁnal method of handling an object like addscan is to use strsplit()
to create a list consisting of one character vector for each row, broken at its
commas. We can then use do.call() and rbind() to combine the elements
of the list, as described in Section 3.7.1. This approach is fast, and well suited
to large data objects, although it produces a character matrix that needs to be
converted into a data frame, with column names and classes that need to be set.
6.1.5

Other Uses of the scan() Function

As we have seen, the scan() function is the most general way to read data
into R. In this section, we describe some other cases where its use might be
necessary.
Headers, Page Numbers, and Other Superﬂuous Text

One case where scan() is necessary is when the data set being read in was
formatted by some program that was expecting to produce printed material.
Often data like that will contain page headers and footers, and these need to

181

182

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

be detected and removed. When we encounter this, we generally use scan()
with sep = "\n" to read the data in as lines, then use a regular expression
(Section 4.4) to detect and remove the oﬀending lines. For example, if a vector
old returned from scan() contains lines that start with Page:, a command
like new <- old[!grepl("^Page:", old)] will produce a new vector
omitting those lines. In this case, you have to examine the original data to determine the format of the header and footer lines, and ensure that no “real” lines
are deleted inadvertently. A similar problem can arise when the original program contains both data and also total and sub-total lines – these, too, will need
to be detected and deleted.
Input Records on Multiple Lines

It also happens sometimes that the headers occupy several lines (as with other
diﬃculties, this arises in data from spreadsheets). If only the headers, and not
the data, wrap around lines, the natural way to handle this is to omit the headers
using the skip argument; so skip = 3, for example, skips the ﬁrst three rows
of the ﬁle. Then the headers can be added back in after the data frame is created.
When the individual records are broken across multiple lines, and the number
of lines is the same for every record, it is possible to persuade scan() to read
these data in pieces. This requires passing a list in the what argument – see
the help for scan() for the details. In fact, though, we usually scan() the
whole ﬁle in, as a series of lines, and then operate on the components. If every
input record takes up three lines, say, then we know that lines of the ﬁrst type
are found in positions 1, 4, 7, and so on; we can generate this sequence with
seq(1, by = 3, to = n) where n is the number of lines read. (Remember to account for the header if there is one.) Then lines of the second type are
found at locations one greater than those of the ﬁrst type, and so on. Having
identiﬁed the lines that make up each type, we can then paste() the strings
of diﬀerent types into a character vector with one element per original observation. Finally, we can pass this vector to read.table() using the text
argument as in the previous example.
All of this may seem like a lot of work, but, as we have claimed throughout the
book, a huge proportion of our time as statisticians and data scientists is taken
up with tasks like this. The more eﬃciently we can take care of data issues, the
faster we can get to the modeling part.
6.1.6

Writing Delimited Files

The write.table() function acts to write a delimited ﬁle, just as read
.table() reads one. (And, just as there are the read.csv() and read
.csv2() analogs to read.table(), R also provides write.csv() and
write.csv2().) We normally pass write.table() a data frame, though
a matrix can be written as well, and we generally supply the delimiter with the
sep argument, since the default choice of a space is rarely a good one. A few

Getting Data into and out of R

other arguments are useful as well. First, the resulting entries from character
and factor columns are quoted by default; we often turn this behavior oﬀ with
quote = FALSE, depending on what the recipient of the output is expecting.
Quoting becomes necessary, though, when character values might contain the
delimiter, or when it is important to retain leading zeros in identiﬁers that
look numeric (01, 02, etc.). Second, row names are written by default; we
rarely want these, so we generally specify row.names = FALSE. In contrast,
the default setting of the col.names argument, which is TRUE, usually is
what we want. The exception is when we plan to do a number of writes to
a single ﬁle. In that case, the ﬁrst write will usually specify col.names =
TRUE and append = FALSE, and subsequent ones will specify col.names
= FALSE and append = TRUE.
6.1.7

Reading and Writing Fixed-Width Files

One alternative to the delimited ﬁle is the ﬁxed-width ﬁle. In this approach,
each ﬁeld occupies the same positions on each line. For example, an account
number might take up characters 1–9, the customer’s last name characters
10–24, and so on. This sort of output was more common in the past because it
is the preferred format of the COBOL language that is no longer widely used.
Depending on the layout, it is possible for the ﬁxed-width approach to use much
less space than the delimited one. For example, a comma-separated ﬁle with a
million rows and a thousand columns will have roughly a billion commas in it.
(On the other hand, each record in our example will have 15 characters for a
customer’s last name, so the design will waste space for customers with names
such as Lee, and truncate those with more than 15 characters.) Moreover, random access is possible in a ﬁxed-width ﬁle; if the customer’s last name starts
in position 10, and each line has 351 characters, then the millionth customer’s
name must start at character 351,000,010. A program that needs that name
can “seek” directly to the relevant character, whereas with a delimited ﬁle it
would have to read a million lines of diﬀerent lengths. Despite these advantages,
ﬁxed-width ﬁles seem to arise nowadays only from older systems.
The R function to read these ﬁles is read.fwf(). It has many of the same
arguments as read.table(). The most important additional argument is
widths, a vector of integers giving the lengths of the ﬁelds. In our example
above, the ﬁrst two elements of widths would be 9 and 15, those being the
lengths of the account number and customer’s name.
It is rare to have to write a ﬁxed-width ﬁle – indeed, we have never had to. The
gdata package (Warnes et al., 2015) has a routine aptly named write.fwf()
that appears to do this job. We have not tested it.
6.1.8

A Note on End-of-Line Characters

For historical reasons, Windows text ﬁles use two characters to denote the
end of a line – carriage return (\r in R) and line feed (\n), whereas OS X

183

184

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

and Linux use only the \n character (and older Apple computers used only
\r). This can cause problems when passing text ﬁles between systems, but R
is generally forgiving. The scan() function recognizes any of these line endings, and write.table() permits some ﬂexibility in the eol (“end of line”)
argument. Mac and Linux users can also use gsub() to remove any \r characters. Although R is more forgiving than some other applications, you should
be aware that text ﬁles’ endings can diﬀer from platform to platform.

6.2 Reading Large, Non-Tabular, or Non-ASCII Data
Most of the text data we get into R comes via scan() or read.table(). But
there are at least two cases where we need more than these techniques alone
can provide. First, some ﬁles – and we expect this to be ever more common
in the future – are simply too big to ﬁt into the main memory of the computer. We brieﬂy discussed some ways around this limitation in Section 3.8.
If the entire ﬁle is needed, then the techniques of that section may be necessary. Often, though, the ﬁle is huge but the subset of records that we need is
not. In that case, we need the ability to ﬁlter the data set, without ﬁrst reading
it all into memory.
The second case involves ﬁles that are not in tabular formats suitable for
read.table(). These might be text in a non-tabular format, such as the popular JSON and XML formats we discuss in this chapter; free-form text such
as log ﬁles and other documents; or binary data such as images and sounds.
For all of these cases, R provides the ability to perform basic ﬁle operations
on disk ﬁles. By “basic ﬁle operations” we mean opening a ﬁle to get access to
its contents, reading it bit by bit so that we can operate on the part that was
read, writing results to the ﬁle (usually we will read from one ﬁle and write to
another), seeking (i.e., resetting our current position in the ﬁle), and then closing the ﬁle. In fact, in most circumstances, we open two ﬁles, one for input and
another for output. We read the input ﬁle sequentially (i.e., starting at the beginning and all the way through); when we encounter records of interest we write
them, or something about them, to the output ﬁle; and then when the input
ﬁle has been read we close both ﬁles. These basic ﬁle operations are most often
performed on text ﬁles divided into discrete lines, but they also apply to ﬁles
with binary or other data. In this section, we describe these basic ﬁle operations
and how they might be used within R.
6.2.1

Opening and Closing Files

The ﬁrst step in ﬁle handling is to open the ﬁle. We must ﬁrst know whether
the ﬁle contains text (which might be UTF-8) or whether the data is binary (as
an image, video, or music ﬁle). This is not always easy to ascertain inside R,

Getting Data into and out of R

although the name of the ﬁle is often informative. Most often the ﬁle format is
supplied to us by the client.
When we open a ﬁle in R, we are really creating a connection, which is a
method of communication not only to and from ﬁles but also to and from other
devices and processes. Files are very much the most common sort of connections we use, so in these sections we will focus on tools for handling ﬁles.
A ﬁle is opened with the file() function and its relatives. These related
functions, such as gzfile(), provide the ability to open ﬁles in some of the
popular zipped formats, like gzip. The help for the file() function shows the
formats supported. (There is also an open() command, which is more suited
to non-ﬁle connections.)
The file functions return a connection object that stores all the information
that R needs to have about the connection, and we use this connection object in
subsequent calls to the functions that will read from, write to, seek in, or close
ﬁles. When we open a ﬁle, we pass an argument called open that describes
whether the ﬁle is to be read from, written to, or appended to (open = "r",
"w" or "a", respectively). If a + is added, the ﬁle is opened for both reading and
writing (but open = "w+" truncates the ﬁle ﬁrst), and if a b or t is added the
ﬁle is explicitly opened in binary or text mode. So, for example, open = "rb"
opens a binary ﬁle for reading, and open = "a+b" opens a binary ﬁle for
reading and appending.
Once you have ﬁnished with a connection, you should close it with the
close() function. This is a good practice even though R will close it for you
when your session terminates.
6.2.2

Reading and Writing Lines

Once the connection object is available, we pass it to the reading and writing
functions. The readLines() function reads text lines – that is, pieces of text
terminated by the new-line character. We specify the number of lines to read
with the n argument (n = -1 meaning to read all lines). Rather than read one
line at a time, it feels as if it should be faster to read a large “chunk” of lines at
once, process them, and then read another chunk. In reality, though, the situation is complicated by the way the operating system itself prepares “caches” of
disk ﬁles in main memory. When programming, you have to balance the (possible) gain in speed of reading a chunk of lines at once with the simplicity of
code to handle just one line at a time.
As an example, consider using readLines() to read lines from the
addresses.csv ﬁle of the example from last section. (readLines() produces the same result as scan() with what = "" and sep = "\n".) Operating on a ﬁle, readLines() opens the ﬁle, reads as many lines as requested
with the n argument and then closes the ﬁle. The reading always begins at
line 1. In this example, we read one line at a time from the addresses.
csv ﬁle.

185

186

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> readLines ("addresses.csv", n = 1)
[1] "ID,LastName,Address,City,State"
> readLines ("addresses.csv", n = 1)
[1] "ID,LastName,Address,City,State"

Here, the same line is returned from both calls because readLines() opens
and closes the ﬁle each time. If you want to read the data in pieces, you will need
to open a connection and pass the connection to readLines(). Operating on
a connection, readLines() reads the number of lines requested and keeps
track of its location in the ﬁle for future calls. Here, we read the ﬁrst few lines
of addresses.csv via a connection.
> con <- file ("addresses.csv", open = "r")
> readLines (con, n = 2)
[1] "ID,LastName,Address,City,State"
[2] "001,O'Higgins,48 Grant Rd.,Des Moines,IA"
> readLines (con, n = 2)
[1] "011,Macina,401 1st Ave., Apt 13G,New York,NY"
[2] "242,Roeder,71 Quebec Ave.,E. Thetford,VT"
> close (con)

The readLines() and scan() functions applied to a connection provide a
natural way to read very large data ﬁles piece by piece.
Analogously to readLines(), writeLines() writes a vector of characters out to a ﬁle (adding in the new-line separator). Again, passing a ﬁle name
causes the ﬁle to be opened, written, and closed – so in order to append data to
a ﬁle, you will want to open a connection and pass the connection to writeLines().
Besides opening, closing, reading, and writing ﬁles, there are two other
basic ﬁle operations to know about. When a ﬁle is opened for read or write,
R maintains a “pointer” that describes the current location in the ﬁle. (Files
opened for both read and write have two of these.) The seek() function
with its default arguments returns the current location in the ﬁle (in terms of
number of bytes from the start of the ﬁle). Passing the where argument lets
you re-set the ﬁle pointer to a position of your choice. This is useful if you need
to jump to a prespeciﬁed position. However, the help tells us that “use of seek
on Windows is discouraged.”
The ﬁnal operation, flush(), can be used after a ﬁle output command to
ensure that the write operation gets committed to disk right away. Without this
operation, the output from the write operations may be cached – that is, saved
up by the operating system for a convenient time. The flush() command will
be useful when you are monitoring a program’s output, or when it is important
to perform a write right away to protect against a system crash.

Getting Data into and out of R

6.2.3

Reading and Writing UTF-8 and Other Encodings

We described UTF-8 and other encodings in Section 4.5, and you have to expect
that you will be called on to read some non-ASCII text soon and ever more
frequently in the future. Many of the text-reading functions we describe in this
chapter, such as read.table() and scan(), have two arguments in place to
handle UTF-8-related chores. The fileEncoding argument lets you specify
the encoding in the ﬁle that you are reading in, while the encoding argument
speciﬁes the encoding of the R object that contains the data read in. Remember,
though, as we saw in Section 4.5.3, that UTF-8 inside a data frame will often be
displayed with the -type notation.
When reading data line by line, it is best to ascertain in advance whether
the ﬁle contains UTF-8. Then it can be opened by passing encoding =
"UTF-8" to the file() command, and read using readLines(), again
with encoding = "UTF-8". There is no cost to passing these options
to a ﬁle containing simple ASCII text, since ASCII is a subset of UTF-8.
However, if the ﬁle was created with latin1 or another diﬀerent encoding,
certain characters will be handled incorrectly with the UTF-8 options.
Writing UTF-8 is best accomplished by ﬁrst creating a ﬁle connection with
the file() function with open = "w" and encoding = "UTF-8". Then
text can be written to the connection using writeLines() with the useBytes = TRUE argument. (We have not always had success using write
.table() with UTF-8 data.) By default the useBytes argument is FALSE,
which tells R to convert encoded strings back to the native encoding before
writing. For non-UTF-8 locales, particularly in Windows, this conversion can
lead to unexpected results. Be aware that operating systems and locales treat
UTF-8 diﬀerently – in some cases, inconsistently.
6.2.4

The Null Character

The “null” is the character whose hexadecimal value is 00, sometimes denoted
in R by 0x00 using R’s hexadecimal notation (so this is not the same as R’s NULL
value). Null characters should generally not be found in regular text and in fact
are not permitted in R. However, the null is used as an end-of-string marker in
the C language, and you have to expect to encounter some text with nulls in it.
Moreover, we occasionally encounter text ﬁles with nulls embedded in them,
for whatever reason. By default, scan() and read.table() stop reading
when a null character is encountered and resume with the next ﬁeld. Setting the
skipNul argument to TRUE allows scan() and read.table() to skip over
nulls. For delimited data, this is generally a safe choice, although you will not
see any warning that nulls were detected and skipped. If there are “intentional”

187

188

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

nulls in the text, and you need to detect and keep them, you will need to read
the ﬁle as binary data using readBin(), which we discuss as follows.
6.2.5

Binary Data

The readBin() function reads binary data. We rarely use this to read actual
binary data like images because data like that is rarely part of a data cleaning
exercise. However, sometimes the data is so messy that this is the one of the
few approaches that seems to work (see our next example). It is also one way to
read text data with embedded nulls if you want to preserve all the characters in
the original text. This arises when a ﬁxed-width ﬁle has embedded nulls.
The readBin() function requires that you specify the number of bytes
to be read, since it does not recognize end-of-line characters. The return
value of readBin() is a vector whose class is raw – and indeed it looks
on the screen like a stream of hexadecimal bytes. There are only a couple of
things we can do with raw data in R (unless we have some custom plan for
it). We can write it back out, using writeBin(), or we can convert it into
data in one of the R formats. If we know that the raw data represents text,
we can convert it with rawToChar() – unless there are embedded nulls.
If our vector myvec is raw, then nulls can be replaced with a command like
myvec[myvec == 0x00] <- as.raw(0x20). This will replace all the
nulls by the character whose value is hexadecimal 0x20 – the space. So if
necessary we might read in such a ﬁle in chunks of arbitrary size, convert the
nulls to spaces, then write the data back out as binary or convert to text. If all
has gone well, the resulting ﬁle should then be able to be read by read.fwf()
or readLines().
To set up the next example we construct a character vector representing some
entries in a ﬁxed-width ﬁle. We embed a null in one of the entries and then write
the resulting vector to a disk ﬁle called nully.txt.
Each entry has 16 characters, counting the new-line appended for convenience. The strings contain a 3-character id (e.g., 001), a 10-character name
(Jenkins, including three trailing spaces), and a 2-character state.
> thg <- c("001Jenkins
"003Lee

MI\n", "002O'FlahertyIA\n",
HI\n")

Now we replace the apostrophe in the second string with a null character.
Since nulls are illegal in R, some extra steps are required to make this modiﬁcation. We convert the string to a raw vector named second.bytes
using the charToRaw() function. This produces the hexadecimal values
30 30 32 representing “002” and so on. We then replace the ﬁfth element of
second.bytes with the null character, indicated by as.raw(0x00).
> (second.bytes <- charToRaw (thg[2]))
[1] 30 30 32 4f 27 46 6c 61 68 65 72 74 79 49 41 0a

Getting Data into and out of R

> second.bytes[5] <- as.raw (0x00)
> second.bytes
[1] 30 30 32 4f 00 46 6c 61 68 65 72 74 79 49 41 0a
> rawToChar (second.bytes)
Error in rawToChar(second.bytes) :
embedded nul in string: '002O\0FlahertyIA\n'

The ﬁnal command shows that after the modiﬁcation, this vector cannot be
converted back to character because it has the embedded null.
Now let us write that vector to a disk ﬁle. In order to write the second.bytes item we will need to use writeBin(), so we will use that for
the other strings as well. In this code, we write the three strings individually to
a connection opened for binary writing.
>
>
>
>
>

con <- file ("nully.txt", "wb")
writeBin (charToRaw (thg[1]), con)
writeBin (second.bytes, con)
writeBin (charToRaw (thg[3]), con)
close (con)

The nully.txt ﬁle is now complete; it consists of three text lines of which the
second has an embedded null. When we read that into R, read.fwf() omits
the part of the second line following the null character and emits a warning, as
we show here.
> read.fwf ("nully.txt", c(3, 10, 2))
V1
V2
V3
1 1 Jenkins
MI
2 2
O 
3 3 Lee
HI
Warning message:
In readLines(file, n = thisblock) :
line 2 appears to contain an embedded nul

Despite indications in the help ﬁle, read.fwf() does not respect the skipNul argument. When we use scan(), we have two unappetizing choices. If
skipNul is FALSE, scan() will stop reading the second line after the null
just as read.fwf() does. But if skipNul is TRUE, scan() will produce
only 14 characters for the second line – and the ﬁelds will no longer line up. The
following example shows the output from scan() when applied to this ﬁle.
> scan ("nully.txt", what="", sep="\n", skipNul = TRUE)
Read 3 items
[1] "001Jenkins
MI"
[2] "002OFlahertyIA"
[3] "003Lee
HI"

The removal of the null has led to the second string being one character too
short. For example, its “state” ﬁeld does not line up with the others’.

189

190

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

The best way around this may be by reading the binary data directly using
readBin(). Here, we show how that might be done, exploiting the fact that
each line is known to contain 16 characters. We open the ﬁle for binary reading
by passing the open = "rb" argument to the file() function, read all 48
characters, and convert null characters to space (the character with hex value
0x20).
> con <- file ("nully.txt", "rb")
> lns <- readBin (con, what="raw", n = 48)
> lns[lns == as.raw (0x00)] <- as.raw (0x20)
> rawToChar (lns)
[1] "001Jenkins
MI\n002O FlahertyIA\n003Lee
> (lns <- strsplit (rawToChar (lns), "\n")[[1]])
[1] "001Jenkins
MI"
[2] "002O FlahertyIA"
[3] "003Lee
HI"

HI\n"

The ﬁnal command splits the string at the new-line values to produce a vector
of three equal-length strings. Given the starting and ending locations of
the ﬁelds, we can then produce a matrix by breaking the strings into their
component pieces. This example shows how that might be done.
> start <- c(1, 4, 14)
> end <- c(3, 13, 15)
> sapply (1:3,
function (i) substring (lns, start[i], end[i]))
[,1] [,2]
[,3]
[1,] "001" "Jenkins
" "MI"
[2,] "002" "O Flaherty" "IA"
[3,] "003" "Lee
" "HI"

This example is complicated, but handling data with embedded nulls or other
problematic characters is a diﬃcult problem that really does arise in practice.
Sometimes, reading the data as binary is the only road available.
6.2.6

Reading Problem Files in Action

In this section, we show a function we wrote to handle a real-life problem reading text data. We were given a series of large XML ﬁles. We discuss XML in
Section 6.5.3, but for this purpose think of XML as text. For unknown reasons,
these ﬁles contained embedded null characters, and also a small number of
“control” characters (non-ASCII characters with hexadecimal values ≥ 80). In
general, XML may contain UTF-8 and other non-ASCII characters, but never
nulls – and these control characters were known to be erroneous.
Each one of these ﬁles took up around 600 MB and consisted of one text
line with no end-of-line character anywhere. Regular text tools run into problems handling lines of this size, as does R. Using scan() or readLines(),

Getting Data into and out of R

R would try to read one of these ﬁles into memory as a character vector with
one (enormous) entry. Oﬃcially, a character string in R can contain just about
2 GB, but we had no success in reading this object into R in order to break it
into manageable pieces.
Instead our approach was to read each ﬁle as raw data in pieces. When we
converted the pieces to character, we discovered that the XML ﬁles contained
tags such as  and , which supplied a natural place into which
to insert new-line characters to break the text into lines. It was also during this
investigation that we discovered the embedded null and control characters. We
wrote a small function to read the entire ﬁle bit by bit, remove the oﬀending null
and control characters, add the new-lines at the appropriate places, and write
the resulting bits back out. The output from this function was a text ﬁle that
could be read handily, one line at a time, in groups of lines, or using an XML
reading function of the sort we describe in Section 6.5.3.
We give the details of this function in three pieces. In the ﬁrst piece of
the function, we open the input ﬁle as binary and the output ﬁle as text.
The on.exit() functions (Section 5.1.3) ensure that the ﬁles are properly
closed even if the function aborts; without the add = TRUE argument the
expression in the second call would replace the action speciﬁed in the ﬁrst. The
chunk argument describes the number of characters we will read at one time.
function (xml.in, xml.out, chunk = 10000)
{
# Open input file as read only, binary
fi <- file (xml.in, open = "rb")
# Open output file for write only, text
fo <- file (xml.out, open = "wt")
on.exit (close (fi))
on.exit (close (fo), add = TRUE)

In the second piece of the function, we read chunk characters as raw data. The
while(1) starts an inﬁnite loop that is ended when the function encounters
break. If no data is read, the input ﬁle has been used up (we might say
“exhausted”). If fewer than chunk characters are read, we have reached the
end of the ﬁle – but we still have to process the ﬁnal chunk. We ﬁnd the
unwanted characters – the null, whose hexadecimal value is 0x00, and the
non-ASCII, whose values are 0x80 and above – and replace them by spaces.
while (1) {
# loop until "break"
# Read text. If none is returned, the file is empty.
txt.raw <- readBin (fi, "raw", n = chunk) # the maximum
if (length (txt.raw) == 0) break
# Replace those are 0x00 or >= 0x80 with 0x20 (space)
txt.raw[txt.raw == as.raw (0x00) |
txt.raw >= as.raw (0x80)] <- as.raw (0x20)
txt <- rawToChar (txt.raw)

191

192

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

At the end of this last operation, txt contains the character values from the
input, except with spaces where nulls and other non-ASCII characters used
to be. In the rest of the function, we add the new-line characters after every
instance of  (using the gsub() function from Section 4.4.6) and write
the text out. We cannot use writeLines() because we do not want to insert
a new-line character at the end of the 10,000-character chunk. Instead, we use
the writeChar() function. By default, this adds a null character after its
end-of-string character, but when we set eos = NULL the string is written
out with no terminator. (We could alternatively have opened the ﬁle to write
binary and used writeBin() to write the raw data. Writing text data with
writeBin() produces a null character in the output.) Finally, we check to see
whether the number of characters read is less than the size of chunk. If so, we
must have exhausted the input ﬁle, and we break, relying on on.exit() to
close the ﬁles. (We use number of characters read, rather than written, because
that number does not count the number of new-line characters inserted.)
# Replace  with \n wherever the former appears
txt <- gsub ("", "\n", txt)
# Write output. If txt is short, input is exhausted; quit.
writeChar (txt, fo, eos=NULL)
if (nchar (txt.raw) < chunk) break
} # end "while"
}

This example demonstrates the sorts of tasks that we, as data scientists, are
called on to perform in order to read data as part of a data cleaning exercise. As
a data scientist, you will need to be prepared to handle this sort of messy data.

6.3 Reading Data From Relational Databases
A lot of data is stored in relational databases. A database has two parts: ﬁrst,
a set of tables of data together with rules that describe their content and keys
that link the tables together. The second part is software to access the data in
customized ways. The software is often called a “relational database management system,” and the acronyms RDBMS or just DBMS are often used. Popular
examples of these programs include Access and SQL Server, from the Microsoft
Corporation; Oracle and the open-source MySQL, from the Oracle Corporation; and the open-source PostgreSQL and SQLite. All of these use something
called the “Structured Query Language,” or SQL, a (mostly) standardized language designed speciﬁcally for interacting with database management systems.
In this section, we discuss how to connect to a database and extract its data into
R for cleaning and management.
Remember that database programs are speciﬁcally designed and engineered
to hold and manipulate large amounts of data. So it is almost always a good

Getting Data into and out of R

idea to let the database do the work of ﬁltering and merging data sets, rather
than reading all the data into R and operating on it there. Of course, when the
data tables in question are small, it doesn’t matter much one way or the other
where the work is done. But where the data is large, we recommend using the
database as much as possible – which means learning at least a little SQL.
6.3.1

Connecting to the Database Server

The ﬁrst step in using a database is connecting to it. With some exceptions
discussed in what follows, a database system will run as a “server” in a process – that is, a running program – located either on your computer, or on
another computer on your network. Some person or organization will need to
be in charge of that program (starting it, stopping it, updating the data and permissions, etc.). That person (the database administrator) will know the name
of the machine on which the server is running, and any user/password information you will need to access data in the system. In order to connect your R
session to the database server, you will need a driver, which is a piece of software that lets the operating system connect to the database. In some cases,
drivers will already be present on your computer, but in others, you will need
to acquire and install one yourself. Your database administrator will be able to
tell you what driver to use and how to install it. The ﬁrst time you connect to
the database you will provide a “data source name,” or DSN, that will be used
to refer to the database on subsequent occasions.
Open Database Connectivity

The Open Database Connectivity (ODBC) protocol is an eﬀort to make
diﬀerent database software appear the same to clients like R. Just about all
databases support ODBC, and the R package RODBC (Ripley and Lapsley,
2015) provides an interface to ODBC-compliant databases. (Some databases,
such as the well-known Oracle one, support additional, speciﬁc packages that
may be more eﬃcient than RODBC.) If you have a DSN named “source” in
place, for example, the command odbcConnect("source") will make the
connection; additional arguments let you specify a username (argument uid)
and password (pwd) if required. Once the connection is made, the function
returns an object (a “handle”) that holds the details about the connection.
This handle is then passed to all the other functions that communicate with
the database. In this example, we show how we might connect to a database
via ODBC, then use the sqlTables() command to list the names of all the
tables in the database.
han <- odbcConnect (dsn = "source", uid = "us", pwd = "abc")
sqlTables (han) # list all table names

Generally, the functions whose names start with odbc are the lower-level ones,
and the ones whose names start with sql are easier to use with data frames.

193

194

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Once you know the names of the tables in the database, the sqlColumns()
function will return some information about the columns in a speciﬁc table,
such as their names and types.
While the connection is open, SQL commands (see the following sections)
can be issued directly through the sqlQuery() function. Once the connection han is no longer needed, it can be closed with either the close(han) or
the odbcClose(han) commands.
6.3.2

Introduction to SQL

All relational databases use SQL to access and manage data. SQL is a big subject, and there are many books and web sites devoted to teaching it. Inevitably,
perhaps, diﬀerent databases can have slight diﬀerences in the SQL and data
types they support. In this section, we describe the basic SQL commands that
you might need to extract data from databases into R. These commands will
normally be passed to the database through one of the ODBC functions, such
as sqlQuery(), as a character string – see the following few examples. SQL
commands are not case-sensitive; table and column names might be, depending
on the database. Here, we follow one common convention, where we put SQL
keywords in upper-case letters and table and column names in lower-case.
The SELECT Command

The most important SQL command is SELECT, which is the command used
to create a data frame in R from a table in the database. Suppose that we
know, from our earlier call to sqlTables(), that a particular table is called
accounts. Then the SQL command SELECT * FROM accounts will
return the entire table. We would normally specify this query as a character
string passed as an argument to sqlQuery(). This example shows how we
can acquire an entire table, presuming that the handle han created above is
still valid.
acc <- sqlQuery (han, "SELECT * FROM accounts")

Here, the asterisk asks for all columns, so after this operation completes, the
new data frame acc will have all the data from the database table accounts.
Often we will want only a subset of rows and columns. We can use the
sqlColumns() function to learn the names of the columns in the table;
we can then supply the desired names explicitly in the SELECT statement.
The SELECT command has many other possible additions that control the
selection of subsets of rows. For example, the next command will return only
the three named columns, and only the rows for which accyear was ≥2016.
recent <- sqlQuery (han, "SELECT accno, accyear, accamt
FROM accounts WHERE accyear >= 2016",
stringsAsFactors = FALSE)

Getting Data into and out of R

The sqlQuery() function supports a stringsAsFactors argument but
not the colClasses one. In addition to returning the data itself, the database
can perform simple arithmetic operations. For example, nn <- sqlQuery
(han, "SELECT COUNT(*) FROM accounts") gives the number of
rows in the table. Other operators compute maxima or minima, combine
columns arithmetically, compute logarithms, aggregate data into groups, and
so on.
Joining Tables

Besides extracting a table, or part of a table, the most important thing a database
can do for us is to match up two tables, according to the value of a column in
each one, and return the results of the match. In SQL, we call this a “join”; in R,
we call it a “merge,” based on the merge() function that does the same thing
for data frames (see Section 3.7.2). So, for example, suppose that in our database
table accounts has a column called accno and table payment has a column
called acct. Suppose each of these two columns contains account numbers
in the same format. We want to produce a new table with all the columns of
accounts and all the columns of payment, and all the rows for which the
two account numbers match. This command shows how SQL can be used to
join the two tables.
matchers <- sqlQuery (han, "SELECT * FROM accounts, payment
WHERE accounts.accno = payment.acct")

The database is in charge of sorting the tables to make the account numbers
line up. Typically, we expect the result of this command to have as many rows
as there are account numbers that are common to the two tables. (This can
be diﬀerent, though, if there are duplicated account numbers.) This so-called
“inner join” is just one way to join tables. The “left join” produces a new table
with one row for each entry in accounts (again, if there are no duplicates in
the index used to do the join). Entries in accounts with no match in payment
are returned, but of course there are missing values for those rows’ entries in
the columns from payment. This command shows how to produce a left join
between the tables in this example.
matchers <- sqlQuery (han, "SELECT * FROM accounts
LEFT JOIN payment ON accounts.accno = payment.acct")

“Right” and “outer” joins (Section 3.7.2) are constructed similarly.
Getting Results in Pieces

The sqlQuery() command actually performs two tasks. First, it sends the
query to the database, and then it fetches the results. For really big tables, it
makes sense to separate these tasks; the ﬁrst query starts the retrieval process, and then subsequent calls retrieve rows chunk by chunk. When we want a

195

196

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

single table, the sqlFetch() command will get the ﬁrst batch, with the number of rows given by the max argument; subsequent calls should be made to
sqlFetchMore(), again using the max argument. When executing a more
complicated query, a call to sqlQuery() with max can be followed by a call
to sqlGetResults(). (The argument rows_at_time is an instruction to
the database, which does not aﬀect the number of rows returned.)
Other SQL Commands

There are many other SQL commands, but most of them (e.g., INSERT,
UPDATE, DELETE, and CREATE TABLE) are intended for data management
and will normally be used by the database administrator rather than by the
users of data. There are two more SQL commands other than SELECT that
every data handler should know. One is CREATE TEMPORARY TABLE, which,
as its name suggests, creates a temporary table that will be deleted when
the SQL connection is closed. (Whether you have permission to create such
a table is decided by the database administrator.) In a recent project, for
example, we needed to extract rows for a particular set of keys from a number
of large tables. It was convenient to create a temporary table and insert those
keys into it (which we did via the sqlSave() function). We could then
use sqlQuery() to construct an inner join that returned only the rows
corresponding to keys in the temporary table. In this way, we allowed the
database, rather than R, to be the data manager.
The second command worth noting is EXPLAIN. This command does not
perform any action; instead, it causes the database to describe what steps it
would have taken, had the EXPLAIN not been speciﬁed. This can be useful
when designing a query against a very large database, or one that will return
a very big result set. If there are two diﬀerent approaches, EXPLAIN might
give information as to which is more eﬃcient. Diﬀerent databases approach
this in diﬀerent ways, returning diﬀerent information, and indeed some do not
support EXPLAIN at all, so consult your database’s documentation if you want
to use this feature.
Serverless Databases

Microsoft Access and SQLite are two “serverless” databases. With these products, the entire database, together with details of table layouts and so on, is
stored in one large ﬁle. Clients such as R treat the ﬁle as a database, acting as
their own server. This makes these two databases particularly well suited to
smaller applications. For connections to Access, you can either set the Access
ﬁle up as a DSN, or use the Access-speciﬁc functions like odbcConnectAccess() in the RODBC package to connect to the ﬁle. The RSQLite package
(Wickham et al., 2014) connects R to SQLite databases; it looks much like the
RODBC package, but the function names are slightly diﬀerent. In either case,
SQL commands can then be used to extract data.

Getting Data into and out of R

Our extended exercise in Chapter 8 uses the RSQLite package, so here we
present a few of the functions you will need to communicate with RSQLite
from R. These functions include the following:
• dbConnect() initializes an RSQLite connection. We would normally
start a session with a command like han <- dbConnect (SQLite(),
dbname = ), where  is the name of the ﬁle containing
the data. This command creates a handle named han, which will be passed
to the other SQLite functions.
• dbListTables() lists the tables in the database. In our example, the command dbListTables(han) would produce a character vector with the
names of the SQLite tables.
• dbListFields() lists the ﬁelds (columns) in a table.
• dbGetQuery() executes a query and captures the returned data, analogously to sqlQuery() in the RODBC package.
• dbSendQuery() and dbFetch() create a query and then receive data
from it. These are similar in use to sqlQuery() followed by sqlGetResults(), except that dbSendQuery() returns no data at all; it only prepares the database to return data in subsequent calls to dbFetch().
Other (NoSQL) Databases

Recent years have seen the growth of the so-called “NoSQL” databases.
Sometimes, these actually support SQL-type commands; their “NoSQL”
nature comes from the way transactions are handled within the database.
Others have a diﬀerent model for storing and extracting data. For example,
the well-known MongoDB system uses a version of JSON (see Section 6.5.3).
Most of these databases have the ability to connect from R, but you will need
to ﬁnd the right package for each one.
We can only barely scratch the surface of SQL, and interacting with
databases, here. If you are going to use databases regularly, it will deﬁnitely be
worth your while learning more. Remember that data-management tasks such
as subsetting and merging will almost always be more eﬃcient in the database
than in R.

6.4 Handling Large Numbers of Input Files
R provides a set of tools that allow us to interact with the computer’s operating
system. These allow us to perform tasks such as creating and removing directories and listing ﬁles that match patterns, and these tasks typically form part of
any data cleaning project. It is particularly important to be able to handle ﬁles
and directories automatically when there are hundreds or thousands of them.
For example, we sometimes receive data in the form of a set of directories, each

197

198

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

ﬁlled with zipped ﬁles. We want to avoid unzipping these ﬁles manually, and
instead use R to unzip them all at once.
There are at least two ways to have R interact with the operating system.
First, R has a set of built-in commands that perform ﬁle- and directory-related
tasks. These include list.files(), which allows you to list only the ﬁles
whose names match a regular expression (Section 4.4), unzip() for unzipping zip ﬁles, and a set of functions whose names start with file – among
them, file.copy(), file.remove(), and file.rename().
The second way to interact with the operating system is by calling system
utilities directly through the system() function. This method may allow more
ﬂexibility, but will normally be less general, since the set of utilities available will
be diﬀerent between operating systems. Using this approach will make your
eﬀorts more diﬃcult to reproduce.
Besides manipulating existing ﬁles, R also provides tools for downloading,
particularly the download.file() function. We talk more about acquiring
data from the Internet in Section 6.5.3. The ability to download and then
access a ﬁle entirely within R contributes to the reproducibility of your code
and solutions.
Example
As an example, consider a data set consisting of a directory containing 12
sub-directories of our working directory. Suppose that the sub-directories
have names such as 2016.Jan and 2016.Feb. For this example, each
sub-directory contains a few hundred comma-separated values ﬁles, each ﬁle
individually zipped. Each ﬁle has the same named columns in the same order.
The total size of all the ﬁles is not enough to overwhelm R, so the goal is to read
all of the ﬁles into one R data frame, with the resulting data frame containing
all of the original columns, plus two more columns giving the month and year
associated with each row.
This is only an example. In practice, the ﬁles might be huge or in an odd format. They might have been prepared with another tool like gzip or tar. As a data
scientist, you have to expect to receive data in whatever way the supplier can
imagine.
In this example, we would use unz() to open each of the zip ﬁles. To do
this, we need to know not only the name of the zip ﬁle but also the original ﬁle’s
name. If the name or names of the ﬁle or ﬁles inside a zip ﬁle are unknown, they
can be retrieved via the unzip() function with the list = TRUE argument;
but in this example, we will presume that the zip ﬁle’s name is the same as the
original name, except that the ending .csv has been replaced by .zip.
Suppose that the ﬁrst ﬁle in the 2016.Jan directory is called Alameda.
zip. We can open and read that ﬁle with code like this:
fi <- unz ("2016.Jan/Alameda.zip", "Alameda.csv", "r")
out <- read.delim (fi)

Getting Data into and out of R

Notice that the read.delim() functions and the other read.table()
oﬀspring can read from the connection directly. If the ﬁle was too big to read all
at once, we might use scan() – interestingly, readLines() is not available
for connections from unz(). We chose read.delim() here because it sets
the separator character to the tab, the quote argument to the double-quote
only, and header to TRUE. Of course, we may have to experiment to ﬁnd the
proper settings of these arguments.
Now that we can read one ﬁle, we need to establish the loop to read them
all. Finding the names of all the zip ﬁles in any subdirectory below this one
is straightforward using the list.files() command. We can restrict the
directories to search by supplying a vector of sub-directory names, which might
be the output from the list.dirs() command, but here we will assume that
the only sub-directories present are the relevant ones. We need the full names
of the ﬁles in order to open them, but we want only the ﬁle name, not the path
part, in order to reconstruct the original ﬁle name. So we call list.files()
twice. In the ﬁrst call, we extract only the ﬁle names. Then a call to sub() will
produce the name of the unzipped ﬁle by replacing the ending zip of the ﬁle
name with csv. This code shows how this might be done.
zips <- list.files (pattern = "\\.zip$",
recursive = TRUE, full.names = FALSE)
csvs <- sub ("\\.zip$", "\\.csv", zips)

Here, of course, "\\.zip$" is a regular expression that restricts attention to
ﬁles whose names end with .zip.
Now we need to know the year and month associated with each directory.
We can ﬁnd this in the vector of full names, which we produce by calling
list.files() a second time with the full.names = TRUE argument.
Then the year appears in characters 3–6 (because each ﬁle name starts with
the working directory, ./), and the month in characters 8–10. In a more complicated situation, we might need a regular expression or another approach to
ﬁnd these values. In this example, our code might look like this.
fullzips <- list.files (pattern = "\\.zip$",
recursive = TRUE, full.names = TRUE)
year <- substring (fullzips, 3, 6)
mon <- substring (fullzips, 8, 10)

At the end of this operation, we can construct a loop that sequentially reads
each ﬁle, adds a column with month and year, and appends it to a data frame.
This example shows what that code might look like.
result <- NULL # empty object
for (i in 1:length (zips)) {
fi <- unz (fullzips[i], csvs[i], "r")
out <- data.frame (Year = year[i], Month = month[i],

199

200

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

read.delim (fi, stringsAsFactors = FALSE))
result <- rbind (result, out)
}

In this example, we used repeated calls to rbind() to construct the result. This
is an ineﬃcient operation, since R needs to adjust the size of the data frame on
every iteration of the loop. However, it is easy in implementation, and we only
need to run this code once. A more eﬃcient approach might be to create the
data frame ahead of time (if we knew or could estimate the total number of
rows needed) or to read the ﬁles separately and join them at the end.

6.5 Other Formats
A number of newer formats for storing data are becoming increasingly commonly used. Handling this data often requires the use of packages that are not
included in base R. In this section we describe some of these formats and the
tools used to read and write them – but, as always, keep in mind that packages
tend to evolve faster than R itself does.
6.5.1

Using the Clipboard

The “clipboard” is a mechanism for moving text data between programs.
In Windows, R sees the clipboard as a ﬁle named “clipboard” (so this is
a bad name for a regular disk ﬁle). There is also a larger version called
“clipboard-128.” As we mention below, the clipboard is particularly helpful
for bringing in data from spreadsheets like Excel: you can simply highlight a rectangular area in Excel and copy to the clipboard; then in R,
read.table("clipboard", sep = "\t") will bring the data in.
All of the usual arguments to read.table(), particularly header and
stringsAsFactors, will need to be set in the usual way. Conversely,
we can put an R data frame into a spreadsheet by running a command
like write.table(mydf, "clipboard", sep = "\t") in R, then
switching to the clipboard and using that program’s “paste” command.
The procedure is somewhat diﬀerent across Windows, Mac, and Linux,
however, and it does require the user to use the copy and paste commands. So an approach that uses the clipboard might be diﬃcult for other
users to reproduce. Still, the clipboard is a good way to get quick and easy
access to data from other programs. In Mac, the two commands above
would look like read.table(pipe("pbpaste"), sep = "\t") and
write.table(mydf, pipe("pbcopy"), sep = "\t"). Linux is
more variable, but the help for the file() function gives some suggestions.
In using the clipboard to copy text in from other programs, you need to be
aware that some word-processing software may, depending on the settings,

Getting Data into and out of R

automatically convert certain characters into others that “look nicer.” (This
doesn’t appear to be a problem with spreadsheet programs.) Speciﬁcally, if
you type into Microsoft Word "She said it wasn’t so -- and I
believed her", that program will convert the quotation marks to the
curved so-called “smart quotes” (like the ones seen surrounding “smart
quotes”). Similarly, the apostrophe will be curved and the two hyphens will
be converted into a dash. None of those characters are ASCII, and this text is
not a valid R string, because R does not recognize the quotation marks – so
if you try to paste this text directly into an R session, you will see an error.
If the text is surrounded with ordinary R-style quotation marks, then you
will have a valid R string. You can replace these non-ASCII characters using
gsub() – for example, gsub("[^[:print:]]", "", a) removes all
the so-called “non-printable” characters from a, leaving only letters, numbers,
and spaces. An alternative is to copy the text from the word processor to the
clipboard and scan() it into R from there, but the way clipboards handle
non-ASCII text is subject to change.
Be aware that many applications use the clipboard. If you copy text from the a
spreadsheet to the clipboard, but then copy anything else, your clipboard contents will be changed. This includes copying text in an R script window, as, for
example, when you highlight the text of a command and right-click to execute
that command.
6.5.2

Reading Data from Spreadsheets

In our work, we very often need to read data in from spreadsheets, and by a
very large margin the most commonly used spreadsheet is Microsoft Excel.
Excel uses at least two ﬁle formats, an older one identiﬁed by names that end in
.xls and a newer one identiﬁed by names ending in .xlsx. Diﬀerent packages handle these formats: the gdata package has a read.xls() function,
while xlsx (Dragulescu, 2014) provides a read.xlsx() function.
But an Excel “workbook” can contain many spreadsheets, and need not be
rectangular as a data frame must be. While it is possible to specify a speciﬁc
sheet of a workbook, we have found that if there are just a few moderate-sized
data sets, it is often fastest to copy them to the clipboard and read them into
Excel from there. Another common approach is to use Excel to write the spreadsheet as a tab- or comma-separated ﬁle.
Copying from the clipboard has at least two obvious disadvantages: ﬁrst, it
makes it harder for another user to reproduce your work, since they need to
open the Excel workbook. This supposes that the user has Excel, or at least
one of the open-source spreadsheet programs that can read the Excel formats
(and often older formats or very complicated, macro-laden spreadsheets
can cause problems with these other tools). Then the user has to ﬁnd the
proper sheet, and copy the proper cells to the clipboard – and this leads to

201

202

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

the second problem, because, as we mentioned earlier, the Windows and
Mac/Linux implementations of clipboard are not the same. To make your work
reproducible you should give instructions as to how to read your data that
work on any machine. Alternatively, you might use the spreadsheet program
to save the worksheet in a text format, like a tab-delimited one. That way other
users will be able to read the data directly into R.
Another issue that has caused trouble in the past when reading data from
Excel spreadsheets via the clipboard is that confusion can arise if the rows have
diﬀerent lengths. When reading from a ﬁle, the fill = TRUE argument to
read.table() will add blank ﬁelds on the ends of lines, but this has not
always been successful when reading from the clipboard. As a quick manual
adjustment, you can add a new column out at the right edge of the spreadsheet
containing all zeros, say, and then read the data in.
If you are confronted with a large number of spreadsheets, you will ﬁnd it
necessary to use one of the packages like xlsx to automate the reading process.
The decision about whether to use a package to read an Excel ﬁle or to read it
through the clipboard depends on the number of sheets needed, their size and
complexity, and the sorts of documentation that are going to be required. We
usually read our Excel sheets in via the clipboard – but we acknowledge the
diﬃculties that come with that.
One aspect of Excel that can cause confusion is its convention regarding
dates. Dates can be displayed in many user-selectable formats, but internally
a date is represented by the number of days since December 30, 1899, so that
March 1, 1900 is day 61. (There is no support for dates before January 1, 1900,
and in an additional complication, day 60 is understood by Excel to have been
February 29, 1900 – even though 1900 was not a leap year.) Times are indicated
by fractional days, so 61.75 represents 6:00 p.m. on March 1, 1900. Often when
you read in an Excel column using one of the read.table() functions, your
output will contain text-format dates that can be handled in the usual way,
assuming you know the display format. (This is another example where you
want to remember to set stringsAsFactors to FALSE.) If you read in an
Excel column (say, df$indate) with numeric dates, they can be converted
to Date objects with a command such as as.Date(df$indate, origin = "1899/12/30"). If you need times of day as well, recall from
Section 3.6.2 that Date objects have their non-integer time parts truncated
under some circumstances. So the safest plan is to use Excel to export the
dates and times in a text format. Alternatively, you might convert the numeric
days to seconds, by multiplying by 86,400, and then converting the resulting
values into a vector of one of the POSIX time classes, with a command
like as.POSIXct(86400 * df$indate, origin = "1899/12/30",
tz = "UTC"). In this example, we selected the UTC time zone explicitly
since otherwise R would have chosen the local time zone.

Getting Data into and out of R

6.5.3

Reading Data from the Web

A lot of data is now stored on the World Wide Web. Some of this is stored
statically, that is, in ﬁles on a server somewhere. This will often be HTML
data, like tables in ordinary web pages. However, a lot of data is not stored in
a page; rather, it is held in a database and returned dynamically, in response
to requests. For example, think of a web page that displays next week’s ﬂights
between San Francisco and New York. This page is generated dynamically
in response to your input parameters, and might be diﬀerent if generated a
few hours from now. Moreover, some services return responses in a more
complicated format like XML or JSON. In this section, we talk about how to
read ﬁles in common web-based formats; then we look brieﬂy at retrieving
dynamic data from web servers.
Copying Tables from the Web via the Clipboard

HTML, the “hyper-text markup language,” is the format for displaying most
web pages. So a lot of data in the world is embedded in HTML, usually inside
tables. The HTML code for a simple 2 × 2 table might look like this example.

Name Amount
Dylan 116.34
Garcia 953.21


HTML is made up of “tags,” like  and , that often come in
pairs – so, for example, a row of a table starts with  and ends with .
Tables in web pages are often easy to copy to the clipboard. We can simply
highlight them with the mouse and use the usual copy command (control-C
in Windows, command-C on Mac). Then read.table("clipboard")
with the default value of the sep argument will very often produce acceptable results. You may want to set other options, such as header and
stringsAsFactors as well.
Reading Web Pages in HTML

If we need to acquire a large number of web pages, an automatic procedure
is necessary. One simple way to acquire a web page, if the address is known,
is through the getURI() function of the RCurl package (Temple Lang,
2015a). For example, the command groucho <- getURI("https://en
.wikipedia.org/wiki/Groucho_Marx") will retrieve the Wikipedia
page on Groucho Marx, in HTML, and save it as a character vector of length
1. (The task of converting from HTML to text is not, alas, a trivial one. We give
one approach after we discuss XML as follows.) More often, we want to extract
data that has been formatted in an HTML table. Our usual tool for ingesting
HTML tables is the readHTMLTable() function from the XML package

203

204

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Temple Lang (2015b). It returns a list, with one entry for each table on the web
page. For some web pages, this function works ﬂawlessly. However, a lot of
web pages are designed more for display than for serious data storage. Formats
can change over time and the HTML used does not always adhere strictly to
standards. Even when tables are returned cleanly, it is often true that the ﬁrst
row contains the headers, and the columns have all been made into factors.
Although the output from readHTMLTable() may require additional processing, using that function is almost always a better solution than writing a
custom function to look through the HTML tags, maybe using regular expressions. If you will need to read substantial amounts of HTML as part of your
project, we recommend downloading the HTML pages to your local machine,
perhaps using download.file() to make a copy on disk, or getURI() to
read the text into R. That way, if the developers of the web change the format
or data, your code will continue to work on your local copy.
XML

XML, despite standing for “eXtensible Markup Language,” is not so much a
language as a method of deﬁning strict rules for document layouts. An XML
document will normally be expressed as a nested list in R. The XML package
handles the reading and writing of XML documents, and in fact reads XML
in two distinct ways. The xmlTreeParse() function reads in a ﬁle (or, with
the asText argument, takes in some text that is already in R) and produces a
tree-like object of class XMLDocument. This acts like an R list, so if you know
the exact names of the ﬁelds of interest, you can extract them. As an example,
consider this simple XML ﬁle:


Dylan116.34
Garcia953.21


If we write this into a ﬁle called example1.xml, we can read that ﬁle into an
XMLDocument. The XMLDocument object is a long and complicated list. With
enough information, we can extract the value of the ﬁrst Amt ﬁeld directly, as
in this example.
> xx <- xmlTreeParse (file = "example1.xml")
> xmlValue(xx$doc$children[["Fields"]][[1]][["Amt"]])
[1] "116.34"

The xmlValue() function converts the list element, which is an object of class
xmlNode, into text. But clearly this approach is diﬃcult for large or deeply
nested documents. A better approach to handling XML is to read the ﬁle into
R using the xmlParse() function, which produces an object of class XMLInternalDocument. We cannot extract elements from these documents using

Getting Data into and out of R

the list-type notation above, but documents of this sort can be searched with
the xpathSApply() function using syntax derived from the XPath language
intended for this purpose. A description of XPath is beyond the scope of this
book, but lots of documentation and examples are available on the Internet.
Here, we show the use of the simplest XPath command, the // command that
performs a basic search. Notice that we search for Amt without needing to know
the names of its parent nodes.
> yy <- xmlParse (file = "example1.xml")
> xpathSApply (yy, "//Amt")
[[1]]
116.34
[[2]]
953.21

The output of xpathSApply() here is a list of objects of class xmlNode.
In the following example, we use the ordinary sapply() function to have
xmlValue() operate on each of the xmlNodes, ﬁnally producing a vector
of (character) amounts.
> sapply (xpathSApply (yy, "//Amt"), xmlValue)
[1] "116.34" "953.21"

The xpathSApply() function is also useful for converting HTML text
retrieved from a URL into readable text. For example, recall that in the
last section we saw how we might retrieve the Wikipedia entry for Groucho Marx using the getURI() function. Recall that the entry was saved
in a character vector of length 1 named groucho. Then, analogously to
the xmlTreeParse() function, the XML package provides a function
htmlTreeParse(). When used with the useInternalNodes = TRUE
argument, this function produces an object of class HTMLInternalDocument. This class is a particular sort of XMLInternalDocument. Therefore,
the xpathSApply() function can be used as before. In this example, from
Luis (2011), we search for the HTML “new paragraph” tag, , and apply the
xmlValue() command to extract the value from each paragraph found into
a list. We then use unlist() to produce a vector. The following commands
show how the text might be extracted from the HTML. The output is a
character vector with one entry for each paragraph in the original HTML
document.
> grou.tree <- htmlTreeParse (groucho, useInternal = TRUE)
> unlist (xpathSApply (grou.tree, "//p", xmlValue))
[1] "Julius Henry Marx (October 2, 1890 - August 19, ...
[2] "He made 13 feature films with his siblings the ...
[3] "His distinctive appearance, carried over from ...
: : :

205

206

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

XML handles Unicode in a natural way and elements of an XML document
can be expected to be encoded in UTF-8.
JSON

JSON, the “JavaScript Object Notation,” is another text-based format for containing list-like data. For example, Twitter messages are often stored in JSON,
one message per JSON object. A JSON object is enclosed in braces and essentially consists of series of pairs like "name":value, separated by commas. The
“value” part of the object can be a value or another series of "name":value
pairs. JSON also supports an “array” object, analogous to an R vector. So a JSON
object can be directly represented by an (possibly nested) R list.
The rjson (Couture-Beil, 2014), RJSONIO (Temple Lang, 2014), and
jsonlite (Ooms, 2014) packages read and write JSON objects and make
the underlying JSON essentially transparent to the user. Often you will be
presented with a ﬁle containing a whole set of JSON messages, one per line. If
the ﬁle is small, it might be easiest to read the entire ﬁle into R using scan()
with sep = "\n" and what = "", and then apply one of conversion
functions like fromJSON() to each of the messages. In this example, we show
what some of the data from the previous XML ﬁle might look like as text after
its JSON representation is read into an R object named zz.
> zz
[1] "{\"Name\":\"Dylan\",\"Amt\":\"116.34\"}"
[2] "{\"Name\":\"Garcia\",\"Amt\":\"953.21\"}"

Now we can use sapply() to have the fromJSON() function (we used the
one from the RJSONIO package) operate on each line.
> sapply (zz,
[,1]
Name "Dylan"
Amt "116.34"

fromJSON, USE.NAMES = FALSE)
[,2]
"Garcia"
"953.21"

The USE.NAMES = FALSE argument prevents the converter from constructing unwieldy column names.
If the ﬁle is too big to import into R directly, it can be opened and read one
line at a time, as discussed in Section 6.2 earlier in this chapter.
Like XML, JSON is able to handle Unicode in a native way, so you should
expect to see UTF-8 in the data you acquire.
Reading Data Via a REST API

REST, which stands for “representational state transfer,” is the most straightforward of the ways to send data from a web server to a client like R. (Another
popular approach, SOAP, will not be discussed here.) Data suppliers will
often permit clients to establish REST connections to retrieve data by means

Getting Data into and out of R

of specially formatted requests. The description of those requests is often
published in an API, “applications programmer interface.” For example,
Google Maps publishes an API that describes how to extract map data from
its database, and Bing Translate has a diﬀerent API describing how to use
that tool to translate text from one language to another. If you want to use a
commercial API, be sure you understand the terms of use.
In many cases, developers have created R packages to make using the API as
straightforward as possible. In these examples, there are the RgoogleMaps
package (Loecher and Ropkins, 2015) for Google Maps and the translateR
package (Lucas and Tingley, 2014) for Bing Translate.
Each API will have its own rules, particularly regarding authentication. If no
package exists, but you can see the web page, you can often emulate the actions
of a user by creating your own request. This can be sent to the server as a “get” or
“post” request via the getForm() and postForm() functions in the RCurl
package, or with the GET() function in the httr package (Wickham, 2015). In
either case, you will need to know the name-value pairs that the server expects.
Often these can be deduced by reading the underlying HTML in your browser,
or by examining the requests sent to the server when you click manually (since
for “get” requests that information appears in your browser’s URL line).
As an example, at the time of this writing, the US Census Bureau
maintained an API that lets you look at certain values associated with
international trade. This API, for which documentation is available
at the URL www.census.gov/data/developers/data-sets/
international-trade.html, expects a “get” request to be sent to
the location //api.census.gov/data /timeseries/intltrade/
exports, with arguments get giving the ﬁelds to be retrieved and YEAR and
MONTH specifying the year and month of interest. This command shows the
sending of the request with the result being stored in an object called cens. In
this example, we request the year-to-date value of exports ("ALL_VAL_YR")
for each of the “end-use codes” ("E_ENDUSE") and their descriptions
("E_ENDUSE_LDESC") for April of 2016.
> url <-

paste0 ("https://api.census.gov/data/timeseries/",
"intltrade/exports/enduse")
# For GET(), enclose API arguments via "query" as a list
> cens <- GET (url, query = list (
get = "E_ENDUSE,E_ENDUSE_LDESC,ALL_VAL_YR",
YEAR = "2016", MONTH = "04"))

The cens object is an object belonging to a special class called “response.”
Operating on this object with the content() function produces a list that
be reshaped into a matrix via do.call() and rbind(). This example shows
that call and a small part of the result.

207

208

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> do.call (rbind, content (cens))[1:4,]
[,1]
[,2]
[1,] "E_ENDUSE" "E_ENDUSE_LDESC"
[2,] ""
"TOTAL EXPORTS FOR ALL END-USE CODES"
[3,] "0"
"FOODS, FEEDS, AND BEVERAGES"
[4,] "00000"
"WHEAT"
[,3]
[,4]
[,5]
[1,] "ALL_VAL_YR"
"YEAR" "MONTH"
[2,] "465400958493" "2016" "04"
[3,] "38996430386" "2016" "04"
[4,] "1606226019"
"2016" "04"

The resulting matrix will often need some cleaning before it can be converted
into a data frame, particularly with regard to column headers. In this wellbehaved example, all of the components of the list produced by content()
had the same length. In general, that might not be true. Here, we could extract
every month for every year by passing suitable values of YEAR and MONTH
parameters. As elsewhere, it might be a good idea to save the data to disk so
that its provenance is assured if the provider were to change the interface or
modify the data. REST APIs almost always deliver their data as JSON, XML,
or HTML.
Streaming Data

Sometimes, we need to capture streaming data, like feeds from sensors or from
a supplier like the Twitter platform. We have not encountered the need for this
in our own data cleaning problems, but at least in some cases the connection is
straightforward. We can open a “socket,” which is a communications endpoint,
once we are given an IP address for the server and a “port number,” an integer
that speciﬁes the address of the socket. The socketConnection() function establishes the connection. You will probably want to specify blocking
= TRUE so that subsequent reads do not complete until something is actually
read.
6.5.4

Reading Data from Other Statistical Packages

In days gone by, a statistician would often need to read data saved in a format speciﬁc to another statistical package, such as SAS, SPSS, or Minitab. The
recommended package foreign (R Core Team, 2015) provides interfaces to
these and some other formats. Again, though, formats evolve and very often it
will be safest and maybe even easiest to receive data as tab-delimited text or
in another text format. In any case, our experience has been that the need to
import data from other statistical packages is very much less than it used to be.

Getting Data into and out of R

6.6 Reading and Writing R Data Directly
R’s own format for saving objects is a binary one, which is only readable by R
itself. It is often useful to save objects in R format as an archive, a backup, or
so that they can be passed to another user or machine because the format of R
data ﬁles is the same across all computers and operating systems. The primary
functions for saving and loading R objects are saveRDS() and readRDS(),
for individual objects, and save() and load(), for groups of objects.
The saveRDS() function (the letters RDS evoke “R data serialization”)
takes an object and produces a disk ﬁle with all of that object’s data and
attributes. Saving a data frame to a spreadsheet-type text ﬁle can lose some
information, and in any case saveRDS() works on any R object. Each call
to saveRDS() applies to exactly one R object. So, for example, saveRDS
(myobj, "newfile") produces a disk ﬁle named "newfile" with a
binary representation of the R object myobj. The complementary action is
performed by readRDS(): in this case, readRDS("newfile") returns the
object just as it was saved. Note that readRDS() does not replace the existing
myobj; instead, it simply returns the object. You can save it with a command
like newobj <- readRDS("newfile"), which will create a new object
newobj that is identical to the original myobj.
You can save a whole set of objects with the save() function. The output
of save() is one big ﬁle with all the referenced objects stored in it. The
save() functions let you specify the objects as objects, so, for example,
save(myframe, myresults, myfunction, file = "output")
saves three objects into a ﬁle called "output". But, perhaps more conveniently, it also lets you specify objects by name using the list argument.
Alternatively, that last command could have been written save(list = c
("myframe", "myresults", "myfunction"), file = "output").
In this case, the second command requires slightly more typing, but the savings
are clear when it comes time to save every object whose name starts with
projectA, with a command like save(list = ls(pattern=
"^projectA"), file = "output"). Here, of course, the caret sign
(^) is part of a regular expression (Section 4.4) that extracts from ls() only
objects whose names start with projectA. To save all the objects in your
workspace, you can use the save.image() command. This is a useful way
to move your entire workspace over to a diﬀerent machine, for example, and
this command is called when you quit R and ask to “save the workspace.” The
save.image() function creates or updates a ﬁle named .RData by default.
The load() function serves as the complement to save(), restoring all the
objects stored on disk by save(). It will over-write any existing object with
the same name as one being restored, so load() can be dangerous. A useful

209

210

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

alternative is attach(), which allows you to add an R data ﬁle to the search
path, that is, the set of places that R will look for objects. In this way, a set of
objects can be made read-only. This is particularly useful for large, static data
sets that need not be loaded into your working directory.

6.7 Chapter Summary and Critical Data Handling
Tools
This chapter discusses getting data into R. Most often this will come in the form
of text ﬁles arranged in a tabular format, with one observation per row and one
measurement per column. Section 6.1 discusses diﬀerent methods of reading
these ﬁles, primarily through the read.table() function and its oﬀspring.
These functions will be able to handle a great many of these types of ﬁles. You
will need to know whether the ﬁle is delimited or ﬁxed-width. In the former
case, you will need to know the delimiter, whether there are headers, whether
we expect quotation marks, and other attributes of the ﬁle. For ﬁxed-width ﬁles,
we need to know the widths of each of the ﬁelds.
However we acquire a text ﬁle, we will probably want to set stringsAsFactors = FALSE so that character data are not converted into factors. We
may need to look for missing value indicators or other anomalies and re-read
the data. Indeed, reading data into R is very often an iterative process, requiring
us to reﬁne our approach on each step until the data is exactly what we expect.
Some data cannot be read with read.table() because it is too broken,
too big, or not tabular. Data is sometimes “broken” by having extra delimiters
embedded, and we give an example of how to handle a ﬁle broken in this way.
We usually approach broken ﬁles using scan() or, if necessary, readBin().
Section 6.2 discusses approaches for very large or non-tabular data. We can
open a ﬁle and read it piece by piece, processing the pieces one at a time until the
ﬁle is exhausted. This is the usual approach for log ﬁles, JSON or XML (formats
we describe later in the chapter), or any sort of text that is non-tabular. Binary
ﬁles can be read in this way as well, though this is rarely needed. One time this
need does arise is when, for whatever reason, the source ﬁle includes embedded
null characters.
This section includes some code that we used in a real-life example where
some (purportedly) text ﬁles both lacked new-line characters and also included
embedded null and control characters. The example describes some preprocessing steps we had to take in order to make the data readable.
In Section 6.3, we describe how R can connect to a relational database in
order to extract data. In order to interact with a database you will need to know
at least a little of the SQL. This language works with all major databases and
provides a framework for extracting and merging tables and subsets of tables.

Getting Data into and out of R

Section 6.4 describes another common problem in data cleaning – how to
handle large numbers of directories and ﬁles. We give an example of the sort
of task we are called on to perform. Handling problems of this sort requires
more than just understanding how to read and process individual ﬁles. It is also
necessary to know how to navigate the operating system, and to understand the
R tools that make it possible to perform ﬁle-system tasks such as listing the ﬁles
in a directory and managing zipped ﬁles.
We spend some time on acquiring data in other formats in Section 6.5. Most
of the non-text data we get is in the form of spreadsheets, so it is important to
know how to access the data inside these ﬁles. Often the clipboard provides a
simple way of transferring the data from a spreadsheet, though this is not very
automatic. The clipboard is also a reasonable way to grab one or two tables from
a web page in order to read them into R. We also show how to read a table from
a web page directly, via the readHTMLTable() function in the XML package.
We can also read the HTML text of a web page – but then we are faced with the
problem of extracting the important information from among the HTML tags.
Data scientists need to be familiar with XML and JSON, two text-based formats that are in increasingly common use. Add-on packages provide the ability
to read data in these formats into R objects. These are the common formats for
data returned from the Web via a REST query, and we give an example of such
a query.
Finally, we discuss how R stores its own objects. R’s internal format is binary
and not human-readable, but there is no better format for passing R data
between diﬀerent machines.

211

213

7
Data Handling in Practice
In our experience, a data cleaning project arises out of a modeling or data
exploration problem. We are given some data (or perhaps a description of
data that the project sponsor plans to eventually provide) and, usually, a
problem to be solved. There is no ﬁxed method for undertaking a data cleaning
project, but we think of the process as having four parts: acquiring and reading
the data, actually cleaning the data, combining the data (when it comes from
multiple sources), and preparing the data for analysis. (We sometimes use
“data cleaning” in both a broad sense and also as the name of a speciﬁc set of
tasks. Here, we are using “data handling” as the umbrella term for these four
tasks.) Of course, the “cleaning” part is never really ﬁnished, and often the most
important cleaning tasks are discovered as data sets are combined, or even as
the modeling proceeds. In this chapter, we describe the tasks associated with
each of the four parts of data cleaning. Then, we emphasize the importance of
reproducibility and documentation and give a detailed example at the end.

7.1 Acquiring and Reading Data
Acquiring data is, of course, the act of actually taking delivery of the data. Very
often the ﬁnal data, the data that will be used for building models, will come
not in one big ﬁle but from a number of sources and in varying formats. So
it is important for the data cleaner to be prepared to read in text, spreadsheet
data, XML, JSON, and to handle non-standard formats as well. The exercise in
Chapter 8 includes examples of all of these data types.
Acquiring data turns out to be more diﬃcult than you might expect. Data
providers are sometimes reluctant to release data, fearing the loss of proprietary or sensitive information. Some providers try to be helpful by providing
summarized data, or by taking it on themselves to delete or ﬁll in records with
missing values. Sometimes, the data sets are so big that just moving them can
be a challenge. We have used e-mail attachments, DVDs, secure ﬁle transfer,

214

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

downloads from cloud-based servers, and even hard drives sent through the
mail for transfer. Our advice is to get as much data as possible, at as detailed a
level as possible. It is disruptive to your project to realize after you receive the
data from your ﬁrst request that an important ﬁle is missing; the person sending you the data may be less inclined to focus on your second or third request.
Of course, another issue is that often we don’t know what’s in the data until we
receive it, and only then can we evaluate whether we have what we need – and,
often, once we receive some data we understand enough about the problem to
determine what data we should have asked for in the ﬁrst place.
Once we have taken possession of the data, the next step is almost always
to read the data into an R data frame using the techniques of Chapter 6. The
initial cleaning process begins here, since it is at this point that you will start
to identify missing value indicators, determine column classes and get an idea
of the size and complexity of the data. This is also a good point to ensure that
your column names are meaningful and tractable. For instance, we recently got
a data set of credit bureau data in which the nearly 200 columns had unwieldy
names like “Number Open Installment Trades with Credit Limit <5000 with bal
>0 and reported within 6 months.” Our ﬁrst task was to change all the names to
ones that would be more manageable in R. This was time-consuming, tedious,
diﬃcult to automate – and necessary.
Another important element you will usually want in any data cleaning project
is a key ﬁeld, an indicator that uniquely identiﬁes each observation. For a piece
of equipment, this might be a serial number; for a person from the United
States, it might be a Social Security Number (although these are sensitive, we
generally ask our providers to construct and provide a diﬀerent, unique identiﬁer). In many cases, one individual may have multiple records – in a medical
example, each person may have many visits to a doctor, for example. We might
then construct our own key, combining social Security Number with date. It is
particularly important to have a key ﬁeld when combining data from diﬀerent
sources; we talk more about this in Section 7.3.3.

7.2 Cleaning Data
The actual steps in data cleaning will depend on the data, and what we expect to
ﬁnd in each column. Of course, before anything is changed we want to see what
the data looks like. In general, whenever we receive a data set, we tabulate the
keys, looking for duplicates. If there are duplicate keys, we try to see if entire
records are duplicated, and, if there are, we may delete the duplicates and record
this fact. But if there are duplicate keys that go with distinct records, we have to
evaluate whether this describes real observations, errors, or another condition.
We look at every column’s type (character, numeric, etc.), and count the
number of missing values in every column. Even when there are thousands of

Data Handling in Practice

columns, it is useful to tabulate the column classes and to draw a histogram
of, or get summary statistics on, the numbers of missing values. Often a set of
columns will have the same number of missing values, and you will then ﬁnd
that they are all missing on the very same observations. Frequently, this will
arise because those observations all failed to match some speciﬁc data source
when the data frame was being constructed. You then have to decide whether
to keep those columns, remove them, or maybe create a new logical column
describing whether each observation was, or was not, missing those columns.
For numeric columns, we very often want to see the mean value (among
non-missing items) and, usually, the range. The range helps us spot anomalies – negative numbers where they are unexpected, or 999 codes used as
missing values – quickly. For columns that are supposed to be integer or
character, we will often want to know the number of unique values; if this is
very diﬀerent from what we expect, we would investigate further. For example,
a column measuring “number of home mortgages” will most commonly have
the values 0 or 1. If such a column had hundreds of distinct values we might
want to investigate. Conversely, a column named “Salesperson ID” might be
expected to have hundreds of values. In that case if there were only a few, we
would want to understand why. For categorical or integer variables with only a
few values, we will tabulate them, looking for unusual values. If we encounter
a column with values “A,” “B,” “Other,” and “other,” for example, we will almost
certainly consider combining those last two values. As with numeric columns,
it is helpful to look at maxima and minima of date ﬁelds. We tabulate dates
by month or quarter, looking for anomalous conditions like months with no
observations.
Once we have identiﬁed missing or clearly erroneous data, we face a decision. If the data came recently from a speciﬁc provider, we might return to that
provider, point out the issues, and hope for corrected data. More often, we will
have to act on our own and take steps to keep those values from disrupting our
analysis. For example, suppose we have a data set giving information about soldiers. If a ﬁeld giving a particular soldier’s age contains the value 999, and we
expect to need the age in our analysis, we might make a note of the soldier’s
identiﬁcation or other key value, then delete that record and continue. Deletion should be a rare tactic. If, say, 30% of soldiers had the 999 code, we might
need to do something else. We might replace the erroneous value with a valid
one using an “imputation” method, or we might mark the 999 values as NA.
Alternatively, if we do not anticipate using the age in modeling or other eﬀorts,
we might let the value stay as it is – but the fact that there are some soldiers
whose age is equal to 999 is still important information about the quality of the
data. You will want to report data quality issues to the data provider and project
sponsor.
Each data frame will need to be examined on its own, but the cleaning process
also needs to consider the collective set of data frames that go into the project.

215

216

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

For example, one data frame might indicate sex by a code like M or F and maybe
a missing or unknown code, say, Z. A second data frame might have a column
that gives TRUE for females and FALSE for males. Either of these schemes is
adequate on its own, but if the data frames need to be combined, we will need
to select a scheme from one data frame and impose it on the other.
Moreover, if two data sets provide the same information – sex, in this
example – for the same observations, it is generally a good idea to see how
often they agree, as a measure of data quality. In the case of sex, we expect
to see almost no disagreements – people do change sex, but rarely, and
disagreements are probably more likely to be due to transcription or other
errors. More often, in our experience, we will see two sources agreeing almost
all the time when both are present, but one source reporting many more
missing values than the other. This can give us information about the overall
data quality of the sources. Inevitably, experience and outside knowledge will
help you spot anomalies. As an example, we were given a data set of car loans
in which the cars were about 70% used and 30% new. When the same company
sent us an update, the new ﬁle described cars that were about 30% used and
70% new. Neither is unreasonable taken alone, but when we saw both ﬁles we
were certain an error had been made – and we were right.

7.3 Combining Data
We discussed combining data frames in Section 3.7. In that section, we focused
on the mechanics of using R to combine data frames. As we described there,
data frames can be joined three ways: by row (stacking vertically), by column
(combining horizontally), or using a key (which we call merging). This section
looks at some of the practical considerations we encounter when combining
data frames in real data cleaning problems.
7.3.1

Combining by Row

As we noted earlier, most data cleaning projects involve data from a number
of sources. These might be diﬀerent sets of similar observations that should
be combined row-wise. For example, we might get identically formatted tables
from each of several laboratories that should be combined, one table on top
of the next, into the ﬁnal data set. Suppose we wanted to combine data sets
named NYC, ATL, and PHX, representing input from our labs in New York,
Atlanta, and Phoenix. First we will need to ensure that the data sets have exactly
the same column names – often names will diﬀer in case, or in the use of
dots as separators. Second, columns with the same name should have the same
class – character, numeric, logical, or date. Third, we usually want the set of
values in categorical variables to match. If one data set’s column Success has

Data Handling in Practice

values Yes and No, say, and the other uses TRUE and FALSE, we will want to
modify one to match the other. Fourth, we examine the key ﬁelds to ensure that
they will be unique in the new, combined data set. If not, we might construct a
new key that adds NYC, ATL, or PHX onto the front or rear of the existing key.
Even if the keys are unique, we normally append a new column to each table,
giving its source, just to be unambiguous.
We combine data sets like these with R’s rbind() command , which we
introduced in Section 3.7.1. This function will respond properly if columns with
the same names appear in diﬀerent orders in the two data sets. However, it will
not complain if categorical columns do not have the same sets of values. So
when we combine data sets in this row-wise manner, we often use code like
that in the following examples to ensure that they do have the same values. We
start by ensuring that all three data frames have the same number of columns,
and that their names match. We sort the names because the columns might be
in diﬀerent orders in the diﬀerent data frames.
# Values at right show expected output from each command
length (unique (c (ncol (NYC), ncol (ATL), ncol (PHX))) # 1
all (sort (names (NYC)) == all (sort (names (ATL)))
# TRUE
all (sort (names (NYC)) == all (sort (names (PHX)))
# TRUE

We now check that the column classes match one another. Here we use the
slightly diﬀerent technique of calling all.equal() on the two vectors rather
than all() on the comparison. The all.equal() approach will be necessary when comparing lists.
all.equal (sapply
sapply
all.equal (sapply
sapply

(NYC,
(ATL,
(NYC,
(PHX,

class),
class)[names (NYC)]) # TRUE
class),
class)[names (NYC)]) # TRUE

The class() function returns a vector of length 2 or more, rather than a
single entry, for some columns (like, e.g., columns of POSIX date objects).
In that case, we can replace sapply(NYC, class) by sapply(NYC,
function(x) class(x)[1]).
In the next step, we identify the sets of categorical or factor variables, and
extract from each one its unique values. In the following code, we explicitly
convert any factor variables to character for the purpose of comparing their
unique values.
cats <- names (NYC)[sapply (NYC, class) == "character" ||
sapply (NYC, class) == "factor"]
levs.nyc <- lapply (NYC[,cats],
function (x) unique (sort (as.character (x))))
levs.atl <- lapply (ATL[,cats],
function (x) unique (sort (as.character (x))))
levs.phx <- lapply (PHX[,cats],

217

218

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

function (x) unique (sort (as.character (x))))
all.equal (levs.nyc, levs.atl) &&
all.equal (levs.nyc, levs.phx) # TRUE

The levs objects are lists of levels, which we sorted alphabetically (since
unique() does not sort). The all.equal() functions produce logical
values, which we combine with the && operator. If the result is TRUE, we know
that all three data frames’ sets of character or factor variables have exactly the
same sets of values.
The ﬁnal step in this operation is to create a column identifying each data
frame’s source and then to combine the three data frames, as we show here.
NYC$Source <- "NYC"
ATL$Source <- "ATL"
PHX$Source <- "PHX"
big <- rbind (NYC, ATL, PHX, stringsAsFactors = FALSE)

Notice the use of the stringsAsFactors = FALSE argument in the call to
rbind(). As we said in Section 4.6.5, rbind() behaves more gracefully with
factors, or mixtures of factors and characters, than some other R functions. Still,
we recommend using the stringsAsFactors = FALSE argument during
the data cleaning process. We generally want to create factors only at the end
of the cleaning process (see Section 7.5).
7.3.2

Combining by Column

Two data frames with the same number of rows can be combined “horizontally”
using the data.frame() function. This straightforward operation produces
a result whose number of columns is simply the sum of the numbers of columns
in the components. However, the joining is done naïvely. If the components are
data frames A and B, the ﬁrst row of A is joined to the ﬁrst row of B, the second to
the second, and so on. The data.frame() function ensures that the column
names in the result are distinct, so some of the column names in the second data
frame may be changed. The cbind() function does not de-conﬂict column
names, so we recommend using data.frame() for this job. The row names
in the result come from the ﬁrst data frame that has (non-default) ones.
7.3.3

Merging by Key

A more sophisticated operation is needed when two data sets have information
about the same observations but possibly in diﬀerent orders. In this case, the
data needs to be “merged,” that is, combined into one wide data set. We do this
by joining the tables according to their key values; in R, we use the merge()
function (see Section 3.7.2). In its simplest form, merge() takes two data sets
and returns a new one with the values from each key combined into one row.
If keys are unique, and every key appears in both data sets, this operation is

Data Handling in Practice

straightforward. The behavior when the keys are duplicated is slightly tricky;
we make sure that our keys are always unique. We may also need to decide
what to do about records that appear in one table but not both. The merge()
function accepts arguments that control this behavior.
Two facts about merge() are worth remembering. First, the return from the
merge() function is, by default, sorted according to the key, so the order of the
rows may have changed compared to the ordering in the source tables. A second, more important point has to do with what happens when the two source
tables have columns with the same name. If both of the original tables contain
a column named State, for example, the output will contain two columns
State.x and State.y. The term State is now not enough to name a column unambiguously, so code that worked on the State column in either of the
two original data sets will fail on the merged one. Moreover, if the merged data
is now merged again with a third input, that third input will contribute a column just called State (since that name is now not a duplicate). If you intend
to keep both of a pair of like-named columns after a merge, we recommend
changing their names in a controlled way, ahead of time.

7.4 Transactional Data
One type of data that is worth further mention here is transactional data. In
contrast to tabular data, in which we might expect one row per key, transactional data sets often have many rows for each key, with each row representing
a single transaction. Think of a log of clicks at a website, for example, identiﬁed
by user; each user may contribute dozens of clicks in a single session. Or think
of a data set of bank transactions identiﬁed by the customer’s account number.
If the interest is in individual transactions, then the data set is well on its way
to being useful. However, if the interest is in account numbers, we may want
to summarize all the transactions for a particular customer so as to produce a
data set with one row per account rather than one row per customer.
In the rest of this section, we illustrate working with transactional data by reference to a real-life problem we encountered recently. This example is lengthy,
but it serves to show some of the techniques that will be useful when handling
this and other sorts of data.
7.4.1

Example of Transactional Data

We were given a large data set regarding a particular survey taken by soldiers
that was intended to elicit the soldier’s emotional state. The key ﬁeld was an
identiﬁer that was unique to each soldier. This survey was administered approximately every year, so most soldiers appear multiple times in this data set. The
actual survey data set had several million rows, and each row contained the

219

220

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

answer to perhaps 150 questions, but for this demonstration we will show a
sample survey data set with only one question (to which is answer is an integer
on the scale of 1–5). Our sample survey data frame survey has seven rows, as
shown here. Like the original, our sample data is sorted by Date within ID.
> survey
ID
Date Response
1 AA 2012-09-26
3
2 AA 2014-01-16
4
3 CC 2013-03-13
3
4 CC 2014-04-30
5
5 CC 2015-03-31
4
6 DD 2013-06-03
2
7 EE 2013-12-02
4

We would consider this to be tabular data if our interest were primarily in
each survey response. If our interest were primarily in each soldier, we would
consider this to be transactional data. We have already seen some ways to summarize data like this. For example, we might compute the average Response
for each ID using the tapply() function, with code like with(survey,
tapply(Response, ID, mean)). More generally, we might be interested
in building a data set with one row per soldier, where each row has the earliest
date taken, perhaps, the average response, the number of responses, and so on.
One useful tool for transforming transactional data into a tabular form is
the rle() function from Section 2.6.5. Since the records are sorted by ID,
the rle() function, applied to the IDs, computes the “runs” of ID values. It
returns a list with two elements named lengths and values. The values
component is a vector of distinct soldier IDs (one entry for each ID for which
there is at least one survey), and the lengths component gives the number of
times that ID appeared in each run. That number is, of course, an integer giving
the number of surveys taken by each soldier. This code shows the list produced
as output of the call to rle(). Despite the special print format, the elements
of this object can be accessed in the same way as elements of an ordinary list.
> (rle.out
Run Length
lengths:
values :

<- rle (survey$ID))
Encoding
int [1:4] 2 3 1 1
chr [1:4] "AA" "CC" "DD" "EE"

This output is useful for many tasks that need to be performed on transactional data. For example, since lengths gives the lengths of the runs for
each ID, we can identify the ending points of the runs for each soldier with
end <- cumsum(rle.out$lengths). Then the starting point can be
computed with start <- c(1, end[-length(end)] + 1) – here we
start the ﬁrst run at 1 and start the last one after the second-to-last endpoint.
With these starting and ending points in hand, we can answer even diﬃcult

Data Handling in Practice

and complicated questions about the contents of the records within each run.
For example, suppose we wanted to identify soldiers who had an increase of
exactly 1 between one value of the Response and the next. Assuming that
records are sorted by Date within ID, we can construct a function diff1
that extracts the set of records associated with a particular index of start
and end and determines whether any of the diﬀerences are equal to 1. Then we
can apply that function to each pair of corresponding start and end values,
as shown in the following code.
> diff1 <- function (i) {
recs <- survey[start[i]:end[i],]
if (any (diff (recs$Response) == 1)) TRUE else FALSE
}
> sapply (1:length (start), diff1)
[1] TRUE FALSE FALSE FALSE

The output shows us that AA is the only soldier with an increase in Response
of exactly 1 from one survey to the next. In this example, we could also have
used tapply(), which has the additional advantage of not relying on global
variables (as our diff1 relies on start and end and survey). However,
tapply() cannot be applied to a data frame. If we needed all the information
for each soldier, another alternative is to combine split() with lapply(),
but that approach seems to be slower and more memory-intensive.
7.4.2

Combining Tabular and Transactional Data

We continue this example by reference to the real-life problem that inspired
it. We were actually given two large data sets. One contained the surveys we
described earlier. The second listed soldiers who had undergone deployment
overseas, giving an ID and the starting and ending dates of the deployment.
Many soldiers had more than one deployment. When read into R, the actual
deployment data frame had some hundreds of thousands of rows, but for this
example we use ﬁve deployments. This code shows what the ﬁve sample deployments look like, as stored in an R data frame called deploy.
> deploy
ID
Start
1 AA 2014-05-05
2 BB 2012-10-15
3 CC 2013-08-16
4 CC 2015-11-01
5 EE 2013-02-20

End
2014-11-08
2013-07-19
2014-04-03
2015-05-17
2013-05-18

Like the surveys, this data set has been sorted to be in ascending order of ID.
Our ultimate goal was to identify soldiers who had taken the survey both before
and after deployment, to see whether deployments might be associated with

221

222

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

changes in emotional state. To do that, we wanted to create a data set with one
row per deployment. Each row of this ﬁnal data will indicate the latest survey
for each soldier, among those that preceded the deployment start date, and the
earliest survey among those that follow the deployment end date. In this way,
we would have the surveys that most closely bracketed the deployment. Soldiers might appear twice in the data set, possibly bracketed by diﬀerent pairs
of surveys. In fact it would, in theory, be possible for two deployments by the
same soldier to be bracketed by the same pair of surveys, but we judged this to
be unlikely and not damaging to the analysis. In this example, the deployment
data set is already tabular (since we are interested in deployments, not individual soldiers), and the survey data set is transactional. A natural ﬁrst step is to
look for missing values, especially in responses and dates.
Once we were satisﬁed with the quality of the data, we decided to create, as an intermediate product, a data set with one row per deployment,
each row containing the dates of all the surveys for that soldier. Of course,
diﬀerent soldiers take the survey diﬀerent numbers of times, so we found
the maximum number of survey taken by any soldier with a command like
max(rle.out$lengths). We might have used max(table(survey
$ID)), although we would expect that to take much longer.
In this case, the maximum number of surveys taken is three, by soldier CC. So
we created a character matrix with three columns. We used characters because
Date objects in a matrix get converted to numeric. This character matrix was
called datemat. This code shows how the datemat matrix might be constructed. Notice that, for the moment, this matrix is constructed without an ID.
> deploy.unique.ID <- unique (deploy$ID)
> datemat <- matrix ("", nrow = length (deploy.unique.ID),
ncol = max (rle.out$lengths))

In order to ﬁll datemat, we use the rle() function from the example
above and the idea of a matrix subscript from Section 3.2.5. However, we need
to exclude soldiers who took the survey who are not in the set of deployers.
In this code, we remove those entries from the survey data frame, producing
surv.short, and then re-run rle().
> surv.short <- survey[is.element (survey$ID, deploy$ID),]
> rle.out <- rle (surv.short$ID)

We can now create a two-column matrix subscript, to be used to enter dates
into the datemat matrix. Each row of the matrix subscript, as you recall, gives
a row and column number of an entry to be ﬁlled. The column numbers are the
values from 1 to 3 giving the column of datemat where the date of the survey
will be recorded. For example, the ﬁrst soldier, AA, took two surveys, so we will
ﬁll columns 1 and 2 of that soldier’s row. We will want to generate a vector like
1:2 for that soldier. The second soldier took three surveys, so we will ﬁll all

Data Handling in Practice

three columns, using a vector like 1:3, and so on. We can compute a list with
the relevant vectors using sapply(), as we show in the following code. Then
unlist() extracts a vector of values from the list.
> surv.lens <- sapply (rle.out$length, function (i) 1:i)
> (col.surveys <- unlist (surv.lens))
[1] 1 2 1 2 3 1

The col.surveys vector gives the columns into which the dates need to be
placed. The IDs that go along with each of these entries are precisely the entries
in surv.short$ID. However, rather than the text IDs, we want the (numeric)
row number of datemat, which we can get using the match() function to
match the IDs from the surveys to the set of unique deployment IDs. This code
shows how we construct the vector giving the row of datemat desired for each
survey.
> (row.surveys <- match (surv.short$ID, deploy.unique.ID))
[1] 1 1 3 3 3 4
> (mat.subset <- cbind (row.surveys, col.surveys))
row.surveys col.surveys
[1,]
1
1
[2,]
1
2
[3,]
3
1
[4,]
3
2
[5,]
3
3
[6,]
4
1

We can now see that this matrix can be correctly used as a subscript. (We can
also see why datemat has no extraneous columns like ID; we want to refer to
the ﬁrst date column as column number 1.) The date of the ﬁrst survey should
be entered into the ﬁrst row and ﬁrst column of datemat; the second survey
was taken by the same soldier, so its date, too, should go into the ﬁrst row, but
this one should go into the second column; and so on.
The ﬁnal steps are these. First, we use the matrix subscript to ﬁll the
datemat matrix with the survey dates. We make these character explicitly. Then we create a data frame consisting of the unique IDs from the
deployment ﬁle, together with the dates from the datemat matrix. We
merge the original deploy data to this new data frame by ID, creating a
data set with one row per deployment (the merge() command, you will
recall, takes care of the fact that some soldiers deploy more than once). These
commands show the construction of the ﬁnal combined data set, which we
call dd.
> datemat[mat.subset] <- as.character (surv.short$Date)
> date.df <- data.frame (ID = deploy.unique.ID, datemat,
stringsAsFactors = FALSE)

223

224

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> (dd <- merge (deploy, date.df, by =
ID
Start
End
X1
1 AA 2014-05-05 2014-11-08 2012-09-26
2 BB 2012-10-15 2013-07-19
3 CC 2013-08-16 2014-04-03 2013-03-13
4 CC 2015-11-01 2015-05-17 2013-03-13
5 EE 2013-02-20 2013-05-18 2013-12-02

"ID"))
X2
2014-01-16

X3

2014-04-30 2015-03-31
2014-04-30 2015-03-31

With this data set in hand, we can now determine which deployments are bracketed by surveys. We need to be aware that the Start and End columns are
Date objects, whereas the survey dates in the three rightmost columns are
character. We also need to be aware that some entries in those last three
columns are empty. For each row we extract the largest date “to the left of”
(i.e., smaller than) the non-entry survey dates, using -Inf when none is found.
Then extract the smallest date to the right, using Inf if none is found. This code
deﬁnes a function bracket() that performs this computation and shows the
eﬀect of it operating on each of the rows of dd. The result is named brack.
> bracket <- function (i) {
dat <- dd[i,]
# extract ith row
dts <- dat[,4:6]
# grab dates and...
dts <- dts[dts != ""]
# omit empties
left <- max (-Inf, dts[dts < as.character (dat$Start)])
right <- min (Inf, dts[dts > as.character (dat$End)])
c(left, right)
}
> (brack <- sapply (1:nrow (dd), bracket))
[,1]
[,2]
[,3]
[,4]
[,5]
[1,] "2014-01-16" "-Inf" "2013-03-13" "2015-03-31"
"-Inf"
[2,] "Inf"
"Inf" "2014-04-30" "Inf"
"2013-12-02"

As we noted in Section 5.1.2, bracket() is a function that relies on variables
in the workspace (in this case, dd). We usually avoid this reliance, but this function provides a convenient way to operate on the rows of dd, and we will only
use this function once.
The result in brack gives one column per row of dd. Only the third row
of dd – the ﬁrst deployment of soldier CC – is bracketed by surveys. Now we
simplify the dd data set by combining its ID with the transpose of the brack
matrix. Having identiﬁed the “bracketing” survey dates, we now want to extract
the responses for those surveys. The natural way to do this is to identify every
survey by a unique key consisting of the soldier’s ID and the date. We construct
these keys both in the new dd data set, and also in the original survey data set.
Then match() can be used to bring in the responses from the two bracketing
surveys. This ﬁnal piece of code shows how this is accomplished. Notice that
the columns from brack are now named X1 and X2.
> dd <- data.frame (ID = dd$ID, t(brack),
stringsAsFactors = TRUE)
> left.key <- paste0 (dd$ID, ".", dd$X1)

Data Handling in Practice

>
#
>
>
>
>
1
2
3
4
5

right.key <- paste0 (dd$ID, ".", dd$X2)
Add key, then use match
survey$key <- paste0 (survey$ID, ".", survey$Date)
dd$ResponseLeft =
2016-08-31
17,288 (7.8%)

Figure 7.1 Example population ﬂowchart.

RCODE = “??”
864 (0.4%)

Ready for
modeling
201,692

227

228

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

then the two remaining sets of observations are merged into one data set of
230,025 observations. In subsequent cleaning steps, we lose observations that
fail to match back to a particular archive, that have unexpected values of a
ﬁeld called RCODE, and whose start dates are too recent. The rectangle in the
bottom right gives the number of observations that have survived the diﬀerent
cleaning steps.
We also recommend that you create, as one of your products, a master table
with every key and the disposition of the record for that key (in our example,
“no payment information,” “illegal RCODE,” etc.). That way you and the sponsor
can quickly identify the outcome associated with every record.
Perhaps the most important goal of the documentation that you produce is
to make the data handling process “reproducible.” That is, any R user should be
able to take your data, and your scripts and functions, run through the data handling process from beginning to end and produce exactly the same output that
you did. Sometimes, this will be impossible – if, for example, you are acquiring
data from a database that gets updated frequently. But usually you want to think
of this as a requirement. In fact, it may be worth moving your data, scripts, and
functions to a new directory (or even a new machine, perhaps with a diﬀerent
operating system) and re-running your handling process there, to ensure that
you aren’t, for example, relying on variables in your R environment that won’t
be present in the new directory.
Notice that the ordering of the steps will aﬀect the numbers in the ﬂowchart.
Suppose that some of the rows were both missing payment information and
also had illegal RCODEs. Then the counts associated with those two exclusions
will depend on the order in which they are applied. In this case, we would
expect the ﬁnal count to be the same regardless of the order of the operations
(although the master table may change). However, in other cases, particularly
those involving judgment (next section), we have to expect that two diﬀerent
data cleaners will end up with two slightly diﬀerent data sets, depending on the
choices they make.

7.7 The Role of Judgment
The data scientist is confronted with data and with rules for preparing it, and
yet inevitably is called on to exercise his or her judgment during the cleaning
process. As an artiﬁcial example, imagine a data set called indat that included
city names and two-letter state abbreviations. As part of the data cleaning process you, as the analyst, construct a table of the length (i.e., nchar()) of the
abbreviations, expecting to see every entry with the value 2. However, you ﬁnd
that a number of state abbreviations in fact have four characters each, so you
tabulate the set of state abbreviations with four characters. So far, your code
might look like this:

Data Handling in Practice

table (nchar (indat$State))
2
4
9203
84
table (indat$State[nchar (indat$State) == 4])
CITY
84

Something seems to have gone wrong here. You tabulate the values in the
City ﬁeld for the 84 records whose State ﬁeld has the CITY value, and you
see this:
table (indat$City[nchar (indat$State) == 4])
NEW YORK
84

At this point, the issue seems clear: these 84 records presumably should have
been recorded as “city = ‘New York’, state = ‘NY’ ” but instead were recorded
as “city = ‘New York’, state = ‘City’.” Having identiﬁed the issue, though, some
steps remain. Are the data suppliers aware of this anomaly? They will need to
be informed, maybe right away if fresh data is required, but maybe, for comparatively unimportant problems, in the ﬁnal project documentation. Does this
error suggest that other ﬁelds in those same records are less reliable? Is it acceptable to change the state for these 84 records to NY? Are there other entries
correctly labeled “city = ‘New York,’ state = ‘NY’ ” (or “city = ’New York City,’
state = ‘NY’ ”), and, if so, do these diﬀer from the mislabeled ones, perhaps in
terms of date?
Issues like this arise in almost every data cleaning problem. The analyst has
the responsibility of alerting the data provider as to problems, but she must
also forge ahead. The most important rule is that everything must be documented. Documentation will appear in the code, but it must also be in the
human-readable materials returned to the client. We are always wary of making changes to our customers’ data, and yet we very often end up doing just
that – informing the customer of what changes were made, why, and how many
records were aﬀected.
One more place where the analyst’s judgment can be required is in missing value handling. Some missing values appear to have arisen more or less
at random; the rest of the record and entries for that ﬁeld in other records are
generally not missing. Missing values like these can often be incorporated into
the modeling process.
Other records might have almost all missing entries. Since these carry almost
no information, they will often need to be discarded. On the other hand, ﬁelds
for these observations that are present might be informative about some items
of interest. For example, if a set of records, missing most ﬁelds, contains age and
sex information, those values might be useful when assessing the population of
customers, even if they are not useful for modeling outcomes. So, at each stage,

229

230

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

it is important to report the number of observations used for any particular
graph, table, or statistic.
Often values marked as missing actually indicate something. For example,
one ﬁeld might carry information about whether a customer has ever declared
bankruptcy. Suppose that such a ﬁeld has only two unique values: Y and
the missing value NA, and suppose that only a few percent are marked Y. It
might be reasonable to judge that NA was recorded for customers with no
bankruptcy, change those to N, and proceed (documenting this change, of
course, and reporting it). Now suppose that this ﬁeld had three distinct values:
Y making up, say, 2% of the values, N making up 94%, and NA accounting for
the remaining 4% of values. Here, the way forward is not as clear.
Judgment arises when combining the levels of categorical values. For
example, one ﬁeld might record whether an applicant rents a home, owns
it, lives with parents, or has some other housing situation; imagine that the
documentation speciﬁes that these choices will be represented with values R,
O, P, and X, respectively. Suppose that when you tabulate that ﬁeld, you ﬁnd
about 60% R values, 20% O values, 10% P values, and 3% X values; but you also
ﬁnd 4% r, in lower-case, and 3% missing values. Even though r is not listed
as a possible choice, it seems reasonable to assign them to the R group, with
corresponding documentation to be included in the ﬁnal report. On the other
hand, if instead of r the value had been J, we might have made a diﬀerent
choice. And what about the missing values? Since R is the most common
group, it will under some circumstances make sense to insert an R into those
records. Alternatively, since missing value is “other than R, O, or P,” we might
move those items with missing values into the X group – or we might create a
new label called Missing. The path to take has to depend on the frequency
of anomalous entries, the frequencies of the existing entries (as, here, where R
is the most common), and the ultimate goals of the analysis, and these call for
the analyst’s judgment in consultation with the data providers.

7.8 Data Cleaning in Action
In this section, we demonstrate some of our approaches on some data inspired
by a real problem we investigated. This data concerns somewhat more than 100
homes in the eastern part of the United States. It comes in the form of three
CSV ﬁles, which you can ﬁnd in the cleaningBook package. Each property
is described by its size (in number of bedrooms and number of bathrooms),
the state in which it is located, and its area (in square feet). This information
is held in two separate ﬁles, named BedBath1.csv and BedBath2.csv.
The electrical usage, measured at a meter, for each property has been recorded
over six consecutive calendar quarters. This information is held in a third ﬁle,
EnergyUsage.csv, which should include all the homes in the ﬁrst two ﬁles.

Data Handling in Practice

Homes are identiﬁed by a serial number, which consists of one of the letters
A, H, or P, followed by three digits and then a “trailer” made up of three more
digits; trailers are one of 001 through 009.
The ultimate goal of the eﬀort was to model the changes in electrical usage
across time, particularly in speciﬁc subsets of houses that are similar in terms
of their size and location. The goal of the data cleaning portion is simply to
combine the three data sets into one that is suitable for the modeling eﬀort. In
the following sections, we handle each ﬁle one at a time.
7.8.1

Reading and Cleaning BedBath1.csv

Reading

Since the BedBath1 ﬁle has a name ending in .csv, it seems reasonable to
presume that it is comma-separated. We might start using scan() to extract
and examine a few rows of the ﬁle, but very often we wade right in with the
hope that the ﬁrst call will produce the desired result. In this case, we start with
a call to read.csv() and then show a few rows. Recall that this function calls
read.table() with quote= "\"" and comment.char = "" as well
as sep = ",".
> bb1 <- read.csv ("BedBath1.csv", stringsAsFactors = FALSE)
> bb1[1:4,]
SerText
Bed Bath SqFt State
Prop # P477005 and land
3
3 1580
PA
H644009 as identifier
3 2 1/2 1490
CT
House Num H260008
2
1
910
CT
Property A834009
house and land
2 1 1/2 980
CT

We have learned a few things here. First, it appears that we have encountered an
embedded comma problem – the fourth row has one more entry than the others. As a result, the text from the ﬁrst column has been made into row names. If
any row had had too many commas, read.csv() would have failed. Second,
the numbers for Bath must be text because they include entries like 2 1/2.
We may want to convert that number to 2.5. Third, the property identiﬁer is
enclosed inside some text, and not in a consistent location from row to row.
We will need to extract the ID from the surrounding text.
The ﬁrst thing to handle is the embedded comma. Let us double-check by
extracting the ﬁrst few rows of text using scan() to ensure that that is, in fact,
the issue. This code shows how we might do that.
> bb1 <- scan ("BedBath1.csv", sep = "\n", what = "")
Read 77 items
> bb1[1:5]
[1] "SerText,Bed,Bath,SqFt,State"
[2] "Prop # P477005 and land,3,3,1580,PA"

231

232

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

[3] "H644009 as identifier,3,2 1/2,1490,CT"
[4] "House Num H260008,2,1,910,CT"
[5] "Property A834009, house and land,2,1 1/2,980,CT"

There really is a comma after the property ID, and before the word “house,” in
the ﬁfth line. In the next step, we use count.fields() to see how frequently
there is one extra comma and to examine what rows with extra commas look
like.
> table (count.fields ("BedBath1.csv", sep = ",",
comment.char = ""))
5 6
66 11
> bb1[6 == count.fields ("BedBath1.csv", sep = ",",
comment.char = "")]
[1] "Property A834009, house and land,2,1 1/2,980,CT"
[2] "Property P589004, house and land,3,3,1560,NY"
[3] "Property P450001, house and land,3,2,1400,PA"
: : :
[8] "Property A303001, house and land,3,2,c. 1460,PA"
: : :

All 11 of the lines with 6 ﬁelds include the “house and land” notation. So we
will just remove the ﬁrst comma from each of those 11 lines. Notice also that
in the eighth line the square footage value, c. 1460, is not numeric. We will
need to address this shortly. First, though, we remove the ﬁrst comma in each
of the over-long lines using sub(). Then we can pass the result as the text
argument to read.csv() to produce a data frame.
> comm <- (6 == count.fields ("BedBath1.csv", sep = ",",
comment.char = ""))
> bb1[comm] <- sub (",", "", bb1[comm])
> bb1 <- read.csv (text = bb1, stringsAsFactors = FALSE)
> bb1[1:4,]
SerText Bed Bath SqFt State
1
Prop # P477005 and land
3
3 1580
PA
2
H644009 as identifier
3 2 1/2 1490
CT
3
House Num H260008
2
1 910
CT
4 Property A834009 house and land
2 1 1/2 980
CT

In the ﬁrst command, we located the rows with six ﬁelds (i.e., ﬁve commas). In
the second, we removed those commas. After the call to read.csv() in the
third line, we can see that the ﬁelds appear to line up, at least in the ﬁrst few
rows. We move on to the cleaning step.
Cleaning

Now we examine the classes of the columns using a call to sapply().

Data Handling in Practice

> sapply (bb1, class)
SerText
Bed
Bath
SqFt
State
"character"
"integer" "character" "character" "character"

This ﬁnal command shows that the Bed ﬁeld has been recognized as
integer, which suggests that the ﬁelds are properly lined up. It also conﬁrms
our previous observation that the Bath and SqFt columns will need to be
cleaned in order to make them numeric.
Now let us extract the ID from the SerText ﬁeld. Since we know that the
ID consists of an A, H, or P followed by six digits, we can extract the ID with a
regular expression, as shown in the next code segment.
> (bb1$ID <- regmatches (bb1$SerText,
regexpr ("[AHP][[:digit:]]{6}", bb1$SerText)))
[1] "P477005" "H644009" "H260008" "A834009" ...
> table (is.na (bb1$ID))
FALSE
76
> table (nchar (bb1$ID))
7
76

The second command, of course, shows us that no IDs are missing, and the
third, that every ID has seven characters. We might look further into the ID
values by ensuring that they all really do start with A, H, or P, that they all end
with a three-digit number of the form 00x, and so on.
As to the Bath column, we can use sub() to replace every instance of 1/2
with .5. Notice how the text to be replaced includes the space that separates
the two parts of, for example, 2 1/2. After that substitution, we should be
able to convert the column to numeric. We will create a new variable with the
numeric version of Bath ﬁrst, to ensure that the conversion went successfully.
This code shows how we can compare the original Bath with the newly constructed version.
> bath <- as.numeric (sub (" 1/2", ".5", bb1$Bath))
> table (bath, bb1$Bath, useNA = "ifany")
bath
1 1 1/2 2 2 1/2 3
1
5
0 0
0 0
1.5 0
18 0
0 0
2
0
0 18
0 0
2.5 0
0 0
20 0
3
0
0 0
0 15

We can tell that this conversion has succeeded by examining the table, and we
can see that there are no missing values in the bb1$Bath column. As a result,
we can replace the Bath column in the data set with the bath variable just
constructed, as we do here.

233

234

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> bb1$Bath <- bath

The ﬁnal preparatory step for this data involves addressing the non-numeric
entries in the SqFt column. The non-numeric entries, of course, would be
turned into NA by as.numeric(). The following command shows how we
can use this fact to show the set of non-numeric SqFt entries.
> bb1[is.na (as.numeric (bb1$SqFt)),"SqFt"]
[1] "∼ 1510"
"c. 970"
"∼ 1460"
[4] "c. 1470"
"c. 920"
"approx. 1510"
[7] "∼ 1580"
"c. 1460"
"∼ 1600"
[10] "∼ 1560"
"approx. 1440" "∼ 1460"
[13] "approx. 1450"
Warning message:
In ‘[.data.frame‘(bb1, is.na(as.numeric(bb1$SqFt)), "SqFt"):
NAs introduced by coercion

In another example, there might have been too many to view, but here we can
see that the non-numeric entries come in three types: those that start c., those
that start ∼, and those that start approx. The next step requires some judgment. We will elect to remove the qualiﬁers and report the square footage using
the integer given; but we would certainly report this choice.
One way to extract the numeric part is, as before, with a regular expression.
In the ﬁrst command, as shown below, we search for a sequence of digits, and
then use regmatches() to extract that sequence. This produces a character
vector we name sqft. Then, we extract the values of sqft, which correspond
to non-numeric entries in the original bb1$SqFt vector. We display the ﬁrst,
second, and the sixth elements of that subset to ensure that each diﬀerent sort
of “approximate” notation is corrected properly.
> sqft <- regmatches (bb1$SqFt,
regexpr ("([[:digit:]]+)", bb1$SqFt))
> sqft[is.na (as.numeric (bb1$SqFt))][c(1, 2, 6)]
[1] "1510" "970" "1510" # warning message suppressed

Now that sqft appears to contain the correct value for every row, we can convert it into a numeric vector and assign the result to the SqFt column of bb1.
This code shows that operation, together with the ﬁrst few rows of the modiﬁed
data frame.
> bb1$SqFt <- as.numeric (sqft)
> bb1[1:3,]
SerText Bed Bath SqFt State
ID
1 Prop # P477005 and land
3 3.0 1580
PA P477005
2
H644009 as identifier
3 2.5 1490
CT H644009
3
House Num H260008
2 1.0 910
CT H260008

Our data cleaning for this ﬁle is nearly complete. The SerText column is
redundant but not otherwise bothersome. However, a few checks remain.

Data Handling in Practice

First, we examine the classes of the column to ensure they are as we
expect – character for the ﬁrst and last two, numeric for the others. Second, we look for duplicated rows and duplicated ID values. We could use
any(duplicated()) here, but we are in the habit of using table(). That
way, if there are duplicates, we know how many are there.
> sapply (bb1, class)
SerText
Bed
Bath
SqFt
State
"character"
"integer"
"numeric"
"numeric" "character"
ID
"character"
> table (duplicated (bb1))
# duplicated rows?
FALSE
76
> table (duplicated (bb1$ID)) # duplicated IDs?
FALSE TRUE
75
1

Apparently there is a duplicated ID. Let us determine its value and then extract
the rows for that ID.
> bb1$ID[duplicated (bb1$ID)]
[1] "P888009"
> bb1[bb1$ID == "P888009",]
SerText Bed Bath SqFt State
ID
23
Property ID P888009
2 1.5 920
PA P888009
49 P888009 as identifier
2 1.5 920
PA P888009

The good news is that these two rows are identical except for the original
SerText. It is almost certainly safe to delete one of the two. Still, the fact
of the duplication might reveal something about the process by which the
data has been collected, stored, or transmitted, and it should be reported. The
following commands show how we might delete one of these two rows as
well as the SerText column. We examine the dimensionality, mostly for our
awareness. Finally, we tabulate the State column, again just for our general
awareness.
> bb1 <- bb1[-23,]
# using row number from above
> bb1[bb1$ID == "P888009",] # double-check!
SerText Bed Bath SqFt State
ID
49 P888009 as identifier
2 1.5 920
PA P888009
> bb1$SerText <- NULL
# delete columns
> dim (bb1)
[1] 75 5
> table (bb1$State, useNA = "ifany")
CT NY PA
28 17 30

235

236

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Deleting row 23 is a quick operation, but it is slightly dangerous, in that if it were
inadvertently repeated a diﬀerent row would be (wrongly) removed. It might be
safer to use a command like
> bb1 <- bb1[-(which (bb1$ID == "P888009")[2]),]

Here, if there is a second row with the given ID, it will be removed. If not, the
value of which(bb1$ID == "P888009")[2] will be NA, and the attempt
to compute bb1[-NA,] will produce an error.
As a ﬁnal cleaning step for this data frame we might examine our summary
statistics for our numeric columns (Bed and SqFt), looking for missing and
anomalous values – but in this case, it may make sense to wait until the ﬁnal
data set has been constructed.
7.8.2

Reading and Cleaning BedBath2.csv

Reading

We continue the data cleaning exercise by reading the BedBath2.csv ﬁle.
As before, it is not implausible that everything will be well formatted, so we call
read.csv() and examine the ﬁrst few lines.
> bb2 <- read.csv ("BedBath2.csv", stringsAsFactors = FALSE)
> bb2[1:3,]
[1] "P957001\t3\t2\tNew York\t1440"
[2] "H429005\t2\t1.5\tNew Jersey\t950"
[3] "P226003\t2\t1.5\tNew York\t930"

Apparently this ﬁle, despite having a name that ends in .csv, is in fact
tab-separated. Naturally we try again, adding sep = "\t" to read.csv()
as shown here. (We might just as well have used read.table() or
read.delim().) We also examine the dimension of the resulting data frame
and the classes of its columns.
> bb2 <- read.csv ("BedBath2.csv", sep = "\t",
stringsAsFactors = FALSE)
> bb2[1:3,]
PropId Bed Bath
State Size
1 P957001
3 2.0
New York 1440
2 H429005
2 1.5 New Jersey 950
3 P226003
2 1.5
New York 930
> dim (bb2)
[1] 40 5

The data frame appears to have been constructed successfully.
Cleaning

A natural ﬁrst step in the cleaning process, as with bb1, is to examine the classes
of the columns. This code shows how we can do that.

Data Handling in Practice

> sapply (bb2, class)
PropId
Bed
"character"
"integer"

Bath
State
"numeric" "character"

Size
"integer"

We see a few things here. First, the IDs, called PropId here, seem to be whole
(at least, in the ﬁrst few rows). We might use table(nchar(bb2$PropId))
to examine them further. The Bath and Size columns are properly numeric,
as well. The State entries are full names rather than two-letter postal
codes, but this is easily handled. The data appears to have been read in
properly. Let us look for duplicate keys, both within this data set and between
the two.
> table (duplicated (bb2))
FALSE
40
> table (duplicated (bb2$PropId))
FALSE
40
> length (intersect (bb1$ID, bb2$PropId))
[1] 0

Our 40 PropId values are distinct, and they do not overlap with any of the ID
values in bb1. In order to join the two bb data sets, we will need to convert the
two-letter state codes into state names, or vice versa. Let us tabulate the bb2
state codes.
> table (bb2$State, useNA = "ifany")
Connecticut New Jersey
New York
10
15
15

It seems reasonable to convert these state names to their corresponding
two-letter codes. We could construct an ad hoc cross-reference table for this
purpose, but we can just as easily construct a table of state names and codes
from the built-in R objects state.name and state.abb. The following
code shows how we can build this cross-reference table.
> state.xref <- data.frame (Name = state.name,
Abbr = state.abb, stringsAsFactors = FALSE)
> state.xref[1:3,] # check
Name Abbr
1 Alabama
AL
2 Alaska
AK
3 Arizona
AZ

Now we use match() to extract the row numbers of state.xref where the
entries of bb2$State are found. Extracting the Abbr entries for those row
numbers produces the desired two-letter codes. We call that vector of codes
state.2 and cross-tabulate it with the original State values as a check.

237

238

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> state.2 <- state.xref$Abbr[match (bb2$State,
state.xref$Name)]
> table (state.2, bb2$State, useNA = "ifany")
state.2 Connecticut New Jersey New York
CT
10
0
0
NJ
0
15
0
NY
0
0
15

Since that has worked, we can now replace the State values with the
state.2 vector. The following command completes the data cleaning for
BedBath2.csv.
> bb2$State <- state.2

It may be worth noting, particularly for readers accustomed to SQL, that we
could have “joined” the bb2 and xref tables with merge(), as in this example.
> merge (bb2, state.xref, by.x = "State", by.y = "Name",
all.x = TRUE)

This command produces one row for every entry in bb2, since all.x = TRUE.
The State column of bb2 is matched with the Name column of state.xref.
7.8.3

Combining the BedBath Data Frames

We can now stack the two bb data frames vertically; they have the same
columns, of the same type, with the same values. We need only to ensure that
the column names match. To do that, we can rename the PropId and Size
columns of bb2 to ID and SqFt, so as to match bb1 (or, of course, vice versa).
We need not re-order the columns; the rbind() function for data frames will
take care of that. This code shows the renaming, and the call to rbind() that
produces the ﬁnal BedBath data set.
> names (bb2)[names (bb2) == "PropId"] <- "ID"
> names (bb2)[names (bb2) == "Size"] <- "SqFt"
> bedbath <- rbind (bb1, bb2, stringsAsFactors = FALSE)

Notice that to identify the ﬁrst column to rename we use [names (bb2) ==
"PropId"] rather than [1]. The ID column is column number 1, but perhaps
in another iteration it might end up in a diﬀerent position. Relying on the name,
rather than the number, is safer.
Once we have the combined data set, we should examine it. It is always a
good idea to check for missing values, to count the number of unique entries,
construct a few more tables, and compute summary statistics. The following
commands show a few of the computations that might perform in this example.
> sapply (bedbath, function (x) sum (is.na (x)))
Bed Bath SqFt State
ID
0
0
0
0
0

Data Handling in Practice

> sapply (bedbath, function (x) length (unique (x)))
Bed Bath SqFt State
ID
3
5
35
4
115
> with (bedbath, table (Bed, Bath), useNA = "ifany")
Bath
Bed 1 1.5 2 2.5 3
2 7 25 0
0 0
3 0
0 31 21 23
4 0
0 0
8 0
> summary (bedbath$SqFt)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
860
985
1450
1334
1495
1690

The ﬁrst command shows that this data set has no missing values. The second shows, ﬁrst, that the ID numbers are unique, as intended, and that there
are only 35 distinct values of SqFt. These values were rounded, so this is not
implausible. The cross-tabulation of Bed and Bath is also plausible, with bigger
houses having both more bedrooms and more bathrooms. Finally, the summary() of the SqFt column shows no obvious anomalies.
7.8.4

Reading and Cleaning EnergyUsage.csv

Reading

Now we read in the data set with the energy usage data. We start with
read.csv() as before.
> kwh <- read.csv ("EnergyUsage.csv",
stringsAsFactors = FALSE)
> kwh[1:3,]
Serial X2016.Q1 X2016.Q2 X2016.Q3 X2016.Q4
1 A594-005 2271.074 2245.196
2767.31 3713.24
2 P957-001 3565.268 3255.392 2880.485 3446.105
3 H625-003 2919.32 2334.8498 2821.3652
#N/A
X2017.Q1 X2017.Q2
1 3647.2811 3499.3139
2
3440.72
3485.93
3
#N/A
#N/A
> sapply (kwh, class)
Serial
X2016.Q1
X2016.Q2
X2016.Q3
"character" "character" "character" "character"
X2016.Q4
X2017.Q1
X2017.Q2
"character" "character" "character"

The ﬁle appears to be comma-delimited, but the “energy usage” columns are
all of mode character. For at least the last few, this must be, at least in part,
due to the #N/A missing value code. Recall that R produces valid column names

239

240

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

when it reads data in. In this case, it has added the letter X to the front of column
names like 2016.Q1; we will continue to use those modiﬁed names.
Let us read the data again, specifying that value as the missing value string.
We also observe that the Serial ﬁeld looks as if it contains the ID, but with a
hyphen.
> kwh <- read.csv ("EnergyUsage.csv",
stringsAsFactors = FALSE)
> kwh[1:3,]
Serial X2016.Q1 X2016.Q2 X2016.Q3
1 A594-005 2271.074 2245.196 2767.310
2 P957-001 3565.268 3255.392 2880.485
3 H625-003 2919.320 2334.850 2821.365

na.strings = "#N/A",

X2016.Q4
3713.240
3446.105
NA

X2017.Q1 X2017.Q2
1 3647.281 3499.314
2 3440.720 3485.930
3
NA
NA

The data has been properly read in. Some missing values remain, and we will
keep our eye on them to see, if, for example, they arise more frequently in some
time periods or locations than in others.
Cleaning

We start by examining the classes just as before.
> sapply (kwh, class)
Serial
X2016.Q1
"character"
"numeric"
X2016.Q4
X2017.Q1
"numeric"
"numeric"

X2016.Q2
"numeric"
X2017.Q2
"numeric"

X2016.Q3
"numeric"

Since the columns all have the expected class, we turn our concern to the
Serial ﬁeld. We will need to remove the hyphen to get these values to match
up with bedbath. This code uses sub() to perform that task.
> kwh$Serial <- sub ("-", "", kwh$Serial)
> kwh$Serial[1:4]
[1] "A594005" "P957001" "H625003" "P462004"

# check

That conversion seems to have succeeded, although as before we might look
more deeply. Let us now look for duplicate rows, duplicate keys, and duplicate
values within columns.
> table (duplicated (kwh))
FALSE
116
> table (duplicated (kwh$Serial))

# duplicate rows?

# duplicate keys?

Data Handling in Practice

FALSE TRUE
115
1
> sapply (kwh,
function (x) length (unique (x))) # duplicate values?
Serial X2016.Q1 X2016.Q2 X2016.Q3 X2016.Q4 X2017.Q1 X2017.Q2
115
112
108
104
103
94
95

Let us start with the item with the duplicate value of Serial. As we did before,
we will extract both the rows for this Serial value.
> kwh$Serial[duplicated (kwh$Serial)]
[1] "P888009"
> kwh[kwh$Serial == "P888009",]
Serial X2016.Q1 X2016.Q2 X2016.Q3 X2016.Q4
16 P888009
NA
NA 2797.175 3700.685
88 P888009 2244.566 2526.494 2797.175 3628.685
X2017.Q1 X2017.Q2
16
NA 3608.844
88 4310.686 3608.844

First, we note that the duplicated key is the same one that was duplicated in the
bed and bath data. Second, the two rows for this key are not identical, as they
were there. Our inclination is to drop the ﬁrst of these two rows, reasoning that
they are equal when not missing, except in one place. Again, this is a judgment
we make pending communication with the data provider. In this next piece of
code we remove row 16, identiﬁed as the ﬁrst of the duplicates.
> kwh <- kwh[-(which (kwh$Serial == "P888009")[1],]

The issue of duplicates within columns is subtler. Each of the numeric columns
has at least a few duplicate values, and this is probably not surprising. We might
expect a few matches just by coincidence. However, there may be a trend toward
more duplication as we move to the right of the data set. In the next line, we use
the table(table()) approach to examine the frequency of common values.
If common values are just coincidence, then most likely we would see sets of
pairs, each with a common value, rather than a group of four or ﬁve entries with
the very same value.
> table (table (kwh$X2017.Q2))
1 6
93 1

There is, in fact, a set of six observations with the very same value in the
X2017.Q2 column. This seems unexpected. In this code, we display those six
observations.
> which (6 == table (kwh$X2017.Q2))
4847.51
50

241

242

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> kwh[!is.na (kwh$X2017.Q2) & kwh$X2017.Q2 == 4847.51,]
Serial X2016.Q1 X2016.Q2 X2016.Q3 X2016.Q4
31 P308007 4059.905 3875.015 3034.91 3990.26
44 H958007 3771.095 5358.575 3034.91 3990.26
48 H419006 5360.750 1078.055
NA
NA
60 H619001 3570.185 2400.890 3034.91 3990.26
112 A542003 2126.240 2951.270 3034.91 3990.26
116 A436004 1181.540 2951.270 3034.91 3990.26

31
44
48
60
112
116

X2017.Q1 X2017.Q2
4512.2 4847.51
4512.2 4847.51
4512.2 4847.51
4512.2 4847.51
4512.2 4847.51
4512.2 4847.51

The value 50 returned from the which() command just indicates that the
entry of interest happened to be the 50th entry in the table. We are more interested in the duplicated value, which is 4847.51. We can see that a number of
properties have the same value not just for the X2017.Q2 column but for others as well.
Once again the analyst needs to come to some judgment as to how to proceed. In the real-life case from which this example was derived, we were able
to communicate with the data provider and learn that these readings were estimated on occasions when meters could not be read, the estimation deriving
from an average of a set of properties. We continued constructing the data,
documenting our ﬁndings.
7.8.5

Merging the BedBath and EnergyUsage Data Frames

The two data sets are nearly ready to merge. They have keys in the same format
and issues with the data have been resolved. All that remains is to ensure that
the two sets of keys do, in fact, refer to the same properties. In this code, we
tabulate the number of keys in bedbath that appear in kwh, and vice versa.
> table (is.element (bedbath$ID, kwh$Serial))
TRUE
115
> table (is.element (kwh$Serial, bedbath$ID))
TRUE
115

Where there are no duplicates, we need not do this in both directions – but we
recommend it anyway. Since all the keys match, we can now merge the two data
sets by the Serial and ID keys. This code shows how we construct the ﬁnal
data set for this example.

Data Handling in Practice

> properties <- merge (kwh,
by.y
> properties[1:4,] # Check
Serial X2016.Q1 X2016.Q2
1 A195008 4226.522 3961.223
2 A255003 3558.008 3512.932
3 A294008 3897.872 3352.718
4 A301008 3327.296 3365.249

1
2
3
4

bedbath, by.x = "Serial",
= "ID")
X2016.Q3
4117.505
3645.375
2073.440
2997.545

X2016.Q4
4683.680
4182.305
3013.130
5168.420

X2017.Q1
NA
5108.536
3644.720
4151.660

X2017.Q2 Bed Bath SqFt State
7022.795
2 1.0 950
NY
5695.186
3 2.5 1460
CT
3590.420
3 2.5 1510
NY
3080.435
3 2.0 1430
NJ

Our properties data set is complete and appears to be clean, although it still
has missing values. Moreover, we have not examined the usage numbers very
carefully. These following commands give examples of the sorts of analyses we
might perform to examine these numbers. We start by creating a vector Xcols
to identify the columns of usage – that is, the ones whose names start with X.
> Xcols <- grep ("X", names (properties), value = TRUE)
> sapply (properties[,Xcols], function (x) sum (is.na (x)))
X2016.Q1 X2016.Q2 X2016.Q3 X2016.Q4 X2017.Q1 X2017.Q2
4
6
5
7
15
16
> sapply (properties[,Xcols], range, na.rm = T)
X2016.Q1 X2016.Q2 X2016.Q3 X2016.Q4 X2017.Q1 X2017.Q2
[1,] 976.010
881.09 680.1299 680.690 1138.697 2123.470
[2,] 5847.704 10267.73 5448.1001 8484.425 9307.250 8421.106

The number of NA values increases slightly with increasing quarter, ending up
at about 10% of the rows. In the second command, we compute the range of
each column. The maxima of the usage values increase, but none are negative
or otherwise obviously anomalous.
We might also compute other summary statistics, like, for example, typical usage by state. For one quarter’s usage, this is easy to compute using the
tapply() function, as we demonstrate in the ﬁrst command as follows. The
second command shows how this might be computed simultaneously for every
quarter.
> tapply (properties$X2016.Q1, properties$State, mean,
na.rm = TRUE)
CT
NJ
NY
PA
3123.335 3321.658 3416.334 3237.445
> sapply (properties[,Xcols], function (qtr) {
tapply (qtr, properties$State, mean, na.rm=TRUE)})

243

244

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

CT
NJ
NY
PA

X2016.Q1
3123.335
3321.658
3416.334
3237.445

X2016.Q2
2776.943
2953.894
3120.687
3092.234

X2016.Q3
2871.059
2892.502
3056.265
3239.677

X2016.Q4
4005.165
3922.922
4043.530
3996.064

X2017.Q1
4461.140
4445.038
4602.037
4669.660

X2017.Q2
4918.352
4556.398
5085.058
4829.615

This last command loops over the columns whose names appear in Xcols,
calling tapply() for each one. The results are assembled into a matrix automatically. We might plot this matrix; we might plot usage by square footage for
each state or quarter; we might examine the distributions of values by quarter,
perhaps with boxplots; and so on. After some investigation like this, we can
now begin the modeling process. This would be the stage at which to convert
the State column into a factor variable.
We note here that our ﬁnal product consisted of one row per property. To
strictly conform with the notion of “tidy” data from Chapter 1, in which each
usage observation occupies one row, one more transformation is necessary. We
know that the ﬁnal data set will have 115 × 6 = 690 rows. The usage entries
will come from the Xcols columns. We can extract these using the stack()
function, which both produces the vector of 690 usage entries, and also includes
a second, factor vector identifying the source of the usage by name. This code
shows how we might do this.
> stack.output <- stack (properties, Xcols)
> stack.output
values
ind
1
4226.522 X2016.Q1
2
3558.008 X2016.Q1
3
3897.872 X2016.Q1
:
:
:
115 3825.980 X2016.Q1
116 3961.223 X2016.Q2
:
:
:

We can see the transition from values that originated in the X2016.Q1 column
to ones from the X2016.Q2 column after the 115th row. Now we replicate
the remaining columns six times, and attach the result to the stack.output
object.
> usages <- data.frame (
properties[rep (1:nrow(properties), 6),c(1, 8:11)],
stack.output)
> usages[1:3,] # check
Serial Bed Bath SqFt State
values
ind
1 A195008
2 1.0 950
NY 4226.522 X2016.Q1
2 A255003
3 2.5 1460
CT 3558.008 X2016.Q1
3 A294008
3 2.5 1510
NY 3897.872 X2016.Q1

Data Handling in Practice

After this operation, our new usages data set is 690 × 7. If we need to
add columns giving the year or the quarter, we can use a command like
usages$Year <- substring (usages$ind, 2, 5).

7.9 Chapter Summary and Critical Data Handling
Tools
This chapter describes some of the practical steps we need to take in a real data
cleaning problem. These include acquiring the data, cleaning the data, combining cleaned data sets from diﬀerent sources, and, ﬁnally, undertaking any
necessary data preparation steps. All four of these steps require detailed documentation, both to describe the steps you took and to make them reproducible.
The documentation you produce should also describe steps where you exercised judgment. This might involve deleting records with mostly missing values,
or replacing a small number of missing values with another value. You might
document anomalies on which you took no action at all – for example, if you
observe that all the loans in a data set had dates whose days of the week were
Monday or Wednesday, that might seem peculiar but not impossible, depending on the business practices of the data provider. Of course, the more you know
about the organization, the data, and the way it is gathered and processed, the
better your judgment can be.
Some data is transactional, meaning that it can reﬂect multiple observations
for each item of interest. We demonstrate some approaches for handling this
sort of data. While the demonstration requires a certain amount of explanation,
the script to actually perform the tasks requires only a few dozen lines of code.
The ﬁnal section of the chapter goes through an example of cleaning some
fairly small data sets. This example demonstrated a number of the challenges
that face us in our day-to-day business of handling data: embedded delimiters,
inconsistent key conventions, missing or duplicate values, and so on.
There are, inevitably, many more ways that data can need cleaning that
the example did not touch upon. It did not, for example, involve data from
non-delimited formats such as Excel, JSON, or XML, or importing data from
a relational database. It did not require us to write a specialized function, nor
did it require the handling of dates or times. Many of these complications are
taken up in the extended exercise in the following chapter, but you should be
aware that data you get will very often involve a combination of input types
and require a number of approaches.

245

247

8
Extended Exercise
In this chapter, we set up a guided data cleaning task, from beginning to end.
This data (including the company and personal names and addresses) is entirely
fabricated and is intended only to demonstrate some of the concepts in this
book – but every quirk you see in this data is based on actual data we encountered doing real projects. Unlike the smaller examples in earlier chapters, these
data sets are large enough so that you cannot spot all of their anomalies by eye.
However, you will be able to open and examine these outside of R – unlike some
of the data sets we deal with in real life.
This exercise requires time and focus to complete. You will get the most beneﬁt from this book if you read the chapter all the way through and try to perform
all the tasks in the exercise. Part of the exercise is ﬁguring out exactly what needs
to be done at each step and in which order. We have included some “pseudocode” – high-level descriptions of the algorithms we used to perform the
tasks – and some hints in Appendix A. However, we recommend you only use
that when you ﬁnd yourself stuck. The actual code we used to do the cleaning
is available in the cleaningBook package. Again, you will receive the most
beneﬁt when you try to solve the problem in it entirely before looking at our
code – and you may ﬁnd a diﬀerent, or better, way of getting the job done.

8.1 Introduction to the Problem
This data comes from a hypothetical client company called Hardy Business
Loans, which lends money to small businesses. Sometimes, the borrower is the
business itself; other times, the borrower is one (or more) of the business’s principal owners or partners. The loans are for small- to medium-sized pieces of
equipment: for a restaurant, this might be a pizza oven or espresso maker; for
a trucking business, it might be a truck or a copying machine; for a gardening
business, it might be a backhoe. Hardy Business Loans has acquired portfolios
of loans from two diﬀerent companies, “Beachside Lenders” and “Wilson and

248

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Sons,” and inevitably the data for these two diﬀerent portfolios has diﬀerent
layouts and ﬁelds.
Each portfolio consists of applications and loans, each identiﬁed by a key
(although application keys are diﬀerent from loan keys). Every loan should have
a corresponding application, but some applications do not lead to loans – they
are said to be “unbooked,” whereas those that are made into loans are “booked.”
An application might end up unbooked because the lender declined the application, or because the borrower found the money elsewhere, or perhaps for
other reasons.
At the time the application is made, the lender consults several other companies that provide information about the creditworthiness of the borrower.
In real life, these would include various credit reporting agencies, which
supply information about the credit habits and experience of individuals,
and also corresponding repositories of similar information for businesses.
Our “agency” data is modeled on the sorts of data made available from these
reporting agencies.
In many cases, the reporting agencies supply many columns containing
“low-level” data for each customer, and also a single “agency score.” Low-level
data might include, for example, the number of times a customer was 30 days
late on a credit card bill, the number of department store charge cards he or
she has, whether he or she ever declared bankruptcy, and dozens of similar
items. The agency score is a single number intended to rate the customers’
overall creditworthiness that combines all the low-level information using an
algorithm that is typically proprietary. Since Hardy acquires data from several
agencies, each borrower might have several agency scores. Lenders like Hardy
use the scores, but they might want access to the low-level data as well. For
example, a used-car dealer might want to build a custom score that focuses
speciﬁcally on borrowers’ past record with car-loan payments. In our case,
the list of variables we need in the output includes a few of these low-level
variables and also all of the agency scores.
8.1.1

The Goal

The goal of the data cleaning exercise is to produce tabular data with one row
per loan and with all the necessary columns of descriptive data. (The set of
columns to be retained will be discussed shortly.) We will also need to report
on observations that were omitted and the reason, whether it was too much
missing data, no matching records in agency data, or something else. We may
need to show the distribution of loans by region of the country and by calendar
quarter, through tables, graphs, and perhaps by other variables as well.
Our output will be in two data frames (or, if you prefer, one big data frame
with records of two types). The ﬁrst data set comes from the booked loans. For
these loans, we will need to construct the “response variable” for later modeling eﬀorts. This is the measurement that describes the outcome of the loan.

Extended Exercise

In other applications this might be a number, but in this case it will be one of
the levels “Good” or “Bad.” (We give the deﬁnitions of those terms used in the
material that follows.) So each row of this part of the output will contain one
of those values as a response, a number of measurements on diﬀerent variables
(the “predictors”), and perhaps one or more keys that allow the ﬁnal data to be
linked to the components from which it was built.
The modeling step – which is not part of this exercise – will use the predictor
variables to predict the response variable. If the responses in the booked loans
can be predicted accurately, Hardy Business Loans could use that model as a
screening tool by which to evaluate the risk of new loans or of its portfolio.
In addition to the predictors and the response for booked records, we must
also construct a similar data set using the records from unbooked applications.
By applying the model to the set of unbooked applications, Hardy can learn
about the rates at which it is turning away potentially successful borrowers. Of
course, these unbooked records do not have response variables. But it may be
possible to estimate how unbooked applications would have fared, based on the
statistical model from the booked ones. So we need to prepare the unbooked
applicants and the loans in a common way.
8.1.2

Modeling Considerations

The data sets that we produce at the end of this exercise are a beginning, not an
end. We can expect to make further changes as the modeling proceeds. These
are the steps we called “preparing data” in Section 7.5. For example, we might
choose to transform a numeric variable by, say, taking its logarithm. Categorical variables may need to be re-grouped, or numeric ones binned. We might
exclude more records as we determine their data to be unreliable. The processes
of cleaning, preparing, and modeling are iterative ones.
Another consideration is that we might want to divide the data into pieces,
one piece (the “training set”) for building models and a second piece (the “test
set”) for evaluating them. This division might be performed at random, with,
say, 30% of the entire data set retained as a test set, but we might also sample
using a more complicated scheme if, say, we wanted the training set to have a
higher proportion of Wilson and Sons loans, or newer loans, than the data as a
whole. In any case, this important consideration is not part of our exercise.
8.1.3

Examples of Things to Check

To the extent possible you should check every column against the data
dictionary. We have intentionally (and, probably, unintentionally) introduced
anomalous entries into the data to mirror the sorts of problems we encounter
in practice. Follow the steps mentioned in Section 7.2 in looking for duplicates
and ensuring that the set of column classes is as expected. Look at the maxima
and minima of numeric and date columns, and tabulate the categorical ones.

249

250

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Examining each column in isolation is a good start, but it is also worthwhile
to compare pairs of columns. Of course, there are a lot of pairs and often they
cannot all be checked. But it may be worthwhile to build two-way tables of
important categorical variables, keeping an eye out for unexpected results like
months with almost no observations.
In practice, you will often ﬁnd that some predictor variables are available
from more than one source. When two oﬀerings of the same variable are available, you might consult the data provider to see if they are considered to be
equally reliable. As we said in Section 7.2, it is useful to compute the proportion of times that the two agree – often, we ﬁnd that two versions of a variable
(call them “A” and “B”) have the same values when both are present, but that A is
missing much more often than B. In that case we will use B, possibly augmenting its own missing entries by values taken from A. When A and B disagree
on other than missingness, we exercise our judgment in deciding how to move
forward. In this exercise, we do not expect that sort of overlap in variables to
occur – but it does happen in practice.

8.2 The Data
The data for this exercise needs to be constructed by combining inputs from
seven sources. The sections following this one describe each of these sources
in detail, and give layouts or sample data. The seven sources are as follows:
• the two loan and application portfolios, one from Beachside and one from
Wilson, giving information about the borrower and the product to be purchased with the loan;
• the scores database, giving scores for the borrowers from four agencies,
including a score called “KSC” or “KScore”;
• the co-borrower scores, giving agency scores for co-signers of those loans
that have them;
• a set of updated KScores, which may supersede the KScore in the borrowers’
data;
• a set of loans to be excluded from the ﬁnal data set; and
• the payment matrix, showing payment information for up to 48 prior billing
periods.
Figure 8.1 shows a diagram of one process that can be used to construct
the ﬁnal data. The loan and application portfolios (two left items, top row)
contain the information about borrowers and loans. The layouts of these tables
are described in Section 8.4. These tables include low-level predictors; these
will then be joined to the scores database (Section 8.5), shown in the top
center-right of the ﬁgure, which contains the agency scores for the loans and

Extended Exercise

Table
Beachside
predictors

Table
Wilson
predictors

SQL

Table

Scores

Payment
matrix

XML

Fixed
Predictors
Co-borrower
scores

JSON

Response
Exclusions

Updated
predictors

Final
predictors

Cleaned
data

Select
Good/Bad

Final data
for models

Updated
KScores

Figure 8.1 Schematic of the data cleaning process for the example data.

applications in the two portfolios. Some loans have co-borrowers; those scores
are supplied in the co-borrowers data (Section 8.6), shown at center left in
the ﬁgure.
There are then two modiﬁcations that need to be imposed on the data. First,
a set of updated KScores (bottom left of ﬁgure) is available; loans and applications appearing in that data set may have to have their KScore values updated
(see Section 8.7). In addition, a small number of loans and applications should
be excluded from the modeling eﬀort, based on some exclusion data (middle
right of picture). We apply this exclusion as the last step in generating the predictors, because the excluded records need to be updated anyway, for a later
eﬀort, which is not part of this exercise. The payment matrix (Section 8.9),
shown at top right, is used to generate the response variable for the modeling process. The algorithm described in that section can be used to determine
whether the performance of a loan was “Good,” “Bad,” or “Indeterminate.” The
ﬁnal data set will include only those loans whose performance was “Good” or
“Bad.” The ﬁnal set of predictors is joined with the response (for the booked
records that have one) to produce the cleaned data (gray box, mid-lower right).
Finally, we exclude unbooked records or those whose response was “Indeterminate” to produce the subset of records that will be used in the initial modeling
eﬀort (bottom right of ﬁgure).

251

252

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

8.3 Five Important Fields
There are ﬁve ﬁelds that are vital to this exercise, which are as follows:
Application Number: A 10-digit number identifying the application. Every
application has a unique application number, and application numbers
should be distinct between the two portfolios.
Loan Number: A ﬁeld identifying the loan for booked applications. In the
Wilson portfolio, the loan number is supposed to consist of a six-digit
customer identiﬁcation, followed by a three-digit “instance” number. The
instance number starts at 001 and increases when a previous customer
receives a second or third loan. In some cases, though, the instance is
missing. In that case 001 can be used. The Beachside portfolio identiﬁes
loans by a six-digit number with no instance. It is possible for the same
six-digit number to be used in both portfolios, purely by happenstance,
which, combined with the fact that some Wilson loans do not have instance
numbers, means that six-digit loan numbers are not necessarily unique.
Application Date: The date on which the application was completed. For the
purposes of evaluating unbooked loans in the modeling process, we must
restrict ourselves to using only information that was available by this date.
So if a particular unbooked application is dated November 30, 2016, we will
ignore scores, co-borrower information, and updated KScores recorded after
that date.
Loan Date: The date on which a loan is “booked” (made oﬃcial). For booked
loans, scores and other updated information recorded after this date should
be ignored.
Active Date: We will use the term “active date” to identify the application date
for unbooked applications and the loan date for booked loans. This date is
important because it marks the date on which a decision is made, and agency
scores, for example, that are recorded after the active date will need to be
ignored. Unlike the other important ﬁelds, active date has to be inferred.

8.4 Loan and Application Portfolios
The two portfolios acquired by Hardy Business Loans come from two sources:
Beachside Lenders and Wilson and Sons. Both sets of data provide low-level
reporting agency data recorded at or before the active date for both unbooked
applications and loans.
The data for Beachside Lenders appears as an Excel workbook named
Beach.xlsx in the newer .xlsx format. Data for the Wilson portfolio
comes in an Excel workbook named wilson.xls in the older .xls format.
The Wilson data can contain multiple loans for one borrower (although each

Extended Exercise

has a separate application number). We might treat these multiple loans as if
they are independent (although they are not), or we might keep only the most
recent ones (although in that case we presumably lose some information), or
we might try combining multiple loans for the same borrower. In any case, the
number of multiple borrowers is so small that it will almost certainly make
no diﬀerence. For this example, we will keep all of the loans as if they are
independent.
Most of the variables that we need to extract or, in some cases, construct,
for the modeling eﬀort are given in Table 8.1 and are laid out in the following
three sections. The table lists the desired columns both by the Beachside names
and by their Wilson names, which are, inevitably, diﬀerent. The two layouts
in the following sections, and Table 8.1, serve as the data dictionary (Section
7.6) for the loan and application portfolios. In theory, you would hope that the
data dictionary would give a complete and accurate description of the data.
In our example, though – just as in real life – you may ﬁnd slight discrepancies between what the dictionary says the data should look like, and what is
actually in the data. Finding these discrepancies and adjusting for them – and
documenting them – is part of the exercise!
8.4.1

Layout of the Beachside Lenders Data

The Beachside portfolio contains information about 23 attributes of 8621 loan
applications. Those 23 columns are described here. A number of columns, such
as the Dbl and KAK columns in the ﬁnal entry on the list, contain information
that is never used. This is a common occurrence.
• APP_NO, APP_DT, LOAN_NO, LOAN_DT: Application number and application date for every entry; loan number and loan date for booked loans
• STAT: The status of the application (Booked, Decline, TUD; see below)
• CUST_NM, CUST_ADD1, CUST_CT, CUST_ST, CUST_ZIP: Name, address,
city, state, and ZIP code of customer
• TOT_COST, LOAN_AMT: Total cost of equipment and amount of loan
• GCORP, Local: Used to construct “credit risk” (see section 8.4.3)
• TIB: Time in business, in years
• ASSET_NEW: T if the equipment was to be purchased new, F if purchased
used
• BUS_TYPE, ASSET_TYPE, Dbl, KAK, Folds, JL, 2BC, MM ? : Unused.
The STAT column describes the status of the loan. For booked loans, this
will be “Booked.” For unbooked applications, this can be either “Decline,”
meaning that the lender declined to make the loan, or “TUD,” which indicates
that the borrower “Turned Us Down.” This happens either because a borrower
ﬁnds more favorable terms elsewhere, or changes her mind about whether
the purchase is necessary (or goes out of business or another unusual event
takes place).

253

254

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

8.4.2

Layout of the Wilson and Sons Data

The Wilson data has 10,587 rows and 23 columns. Most of these columns
carry the same information as in the Beachside portfolio, though with diﬀerent
names. The Wilson columns are shown here.
• Appl Number, Appl Date, Loan Number, Loan Date: Application
number and date for every entry; loan number with instance and loan date
for booked loans
• Application Status: Application status, like the STAT column from
Beachside. Values are Approved (booked), Declined, Incomplete (equivalent
to “TUD” in Beachside)
• Customer Name, City, State, Zip: Identiﬁcation of customer
• Total Cost, Net Loan Amount: The cost of the equipment and the size
of the loan
• GC Indicator, Local Cred: Used to construct “credit risk”
• Business Start Date: Date on which business began operating
• New/Used Indicator: Indicates whether the equipment being purchased is “New” or “Used”
• Customer Type, Equipment Class, CVR Indicator, Reclass
Indicator, Minimum GM, 2nd Gen PP, GS Indicator: Unused.
8.4.3

Combining the Two Portfolios

Table 8.1 shows the columns required for the modeling eﬀort that can be found
in, or derived from, the two portfolios or the additional data. In a real example,
we might keep dozens, scores, or even hundreds of columns, some formed by
transforming or combining others. For each column in the ﬁnal data set, we
give the corresponding columns in the two loan portfolios: “derived” denotes
columns that need to be constructed, using details that follow the table. Just as
in real life, we will need to combine columns with diﬀerent names and values
that nonetheless represent the same underlying measurement.
The Time in Business ﬁeld measures the amount of time that the borrower
has been in business – in other words, its age (in years). In the Beachside case,
this is given directly, in years, by the TIB column. In the case of Wilson, we
will need to deduce this from the date that the business was reported to have
started, given by the Business Start Date column. The age is the time between
the business start date and the active date (i.e., the Loan Date for booked loans,
the Application Date for unbooked ones).
The Region column represents the census region associated with each state.
Table 8.2 gives a table showing which state (by abbreviation) is assigned to
which region. This information is also available in the cleaningBook package in a more convenient form, as a 50-row data frame called state.tbl.
Observations from locations not in any region (such as Canada or Puerto Rico)
should be given the value “Other” for the Region ﬁeld.

Extended Exercise

Table 8.1 Columns required for predictive model
Name

Beachside

Wilson

Notes

App Number

APP_NO

Appl Number

Application number

App Date

APP_DT

Appl Date

Application date

Loan Number

LOAN_NO

Loan Number

Loan number (if booked)

Loan Date

LOAN_DT

Loan Date

Loan date (if booked)

AppStat

STAT

Application

Booked, declined,

Status

or TUD

Time in Business

TIB

Derived

Time in business

State

CUST_ST

Customer

Customer state/province

State
Region

Derived

Derived

Customer region (see notes)

Cost

TOT_COST

Total Cost

Total equipment cost

Loan Amount

LOAN_AMT

Net Loan

Loan amount

Amount
Down

Derived

Derived

Down payment (numeric)

New/Used

ASSET_NEW

New / Used

New/used indicator: T

Indicator

(“new”) or F in Beachside,
New or Used forWilson

CredRisk

Derived

Derived

Credit risk (see notes)

Source

Derived

Derived

Beachside or Wilson

Co-borrower

Derived

Derived

(see notes)

Table 8.2 State abbreviations by census region
Region

State

Midwest

IL, IN, IA, KS, MI, MN, MO, NE, ND, OH, SD, WI

Northeast

CT, ME, MA, NH, NJ, NY, PA, RI, VT

South

AL, AR, DC, DE, FL, GA, KY, LA, MD, MS, NC, OK, SC, TN,
TX, VA, WV

West

AK, AZ, CA, CO, HI, ID, MT, NV, NM, OR, UT, WA, WY

The loan amount is usually smaller than the equipment cost, since most
customers will make a down payment or trade in an older piece of equipment.
The Down column is computed as Cost − Loan Amount; this may be useful in
a predictive model since companies that can make large down payments may
be healthier – or, alternatively, a large down payment may leave them short
on cash.

255

256

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

The CredRisk credit rating is derived from two other columns present in the
portfolio ﬁles. First, the GCORP or GC variable gives the borrower a letter grade
from A (the highest score) to D. Second, the Local score gives a number from
1 (the highest) to 3. We will need to create a new credit risk predictor by combining those two indicators into a single column with values A1, A2, A3, B1,
and so on – except that C3 and all the D ratings should be combined into a
single level called X.
The co-borrower indicator shows whether an applicant has a co-borrower.
This is determined by the existence of co-borrower scores in the data of
Section 8.6. Application numbers for which co-borrowers appear in that data
should be given a co-borrower value of TRUE, and all others should be given
the value FALSE.

8.5 Scores
Each customer may have been assigned scores by the diﬀerent agencies. Primary borrowers’ scores are stored in an SQLite repository called
scores.sqlite, accessible via SQL (see Section 6.3). Agencies report
scores fairly frequently, and customers’ scores can change from time to time.
For each customer, we want the score that was the most recent at the time
of the active date. Scores that are recorded after the active date are not to be
used in our modeling eﬀorts. However, the active date is not part of the scores
database; it comes from the loan and application portfolios. In our example,
each customer can have scores from up to four agencies. These scores have the
names “Rayburn,” “KScore,” “J&G,” and “NorthEast.” Valid values for scores
are 200–899, so values outside this range should be ignored.
In the database, customers are identiﬁed by an eight-digit customer number,
which is not the same as an application or loan number. The database contains a
table named CScore containing this customer number, the agency scores, and
the month associated with the scores. Each customer may have many records,
one for each reporting month, so this table does not have a unique key (although
one could be created by combining the customer number and month). A second
table, CusApp, links customer numbers to application numbers. The following
section shows the layouts of these tables.
8.5.1

Scores Layout

The four agency scores are supplied in the scores database as an SQLite repository. The repository contains two tables. The layouts are shown in Tables 8.3
and 8.4. In the “format” columns, each 9 corresponds to a digit, and YYYYMM,
of course, corresponds to a four-digit year and two-digit month. The CScore
table holds the scores. Each set of four scores corresponds to a single month,
given in the ScoreMonth ﬁeld.

Extended Exercise

Table 8.3 CScore table
Name

Format

Description

Customer

99999999

Customer number

RAY

999

Three-digit Rayburn score

JNG

999

Three-digit J&G score

KSC

999

Three-digit KScore

NE

999

Three-digit NorthEast score

ScoreMonth

YYYYMM

Month associated with scores

Table 8.4 CusApp table
Name

Format

Description

Customer

99999999

Eight-digit customer number

Appl

9999999999

Ten-digit application number

The CusApp table cross-references the customer number to the application
number.

8.6 Co-borrower Scores
Some personal borrowers specify co-borrowers, who are other individuals
who co-sign for the loan, thereby promising to pay the lender if the original
borrower defaults. Borrowers with co-borrowers might be expected to be
more likely to pay the loan back, on average, depending, perhaps on the
creditworthiness of the co-borrower. Co-borrowers have their own agency
scores; that information (for both of the portfolios) is stored in the text ﬁle
called CoBorrScores.txt. This ﬁle consists of a series of transactional
records, each co-borrower being represented by between three and a few
dozen records. Records come in diﬀerent types, the diﬀerent types being
indicated by the three-letter sequence starting in position 9. The ﬁrst record
for each co-borrower is the master record with three-letter sequence J01. This
J01 record contains the application number associated with that co-borrower,
in character positions 20–29. Any other characters in the J01 line can
be ignored.
All the records for a single co-borrower will be on separate lines following
the J01 record. These other records can be of several types; for our purposes,
we are interested only in records of type ERS, which give co-borrower scores.

257

258

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Records with types other than J01 or ERS should be ignored. Each ERS record
reports a score type (with Rayburn, KScore, J&G, and NorthEast being abbreviated by RAY, KSC, JNG, and NTH, respectively), the score value, and a score
date in format MM/DD/YYYY (and followed by other information not needed
for this exercise). For each co-borrower we want the latest value of each score
that precedes the active date, except that, as before, only scores in the range
200–899 should be used. The following section shows examples of the records
in the co-borrower ﬁle for a particular co-borrower.
8.6.1

Co-borrower Score Examples

Co-borrowers can have diﬀerent numbers of records; the only way to tell that
we have seen all the records for a particular individual is by observing that
the next record has type J01 (except for the very last co-borrower in the
ﬁle). As with the scores database, we want to extract from the co-borrower
ﬁle the agency scores immediately before the active date, and, again, the
active date is not found in the ﬁle. Here are some example records from the
CoBorrScores.txt ﬁle; assume that these are the only records for this
co-borrower.
SuppDat
SuppDat
SuppDat
SuppDat
SuppDat
SuppDat
SuppDat
SuppDat

J01
BBR
ERS
ERS
ERS
FOE
ERS
ERS

AppNum 3385219170 Source 2151 Seq 661 Mod1 N ...
Check 2 Center 21 Audit F ...
NTH 720 10/31/2016 Mod N Report1 YNY
JNG 692 12/12/2016 Mod Y Report1 YYY
NTH 999 12/31/2016 Mod N Report1 NNY
OfcCode 22X OfcState NY OfcReg 22 OfcHold N...
RAY 999 10/31/2016 Mod Y Report1 NNN
NTH 717 11/30/2016 Mod N Report1 YNY

In this example, we see scores for application number 3385219170. The second
and sixth lines have codes BBR and FOE; these lines should be ignored. The
JNG score has the single entry 692, and that pair of score and date should be
retained. The RAY score is given as 999, so no Rayburn score is retained for this
co-borrower. There are three reports of a score of type NTH; of these, the most
recent has value 999 and should be ignored. So two (score, date) pairs need to
be returned for the NTH score for this co-borrower. Therefore, this co-borrower
should be associated with a total of three pairs of scores and dates. The actual
NTH score entered into the ﬁnal database will depend on the active date for
this application. If the active date is October 30, 2016 or earlier, none of the
three NTH rows in the example apply, and no NTH score will be used. If the
active date is between October 31, 2016 and November 29, 2016, the third row
of the example (the ﬁrst of the NTH rows) applies, and the NTH score for this
co-borrower should be reported as 720. Finally, if the active date of the application is November 30, 2016 or later, the last row of the example applies and the
NTH score should be reported as 717.

Extended Exercise

8.7 Updated KScores
The scores database includes a column called KSC, which reports KScores,
one of the agency scores to be used in the modeling. Many of these KScores
are missing in the original scores database. However, in the time since that
database was assembled, the lenders have acquired a new set of KScores
from the vendor of KScores, and they want to insert those scores into the
predictor data frame. These scores are delivered as JSON records in one ﬁle
named KScore.update.json. (Technically, a ﬁle with a set of valid JSON
records is not itself valid JSON, but there you have it.) This ﬁle is small enough
that it can be read into memory all at once, though if it were not we could
certainly read one line at a time in the style of the examples in Section 6.2.
Updated KScores refer to primary borrowers only, not to co-borrowers. Not
every customer will have an updated KScore, and some updates will refer to
customers not in our analysis. The appid ﬁeld in a JSON record will introduce
an application number, but those application numbers have been treated as
numeric, with leading zeros inadvertently removed. JSON records with ﬁelds
called KScore will be relevant, as long as the score is between 200 and 899.
Each record will contain a ﬁeld called record-date; scores whose values of
record-date are more recent than the active date of the application in question should also be ignored. If there is more than one update for a particular
account, we should use the more recent among those whose record-date
value precede the active date for this account. The KScore.update.json
ﬁle may contain other ﬁelds, too, including values of other scores, but those
scores, too, should be ignored. Only the KScore and its corresponding date
and application number are of interest for this portion of the data preparation.
8.7.1

Updated KScores Layout

The values of KSC retrieved from the scores database described in the previous section need to be updated for those applications that appear in the
KScore.update.json ﬁle. This ﬁle consists of a set of entries (“messages”)
in JSON format. Messages that do not include a ﬁeld named KScore are not of
interest. From those that do, we will extract the application identiﬁer (appid,
with leading zeros truncated), the score itself (KScore, represented as a string)
and the date associated with the score (record-date, in MM-DD-YYYY
format).
Example messages might look like this:
{"appid":"442480", "Rayburn":"721", "NorthEast":"999",
"record-date":"09-20-2016"}
{"appid":"621877", "KScore":"999", "J&G":"744",
"record-date":"03-31-2016"}
{"appid":"3826", "Rayburn":"699", "KScore":"685",

259

260

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

"J&G":"701", "record-date":"10-03-2016"}
{"appid":"3826", "KScore":"676", "record-date":"12-19-2016"}

Here, the ﬁrst record would be ignored, because no KScore is reported, and the
second would be ignored because the KScore has an invalid value. The third and
fourth records might cause the KScore for application number 0000003826
to be updated. If that application’s active date (application date for unbooked
loans, loan date for booked ones) was earlier than October 3, 2015, no update
is performed; if the application’s date is between October 3 and December 19,
2016, the KScore for that application would be updated to 685; if the application’s date was later than December 19, 2016, the KScore for that application
would be updated to 676.

8.8 Loans to Be Excluded
The lender has speciﬁed that a small number of loans be excluded entirely
from the modeling eﬀort. The loans to be excluded are among those for which
XML ﬁles are found in the Excluders directory. Each XML ﬁle corresponds
to one loan, identiﬁed by that loan’s application number in the  ﬁeld.
Any XML ﬁle that contains a ﬁeld called , for which the value
of the  ﬁeld is “Yes” or “Contingent,” should be excluded. The
 ﬁeld, if it exists, will be found inside an element named DATA.
Applications that do not appear in XML ﬁles, or whose XML ﬁles do not have
an  ﬁeld, or whose values of  are other than Yes or
Contingent, should not be excluded on the basis of their XML data.
8.8.1

Sample Exclusion File

Here, we present a sample ﬁle of the sort in the exclusion directory. This ﬁle
would lead to the exclusion of application number 0134292005, since the value
of the EXCLUDE ﬁeld is Yes. Other ﬁelds in the XML ﬁles can be ignored.

Supp0134292005
Yes25167
Yes36H


8.9 Response Variable
In this section, we describe how to construct the response variable for our modeling eﬀort. This is the measurement that describes the outcome of the loan. In
other applications this might be a number, but in this case it will be one of
the levels “Good,” “Bad,” or “Indeterminate.” We will construct these values

Extended Exercise

from the payment matrix contained in the Payments.txt ﬁle. For each loan
(identiﬁed by a payment identiﬁer found in the portfolio ﬁles), this ﬁxed-width
ﬁle contains information about the last 48 months. For each month, the ﬁle
records four items: the amount paid, the amount delinquent (which can accumulate from month to month), the billing date, and the number of months since
the account was last fully paid. Each payment record also carries a 10-digit
application identiﬁer.
The ﬁle will carry zeros in all four ﬁelds for any month earlier than the loan
origination date. For example, a loan that was 10 months old will have 10 sets
of payment information (from 1 month ago, 2 months ago, etc.) preceded by 38
sets of zeros. The layout of this ﬁle comes to us in the following COBOL code,
which we provide because on some occasions our documentation has come to
us in the form of COBOL speciﬁcations like this one:
001000 01
001100
001200
001300
001400
001500
001600
001700

PAY-LAYOUT.
05 APPID
PIC X(10).
05 FILLER
PIC X(01).
05
PAYMENT-REC OCCURS 48 TIMES.
10
AMT-PD
PIC S9(9)V99.
10
AMT-DELQ
PIC S9(9)V99.
10
BILL-DATE
PIC 9(8).
10
MONTHS-SINCE-PAID PIC 999.

Even without knowing COBOL, you can probably deduce that this record
starts with a 10-character payer identiﬁer called APPID. The PIC X(10)
tells us to expect 10 alphanumeric characters. The 11th character, indicated by
FILLER, can be ignored. Then there follow 48 instances of four ﬁelds each.
PIC 9 clauses indicate numeric values, so the S9(9)V99 tells us that, for
each of the ﬁrst two ﬁelds, we can expect a sign (positive or negative, or a
space which also indicates a positive value) and a numeric value of 11 digits,
with an “implied” decimal point before the last 2 digits. That is, no actual
space is set aside for the decimal point; dollar amounts are represented in
pennies, but the layout tells the program where the decimal point is supposed
to be. (This sort of storage was very common in COBOL, a language that is
no longer widely used.) The third element in each set is an eight-digit date,
in YYYYMMDD format, and the fourth is a three-digit number giving the
number of months that the account was past due. The months are in order in
every set, with the least recent month being associated with the ﬁrst set of four
values and the most recent being the 48th. Examples of this data might look,
in part, like the line in the following example. These lines are very long; for
display purposes, we have wrapped lines around the page so that two lines in
the display correspond to one line of the ﬁle.
0034344639 00000000000 0000000000000000000000
00000000000 0000000000000000000000...

261

262

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

0048268540 00000860623 0000000000020101214000
00000000000 0000086062320110114001...

In this example, the ﬁrst account did not have a loan 48 months ago, so the values for that month are all zeros. The second customer paid $8606.23 (AMT-PD
with a space for the sign and the implicit decimal point) in the ﬁrst month. She
was not delinquent, so the AMT-DELQ ﬁeld carries a space for the sign and then
11 zeros. The billing date for that payment was 2010-12-14, and the customer
was up to date, so the “months since paid” ﬁeld carries the value 000. In the
second month, with billing date 2011-01-14, the customer paid nothing and
was delinquent by $8606.23; during that month, the “months since paid” ﬁeld
had the value 001. Each row will have a total of 48 of these 33-character sets of
payment records.
We are now ready to deﬁne our response variable. Of course, this will normally be deﬁned by the client (in this case, Hardy Business Loans) rather than
the data cleaner. For this example we will implement these rules, which are handled in priority order, so that, for example, a customer who meets the criterion
of rule 1 is not considered by a later rule.
1) Any customer whose loan is six or fewer months old is considered Indeterminate.
2) A customer who is ever 3 months delinquent, or who is 2 months delinquent
on two diﬀerent occasions, is Bad.
3) A customer who is ever more than $50,000 delinquent is Bad.
4) Customers who meet none of these criteria are Good.
Since the payment ﬁle has a ﬁxed format, it is natural to read it into R with
read.fwf() (Section 6.1.7), although if the ﬁle were huge it might be more
eﬃcient to read it line by line and decode it ourselves.

8.10 Assembling the Final Data Sets
The ﬁnal product from this exercise should be two data sets, one with booked
loans whose outcome is not “Indeterminate,” and one with all other applications – or, if you prefer, one combined data set containing both booked and
unbooked applications.
8.10.1

Final Data Layout

At the conclusion of the exercise, each data set should consist of these pieces:
• Identifying information for each entry: application number; application
portfolio (Beachside or Wilson); loan number (if any); loan and application
dates, or a single column giving the active date.

Extended Exercise

• The value of the response variable (“Good” or “Bad”) for loans. For the
unbooked applications we can omit this column, or we might create an
empty column so that the two data sets have exactly the same set of columns.
• The remaining columns from Table 8.1.
• Four columns of agency scores, with a suitable indicator to denote missing
scores, and with KScores updated as appropriate. It might be of value to keep
the corresponding score dates for documentation purposes.
• Four columns of agency scores for co-borrowers, again with a suitable indicator of missingness, and possibly including corresponding dates for these
scores as well. For observations without co-borrowers, of course, all of these
co-borrower scores will be missing.
It is also important to keep track of your code and your assumptions. You will
also want to construct a table giving, for each application number found in any
source, its ultimate disposition – whether it appears in the ﬁnal data set or was
deleted, and, if so, why. You may also want to construct a ﬂowchart like the one
in Figure 7.1.
8.10.2

Concluding Remarks

The exercise in this chapter is intended to bring together a number of the
skills and techniques described throughout the book. If you can complete the
exercise in a satisfactory way, we think we are well on your way to mastering
the important skills of data cleaning.
Having said that, let us assure you that there are many ways to do data
cleaning in general, and this exercise in particular. After the authors devised
the data for this book, we set out to do the exercise separately. We took quite
diﬀerent paths: one of us started with Beachside, then moved on to Wilson,
and then combined the two data sets into one to create the ﬁrst big data set of
the exercise. The other went column by column through Table 8.1, extracting
the Beachside and Wilson versions and cleaning and combining them. One of
us turned dates into Date objects to ensure that they were sorted correctly;
the other used YYYYMMDD-style text objects, turning them into POSIXt-type
dates as needed. One of us turned “unknown” values for the New/Used column
into “Used,” and the other left them as “unknown.” Of course, in real life
we would have tried to refer the question of what action to take to the data
provider, and possibly to whomever will be doing to modeling as well. Despite
our diﬀerent approaches, our ﬁnal products agreed almost exactly – “almost”
because of the judgments made in handling the small numbers of unexpected
values. For the purposes of evaluating our exercise, we worked separately. In
real data cleaning work, though, you will probably ﬁnd it helpful to work in a
small group, or at least to show your work to a colleague who can understand
and comment on the steps.

263

264

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

When we originally wrote this exercise, it had more steps. The additional
ones, inspired though they were by real data we analyze, were tedious
and repetitive, so we took them out. But just because they are not in the
exercise doesn’t mean your data cleaning tasks won’t be ﬁlled with tedious,
detail-oriented actions. For example, you might look at the BUS_TYPE and
ASSET_TYPE columns in the Beachside portfolio, which match up, more
or less, to the Customer Type and Equipment Class columns in the
Wilson one. In order to include those columns in the ﬁnal output you need to
examine the set of labels in the two portfolios and decide how to match them
up. For any one column from two sources this task is easy enough, but given
dozens of these the workload can be intimidating.
We said at the beginning of the book that data cleaning might take up 80%
of the time we devote to a project. It is not always fun or glamorous, but it
always needs to be done properly. In order to be good at data cleaning in
R, you will need to understand what data is available, and how it is represented.
You will need to be familiar with the diﬀerent formats in which data arrives,
and how to combine data from diﬀerent sources. Perhaps most importantly
you will need to know at least a little about how the data is collected and how
it is intended to be used. One of the great things about being a data scientist
is that you get to learn at least a little bit about a lot of ﬁelds, and a lot of that
learning comes from practicing with real data.

265

Hints and Pseudocode

Chapter 8 described a data handling task involving acquiring data from
spreadsheets, a database, JSON, XML, and ﬁxed-width text ﬁles. The formats
and layouts of the data are documented in that chapter. In this appendix,
we give some extra hints about how to proceed. We recommend trying the
exercise ﬁrst, without referring to this appendix until you need to.
Some of these hints come in the form of “pseudocode.” This is the programmer’s term for instructions that describe an algorithm in ordinary English,
rather than in the strict form of code in R or another computer language. The
actual R code we used can be found in the cleaningBook package.

A.1 Loan Portfolios
Reading, cleaning, and combining the loan portfolios (Section 8.4) is the ﬁrst
task in the exercise, and perhaps the most time-consuming. However, none of
the tasks needed to complete this task is particularly challenging from a technical standpoint.
If you have a spreadsheet program such as Excel that can open the ﬁle, the
very ﬁrst step in this process might be to use that program to view the ﬁle. Look
at the values. Are there headers? Can you see any missing value codes? Are some
columns empty? Do any rows appear to be missing a lot of values? Are values
unexpectedly duplicated across rows? Are there dates, currency amounts, or
other values that might require special handling?
Then, it is time to read the two data sets into R and produce data frames,
using one of the read.table() functions. We normally start with
quote = "" or quote = "\"" and comment.char = "". You might
select stringsAsFactors = FALSE, in which case numeric ﬁelds in the
data will appear as numeric columns in the data frame. This sets empty entries
in those columns of the spreadsheet to NA and also has the eﬀect of stripping
leading zeros in ﬁelds that look numeric, like the application number. As one

266

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

alternative, you might set colClasses = "character", in which case all
of the columns of the data frame are deﬁned to be of type character. In this
case, empty ﬁelds appear as the empty string "". Eventually, columns that are
intended to be numeric will have to be converted. On the other hand, leading
zeros in application numbers are preserved. From the standpoint of column
classes, perhaps the best way to read the data in is to pass colClasses as
a vector to specify the classes of individual columns – but this requires extra
exploration up front to determine those classes.
Pseudocode for the step of reading, cleaning, and combining the portfolio
data ﬁles might look like this. In this description, we treat the ﬁles separately.
You might equally well treat them simultaneously, creating the joint data set as
you go.
for each of {Beachside, Wilson}
read in file
examine keys and missing values
discard unneeded columns
convert formatted columns (currencies, dates) to R types
ensure categorical variables have the correct levels
construct derived variables
ensure categorical variables levels match between data sets
ensure column names match between data sets
ensure keys are unique across data sets
add an identifying column to each data set
join data sets row-wise (vertically)

A.1.1

Things to Check

As you will see, the portfolio data sets are messy, just as real-life data is messy.
As you move forward, you will need to keep one eye on the data itself, one eye on
the data dictionary, and one eye on the list of columns to construct. (We never
said this would be easy.) Consider performing the sorts of checks listed here, as
well as any others you can think of. Remember to specify exclude = NULL
or useNA = "ifany" in calls to table() to help identify missing values.
• What do the application numbers look like? Do they all have the expected
10 digits? Are any missing? Are there duplicate rows or duplicate application
numbers? If so, can rows be safely deleted?
• What do missing value counts look like by column? Are some columns missing a large proportion of entries? Are some rows missing most entries, and,
if so, should they be deleted?
• Do the values of categorical variables match the values in the data dictionary?
If not, are the diﬀerences large or small? Can some levels be converted into
others? If two categorical variables measure similar things, cross-tabulate

Hints and Pseudocode

•
•

•

•

them to ensure that they are associated. At this stage, it might be worthwhile
to consider the data from the modeling standpoint. For example, if a large
number of categorical values are missing, it might be worthwhile to create
a new level called Missing. If there are a number of values that apply to
only a few records each, it might be wise to combine them into an Other
category.
Are there columns with special formatting, like, for example, currency values
that look like $1,234.56? Convert them to numeric.
Across the set of numeric columns, what sorts of ranges and averages do
you see? Are they plausible? For columns that look like counts, what are the
most common values? Are there some values that look as if they might be
special indicators (999 for a person’s age, 99 or −1 for number of mortgages)?
Consider computing the correlation matrix of numeric predictors to see if
columns that should carry similar information are in fact correlated.
What format are the date columns in? If you plan to do arithmetic on dates,
like, for example, computing the number of days between two dates, you will
need to ensure that dates use one of the Date or POSIXt classes. If all you
need to do is to sort dates, you can also use a text format like YYYYMMDD,
since these text values will sort alphabetically just as the underlying date values would. Keeping dates as characters can save you a lot of grief when you
might inadvertently turn a Date or POSIXt object into numeric form. The
date formats are easier to summarize and to use in plots, however. Examine
the dates. Are there missing values? What do the ranges look like? Consider
plotting dates by month or quarter to see if there are patterns.
Computing derived variables often involves a table lookup. For example, in
the exercise, you can determine each state’s region from a lookup into a table
that contains state code and region code. The cleaning issue is then a matter
of identifying codes not in the table and determining what to do with them.

Once the two loan portfolios have been combined, we can start in on the
remaining data sets. In the sections that follow, we will refer to the combined
portfolio data set as big. When we completed this step, our big data frame
had 19,196 rows and 17 columns; yours may be slightly diﬀerent, depending on
the choices you made in this section.

A.2 Scores Database
The agency scores (Section 8.5) are stored in an SQLite database. The ﬁrst order
of business is to connect to the database and look at a few of its records. Do the
extracted records match the descriptions in the data dictionary in Chapter 8?
Count the numbers of records. If these are manageable, we might just as well
read the entire tables into R, but if there are many millions or tens of millions
of records we might be better oﬀ reading them bit by bit.

267

268

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

In this case, the tables are not particularly large. Pseudocode for the next steps
might look as follows:
read Cscore table into an R data frame called cscore
read CusApp table into an R data frame called cusapp
add application number from cusapp to cscore
add active date from big to cscore
discard records for which ScoreMonth > active date
discard records with invalid scores
order cscore by customer number and then by date, descending
for each customer i, for each agency j
find the first non-missing score and date

We did not ﬁnd a particularly eﬃcient way to perform this last step.

A.2.1

Things to Check

As with almost all data sets, the agency scores have some data quality issues.
Items to consider include the following:
• What do the application numbers and customer numbers from cusapp look
like? Do they have the expected numbers of digits? Are any duplicated or
missing? What proportion of application numbers from cusapp appears
in big, and what proportion of application numbers from big appears in
cusapp? Are there customer numbers in cscore that do not appear in
cusapp or vice versa?
• Summarize the values in the agency score columns. What do missing values
look like in each column? What proportion of scores is missing? Are any
values outside the permitted range 200–899?
• The ScoreMonth column in the cscore table is in YYYYMM format, without the day of the month. However, the active date ﬁeld from big has days of
the month (and, depending on your strategy, might be in Date or POSIXt
format). How should we compare these two types of date?
In the ﬁnal step, we merge the data set of scores with the big data set
created by combining the two loan portfolios. Recall that duplicate keys are
particularly troublesome in R’s merge() function – so it might be worthwhile
to double-check for duplicates in the key, which in this case is the application
number, in the two data frames being merged. Does the merged version
of big have the same number of rows as the original? If the original has,
say, 17 columns and the new version 26, you might ensure that the ﬁrst 17
columns of the new version match the original data, with a command like
all.equal(orig, new[,1:17]).

Hints and Pseudocode

A.3 Co-borrower Scores
The co-borrower scores (Section 8.6) make up one of the trickiest of the data
sets, because of their custom transactional format (Section 7.4). We start by
noting that there are 73,477 records in the data set, so it can easily be handled
entirely inside R. If there had been, say, a billion records, we would have needed
to devise a diﬀerent approach. We start oﬀ with a pseudocode as follows:
read data into R using scan() with sep = "\n"
discard records with codes other than ERS or J01
discard ERS records with invalid scores

At this point, we are left with the J01 records, which name the application
number, and the ERS records, which give the scores. It will now be convenient
to construct a data frame with one row for each score. This will serve as an
intermediate result, a step toward the ﬁnal output, which will be a data frame
with one row for each application number. This intermediate data frame will,
of course, contain an application number ﬁeld. (This is just one approach, and
other equally good ones are possible.) We take the following steps:
extract the application numbers from the J01 records
use the rle() function on the part of the string
that is always either J01 or ERS

Recall that each application is represented by a J01 record followed by a series
of ERS records. If the data dictionary is correct, the result of the rle() function will be a run of length 1 of J01, followed by a run of ERS, followed by
another run of length of 1 of J01, followed by another run of ERS, and so on.
So, values 2, 4, 6, and so on of the length component of the output of rle()
will give the number of ERS records for each account. Call that subset of length
values lens. Then, we operate as follows:
construct a data frame with a column called Appl,
produced by replicating the application numbers
from the J01 records lens times each
add the score identifier (RAY, KSC) etc.
add the numeric score value
add the score date
insert the active date from big
delete records for which the score date > active date

At this stage, we have a data set with the information we want, except that
some application numbers might still have multiple records for the same
agency. A quick way to get rid of the older, unneeded rows is to construct a
key consisting of a combination of application number and agency (like, e.g.,
3004923564.KSC). Now sort the data set by key, and by date within key, in
descending order. We need to keep only the ﬁrst record for every key – that

269

270

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

is, we can delete records for which duplicated(), acting on that newly
created key ﬁeld, returns TRUE.
Following the deletion of older scores, we have a data set (call it co.df) with
all the co-borrower scores, arrayed one per score. (When we did this, we had
4928 scores corresponding to 1567 application numbers.) Now we can use a
matrix subscript (Section 3.2.5) to assemble these into a matrix that has one row
per application number. We start by creating a data frame with the ﬁve desired
columns: places for the JNG, KSC, NTH, and RAY scores and the application
number. It is convenient to put the application number last. Call this new data
frame co.scores. Then, we continue as follows:
construct the vector of row indices by matching the
application number from co.df to the application number
from co.scores. This produces a vector of length 4,928
whose values range from 1 to 1,567
construct the vector of column indices by matching the score
id from co.df to the vector c("JNG", "KSC", "NTH",
"RAY"). This produces a vector of length 4,928
whose values range from 1 to 4
combine those two vectors into a two-column matrix
use the matrix to insert scores from co.df into co.scores

A.3.1

Things to Check

As with the other data sets, the most important issues with the co-borrower
data are key matching, missing values, and illegal values. You might examine
points like the following:
• What do the application numbers look like? Do they have 10 digits? What
proportion of application numbers appears in big? In our experience the
proportion of loans with co-borrowers has often been on the order of 20%.
Are there application numbers that do not appear in big? If there are a lot
of these we might wonder if the application numbers had been corrupted
somehow.
• After reading the data in, check that the ﬁelds extracted are what you expect.
Do the score identiﬁers all match JNG, and so on? Are the score values mostly
in the 200–899 range? Do the dates look valid?
• We expect some scores to be 999 or out of range. What proportions of those
values do we see? Are some months associated with large numbers or proportions of illegal score values? Are these any non-numeric values, which
might indicate that we extracted the wrong portion of the record?
When the co-borrowers data set is complete, we can merge it to big. Since
big already has columns names JNG, KSC, NE, and RAY, it might be useful to

Hints and Pseudocode

rename the columns of our co.score before the merge to Co.JNG and so on.
(At this point, you have probably noticed that the NorthEast score is identiﬁed
by NE in the scores database but by NTH in the co-borrower scores. It might be
worthwhile making these names consistent.)
A ﬁnal task in this step is to add a co-borrower indicator, which is TRUE
when an application number from big is found in our co.scores and FALSE
otherwise. Actually, this approach has the drawback that it will report FALSE
for applications with co-borrowers for which every score was invalid or more
recent than the active date (if there are any). If this is an issue, you will need to
go back and modify the script by creating the indicator before any ERS records
are deleted. This is a good example of why documenting your scripts as you go
is necessary.

A.4 Updated KScores
The updated KScores (Section 8.7) are given in a series of JSON messages. For
each message, we need to extract ﬁelds named appid and record-date, if
they exist. We start with pseudocode as follows:
read the JSON into R with scan() and sep = "\n"
remove messages in which the string "KScore" does not appear
write a function to handle one JSON message: return appid,
KScore, record-date if found, and (say) "none" if not
apply that function to each message

Using sapply() in that last command produces a two-row matrix that should
be transposed. Using lapply() produces a list of character vectors of length
2 that should be combined row-wise, perhaps via do.call(). In either case,
it will be convenient to produce a data frame with three columns. We continue
to work on that data frame with operations as follows:
remove rows with invalid scores
add active date from big
remove rows for which KScore date is > active date
keep the most recent row for each application id

This ﬁnal data set contains the KScores that are eligible to update the ones in
big. Presumably, it only makes sense to update scores for which the one is big
is either missing, or carries an earlier date than the one in our updated KScore
data. When we did this, we found 1098 applications in the updated KScore data
set, and they all corresponded to scores that were missing in big. So, all 1098
of these KScore entries in big should be updated. In a real example, you would
have to be prepared to update non-missing KScores as well.

271

272

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

A.4.1

Things to Check

As with the other data sets, the important checks are for keys, missing, duplicated, and illegal values. In this case, we might consider some of the following
points:
• Are a lot of records missing application numbers? Examine a few to make
sure that it is the data, not the code, at fault. Are any application numbers
duplicated? What proportion of application numbers appears in big?
• Are many records missing KScores? That might be odd given the speciﬁc
purpose of this data set. What do the KScores that are present look like? Are
many missing or illegal?
• What proportion of update records carries dates more recent than the active
date? If this proportion is very large, it can suggest an error on the part of the
data supplier.

A.5 Excluder Files
The Excluder ﬁles (Section 8.8) are in XML format. We need to read each XML
ﬁle and determine whether it has (i) a ﬁeld called APPID and (ii) a ﬁeld called
EXCLUDE, found under one called DATA. In pseudocode form, the task for the
excluder data might look as follows:
acquire a vector of XML file names from the Excluders dir.
use sapply to loop over the vector of names of XML files:
read next XML file in, convert to R list with xmlParse()
if there is a field named "APPID", save it as appid
otherwise set appid to None
if there is a field named "EXCLUDE" save it as exclude
otherwise set exclude to None
return a vector of appid and exclude

The result of this call to sapply() will be a matrix with two rows. It will be
convenient to transpose it, then convert it into a data frame and rename the
columns, perhaps to Appl and ExcCode. Call this data frame excluder.
A.5.1

Things to Check

• Are there duplicate rows or application numbers? This is primarily just a
check on data quality. Presumably, if an application number appears twice in
the excluders data set, that record will be deleted, just as if it had appeared
once.
• Do the columns have missing values?
• Are the values in the ExcCode column what we expect (“Yes,” “No,” “Contingent”)? We can now drop records for which ExcCode is anything other
than Yes or Contingent.

Hints and Pseudocode

• Do the application numbers in excluder match the ones in the big
(combined) data set? Do they have 10 digits? If there are records with no
application number, or whose application numbers do not appear elsewhere,
they can be dropped from the excluder data set.
When we are satisﬁed with the excluder data set, we can remove the
matching rows from big. When we did this, our data set ended up with 19,034
rows and between 30 and 40 columns, depending on exactly which columns
we chose to save.

A.6 Payment Matrix
Recall from Section 8.9 that every row of the payment matrix contains a
ﬁxed-layout record consisting of a 10-digit application number, then a space,
and then 48 repetitions of numeric ﬁelds of lengths 12, 12, 8, and 3 characters.
Those values will go into a vector that will deﬁne the ﬁeld widths. We might
also set up a vector of column names, by combining names like Appl and
Filler with 48 repetitions of four names like Pay, Delq, Date, and Mo. We
pasted those 48 replications together with the numbers 48:1 to create unique
and easily tractable names. We also know that the ﬁrst two columns should be
categorical; among the repeaters, we might specify that all should be numeric,
or perhaps that the two amounts should be numeric and the date and number
of months, character.
With that preparation, we are ready to read in the ﬁle, using read
.fwf() and the widths, names, and column classes just created. We called
that data frame pay.
Our column naming convention makes it easy to extract the names of the
date columns with a command like grep("Date", names(pay)). For
example, recall that we will declare as “Indeterminate” any record with six
or fewer non-zero dates. We can use apply() across the rows of our pay
data frame, operating just on the subset of date columns, and run a function
that determines whether the number of zero entries is ≤6. (Normally, we are
hesitant to use apply() on the rows of a data frame. In this case, we are
assured that the relevant columns are all of identical types and lengths.)
Now, we create a column of status values, and wherever the result of this
apply() is TRUE, we insert Indet into the corresponding spots in that vector. Our code might look something like the following:
set up pay$Good as a character vector in the pay data frame
apply to the rows of pay[,grep ("Date", names (pay))]
a function that counts the number of entries that are
0 or "00000000"
set pay$Good to Indet when that count is <= 6

273

274

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Our column naming convention also makes it easy to check our work. For
example, we might pick out some number haphazardly – say, 45 – and look at
the entries in pay for the 45th Indet entry. (This feels wiser than using the
very ﬁrst one, since the ﬁrst part of the ﬁle might be diﬀerent from the middle
part.) In this example, our code might look as follows:
pick some number, like, say, 45
extract the row of pay corresponding to the 45th Indet
-- call that rr
pay[rr, grep ("Date", names (pay))] shows that row

Now we perform similar actions for the other possible outcomes. At this
stage, it makes sense to keep the diﬀerent sorts of bad outcomes separate, combining them only at the end. One bad outcome is when an account is 3 months
delinquent. We apply() a function that determines whether the maximum
value in any of the month ﬁelds is ≥3. For those rows that produce TRUE, we
insert a value like bad3 into the pay$Good unless it is already occupied with
an Indet value. Again, it makes sense to examine a couple of these records to
ensure that our logic is correct.
Another bad outcome occurs when there are at least two instances of month
values equal to 2, and a third is when the delinquency value passes $50,000. In
each of these cases, we update the pay$Good vector, ensuring that we only
update entries that have not already been set. It is a good idea to check a few of
the records for each indicator as we did earlier.
Records that do not get assigned Indet nor one of the bad values are
assigned Good. Once we have tabulated the bad values separately, we can combine them into a single Bad value. This way, we can compare the frequencies
of the diﬀerent outcomes to see if what we see matches what we expect.
A.6.1

Things to Check

There are lots of ways that these payment records can be inconsistent. The
extent of your exploration here might depend on the time available. But we
give, as examples, some of the questions you might ask of this data.
• What do the application numbers look like? Do they have 10 digits? Are any
missing? Does every booked record have payment information, and does
every record with payment information appear in big? What is the right
action to take for booked records with no payment information – set their
response to Indet?
• What are the proportions of the diﬀerent outcomes? Do they look reasonable? For example, if 90% of records are marked Indet we might be concerned.
• What do the amounts look like? Are they ever negative, missing, or absurd?
• Are the dates valid? We expect every entry to be a plausible value in the
form YYYYMMDD or else a zero or set of zeros. Are they? Are adjacent dates 1

Hints and Pseudocode

month apart, as we expect? Are there zeros in the interior of a set of non-zero
dates?
• Are adjacent entries consistent? For example, if month 8 shows a delinquency
of 4 months, and month 10 shows a delinquency of 6 months, then we expect
month 9 to show a delinquency of 5 months. Are these expected patterns
followed?
Once the response variable has been constructed, we can merge big with
pay. Actually, since we only want the Good column from pay, we merge big
with the two-column data set consisting of Appl and Good extracted from
pay. In this way, we add the Good column into big.

A.7 Starting the Modeling Process
Our data set is now nearly ready for modeling, although we could normally
expect to convert character columns to factor ﬁrst. Moreover, we often
ﬁnd errors in our cleaning processes or anomalous data when the modeling
begins – requiring us to modify our scripts. In the current example, let us see if
the Good column constructed above is correlated with the average of the four
scores. This R code shows one way we might examine that hypothesis using the
techniques in the book.
> big$AvgScore <- apply (big[,c("RAY", "JNG", "KSC", "NE")],
1, mean, na.rm = T)
> gb <- is.element (big$Good, c("Good", "Bad")) &
!is.na (big$AvgScore)

The gb vector is TRUE for those rows of big for which the Good column has
one of the values Good or Bad, and AvgScore is present, meaning that at least
one agency score was present.
We can now divide the records into groups, based on average agency score.
Here we show one way to create ﬁve groups; we then tabulate the numbers of
Good and Bad responses by group.
> qq <- quantile (big$AvgScore, seq (0, 1, len=6),
na.rm = TRUE)
> (tbl <- table (big$Good[gb], cut (big$AvgScore[gb], qq,
include.lowest = TRUE), useNA = "ifany"))
[420,598] (598,649] (649,696] (696,748] (748,879]
Bad
641
620
514
449
384
Good
607
721
811
926
1067
> round (100 * prop.table (tbl, 2), 1)
[420,598] (598,649] (649,696] (696,748] (748,879]
Bad
51.4
46.2
38.8
32.7
26.5
Good
48.6
53.8
61.2
67.3
73.5

275

276

A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

We can see a relationship between agency scores and outcome from the tables.
The ﬁrst shows the counts, and the second uses prop.table() to compute
percentages within each column. In the group with the lowest average score
(left column), there are about as many Bad entries as Good ones. As we move
to the right, the proportion of Good entries increases; in the group with the
highest average scores, almost 75% of entries are Good ones. This is consistent
with what we expect, since higher scores are supposed to be an indication of
higher probability of good response.

277

Bibliography

Adler, D., Gläser, C., Nenadic, O., Oehlschlägel, J., and Zucchini, W. (2014) ﬀ:
Memory-Eﬃcient Storage of Large Data on Disk and Fast Access Functions. R
Package Version 2.2-13.
Bache, S.M. and Wickham, H. (2014) magrittr: A Forward-Pipe Operator for R. R
Package Version 1.5.
Bates, D. and Maechler, M. (2016) Matrix: Sparse and Dense Matrix Classes and
Methods. R Package Version 1.2-6.
Couture-Beil, A. (2014) rjson: JSON for R. R Package Version 0.2.15.
Dasu, T. and Johnson, T. (2003) Exploratory Data Mining and Data Cleaning,
John Wiley & Sons, Inc., Hoboken, NJ.
Dowle, M., Srinivasan, A., Short, T., with contributions from Saporta, R.,
Lianoglou, S., and Antonyan, E. (2015) data.table: Extension of Data.frame. R
Package Version 1.9.6.
Dragulescu, A.A. (2014) xlsx: Read, Write, Format Excel 2007 and Excel
97/2000/XP/2003 ﬁles. R Package Version 0.5.7.
Eddelbuettel, D. and François, R. (2011) Rcpp: seamless R and C++ integration.
Journal of Statistical Software, 40 (1), 1–18.
Feinerer, I., Hornik, K., and Meyer, D. (2008) Text mining infrastructure in R.
Journal of Statistical Software, 25 (5), 1–54.
Free Software Foundation (2016) GNU Bash, https://www.gnu.org/software/bash/
bash.html (accessed 19 November 2016).
Kane, M.J., Emerson, J., and Weston, S. (2013) Scalable strategies for computing
with massive data. Journal of Statistical Software, 55 (14), 1–19.
Loecher, M. and Ropkins, K. (2015) RgoogleMaps and loa: unleashing R graphics
power on map tiles. Journal of Statistical Software, 63 (4), 1–18.
Lucas, C. and Tingley, D. (2014) translateR: Bindings for the Google and Microsoft
Translation APIs. R Package Version 1.0.
Luis (2011) Reading HTML Pages in R for Text Processing, http://www
.quantumforest.com/2011/10/reading-html-pages-in-r-for-text-processing/
(accessed 13 October 2016).
Matloﬀ, N. (2011) The Art of R Programming, No Starch Press, San Francisco, CA.

278

Bibliography

Ooms, J. (2014) The Jsonlite Package: A Practical and Consistent Mapping Between
JSON Data and R Objects, arXiv:1403.2805 [stat.CO].
R Core Team (2013) R: A Language and Environment for Statistical Computing, R
Foundation for Statistical Computing, Vienna, Austria.
R Core Team (2015) foreign: Read Data Stored by Minitab, S, SAS, SPSS, Stata,
Systat, Weka, dBase,... R package version 0.8-66.
Ripley, B. and Lapsley, M. (2015) RODBC: ODBC Database Access. R Package
Version 1.3-12.
RStudio Team (2015) RStudio: Integrated Development Environment for R.
RStudio, Inc., Boston, MA.
Temple Lang, D. (2014) RJSONIO: Serialize R objects to JSON, JavaScript Object
Notation. R Package Version 1.3-0.
Temple Lang, D. (2015a) RCurl: General Network (HTTP/FTP/...) Client Interface
for R. R Package Version 1.95-4.7.
Temple Lang, D. (2015b) XML: Tools for Parsing and Generating XML Within R
and S-Plus. R Package Version 3.98-1.3.
Warnes, G.R., Bolker, B., Gorjanc, G., Grothendieck, G., Korosec, A., Lumley, T.,
MacQueen, D., Magnusson, A., Rogers, J. et al. (2015) gdata: Various R
Programming Tools for Data Manipulation. R Package Version 2.17.0.
Wickham, H. (2011) The split-apply-combine strategy for data analysis. Journal of
Statistical Software, 40 (1), 1–29.
Wickham, H. (2014) Tidy data. Journal of Statistical Software, 59 (10), 1–23, doi:
10.18637/jss.v059.i10.
Wickham, H. (2015) httr: Tools for Working with URLs and HTTP. R Package
Version 1.0.0.
Wickham, H. and Francois, R. (2015) dplyr: A Grammar of Data Manipulation. R
package version 0.4.3.
Wickham, H. and Grolemond, G. (2016) R for Data Science, O’Reilly, Sebastopol,
CA.
Wickham, H., James, D.A., and Falcon, S. (2014) RSQLite: SQLite Interface for R. R
Package Version 1.0.0.

279

Index

! operator 23, 33
.C() 167
.GlobalEnv 13
.RData 13, 209
/ 7, 112, 127
: operator 22
<- 6
? 16
[ , ] 55, 71
drop argument 56
[ ] 31, 54, 64, 70
[[ ]] 64, 70
#, R comment character 7, 173
#!, see shell scripts
#!Rscript, see shell scripts
$ operator 64, 71
%>% 78
%% 156
%in% 47
& and && 26, 218
\ 7, 100, 119, 127
\n 100, 184
\r 184
\t 100
\u and \U for Unicode, see UTF-8
\x, see hexadecimal
| and || 26
... 145
0x see hexadecimal

a
addmargins() 43
addNA() 137
adist() 93
agrep() 93
all() 40, 217
all.equal() 91, 94, 217
anyDuplicated() 47
anyNA() 37, 69
aperm() 62
apply() 57
problems on data frames 74–75, 89
apropos() 16
args() 16, 146
arguments, see function
as.character() 135
as.data.frame() 77
as.Date() 80, 202
format argument 81
as.factor() 132, 226
as.logical() 30
as.matrix() 77, 172
as.numeric() 30, 82, 83, 135, 176
units argument 85
as.POSIXct() 85, 202
as.POSIXlt() 84
as.raw() 188, 191
ASCII, see character data type
assign() 138, 150

280

Index

assigning 35–36, 54, 65, 69, 71
attach() 13, 210
attr() 38, 86
attributes 38

b
backslash, see \
bash command interpreter 95
Beach.xlsx, see data sets
BedBath1.csv, see data sets
BedBath2.csv, see data sets
binary data, see data
browser() 157
buﬀered output 156
by() 77

c
c() 22, 85, 88
call by reference 149
call by value 149
carriage return, see \r
casefold() 103
cat() 149, 153, 156, 157
cbind() 54, 90, 132
stringsAsFactors argument
90
ceiling() 45
character data type 12, 22, 107
latin1 128
ASCII 101, 128–130
changing case 103
currency 117, 127, 128
encoding 130
leading and trailing spaces 126
character(0) 101
charToRaw() 188
class() 29, 132, 161, 217
clipboard 200–201, 203
close() 189, 194
clusterExport() 166
cmpfun() 165
COBOL 261
CoBorrScores.txt, see data sets

colnames() 56, 68, 179
colon operator, see : operator
colRows() 57
colSums() 57
column names 214
commands as text 139
comment character 7
comparison operators 8, 23, 26
compiling, see functions
complex data type 28
connection 185, 186, 193, 208
converting between data types, see data
type
count.fields() 178
CRAN 2, 14
cross-reference table 225, 237, 254
CSV 172, 177
cumsum() 220
currency, see character data type
cut() 107

d
data
acquiring 3, 4
binary 185, 188–190
HTML 203, 207
JSON 206, 259
large ﬁles 184–187, 192
many ﬁles 197
other packages 208
provenance 171
relational databases 192–256
serverless, 196
streaming 208
tabular 4, 171
time zones 202
transactional 219–225, 257
via REST API 206
web 203–208
XML 190, 204–206, 260
data frame 8, 67–80
all-numeric 77
assigning 69

Index

column names 68, 73, 78, 90, 109,
179
combining 216
by column, 90, 218
by key, 92, 218, 242–245
by row, 90, 137, 216, 238
comparing 94
missing values 69
operating on columns 74
operating on rows 75
row names 68, 72, 176, 179
subsetting 69–73
UTF-8 in 131
data handling 213
acquiring 213, 231–232, 236–239
cleaning 214–216, 232–238,
240–249
combining, see data frame
documentation 5, 226–229
judgment 228
preparing 225, 249, 275
data sets
Beach.xlsx 252
BedBath1.csv 231
BedBath2.csv 236
CoBorrScores.txt 257, 269
EnergyUsage.csv 239
Excluders directory 260, 272
KScore.update.json 259,
271
Payments.txt 261, 273
scores.sqlite 256, 267
state.tbl 225, 254
wilson.xls 252
data type
character see character data type
complex 28
conversion 24, 27–31, 136,
226
determining 29
factor, see factor
integer 28
logical 23

numeric or double 21, 45
raw 28, 188
data.frame() 67, 90, 132, 218
row.names argument 68
stringsAsFactors argument
68, 76, 90
data.table() 42
date() 86
Date class 80, 174
dates 80–89
Date class 80, 174
and times 83–88
am/pm indicator, 86
diﬀerences 83, 86, 88
formatting 80–83, 111
in Excel 202
missing values 88
POSIX classes 83, 174
tabulating 111
time zones, see time zones
dbConnect() 197
dbFetch() 197
dbGetQuery() 197
dbListFields() 197
dbListTables() 197
dbSendQuery() 197
debug() 158
debugonce() 158
delimited ﬁles 172–183
deparse() 166
detectCores() 166
diff() 86
difftime() 83
dim() 54, 69
dimnames() 56, 66, 68
do.call() 91, 181
double, see numeric data type
download.file() 198, 204
DSN 193, 196
dump() 152
duplicated() 47, 79
duplicates 47, 79, 214, 235
dyn.load() 167

281

282

Index

e
edit() 94, 150, 151
editor 151
ellipsis, see ...
embedded delimiters 175, 231
empty string 101, 175
enableJIT() 165, 167
Encoding() 130
encoding, see character data type
end of line, see \n
EnergyUsage.csv, see data
sets
environment variables 15, 162
Euro currency symbol 129, 131
eval() 140, 166
Excel 81, 101, 175, 200, 201, 252
dates 202
Excluders directory, see data sets
expand.grid() 111
extracting, see subsetting

f
factor
combining 136
levels 132, 133
missing values 133, 137
factor() 132–134, 226
file() 185, 190
ﬁle names 7, 112, 127, 162
ﬁle operations 185, 186
file.copy() 198
file.info() 162
file.remove() 198
file.rename() 198
fix() 151
ﬁxed-width ﬁles 183
ﬂoating-point error 11, 23, 30, 79
floor() 45
flush() 186
for() 138, 164
format() 82, 103, 107
formatting numbers 103–107

fromJSON() 206, 207
functions 9, 143–167
arguments 144–148
compiling 165
debugging 156–158
editing 151
errors 158, 159
parallel processing 166
proﬁling 95, 163
return values 149
side eﬀects 150
speeding things up 163–167
warnings 158

g
GET() 207
get() 138
getForm() 207
getURI() 203
getwd() 112, 162
Giants 173
glob2rx() 126
global variables 138, 148
GMT, Greenwich Mean Time, see time
zones
greedy matching, see regular
expressions
gregexpr() 123, 180
grep() 113–121
grepl() 113
gsub() 124
gzfile() 185

h
head() 69, 110
help() 16
help.search() 6, 16
hexadecimal 101, 128, 187, 188
history() 7
HTML, see data
HTML tags 127
htmlTreeParse() 205

Index

i
iconv() 130
identical() 94
if() 155, 191
Inf 25, 40
install.packages() 14
integer data type 28
intersect() 47
invisible() 150
is() 29
is.character() 29
is.element() 47, 50
is.finite() 40
is.integer() 29
is.logical() 29
is.na() 37, 59, 69, 176
is.null() 40
is.numeric() 29
isTRUE() 91, 94

j
join
in R, see merge()
in SQL 195
JSON, see data

k
key ﬁeld 90, 92, 112, 125, 192, 214,
218, 226
KScore.update.json, see data
sets

l
lapply() 74, 75, 89, 91, 137, 226
latin1, see character data type
lazy matching, see regular expressions
leading spaces, see character data type
leading zeros 105, 106, 175, 181
leap seconds 84
length() 25, 54, 63, 100
lengths() 63
LETTERS 6, 109, 138

letters 6, 37
levels() 134, 137
library() 14
list 8, 62–67
POSIXlt class 84
assigning 64
names of 64
operating on 74–77
subsetting 64
list() 63, 91, 147
list.dirs() 199
list.files() 13, 112, 162, 198,
199
load() 13, 153, 209
local variables 138, 148
locales 15, 81
logical data type 23
logical subscript, see subsetting
long vectors 50, 95
ls() 6, 157

m
make.names() 10, 90, 172
makeCluster() 166
margin.table() 43
match() 48, 50, 223, 225, 226
matching 48
matrix 8, 53–62
assigning 54
demoting 55
missing values 59
row and column names 56, 66
subsetting 55–56, 60
three- and higher-way 62
matrix() 53
matrix subscript, see subsetting
max() 25, 86
memory, managing 95
merge, see data frame
merge() 92, 195, 218, 223
methods() 170
missing() 148

283

284

Index

missing values 36–39, 214, 229
identifying 37
in nchar() 100
in computation 37
in data frames 69
in dates 88
in factors 133, 136
in matrices 59
omitting 38
reading in 175
subsetting with 39
mode() 28, 132
month.abb 110
month.name 103
months() 82, 84

n
NA, see missing values
na.omit() 38, 69
names() 27, 64, 68
NaN 40
nchar() 100, 130
ncol() 54
new-line, see \n
nrow() 54, 80
NULL 27, 40, 64, 71, 176
null character 128, 187, 188, 191
numeric data type 21
converting to text 103
discretizing 107
scientiﬁc notation 106
nzchar() 102

o
object names 6, 10, 138
object-oriented programming 147,
170
object.size() 138
objects() 6
ODBC 193
odbcConnect() 193
odbcConnectAccess() 196
on.exit() 151, 191, 192

operating system, interacting with
161–163, 198
options() 14, 67, 68, 106, 151,
159
order() 46, 79
outer() 110

p
packages 14
RCurl 203, 207
RJSONIO 206
RODBC 193, 196
RSQLite 196
RStudio 15
Rcpp 167
RgoogleMaps 207
XML 203, 205
bigmemory 95
cleaningBook 19, 154, 225, 230,
247, 265
compiler 165
data.table 42, 50, 95
dplyr 77
ff 95
foreign 208
gdata 201
httr 207
jsonlite 206
magrittr 78
parallel 166
plyr 77
rjson 206
tm 95
translateR 207
xlsx 201
Matrix 61
parallel processing, see functions
parSapply() 166
parse() 139, 166
paste() 109–112
paste0(), see paste()
Payments.txt, see data sets
pi 8

Index

pipe() 200
plot() 147
POSIX date classes 83, 94
operations on 86
POSIX regular expressions, see regular
expressions
postForm() 207
print() 131, 153
proﬁling, see functions
prop.table() 42
provenance, see data
pseudocode 265

q
q() 3
quantile() 108
quarters() 84, 111
quotation marks 100

r
R

1
acquiring 2
assignment 6
basics 5–16
break command 5
console 3
graphical user interface 15
help 6, 16, 17
installing 2
packages, see packages
prompt 5
startng and quitting 3
workspace 6, 9, 209
range() 58, 86
raw data type 28, 188
rawToChar() 28, 188, 191
rbind() 54, 90, 136, 181, 200, 217,
218
stringsAsFactors argument
90
read.fwf() 183
read.table() 172–180, 199, 202,
203

colClasses argument 173, 174
comment.char argument 172,
173, 177
encoding argument 187
fileEncoding argument 187
fill argument 178, 202
header argument 172, 177
na.strings argument 173, 175,
176
nrows argument 174, 177, 178
quote argument 172, 173, 175,
177
row.names argument 176
sep argument 172
skip.blank.lines argument
176
skipNul argument 187
skip argument 178
stringsAsFactors argument
173, 177
text argument 181
from clipboard 200
read.xls() 201
read.xlsx() 201
readBin() 188–190
readHTMLTable() 203
readLines() 185
readRDS() 152, 209
recycling 25–26
logical subscripts 39
regexpr() 121–123
regmatches() 122
regular expressions 112–128
backreferences 124
character classes 120
escape sequences 117
greedy matching 123
lazy matching 123
leading and trailing spaces
126
multiple matching 122
repetition 117
replacement 123

285

286

Index

regular expressions (contd.)
special characters 113–116
table, 114
splitting strings 124
vs. wildcard matching 126
word boundaries 121
relational databases, see data
remove(), see rm()
rep() 22, 110
replacing, see assigning
reproducible research, see data
handling
require() 14
REST, see data
return() 150
return values, see function
rle() 49, 220
rm() 6, 138, 179, 180
round() 45
row.names() 56
rownames() 56, 179
RProf() 163
Rscript, see shell scripts
runs 49

s
sample() 80
sapply() 74, 75, 89, 137
save() 153, 209
save.image() 209
saveRDS() 152, 209
scan() 180–182, 186, 187
scientiﬁc notation, see numeric data
type
scores.sqlite, see data sets
scripts 9, 14, 153–155
search() 13
seek() 186
seq() 23, 87, 182
sequences 22
set.seed() 108
setdiff() 47

setwd() 162
shell scripts 154–155
signif() 45
socketConnection() 208
sort() 45
sorting 16, 45, 79
source() 152, 153
special characters for regular
expressions 114
split() 63, 75
spreadsheets, see Excel
sprintf() 104–106, 175
SQL, see data
sqlClose() 194
sqlColumns() 194
sqlFetch() 196
sqlFetchMore() 196
sqlGetResults() 196
SQLite() 197
sqlQuery() 194, 195
sqlSave() 196
sqlTables() 193
stack 244
state.tbl, see data sets
stop() 158
stopCluster() 167
stopifnot() 158
str() 29, 64, 69
string, see character data type
strsplit() 124–126, 181
sub() 123
subsetting
by name 34, 56, 64, 70
data frame 69–73
logical subscript 32, 55, 64, 70
matrices 55–56, 60
matrix subscript 60, 222
missing values in subscript 39
negative subscripts 32
numeric subscript 31, 54, 64, 70
vectors 31–34
zeros in subscript 32

Index

substring() 102
substrings
see character data type 102
summary() 25, 69, 76, 87, 239
summaryRprof 163
suppressWarnings() 159
Sys.getenv() 163
Sys.setenv() 15, 163
Sys.timezone() 86
system() 198
system.time() 163, 165

u

t

v

t() 54
tab, see \t
table() 40, 56, 87, 111
exclude argument 41
useNA argument 41
table(table()) 48
tables 40–45, 56
character strings 101
date objects 87
marginal totals 43
missing values in 41–42
proportions 42
two- and higher-way 42–45, 56, 62
tabular data, see data
tail() 69
tapply() 44
INDEX argument 44
text, see character data type
time zones 84, 86
times, see dates
to.lower() 103
to.upper() 103
traceback() 157, 160
trailing spaces, see character data type
triocular 266
trunc() 45
try() 159
TSV 173
typeof() 28, 132

vector 8, 21
as sets 46–47
assigning 35
conversion 29–31
length zero 34
logical 23, 30
comparisons, 23
mean of, 24
sum of, 24
long 50
names of 27
sorting 45
subsetting 31–34
View() 94

undebug() 158
Unicode, see UTF-8
union() 47
unique() 47, 79
unlist() 65, 84, 205
unz() 198
unzip() 198
UTC, see time zones
UTF-8 129–187, 206
in data frame 131

w
warning() 159
warnings() 159
weekdays() 82, 84–86
which() 33
on a matrix 59
which.max() 34
which.min() 34
while() 191
wildcard matching, see regular
expressions
wilson.xls, see data sets
with() 78
within() 78

287

288

Index

working directory 10, 13
workspace, see R
write.csv(), see write.table()
write.table() 182
append argument 183
col.names argument 183
quote argument 183
row.names argument 183
to clipboard 200
writeBin() 189
writeChar() 192
writeLines() 186
useBytes argument 187

x
XML, see data
xmlParse() 204
xmlTreeParse() 204
xmlValue() 204, 205
Xpath 205
xpathSapply() 205
xtabs() 42

z
zipped ﬁles

185, 198

Name	Amount
Dylan	116.34
Garcia	953.21

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.7
Linearized                      : No
Creator                         : Adobe Acrobat Pro 9.3.3
Modify Date                     : 2017:11:27 13:28:13+01:00
Create Date                     : 2017:11:11 09:12:34+01:00
PXC Viewer Info                 : PDF-XChange Viewer;2.5.201.0;Jan 23 2012;21:08:47;D:20171127132647+01'00'
WPS-PROCLEVEL                   : 2
WPS-JOURNALDOI                  : 10.1002/9781119080053
Page Count                      : 293
Xpacket                         : 
XMP Toolkit                     : XMP Core 4.1.1
Creator Tool                    : Adobe Acrobat Pro 9.3.3
Producer                        : iTextSharpâ„¢ 5.5.5 Â©2000-2014 iText Group NV (AGPL-version); modified using iTextSharpâ„¢ 5.5.5 Â©2000-2014 iText Group NV (AGPL-version)
Format                          : application/pdf
Page Mode                       : UseOutlines
Page Layout                     : SinglePage

EXIF Metadata provided by EXIF.tools

Samuel E. Buttrey, Lyn R. Whitaker A Data Scientist’s Guide To Acquiring, Cleaning, And

Navigation menu

Versions of this User Manual:

Views

Navigation