Samuel E. Buttrey, Lyn R. Whitaker A Data Scientist’s Guide To Acquiring, Cleaning, And

User Manual: Pdf

Open the PDF directly: View PDF .
Page Count: 293 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Contents
Preface
Website
R
R Data - Vectors
R Data - more Complicated Structures
R Data - Text & Factors
Functions & Scripts
Data into & out of R
Data Handling in Practice
Extended Exercise
Hints & Pseudocode
Biblio
Index

A Data Scientist’s Guide to Acquiring,

Cleaning, and Managing Data in R

Samuel E. Buttrey and Lyn R. Whitaker

Naval Postgraduate School, California, United States

is edition ﬁrst published 2018

Library of Congress Cataloging-in-Publication Data applied for

Hardback ISBN: 9781119080022

Contents

1R1

1.1 Introduction 1

1.1.1 What Is R? 1

1.1.2 Who Uses R and Why? 2

1.1.3 Acquiring and Installing R 2

1.1.4 Starting and Quitting R 3

1.2 Data 3

1.2.1 Acquiring Data 3

1.2.2 Cleaning Data 4

1.2.3 e Goal of Data Cleaning 4

1.2.4 Making Your Work Reproducible 5

1.3 e Very Basics of R 5

1.3.1 Top Ten Quick Facts You Need to Know about R 5

1.3.2 Vocabulary 8

1.3.3 Calculating and Printing in R 11

1.4 Running an R Session 12

1.4.1 Where Your Data Is Stored 13

1.4.2 Options 13

1.4.3 Scripts 14

1.4.4 R Packages 14

1.4.5 RStudio and Other GUIs 15

1.4.6 Locales and Character Sets 15

1.5 Getting Help 16

1.5.1 At the Command Line 16

1.5.2 e Online Manuals 16

1.5.3 On the Internet 17

Preface xvii

About the Companion Website xxi

1.5.4 Further Reading 17

1.6 How to Use is Book 17

1.6.1 Syntax and Conventions in is Book 17

1.6.2 e Chapters 18

2 R Data, Part 1: Vectors 21

2.1 Vectors 21

2.1.1 Creating Vectors 21

2.1.2 Sequences 22

2.1.3 Logical Vectors 23

2.1.4 Vector Operations 24

2.1.5 Names 27

2.2 Data Types 27

2.2.1 Some Less-Common Data Types 28

2.2.2 WhatTypeofVectorIsis? 28

2.2.3 Converting from One Type to Another 29

2.3 Subsets of Vectors 31

2.3.1 Extracting 31

2.3.2 Vectors of Length 0 34

2.3.3 Assigning or Replacing Elements of a Vector 35

2.4 Missing Data (NA) and Other Special Values 36

2.4.1 e Eﬀect of NAsinExpressions 37

2.4.2 Identifying and Removing or Replacing NAs37

2.4.3 Indexing with NAs39

2.4.4 NaN and Inf Values 40

2.4.5 NULL Values 40

2.5 e table() Function 40

2.5.1 Two- and Higher-Way Tables 42

2.5.2 Operating on Elements of a Table 42

2.6 Other Actions on Vectors 45

2.6.1 Rounding 45

2.6.2 Sorting and Ordering 45

2.6.3 Vectors as Sets 46

2.6.4 Identifying Duplicates and Matching 47

2.6.5 Finding Runs of Duplicate Values 49

2.7 Long Vectors and Big Data 50

2.8 Chapter Summary and Critical Data Handling Tools 50

3 R Data, Part 2: More Complicated Structures 53

3.1 Introduction 53

3.2 Matrices 53

3.2.1 Extracting and Assigning 54

3.2.2 Row and Column Names 56

3.2.3 Applying a Function to Rows or Columns 57

3.2.4 Missing Values in Matrices 59

3.2.5 Using a Matrix Subscript 60

3.2.6 Sparse Matrices 61

3.2.7 ree- and Higher-Way Arrays 62

3.3 Lists 62

3.3.1 Extracting and Assigning 64

3.3.2 Lists in Practice 65

3.4 Data Frames 67

3.4.1 Missing Values in Data Frames 69

3.4.2 Extracting and Assigning in Data Frames 69

3.4.3 Extracting ings at Aren’t ere 72

3.5 Operating on Lists and Data Frames 74

3.5.1 Split, Apply, Combine 75

3.5.2 All-Numeric Data Frames 77

3.5.3 Convenience Functions 78

3.5.4 Re-Ordering, De-Duplicating, and Sampling from Data Frames 79

3.6 Date and Time Objects 80

3.6.1 Formatting Dates 80

3.6.2 Common Operations on Date Objects 82

3.6.3 Diﬀerences between Dates 83

3.6.4 Dates and Times 83

3.6.5 Creating POSIXt Objects 85

3.6.6 Mathematical Functions for Date and Times 86

3.6.7 Missing Values in Dates 88

3.6.8 Using Apply Functions with Dates and Times 89

3.7 Other Actions on Data Frames 90

3.7.1 Combining by Rows or Columns 90

3.7.2 Merging Data Frames 91

3.7.3 Comparing Two Data Frames 94

3.7.4 Viewing and Editing Data Frames Interactively 94

3.8 Handling Big Data 94

3.9 Chapter Summary and Critical Data Handling Tools 96

4 R Data, Part 3: Text and Factors 99

4.1 Character Data 100

4.1.1 e length() and nchar() Functions 100

4.1.2 Tab, New-Line, Quote, and Backslash Characters 100

4.1.3 e Empty String 101

4.1.4 Substrings 102

4.1.5 Changing Case and Other Substitutions 103

4.2 Converting Numbers into Text 103

4.2.1 Formatting Numbers 103

4.2.2 Scientiﬁc Notation 106

4.2.3 Discretizing a Numeric Variable 107

4.3 Constructing Character Strings: Paste in Action 109

4.3.1 Constructing Column Names 109

4.3.2 Tabulating Dates by Year and Month or Quarter Labels 111

4.3.3 Constructing Unique Keys 112

4.3.4 Constructing File and Path Names 112

4.4 Regular Expressions 112

4.4.1 Types of Regular Expressions 113

4.4.2 Tools for Regular Expressions in R 113

4.4.3 Special Characters in Regular Expressions 114

4.4.4 Examples 114

4.4.5 e regexpr() Function and Its Variants 121

4.4.6 Using Regular Expressions in Replacement 123

4.4.7 Splitting Strings at Regular Expressions 124

4.4.8 Regular Expressions versus Wildcard Matching 125

4.4.9 Common Data Cleaning Tasks Using Regular Expressions 126

4.4.10 Documenting and Debugging Regular Expressions 127

4.5 UTF-8 and Other Non-ASCII Characters 128

4.5.1 Extended ASCII for Latin Alphabets 128

4.5.2 Non-Latin Alphabets 129

4.5.3 Character and String Encoding in R 130

4.6 Factors 131

4.6.1 What Is a Factor? 131

4.6.2 Factor Levels 132

4.6.3 Converting and Combining Factors 134

4.6.4 Missing Values in Factors 136

4.6.5 Factors in Data Frames 137

4.7 R Object Names and Commands as Text 137

4.7.1 R Object Names as Text 137

4.7.2 R Commands as Text 138

4.8 Chapter Summary and Critical Data Handling Tools 140

5 Writing Functions and Scripts 143

5.1 Functions 143

5.1.1 Function Arguments 144

5.1.2 Global versus Local Variables 148

5.1.3 Return Values 149

5.1.4 Creating and Editing Functions 151

5.2 Scripts and Shell Scripts 153

5.2.1 Line-by-Line Parsing 155

5.3 Error Handling and Debugging 156

5.3.1 Debugging Functions 156

5.3.2 Issuing Error and Warning Messages 158

5.3.3 Catching and Processing Errors 159

5.4 Interacting with the Operating System 161

5.4.1 File and Directory Handling 162

5.4.2 Environment Variables 162

5.5 Speeding ings Up 163

5.5.1 Proﬁling 163

5.5.2 Vectorizing Functions 164

5.5.3 Other Techniques to Speed ings Up 165

5.6 Chapter Summary and Critical Data Handling Tools 167

5.6.1 Programming Style 168

5.6.2 Common Bugs 169

5.6.3 Objects, Classes, and Methods 170

6 Getting Data into and out of R 171

6.1 Reading Tabular ASCII Data into Data Frames 171

6.1.1 Files with Delimiters 172

6.1.2 Column Classes 173

6.1.3 Common Pitfalls in Reading Tables 175

6.1.4 An Example of When read.table() Fails 177

6.1.5 Other Uses of the scan() Function 181

6.1.6 Writing Delimited Files 182

6.1.7 Reading and Writing Fixed-Width Files 183

6.1.8 A Note on End-of-Line Characters 183

6.2 Reading Large, Non-Tabular, or Non-ASCII Data 184

6.2.1 Opening and Closing Files 184

6.2.2 Reading and Writing Lines 185

6.2.3 Reading and Writing UTF-8 and Other Encodings 187

6.2.4 e Null Character 187

6.2.5 Binary Data 188

6.2.6 Reading Problem Files in Action 190

6.3 Reading Data From Relational Databases 192

6.3.1 Connecting to the Database Server 193

6.3.2 Introduction to SQL 194

6.4 Handling Large Numbers of Input Files 197

6.5 Other Formats 200

6.5.1 Using the Clipboard 200

6.5.2 Reading Data from Spreadsheets 201

6.5.3 Reading Data from the Web 203

6.5.4 Reading Data from Other Statistical Packages 208

6.6 Reading and Writing R Data Directly 209

6.7 Chapter Summary and Critical Data Handling Tools 210

7 Data Handling in Practice 213

7.1 Acquiring and Reading Data 213

7.2 Cleaning Data 214

7.3 Combining Data 216

7.3.1 Combining by Row 216

7.3.2 Combining by Column 218

7.3.3 Merging by Key 218

7.4 Transactional Data 219

7.4.1 Example of Transactional Data 219

7.4.2 Combining Tabular and Transactional Data 221

7.5 Preparing Data 225

7.6 Documentation and Reproducibility 226

7.7 eRoleofJudgment 228

7.8 Data Cleaning in Action 230

7.8.1 Reading and Cleaning BedBath1.csv 231

7.8.2 Reading and Cleaning BedBath2.csv 236

7.8.3 Combining the BedBath Data Frames 238

7.8.4 Reading and Cleaning EnergyUsage.csv 239

7.8.5 Merging the BedBath and EnergyUsage Data Frames 242

7.9 Chapter Summary and Critical Data Handling Tools 245

8 Extended Exercise 247

8.1 Introduction to the Problem 247

8.1.1 e Goal 248

8.1.2 Modeling Considerations 249

8.1.3 Examples of ings to Check 249

8.2 e Data 250

8.3 Five Important Fields 252

8.4 Loan and Application Portfolios 252

8.4.1 Layout of the Beachside Lenders Data 253

8.4.2 Layout of the Wilson and Sons Data 254

8.4.3 Combining the Two Portfolios 254

8.5 Scores 256

8.5.1 Scores Layout 256

8.6 Co-borrower Scores 257

8.6.1 Co-borrower Score Examples 258

8.7 Updated KScores 259

8.7.1 Updated KScores Layout 259

8.8 Loans to Be Excluded 260

8.8.1 Sample Exclusion File 260

8.9 Response Variable 260

8.10 Assembling the Final Data Sets 262

8.10.1 Final Data Layout 262

8.10.2 Concluding Remarks 263

A Hints and Pseudocode 265

A.1 Loan Portfolios 265

A.1.1 ings to Check 266

A.2 Scores Database 267

A.2.1 ings to Check 268

A.3 Co-borrower Scores 269

A.3.1 ings to Check 270

A.4 Updated KScores 271

A.4.1 ings to Check 272

A.5 Excluder Files 272

A.5.1 ings to Check 272

A.6 Payment Matrix 273

A.6.1 ings to Check 274

A.7 Starting the Modeling Process 275

Bibliography 277

Index 279

k k

Preface

Statisticians use data to build models, and they use models to describe the world

and to make predictions about what will happen next. ere has been a large

number of very good books that describe statistical modeling, but these model-

ing eﬀorts usually start with a set of “clean,” well-behaved data in which nothing

is missing or anomalous.

In real life, data is messy. ere will be missing values, impossible values,

and typographical errors. Data is gathered from multiple sources, leading to

both duplication and inconsistency. Data that should be categorical is coded

as numeric; data that should be numeric can appear categorical; data can be

hidden inside free-form text; and data can be in the form of dates in a wide

number of possible formats. We estimate that 80% of the time taken in any

data analysis problem is taken up just in reading and preparing the data. So, any

analyst needs to know how to acquire data and how to prepare it for modeling,

and the steps taken should be automatic, as far as possible, and reproducible.

is book describes how to handle data using the R software. R is the most

widely used software in statistics, and it has the advantage of being free,

open-source, and available on every major computing platform. Whatever

software you use, you will ﬁnd yourself facing the issues of acquiring, cleaning,

and merging data, and documenting the steps you took. We hope this book

will help you do these things eﬃciently.

Sam Buttrey and Lyn WhitakerMonterey, California, USA

November 30, 2016

Don’t forget to visit the companion website for this book:

www.wiley.com/go/buttrey/datascientistsguide

ere you will ﬁnd valuable material designed to enhance your learning,

including:

•A complete listing of all the R code in the Book

•Example datasets used in the Exercises

Companion Website

k k

1.1 Introduction

is book focuses on one problem that is common to almost every statistical

problem – indeed, to almost any problem involving any sort of analysis. at

problem is acquiring and preparing the data. Across our many years of data

analysis, we have learned that seemingly 80% of our time – maybe more – goes

into the data preparation steps (a belief echoed by others such as Dasu and

Johnson, 2003). Collectively, we call these actions data cleaning, although,

as we will discuss later, we sometimes use that term for something a little

more speciﬁc. Regardless of the name, almost any analysis requires that you

(i) acquire that data, that is, read it into the computer program; (ii) clean the

data, that is, identify entries that are duplicated or clearly erroneous or anoma-

lous, and take other preparation steps (e.g., combining entries such as “Female,”

“female,” and “F”); (iii) merge data from diﬀerent sources; and (iv) prepare

the data for modeling, which might involve dividing a set of numeric values

into subsets, combining states into regions, and so on. is book discusses

some approaches for accomplishing these four steps in the R language (R Core

Team, 2013). A ﬁfth problem, which receives less emphasis, is the problem of

long-term curation of the data. Which parts of the data must be saved and in

what way? We address that question by reference to the idea of reproducible

research, which we discuss later in this chapter, and later in the book as well.

1.1.1 What Is R?

R is a computer program that lets you analyze data. By “analyze” we mean, ﬁrst,

read the data into the program and then operate on it – drawing graphs and

charts, manipulating values, ﬁtting statistical models, and so on. (Notice that

we prefer to call data “it” rather than “them.” We discuss this choice brieﬂy

toward the end of the chapter.) R is both a statistical “environment” and also

k k

2A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

a programming language, and it is very widely used both in commercial and

academic settings. R is free and open-source and runs on Windows, Apple, and

Linux operating systems. It is maintained by a group of volunteers who release

bug ﬁxes and new features regularly.

1.1.2 Who Uses R and Why?

R started as a tool for statisticians, evolving from a language called S that

was created in the 1970s. Today, R remains the primary language of academic

statisticians, and it also has a prominent place among analysts in business

and government as well. It is used not only for building statistical models

but also for handling and cleaning data, as in this book, and for developing

new statistical methods, building simulations, for visualization, and generally

for all the data-handling tools the statistician and the data scientist require.

Because of the ease with which users can develop and distribute new methods,

R has also become the tool of choice in certain fast-growing ﬁelds such as

biostatistics and genetics. Articles on “surveys of the top tools used by data

scientists” inevitably name R as one of the important tools with which data

scientists, as well as statisticians, should be familiar. Moreover, R’s popularity is

such that there are extensions to R (see “packages” in Section 1.4.4) that allow

you to connect to other programs such as the Python and Java languages, the

H2O machine-learning system, the ArcGIS geographical information system,

and many more.

1.1.3 Acquiring and Installing R

e primary way to acquire R is to download it from the Internet. e main

RwebsiteforRiswww.r-project.org,andthewww.cran.r-project

.org page (“CRAN” standing for “Comprehensive R Archive Network”) is

where you can download R itself. ere are in fact dozens of “mirror” sites for

CRAN – that is, websites that are essentially copies of the CRAN site – so as

to reduce the load on the CRAN site. You can probably ﬁnd a mirror near you

on the “mirrors” page. After you download R, install it in the way you would

normally install a program on your operating system.

At any one time, users around the world will be running slightly diﬀerent

versions of R, since new ones are released fairly frequently. For example, at this

writing the current version of R was called 3.3.2, but many users are still using

3.2 or earlier versions. is will almost never cause problems, but it is a good

idea to update your version of R from time to time.

ere are also several slightly diﬀerent versions of R distributed other than at

CRAN. Microsoft R Open is a particular version of R that uses a diﬀerent set

of math libraries intended to make certain computations faster. Like “regular”

R, Microsoft R Open is free, although it does not run on OS X. Other ver-

sions of R are intended to communicate with relational databases or with other

k k

big-data platforms. For this book, we will assume you are running “regular”

R – but in any case for our purposes all versions of R should behave exactly the

same way.

1.1.4 Starting and Quitting R

e way you start R depends on your operating system. Normally double-

clicking on an R icon will be enough to get R started. In the command-

line interface of many Linux systems, or using the OS X terminal window, it

may be enough just to type the upper-case letter R(or, for Windows command

lines, Rgui). When R has started, you will see the command prompt >. is is

the R console, the place where commands are entered. At this point, you can

start typing commands to R. When it comes time to quit R, you can either

“kill” the window in the usual way (for OS X, the red dot, the lightswitch in the

top right, or via the File dialog; for Windows, the red X or File dialog) or you

can type the q() command. In either case, R will then ask you if you want to

“Save workspace image.” If you answer “yes” to this question, R will save to the

disk any changes you made during the current session, whereas if you answer

“no,” R will return its workspace to the condition it was in when R was last

started. We almost always want to answer “yes” to this question!

1.2 Data

Data is information about the elements of whatever problem we are investigat-

ing. Data comes in many forms, but for our purposes it will always be presented

in a set of computer-ready values. For example, a database concerning birds

might include text about the habits of the birds, numbers giving lengths and

weights of the individuals, maps showing migration patterns, images showing

the birds themselves, sound recordings of the birds’ calls, and so on. Although

they look very diﬀerent, all of these diﬀerent pieces of information can be rep-

resented in the computer in digital form in one way or another. In this example,

one of our primary tasks might be to ensure that each bird’s description is cor-

rectly matched with the correct map, image, and song ﬁle. Our data analysis

projects rarely include data quite so disparate, but in almost every case we need

to acquire data, clean it (a process we start to describe in what follows and con-

tinue throughout the book), and prepare it for modeling, and in almost every

case we expect our data to consist of both numeric and textual values.

1.2.1 Acquiring Data

e ﬁrst step in a data analysis project, of course, is to get the data into R where

it can be manipulated. We are old enough to remember the days when this

involved typing all the data from the back of a book or journal paper into a

k k

4A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

statistics package by hand, but happily this is not necessary today. On the other

hand, data now comes in a variety of formats, few of which were created with

the convenience of the data scientist in mind. In Chapter 6, we describe some

of these common formats and how to use R to read data eﬀectively.

1.2.2 Cleaning Data

We “clean” data when we detect (and, in many cases, remove) anomalies.

Anomalies will very often be missing values, but they might also be absurd

ones,aswhenpeople’sagesarereportedas999or−1. Sometimes, as in our

earlier example, we might have genders reported as “Female,” “female,” and “F”

and we want to combine these three values. In the cleaning process we might

learn, for example, that one data source produced no data at all in August 2016;

this sort of fact will need to be brought to the attention of the data provider.

e data cleaning process also involves merging data from diﬀerent sources,

extracting subsets or reshaping the data in some way. All in all, data cleaning

is the process of turning raw data, received from one or more providers, into a

data set that can be used in visualization, modeling, and decision-making.

In practice these steps are iterative. Our cleaning process not only informs the

modeling, but it sometimes leads us to re-acquire the data in a diﬀerent, more

usable form. Similarly, insights from modeling will often lead us to prepare the

data in a new and more revealing way – because it is when we model that we

often discover anomalies or other interesting attributes of the data.

1.2.3 The Goal of Data Cleaning

What a “clean” data set should look like depends on what your goals are. One

useful perspective is given by Wickham (2014), who describes what he calls

“tidy” data. A tidy data set is rectangular (or tabular); each row describes one

unit of analysis (an observation), and each column gives one measurement (a

variable). For example, in a data set giving measurements about people, each

row would concern itself with a person, and the columns might give height,

weight, age, blood type, and so on.

In some problems, it is not immediately clear what the unit of analysis is.

For example, imagine data that describes the locations of boats over the course

of a month, as recorded by GPS. For some purposes, a “tidy” data set would

have one row per GPS ping, each row giving a ship identiﬁer, a location, and

a time. For other purposes, we might prefer a data set with one row per boat,

each row giving the southernmost point that ship reaches, or perhaps giving a

binary indicator of whether the ship did, or did not, spend time in international

waters. Some data – images and sound, for example – do not lend themselves

to this “tidy” approach.

k k

e exact layout of your ﬁnal data will depend on what you plan to do with

it – and in some cases this won’t be known until after you have operated on

the data.

1.2.4 Making Your Work Reproducible

It is vital that other people be able to reproduce the actions you took on your

data. Ideally, you or another analyst should be able to start with your raw data,

run all the steps you applied to it, and emerge with exactly the same clean, pre-

pared data sets. is will be useful to you when you encounter a situation similar

to the one in the previous paragraph, where the form of the new data needs to

be designed. But it is even more important for another analyst, since if you

or another analyst can reproduce your results there will be no disagreement

about the data. e act of making research reproducible has, in recent years,

been rightfully recognized as a cornerstone of scientiﬁc progress. Record and

document every step you take so that others can repeat them.

1.3 The Very Basics of R

is book is about handling data in R. It cannot teach you the very basics of R in

detail – although, happily, there are many good books and online resources that

can. (We give a few examples at the end of this chapter.) In this section, we list a

few of the most basic facts about R, but, again, this book is not intended to teach

you R. Rather, we focus on the details of R and of the way data is represented

in R, in order to help you understand some of the ways to acquire, clean, and

handle data inside R.

1.3.1 Top Ten Quick Facts You Need to Know about R

In this section, we give a few of the most important facts about R a beginner

needs to know. ere will be more detail on these facts later in the chapter and

throughout the book.

1) e prompt is (by default) >. If you leave a command incomplete, maybe

because there is an unclosed parenthesis or quotation mark, R gives you

the continuation prompt, which is +. e Esc key (Windows) or control-C

(other systems) produces the break command, which will take you back to

the regular prompt. In this example, we show what a completed command

looks like – in this case, R is computing the value of 3 divided by 2.

>3/2

[1] 1.5

k k

6A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Here, R produced the prompt (>), and we typed 3/2and pressed the

Enter (or “Return”) key. R then produced the output. We will talk about

the [1] part in Chapter 2, but the computed value of 1.5 is shown. In

the following example, we show what happens when we press Enter after

typing the slash character:

>3/

[1] 1.5

Here, since the expression on the ﬁrst line was incomplete, R produced the

continuation prompt, +.Whenwetyped2and hit Enter, the expression

was complete and the result was shown. In case of confusion, press break

until the original >prompt is showing.

In examples in this book where we want to show the R output, we also

show the >prompt in front of our code. Remember, that >is produced by

R; you don’t need to type that yourself. (At the end of the chapter, we tell

youwhereyoucangetallthecodefromthebookinelectronicform.)

2) R is case-sensitive, which means that upper- and lower-case letters

are diﬀerent in R. For example, the built-in R object LETTERS gives

all 26 upper-case letters. A diﬀerent item called letters contains the

lower-case versions of the alphabet. ere is no built-in object called

Letters.

3) Show an object by typing its name. For example, if you type ls by itself,

you see the contents of the function whose name is ls, the one that lists all

the objects in your workspace (which we deﬁne later). To actually run the

function and see the objects, you need to type the function’s name together

with parentheses. In this case, list your objects by typing ls().

4) Get help for a function or object named thing with the command

help(thing) or ?thing. For example, to see the help for the

ls() function, type help(ls). If you don’t know the name, try

help.search() with a relevant word in quotation marks; for example,

try help.search ("matrices") to see functions that handle

matrices.

5) Assign a value or object to a name with the left-arrow (less-than plus

hyphen): for example, the command a<-1creates a new object named

awith value 1. (You can also assign with a command such as a=1,

but we don’t recommend it.) e assignment will over-write any existing

object named ayou might have had. Once you create an object, it is in

your “workspace,” and your workspace can be saved when you quit. So

unless your computer crashes, when you create an object it will persist

until you delete it. Display the set of objects in your workspace with

objects() or ls();removeanobjectwithremove() or rm().Not

every character is permitted in the name of an R object. Start a name

k k

with a letter or a dot, and then stick to numbers, letters, underscores,

and dots. Names cannot contain spaces. In this example, we show some

assignmentsthatsucceedandsomethatdonot.

>a<-1

> a.1 <- 1

>2a<-1

Error: unexpected symbol in "2a"

>a2<-1

Error: unexpected numeric constant in "a 2"

e ﬁrst two of these assignments succeed, because aand a.1 are valid

names. e last two fail because they refer to invalid names.

6) e comment character is #. A comment ends at the end of the line. If you

want a comment to span multiple lines, you need to start each comment

line with #.

7) Recall earlier commands with the up-arrow. You can edit an earlier

command and then press the Enter key to run the new version. e

history() command shows a list of your recent commands; put a

number in (as in history(500))toseemore.

8) When referring to ﬁle names, R itself uses the forward slash in the console.

e Windows ﬁle system uses the backward slash, so Windows users may

usethat,too,butinthatcaseyouhavetotype\\ (we talk more about

this later on). For example, a Windows user who wants to access a ﬁle

named c:\temp\mycode.R in an R command will need to type either

c:/temp/mycode.R or c:\\temp\\mycode.R. You’ll need to use a

regular, single backslash if you are interacting with the Windows operat-

ing system and not R – if, for example, you are presented with a graphical

“select ﬁle” window. e ﬁle systems for OS X and Linux users use the

forward slash at all times.

9) Just about any function you want is built into R, so R makes an excellent

calculator. For example,

> sin (log (34))

[1] -0.375344

is says that the sine (using radians) of the logarithm (base e)of34is

−0.375344. Most functions allow you to specify “arguments,” values you

pass to the function to modify its behavior. Some must be speciﬁed; others

have default values. For example, log (34, 10) produces the base 10

logarithm instead of the natural logarithm. If a function accepts multiple

arguments, you will need to specify them in the proper order – or by name.

In this example, the arguments to log are named xand base (see the

help at ?log), so we could have entered log(base = 10, x = 34)

too.

k k

8A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

10) R’s operators include the comparison operators != for “not equal,” == for

“is equal to,” <= and >= for“lessthanorequalto”and“greaterthanorequal

to,” and the arithmetic operators *for “multiplied by” and ^for “raised to

the power of.”

1.3.2 Vocabulary

As we get started, it will be worthwhile for us to repeat some of the vocabulary

of R, and of data, that you should be familiar with. In this section, we deﬁne

some of the terms that are commonly used in discussion of R, both in this book

and elsewhere.

vector Avector is the simplest piece of data in R. It consists of one or more

entries (also called “items” or “elements”) that are all either text or all num-

bers or all “logical” (i.e., TRUE or FALSE). (Technically, a vector might have

length 0, and there are some other types, but that last sentence covers 99%

of what you will do with R.) For example, the value of the famous constant 𝜋

is built into R as the object pi,andtheRobjectpi is a numeric vector with

length 1. We talk about vectors in Chapter 2.

matrix Amatrix is just a two-dimensional vector in rectangular shape. While

matrices are important in statistics, they are less important in the data clean-

ing process. Still, it is useful to know about matrices in preparation for using

data frames (below). We discuss matrices at the start of Chapter 3.

list Alist is an R object that can hold other R objects. Lists are everywhere in

R and you will need to know how to create them and access their elements.

We discuss lists starting in Section 3.3.

data frame Adata frame is a cross between a matrix and a list. Like a matrix, it

is rectangular, but like a list it can contain items of diﬀerent sorts – numeric,

text, and so on – as its columns. You can think of a data frame as a list of

vectors all of which are the same length. Most of the data we encounter will

be in the form of data frames, and, if it isn’t, we will usually try to put it into

a data frame. We talk about data frames starting in Section 3.4.

object An object is a general word for anything in R. Usually, we will use this to

refer to data objects such as vectors, matrices, lists, or data frames, but we

might use “object” to refer to a function, a ﬁle handle, or anything else with

anameinR.

rows and columns A data frame and a matrix are two-dimensional rectan-

gular objects, consisting of rows and columns. Our goal, in a data cleaning

problem, is almost always to produce one or more data frames whose rows

correspond to the things being measured, and whose columns give the

diﬀerent measurements. For example, in a military manpower problem each

row might represent a soldier, and the columns would give measurements

such as age, sex, rank, and years in service. Statisticians sometimes call

rows and columns “observations” and “variables” (although that second

k k

word has another meaning in R, see the following discussion). Confusingly,

other terms exist too: authors in machine learning talk of “instances” (or

“entities”) and “attributes” (“features”). We will use “rows” and “columns”

when the emphasis is on the representation of the data in a data frame, and

“observations” and “variables” when the emphasis is on the role being played

by the data.

variable Avariable is also a generic term for an R object, especially one of

the objects in our workspace. e name is slightly misleading because the

object’s value doesn’t have to change. We would call pi a “variable,” at least

in casual conversation.

operator An operator describes an action on one or two objects – often vec-

tors – and produces a result. For example, the *operator, placed between two

numbers, produces their product. Most operators act on two things – we say

they are “binary.” e +and -operators can also be “unary,” meaning they

act on one number. So in the expression -3,the-is a unary operator. Oper-

ations are often “vectorized,” meaning they act separately on each item of a

vector.

function Afunction isakindofRobjectthatcantakeanaction.Functions

often accept arguments to control the computations they make, and pro-

duce “return values,” the results of the computation. For example, the cos()

function takes as its one argument the size of an angle, in radians, and pro-

duces, as its return value, the cosine of that angle. So typing cos(1) invokes

a function and produces a value of about 0.54. Operators are functions, too,

although they don’t look like it. For example, you can multiply two numbers

by calling the *function explicitly with two arguments, though you’ll need

quotation marks; "*"(3, 4) operates *on 3 and 4 and produces 12. Func-

tions are covered in detail in Chapter 5.

expression An expression is a legal R “phrase” that would produce an action if

you entered it into R. For example, a<-3is an expression that, if evaluated,

would cause an item ato be created and given the value 3. at expression

is called an assignment.pi > 3 is an expression that would produce TRUE,

since the number pi is greater than 3. is is an example of a comparison.

Just typing 2is also an expression; the system interprets this as being the

same as print(2), and prints out the value 2. Most expressions involve the

use of functions or operators, as well as R variables.

command We often use the word “command” as a casual shortcut to mean

“function,” “operator,” or “expression.” For example, we might say “use the

help command” instead of “run the help function.”

script Ascript is a text ﬁle that can list R commands. We use script ﬁles in all

of our projects and we recommend that you do, too. We discuss scripts in

Chapter 5.

workspace e workspace is the set of objects (data and functions) in our cur-

rent environment. ese are objects we have created.

k k

10 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

working directory e working directory is the folder on your computer where

your R data is stored. By default, R will look in this directory for any exter-

nal ﬁles you might ask for. We talk more about the working directory in the

following section.

With this vocabulary in mind it is easier to discuss some of the ways that R

operates. As an example, it’s not always obvious what the diﬀerent operators

in R will do in weird cases. We know that 3<10is TRUE. What is the value

of 3 < "10"?eanswerisFALSE. R cannot compare a number to a char-

acter, so converts both values into characters. en the comparison is made

alphabetically. So just as "Apple" < "Banana" is TRUE because "Apple"

comes ﬁrst in alphabetical order, so too does "10" come before "3" –since,

as always, we compare the initial characters ﬁrst, and the 1character precedes

the 3character in our computer’s sorting system. We talk much more about

the diﬀerent types of data in R, and converting between them, in Chapter 2.

Another example of unexpected behavior has to do with the way R reads

commands typed in at the command line. We saw that the command a<-3

assigns the value 3to an object a. However, what happens when you type

a<-3, with a space between <and -?eansweristhatRattachesthe

hyphen to the value 3, and then compares the value of ato the number -3.In

general, spaces will not aﬀect your R commands – but in this case the space

“broke” the assignment operator <-.

R objects have names and names have to conform to a small set of rules. If

data is brought in from outside R, perhaps from a spreadsheet, names will be

changed if they need to be made valid (details can be seen in the help for the

make.names() function). Technically it is possible to force R to use invalid

names, but don’t do that. A few names in R are reserved, meaning they cannot

beusedasthenameofanRvariable.Forexample,youcannotnameanobject

TRUE; that name is reserved. (You may name an object T, because that name

isn’t reserved, but we don’t recommend it.) It is also wise to try to avoid giv-

ing an object the name of an existing R function (although there are lots of R

functions and some are obscure). If you name a vector sum, and then use the

sum() function to add things up, R will be smart enough to diﬀerentiate your

vector from the system’s function. But if you create a function called sum() in

your workspace, R will use that one (since your function will appear ﬁrst on the

search path; see “search path” in Section 1.4.1). is is almost never what you

want. e R functions c() and t() provide good examples of names to avoid.

Finally, R can operate in an “object-oriented” way. A number of R functions

are “generic,” meaning that have speciﬁc methods to handle speciﬁc data types.

For example, the summary() function applied to a numeric vector gives some

information about the values in the vector, but the same function applied to

the output of a modeling function will often give summary statistics about the

model. e exact action that the generic function takes depends on the “class”

k k

R11

(i.e., the type) of the object passed to it. We run across a few of these generic

functions in the following few chapters and discuss object-oriented program-

ming brieﬂy in Section 5.6.3

1.3.3 Calculating and Printing in R

R performs calculations and prints results. In this section, we talk about some

of the diﬀerences between what R computes and what it prints, as well as how

text data is represented.

Floating-Point Error

isisagoodplacetodiscussanissuethatarisesinalotofdatacleaningprob-

lems and has caught us and our students oﬀ-guard more than once. For almost

all computations, R uses “double-precision ﬂoating-point” arithmetic, as most

other systems do. What this means is that R can represent numbers up to about

±1.79 ×10±308 with at least some accuracy. However, double precision is not

exact. Consider this example, in which we multiply together the numbers (1/49)

and 49.

> 1/49 * 49

[1] 1 # as expected

> 1 - (1/49 * 49)

[1] 1.110223e-16

> (49 * 1/49) == (1/49 * 49) # should be TRUE

[1] FALSE

e ﬁrst computation shows the “expected” product of (1/49) and 49 – the

value 1. In fact, though, the second computation shows that this prod-

uct is not exactly 1; it diﬀers from 1 by a tiny amount that we might call

“ﬂoating-point error.” at amount was so small that it wasn’t displayed in the

ﬁrst computation, according to R’s default display conditions. (e command

print(1/49 * 49, digits = 16) will reveal that this product is

computed as a number very slightly less than 1.) is is not a bug in R; it’s a

statement about the way double-precision ﬂoating-point arithmetic works,

analogous to the way that in ordinary arithmetic, the number 0.333333 …

is not quite 1/3. e ﬁnal computation shows the practical eﬀect of this: if

you compare two ﬂoating-point values directly, they might be recorded as

being diﬀerent just because of ﬂoating-point error. You will need to be aware

of this when you compare the results of doing the same computation in two

diﬀerent ways.

Signiﬁcant Digits

In the above-mentioned example, we saw how R printed 1even though

the number in question was slightly diﬀerent. While R’s computations use

k k

12 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

double-precision ﬂoating point, its display will generally print a smaller

number of digits than are available. Moreover, R formats outputs in a neat

way, so that typing 2.00 produces 2, but typing 2.01 prints out as 2.01.

ese formatting choices are most noticeable when many values are being

shown. e display that R chooses does not aﬀect the precision with which

it does calculations. Of course you can force R to round oﬀ the results of

its calculation; we discuss formatting, rounding, and scientiﬁc notation in

Chapter 4.

Character Strings

We will spend a lot of time in this book handling text or character data, data

in the form of letters such as "Oakland" or "Missing".Sometimes,asis

common, we will call a set of characters a string. In R, strings are enclosed by

quotation marks, and either the double-quotation mark "orthesingleone

’can be used. A string delineated by single-quotation marks is converted

into the other kind. e two kinds of quotation marks make it possible to

insert a quote into a string, such as this: "She said ’No.’ " (If you

typed "She said "No." ", you would see R produce an error.) If you type

’She said "No." ’, the outside quotes are converted to double quotes.

en, since there are double quotes on the inside, too, those interior quotation

marks are “protected” by preceding them with the backslash character. e

result is converted into "She said \"No.\" "

is idea of “protecting” certain special characters goes beyond quotation

marks. e character that marks the end of a line of text is called “new-line” and

is written as \n, backslash followed by n. Typing this character requires two

keystrokes, but it counts as only one character. In general, special characters

are “protected” by the backslash characters. Besides the quotation mark and the

new-line, the important special characters are \t,thetab,and\\,thebackslash

itself. e backslash also serves to introduce strings in special formats, such as

hexadecimal (e.g., "\xb1" produces the character with hexadecimal value b1,

which displays as the plus-minus sign, ±) or Unicode (e.g., "\U20ac" uses

Unicode to display the Euro currency symbol). We talk much more about text

in general and Unicode in particular in Chapter 4.

1.4 Running an R Session

Once you start using R you may ﬁnd yourself using it for lots of diﬀerent

projects. Although this is partly a matter of taste, we ﬁnd it useful to keep

separate sets of data for separate projects. In this section, we describe where R

keeps your data, and some other aspects of R with which you will need to be

familiar.

k k

R13

1.4.1 Where Your Data Is Stored

When you start R, you start it in a working directory, and this directory forms

the starting point for where R looks for, and stores, data. For example, typing

list.files() will list all of the ﬁles in your working directory. When you

quit R and save the workspace, a ﬁle with all of your R objects will be created

in that same directory. is ﬁle is named .RData. e leading dot in the name

is important, because some terminal programs, such as the “bash” command

interpreter, do not by default list ﬁles whose names start with a dot. We don’t

recommend changing the name of the .RData ﬁle.

is provides a natural mechanism for project management. To prepare for a

new project on a system with a command-line interface, just create a new direc-

tory and start R from there (see “starting R” above). On systems with desktop

icons, copy an existing R icon, edit the properties to point to the new directory,

and add the project name to the icon. e details of this operation will depend

on your operating system. In this way, you can keep the diﬀerent .RData ﬁles

for your diﬀerent projects separate.

When you start R, it will use an existing .RData ﬁleifthereisoneinthe

working directory, or create a new, empty one if there is not. Often we have a

certain number of objects from earlier projects that we want in the new project.

ere are two mechanisms for acquiring those existing R objects. In one case,

we literally copy all the objects from another .RData in a diﬀerent project’s

directory into the existing workspace, using the load() function. is can

be dangerous because objects being copied will over-write existing ones with

the same names. A second mechanism uses attach(), which puts the other

.RData on the “search path.” e search path is a list of places where R looks

for objects when you mention them. You can examine your current search path

with the search() command. e ﬁrst entry on the search path is the current

.RData ﬁle (although it carries the confusing name .GlobalEnv); most of

the other entries on the search path are put there by R itself. When you use a

name such as pi, R looks for that object in your workspace, and then in each

of the packages or directories named in the search path until it ﬁnds one by

that name. You can attach other .RData ﬁles anywhere in the search path,

except in the ﬁrst position; usually we put them into position two so that they

are searched right after the local workspace. We talk more about getting data

into and out of R in Chapter 6.

1.4.2 Options

R maintains a list of what it calls “options,” which describe aspects of your inter-

action with it. For example, one option sets the text editor that R calls when

you edit a function, one describes how much memory is set aside for R, one

lets you change the prompt character from its default, and so on. Generally, we

k k

14 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

ﬁnd the default values reasonable, but the help for the options() function

describes the possible values and running options() shows you the current

ones. Changes to the options last only for this R session. Section 3.3.2 shows an

example of setting one of the options.

1.4.3 Scripts

Most of the work we do with R is interactive – that is, we issue commands

and wait for R’s response. is use of R is best when we are exploring data and

developing approaches to handling and modeling it. As we develop sets of com-

mands for a particular project, we can combine these into “scripts,” which are

simply ﬁles full of commands. Having a set of commands together allows us

to execute them in exactly the same way every time, and it allows us to add

comments and other notes that will be useful to us and to other users whom

we share the code with. is approach, while still interactive, is best when we

have developed an approach and want to use it repeatedly. Scripts also provide

a natural mechanism for project management: often we start with an empty

workspace and use scripts to populate the workspace by reading and preparing

data, loading from other sources, or attaching other directories, before starting

on the modeling steps.

R can also be run in batch mode – that is, it can start, run a single set of

commands, and then stop. is approach can be used when the same task needs

to be performed repeatedly, perhaps on diﬀerent data – say, every day to process

data gathered overnight. We talk about scripts and batch use of R in Chapter 5.

1.4.4 R Packages

Apackage is a set of functions (and maybe data and other stuﬀ too) that pro-

vides an extension to R. R comes with a set of packages, some of which are

automatically placed onto the search path, and others of which are not. If a

package is present on your computer but not in your search path, you can access

(or “load”) it with the library() or require() command (these two diﬀer

only in how they react if a package cannot be found). A package only needs to

be loaded once per R session, but when you re-start R you will need to re-load

packages. ere are also thousands of additional packages that have been con-

tributedbyRusersthatcanbefoundontheInternet,primarilyatthemain

repository at cran.r-project.org and its mirror sites. If your computer

is connected to the Internet, you can install a package (if you know its name)

with the install.packages() command. If that works, the package will

still need to be loaded with the library() command. If your computer is

not connected to the Internet, you can still install packages from a disk ﬁle if

one is available. Most of the code in this book requires no additional packages,

although in some cases we will point out cases where additional packages make

particular tasks easier, more eﬃcient, or, in rare cases, possible.

k k

R15

It is possible to force certain packages to be loaded whenever you start R.

When we anticipate needing a package, our preference is to include a call to

library() or require() inside our scripts.

1.4.5 RStudio and Other GUIs

e “look” of R depends on your operating system. At its most basic – and we

often see this when we are connecting to remote servers – R consists only of a

command line. On the most popular platforms – Windows and OS X – running

R produces a graphical user interface, or GUI. is is a set of windows con-

taining a number of menu items giving selections, or buttons that help you

perform common tasks. Most of the GUI, though, consists of the console. A

few enhanced GUIs are available. Perhaps the most widely used among these

is RStudio (RStudio Team, 2015), a development environment that includes a

console window, a set of script window tabs, and better handling of multiple

graphics windows. RStudio comes in free and paid versions for all operating

systems and is available from its maker at rstudio.com. We have found that

many of our students prefer the more interactive, perhaps more modern feel of

RStudio to the standard R interface – but underneath, the R language is exactly

the same.

1.4.6 Locales and Character Sets

R is essentially the same program whether you run it on Windows, OS X, or

Linux. (ere are minor diﬀerences in the way you access external ﬁles and

in some low-level technical functions that will not be relevant in data clean-

ing.) In particular, R is an English-language program, so a “for” loop is always

indicated by for(). Speakers of many languages can arrange to have error

messages delivered in their language, if this ability is conﬁgured at the time R is

installed – see the help for the Sys.setenv() function and for “environment

variables.”

Even though R is in English, it is possible to set the “locale” of R. is allows

you to change the way that R does things such as format currency values.

English speakers use the dot as the decimal separator and the comma to set

oﬀ thousands from hundreds, but many Europeans use those two characters

in reverse. Other locale settings aﬀect the abbreviations in use for days of

the week and months of the year. We discuss some of these in Chapter 3, but

one important one to note here is the “collation” setting. is describes how

R sorts alphabetical items. Under the usual choices on Windows and OS X,

lower- and upper-case letters are sorted together, so that “a” precedes “A”

in alphabetical order, but both precede “b.” To continue an earlier example,

this ensures that "apple" < "banana" and "apple" < "Banana"

are both TRUE. However, on some Linux systems the so-called “C” collation

sequence is used. In that scheme, all the upper-case letters come before

k k

16 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

any of the lower-case ones – so that "apple" < "banana" is TRUE,but

"apple" < "Banana" is FALSE. Moreover, as the help for Comparison

points out, “in Estonian, Zcomes between Sand T.” You have to be aware of

your both locale and the relevant language whenever you compare strings.

Another aspect of character handling is the use of diﬀerent character sets.

Text in non-Roman languages such as Hebrew or Korean requires some special

considerations. We discuss these at some length in Chapter 4.

1.5 Getting Help

R has a number of ways of getting help to you. “Help” can mean information

about the speciﬁc syntax of individual R commands, about putting the pieces

of R together in programs, or about the details of the various statistical models

and tools that R provides. In this section, we describe some of the resources

available to help you learn about R.

1.5.1 At the Command Line

e most basic help is provided at the command line, through the commands

help(),?and help.search(). e ﬁrst two commands act identically and

will be most useful when you need information on a particular R function or

operator whose name you know. In most cases, the argument doesn’t need to

be in quotation marks, though it may be – so help(matrix) or ?"matrix"

both bring up a page about some matrix functions. Quotation marks will

be required when looking for help on some elements of the R language – so

?"for" gives the help page for the for looping term and help("==")

produces the page on comparison operators. e help.search() command

is useful when the subject, rather than the name, is known; this command

opens a window (depending on your operating system) that gives links to

associated R objects. A related command is the apropos() function, which

takes a character argument (as in apropos("matrix"))andreturnsa

vector of names of objects containing that string (in this example, every object

with matrix in its name). A ﬁnal piece of command-line help is provided by

the args() function, which takes a function and displays the set of arguments

expected by, and default values provided by that function.

1.5.2 The Online Manuals

When you install R, you are given the opportunity to install the online manuals

with it. ese manuals are generally correct and complete, but they are intended

as references, and are not always useful as tutorials.

k k

R17

1.5.3 On the Internet

e main page for the R project is r-project.org. is is the central repos-

itory for R and its documentation. If you are interested in participating in a

community of R users, you might consider joining one of the mailing lists,

which you can ﬁnd under mail.html at that page.

R is very popular and there are lots and lots of blogs, pages, and other web

documents that address R and solve speciﬁc problems. Your favorite Internet

search engine will be able to ﬁnd dozens of these.

1.5.4 Further Reading

A lot of documentation comes with R when you install in the usual way. You

can ﬁnd a list of these manuals under Help |Manuals in Windows, or Help |R

Help on OS X, or with help.start(). e “Introduction to R” manual is a

good place to start.

e book “e Art of R Programming” (Matloﬀ, 2011) is a nice tour of many R

features ranging from beginning to advanced. As its name suggests, the empha-

sis is on writing powerful and eﬃcient R programs. Many other books introduce

the use of R, or describe its application in speciﬁc ﬁelds such as economics or

genomics. e r-project website has a list of over 150 books using R. As we

mentioned earlier, that site also maintains mailing lists for interested users, and

a quick web search will reveal scores of blogs and web pages devoted to R and

to answering R questions.

e recent book by Wickham and Grolemond (2016) describes those authors’

approach to not only data cleaning but a set of additional tasks, including visual-

ization and modeling, which we think of as beyond the scope of data acquisition

and cleaning. at approach requires an entire set of tools from packages out-

side R – although they come conveniently bundled together – as well as a new

vocabulary. is ecosystem has its adherents, but we prefer to use base R where

possible.

1.6 How to Use This Book

1.6.1 Syntax and Conventions in This Book

We reproduce a lot of R code in this book. R code is indicated in a ﬁxed-width

font like this. Since R is case-sensitive, our text will exactly match what is

typedintoR–exceptthatintheprosewecapitalizelettersofRobjectsifthey

appear at the beginning of sentences. Inside a paragraph, or when we want to

show a sequence of commands, we reproduce exactly what we type, like this:

k k

18 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

sqrt(pi). When we also want to show what R returns, the code will be shown

with the prompt and the literal R output, like this:

> sqrt (pi)

[1] 1.772454

Unlike the example in the “top ten quick fact” #1, we suppress the continuation

prompt +,sothatitisnotconfusedwiththeordinaryplussign.

ere are several diﬀerent schemes for formatting code that you can ﬁnd

described on the Internet, and they do not always agree. To us the most impor-

tant rule is to make your code easy to read. is means, ﬁrst, use spacing and

indenting in a helpful and consistent way, and second, add plenty of comments

to help the reader. ere is always a temptation to write code as quickly as pos-

sible, with an eye toward worrying about neatness later. Resist that temptation!

Code is for sharing and for re-use.

On a lighter note, we know that the word “data” originated as the plural of

the singular “datum,” but it has long been permitted to construe “data” in the

singular, and we do that in this book. You will ﬁnd us saying “the data is...” rather

than“are.”isisintentional.

1.6.2 The Chapters

In order to use R wisely, you have to understand what data looks like to R. e

following three chapters describe the sorts of data that R recognizes, and how

to manipulate R’s objects. We start by describing vectors, the simplest form of

data in R, in Chapter 2. is chapter describes the common types of vectors,

the diﬀerent ways to extract subsets from them, and how to change values in

vectors. It also describes how R stores missing values, an integral part of almost

every data cleaning problem. e chapter concludes with a look at the impor-

tant table() function and some of the basic operations on vectors – sorting,

identifying duplicates, computing unions and intersections of sets, and so on.

Chapter 3 describes more complicated data structures: matrices, lists, and

ﬁnally data frames. Understanding how data frames work is critical to using R

intelligently. We defer until this chapter discussion of how R handles times and

dates, because part of that discussion requires an understanding of lists.

e ﬁnal data chapter, Chapter 4, discusses the last important data type – text

or character data. Text data is stored in vectors and data frames such as other

kinds, but there are a number of operations speciﬁc to text. is chapter

describes how to manipulate text in R – changing case, extracting and

assembling pieces of strings, formatting numbers into strings, and so on. One

important topic is regular expressions, a set of tools for ﬁnding strings that

contain a pattern of characters. is chapter also discusses the UTF-8 system

of encoding non-Roman alphabets such as Greek or Chinese and R’s concept

of factors, which are important in modeling but often cause problems during

the data cleaning process.

k k

R19

Chapter 5 discusses two types of tool used to automate computations in R:

functions and scripts. ese diﬀerent, but related, tools, will be part of every

analysis you ever do, so you should understand how to construct them intelli-

gently. We also look brieﬂy at “shell scripts,” which are a special sort of script

that let you run R in batch, rather than interactive mode, and discuss some of

the tools available in R for debugging.

isisabookaboutcleaningdata,butthedatatobecleanedneedsto

come from somewhere. Chapter 6 describes the diﬀerent ways to bring

data into R: from other R sessions, from spreadsheet-like text ﬁles, from

relational databases, and so on. We describe two of the formats in which data

is commonly found in modern applications: XML and JSON. We also describe

how to acquire data programmatically from web pages.

Chapter 7 takes a bigger view of the data cleaning process. While the earlier

chapters focus on the nuts and bolts of R as they relate to data cleaning, this

chapter describes the sort of challenges in a real-life data cleaning project. We

talk about how to combine data from diﬀerent sources and give examples of the

sort of anomalies that you have to expect in dealing with real data. In almost

every case you will have to rely on judgment, rather than just on a cookbook

of techniques. We spend some time discussing the role of judgment on data

cleaning.

The Exercise

e culmination of the book is the data cleaning exercise presented in

Chapter 8. is chapter presents a complicated data acquisition and cleaning

problem that, while artiﬁcial, reﬂects many of the problems and challenges we

have seen over our years of real-life data handling experience. If you can ﬁnd

your way through to the end of the exercise, we expect that you will be well

prepared to handle the data the real world sends your way.

Critical Data Handling Tools

In every chapter, we have set aside the ﬁnal section to recap commands and

tools we think are particularly important when it comes to data handling and

manipulation. If you can master the use of these tools, and apply them wisely,

you can reduce the risk of missing important information in your data.

The Code

All of the code reproduced in this book appears in scripts in the cleaning

Book package you can download from the CRAN website. You can open these

scripts in R and run the code from there – although since most examples are

very short, we suggest that you consider typing them in yourself, to get a feel

for the R language.

k k

R Data, Part 1: Vectors

e basic unit of computation in R is the vector. A vector is a set of one or

more basic objects of the same kind.(Actually,itisevenpossibletohavea

vector with no objects in it, as we will see, and this happens sometimes.) Each

oftheentriesinavectoriscalledanelement. In this chapter, we talk about

the diﬀerent sorts of vectors that you can have in R. en, we describe the

very important topic of subsetting, which is our word for extracting pieces of

vectors – all of the elements that are greater than 10, for example. at topic

goes together with assigning, or replacing, certain elements of a vector. We

describe the way missing values are handled in R; this topic arises in almost

every data cleaning problem. e rest of the chapter gives some tools that are

useful when handling vectors.

2.1 Vectors

By a “basic” object, we mean an object of one of R’s so-called “atomic” classes.

ese classes, which you can ﬁnd in help(vector),arelogical (values

TRUE or FALSE, although Tand Fare provided as synonyms); integer;

numeric (also called double); character, which refers to text; raw,

which can hold binary data; and complex. Some of these, such as complex,

probably won’t arise in data cleaning.

2.1.1 Creating Vectors

Wearemostlyconcernedwithvectorsthathavebeengiventousasdata.How-

ever, there are a number of situations when you will need to construct your own

vectors. Of course, since a scalar is a vector of length 1, you can construct one

directly, by typing its value:

[1] 5

k k

22 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Rdisplaysthe[1] before the answer to show you that the 5istheﬁrstelement

of the resulting vector. Here, of course, the resulting vector only had one entry,

but R displays the [1] nonetheless. ere is no such thing as a “scalar” in R;

even 𝜋, represented in R by the built-in value pi, is a vector of length 1. To

combine several items into a vector, use the c() function, which combines as

manyitemsasyouneed.

> c(1, 17)

[1] 1 17

> c(-1, pi, 17)

[1] -1.000000 3.141593 17.000000

> c(-1, pi, 1700000)

[1] -1.000000e+00 3.141593e+00 1.700000e+06

Rhasformattedthenumbersinthevectorsinaconsistentway.Inthesec-

ond example, the number of digits of pi is what determines the formatting;

see Section 1.3.3. In example three, the same number of digits is used, but

the large number has caused R to use scientiﬁc notation. We discuss that in

Section 4.2.2. Analogous formatting rules are applied to non-numeric vectors

as well; this makes output much more readable. e c() function can also be

used to combine vectors, as long as all the vectors are of the same sort.

Another vector-creation function is rep(), which repeats a value as many

times as you need. For example, rep(3, 4) produces a vector of four 3s. In

this example, we show some more of the abilities of rep().

> rep (c(2, 4), 3) # repeat a vector

[1]242424

> rep (c("Yes", "No"), c(3, 1)) # repeat elements of vector

[1] "Yes" "Yes" "Yes" "No"

> rep (c("Yes", "No"), each = 8)

[1] "Yes" "Yes" "Yes" "Yes" "Yes" "Yes" "Yes" "Yes" "No"

[10] "No" "No" "No" "No" "No" "No" "No"

e last two examples show rep() operating on a character vector. e ﬁnal

one shows how R displays longer vectors – by giving the number of the ﬁrst

element on each line. Here, for example, the [10] indicates that the ﬁrst "No"

on the second line is the 10th element of the vector.

2.1.2 Sequences

We also very often create vectors of sets of consecutive integers. For example,

we might want the ﬁrst 10 integers, so that we can get hold of the ﬁrst 10 rows

in a table. For that task we can use the colon operator, :. Actually, the colon

operator doesn’t have to be conﬁned to integers; you can also use it to produce

a sequence of non-integers that are one unit apart, as in the following example,

butwehaven’tfoundthattobeveryuseful.

k k

R Data, Part 1: Vectors 23

> 1:5

[1]12345

> 6:-2

[1]6543210-1-2#Cangoinreverse, by 1

> 2.3:5.9

[1] 2.3 3.3 4.3 5.3 # Permitted (but unusual)

> 3 + 2:7 # Watch out here! This is 3 +

[1]5678910 #(vector produced by 2:7)

> (3 + 2):7

[1] 5 6 7 # This is 5:7

In that last pair of examples, we see that R evaluates the 2:7 operation before

adding the 3. is is because :has a higher precedence in the order of opera-

tions than addition. e list of operators and their precedences can be found at

?Syntax, and precedence can always be over-ridden with parentheses, as in

the example – but this is the only example of operator precedence that is likely

to trip you up. Also notice that adding 3to a vector adds 3to each element of

that vector; we talk more about vector operations in Section 2.1.4.

Finally, we sometimes need to create vectors whose entries diﬀer by a num-

ber other than one. For that, we use seq(), a function that allows much ﬁner

control of starting points, ending points, lengths, and step sizes.

2.1.3 Logical Vectors

We can create logical vectors using the c() function, but most often they

are constructed by R in response to an operation on other vectors. We saw

examples of operators back in Section 1.3.2; the R operators that perform

comparisons are <,<=,>,>=,== (for “is equal to”) and != (for “not equal to”).

In this example, we do some simple comparisons on a short vector.

> 101:105 >= 102 # Which elements are >= 102?

[1] FALSE TRUE TRUE TRUE TRUE

> 101:105 == 104 # Which equal (==) 104?

[1] FALSE FALSE FALSE TRUE FALSE

Of course, when you compare two ﬂoating-point numbers for equality, you

can get unexpected results. In this example, we compute 1 - 1/46 * 46,

which is zero; 1 - 1/47 * 47, and so on up through 50. We have seen this

example before!

> 1 - 1/46:50 * 46:50 == 0

[1] TRUE TRUE TRUE FALSE TRUE

We noted earlier that R provides Tand Fas synonyms for TRUE and FALSE.

Wesometimesusethesesynonymsinthebook.However,itisbesttobeware

of using these shortened forms in code. It is possible to create objects named

k k

24 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Tor F, which might interfere with their usage as logical values. In contrast,

the full names TRUE and FALSE are reserved words in R. is means that you

cannot directly assign one of these names to an object and, therefore, that they

are never ambiguous in code.

The Number and Proportion of Elements That Meet a Criterion

One task that comes up a lot in data cleaning is to count the number (or pro-

portion) of events that meet some criterion. We might want to know how many

missing values there are in a vector, for example, or the proportion of elements

that are less than 0.5. For these tasks, computing the sum() or mean() of a

logical vector is an excellent approach. In our earlier example, we might have

been interested in the number of elements that are ≥102, or the proportion that

are exactly 104.

> 101:105 >= 102

[1] FALSE TRUE TRUE TRUE TRUE

> sum (101:105 >= 102)

[1] 4 # Four elements are >= 102

> 101:105 == 104

[1] FALSE FALSE FALSE TRUE FALSE

> mean (101:105 == 104)

[1] 0.2 # 20% are == 104

It may be worth pondering this last example for a moment. We start with the

logical vector that is the result of the comparison operator. In order to apply

a mathematical function to that vector, R needs to convert the logical ele-

ments to numeric ones. FALSE values get turned into zeros and TRUE values

into ones (we discuss conversion further in Section 2.2.3). en, sum() adds

up those 0s and 1s, producing the total number of 1s in the converted vec-

tor – that is, the number of TRUE values in the logical vector or the number

of elements of the original vector that meet the criterion by being ≥102. e

mean() function computes the sum of the number of 1s and then divides that

sum by the total number of elements, and that operation produces the propor-

tion of TRUE values in the logical vector,thatis,theproportionofelements

in the original vector that meet the criterion.

2.1.4 Vector Operations

Understanding how vectors work is crucial to using R properly and eﬃciently.

Arithmetic operations on vectors produce vectors, which means you very often

do not have to write an explicit loop to perform an operation on a vector. Sup-

pose we have a vector of six integers, and we want to perform some operations

on them. We can do this:

> 5:10

[1]5678910

> (5:10) + 4

k k

R Data, Part 1: Vectors 25

[1] 91011121314

> (5:10)^2 # Square each element;

[1] 25 36 49 64 81 100 # parentheses necessary

Just to repeat, arithmetic and most other mathematical operations operate on

vectors and return vectors. So if you want the natural logarithm of every item

in a vector named x, for example, you just enter log(x).Ifyouwantthe

square of the cosine of the logarithm of every element of x, you would use

cos(log(x))^2, and so on. ere are functions, such as length(),sum(),

mean(),sd(),min(),andmax(), that operate on a vector and produce a

single number (which, to be sure, is also a vector in R). ere are also func-

tions such as range(), which returns a vector containing the smallest and

largest values, and summary(), which returns a vector of summary statistics,

but one of the sources of R’s power is the ability to perform computations on

every element of a vector at once.

In the last two examples above, we operated on a vector and a single number

simultaneously. R handles this in the natural way: by repeating the 4(in the ﬁrst

example) or the 2(in the second) as many times as needed. R calls this recycling.

In the following example, we see what R does in the case of operating on two

vectors of the same length. e answer is, it performs the operation between

the ﬁrst elements of each vector, then the second elements, and so on. In the

opening command, we have the usual assignment, using <-, and also an addi-

tional set of parentheses outside that command. ese additional parentheses

cause the result of the assignment to be printed. Without them, we would have

created thing1, but its value would not have been displayed.

> (thing1 <- c(20, 15, 10, 5, 0)^2)

[1] 400 225 100 25 0

> (thing2 <- 105:101)

[1] 105 104 103 102 101

> thing2 + thing1

[1] 505 329 203 127 101

> thing2 / thing1

[1] 0.2625000 0.4622222 1.0300000 4.0800000 Inf

In the last lines, R computes the ratios element by element. e ﬁnal ratio,

101/0, yields the result Inf, referring to an inﬁnite value. We discuss Inf more

in Section 2.4.4. e following example compares a function that returns a sin-

gle,summaryvaluetoonethatoperateselementbyelement.

> max (thing2, thing1)

[1] 400

> pmax (thing2, thing1)

[1] 400 225 103 102 101

e max() function produces the largest value anywhere in any of its

arguments – in this case, the 400 from the ﬁrst element of thing1.e

k k

26 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

pmax() (“parallel maximum”) function ﬁnds the larger of the ﬁrst element of

the two vectors, and the larger of the second element of the two vectors, and

so on.

Two logical vectors can also be combined element by element, using the |

logical operator for “or” (i.e., returning TRUE if either element is TRUE)and

the &operator for “and” (i.e., returning TRUE only if both elements are TRUE).

ese operators diﬀer in a subtle way from their doubled versions || and &&.

e single versions evaluate the condition for every pair of elements from both

vectors, whereas the doubled versions evaluate multiple TRUE/FALSE condi-

tions from left to right, stopping as soon as possible. ese doubled versions

aremostusefulin,forexample,if() statements.

Recycling

ere can be a complication, though: what if two vectors being operated on are

not of the same length?

> 5:10 + c(0, 10, 100, 1000, 10000, 100000) # Two 6-vectors

[1] 5 16 107 1008 10009 100010 # Add by element

> 5:10 + c(1, 10, 100) # A 6-vector and a 3-vector

[1] 6 16 107 9 19 110 # The 3-vector is replicated

> 5:10 + 3:7 # A 6-vector and a 5-vector

[1] 8 10 12 14 16 13 # 5+3, 6+4, ..., 9+7, 10+3

Warning message:

In 5:10 + 3:7 :

longer object length is not a multiple of shorter length

It is important to understand these last two examples because the problem

of mismatched vector lengths arises often. In the ﬁrst of the two examples, the

3-vector (1, 10, 100) was added to the ﬁrst three elements of the 6-vector,

and then added again to the second three elements. Once again R is recycling.

No warning was issued because 3 is a factor of 6, so the shorter vector was

recycled an exact number of times. In the ﬁnal example, the 5-vector was

added to the ﬁrst ﬁve elements of the 6-vector. In order to ﬁnish the addition,

R recycled the ﬁrst element of the 5-vector, the value 3. at value was added

to the last entry of the 6-vector, 10, to produce the ﬁnal element of the result,

13. e recycling only used part of the 5-vector; since 5 is not a factor of 6, a

warning was issued.

Recycling a vector of length 1, as we did when we computed (5:10) + 4,

is very common. Recycling vectors of other lengths is rarer, and we suggest you

avoid it unless you are certain you know what you are doing. When you see

the longer object length... warning as we did in the last example, we

recommend you treat that as an error and get to the root of that problem.

Tools for Handling Character Vectors

Almost every data cleaning problem requires some handling of characters.

Either the data contains characters to start with – maybe names and addresses,

or dates, or ﬁelds that indicate sex, for example – or we will need to construct

k k

R Data, Part 1: Vectors 27

some (perhaps turning sexes labeled 1or 2into Mand F). We also often need to

search through character strings to ﬁnd ones that match a particular pattern;

remove commas or currency signs that have been put into formatted numbers

(such as “$2500.00”); or discretize a numeric variable into a smaller number

of groups (such as turning an Age ﬁeld into levels Child,Teen,Adult,

Senior). Character data is so important, and so common, that we have

devoted an entire chapter (Chapter 4) to special techniques for handling it.

2.1.5 Names

Avectormayhavenames, a vector of character strings that act to identify the

individual entries. It is possible to add names to a vector, and in this section we

give examples of that. More commonly, though, R adds names to a table when

you tabulate a vector using the table() function. We will have more to say

about table(), and the names it produces, in Section 2.5. In the meantime,

here is a simple example of a vector with names. Notice that the third name

has an embedded space. is name is not “syntactically valid” according to R’s

rules. A syntactically valid name has only letters, numbers, dots, and under-

scores and starts either with a letter or a dot and then a non-numeric character.

It is usually a bad practice to have a vector’s names be invalid, but, as we show in

the following example, it is possible. See Section 3.4.2 for information on how

to ensure that your names are valid.

> vec <- c(101, 102, 103)

> names(vec)

NULL

> names(vec) <- c("a", "b", "Long name")

> names(vec)

[1] "a" "b" "Long name"

After the second line, R returned the special value NULL to indicate that the vec-

tor had no names. (We talk more about NULL in Section 2.4.5.) e names()

function then assigned names to the elements of the vector. We can also assign

names directly in the c() function, as in this example.

> c(a = 101, b = 102, Long.name = 103)

a b Long.name

101 102 103

In this case, we used a syntactically valid name; an invalid one would have had

to be enclosed in quotation marks.

2.2 Data Types

e three data types we have mentioned so far – numeric, logical, and charac-

ter – are the ones we most often use. R does support several other data types. In

this section, we mention these datatypes brieﬂy, and then discuss the important

k k

28 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

topic of converting data from one type to another. Sometimes this is an opera-

tion we do explicitly and intentionally; other times R performs the conversion

automatically.

2.2.1 Some Less-Common Data Types

Integers

R can represent as integer values between −(231 −1)and 231 −1. (is num-

ber is 2,147,483,647.) Values outside this range may be displayed as if they were

integers, but they will be stored as doubles. When doing calculations, R auto-

matically converts values that are too big to be integers into doubles, so the only

time integer storage will matter is if you explicitly convert a really large value

into an integer (see Section 2.2.3). If you need R to regard an item as an integer

for some reason, you can append Lon its end. So, for example, 123 is numeric

but 123L is regarded as an integer value. Of course, it only makes sense to add

Lto a thing that really is an integer.

Raw

“Raw” refers to data kept in binary (hexadecimal) form. is is the format that

data from images, sound, or video will take in R. We rarely need to handle that

kind of ﬁle in a data cleaning problem. However, we do sometimes resort to

using raw data when a ﬁle has unexpected characters in it, or at the beginning

of an analysis when we do not know what sort of data a ﬁle might have. In that

case, the data will be read into R and held as a vector of class raw.Araw vector

is a string of bytes represented in hexadecimal form. It can be converted into

character data (when that makes sense) with the rawToChar() function. We

talk more about reading raw data, particularly to handle the case of unexpected

characters, in Section 6.2.5.

Complex Numbers

R has the ability to manipulate complex numbers (numbers such as 1.3 −2.4i,

where iis √−1). Since complex numbers almost never arise in data cleaning,

we will not discuss them in this book.

2.2.2 What Type of Vector Is This?

Youcanusuallytellwhatsortofvectoryouhavebylookingatafewofitsentries.

Character data has entries surrounded by quotes; numeric entries have

no quotes; and logical entries are either TRUE or FALSE.So,forexample,

the value "TRUE", with quotation marks, can only belong to a character vec-

tor. ere are also several functions in R that tell you explicitly what sort of

thing you have. Two of these functions, mode() and typeof(),tellyouthe

basic type of vector. ey are essentially identical for our purposes, except that

typeof() diﬀerentiates between integer and double,whereasmode()

k k

R Data, Part 1: Vectors 29

calls them both numeric.estr() function (for “structure”) not only tells

you the type of vector but also shows you the ﬁrst few entries. A related func-

tion, class(), is a more general operator for complex types.

A second group of functions gives a TRUE/FALSE answer as to whether

a speciﬁc vector has a speciﬁc mode. ese functions are named is.

logical(),is.integer(),is.numeric(),andis.character(),

and each returns a single logical value describing the type of the vector.

A more general version, is(),letsyouspecifytheclassasanargument:

so is.numeric(pi) is identical to is(pi, "numeric").ismore

general form is particularly useful when testing for more complicated, possibly

user-deﬁned classes.

2.2.3 Converting from One Type to Another

It is important to remember that a vector can contain elements of only one type.

When types are mixed – for example, if you inadvertently insert a character

element into a numeric vector – R modiﬁes the entire vector to be of the more

complicated type. Here is an example:

> c(1, 4, 7, 2, 5) # Create numeric vector

[1]14725

> c(1, 4, 7, 2, 5, "3") # What if one element is character?

[1] "1" "4" "7" "2" "5" "3"

In this example, the entire vector got converted to character.eruleis

that R will convert every element of a vector to the “most complicated” type

of any of the elements. Logical is the least complicated type, followed by

raw,numeric,complex, and then character.(Raw vectorsbehavealittle

diﬀerently from the others. See Section 6.2.5.)

It is important to know what values the less complicated types get when they

are converted to more complicated ones. Logical elements that are converted

into numeric become 0 where they have the value FALSE and 1 where they are

TRUE. A logical converted into a character, however, gets values "FALSE" and

"TRUE". A number gets converted into a high-accuracy text representation of

itself, as we see in these examples.

> 1/7

[1] 0.1428571 # by default, 7 digits are displayed

> c(1/7, "a")

[1] "0.142857142857143" "a"

One instance where R frequently performs conversions automatically is from

integer to numeric types.

Conversion Functions

R will convert less complicated types into more complicated ones where

required. Sometimes you need to force the elements of a vector back into a

k k

30 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

less complicated representation. Just as there are functions whose names start

with is. fortestingthetypeofanobject,thereisasetofas. functions for

converting from one type to another. e rules are these: a character will be

successfullyconvertedtoanumericifithasthesyntaxofanumber.Itmay

have leading or trailing spaces (or new-lines or tabs), but no embedded ones; it

may have leading zeros; it may have a decimal point (but only one); it may not

have embedded commas; it may have a leading minus or plus sign, and, if it is

in scientiﬁc notation, the exponent character Emay be in upper- or lower-case

and may also be followed by a minus or plus sign. In this example, we show

some character strings that do and do not get converted to numbers. Notice

that the elements of the vector that do not get converted turn into missing

values (NA). We discuss missing values in Section 2.4.

> as.numeric (c(" 123.5 ", "-123e-2", "4,355", "45. 6",

"$23", "75%"))

[1] 123.50 -1.23 NA NA NA NA

Warning message:

NAs introduced by coercion

In this case, the ﬁrst two elements were successfully converted. e third has a

comma,thefourthhasanembeddedspace,andthelasttwohavenon-numeric

characters. In order to convert strings such as those into numbers, you would

have to remove the oﬀending characters. We describe how to manipulate text

in Chapter 4.

e warning message you see here is a very common one. Unlike most

warning messages, this one will often arise naturally in the course of data

cleaning – but make sure you understand exactly where it’s coming from.

e only character values that can be successfully converted into

logical are "T","TRUE","True",and"true" and "F","FALSE",

"False",and"false". In this case, no extraneous spaces are permitted.

All other character values are converted into NAs.

e rule is simple for converting numeric values into logical ones.

Numeric values that are zero become FALSE; all other numbers become TRUE.

e only issue is that sometimes numbers you expect to be zero aren’t quite

because of ﬂoating-point error. In this example, we convert some numbers and

expressions to logical.

> as.logical (c(123, 5 - 5, 1e-300, 1e-400, 1 - 1/49 * 49))

[1] TRUE FALSE TRUE FALSE TRUE

e ﬁrst element here is clearly non-zero, so it gets converted to TRUE.e

second evaluates to exactly zero and produces FALSE. e third is non-zero,

but the fourth counts as zero since it is outside the range of double precision

(see Section 1.3.3). e last element is our running example of an expression

that“should”bezerobutisnot(again,seeSection1.3.3).Sinceitisnotzero,

k k

R Data, Part 1: Vectors 31

it gets converted to TRUE.Numeric, non-missing values never produce NA

when converted to logical.

2.3 Subsets of Vectors

We very often need to pull out just a piece of a vector. is is called subsetting

or extracting.Inmostcases,whereweextractasubset,wecanuseasimilar

expression to replace (or assign) new values to a subset of the elements in a

vector. Knowing how to do this is crucial to data cleaning in R; you cannot

work eﬃciently in R without understanding this material.

2.3.1 Extracting

We constantly perform this operation in one form or another when cleaning

data: we look at subsets of rows or columns, we examine a vector for anoma-

lous entries, we extract all the elements of one vector for which another has a

speciﬁc value, and so on. ere are three methods by which we can extract a

subset of a vector. First, we can use a numeric vector to specify which elements

to extract. is numeric vector is an example of a “subscript” and its entries

are called “indices.” Second, we can use a logical subscript; and, third, we can

extract elements using their names.

Numeric Subscripts

e most basic way to extract a piece of a vector is to use a numeric subscript

inside square brackets. For example, if you have a vector named a,thecom-

mand a[1] will extract the ﬁrst element of a.eresultofthatcommandisa

vector of length 1, of the same mode as the original a.ecommanda[2:5]

will produce a vector of length 4, with the second through ﬁfth elements of

a. If you ask for elements that aren’t there – if, for example, aonly had three

elements – then R will ﬁll up the missing spots with missing (NA) values. We dis-

cussthosefurtherinSection2.4.Inthisexample,wehaveavectoracontaining

the numbers from 101 to 105.

> (a <- 101:105)

[1] 101 102 103 104 105

> a[3]

[1] 103

It’s possible to pull out elements in any order, just by preparing the subscript

properly. You can even use a numeric expression to compute your subscript,

but only do this if you’re sure your expression is an integer. If the result of your

expression isn’t an integer, even if it misses by just a tiny bit, you will get some-

thing you might not expect.

k k

32 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> a[c(4, 2)]

[1] 104 102

> a[1+1] # A simple expression; this works

[1] 102

> a[2.999999999999999] # This is truncated to 2, but...

[1] 102

> a[2.9999999999999999] # exactly 3 in double-precision.

[1] 103

> a[49 * (1/49)] # This index gets truncated to zero;

integer (0) # R produces a vector of length zero

ere are two kinds of special values in numeric subscripts: negative values

and zeros. Negative values tell R to omit those values, instead of extracting

them – so a[-1], for example, returns everything except the ﬁrst element of a.

You can have more than one negative number in your subscript, but you cannot

mix positive and negative numbers, and that makes sense. (For example, in the

expression a[c(-1, 3)], should the second element be returned or not?)

Zeros are another special value in a subscript. ey are simply ignored by

R. Zeros appear primarily as a result of the match() function; you will rarely

use them intentionally yourself. Knowing that zeros are permitted helps make

sense of the error message in the following example, though.

> a[-2] # Omit element 2

[1] 101 103 104 105

> a[c(-1, 3)] # Illegal

Error in a[c(-1, 3)] : only 0's may be mixed

with negative subscripts

> a[-1:2] # Illegal, because -1:2 evaluates to -1, 0, 1, 2

Error in a[-1:2] : only 0's may be mixed

with negative subscripts

> a[-(1:2)] # -(1:2) is (-1, -2): omit elements 1 and 2.

[1] 103 104 105

Logical Subscripts

Logical subscripts are also very powerful. A logical subscript is a logical vec-

tor of the same length as the thing being extracted from. Values in the original

vector that line up with TRUE elements of the subscript are returned; those that

line up with FALSE are not.

We almost never construct the logical subscript directly, using c().Instead

it is almost always the result of a comparison operation. In this example, we

start with a vector of people’s ages, and extract just the ones that are >60.

> age <- c(53, 26, 81, 18, 63, 34)

>age>60

[1] FALSE FALSE TRUE FALSE TRUE FALSE

> age[age > 60]

[1] 81 63

k k

R Data, Part 1: Vectors 33

e age > 60 vector has one entry for each element of age,soitiseasyto

use that to extract the numeric values of age,whichare>60. But the power

of logical subscripting goes well beyond that. Imagine that we also knew the

names of each of the people. Here we show how to extract the names just for

the people whose ages are >60.

> people <- c("Ahmed", "Mary", "Lee", "Alex", "John", "Viv")

> age > 60 # Just as a reminder

[1] FALSE FALSE TRUE FALSE TRUE FALSE

> people[age > 60] # Return name where (age > 60) is TRUE

[1] "Lee" "John"

is particular manipulation – extracting a subset of one vector based on

values in another – is something we do in every data cleaning problem. It is

important to be sure that you know exactly how it works.

One case where results might be unexpected is when you inadvertently cause

a logical subscript to be converted to a numeric one. In the example above,

suppose we had saved the logical vector as a new R object called age.gt.60.

In the following example, we show what happens if R is allowed to convert that

logical vector into a numeric one.

> age.gt.60 <- age > 60

> people[age.gt.60]

[1] "Lee" "John" # as expected

> people[0 + age.gt.60]

[1] "Ahmed" "Ahmed"

> people[-age.gt.60]

[1] "Mary" "Lee" "Alex" "John" "Viv"

In the 0 + age.gt.60 example, R has to convert the logical subscript to

numeric in order to perform the addition. After the addition, then, the subscript

has the values 001010, and the extraction produces the ﬁrst element of

the vector two times, ignoring the zeros. In the following example, the negative

sign once again causes R to convert the logical subscript to numeric; after the

application of the sign operator the subscript has the values 00-10-10.

e extraction drops the ﬁrst element (because of the -1 value) and the rest are

returned. is is a mistake we sometimes make with a logical subscript – in this

example, we probably intended to enter people[!age.gt.60],withthe!

operator,inordertoreturnpeoplewhoseagesarenotgreaterthan60.

When using a logical subscript, it is possible for the two vectors – the data

and the subscript – to be of diﬀerent lengths. In that case R recycles the shorter

one, as described in Section 2.1.4. is might be useful if, say, you wanted to

keep every third element of your original vector, but in general we recommend

that your logical subscript be the same length as the original vector.

e which() function can be used to convert a logical vector into a numeric

one. It returns the indices (i.e., the position numbers) of the elements that are

k k

34 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

TRUE. So this is particularly useful when trying to ﬁnd one or two anomalous

entries in a long vector of logical values. To ﬁnd the locations of the minimum

valueinavectory,youcanusewhich(y == min(y)), but the act of ﬁnd-

ing the index speciﬁcally of the minimum or maximum value is so common

that there are dedicated functions, called which.min() and which.max(),

for this task. ere is one diﬀerence, though: these two functions break ties by

selecting the ﬁrst index for which yis at its maximum or minimum, whereas

which() returns all the matching indices.

Using Names

e third kind of subscripting is to use a vector’s names. Since names are char-

acters, a name subscript will need to be a character as well. Here is a named

vector, together with an example of subscripting by name.

> (vec <- c(a = 12, b = 34, c = -1))

abc

12 34 -1

> vec["b"]

> vec[names (vec) != "a"]

34 -1

To show all the values except the one named a, it is tempting to try something

like vec[-"a"]. However, R tries to compute the value of “negative a,” fails,

andproducesanerror.eﬁnalexampleaboveshowsonewaytoexcludethe

element with a particular name from being extracted.

Named vectors are not uncommon, but they do not come up very often in

data cleaning. e real use of names will become clearer in Chapter 3, where

we will encounter rectangular structures that have row names, column names,

or, very often, both.

2.3.2 Vectors of Length 0

Any of these extraction methods can produce a vector of length 0, if no element

meets the criterion. is happens particularly often when all of the elements

of a logical subscript are FALSE. A vector of length 0 is displayed as inte-

ger(0),numeric(0),character(0),orlogical(0). In this example,

we show how such a vector might arise.

> (b <- c(101, 102, 103, 104))

[1] 101 102 103 104

> a <- b[b < 99] # Reasonable, but no elements of b are < 99

numeric(0) # a has length 0

k k

R Data, Part 1: Vectors 35

A zero-length vector cannot be used intelligently in arithmetic, and watch

out: the sum() of a numeric or logical vector of length 0 is itself zero. If a

zero-lengthvectorisusedastheconditioninanif() statement, an error

results.isisanerrorthatarisesindatacleaning,asinthisexample:

> sum (a)

[1] 0 # Possibly unexpected

> sum (a + 12345)

[1] 0 # Definitely unexpected

> if (a < 2) cat ("yes\n")

Error in if (a < 2) cat("yes\n") : argument is length zero

In the last example, we made use of the cat() function, which writes its

arguments out to the screen, or, as R calls it, the console. e \n represents the

new-line, to return the cursor to the left of the screen. When writing functions

to do data cleaning (Chapter 5), we will need to check that the conditions

being tested are not vectors of length 0.

2.3.3 Assigning or Replacing Elements of a Vector

Every operation that extracts some values can also be used to replace those val-

ues, simply using the extraction operation on the left side of an assignment. Of

course, R will require that the resulting vector have all its entries of the same

type. So, for example, a[2] <- 3 will replace the second entry of awith the

value 3.Ifais logical, this operation will force it to be numeric; if ais character,

the second entry of awill be assigned the character value "3".Justaswecan

extract using logical subscripts or names, we can use those subscripting tech-

niques for assignment as well. ese examples show replacement with numeric

and logical subscripts.

> (a <- c(101, 102, -99, 104, -99, 9106)) # last item should

[1] 101 102 -99 104 -99 9106 # have been 106

> a[6] <- 106 # numeric subscript

[1] 101 102 -99 104 -99 106

> a[a < 0] <- 9999 # logical subscript

[1] 101 102 9999 104 9999 106

As we mentioned, a logical subscript will almost always have the same length as

the data vector on which it is operating. In the preceding example, the logical

subscript a<0has the same length as aitself.

ese examples show how names can be used to assign new values to the

elements of a vector.

> b <- c("A", "missing", "C", "D")

> names (b) <- c("Red", "White", "Blue", "Green")

k k

36 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Red White Blue Green

"A" "missing" "C" "D"

> b["White"] <- "B" # name subscript

Red White Blue Green

"A" "B" "C" "D"

It is also possible to assign to elements of a vector out past its end. is is one

way to combine two vectors. Elements that are not assigned will be given the

special NA value (see the following section). Another way to combine two vec-

tors is with the c() command, but either way, if two vectors of diﬀerent types

are combined, R will need to convert them to the same type. In this example,

we combine two vectors.

> a <- 101:103

> b <- c(7, 2, 1, 15)

> c(a, b) # Combine two vectors

[1] 101 102 103 7 2 1 15

> a # Unchanged; no assignment made

[1] 101 102 103

> a[4:7] <- b # index non-existent values

[1] 101 102 103 7 2 1 15

[1]72115

> b[6] <- 22 # index non-existent value

[1]72115NA22 #b[5] filled in with NA

In the last example, b[6] was assigned, but no instruction was given about

what to do with the newly created ﬁfth element of b. R ﬁlled it in with the special

missing value code, NA. e following section describes how NA values operate

in R.

2.4 Missing Data (NA) and Other Special Values

In R, missing values are identiﬁed by NA (or, under some circumstances, by

<NA>; see Sections 2.5 and 4.6). is is a special code; it is not the two capital

letters Nand Aput together. Missing values are inevitable in real data, so it is

important to know the eﬀect they have on computations, and to have tools to

identify them and replace them where necessary. In this section, we discuss NA

values in vectors; subsequent chapters expand the discussion to describe the

eﬀect of NA values in other sorts of R objects.

Missing values arise in several ways. First, sometimes data is just missing – it

would make sense for an observation to be present, but in fact it was lost

k k

R Data, Part 1: Vectors 37

or never recorded. Second, some observations are inherently missing. For

example, a ﬁeld named MortPayRat might contain the ratio of a customer’s

monthly home mortgage payment to her monthly income. Customers with

no mortgage at all would presumably have no value for this ﬁeld. An NA value

would make more sense than a zero, which would suggest a mortgage payment

of zero. ird, as we saw in the last section, missing values appear when we

try to extract an item that was never present in a vector. For example, the

built-in item letters is a vector containing the 26 lower-case letters of the

English alphabet. e expression letters[27] will return an NA. Finally, we

sometimes see other special values Inf or -Inf or NaN in response to certain

computations, like trying to divide by zero. ose special values can often be

treated as if they were NA values. We discuss these and a ﬁnal special value,

NULL,inthissection.

Sincealltheelementsofavectormustbeofthesamekind,thereare

actually several diﬀerent kinds of NA.AnNA in a logical vector is a little

diﬀerent from an NA in a numeric or character one. (ere are actually objects

named NA_real_,NA_integer_,andNA_character_, which make this

explicit.) Normally, the diﬀerence will not matter, but there is one case where

knowing about the types of NA can explain some behavior that both arises

fairly often and also seems mysterious. We mention this in Section 2.4.3.

2.4.1 The Eﬀect of NAs in Expressions

A general, if imprecise, rule about NA values is that any computation with an

NA itself becomes an NA. If you add several numbers, one of which is an NA,

the sum becomes NA. If you try to compute the range of a numeric with miss-

ing values, both the minimum and maximum are computed as NA.ismakes

sense when you think of an NA as an unknown that could take on any value.

Basic mathematical computations for numeric vectors all allow you to specify

the na.rm = TRUE argument, to compute the result after omitting missing

values.

2.4.2 Identifying and Removing or Replacing NAs

In every data cleaning problem we need to determine whether there are NA

values. What you cannot do to identify missing values is to compare them

directly to the value NA. Just as adding an NA to something produces an NA,

comparing an NA to something produces NA.Soifavariablething has value

3, the expression thing == NA produces NA,andifthing has value NA,the

expression thing == NA also produces NA. To determine whether any of

your values are missing, use the anyNA() function. is operates on a vector

and returns a logical, which is TRUE if any value in the vector is NA.More

useful, perhaps, is the is.na() function: if we have a vector named vec,a

call to is.na(vec) returns a vector of logicals, one for each element in vec,

k k

38 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

giving TRUE for the elements that are NA and FALSE for those that are not.

We can also use which(is.na(vec)) to ﬁnd the numeric indices of the

missing elements. Here, we show an example of a vector with NA values and

some example of what operations can, and cannot, be performed on them.

> (nax <- c(101, 102, NA, 104))

[1] 101 102 NA 104

> nax * 2 # Arithmetic on NAs gives NAs...

[1] 202 204 NA 208

> nax >= 102 # ...as do comparisons

[1] FALSE TRUE NA TRUE

> mean (nax) # One NA affects the computation

[1] NA

> mean (nax, na.rm = TRUE) # na.rm = TRUE excludes NAs

[1] 102.3333

> is.na (nax) # Locate NAs with logical vector

[1] FALSE FALSE TRUE FALSE

> which (is.na (nax)) # Numeric indices of NAs

[1] 3

When your data has NA or other special values, you are faced with a deci-

sion about how to handle them. Generally they can be left alone, replaced, or

removed. Removing missing values from a single vector is easy enough; the

command vec[!is.na(vec)] will return the set of non-missing entries in

vec. A more sophisticated alternative is the na.omit() function, which not

only deletes the missing values but also keeps track of where in the vector they

used to be. is information is stored in the vector’s “attributes,” which are extra

pieces of information attached to some R objects.

> nax[!is.na (nax)] # Return the non-missing values

[1] 101 102 104

> (nay <- na.omit (nax)) # This keeps track of deleted ones

[1] 101 102 104

attr(,"na.action")

[1] 3

attr(,"class")

[1] "omit"

Data cleaners will very often want to record information about the original

location of discarded entries. In this example, these can be extracted with a

command like attr(nay, "na.action").

ings get more complicated when the vector is one of many that need to

be treated in parallel, perhaps because the vector is part of a more compli-

cated structure like a matrix or data frame. Often if an entry is to be deleted, it

needs to be deleted from all of these parallel items simultaneously. We talk more

about these structures, and how to handle missing values in them, in Chapter

3. (We also note that most modeling functions in R have an argument called

k k

R Data, Part 1: Vectors 39

na.action that describes how that function should handle any NA values it

encounters. is is outside our focus on data cleaning.)

2.4.3 Indexing with NAs

When an NA appears in an index, NA is produced, but the actual eﬀect that R

produces can be surprising. is arises often in data cleaning, since it is com-

mon to have a vector (usually fairly long and as part of a larger data set) with

many NAs that you may not be aware of. Suppose we have a vector of data b

and another vector of indices a, and we want to extract the set of elements of

bfor which ahas the value 1, like this: b[a == 1].ecomparisona==1

will return NA wherever ais missing, and b[NA] produces NA values. So the

resultisavectorwithboththeentriesofbfor which a==1and also one NA

for every missing value in a. is is almost never what we want. If we want to

extract the values of bfor which ais both not missing and also equal to 1, we

have to use the slightly clunky expression b[!is.na(a) & a == 1].is

example shows what this might look like in practice.

> (b <- c(101, 102, 103, 104))

[1] 101 102 103 104

> (a <- c(1, 2, NA, 4))

[1] 1 2 NA 4

> b[!is.na (a) & a == 2] # We probably want this...

[1] 102

> b[a == 2] # ...and not this.

[1] 102 NA

In the following example, we show how two commands that look alike are

treated slightly diﬀerently by R.

> b[a[2]] # a[2] = 2; extract element 2 of b

[1] 102 # ... which is 102

> b[a[3]] # a[3] is NA

[1] NA

> (a <- as.logical (a)) # Now convert a to logical

[1] TRUE TRUE NA TRUE

> b[a[3]] # a[3] is NA

[1] NA NA NA NA

In the ﬁrst example of b[a[3]], the value in a[3] was a numeric NA,soR

treated the subscripting operation as a numeric one. It returned only one value.

In the second example, a[3] was a logical NA, and when R subscripts with a

logical – even when that logical value is NA – it recycles the subscript to have

the same length as the index (we saw this in Section 2.1.4).

e lesson here is that when you have an NA in a subscript, R may return

something other than what you expect.

k k

40 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

2.4.4 NaN and Inf Values

A diﬀerent kind of special value can arise when a computation is so big that

it overﬂows the ability of the computer to express the result. Such a value is

expressed in R as Inf or -Inf. On 64-bit machines Inf is a bit bigger than

1.79 ×10308; it most often appears when a positive number is accidentally

divided by zero. Inf values are not missing, and is.na(Inf) produces

FALSE. Another special value is NaN, “not a number,” which is the result of

certain speciﬁc computations such as 0/0 or Inf + -Inf or computing the

mean of a vector of length 0. Unlike Inf,anNaN valueisconsideredtobe

missing. As with NA values, Inf and NaN values take over every computation

in which they are evaluated. ere are rules for when more than one is

present – for example, Inf + NA gives NA,butNaN + NA gives NaN.Froma

data cleaning perspective, all of these values cause trouble and you will gener-

ally want to identify any of these values early on. e function is.finite()

is useful here; this produces TRUE for numbers that are neither NA nor NaN

nor Inf or -Inf.Sointhatsenseitservesasacheckonvalidvalues.Tosee

whether every element of a numeric vector vec consists of values that are not

any of these special ones, use the command all(is.finite(vec)).

2.4.5 NULL Values

AﬁnalsortofspecialvalueistheRvalueNULL.ANULL is an object with zero

length, no contents and no class. (A vector of length 0 has no contents, but

since it has a class – numeric, logical, or something else – it is not NULL.) In

data cleaning, NULLs most often arise when attempting to access an element

of a list, or a column of a data frame, which does not exist. We discuss this in

Section 3.4.3. For the moment, the important point is that we can test for NULL

values with the is.null() function, and that if you index using a NULL value

the result will be a vector of length 0.

2.5 The table() Function

e table() function is so important in data cleaning that it merits its own

section. is command, as its name suggests, produces a table giving, for each

of the unique values in its argument, the number of times that value appears.

In this example, we will create a vector with some color names in it, and we will

add in an NA as well.

> vec <- rep (c("red", "blue", NA, "green"), c(3, 2, 1, 4))

> vec

[1] "red" "red" "red" "blue" "blue" NA

[7] "green" "green" "green" "green"

> table (vec)

k k

R Data, Part 1: Vectors 41

vec

blue green red

243

ere are a couple of things to notice here. First, the ordering of the results in

the table is alphabetical, rather than being determined by the order the entries

appear in the vector vec. Second, the resulting object is not quite a named

vector, as you can see by the word vec that appears above the word blue.

(We omit this line in many future displays to save space.) In fact, this object

has class table, but it can be treated like a named vector – so, for example,

table(vec)["green"] produces 4. ird, by default table omits NA as

well as NaN values. In data cleaning this is almost never what we want. ere

are two diﬀerent arguments to the table() function that serve to declare how

you want missing values to be treated. e ﬁrst of these is named useNA.is

argument takes the character values "no" (meaning exclude NA values, which

was the default as seen earlier), "ifany" (meaning to show an entry for NAs

if there are any, but not if there aren’t) and "always", meaning to show an

entry for NAs whether there are any NA values or not. In our current example,

where there is one NA,thetable() command with useNA set to "ifany"

or "always" will produce output like this:

> table (vec, useNA = "always")

blue green red <NA>

2431

Notice that R displays the entry for NA values as <NA>, with angle brackets.

is makes it easier to use the characters "NA" as a regular character string,

perhaps for “North America” or possibly “sodium.” (is angle bracket usage

will appear again later.) R will not be confused if you have both NA values and

also actual character strings with the angle brackets, such as "<NA>",butit

is deﬁnitely a bad practice. To see what happens when there are no NAs, let us

look at the same vector without its missing entry, which is number 6.

> table (vec[-6], useNA="ifany")

blue green red

243

> table (vec[-6], useNA="always")

blue green red <NA>

2430

For data cleaning purposes, we almost always want to know about missing

values, so we will almost always want useNA to be "ifany" or "always".

e second missing-value argument, exclude, allows you to exclude speciﬁc

values from the table. By default, exclude has the value c(NA, NaN),

which is why those values do not appear in tables. Most commonly we set

this value to NULL to signify that no entries should be excluded, although

sometimes we exclude certain very common values. Here we might want to

k k

42 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

exclude the common value green while tabulating all other values, including

NAs. e following example shows how we can do that. It also shows the use

of exclude = NULL.

> table (vec, exclude="green")

blue red <NA>

231

> table (vec, exclude=NULL)

blue green red <NA>

2431

It is possible to supply both useNA and exclude at the same time, but the

results may not be what you expect. We recommend using either useNA or

exclude to display missing values in every table.

2.5.1 Two- and Higher-Way Tables

If we give two vectors of the same length to the table() function, the result

is a two-way table, also called a cross-tabulation. For example, suppose we had

a vector called years, one for each transaction in our data set, with values

2015,2016,and2017; and suppose we also had a vector called months,

of the same length, with values such as "Jan","Feb",andsoon.en

table(years, months) would produce a 3 ×12 table of counts, with

each cell in the table telling how many entries in the two vectors had the values

for the cell. at is, the top-left cell would give the number of entries from

January 2015; the one to the right of that would give the number of entries for

February 2015; and so on. (If there are fewer than 12 months represented in

the data, of course, there will be fewer than 12 columns in the table.) is is an

important data cleaning task – to determine whether two variables are related

in ways we expect. If, for example, we saw no transactions at all in March 2016,

we would want to know why.

InR,atwo-waytableistreatedthesameasamatrix; we discuss matrices

in detail in the following chapter. For very large vectors, the data.table()

function in the data.table package (Dowle et al., 2015) may prove more

eﬃcient than table(). ree- and higher-way tables are produced when the

arguments to table() are three or more equal-length vectors. ese tables

are treated in R as arrays;wegiveanexampleinSection3.2.7.extabs()

function is also useful for creating more complex tables.

2.5.2 Operating on Elements of a Table

e table() command counts the number of observations that fall into a par-

ticular category. In the example above, the table(years, months) com-

mand produces a two-way table of counts. Often we want to know more than

just how many observations fall into a cell. R has several special-purpose func-

tions that operate on tables. e prop.table() function takes, as its ﬁrst

k k

R Data, Part 1: Vectors 43

argument, the output from a call to table(), and depending on its second

argument produces proportions of the total counts in the table by cell, or by

row, or by column. In this example we set up three vectors, each of length 15.

en we show the eﬀect of calling table(), and of calling prob.table() on

the result. By default, prop.table() computes the proportions of observa-

tionsineachcellofthetable.Intheﬁnalexample,weusethesecondargument

of 2to compute the proportions within each column; supplying 1would have

produced the proportions within each row.

> yr <- rep (2015:2017, each=5)

> market <- c("a", "a", "b", "a", "b", "b", "b", "a", "b",

"b", "a", "b", "a", "b", "a")

> cost <- c(64, 87, 71, 79, 79, 91, 86, 92, NA,

55, 37, 41, 60, 66, 82)

> (tab <- table (market, yr))

market 2015 2016 2017

a313

b242

> prop.table (tab) # These proportions sum to 1

market 2015 2016 2017

a 0.20000000 0.06666667 0.20000000

b 0.13333333 0.26666667 0.13333333

> prop.table (tab, 2) # Each column's proportions sum to 1

market 2015 2016 2017

a 0.6 0.2 0.6

b 0.4 0.8 0.4

e margin.table() command produces the marginal totals from a

table – that is, row or column totals (controlled by the second argument) for a

two-way table, and corresponding sums for a higher-way one. e addmar-

gins() function incorporates those totals into the table, producing a new

row or column named Sum (or both). is is often a summary statistic we want,

but watch out – the convention regarding the second argument of addmar-

gins() is not the same as that of prop.table() and margin.table().

is example shows addmargins() in action.

> addmargins (tab) # append row and column sums

market 2015 2016 2017 Sum

a3137

b2428

Sum 5 5 5 15

> addmargins (tab, 2) # append column sums

k k

44 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

market 2015 2016 2017 Sum

a3137

b2428

We might also want to know the average, standard deviation, or maximum of

entries in a numeric variable, broken down by which cell they fall into. In our

example, we might want the maximum cost among the three observations

from 2015 with market a, and for the two from 2015 and market b,andso

on. For this purpose we use the tapply() function, whose name reminds us

that it applies a function to a table. is function’s arguments are the vector

on which to do the computation (in our example, cost), an argument named

INDEX describing the grouping (here, we might use the vector yr), and then

the function to be applied. e following example shows tapply() at work.

Intheﬁrstline,weusethemin() function to produce the minimum value for

each year – but an NA is produced for 2016 since one cost for that year is

NA.Wecanpassthena.rm = TRUE argument into tapply(),whichthen

passes it into min() as in the following example, if we want to compute the

minimum value among non-missing entries.

> tapply (cost, yr, min) # find minimum within each yr

2015 2016 2017

64 NA 37

> tapply (cost, yr, min, na.rm = TRUE)

2015 2016 2017

64 55 37

It is possible to extend this example to the two-way case of minimum cost,

or another statistic, by both market and year. Here the tabularization part, rep-

resented by the argument INDEX, needs to be a list. We discuss lists starting

in Section 3.3; for the moment, just know that a list is required when group-

ing with more than one vector. In the ﬁrst example as follows, we compute

themeanofthecost values for each combination of market and year (using

na.rm = TRUE as above, and the list() function to construct the list). In

the second example, we show how we can supply our own function “in line,”

which makes it more transparent than if we had written a separate function.

e details of writing functions are covered in Chapter 5, but here our function

takes one argument, named x, and returns the value given by the sum of the

squares of the entries of x. (In this example, we pass the na.rm = TRUE argu-

ment directly to sum to keep our function simpler.) e tapply() function

is in charge of calling our function six times, once for each cell of the table.

> tapply (cost, list (market, yr), mean, na.rm = TRUE)

2015 2016 2017

a 76.66667 92.00000 59.66667

b 75.00000 77.33333 53.50000

k k

R Data, Part 1: Vectors 45

> tapply (cost, list (market, yr),

function (x) sum (x^2, na.rm = TRUE))

2015 2016 2017

a 17906 8464 11693

b 11282 18702 6037

2.6 Other Actions on Vectors

In this section, we describe additional actions on vectors that we ﬁnd partic-

ularly important for data cleaning. is includes rounding numeric values,

sorting, set operations, and the important topics of identifying duplicates and

matching.

2.6.1 Rounding

R operates on numeric vectors using double-precision arithmetic, which means

that often there are more signiﬁcant digits available than are useful. Results will

often need to be displayed with, say, two or three signiﬁcant digits. e natural

way to prepare displays like this is through formatting the numbers – that

is, changing the way they display, but not their actual values. We discuss

formatting in Section 4.2. But sometimes we want to change the numbers

themselves, perhaps to force them to be integers or to have only a few signiﬁ-

cant digits. e round() function and its relatives do this. Round() lets the

user specify the number of digits to the right of the decimal place to be saved;

the signif() function lets him or her specify the total number of signiﬁcant

digits retained. So round(123.4567, 3) produces 123.457, while

signif(123.4567, 3) produces 123. A negative second argument pro-

duces rounding the nearest power of 10, so round(123.4567, -1) rounds

to the nearest 10 and produces 120, while round(123.4567, -2)

rounds to the nearest 100 and produces 100.etrunc() function discards

the part the decimal and produces an integer; floor() and ceiling()

round to the next lower and next higher integer, respectively, so floor(-3.4)

is -4 while trunc(-3.4) is -3. Rounding of problematic entries (like those

that end in a 5) can be aﬀected by ﬂoating-point error (see Section 1.3.3).

2.6.2 Sorting and Ordering

It is common to have to sort the elements of a vector, and the sort() function

performs that task in R. By default, the sort is from smallest to largest, but the

decreasing = TRUE argument will reverse the order. ere are two minor

complications. First, sort() will drop NA and NaN values by default, giving a

vector shorter than the original when these values are present. is behavior

is controlled by the na.last argument, which itself defaults to NA.Ifsetto

TRUE, this argument will have the sort() function place NA and NaN values

at the end, and, if FALSE, at the beginning of the sorted output.

k k

46 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

A second complication is in sorting character vectors. Sorting in this case is

alphabetical, of course, so if the characters are text representations of numbers

such as "1","2","5","10",and"18", the resulting output, sorted alpha-

betically, will be "1","10","18","2",and"5". Moreover, the sorting order

depends on the character set and locality being used. We mentioned this in

Section 1.4.6 and address it further in Section 4.5.

e related order() function returns a set of indices that you can use to sort

avector.isisusefulwhenyouwanttore-arrangeonevector’svaluesinthe

order speciﬁed by a second vector. (If that sounds as if it wouldn’t be a common

task, wait until Section 3.5.4.) In this example, we have a vector of names, and

a vector of scores, and we want the names in ascending order of score.

> nm <- c("Freehan", "Cash", "Horton",

"Stanley", "Northrop", "Kaline")

> scores <- c(263, 263, 285, 259, 264, 287) # 2 tied at 263

> nm[order(scores)] # ascending order of score

[1] "Stanley" "Freehan" "Cash"

[4] "Northrop" "Horton" "Kaline"

> nm[order(scores, nm)] # tie broken by nm

[1] "Stanley" "Cash" "Freehan" # (alphabetically)

[4] "Northrop" "Horton" "Kaline"

> nm[order (scores, decreasing = TRUE)] # descending

[1] "Kaline" "Horton" "Northrop"

[4] "Freehan" "Cash" "Stanley"

As in the example, the order() function can be given more than one vector.

In this case, the second vector is used to break ties in the ﬁrst; if a third vector

were supplied, it would be used to break any remaining ties, and so on. It is

very common to re-order a set of data that has time indicators (month and year,

maybe) from oldest to newest. e order() function has the same na.last

argument that sort() has, although its default value is TRUE.

2.6.3 Vectors as Sets

Oftenweneedtoﬁndtheextenttowhichtwovectorshavevaluesthatoverlap.

For example, we might have customer data from two sources and we want

to determine the extent to which the customer IDs agree; or we might want

to ﬁnd the set of states in which none of our customers reside. ese call for

techniques that treat vectors as sets and that will normally be most useful

when the data is a small number of integers, character data, or factors, about

which we say more in Section 4.6. ey can be used with non-integer data

as well, but as always we cannot rely on two ﬂoating-point numbers that we

expect to be equal actually being equal.

e essential set membership operation is performed by the %in% function.

R has a few functions with names like this, surrounded by percentage signs.

k k

R Data, Part 1: Vectors 47

is allows us to use a command like a %in% b, rather than the equivalent,

but perhaps less transparent, is.element(a, b). e return value is a vec-

tor the same length as a, with a logical indicating whether each element of a

is found anywhere in b.Indatacleaningweveryoftentabulatetheresultof

this function call; so a command like table(a %in% b) produces a table of

FALSE and TRUE, giving the number of items in athat were not found in b,

and the number that were. For this purpose, an NA value in amatches only an

NA in b,andsimilarlyanNaN value in amatches only an NaN value in b.In

this example, we compare some alphanumeric characters to the built-in data

set letters containing the 26 lower-case letters of the alphabet.

> c("g", "5", "b", "J", "!") %in% letters

[1] TRUE FALSE TRUE FALSE FALSE

> table (c("g", "5", "b", "J", "!") %in% letters)

FALSE TRUE

e union(),intersect(),andsetdiff() functions produce the

union, intersection, and diﬀerence between two sets. is example shows

those functions in action.

> union (c("g", "5", "b", "J", "!"),

letters) # elements in either vector

[1] "g" "5" "b" "J" "!" "a" "c" "d" "e" "f" "h" "i" "j"

[14] "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w"

[27] "x" "y" "z"

> intersect (c("g", "5", "b", "J", "!"),

letters) # elements in both vectors

[1] "g" "b"

> setdiff (c("g", "5", "b", "J", "!"),

letters) # elements of a not in b

[1] "5" "J" "!"

2.6.4 Identifying Duplicates and Matching

Another data cleaning task is to ﬁnd duplicates in vectors. e anyDupli-

cated() function tells you whether any of the elements of a vector are

duplicates. e unique() function extracts only the set of distinct values

(including, by default, NA and NaN). e distinct values appear in the output

in the order in which they appear in the input; for data cleaning purposes we

will often sort those unique values.

Often it will be important to know which elements are duplicates. e

duplicated() function returns a logical vector with the value TRUE for

the second and subsequent entries in a set of duplicates. However, the ﬁrst

entry in a set of duplicates is not indicated. For example, duplicated

(c(1, 2, 1, 1)) returns FALSE FALSE TRUE TRUE; the ﬁrst 1is not

k k

48 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

considered duplicated under this deﬁnition. (Alternatively, the fromLast

=TRUEargument reads from the end of the vector back to the begin-

ning, but again the “ﬁrst” member of a set of duplicates is not indicated.)

Combining a call with fromLast = FALSE and one with fromLast

=TRUE,usingtheunion() function, identiﬁes all duplicates.

A common task is to ﬁnd all the entries that are duplicated anywhere in the

data set (or that are never duplicated). One way to do this is via table().Any

value that appears more than once is, of course, duplicated (but remember that

ﬂoating-point numbers might not match exactly). In this example, we construct

a vector from the lower-case letters, but add a few duplicates.

> let <- c(letters, c("j", "j", "x"))

> (tab <- table (let))

let

abcdefghijklmnopqrstuvwxyz

11111111131111111111111211

> which (tab != 1) # table locations where duplicates appear

10 24 # 10th & 24th table entries aren't ones

> names (tab)[tab != 1]

[1] "j" "x"

It is often useful to use table() twice in a row. is example counts the num-

ber of entries that appear once, twice, and so on in the original data. Consider

this example:

> table (table (let))

123

24 1 1 # 24 entries are 1, one is 2, one is 3

e last line shows that there are 24 entries in let that appear once; one entry,

x, that appears twice; and one entry, j,thatappearsthreetimes.Weusethisin

almost every data cleaning problem to ﬁnd entries that appear more often than

we expect. In a real application, we might have tens of thousands of elements

and only a few duplicates. e which(tab != 1) command shows us the

elements that are duplicated, but not how many times each one appears; the

table(table(let)) command shows us how many duplicates there are,

but not which letter goes with which count.

Another important task is matching, which is where we identify where, in a

vector, we can ﬁnd the values in another vector. We will ﬁnd this particularly

useful when merging data frames in Section 3.7.2. ere are two ways to han-

dle elements that do not match; they can be returned as NA,preservingthe

length of the original argument in the length of the return value, or, with the

nomatch = 0 argument, they can be returned as 0, which allows the return

value to be used as an index. In this example, we match two sets of names.

k k

R Data, Part 1: Vectors 49

> nm <- c("Jensen", "Chang", "Johnson",

"Lopez", "McNamara", "Reese")

> nm2 <- c("Lopez", "Ruth", "Nakagawa", "Jensen", "Mays")

> match (nm, nm2)

[1] 4 NA NA 1 NA NA

> nm2[match (nm, nm2)]

[1] "Jensen" NA NA "Lopez" NA NA

e third command tells us that the ﬁrst element of nm,whichisJensen,

appears in position 4 of nm2; the second element of nm,Chang,doesnotappear

in nm2, and so on. We can extract the elements that matched from the nm2 vec-

tor as in the last line – but the NA entries in the output of match() produce

NAs in the vector of names. An easier approach is to supply the nomatch = 0

argument,asinthisexample.

> match (nm, nm2, nomatch = 0)

[1]400100

> nm2[match (nm, nm2, nomatch = 0)]

[1] "Jensen" "Lopez"

We use match() (or its equivalent) in any data cleaning problem that requires

combining two data sets. Understanding how match() works makes data

cleaning easier. Match() is, in fact, a more powerful version of %in%.

2.6.5 Finding Runs of Duplicate Values

During a data cleaning problem, it often happens that a particular identiﬁer – a

name or account number, perhaps – appears many times in an input data set. As

an example we might be given a list of payments, with each payment identiﬁed

by a customer number and each customer contributing dozens of payments.

It will be useful to count the number of times each repeated item appears. We

also use this on logical vectors to ﬁnd, for example, the locations and lengths of

sets of payments that are equal to 0. e rle() function (the name stands for

“run length encoding”) does exactly this: given a vector, it returns the number

of “runs” – that is, repetitions – and each run’s length. In this example, we show

what the output of the rle() function looks like.

> rle (c("a", "b", "b", "a", "c", "c", "c"))

Run Length Encoding

lengths: int [1:4] 1 2 1 3

values : chr [1:4] "a" "b" "a" "c"

is output shows that the vector starts with a run of length 1 (the ﬁrst element

in the lengths vector)withvaluea(the values vector); then a run of length

2 with value b; and so on. e output is actually returned in the form of a list

with two parts named lengths and values; in Section 3.3, we discuss how

to access the pieces of a list individually.

k k

50 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

2.7 Long Vectors and Big Data

Starting in version 3.0.0, R introduced something called a long vector,aspecial

mechanism that allows vectors to be much longer than before. Since there are

only 231 −1 values of an integer, entries in a long vector beyond that point will

have to be indexed by double indices. Other than that, this extension should,

in principle, be invisible to users. One exception is that the match() function,

and its descendants, is.element() and %in%, do not work on long vectors.

On long vectors, table() can be very slow and the data.table package

provides some faster alternatives. R’s documentation suggests avoiding the use

of long vectors that are characters.

2.8 Chapter Summary and Critical Data Handling

Tools

is chapter introduces R vectors, which come in several forms, primarily logi-

cal, numeric, and character. e mode(),typeof(),andclass() functions

give you information about the class of a vector. e set of is. functions like

is.numeric() returns a TRUE/FALSE result when an object is of the spec-

iﬁed model, and the set of as. functions performs the conversion. Remember

that logicals are simpler than numerics, and numerics simpler than character,

and that converting from a simpler to a more complicated mode is straightfor-

ward. Converting from a more complicated to a simpler mode follows these

rules:

•Converting character to numeric produces NA for things that aren’t numbers,

like the character strings "TRUE" or "$199.99".

•Converting character to logical produces NA for any string that isn’t "TRUE",

"True","true","T","FALSE","False","false" or "F".

•Converting numeric to logical produces FALSE for a zero and TRUE for any

non-zero entry (and watch out for ﬂoating-point error here).

Extracting and assigning subsets of vectors are critical parts of any data clean-

ing project. We can use any of the modes as an index or “subscript” with which

to extract or assign. A logical subscript returns the values that match up with

its TRUE entries. Logical subscripts are extended by recycling where neces-

sary(butmostoftenwhenwedothisitisbymistake).Anumericsubscript

returns the values speciﬁed in the subscript – and, unsurprisingly, numeric sub-

scripts are not recycled. e which() command identiﬁes TRUE values in a

logical vector, so you can use that to convert a logical subscript to a numeric

one. Finally, a character subscript will extract, from a named vector, elements

whose names are present in the subscript (and, again, this kind of subscript is

not recycled).

k k

R Data, Part 1: Vectors 51

Any kind of vector can have missing values, indicated by NA, and there are

a few other special values as well. Missing values inﬂuence computations they

areinvolvedin,soweoftenwanttosupplyanargumentlikena.rm = TRUE

to a function computing a sum, mean or other summary statistic on numeric

data. You should expect to encounter missing values in any data set from any

source and be prepared to accommodate them.

e table() function is critical to data cleaning. It tabulates a vector,

returning the number of times each unique value appears, with names corre-

sponding to the original values in the data set. Passing two or more vectors

to table() produces a two- or higher-way cross-tabulation. We recom-

mend adding the useNA = "ifany",useNA = "always",orexclude

= NULL arguments to ensure that table() counts and displays the number

of NA values, unless you’re certain no values are missing. Using table() on

the output of table() –asintable(table(x)) –tellsushowmany

items in a vector xappear once, twice, and so on. is is useful for detecting

entries that appear more often than expected.

Using names() on the output of table() will produce the unique entries

in a vector, but we also use the unique() function to ﬁnd these. We spend

a lot of energy in identifying duplicates, and the duplicated() function is

useful here – although, remember, it does not return TRUE for the ﬁrst item in a

set of duplicates. e is.element() and %in% functions help determine the

extent to which two sets of values overlap; both of these are simpler versions

of the match() function, which is critical to combining data from diﬀerent

sources.

k k

R Data, Part 2: More Complicated Structures

3.1 Introduction

R data is made up of vectors, but, as you already know, there are more

complicated structures that consist of a group of vectors put together. In

this chapter, we talk about the three major structures in R that data handlers

need to know about. e most important of these is the data frame,inwhich,

eventually, almost all of our data will be held. But in order to build up to the

data frame, we ﬁrst need to describe matrices and lists. A data frame is part

matrix, part list, and in order to use data frames most eﬃciently, you need to

be able to think of it in both ways. Furthermore, we do encounter matrices in

the data cleaning world, since the table() command can produce something

that is basically a matrix.

3.2 Matrices

Amatrix (plural matrices) is essentially a vector, arrayed in a (two-dimensional)

rectangle. As with a vector, every element of a matrix needs to be of the same

type – logical, numeric, or character. Most of the matrices we will see will be

numeric, but it is also possible to have a logical matrix, typically for subscript-

ing, as we shall see. We start using the vector of 15 numbers, 101, 102, …, 115,

to produce a 5 ×3 (i.e., ﬁve rows by three columns) numeric matrix.

> (a <- matrix (101:115, nrow = 5, ncol = 3))

[,1] [,2] [,3]

[1,] 101 106 111

[2,] 102 107 112

[3,] 103 108 113

[4,] 104 109 114

[5,] 105 110 115

k k

54 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

ere are a couple of points to mention here. First, the matrix is ﬁlled column

by column, with the ﬁrst column being ﬁlled before the second one starts. We

often intuitively expect the matrix to be ﬁlled row by row, because our data

comes in rows, and we read English left-to-right, but this is not how R works. If

you need to load your data into your matrix by rows, use the byrow = TRUE

argument. is arises when you copy a matrix oﬀ of a web page, for example;

we expect the entries to be read along the top line, but R stores them down the

ﬁrst column. (We come back to this example in Section 6.5.3.)

Second, notice the row and column indicators such as [4,] and [,2].In

the following section, we will see how to use those numbers to extract elements

from the matrix, or to assign new ones.

ird, the length() operatorcanbeusedonamatrix,butitreturnsthe

total number of elements in the matrix. More often we want to know the num-

ber of rows and columns; that information is returned by the nrow() and

ncol() functions, or jointly by the dim() function, which gives the numbers

of rows and columns in that order:

> length (a)

[1] 15

> dim (a)

[1] 5 3

Fourth, in our example, we used the matrix() function to create the

matrix from one long vector. An alternative is to create a matrix from a set

of equal-length vectors. e cbind() function (“c” for column) combines a

set of vectors into a matrix column by column, while rbind() performs the

operation row by row. If the vectors are of unequal length, R will use the usual

recycling rules (Section 2.1.4). Again, all of the elements of a matrix need to be

of the same sort, so if any vector is of type character, the entire matrix will be

character.

As with vectors, arithmetic operations on matrices are performed element by

element, so A^2 squares each element of Aand A*Bmultiplies two matrices

element by element. ere are special symbols for matrix-speciﬁc operations:

for example, A %*% B performs the usual kind of matrix multiplication, t(A)

transposes a matrix, and solve() inverts a matrix. ese operations do not

tend to come up much in data cleaning, but often, we want to perform an oper-

ation on a matrix row by row or column by column. We come back to these row

and column operations in Section 3.2.3.

3.2.1 Extracting and Assigning

Since a matrix is just a vector, it is possible to use a subscript just like the

one we used in Section 2.3.1 to pull out or replace an element. In the example

above, a[6] would produce 106 (remember that we count by columns ﬁrst),

and a[6] <- 999 would replace that element with 999. However, it is much

k k

R Data, Part 2: More Complicated Structures 55

more common to identify elements of a matrix by two subscripts, one for the

row and one for the column. ese two subscripts are separated by a comma.

In our example, a[1,2] would produce 106, and a[1,2] <- 999 would

replacethatvalue.

Ofcourse,itispossibletoaskformorethanoneentryatatime.Inthis

example, we ask for a 3 ×2 sub-matrix from our original matrix a:

> a[c(4, 2), c(3, 1)]

[,1] [,2]

[1,] 114 104

[2,] 112 102

etworowsweaskedfor,numbers4and2,inthatorder,arereturned,with

the corresponding entries from columns 3 and 1, in that order. Just as when we

use subscripts on a vector, we may use duplicate subscripts; a vector of negative

numbers indicates that the corresponding entries should be removed.

If you leave one of the two subscripts empty, you are asking for an entire row

or column. is command says “give me all the rows except for number 2, and

all the columns.”

> a[-2,]

[,1] [,2] [,3]

[1,] 101 106 111

[2,] 103 108 113

[3,] 104 109 114

[4,] 105 110 115

Notice here that some rows have been renumbered. e row that had been

number 5 in the original ais now the fourth row. is is not surprising, but it

raisesthequestionastowhetherwemightbeabletokeeptrackofrowsthat

have been deleted, since that would help us audit changes we have made to the

data. We will describe one way to do that using row names in Section 3.2.2.

In addition to using a numeric subscript, we can use a logical one. Logical

subscripts for rows or columns act exactly as logical subscripts for vectors (see

Section 2.3.1). Whether you use numeric or logical subscripts, subscripting a

matrix with row and column indices will return a rectangular object. To extract

values from, or assign new values to, a non-rectangular set of entries, you can

use a matrix subscript, which we describe in Section 3.2.5.

Demoting a Matrix to a Vector

In order to turn a matrix into a vector, use the c() function on it. Just as c()

creates vectors from individual elements (see, e.g., Section 2.1.1), it also cre-

ates vectors from matrices. In our example, c(a) will produce a vector of 15

numbers. e entries in that vector come from the ﬁrst column, followed by

the second column, and so on. In order to extract data row by row, transpose

the matrix ﬁrst, using the t() function in a command like c(t(a)).

k k

56 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Sometimes, though, R produces a vector from a matrix when we did not

expect it. In this example, see what happens when we ask for, say, the second

column of a. Remember that ahas ﬁve rows and three columns.

> a[,2]

[1] 106 107 108 109 110

e result of this operation is not a matrix with ﬁve rows and one column;

it is a vector of length 5. is reduction – or “demotion” – from a matrix to

a vector follows a general rule in R under which dimensions of length 1 are

usually removed (“dropped”) by default. is can cause trouble when you have

a function that is expecting a matrix, perhaps because it plans to use dim() to

ﬁnd the number of rows. If you pass a single column of a matrix, that is, a vector,

such a function would call dim() on a vector, which returns the value NULL.

e way around this is to specify that this dropping should not take place, using

the drop = FALSE argument, like this:

> a[,2,drop = FALSE]

[,1]

[1,] 106

[2,] 107

[3,] 108

[4,] 109

[5,] 110

e result of that operation is a matrix with ﬁve rows and one column. When

building functions that take subsets of matrices, it is often a good idea to use

drop = FALSE to ensure that the resulting subset is itself a matrix and not a

vector value.

3.2.2 Row and Column Names

It is very convenient to have a matrix whose rows and columns have names.

We can assign (and extract) row and column names with the dimnames()

function, described in Section 3.3.2, and there are also functions named

rownames() and colnames() to do the same job. (ere is also an equiv-

alent row.names() function, spelled with a dot, but, interestingly, there is

no col.names() function.) Rows and columns are named automatically

by the table() function (technically, a two-way table has class table,not

matrix, but that distinction will not matter here). We start this extended

example by constructing a table.

> yr <- rep (2015:2017, each = 5)

> market <- c(2, 2, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 2, 3, 2)

> (tbl <- table (market, yr))

k k

R Data, Part 2: More Complicated Structures 57

market 2015 2016 2017

2313

3242

Notice that the row-name entries ("2" and "3" under market)arenota

column of the table; they are merely identiﬁers. is table has three columns,

not four. Now we show the column names and demonstrate how they can be

changed using the colnames() function.

> colnames (tbl)

[1] "2015" "2016" "2017"

> colnames (tbl) <- c("FY15", "FY16", "FY17")

> tbl

market FY15 FY16 FY17

2313

3242

Onceroworcolumnnameshavebeenassigned,wecanrefertothembyname

as well as by number. is makes it possible to refer to a row or column in a con-

sistent way, without having to know its location. Notice, though, that dimension

names are characters, even if they look numeric. So, for example, tbl[2,] will

produce the second row of the matrix tbl, while tbl["2",] will produce the

row whose name is "2", regardless of what number that row has – and even if

earlierrowshavebeenremoved.

> tbl[2,]

FY15 FY16 FY17

242

> tbl["2",]

FY15 FY16 FY17

313

3.2.3 Applying a Function to Rows or Columns

ere are lots and lots of operations on matrices supported by R, but many

of them are mathematical and not useful in data cleaning. One operation that

does come up, though, is running a function separately on each row or column

of a matrix. A few of these are so common that they are built in. Speciﬁcally,

there are functions named colSums() and rowSums(), which compute all

of the column sums or row sums, and corresponding functions for the means,

colMeans() and rowMeans(). Very often, though, you want to apply some

custom function, such as the one that tells how many entries are NA or missing.

e facility for doing this is the apply() function, to which you supply the

matrix, the direction of travel (1 for across rows, 2 for down columns), and

then the function that is to be applied to each row or column. is last can be

k k

58 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

a function built into R, a function you have written yourself, or even a function

deﬁned on the ﬂy.

> a <- matrix (101:115, 5, 3)

# These four commands produce identical results

> rowSums (a)

> apply (a, 1, sum)

# Pass argument na.rm into the sum() function

> apply (a, 1, sum, na.rm = T)

> apply (a, 1, function (x) sum (x))

[1] 318 321 324 327 330

# User-written command selects the second-smallest entry

# in each column

> apply (a, 2, function (x) sort(x)[2])

[1] 102 107 112

If each call to the function returns a vector of the same length, apply()

creates a matrix. In this example, we use the range() function to produce

two values for each column of a.

> apply (a, 2, range)

[,1] [,2] [,3]

[1,] 101 106 111

[2,] 105 110 115

When apply() is used with a vector-valued function, such as range() in

the last example, the output is arranged in columns, regardless of whether the

operation was performed on the rows or the columns of the original matrix.

is does not always match our intuition, particularly when the operation was

performed on rows. In this example, we show the row-by-row ranges of the a

matrix and then transpose using the t() function.

> apply(a, 1, range)

[,1] [,2] [,3] [,4] [,5]

[1,] 101 102 103 104 105

[2,] 111 112 113 114 115

# Use t() to transpose that matrix

> t(apply(a, 1, range))

[,1] [,2]

[1,] 101 111

[2,] 102 112

[3,] 103 113

[4,] 104 114

[5,] 105 115

A diﬃculty arises when diﬀerent calls to the function produce vectors of dif-

ferent lengths. In that case, R cannot construct a matrix and has to return the

results in the form of a list (we discuss lists in Section 3.3). is might arise, say,

k k

R Data, Part 2: More Complicated Structures 59

when looking for the locations of unusual values by column. In this example, we

look for the locations in each column of values greater than 109 in the matrix a.

> apply (a, 2, function (x) which (x > 109))

[[1]]

integer(0)

[[2]]

[1] 5

[[3]]

[1]12345

is result tells us that the ﬁrst column has no entries >109, the second col-

umn’s ﬁfth entry is >109, and all ﬁve entries in the third column are >109. In

general, you have to be aware that apply() mightreturnalistifthefunction

being applied can return vectors of diﬀerent lengths.

3.2.4 Missing Values in Matrices

One very common use of apply() is to count the number of missing values in

each row or column, since missing values always aﬀect how we do data cleaning.

is code shows how to count the number of NA value in each column. To show

oﬀ some more of R’s capabilities, we use the semicolon, which allows multiple

commands on one line, and the multiple assignment operation, which lets us

assign several things at once.

> a <- matrix (101:115, 5, 3); a[5, 3] <- a[3, 1] <- NA; a

[,1] [,2] [,3]

[1,] 101 106 111

[2,] NA 107 112

[3,] 103 108 113

[4,] 104 109 114

[5,] 105 110 NA

> apply (a, 2, function (x) sum (is.na (x)))

[1] 1 0 1

From the last command, we see there is one missing value in each of columns

1and3.

We saw how to use which() to identify missing values in a vector back

in Section 2.4, and the same command can also identify missing values in a

matrix. By default, which(is.na(vec)) will return the indices of vec with

missing values as if vec had been stretched out into a long vector (column by

column, as always). However, the arr.ind = TRUE argument will supply the

row and column indices of the items selected by which(). is is extremely

useful in tracking down a small number of missing values. In this example, we

use which() to identify the missing entries in a.

k k

60 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> which (is.na (a))

[1] 3 15

> which (is.na (a), arr.ind = TRUE)

row col

[1,] 2 1

[2,] 5 3

Here, which() returns a matrix with two named columns and two unnamed

rows. Of course, this approach is not limited to ﬁnding NAs. It can also be used

to ﬁnd negative values, or anything else that is unexpected and needs to be

cleaned.

3.2.5 Using a Matrix Subscript

In the last example, we saw how which() with arr.ind = TRUE returns

a matrix giving a vector of rows and a vector of columns that, together,

identify the cells that had NA values. One underused feature of R is that

we can use a matrix subscript, such as the one returned by which() with

arr.ind = TRUE to extract from, or assign to, another matrix. We can

also use the vector returned by the ordinary use of which(),butthematrix

approach sometimes makes it much easier to extract the necessary rows or

columns. In this example, we construct a matrix with ﬁve columns of data and

a sixth column named "Use". is ﬁnal column tells us which of the data

columns should be extracted for each of the rows.

> b <- matrix (1:20, nrow = 4, byrow = TRUE)

> b <- cbind (b, c(3, 2, 0, 5))

> colnames (b) <- c("P1", "P2", "P3", "P4", "P5", "Use")

> rownames (b) <- c("Spring", "Summer", "Fall", "Winter")

P1 P2 P3 P4 P5 Use

[1,]12345 3

[2,]678910 2

[3,] 11 12 13 14 15 0

[4,] 16 17 18 19 20 5

Since the ﬁrst row’s value of Use is 3, we want to extract the third element of

that row; since the second row’s value of Use is 2, we want the second element

of that row; and so on. Without the ability to use a matrix subscript, we might

be forced to loop through the rows of b, but in R we can extract all these items

in one call. Our matrix subscript has two columns, one giving the rows from

which we are extracting (in this case, all the rows of bin order) and another

giving the column from which to extract (in this case, the values in the Use

column of b). Here we show we can construct this matrix subscript and use it

to extract the relevant entries of b.

k k

R Data, Part 2: More Complicated Structures 61

> (subby <- cbind (1:nrow(b), b[,"Use"]))

[,1] [,2]

[1,] 1 3

[2,] 2 2

[3,] 3 0

[4,] 4 5

> b[subby]

[1] 3 7 20

Notice that in this example the value of Use in the third row was zero – and

therefore no value was produced for that row of the matrix subscript (see “zero

subscripts” in Section 2.3.1). Negative values cannot be used in a matrix sub-

script.

As a real-life example of where this might occur, we were recently given a

matrix of customer payments. e ﬁrst 96 columns contained monthly pay-

mentamounts.elastcolumngavethenumberofthemonthwiththelastpay-

ment in it. Our task was to extract the payment amount whose month appeared

in that ﬁnal column. So if, in the ﬁrst row, that column had the value 15,we

would have extracted the amount from the 15th column of the payment matrix;

and so on for the second and subsequent rows.

Two points here: notice that we extracted in our example earlier using

b[subby] using no additional commas. e matrix subscript deﬁnes both

rows and columns. Second, remember to use cbind() to construct the

subscript argument (our subby above). Make sure that the matrix subscript

really is a matrix, and not two separate vectors, or you will extract rows and

columns separately. Matrix subscripting works with names, too. If our matrix b

had had both row and column names, we could have used a character matrix in

exactly the same way as the numeric subby.Inthatcase,bwould need both

row and column names so that both columns of the subscript argument could

be character. We cannot have one vector be numeric and the other character,

because we need to combine them into a matrix, and all the entries in a matrix

have to be of the same type. It is also possible to have a logical matrix act as a

subscript – but the results are surprising and we do not recommend it.

3.2.6 Sparse Matrices

Asparse matrix is one whose entries are largely zero. For example, in a language

processing application we might form a matrix with words in the rows and doc-

uments in the columns. en a particular cell, say, the ijth one, would have a

zero if word idid not appear in document j, and since, in many examples, most

words do not appear in most documents that matrix might have a high propor-

tion of zeros. ere are a number of schemes for representing sparse matrices.

e recommended Matrix package (Bates and Maechler, 2016) implements

k k

62 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

many of these. We encounter sparse matrices in our work, but rarely in the

context of data cleaning, so we will not discuss them in this book.

3.2.7 Three- and Higher-Way Arrays

ree-way (and higher-way) matrices are called arrays in R. An array looks like

a matrix in that all of its elements need to be of the same type, but a three-way

array requires three subscripts, a four-way array requires four subscripts, and so

on. e only time we seem to have encountered such a thing in data cleaning is

when constructing a three- or higher-way table(). In this example, we show

athree-waytablemadefromthreevectorseachoflength8,andthenweextract

the value 3from the second row of the ﬁrst column of the ﬁrst “panel.”

> who <- rep (c("George", "Sally"), c(2, 6))

> when <- rep (c("AM", "PM"), 4)

> worked <- c(T, T, F, T, F, T, F, T)

> (sched <- table (who, when, worked))

, , worked = FALSE

when

who AM PM

George 0 0

Sally 3 0

, , worked = TRUE

when

who AM PM

George 1 1

Sally 0 3

> sched[2,1,1]

[1] 3

Manycommandsthatworkonmatrices,like,apply() and prop.table(),

operate on arrays as well. You can also use c() on an array to produce a vec-

tor – in this case, the ﬁrst column of the ﬁrst panel is followed by the second

column of the ﬁrst panel, and so on. e aperm() function plays the role of

t() for higher-way arrays.

3.3 Lists

Alist is the most general type of R object. A list is a collection of things that

might be of diﬀerent types or sizes; a list might include a numeric matrix, a

character vector, a function, another list, or any other R object. Almost every

modeling function in R returns a list, so it is important to understand lists when

k k

R Data, Part 2: More Complicated Structures 63

using R for modeling, but we also need to describe lists because one special sort

of list is the data frame, which we describe in the following section.

Normally, we will encounter lists as return values from functions, but we can

create a list with the list() function, like this:

> (mylist <- list (alpha = 1:3, b = "yes", funk = log, 45))

$alpha

[1] 1 2 3

[1] "yes"

$funk

function (x, base = exp(1)) .Primitive("log")

[[4]]

[1] 45

Listsalsoappearastheoutputfromthesplit() function, which divides a

vector into (possibly unequal-length) pieces according to the value of another

vector. We use this frequently in data cleaning. For example, we might divide

a vector of people’s ages according to their gender. In this simple example, we

show how split() produces a list; later, in Section 3.5.1, we show how that

listcanbeputtouse.

> ages <- c(26, 45, 33, 61, 22, 71, 43)

> gender <- c("F", "M", "F", "M", "M", "F", "F")

> split (ages, gender)

[1] 26 33 71 43

[1] 45 61 22

> split (ages, ages > 60)

$‘FALSE‘

[1] 26 45 33 22 43

$‘TRUE‘

[1] 61 71

It is worth noting that if the second argument – gender in this case – has miss-

ing values, those values will be dropped from the output of split().Notice

also that in the second example the names of the list elements have been sur-

rounded by backward quotes. is is for display, because FALSE and TRUE are

not valid names here, but the character strings "FALSE" and "TRUE" are.

e length of a list, as found using length(), is the number of elements,

regardless of how big each individual element is. e lengths() function

k k

64 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

returns a vector of lengths, one for each element on the list. In our example,

length(mylist) returns the value 4, whereas lengths(mylist) returns

a vector with four lengths in it (including the length of 1 that is returned for the

function funk()). e str() command we described in Section 2.2.2 works

on lists as well. e resulting value printed to the screen gives a description of

every element on the list – one line for atomic elements and multiple lines for

lists within lists. is is one way to help understand the structure of your data

quickly.

3.3.1 Extracting and Assigning

In the ﬁrst example in this section, the ﬁrst three elements were given names

and the fourth was not. at output hints at how to extract items from a list. You

can use double square brackets – so mylist[[4]] will return 45 – or, if an

element has a name, you can use the dollar sign and the name – so mylist$b

will return "yes",andsplit(ages, ages > 60)$"TRUE" will return the

vector of ages >60. Single square brackets can be used, with a numeric, logi-

cal, or name subscript, but there’s a catch – single square brackets return a list,

not the contents of the list. is is useful if you want only a couple of pieces

of a list. For example, mylist[1:2] will return a list with the ﬁrst two ele-

ments of mylist,andmylist[1] will return a list with the ﬁrst element of

mylist – not as a vector but as a list. A logical subscript will also work here:

mylist[c(T, T, F, F)] willreturnthesamelistasmylist[c(1,2)]

or mylist[c("alpha", "b")]. Most of the lists we run into will have

names, and we usually extract elements one at a time with the dollar sign, but

the distinction between single and double brackets is still important. Single

brackets create lists; double brackets extract contents. And what happens in our

example if you ask for mylist[[2:3]] or mylist[[c(F, T, T, F)]]?

Unsurprisingly, these commands generate errors.

When you request a list item using single brackets and a name that is not

present on the list, R returns a list with one NULL element; with double brackets

or the dollar sign, it returns the NULL itself. is is consistent with the rule

that says single brackets produce lists, while double brackets and dollar signs

extract contents. Using double brackets with a numeric subscript greater than

the number of elements in the list, such as mylist[[11]] in our example,

produces an error rather than a NULL.

Of course, to extract elements of a list by name, we need to know the names.

We can determine the names a list has using the names() function. If the

list has no names at all, this function will return NULL; if some elements have

names, the names() function will return an empty string for those elements

with no names. is example shows the names of the mylist list.

> names(mylist)

[1] "alpha" "b" "funk" ""

k k

R Data, Part 2: More Complicated Structures 65

We can also use the names() function on the left-hand side of an assignment

to change the names of the elements on a list. For example, the commands

names(mylist)[4] <- "RPM" would change the name of the fourth ele-

ment of mylist to "RPM".

Unlike when you use single or double square brackets, when using the dollar

sign to extract an item, you don’t need its full name. (Technically, you can pass

the exact argument into double square brackets to control this behavior, but

we don’t.) You only need enough to identify the item unambiguously. In this

example, mylist$a would be enough to produce the same numeric vector

returned by mylist$alpha, but if there were two items on the list, say alpha

and algorithm,typingmylist$a would produce NULL. You would need to

specify at least mylist$alp in order to be unambiguous. It’s often convenient

to use these abbreviated names, but that approach is best suited for quick work

at the command line. We recommend using full names in functions and scripts,

to avoid confusion or even an error if new items get added to the list later.

To replace an item on a list, just re-assign it. If you want to add a new

item to a list, just assign the new item to a new name. Here, naturally, you

need to use the item’s full name. If your mylist has an item called alpha

and you use the command mylist$alp <- 3,youwillcreateanewitem

named alp and leave the old one, alpha, unchanged. To delete an item from

a list, you can use subscripting as we did for a vector. For example, either

mylist <- mylist[c(1, 2, 4)] or mylist <- mylist[-3] will

drop the third entry. But another, possibly easier, way is to assign NULL

to the name or number. In this example, mylist$funk <- NULL or

mylist[[3]] <- NULL would remove the item named funk from

mylist. is behavior means that it is diﬃcult to intentionally store a NULL

value in a list, but this does not seem to be much of a limitation.

Another useful function for operating on lists is unlist(),which,asits

names suggests, tries to turn your entire list into a vector. When the list con-

tains unusual objects, such as the function element of the mylist list in our

example, the results of unlist() can be diﬃcult to predict. is example

showstheeﬀectofunlist() operating on a list of regular vectors, which we

create by excluding the function element of mylist().

> unlist (mylist[-3])

alpha1 alpha2 alpha3 b

"1" "2" "3" "yes" "45"

Here we can see that R has produced names for each of the elements from the

vector mylist$a, and in a well-behaved list these names can be useful.

3.3.2 Lists in Practice

Generally, we do not need lists much when data cleaning. As we have noted,

lists arise as the output of many R functions – a function in R cannot return

k k

66 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

more than one result, so if a function computes two things of diﬀerent sizes,

it will need to return a list. For example, the rle() function we described in

Chapter 2 returns the lengths of runs and, separately, the value associated with

each run. It is then your job to extract the pieces from the list. e pieces will

almost always be named, so they will be able to be extracted using the $opera-

tor. (In the case of rle(), the pieces are called lengths and values.) Lists

also arise as the output from the split() command. Normally, after calling

split() we would then call an apply()-typefunctiononeachelementof

the resulting list. We describe this in Section 3.5.1. And, of course, the apply

functions can themselves produce lists, as we saw in Section 3.2.3.

Another common context in which lists arise concerns the dimension names

of a matrix. e dimnames() function returns NULL when applied to a matrix

without row or column names. Otherwise, it returns a list with two elements:

the vector of row names and the vector of column names. In general, this return

has to be a list, rather than a matrix, because the number of rows and number

of columns will be diﬀerent. Either of the two entries may be NULL,becausea

matrix may have row names without column names, or vice versa. e dim-

names() function may be used to assign, as well as extract, dimension names.

ese examples continue the earlier ones using the two-way table tbl and the

three-way table sched from Sections 3.2.2 and 3.2.7, respectively, and show

dimnames() at work. Notice that dimnames() produces a list with three

vectors of names from the three-way table.

> dimnames(tbl)

$market

[1] "2" "3"

$yr

[1] "FY15" "FY16" "FY17"

> dimnames (sched)

$who

[1] "George" "Sally"

$when

[1] "AM" "PM"

$worked

[1] "FALSE" "TRUE"

As we have seen before, dimension names are always characters. So in the

three-way array example, the names for the worked dimension are the char-

acter strings "FALSE" and "TRUE", not the logical values. In the following

example, we show how we can modify an element of the dimnames() list.

k k

R Data, Part 2: More Complicated Structures 67

> dimnames(tbl)[[2]][1] <- "Archive"

> tbl

market Archive FY16 FY17

2 313

3 242

In the dimnames() assignment we change the ﬁrst column name. Here,

dimnames(tbl) produces a list, the [[2]] part extracts the vector

of column names, and the [1] part accesses the element we want to

change. Of course, we could have achieved the same result with dimnames

(tbl)$yr[1] <- "Archive".

Another list that arises from R itself is the list of session options, returned

from a call to the options() function. is list includes dozens of elements

describing things such as the number of digits to be displayed, the current

choice of editor, the choices going into scientiﬁc notation, and many more.

Calling names(options()) will produce a vector of the names of the

current options. You can examine a particular option, once you know its

name, with a command like options()$digits. To set an option, pass

itsnameandvalueintotheoptions() function, with a command like

options(digits = 9).

3.4 Data Frames

Now that we understand how matrices and lists work, we can focus on the

most important object of all, the data frame. A data frame (written with a dot

in R, as data.frame) is a list of vectors, all of which are the same length, so

that they can be arrayed in a matrix-like rectangle. (Technically, the elements

of a data frame can also be matrices, as long as they are of the right size, but

let us avoid that complication. For our purposes, the elements of a data frame

will be ordinary vectors.) e vectors in the list serve as the columns in the

rectangle. A data frame looks like a matrix, with the critical diﬀerence that

the diﬀerent columns can be of diﬀerent types. One column can be numeric,

another character, a third factor, a fourth logical, and so on. Each vector

has elements of one type, as usual, but the data frame allows us to store the

sort of data we get in real life. So a data frame about people might contain

their names (which would probably be character), their ages (often numeric,

but possibly factor), their gender (possibly character, possibly factor), their

eligibility for a particular program (which might be logical), and so on. In this

example, we use the data.frame() function to construct a data frame. In

data cleaning, our data frames are very often produced by functions that read

in data from the disk, a database, or some other source. We describe methods

k k

68 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

of acquiring data in Chapter 6, but for the moment we will use this simple

example.

> (mydf <- data.frame (

Who = letters[1:5], Cost = c(3, 2, 11, 4, 0),

Paid = c(F, T, T, T, F), stringsAsFactors = FALSE))

Who Cost Paid

1 a 3 FALSE

2 b 2 TRUE

3 c 11 TRUE

4 d 4 TRUE

5 e 0 FALSE

ere are a few points worth noting here. First, R has provided row names

(visible as 1through 5on the left) to the data frame automatically. A matrix

need not have row names or column names, and a list need not have names,

but a data frame must always have both row names and column names.

R will create them if they are not explicitly assigned, as it did here. e

data.frame() function ensures (unless you specify otherwise) that column

names are valid and not duplicated. You may specify row names explicitly,

using the row.names argument, in which case they must be not duplicated

and not missing. Column names can be examined and set using the names()

command, as with a list, or with the colnames() or dimnames() com-

mands, as with a matrix. Generally, you will probably ﬁnd the names() or

colnames() approaches to be easier, since they involve vectors and not a list.

For row names, the rownames() and row.names() functions allow the

row names of a data frame to be examined or assigned. Section 3.2.2 describes

how row names can be useful when handling matrices, and those points are

true for data frames as well.

A second point is that, by default, the data.frame() function turns char-

acter vectors into factors. Factors are discussed in Section 4.6, and, as we men-

tion there, they are useful, even required, in some modeling contexts. ey are

rarely what we want in data cleaning, however. e best way to keep factors out

of data frames is to not allow them in the ﬁrst place; we accomplished this in

the example above by passing the stringsAsFactors = FALSE argument

to the data.frame() function. Without that argument, the Who column of

mydf would have been a factor variable with ﬁve levels. Another way to prevent

factors from being created is to set the stringsAsFactors global option

to be FALSE,usingtheoptions(stringsAsFactors = FALSE) com-

mand. However, we cannot rely on all of the users of our code having that setting

in place, so we always try to remember to turn this option oﬀ explicitly when

we call data.frame(). is issue will arise again when we talk about com-

bining data frames later in this section, and about reading data in from outside

sourcesinChapter6.

k k

R Data, Part 2: More Complicated Structures 69

ere are several functions that help you examine your data frame. Of

course, in many cases, it will be too big to simply print out and examine. e

head() and tail() functions display only the ﬁrst or last six rows of a data

frame, by default, but this can be changed by the second argument, named n.

So head(mydf, n = 10) will show the ﬁrst 10 rows, tail(mydf, 12)

will show the last 12, and, using a negative argument, head(mydf, -120)

will show all but the last 120. e str() function prints a compact represen-

tation of a data frame that includes the type of each column, as well as the ﬁrst

few entries. Other useful functions include dim(),toreportthenumbersof

rows and columns, and summary(), which gives a brief description of each

column.

3.4.1 Missing Values in Data Frames

Because the columns of a data frame can be of diﬀerent classes, missing val-

ues can be of diﬀerent classes, too. A missing value in a numeric column will

be a numeric missing value, while in a character column, the missing value

will be of the character type. We discussed missing values at some length in

Section 2.4. It is always good to know where missing values come from and

why they exist – often investigating the causes of “missingness” will lead to dis-

coveries about the data. e is.na() function operates on a data frame and

returns a logical-valued matrix showing which elements (if any) are missing;

the anyNA() function operates on data frames as well. One approach to han-

dling missing data is to simply omit any observations (rows) of the data frame in

which one or more elements is missing. R’s na.omit() function does exactly

that. (For this purpose, NaN is missing but Inf and -Inf arenot.)isisthe

default behavior for a number of R’s modeling functions, but in general we do

not recommend deleting records with missing values until the reason for the

values being missing is understood.

3.4.2 Extracting and Assigning in Data Frames

Since a data frame is matrix-like and also list-like, we can use both matrix-style

and list-style subsetting operations on a data frame. One diﬀerence appears

when we select a single row. With a matrix, selecting a single row returns a

vector, unless you specify drop = FALSE (see Section 3.2.1). However, with

a data frame, even a single row is returned as a data frame with one row because

in general even one row of a data frame will contain entries of diﬀerent types.

Withthatonediﬀerence,weextractrowsfromadataframejustaswe

extract rows from a matrix – by number, including negatives; using a logical

vector; or by names (as we mentioned, the rows of a data frame, and the

columns, always have names). We can extract columns using either list-style

access or matrix-style access. List-style access uses single brackets to produce

k k

70 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

sub-lists, which in this case means that using single brackets will produce

a data frame. Double bracket subscripts, or the dollar sign, will produce a

vector. e diﬀerence is that double brackets require an exact name, unless

exact = FALSE is set, whereas the dollar sign only requires enough of the

name to be unambiguous. If there are two columns with similar names, and

your request is not suﬃcient to determine a unique answer, nothing at all

(i.e., NULL) is returned. erefore, it makes sense, particularly when writing

functions for other people, to use full names for columns.

Matrix-style access uses column names or numbers; just as with a matrix,

selecting only one column will produce a vector unless you explicitly set

drop = FALSE. is example shows a number of ways of extracting columns

from data frames. We start by showing list-style access using single brackets.

> mydf[2] # Numeric subscript

Cost

311

> mydf["Cost"] # Subscript by name

Cost

311

> mydf[c(F, T, F)] # Logical subscript

Cost

311

Each of those operations produced a data frame with ﬁve rows and one column

(which is, of course, a list). In the following examples, we use double brack-

ets together with a numeric or character subscript and produce a vector. As

with a list, a logical subscript with more than one TRUE inside a pair of dou-

blebracketswillproduceanerror.(Youmighthaveexpectedthesameresult

with a numeric subscript; in fact, a numeric subscript of length 2 can be used;

it acts as a one-row matrix subscript.) When using a character index inside

double brackets, you can specify exact = FALSE to permit the same sort of

matching that we get with the dollar sign.

k k

R Data, Part 2: More Complicated Structures 71

> mydf[[2]]

[1]321140

> mydf[["Cost"]]

[1]321140

> mydf[["C"]]

NULL

> mydf[["C", exact = FALSE]]

[1]321140

Notice that the result in each of these cases is a vector. In the following

examples, we show the use of the dollar sign to extract a column. In this

case, as we mentioned, we need to specify only enough of the name to be

unambiguous.

> mydf$W # Extracts the "Who" column

[1] "a" "b" "c" "d" "e"

e dollar sign can only refer to one column at a time. To extract more than one

column, we can use single brackets as above, or matrix-style access in which

we explicitly specify rows and columns. As with a matrix, leaving one of those

two indices blank will select all of them, and R will produce vectors from sin-

gle columns unless the drop = FALSE argument is speciﬁed. is example

shows extraction using matrix-style syntax.

> mydf[1:2, c("Cost", "Paid")]

Cost Paid

1 3 FALSE

2 2 TRUE

> mydf[,"Who", drop = FALSE] # Example of drop = FALSE

Who

Removing a column from a data frame is exactly like removing an element

from a list and is accomplished in the same way – by assigning NULL to

the column reference. Running the command mydf$Paid <- NULL will

remove the column Paid from the data frame using list-style notation, and

mydf[,"Paid"] <- NULL performs the same task using matrix-style

notation.

To replace subsets of elements you can once again use the matrix-style

or list-style syntax. So, for example, mydf[c(1,3), "b"] <- "A" and

mydf$b[c(1,3)] <- "A" both replace the ﬁrst and third entries of the b

column of mydf with "A". (Of course, if that column had been numeric or

logical before, this operation will force R to convert it to character.)

k k

72 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

3.4.3 Extracting Things That Aren’t There

e critical diﬀerence between a matrix and a data frame is that the columns

of a data frame can be vectors of diﬀerent types. Another diﬀerence manifests

itself when you try to access an element that isn’t there, maybe because you

asked for a row or column number that was too big or a row or column name

that didn’t exist. In a vector, attempts to extract an item beyond the end of the

vector will produce NAs. But if you ask a matrix for a row or column that doesn’t

exist, R will produce an error. is example shows the diﬀerence:

> (mat <- matrix (1001:1006, 2, 3)) # Matrix with six items

# Ask for a non-existent entry, using vector-like indexing

> mat[8]

[1] NA

> mat[,4] # Ask for a non-existent column

Error in mat[, 4] : subscript out of bounds

In general, we prefer the error. A function that sees an NA will often try to carry

on, whereas an error will force you to stop and ﬁgure out what has happened.

e situation with data frames (and lists) is diﬀerent. Supplying subscripts for

which there are no rows produces one row with all NAs for every unusable sub-

script. e entries in these rows will have the same classes (numeric, character,

etc.)thatthedataframehad.isariseswhensomerowshavebeendeleted,

and then you, or a program, try to access one of the deleted rows by name. In

this example, we show how asking for rows that don’t exist can cause trouble.

> mydf2 <- data.frame (alpha = 1:5, b = c(T, T, F, T, F),

NX = c("NA", "NB", "NC", "ND", "NE"),

stringsAsFactors = FALSE,

row.names = c("Red", "Blue", "White", "Reddish", "Black"))

> mydf2

alpha b NX

Red 1 TRUE NA

Blue 2 TRUE NB

White 3 FALSE NC

Reddish 4 TRUE ND

Black 5 FALSE NE

# Let's ask for rows that don't exist.

> mydf2[c(9, 4, 7, 1),]

alpha b NX

NA NA NA <NA>

Reddish 4 TRUE ND

NA.1 NA NA <NA>

Red 1 TRUE NA

In this example, we see that the resulting data frame has four rows, two of

which contain only NA values. e character column’s NAs are represented with

k k

R Data, Part 2: More Complicated Structures 73

angle brackets, as <NA>, to make it easy to distinguish a missing value from

the legitimate character string NA in row 1. e ﬁrst two columns’ NAsare

numeric and logical. As elsewhere (e.g., Section 2.4.3), logical subscripts will

recycle – which is rarely what you want – and usually produce unwanted results

when they contain NAs.

In the following example, we show one more operation that can produce

rows with NAs in them. Since our data frame has rows named both "Red"

and "Reddish", asking for a row named "Re" is ambiguous and produces

a row of NAs. (In contrast, the row names of a matrix may not be abbreviated;

supplying a name that is not an exact row name produces an error.)

> mydf2["Re",] # Not enough to be unambiguous

alpha b c

NA NA NA <NA>

A much more frequent problem happens when accessing columns. If you

access a non-existent column in the matrix or list styles, using an abbrevia-

tion, R produces an error. In our example, mydf2[,"gamma"] (referring to a

non-existent column), mydf2[,"N"] (referring to an abbreviated name, with

acomma),andmydf2["N"] (without the comma) all produce errors. In con-

trast, when using the double-bracket notation, NULL is returned when a name

is abbreviated or non-existent. (As we mentioned with lists, there is in fact

an exact argument to the double brackets that we do not use.) Just like the

NA returned when accessing a non-existent element of a vector, this NULL has

the potential to be more trouble than an error would have been. e use of

the dollar sign, as we mentioned, permits the use of unambiguously abbrevi-

ated names but produces a NULL when used with a non-existent name. In this

example, we show how asking for a non-existent name can produce an unex-

pected result.

# Ask for the first column by abbreviated name.

> mydf2$alph

[1] 1 2 3 4 5 # No problem

# Create another column with a similar name

> mydf2$alpha.plus.1 <- mydf2$alpha + 1

> mydf2$alph

NULL

> mydf2$alph + 1 # No error, but..

numeric(0) # probably unexpected

e second-to-last operation produced NULL because alph was not suﬃcient

to diﬀerentiate between the columns alpha and alpha.plus.1.Ifarow

or column name matches exactly, R will extract it properly (so if you have

alpha and alpha.plus.1 and ask for alpha, there is no ambiguity). It

is a good practice to use complete names, unless there is a strong reason

not to.

k k

74 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

3.5 Operating on Lists and Data Frames

Very often we will want to operate on each of the elements of a list or each

of the rows or columns of a data frame. For example, we might want to know

how many missing values are in each column. In Section 3.2.3, we saw a matrix

using apply(),butapply() does not work on a list (since a list doesn’t have

dimensions). e apply() function does work on data frames, but it ﬁrst con-

verts the data frame into a matrix. is conversion will only be sensible when all

the columns are of the same type, as with the all-numeric data frame described

inSection3.5.2.Inothercases,theresultscanbequiteunexpected.Inthis

example, we operate on the rows of a data frame, using apply(),toshowhow

this can go wrong.

> (dd <- data.frame (a = c(TRUE, FALSE), b = c(1, 123),

cc = c("a", "b"), stringsAsFactors = FALSE))

abc

1 TRUE 1 a

2 FALSE 123 b

> apply (dd, 1, function (x) x)

[,1] [,2]

a " TRUE" "FALSE"

b " 1" "123"

cc "a" "b"

Here the function passed to apply() does nothing but return whatever it

passed to it. Since data frame dd has a character column, apply() converted

the whole data frame into a character matrix. It does this in part by calling

the format() function column by column, producing the results seen here:

avalue" TRUE" with a leading space in row 1 (formatted to be the same

length as the string "FALSE"), and "1"with two leading spaces in row 2

(formatted to be the same length as the string "123"). Analogous conversions

happen whenever a data frame with at least one column that is neither logical

nor numeric is passed to apply(), used in other matrix functions such as t()

(transpose), or accessed with a matrix subscript.

A general approach to this sort of operation (element by element for a list, col-

umn by column for a data frame) is supplied by sapply() and lapply().

e lapply() function always returns a list, whereas sapply() runs

lapply() and then tries to make the output into a vector (if the function

always returns a vector of length 1) or a matrix (if the function returns a

vector of constant length). Be careful, though, because if the diﬀerent function

calls return items of diﬀerent lengths, sapply() will need to return a list,

just as the ordinary apply() function did back in Section 3.2.3. Moreover,

if the function returns elements of diﬀerent types (perhaps as a row of a

data frame), sapply() will try to convert these to a common type. In these

k k

R Data, Part 2: More Complicated Structures 75

cases, use lapply(). e following example shows one very common use of

sapply(), which is to return the classes of each column in a data frame.

> sapply (mydf2, class)

alpha b NX alpha.plus.1

"integer" "logical" "character" "numeric"

In this example, the regular apply() function will convert the whole data

frame to character ﬁrst, before computing the classes, which it would report

as all character.

It is easy to operate on the columns of a data frame (or the elements of a list)

with lapply() and sapply() functions. As we have seen, it is more diﬃcult

to operate on the rows. ese two functions provide a solution to this problem.

ey can be used with an ordinary numeric vector as their ﬁrst argument,

in which case they act like a for() loop, applying their function to each

element of the vector. e for()-like behavior of lapply() and sapply()

is most useful when using a complicated function on each row of a data frame.

e command sapply (1:nrow(ourdf), function (i) fancy

(ourdf[i,])) runs a user-written function called fancy() on each row

of a data frame. supplied by sapply() and lapply().eargumentto

fancy() really is a data frame, and not one that has been converted into a

matrix. In this example, we show how we might identify rows that contain the

number 1. Note that the naïve use of apply() does not ﬁnd the number 1 in

theﬁrstrow.

> apply (dd, 1, function (x) any (x == 1))

[1] FALSE FALSE

> sapply (1:2, function (i) any (dd[i,] == 1))

[1] TRUE FALSE

3.5.1 Split, Apply, Combine

e family of apply() functions all operate as part of strategy that Wickham

(2011) calls “split-apply-combine.” e data is split (possibly by row, possibly

by column), a function is applied to each piece, and the results recombined.

We have already met the tapply() function (Section 2.5.2), which performs

exactly this set of operations on vectors. We can also do this explicitly via

split() and sapply() or lapply(). We start the following example

by constructing a data frame with some people’s ages, genders, and ages of

spouses, and computing the average value of Age by Gender. In this example,

we do not specify stringsAsFactors = FALSE.

> age <- data.frame (Age = c(35, 37, 56, 24, 72, 65),

Spouse = c(34, 33, 49, 28, 70, 66),

Gender = c("F", "M", "F", "M", "F", "F"))

k k

76 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> split (age$Age, age$Gender)

[1] 35 56 72 65

[1] 37 24

> sapply (split (age$Age, age$Gender), mean)

57.0 30.5

Here the split() function returns a list with the elements of Age divided

by value of Gender.ensapply() operates the mean() function on

each element of the list and returns a vector (i.e., it performs both the

“apply” and “combine” operations). In this example, we could have used

tapply(age$Ages, age$Gender, mean) to produce an identical result.

However, unlike tapply(),split() can operate on a data frame, pro-

ducingalistofdataframes.Wecanthenwriteafunctiontooperateoneach

data frame. In this example, we split our data frame by Gender and then use

summary() on each of the resulting data frames to return some informa-

tion about every column. Summary() applied to the factor column Gender is

more informative than when applied to a character column; this is why we did

not specify stringsAsFactors = FALSE earlier. e result of the calls to

summary() appears as a specially formatted table.

> split (age, age$Gender)

Age Spouse Gender

135 34 F

356 49 F

572 70 F

665 66 F

Age Spouse Gender

237 33 M

424 28 M

> lapply (split (age, age$Gender), summary)

Age Spouse Gender

Min. :35.00 Min. :34.00 F:4

1st Qu.:50.75 1st Qu.:45.25 M:0

Median :60.50 Median :57.50

Mean :57.00 Mean :54.75

3rd Qu.:66.75 3rd Qu.:67.00

Max. :72.00 Max. :70.00

k k

R Data, Part 2: More Complicated Structures 77

Age Spouse Gender

Min. :24.00 Min. :28.00 F:0

1st Qu.:27.25 1st Qu.:29.25 M:2

Median :30.50 Median :30.50

Mean :30.50 Mean :30.50

3rd Qu.:33.75 3rd Qu.:31.75

Max. :37.00 Max. :33.00

Using sapply() in this case produces an unexpected result (try it!). at

function tries hard to construct a vector or matrix whenever it can. A

single command that produces essentially the same ﬁnal result, without

letting you save the list, is the by() function. In this example, by(age,

age$Gender, summary) performs the summary() operation on each

column, broken down by gender.

Under some circumstances, the three tasks of split, apply, and combine might

require separate functions, each of which may have its own arguments and con-

ventions. e dplyr package Wickham and Francois (2015) presents a set of

tools that aim to make this sort of processing more consistent. Although this

package is intended for data frames, the earlier plyr Wickham (2011) package

handles lists and arrays as well. Both are intended to be fast and eﬃcient and

to permit parallel computation, which we address in Section 5.5. We have been

accustomed to performing these tasks in regular R, and we recommend that

users know how to perform these tasks there, since lots of existing code and

users take that approach.

3.5.2 All-Numeric Data Frames

We noted above that it is diﬃcult to apply a function to the rows of a data

frame because the entries of a row may have diﬀerent classes. All-numeric data

frames, though – those whose columns are all logical or numeric – behave

specially in these situations. When one of these data frames is converted to

a matrix the numeric nature of the columns is preserved (with logicals being

converted to numeric). ese data frames can also be transposed, or accessed

with a matrix subscript, without losing their numeric nature. All-numeric data

frames provide a useful way of storing numbers in a matrix-like way while being

able to use data-frame-like syntax – but, again, as soon as one character column

(perhaps an ID) is added, the nature of the data frame changes.

Just as there are functions as.numeric() and so on to convert vectors

from one class to another (see Section 2.2.3), R provides as.matrix() and

as.data.frame() functions to convert data frames to matrices and vice

versa. is is mostly useful for all-numeric data frames or for older functions

that require numeric matrices.

k k

78 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

3.5.3 Convenience Functions

We encourage users to use long names for their data objects and for their col-

umn names, for increased readability. However, this often leads to a situation

wheretouseasimpleexpressionweneedalonglineliketheoneinthisexample:

CustPayment2016$JanDebt + CustPayment2016$FebPurch -

CustPayment2016$FebPmt

e with() and within() functions provide an easier way to perform oper-

ations such as these, and they are particularly useful when the same operation

needs to be done multiple times on multiple data objects, usually data frames.

For each of these functions, we pass the data frame’s name and then the expres-

sion to be performed, like this:

with (CustPayment2016, JanDebt + FebPurch - FebPmt)

One issue that if the expression includes an assignment, the assignment is

ignored. In order to create a new column in CustPayment2016 we would

need code like this:

CustPayment2016$FebDebt <- with (CustPayment2016,

JanDebt + FebPurch - FebPmt)

As an alternative, the within() function can perform assignments; it returns

a copy of the data with the expression evaluated. In this case, we could add a

new column called FebDebt to the data frame with a command like this:

CustPayment2016 <- within (CustPayment2016,

FebDebt <- JanDebt + FebPurch - FebPmt)

Notice that in this example within() returns a copy, which then needs to be

saved.

Two more convenience functions are the subset() and transform()

functions. Much beloved of beginners, they make the subsetting and trans-

formation process easier to follow by helping do away with square brackets.

For example, we might extract all the rows of data frame dfor which column

Price is positive with a command like d[ d$Price > 0,];subset()

allows us to use the alternative subset(d, Price > 0).Itisalsopossible

to extract a subset of columns at the same time. e transform() function

allows the user to specify transformations to existing columns in a data frame

and returns the updated version. e help pages for both of these functions are

accompanied by warnings that recommend using them interactively only, not

for programming, and we generally avoid them.

A ﬁnal convenience function is the ability to “pipe” provided by the %>%

function in the magrittr package (Bache and Wickham, 2014). is

is intended to make code more readable by allowing one function’s out-

put to serve as another’s input directly at the command line, rather than

k k

R Data, Part 2: More Complicated Structures 79

requiring nested calls. For example, consider this evaluation of a mathematical

expression:

> cos (log (sqrt (8 - 3)))

[1] 0.6933138

In R, we have to read this from the inside out: we compute 8 −3; take the

square root of the result; take the logarithm of that result; and ﬁnally compute

thecosineoftheresultfromthelog() function. Using the pipe notation, we

can pass the results of one computation to the next in the order in which they

are performed. is example shows the same computation performed using

the pipe notation.

> (8 - 3) %>% sqrt %>% log %>% cos

[1] 0.6933138

e pipe notation is particularly useful for nested functions and can be brought

to bear on data frames. However, be aware that not every function is suitable

for piping, and notice that the order of precedence required that we surround

the 8 −3 with parentheses.

3.5.4 Re-Ordering, De-Duplicating, and Sampling from Data Frames

Data frames can be re-ordered (i.e., sorted) using a command that extracts

all the rows in a new order. is ordering will usually be a vector of row

indices constructed with the order() function (see Section 2.6.2). So if a

data frame named cust has columns ID and Date,thenord <- order

(cust$ID, cust$Date) (or the slightly more convenient alternative,

ord <- with(cust, order (ID, Date))) will produce a vec-

tor ord that shows the ordering of the data frame’s rows by increasing

ID, and then by increasing Date within ID. erefore, the command

cust <- cust[ord,] will replace the old cust with the newly

ordered one.

In Section 2.6.4, we saw that the unique() function returns the unique

entries in a vector, while its counterpart duplicated() returns a logical vec-

tor that is TRUE for any entry that appears earlier in the vector. ese two

functions operate directly on matrices and data frames as well. So the com-

mand unique(mydf) takes a data frame named mydf and returns the set of

non-duplicated rows. As always, ﬂoating-point error can be a problem when

detecting whether two things are identical.

One more operation that comes up is random sampling from a data frame.

isisagoodplanwhentheoriginaldatasetissobigthatitcannotbeeasily

used for testing, for example, or plotting. As with re-ordering, the idea is to

construct a sample of row indices and then to subset the data frame with that

sample. e sample() commandisusefulhere.Initsmostbasicform,we

k k

80 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

pass an integer named xgivingthenumberofrowsinthedataframeandan

argument named size giving the desired sample size. e result is a random

set of integers selected without replacement with each value from 1 to xbeing

equally likely. To sample 200 rows from a data frame named mydf, we could use

the command sam <- sample(nrow(mydf), 200) to get a vector of 200

row numbers, and then mydf[sam,] to do the sampling. (is presumes that

there are 200 or more rows in mydf. If not, R produces an error.) Of course, the

new data frame’s rows will maintain the numbers they had in the original mydf,

so the row names of the new version will be out of order. If that bothers you, a

quick sam <- sort(sam) prior to subsetting will ﬁx that. e sample()

function also has a number of more sophisticated features, including sampling

with replacement and the ability to specify diﬀerent probabilities for diﬀerent

choices.

3.6 Date and Time Objects

Most data cleaning problems will include dates (and sometimes times). e

most important tasks we face with dates in data cleaning are doing arithmetic

(e.g., adding a number of days to a date or ﬁnding the number of days between

two dates) and extracting each date’s day, day of the week, month, calendar

quarter, or year. Objects representing dates and times come in several forms in

R, but since one of them takes the form of a list, we have postponed discussion

of those objects until here.

3.6.1 Formatting Dates

ere are lots of ways to display a date in text, and during data cleaning it will

feel like you meet all of them. Americans might write July 4, 2017 as “7/4/17,”

but to most of the rest of the world, this indicates April 7th. Furthermore,

this representation leaves unclear precisely where in the string the day starts;

it starts in the third character for an American’s “7/4/17” but in the ﬁrst for

an internationally formatted date like “26/05/17.” e two unambiguous for-

mats “2017-07-04” and “2017/07/04” are good starting points for storing dates,

especially in text ﬁles outside of R. (e value “2017-7-4” is permitted, but this

format leads to date strings of diﬀerent lengths; 20170704 is easy to mistake for

an integer.)

e simplest date class in R is called Date, and an object of this class is

represented internally as an integer representing the number of days since a

particular “origin” date. e as.Date() function converts text into objects of

class Date in two ways. First, it can convert an integer number of days since

the origin into a date. e usual origin date in R is January 1, 1970, or, unam-

biguously, “1970-01-01.” In this example, we show how a vector of integers can

be converted into a Date object.

k k

R Data, Part 2: More Complicated Structures 81

> (dvec <- as.Date (c(0, 17250:17252),

origin = "1970-01-01"))

[1] "1970-01-01" "2017-03-25" "2017-03-26" "2017-03-27"

Notice that the value 0is converted into the origin date, “1970-01-01.” If we

are given integer dates, we need to know what the origin is supposed to be.

is concern arises when reading data in from the Excel spreadsheet program.

Excel uses integer dates, but the origins are diﬀerent between Windows and

Mac, and Excel mistakenly treats 1900 as a leap year. We describe this in more

detail in Section 6.5.2 when we describe reading data in.

e second conversion that as.Date() can perform is to convert text-

based representations such as “7/4/17” or “July 4, 2017,” using a format

string that describes the way the input text is formatted. Each piece of the

format string that starts with %identiﬁes one part of the date or time; other

pieces represent characters such as space, comma, /,or-between pieces

of the input text. For example, %B matches the name of the month and %a

matches the name of the day of the week. e most important pieces of the

format are %d for day of the month, %m for the month, and %y and %Y for

two- and four-digit year, respectively. (Two-digit years between 69 and 99

are assumed to be twentieth-century ones starting with 19, and the rest,

twenty-ﬁrst-century ones.) e help page for as.Date() refers us to the help

page for strptime(), which lists all of the possibilities. For example, this

command uses the format string "%B %d, %Y" to convert text dates such as

"September 20, 2016" into a Date object.

> as.Date (c("Feb 29, 2016", "Feb 29, 2017",

"September 30, 2017"), format = "%b %d, %Y")

[1] "2016-02-29" NA "2017-09-30"

Notice that the format string had to contain the same pattern of spaces

and comma that the input text had. R was able to read both the three-letter

abbreviation Feb and the full name September – but it produced an NA for

Feb 29, 2017 which was not a legitimate date.

e names of the days of the week, and the months of the year, are set by the

computer’s locale (see Section 1.4.6). By changing locales R can be made to read

in days or months in other languages, as well, which is useful when data comes

from international sources. In this example, we have some dates in which the

month has been given in Spanish. By changing the locale we can read these in;

then by re-setting the locale we can use as.character() to convert them

into English.

> sp.dates <- c("3 octubre 2016", "26 febrero 2017",

"5 mayo 2017")

> as.Date (sp.dates, format = "%d %B %y")

[1] NA NA NA

# Not understood in English locale; use Spanish for now

k k

82 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> Sys.setlocale ("LC_TIME", "Spanish")

[1] "Spanish_Spain.1252" # Setting was successful

> (dts <- as.Date (sp.dates, format = "%d %B %Y"))

[1] "2016-10-03" "2017-02-26" "2017-05-05"

> Sys.setlocale ("LC_TIME", "USA") # Change back

[1] "English_United States.1252" # Setting was successful

> as.character (dts, "%d %B %Y")

[1] "03 October 2016" "26 February 2017" "05 May 2017"

3.6.2 Common Operations on Date Objects

ere are a number of convenience functions to manipulate date objects. e

months() and weekdays() functions act on Date objects and return the

names of the corresponding months and days of the week. Each has an abbre-

viate argument that defaults to FALSE;whensettoTRUE these arguments

produce three-letter abbreviations. In this example, we show examples of these

convenience functions.

> d1 <- as.Date ("2017-01-02")

> d2 <- as.Date ("2017-06-15")

> weekdays (c(d1, d2))

[1] "Monday" "Thursday"

> months (c(d1, d2))

[1] "January" "June"

> months (c(d1, d2), abbreviate = TRUE)

[1] "Jan" "Jun"

> quarters (c(d1, d2))

[1] "Q1" "Q2"

ere is no function to extract the numeric day, month, or year from a Date

object. ese operations are performed using the format() function, which

calls format.Date() to produce character output that can then be converted

to numeric using as.numeric(). e elements of the format string are like

those that are used in as.Date(). is example shows how to extract some

of those pieces from a vector of Date objects – but, again, note that the output

of format() is text.

> format (c(d1, d2), "%Y")

[1] "2017" "2017"

> format (c(d1, d2), "%d")

[1] "02" "15"

> format (d1, "%A, %B %m, %Y")

[1] "Monday, January 01, 2017"

e ﬁnal command shows a more sophisticated formatting operation, using a

format string like the one in as.Date().

It is permitted to use decimals in a Date object to represent times of

day. If you want to create a date object to represent 1:00 p.m. on July 29,

k k

R Data, Part 2: More Complicated Structures 83

2015, as.Date(16645 + 13/24, origin = "1970-01-01") will

return a numeric, non-integer object that can be used as a date. However,

as.Date("2017-07-29 13:00:00") producesaDatethatisrepresented

internally by the integer 17,376 – the time portion is ignored. Moreover

non-integer parts are never displayed and can even be truncated by some

operations. When times of day are required, it is a better idea to use a POSIXt

object (Section 3.6.4).

3.6.3 Diﬀerences between Dates

Very often we need to know how far apart two dates are. e diﬀerence between

two Date objects is not a date; it is instead a period of time. In R, one of

these diﬀerences is stored as a difftime object. Some functions, such as

mean() and range(), handle difftime objects in the expected way. Oth-

ers, such as hist() (to produce a histogram) or summary(), fail or produce

unhelpful results. Normally, we will convert difftime objects into numeric

items with as.numeric(). Be careful, though: the units that R uses for the

conversion can depend on the size of the diﬀerence, whereas for data clean-

ing we almost always want to use one consistent choice of unit. erefore, it

is a good habit, when converting difftime objects to numbers, to specify

units = "days" (or whichever unit we want) explicitly. In this example,

which continues the one above, we show addition on dates plus an example of

adifftime object.

# Date objects are numeric; we can add and subtract them

>d1+30

[1] "2017-07-02"

>(d<-d2-d1)

Time difference of 13 days # an object of class difftime

> as.numeric (d) # convert to numeric, in days

[1] 13

> units (d)

[1] "days"

> as.numeric (d, units = "weeks")

[1] 1.857143

In the last pair of commands, we saw that as.numeric() produced an out-

put in days by default, the units being revealed by the units() command.

We can also set the units of a difftime object explicitly, with a command

like units(d) <- "weeks",orusethedifftime() function directly, like

difftime(d2, d1, units = "weeks").

3.6.4 Dates and Times

If you don’t need to do computations with times – only with dates – the Date

class will be enough, at least back to 1752, when Britain switched from the Julian

k k

84 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

to the Gregorian calendar. If you need to do computations on times, there is a

second set of objects that are stronger at storing and computing those. ese

are named POSIXct and POSIXlt objects, after the POSIX set of standards.

Collectively, these two types of objects are called POSIXt objects. POSIXt

objects measure the number of seconds (possibly with a decimal part) since

the beginning of January 1, 1970 using Coordinated Universal Time (UTC),

which is identical to Greenwich Mean Time (GMT). (Technically, the POSIX

standard does not include leap seconds, a vector of which is given by R’s built-in

.leap.seconds variable. is has never aﬀected us.)

e POSIXlt object is implemented as a list, which makes it easy to extract

pieces; the POSIXct object acts more like a number, which makes it the choice

for storing as a column in a data frame. We start with an example of a POSIXlt

number. It prints out in a character string, but it behaves like a list. One unusual

feature is that, to see the names of the list, you need to unlist() the object

ﬁrst. For example,

> (start <- as.POSIXlt("2017-01-17 14:51:23"))

[1] "2017-01-17 14:51:23 PST" # R has inferred time zone PST

> unlist (start)

sec min hour mday mon year wday yday

"23" "51" "14" "17" "0" "117" "2" "16"

isdst zone gmtoff

"0" "PST" NA

Here start really is a list, and we can extract components in the usual way,

with a dollar sign or double brackets (but, although you can use its names,

names(start) is NULL, and you cannot extract a subset of components with

single brackets). Notice also that the ﬁrst day of the month gets number 1,but

the ﬁrst month of the year, January, carries the number 0, and that the year

element counts the number of years since 1900. e advantage of a list is that,

given a vector of POSIXlt objects named date.vec, say, you can extract all

the months at once with data.vec$mon – but again, January is month 0 and

December is month 11. Weekdays are given in the list by wday, with 0–6 rep-

resenting Sunday through Saturday, respectively. e weekdays() function

from above, and the other Date functions, also work on POSIXt objects – but

be aware that the results are displayed in the locale of the user. Notice that the

time zone above, PST, is deduced by our computer from its locale. e help

for DateTimeClasses gives more information on the niceties of time zones,

many of which are system speciﬁc.

Although we can use the weekdays(),months(),andquarters()

functions on POSIXct objects, we extract other components, such as years

or hours, via the format() function, as we did for Date objects. is is

slightly less eﬃcient than the list-type extraction from a POSIXlt object,

butwerecommendusingPOSIXct objects where possible, because we have

k k

R Data, Part 2: More Complicated Structures 85

encountered unexpected behavior when changing time zones with POSIXlt

objects.

It is worth noting that although a POSIXt object may have a time, a time

is not required. When a Date object is converted into a POSIXt object, the

resulting object is given a time of 00:00 (i.e., midnight) in UTC. A vector of

POSIXt objects that are all at midnight display without the time visible, but

they do contain a time value. When a POSIXt object is converted to a Date

object, the time is truncated.

3.6.5 Creating POSIXt Objects

R’s as.POSIXct() and as.POSIXlt() functions convert text that is unam-

biguously formatted into POSIXt objects just as as.Date() does. Here the

date can be followed by a 24-hour clock time like 17:13:14 or a 12-hour time

with an AM/PM indicator. More usefully, perhaps, these functions allow the use

of a format string such as the one used by as.Date(). is format string, doc-

umented in the help for strptime(), allows times, time zones, and AM/PM

indicators, attributes that are also accepted by as.Date(). Often we discard

time information, since we are only interested in dates, but sometimes discard-

ing time information can lead to incorrect conclusions. In this example, we

construct two POSIXct objects that represent the same moment expressed in

two diﬀerent time zones.

> (ct1 <- as.POSIXct ("Mar 31, 2017 10:26:08 pm",

format = "%b %d, %Y %I:%M:%S %p"))

[1] "2017-03-31 22:26:08 PDT"

> (ct2 <- as.POSIXct ("2017-04-01 05:26:08", tz = "UTC"))

[1] "2017-04-01 05:26:08 UTC"

> as.numeric (ct1 - ct2, units = "secs")

[1] 0

e ﬁrst date, ct1, is not given an explicit time zone, so the system selects the

local one (shown here as PDT). In the second example, we explicitly provide the

UTC indicator with the tz argument. e as.numeric() command shows

that the two times are identical. ere are a few confusing properties of POSIXt

objects. All the objects in a vector of length >1 will be displayed with the local

time zone, and their weekdays() and months() willbe,too.Forasingle

object, though, these functions refer to the time zone of the object, although,

as this example shows, there is a complication.

> c(ct1, ct2)

[1] "2017-03-31 22:26:08 PDT" "2017-03-31 22:26:08 PDT"

> weekdays (c(ct1, ct2))

[1] "Friday" "Friday"

> weekdays (ct2) #

[1] "Saturday"

k k

86 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> weekdays (c(ct2))

[1] "Friday"

e top command shows that the vector of dates is displayed in our locale.

at date refers to a moment that was on a Friday locally. When weekdays()

acts on ct2 by itself, though, it shows that that moment was on a Saturday in

Greenwich. In the ﬁnal command, the c() causes ct2 to be converted to local

time, where its date falls on a Friday.

To explicitly convert the time zone of a POSIXct object, you can set its

tzone attribute, with a command like attr(ct1, tzone = "UTC"),or,

equivalently, with tzone = "GMT";seethehelpforSys.timezone() for

a way to determine the names of time zones. (e approach for POSIXlt

objects is more complicated and we do not discuss it here.) Note that when

POSIXct objects are converted to Date objects, they are rendered in UTC,

so as.Date(ct1) and as.Date(ct2) both produce dates with value

"2017-04-01".

e format string that is passed to as.POSIXct() allows for a lot of ﬂexibil-

ity in the way dates are formatted. is example shows how you might convert

R’s own date stamp, produced by the date() function, into a POSIXct object

and then a Date object.

> (curdate <- date())

[1] "Wed Sep 21 00:36:47 2016"

> (now <- as.POSIXct (curdate,

format = "%A %B %d %H:%M:%S %Y"))

[1] "2016-09-21 00:36:47 PDT" # POSIXct object

> as.Date (now)

[1] "2016-09-21"

As long as the format of the dates in your data is consistent, it will probably

be possible to read them in using as.POSIXct().Insomecases,dates

may appear with extraneous text. If the contents of the text is known exactly,

the text can be matched. For example, the string Wednesday, the 17th

of March, 2017 at 6:30 pm canbereadinwiththeformatstring

"%A, the %dth of %B, %Y at %I:%M %p". But this formatting will fail

for the 21st or the 22nd or if the input string ends with p.m. (with

periods). In cases where there is variable, extraneous text, you may have to

resort to manipulating the text strings using the tools in Chapter 4.

3.6.6 Mathematical Functions for Date and Times

Since Date and POSIXt objects are numeric, many functions intended to

work on numeric data also work on these date objects. In particular, range(),

max(),min(),mean(),andmedian() objectsinR.allproducevectors

k k

R Data, Part 2: More Complicated Structures 87

of date objects. e diff() function computes diﬀerences between adjacent

elements in a vector, so diff(range(x)) produces the range of dates in the

vector xas a difftime object. e summary() function acts on a vector

of date objects, producing an object that is slightly diﬀerent from a vector of

dates but still usable. You can also tabulate Date and POSIXct objects with

table() –buttable() does not work on the list-like POSIXlt objects.

e seq() function can also be used to generate a sequence of dates. is

is useful for generating the endpoints of “bins” for histograms or other sum-

maries.Aswementioned,Date objects are implemented in units of days, so

a sequence of Date objects one unit apart has values 1 day apart by default.

However, POSIXt objects are in units of seconds, so a sequence of POSIXt

objects one unit apart are 1 second apart. One way to create a sequence of

POSIXt objects representing consecutive days is to use by = 86400,since

there are 86,400 seconds in a day. However, R has a better approach. When

called with a vector of Date or POSIXt objects, the seq() function invokes

one of the functions, seq.Date() or seq.POSIXt(),thatissmarterabout

date objects. ese functions let you use the by argument with a word like

"hour","day" and so on. An additional value "DSTday" (for POSIXt only)

ignores daylight saving time to produce the same clock time every day. In this

example, we generate some sequences of Date and POSIXt objects. Notice

that R suppresses times for POSIXt dates when all of the times in the vector

are midnight.

> seq (as.Date ("2016-11-04"), by = 1, length = 4)

[1] "2016-11-04" "2016-11-05" "2016-11-06" "2016-11-07"

# Create and save a POSIXct object, for convenience

> ourPos <- as.POSIXct ("2016-11-04 00:00:00")

> seq (ourPos, by = 1, length = 3)

[1] "2016-11-04 00:00:00 PDT" "2016-11-04 00:00:01 PDT"

[3] "2016-11-04 00:00:02 PDT"

> seq (ourPos, by = "day", length = 3)

[1] "2016-11-04 PDT" "2016-11-05 PDT" "2016-11-06 PDT"

> seq (ourPos, by = "day", length = 4)

[1] "2016-11-04 00:00:00 PDT" "2016-11-05 00:00:00 PDT"

[3] "2016-11-06 00:00:00 PDT" "2016-11-06 23:00:00 PST"

> seq (ourPos, by = "DSTday", length = 4)

[1] "2016-11-04 PDT" "2016-11-05 PDT" "2016-11-06 PDT"

[4] "2016-11-07 PST"

> seq (ourPos, by = "month", length = 4)

[1] "2016-11-04 PDT" "2016-12-04 PST" "2017-01-04 PST"

[4] "2017-02-04 PST"

In the top example, we see a sequence of Date objects1dayapart(asspeciﬁed

by by = 1). at same speciﬁcation produces POSIXt dates 1 second apart.

k k

88 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Using by = "day" moves the clock by 24 hours, but since the Paciﬁc Time

Zone, where these examples were generated, switched from daylight saving to

standard time on November 6, 2016, the old time of midnight standard time

was advanced 24 hours to 11 p.m. standard time. With by = "DSTday"

the clock time is preserved across days. e ﬁnal example shows how we can

advance 1 month at a time – the help for seq.POSIXt() shows how these

functions adjust for the case when advancing by month starting at January 31,

for example.

Diﬀerences between two POSIXt objects, like diﬀerences between Date

objects, are represented by difftime objects in R. Here, though, you need

to be even more careful to specify the units when converting the difftime

object to numeric. is example shows how neglecting that speciﬁcation can

cause problems.

> d1 <- as.POSIXct ("2017-05-01 12:00:00")

> d2 <- as.POSIXct ("2017-05-01 12:00:06") # d1 + 6 seconds

> d3 <- as.POSIXct ("2017-05-07 12:00:00") # d1 + 6 days

> (d2 - d1) == (d3 - d1)

[1] FALSE # expected

> as.numeric (d2 - d1) == as.numeric (d3 - d1)

[1] TRUE # possibly unexpected

Here the d2 −d1 diﬀerence has the value 6 seconds, while the d3 −d1

diﬀerence has the value 6 days. e units are preserved in the difftime

objects but discarded by as.numeric().Itisagoodpracticetoalways

specify units = "days" or whatever your preferred unit is, whenever you

convert a difftime object to a numeric value.

3.6.7 Missing Values in Dates

Dates of diﬀerent classes should not be combined in a vector. It is always wise

touseanexplicitfunctiontoforcealltheelementsofadatevectortohavethe

same class. is also applies to missing values in date objects – they need to be

oftheproperclass.Inthisexample,wecombineanNA with the d1 date from

above, using the c() function. e c() function can call a second function

depending on the class of its ﬁrst argument – c.Date(),c.POSIXct() or

c.POSIXlt().

> c(d1, NA)

[1] "2017-05-01 12:00:00 PDT" NA

> c(NA, d1)

[1] NA 1493665200

> c(as.POSIXct (NA), d1)

[1] NA "2017-05-01 12:00:00 PDT"

> c(NA, as.Date (d1))

[1] NA 17287

k k

R Data, Part 2: More Complicated Structures 89

e ﬁrst command succeeds, as expected, because c.POSIXct() is able to

convert the NA value into a POSIXct object. In the second command, though,

c() sees the NA and does not call a class-speciﬁc function. Instead, it converts

both values to numeric. e resulting second element is the number of seconds

since the POSIXt origin date. e way around this is to explicitly specify an NA

value of class POSIXct, as in the third command. e ﬁnal command shows

that this problem exists for Date objects as well – here, d1 is converted into

the number of days since the origin date. e lesson is that you should ensure

that every date element, even the NA ones, in your vector has the same class.

3.6.8 Using Apply Functions with Dates and Times

Often a data set will arrive as a data frame with a series of dates in each row.

ese might be dates on which a phenomenon is repeatedly recorded –

monthly manpower data, for example, or payment information. If an operation

needs to be performed on each row – say, ﬁnding the range of the dates in each

one – it is tempting to use apply() on such a data frame. As with earlier

examples (Section 3.5), this will not succeed – even (perhaps surprisingly) if

the data frame’s columns are all Date or all POSIXct. A better approach is

to operate on each row via the lapply() or sapply() functions. Here we

show an example of a data frame whose columns are both Date objects.

> date.df <- data.frame (

Start = as.Date (c("2017-05-03", "2017-04-16")))

> date.df$End <- as.Date (c("2018-06-01", "2018-02-16"))

> date.df

Start End

1 2017-05-03 2018-06-01

2 2017-04-16 2018-02-16

> apply (date.df, 1, function (x) x[2] - x[1])

Error in x[2] - x[1]: non-numeric arg. to binary operator

Here, the apply() function converts the data frame to a character matrix.

(Why it does not convert it to a numeric one is not clear.) So the mathematical

operation fails. One way to apply the function to each row is via sapply(),as

in this example:

> sapply (1:2, function (i)

as.numeric (date.df[i,2] - date.df[i,1],

units = "days"))

[1] 394 306

Using sapply() to index the rows, we can compute each diﬀerence in days in

a straightforward way. In general, you will need to pay attention when dealing

with data frames of dates row by row.

k k

90 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

3.7 Other Actions on Data Frames

It is a rare data cleaning task that does not involve manipulating data frames,

and one very common operation is to combine two data frames. ere are

essentially three ways in which we might want to combine data frames: by

columns (i.e., combining horizontally); by rows (i.e., stacking vertically); and

matching up rows using a key (which we call merging). e ﬁrst two of these are

straightforward and the third is only a little more complicated. In this section,

we describe these tasks, as well as some other actions you can perform on data

frames. We show some more detailed examples in Chapter 7.

3.7.1 Combining by Rows or Columns

When we talk about “combining data frames by columns,” we mean combining

them side by side, creating a “wide” result whose number of columns is the

sum of the numbers of columns in the things being combined. We have seen

the cbind() function, which is the preferred function for joining matrices. We

can also supply two data frames as arguments to the data.frame() function

and R will join them. Both cbind() and data.frame() can incorporate

vectors and matrices in its arguments as well – but they will convert characters

to factors unless you explicitly provide the stringsAsFactors = FALSE

argument. R is prepared to recycle some inputs, but it is best if the things being

combined have the same numbers of rows.

Recall that a data frame needs to have column names, and that we (almost)

always want these to be distinct. If two columns have the same name, R will use

the make.names() function with the unique = TRUE argument to con-

struct a set of distinct names. If three data frames each have a column named

a, for example, the result will have columns a,a.1,anda.2.Itisalwaysa

good idea to examine the set of column names for duplication (perhaps using

intersection() as in Section 2.6.3) to ensure that you know what action R

will take.

Combining data frames by rows means stacking them vertically, creating

a “tall” result whose number of rows is the sum of the numbers of rows in

the things being combined. e rbind() function combines data frames in

this way. We can only operate rbind() on things with the same number of

columns; moreover, the columns need to have the same names, but they need

not be in the same order; R will match the names up. You will almost always

want the columns being joined to be of the same sort – numeric with numeric,

character with character, POSIXct with POSIXct, and so on – otherwise,

R will convert each column to a common class. We usually check the classes

explicitly and recommend you pass the stringsAsFactors = FALSE

argument to rbind(). If we have two data frames called df1 and df2,we

start by comparing the names, using code like this:

k k

R Data, Part 2: More Complicated Structures 91

> n1 <- names (df1)

> n2 <- names (df2)

> all (sort (n1) == sort (n2)) # should be TRUE

We sort the names of each data frame to account for the fact that they might be

out of order. Next, we extract the class of each column. e results, c1 and c2

as follows, will often be vectors, although they might be lists if some columns

produce a vector of length 2 or more. (is will be the case if any columns are

POSIXct objects.) We compare these two objects as in this example:

> c1 <- sapply (df1, class)

> c2 <- sapply (df2, class)

> isTRUE (all.equal (c1, c2[names (c1)])) # should be TRUE

Notice that we re-order the names of c2 so that they match the order of

the names of c1.eall.equal() function compares two objects and

returns TRUE if they match, and a small report (a vector of character strings)

describing their diﬀerences if they do not. is report is useful, but to test for

equality in, for example, an if() statement, the isTRUE() function is useful.

is function produces TRUE if its argument is a single TRUE,asreturned

by all.equal() when its arguments match, and FALSE if its argument is

anything else, like the character strings produced by all.equal() when its

arguments diﬀer.

If the data frames being combined have the usual unmodiﬁed numeric row

names, R will adjust them so that the resulting row names go from 1 upward,

but if there are non-numeric or modiﬁed row names, R will try to keep them,

again deconﬂicting matches to ensure that row names are distinct.

When combining a large number of data frames, the do.call() function

will often be useful. is function takes the name of a function to be run,

and a list of arguments and runs the function with those arguments. For

example, the command log(x = 32, base = 2) produces the result

5,becauselog

2(32)=5. We get the exact same result with the command

do.call("log", list(x = 32, base = 2)).Noticethatthe

arguments are speciﬁed in the form of a list. is mechanism allows us to

combinealargenumberofdataframesinafairlysimpleway.Supposewe

have a list of data frames named list.of.df (such a list arises frequently

as the output from lapply()). Extracting the individual data frames from

the list can be tedious, but we can rbind() them all with a command like

do.call ("rbind", list.of.df) (assuming the data frames meet the

rbind() criteria). If the data frames are not already on a list, we can construct

such a list with a command like list(first.df, second.df, ...).

3.7.2 Merging Data Frames

Merging is a more complicated and powerful operation. In the usual type of

merging, each data frame has a “key” ﬁeld, typically a unique one. e merge()

k k

92 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

matches up the keys and produces a data frame with one row per key, with all of

the columns from both of the data frames. ere are three main complications

here: what to do when keys are present in one data set but not in the other,

what to do when keys are duplicated, and what to do when keys match only

approximately.

e action when keys are present in one data set, but not in the other, is con-

trolled by the all.x and all.y arguments, both of which default to FALSE.

For this purpose, xrefers to the ﬁrst-named data set and yto the second. By

default, the result of the merge has one row for each key that appears in both x

and y(except when there are duplicated keys). Database users call this an “inner

join.” When all.x = TRUE and all.y = FALSE, the result has one row

for each key in x(this is a “left join”). Columns of the corresponding keys that

do not appear in yare ﬁlled with NA values. Naturally, the converse is true when

all.x = FALSE and all.y = TRUE – the result has one row for each key

in yand the result has NAs for those columns contributed from xfor those

keys that did not appear in y.Whenall.x = TRUE and all.y = TRUE,

the result has one row for every key in either xor y(this is an “outer join”). In

this example, we merge two small data sets to show the behavior brought about

by all.x and all.y.

> (df1 <- data.frame (Key = letters[1:3], Value = 1:3,

stringsAsFactors = FALSE))

Key Value

1a 1

2b 2

3c 3

> (df2 <- data.frame (Key = c("a", "c", "f"),

Origin = 101:103, stringsAsFactors = FALSE))

Key Origin

1 a 101

2 c 102

3 f 103

> merge (df1, df2, by = "Key") # inner join

Key Value Origin

1 a 1 101

2 c 3 102

> merge (df1, df2, by = "Key", all.x = TRUE) # left join

Key Value Origin

1 a 1 101

2b 2 NA

3 c 3 102

> merge (df1, df2, by = "Key", all.y = TRUE) # right join

Key Value Origin

1 a 1 101

k k

R Data, Part 2: More Complicated Structures 93

2 c 3 102

3 f NA 103

> merge (df1, df2, by = "Key", all.x = TRUE,

all.y = TRUE) # outer join

Key Value Origin

1 a 1 101

2b 2 NA

3 c 3 102

4 f NA 103

e behavior of merge() when keys are duplicated is straightforward, but

it is rarely what we want. It is best to remove rows with duplicate keys, or to

create a new column with a unique key, before merging. e number of rows

produced by merge() when there are duplicates is the number of pairs of keys

that match between the two data frames. In this example, we establish some

duplicated keys and show the behavior of merge in the left join case.

> (df3 <- data.frame (Key = c("b", "b", "f", "f"),

Origin = 101:104, stringsAsFactors = FALSE))

Key Origin

1 b 101

2 b 102

3 f 103

4 f 104

> merge (df1, df3, by = "Key", all.x = TRUE)

Key Value Origin

1a 1 NA

2 b 2 101

3 b 2 102

4c 3 NA

Here the merge produces one row every time a key in df1 matches a key in

df3 – even if that happens more than once – in addition to producing rows

for every key that does not match. If df1 had included two rows with the Key

value of b,asdf3 does, then the result would have had four rows with Key

value b.

e issue of matching when keys that match only approximately is a thornier

one. is arises when matching on people’s names, for example, since these are

often represented in slightly diﬀerent ways – think about the slightly diﬀering

strings “George H. W. Bush,” “George HW Bush,” “George Bush 41,” and so on.

e adist() and agrep() functions (see also the discussion of grep() in

Section 4.4.2) help ﬁnd keys that match approximately, but this sort of “fuzzy

matching” (also called “entity resolution” or “record linkage”) is beyond the

scope of this book.

k k

94 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

3.7.3 Comparing Two Data Frames

At some point you will have two versions of a data frame, and you will want

to know if they are identical. “Identical” can mean slightly diﬀerent things

here. For example, if two numeric vectors diﬀer only by ﬂoating-point error,

we would probably consider them identical. If a character vector has the

same values as a factor, that might be enough to be identical, but it might

not. e identical() function tests for very strict equivalence and can

be used on any R objects. It returns a single logical value, which is TRUE

when the two items are equal. e help page notes that this function should

usually be applied neither to POSIXlt objects nor, presumably, to data frames

containing these. is is partly because two times might represent the same

value expressed in two diﬀerent time zones.

e all.equal() function described above compares two objects but

with slightly more room for diﬀerence. e tolerance argument lets you

decide how diﬀerent two numbers need to be before R declares them to be

diﬀerent. By default, R requires that the two data frames’ names and attributes

match, but those rules can be over-ridden. Moreover, two POSIXlt items

that represent the same time are judged equal. When two items are equal

under these criteria, all.equal() returns TRUE.Sincethereturnvalue

of all.equal() when its two arguments are not equal is a vector of text

strings, one correct way to compare data frames aand bfor equality is with

isTRUE(all.equal (a, b)).

3.7.4 Viewing and Editing Data Frames Interactively

R has a couple of functions that will let you edit a matrix, list, or data frame in

an interactive, spreadsheet-like form. e View() function shows a read-only

representation of a data frame, whereas edit() allows changes to be made.

e return value from the edit() function can be saved to reﬂect the changes.

A more dangerous option is provided by data.entry(); changes made by

that function are saved automatically. If you use these functions to clean your

data, of course, your steps will not be reproducible, and we strongly recommend

using commented scripts and functions, which we describe in Chapter 5.

3.8 Handling Big Data

e ability to acquire, clean, handle, and model with big data sets will surely

become more and more important in coming years. From its beginning, R has

assumed that all relevant data will ﬁt into main memory on the machine being

used, and although the amount of memory installed in a computer has certainly

grown over time, the size of data sets has been growing much faster. Handling

data sets too big for the computer is not part of this book’s focus, but in the

k k

R Data, Part 2: More Complicated Structures 95

following section we lay out some ideas for dealing with data sets that are just

too big to hold in memory.

Given data that requires more storage than main memory can provide, we

oftenproceedbybreakingthedataintopiecesoutsideofR.Fortextdataweuse

the command-line tools provided by the bash program (Free Software Founda-

tion, 2016), a widespread command interpreter that comes standard on OS X

and Linux systems and which is available for Windows as well. Bash includes

tools such as split, which breaks up a data set by rows; cut,whichextracts

speciﬁc columns; and shuf, which permutes the lines in a ﬁle (which helps

when taking random samples). ese tools provide the ability to break the data

into manageable pieces.

Another approach for manipulating large data, this time inside R (in main

memory), was noted in Section 2.7. R has support for “long vectors,” those

whoselengthsexceed2

31 −1, but these are not recommended for character

data. Moreover, they are vectors rather than data frames, so the long vector

approach does not mirror the data frame approach.

Sometimesthedatacanﬁtintomemory,butthesystemisveryslowperform-

ing any actions on it. In this case, the data.table packages might be useful;

it advertises very fast subsetting and tabulation. Unfortunately, the syntax of

the calls inside the data.table package is just foreign enough to be confus-

ing. We will not cover the use of data.table in this book. If speciﬁc actions

are slow, we can often gain insight by “proﬁling,” which is where we determine

which actions are using up large amounts of time. e “Writing R Extensions”

manual has a section on proﬁling (Section 3) that might be useful here.

Other ways to speed up computations include compiling functions and run-

ning in parallel. We discuss these and other ways to make your functions faster

in Section 5.5.

ere are several add-in packages that provide the ability to maintain

“pointers” to data on disk, rather than reading the data into main memory. e

advantage of this approach is that the size of objects it can handle is limited

only by disk storage, which can be expected to be huge. In exchange, of course,

we have to expect processing to be much slower because so much disk access

can be required. Packages with this approach include bigmemory (Kane et al.,

2013) and its relatives, and ff (Adler et al., 2014) . e tm package (Feinerer

et al., 2008) does something similar for large bodies of text.

R is so popular in the data science world that there are many other programs,

including big data storage mechanisms, for which R interfaces are available.

is allows you to use familiar R commands to access these other mechanisms

without having to understand the details of those programs. In this way, you

cankeepyourdatainarelationaldatabaseorsomestoragefacilitythatuses,

for example, distributed memory for eﬃcient retrieval. ese approaches are

beyond the scope of this book, but we have some discussion of acquiring data

from a relational database in Chapter 6.

k k

96 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

3.9 Chapter Summary and Critical Data

Handling Tools

Matrices are important in many mathematical and statistical contexts, but they

do not play an important role in data cleaning. However, learning about matri-

ces makes learning about data frames more natural. Data frames also have the

attributes of lists, so we have discussed lists as well in this chapter. But the

important type of object in this chapter, and in R generally, is the data frame.

Data frames are often created by reading data in from outside R. We can also

create them directly by combining vectors, matrices, or other data frames with

the cbind(),rbind(),ormerge() functions. We can add a character vec-

tor to a data frame with the dollar-sign notation, but whenever we supply a

character vector as a column to be added to a data frame via data.frame()

or cbind(), we need to specify the stringsAsFactors = FALSE argu-

ment.

Onceourdatahasbeenplacedintoadataframe(say,onecalleddata1), we

oftenstartbyrecordingtheclassesofeachcolumninavector,usingacom-

mand like col.cl <- sapply(data1, function(x) class(x)[1]).

We use the function shown here rather than simply using the class()

function, as we did earlier, to account for columns with a vector of two or

more classes – usually these would be columns with one of the date classes like

POSIXct. Keeping the names data1 for the data and col.cl for the vector

ofclassesinthisexample,weusecommandssuchastheseaspartofourdata

cleaning process:

•table(col.cl) to tabulate the column classes. Often we will have an

expectation that some proportion of the columns will be numeric, or that

we will have, say, exactly 10 date columns. is is a good starting point to see

ifthedataframelooksasweexpect.

•sapply(data1, function(x) sum(is.na(x))) to count missing

values by column. If the number of columns is large, we would often use

table() on the result of the sapply() call to see if there are a few

columns with a large number of missing values. It is also interesting if many

columns report the same number of missing values. For example, if there

are 56 diﬀerent columns each with exactly 196 missing values, we might

hypothesize that those are the very same 196 records in every column –

and investigate them. In some cases, we might also count the number

of negative values or values equal to 99 or some other “missing” code.

Instead of the function above, then, we might substitute function

(x) sum (x < 0, na.rm = TRUE) or something analogous.

•sapply(data1[,col.cl == "numeric"], range) to compute

the ranges of numeric columns in a search for outliers or anomalies. If some

columns have class “integer” we will need to address those as well, perhaps

k k

R Data, Part 2: More Complicated Structures 97

using col.cl %in% c("numeric", "integer").Wemighthaveto

add the na.rm = TRUE argument, and we might also use other functions

here such as mean(),median(),orsd().

•sapply(data1, function(x) length(unique(x))) to count

unique values by column. Since these numbers will count NA values, we

might instead use function(x) length(unique(na.omit(x))).

e apply() family functions provide a lot of power, but they need to be

exercised carefully on data frames. e apply() function itself converts the

data frame to a matrix ﬁrst, and should only be used if all the columns of a data

frameareofthesametype.Sapply() tries to return a vector or matrix if it can,

so if the return elements are of diﬀerent classes they will often be converted. We

suggest using lapply() unless you know that one of the other functions will

succeed.

Another important focus of this chapter was on date (and time) objects.

Although Date and POSIXct objects are implemented inside R in a numeric

fashion, they are not quite numeric items in the usual ways. Similarly, while

the POSIXlt object has some numeric features, it is best thought of as a list.

So we deferred description of these objects until this chapter. Date and time

data take up a lot of energy in the data cleaning process because the number of

formats is large and variable and because of complications such as time zones

and date arithmetic.

k k

R Data, Part 3: Text and Factors

A lot of data comes in character (“string”) form, sometimes because it really

is text, and sometimes because it was originally intended to be numeric but

included a small number of non-numeric items such as, for example, the word

“Missing.” Almost every data cleaning problem requires manipulating text in

some way, to ﬁnd entries that include particular strings, to modify column

names, or something else. In this chapter, we describe some of the operations

you can perform on character data. is includes extracting pieces of strings,

formatting numbers as text, and searching for matches inside text.

However, there are really two ways that character data can be stored in R. One

is as a vector of character strings, as we saw in Chapter 2. e tools we men-

tioned above are primarily for this sort of data. A second way text can appear

inRisasafactor, which is a way of storing individual text entries as integers,

together with a set of character labels that match the integers back to the text.

Factors are important in many R modeling functions, but they can cause trou-

ble. We discuss factors in Section 4.6.

One consideration has become much more important in recent years: han-

dling text from alphabets other than the English one. We are very often called

on to deal with text containing accented characters from Western European

languages, and increasingly, particularly as a result of data from social media

sources, we ﬁnd ourselves with text in other alphabets such as Cyrillic, Arabic,

or Korean. e Unicode system of representing all the characters from all the

world’s alphabets (together with other symbols such as emoji) is implemented

in R through encodings including the very popular UTF-8; Section 4.5 discusses

how we can handle non-English texts in R.

k k

100 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

4.1 Character Data

4.1.1 The length() and nchar() Functions

e length of a character vector is, as with other vectors, the number of

elements it has. In the case of a character vector, you also might want to know

how many characters are in each element. We use the nchar() function

for that. Remember that some characters require two keystrokes to type (see

Section 1.3.3 for a discussion), but they still count as only one character. In this

example, we construct a character vector and observe how many characters

each element has.

> (planets <- c("Mercury", "Venus", NA, "Mars"))

[1] "Mercury" "Venus" NA "Mars"

> length (planets) # Four elements

[1] 4

> nchar (planets) # Count characters

[1] 7 5 NA 4

Notice that the number of characters in the missing value is itself missing. In

older versions of R (through 3.2), nchar() reported the lengths of missing

values as 2 – as if those entries were made up of the two characters NA.Starting

with version 3.3, returning NA for that string’s length is the default, though the

older behavior can be requested by passing the keepNA = FALSE argument.

4.1.2 Tab, New-Line, Quote, and Backslash Characters

ere are a few characters in R that need special treatment. We discussed

this in Section 1.3.3, but it is worth repeating that if you want to enter a tab,

a new-line character, a double quotation mark, or a backslash character, it

needs to be “protected” – we say “escaped” – by a backslash. e leading

backslash does not count as a character and is not part of the string – it’s just

a way to enter these characters that otherwise would be taken literally. As an

example, consider entering into R this text: She wrote, "To enter a

'new-line,'type "\n" ." Normally, of course, we enclose text in quota-

tion marks, but here R will think that the character string ends at the quotation

mark preceding To. To remedy that, we escape the two inner double quota-

tion marks. (Alternatively, we could enclose the entire quote in single, instead

of double quotation marks. en we would have to escape the two inner single

quotation marks.) Moreover, the backslash is a special character. It, too, needs

to be escaped. So to enter our quote into an R object, we need to type this:

> (quo <-

"She wrote, \"To enter a 'new-line,' type \"\\n\".\"")

[1] "She wrote, \"To enter a 'new-line,' type \"\\n\".\""

k k

R Data, Part 3: Text and Factors 101

> nchar (quo)

[1] 47

> cat (quo, "\n")

She wrote, "To enter a 'new-line,' type "\n"."

Notice the length of the string as given by nchar(). Even though it takes 52

keystrokes to type it in, there are only 47 characters in the string. ere is no

real diﬀerence between single and double quotes in R; if you create a string

with single quotes, it will be displayed just as if it had been created with double

quotes.

e backslash also escapes hexadecimal (base 16) and Unicode entries.

Hexadecimal values describe entries in the ASCII table that converts binary

values into text ones. For example, if you type "\x45", R returns the value

from the ASCII table that has been given value 45 in base 16 (69 in decimal):

the upper-case E. Passing the string "\x45" to nchar() returns the value 1.

Unicode entries can be one or more characters, and arguments to nchar()

help control what that function will return in more complicated examples. We

talk more about Unicode in Section 4.5.

4.1.3 The Empty String

In an earlier chapter (Section 2.3.2), we saw that some vectors have length

0. We could create a character vector of length 0 with a command like

character(0). However, something that is much more common in text

handling is the empty string, which is a regular character string that does

not happen to have any characters in it. is is indicated by "", two quote

characters with nothing between them. at empty string has length() 1but

nchar() 0. Often the empty string will correspond to a missing value but not

always. It is very common to see empty strings when, for example, reading data

in from spreadsheets. In our experience, spreadsheets will sometimes produce

empty strings, and other times produce strings of spaces (e.g., sometimes,

when all the other entries in a column are two characters long, the empty cells

of the spreadsheet may contain two spaces). Naturally, these diﬀerent types of

empty or blank strings will need to be addressed in any data cleaning task.

One area of confusion is when using table() on a character vector. e

names() of the table will always be exactly right, but since those names are

displayed without quotation marks, leading spaces are impossible to see.

> vec <- c (" ", " ", "", " ", "", "2016", "",

" 2016", "2016", " ")

> table (vec)

vec

2016 2016

32212

> names (table (vec))

[1] "" " " " " " 2016" "2016"

k k

102 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

In this example, we have items that are empty, items that consist of one space

and three spaces, and items that look like 2016 but sometimes with a leading

space.

e output of the table() function is not enough to determine the values

being tabulated because of the leading spaces. We need the names() func-

tion applied to the table (or, equivalently, something like unique(vec))to

determine what the values are.

e nzchar() is a fast way to determine whether a string is empty or not;

it returns TRUE for strings that have non-zero length and FALSE for empty

strings (think of “nz” as indicating “non-zero”).

4.1.4 Substrings

Another action we perform frequently in data cleaning is to extract a piece

of a string. is might be extracting a year from a text-formatted date,

for example, or grabbing the last ﬁve characters of a US mailing address,

which hold the ZIP code. e tool for this is the substring() func-

tion, which takes a piece of text, an argument first giving the position

oftheﬁrstcharactertoextract,andanargumentlast that gives the

last position. e last argument defaults to 1 million, so unless your

strings exceed that length, last canbeomittedwhenweseektheendof

a string. For example, to extract characters three and four from a string

named dt containing "2017-02-03",weusethecommandsubstring

(dt, 3, 4); the result is the string "17". (If the string has fewer than three

characters, the empty string is returned.) To extract the ﬁnal ﬁve characters we

could use substring(dt, nchar(dt) - 4). is extracts characters

6–10 from a string of length 10, characters 21–25 from a string of length 25,

and so on.

e substring() function works on vectors, so substring(vec,

nchar(vec) - 4) will produce a vector the same length as vec, giving the

last ﬁve (or up to ﬁve) characters of each of its entries. In this example,

the first argument was a numeric vector, and in general both first and

last may be vectors. is lets us use substring() to pull out parts of each

element in a string vector depending on its contents (e.g., “all the characters

after the ﬁrst parenthesis”).

We can exploit this vectorization to use substring() to break a string

into its individual characters. e command substring(a, 1:nchar(a),

1:nchar(a)) does exactly that, just as if we had called substring(a, 1,

1),substring(a, 2, 2), and so on. Another, slightly more eﬃcient, way

to break a string into its characters is mentioned below under strsplit()

(Section 4.4.7).

Oneofthestrengthsofsubstring() isthatitcanbeusedontheleftside

of an assignment operation. For example, to change the last two letters of each

k k

R Data, Part 3: Text and Factors 103

month’s name to "YZ", you could do this, using R’s built-in month.name

object,aswedointhisexample.

> new.month <- month.name

> substring (new.month, nchar (new.month) - 1) <- "YZ"

> new.month

[1] "JanuaYZ" "FebruaYZ" "MarYZ" "AprYZ"

[5] "MYZ" "JuYZ" "JuYZ" "AuguYZ"

[9] "SeptembYZ" "OctobYZ" "NovembYZ" "DecembYZ"

4.1.5 Changing Case and Other Substitutions

R is case-sensitive, and so we often need to manipulate the case of characters

(i.e., change upper-case letters to lower-case or vice versa). e tolower()

and toupper() functions perform those operations, as does the equivalent

casefold() function, which takes an argument called upper that describes

the direction of the intended change (upper = TRUE means “change to

upper case,” with FALSE, the default, indicating “change to lower case”). Note

that case-folding works with non-Roman alphabets in which that operation is

deﬁned, such as Cyrillic and Greek. e help page for these functions describes

a more complicated approach that can capitalize the ﬁrst letter in each word,

which is particularly useful for multi-word names such as “kuala lumpur” or

“san luis obispo.”

A more general substitution facility is provided by the chartr() (“character

translation”) function. is takes two arguments that are vectors of characters,

plus a string, and it changes each character in the ﬁrst argument into the cor-

responding character in the second argument.

4.2 Converting Numbers into Text

Numbers get special treatment when they are converted into text because R

needs to decide how they should be formatted. As we have seen earlier, R for-

mats entries in a numeric vector for display, but that formatting is part of the

print-out, not part of the vector, and the formatting can change when the vec-

tor changes. In this section, we describe some of the details of those formatting

choices. We also describe how R uses scientiﬁc notation, and how you can cre-

ate categorical versions of numeric vectors.

4.2.1 Formatting Numbers

Often it is convenient to represent a series of numbers in a consistent format for

reporting. e primary tools for formatting are format() and sprintf().

Format() provides a number of useful options, particularly for lining up deci-

mal points and commas. (e European usage, with a comma to denote the start

k k

104 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

of the decimal and a period to separate thousands, is also supported.) Of course,

formatting strings nicely in R doesn’t guarantee that those strings will line up

nicely in a report; that will depend on which font is used to display the format-

ted strings. Still, format() isafastandeasywaytoformatasetofnumbersin

a common way. Important arguments are digits, to determine the number

of digits, nsmall to determine the number of digits in the “small” part (i.e., to

the right of the decimal point), big.mark to determine whether a comma is

used in the “big” part, drop0trailing, which removes trailing zeros in the

small part, and zero.print,which,ifTRUE, causes zeros to be printed with

spaces. (You can also specify an alternate character like a dot, which might be

useful when most entries are zero.)

isexampleshowssomeoftheseargumentsatwork.

# Seven digits by default, decimals aligned

> format (c(1.2, 1234.56789, 0))

[1] " 1.200" "1234.568" " 0.000"

# Add comma separator

> format (c(1.2, 1234.56789, 0), big.mark=",")

[1] " 1.200" "1,234.568" " 0.000"

# Currency style, blank zeros

> format (c(1.2, 1234.56789, 0), digits = 6, nsmall=2,

zero.print=F, width=2)

[1] " 1.20" "1234.57" " "

In the last example, the digits and nsmall arguments had to be chosen

carefully in order to produce exactly two digits to the right of the decimal point.

(e nsmall argument describes the minimum, not the maximum, number of

digits to be printed.) ere are a few formatting tasks, including incorporating

text and adding leading zeros, that format() is not prepared for, and these

are handled by sprintf().

e sprintf() function takes its name from a common function in the

C language (the name evokes “string print, formatted”). is powerful func-

tion is complicated, and we just give a few examples here. e important point

about sprintf() is the format string argument, which describes how each

number is to be treated. (In R fashion, the format string can be a vector, in

which case either that argument or the numerics being formatted may have

to be recycled in the manner of Section 2.1.4.) A format string contains text,

which gets reproduced in the function’s output (this is useful for things such as

dollar signs) and conversion strings, which describe how numbers and other

variables should appear in that output. Conversion strings start with a per-

cent sign and contain some optional modiﬁers and then a conversion character,

which describes the manner of object being formatted. Although sprintf()

can produce hexadecimal values and scientiﬁc notation (see the following dis-

cussion) with the proper conversion characters, the two most useful are i(or d)

for integer values, ffor double-precision numerics, and sfor character strings.

k k

R Data, Part 3: Text and Factors 105

So, for example, sprintf("%f", 123) formats 123 as a double precision

using its default conversion options and produces the text "123.000000",

while sprintf("%f", 123.456) produces "123.456000".

Much of the power in sprintf() comes from the modiﬁers. Primary

among these are the ﬁeld width and precision, two numbers separated by a

period that give the minimum width (the total number of characters, including

sign and decimal point) and the number of digits to the right of the decimal

points, respectively. Other modiﬁers include the 0, to pad with leading zeros;

the space modiﬁer, which leaves a space for the sign if there isn’t one (so that

negative and positive numbers line up), and the +modiﬁer, which produces

plus signs for positive numbers. So, to continue the example, the format

string in sprintf("%9.1f", 123.456) asks for a ﬁeld width of 9 and a

precision of 1, and the result is the nine-character string " 123.5".e

command sprintf("%09.1f", 123.456) asks for leading zeros and

therefore produces "0000123.5". e items to be formatted, and even the

format string itself, can be vectors. is vectorization makes it straightforward

to insert numbers into sentences like this:

costs <- c(1, 333, 555.55, 123456789.012)

# Format as integers using %d

> sprintf ("I spent $%d in %s", costs, month.name[1:4])

[1] "I spent $1 in January" "I spent $333 in February"

[3] "I spent $555 in March" "I spent $123456789 in April"

In this example, each element of costs and month.name[1:4] is used, in

turn, with the format string.

e format strings are very ﬂexible. We show two more examples here.

# Format as double-precision (%f) with default precision

> sprintf ("I spent $%f in %s", costs, month.name[1:4])

[1] "I spent $1.000000 in January"

[2] "I spent $333.000000 in February"

[3] "I spent $555.550000 in March"

[4] "I spent $123456789.012000 in April"

# Format as currency, without specifying field width

> sprintf ("I spent $%.2f in %s", costs, month.name[1:4])

[1] "I spent $1.00 in January"

[2] "I spent $333.00 in February"

[3] "I spent $555.55 in March"

[4] "I spent $123456789.01 in April"

One ﬁnal feature of sprintf() is that ﬁeld width or precision (but not both)

canthemselvesbepassedasanargumentbyspecifyinganasteriskinthefor-

mat conversion string. is allows ﬁne-tuning of the widths of output, which

is useful in reporting. Suppose we wanted all four output strings from the last

example to have the same length. We can compute the length of the largest

k k

106 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

number in costs (after rounding to two decimal points) and supply that length

as the ﬁeld width, as seen in this example.

> biggest <- max (nchar (sprintf ("%.2f", costs)))

> sprintf ("I spent $%*.2f in %s",

biggest, costs, month.name[1:4])

[1] "I spent $ 1.00 in January"

[2] "I spent $ 333.00 in February"

[3] "I spent $ 555.55 in March"

[4] "I spent $123456789.01 in April"

Although sprintf() is complicated, it is very handy for at least one

job – generating labels that look like 001,002,003, and so on. e command

sprintf("%03d", 1:100) will generate 100 labels of that sort.

4.2.2 Scientiﬁc Notation

Scientiﬁc notation is the practice of representing every number by an optional

sign, a number between 1 and 10, and a multiplier of a power of 10. e choice

that R makes to put a number into scientiﬁc notation depends on the number

of signiﬁcant digits required. For example, the number 123,000,050 is written

1.23e+08, but 123,000,051 is written 123000051. When one number in a

vector (or matrix) needs to be represented in scientiﬁc notation, R represents

them all that way, which can be helpful or annoying, depending on the job at

hand. In this example, we show some of the eﬀects of the way R displays num-

bers in scientiﬁc notation. Notice that the rules are slightly diﬀerent for integer

and for ﬂoating-point values.

> 100000 # Big enough to start scientific notation

[1] 1e+05

> c(1, 100000) # Both numbers get scientific notation

[1] 1e+00 1e+05

> c(1, 100000, 123456) # R keeps precision here

[1] 1 100000 123456

> as.integer (10000000 + 1)

[1] 10000001 # Integers are a little different

ere is no easy way to change scientiﬁc notation for a single command.

e R option scipen controls the “crossover” points between regular

(“ﬁxed”) and scientiﬁc notation, which depends on the number of char-

acters required to print the vector out. (is, in turn, depends on the

number of digits R is prepared to display, which depends on the digits

option.) Set options(scipen = 999) to disable all scientiﬁc notation,

and options(scipen = -999) to require scientiﬁc notation every-

where–butdon’tforgettosetitbacktothedefaultvalueof0as needed.

(As with other options() calls, the value of scipen is re-set when you

k k

R Data, Part 3: Text and Factors 107

close and re-open R.) An alternative is to use the format() command with

the scientific = FALSE option. is example shows the format()

command at work on a large number.

> format (10000000)

[1] "1e+07" # scientific notation

> format (100000000, scientific = FALSE)

[1] "100000000"

Notice that, like sprintf(),format() always produces a character string,

which makes further numeric computation diﬃcult.

4.2.3 Discretizing a Numeric Variable

Very often we construct a discretized, categorical version of a numeric vector

with just a few levels for exploration or modeling purposes (we sometimes call

this procedure “binning”). For example, we might want to convert a numeric

vector into a categorical with levels “Small,” “Medium,” and “Large.” e natu-

raltoolforthisinRisthecut() function. e arguments are the vector to be

discretized, the breakpoints, and, optionally, some labels to be applied to the

new levels. e result of a call to cut() is a factor vector; we discuss factors

in Section 4.6, but for the moment we will simply convert the result back to

characters. In this example, we start with a numeric vector and bin them into

three groups. We will set the boundary points at 4 and 7.

> vec <- c(1, 5, 6, 2, 9, 3, NA, 7)

> as.character (cut (vec, c(1, 4, 7, 10)))

[1] NA "(4,7]" "(4,7]" "(1,4]" "(7,10]" "(1,4]"

[7] NA "(4,7]"

e cut() function has some distracting quirks. By default, intervals

do not include their left endpoint, so that in this example, the value 1

does not belong to any interval. is produces the NA in element 1 of the

output; the second, of course, arises from the missing value in vec.e

include.lowest = TRUE argument will force the leftmost breakpoint to

belong to the leftmost bin. In this example, the number 1 would be found in

the leftmost bin if include.lowest = TRUE were speciﬁed. Alternatively,

the right = FALSE argument makes intervals include their left end and

exclude their right (in which case include.lowest = TRUE actually refers

to the largest of the breakpoints). In this example, any values larger than 10

would have produced NAsaswell.isrequiresthatyouknowthelowerand

upper limits of the data before deciding what the breakpoints need to be.

If the exact locations of the breakpoints are not important, cut() provides

a straightforward way to produce bins of equal width or of approximately

equal counts. e ﬁrst is accomplished by specifying the breaks argument as

k k

108 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

an integer. In this case, cut() assigns every non-missing observation (even

the lowest) to one of the bins. For bins of approximately equal counts, we can

compute the quantiles of the numeric vector and use those as breakpoints.

e following two examples show both of these approaches on a set of 100

numbers generated from R’s random number generator with the standard

normal distribution. We use the set.seed() function to initialize the

random number generator; if you use this command your generator and ours

should produce the same numbers. First we pass breaks as an integer to

produce bins of approximately equal width.

> set.seed (246)

> vecN <- rnorm (100)

> table (cut (vecN, breaks = 5))

(-3.18,-1.94] (-1.94,-0.709] (-0.709,0.522]

22152

(0.522,1.75] (1.75,2.99]

20 5

In the following example, we use R’s quantile() function to compute the

minimum, quartiles and maximum of the vecN vector. (Other choices are pos-

sible through the use of the probs argument.) Once the quantiles are com-

puted we can pass them as breakpoints to produce four bins with approximately

equal counts – but, as before, cut() produces NA for the smallest value unless

include.lowest = TRUE –andbydefault,table() omits NA values.

> quantile (vecN)

0% 25% 50% 75% 100%

-3.17217563 -0.61809034 -0.06712505 0.45347704 2.98469969

> table (cut (vecN, quantile (vecN))) # lowest value omitted

(-3.17,-0.618] (-0.618,-0.0671] (-0.0671,0.453]

24 25 25

(0.453,2.98]

> table (cut (vecN, quantile (vecN), include.lowest = TRUE))

[-3.17,-0.618] (-0.618,-0.0671] (-0.0671,0.453]

25 25 25

(0.453,2.98]

Notice how supplying include.lowest = TRUE changed the ﬁrst bin from

a half-open interval (indicated by the label starting with ()toaclosedone

(label starting with [). In general, the default labels are somewhat unwieldy – a

character value like "(-0.618,-0.0671]" will be diﬃcult to manage. e

cut() function allows us to pass in a vector of text labels using the labels

argument.

k k

R Data, Part 3: Text and Factors 109

4.3 Constructing Character Strings: Paste in Action

Character strings arise in data we bring in from other sources, but very often we

need to construct our own. e primary tool for building character strings is the

paste() function, plus its sibling paste0().Initssimplestform,paste()

sticks together two character vectors, converting either or both, as necessary,

from another class into a character vector ﬁrst. By default R inserts a space

in between the two. For example, paste("a" ,"b") produces the result

"a b" while paste(1 == 2, 1 + 2) evaluates the two arguments, con-

verts them to character (see Section 2.2.3) and produces "FALSE 3".

In practice, we prefer to control the character that gets inserted. We want

a space sometimes, for example, when constructing diagnostic messages, but

more often we want some other separator, in order to construct valid column

names, for example. e sep argument to paste() allows us to specify the

separator. Very often in our work we use a period, by setting sep = ".",or

no separator at all, by setting sep = "".Inthelattercase,wecanalsousethe

paste0() function, which according to the help ﬁle operates more eﬃciently

in this case.

What really gives paste() its power is that it handles vectors. If any of its

arguments is a vector, paste() returns a vector of character strings, recycling

(Section 2.1.4) shorter ones as needed. is gives us great ﬂexibility in con-

structing sets of strings. For example, the command paste0("Col", LET-

TERS) produces a vector of the strings "ColA","ColB",andsoon,upto

"ColZ".

A ﬁnal useful argument to paste() is collapse, which combines all the

strings of the vector into one long string, using the separator speciﬁed by the

value of the collapse argument. Common choices are the empty string "",

which joins the pieces directly, and the new-line and tab characters, when for-

matting text for output to tables.

Paste() is such a big part of character manipulation in R that we think it

important to show a few examples of how it works and where it can be useful.

4.3.1 Constructing Column Names

When a data frame is constructed from data without header names, R con-

structs names such as V1 and V2. Normally, we will want to replace these

with meaningful names of our own, but in big data sets the act of typing

in those names is tedious and error-prone. Moreover it is often true that

the names follow a pattern – for example, we might have a customer ID

followed by 36 months of balance data from 2016 to 2018, followed by 36

months of payment data for the same years. One way to generate those latter

k k

110 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

72 names is through the outer() function. is function operates on two

vectors and performs another function on each pair of elements from the

two vectors, producing a matrix of results. For example, outer(1:10,

1:10, "*") produces a 10 ×10 multiplication table. e command

outer(month.abb, 2016:2018, paste, sep="."), similarly,

produces a matrix. In this example, we show the ﬁrst few rows of that matrix

using the head() command.

> head (outer (month.abb, 2016:2018, paste, sep = "."), 3)

[,1] [,2] [,3]

[1,] "Jan.2016" "Jan.2017" "Jan.2018"

[2,] "Feb.2016" "Feb.2017" "Feb.2018"

[3,] "Mar.2016" "Mar.2017" "Mar.2018"

To construct column labels with Bal. onthefront,wecansimplypastethat

string onto the elements of the matrix. Remember that paste() converts its

arguments into character vectors before operating, so the result of this oper-

ation is a vector of column labels, as shown in this example, where again we

show only a few of the elements of the result.

> myout <- outer (month.abb, 2016:2018, paste, sep = ".")

> paste0 ("Bal.", myout)[1:3]

[1] "Bal.Jan.2016" "Bal.Feb.2016" "Bal.Mar.2016"

So to construct a vector with all 73 desired names, we could use a single com-

mand as in this example.

newnames <- c("ID", paste0 ("Bal.", myout),

paste0 ("Pay.", myout))

We note that outer() is not very eﬃcient. For very big data sets, we might

create separate vectors from the year part and the month part, and then paste

them together. Suppose that the balance and payment values alternated, so the

ﬁrst two columns gave balance and payment for January 2016, the next two for

February 2016, and so on. en a straightforward way to construct the labels

using paste() is by repeating the components as needed, with rep(),and

then pasting the resulting vectors together:

# 2 values x 12 months x 3 years

part1 <- rep (c("Bal", "Pay"), 12 * 3)

# Double each month, repeat x 3

part2 <- rep (rep (month.abb, each = 2), 3)

part3 <- rep (2016:2018, each = 24)

newnames <- c("ID", paste (part1, part2, part3, sep = "."))

e task in this example could actually have been done more easily with

expand.grid(). is function takes, as arguments, vectors of values

and produces a data frame containing all combinations of all the values.

k k

R Data, Part 3: Text and Factors 111

Since the output is a data frame, for many purposes you will want to specify

stringsAsFactors = FALSE.enextstepistousepaste() on the

columns of the data frame. We use paste() regularly and in many contexts.

4.3.2 Tabulating Dates by Year and Month or Quarter Labels

Often we want to summarize vectors of dates (Section 3.6) across, for example,

years, months, or calendar quarters. An easy way to do this is by pasting

together identiﬁers of year and month, then using table() or tapply() to

compute the relevant numbers of interest. We use paste() here because the

built-in months() and quarters() functions do not produce the year as

well (and the format() function does not extract quarters). In this example,

we ﬁrst generate 600 dates at random between January 1, 2015 and December

31, 2016 (a period of 731 days) and then tabulate them by quarter.

> set.seed (2016)

> dts <- as.Date (sample (0:730, size = 600),

origin = "2015-01-01")

> table (quarters (dts)) # Shows calendar quarter

Q1 Q2 Q3 Q4

134 153 151 162

To combine both year and quarter, we can use substring() to extract

the years, then paste them together with the quarters. (We could also have

extracted the years with format(dts, "%Y").) We put the years ﬁrst in

these labels so that the table labels are ordered chronologically. In this example,

we paste the year and quarter, and then tabulate.

> table (paste0 (substring (dts, 1, 4), ".",

quarters (dts)))

2015.Q1 2015.Q2 2015.Q3 2015.Q4 2016.Q1 2016.Q2 2016.Q3

72 71 75 81 62 82 76

2016.Q4

To add months to the years, we could call the months() function and once

again use paste() to combine the year and month information. Alterna-

tively, we can use format() directly, as in this example. Notice, however, that

table() sorts its entries alphabetically by name.

> (mtbl <- table (format (dts, "%Y.%B")))

2015.April 2015.August 2015.December ...

24 23 24 ...

To put these entries into calendar order explicitly, we can use paste0() to

construct a vector of names to be used as an index. en we can use that index

to re-arrange the entries in the mtbl table. We show that in this example.

k k

112 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> (month.order <- paste0 (2015:2016, ".", month.name))

[1] "2015.January" "2016.February" "2015.March" ...

> mtbl[month.order]

2015.January 2016.February 2015.March ...

24 21 27 ...

4.3.3 Constructing Unique Keys

Often we need to construct a single column that uniquely labels each row in a

data frame. For example, we might have a table with one row for each customer

in every month in which a transaction took place. Neither customer number

nor month is enough to uniquely identify a transaction, but we can construct

a unique key by pasting account number, year, and month. In this example, we

would probably use a two-digit numeric month here and put year before month.

at way an alphabetical ordering of the keys would put every customer’s trans-

actions together in increasing date order.

4.3.4 Constructing File and Path Names

In many data processing applications, our data is spread out over many ﬁles and

we need to process all the ﬁles automatically. is might require constructing

ﬁle names by pasting together a path name, a separator like /, and a ﬁle name. R

can then loop over the set of ﬁle names to operate on each one. As an example,

one way to get the full (absolute) ﬁle names of all the ﬁles in your working

directory is by combining the name of the directory (retrieved with getwd())

and the names of the ﬁles (retrieved with list.files()). e command

paste(getwd(), list.files(), sep = "/") produces a character

vector of the absolute ﬁle names of ﬁles in the working directory. is is not

quite the same as the output from list.files(full.names = TRUE);

we discuss interacting with the ﬁle system in more detail in Section 5.4.

4.4 Regular Expressions

Aregular expression isapatternusedinatooltoﬁndstringsthatmatch

the pattern. e patterns can be very complicated and perform surprisingly

sophisticated matches, and in fact entire books have been written about regular

expressions. While we cannot cover all the complexities of regular expressions

in this book, we can make you knowledgeable enough to do powerful things.

We use regular expressions to ﬁnd strings that match a rule, or set of rules,

called a pattern. For example, the pattern amatches strings that include one

or more instances of aanywhere in them. e pattern a8 matches strings with

a8, with no intervening characters. Most characters, such as aand 8in this

example, match themselves. What gives regular expressions their power is the

k k

R Data, Part 3: Text and Factors 113

ability to add certain other characters that have special meaning to the pattern.

e exact set of special characters diﬀers across the diﬀerent kinds of regu-

lar expression, but as a ﬁrst example, the character ^means “at the start of a

line,” and $means “at the end of a line.” So the pattern ^The matches every

string that starts with The;end$ matches every string that ends with end,and

^No$ matches every string that consists entirely of No. By default, patterns are

case-sensitive, but shortly we will see how to ignore case.

4.4.1 Types of Regular Expressions

e details of regular expressions diﬀer from one implementation to the next,

so a regular expression you write for R may not work in, for example, Python

or another language. Actually, R supports two sorts of regular expressions:

one is POSIX-style (named for the same POSIX standards group that gave us

the POSIXt date objects), and the other is Perl-style, referring to the regular

expressions used in the Perl language. (Speciﬁcally, if you need to look this up

somewhere, the POSIX style incorporates the GNU extensions and the Perl

style comes via the PCRE library.) By default, POSIX regular expressions are

used in R.

4.4.2 Tools for Regular Expressions in R

ere are three primary tools for regular expressions in R: grep() and its vari-

ants, regexpr() and its variants, and sub() and its variants. ese three

are similar in implementation. We start by describing grep() in some detail.

e grep() function takes a pattern and a vector of strings, and returns a

numeric vector giving the indices of the strings that match the pattern. With the

value = TRUE argument, grep() returns the matching strings themselves.

e related function grepl() (the letter lon the end standing for “logical”)

returns a logical vector with TRUE indicating the elements that match. In this

example, we look through R’s built-in state.name vector to ﬁnd elements

with the capital letter C.

> grep ("C", state.name)

[1]5673340

> grep ("C", state.name, value = TRUE)

[1] "California" "Colorado" "Connecticut"

[4] "North Carolina" "South Carolina"

> grep ("^C", state.name, value = TRUE)

[1] "California" "Colorado" "Connecticut"

e ﬁrst call to grep() produces a vector of indices. ese ﬁve numbers show

the locations in state.name where strings containing Ccan be found. With

value = TRUE, the names of the matching states are returned. In the ﬁnal

example, we search only for strings that start with C.

k k

114 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Several other arguments are also important. First, the ignore.case argu-

ment defaults to FALSE,butwhensettoTRUE,itallowsthesearchtoignore

whether letters are in upper- or lower-case. Second, setting invert = TRUE

reverses the search – that is, grep() produces the indices of strings that do

notmatchthepattern.(einvert argument is not available for grepl(),

but of course you can use !applied to the output of grepl() to produce

a logical vector that is TRUE for non-matchers.) ird, fixed = TRUE

suspends the rules about patterns and simply searches for an exact text string.

is is particularly useful when you know your pattern, and also it has a

special character in it – such as, for example, a negative amount indicated

with parentheses, such as ($1,634.34). To continue an earlier example,

grep("^The", vec) ﬁnds the entries of vec that start with the three

characters The,whereasgrep("^The", vec, fixed = TRUE) ﬁnds

the entries that include the four characters ^The anywhere in the string.

A fourth useful argument is perl,which,whensettoTRUE,leadsthegrep

functions to use Perl-type regular expressions. Perl-type regular expressions

have many strengths, but for this development we will describe the default,

POSIX style. Finally, all of these regular expression functions permit the use of

the useBytes argument, which speciﬁes that matching should be done byte

by byte, rather than character by character. is can make a diﬀerence when

using character sets in which some characters are represented by more than

onebyte,suchasUTF-8(seeSection4.5).

4.4.3 Special Characters in Regular Expressions

We have seen how ^and $match, respectively, the beginning and end of a line.

ere are a number of other special characters that have speciﬁc meanings in

a regular expression. In order for one of these special characters to be used to

stand for itself, it needs to be “protected” by a backslash. We talk more about

the way backslashes multiply in R regular expressions in the following section.

Table 4.1 lists the special characters in R’s implementation of POSIX regu-

lar expressions. Many implementations of regular expressions work on lines of

text in a ﬁle, so we use the word “line” here synonymously with “element of a

character vector.”

4.4.4 Examples

In this section, we show our ﬁrst examples of using regular expressions to locate

matching strings. Remember that, by default, grep() gives the indices of the

matching strings; pass value = TRUE togetthestringsthemselvesanduse

grepl() to get a logical indication of which strings match. In these func-

tions, a string matches or does not – there is no notion of the position within

astringwhereamatchtakesplace.etoolforthatisregexpr(),described

k k

R Data, Part 3: Text and Factors 115

Table 4.1 Special characters in R (POSIX) regular expressions

Char Name Purpose Example

Matching characters

.Period Match any character t.e matches strings with tae,tbe

t9e,t;e, and so on, anywhere

[] Brackets Match any character t[135]e matches t1e,t3e,t5e;

between them t[1-5]e matches t1e,t2e,…,t5e,

but note: [a-d] might mean[abcd]

or [aAbBcCdD], depending on your

computer. See “character ranges”

^Caret (i) Start of line ^L matches lines starting with L

(ii) “Not” when appear- t[^h]e matches lines containing t,

ing ﬁrst in brackets then something not an h,thene

$Dollar End of line the$ matches lines ending with the

|Pipe “Or” operator th|sc matches either th or sc

() Parentheses Grouping operators

\\ Backslash Escape character See text

Repetition characters

{} Braces Enclose repetition (a|b){3}matches lines with three

operators (a or b)’s in a row, for example, aba,bab,…

,Comma Separate repetition j{2,4}matches jj,jjj,jjjj

operators

*Asterisk Match 0 or more ab* matches a,ab,abb,…

+Plus Match 1 or more ab+ matches ab,abb,abbb,…

?Question mark Match 0 or 1 ab? matches aor ab

in Section 4.4.5. We start by creating a vector of strings that contain the string

sen in diﬀerent cases and locations.

> sen <- c("US Senate", "Send", "Arsenic", "sent", "worsen")

> grep ("Sen", sen) # which elements have "Sen"?

[1] 1 2

> grep ("Sen", sen, value = TRUE) # elements with "Sen"

[1] "US Senate" "Send"

> grep ("[sS]en", sen, value = TRUE) # either case "S"

[1] "US Senate" "Send" "Arsenic" "sent" "worsen"

> grep ("sen", sen, value = TRUE, # upper or lower-case

ignore.case = T)

k k

116 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

[1] "US Senate" "Send" "Arsenic" "sent" "worsen"

> grep ("^[sS]en", sen, value = T) # start "Sen" or "sen"

[1] "Send" "sent"

> grep ("sen$", sen, value = T) # end with "sen"

[1] "worsen"

e ﬁrst grep() produces the indices of elements that match the pattern – this

is useful for extracting the subset of items that match. e second grep() uses

value = TRUE to returns the element themselves. ese simple examples

start to show the power of regular expressions. at power is multiplied by the

ability to detect repetition, as we see next.

Repetition

e second part of Table 4.1 describes some repetition operators. Often we seek

not a single character, but a set of matching characters – a series of digits, for

example. e regular expression repetition operator ?allows for zero or one

matches, essentially making the match optional; the *allows for zero or more,

and the +operator allows for one or more matches. So the pattern 0+ matches

one or more consecutive zeros. Since the dot character matches any character,

the combination .+ means “a sequence of one or more characters”; this combi-

nation appears frequently in regular expressions. We also often see the similar

pattern .* for “zero or more characters.” e following example shows how we

can match strings with extraneous text using repetition operators. We start by

creating a vector of strings, and our goal is to ﬁnd elements of the vector that

start with Reno and are followed at some point later in the string by a ZIP code

(ﬁve digits).

> reno <- c("Reno", "Reno, NV 895116622", "Reno 911",

"Reno Nevada 89507")

> grep ("Reno.+[0-9]{5}", reno, value = TRUE)

[1] "Reno, NV 895116622" "Reno Nevada 89507"

Here, the .+ accounts for any text after the oin Reno and the [0-9]{5}

describes the set of ﬁve digits. It is tempting to add spaces to your pattern to

make it more readable, but this is a mistake; the regular expression will then

take the spaces literally and require that they appear. Notice that the nine-digit

number was also matched by the {5}repetition, since the ﬁrst ﬁve of the nine

digits satisfy the requirement.

We end this section with a more complicated example. Here, we search for

strings with dates in the form of a one- or two-digit numeric day, a month name

(as a three-letter abbreviation), and a four-digit year number, when there might

be text between any of these pieces. is example shows the text to be matched.

> dt <- c("Balance due 16 Jun or earlier in 2017",

"26 Aug or any day in 3018",

"'76 Trombones' marched in a 1962 film",

k k

R Data, Part 3: Text and Factors 117

"4 Apr 2018", "9Aug2006",

"99 Voters May Register in 20188")

e pieces of the regular expression to detect the dates are these. First, we can

have leading text, so .* will match that if it is present. Second, [0-3]?[0-9]

matches a one-digit number (since the ﬁrst digit is optional, as indicated by the

?) or a two-digit number less than 40. Next, there is (optional) additional text,

followed by a set of month names. e month-related part of the pattern looks

like (Jan|Feb|Mar...|Dec), where the pipes denote that any month will

match and the parentheses make this a single pattern. (e abbreviations in

the pattern will match a full name in the text.) Finally, we match some more

additional text, followed by four digits that have to start with a 1 or a 2. We

construct the month-related part of the pattern ﬁrst by using paste() with

the collapse argument.

> (mo <- paste (month.abb, collapse = "|"))

[1] "Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec"

> re <- paste0 (".*[0-3]?[0-9].*(", mo, ").*[1-2][0-9]{3}")

> grep (re, dt, value = TRUE)

[1] "Balance due 13 Jun or earlier in 2017"

[2] "26 Aug or any day in 2018"

[3] "9Aug2006"

[4] "99 Voters May Register in 20188"

Notice that the mar in marched doesnotmatchthemonthabbreviationMar.

However, the 99 in the ﬁnal string matches the day portion of our pattern. at

is because the [0-3] is optional; the ﬁrst 9matches in the [0-9] pattern and

the second, in the .* pattern. Moreover, the ﬁve-digit year 20188 in that string

matches the pattern [1-2][0-9]{3}because its ﬁrst four digits do. We see

how to reﬁne this example later in the section – but regular expressions are

tricky!

The Pain of Escape Sequences

Special characters give regular expression much of their power. But sometimes

we need to use special characters literally – for example, we might want to ﬁnd

strings that contain the actual dollar sign $. A dollar sign in a pattern normally

indicates the end of a line; to use it literally in a pattern it needs to be “escaped”

with a backslash. So in order for the regular expression “engine” to look for a

dollar sign, we need to pass it the pattern \$. But remember that to type a back-

slash into R, we need to type two backslashes, since R also uses the backslash

as the character that “protects” certain other characters (in strings like \n for

new-line). at is, we have to type \\$ in R so that the engine can see \$ and

know to search for a dollar sign. In this example, we create a vector of character

strings and search for a dollar sign among them. Remember that $matches the

end of a string. In the ﬁrst grep() command below, the pattern $matches

every element in the vector that has an end – all of them.

k k

118 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> vec <- c("c:\\temp", "/bin/u", "$5", "\n", "2 back: \\\\")

> grep ("$", vec) # Indices of elements with ends

[1]12345

> grep ("\$", vec, value = TRUE)

Error: '\$' is an unrecognized escape...

> grep ("\\$", vec, value = TRUE)

[1] "$5"

e pattern \$ looks to R as if we are constructing a special “protected” charac-

ter such as \n or \t. Since there is no such character in R, we see an error. e

next command produces the elements of vec that contain dollar signs since

value = TRUE; in this case, the only element that matches is $5.

Other special characters also need to be escaped. To search for a dot, use \\.;

to search for a left parenthesis, use \\(, and so on. e “pain” of this section’s

title refers to searching for the backslash itself. Since a backslash is represented

as \\, and we need to pass two of them to the engine, the pattern for ﬁnding

backslashes in a string is \\\\. is looks like four characters, but it’s actually

two (as nchar ("\\\\") will conﬁrm). e ﬁrst tells the regular expression

engine to take the second literally.

Backslashes are fortunately pretty rare in text, but they do arise in path names

in the Windows operating system. In this example, we show how we can locate

strings containing the backslash character.

> grep ("\\", vec)

Error in grep("\\", vec) :

invalid regular expression '\', reason 'Trailing backslash'

> grep ("\\\\", vec, value = TRUE) # elements with \

[1] "c:\\temp" "2 back: \\\\"

> grep ("\\\\\\\\", vec, value = TRUE) # two backslashes

[1] "2 back: \\\\"

In the ﬁrst command, the backslash \\ in valid in R, but because the regular

expression engine uses the backslash as well, it expects a second character (like

$in our example above). When no second character is found, grep() pro-

duces an error. e second example shows the elements of vec that contain a

backslash. Notice that the \n character in position 4 is a single character. e

backslash depicts its special nature but is not part of the actual character. e

ﬁnal pattern matches the string with two backslashes.

e fixed = TRUE argument can alleviate some of the pain when search-

ing for text that includes special characters. In this example, we repeat the

searches above using fixed = TRUE.

> grep ("\\", vec, value = TRUE, fixed = TRUE) # one \

[1] "c:\\temp" "2 back: \\\\"

> grep ("\\\\", vec, value = TRUE, fixed = TRUE) # two \

[1] "2 back: \\\\"

k k

R Data, Part 3: Text and Factors 119

As a ﬁnal example in this section, we show how we can use the pipe character

|to ﬁnd elements of vec with either forward slashes or backslashes.

> grep ("\\\\|/", vec, value = TRUE)

[1] "c:\\temp" "/bin/u" "2 back: \\\\"

> grep ("\\|/", vec, value = TRUE, fixed = TRUE)

character(0)

In the ﬁrst command, we found strings containing either a backslash (\\\\)

or forward slash (/), the two separated by the pipe character indicating “or.” In

thesecondcommand,weusedfixed = TRUE to look for strings containing

the literal text \|/ in that order – and of course none was found.

Character Ranges and Classes

We saw earlier how we can match ranges of digits by enclosing them in square

brackets as [0-9]. is extends to other sets of characters. For example,

we might want to match any of the lower-case letters, or any punctuation

character, or any of the letters A–G of the musical scale. It is easy to specify a

range of characters using square brackets and a hyphen, so [a-z] matches a

lower-case letter and [A-G] matches an upper-case musical note. To match

musical notes given in either case, we can combine a range with the pipe

character: [A-G]|[a-g] matches any of those seven letters in either case.

To include a hyphen in the pattern, put it ﬁrst or last in the brackets. (You can

also put an opening square bracket in a set or range, but to include a closing

square bracket, you will need to escape it so that the set isn’t seen as ending

with that character.) So, for example, the range [X-Z] matches “any of the

letters X,Yor Z,” and includes Y,whereastheset[XZ-] matches “any of X,Z,

or hyphen” and does not.

We can negate a character class or range by preceding it with the caret char-

acter ^.Sotheset[^XZ-] matches any characters other than X,Z, or hyphen.

Notice that the caret must be inside the brackets; outside, it matches the start of

thelineaswesawabove.Acaretelsewherethantheﬁrstcharacterisinterpreted

literally – that is, it matches the caret character.

ere is a predeﬁned set of character classes thatmakesiteasytospec-

ify certain common sets. ese include [:lower:],[:upper:],and

[:alpha:] for lower-case, upper-case, and any letters; [:digit:] for dig-

its; [:alnum:] for alphanumeric character (letters or numbers); [:punct:]

for punctuation; and a few more (see the help pages). Notice that the name of

theclassincludesthesquarebrackets;tousetheseinaregularexpressionthey

need to be enclosed in another set of square brackets. So, for example, the

pattern [[:digit:]] matches one digit, and [^[:digit:]] matches any

character that is not a digit. We start this example by showing how to identify

strings that include, or do not include, upper-case letters.

> vec <- c("1234", "6", "99 Balloons", "Catch 22", "Mannix")

> grep ("[[:upper:]]", vec, value = TRUE) # any upper-case

k k

120 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

[1] "99 Balloons" "Catch 22" "Mannix"

> grep ("[^[:upper:]]", vec, value = TRUE) # any non-upper

[1] "1234" "6" "99 Balloons" "Catch 22"

[5] "Mannix"

> grep ("^[^[:upper:]]+$", vec, value = TRUE) # no upper

[1] "1234" "6"

e ﬁrst grep() uses the [:upper:] character class to identify strings with

at least one upper-case character in them. It is tempting to think that the sec-

ond regular expression, [^[:upper:]], will ﬁnd strings consisting only of

non-upper-case characters, but as you can see the result is something diﬀerent.

In fact, this pattern matches every string that has at least one non-upper-case

character. e last example shows how to identify strings consisting entirely of

these characters – we specify that a sequence of one or more non-upper-case

characters (i.e., [^[:upper:]] followed by +) should be all that can be found

on the line (i.e., between the ^and the $).

Some classes are so commonly used that they have aliases. We can use \d for

[:digit:] and \s for [:space:],and\D and \S for “not a digit,” “not a

space.” (Here, “space” includes tab and possibly other more unusual characters.)

is makes it easy to, for example, ﬁnd strings that contain no digits at all, as

we see in this example.

> grep ("^[^[:digit:]]+$", vec, value = TRUE) # long way

[1] "Mannix"

> grep ("^\\D*$", vec, value = TRUE) # shorter

[1] "Mannix"

Word Boundaries

Oftenwerequirethatamatchtakeplaceonawordboundary,thatis,atthe

beginning or end of a word (which can be a space or related character such as

tab or new-line, or at the beginning or end of the string). Word boundaries are

indicated by \b, or by the pair \< and \>. e characters that are considered to

go into a word include all the alphanumeric (non-space) characters. Recall our

earlier example where we tried to locate strings with dates included – such as,

for example, "4 Apr 2018". Our earlier eﬀort inadvertently matched a year

with the value 20188. Using the word-boundary characters, we can specify

that, in order to match, a string must include a word with exactly four digits.

is example shows how that might be done.

> (newvec <- grep ("\\<\\d{4}\\>", dt, value = TRUE))

[1] "Balance due 16 Jun or earlier in 2017"

[2] "26 Aug or any day in 3018"

[3] "'76 Trombones' marched in a 1962 film"

[4] "4 Apr 2018"

> grep (mo, newvec, value = TRUE)

[1] "Balance due 16 Jun or earlier in 2017"

k k

R Data, Part 3: Text and Factors 121

[2] "26 Aug or any day in 3018"

[3] "4 Apr 2018"

In the ﬁrst command, we found strings that contained words of exactly four dig-

its and saved the result into the new item newvec. e second command then

searched for the month string in that new vector. We did not look for the days

in this example, but in practice we very often use multiple passes (sometimes

with invert = TRUE) in order to extract the set of strings we need. It may

be more computationally eﬃcient to call grep() only once for any problem,

but that may not be the fastest route overall.

4.4.5 The regexpr() Function and Its Variants

While grep() identiﬁes strings that match patterns, the regexpr()

function is more precise: it returns the location of the (ﬁrst) match within the

string – that is, the number of the ﬁrst character of the match. We can use this

information to not only identify strings that contain numbers but also extract

the number itself. is example shows the result of calling regexpr() with

a pattern that looks for the ﬁrst stand-alone integer in each string. It is not

enough to extract a set of digits because that would match strings such as

11-dimensional or B2B. Word boundaries provide the mechanism for

specifying an integer, as seen here:

> (regout <- regexpr ("\\<\\d+\\>", dt))

[1]13121-11

attr(,"match.length")

[1]2221-12

attr(,"useBytes")

[1] TRUE

e regexpr() function returns a vector (plus some other information we

describe as follows). e vector, which starts with 13, shows the number of the

character where the ﬁrst integer begins. For example, the number 16 in the ﬁrst

element of dt appears starting at the 13th character in that string, the number

26 starts in the 1st character of the second element, and so on. e -1 in the

ﬁfth position indicates that string does not contain an integer as a word.

e function also returns attributes, extra pieces of information attached to

its output. e match.length attributeinthiscasegivesthelengthofthe

match – so the ﬁrst element is 2because the integer in the ﬁrst string is two

characters long; the fourth is 1because the integer in the fourth string is one

character long. (We will not need the useBytes attribute.) We could extract

the match.length vector using the attr() function, and then use sub-

string() to extract the numbers in the strings. But a more convenient alter-

native is provided by regmatches(), which takes the initial string and the

output of regexpr() and performs the extraction for us, as in this example.

k k

122 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> regmatches (dt, regout)

[1] "13" "26" "76" "4" "99"

ere are ﬁve entries in this vector because only ﬁve of the strings contained

integers. (e -1 values in the original vector remind you of which strings did

not produce values here.)

Finding All Matches

e regexpr() function ﬁnds the ﬁrst instance of a match in a vector of

strings. To ﬁnd all the matches is only a little more complicated. We use the

gregexpr() function, the gevoking “global.” e return value of greg-

expr() is a list, not a vector, because some strings may contain many integers.

However, regmatches() works on this return value just as it does for reg-

expr(). In this example, we extract all of the integers from each of our strings

in one command.

# Note that some output from this command is suppressed

> (gout <- gregexpr ("\\<\\d+\\>", dt))

[[1]]

[1] 13 34

attr(,"match.length")

[1] 2 4

...

[[2]]

[1] 1 22

attr(,"match.length")

[1] 2 4

...

[[6]]

[1] 1 27

attr(,"match.length")

[1] 2 5

...

> regmatches (dt, gout)

[[1]]

[1] "13" "2017"

[[2]]

[1] "26" "2018"

[[3]]

[1] "76" "1962"

[[4]]

[1] "4" "3018"

[[5]]

character(0)

k k

R Data, Part 3: Text and Factors 123

[[6]]

[1] "99" "20188"

> matrix (as.numeric (unlist (regmatches (dt, gout))),

ncol = 2, byrow = TRUE)

[,1] [,2]

[1,] 13 2017

[2,] 26 2018

[3,] 76 1962

[4,] 4 3018

[5,] 99 20188

Here, the result of the call to regmatches() is a list of length 6, one for each

string in the original dt. e ﬁfth entry in the list is empty because the ﬁfth

entry of dt had no integers that were words. e ﬁnal command shows one way

youmightformthelistintoatwo-columnnumericmatrix,ausefulsteponthe

way to constructing a data frame. A second approach would use do.call()

and rbind().

Greedy Matching

By default, regular expression matching is “greedy” – that is, matches are as

long as possible. As an example consider using the pattern \\d.*\\d to ﬁnd a

digit, zero or more characters, and a second digit in the string "4 Apr 3018".

You might expect the regular expression engine to ﬁnd the string "4 Apr 3",

butinfactitgathersasmuchaspossible:"4 Apr 3018", stopping at the last

8. Adding a question mark makes the match “ungreedy” (or “lazy”) – so that

\\d.*?\\d produces "4 Apr 3".

4.4.6 Using Regular Expressions in Replacement

In addition to ﬁnding matches, R has tools that allow you to replace the part

of the string that matches a pattern with a new string. ese are sub(),which

replaces the ﬁrst matching pattern, and gsub(), which replaces all the match-

ing patterns. e replacement text is not a regular expression. For example, here

is a vector of four character strings. In the ﬁrst example, we replace the ﬁrst

lower-case iwith the number 9. In the second, we replace the ﬁrst instance

of either ior Iwith 9, and in the last, we replace all instances of either one

with 9.

> (mytxt <- c("This is", "what I write.",

"Is it good?", "I'm not sure."))

[1] "This is" "what I write." "Is it good?" "I'm not sure."

> sub("i", "9", mytxt) # replace first i with 9

[1] "Th9s is" "what I wr9te." "Is 9t good?" "I'm not sure."

> sub("[iI]", "9", mytxt) # replace first (i or I) with 9

[1] "Th9s is" "what 9 write." "9s it good?" "9'm not sure."

k k

124 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> gsub("[iI]", "9", mytxt) # replace all (i or I) with 9

[1] "Th9s 9s" "what 9 wr9te." "9s 9t good?" "9'm not sure."

Sometimes the text being matched is needed in the replacement. is can

sometimes be done very neatly using “backreferences.” When a regular

expression is enclosed in parentheses, its matching strings get labeled by

integers and can be re-used in the replacement string by referring to them as

\1,\2, and so on – of course, to be typed into R as \\1,\\2,andsoon.In

this example, we are given names in the form “Firstname Lastname” and asked

to produce names of the form “Lastname, Firstname.”

> folks <- c("Norman Bethune", "Ralph Bunche",

"Lech Walesa", "Nelson Mandela")

> sub ("([[:alpha:]]+) ([[:alpha:]]+)", "\\2, \\1", folks)

[1] "Bethune, Norman" "Bunche, Ralph" "Walesa, Lech"

[4] "Mandela, Nelson"

e ﬁrst argument to the sub() command gave the pattern: a series of one or

more letters (captured as backreference 1), a space, and another series of letters

(backreference 2). e replacement part gives the second backreference, then

a comma and space, and then the ﬁrst backreference. We note that this task is

more complicated with people whose names use three words, since sometimes

the second word is a middle or maiden name (as with John Quincy Adams or

Claire Booth Luce) and sometimes it is part of the last name (Martin Van Buren,

Arthur Conan Doyle) – and of course some people’s names require four or more

words (Edna St Vincent Millay, Aung San Suu Kyi).

4.4.7 Splitting Strings at Regular Expressions

It is common to want to split a string whenever a particular character

occurs. is is more or less the opposite of the paste() operation. For

example, in our work we often construct a unique key to identify each of

our observations, using paste(). We might combine a company identiﬁer,

transaction identiﬁer, and date, with a command like key <- paste

(co.id, tr.id, date, sep = "-"). Of course, in this example, the

sep = "-" argument speciﬁes a hyphen as the separator.

At a later time, it might be necessary to “unpaste” those keys into their indi-

vidual parts. e strsplit() function performs this duty. In this example,

strsplit(key, "-") produces a list with one entry for each string in key.

Eachentryisavectorofpartsthatresultwhenthekeyisbrokenatitshyphens;

so if one key looked like 00147-NY-2016-K before the split, the correspond-

ing entry in the output of strsplit() would be a vector with four elements

(and no hyphens). If the key had two hyphens in a row, there would have been

an empty string in the output vector. In this example, we show the eﬀect of

strsplit() on several keys constructed using hyphens.

k k

R Data, Part 3: Text and Factors 125

> keys <- c("CA-2017-04-02-66J-44", "MI-2017-07-17-41H-72",

"CA-2017-08-24-Missing-378")

> (key.list <- strsplit (keys, "-"))

[[1]]

[1] "CA" "2017" "04" "02" "66J" "44"

[[2]]

[1] "MI" "2017" "07" "17" "41H" "72"

[[3]]

[1] "CA" "2017" "08" "24" "Missing" "378"

In cases like these, where the number of pieces is the same in every key, it

is common to construct a matrix or data frame from the parts. We saw a

similar example using the output of regmatches() in an earlier section.

Here,weconstructacharactermatrixinthesameway.Wecanthenuse

data.frame() to make the matrix into a data frame, although the columns

of the latter will be character unless you then convert them explicitly.

> matrix (unlist (key.list), ncol = 6, byrow = TRUE)

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] "CA" "2017" "04" "02" "66J" "44"

[2,] "MI" "2017" "07" "17" "41H" "72"

[3,] "CA" "2017" "08" "24" "Missing" "378"

Note that the alternative do.call("rbind", key.list) produces the

same character matrix.

Unlike the sep argument to paste(), which is a character string, the

second argument to strsplit() canbearegularexpression.estr-

split() function also accepts the fixed,perl,anduseBytes arguments

as the other regular expression operators do. Because that second argument

can be a regular expression, extra work is required to split at periods. e

command strsplit(key, ".") produces a split at every character, since

the period can represent any character, so it returns an unhelpful vector of

empty strings. e command strsplit(key, "\\.") or its alternatives,

strsplit(key, "[.]") or strsplit(key, ".", fixed = TRUE)

will split at periods. Remember that the output of strsplit() is always a

list, even if only one character string is being split.

4.4.8 Regular Expressions versus Wildcard Matching

e patterns used in regular expressions are more complicated, and more

powerful, than the sort of wildcard matching that many users will have

seen as part of a command-line interpreter. In wildcard matching, the only

special characters are *, meaning “match zero or more characters,” and ?,

meaning “match exactly one character.” So, for example, the wildcard-type

k k

126 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

pattern *an? matches any string that includes an followed by exactly one

more character. R does not use wildcard matching, but it does allow you to

convert a wildcard pattern, which R calls a “glob,” into a regular expression,

by means of the glob2rx() function. For example, glob2rx("*an?")

produces "^.*an.$".Noticethe^and $sign; glob2rx() adds those

by default, but they can be omitted with the trim.head = TRUE and

trim.tail = TRUE arguments.

4.4.9 Common Data Cleaning Tasks Using Regular Expressions

Regular expressions make it possible to do many complicated things to text,

speciﬁc to your particular problem and data. ere are a few operations,

though, that seem to be called for in a lot of data cleaning tasks. In these

sections, we describe how to do some of these.

Removing Leading and Trailing Spaces

One frequent need in text handling is removing leading and trailing spaces from

text. e regular expression "^ *" matches any string with leading spaces,

while " *$" matches one with trailing spaces. To match either or both of these,

we combine them with the pipe character, and use gsub() instead of sub()

since some strings will have both kinds of matches – as in this example.

> gsub ("^ *| *$", "", c(" Both Kinds ", "Trailing ",

"Neither", " Leading"))

[1] "Both Kinds" "Trailing" "Neither" "Leading"

Here, the embedded space inside "Both Kinds" doesnotmatch–itis

neither leading nor trailing – and is not deleted. e command gsub(" ",

"", vec) will remove all spaces in every element of vec.

Converting Formatted Currency into Numeric

We see something similar in formatted currency amounts such as

$12,345.67. Here, we need to remove the currency symbol and the

comma before converting to numeric. If the only currency sign we expected to

encounter was the dollar sign, we might do this:

> as.numeric (gsub ("\\$|,", "", "$12,345.67"))

[1] 12345.67

Recall that the as.numeric() will accept and ignore leading and trailing

spaces. More generally, we might delete any non-numeric leading characters

like this:

> as.numeric (gsub ("^[^0-9.]|,", "", "$12,345.67"))

[1] 12345.67

> as.numeric (gsub ("^[^[:digit:]]|,", "", "$12,345.67"))

[1] 12345.67

k k

R Data, Part 3: Text and Factors 127

In this example, the ﬁrst ^indicates “a string that starts with ···.” e [^0-9.]

bracketed expression starts with a ^, meaning “not,” so that part means “any-

thing except a number or a dot.” e |, sequence says “or a comma,” so the

regular expression will ﬁnd any leading non-numeric (and non-period) charac-

ters,aswellasanycommasanywhere,anddeletethemall.

Removing HTML Tags

Occasionally,wecomeacrosstextformattedwithHTMLtags.eseare

instructions to the browser regarding display of the material, so, for example,

<b>Bold</b> formats the word “Bold” in bold face. Other tags indicate

headings, delineate cells of tables, paragraphs, and so on. It can be useful to

strip out all of the formatting information as a ﬁrst step toward processing

the text. Every tag starts with the angle bracket <and ends with >.So,given

a character string txt,thecommandgsub("<.*?>", "", txt) will

delete all the brackets (the <and >are treated literally) and all the text between

any pair (here, the .* indicates “zero or more characters” and the ?instructs

the engine to match in a lazy way).

Converting Linux/OS X File Paths to R and Windows Ones

e Windows ﬁle system uses the backward slash, \, to separate directories

in a ﬁle path, whereas Linux and Mac operating systems use the forward

one, /. Suppose we are given a Linux-style path like /usr/local/bin,and

we want to switch the direction of the slashes. e command gsub("/",

"\\\\", "/usr/local/bin") will produce the desired result. To make

the change in the other direction, the command gsub("\\\\", "/",

"\\usr\\local\\bin") will convert Windows path separators to Linux

ones. As an alternative in this case, we can specify the matching pattern exactly

with a command like gsub("\\", "/", "\\usr\\local\\bin",

fixed = TRUE).

4.4.10 Documenting and Debugging Regular Expressions

Regular expressions are complicated, and debugging them is hard. It is annoy-

ing (and time-consuming) to try to ﬁx a regular expression that you know is

wrong, but you’re not sure why. It is worse to have one that is wrong and not

knowing it. ere are online aids to diagnosing problems with regular expres-

sions that we have found useful. An Internet search will turn up a number of

helpful sites – but make sure that the site you ﬁnd describes the regular expres-

sion type (POSIX with GNU extensions or PCRE) that you use. Because regular

expressions are complicated, be sure to document them as well as you can.

Write out the patterns you expect to match and the rules you use to match

them.

k k

128 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

4.5 UTF-8 and Other Non-ASCII Characters

4.5.1 Extended ASCII for Latin Alphabets

Up until now we have implicitly been dealing only with the “usual” characters,

those found on a keyboard used in English-speaking countries. e starting

point for the way characters are displayed is ASCII, a table that gives charac-

ters and the corresponding standard digital representations. ASCII provides

representations of only 128 characters, many of which are unprintable “con-

trol” characters, such as tab, new-line, or the command to ring the bell of an

old-fashioned teleprinter. Much of the text we handle in our work is of this

sort, but ASCII does not include codes for letters with accents or other diacrit-

ical marks, required for many Western European languages. Every computer

today honors a much broader character set, often based on a standard named

latin1, but realized in slightly diﬀerent ways by diﬀerent manufacturers. For

example, Windows uses its own “Win-1252” table, which includes some char-

acters not found in latin1, such as the Euro currency symbol and the curly

“smart quotes,” and Apple OS X uses a table called “Mac OS Roman.” Each

character has a hexadecimal representation – for example, the upper-case E

withacircumﬂex,Ê,hascodeca, and typing "\xca" into R (with quota-

tion marks because this is text) will produce that character. e \x is used to

introduce hexadecimal notation in R, and it is case-sensitive – \X may not be

used – but the hexadecimal digits themselves are not case-sensitive. Entering

text in hexadecimal is diﬀerent from entering numeric values in hexadecimal.

Typing 0xca produces the number whose hexadecimal value is ca,thatis,the

number 202. Typing "\xca" producesthecharacterwhosecodeintheASCII

table is ca,thatis,Ê.

Characters represented by their hexadecimal codes can be used just like reg-

ular characters, as arguments to grep() or other functions. (ey can also

be entered in other, diﬀerent ways that depend on your computer and key-

board.) Almost all of these characters will display on your screen, depending

on which fonts you have installed. One exceptional character is the so-called

null character, which has code 00 (following the convention that every charac-

ter requires two hexadecimal digits). is character is not permitted in R text;

if needed nulls can be held in objects of class raw, but they should be avoided.

In Chapter 6, we describe how you can skip null characters when reading data

from outside sources.

e Windows and OS X character codes generally coincide. e one com-

monly encountered character for which the two disagree is the Euro currency

symbol, €, which was introduced after the latin1 standard was decided. In

the Win-1252 table, the symbol has hexadecimal value 80, whereas in OS X, it

has db.

k k

R Data, Part 3: Text and Factors 129

4.5.2 Non-Latin Alphabets

Of course, the need for standardization goes beyond the Euro sign. Increasingly,

with the availability of data from social media data and other sources, analysts

need methods to read, store, and process characters from very diﬀerent lan-

guages such as Chinese, Arabic, and Russian. e computing communities have

settled on Unicode, which is a system that intends to describe all the symbols

in all the world’s alphabets. Unicode values are shown in R by preceding them

with \U (or \u, but the upper-case Uis more general). Unicode includes ASCII

as a subset. For example, the lower-case letter khas an ASCII and Unicode rep-

resentation as the hexadecimal value 6b,sotyping"\U6b" or "\U006B" into

R will produce a lower-case k. As with other hexadecimal encodings, Unicode

characters may be in either case.

As a non-Latin example, the two Chinese characters represent the word

“China” in (simpliﬁed) Chinese. ese cannot be represented in ASCII, but

their Unicode representations are (from left) "\U4E2D" and "\U56FD",and

these values can be entered (inside quotation marks) directly in R, as in this

example:

> "\U4e2d\U56fd"

[1] # If fonts permit

> nchar ("\U4e2d\U56fd")

[1] 2 # Two characters...

> nchar ("\U4e2d\U56fd", type = "bytes")

[1] 6 # ...requiring six bytes in UTF-8

ere are several ways to represent Unicode, but the most popular, particu-

larly in web pages, is UTF-8. In this encoding, each character in Unicode is

represented by one or more bytes. For our purposes, it is not important to

know how the encoding works, but it is important to be aware that some char-

acters, particularly those in non-European alphabets, require more than one

byte. In the example above, the two Chinese characters take up six bytes in

UTF-8.

Depending on your computer, its fonts, and the windowing system, the Chi-

nese characters may not appear. Instead, you might see the Unicode representa-

tion (such as \U4e2d), an empty square indicating an unprintable character, an

empty space, or even, on some computers, some seemingly garbled characters

such as . Sometimes these characters indicate the latin1 encoding, but

on some computers the very same representation will be used for UTF-8. You

can ensure that the computer knows these characters are UTF-8 by examining

their encoding (the following section). e important point is that the display

of UTF-8 characters can be inconsistent from one machine to the next, even

when the encodings are correctly preserved. We talk about reading and writing

UTF-8 text in Section 6.2.3.

k k

130 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

4.5.3 Character and String Encoding in R

Handling Unicode in R requires knowledge of one more detail, which is “en-

coding.” R assigns an encoding to every element in a character vector (and

diﬀerent elements in a vector may have diﬀerent encodings). ASCII strings are

unencoded (so their encoding is marked as unknown); strings with latin1

characters (but no non-Latin Unicode) are encoded as latin1 and strings

with non-Latin Unicode are encoded as UTF-8.eEncoding() function

returns the encoding of the strings in a vector and iconv() will convert the

encodings. In the following ﬁrst example, we create a latin1 string and look

for the à character using regexpr(). e search succeeds whether the regular

expression is entered in latin1 style (as "\xe0"), Unicode style (as "\Ue0",

or directly with the keyboard. In each case, the à is found in location 9 as we

expect.

> (yogi <- "It's d\xe9j\xe0 vu all over again.")

[1] "It's d ’

ej‘

a vu all over again."

> Encoding (yogi)

[1] "latin1"

> c(regexpr ("\xe0", yogi), regexpr ("\ue0", yogi),

regexpr ("‘

a", yogi))

[1] 9 9 9

Diﬀerent encodings only cause problems in the rare cases where the Win-1252

and Mac OS Roman pages disagree with Unicode, and the primary example of

this issue is, again, the Euro sign. In this example, we create a string containing

theEurosignusingtheWindowsvalue"\x80" (to repeat this example with OS

X, use "\xdb"). We then use grepl() tochecktoseeifthesignisfoundinthe

string. R encodes the string as latin1 when it sees the non-ASCII character.

Here, the Euro sign in the latin1 string is not matched by the Unicode Euro,

but after the string is converted into UTF-8, the Unicode Euro is matched.

> (bob <- "bob owes me \x80123")

[1] "bob owes me €123"

> Encoding (bob)

[1] "latin1"

> (euro <- "\U20ac") # Assign Unicode Euro

[1] "€"

> Encoding (euro)

[1] "UTF-8"

> grepl (euro, bob) # Is it there?

[1] FALSE

> (bob <- iconv (bob, to = "UTF-8")) # Convert to UTF-8

[1] "bob owes me €123"

> grepl (euro, bob) # Is it there?

[1] TRUE

k k

R Data, Part 3: Text and Factors 131

Notice that iconv() hasnoeﬀectonstringsthatcontainonlyASCIItext.

ese will continue to have encoding “unknown.”

UTF-8 is vital for handling non-European text. Although the display is not

always perfect, R is usually intelligent about handling UTF-8 once it is read

in. UTF-8 text behaves as expected in regular expressions, paste() and other

string manipulation tools. R’s functions to read from, and write to, ﬁles also sup-

port the notion of encoding in UTF-8 and other formats. We talk more about

reading and writing UTF-8 in Chapter 6.

We have noted that the display of UTF-8 strings can be unexpected on some

computers (at least, for some characters). Even on computers equipped with

the correct fonts, though, an issue arises when UTF-8 characters are part of a

data frame. When the print() function is applied to a data frame, it calls the

print.data.frame() function, which in turns calls format().islater,

though, reacts poorly to UTF-8, often converting it into a form like <U+4E2D>.

In this example, we create a data frame with those Chinese characters and show

the results of printing the data frame.

> data.frame (a = "\U4e2d\U56fd", stringsAsFactors = FALSE)

1 <U+4E2D><U+56FD>

Here, the data.frame() command produced a data frame whose

oneentrywastwocharacters.edataframe,asdisplayedbythe

print.data.frame() function, shows the <U+4E2D>-type notation.

Despite the display, the underlying values of the characters are unchanged – as

seen in the next command.

> data.frame (a = "\U4e2d\U56fd",

stringsAsFactors = FALSE)[1,1]

[1]

R shows the expected result because print() isbeingcalledonavector,not

a data frame.

Sometimes, UTF-8 is inadvertently saved to disk in the <U+4E2D> form as

literal characters – <followed by U, and so on. At the end of the chapter, we

show one way to reconstruct the original UTF-8 from this representation.

4.6 Factors

4.6.1 What Is a Factor?

Afactor is a special type of R vector that looks like text but in many cases

behaves like an integer. Factors are important in modeling, but they often cause

trouble in data entry and cleaning. In this section, we describe how factors are

created, how they behave, and how to get them to do what you want them to do.

k k

132 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Factors arise in several ways. You can create a factor vector from some other

vector using the factor() or equivalent as.factor() function; this will

often be a ﬁnal step, after data cleaning has been completed and modeling is

about to start. Factors are also created automatically by R when constructing

data frames, or when character vectors are added into data frames, with the

data.frame() or cbind() functions (Sections 3.4 and 3.7.1), or when read-

ing data into R from other formats (Section 6.1.2). In both of these cases, the

behavior can be changed through a function argument or global option.

Factors are useful in a number of places in R but particularly in modeling.

ey provide a natural and powerful way of representing categorical variables

in a statistical model. However, we recommend that you only turn character

vectors into factors when all the data cleaning is ﬁnished and it is time to start

modeling. Chapter 7 shows a complete data cleaning example in depth and

there we ensure that our character data starts out and remains as character.

Still, it is important to understand how factors work in R.

ink of a factor as having two parts. One part is the set of possible values, the

“levels.” In a manpower example, the levels of a factor named “Gender” might

be “Male” and “Female,” and perhaps a third called “Not Recorded.” e sec-

ond part is a set of integer codes that R uses to represent and store the levels.

ese codes start at 1 and go up. By default, R assigns codes to levels alphabet-

ically – so in this example, “Female” would be represented by 1, “Male” by 2,

and “Not Recorded” by 3. e class() of a factor vector is factor,show-

ing its special nature, but the mode() of a factor vector is numeric,andthe

typeof() is integer, referring to the underlying codes that R stores. e

advantage of this representation is eﬃciency: in a data set of a million observa-

tions, it is clearly much more eﬃcient to store a million small integers than to

store millions of copies of longer strings.

4.6.2 Factor Levels

Once a set of levels is deﬁned for a factor, it is resistant to change. If you try to

change a value of one of the elements of a factor vector to a new value that is

not already a level, R sets that value to NA and issues a warning. Conversely, if

you remove all the elements with a particular value from the vector, that value

is still one of the levels. In this example, we create a factor whose levels are the

three colors of a traﬃc light.

> (cols <- factor (c("red", "yellow", "green", "red",

"green", "red", "red")))

[1] red yellow green red green red red

Levels: green red yellow

> table (cols)

green red yellow

241

k k

R Data, Part 3: Text and Factors 133

We can tell that the result is showing factor levels, rather than character strings,

because there are no quotation marks and because R also prints out the levels

themselves. Notice that the levels consist of the unique values in the vector,

sorted alphabetically, and that the table() command performs as expected

on the factor vector. e levels and labels arguments to the factor()

function control the setting and ordering of the factor’s levels. In the following

example, we show what happens when we exclude the elements whose values

are green from the vector.

> cols[cols != "green"]

[1] red yellow red red red

Levels: green red yellow

> table (cols[cols != "green"])

green red yellow

041

In this example, we see that the green level is still present in the vector,

even though none of the elements in the vector have that value. Moreover,

the table() command acknowledges the empty level. is can be annoying

when many levels are empty, but it can also be helpful when, for example,

levelsaremonthsoftheyearandsomesourcesomitsomemonths.Inthis

case, tables constructed from the diﬀerent sources can be expected to line up

nicely.

Another way in which factor levels are resistant to change is shown in this

example, where we try to change the value yellow to amber.Westartby

making a copy of cols called cols2.

> cols2 <- cols

> cols2[2] <- "amber"

Warning message:

In ‘[<-.factor‘(‘*tmp*‘, 2, value = "amber") :

invalid factor level, NA generated

> cols2

[1] red <NA> green red green red red

Levels: green red yellow

is assignment failed because amber is not one of the levels of the factor

vector cols2.Itwouldhavebeenokaytoassigntotheyellow element of

our vector the value green or red because those levels existed in the factor.

But trying to assign a new value, such as amber, generates an NA. Notice how

that NA is displayed with angle brackets, as <NA>, to help distinguish it from a

legitimate level value of NA.

e levels() function shows you the set of levels in a factor, and you can

usethatfunctioninanassignmenttochangethelevels.Here,weshowhowwe

mighthavechangedtheyellow level to have a diﬀerent label.

k k

134 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> levels(cols)

[1] "green" "red" "yellow"

> levels(cols)[3] <- "amber"

> cols

[1] red amber green red green red red

Levels: green red amber

is operation changes only the level labels; the underlying integer values are

not changed. Here, then, the labels are no longer in alphabetical order. We

often want to control the order of the levels in our factors; a good example is

when we tabulate a factor whose levels are the names of the months. By default,

the levels are set alphabetically (April, then August, and so on, up to Septem-

ber) – this aﬀects the output of the table() function (and more, like the way

plots are laid out). e order of the levels can be speciﬁed in the original call to

the factor() function, and we can re-order the levels using another call to

factor(),asinthisexample:

> levels(cols)

[1] "green" "red" "amber"

> factor (cols, levels = c("red", "amber", "green"))

[1] red amber green red green red red

Levels: red amber green

In this example, we changed the level through use of the factor() function.

To repeat, assigning to the levels() function changes only the labels, not the

underlying integers. e following example shows one common error in factor

handling, which is assigning levels directly.

> (bad.idea <- cols)

[1] red amber green red green red red

Levels: green red amber

> levels(bad.idea) <- c("red", "amber", "green")

> bad.idea

[1] amber green red amber red amber amber

Levels: red amber green

Here, the elements of bad.idea that used to be red are now amber.Ifyou

use this approach, make sure this is what you wanted.

e feature of R that causes more data-cleaning problems than any other,

we think, is this: Factor values are easy to convert into their integer codes but

we almost never want this. In the following section, we see an example of how

having a factor can produce unexpected results.

4.6.3 Converting and Combining Factors

To convert a factor fto character, simply use as.character(f). Actually,

thehelpﬁlestellusthatitis“slightlymoreeﬃcient”touselevels(f)[f].

k k

R Data, Part 3: Text and Factors 135

Here, the interior [f] is indexing the set of levels after converting f,

internally, to its underlying integer codes. Usually, we use the slightly less

eﬃcient approach because we think it is easier to read. R’s conversion of factors

to integers can be useful when exploited carefully; this arises more often in

plotting than in data cleaning.

Whenafactorfhas text labels that look like integers, it is tempting to try to

convert it directly into a numeric vector using as.numeric(). is is almost

always a mistake; convert levels to numeric only after ﬁrst converting to charac-

ter. is example shows how this conversion can go wrong. We start by creating

a factor containing levels that look numeric, except that one of the values in the

vector (and therefore one of the levels of the factor) is the text string Missing.

is factor is intended to give the indices of elements to be extracted from the

vector src.

> wanted <- factor (c(2, 6, 15, 44, "Missing")) # indices

> src <- 101:200 # vector to extract from

> as.numeric (wanted) # ...but this happens

[1]24135

> src[wanted]

[1] 102 104 101 103 105

When wanted is created, its text labels ("2","6",…,"Missing")are

stored, together with its integer codes. By default, these are assigned according

to the alphabetical order of the labels; so "15" gets level 1, "2" gets level 2,

"44" gets level 3, and so on. When we enter src[wanted], R uses these

integer codes to extract elements from src.Ifweactuallywantthe2nd,6th,

15th, and so on elements of src,wehavetoconverttheelementsofwanted

to character ﬁrst, and then to numeric, as in this example.

> src[as.numeric (as.character (wanted))]

[1] 102 106 115 144 NA

Warning message:

NAs introduced by coercion

Here, the warning message is harmless – it indicates that the text Missing

could not be converted to a numeric value.

One time that the behavior of factors can be helpful is when we need to

convert text into numeric labels for whatever reason. For example, given

a character vector sex containing the values "F" and "M",wemightbe

called on to produce a numeric vector with 0for "F" and 1for "M".In

this case, factor(sex) creates a factor with levels 1 and 2; as.numeric

(factor(sex)) creates an integer vector with values 1 and 2; and therefore

as.numeric(factor(sex)) - 1 produces a numeric vector with values

0and1.

k k

136 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

It is surprisingly diﬃcult to combine two factor vectors, even if they have the

same levels. R will convert both vectors to their underlying integer codes before

combining them. Our recommendation is to always convert factors to charac-

ters before doing anything else to them. ere is one happy exception, though,

when two or more data frames containing factors are being combined with

rbind() (Section 6.5). Other than in this case, however, combining factor vec-

tors will usually end badly. We recommend converting factors into character,

combining, and then, if necessary, calling factor() to return the new vector

to factor form.

4.6.4 Missing Values in Factors

Like other vectors, factors may have missing values. Missing values look like

NA values in most vectors, but in factors they are represented by <NA> with

angle brackets. is level is special and does not prevent you from having a real

level whose value is actually <NA>, but you should avoid that. (Analogously, it’s

permitted to have the string value "NA", and it is a good idea to avoid that, too.)

Values of a factor that are missing have no level. In this example, we create a

vector with missing values, and also with values that are legitimately "NA" and

"<NA>".

> (f <- factor (c("b", "a", "NA", "b", NA, "a", "c",

"b", NA, "<NA>")))

[1] b a NA b <NA> a c b <NA> <NA>

Levels: <NA> a b c NA # alphabetized by default

> table (f, exclude=NULL)

123112

levels (f)

[1] "<NA>" "a" "b" "c" "NA"

e levels() function makes no mention of the true missing values, since

they do not have a level. e ﬁrst <NA> in the output of table() describes

the ﬁnal element of the vector, whereas the last <NA> refers to the two items

that really were missing. Clearly there is a possibility of confusion here.

When elements of a factor vector are missing, the addNA() function can be

used to add an explicit level (which is itself NA) to the factor. More often we

want to replace the NA values with an explicit level so that, for example, those

entries are accounted for in the result of table(). In this example, we show

one way to add such a level.

> (f <- factor (c("b", "a", NA, "b", "b", NA, "c", "a")))

[1] b a <NA> b b <NA> c a

Levels: a b c

> f <- as.character(f) # Convert to character

> f[is.na (f)] <- "Missing" # Replace missings

k k

R Data, Part 3: Text and Factors 137

> (f <- factor (f)) # Re-factorize

[1] b a Missing b b Missing c a

Levels: a b c Missing

Here, the factor is converted to character, missing values replaced by a value

like Missing, and then the vector converted back to factor.

4.6.5 Factors in Data Frames

Factors routinely appear in data frames, and, as we have mentioned, they are

important in R modeling functions. Factors inside data frames act just like fac-

tors outside them (except sometimes when printing, as we saw with Chinese

characters in an earlier example) – they have a ﬁxed set of levels and they are

represented internally as integers. A few points should be noticed here. First,

as we mentioned above, R is not good at combining factor vectors on their

own, but when data frames containing factors are combined with rbind(),

R creates new factors from the factors in the input, extending the set of levels

toincludeallthelevelsfrombothdataframes.esetoflevelsisformedby

concatenating the two initial sets; the levels are not re-sorted. (If a column is

factor but its corresponding column in another data frame is character, then the

resulting combined column will have the class of the column in the ﬁrst data

frame passed to rbind().) Second, applying functions to the rows of a data

frame containing factors can produce unexpected results. We discuss apply-

ingfunctionstotherowsofadataframeinSection3.5andtheconcernsthere

apply even more to data frames containing factor levels. Our recommendation

is to not use apply() functions on data frames, particularly with columns of

diﬀerent types. Instead, use sapply() or lapply() on columns. If you need

to process rows separately, loop over the rows with a command like lapply

(1:nrow(mydf), function(i) ...) where your function operates on

mydf[i,],theith row of the data frame mydf.

4.7 R Object Names and Commands as Text

4.7.1 R Object Names as Text

In some data cleaning problems, a large set of related objects need to be cre-

ated or processed. For example, there might be 500 tables stored in disk ﬁles,

and we want to read them all into R, saving them in objects with names such as

M2.2013.Jan,M2.2013.Feb,···,andM2.2016.Dec.Or,wemighthave

data frames named p001,p002,···,p100 and we want to run a function

on each one. It is easy enough to construct the set of names using paste()

and sprintf() (see Section 4.2.1). But there is a distinction between the

name "p001" (a character string with four characters) and the R object p001

k k

138 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

(a data frame). e R command get() accepts a character string and returns

theobjectwiththatname.(Ifthereisnoobjectbythatname,anerrorispro-

duced; the related function exists() cantesttoseewhethersuchanobject

exists, and get0() allows a value to be speciﬁed in place of the error.)

One place where get() is useful is when examining the contents of your

workspace. e ls() command returns the names of the objects there; by

using get() in a loop we can apply a function to every object in the workspace.

For example, the object.size() function reports the size of an object in

yourworkspaceinbytes(bydefault).isfunctionoperatesonanobject,not

a name in character form. So often we do something like this: ﬁrst, we pro-

duce the set of names of the objects of interest, perhaps with a command like

projNames <- ls(pattern = "^projA") to identify all the names of

objects that start with projA. en, the command sapply(projNames,

function(i) object.size(get(i))) passes each name to the func-

tion, and the function uses get() to produce the object itself and report its

size. e result is a named vector of the sizes of every object in the workspace

whose name starts with projA.

e complement of get() is assign(). is takes a name and a value and

creates a new R object with that name and value. Be careful; it will over-write

an existing object with that name. Assigning is useful when each iteration of

a loop produces a new object. In the following example, we use a for() loop

to create an item named AA whose value is 1, one named BB with value 2, and

so on, up to an object ZZ with value 26. (We used double letters here to avoid

creating an item Fthat might conﬂict with the alias for FALSE.)

> for (i in 1:26)

assign (paste0 (LETTERS[i], LETTERS[i]), i, pos = 1)

> get ("WW") # Example

[1] 23

# Remove the 26 new objects from the workspace

> remove (list = grep ("^[A-Z]{2}$", ls (), value = T))

e ﬁnal command uses a regular expression to remove all items in the

workspace whose names start (^)withaletter([A-Z])thatisrepeated

({2}) and then come to an end ($). e remove() and rm() commands

operate identically. Notice the pos = 1 argument to assign().Atthe

command line this has no eﬀect. Inside a function it creates a variable in your

R workspace, not one local to the function. We discuss the notions of global

and local variables in Section 5.1.2.

4.7.2 R Commands as Text

It is also possible to construct R commands as text and then execute them. Sup-

pose in our earlier example that we have objects p001,p002,···,p100 and

we want to run a function report() on each one, producing results res001,

k k

R Data, Part 3: Text and Factors 139

···,res100. We could use the get() and assign() approach from above,

like this:

nm <- paste0 ("p", sprintf ("%03d", 1:100)) # Object names

res <- paste0 ("res", sprintf ("%03d", 1:100))# Result names

for (i in nm) {# Begin loop

result <- report (get (i)) # Run function

assign (res[i], result, pos=1)

}

But it is easy to think of more complicated examples where each call is diﬀerent.

Perhaps the caller needs to supply the month and year associated with a ﬁle as

an argument, or perhaps each call requires an additional argument whose name

also varies. In these cases, it can be useful to construct a vector of R commands

using, say, paste0(), and then execute them. is requires a two-step pro-

cess: ﬁrst the text is passed to parse() with the text argument, to create an

R “expression” object; then the eval() function executes the expression. For

example, to compute the logarithm of 11 and assign it to log.11 we can use

the command eval(parse(text = "log.11 <- log(11)")).After

this command runs, our R workspace has a new variable called log.11 whose

valueisabout2.4.

Imagine having objects p001,p002,…,p100 and also q001,q002,…,

q100, and suppose we wanted to run res001 <- report(p001, q001),

res002 <- report(p002, q002) and so on. It is simple to construct a

set of 100 character strings containing these commands:

> num <- sprintf ("%03d", 1:100) # 001, 002, etc.

> pnm <- paste0 ("p", num)

> qnm <- paste0 ("q", num)

> rnm <- paste0 ("res", num)

> cmd <- paste0 (rnm, " <- report (", pnm, ", ", qnm, ")")

> cmd[45] # as an example

[1] "res045 <- report (p045, q045)"

Now all 100 reports can be run with the command eval(parse(text =

cmd)).isapproachcanbothsavetimeandcutdownontheerrorsasso-

ciated with copying and modifying dozens – or hundreds – of similar lines

of code.

As a ﬁnal example, we encountered a problem with some UTF-8 data

(Section 4.5), which we solved with regular expressions (Section 4.4) and

eval(). Under some circumstances, UTF-8 can be saved to disk as ASCII in

aformlike"<U+4E2D><U+56FD>" – that is, with a literal representation of

<,U,+, and so on. To convert this into “real” UTF-8, we used regular expres-

sions and the gsub() command to delete each >and to replace each <U+

with \U. Of course, +and \are special characters and will need to be escaped.

en we surrounded the entire result in quotation marks. e resulting string

k k

140 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

contains what we might have typed in at the R command line, and when it is

executed with parse() and eval(), the UTF-8 characters are produced, as

in this example:

> inp <- "<U+4E2D><U+56FD>" # ASCII (not UTF-8)

> (out <- gsub (">", "", inp)) # remove > chars

[1] "<U+4E2D<U+56FD"

> (out <- gsub ("<U\\+", "\\\\U", out)) # change <U+ to \U

[1] "\\U4E2D\\U56FD"

> (out <- paste0 ("\"", out, "\"")) # add quotes

[1] "\"\\U4E2D\\U56FD\""

> eval (parse (text = out))

[1]

4.8 Chapter Summary and Critical Data Handling

Tools

is chapter discusses character data, which forms an important part of almost

every data cleaning project. Even if you have very little data in text form you

will need to be proﬁcient at handling text in order to modify column names or

to operate on multiple ﬁles across multiple directories. is chapter includes

discussion of these important R tools:

•e substring() function, which extracts a piece of a string as identiﬁed

by the starting and ending positions. is function and the others in this

chapter are made more powerful by the fact that they are vectorized, so they

can operate on a whole set of strings as once.

•e format() and sprintf() functions, which help convert numeric

values into nicely-formatted strings. Sprintf() in particular provides a

powerful interface for formatting values into report-like strings. Also handy

here is the cut() function, which lets us convert a numeric variable into a

categorical one for reporting or modeling.

•e paste() and paste0() functions. ese combine strings into longer

ones in a vectorized way. We use the paste functions in every data cleaning

project.

•Regular expression functions. ese functions (grep() and grepl(),

regexpr() and gregexpr(),sub() and gsub(),andstrsplit())

use regular expressions to ﬁnd, extract, or replace parts of strings that match

patterns. e power of regular expressions comes from the ﬂexibility that

the patterns provide. Regular expressions form a big subject, but we ﬁnd

that even a limited knowledge of them makes data cleaning much easier and

more eﬃcient.

k k

R Data, Part 3: Text and Factors 141

•Tools for UTF-8. UTF-8 describes a particular, popular encoding of the set

of Unicode characters. More and more data cleaning problems will involve

non-Roman text and R provides tools for handling these strings.

•Factors. Factors are indispensable in some modeling contexts in R, and they

provide for eﬃcient storage of text items. In data cleaning tasks, however,

they often get in the way. Remember to convert factors, even ones that look

numeric, into character before converting the result into numeric.

•e get() and assign() functions. ese let us manipulate R objects

by name, even when the name is held in an R object. is can make some

repetitive tasks much simpler. e combination of parse() and eval()

lets us construct R commands and execute them – again, allowing us to exe-

cute sequences of commands once we have created them with paste() and

other tools.

k k

143

Writing Functions and Scripts

Functions and scripts are two methods by which we can do repetitive tasks eas-

ily. ey have similar goals, but they operate in diﬀerent ways. In our work, we

use both, and every data cleaning project will require that you create functions

or scripts – probably both but almost certainly scripts, since R already has lots

of functions to perform lots of necessary tasks – and, of course, we have met

many of these in earlier chapters.

Writing functions is more diﬃcult than writing scripts because there are

strict rules about what functions can do and how they operate. In contrast, a

script is very often just a saved set of commands that you typed in to accomplish

a particular task. Of course, the commands you type in are themselves calls to

R’s built-in functions, and sometimes you need to do something for which no

function has been written. In that case, you may have to write your own. In this

chapter, we describe what functions and scripts do and their relative strengths.

5.1 Functions

Afunction is a special R object. If you have made it this far in the book, you

probably know a lot about how functions work. But we want to repeat some of

the details here, to make clear the important points that will come up when you

start writing your own. A function’s text starts with the reserved word func-

tion, then it has its list of arguments inside parentheses, and then it has the

body of the function enclosed in braces. If the body is only one line, the braces

are unnecessary, but we recommend that you use them anyway. For example,

suppose you needed a function that takes the numbers xand yas input and

returns the value √x+y.Ifxis negative, the value of √xis not deﬁned, so the

function issues a warning and returns NA. In the following code, we deﬁne this

function and assign it to an R object named funk.Noticethatinthefunction

“declaration” (where the arguments are speciﬁed), the yargument is given the

default value of 2. If no value is entered for x, the function will fail, because x

k k

144 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

has no default value, but if no value is entered for y, the default value of 2 is

used. is code shows the deﬁnition of the function.

> funk <- function (x, y = 2) {

# Example function to compute sqrt (x) + y

# Arguments: x, numeric;

# y, numeric

if (x < 0) {

warning ("Negative number cannot be funk-i-fied")

return (NA)

}

return (sqrt(x) + y)

}

> funk (x = 9, y = 3) # run with x = 9, y = 3

[1] 6

In the ﬁnal command we run the function, passing in both arguments explic-

itly. Just typing the name funk, without parentheses, causes R to print out the

function itself. Notice that our function includes some comments. In this case,

we have listed the arguments together with their types. It is important to doc-

ument all of your functions, even the ones you only plan to use yourself. In

addition to the arguments, this documentation might include the date, version

number, author, or other relevant information.

ere are several aspects to a function, and you will need to understand

all of them to use functions properly. ese include the arguments (informa-

tion passed into the function), the return value (information computed and

returned), and side eﬀects (actions by the function beyond simply computing

the return value). is section describes the diﬀerent pieces of an R function.

5.1.1 Function Arguments

An argument is a value passed to a function. e aforementioned funk func-

tion has two arguments named xand y. R functions may have as many argu-

ments as needed (or none at all). Arguments may be vectors or data frames or

lists or functions or any other R object. is means that if you develop a func-

tion, particularly for other users, the ﬁrst thing it should do is to ensure that

the arguments are of the types expected by the user. For example, the funk

function checks to see that the argument xis a non-negative number. But what

happens if the user passes a data frame or a character string or a list? e func-

tion developer needs to detect these unexpected inputs and stop gracefully. We

discuss error handling in Section 5.3.

Argument Matching

When a user calls a function, he or she may specify the arguments by name.

In this case, R knows unambiguously which input goes with which argument.

k k

Writing Functions and Scripts 145

So, in our example, the call funk(x = 4, y = 1) will produce the result

3, and so too will the call funk(y = 1, x = 4). e user may also specify

arguments without naming them. In this case, R matches arguments in the call

to arguments in the function declaration by position. Since funk listed xbefore

y, the call funk(9, 3) is equivalent to funk(x = 9, y = 3).

Arguments are matched by partial names in a way similar to the way that the

elements of a list can be matched by an unambiguous substring (see Section

3.3.1). If a function ghas arguments dimension,data,andsubset,for

example, the user may specify s,su,orsub to supply the subset argument.

An argument dis ambiguous and will produce an error, but da would suﬃce

to supply the data argument. While this is permitted, we recommend using

full names where possible. is enhances readability and lessens the chance of

confusion if a function is updated later to include more arguments.

If some arguments are named and others are not, named arguments are

matched by name and the others by position. In the example of the previ-

ous paragraph, the call g(data = 2, 5, 3) will assign 2as the data

argument, then 5as the dimension argument, and then 3as subset.In

interactive work we often match arguments by position, but when constructing

code we plan to save, re-use, distribute or archive we try to specify arguments

by name.

The Ellipsis

Some functions are deﬁned with one special argument, the ellipsis (...). is

is R’s mechanism for allowing functions to accept variable numbers of argu-

ments. e ellipsis captures all of the arguments that are not otherwise matched

and presents them to the function as a list. One complication is that the names

of function arguments deﬁned after the ellipsis may not be abbreviated. For

example, the table() function takes the ellipsis as its ﬁrst argument; this

allows us to pass as many vectors as needed into the function. Subsequent argu-

ments include the exclude and usaNA arguments described in Section 2.5. In

this example, we show what happens when one of those names is abbreviated.

> table (c(1, 1, 2, 3, 1, NA, 2, 1), use = "always")

Error in table(c(1, 1, 2, 3, 1, NA, 2, 1), use = "always") :

all arguments must have the same length

> table (c(1, 1, 2, 3, 1, NA, 2, 1), useNA = "always")

1 2 3 <NA>

4211

In the ﬁrst command, table() sees two vectors to tabulate. e use argu-

ment is insuﬃcient to act as the usaNA instruction, so it is treated as an input

vector of length 1 (with the value "always"). Unable to create a two-way table

from vectors of diﬀerent lengths, table() produces an error. In the second

command, the complete name successfully indicates the action to perform with

k k

146 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

the NA. We can examine the arguments of table() using the args() func-

tion, as follows:

> args (table)

function (..., exclude = if (useNA == "no") c(NA, NaN),

useNA = c("no", "ifany", "always"),

dnn = list.names(...), deparse.level = 1)

NULL

is output shows that the ellipsis precedes the useNA argument in the deﬁni-

tion. erefore, the name useNA needs to be speciﬁed in its entirety.

Very often the ellipsis is one of the ﬁnal arguments in the function deﬁnition.

is has two consequences. First, arguments deﬁned after the ellipsis must be

matched by name exactly. If not – if you supply an argument whose name is

not in the declaration – that argument will often “fall into” the ellipsis and be

ignored. For example, when you type the name of a data frame at the command

line, R invokes a particular function called print.data.frame().is

function takes an argument called digits that speciﬁes the precision of the

printing of numeric columns – but this argument is deﬁned only after the

ellipsis, so its name must be matched exactly. Specifying digit = 3,for

example, has no eﬀect – the name digit does not match an argument exactly,

so the argument digit = 3 is given to the ellipsis, which ignores it. In this

example, we construct a data frame and then use print.data.frame() to

display it.

> (newdf <- data.frame (a = c(1, 2.345)))

1 1.000

2 2.345

> print.data.frame (newdf, digits = 2)

1 1.0

2 2.3

> print.data.frame (newdf, digit = 2)

1 1.000

2 2.345

Here, the mis-typed argument name digit has had no eﬀect on the output,

but no error or warning is produced. is points up the importance of com-

paring the output you see to what you expect. Similarly, arguments that are

unmatched are often ignored in functions with the ellipsis; in this example, we

show how a made-up argument name does not produce an error or warning

in print.data.frame() – but it does in the log() function, which is

deﬁned with no ellipsis.

> print.data.frame (newdf, NOTANARGUMENT = 1)

k k

Writing Functions and Scripts 147

1 1.000

2 2.345

> log (newdf) # this is valid, but...

1 0.0000000

2 0.8522854

> log (newdf, NOTANARGUMENT = 1) # ...this is not.

Error in log(newdf, NOTANARGUMENT = 1) :

unused argument (NOTANARGUMENT = 1)

In this example, we used the print.data.frame() function to print a

data frame. However, we could have just used print(); R will detect that

the object being printed is a data frame and use the appropriate printing

function. We saw this in Section 4.5.3. is practice of having “generically”

named functions such as print() call functions speciﬁc to a data type is an

example of “object-oriented programming.” In this approach, the class of an

object being operated on helps to determine the operation being performed.

We have seen other examples of this behavior earlier in the book. For example,

we saw how seq() calls seq.POSIXt() in Section 3.6.6. We talk a little

more about object-oriented programming in Section 5.6.3.

Most commonly ellipses are used for function that will call other functions

and pass arguments down to the function being called. For example, lots of

functionsthatwewritedrawplotsusingtheplot() function. e plot()

function accepts dozens of possible arguments reﬂecting the values of “graph-

ical parameters” such as colors, line sizes, typefaces, and axes. Rather than

prepare our plot function for every possible argument, we will often create a

function similar to the following one.

myplot <- function (x, y, ...) {

# Do stuff here

plot (x, y, ...) # Call plot()

}

In this example, any arguments after xand yare passed to plot() just as they

were supplied by myplot(), in the same order with the same names. If we

needed access to the individual arguments passed in the ellipsis, we could cap-

ture those arguments with a command such as mylist <- list(...) and

use mylist like any other R list. In this example, we examine the arguments

passed to our function myplot() to see if the argument xlab is among them,

and, if so, print it.

> myplot <- function (x, y, ...) {

mylist <- list (...) # grab extra arguments

plot (x, y, ...) # Call plot()

if (any (names (mylist) == "xlab"))

cat ("xlab was supplied as ", mylist$xlab, "\n")

}

k k

148 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> myplot (1:5, 1:5, xlab = "Plot of x vs y")

xlab was supplied as Plot of x vs y

Modifying an argument and passing it along as part of the ellipsis requires

some thought. Suppose our standard required that x-axislabelsalwaysbein

upper case. To modify the value of the xlab argument, the easiest way is to add

the xand yarguments to mylist, and then to invoke the plot() function

via do.call(plot, mylist). We rarely need to do this, but for advanced

userswegivehereanexampleofhowitmightbedone.

myplot <- function (x, y, ...) {

mylist <- list (...) # grab extra arguments

if (any (names (mylist) == "xlab"))

mylist$xlab <- casefold (mylist$xlab, upper = TRUE)

mylist$x <- x; mylist$y <- y

do.call (plot, mylist)

}

Missing Arguments

Sometimes users do not pass an argument, because it is optional, because they

are satisﬁed with the default value, or by mistake. e missing() function

returns TRUE if an argument is missing. If an argument named arg has not

been supplied and has no default value, then any reference to arg in the

code, other than a call to the missing(arg) function, will produce an error.

Here, missing(arg) will produce TRUE and the code can then determine

what action to take. When used inside a function, the missing() function

should be used only near the beginning of a function (since, e.g., if arg gets

assigned in the code, then missing(arg) will subsequently be FALSE). An

argument that is not passed explicitly is considered to be missing even if it has

a default value.

5.1.2 Global versus Local Variables

Alocal variable is one that exists only inside your function. All of the variables

inside your function are local. at is, when the function starts, it creates a spe-

cial area of memory where local variables are stored and manipulated, and when

the function ends, that area of memory (and the variables in it) is destroyed.

Other variables are global; most often global variables are in your workspace.

e R variables in the workspace that are supplied as values to the function are

untouched (but see “side eﬀects,” as follows). is example demonstrates how

workspace values passed to a function are unchanged.

> new <- function (a = 5) {

a<-a+1

cat ("a is now", a, "\n")

return (a)

}

k k

Writing Functions and Scripts 149

> a <- 11

> new (a)

a is now 12

[1] 12

[1] 11

e value of the global variable ain the workspace is 11 before the call and 11

after. e local a, inside the function, changes, but that change has no eﬀect on

the global one. Another way to say this is that R uses the “call by value” approach.

(In one alternative scheme, “call by reference,” the item passed into the argu-

ment is a reference to the global variable, so changes to the reference produce

changes in the global variable.) ere are some packages that implement “call

by reference” in R, and this approach can bring eﬃciencies, but discussion of

this strategy is beyond the scope of this book.

It is possible to access global variables directly from inside a function, sim-

ply by referring to them by name. We might use system-deﬁned global vari-

ables such as pi or letters, in our functions, but you should avoid using

your own workspace objects directly in your functions. e objects in your

workspace can change, and in any case such a function would not be usable

by another user. Instead, pass into the function all of the values it will need as

arguments. e one exception to this informal rule is when writing a function

to be used by a one-time application of apply() or its relatives, when oper-

ating on, for example, the rows of a data frame. Recall from Section 3.5 that

it is unwise to use apply() directly on a data frame’s rows. In that section,

we operated on the rows of a data frame named dd with code such as sap-

ply(1:nrow(dd), function (x) any (dd[i,] == 1)). Although

the embedded function uses the workspace variable dd, this approach is pow-

erful and convenient.

As a side observation, note the cat() statement designed to print out some

text and the value of ain the aforementioned example. At the command line,

you can simply type ato have the system print its value. Inside a function,

though, a line with only aon it has no eﬀect – the value of ais evaluated, which

in more complicated examples might require some processing, but nothing is

assigned or printed. To print a value to the screen from inside a function, use

cat() or print() explicitly.

5.1.3 Return Values

e most important thing functions do is to produce a return value,whichis

the results of the computation done by the function. In R, a function always

produces one return value, so if you want to compute diﬀerent things inside a

function, and return them, you have to combine them into a vector, matrix, data

frame, or list. A function returns whatever is inside the ﬁrst call to return()

k k

150 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

that it encounters; if there isn’t a call to return(), the function returns what-

ever the last line it executes produces. You can hide a function’s return value

with the invisible() command, but every function has an output. If you

assign the output of a function with an invisible return value, that value is pre-

served in the usual way.

Side Eﬀects

Functions can do other things beyond producing return values. ese things are

called side eﬀects. One important side eﬀect is to produce graphics. It is also

possible to change global variables with a function, separate from the return

value (see the discussion of assign() in Section 4.7.1), and indeed some-

times R modiﬁes global variables without using the return value, intentionally.

For example, the fix() function (Section 5.1.4) creates and modiﬁes func-

tions, and the edit() function (Section 3.7.4) modiﬁes data frames and other

Robjects.Wedonot recommend changing global variables in your code.

A third, more benign, side eﬀect is to print out information for debugging

purposes. is is very handy, particularly when the function needs to run hun-

dreds or thousands of times. We talk more about debugging in Section 5.3.

Another common side eﬀect is to write ﬁles to the disk. ese might be text

ﬁles with data in them, they might be tables of results, they might be graphics,

or something else. We discuss the ways to get data out of R in Chapter 6.

Side eﬀects can be good or bad. It is important to recognize when one of your

functions produces a side eﬀect. Such a function might even behave diﬀerently

at a diﬀerent time, or on another machine.

When a Function Produces Errors or Warnings

A function that has been successfully created or edited is always syntactically

correct. However, functions can still produce errors, either because the inputs

were unexpected, or because you tried to read a ﬁle that didn’t exist, because

R encountered an NA for which it was unprepared, or R tried to perform

arithmetic on a character vector, or for one of many other reasons. When an

error results, an error message is produced and the function terminates. Local

variables are lost and no return value is produced. However, any side eﬀects

produced by code before the error occur.

Sometimes, functions produce warnings rather than errors. A function that

produces warnings will nonetheless return a value. Still, unless you’re sure you

understand the cause of the warnings, we recommend that you not ignore them.

We talk more about errors, warnings, and debugging in Section 5.3.3.

Cleaning Up

When a function completes, there might be a few bookkeeping-type tasks

that remain before the result can be returned. For example, ﬁles or other

connections (Chapter 6) might need to be closed, graphics devices reset, and

k k

Writing Functions and Scripts 151

warnings re-armed. Moreover, we usually want these actions performed even if

the function exits prematurely because of an error. e on.exit() function

allows you to specify one or more expressions to be evaluated however the

function ends. We show an example in Section 6.2.6.

5.1.4 Creating and Editing Functions

Itispossibletotypeafunctioninatthecommandline,butthisisalmostnever

practical. Instead, R provides a simple interface to one of your system’s editing

programs to allow you to create and edit functions. e R function is called

fix(). You can create a new function newf by entering fix(newf) and the

editor will appear with an empty function skeleton that looks like this:

function ()

{

}

If the function newf already exists, the fix(newf) command will open

the existing function for editing. e exact editor that R uses depends

on what is available on your system and its name can be displayed by

options()$editor. When you are done editing, exit the editor. In Win-

dows, with the default editor, you can do that by clicking the red X in the top

right and choose “Yes” to keep modiﬁcations; in OS X’s default, click the red

dotinthetopleftandchoose“Save.”Donotchoose“SaveAs”fromtheFile

menu. In Linux using the default “vi” editor, type :wq to save and quit, or

:q! to exit without saving your changes. Other editors may require diﬀerent

keystrokes, of course. If you asked for modiﬁcations to be saved, R checks to

see if the new version of the function has no errors and, if so, saves it so that

the function is ready to use.

If R detects that errors have been introduced, it will produce an error message

that, apart from the details, might look something like this:

Error in .External2(C_edit, name, file, title, editor) :

unexpected symbol occurred on line 3

use a command like

x <- edit()

to recover

e second line contains the meat of the error message: the cause (in this case,

“unexpected symbol,” but many other choices are possible) and the location

(here, line 3). e last part of the error message is a speciﬁc instruction that

is unclear to many R beginners. It says that to resume editing the ﬂawed func-

tion newf, we should enter the command newf <- edit().eedit()

command, without any argument, edits the function most recently operated on;

here this command re-edits our newf function. If our modiﬁcations produce

a valid function, that function is saved to the object newf. Otherwise, the next

k k

152 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

error detected will be reported, and we can enter newf <- edit() to edit

the function once again. You will need to produce an error-free version of your

function before R will save it. Edit() without an argument always operates on

the most recently edited function. We recommend using it only immediately

after encountering an editing error and otherwise using fix().

Reading and Writing Functions to and from Disk

It is easy to store functions in readable text ﬁles, so this is a natural way to

save and distribute them. We prefer to save functions using the dump()

command. Dump() creates a disk ﬁle with the exact text of your function

(including comments, blank lines, and UTF-8). e ﬁrst argument to dump()

is a vector of the names of the items to be dumped. e second argument

gives the name of the disk ﬁle to be created. For example, to dump a single

function named newf to a ﬁle named newf.txt, you can use the command

dump("newf", "newf.txt").

Importantly, dump() also adds an assignment line at the top of the disk

ﬁle – so if you dump a function named newf, the ﬁrst line of the ﬁle looks

like newf <-; the second line starts with function and begins the function

deﬁnition.

Once a function is on disk, it can be read back into R using the

source() command. is command reads disk ﬁles containing R code

and executes the code in your R session. In this example, the command

source("newf.txt") willreadtheﬁleandre-createthefunctioninyour

workspace with its original name. is will over-write an R object named

newf if one exists.

It is equally possible to call dump() with a vector of names of R objects. is

vector can include the names of lists, data frames, or other R objects as well

as functions. So, this is one quick way to transport a set of objects from one

machine or user to another.

However, dump() is not equipped to handle certain complicated objects.

e saveRDS() function (the letters RDS evoke “R data serialization”)

takes an object and produces a binary disk ﬁle with all of that object’s data

and attributes. Each call to saveRDS() applies to exactly one R object.

So, for example, saveRDS(myobj, "newfile") produces a disk ﬁle

named "newfile" with a binary representation of the R object myobj.

e complementary action is performed by readRDS():inthiscase,

readRDS("newfile") returns the object just as it was saved. Note

that readRDS() does not replace the existing myobj; instead, it simply

returns the object. You can assign the return value with a command such

as newobj <- readRDS("newfile"), which will create a new object

newobj that is identical to the original myobj.

e save() function operates on sets of objects. We call it by passing the

names of objects (in quotation marks) rather than passing the object itself.

k k

Writing Functions and Scripts 153

e resulting ﬁle should be readable by R on any machine or operating system.

(Make sure that the two machines share the same character set, like UTF-8.)

Moreover, the ﬁle can be compressed at the time it is created.e complement

of save() is load(); this function reads in a ﬁle created by save() and

re-creates the items speciﬁed at the time the ﬁle was created. An important

distinction between load() and readRDS() is that with load(),allthe

objects are automatically re-created with their original names. is means that

R will over-write any existing objects with those names.

5.2 Scripts and Shell Scripts

A script is a text ﬁle that contains R commands and possibly other lines as well.

We like to diﬀerentiate between two sorts of these text ﬁles: ﬁles we call scripts

and ﬁles we call shell scripts. (e word “shell” here refers to a command line,

like Terminal on OS X or cmd.exe on Windows. Admittedly, these two names

are more similar than perhaps they should be.) A script is just a text ﬁle with

stuﬀ in it. New scripts are created through the File |New script (or New Doc-

ument) dialog, and existing ones can be opened under File |Open script (or

Open Document). A script ﬁle will contain R commands, but it might also con-

tain comments, musings, invalid pieces of code you plan to ﬁx someday, or

other notes. In normal, interactive usage, the script will be visible in a sepa-

rate window. You can run the line that the cursor is sitting on with control-R

(command-R in OS X), or highlight a few lines with the mouse and run those,

also with control-R, or run the entire script using “Run all” from the “Edit”

menu. (ose who prefer keyboard commands can run an entire script using

control-A to select all the lines, and then control-R to run them.) After you

modify a script you can save the updated version however your system requires.

We use scripts to store a lot of our commands, and we often run them bit by

bit, interactively, a few lines at a time. Running lines from a script is not like

running a function. Instead it is like typing commands, one at a time, into the

console. ere are no “local” variables in a script; every assignment is made

immediately in the global workspace. If you run several lines of a script at once,

and one produces an error, R will still attempt to run all the others. (In contrast,

remember, when a function encounters an error, it quits.)

A script can also be run, all at once, like a function, through the source()

command we described earlier (Section 5.1.4). In this case, like a script run

interactively, there are no local variables. An error terminates a script being

run via source(), and only the commands prior to the error are executed.

Also as with a function, if you want to print intermediate results to the screen

during a script ﬁle being source()-d, you need to explicitly call cat() or

print() inside the script. A script to be run in this way is often a good product

to deliver to another R user; you can deliver the data as one part (perhaps using

k k

154 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

one of the techniques in Chapter 6) and a script to read or manipulate it, make

computations, draw pictures, or anything else as another. In fact, we some-

times deliver several scripts for performing the diﬀerent tasks of the project,

together with one parent or “wrapper” script that allows the user to invoke all

of the other scripts in the proper order. e scripts for the extended exercise in

Chapter 8 operate in this way, with a wrapper that can invoke a number of other

scripts. ese scripts and the wrapper can be found in the cleaningBook

package.

Ordinary scripts are good for developing techniques to handle data, or for

tasks that will only be done one or two times. A shell script is useful in a pro-

duction environment where a speciﬁc task needs to be performed every day,

for example – frequently and automatically. A shell script is also a text ﬁle, but

it is a unit intended to be run all at once by R or another program, not as part of

an interactive session but from an outside “shell” program. In this way, a shell

script is like a script to be run via source(). e diﬀerence is that a script is

run from within R, either interactively or via source(), whereas a shell script

is run from a command line without having to open R.

A particular version of the R program called “Rscript” (included when you

acquire R) runs shell scripts. Shell scripts diﬀer from regular scripts in that

the very ﬁrst line will always look like #!Rscript. ose ﬁrst two characters,

#!, are sometimes whimsically called “shebang” or “hash bang.” ey indicate

that this line is a comment (the #), but that it’s a very special sort of com-

mentthatmayonlybeplacedonline1ofthescript(the!). e Rscript

after the shebang tells the operating system that the ﬁle that contains it is ﬁlled

with commands intended to be executed by Rscript. Lots of other programs,

besides Rscript, also support shell scripts, so you might see ﬁles that start

#!python or #!ruby or something else.

Onenotehereisthattheoperatingsystemneedstoknowhowtoﬁnd

the Rscript program. Generally, the set of folders the operating system

will search is set by the PATH environment variable (see Section 5.4.2 for a

discussion of environment variables). If the PATH hasnotbeensettoinclude

thefolderthatholdstheRscript program, the ﬁrst line of the shell script

will need to specify the complete path to Rscript. For example, on one of

our Linux machines, a shell script might start with the line #!/usr/bin/

Rscript.

It is common for details – information about, say, input ﬁles, output location,

numbers of replications, and so on – to be passed into the shell script via envi-

ronment variables. An alternative, perhaps more R-like, mechanism is given

by the commandArgs() function, which produces a vector of the arguments

passed to Rscript at the time it was called. Either approach requires that the

shell script know where to look for the information it needs, of course, whereas

with an interactive script this information will often be entered by the user at

the command line.

k k

Writing Functions and Scripts 155

Once Rscript reaches the end of a shell script, it stops and control returns

to the command-line interface from which it was started. Normally, the point

of the shell script will be to produce some output ﬁles, graphics, or informa-

tive messages. You can also have the shell script modify the contents of your

R workspace, but this is not the default and we avoid it because it feels messy.

e command Rscript --help will show you some of the options available

when running shell scripts.

5.2.1 Line-by-Line Parsing

If you use scripts, you will probably some day copy a piece of a function into

a script window and run it from there. Remember that a function is an entire

unit, but a script is a sequence of lines. is distinction can cause problems.

Suppose that you have some code like this:

if (i > 100)

x<-x+200

else

x<-x-200

In a function, this will do just what you think it will; it will look at the value of

iand use that to decide what to do with x. In a script, this code will fail. Here’s

why: remember, the script executes line by line. e ﬁrst line is clearly incom-

plete, so R waits for the second line. At the end of the second line, the script

interpreter has seen an entire R expression, and it executes it. If iis > 100

then xis set to x + 200. Now it comes to the third line and it thinks a new

expression has started with an else that has no paired if. Executing a script

is just like typing the lines of the script into the console.

In this example, you would have to use braces to “protect” the else.Inthe

following code, the second line doesn’t end the expression because there is still

an open brace.

if (i > 100) {

x<-x+200

}else {

x<-x-200

}

e choice of whether to construct a function or script is a personal one.

As we have said, functions operate on local variables, so there is less danger of

them over-writing items in your global workspace or of them creating unneces-

sary copies of large data sets. Moreover, R examines functions for errors before

saving them. However, functions are diﬃcult to transport and need to be run

all at once, not in pieces. Scripts are less formal; they can be run bit by bit and

passed around as simple text ﬁles. However, they only create global variables,

over-writing the existing objects with the same name in the process.

k k

156 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

5.3 Error Handling and Debugging

At some point, sad to say, your function or script will fail unexpectedly or pro-

duce unexpected results. You will need to debug it. Debugging is a diﬃcult task;

there are many more ways to do something wrongly than there are to do it cor-

rectly, and it is diﬃcult to anticipate all the possible ways your function might

be called. Still, there are several ways to approach debugging, with diﬀerent lev-

els of complexity. is section describes some of these approaches, from least

to most complicated.

5.3.1 Debugging Functions

cat() Statements

e easiest way to debug is to insert cat() statements into your code at strate-

gic locations, to print out intermediate results. Of course it’s diﬃcult to know in

advance what the best locations will be. If your function is failing with an error

message, then of course you want your cat() to be higher up than the place

where you think the error takes place. We recommend labeling your cat()

statements, particularly if you insert several, to display where in the program

they are placed and what the program is about to do. So, for example, you might

have cat() statements that look like the following:

cat ("A: Start setup, i is", i, " dim (X):", dim (X), "\n")

cat ("B: Top of loop, xcount is", xcount, "\n")

cat ("C: End loop, result[1:3] is", result[1:3], "\n")

and so on. e cat() statements are easy to put in and take out (the labeling

scheme helps ensure you remove them all), but for them to be informative,

you must have selected the proper thing to display. We have sometimes

found it useful to include an argument to the function being debugged that

controls whether this diagnostic printing takes place. Conventionally, such

an argument is called verbose. So some of our functions include a number

of lines like, for example, if(verbose) {cat("Now operating on

file", fname, "\n")}.everbose argument might be logical, or it

might be numeric, with higher numbers producing more detailed diagnostics.

A little note that writes a line every hundredth iteration or so can be very reas-

suring. e %% operator gives remainders, so if you have a loop variable named

i, the line if(i %% 100 == 0) cat(We’re on rep", i, "\n")

will print a line when iis 100, 200, and so on.

ere is a complication when using a function to print out intermediate diag-

nostic messages. For eﬃciency R uses buﬀered output, which means that it

saves up a lot of these messages to deliver them all at once. For diagnostic pur-

poses, we often want the messages to be shown as soon as they are generated; in

these cases, we turn buﬀering oﬀ. In Windows, you can do that by right-clicking

k k

Writing Functions and Scripts 157

inside the console window (not on its title bar) and selecting “Buﬀered Output”;

control-W will toggle its setting back and forth.

The traceback() Function

When an error occurs, your ﬁrst question will often be where, exactly, the prob-

lem was. Although the line number is often printed, that might be unhelpful if

the function that failed is nested inside a sequence of calls. A call to the trace-

back() function is often the ﬁrst action to take when an error occurs. Its goal

is to show the sequence of function calls that led to the error and it often (but

perhaps not always) succeeds. We give a brief example in Section 5.3.3. e

recover() function will help advanced users; it lists the set of calls and starts

a browser session in the one that the user selects.

The browser() Function

e browser() function represents a big step forward in interactive debug-

ging. When a function or script encounters a call to browser(),itpausesand

produces a prompt like Browse[1].(e[1] part indicates that this prompt

arose from a function called at the command line; for a function called by a

function it would be [2], and higher numbers would indicate even more nest-

ing of function calls.) At the browser prompt, you can type the name of an

object to display it, enter other function calls, create local variables, and do

other things you can do in a regular R session – but usually we use the browser

to display or modify the values of variables in the function. ere are a few com-

mands you can issue to the browser: cmeans “continue” (i.e., resume running),

smeans “step” (go to the next statement, even if it’s inside another function),

nmeans “next” (i.e., go to the next statement, treating function calls as if they

were one step), fmeans “ﬁnish this loop or function,” and Qmeans “quit the

browser.” Since these are command names, if you need to print out the value of

a variable named cor for another of these, you need to do that explicitly with a

command such as print(c). It is also worth knowing that, inside a function,

the ls() command shows you only the variables local to that function. To see

the variables in the global workspace, specify ls(pos = 1).

Just as we sometimes set up a verbose argument that allows the user to

specifythatdiagnosticmessagesbeprintedout,itissometimesvaluabletoadd

an argument such as browse that speciﬁes where calls to browser() might

be made. Ensure that this argument is FALSE by default if you give your code

to other users, since the browser prompt has the potential to confuse them.

The debug() Function

Another mechanism for debugging expands on the browser() call. e

debug() function labels a function as “to be debugged,” so that, whenever the

function is run, browsing starts at its top. e label persists until it is removed

with undebug();thedebugonce() function imposes the label for only one

k k

158 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

run of the function. Debugging produces the browser prompt; it just saves you

fromhavingtoincludeanexplicitcalltobrowser() inthetextofthefunction.

5.3.2 Issuing Error and Warning Messages

Sometimes, you will want to stop your function early, perhaps because argu-

ments of the wrong type were passed, or some value important to the com-

putation is missing. Other times, you might want the function to produce a

warning in the R style and then continue processing. e stop() and warn-

ing() functions and their relatives perform these tasks in R.

Producing Errors

Stop() stops the function and prints its argument to the console, so in a

function called funk,thecodestop("Integer needed") will produce

the message Error in funk() : Integer needed. (Notice that no

“new-line” character is needed in the error text.) Even if the function with the

stop() was called from inside another function, control returns all the way

out to the command line. (In the following section, we show how that behavior

canbemodiﬁed.)

One very common use of stop() is to test whether all the arguments are

of the expected type. Our code often includes lines such as if(!is.matrix

(X)) stop("X must be matrix"). If multiple strings need to be com-

bined – often a good practice since it allows you to include diagnostic informa-

tioninthemessages–usepaste() (Section 4.3) ﬁrst.

An enhancement to stop() is provided by stopifnot(),whichacts

like stop() unless every one if its arguments evaluates to a vector of TRUE

values.ismakesiteasytohandleasetofargumentsatonce,aswellas

the case where NAs are found inside a vector. Suppose we require that an

argument bmust be present, and contain elements that are greater than zero

and also not missing. We could ensure that in a function with code such as

if(any(is.na (b)) || any(b < 0)) {stop("Illegal argu-

ment b")}but we would need to have one line for each of the several

arguments that had that requirement. Here we used the double-vertical

bar version of OR so that R would stop evaluating if the ﬁrst any() were

TRUE. In this case, it would be easier to specify stopifnot(b > 0).is

function evaluates all its arguments with an implicit all() command, and

calls stop() unless all its arguments are TRUE.

Here, if any element of bis negative or NA, the implicit all() command will

return something other than TRUE,andthefunctionwouldstopwiththeerror

message Error: b > 0 is not TRUE.

Warnings

In situations that do not require the function to stop entirely, you can issue a

warning to your user. R has two functions with similar names that help you

k k

Writing Functions and Scripts 159

manage warnings: warning() and warnings().Likestop(),thewarn-

ing() function prints out text supplied by the function (often after a call to

paste() to assemble some diagnostic information), but with warning() the

function attempts to continue.

e exact behavior of warning messages is determined by the value of the

warn option. is value can be displayed with a call to options()$warn.By

default, warn has the value 0. is means that warnings are printed after con-

trol is returned to the command line. If there are fewer than 10 warnings, their

text is printed out; if there are more than 10, a single message indicating how

many messages are there is displayed. In that case, as the messages point out,

the individual messages can be accessed with a call to the warnings() func-

tion. If warn is set to 1, with the command options(warn = 1), warning

messages appear as they are generated, instead of being saved up until con-

trol returns to the console. An even more rigorous choice is warn = 2,which

causes any warning to be treated as an error. is is particularly useful in a

debugging environment and we often choose this option. In your own use of R,

you should investigate every warning until you are certain as to why it is being

produced. We also recommend that you anticipate that your users will ignore

your warnings – because they will.

One ﬁnal choice of value for warn is possible; if warn is set to a negative

value, all warnings will be suppressed. is is almost never a good idea,

although we do know of one exception. We often run into a case where the

contents of a character vector (say, vec) are mostly numeric, with a few

non-numeric items that we are prepared to convert to NA. For example, vec

might represent the number of days since a customer declared bankruptcy,

and it might have the value c("342", "1101", "Never", "615").

Acalltoas.numeric(vec) will produce a warning (“NAs introduced by

coercion”) that should be ignored, and we do not like to produce warnings

that we really do want our users to ignore. Happily, the suppressWarn-

ings() function takes care of this by setting warn to −1 just long enough

to execute the expression passed to it. In this case, the call suppress-

Warnings(as.numeric(vec)) will produce a numeric vector with one

NA – and no warning.

5.3.3 Catching and Processing Errors

An error, as we mentioned earlier, causes R to stop processing and return con-

trol to the command line, even if the error takes place in a nested stack of

function calls. ere are many circumstances where we would rather intercept

the error processing, take some necessary action, and then resume processing.

As R has evolved, the mechanisms for this have become more sophisticated,

but the most basic of these mechanisms is the try() function. e try()

function lets you “try” a call to an R expression. If the call fails, the return value

k k

160 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

is an object of class “try-error,” whereas if it succeeds, the return value is the

value of the call. As an example, consider a function athat computes the square

of its argument. is function, as follows, issues an error if its argument is not

supplied:

a <- function (arg1) {

if (missing (arg1)) stop ("Missing argument in a!")

return (arg1^2)

}

Now suppose we have a function bthat calls a, but does so without checking to

see whether the argument is supplied. In our example, bhas parameters input

(which defaults to 9) and offset, the latter of which is used as the input to

a().ebfunction then returns input + a(offset). is example shows

the bfunction, and the result of its being run with no arguments as input.

# Compute input + a (offset)

> b <- function (input = 9, offset) {

a.result <- a (offset)

return (input + a.result)

}

>b()

Error in a(offset) : Missing argument in a!

e error took place inside a(), but control was returned to the command line

immediately. If we had not known where the error occurred, we might have

called traceback(). is example shows the result of that call.

> traceback ()

3: stop("Missing argument in a!") at #2

2: a(offset) at #3

1: b()

e output of traceback() is not always helpful or easy to read. We start at

the bottom and work up. In this case, we can see that the error arose in a call

to b(), which called a() at its line 3 (i.e., the third line of the b() function).

e error took place at the second line of a().

In this next version of b(),weusetry() to see whether the call to a() can

be completed successfully. If it cannot, we issue a warning and set the value of

offset to3.isexampleshowsanupdatedversionofb() and the result of

running it.

> b <- function (input = 9, offset) {

a.check <- try (a.result <- a (offset))

if (class (a.check)[1] == "try-error") {

warning ("Call to a() failed; setting a.result = 3")

a.result <- 3

}

k k

Writing Functions and Scripts 161

return (input + a.result)

}

>b()

Error in a(offset) : Missing argument!

[1] 12

Warning message:

In b() : Call to a() failed; setting a.result to 3

In this example, the function is completed, returning the value 12. e text of

the error was still produced, though; this can be disconcerting to users, and it

canbeturnedoﬀwiththesilent = TRUE argument to the try() function.

Indeed, with both silent = TRUE and no call to warning(), this error will

be handled without notifying the user at all – which very well may not be what

you want. Notice, by the way, that when we examine the class of the a.check

variable, we use class(a.check)[1].e[1] is there because lots of

R objects have a vector of classes. Comparing that vector to the single value

"try-error" will produce a warning, just the action we’re hoping to avoid.

We have found try() to be particularly useful when we are relying on pro-

grams and ﬁles outside R’s control. For example, if a call to an outside function

fails,orifoneinaseriesofﬁlescannotbereadin,wewillusuallywanttotrap

the error, inform the user and continue processing. R also supplies more sophis-

ticated error handling, which allow diﬀerent treatments of errors and warnings,

which allow functions to signal unusual conditions and other functions to inter-

pret those signals, and which allow more control over restarting. A discussion

of those is outside the scope of a book on data cleaning, but interested R pro-

grammers should start at the help page for tryCatch().

5.4 Interacting with the Operating System

A function or script will very often need to interact with the operating system.

For example, it might need to get a list of all the ﬁles in a particular directory

whose names end in .data so it can process them. It might need to create a

new sub-directory for results (this is an example of a “side eﬀect”), and so on. A

number of built-in functions manage R’s access to the ﬁle system. As examples,

we saw dump() and source() in Section 5.1.4 as ways to interact with ﬁles.

Many scripts – particularly those used for cleaning data – start by acquiring

the data from an outside source such as a ﬁle or relational database. is is

such an important part of the data cleaning process that we devote the following

chapter (Chapter 6) to the topic of getting data into and out of R. In this section,

we give a couple of examples of the diﬀerent ways that R can interact with the

operating system, separately from the chore of actually acquiring data from an

outside source.

k k

162 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

5.4.1 File and Directory Handling

By default, R presumes that the ﬁles it is dealing with are in the working direc-

tory. is is normally the directory from which R was started. So, a command

such as source ("myfile.txt") is assumed to refer to a ﬁle in the work-

ing directory, and a command such as cat(message, file = "output")

will likewise create a ﬁle in that directory. e getwd() command prints the

working directory to the screen; the setwd() command allows you to change

the working directory to a new location.

e list.files() command, with no arguments, will show you all the

ﬁles in the working directory. is function has a large number of useful

arguments. By default, only the ﬁle name, without the path, is returned.

However, it is possible to list the ﬁles in a vector of directories, in which case

the full.names = TRUE argument will return a path name (relative to the

working directory) for each ﬁle. Full names are also useful when using the

recursive = TRUE argument to ﬁnd ﬁles in the working directory and

all of its subdirectories. More important, perhaps, is the ability to select ﬁles

that match a regular expression (Section 4.4). So, for example, the command

list.files("..", pattern = "xlsx*$", recursive = TRUE)

will list all ﬁles whose names end in xls or xlsx in any directory underneath

the parent (..) of the working directory. OS X and Linux users may ﬁnd the

ignore.case argument useful as well, and other arguments control the

inclusion of directories in the listing.

Beyond merely listing the ﬁles, we often want to determine something

about their content. e file.info() command gives some information

about a ﬁle – here, the full name will be needed for a ﬁle outside the working

directory – and a series of commands with names such as file.copy(),

file.exists() (to check for existence), and the slightly more dangerous

file.remove() are also available. For directories there are the correspond-

ing functions dir.exists() and dir.create() that, respectively, test

for the existence of a directory and create one.

5.4.2 Environment Variables

An environment variable is a name and (single) value stored inside the oper-

ating system. ese are available for programs to read and, in some cases, cre-

ateorupdate.Forexample,onmostsystems,anenvironmentvariablenamed

HOME holds the location of the user’s home directory. When R starts, it reads a

few existing variables (e.g., LC_ALL describes the locale – see Section 1.4.6) and

creates a few of its own (e.g., R_HOME gives the directory where R is installed).

Environment variables are case-sensitive in OS X and Linux and not in Win-

dows, but it is conventional to use all upper case for variable names.

Environment variables provide a convenient way for one program to com-

municate with another. In R, the set of environment variables is available

k k

Writing Functions and Scripts 163

through the Sys.getenv() function, which returns a vector whose names

and values are the names and values of all the variables in the environment.

islistisoftenlongandunwieldy,butyoucanspecifyparticularvariables

to extract with a command such as Sys.getenv("R_HOME").Variables

canbecreatedorupdatedwithSys.setenv().So,forexample,wemight

create a new environment variable named REPS with a command such as

Sys.setenv(REPS = 12). Now if R starts another program, R’s environ-

ment will be available in that program, and in particular that program will

be able to determine that the value REPS had been set to "12". (Notice the

quotation marks; environment values are always treated as text by R.) Envi-

ronment variables are one way to pass information from “outside” into R in,

for example, a shell script. Notice that a function that creates an environment

variable is producing a side eﬀect (see Section 5.1.3).

5.5 Speeding Things Up

Some functions are fast and some are slow. Sometimes functions are so slow

that they keep you from doing what you need to do. e slow speed can be a

function of things outside R’s control – maybe a ﬁle is just very big, or is being

fetched over a slow network connection – but it can also be a result of ineﬃ-

cient programming. In this section, we talk about how to measure a function’s

performance and then give a few ideas on how to speed things up.

5.5.1 Proﬁling

e process of measuring how much time and memory a function uses is called

“proﬁling.” e very simplest way of measuring how much time a function uses

is with the system.time() function, which essentially reads the computer’s

clock at the beginning and end of a call to an R expression and reports the

diﬀerence. While this is a useful measure, it does neither divide the time used

into pieces attributable to each of the function’s components nor address mem-

ory use.

R has a much more sophisticated proﬁling tool based on the Rprof() func-

tion. is can help you identify both the steps that use a lot of time and also

the ones using a lot of memory. is function writes out a log ﬁle describing

what R is doing 50 times a second, by default (and so these ﬁles can get big).

After your function terminates, the summaryRprof() function can produce

a report. e help page for Rprof() and the chapter of the online manual ref-

erenced there address this topic in detail. While proﬁling can be important in a

production environment where a function is run thousands of times, we rarely

ﬁnd need of it in our more interactive functions for data cleaning that are only

run a few times.

k k

164 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

5.5.2 Vectorizing Functions

It is almost always useful to ensure that a function can act on vectors of any

length as well as on single values. Moreover, using a vectorized function on a

vector will almost always be more eﬃcient than looping over the individual

entries. Often vectorization will happen automatically, at least in part, because

arithmetic functions such as +are themselves vectorized. However, the

funk() function from the start of the chapter is not properly vectorized.

eexampleinFigure5.1showsthatfunctionontheleft(wehaveshortened

the text of the warning) and a vectorized version on the right.

Here, the non-vectorized behavior was associated with the if(x < 0)

statement that was part of error checking. In the vectorized version, we

initialize the out vector with NAs, then ﬁll only the values for which xis ≥0.

As a technical note, the initial out is actually a logical vector because NAsare

logical by default. e vector gets converted to a numeric as soon as the ﬁrst

numeric is stored in it, but if no xvalues are ≥0 this function will return a

logical vector. If this were an issue, we would get around it by initializing out

as as.numeric(rep(NA, length(x)).

One source of slow R code is ineﬃcient looping. If it is possible to replace a

for() or while() loop with a call to one of the apply functions, the result

will almost always be faster. e apply functions use looping internally, too,

but usually in a much more eﬃcient way. Sometimes, replacing a for() loop

will be diﬃcult – for example, if the action to be taken on iteration idepends

on the results from iteration i−1. But we always try to vectorize our functions.

However, sometimes vectorization makes code harder to read and maintain.

It is gratifying to produce a function that runs faster, but sometimes it is the

total time taken to code, run, explain, and maintain that is the more important

measure of quality. In the sort of large data sets we run into, with many more

rows than columns, the critical point is to work hard to avoid looping over rows.

Looping over columns is usually much less costly. Data in other formats (e.g.,

with many more than columns than rows) will need to be approached diﬀer-

ently.

Non-vectorized Vectorized

funk <- function (x, y = 2) {

if (x < 0) {

warning ("Neg(s)")

return (NA)

}

return (sqrt(x) + y)

}

funk <- function (x, y = 2) {

out <- rep (NA, length (x))

if (any (x < 0)) warning ("Neg(s)")

out[x >= 0] <- sqrt(x[x >0]) + y

return (out)

}

Figure 5.1 Vectorized and non-vectorized code example.

k k

Writing Functions and Scripts 165

5.5.3 Other Techniques to Speed Things Up

Vectorization is always the ﬁrst thing to try when you need to speed up your

R code. But sometimes a fully vectorized function is just not fast enough. e

rest of this section suggests some avenues to try when looking for more speed

from your R computations.

Compiling

One quick way to gain some speed is through compiling, which is the

translation of R code (which, of course, looks something like English)

into “byte code,” which the machine can read much faster and more eﬃ-

ciently. Indeed almost all of the R’s built-in functions have already been

compiled into byte code; type the name of a function like qand you will

see near the bottom a line with a byte-code address, like, for example,

<bytecode: 0x00000000067c6368>.ebuilt-inpackagecompiler

allows you to compile your own functions, either one at a time with cmpfun()

for functions (or compile() for expressions), or package by package with

compilePKGS(). Sometimes, compilation does not appear to speed things

up, but it is easy to try. In this example, we show system.time() applied to

a simple function, which, essentially, does nothing.

> dumb <- function (n = 100) {

for (i in 1:n) {}

}

> system.time (dumb (6e8))

user system elapsed

33.48 0.23 33.71

e function dumb() takes no action, but in our example it does so six hundred

million times. R required about 34 seconds for those operations, although the

exact time can vary. e result of system.time() will depend greatly on

how fast your computer is; we show our result as a baseline. (Our machine is

fast. If you try this example, you might want to start with a smaller number.) In

this case, anyway, compilation deﬁnitely helps. e following example shows

the result of compiling the function (an operation that used 0.02 seconds of

elapsed time) and then running it.

> dc <- cmpfun (dumb)

> system.time (dc (6e8))

user system elapsed

7.33 0.00 7.34

In this example, we see a substantial savings from compiling – the compiled ver-

sion required only 22% of the time required by the original. Another very useful

approach is the “just-in-time” compilation enabled by a call to enableJIT(),

with an integer argument describing the details of how the compilation should

k k

166 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

work. Speciﬁcally, enableJIT(3) performs as much compilation as possible.

Once this call has been made, functions are compiled right before their ﬁrst use

(and stay compiled). In this way, you do not need to specify speciﬁc functions

to be compiled – they all are – and this has the potential to result in substantial

time savings. In practice, it sometimes happens that the time-consuming parts

of your data cleaning tasks are being done inside built-in functions – and these

will probably already have been compiled. On the other hand, there are very few

drawbacks to compilation. To “uncompile” a function, edit it with fix() –or

“deparse” it to convert it to text, and then convert it back into a function, with

acommandsuchasdc <- eval(parse(text = deparse(dc, con-

trol="useSource"))).

Parallel Processing

An even greater gain in processing speed can sometimes be realized using par-

allel processing, which exploits the fact that modern computers almost all have

multiple “cores” capable of more-or-less independent processing. e built-in

package parallel allows control over the use of multiple cores. Using the

parallel package requires three steps: ﬁrst, a “cluster” of cores is created,

possibly with the makeCluster() command. (Creating the cluster may take

a few seconds, but it only needs to be done once per session.) Second, any nec-

essary items from the global workspace need to be passed to the cluster via

acalltoclusterExport(). Finally, the cluster created in the ﬁrst step is

passed to one of the functions that knows how to exploit it. For example, the

parSapply() function acts as a “parallel sapply(),” running a function on

the columns of a data frame, with diﬀerent cores acting on diﬀerent columns.

emachineonwhichwewrotethisbookhas32cores–wecandeterminethis

viaacalltothedetectCores() function – so in this example we set aside 24

of them to act as the cluster. Creating the cluster takes about 5 seconds in this

example. Next we export the dumb() and dc() functions so that the cluster

can use them, and then we run parSapply() to run 24 separate instances of

each of these functions.

> detectCores () # after library (parallel)

[1] 32

> clust <- makeCluster (24)

> clusterExport (clust, c("dumb", "dc"))

> system.time (

parSapply (clust, 1:24, function (i) dumb (6e8/24)))

user system elapsed

0.00 0.00 2.29

> system.time (

parSapply (clust, 1:24, function (i) dc (6e8/24)))

user system elapsed

0.00 0.00 0.67

k k

Writing Functions and Scripts 167

Because we were running 24 instances of our dumb() and dc() functions,

we only needed to run 1/24th of the iterations within each instance. Notice the

substantial time savings realized from running in parallel – even more so when

running the compiled version of the function. Interestingly, functions inside a

call to parSapply() are not automatically compiled even if enableJIT()

has been called; you will need to compile them explicitly (or run them ﬁrst,

to get them to compile if enableJIT() has been set) and then export the

compiled version to the cluster.

It is a good practice to stop the cluster with stopCluster() when parallel

processing is complete, although the cluster will be stopped when R termi-

nates.

Even More Speed

When even more speed is required, R can interface nicely with code that is com-

piled down to the machine level. Often this is code from somewhere else that

was originally written in, say, C or Fortran. We run code like this all the time

inside R, and in packages, without even knowing it. e ability to write this

sort of code is perhaps more valuable for production environments requiring

lots of specialized computation than for the sorts of data cleaning problems

relevant to this book, and we will not discuss this here. e help pages for

dyn.load() and .C(), and especially Chapter 5 (“System and foreign lan-

guage interfaces”) of the “Writing R Extensions” manual will be useful starting

points. e Rcpp package provides another, cleaner approach to incorporat-

ing C++ code, and the paper of Eddelbuettel and Françcois (2011), and their

web page, www.rcpp.org, together with that of Wickham, adv-r.had.co

.nz/Rcpp.html, provide more details for the interested user.

5.6 Chapter Summary and Critical Data Handling

Tools

is chapter discusses functions and scripts, two ways to automate actions in

R. Every data cleaning task will generate one or more of these. Functions are

self-contained but hard to transport. Unless you include a side eﬀect (such as

plotting or writing to a ﬁle) they operate on local variables and do nothing

more than compute a return value. Scripts are easy-to-read text ﬁles that act

as sets of commands – often including lots of function calls and even function

deﬁnitions – just like those you type into the console (including commands that

may not be valid). All variables in a script are global.

Important features of functions (and scripts) include the following:

•Function arguments are matched by name and by position. Names can often

be abbreviated. One special argument is the ellipsis ..., which allows the

k k

168 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

number of arguments to vary. However, names of arguments deﬁned after

the ellipsis cannot be abbreviated, and the ellipsis can “consume” arguments

whose names are typed in error. It is important for you as the function devel-

oper – and as a user – to check the arguments carefully. e missing()

function is useful here.

•A function has a return value. is can only be one R object, so if you need

to return several items, put them into a list. If you do not want a function to

return any value, end it with a call to invisible().

•Side eﬀects are when a function does something other than compute and

return a value. Sometimes these are harmless, as when a function draws a

plot, or necessary, as when a function reads in data from a disk ﬁle. Some-

times side eﬀects are dangerous, as when a function changes or deletes an

object in the global workspace. Avoid this sort of side eﬀect.

•Functions can be saved to disk with dump() andrestoredwithsource(),

or via save() and restored via load(). Scripts are saved as ordinary text

ﬁles, but using source() on a script ﬁle causes its code to be run.

•Functions and scripts will have errors. Debugging is a substantial part

of every development eﬀort. We can debug with a simple method, such

as inserting cat() statements, or via the more sophisticated interactive

debugging tools browser() and debug(). Our functions can generate

our own errors (and warnings), and these can be handled via try().Ifa

function produces an error or warning, be sure to ﬁnd out why – but do not

expect your users to behave similarly.

•Functions and scripts can access ﬁles and directories through ordinary R

functions.

•To speed things up, try compiling your functions, either individually with

cmpfun() from the compiler package, or via enableJIT(),which

compiles every function it runs. For problems with large loops, you can

achieve substantial gains in speed via parallel processing through the

parallel package.

5.6.1 Programming Style

In any data cleaning project you should expect to deliver your functions and

scripts, as well as your results. Your code should be neatly and consistently for-

matted. A number of proposed R style guides can be found on the Internet

and inevitably they disagree. For example, one source suggests separating parts

of an R object’s names with underscores (as in first_sub_total) while

another recommends never using underscores. R’s code itself uses diﬀerent

schemes in diﬀerent places. Make sure your names are meaningful – sis an

uninformative name for a standard deviation, for example. e cost of typing

longer variable names is less than the cost of debugging code later when you

cannot remember what that variable was for.

k k

Writing Functions and Scripts 169

Our recommendation is to select a style and stick with it. Most importantly,

add comments to your code – more than you think you need. Use spacing –

blank lines, spaces around operators, and indentation – for readability. For

example, in our code we do not indent commands in a function that are not

part of an if(),for(), or similar clause. en, we indent four spaces for

each such nested clause. Other programmers indent everything inside a func-

tion. On another point, consider including dates or version numbers in your

functions and scripts.

It is as important to write readable code as it is to write eﬃcient code.

Tailor your style to your reader. For example, suppose you want to check

whether a vector has a zero length. We would use an expression such as

if(length(x) == 0) .... is shows clearly the two things being

compared. Some programmers use the more cryptic if(!length(x))....

HereRcomputesthelength,thenconvertsittoalogicaltobeabletoapply

the !operator. If the length is 0, then !0 produces TRUE.Evenifthislatter

approach were marginally faster it would not be worth it.

5.6.2 Common Bugs

In this section, we mention a number of the bugs we see in our, and other

people’s, R code. Avoiding these bugs is a good start toward building useful,

re-usable code.

•Many bugs arise from unexpected input, so when you prepare a function

for someone else’s use, you should consider testing the classes, sizes, and

other attributes of input arguments to ensure they match what a function

expects. Sometimes, the “unexpected” behavior is related to missingness, so

if a computation depends on, say, the average of some input vector, you, as a

function writer, will need to decide whether an NA in the input should cause

the function to stop (perhaps testing with anyNA()) or whether the average

should be computed with the missing values excluded (using mean() with

na.rm = TRUE in this example).

•Another common input error arises from R’s habit of converting a one-row

or one-column matrix into a vector (see Section 3.2.1). Users might

intend to pass a matrix to a function expecting one, using a call like

myfun(mymat[mymat$Price > 10,]) to select only those rows for

which Price is greater than 10. If there is only one such row, though, R will

silently convert that row into a vector. Code that relies on matrix attributes,

such as trying to determine how many columns a matrix has using dim(),

will fail.

•A third common error arises from the way that missing values propagate

through computations. We very often use the if() function, but a call to

if(x) produces an error if xis NA. When you observe this error, you will

want to ﬁnd the source of the missing value.

k k

170 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

•Errors often arise when reading disk ﬁles (if they are not present) or writ-

ing them (if a folder does not exist or if permissions are not properly set).

Check for these conditions with one of the ﬁle-handling functions such as

file.exists(), to check whether a ﬁle is present, or file.info(),to

see if a ﬁle is writeable.

•We commonly see a warning when a function expects a single result and gets

avectoroflength2ormore.Forexample,inmanycases,codewillcheckthe

class of an object with code such as if(class(obj) == "lm") ....

Since many objects have a vector of classes, this code will produce a warning

(and examine only the ﬁrst element of the class vector).

•An even more serious problem can occur when sapply() (Section 3.5)

runsafunctiononadataframeorlistthatisexpectedtoreturnasinglevalue

for each column in the data frame. We saw in Section 3.2.3 how a problem

arises if the function in question can return results of diﬀerent lengths.

5.6.3 Objects, Classes, and Methods

We showed an earlier example where calling print() on a data frame led

to R calling another function, print.data.frame(). is is an example

of a widely used programming approach in which the type of object passed

to a function determines the action the function will take. In this example,

print() is a “generic” function and print.data.frame() is a “method”

that is applied to data frame objects. R has almost 200 diﬀerent methods

for the print() function; you can see them with the command methods

("print").Asintheprint.data.frame() example, the names of

methods are constructed from the generic function’s name, a dot, and the

object type. To give another example, we construct sequences of POSIXt

objects using the seq() function (Section 3.6.6). R detects the class of the

POSIXt object and runs the speciﬁc seq.POSIXt() function.

is approach, in which one function can have many methods, based on the

class of the object passed to it, is part of “object-oriented programming,” a

widespread architecture of programming languages. R has two diﬀerent ways to

implement object-oriented programming. Neither is important enough to data

cleaning to be part of this book, but you should be aware that many R functions

are equipped to perform diﬀerent duties based on diﬀerent inputs. Sometimes,

you will need to look up the speciﬁc function, rather than the generic one, in

the help pages.

k k

171

Getting Data into and out of R

Earlier chapters in this book have described what data in R looks like and how

to manipulate it. In this chapter, we describe how to get data into R for analysis,

and how to get updated data back out. Reading data into R is the ﬁrst step in

everydatacleaningproject,soitisinthischapterthattherealworkofdata

cleaning begins. But we start with a note on keeping track of your data’s prove-

nance, which is the word we use to describe the documentation of your data’s

history. You should know where you acquired every bit of your data, from what

source, and on what date. A natural place to keep that information is in your

scripts. Often, we have one or more scripts devoted to reading in the data, and

these start with some description of the date, the source of the data, the com-

mands to do the actual reading, and some notes on problems we encountered

reading the data in.

Keeping track of the data’s provenance is always important, but it is espe-

cially important if the underlying data is subject to change, perhaps because

you extracted it from a database, a public site, or a web page under someone

else’s control. It is through this sort of documentation that you can make your

research reproducible by others.

6.1 Reading Tabular ASCII Data into Data Frames

Mostofthedataweread–andwrite–inRcomestousintheformofrect-

angular or tabular data, that is, data arranged with observations in rows and

measurements in columns. We expect that every row will have exactly the same

number of items. Being able to get these data ﬁles into R cleanly is a critical part

of data cleaning. Unfortunately, there are a lot of ways a ﬁle can be badly put

together. In this section, we describe how to read tabular data ﬁles into R data

frames and some approaches to try when things aren’t working. Writing data

from R to disk is generally straightforward and we discuss that brieﬂy at the end

k k

172 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

of the section. In this section, we focus on ASCII text, and in the next section,

we describe the minor complications brought about by UTF-8.

6.1.1 Files with Delimiters

Perhaps, the most common sort of ﬁle we read in is a delimited ﬁle in which each

observation is represented by a single row. Within the line, the ﬁelds are sepa-

rated by a delimiter, which will usually be a single character such as a comma,

tab, semicolon, pipe character (|), or exclamation point. Sometimes, a space

or spaces are used as the delimiter. e end of each line is marked with the

end-of-line character (but, as we observe in Section 6.1.8, these can vary across

platforms). e advantage of delimited ﬁles is that, being text, they are easy

to move across systems and easy to inspect by eye (although of course ﬁelds

will generally not line up). Unlike ﬁelds in ﬁxed-width ﬁles, ﬁelds in delimited

ﬁles need never be truncated nor made artiﬁcially too big. On the downside,

like other text ﬁles, delimited ﬁles are not good at representing numeric data

eﬃciently. Moreover, on occasion users will inadvertently insert the delimiter

character into regular ﬁelds (we illustrate this later in the chapter).

e principal tool for reading in delimited ﬁles in R is the read.table()

function, together with its oﬀspring read.csv() and read.delim().

(“CSV” names the popular “Comma-Separated Values” ﬁle format.) ese

three functions are identical except for their default settings. All three of these

produce data frames; if you need a matrix, the best approach is to construct

a data frame and then convert it via as.matrix(). Two more functions,

read.csv2() and read.delim2(), are also available; these are just like

read.csv() and read.delim() except they expect the European-style

comma for the decimal point and that read.csv2() uses the semi-colon as

its delimiter.

e read.table() function and the others employ a host of arguments

that allow most delimited ﬁles to be read in. ese arguments are optional and

have default values that are sometimes useful and sometimes not.

e most important of these arguments are as follows:

•header: a logical indicating whether the disk ﬁle has header labels in the

ﬁrst row. If so, set this to TRUE and those labels will be used as column head-

ers (after being passed through the make.names() function to make them

valid for that use).

•sep: the separator character. is might be a comma for a CSV, a tab (written

\t) for a tab-separated ﬁle, a semi-colon, or something else.

•quote: a set of characters to be used to surround quotes, discussed as fol-

lows.

•comment.char: the character that is understood to introduce a comment

line.

k k

Getting Data into and out of R 173

•stringsAsFactors: a logical that determines whether columns that

appear to be characters are converted to factors or left as characters.

•colClasses: a vector that explicitly gives the class of each column.

•na.strings: a vector specifying the indicator(s) of missing values in the

input data.

Other options control the number of lines skipped before the reading starts,

the maximum number of rows to read, whether blank lines are omitted or

included, and more, but the ones above are generally the arguments we worry

about ﬁrst. e choice of sep character can often be inferred from the name

of the ﬁle (comma for ﬁles whose names end in CSV and tab for TSV, although

this is not a requirement). When the separator is unknown, we either try the

usual ones, use an external program to examine the ﬁrst few lines of the ﬁle,

or resort to the scan() function (discussed later in this section). e default

value of sep is the empty string, "", which indicates that any amount of white

space (including tabs) serves as the delimiter. is is intended for text that has

been formatted to line up nicely on the page (so that extra spaces or tabs have

been added for readability). Setting sep to be the space character, "",means

that read.table() will split the line at every space (and never at a tab).

We have also found that the quote,comment.char,andstringsAs-

Factors arguments, in particular, very often need to be set explicitly in data

cleaning tasks. By default, the set of quote characters is set to be both 'and

"in read.table(),andtojust"in the other functions. is means that

a string inside quotation marks such as "Ann Arbor, MI" is treated as a

single unit. is is a valid approach when the separator is a space, since other-

wise that phrase would look as if it has been made up of three separate ﬁelds.

e single quote mark would be useful in the corresponding British environ-

ment, where its use is more common. In our work, the single quote is found

most often as an apostrophe, and if the single quote is part of the quote argu-

ment, the apostrophe in the phrase Coeur d’Alene, Idaho will be taken

as starting a very long string that might not be ended until a later entry contains

Peter O’Toole or Martha’s Vineyard. We generally turn the interpre-

tation of quotation marks oﬀ by passing quote = "",orsettheargumentto

recognize only the double quotation mark by passing quote = "\"".

Similarly, the comment character defaults to R’s own comment character, the

hash mark or pound sign (#). Lots of code has comments, but comments in data

arerare.Muchmoreoftenweseethepoundsignaslegitimatetextinanexpres-

sion such as Giants are #1! or 241 E. 58th St. #8A. We generally

turn comments oﬀ by passing comment.char = "".

6.1.2 Column Classes

e stringsAsFactors argument is one of a set of arguments that aims

to help R ﬁgure out what to do with the data that it reads in. We have

k k

174 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

encountered this name when constructing data frames with data.frame()

and cbind() in Section 3.4. Its default TRUE value speciﬁes that any column

found to be character should be converted to a factor. is can cause problems

inacoupleofways.First,itisoftenthecasethatnumericcolumnscanhave

a small number of unexpected text items in them, particularly missing values

codes that are diﬀerent from the default NA that R expects. In this case, the

read.table() function interprets these columns as character and then

converts them to factors. Second, even for text data we generally want the

raw text for purposes of data cleaning; we switch to factors only when we are

ready to begin modeling. One way around this automatic conversion is to set

stringsAsFactors = FALSE, and we almost always pass this argument

when we are reading in data. e exceptions are when we know that the data

is numeric (and pre-cleaned), and when we use colClasses,whichwe

describe in the following paragraph. In fact, this issue arises so often that R

has a built-in option to set the default behavior of stringsAsFactors.

However, we rarely use this option because we want our scripts to be portable

to other users who might not have set it, and for aesthetic reasons we hesitate

to have our scripts set other users’ options.

A more ﬂexible approach is provided by the colClasses argument. is

allows you to specify the column type for each of the columns at the time you

read the table. Of course, this information is not always available – sometimes

you have to read the data to ﬁgure out what is in it. Passing the nrows argument

allows you to specify how many rows read.table() should read. So often

we read just a few dozen lines and inspect the resulting data set to get an idea

about what classes to expect.

e colClasses argumentcanbespeciﬁedasavectorwhoselengthisthe

number of columns in the input ﬁle, or as a named vector, in which case you

can name speciﬁc columns, and R assigns classes to the rest. It does this by

reading the entire ﬁle, so supplying column classes explicitly can often speed

up the reading of a big ﬁle.

In addition to the basic types of numeric,character,andlogical,you

can also specify Date or POSIXct, and there is an approach by which you can

convert the input into other classes as well (see the help for read.table()).

e elements of colClasses arerecycledintheusualway,socol-

Classes = "character" converts all columns to characters – which is

often a good place to start in a data cleaning problem where the data is poorly

documented or of suspect quality. e as.is argument provides another way

to keep columns from being converted.

As an example of where colClasses canbehelpful,considerthecase

where a column consists of numbers that might have leading zeros, such

as US ﬁve-digit ZIP codes. Without colClasses, such a column would

automatically be converted to numeric, producing values whose leading zeros

would be lost – so the ZIP code for Logan Airport in Boston would come out as

k k

Getting Data into and out of R 175

the number 2128 instead of the expected 02128.estringsAsFactors

argument would have no eﬀect here since R would see the column as numeric.

In this example, we could replace the missing zeros using sprintf() as in

Section 4.2.1, but a more direct approach is to specify "character" in the

colClasses argument.

6.1.3 Common Pitfalls in Reading Tables

In this section, we describe common problems we encounter when reading text

ﬁles into R and some ways you might address them. In any particular case, it is

diﬃcult to know which issue is causing the problem. After we describe some of

these problems, we give an example of how we might go about diagnosing them

with a small realistic data set that appears to have a deceptively simple problem.

Embedded Delimiters

We have mentioned that problems commonly arise when the table to be read

includes #or quote characters. Another problem arises when the data has the

separator character (say, a comma) included as part of some text. Data of the

sort arises particularly often when the original source of the ﬁle is a spread-

sheet. Textual comments in spreadsheets very often have commas (or tabs,

which are introduced when the user enters a new-line character within a cell).

It also happens that cities get recorded in a form like “Hempstead, Long Island”

or “Westminster, Orange County.” If those text ﬁelds are surrounded by quota-

tion marks then R can interpret them correctly using quote="\"", but this is

rarely seen in spreadsheet data. More often, embedded delimiters require some

eﬀort to correct, and in the following section we give an example.

Unknown Missing Value Indicator

Another set of problems arises when the “missing value” indicator is unknown.

By default, R expects missing values to be indicated by NA,butthena.

strings argument allows a set of values to be supplied. Any value in the

input that matches an element of na.strings will be interpreted as a

missing value – and blank ﬁelds will also be taken as missing, except in

character or factor ﬁelds. For example, a spreadsheet from the Excel program

canhavevaluessuchas#NULL!,#N/A,or#VALUE!, so these would be good

candidates for including in na.strings. If they are included, then columns

that are otherwise numeric (or logical) will be correctly interpreted, whereas if

they are not, then those columns will be interpreted as characters – and as we

discussed earlier, character columns are treated as factors by default.

Empty or Nearly Empty Fields

Empty ﬁelds – those with nothing in them – generally do not cause prob-

lems, since they are brought in as NA valuesinnumericﬁelds.Aswenoted

k k

176 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

in Section 4.1.3, though, sometimes text data – particularly from spread-

sheets – represents empty cells by strings with a single space (or, sometimes, a

few spaces in a row). We sometimes think of these cells as “nearly empty.” is

problem is often hard to detect ahead of time, since empty and nearly empty

cells of a spreadsheet look alike. So the ﬁrst thing we do, when we read in a data

frame, is to determine the number of numeric and character or factor columns

(and logical, too, though these are rarer). For a data frame named ourdf,

forexample,werunthecommandtable(sapply(ourdf, func-

tion(x) class(x))) and compare what we see to what we expect. In

much of our work we expect 50–75% or more of the columns to be numeric,

so if there are no, or only a few, numerics, we conclude that there are textual

values – often, missing-value indicators or nearly empty values – in those

numeric columns. If we know that a particular column (say, one called

NumID) should be numeric but is being represented as character, we can

tabulate the elements that R is unable to convert, with a command like

table(ourdf$NumID[is.na(as.numeric(ourdf$NumID))]).is

is often a good starting point for examining the set of missing value indicators in

thedata.esecanthenbeincludedinthena.strings argument in another

call to read.table() – or we can explicitly set them to NA ourselves.

Blank Lines

A less common problem occurs when the input ﬁle has blank rows.

Read.table() will skip these by default, and this is often what we

want. However, in some cases it is important that the lines from two diﬀerent

ﬁles correspond. In this case set skip.blank.lines = FALSE and blank

lines will be included in the output. e entries in these lines will appear as NA

in numeric columns and as the empty string "" in character ones.

Row Names

In Section 3.4, we noted that a data frame in R must have row and column

names. When a data frame is produced by one of the read.table() func-

tions, the default behavior is for R to create row names from the integers 1, 2,

and so on, unless the header has one fewer ﬁeld than the data rows, in which

case R uses the ﬁrst column’s values as the row names. Remember that row

names need to be unique; if yours are not, you can force R to supply integer

row names by passing row.names = NULL.

You can also specify one of the columns to be used as row names explicitly, by

passing its name or number with the row.names argument. (is argument

can also be used with a vector of row names, but we rarely use that feature.) It

can be useful to specify row names when they represent something important

because the names can be used explicitly in subsetting statements.

k k

Getting Data into and out of R 177

6.1.4 An Example of When read.table() Fails

ere are some data sets for which our initial attempts at using read.table()

just will not work because of one or more of the problems we have

described – or something else. Sometimes we correct these problems,

when we ﬁnd them, by modifying the original data directly and documenting

that fact. More often, we can read the data into R by suitable choice of argu-

ments to read.table(), and sometimes read.table() just cannot be

used and we resort to the more primitive, but more ﬂexible, function scan(),

which we describe later in the chapter.

In this example, we show a small sample of the sort of text we are often sup-

plied with and try to read it into R. e following text has been saved as a ﬁle

named addresses.csv in our working directory.

ID,LastName,Address,City,State

001,O'Higgins,48 Grant Rd.,Des Moines,IA

011,Macina,401 1st Ave., Apt 13G,New York,NY

242,Roeder,71 Quebec Ave.,E. Thetford,VT

146,Stephens,1234 Smythe St., #5,Detroit,MI

241,Ishikawa,986 OceanView Dr.,Pacific Grove,CA

Because the ﬁle’s name ends in csv, we will presume that the ﬁle uses commas

as separators (although this is not always the case), and that the ﬁle includes a

header row. In real life, it is a good idea to examine the ﬁle ﬁrst, where possible,

using a text editor or ﬁle viewer.

Using the lessons learned from the previous section, we know to specify the

arguments quote = "",comment.char = "",andstringsAsFac-

tors=FALSE. Our ﬁrst eﬀort produces this error:

> read.table ("addresses.csv", header = TRUE, sep = ",",

quote = "", comment.char = "", stringsAsFactors = FALSE)

Error in scan(file = file, what = what, sep = sep, ...

line 1 did not have 6 elements

is error is telling us that the longest line encountered contained six elements

(ﬁelds), whereas at least one other – in this case, the header, line 1 – had fewer.

Either the header or some of the data is lacking a ﬁeld – or has too many. Notice

that the error was issued by the scan() function, which is itself called by

read.table().

Using read.table()

Ournextstepinthisprocessmightbetoexaminetheﬁrstrowtomakesure

this is a header row and to determine the number of ﬁelds. When we add the

nrows = 1 argument to read.table(),andsetheader to FALSE,we

extract the very ﬁrst row in the ﬁle.

k k

178 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> read.table ("addresses.csv", header = FALSE, sep = ",",

quote = "", comment.char = "", stringsAsFactors = FALSE,

nrows = 1)

V1 V2 V3 V4 V5

1 ID LastName Address City State

R has returned a data frame with one row. Since no row or column names were

provided, R has added them (the 1at the left and the V1 through V5 at the

top). e header contains ﬁve ﬁelds, so we expect every row of the data to have

ﬁveﬁeldsaswell.WecandeterminethenumberofﬁeldsthatRseesineach

row with the count.fields() command. Since this command produces

one number for every row in the ﬁle, we normally pass its output directly to

table(), but here we show it explicitly.

> count.fields ("addresses.csv", sep = ",", quote = "",

comment.char = "")

[1]556565

e fact that the second and fourth lines of the ﬁle (after the header) contain

six ﬁelds and the others do not is now visible. We can suspect that those rows

each contain an extra delimiter. Files with embedded delimiters are particularly

painful to handle. We extract the ﬁrst problematic line with read.table()

by skipping the ﬁrst two and reading only the third (using skip = 2 and

nrows = 1). R returns a data frame with seven columns, as shown here.

> read.table ("addresses.csv", header = FALSE, sep = ",",

quote = "", comment.char = "", stringsAsFactors = FALSE,

nrows = 1, skip = 2)

V1 V2 V3 V4 V5 V6

1 11 Macina 401 1st Ave. Apt 13G New York NY

We are prepared to conclude that the problem with the ﬁle is that some of the

entries in the Address column contain embedded commas – but we would

look at the other problem rows to be sure.

When there are delimiters embedded in ﬁelds, one quick way to get the data

into R with read.table() is by passing the fill = TRUE argument. is

is intended to produce a data frame with the largest number of columns nec-

essary to hold any line. If it succeeds, and there are only a small number of

defective lines, the ﬁnal column of the new data frame will be almost empty. e

rows where there are entries in the ﬁnal column will help you ﬁgure out where

the problems in the input data are arising. Here, we show the results of using

the fill = TRUE argument, and assign the return value of read.table()

to an object named add.

> (add <- read.table ("addresses.csv", header = TRUE,

sep = ",", quote = "", comment = "",

stringsAsFactors = FALSE, fill = TRUE))

k k

Getting Data into and out of R 179

ID LastName Address City State

001 O'Higgins 48 Grant Rd. Des Moines IA

011 Macina 401 1st Ave. Apt 13G New York NY

242 Roeder 71 Quebec Ave. E. Thetford VT

146 Stephens 1234 Smythe St. #5 Detroit MI

241 Ishikawa 986 OceanView Dr. Pacific Grove CA

ere are two things to notice here. First, R has produced row names from the

ID column since the header had one fewer entry than the longest row. is

also causes the column names to be shifted to the right. Had there been dupli-

cate entries in that column, read.table() would have failed and we would

have supplied row.names = NULL on our next eﬀort. Second, the second

and fourth rows’ addresses have been broken at the extra delimiter. is causes

those rows to extend over six columns, leaving them as the only ones with

entries in the rightmost column.

If all the problematic lines appear to be broken in the same way, we would

probably ﬁx them directly in R. In this example, we would start by identifying

the broken lines, paste columns 2 and 3 to complete the address, and then move

columns 4 and 5 into positions 3 and 4. is code shows the steps we might take,

and what the add data frame looks like at this point.

> fixers <- add$State != "" # logical vector

> add[fixers, 2] <- paste (add[fixers,2], add[fixers,3])

> add[fixers, 3:4] <- add[fixers, 4:5]

> add

ID LastName Address City State

001 O'Higgins 48 Grant Rd. Des Moines IA

011 Macina 401 1st Ave. Apt 13G New York NY NY

242 Roeder 71 Quebec Ave. E. Thetford VT

146 Stephens 1234 Smythe St. #5 Detroit MI MI

241 Ishikawa 986 OceanView Dr. Pacific Grove CA

All that remains is to adjust the column names, remove the rightmost column,

and insert the current row names as a column (replacing them, perhaps, with

integers). is code shows how this might be done.

# Save column names, then remove last column

> mycolnames <- colnames (add)

> add$State <- NULL

# Insert the ID column

> add <- data.frame (ID = rownames (add), add)

> colnames (add) <- mycolnames # now assign column names

> rownames (add) <- NULL # replace old row names

> rm (fixers, mycolnames) # clean up!

It was convenient to save the column names before performing the other modi-

ﬁcations and then to re-assign them at the end. We include this lengthy example

k k

180 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

because these are the sorts of problems we face almost every time we read text

data into R. Note that in the last command we remove some temporary vari-

ables. Although we have not been showing this in our code, it is something we

do regularly, to keep the workspace clean and reduce the risk of inadvertently

re-using an existing object.

Using scan()

Sometimes, there are multiple problems with an input data set – highly

variable numbers of ﬁelds, diﬀerent separators used in diﬀerent places, and

so on. A good general-purpose data input tool is scan(), which with the

sep = "\n" argument, reads entire lines into an R character vector. By

default, scan() expects to encounter numbers, so in reading text we need to

pass the what = character(), or, for short, the what = "" argument.

Like read.table(),scan() has a host of arguments to let you handle all

sorts of text ﬁles. is example shows the output of using scan() on our

addresses.csv ﬁle.

> (addscan <- scan ("addresses.csv", what = "", sep = "\n",

quote = "", comment.char = ""))

Read 6 items

[1] "ID,LastName,Address,City,State"

[2] "001,O'Higgins,48 Grant Rd.,Des Moines,IA"

[3] "011,Macina,401 1st Ave., Apt 13G,New York,NY"

[4] "242,Roeder,71 Quebec Ave.,E. Thetford,VT"

[5] "146,Stephens,1234 Smythe St., #5,Detroit,MI"

[6] "241,Ishikawa,986 OceanView Dr.,Pacific Grove,CA"

We can ﬁx addscan directly by replacing the third comma by another charac-

ter – a semi-colon, say – in every row with six commas. is is less complicated

that it might appear at ﬁrst, and represents the sort of data cleaning we do regu-

larly. We start by identifying the problem rows, either with count.fields()

as before, or directly, using gregexpr() from Section 4.4.5.

> commas <- gregexpr (",", addscan) # locate all commas

> length.5 <- lengths(commas) == 5 # identify long rows

> comma.be.gone <- sapply (commas[length.5],

function (x) x[3])

Each element of the commas list consists of a vector giving the locations of

the commas within a line. In this last command, we have extracted the third

element of each of these vectors. So comma.be.gone gives the location of

the third comma on those lines with a total of ﬁve commas. Now we replace

those commas by semicolons.

> substring (addscan[length.5], comma.be.gone,

comma.be.gone) <- ";"

> addscan

[1] "ID,LastName,Address,City,State"

k k

Getting Data into and out of R 181

[2] "001,O'Higgins,48 Grant Rd.,Des Moines,IA"

[3] "011,Macina,401 1st Ave.; Apt 13G,New York,NY"

[4] "242,Roeder,71 Quebec Ave.,E. Thetford,VT"

[5] "146,Stephens,1234 Smythe St.; #5,Detroit,MI"

[6] "241,Ishikawa,986 OceanView Dr.,Pacific Grove,CA"

Now that addscan isexactlyaswewantittobe,wehaveatleastthreechoices.

First, we can write it back out to disk (using the write.table() function

we describe later) in preparation for re-reading. is approach uses extra disk

space, but it has the advantage of creating a clean data set for other users. Sec-

ond, we can pass the addscan vector back to read.table() using the text

argument, and it will be interpreted just as if the text had been read from a ﬁle.

is example shows the output from that call.

> read.table (text = addscan, header = TRUE, sep = ",",

quote = "", comment = "", stringsAsFactors = FALSE)

ID LastName Address City State

1 1 O'Higgins 48 Grant Rd. Des Moines IA

2 11 Macina 401 1st Ave.; Apt 13G New York NY

3 242 Roeder 71 Quebec Ave. E. Thetford VT

4 146 Stephens 1234 Smythe St.; #5 Detroit MI

5 241 Ishikawa 986 OceanView Dr. Pacific Grove CA

is approach is straightforward but may not be very eﬃcient for very large

character vectors. Notice also that the ID column has been interpreted

as numeric, so that the leading zeros have been removed. We can correct

this by passing the colClasses argument. In this case, since ID is the

only column whose class needs to be speciﬁed explicitly, we would pass

colClasses = c(ID = "character").

A ﬁnal method of handling an object like addscan is to use strsplit()

to create a list consisting of one character vector for each row, broken at its

commas.Wecanthenusedo.call() and rbind() to combine the elements

of the list, as described in Section 3.7.1. is approach is fast, and well suited

to large data objects, although it produces a character matrix that needs to be

converted into a data frame, with column names and classes that need to be set.

6.1.5 Other Uses of the scan() Function

As we have seen, the scan() function is the most general way to read data

into R. In this section, we describe some other cases where its use might be

necessary.

Headers, Page Numbers, and Other Superﬂuous Text

One case where scan() is necessary is when the data set being read in was

formatted by some program that was expecting to produce printed material.

Often data like that will contain page headers and footers, and these need to

k k

182 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

be detected and removed. When we encounter this, we generally use scan()

with sep = "\n" toreadthedatainaslines,thenusearegularexpression

(Section 4.4) to detect and remove the oﬀending lines. For example, if a vector

old returned from scan() contains lines that start with Page:,acommand

like new <- old[!grepl("^Page:", old)] will produce a new vector

omitting those lines. In this case, you have to examine the original data to deter-

mine the format of the header and footer lines, and ensure that no “real” lines

are deleted inadvertently. A similar problem can arise when the original pro-

gram contains both data and also total and sub-total lines – these, too, will need

to be detected and deleted.

Input Records on Multiple Lines

It also happens sometimes that the headers occupy several lines (as with other

diﬃculties, this arises in data from spreadsheets). If only the headers, and not

the data, wrap around lines, the natural way to handle this is to omit the headers

using the skip argument; so skip = 3, for example, skips the ﬁrst three rows

of the ﬁle. en the headers can be added back in after the data frame is created.

When the individual records are broken across multiple lines, and the number

oflinesisthesameforeveryrecord,itispossibletopersuadescan() to read

these data in pieces. is requires passing a list in the what argument – see

the help for scan() for the details. In fact, though, we usually scan() the

whole ﬁle in, as a series of lines, and then operate on the components. If every

inputrecordtakesupthreelines,say,thenweknowthatlinesoftheﬁrsttype

are found in positions 1, 4, 7, and so on; we can generate this sequence with

seq(1, by = 3, to = n) where nis the number of lines read. (Remem-

ber to account for the header if there is one.) en lines of the second type are

found at locations one greater than those of the ﬁrst type, and so on. Having

identiﬁed the lines that make up each type, we can then paste() the strings

of diﬀerent types into a character vector with one element per original obser-

vation. Finally, we can pass this vector to read.table() using the text

argument as in the previous example.

All of this may seem like a lot of work, but, as we have claimed throughout the

book, a huge proportion of our time as statisticians and data scientists is taken

up with tasks like this. e more eﬃciently we can take care of data issues, the

faster we can get to the modeling part.

6.1.6 Writing Delimited Files

e write.table() function acts to write a delimited ﬁle, just as read

.table() reads one. (And, just as there are the read.csv() and read

.csv2() analogs to read.table(), R also provides write.csv() and

write.csv2().) We normally pass write.table() a data frame, though

a matrix can be written as well, and we generally supply the delimiter with the

sep argument, since the default choice of a space is rarely a good one. A few

k k

Getting Data into and out of R 183

other arguments are useful as well. First, the resulting entries from character

and factor columns are quoted by default; we often turn this behavior oﬀ with

quote = FALSE, depending on what the recipient of the output is expecting.

Quoting becomes necessary, though, when character values might contain the

delimiter, or when it is important to retain leading zeros in identiﬁers that

look numeric (01,02, etc.). Second, row names are written by default; we

rarely want these, so we generally specify row.names = FALSE.Incontrast,

the default setting of the col.names argument, which is TRUE, usually is

what we want. e exception is when we plan to do a number of writes to

a single ﬁle. In that case, the ﬁrst write will usually specify col.names =

TRUE and append = FALSE, and subsequent ones will specify col.names

= FALSE and append = TRUE.

6.1.7 Reading and Writing Fixed-Width Files

One alternative to the delimited ﬁle is the ﬁxed-width ﬁle. In this approach,

each ﬁeld occupies the same positions on each line. For example, an account

number might take up characters 1–9, the customer’s last name characters

10–24, and so on. is sort of output was more common in the past because it

is the preferred format of the COBOL language that is no longer widely used.

Dependingonthelayout,itispossiblefortheﬁxed-widthapproachtousemuch

less space than the delimited one. For example, a comma-separated ﬁle with a

million rows and a thousand columns will have roughly a billion commas in it.

(On the other hand, each record in our example will have 15 characters for a

customer’s last name, so the design will waste space for customers with names

such as Lee, and truncate those with more than 15 characters.) Moreover, ran-

dom access is possible in a ﬁxed-width ﬁle; if the customer’s last name starts

in position 10, and each line has 351 characters, then the millionth customer’s

name must start at character 351,000,010. A program that needs that name

can “seek” directly to the relevant character, whereas with a delimited ﬁle it

would have to read a million lines of diﬀerent lengths. Despite these advantages,

ﬁxed-width ﬁles seem to arise nowadays only from older systems.

e R function to read these ﬁles is read.fwf().Ithasmanyofthesame

arguments as read.table(). e most important additional argument is

widths, a vector of integers giving the lengths of the ﬁelds. In our example

above, the ﬁrst two elements of widths would be 9 and 15, those being the

lengths of the account number and customer’s name.

It is rare to have to write a ﬁxed-width ﬁle – indeed, we have never had to. e

gdata package (Warnes et al., 2015) has a routine aptly named write.fwf()

that appears to do this job. We have not tested it.

6.1.8 A Note on End-of-Line Characters

For historical reasons, Windows text ﬁles use two characters to denote the

end of a line – carriage return (\r in R) and line feed (\n), whereas OS X

k k

184 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

and Linux use only the \n character (and older Apple computers used only

\r). is can cause problems when passing text ﬁles between systems, but R

is generally forgiving. e scan() function recognizes any of these line end-

ings, and write.table() permits some ﬂexibility in the eol (“end of line”)

argument. Mac and Linux users can also use gsub() to remove any \r char-

acters. Although R is more forgiving than some other applications, you should

be aware that text ﬁles’ endings can diﬀer from platform to platform.

6.2 Reading Large, Non-Tabular, or Non-ASCII Data

Most of the text data we get into R comes via scan() or read.table().But

there are at least two cases where we need more than these techniques alone

can provide. First, some ﬁles – and we expect this to be ever more common

in the future – are simply too big to ﬁt into the main memory of the com-

puter. We brieﬂy discussed some ways around this limitation in Section 3.8.

If the entire ﬁle is needed, then the techniques of that section may be neces-

sary. Often, though, the ﬁle is huge but the subset of records that we need is

not. In that case, we need the ability to ﬁlter the data set, without ﬁrst reading

it all into memory.

esecondcaseinvolvesﬁlesthatarenotintabularformatssuitablefor

read.table(). ese might be text in a non-tabular format, such as the pop-

ular JSON and XML formats we discuss in this chapter; free-form text such

as log ﬁles and other documents; or binary data such as images and sounds.

For all of these cases, R provides the ability to perform basic ﬁle operations

on disk ﬁles. By “basic ﬁle operations” we mean opening a ﬁle to get access to

its contents, reading it bit by bit so that we can operate on the part that was

read, writing results to the ﬁle (usually we will read from one ﬁle and write to

another), seeking (i.e., resetting our current position in the ﬁle), and then clos-

ing the ﬁle. In fact, in most circumstances, we open two ﬁles, one for input and

another for output. We read the input ﬁle sequentially (i.e., starting at the begin-

ning and all the way through); when we encounter records of interest we write

them, or something about them, to the output ﬁle; and then when the input

ﬁle has been read we close both ﬁles. ese basic ﬁle operations are most often

performed on text ﬁles divided into discrete lines, but they also apply to ﬁles

with binary or other data. In this section, we describe these basic ﬁle operations

and how they might be used within R.

6.2.1 Opening and Closing Files

e ﬁrst step in ﬁle handling is to open the ﬁle. We must ﬁrst know whether

the ﬁle contains text (which might be UTF-8) or whether the data is binary (as

an image, video, or music ﬁle). is is not always easy to ascertain inside R,

k k

Getting Data into and out of R 185

although the name of the ﬁle is often informative. Most often the ﬁle format is

supplied to us by the client.

When we open a ﬁle in R, we are really creating a connection,whichisa

method of communication not only to and from ﬁles but also to and from other

devices and processes. Files are very much the most common sort of connec-

tions we use, so in these sections we will focus on tools for handling ﬁles.

A ﬁle is opened with the file() function and its relatives. ese related

functions, such as gzfile(), provide the ability to open ﬁles in some of the

popular zipped formats, like gzip. e help for the file() function shows the

formats supported. (ere is also an open() command, which is more suited

to non-ﬁle connections.)

e file functions return a connection object that stores all the information

that R needs to have about the connection, and we use this connection object in

subsequent calls to the functions that will read from, write to, seek in, or close

ﬁles. When we open a ﬁle, we pass an argument called open that describes

whether the ﬁle is to be read from, written to, or appended to (open = "r",

"w" or "a",respectively).Ifa+is added, the ﬁle is opened for both reading and

writing (but open = "w+" truncates the ﬁle ﬁrst), and if a bor tis added the

ﬁle is explicitly opened in binary or text mode. So, for example, open = "rb"

opens a binary ﬁle for reading, and open = "a+b" opens a binary ﬁle for

reading and appending.

Once you have ﬁnished with a connection, you should close it with the

close() function. is is a good practice even though R will close it for you

when your session terminates.

6.2.2 Reading and Writing Lines

Once the connection object is available, we pass it to the reading and writing

functions. e readLines() function reads text lines – that is, pieces of text

terminated by the new-line character. We specify the number of lines to read

with the nargument (n=-1meaning to read all lines). Rather than read one

lineatatime,itfeelsasifitshouldbefastertoreadalarge“chunk”oflinesat

once, process them, and then read another chunk. In reality, though, the situa-

tion is complicated by the way the operating system itself prepares “caches” of

disk ﬁles in main memory. When programming, you have to balance the (pos-

sible) gain in speed of reading a chunk of lines at once with the simplicity of

code to handle just one line at a time.

As an example, consider using readLines() to read lines from the

addresses.csv ﬁle of the example from last section. (readLines() pro-

duces the same result as scan() with what = "" and sep = "\n".) Oper-

ating on a ﬁle, readLines() opens the ﬁle, reads as many lines as requested

with the nargument and then closes the ﬁle. e reading always begins at

line 1. In this example, we read one line at a time from the addresses.

csv ﬁle.

k k

186 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> readLines ("addresses.csv", n = 1)

[1] "ID,LastName,Address,City,State"

> readLines ("addresses.csv", n = 1)

[1] "ID,LastName,Address,City,State"

Here, the same line is returned from both calls because readLines() opens

and closes the ﬁle each time. If you want to read the data in pieces, you will need

to open a connection and pass the connection to readLines().Operatingon

a connection, readLines() reads the number of lines requested and keeps

track of its location in the ﬁle for future calls. Here, we read the ﬁrst few lines

of addresses.csv via a connection.

> con <- file ("addresses.csv", open = "r")

> readLines (con, n = 2)

[1] "ID,LastName,Address,City,State"

[2] "001,O'Higgins,48 Grant Rd.,Des Moines,IA"

> readLines (con, n = 2)

[1] "011,Macina,401 1st Ave., Apt 13G,New York,NY"

[2] "242,Roeder,71 Quebec Ave.,E. Thetford,VT"

> close (con)

e readLines() and scan() functions applied to a connection provide a

natural way to read very large data ﬁles piece by piece.

Analogously to readLines(),writeLines() writes a vector of charac-

ters out to a ﬁle (adding in the new-line separator). Again, passing a ﬁle name

causes the ﬁle to be opened, written, and closed – so in order to append data to

a ﬁle, you will want to open a connection and pass the connection to write-

Lines().

Besides opening, closing, reading, and writing ﬁles, there are two other

basic ﬁle operations to know about. When a ﬁle is opened for read or write,

R maintains a “pointer” that describes the current location in the ﬁle. (Files

opened for both read and write have two of these.) e seek() function

with its default arguments returns the current location in the ﬁle (in terms of

number of bytes from the start of the ﬁle). Passing the where argument lets

you re-set the ﬁle pointer to a position of your choice. is is useful if you need

to jump to a prespeciﬁed position. However, the help tells us that “use of seek

on Windows is discouraged.”

e ﬁnal operation, flush(), can be used after a ﬁle output command to

ensure that the write operation gets committed to disk right away. Without this

operation, the output from the write operations may be cached – that is, saved

up by the operating system for a convenient time. e flush() command will

be useful when you are monitoring a program’s output, or when it is important

to perform a write right away to protect against a system crash.

k k

Getting Data into and out of R 187

6.2.3 Reading and Writing UTF-8 and Other Encodings

We described UTF-8 and other encodings in Section 4.5, and you have to expect

that you will be called on to read some non-ASCII text soon and ever more

frequently in the future. Many of the text-reading functions we describe in this

chapter, such as read.table() and scan(),havetwoargumentsinplaceto

handle UTF-8-related chores. e fileEncoding argument lets you specify

the encoding in the ﬁle that you are reading in, while the encoding argument

speciﬁes the encoding of the R object that contains the data read in. Remember,

though, as we saw in Section 4.5.3, that UTF-8 inside a data frame will often be

displayed with the <U+0000>-type notation.

When reading data line by line, it is best to ascertain in advance whether

the ﬁle contains UTF-8. en it can be opened by passing encoding =

"UTF-8" to the file() command, and read using readLines(), again

with encoding = "UTF-8". ere is no cost to passing these options

to a ﬁle containing simple ASCII text, since ASCII is a subset of UTF-8.

However, if the ﬁle was created with latin1 or another diﬀerent encoding,

certain characters will be handled incorrectly with the UTF-8 options.

Writing UTF-8 is best accomplished by ﬁrst creating a ﬁle connection with

the file() function with open = "w" and encoding = "UTF-8".en

text can be written to the connection using writeLines() with the use-

Bytes = TRUE argument. (We have not always had success using write

.table() with UTF-8 data.) By default the useBytes argument is FALSE,

which tells R to convert encoded strings back to the native encoding before

writing. For non-UTF-8 locales, particularly in Windows, this conversion can

lead to unexpected results. Be aware that operating systems and locales treat

UTF-8 diﬀerently – in some cases, inconsistently.

6.2.4 The Null Character

e “null” is the character whose hexadecimal value is 00, sometimes denoted

in R by 0x00 using R’s hexadecimal notation (so this is not the same as R’s NULL

value). Null characters should generally not be found in regular text and in fact

are not permitted in R. However, the null is used as an end-of-string marker in

theClanguage,andyouhavetoexpecttoencountersometextwithnullsinit.

Moreover, we occasionally encounter text ﬁles with nulls embedded in them,

for whatever reason. By default, scan() and read.table() stop reading

when a null character is encountered and resume with the next ﬁeld. Setting the

skipNul argument to TRUE allows scan() and read.table() to skip over

nulls. For delimited data, this is generally a safe choice, although you will not

see any warning that nulls were detected and skipped. If there are “intentional”

k k

188 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

nulls in the text, and you need to detect and keep them, you will need to read

the ﬁle as binary data using readBin(), which we discuss as follows.

6.2.5 Binary Data

e readBin() function reads binary data. We rarely use this to read actual

binary data like images because data like that is rarely part of a data cleaning

exercise. However, sometimes the data is so messy that this is the one of the

few approaches that seems to work (see our next example). It is also one way to

read text data with embedded nulls if you want to preserve all the characters in

the original text. is arises when a ﬁxed-width ﬁle has embedded nulls.

e readBin() function requires that you specify the number of bytes

to be read, since it does not recognize end-of-line characters. e return

value of readBin() is a vector whose class is raw – and indeed it looks

on the screen like a stream of hexadecimal bytes. ere are only a couple of

things we can do with raw data in R (unless we have some custom plan for

it). We can write it back out, using writeBin(),orwecanconvertitinto

datainoneoftheRformats.Ifweknowthattherawdatarepresentstext,

we can convert it with rawToChar() – unless there are embedded nulls.

If our vector myvec is raw, then nulls can be replaced with a command like

myvec[myvec == 0x00] <- as.raw(0x20). is will replace all the

nulls by the character whose value is hexadecimal 0x20 – the space. So if

necessary we might read in such a ﬁle in chunks of arbitrary size, convert the

nulls to spaces, then write the data back out as binary or convert to text. If all

has gone well, the resulting ﬁle should then be able to be read by read.fwf()

or readLines().

To set up the next example we construct a character vector representing some

entriesinaﬁxed-widthﬁle.Weembedanullinoneoftheentriesandthenwrite

the resulting vector to a disk ﬁle called nully.txt.

Each entry has 16 characters, counting the new-line appended for conve-

nience. e strings contain a 3-character id (e.g., 001), a 10-character name

(Jenkins, including three trailing spaces), and a 2-character state.

> thg <- c("001Jenkins MI\n", "002O'FlahertyIA\n",

"003Lee HI\n")

Now we replace the apostrophe in the second string with a null character.

Since nulls are illegal in R, some extra steps are required to make this mod-

iﬁcation. We convert the string to a raw vector named second.bytes

using the charToRaw() function. is produces the hexadecimal values

30 30 32 representing “002” and so on. We then replace the ﬁfth element of

second.bytes with the null character, indicated by as.raw(0x00).

> (second.bytes <- charToRaw (thg[2]))

[1] 30 30 32 4f 27 46 6c 61 68 65 72 74 79 49 41 0a

k k

Getting Data into and out of R 189

> second.bytes[5] <- as.raw (0x00)

> second.bytes

[1] 30 30 32 4f 00 46 6c 61 68 65 72 74 79 49 41 0a

> rawToChar (second.bytes)

Error in rawToChar(second.bytes) :

embedded nul in string: '002O\0FlahertyIA\n'

e ﬁnal command shows that after the modiﬁcation, this vector cannot be

converted back to character because it has the embedded null.

Now let us write that vector to a disk ﬁle. In order to write the sec-

ond.bytes item we will need to use writeBin(),sowewillusethatfor

the other strings as well. In this code, we write the three strings individually to

a connection opened for binary writing.

> con <- file ("nully.txt", "wb")

> writeBin (charToRaw (thg[1]), con)

> writeBin (second.bytes, con)

> writeBin (charToRaw (thg[3]), con)

> close (con)

e nully.txt ﬁle is now complete; it consists of three text lines of which the

second has an embedded null. When we read that into R, read.fwf() omits

the part of the second line following the null character and emits a warning, as

we show here.

> read.fwf ("nully.txt", c(3, 10, 2))

V1 V2 V3

1 1 Jenkins MI

2 2 O <NA>

3 3 Lee HI

Warning message:

In readLines(file, n = thisblock) :

line 2 appears to contain an embedded nul

Despite indications in the help ﬁle, read.fwf() does not respect the skip-

Nul argument.Whenweusescan(), we have two unappetizing choices. If

skipNul is FALSE,scan() will stop reading the second line after the null

just as read.fwf() does. But if skipNul is TRUE,scan() will produce

only 14 characters for the second line – and the ﬁelds will no longer line up. e

following example shows the output from scan() when applied to this ﬁle.

> scan ("nully.txt", what="", sep="\n", skipNul = TRUE)

Read 3 items

[1] "001Jenkins MI"

[2] "002OFlahertyIA"

[3] "003Lee HI"

e removal of the null has led to the second string being one character too

short. For example, its “state” ﬁeld does not line up with the others’.

k k

190 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

e best way around this may be by reading the binary data directly using

readBin(). Here, we show how that might be done, exploiting the fact that

each line is known to contain 16 characters. We open the ﬁle for binary reading

by passing the open = "rb" argument to the file() function, read all 48

characters, and convert null characters to space (the character with hex value

0x20).

> con <- file ("nully.txt", "rb")

> lns <- readBin (con, what="raw", n = 48)

> lns[lns == as.raw (0x00)] <- as.raw (0x20)

> rawToChar (lns)

[1] "001Jenkins MI\n002O FlahertyIA\n003Lee HI\n"

> (lns <- strsplit (rawToChar (lns), "\n")[[1]])

[1] "001Jenkins MI"

[2] "002O FlahertyIA"

[3] "003Lee HI"

e ﬁnal command splits the string at the new-line values to produce a vector

of three equal-length strings. Given the starting and ending locations of

the ﬁelds, we can then produce a matrix by breaking the strings into their

componentpieces.isexampleshowshowthatmightbedone.

> start <- c(1, 4, 14)

> end <- c(3, 13, 15)

> sapply (1:3,

function (i) substring (lns, start[i], end[i]))

[,1] [,2] [,3]

[1,] "001" "Jenkins " "MI"

[2,] "002" "O Flaherty" "IA"

[3,] "003" "Lee " "HI"

is example is complicated, but handling data with embedded nulls or other

problematic characters is a diﬃcult problem that really does arise in practice.

Sometimes, reading the data as binary is the only road available.

6.2.6 Reading Problem Files in Action

In this section, we show a function we wrote to handle a real-life problem read-

ing text data. We were given a series of large XML ﬁles. We discuss XML in

Section 6.5.3, but for this purpose think of XML as text. For unknown reasons,

these ﬁles contained embedded null characters, and also a small number of

“control” characters (non-ASCII characters with hexadecimal values ≥80). In

general, XML may contain UTF-8 and other non-ASCII characters, but never

nulls – and these control characters were known to be erroneous.

Each one of these ﬁles took up around 600 MB and consisted of one text

line with no end-of-line character anywhere. Regular text tools run into prob-

lems handling lines of this size, as does R. Using scan() or readLines(),

k k

Getting Data into and out of R 191

R would try to read one of these ﬁles into memory as a character vector with

one (enormous) entry. Oﬃcially, a character string in R can contain just about

2 GB, but we had no success in reading this object into R in order to break it

into manageable pieces.

Instead our approach was to read each ﬁle as raw data in pieces. When we

converted the pieces to character, we discovered that the XML ﬁles contained

tags such as <ROW> and </ROW>, which supplied a natural place into which

to insert new-line characters to break the text into lines. It was also during this

investigation that we discovered the embedded null and control characters. We

wrote a small function to read the entire ﬁle bit by bit, remove the oﬀending null

and control characters, add the new-lines at the appropriate places, and write

the resulting bits back out. e output from this function was a text ﬁle that

could be read handily, one line at a time, in groups of lines, or using an XML

reading function of the sort we describe in Section 6.5.3.

We give the details of this function in three pieces. In the ﬁrst piece of

the function, we open the input ﬁle as binary and the output ﬁle as text.

e on.exit() functions (Section 5.1.3) ensure that the ﬁles are properly

closed even if the function aborts; without the add = TRUE argument the

expression in the second call would replace the action speciﬁed in the ﬁrst. e

chunk argument describes the number of characters we will read at one time.

function (xml.in, xml.out, chunk = 10000)

{

# Open input file as read only, binary

fi <- file (xml.in, open = "rb")

# Open output file for write only, text

fo <- file (xml.out, open = "wt")

on.exit (close (fi))

on.exit (close (fo), add = TRUE)

In the second piece of the function, we read chunk characters as raw data. e

while(1) starts an inﬁnite loop that is ended when the function encounters

break. If no data is read, the input ﬁle has been used up (we might say

“exhausted”). If fewer than chunk characters are read, we have reached the

end of the ﬁle – but we still have to process the ﬁnal chunk. We ﬁnd the

unwanted characters – the null, whose hexadecimal value is 0x00,andthe

non-ASCII, whose values are 0x80 and above – and replace them by spaces.

while (1) {# loop until "break"

# Read text. If none is returned, the file is empty.

txt.raw <- readBin (fi, "raw", n = chunk) # the maximum

if (length (txt.raw) == 0) break

# Replace those are 0x00 or >= 0x80 with 0x20 (space)

txt.raw[txt.raw == as.raw (0x00) |

txt.raw >= as.raw (0x80)] <- as.raw (0x20)

txt <- rawToChar (txt.raw)

k k

192 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

At the end of this last operation, txt contains the character values from the

input, except with spaces where nulls and other non-ASCII characters used

to be. In the rest of the function, we add the new-line characters after every

instance of </ROW> (using the gsub() function from Section 4.4.6) and write

thetextout.WecannotusewriteLines() becausewedonotwanttoinsert

a new-line character at the end of the 10,000-character chunk. Instead, we use

the writeChar() function. By default, this adds a null character after its

end-of-string character, but when we set eos = NULL the string is written

out with no terminator. (We could alternatively have opened the ﬁle to write

binary and used writeBin() to write the raw data. Writing text data with

writeBin() produces a null character in the output.) Finally, we check to see

whether the number of characters read is less than the size of chunk.Ifso,we

must have exhausted the input ﬁle, and we break, relying on on.exit() to

close the ﬁles. (We use number of characters read, rather than written, because

that number does not count the number of new-line characters inserted.)

# Replace </ROW> with </ROW>\n wherever the former appears

txt <- gsub ("</ROW>", "</ROW>\n", txt)

# Write output. If txt is short, input is exhausted; quit.

writeChar (txt, fo, eos=NULL)

if (nchar (txt.raw) < chunk) break

}# end "while"

}

isexampledemonstratesthesortsoftasksthatwe,asdatascientists,are

called on to perform in order to read data as part of a data cleaning exercise. As

a data scientist, you will need to be prepared to handle this sort of messy data.

6.3 Reading Data From Relational Databases

A lot of data is stored in relational databases. A database has two parts: ﬁrst,

a set of tables of data together with rules that describe their content and keys

that link the tables together. e second part is software to access the data in

customized ways. e software is often called a “relational database manage-

ment system,” and the acronyms RDBMS or just DBMS are often used. Popular

examples of these programs include Access and SQL Server, from the Microsoft

Corporation; Oracle and the open-source MySQL, from the Oracle Corpora-

tion; and the open-source PostgreSQL and SQLite. All of these use something

called the “Structured Query Language,” or SQL, a (mostly) standardized lan-

guage designed speciﬁcally for interacting with database management systems.

In this section, we discuss how to connect to a database and extract its data into

R for cleaning and management.

Remember that database programs are speciﬁcally designed and engineered

to hold and manipulate large amounts of data. So it is almost always a good

k k

Getting Data into and out of R 193

idea to let the database do the work of ﬁltering and merging data sets, rather

than reading all the data into R and operating on it there. Of course, when the

data tables in question are small, it doesn’t matter much one way or the other

where the work is done. But where the data is large, we recommend using the

database as much as possible – which means learning at least a little SQL.

6.3.1 Connecting to the Database Server

e ﬁrst step in using a database is connecting to it. With some exceptions

discussed in what follows, a database system will run as a “server” in a pro-

cess – that is, a running program – located either on your computer, or on

another computer on your network. Some person or organization will need to

be in charge of that program (starting it, stopping it, updating the data and per-

missions, etc.). at person (the database administrator) will know the name

of the machine on which the server is running, and any user/password infor-

mation you will need to access data in the system. In order to connect your R

session to the database server, you will need a driver,whichisapieceofsoft-

ware that lets the operating system connect to the database. In some cases,

drivers will already be present on your computer, but in others, you will need

to acquire and install one yourself. Your database administrator will be able to

tell you what driver to use and how to install it. e ﬁrst time you connect to

the database you will provide a “data source name,” or DSN, that will be used

to refer to the database on subsequent occasions.

Open Database Connectivity

e Open Database Connectivity (ODBC) protocol is an eﬀort to make

diﬀerent database software appear the same to clients like R. Just about all

databases support ODBC, and the R package RODBC (Ripley and Lapsley,

2015) provides an interface to ODBC-compliant databases. (Some databases,

such as the well-known Oracle one, support additional, speciﬁc packages that

may be more eﬃcient than RODBC.) If you have a DSN named “source” in

place, for example, the command odbcConnect("source") will make the

connection; additional arguments let you specify a username (argument uid)

and password (pwd) if required. Once the connection is made, the function

returns an object (a “handle”) that holds the details about the connection.

is handle is then passed to all the other functions that communicate with

the database. In this example, we show how we might connect to a database

via ODBC, then use the sqlTables() command to list the names of all the

tables in the database.

han <- odbcConnect (dsn = "source", uid = "us", pwd = "abc")

sqlTables (han) # list all table names

Generally, the functions whose names start with odbc are the lower-level ones,

and the ones whose names start with sql are easier to use with data frames.

k k

194 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Once you know the names of the tables in the database, the sqlColumns()

function will return some information about the columns in a speciﬁc table,

such as their names and types.

While the connection is open, SQL commands (see the following sections)

canbeissueddirectlythroughthesqlQuery() function. Once the connec-

tion han is no longer needed, it can be closed with either the close(han) or

the odbcClose(han) commands.

6.3.2 Introduction to SQL

All relational databases use SQL to access and manage data. SQL is a big sub-

ject, and there are many books and web sites devoted to teaching it. Inevitably,

perhaps, diﬀerent databases can have slight diﬀerences in the SQL and data

types they support. In this section, we describe the basic SQL commands that

you might need to extract data from databases into R. ese commands will

normally be passed to the database through one of the ODBC functions, such

as sqlQuery(), as a character string – see the following few examples. SQL

commands are not case-sensitive; table and column names might be, depending

on the database. Here, we follow one common convention, where we put SQL

keywords in upper-case letters and table and column names in lower-case.

The SELECT Command

e most important SQL command is SELECT,whichisthecommandused

to create a data frame in R from a table in the database. Suppose that we

know, from our earlier call to sqlTables(), that a particular table is called

accounts. en the SQL command SELECT * FROM accounts will

return the entire table. We would normally specify this query as a character

string passed as an argument to sqlQuery(). is example shows how we

can acquire an entire table, presuming that the handle han created above is

still valid.

acc <- sqlQuery (han, "SELECT * FROM accounts")

Here, the asterisk asks for all columns, so after this operation completes, the

new data frame acc will have all the data from the database table accounts.

Often we will want only a subset of rows and columns. We can use the

sqlColumns() function to learn the names of the columns in the table;

we can then supply the desired names explicitly in the SELECT statement.

e SELECT command has many other possible additions that control the

selection of subsets of rows. For example, the next command will return only

the three named columns, and only the rows for which accyear was ≥2016.

recent <- sqlQuery (han, "SELECT accno, accyear, accamt

FROM accounts WHERE accyear >= 2016",

stringsAsFactors = FALSE)

k k

Getting Data into and out of R 195

e sqlQuery() function supports a stringsAsFactors argument but

not the colClasses one. In addition to returning the data itself, the database

can perform simple arithmetic operations. For example, nn <- sqlQuery

(han, "SELECT COUNT(*) FROM accounts") gives the number of

rows in the table. Other operators compute maxima or minima, combine

columns arithmetically, compute logarithms, aggregate data into groups, and

so on.

Joining Tables

Besides extracting a table, or part of a table, the most important thing a database

can do for us is to match up two tables, according to the value of a column in

each one, and return the results of the match. In SQL, we call this a “join”; in R,

we call it a “merge,” based on the merge() function that does the same thing

for data frames (see Section 3.7.2). So, for example, suppose that in our database

table accounts has a column called accno and table payment has a column

called acct. Suppose each of these two columns contains account numbers

in the same format. We want to produce a new table with all the columns of

accounts and all the columns of payment, and all the rows for which the

two account numbers match. is command shows how SQL can be used to

join the two tables.

matchers <- sqlQuery (han, "SELECT * FROM accounts, payment

WHERE accounts.accno = payment.acct")

e database is in charge of sorting the tables to make the account numbers

line up. Typically, we expect the result of this command to have as many rows

as there are account numbers that are common to the two tables. (is can

be diﬀerent, though, if there are duplicated account numbers.) is so-called

“inner join” is just one way to join tables. e “left join” produces a new table

with one row for each entry in accounts (again, if there are no duplicates in

the index used to do the join). Entries in accounts with no match in payment

are returned, but of course there are missing values for those rows’ entries in

the columns from payment. is command shows how to produce a left join

between the tables in this example.

matchers <- sqlQuery (han, "SELECT * FROM accounts

LEFT JOIN payment ON accounts.accno = payment.acct")

“Right” and “outer” joins (Section 3.7.2) are constructed similarly.

Getting Results in Pieces

e sqlQuery() command actually performs two tasks. First, it sends the

query to the database, and then it fetches the results. For really big tables, it

makes sense to separate these tasks; the ﬁrst query starts the retrieval pro-

cess, and then subsequent calls retrieve rows chunk by chunk. When we want a

k k

196 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

single table, the sqlFetch() command will get the ﬁrst batch, with the num-

ber of rows given by the max argument; subsequent calls should be made to

sqlFetchMore(), again using the max argument. When executing a more

complicated query, a call to sqlQuery() with max can be followed by a call

to sqlGetResults().(eargumentrows_at_time is an instruction to

the database, which does not aﬀect the number of rows returned.)

Other SQL Commands

ere are many other SQL commands, but most of them (e.g., INSERT,

UPDATE,DELETE,andCREATE TABLE) are intended for data management

and will normally be used by the database administrator rather than by the

users of data. ere are two more SQL commands other than SELECT that

every data handler should know. One is CREATE TEMPORARY TABLE,which,

as its name suggests, creates a temporary table that will be deleted when

theSQLconnectionisclosed.(Whetheryouhavepermissiontocreatesuch

a table is decided by the database administrator.) In a recent project, for

example, we needed to extract rows for a particular set of keys from a number

of large tables. It was convenient to create a temporary table and insert those

keys into it (which we did via the sqlSave() function). We could then

use sqlQuery() to construct an inner join that returned only the rows

corresponding to keys in the temporary table. In this way, we allowed the

database, rather than R, to be the data manager.

e second command worth noting is EXPLAIN.iscommanddoesnot

perform any action; instead, it causes the database to describe what steps it

wouldhavetaken,hadtheEXPLAIN notbeenspeciﬁed.iscanbeuseful

when designing a query against a very large database, or one that will return

a very big result set. If there are two diﬀerent approaches, EXPLAIN might

give information as to which is more eﬃcient. Diﬀerent databases approach

this in diﬀerent ways, returning diﬀerent information, and indeed some do not

support EXPLAIN at all, so consult your database’s documentation if you want

to use this feature.

Serverless Databases

Microsoft Access and SQLite are two “serverless” databases. With these prod-

ucts, the entire database, together with details of table layouts and so on, is

stored in one large ﬁle. Clients such as R treat the ﬁle as a database, acting as

their own server. is makes these two databases particularly well suited to

smaller applications. For connections to Access, you can either set the Access

ﬁle up as a DSN, or use the Access-speciﬁc functions like odbcConnectAc-

cess() in the RODBC package to connect to the ﬁle. e RSQLite package

(Wickham et al., 2014) connects R to SQLite databases; it looks much like the

RODBC package, but the function names are slightly diﬀerent. In either case,

SQL commands can then be used to extract data.

k k

Getting Data into and out of R 197

Our extended exercise in Chapter 8 uses the RSQLite package, so here we

present a few of the functions you will need to communicate with RSQLite

from R. ese functions include the following:

•dbConnect() initializes an RSQLite connection. We would normally

start a session with a command like han <- dbConnect (SQLite(),

dbname = <fname>),where<fname> is the name of the ﬁle containing

thedata.iscommandcreatesahandlenamedhan, which will be passed

to the other SQLite functions.

•dbListTables() lists the tables in the database. In our example, the com-

mand dbListTables(han) would produce a character vector with the

names of the SQLite tables.

•dbListFields() lists the ﬁelds (columns) in a table.

•dbGetQuery() executes a query and captures the returned data, analo-

gously to sqlQuery() in the RODBC package.

•dbSendQuery() and dbFetch() create a query and then receive data

from it. ese are similar in use to sqlQuery() followed by sqlGetRe-

sults(),exceptthatdbSendQuery() returns no data at all; it only pre-

pares the database to return data in subsequent calls to dbFetch().

Other (NoSQL) Databases

Recent years have seen the growth of the so-called “NoSQL” databases.

Sometimes, these actually support SQL-type commands; their “NoSQL”

nature comes from the way transactions are handled within the database.

Others have a diﬀerent model for storing and extracting data. For example,

the well-known MongoDB system uses a version of JSON (see Section 6.5.3).

Most of these databases have the ability to connect from R, but you will need

to ﬁnd the right package for each one.

We can only barely scratch the surface of SQL, and interacting with

databases, here. If you are going to use databases regularly, it will deﬁnitely be

worth your while learning more. Remember that data-management tasks such

as subsetting and merging will almost always be more eﬃcient in the database

than in R.

6.4 Handling Large Numbers of Input Files

R provides a set of tools that allow us to interact with the computer’s operating

system. ese allow us to perform tasks such as creating and removing directo-

ries and listing ﬁles that match patterns, and these tasks typically form part of

any data cleaning project. It is particularly important to be able to handle ﬁles

and directories automatically when there are hundreds or thousands of them.

For example, we sometimes receive data in the form of a set of directories, each

k k

198 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

ﬁlled with zipped ﬁles. We want to avoid unzipping these ﬁles manually, and

instead use R to unzip them all at once.

ere are at least two ways to have R interact with the operating system.

First, R has a set of built-in commands that perform ﬁle- and directory-related

tasks. ese include list.files(), which allows you to list only the ﬁles

whose names match a regular expression (Section 4.4), unzip() for unzip-

ping zip ﬁles, and a set of functions whose names start with file –among

them, file.copy(),file.remove(),andfile.rename().

e second way to interact with the operating system is by calling system

utilities directly through the system() function. is method may allow more

ﬂexibility, but will normally be less general, since the set of utilities available will

be diﬀerent between operating systems. Using this approach will make your

eﬀorts more diﬃcult to reproduce.

Besides manipulating existing ﬁles, R also provides tools for downloading,

particularly the download.file() function. We talk more about acquiring

data from the Internet in Section 6.5.3. e ability to download and then

access a ﬁle entirely within R contributes to the reproducibility of your code

and solutions.

Example

As an example, consider a data set consisting of a directory containing 12

sub-directories of our working directory. Suppose that the sub-directories

have names such as 2016.Jan and 2016.Feb. For this example, each

sub-directory contains a few hundred comma-separated values ﬁles, each ﬁle

individually zipped. Each ﬁle has the same named columns in the same order.

e total size of all the ﬁles is not enough to overwhelm R, so the goal is to read

all of the ﬁles into one R data frame, with the resulting data frame containing

all of the original columns, plus two more columns giving the month and year

associated with each row.

is is only an example. In practice, the ﬁles might be huge or in an odd for-

mat. ey might have been prepared with another tool like gzip or tar. As a data

scientist, you have to expect to receive data in whatever way the supplier can

imagine.

In this example, we would use unz() to open each of the zip ﬁles. To do

this, we need to know not only the name of the zip ﬁle but also the original ﬁle’s

name. If the name or names of the ﬁle or ﬁles inside a zip ﬁle are unknown, they

can be retrieved via the unzip() function with the list = TRUE argument;

but in this example, we will presume that the zip ﬁle’s name is the same as the

original name, except that the ending .csv has been replaced by .zip.

Suppose that the ﬁrst ﬁle in the 2016.Jan directory is called Alameda.

zip. We can open and read that ﬁle with code like this:

fi <- unz ("2016.Jan/Alameda.zip", "Alameda.csv", "r")

out <- read.delim (fi)

k k

Getting Data into and out of R 199

Notice that the read.delim() functions and the other read.table()

oﬀspring can read from the connection directly. If the ﬁle was too big to read all

at once, we might use scan() – interestingly, readLines() is not available

for connections from unz().Wechoseread.delim() here because it sets

the separator character to the tab, the quote argument to the double-quote

only, and header to TRUE. Of course, we may have to experiment to ﬁnd the

proper settings of these arguments.

Now that we can read one ﬁle, we need to establish the loop to read them

all. Finding the names of all the zip ﬁles in any subdirectory below this one

is straightforward using the list.files() command. We can restrict the

directories to search by supplying a vector of sub-directory names, which might

be the output from the list.dirs() command,butherewewillassumethat

the only sub-directories present are the relevant ones. We need the full names

of the ﬁles in order to open them, but we want only the ﬁle name, not the path

part, in order to reconstruct the original ﬁle name. So we call list.files()

twice. In the ﬁrst call, we extract only the ﬁle names. en a call to sub() will

produce the name of the unzipped ﬁle by replacing the ending zip of the ﬁle

name with csv. is code shows how this might be done.

zips <- list.files (pattern = "\\.zip$",

recursive = TRUE, full.names = FALSE)

csvs <- sub ("\\.zip$", "\\.csv", zips)

Here, of course, "\\.zip$" is a regular expression that restricts attention to

ﬁles whose names end with .zip.

Now we need to know the year and month associated with each directory.

We can ﬁnd this in the vector of full names, which we produce by calling

list.files() asecondtimewiththefull.names = TRUE argument.

en the year appears in characters 3–6 (because each ﬁle name starts with

the working directory, ./), and the month in characters 8–10. In a more com-

plicated situation, we might need a regular expression or another approach to

ﬁnd these values. In this example, our code might look like this.

fullzips <- list.files (pattern = "\\.zip$",

recursive = TRUE, full.names = TRUE)

year <- substring (fullzips, 3, 6)

mon <- substring (fullzips, 8, 10)

At the end of this operation, we can construct a loop that sequentially reads

each ﬁle, adds a column with month and year, and appends it to a data frame.

is example shows what that code might look like.

result <- NULL # empty object

for (i in 1:length (zips)) {

fi <- unz (fullzips[i], csvs[i], "r")

out <- data.frame (Year = year[i], Month = month[i],

k k

200 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

read.delim (fi, stringsAsFactors = FALSE))

result <- rbind (result, out)

}

In this example, we used repeated calls to rbind() to construct the result. is

is an ineﬃcient operation, since R needs to adjust the size of the data frame on

every iteration of the loop. However, it is easy in implementation, and we only

need to run this code once. A more eﬃcient approach might be to create the

data frame ahead of time (if we knew or could estimate the total number of

rows needed) or to read the ﬁles separately and join them at the end.

6.5 Other Formats

A number of newer formats for storing data are becoming increasingly com-

monly used. Handling this data often requires the use of packages that are not

included in base R. In this section we describe some of these formats and the

toolsusedtoreadandwritethem–but,asalways,keepinmindthatpackages

tend to evolve faster than R itself does.

6.5.1 Using the Clipboard

e “clipboard” is a mechanism for moving text data between programs.

In Windows, R sees the clipboard as a ﬁle named “clipboard” (so this is

a bad name for a regular disk ﬁle). ere is also a larger version called

“clipboard-128.” As we mention below, the clipboard is particularly helpful

for bringing in data from spreadsheets like Excel: you can simply high-

light a rectangular area in Excel and copy to the clipboard; then in R,

read.table("clipboard", sep = "\t") will bring the data in.

All of the usual arguments to read.table(),particularlyheader and

stringsAsFactors, will need to be set in the usual way. Conversely,

we can put an R data frame into a spreadsheet by running a command

like write.table(mydf, "clipboard", sep = "\t") in R, then

switching to the clipboard and using that program’s “paste” command.

e procedure is somewhat diﬀerent across Windows, Mac, and Linux,

however,anditdoesrequiretheusertousethecopyandpastecom-

mands. So an approach that uses the clipboard might be diﬃcult for other

users to reproduce. Still, the clipboard is a good way to get quick and easy

access to data from other programs. In Mac, the two commands above

would look like read.table(pipe("pbpaste"), sep = "\t") and

write.table(mydf, pipe("pbcopy"), sep = "\t").Linuxis

more variable, but the help for the file() function gives some suggestions.

In using the clipboard to copy text in from other programs, you need to be

aware that some word-processing software may, depending on the settings,

k k

Getting Data into and out of R 201

automatically convert certain characters into others that “look nicer.” (is

doesn’t appear to be a problem with spreadsheet programs.) Speciﬁcally, if

you type into Microsoft Word "She said it wasn’t so -- and I

believed her", that program will convert the quotation marks to the

curved so-called “smart quotes” (like the ones seen surrounding “smart

quotes”). Similarly, the apostrophe will be curved and the two hyphens will

be converted into a dash. None of those characters are ASCII, and this text is

not a valid R string, because R does not recognize the quotation marks – so

if you try to paste this text directly into an R session, you will see an error.

If the text is surrounded with ordinary R-style quotation marks, then you

will have a valid R string. You can replace these non-ASCII characters using

gsub() – for example, gsub("[^[:print:]]", "", a) removes all

the so-called “non-printable” characters from a, leaving only letters, numbers,

and spaces. An alternative is to copy the text from the word processor to the

clipboard and scan() it into R from there, but the way clipboards handle

non-ASCII text is subject to change.

Be aware that many applications use the clipboard. If you copy text from the a

spreadsheet to the clipboard, but then copy anything else, your clipboard con-

tents will be changed. is includes copying text in an R script window, as, for

example, when you highlight the text of a command and right-click to execute

that command.

6.5.2 Reading Data from Spreadsheets

In our work, we very often need to read data in from spreadsheets, and by a

very large margin the most commonly used spreadsheet is Microsoft Excel.

Excel uses at least two ﬁle formats, an older one identiﬁed by names that end in

.xls and a newer one identiﬁed by names ending in .xlsx. Diﬀerent pack-

ages handle these formats: the gdata package has a read.xls() function,

while xlsx (Dragulescu, 2014) provides a read.xlsx() function.

But an Excel “workbook” can contain many spreadsheets, and need not be

rectangular as a data frame must be. While it is possible to specify a speciﬁc

sheetofaworkbook,wehavefoundthatiftherearejustafewmoderate-sized

data sets, it is often fastest to copy them to the clipboard and read them into

Excel from there. Another common approach is to use Excel to write the spread-

sheet as a tab- or comma-separated ﬁle.

Copying from the clipboard has at least two obvious disadvantages: ﬁrst, it

makes it harder for another user to reproduce your work, since they need to

open the Excel workbook. is supposes that the user has Excel, or at least

one of the open-source spreadsheet programs that can read the Excel formats

(and often older formats or very complicated, macro-laden spreadsheets

can cause problems with these other tools). en the user has to ﬁnd the

proper sheet, and copy the proper cells to the clipboard – and this leads to

k k

202 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

the second problem, because, as we mentioned earlier, the Windows and

Mac/Linux implementations of clipboard are not the same. To make your work

reproducible you should give instructions as to how to read your data that

work on any machine. Alternatively, you might use the spreadsheet program

to save the worksheet in a text format, like a tab-delimited one. at way other

users will be able to read the data directly into R.

Another issue that has caused trouble in the past when reading data from

Excel spreadsheets via the clipboard is that confusion can arise if the rows have

diﬀerent lengths. When reading from a ﬁle, the fill = TRUE argument to

read.table() will add blank ﬁelds on the ends of lines, but this has not

always been successful when reading from the clipboard. As a quick manual

adjustment, you can add a new column out at the right edge of the spreadsheet

containing all zeros, say, and then read the data in.

If you are confronted with a large number of spreadsheets, you will ﬁnd it

necessary to use one of the packages like xlsx to automate the reading process.

e decision about whether to use a package to read an Excel ﬁle or to read it

through the clipboard depends on the number of sheets needed, their size and

complexity, and the sorts of documentation that are going to be required. We

usually read our Excel sheets in via the clipboard – but we acknowledge the

diﬃculties that come with that.

One aspect of Excel that can cause confusion is its convention regarding

dates. Dates can be displayed in many user-selectable formats, but internally

a date is represented by the number of days since December 30, 1899, so that

March 1, 1900 is day 61. (ere is no support for dates before January 1, 1900,

and in an additional complication, day 60 is understood by Excel to have been

February 29, 1900 – even though 1900 was not a leap year.) Times are indicated

by fractional days, so 61.75 represents 6:00 p.m. on March 1, 1900. Often when

youreadinanExcelcolumnusingoneoftheread.table() functions, your

output will contain text-format dates that can be handled in the usual way,

assuming you know the display format. (is is another example where you

want to remember to set stringsAsFactors to FALSE.) If you read in an

Excel column (say, df$indate) with numeric dates, they can be converted

to Date objects with a command such as as.Date(df$indate, ori-

gin = "1899/12/30"). If you need times of day as well, recall from

Section 3.6.2 that Date objects have their non-integer time parts truncated

under some circumstances. So the safest plan is to use Excel to export the

dates and times in a text format. Alternatively, you might convert the numeric

days to seconds, by multiplying by 86,400, and then converting the resulting

values into a vector of one of the POSIX time classes, with a command

like as.POSIXct(86400 * df$indate, origin = "1899/12/30",

tz = "UTC").Inthisexample,weselectedtheUTCtimezoneexplicitly

since otherwise R would have chosen the local time zone.

k k

Getting Data into and out of R 203

6.5.3 Reading Data from the Web

A lot of data is now stored on the World Wide Web. Some of this is stored

statically, that is, in ﬁles on a server somewhere. is will often be HTML

data, like tables in ordinary web pages. However, a lot of data is not stored in

a page; rather, it is held in a database and returned dynamically, in response

to requests. For example, think of a web page that displays next week’s ﬂights

between San Francisco and New York. is page is generated dynamically

in response to your input parameters, and might be diﬀerent if generated a

few hours from now. Moreover, some services return responses in a more

complicated format like XML or JSON. In this section, we talk about how to

read ﬁles in common web-based formats; then we look brieﬂy at retrieving

dynamic data from web servers.

Copying Tables from the Web via the Clipboard

HTML, the “hyper-text markup language,” is the format for displaying most

web pages. So a lot of data in the world is embedded in HTML, usually inside

tables.eHTMLcodeforasimple2 ×2 table might look like this example.

<html>

<table><tr><th>Name</th><th>Amount</th></tr>

<tr><td>Dylan</td><td>116.34</td></tr>

<tr><td>Garcia</td><td>953.21</td></tr>

</table></html>

HTML is made up of “tags,” like <html> and <td>,thatoftencomein

pairs – so, for example, a row of a table starts with <tr> and ends with </tr>.

Tables in web pages are often easy to copy to the clipboard. We can simply

highlight them with the mouse and use the usual copy command (control-C

in Windows, command-C on Mac). en read.table("clipboard")

with the default value of the sep argument will very often produce accept-

able results. You may want to set other options, such as header and

stringsAsFactors as well.

Reading Web Pages in HTML

If we need to acquire a large number of web pages, an automatic procedure

is necessary. One simple way to acquire a web page, if the address is known,

is through the getURI() function of the RCurl package (Temple Lang,

2015a). For example, the command groucho <- getURI("https://en

.wikipedia.org/wiki/Groucho_Marx") will retrieve the Wikipedia

page on Groucho Marx, in HTML, and save it as a character vector of length

1. (e task of converting from HTML to text is not, alas, a trivial one. We give

one approach after we discuss XML as follows.) More often, we want to extract

data that has been formatted in an HTML table. Our usual tool for ingesting

HTML tables is the readHTMLTable() function from the XML package

k k

204 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

Temple Lang (2015b). It returns a list, with one entry for each table on the web

page. For some web pages, this function works ﬂawlessly. However, a lot of

web pages are designed more for display than for serious data storage. Formats

can change over time and the HTML used does not always adhere strictly to

standards. Even when tables are returned cleanly, it is often true that the ﬁrst

row contains the headers, and the columns have all been made into factors.

Although the output from readHTMLTable() may require additional pro-

cessing, using that function is almost always a better solution than writing a

custom function to look through the HTML tags, maybe using regular expres-

sions. If you will need to read substantial amounts of HTML as part of your

project, we recommend downloading the HTML pages to your local machine,

perhaps using download.file() to make a copy on disk, or getURI() to

read the text into R. at way, if the developers of the web change the format

or data, your code will continue to work on your local copy.

XML

XML, despite standing for “eXtensible Markup Language,” is not so much a

language as a method of deﬁning strict rules for document layouts. An XML

document will normally be expressed as a nested list in R. e XML package

handles the reading and writing of XML documents, and in fact reads XML

in two distinct ways. e xmlTreeParse() function reads in a ﬁle (or, with

the asText argument, takes in some text that is already in R) and produces a

tree-like object of class XMLDocument. is acts like an R list, so if you know

the exact names of the ﬁelds of interest, you can extract them. As an example,

consider this simple XML ﬁle:

<?xml version="1.0" encoding="UTF-8"?>

<Client><Name>Dylan</Name><Amt>116.34</Amt></Client>

<Client><Name>Garcia</Name><Amt>953.21</Amt></Client>

</Fields>

If we write this into a ﬁle called example1.xml, we can read that ﬁle into an

XMLDocument.eXMLDocument object is a long and complicated list. With

enough information, we can extract the value of the ﬁrst Amt ﬁeld directly, as

in this example.

> xx <- xmlTreeParse (file = "example1.xml")

> xmlValue(xx$doc$children[["Fields"]][[1]][["Amt"]])

[1] "116.34"

e xmlValue() function converts the list element, which is an object of class

xmlNode, into text. But clearly this approach is diﬃcult for large or deeply

nested documents. A better approach to handling XML is to read the ﬁle into

RusingthexmlParse() function, which produces an object of class XMLIn-

ternalDocument. We cannot extract elements from these documents using

k k

Getting Data into and out of R 205

the list-type notation above, but documents of this sort can be searched with

the xpathSApply() function using syntax derived from the XPath language

intended for this purpose. A description of XPath is beyond the scope of this

book, but lots of documentation and examples are available on the Internet.

Here, we show the use of the simplest XPath command, the // command that

performs a basic search. Notice that we search for Amt without needing to know

the names of its parent nodes.

> yy <- xmlParse (file = "example1.xml")

> xpathSApply (yy, "//Amt")

[[1]]

[[2]]

e output of xpathSApply() here is a list of objects of class xmlNode.

In the following example, we use the ordinary sapply() function to have

xmlValue() operate on each of the xmlNodes, ﬁnally producing a vector

of (character) amounts.

> sapply (xpathSApply (yy, "//Amt"), xmlValue)

[1] "116.34" "953.21"

e xpathSApply() function is also useful for converting HTML text

retrieved from a URL into readable text. For example, recall that in the

last section we saw how we might retrieve the Wikipedia entry for Grou-

cho Marx using the getURI() function. Recall that the entry was saved

in a character vector of length 1 named groucho. en, analogously to

the xmlTreeParse() function, the XML package provides a function

htmlTreeParse().WhenusedwiththeuseInternalNodes = TRUE

argument, this function produces an object of class HTMLInternalDocu-

ment. is class is a particular sort of XMLInternalDocument. erefore,

the xpathSApply() function can be used as before. In this example, from

Luis (2011), we search for the HTML “new paragraph” tag, <p>, and apply the

xmlValue() command to extract the value from each paragraph found into

alist.Wethenuseunlist() to produce a vector. e following commands

show how the text might be extracted from the HTML. e output is a

character vector with one entry for each paragraph in the original HTML

document.

> grou.tree <- htmlTreeParse (groucho, useInternal = TRUE)

> unlist (xpathSApply (grou.tree, "//p", xmlValue))

[1] "Julius Henry Marx (October 2, 1890 - August 19, ...

[2] "He made 13 feature films with his siblings the ...

[3] "His distinctive appearance, carried over from ...

:::

k k

206 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

XML handles Unicode in a natural way and elements of an XML document

canbeexpectedtobeencodedinUTF-8.

JSON

JSON, the “JavaScript Object Notation,” is another text-based format for con-

taining list-like data. For example, Twitter messages are often stored in JSON,

one message per JSON object. A JSON object is enclosed in braces and essen-

tially consists of series of pairs like "name":value, separated by commas. e

“value” part of the object can be a value or another series of "name":value

pairs. JSON also supports an “array” object, analogous to an R vector. So a JSON

object can be directly represented by an (possibly nested) R list.

e rjson (Couture-Beil, 2014), RJSONIO (Temple Lang, 2014), and

jsonlite (Ooms, 2014) packages read and write JSON objects and make

the underlying JSON essentially transparent to the user. Often you will be

presented with a ﬁle containing a whole set of JSON messages, one per line. If

the ﬁle is small, it might be easiest to read the entire ﬁle into R using scan()

with sep = "\n" and what = "", and then apply one of conversion

functions like fromJSON() to each of the messages. In this example, we show

what some of the data from the previous XML ﬁle might look like as text after

its JSON representation is read into an R object named zz.

>zz

[1] "{\"Name\":\"Dylan\",\"Amt\":\"116.34\"}"

[2] "{\"Name\":\"Garcia\",\"Amt\":\"953.21\"}"

Now we can use sapply() to have the fromJSON() function (we used the

one from the RJSONIO package) operate on each line.

> sapply (zz, fromJSON, USE.NAMES = FALSE)

[,1] [,2]

Name "Dylan" "Garcia"

Amt "116.34" "953.21"

e USE.NAMES = FALSE argument prevents the converter from construct-

ing unwieldy column names.

If the ﬁle is too big to import into R directly, it can be opened and read one

line at a time, as discussed in Section 6.2 earlier in this chapter.

Like XML, JSON is able to handle Unicode in a native way, so you should

expect to see UTF-8 in the data you acquire.

Reading Data Via a REST API

REST, which stands for “representational state transfer,” is the most straight-

forward of the ways to send data from a web server to a client like R. (Another

popular approach, SOAP, will not be discussed here.) Data suppliers will

often permit clients to establish REST connections to retrieve data by means

k k

Getting Data into and out of R 207

of specially formatted requests. e description of those requests is often

published in an API, “applications programmer interface.” For example,

GoogleMapspublishesanAPIthatdescribeshowtoextractmapdatafrom

its database, and Bing Translate has a diﬀerent API describing how to use

that tool to translate text from one language to another. If you want to use a

commercial API, be sure you understand the terms of use.

In many cases, developers have created R packages to make using the API as

straightforward as possible. In these examples, there are the RgoogleMaps

package (Loecher and Ropkins, 2015) for Google Maps and the translateR

package (Lucas and Tingley, 2014) for Bing Translate.

Each API will have its own rules, particularly regarding authentication. If no

packageexists,butyoucanseethewebpage,youcanoftenemulatetheactions

of a user by creating your own request. is can be sent to the server as a “get” or

“post” request via the getForm() and postForm() functions in the RCurl

package, or with the GET() function in the httr package (Wickham, 2015). In

either case, you will need to know the name-value pairs that the server expects.

Often these can be deduced by reading the underlying HTML in your browser,

or by examining the requests sent to the server when you click manually (since

for “get” requests that information appears in your browser’s URL line).

As an example, at the time of this writing, the US Census Bureau

maintained an API that lets you look at certain values associated with

international trade. is API, for which documentation is available

at the URL www.census.gov/data/developers/data-sets/

international-trade.html, expects a “get” request to be sent to

the location //api.census.gov/data /timeseries/intltrade/

exports,withargumentsget giving the ﬁelds to be retrieved and YEAR and

MONTH specifying the year and month of interest. is command shows the

sending of the request with the result being stored in an object called cens.In

this example, we request the year-to-date value of exports ("ALL_VAL_YR")

foreachofthe“end-usecodes”("E_ENDUSE") and their descriptions

("E_ENDUSE_LDESC") for April of 2016.

> url <- paste0 ("https://api.census.gov/data/timeseries/",

"intltrade/exports/enduse")

# For GET(), enclose API arguments via "query" as a list

> cens <- GET (url, query = list (

get = "E_ENDUSE,E_ENDUSE_LDESC,ALL_VAL_YR",

YEAR = "2016", MONTH = "04"))

e cens object is an object belonging to a special class called “response.”

Operating on this object with the content() function produces a list that

be reshaped into a matrix via do.call() and rbind(). is example shows

that call and a small part of the result.

k k

208 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> do.call (rbind, content (cens))[1:4,]

[,1] [,2]

[1,] "E_ENDUSE" "E_ENDUSE_LDESC"

[2,] "" "TOTAL EXPORTS FOR ALL END-USE CODES"

[3,] "0" "FOODS, FEEDS, AND BEVERAGES"

[4,] "00000" "WHEAT"

[,3] [,4] [,5]

[1,] "ALL_VAL_YR" "YEAR" "MONTH"

[2,] "465400958493" "2016" "04"

[3,] "38996430386" "2016" "04"

[4,] "1606226019" "2016" "04"

e resulting matrix will often need some cleaning before it can be converted

into a data frame, particularly with regard to column headers. In this well-

behaved example, all of the components of the list produced by content()

had the same length. In general, that might not be true. Here, we could extract

every month for every year by passing suitable values of YEAR and MONTH

parameters. As elsewhere, it might be a good idea to save the data to disk so

that its provenance is assured if the provider were to change the interface or

modify the data. REST APIs almost always deliver their data as JSON, XML,

or HTML.

Streaming Data

Sometimes, we need to capture streaming data, like feeds from sensors or from

a supplier like the Twitter platform. We have not encountered the need for this

in our own data cleaning problems, but at least in some cases the connection is

straightforward. We can open a “socket,” which is a communications endpoint,

once we are given an IP address for the server and a “port number,” an integer

that speciﬁes the address of the socket. e socketConnection() func-

tion establishes the connection. You will probably want to specify blocking

=TRUEso that subsequent reads do not complete until something is actually

read.

6.5.4 Reading Data from Other Statistical Packages

In days gone by, a statistician would often need to read data saved in a for-

mat speciﬁc to another statistical package, such as SAS, SPSS, or Minitab. e

recommended package foreign (R Core Team, 2015) provides interfaces to

these and some other formats. Again, though, formats evolve and very often it

will be safest and maybe even easiest to receive data as tab-delimited text or

in another text format. In any case, our experience has been that the need to

import data from other statistical packages is very much less than it used to be.

k k

Getting Data into and out of R 209

6.6 Reading and Writing R Data Directly

R’s own format for saving objects is a binary one, which is only readable by R

itself. It is often useful to save objects in R format as an archive, a backup, or

so that they can be passed to another user or machine because the format of R

data ﬁles is the same across all computers and operating systems. e primary

functions for saving and loading R objects are saveRDS() and readRDS(),

for individual objects, and save() and load(), for groups of objects.

e saveRDS() function (the letters RDS evoke “R data serialization”)

takes an object and produces a disk ﬁle with all of that object’s data and

attributes. Saving a data frame to a spreadsheet-type text ﬁle can lose some

information, and in any case saveRDS() works on any R object. Each call

to saveRDS() applies to exactly one R object. So, for example, saveRDS

(myobj, "newfile") produces a disk ﬁle named "newfile" with a

binary representation of the R object myobj. e complementary action is

performed by readRDS():inthiscase,readRDS("newfile") returns the

object just as it was saved. Note that readRDS() doesnotreplacetheexisting

myobj; instead, it simply returns the object. You can save it with a command

like newobj <- readRDS("newfile"), which will create a new object

newobj that is identical to the original myobj.

You can save a whole set of objects with the save() function. e output

of save() is one big ﬁle with all the referenced objects stored in it. e

save() functions let you specify the objects as objects, so, for example,

save(myframe, myresults, myfunction, file = "output")

saves three objects into a ﬁle called "output". But, perhaps more conve-

niently, it also lets you specify objects by name using the list argument.

Alternatively, that last command could have been written save(list = c

("myframe", "myresults", "myfunction"), file = "output").

In this case, the second command requires slightly more typing, but the savings

are clear when it comes time to save every object whose name starts with

projectA, with a command like save(list = ls(pattern=

"^projectA"), file = "output"). Here, of course, the caret sign

(^) is part of a regular expression (Section 4.4) that extracts from ls() only

objects whose names start with projectA.Tosavealltheobjectsinyour

workspace, you can use the save.image() command. is is a useful way

to move your entire workspace over to a diﬀerent machine, for example, and

this command is called when you quit R and ask to “save the workspace.” e

save.image() function creates or updates a ﬁle named .RData by default.

e load() function serves as the complement to save(),restoringallthe

objects stored on disk by save(). It will over-write any existing object with

thesamenameasonebeingrestored,soload() can be dangerous. A useful

k k

210 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

alternative is attach(), which allows you to add an R data ﬁle to the search

path, that is, the set of places that R will look for objects. In this way, a set of

objects can be made read-only. is is particularly useful for large, static data

sets that need not be loaded into your working directory.

6.7 Chapter Summary and Critical Data Handling

Tools

is chapter discusses getting data into R. Most often this will come in the form

of text ﬁles arranged in a tabular format, with one observation per row and one

measurement per column. Section 6.1 discusses diﬀerent methods of reading

these ﬁles, primarily through the read.table() function and its oﬀspring.

ese functions will be able to handle a great many of these types of ﬁles. You

will need to know whether the ﬁle is delimited or ﬁxed-width. In the former

case, you will need to know the delimiter, whether there are headers, whether

we expect quotation marks, and other attributes of the ﬁle. For ﬁxed-width ﬁles,

we need to know the widths of each of the ﬁelds.

However we acquire a text ﬁle, we will probably want to set stringsAs-

Factors = FALSE so that character data are not converted into factors. We

may need to look for missing value indicators or other anomalies and re-read

the data. Indeed, reading data into R is very often an iterative process, requiring

us to reﬁne our approach on each step until the data is exactly what we expect.

Some data cannot be read with read.table() becauseitistoobroken,

too big, or not tabular. Data is sometimes “broken” by having extra delimiters

embedded, and we give an example of how to handle a ﬁle broken in this way.

We usually approach broken ﬁles using scan() or, if necessary, readBin().

Section 6.2 discusses approaches for very large or non-tabular data. We can

open a ﬁle and read it piece by piece, processing the pieces one at a time until the

ﬁle is exhausted. is is the usual approach for log ﬁles, JSON or XML (formats

we describe later in the chapter), or any sort of text that is non-tabular. Binary

ﬁles can be read in this way as well, though this is rarely needed. One time this

need does arise is when, for whatever reason, the source ﬁle includes embedded

null characters.

is section includes some code that we used in a real-life example where

some (purportedly) text ﬁles both lacked new-line characters and also included

embedded null and control characters. e example describes some prepro-

cessing steps we had to take in order to make the data readable.

In Section 6.3, we describe how R can connect to a relational database in

order to extract data. In order to interact with a database you will need to know

at least a little of the SQL. is language works with all major databases and

provides a framework for extracting and merging tables and subsets of tables.

k k

Getting Data into and out of R 211

Section 6.4 describes another common problem in data cleaning – how to

handle large numbers of directories and ﬁles. We give an example of the sort

of task we are called on to perform. Handling problems of this sort requires

more than just understanding how to read and process individual ﬁles. It is also

necessary to know how to navigate the operating system, and to understand the

R tools that make it possible to perform ﬁle-system tasks such as listing the ﬁles

in a directory and managing zipped ﬁles.

We spend some time on acquiring data in other formats in Section 6.5. Most

ofthenon-textdatawegetisintheformofspreadsheets,soitisimportantto

know how to access the data inside these ﬁles. Often the clipboard provides a

simple way of transferring the data from a spreadsheet, though this is not very

automatic. e clipboard is also a reasonable way to grab one or two tables from

a web page in order to read them into R. We also show how to read a table from

a web page directly, via the readHTMLTable() function in the XML package.

We can also read the HTML text of a web page – but then we are faced with the

problem of extracting the important information from among the HTML tags.

Data scientists need to be familiar with XML and JSON, two text-based for-

mats that are in increasingly common use. Add-on packages provide the ability

to read data in these formats into R objects. ese are the common formats for

data returned from the Web via a REST query, and we give an example of such

aquery.

Finally,wediscusshowRstoresitsownobjects.R’sinternalformatisbinary

and not human-readable, but there is no better format for passing R data

between diﬀerent machines.

k k

213

Data Handling in Practice

In our experience, a data cleaning project arises out of a modeling or data

exploration problem. We are given some data (or perhaps a description of

data that the project sponsor plans to eventually provide) and, usually, a

problem to be solved. ere is no ﬁxed method for undertaking a data cleaning

project, but we think of the process as having four parts: acquiring and reading

the data, actually cleaning the data, combining the data (when it comes from

multiple sources), and preparing the data for analysis. (We sometimes use

“data cleaning” in both a broad sense and also as the name of a speciﬁc set of

tasks. Here, we are using “data handling” as the umbrella term for these four

tasks.) Of course, the “cleaning” part is never really ﬁnished, and often the most

important cleaning tasks are discovered as data sets are combined, or even as

the modeling proceeds. In this chapter, we describe the tasks associated with

each of the four parts of data cleaning. en, we emphasize the importance of

reproducibility and documentation and give a detailed example at the end.

7.1 Acquiring and Reading Data

Acquiring data is, of course, the act of actually taking delivery of the data. Very

often the ﬁnal data, the data that will be used for building models, will come

not in one big ﬁle but from a number of sources and in varying formats. So

it is important for the data cleaner to be prepared to read in text, spreadsheet

data, XML, JSON, and to handle non-standard formats as well. e exercise in

Chapter 8 includes examples of all of these data types.

Acquiring data turns out to be more diﬃcult than you might expect. Data

providers are sometimes reluctant to release data, fearing the loss of propri-

etary or sensitive information. Some providers try to be helpful by providing

summarized data, or by taking it on themselves to delete or ﬁll in records with

missing values. Sometimes, the data sets are so big that just moving them can

be a challenge. We have used e-mail attachments, DVDs, secure ﬁle transfer,

k k

214 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

downloads from cloud-based servers, and even hard drives sent through the

mail for transfer. Our advice is to get as much data as possible, at as detailed a

level as possible. It is disruptive to your project to realize after you receive the

data from your ﬁrst request that an important ﬁle is missing; the person send-

ing you the data may be less inclined to focus on your second or third request.

Of course, another issue is that often we don’t know what’s in the data until we

receive it, and only then can we evaluate whether we have what we need – and,

often, once we receive some data we understand enough about the problem to

determinewhatdataweshouldhaveaskedforintheﬁrstplace.

Oncewehavetakenpossessionofthedata,thenextstepisalmostalways

to read the data into an R data frame using the techniques of Chapter 6. e

initial cleaning process begins here, since it is at this point that you will start

to identify missing value indicators, determine column classes and get an idea

ofthesizeandcomplexityofthedata.isisalsoagoodpointtoensurethat

your column names are meaningful and tractable. For instance, we recently got

a data set of credit bureau data in which the nearly 200 columns had unwieldy

names like “Number Open Installment Trades with Credit Limit <5000 with bal

>0 and reported within 6 months.” Our ﬁrst task was to change all the names to

ones that would be more manageable in R. is was time-consuming, tedious,

diﬃcult to automate – and necessary.

Another important element you will usually want in any data cleaning project

is a key ﬁeld, an indicator that uniquely identiﬁes each observation. For a piece

of equipment, this might be a serial number; for a person from the United

States, it might be a Social Security Number (although these are sensitive, we

generally ask our providers to construct and provide a diﬀerent, unique iden-

tiﬁer). In many cases, one individual may have multiple records – in a medical

example, each person may have many visits to a doctor, for example. We might

then construct our own key, combining social Security Number with date. It is

particularly important to have a key ﬁeld when combining data from diﬀerent

sources; we talk more about this in Section 7.3.3.

7.2 Cleaning Data

e actual steps in data cleaning will depend on the data, and what we expect to

ﬁnd in each column. Of course, before anything is changed we want to see what

the data looks like. In general, whenever we receive a data set, we tabulate the

keys, looking for duplicates. If there are duplicate keys, we try to see if entire

records are duplicated, and, if there are, we may delete the duplicates and record

this fact. But if there are duplicate keys that go with distinct records, we have to

evaluate whether this describes real observations, errors, or another condition.

We look at every column’s type (character, numeric, etc.), and count the

number of missing values in every column. Even when there are thousands of

k k

Data Handling in Practice 215

columns, it is useful to tabulate the column classes and to draw a histogram

of, or get summary statistics on, the numbers of missing values. Often a set of

columns will have the same number of missing values, and you will then ﬁnd

that they are all missing on the very same observations. Frequently, this will

arise because those observations all failed to match some speciﬁc data source

when the data frame was being constructed. You then have to decide whether

to keep those columns, remove them, or maybe create a new logical column

describing whether each observation was, or was not, missing those columns.

For numeric columns, we very often want to see the mean value (among

non-missing items) and, usually, the range. e range helps us spot anoma-

lies – negative numbers where they are unexpected, or 999 codes used as

missing values – quickly. For columns that are supposed to be integer or

character, we will often want to know the number of unique values; if this is

very diﬀerent from what we expect, we would investigate further. For example,

a column measuring “number of home mortgages” will most commonly have

the values 0 or 1. If such a column had hundreds of distinct values we might

want to investigate. Conversely, a column named “Salesperson ID” might be

expected to have hundreds of values. In that case if there were only a few, we

would want to understand why. For categorical or integer variables with only a

few values, we will tabulate them, looking for unusual values. If we encounter

a column with values “A,” “B,” “Other,” and “other,” for example, we will almost

certainly consider combining those last two values. As with numeric columns,

it is helpful to look at maxima and minima of date ﬁelds. We tabulate dates

by month or quarter, looking for anomalous conditions like months with no

observations.

Once we have identiﬁed missing or clearly erroneous data, we face a deci-

sion. If the data came recently from a speciﬁc provider, we might return to that

provider, point out the issues, and hope for corrected data. More often, we will

have to act on our own and take steps to keep those values from disrupting our

analysis. For example, suppose we have a data set giving information about sol-

diers. If a ﬁeld giving a particular soldier’s age contains the value 999, and we

expect to need the age in our analysis, we might make a note of the soldier’s

identiﬁcation or other key value, then delete that record and continue. Dele-

tion should be a rare tactic. If, say, 30% of soldiers had the 999 code, we might

need to do something else. We might replace the erroneous value with a valid

one using an “imputation” method, or we might mark the 999 values as NA.

Alternatively, if we do not anticipate using the age in modeling or other eﬀorts,

we might let the value stay as it is – but the fact that there are some soldiers

whose age is equal to 999 is still important information about the quality of the

data. You will want to report data quality issues to the data provider and project

sponsor.

Each data frame will need to be examined on its own, but the cleaning process

also needs to consider the collective set of data frames that go into the project.

k k

216 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

For example, one data frame might indicate sex by a code like Mor Fand maybe

a missing or unknown code, say, Z. A second data frame might have a column

that gives TRUE for females and FALSE for males. Either of these schemes is

adequate on its own, but if the data frames need to be combined, we will need

to select a scheme from one data frame and impose it on the other.

Moreover, if two data sets provide the same information – sex, in this

example – for the same observations, it is generally a good idea to see how

oftentheyagree,asameasureofdataquality.Inthecaseofsex,weexpect

to see almost no disagreements – people do change sex, but rarely, and

disagreements are probably more likely to be due to transcription or other

errors. More often, in our experience, we will see two sources agreeing almost

all the time when both are present, but one source reporting many more

missing values than the other. is can give us information about the overall

data quality of the sources. Inevitably, experience and outside knowledge will

help you spot anomalies. As an example, we were given a data set of car loans

in which the cars were about 70% used and 30% new. When the same company

sent us an update, the new ﬁle described cars that were about 30% used and

70% new. Neither is unreasonable taken alone, but when we saw both ﬁles we

were certain an error had been made – and we were right.

7.3 Combining Data

We discussed combining data frames in Section 3.7. In that section, we focused

on the mechanics of using R to combine data frames. As we described there,

data frames can be joined three ways: by row (stacking vertically), by column

(combining horizontally), or using a key (which we call merging). is section

looks at some of the practical considerations we encounter when combining

data frames in real data cleaning problems.

7.3.1 Combining by Row

As we noted earlier, most data cleaning projects involve data from a number

of sources. ese might be diﬀerent sets of similar observations that should

be combined row-wise. For example, we might get identically formatted tables

from each of several laboratories that should be combined, one table on top

of the next, into the ﬁnal data set. Suppose we wanted to combine data sets

named NYC,ATL,andPHX, representing input from our labs in New York,

Atlanta,andPhoenix.Firstwewillneedtoensurethatthedatasetshaveexactly

the same column names – often names will diﬀer in case, or in the use of

dots as separators. Second, columns with the same name should have the same

class – character, numeric, logical, or date. ird, we usually want the set of

values in categorical variables to match. If one data set’s column Success has

k k

Data Handling in Practice 217

values Yes and No, say, and the other uses TRUE and FALSE, we will want to

modify one to match the other. Fourth, we examine the key ﬁelds to ensure that

they will be unique in the new, combined data set. If not, we might construct a

new key that adds NYC,ATL,orPHX onto the front or rear of the existing key.

Even if the keys are unique, we normally append a new column to each table,

giving its source, just to be unambiguous.

We combine data sets like these with R’s rbind() command , which we

introduced in Section 3.7.1. is function will respond properly if columns with

the same names appear in diﬀerent orders in the two data sets. However, it will

not complain if categorical columns do not have the same sets of values. So

when we combine data sets in this row-wise manner, we often use code like

that in the following examples to ensure that they do have the same values. We

start by ensuring that all three data frames have the same number of columns,

and that their names match. We sort the names because the columns might be

in diﬀerent orders in the diﬀerent data frames.

# Values at right show expected output from each command

length (unique (c (ncol (NYC), ncol (ATL), ncol (PHX))) # 1

all (sort (names (NYC)) == all (sort (names (ATL))) # TRUE

all (sort (names (NYC)) == all (sort (names (PHX))) # TRUE

We now check that the column classes match one another. Here we use the

slightly diﬀerent technique of calling all.equal() on the two vectors rather

than all() on the comparison. e all.equal() approach will be neces-

sary when comparing lists.

all.equal (sapply (NYC, class),

sapply (ATL, class)[names (NYC)]) # TRUE

all.equal (sapply (NYC, class),

sapply (PHX, class)[names (NYC)]) # TRUE

e class() function returns a vector of length 2 or more, rather than a

single entry, for some columns (like, e.g., columns of POSIX date objects).

In that case, we can replace sapply(NYC, class) by sapply(NYC,

function(x) class(x)[1]).

In the next step, we identify the sets of categorical or factor variables, and

extract from each one its unique values. In the following code, we explicitly

convert any factor variables to character for the purpose of comparing their

unique values.

cats <- names (NYC)[sapply (NYC, class) == "character" ||

sapply (NYC, class) == "factor"]

levs.nyc <- lapply (NYC[,cats],

function (x) unique (sort (as.character (x))))

levs.atl <- lapply (ATL[,cats],

function (x) unique (sort (as.character (x))))

levs.phx <- lapply (PHX[,cats],

k k

218 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

function (x) unique (sort (as.character (x))))

all.equal (levs.nyc, levs.atl) &&

all.equal (levs.nyc, levs.phx) # TRUE

e levs objects are lists of levels, which we sorted alphabetically (since

unique() does not sort). e all.equal() functions produce logical

values, which we combine with the && operator. If the result is TRUE,weknow

that all three data frames’ sets of character or factor variables have exactly the

same sets of values.

e ﬁnal step in this operation is to create a column identifying each data

frame’ssourceandthentocombinethethreedataframes,asweshowhere.

NYC$Source <- "NYC"

ATL$Source <- "ATL"

PHX$Source <- "PHX"

big <- rbind (NYC, ATL, PHX, stringsAsFactors = FALSE)

Notice the use of the stringsAsFactors = FALSE argument in the call to

rbind(). As we said in Section 4.6.5, rbind() behaves more gracefully with

factors, or mixtures of factors and characters, than some other R functions. Still,

we recommend using the stringsAsFactors = FALSE argument during

the data cleaning process. We generally want to create factors only at the end

of the cleaning process (see Section 7.5).

7.3.2 Combining by Column

Two data frames with the same number of rows can be combined “horizontally”

using the data.frame() function. is straightforward operation produces

a result whose number of columns is simply the sum of the numbers of columns

in the components. However, the joining is done naïvely. If the components are

data frames Aand B, the ﬁrst row of Ais joined to the ﬁrst row of B,thesecondto

the second, and so on. e data.frame() function ensures that the column

namesintheresultaredistinct,sosomeofthecolumnnamesintheseconddata

frame may be changed. e cbind() function does not de-conﬂict column

names, so we recommend using data.frame() for this job. e row names

in the result come from the ﬁrst data frame that has (non-default) ones.

7.3.3 Merging by Key

A more sophisticated operation is needed when two data sets have information

about the same observations but possibly in diﬀerent orders. In this case, the

data needs to be “merged,” that is, combined into one wide data set. We do this

by joining the tables according to their key values; in R, we use the merge()

function (see Section 3.7.2). In its simplest form, merge() takes two data sets

and returns a new one with the values from each key combined into one row.

If keys are unique, and every key appears in both data sets, this operation is

k k

Data Handling in Practice 219

straightforward. e behavior when the keys are duplicated is slightly tricky;

we make sure that our keys are always unique. We may also need to decide

whattodoaboutrecordsthatappearinonetablebutnotboth.emerge()

function accepts arguments that control this behavior.

Two facts about merge() are worth remembering. First, the return from the

merge() function is, by default, sorted according to the key, so the order of the

rows may have changed compared to the ordering in the source tables. A sec-

ond, more important point has to do with what happens when the two source

tables have columns with the same name. If both of the original tables contain

a column named State, for example, the output will contain two columns

State.x and State.y.etermState is now not enough to name a col-

umn unambiguously, so code that worked on the State column in either of the

two original data sets will fail on the merged one. Moreover, if the merged data

is now merged again with a third input, that third input will contribute a col-

umn just called State (since that name is now not a duplicate). If you intend

to keep both of a pair of like-named columns after a merge, we recommend

changing their names in a controlled way, ahead of time.

7.4 Transactional Data

Onetypeofdatathatisworthfurthermentionhereistransactional data.In

contrast to tabular data, in which we might expect one row per key, transac-

tional data sets often have many rows for each key, with each row representing

a single transaction. ink of a log of clicks at a website, for example, identiﬁed

by user; each user may contribute dozens of clicks in a single session. Or think

of a data set of bank transactions identiﬁed by the customer’s account number.

If the interest is in individual transactions, then the data set is well on its way

to being useful. However, if the interest is in account numbers, we may want

to summarize all the transactions for a particular customer so as to produce a

data set with one row per account rather than one row per customer.

In the rest of this section, we illustrate working with transactional data by ref-

erence to a real-life problem we encountered recently. is example is lengthy,

but it serves to show some of the techniques that will be useful when handling

this and other sorts of data.

7.4.1 Example of Transactional Data

We were given a large data set regarding a particular survey taken by soldiers

that was intended to elicit the soldier’s emotional state. e key ﬁeld was an

identiﬁer that was unique to each soldier. is survey was administered approx-

imately every year, so most soldiers appear multiple times in this data set. e

actual survey data set had several million rows, and each row contained the

k k

220 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

answer to perhaps 150 questions, but for this demonstration we will show a

sample survey data set with only one question (to which is answer is an integer

on the scale of 1–5). Our sample survey data frame survey has seven rows, as

shown here. Like the original, our sample data is sorted by Date within ID.

> survey

ID Date Response

1 AA 2012-09-26 3

2 AA 2014-01-16 4

3 CC 2013-03-13 3

4 CC 2014-04-30 5

5 CC 2015-03-31 4

6 DD 2013-06-03 2

7 EE 2013-12-02 4

We would consider this to be tabular data if our interest were primarily in

each survey response. If our interest were primarily in each soldier, we would

consider this to be transactional data. We have already seen some ways to sum-

marize data like this. For example, we might compute the average Response

for each ID using the tapply() function, with code like with(survey,

tapply(Response, ID, mean)). More generally, we might be interested

in building a data set with one row per soldier, where each row has the earliest

date taken, perhaps, the average response, the number of responses, and so on.

One useful tool for transforming transactional data into a tabular form is

the rle() function from Section 2.6.5. Since the records are sorted by ID,

the rle() function, applied to the IDs, computes the “runs” of ID values. It

returns a list with two elements named lengths and values.evalues

component is a vector of distinct soldier IDs (one entry for each ID for which

thereisatleastonesurvey),andthelengths componentgivesthenumberof

times that ID appeared in each run. at number is, of course, an integer giving

thenumberofsurveystakenbyeachsoldier.iscodeshowsthelistproduced

as output of the call to rle(). Despite the special print format, the elements

of this object can be accessed in the same way as elements of an ordinary list.

> (rle.out <- rle (survey$ID))

Run Length Encoding

lengths: int [1:4] 2 3 1 1

values : chr [1:4] "AA" "CC" "DD" "EE"

is output is useful for many tasks that need to be performed on transac-

tional data. For example, since lengths gives the lengths of the runs for

each ID, we can identify the ending points of the runs for each soldier with

end <- cumsum(rle.out$lengths).enthestartingpointcanbe

computed with start <- c(1, end[-length(end)] + 1) –herewe

starttheﬁrstrunat1andstartthelastoneafterthesecond-to-lastendpoint.

With these starting and ending points in hand, we can answer even diﬃcult

k k

Data Handling in Practice 221

and complicated questions about the contents of the records within each run.

For example, suppose we wanted to identify soldiers who had an increase of

exactly 1 between one value of the Response and the next. Assuming that

records are sorted by Date within ID, we can construct a function diff1

that extracts the set of records associated with a particular index of start

and end and determines whether any of the diﬀerences are equal to 1. en we

can apply that function to each pair of corresponding start and end values,

as shown in the following code.

> diff1 <- function (i) {

recs <- survey[start[i]:end[i],]

if (any (diff (recs$Response) == 1)) TRUE else FALSE

}

> sapply (1:length (start), diff1)

[1] TRUE FALSE FALSE FALSE

e output shows us that AA is the only soldier with an increase in Response

of exactly 1 from one survey to the next. In this example, we could also have

used tapply(), which has the additional advantage of not relying on global

variables (as our diff1 relies on start and end and survey). However,

tapply() cannot be applied to a data frame. If we needed all the information

for each soldier, another alternative is to combine split() with lapply(),

but that approach seems to be slower and more memory-intensive.

7.4.2 Combining Tabular and Transactional Data

We continue this example by reference to the real-life problem that inspired

it. We were actually given two large data sets. One contained the surveys we

described earlier. e second listed soldiers who had undergone deployment

overseas, giving an ID and the starting and ending dates of the deployment.

Many soldiers had more than one deployment. When read into R, the actual

deployment data frame had some hundreds of thousands of rows, but for this

example we use ﬁve deployments. is code shows what the ﬁve sample deploy-

ments look like, as stored in an R data frame called deploy.

> deploy

ID Start End

1 AA 2014-05-05 2014-11-08

2 BB 2012-10-15 2013-07-19

3 CC 2013-08-16 2014-04-03

4 CC 2015-11-01 2015-05-17

5 EE 2013-02-20 2013-05-18

Like the surveys, this data set has been sorted to be in ascending order of ID.

Our ultimate goal was to identify soldiers who had taken the survey both before

and after deployment, to see whether deployments might be associated with

k k

222 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

changes in emotional state. To do that, we wanted to create a data set with one

row per deployment. Each row of this ﬁnal data will indicate the latest survey

for each soldier, among those that preceded the deployment start date, and the

earliest survey among those that follow the deployment end date. In this way,

we would have the surveys that most closely bracketed the deployment. Sol-

diers might appear twice in the data set, possibly bracketed by diﬀerent pairs

of surveys. In fact it would, in theory, be possible for two deployments by the

same soldier to be bracketed by the same pair of surveys, but we judged this to

be unlikely and not damaging to the analysis. In this example, the deployment

data set is already tabular (since we are interested in deployments, not individ-

ual soldiers), and the survey data set is transactional. A natural ﬁrst step is to

look for missing values, especially in responses and dates.

Once we were satisﬁed with the quality of the data, we decided to cre-

ate, as an intermediate product, a data set with one row per deployment,

each row containing the dates of all the surveys for that soldier. Of course,

diﬀerent soldiers take the survey diﬀerent numbers of times, so we found

the maximum number of survey taken by any soldier with a command like

max(rle.out$lengths).Wemighthaveusedmax(table(survey

$ID)), although we would expect that to take much longer.

In this case, the maximum number of surveys taken is three, by soldier CC.So

we created a character matrix with three columns. We used characters because

Date objects in a matrix get converted to numeric. is character matrix was

called datemat. is code shows how the datemat matrix might be con-

structed. Notice that, for the moment, this matrix is constructed without an ID.

> deploy.unique.ID <- unique (deploy$ID)

> datemat <- matrix ("", nrow = length (deploy.unique.ID),

ncol = max (rle.out$lengths))

In order to ﬁll datemat,weusetherle() function from the example

above and the idea of a matrix subscript from Section 3.2.5. However, we need

to exclude soldiers who took the survey who are not in the set of deployers.

In this code, we remove those entries from the survey data frame, producing

surv.short, and then re-run rle().

> surv.short <- survey[is.element (survey$ID, deploy$ID),]

> rle.out <- rle (surv.short$ID)

We can now create a two-column matrix subscript, to be used to enter dates

into the datemat matrix. Each row of the matrix subscript, as you recall, gives

a row and column number of an entry to be ﬁlled. e column numbers are the

values from 1 to 3 giving the column of datemat where the date of the survey

will be recorded. For example, the ﬁrst soldier, AA, took two surveys, so we will

ﬁll columns 1 and 2 of that soldier’s row. We will want to generate a vector like

1:2 for that soldier. e second soldier took three surveys, so we will ﬁll all

k k

Data Handling in Practice 223

three columns, using a vector like 1:3, and so on. We can compute a list with

the relevant vectors using sapply(), as we show in the following code. en

unlist() extracts a vector of values from the list.

> surv.lens <- sapply (rle.out$length, function (i) 1:i)

> (col.surveys <- unlist (surv.lens))

[1]121231

e col.surveys vector gives the columns into which the dates need to be

placed. e IDs that go along with each of these entries are precisely the entries

in surv.short$ID. However, rather than the text IDs, we want the (numeric)

row number of datemat, which we can get using the match() function to

match the IDs from the surveys to the set of unique deployment IDs. is code

shows how we construct the vector giving the row of datemat desired for each

survey.

> (row.surveys <- match (surv.short$ID, deploy.unique.ID))

[1]113334

> (mat.subset <- cbind (row.surveys, col.surveys))

row.surveys col.surveys

[1,] 1 1

[2,] 1 2

[3,] 3 1

[4,] 3 2

[5,] 3 3

[6,] 4 1

Wecannowseethatthismatrixcanbecorrectlyusedasasubscript.(Wecan

also see why datemat has no extraneous columns like ID; we want to refer to

the ﬁrst date column as column number 1.) e date of the ﬁrst survey should

be entered into the ﬁrst row and ﬁrst column of datemat;thesecondsurvey

was taken by the same soldier, so its date, too, should go into the ﬁrst row, but

this one should go into the second column; and so on.

e ﬁnal steps are these. First, we use the matrix subscript to ﬁll the

datemat matrixwiththesurveydates.Wemakethesecharacter explic-

itly. en we create a data frame consisting of the unique IDs from the

deployment ﬁle, together with the dates from the datemat matrix. We

merge the original deploy data to this new data frame by ID, creating a

data set with one row per deployment (the merge() command, you will

recall, takes care of the fact that some soldiers deploy more than once). ese

commands show the construction of the ﬁnal combined data set, which we

call dd.

> datemat[mat.subset] <- as.character (surv.short$Date)

> date.df <- data.frame (ID = deploy.unique.ID, datemat,

stringsAsFactors = FALSE)

k k

224 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

> (dd <- merge (deploy, date.df, by = "ID"))

ID Start End X1 X2 X3

1 AA 2014-05-05 2014-11-08 2012-09-26 2014-01-16

2 BB 2012-10-15 2013-07-19

3 CC 2013-08-16 2014-04-03 2013-03-13 2014-04-30 2015-03-31

4 CC 2015-11-01 2015-05-17 2013-03-13 2014-04-30 2015-03-31

5 EE 2013-02-20 2013-05-18 2013-12-02

With this data set in hand, we can now determine which deployments are brack-

eted by surveys. We need to be aware that the Start and End columns are

Date objects, whereas the survey dates in the three rightmost columns are

character. We also need to be aware that some entries in those last three

columns are empty. For each row we extract the largest date “to the left of”

(i.e., smaller than) the non-entry survey dates, using -Inf when none is found.

en extract the smallest date to the right, using Inf if none is found. is code

deﬁnes a function bracket() that performs this computation and shows the

eﬀect of it operating on each of the rows of dd.eresultisnamedbrack.

> bracket <- function (i) {

dat <- dd[i,] # extract ith row

dts <- dat[,4:6] # grab dates and...

dts <- dts[dts != ""] # omit empties

left <- max (-Inf, dts[dts < as.character (dat$Start)])

right <- min (Inf, dts[dts > as.character (dat$End)])

c(left, right)

}

> (brack <- sapply (1:nrow (dd), bracket))

[,1] [,2] [,3] [,4] [,5]

[1,] "2014-01-16" "-Inf" "2013-03-13" "2015-03-31" "-Inf"

[2,] "Inf" "Inf" "2014-04-30" "Inf" "2013-12-02"

As we noted in Section 5.1.2, bracket() is a function that relies on variables

in the workspace (in this case, dd). We usually avoid this reliance, but this func-

tion provides a convenient way to operate on the rows of dd, and we will only

use this function once.

e result in brack gives one column per row of dd.Onlythethirdrow

of dd – the ﬁrst deployment of soldier CC – is bracketed by surveys. Now we

simplify the dd data set by combining its ID with the transpose of the brack

matrix. Having identiﬁed the “bracketing” survey dates, we now want to extract

the responses for those surveys. e natural way to do this is to identify every

survey by a unique key consisting of the soldier’s ID and the date. We construct

these keys both in the new dd data set, and also in the original survey data set.

en match() can be used to bring in the responses from the two bracketing

surveys. is ﬁnal piece of code shows how this is accomplished. Notice that

the columns from brack are now named X1 and X2.

> dd <- data.frame (ID = dd$ID, t(brack),

stringsAsFactors = TRUE)

> left.key <- paste0 (dd$ID, ".", dd$X1)

k k

Data Handling in Practice 225

> right.key <- paste0 (dd$ID, ".", dd$X2)

# Add key, then use match

> survey$key <- paste0 (survey$ID, ".", survey$Date)

> dd$ResponseLeft <-

survey$Response[match (left.key, survey$key)]

> dd$ResponseRight <-

survey$Response[match (right.key, survey$key)]

>dd

ID X1 X2 ResponseLeft ResponseRight

1 AA 2014-01-16 Inf 4 NA

2 BB -Inf Inf NA NA

3 CC 2013-03-13 2014-04-30 3 5

4 CC 2015-03-31 Inf 4 NA

5 EE -Inf 2013-12-02 NA 4

Obviously, the bracketed deployments are the ones with both responses

present. If we were writing a script to identify these deployments, we would

take a minute here to remove the temporary variables we created – left.key,

right.key,andsoon.

7.5 Preparing Data

We draw a distinction between the “cleaning” step, where we detect and adjust

anomalies, and the “preparation” step (although these two certainly overlap).

In the preparation step, we create new columns for the purpose of making

modeling easier or more revealing, without aﬀecting the existing columns.

One such action is binning, where we create a new character vector with a

small set of levels (such as “small,” “medium,” and “large”) from a numeric

vector. As another example, it is common in regression problems to transform

a numeric variable by taking its logarithm. We might combine categorical

variables, converting a three-level Race factor and a two-level Sex factor

into a single six-level factor, and so on. All of these steps are intended for

modeling, not data cleaning, and the speciﬁc steps will depend on the models

being used.

Another common action is creating a new, small set of categorical levels from

an existing, much larger set. It is helpful to have a cross-reference table that

connects one set to the other. For example, given a set of US states, we might

want to create a new column of the corresponding census regions (Northwest,

Midwest, South, and West – we use this example in Chapter 8 as well). Suppose

that we have a data set dat with a column named State already in it. We can

use the cross-reference data frame state.tbl from the cleaningBook

package. is data frame has State in one column and Region in another.

en the match() function allows us to add a Region column to dat

with a command like dat$Region <- state.tbl$Region[match

(dat$State, state.tbl$State)].

k k

226 A Data Scientist’s Guide to Acquiring, Cleaning, and Managing Data in R

In this example, we can also use merge() to join the dat and state.tbl

data frames. We are most accustomed to using match() for tasks that involve

a column or two because we often do not need all the columns from the second

data set to be added to the ﬁrst. e important diﬀerence between match()

and merge() arises in the case of duplicate keys. e match() ﬁnds only one

match (the ﬁrst one) if there is one, whereas merge() returns all of the records

with the matching key.

Here, the match() function produces a vector the same length as

dat$State, each entry of which gives the number of the element of

state.tbl$State matched by that element of dat$State.enwelook

up the values of state.tbl$Region that correspond to those matches

to ﬁnd the regions. (Elements in dat that do not match the cross-reference

table will produce NA in the output.) is use of a cross-reference table is

very common and makes code to create the new columns easy to read and

repeatable – but make sure that the cross-reference table does not have

duplicated keys.

Although it is technically a modiﬁcation of an existing column, we include

in the data preparation step the act of changing the class of one column to

another class. e most common of these transformations arises when con-

verting a character vector to a factor (Section 4.6) using the factor() or

as.factor() functions, since we generally avoid factors when we ﬁrst read in

the data. Eventually, factors may be needed for modeling eﬀorts. When convert-

ing a character vector into a factor, remember that you can control the ordering

of the factor levels (which otherwise defaults to alphabetical). e “baseline”

level of a factor is the ﬁrst one, and it is often useful to select the baseline care-

fully, as the most common or least interesting level, perhaps.

Changing the class of one or two columns in a data frame is straightforward.

Changing the class of many things requires some care. In particular, it is gen-

erally a bad idea to use apply() or sapply() for this task on a data frame

because these functions return a matrix (in which all of the entries must be of

the same class). e lapply() functioncanbeused,thoughitproducesalist

that then needs to be manipulated. An inelegant but easy alternative is a simple

for() loop.

7.6 Documentation and Reproducibility

ere are at least two sorts of documentation in a data cleaning process: the

documentation provided to you with the data and the documentation you gen-

erate as you go through the steps. e ﬁrst of these is often called a “data dictio-

nary.” is should contain a manifest of all the ﬁles delivered, with the names of

each column and the possible values. is is a good place to record the source

of the data and the dates on which each piece of it was received. e more

k k

Data Handling in Practice 227

detailed a data dictionary is, the easier we can expect the data cleaning tasks to

be. However, data dictionaries are often incomplete – for example, it is com-

mon to ﬁnd levels or missing value indicators in the data that are not listed in

the dictionary. In some cases, the dictionary is just wrong; this might happen

when you are provided an outdated version, for example. All you can do in this

case is to try to communicate with the data provider and make some educated

guesses. We talk about the role of judgment in the next section.

e second sort of documentation in a data cleaning eﬀort is the documenta-

tion that you generate to describe what you did. At a very minimum, you should

produce your R scripts and any custom functions that you wrote as part of the

eﬀort – and, like all scripts and functions, these should be laid out in a clear

way, with enough comments that a new user can ﬁgure out what you did.

More often we produce a real write-up, just for the data handling eﬀort,

laying out all of the steps that we took in text, rather than just as R comments.

It is particularly important to list actions that resulted in observations being

deleted, with reasons and counts. In fact, most of our reports include a

ﬂowchart describing the number of observations at each step in the data

cleaning process. Figure 7.1 shows an example of such a chart. In this example,

observations came from two sources, as shown in the upper corners. A small

number of observations are deleted because they lack payment information;

Western Office

134,246

Eastern Office

96,442

Western w/Pay

133,845

No Pay Info

401 (0.3%)

No Pay Info

262 (0.3%)

Merged

230,025

Match to

Samuel E. Buttrey, Lyn R. Whitaker A Data Scientist’s Guide To Acquiring, Cleaning, And

Navigation menu

Versions of this User Manual:

Views

Navigation