Automated Data Collection With R Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis

*The book homepage*

Automated Data
Collection with R
A Practical Guide to
Web Scraping and Text Mining

Simon Munzert|Christian Rubba|Peter Meißner|Dominic Nyhuis

Automated Data Collection with R

Automated Data Collection with R
A Practical Guide to Web Scraping and
Text Mining
Simon Munzert
Department of Politics and Public Administration, University of Konstanz,
Germany

Christian Rubba
Department of Political Science, University of Zurich and National Center of
Competence in Research, Switzerland

Peter Meißner
Department of Politics and Public Administration, University of Konstanz,
Germany

Dominic Nyhuis
Department of Political Science, University of Mannheim, Germany

This edition first published 2015
© 2015 John Wiley & Sons, Ltd
Registered office
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for
permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the
Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by
the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be
available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and
product names used in this book are trade names, service marks, trademarks or registered trademarks of their
respective owners. The publisher is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing
this book, they make no representations or warranties with respect to the accuracy or completeness of the contents
of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose.
It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the
publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert
assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data
Munzert, Simon.
Automated data collection with R : a practical guide to web scraping and text mining / Simon Munzert, Christian
Rubba, Peter Meißner, Dominic Nyhuis.
pages cm
Summary: “This book provides a unified framework of web scraping and information extraction from text data
with R for the social sciences”– Provided by publisher.
Includes bibliographical references and index.
ISBN 978-1-118-83481-7 (hardback)
1. Data mining. 2. Automatic data collection systems. 3. Social sciences–Research–Data processing.
4. R (Computer program language) I. Title.
QA76.9.D343M865 2014
006.3′ 12–dc23
2014032266
A catalogue record for this book is available from the British Library.
ISBN: 9781118834817
Set in 10/12pt Times by Aptara Inc., New Delhi, India.

1

2015

To my parents, for their unending support. Also, to Stefanie.
—Simon
To my parents, for their love and encouragement.
—Christian
To Kristin, Buddy, and Paul for love, regular walks, and a final deadline.
—Peter
Meiner Familie.
—Dominic

Contents

1

Preface

xv

Introduction
1.1 Case study: World Heritage Sites in Danger
1.2 Some remarks on web data quality
1.3 Technologies for disseminating, extracting, and storing web data
1.3.1 Technologies for disseminating content on the Web
1.3.2 Technologies for information extraction from web documents
1.3.3 Technologies for data storage
1.4 Structure of the book

1
1
7
9
9
11
12
13

Part One
2

A Primer on Web and Data Technologies

HTML
2.1 Browser presentation and source code
2.2 Syntax rules
2.2.1 Tags, elements, and attributes
2.2.2 Tree structure
2.2.3 Comments
2.2.4 Reserved and special characters
2.2.5 Document type definition
2.2.6 Spaces and line breaks
2.3 Tags and attributes
2.3.1 The anchor tag 
2.3.2 The metadata tag 
2.3.3 The external reference tag 
2.3.4 Emphasizing tags , , 
2.3.5 The paragraphs tag 
2.3.6 Heading tags 
, 
, 
, …
2.3.7 Listing content with , , and 
2.3.8 The organizational tags  and 

15
17
18
19
20
21
22
22
23
23
24
24
25
26
26
27
27
27
27

viii

CONTENTS

2.3.9 The  tag and its companions
2.3.10 The foreign script tag 

This snippet adds the current date and time to the document.

r Reference to an external JavaScript and using its functions within another

script

element (printing the browser used to view the document):
1
2

This snippet loads an external JavaScript file (browserdetect.js) and uses the functions
it contains (BrowserDetect) to add information about the browser of the document.

r Triggering JavaScript with events (changing the style class when hovering over the
element)
1
2
3

Hover Me!

This snippet triggers two events, one when the mouse cursor hovers over the element
and one when the mouse cursor leaves the area of the element—onmouseover and
onmouseout—and assigns two JavaScript functions that are executed whenever the
events take place. The functions change the class of the element to over or out and the
styles associated with these two classes take effect.
Now open http://www.r-datacollection.com/materials/html/JavaScript.html in your
browser and have a look at the examples. The document displays the time you opened
the document, shows the current time, indicates which browser you are using (the version
number as well as the platform it is running on), changes colors from white to black as long

32

AUTOMATED DATA COLLECTION WITH R

Table 2.2 Nominal GDP per capita
Rank
1
2
3
4
5

Nominal GDP
(per capita, USD)
170,373
167,021
115,377
98,565
92,682

Name
Lichtenstein
Monaco
Luxembourg
Norway
Qatar

as you hover over the Hover Me! text, and adds text to the document when you fill out the
text field and press enter.
Have a look at the source code and try to map which parts of the document are plain
HTML and which are the work of JavaScript.

2.3.11

Table tags , , , and 

The next group of elements enables HTML to display tables. Check out Table 2.2 and compare
it to its HTML code representation below. To begin a table we make use of . We
start new lines with . Within , we can either use 

2.4 Parsing
After having learned the key features of HTML documents, we now turn to loading and
representing the contents of HTML/XML files in an R session.6 This step is crucial if we care
to extract information from web documents in a principled and robust fashion from within R.7
6 Although different in many respects, HTML and XML are similar regarding their grammar and thus, the
discussion on HTML parsing is very relevant for XML parsing, too. XML is subject of the next chapter (Chapter 3).
7 See Chapter 4 on how to exploit the parsed representation of parsed documents for data extraction.

HTML

33

While performing web scraping, we usually get in touch with HTML in two steps: First,
we inspect content on the Web and examine whether it is attractive for further analyses.
Second, we import HTML files into R and extract information from them. Parsing HTML
occurs at both steps—by the browser to display HTML content nicely, and also by parsers in
R to construct useful representations of HTML documents in our programming environment.
In the remainder of this chapter we begin by motivating the use of parsers and then discuss
some of the problems inherent in the process as well as their solutions.

2.4.1

What is parsing?

Before showing the application of a parser, let us think about why we need to parse the
contents of marked up web documents such as HTML compared to merely reading them into
an R session. The difference between reading and parsing is not just a semantic one. Instead,
reading functions differ from parsing functions in that the former do not care to understand
the formal grammar that underlies HTML but merely recognize the sequence of symbols
included in the HTML file. To see that, let us employ base R’s readLines() function,
which loads the content of an HTML file. As a stylized, running example in this part, we
consider fortunes.html (see the chapter’s materials)—a simple HTML file that consists of
several nuggets of R wisdoms. We apply readLines() on the document, store the output in
an object called fortunes, and print its content to the screen:
R> url <- "http://www.r-datacollection.com/materials/html/fortunes.html"
R> fortunes <- readLines(con = url)
R> fortunes

readLines() maps every line of the input file to a separate value in a character vector.
Although easy to use, readLines() creates a flat representation of the document, which is
of limited use for extracting information from it. The main problem is that readLines() is
agnostic about the different tag elements (name, attribute, values, etc.) and produces results
that do not reflect the document’s internal hierarchy as implied by the nested tags in any
sensible way.
To achieve a useful representation of HTML files, we need to employ a program that understands the special meaning of the markup structures and reconstructs the implied hierarchy
of an HTML file within some R-specific data structure. This representation is also referred to
as the Document Object Model (DOM). It is a queryable data object that we can build from
any HTML file and is useful for further processing of document parts. This transformation
from HTML code to the DOM is the task of a DOM-style parser. Parsers belong to a general
class of domain-specific programs that traverse over symbol sequences and reconstruct the
semantic structure of the document within a data object of the programming environment.
In the remainder of this book, we will use functionality from the XML package to parse web
documents (Temple Lang 2013c). XML provides an interface to libxml2, a powerful parsing
library written in C that is able to cope with many parsing-specific problems. To get started,
let us parse fortunes.html and store it in a new object called parsed_fortunes using XML’s
htmlParse() function:
R> library(XML)
R> parsed_fortunes <- htmlParse(file = url)
R> print(parsed_fortunes)

Document
Object Model
(DOM)

34

AUTOMATED DATA COLLECTION WITH R

Collected R wisdoms
Robert Gentleman
'What we have is nice, but we need something very different'
Source: Statistical Computing 2003, Reisensburg

Rolf Turner
'R is wonderful, but it cannot work magic' 
answering
a request for automatic generation of 'data from a known mean and 95% CI'

Source: 
R-help

The book homepage

DOM parsing:
A two-step
process

Printing the object to the screen, we receive a visual feedback that we created a copy
of the file inside the R session. For conventional parsing tasks, htmlParse() will be all
that is necessary to create a properly parsed document object. At a minimum, the function
needs to be handed the file path via its file argument. This may either be an HTML file (or
compressed archive of HTML files) that already exists on the hard drive or an URL pointing
to a web document.
htmlParse() and other DOM-style parsers effectively conduct the following steps.
1. htmlParse() first parses the entire target document and creates the DOM in a tree-like
data structure of the C language. In this data structure every element that occurs in the
HTML is now represented as its own entity, or as an individual node. All nodes taken
together are referred to as the node set. The parsing process also includes an automatic
validation step for malformation. From its source code (see object fortunes) we learn
that fortunes.html contains two structural errors. Not only have some of the attribute
values been left unquoted but also a closing tag for the second paragraph tag ()
is missing. Yet, as we see from the parsed output, these two flaws have both been
remedied. This is due to libxml2 which is capable to work on non-well-formed HTML
documents because it recognizes errors and corrects them in order to create a valid
DOM.
2. In the next step the C-level node structure is converted into an object of the R language.
This is necessary because further processing of the DOM, for example, modifying and
extracting information from it, is tremendously more convenient in a higher-level
language such as R. Internally, R uses lists to reflect the hierarchical order of nodes.
More specifically, the transformation between C and R is managed through so-called

HTML

35

handler functions. These handler functions regulate the translation of a C-level node
into an R list element and can be intercepted by the user to determine whether and how
a node should be reflected in the R object.
For most parsing tasks, you will find that htmlParse()’s default options are sufficiently
powerful to create the DOM. Nevertheless, some control over the parsing process can be
beneficial in cases where the target document is of considerable size, carries unnecessary
information, or needs to be altered in some predefined way. To deal with these situations,
the next section looks at ways to affect the building process of the DOM, for example, by
formulating rules that structure the mapping of specific elements into an R object.

2.4.2

Discarding nodes

Discarding unnecessary parts of web documents in the parsing stage can help mitigate memory
issues and enhance extraction speed. Handlers provide a comfortable way to manipulate (i.e.,
delete, add, modify) nodes in the tree construction stage. As we have already noted, handler
functions regulate the conversion of the C-level node structure into the R-object. By default,
that is, when the handlers are left unchanged, all nodes will be mapped into the R list structure,
but we are free to manipulate this process.
We specify handlers as a list of named functions, where the name corresponds to a node
name and the function specifies what should happen with the node. The function is executed
on encountering a node with a specific name. To exemplify, consider the problem of deleting
the  node in our example HTML file. In the parsing stage, we can easily get rid of this
node including all of its children, that is, nodes that are nested deeper in the tree as follows:

Specifying
handler
functions

R> h1 <- list("body" = function(x){NULL})
R> parsed_fortunes <- htmlTreeParse(url, handlers = h1, asTree = TRUE)
R> parsed_fortunes$children
$html

Collected R wisdoms

We first create an object h1 containing a list of a function named after the node we want
to delete. We then pass this object to the htmlTreeParse() function via its handlers
argument. Printing parsed_doc to the screen shows that the  node is not part of the
DOM tree anymore. Internally, the handler has replaced all instances of the  node with
the NULL object, which is equivalent to deleting these nodes. When using handler functions,
one needs to set the asTree argument to TRUE to indicate that the DOM should be returned
and not the handler function itself.
Via the XML package we can pass generic handler functions to operate on specific XML
elements such as the processing instructions, XML comments, CDATA, or the general node
set.8 A complete overview over these generic handlers is presented in Table 2.3. To illustrate
8 For

an explanation of XML comments and CDATA see Chapter 3.

Generic
handlers

36

AUTOMATED DATA COLLECTION WITH R

Table 2.3 Generic handlers for DOM-style parsing
Function name

Node type

startElement()
text()
comment()
cdata()
processingInstruction()
namespace()
entity()

XML element
Text node
Comment node
 node
Processing instruction
XML namespace
Entity reference

Source: Adapted from Nolan and Temple Lang (2014, p. 153).

their use, consider the problem of deleting all nodes with name div or title as well as
comments that appear in the document. We start again by creating a list of handler functions.
Inside this list, the first handler element specifies a function for all XML nodes in the document
(startElement). Handlers of that name allow describing functions that are executed on all
nodes in the document. The function specifies a request for a node’s name (xmlName) and
implements a control structure that returns the NULL object if the node’s name is either div
or title (meaning we discard this node) or else includes the full node in the DOM tree. The
second handler element (comment) specifies a function for discarding any HTML comment:
R> h2 <- list(
startElement = function(node, ...){
name <- xmlName(node)
if(name %in% c("div", "title")){NULL}else{node}
},
comment = function(node){NULL}
)

Let us pass the handler function to htmlTreeParse():
R> parsed_fortunes <- htmlTreeParse(file = url, handlers = h2, asTree = TRUE)

If we print parsed_fortunes to the screen, we find that we rid ourselves of the nodes
specified in the handlers:
R> parsed_fortunes$children
$html

The book homepage

HTML

2.4.3

37

Extracting information in the building process

We motivated the parsing of HTML files as a necessary intermediate step to extracting
information from web documents. In this process, we usually want the parser to traverse the
entire C-level node set and then build the document tree in an R data structure from which
we extract a particular information. Conceptually, there is an alternative strategy where we
conduct the extraction directly during the parsing process. Under some circumstances, this
strategy can provide considerable advantages since multiple loadings of a document can be
avoided, although it is also a little bit more challenging compared to the DOM-style parsing
approach presented before. Once again, handler functions play a key role in this process. But
rather than using the handler to describe how a C-level node should be converted into an
element of the R DOM tree, we now want to specify the handlers to route specific nodes into
an R object of our own choosing. Ultimately, this saves us an additional traversal step and thus
constitutes a more efficient way to pull out target information. Before we dive deeper into
this section, we would like to point out that the contents of this section are fairly advanced. If
you are not too familiar with R scoping issues, you might like to skip ahead to the summary
of this chapter. You can continue with the book just fine without having read this part.
For an example, consider the problem of extracting the information from fortunes.html
that is written in italics, that is, encapsulated with  tags. Underlying this task is a tricky
problem of functional scope that we need to address. Ultimately, we want to create a data object
containing the information in our current workspace or global environment. But functions
in R—and our handler functions are no different—operate on local variables and have no
writing access to the global environment, which is a necessary requirement for this problem.
The solution is to define the handler function for the  nodes in the document as a
so-called closure—a function that is capable of referencing objects that are not local to it. A
closure function not only contains a function’s arguments and body, but also an environment.
Here, the environment is needed to define container variables to which we route the handler’s
output, as well as a return function for the variables’ contents.
We start by defining a nesting function getItalics(). i_container is our local
container variable that will hold all information set in italics. Next, we define the handler
function for the  nodes. On the right side of the first line of this function, we concatenate
the contents of the container variable with a new instance of the  node value. The resulting
vector then overwrites the existing container object by using the super assignment operator
<< − , which allows making an assignment to nonlocal variables. Lastly, we create a function
called returnI() with the purpose of returning the container object just created:
R> getItalics = function() {
i_container = character()
list(i = function(node, ...) {
i_container <<- c(i_container, xmlValue(node))
}, returnI = function() i_container)
}

Next, we execute getItalics() and route its return values into a new object h3.
Essentially, h3 now contains our handler function, but additionally, the function can access
i_container and returnI() as these two objects were created in the same environment
as the handler function:
R> h3 <- getItalics()

Scope issues

Using closure
handler
functions

38

AUTOMATED DATA COLLECTION WITH R

Now we can pass this function to htmlTreeParse()’s handlers argument:
R> invisible(htmlTreeParse(url, handlers = h3))

For clarity, we employ the invisible() function to suppress printing of the DOM to the
screen. To take a look at the fetched information we can make a call to h3()’s returnI()
function to print all the occurrences of  nodes in the document to the screen:
R> h3$returnI()
[1] "'What we have is nice, but we need something very different'"
[2] "'R is wonderful, but it cannot work magic'"
[3] "The book homepage"

Summary
In this chapter we focused on getting a basic understanding of HTML. We learned that what
we get presented when surfing the web is an interpreted version of the marked up source
code that holds the content. Tags form the core of the markup used in HTML and can be
used to define structure, appearance, and content. Furthermore, elements not only contain
information but can also be used to transmit information from user to server or to incorporate
functionality from other computer languages, most notably JavaScript. We should be able at
this point to locate information we seek in the source code and to connect source code to
browser interpretation and vice versa. Along with knowledge about the structure of HTML
elements we are ready to learn how to exploit structure and layout of HTML files to collect
the information we need.
Parsing is an important step in processing information from web documents. The native
structure of HTML does not naturally map into R objects. We can import HTML files as raw
text, but this deprives us of the most useful features of these documents. We have learned in
this chapter how to parse the tree structure of HTML documents, giving them a representation
in the R environment. We will learn powerful tools to locate and extract nodes within these
objects and the information they hold in Chapter 4. But first we turn to XML, a more generic
counterpart to HTML and a frequently used format to exchange data on the Web.

Further reading
As HTML is a W3C standard, we recommend a look at the W3 pages and the accompanied
W3schools pages (http://www.w3schools.com) if you want to dive deeper into HTML and
JavaScript. As HTML is also a WHATWG standard, you might like to check out their web
pages for further information on HTML and related technologies (http://www.whatwg.org/).
For example, the history section explains why W3C and WHATWG develop HTML5 parallelly. Further helpful web sources are the following.

r A complete list of tags with description and example:
http://www.w3schools.com/tags

r A long list of special characters, symbols, and their entity representation:
http://www.w3schools.com/charsets/ref_html_8859.asp

HTML

39

r A much much longer list of characters and their entity representation:
http://unicode-table.com

r An HTML validator:
http://validator.w3.org
For those who like it short but also like to hold a real book in their hands there is Niederst
Robbins’s (2013) less than 200 pages HTML5 Pocket Reference. You can find more thorough
treatments of the subjects in Castro and Hyslop (2014) for HTML and CSS and Flanagan
(2011) for JavaScript.

Problems
1.

Why is it important that HTML is a web standard?

2.

Write down the HTML tags for (a) the primary heading, (b) starting a new paragraph,
(c) inserting foreign code, (d) constructing ordered lists, (e) creating a hyperlink, and
(f) creating an email link!

3.

HTML source code inspection.
(a) Open three webpages you frequently use in your browser.
(b) Have a look at the source code of all three.
(c) Inspect various elements with the Inspect Elements tool of your browser.
(d) Save each of them to your hard drive.

4.

Building a basic HTML document, part I.
(a) Write a minimal HTML file.
(b) Add your name as a comment.
(c) Add a level one and a level two headline.
(d) Add some further content, for example, a sentence about the current weather.
(e) Add a paragraph with some further content, for example, a sentence about tomorrow’s weather.

5.

Building a basic HTML document, part II.
(a) Write a minimal HTML document.
(b) Include a paragraph that contains 10 special characters—only five of them may be
mentioned in Table 2.1.
(c) Use http://www.r-datacollection.com/materials/html/simple.css as your default
style file.
(d) Check the validity of your document at http://validator.w3.org.

6.

Building a basic HTML document, part III.
(a) Write a minimal HTML document.
(b) Include a table with two columns and three rows.
(c) The first column should contain first, second, and third. The second column should
contain links to your top three web pages.
(d) Have a look at the list of tags at http://www.w3schools.com/tags. Try to use some
of the tags you are not yet familiar with in your HTML document.

40

AUTOMATED DATA COLLECTION WITH R

7.

The base R function download.file() is a standard tool to gather data from the Web
with R. Investigate the function’s syntax and try to use it to save the front pages of your
three most favorite websites to your local disk.

8.

The base R functions readLines() and writeLines() can be used to import and
export character data to and from R Try to use them to import the webpages you have
gathered in the previous exercise and save them in different objects. Next, combine the
three objects into a list object. Finally, use writeLines() to store the pages again in
external files.

9.

An encounter with JavaScript.
(a) Check out http://www.r-datacollection.com/materials/html/fortunes3.html in your
browser.
(b) View the page’s source code.
(c) Download both JavaScript files linked to the document using the download
.file() function.

10.

Building a basic HTML document, part IV.
(a) Write a minimal HTML document.
(b) Include a form that has two inputs—name and age.
(c) Define the form in a way that it sends data to http://www.r-datacollection.com/
materials/http/GETexample.php via the GET method.
(d) Make sure it works—the server should respond with Hello YourName! You are
YourAge years old.
(e) Try to send high age values. At what point does the response message change?

3

XML and JSON
XML, the eXtensible Markup Language, is one of the most popular formats for exchanging
data over the Web. But it is more than that. It is ubiquitous in our daily life. As Harold and
Means (2004, xiii) note:
XML has become the syntax of choice for newly designed document formats
across almost all computer applications. It’s used on Linux, Windows, Macintosh,
and many other computer platforms. Mainframes on Wall Street trade stocks with
one another by exchanging XML documents. Children playing games on their
home PCs save their documents in XML. Sports fans receive real-time game
scores on their cell phones in XML. XML is simply the most robust, reliable, and
flexible document syntax ever invented.
XML looks familiar to someone with basic knowledge about HTML, as it shares the same
features of a markup language. Nevertheless, HTML and XML both serve their own specific
purposes. While HTML is used to shape the display of information, the main purpose of XML
is to store data. Therefore, the content of an XML document does not get much nicer when it
is opened with a browser—XML is data wrapped in user-defined tags. The user-defined tags
make XML much more flexible for storing data than HTML. The main goal of this chapter
is not to turn you into an XML coding expert, but to get you used to the key components of
XML documents.
We start with a look at a running XML example (Section 3.1) and continue with an
inspection of the XML syntax (Section 3.2). There are several ways to limit the endless
flexibility in XML markup. We cover technologies that allow extending XML as well as
defining new standards that simplify exchanging specific data on the Web efficiently in
Sections 3.3 and 3.4. Section 3.5 shows how to handle XML data with R. If your web
scraping task does not specifically involve XML data you might be fine to just scan this part
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, First Edition.
Simon Munzert, Christian Rubba, Peter Meißner and Dominic Nyhuis.
© 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

42

AUTOMATED DATA COLLECTION WITH R

of the chapter as you are already familiar with the most important concepts of the XML
language from the previous chapter.
Another standard for data storage and interchange we frequently find on the Web is the
JavaScript Object Notation, abbreviated JSON. JSON is an increasingly popular alternative
to XML for data exchange purposes that comes with some preferable features. The second
part of this chapter therefore turns to JSON. We introduce the format with a small example
(Section 3.6), talk about the syntax (Section 3.7), and learn how to import JSON content into
R and process the information (Section 3.8).

3.1

A short example XML document

We start with a short example of an XML file. The XML code in Figure 3.1 provides a sample
of three James Bond movies, along with some basic information. Probably the most distinctive
feature of XML code is that human readers have no problem in interpreting the data. Values
and names are wrapped in meaningful tags. Each of the three movies is attributed with a
name, a year, two actors, the budget, and the box office results. Indentation further facilitates
reading but is not a necessary component of XML. It highlights the hierarchical structure
of the document. The document starts with the root element , which also
closes the document. The elements are repeated for each movie entry—the content varies.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Dr. No
1962

1.1M
59.5M

Live and Let Die
1973

7M
126.4M

Skyfall
2012

175M
1108.6M

Figure 3.1 An XML code example: James Bond movies

XML AND JSON

43

Some elements are special. The element in the first line () is not repeated, and
this and the  element hold some additional information between the <...> signs.
The XML language works quite intuitively. You should have no problems to expand and
refine the dataset before even knowing every rule of the syntax. In fact, why not try it? Copy
the file, go to Wikipedia, look for other details on the movies, and try to add them to the file!
You can check later if you have written correct XML code. This is because information is
stored as plain text and the tags that allow arranging the data in meaningful ways are entirely
user-defined and should be comprehensible. While the tags might not even be necessary to
interpret the data, they make XML a computer language and as such useful for communication
on and between computers.
The fact that XML is a plain text format is what makes it ultimately compatible. This
means that whatever browser, operating system, or PC hardware we use, we can process it.
No further information or decoder is needed to interpret the data and their structure. The tags
are delivered along with the data and fully describe the document—this is commonly called
self-describing. Further, as tags can be nested within each other, XML documents can be used
to represent complex data structures (Murrell 2009, p. 116). We will discuss these structures in
the following section. To be sure, although XML is so flexible, it possesses a clear set of rules
that defines the basic layout of a document. We can use simple tools to check if these rules are
obeyed.1 There are also tools to further restrict structure and content in an XML document.
Many developers have used the syntax of XML to create new XML-based languages that
basically restrict XML to a fixed set of elements, structure, and content, which we will look
deeper into in Sections 3.4.3 and 3.4.4. Still, these derived languages remain valid XML.
XML has gained a considerable amount of its popularity through these extensions.
The downside of storing information in XML files is a lack of efficiency. Plain text XML
documents often hold a lot of redundant information. Note that in standard XML, the starting
and closing tags are repeated for every entry. This can consume more space in the document
than the actual data. Especially when we deal with large datasets or data that provide highly
hierarchical structure, it may take up a lot of memory to try to import and manipulate the data.
The preferred program to open XML files are programs that are capable of highlighting
the syntax of the documents and automatically indent the elements according to their level
in the hierarchical structure. Current versions of all mainstream browsers are able to layout
XML files adequately, and it is quite likely that your favorite code editor is capable of XML
highlighting as well. Note, however, that XML files can be very large and contain millions of
lines of data, so it may take a while to open them.
In the following sections we will talk more about the syntax of XML. We will learn how
to import XML data into R and how to transform it into other data formats that are more
convenient for analysis. We will also look at other XML “flavors” that are used to store a
variety of data types. You might be surprised about the numerous applications that rely on
XML and how one can make use of this knowledge for data scraping purposes.

3.2 XML syntax rules
As any other computer language, XML has a set of syntax rules and key elements we have to
know in order to find our way in any document. But fear not: XML rules are very simple.
1 However,

as mostly passive users of XML, this is rarely of interest to us.

Why XML is so
popular

How to view
XML files

44

AUTOMATED DATA COLLECTION WITH R

3.2.1
XML
declaration

1

Root element

1
2
3

Element syntax

Elements and attributes

Take another look at Figure 3.1. It helps explain large parts of what we have to know about
XML. An XML document always starts with a line that makes declarations for the XML
document:

version="1.0" indicates the version of XML that is being used. There are currently
two versions: XML 1.0 and XML 1.1.2 Additionally, the declaration can, but need not hold
the character encoding of the document, which in our case is encoding="ISO-88591".3 Another attribute the declaration can contain—but does not in our example—is the
standalone attribute, which take values of yes or no and indicates whether there are
external markup declarations that may affect the content of the document.4
An XML file must contain one and only one root element that embraces the whole
document. In our case, it is:

...
<\bond_movies>

Information is usually stored in elements. An XML element is defined by its start tag and
the content. An element frequently has an end tag, but can also be closed in the start tag with
a slash /. It can contain

r other elements.
r attributes, bits of information that describe the element in more detail. Attributes,
like elements, are slots for information, but they cannot contain further elements or
attributes.

r data of any form and length, for example, text, numbers or symbols.
r a mixture of everything, which sounds complicated but is a rather ordinary case when
elements contain other elements that contain data. For example, the  elements
in Figure 3.1 all contain an attribute, other elements, and data within the children
elements.

r nothing, which means really nothing—no data, no other element, not even white
spaces.

2 The differences between the two existing versions are marginal and relate to encoding issues that are usually of
no interest to us.
3 To learn more about encodings, see Section 8.3.
4 For more information, see the elucidations by W3C on http://www.w3.org/TR/xml/#sec-rmd

XML AND JSON

45

Consider the first  element from above:

1

<title>Dr. No

Its constituent parts are
the element title
the start tag
the end tag
the data value

title

Dr. No

We are already familiar with the start tag–end tag logic from HTML. The benefit of this
syntax is that we can easily locate data of a certain element in the document, regardless
of where, that is, on which line or hierarchical level it is located. The element 
occurs three times in the example. We could retrieve all of these elements by building a query
like “give me the content of all elements named <title>.” This is what we will learn in
Chapter 4 when we learn how to use the query language XPath. A more compact way of
writing elements is

1

<actors bond="Sean Connery" villain="Joseph Wiseman"/>

This element contains
the element name
the start tag
first attribute’s name
first attribute’s value
second attribute’s name
second attribute’s value

actors
<actors.../>
bond
Sean Connery
villain
Joseph Wiseman

In this case there is no end tag but only a start tag. This is a so-called empty element
because the element contains no data. Empty elements are closed with a slash /. The element
in the example is of course not literally empty. Just like in HTML, XML elements can contain
attributes that provide further information. There is no limit to the number of attributes an
element can contain. The example element has two attributes. They are separated by a white
space. Attributes are always part of a start tag and hold their values in quotes after an equal
sign. The information stored in attributes is called attribute value. Attribute values always
have to be put in quotes, either using single quotes like bond='Sean Connery' or double
quotes like bond="Daniel Craig". However, if the attribute value itself contains quotes,
you should use the opposed pair of quotes for the attribute value:

1

<actors henchman="Richard 'Jaws' Kiel"/>

Attributes

46
Elements vs.
attributes

1
2
3
4

AUTOMATED DATA COLLECTION WITH R

As the structure of an XML document is inherently flexible, there are many ways to store
the same content. Note how the actors were stored in the running example in Figure 3.1.
Another way would have been the following:

<actors>
<bond>Sean Connery</bond>
<villain>Jospeh Wiseman</villain>
</actors>

All information is retained, but the actors’ names are now stored in elements, not attributes.
Both ways are equally valid. The problem with attributes is that they do not allow further branching—attributes cannot be expanded and can only contain one value. Besides,
we find them more difficult to read and more inconvenient to extract compared to elements. They are, however, not altogether useless. Take a look at the code in Figure 3.1.
Attributes named id are used to make elements with the same name uniquely identifiable.
This can be of help when we need to manipulate information in a particular element of the
XML tree.

3.2.2

XML structure

Each XML document can be represented as a hierarchical tree. The fact that data are stored
in a hierarchical manner is well suited for many data structures we are confronted with:
Survey participants are nested within countries. Survey participants’ responses are nested
within survey participants. Votes are nested within polling stations that are nested within
electoral districts that are nested within countries, and so on. Figure 3.2 gives a graphical
representation of the XML data from the XML code in Figure 3.1. At the very top stands
the root element <bond_movies>. All other elements have one and only one parent. In
fact, we can apply a family tree analogy to the entire document, describing each element as
a node:

r the movie nodes are children of the root node bond_movies;
r the movie nodes are siblings;
r the bond_movie node is the parent of the movie nodes, which are parents of the
title,..., boxoffice nodes;

r the title,...,

boxoffice nodes are grandchildren of bond_movies.

Note that the attributes and their values are presented in the element value boxes in
Figure 3.2, even though they could be viewed as further leaves in the XML tree. However, as
attributes cannot be parents to other elements or attributes, they are rather element-describing
content than autonomous nodes. Nevertheless, they are strictly speaking attribute nodes.

XML AND JSON

47

<bond_movies>
<movie>
id="1"

<movie>
id="2"

<movie>
id="3"

<title>
Dr. No

<title>
Live and Let Die

<title>
Skyfall

<year>
1962

<year>
1973

<year>
2012

<actors>
bond="Sean Connery"
villain="Joseph Wiseman"

<actors>
bond="Roger Moore"
villain="Yaphet Kotto"

<actors>
bond="Daniel Craig"
villain="Javier Bardem"

<budget>
1.1M

<budget>
7M

<budget>
175M

<boxoffice>
59.9M

<boxoffice>
126.4M

<boxoffice>
1108.6M

Figure 3.2 Tree perspective on an XML document

Elements must be strictly nested, which means that no cross-nesting is allowed. An illegal
document structure would be:

1
2
3
4
5
6
7
8
9
10

<family>
<father>Jack</father>
<mother>Josephine</mother>
<child>Jonathan
<family>
<mother>Julia</mother>
<child>Jeff</child>
</family>
</family>
</child>

While it is theoretically sensible that the element <child> with the value Jonathan
opens a new <family> branch containing Jonathan’s wife Julia and their child Jeff,
Jonathan’s <child> element has to be closed before the <family> element.

48

AUTOMATED DATA COLLECTION WITH R

3.2.3
Element names

Naming and special characters

One of the strengths of XML is that we are basically free to chose the names of elements.
However, there are some naming rules:

r Element names can be composed of letters, numbers, and other characters, like
in <name1> … </name1>. Special characters like ä, ö, ü, é, è, or à are allowed,
but not recommended—they might limit the compatibility of XML files across
systems.

r Names must not start with a number, like in <123name> … </123name>.
r Names must not start with a punctuation character, like in <.name> … </.name>.
r Names must not start with the letters xml (or XML, or Xml, etc.), like in
<xml.rootname> … </xml.rootname>.

r Element names and attribute names are case sensitive. <movie> is not the same as
<MOVIE> or <Movie>.

r Names must not contain spaces, like in <my

family> … </my family>.

As in HTML, there are some characters that cannot be used literally in the content as they
are needed for markup. To represent these characters in the content, they have to be replaced
by escape sequences. These entities are listed in Table 3.1 and used as follows

1
2

<actor protagonist="Scarlett O'Hara"/>
<math_wisdom>pi>3</math_wisdom>

You do not always need to escape special characters. For example, apostrophes are
sometimes left unescaped, like in "Richard 'Jaws' Kiel" in the example above. In this
case, the apostrophes are unambiguous because the attribute value is enclosed by double
quotes. Using apostrophes in XML element values is usually no problem either, because
they have no special meaning in the value slot of the element, only inside tags as limiters to
attribute values.

Table 3.1 Predefined entities in XML
Character

Entity reference

Description

<
>
&
"
'

<
>
&
"
'

Less than
Greater than
Ampersand
Double quotation mark
Single quotation mark

XML AND JSON

3.2.4

49

Comments and character data

XML provides a way to comment content with the syntax
1

Comments

<!-- an arbitrary comment -->

Everything in between  is not treated as part of the XML code and therefore
ignored by parsers. Comments may be used between tags or within element content, but not
within element or attribute names.
The use of escape sequences can be cumbersome when the elements to be escaped are
common in the data values. For example, imagine the following character sequence needs to
be stored in an XML file
1

1 < 3 < pi < 9 <= sqrt(81) < 1'081 > -2 > -999

In XML code, this would translate to
1

1 < 3 < pi < 9 <= sqrt(81) < 1'081 > -2 > -999

To avoid this mess, XML provides an environment that prevents the content from being
interpreted. It is called CDATA and works as follows
1
2
3

<![CDATA[
1 < 3 < pi < 9 <= sqrt(81) < 1'081 > -2 > -999
]]>

All characters in the CDATA section are taken as is. The difference between comments
and a CDATA section is that a comment is not part of the document …
1
2
3
4
5
6
7
8
9
10
11

<?xml version="1.0" encoding="ISO-8859-1"?>
<bond_movies>
<movie id="1">
<title>Dr. No
1962

1.1M
59.5M

CDATA

50

AUTOMATED DATA COLLECTION WITH R

… whereas a CDATA section is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Dr. No
1962

1.1M
59.5M

If we write both snippets in an XML file and open it with a browser, the comments are not
displayed or explicitly highlighted as part of the XML tree. In contrast, the CDATA section
is displayed in the tree. If we delete the CDATA tags, this will produce an error because the
browser fails to interpret the ampersands and quotation marks.
You may want to try this out yourself. Save the last code snippet with your text editor as
an XML file and open it with your browser. Modify the content of the XML file, save it, and
reload the content with the browser. Experiment with allowed and disallowed changes. Try
special characters, cross-nested tags, and forbidden element names.

3.2.5

XML syntax summary

To sum up, the XML syntax comprises the following set of rules:
1. An XML document must have a root element.
2. All elements must have a start tag and be closed, except for the declaration, which is
not part of the actual XML document.
3. XML elements must be properly nested.
4. XML attribute values must be quoted.
5. Tags are named with characters and numbers, but may not start with a number or
“xml.”
6. Tag names may not contain spaces and are case sensitive.
7. Space characters are preserved.
8. Some characters are illegal and have to be replaced by meta characters.

XML AND JSON

51

9. Comments can be included as follows: .
10. Content can be excluded from parsing using: .

3.3 When is an XML document well formed or valid?
In short, an XML document is well formed when it follows all of the syntax rules from the
previous section. Techniques to extract information from XML documents rely on properly
written syntax. If we are in doubt that an XML document is well formed, there are ways
to check. For instance, the XML Validator on http://www.xmlvalidation.com/ checks for
mismatches between start and end tags, whether attribute values are quoted, whether illegal
characters have been used, in short: whether any of the rules are violated.
We can distinguish between well formed and valid XML. An XML document is valid
when it
1. is well formed and
2. conforms to the rules of a Document Type Definition.
As we have seen, the structure of an XML document is arbitrary—tag names and levels
of hierarchy are defined by the user. However, there is a way to restrict this arbitrariness by
using Document Type Definitions, DTDs. A DTD is a set of declarations that defines the
XML structure, how elements are named, and what kind of data they should contain. A DTD
for our running example in Figure 3.1 could look like this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

]>

...
<\bond_movies>

In this variant, the DTD is included in the XML document and wrapped in a DOCTYPE
definition, . This is called an internal DTD. For the
purpose of web scraping we normally do not need to be able to write DTDs, so we will

Well-formed
and valid XML

The Document
Type Definition
(DTD)

52

AUTOMATED DATA COLLECTION WITH R

not explain every detail of the declaration syntax but just provide some fundamentals on the
appearance of DTDs. Elements can be declared like

1
2

Children of elements are declared as follows

1
2
3
4
5

(child1)> 
1 or more times -->
0 or more times -->
at most once -->

It gets a bit more complicated with the declaration of mixed content. If, for example, an
element contains one or more occurrences of the  to  elements or simply
parsed character data, the declaration would look like

1

Declaring attributes can look as follows

1

XML schemas

The IMPLIED attribute value means that the corresponding attribute is optional; REQUIRED
would mean that the attribute is required. There are multiple online tools that allow validating
XML files against a DTD. Just type “dtd validation” into a search engine and pick one of the
first results.
Why should we care whether an XML document is well formed or valid? Above all,
it is important to know that many files come with an internal DTD at the beginning of the
document. In general, DTDs serve several purposes. Data exchanges can be standardized as
senders and receivers know in advance what they are supposed to send and get. As a sender,
you can check if your own XML files are valid. As a receiver it is possible to check whether
the XML you retrieve is of the kind you or your program expects.
DTD itself is only one of several XML schema languages. Such languages help to describe
and constrain the structure and content of an XML document. Another schema language is
XML Schema (XSD), developed by W3C. It allows defining a schema in XML syntax and
has some merits that are of little interest for our purposes. One area where XML schemas
play an important role is XML extensions, which are the topic of the next section.

XML AND JSON

53

3.4 XML extensions and technologies
We have seen that XML has advantages compared to HTML for exchanging data on the
Web as it is extensible—and thus flexible. However, flexibility also carries the potential
for uncertainty or inconsistency, for example, when the same element names are used for
different content. Several extensions and technologies exist that improve the usability of
XML by suggesting standards or providing techniques to set such standards. Some of the
most important of these techniques are described in this section.

3.4.1

Namespaces

Consider the following two pieces of HTML and XML:

1

2

Basic HTML Sample Page

3

1
2
3
4

Douglas Crockford
JavaScript: The Good Parts

Both pieces store information in the element . If the XML code were embedded
in HTML code, this might create confusion. As we will see, there are many XML extensions
to store specific data, for example, geographic, graphical, or financial data. All of these
languages are basically XML with limited vocabulary. When several of these XML-based
languages are used in one document, element or attribute names can become ambiguous if
they are multiply assigned. XML namespaces are used to circumvent such problems. The
idea is very simple: Ambiguous elements become distinguishable if some unique identifier is
added. Just like zip codes allow distinguishing between many different Springfields and area
codes make phone numbers unambiguous, namespaces help make elements and attributes
uniquely identifiable.
The implementation of namespaces is straightforward:

1
2
4
5
6

<root xmlns:h="http://www.w3.org/1999/xhtml"
xmlns:t="http://funnybooknames.com/crockford">
<h:head>
<h:title>Basic HTML Sample Page</h:title>
</h:head>

54

AUTOMATED DATA COLLECTION WITH R

<t:book id="1">
<t:author>Douglas Crockford</t:author>
<t:title>JavaScript: The Good Parts</t:title>
</t:book>

8
9
10
11
13

</root>

In this example, namespaces are declared in the root element using the xmlns attribute
and two prefixes, h and t. The namespace name, that is, the namespace attribute value, usually
carries a Uniform Resource Identifier (URI) that points to some Internet address. The URIs
in the example are two URLs that refer to an existing Internet resource on the W3C homepage
and the fictional domain funnybooknames.com. When dealing with namespaces, note the
following rules:
(i) Namespaces can be declared in the root element or in the start tag of any other
element. In the latter case, all children of this element are considered part of this
namespace.
(ii) The namespace name does not necessarily have to be a working URL. Parsers will
never try to follow the link, not even a URI. Any other string is fine. However, it is
common practice to use URIs for two reasons: First, as they are a long, unique string
of characters, duplicates are unlikely, and second, actual URLs can point the human
reader to pages where more information about the namespace is given.5
(iii) Prefixes do not have to be explicitly stated, so the declaration can either be xmlns
or xmlns:prefix . If the prefix is dropped, the xmlns is assumed to be the default
namespace and any element without a prefix is considered to be in that namespace.
When prefixes are used, it is bound to a namespace in the declaration. Attributes,
however, never belong to the default namespace.

3.4.2

Extensions of XML

Thus far, we have praised XML for its flexibility and extensibility. However, standardization
also has its benefits in data exchange scenarios. Recall how browsers deal with HTML. They
“know” what a table looks like, how headings should be formatted, and so on. In general,
many data exchange processes can be standardized because sender and recipient agree on the
content and structure of the data to be exchanged.
Following this logic, a multitude of extensions of the XML language has been developed
that combine the classical XML features of openness with the benefits of standardization. In
that sense, XML has become an important metalanguage—it provides the general architecture
for other XML markup languages. Varieties of XML rely on XML schemas that specify
5 When the same URL is used again and again—such as http://www.w3.org/1999/xhtml—this reduces the
usefulness of namespaces. Therefore, one should think of references to locations one has full control over, like an
owned web domain where a DTD or XML schema is stored.

XML AND JSON

55

Table 3.2 List of popular XML markup languages
Name

Purpose

Common filename extensions

Atom
RSS
EPUB
SVG
KML
GPX
Office Open XML
OpenDocument
XHTML

web feeds
web feeds
open e-book
vector graphics
geographic visualization
GPS data (waypoint, tracks, routes)
Microsoft Office documents
Apache OpenOffice documents
HTML extension and standardization

.atom
.rss
.epub
.svg
.kml, .kmz
.gpx
.docx, .pptx, .xlsx
.odt, .odp, .ods, .odg
.xhtml

For a more comprehensive list, see http://en.wikipedia.org/wiki/List_of_XML_markup_languages.

allowed structure, elements, attributes, and content. Table 3.2 lists some of the most popular
XML derivations. Among them are languages for geographic applications like KML or GPX
as well as for web feeds and widely used office document formats. You might be surprised
to find that MS Word makes heavy use of XML. To gain basic insight into XML extensions
that are ubiquitous on the Web, we focus on two popular XML markup languages—RSS
and SVG.

3.4.3

Example: Really Simple Syndication

Web users commonly cultivate a list of bookmarks of their favorite webpages. It can be rather
tiresome to regularly check for new content on the sites. Really Simple Syndication (RSS)6
was built to solve this problem—both for the user and the content providers. The basic idea
is that news sites, blog owners, etc., convert their content into a standardized format that can
be syndicated to any user.
We illustrate the logic of RSS in Figure 3.3. Authors of a blog or news site set up an RSS
file that contains some information on the news provider, which is stored on a web server.
The file is updated whenever new content is published on the blog. Both are usually done
by an RSS creation program like RSS builder. The list of entries or notifications is often
called RSS feed or RSS channel and might be located at http://www.example.net/feed.rss. It is
written in XML that follows the rules of the RSS format. Common elements that are allowed
in this XML flavor are listed in Table 3.3. There are elements that describe the channel and
others that describe single entries. Users collect channels by subscribing to an RSS reader
or aggregator like Feedly, which automatically locates the RSS feed on a given website and
lays out the content. These readers automatically update subscribed feeds and offer further
management functionalities. This way, users are able to assemble their own online news.
There are several versions of RSS, the current one being RSS 2.0. RSS syntax has remained
fairly simple, especially for users who are familiar with XML. The rules are strict, that is,

6 Originally, RSS was an abbreviation of RDF Site Summary and was later redefined as Rich Site Summary. In
2002, it was redubbed again to Really Simple Syndication.

The logic of
RSS

56

AUTOMATED DATA COLLECTION WITH R
Schema
Supply

Demand
• Subscription to feeds
• Use of browser, desktop
or mobile RSS readers
• RSS reader manages
subscriptions,
parses
feeds, offers handsome
display of content

RSS

Aggregation

Syndication

• Blogs, news media,
audio, video, . . .
• Stored on web server
• Standardization of
content in RSS channel
• Provides full or summarized content, metadata

Example
RSS/XML file

End user
The squirrels blog
A squirrel’s daily business
October 23, 2013
Today, I saw a squirrel in
the . . .
..
.

The ADCR blog
Scraping the NFL
October 23, 2013
Today’s finger exercise
will be to scrape historical
data from NFL games . . .

Website content

<rss version="2.0">
<channel>
<title>The squirrels blog
http://www.cutesquirrels.com/
All about squirrels!

A squirrel’s daily business
http://www.cutesquirrels.com/131023

Today, I saw a squirrel in
the ... 

.
.
.

A squirrel’s daily business
October 23, 2013
Today, I saw a squirrel
in the neighbor’s garden
collecting food for the
winter.

Squirrelogy #127: breeding
October 26, 2013
This time in our popular
squirrelogy section, I want
to talk about the breeding
behavior of . . .

Figure 3.3 How RSS works

there is a very limited set of allowed elements and a clear document structure. Consider the
following example of a fictional RSS channel accompanying this book:

1
2
3
4
5
6
7
8
9
10

The ADCR blog
Blog to the ADCR book; Wiley 2014
http://www.r-datacollection.com/blog
Tue, 22 Oct 2013 00:01:00 +0000 

Why R is useful for web scraping
R is becoming the most popular statistical
software and is growing fast due to an active community
publishing several additional packages every day. Yet,
R is more than [...]

XML AND JSON
11
12
13
14
15

57

http://www.r-datacollection.com/blog/why-r-is-useful
Tue, 22 Oct 2013 00:01:00 +0000 

RSS documents start with an XML and RSS declaration in the first two lines. The
 element wraps around both meta information and the actual entries. The channel’s
meta block has three required elements—, <description>, and <link>. In the
example, there is another optional element, <lastBuildDate>, that indicates the last time
content was changed on the channel. The content block consists of a set of <item> elements.
Whenever a new story, blog entry, etc., is published, a new <item> element is added to
the feed. <item> elements have three obligatory children—again, they are called <title>,
<description>, and <link>. The main content is usually stored in the <description>

element. Sometimes the whole entry is stored here, sometimes just the first few lines or a
summary. In general, RSS syntax obeys the same set of rules as XML syntax.

Table 3.3 List of common RSS 2.0 elements and their meaning
Element name

Meaning

root elements
rss
channel

The feed’s root element
A channel’s root element

channel elements
description*
link*
title*
item

Short statement describing the feed
URL of the feed’s website
Name of the feed
The core information element: each item contains an entry of the feed

item elements
link*
title*
description*
author
category
enclosure
guid
image
language
pubDate
source
ttl

URL of the item
Title of the item
Short description of the item
Email address of the item’s author
Classification of item’s content
Additional content, for example, audio
Unique identifier of the item
Display of image (with children <url>, <title>, and <link>)
Language of the feed
Publishing date of item
RSS source of the item
“Time-to-live,” number of minutes until the feed is refreshed from the
RSS

Elements marked with “*” are mandatory. For more information on RSS 2.0 specification, see
http://www.rssboard.org/rss-specification

58

AUTOMATED DATA COLLECTION WITH R

1
2
3

<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">

5
6
7
8

<svg xmlns="http://www.w3.org/2000/svg" version="1.1">
<ellipse cx="100" cy="70" rx="100" ry="70" style="fill:grey"/>
<ellipse cx="110" cy="75" rx="80" ry="50" style="fill:white"/>
<text x="65" y="160" fill="blue" style="font-size:160; font-stretch:
ultraexpanded;font-family:sans;font-weight:bold">R</text>
</svg>

9

Figure 3.4 SVG code example: R logo
Take a moment to look at actual RSS feeds. They are all around the Web and indicated with the RSS icon ( ). There are several popular news and blogging platforms
about R. For example, have a look at http://planetr.stderr.org/ where new R packages are
posted (via Dirk Eddelbuettel’s CRANberries blog http://dirk.eddelbuettel.com/cranberries/),
and at http://www.r-bloggers.com/, a meta-blogging platform that collects content from the
R blogosphere.
RSS 2.0 is not the only content syndication format. Besides various predecessors, another
popular standard is Atom, which is also XML-based and has a very similar syntax. In order
to grab RSS feeds into R, we can use the same XML extraction tools that are presented in
Section 3.5.

3.4.4

Vector graphics
versus raster
graphics

An SVG
example

Example: scalable vector graphics

A more peculiar but incredibly popular extension of XML is scalable vector graphics (SVG).
SVG is used to represent two-dimensional vector graphics. It has been developed at the W3C
since 1999 and was initially released in 2001 (Dailey 2010). The idea was to create a vector
graphic format that stores graphic information in lightweight, flexible form for exchange over
the Web.
Vector graphic formats consist of basic geometric forms such as points, curves, circles,
lines, or polygons, all of which can be expressed mathematically. In contrast, raster graphic
formats store graphic information as a raster of pixels, that is, rectangular cells of a certain
color. In contrast to raster graphics, vector graphics can be resized without any loss of
quality and are usually smaller. As the SVG format is based on XML, SVG graphics can
be manipulated with an ordinary text editor. There are, however, SVG editors that simplify
this task. For example, Inkscape is an open-source graphics editor that implements SVG by
default and runs on all common operating systems.7 In order to view SVG files, we can use
current versions of the common browsers.
To get a first impression of how SVG works, Figure 3.4 provides code of a small SVG
file. This code generates a stylized representation of the R icon just like the one displayed
in Figure 3.5. In fact, if we open an SVG file with the content of the sample code with our
browser, we see the graphic shown in Figure 3.5. The syntax does not only resemble XML, it
7 See

http://inkscape.org/ for further information and download.

XML AND JSON

59

Figure 3.5 The R logo as SVG image from code in Figure 3.4

is XML with a limited set of legal elements and attributes. An SVG file starts with the usual
XML declaration. The standalone attribute indicates that the document refers to an external
file, in this case an external DTD in lines 2 and 3. This DTD is stored at the www.w3.org
webpage and describes which elements and attributes are legal in the current SVG version 1.1
(as of March 2014). The actual SVG code that describes the graphic is enclosed in the <svg>
element. It contains a namespace and a version attribute.
SVG uses a predefined set of elements and attributes to represent parts of a graphic
(‘SVG shapes’). Among the basic shapes of SVG are lines (<line>), rectangles (<rect>),
circles (<circle>), ellipses (<ellipse>), polygons (<polygon>), (<text>), and, the most
general of all, paths (<path>). Each of these elements comes with a specific set of attributes
to tune the object’s properties, for example, the position of the corners, the size and radius
of a circle, and so on. Elements are placed on a virtual coordinate system, with the origin
(0,0) in the upper-left corner. Formatted text can also be placed into the graphic. The order
of elements is important. A later-listed element covers a previous element—elements can
therefore be thought of as layers. Further, there is a palette of special effects like blurs or
color gradients. Elements can even be animated. A complex SVG graphic is often generated
by quite complex SVG code. The complexity usually does not stem from a highly hierarchical
structure—most of the elements are often just children of the root element—but from the
mass of elements and their attributes. Our basic graphic in Figure 3.5 is composed of only
three elements—two ellipses and one text element. By default, elements come in the compact
form of XML element syntax: Elements are usually empty and contain no further information
than those given in the attributes.
Back to the example, the locations of the ellipses are defined by their attributes cx and
cy, their shape by the horizontal and vertical radius in rx and ry. Colors and other effects
can be passed via arguments in the style attribute. The white ellipse is plotted on the top
of the grey ellipse simply because it appears second in the code, creating the donut effect.
Finally, shape, color, font, and location of the “R” is defined in the <text> element.
Beyond the principle advantages of vector over raster graphics, SVG, in particular, has
some features that make it attractive as a graphic standard on the Web: It can be edited with
any text editor, opened with the common browsers, follows a familiar syntax as it is basically
just XML, and has been developed for a wide range of applications. We have learned that
XML is flexible but because of the flexibility cannot be interpreted further by a browser. This
is not true for XML extensions such as SVG. As the set of elements and attributes is clearly
defined, browsers can be programmed to display SVG content as meaningful graphic, not as

SVG basics

SVG and the
Web

60

AUTOMATED DATA COLLECTION WITH R

code—just as they interpret and display HTML code. In HTML5, SVG graphics can even be
embedded as simply as this
1
2
3
4
5

SVG is useful
for data
gatherers

<html>
<body>
<svg> <rect width="300" height="100"/> </svg>
</body>
</html>

Why could SVG be useful in the context of automated data collection? At first glance, SVG
is a flexible and widely used vector graphics format. From the data collection perspective,
however, it is more than that. The information in these graphs—and often more than just the
visible parts—are stored in text form and can therefore be searched, subsetted, etc. SVG is
becoming more and more popular on the Web and is used for increasingly complex tasks,
for example, to store geographic information, create interactive maps, or visualize massive
amounts of data.8
The takeaway message of these two examples is that XML is present in many different
areas, and many of these applications hold potentially useful information. And the neat thing
is: We will learn how easy it is to retrieve and process this information with R, regardless of
whether the information is stored in “pure” XML or any of its extensions.

3.5 XML and R in practice
Let us now turn to practice. How can XML files be viewed, how can they be imported and
accessed in an R session, and how can we convert information from an XML document
into data structures that are more convenient for further graphical or statistical analysis, like
ordinary data frames, for example?
As we said before, XML files can be opened and viewed in all text editors and browsers.
However, while text editors usually take the XML file as is, modern web browsers automatically parse the XML and try to represent its structure. This fails when the XML document is
not valid. In this case, the browser might tell you why it thinks the parsing failed, for example,
because of an opening and ending tag mismatch on a certain line. From this perspective, the
web browser is a decent tool to check if your XML is well formed. In standard web scraping
tasks, we usually do not view XML documents file by file but download them in a first step
and import them into our R workspace in a second (see Chapter 9).

3.5.1

Parsing XML

We parse XML for the same reason that we parse HTML documents (see Section 2.4.1),
to create a structure-aware representation of XML files that allows a simple information
8 To learn more about SVG check out Eisenberg (2002) and the elucidations on the W3C pages: http://www.w3
.org/Graphics/SVG/. For a quick access to the language, the SVG primer by Dailey (2010) should prove useful.

XML AND JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

61

<?xml version="1.0"?>
<!DOCTYPE document SYSTEM "technologystocks.dtd">
<document>
<Apple>
<date>2013/11/13</date>
<close>520.634</close>
<volume>7022001.0000</volume>
<open>518</open>
<high>522.25</high>
<low>516.96</low>
<company>Apple</company>
<year>2013</year>
</Apple>
<Apple>
<date>2013/11/12</date>
<close>520.01</close>
<volume>7295400.0000</volume>
<open>517.67</open>
<high>523.92</high>
<low>517</low>
<company>Apple</company>
<year>2013</year>
</Apple>
(...)
</document>

Figure 3.6 XML example document: stock data

extraction from these files. Similar to what was outlined in the HTML parsing section, the
process of parsing XML essentially includes two steps. First, the symbol sequence that
constitutes the XML file is read in and used to build a hierarchical tree-like data structure
from its elements in the C language, and second, this data structure is translated into an R
data structure via the use of handlers.
The package we use to import and parse XML documents is, appropriately enough, called
XML (Temple Lang 2013c). Using the XML package we can read, search, and create XML
documents—although we only care about the former two tasks. Let us see how to load XML
files into R. For DOM-style parsing of XML files one can use xmlParse(). The arguments
of the function coincide with those of htmlParse() for the most part. We illustrate the
process with the help of technology.xml, an XML file that holds stock information for three
technology companies. The first few lines of the document are presented in Figure 3.6. As
we see, the file contains stock information like the closing value, lowest and highest value for
a day, and the traded volume. To obtain the XML tree with R, we pass the path of the file to
xmlParse()’s file argument:
R> library(XML)
R> parsed_stocks <- xmlParse(file = "stocks/technology.xml")

62
1
2
3
4
5
6
7
8
9
10
11
12

AUTOMATED DATA COLLECTION WITH R

<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT

document (Apple,IBM,Google)>
Apple (date,close,volume,open,high,low,company,year)>
Google (date,close,volume,open,high,low,company,year)>
IBM (date,close,volume,open,high,low,company,year)>
close (#PCDATA)>
company (#PCDATA)>
date (#PCDATA)>
high (#PCDATA)>
low (#PCDATA)>
open (#PCDATA)>
volume (#PCDATA)>
year (#PCDATA)>

Figure 3.7 DTD of stock data XML file (see Figure 3.6)

XML
validation

The xmlParse() function is used to parse the XML document.9 The parsing function
offer a set of options that can be ignored in most settings but are still worth knowing. It is
possible to treat the input as XML and not as a file name (option asText), to decide whether
both namespace URI and prefix should be provided on each node or just the prefix (option
fullNamespaceInfo), to determine whether an XML schema is parsed (option isSchema),
or to validate the XML against a DTD (option validate). Let us consider this last option in
more detail.
Although HTML and XML are very similar in most respects, a noteworthy difference
exists in that XML is confined to much stricter specification rules. As we have seen in
Section 3.3, valid XML not only has to be well formed, that is, tags must be closed, attributes
names must be in quotes, etc., but also has to adhere to the specifications in its DTD. To check
whether the document conforms to the specification, a validation step can be included after
the DOM has been created by setting the validate argument to TRUE. We try to validate
technology.xml with the corresponding external technologystocks.dtd (see Figure 3.7), which
is deposited in our folder and referred to in line 2 of the XML file (see Figure 3.6):
R> library(XML)
R> parsed_stocks <- xmlParse(file = "stocks/technology.xml", validate = TRUE)

There is no complaint; the validation has succeeded. To demonstrate what happens if an
XML does not conform to a given DTD, we manipulate the DTD such that the document node
is no longer defined. As a consequence, the XML file does not conform to the (corrupted)
DTD anymore and the function raises a complaint:
R> library(XML)
R> stocks <- xmlParse(file = "stocks/technology-manip.xml", validate = TRUE)
No declaration for element document
Error: XML document is invalid

9 The XML package provides a set of other XML parsing functions, namely xmlTreeParse(),
xmlInternalTreeParse(), xmlNativeTreeParse(), and xmlEventParse(). As their names suggest, they
differ in the way how the XML tree is parsed. xmlInternalTreeParse() and xmlNativeTreeParse() are
equivalent to xmlParse(). Further, all are almost equivalent to xmlTreeParse(), except that the parser automatically relies on the internal nodes (the useInternalNodes parameter is set TRUE).

XML AND JSON

63

In general, the rather bulky logic of XML validation with DTD, XSD, or other schemas
should not discourage you from making use of the full power of the XML DOM structure. In
most web scraping scenarios, there is no need to validate the files and we can simply process
them as they are.

3.5.2

Basic operations on XML documents

Once an XML document is parsed we can access its content using a set of functions in the
XML package. While we recommend using the more general and robust XPath for searching
and pulling out information from XML documents, here we present some basic operations
that might suffice for less complex XML documents. To see how they work, let us go back to
our running example: We start by parsing the bond.xml file:
R> bond <- xmlParse("bond.xml")
R> class(bond)
[1] "XMLInternalDocument" "XMLAbstractDocument"

When we type bond into our console, the output looks pretty much like the original XML
file. We know, however, that the object is anything but pure character data. For instance, we
can perform some basic operations on the root element. The top-level node is extracted with
the xmlRoot() function; xmlName() and xmlSize() return the root element’s name and
the number of children:
R> root <- xmlRoot(bond)
R> xmlName(root)
[1] "bond_movies"
R> xmlSize(root)
[1] 3

Within the node sets, basic navigation or subsetting works in analogy to indexing ordinary
lists in R. That is, we can use numerical or named indices to select certain nodes. This is not
possible with objects of class XMLInternalDocument that are generated by xmlParse().
We therefore work with the root object, which belongs to the class XMLInternalElementNode. Indexing with predicate “1” yields the first child:
R> root[[1]]
<movie id="1">
<name>Dr. No</name>
<year>1962</year>
<actors bond="Sean Connery" villain="Joseph Wiseman"/>
<budget>1.1M</budget>
<boxoffice>59.5M</boxoffice>
</movie>

We have to use double brackets to access the internal node. By adding another index, we
can move further down the tree and extract the first child of the first child:
R> root[[1]][[1]]
<name>Dr. No</name>

Navigation

64

AUTOMATED DATA COLLECTION WITH R

Element names can be used as predicates, too. Using double brackets yields the first
element in the tree, single brackets return objects of class XMLInternalNodeList. To see
the difference, compare
R> root[["movie"]]
<movie id="1">
<name>Dr. No</name>
<year>1962</year>
<actors bond="Sean Connery" villain="Joseph Wiseman"/>
<budget>1.1M</budget>
<boxoffice>59.5M</boxoffice>
</movie>

with
R> root["movie"]
$movie
<movie id="1">
<name>Dr. No</name>
<year>1962</year>
<actors bond="Sean Connery" villain="Joseph Wiseman"/>
<budget>1.1M</budget>
<boxoffice>59.5M</boxoffice>
</movie>
$movie
<movie id="2">
<name>Live and Let Die</name>
<year>1973</year>
<actors bond="Roger Moore" villain="Yaphet Kotto"/>
<budget>7M</budget>
<boxoffice>126.4M</boxoffice>
</movie>
$movie
<movie id="3">
<name>Skyfall</name>
<year>2012</year>
<actors bond="Daniel Craig" villain="Javier Bardem"/>
<budget>175M</budget>
<boxoffice>1108.6M</boxoffice>
</movie>
attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"

Names and numbers can also be combined. To return the atomic value of the first <name>
element, we could write
R> root[["movie"]][[1]][[1]]
Dr. No

XML AND JSON

65

The structure of the object is retained and can be used to locate elements and values. However, content retrieval from XML files via ordinary predicates is quite complex, error prone,
and anything but convenient. Further, this method does not capitalize on node relations—a
core feature of parsed XML documents. For anybody who is seriously working with XML
data, there are good reasons to learn the very powerful query language XPath. We will show
how this is done in the next chapter.
In general, all methods and all those to follow are applicable to other XML-based languages as well. The parser does not care about naming and structure of documents as long as
the code is valid. Therefore, documents like the RSS sample code from above can be imported
just as easy as
R> xmlParse("rsscode.rss")
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>The ADCR blog
Blog to the ADCR book; Wiley 2014
http://www.r-datacollection.com/blog
Tue, 22 Oct 2013 00:01:00 +0000 

Why R is useful for web scraping
R is becoming the most popular statistical software
and is growing fast due to an active community publishing several
additional packages every day. Yet, R is more than [...]
http://www.r-datacollection.com/blog/why-r-is-useful
Tue, 22 Oct 2013 00:01:00 +0000 

3.5.3

From XML to data frames or lists

Sometimes it suffices to transform an entire XML object into common R data structures like
vectors, data frames, or lists. The XML package provides some appropriate functions that
make such operations straightforward if the original structure is not too complex.
Single vectors can be extracted with xmlSApply(), a wrapper function for lapply()
and sapply() that is built to deal with children of a given XML node. The function operates
on an XML node (provided as first argument), applies any given function on its children
(given as the second argument), and commonly returns a vector. We can use the function in
combination with xmlValue() and xmlGetAttr() (and other functions; see Table 4.4) to
extract element or attribute values:
R> xmlSApply(root[[1]], xmlValue)
name
year
actors
budget boxoffice
"Dr. No"
"1962"
""
"1.1M"
"59.5M"
R> xmlSApply(root, xmlAttrs)
movie.id movie.id movie.id
"1"
"2"
"3"
R> xmlSApply(root, xmlGetAttr, "id")
movie movie movie
"1"
"2"
"3"

Accessing
documents of
the XML
family

66

AUTOMATED DATA COLLECTION WITH R

As long as XML documents are flat in the hierarchical sense, that is, the root node’s most
distant relatives are grandchildren or children, they can usually be transformed easily into a
data frame with xmlToDataFrame()
R> (movie.df <- xmlToDataFrame(root))
name year actors budget boxoffice
1
Dr. No 1962
1.1M
59.5M
2 Live and Let Die 1973
7M
126.4M
3
Skyfall 2012
175M
1108.6M

Note, however, that this function already runs into trouble with the  element,
which is itself empty except for two attributes. The corresponding variable in the data.frame
object is left empty with a shrug.
Similarly, a conversion into a list is possible with xmlToList():
R> movie.list <- xmlToList(bond)

XML and other data exchange formats like JSON can store much more complicated data
structures. This is what makes them so powerful for data exchange over the Web. Forcing
such structures into one common data frame comes at a certain cost—complicated data
transformation tasks or the loss of information. xmlToDataFrame() is not an almighty
function to achieve the task for which it is named. Rather, we are typically forced to develop
and apply own extraction functions.

3.5.4

Event-driven/
SAX parser

Event-driven parsing

While parsing the XML example files in Section 3.5.1 was processed quickly by R, files
of larger size can lead to overloaded working memory and concomitant data management
problems. As a format primarily designed for carrying data across services, XML files are
oftentimes of substantially greater size than HTML files. In many instances, file sizes can
exceed the memory capacity of ordinary desktop PCs and laptops. This problem is aggravated
when data streams are concerned, where XML data arrives iteratively. These applications
obstruct the DOM-based parsing approach we have been applying in this and the previous
chapter and demand for a more iterative parsing style.
The root of the problem stems from the way the DOM-style parsers process and store
information. The parser creates two copies of a given XML file—one as the C-level node set
and the second as the data structure in the R language. To detect certain elements in an XML
file, we can deal with this problem by employing a parsing technique called event-driven
parsing or SAX parsing (Simple API for XML). Event-driven parsing differs from DOMstyle parsing in that it skips the construction of the complete DOM at the C level. Instead,
event-driven parsers sequentially traverse over an XML file, and once they find a specified
element of interest they prompt an instant, user-defined reaction to this event. This procedure
provides a huge advantage over DOM-style parsers because the machine’s memory never has
to hold the complete document.
Let us reconsider technology.xml and the problem of extracting information about the
Apple stock. Assume we are interested in obtaining Apple’s daily closing value along with
the date. Once again, we make use of a handler function to specify how to handle a node
of interest. Similar to the extraction problem considered in Section 2.4.3, we define the
handler as a nested function to combine it with a reference environment and container

XML AND JSON
1
2
3

branchFun <- function(){
container_close <- numeric()
container_date <- numeric()
"Apple" = function(node,...) {
date <- xmlValue(xmlChildren(node)[[c("date")]])
container_date <<- c(container_date, date)
close <- xmlValue(xmlChildren(node)[[c("close")]])
container_close <<- c(container_close, close)
#print(c(close, date));Sys.sleep(0.5)

5
6
7
8
9
10

}
getContainer <- function() data.frame(date=container_date,
close=container_close)
list(Apple=Apple, getStore=getContainer)

11
12
13
14

67

}

Figure 3.8 R code for event-driven parsing
variables (see Figure 3.8). branchFun() defines two local variables container_close
and container_date, serving as the container variables. Since we are interested in Apple
stock information, we suggest the following approach: We start by defining a handler function
for the  nodes (lines 6 and 8). Conditional on these elements, we look for their
children called date and close and return their values (lines 7 and 9). A return function
getContainer() is defined (line 12) that assembles the container variable’s contents into a
data frame and returns this object.
To generate a usable instance of the handler function, we execute the function and pass
its return value into a new object called h5:
R> (h5 <- branchFun())
$Apple
function (node, ...)
{
date <- xmlValue(xmlChildren(node)[[c("date")]])
container_date <<- c(container_date, date)
close <- xmlValue(xmlChildren(node)[[c("close")]])
container_close <<- c(container_close, close)
}

$getStore
function ()
data.frame(date = container_date, close = container_close)

We are now ready to run the SAX parser over our technology.xml file using XML’s
xmlEventParse() function. Instead of the handlers argument we will pass the handler function to the branches argument. The branches is a more general version of the
handlers argument, which allows to specify functions over the entire node content, including its children. This is exactly what we need for this task since in our handler function h5

68

AUTOMATED DATA COLLECTION WITH R

we have been making use of the xmlChildren function for retrieving child information.
Additionally, for the handlers argument we need to pass an empty list:
R> invisible(xmlEventParse(file = "stocks/technology.xml",
branches = h5, handlers = list()))

To get an idea about the iterative traversal through the document, remove the commented
line in the handler and rerun the SAX parser. Finally, to fetch the information from the local
environment we employ the getStore() function and route the contents into a new object:
R> apple.stock <- h5$getStore()

To verify parsing success, we display the first five rows of the returned data frame:
R>
R>
11
R>

head(apple.stock, 5)
# date close 1 2013/11/13 520.634 2 2013/11/12 520.01 3 2013/11/
519.048 4
# 2013/11/08 520.56 5 2013/11/07 512.492

As we have seen, the event-driving parsing works and returns the correct information.
Nonetheless, we do not recommend users to resort to this style of parsing as their preferred
means to obtain data from XML documents. Although event-style parsing exceeds the DOMstyle parsing approach with respect to speed and may, in case of really large XML files, be the
only practical method, it necessitates a lot of code overhead as well as background knowledge
on R functions and environments. Therefore, for the small- to medium-sized documents that
we deal with in this book, in the coming chapters we will focus on the DOM-style parsing
and extraction methods provided through the XPath query language (Chapter 4).

3.6

Indiana Jones
and the first
JSON example

A short example JSON document

In this section, we will become acquainted with the benefits of the data exchange standard
JSON. The acronym (pronounced “Jason”) stands for JavaScript Object Notation. JSON was
designed for the same tasks that XML is often used for—the storage and exchange of humanreadable data. Many APIs by popular web applications provide data in the JSON format.
As its name suggests, JSON is a data format that has its origins in the JavaScript programming language. However, JSON itself is language independent and can be parsed with
many existing programming languages, including R. JSON has turned into one of the most
popular formats for web data provision. It is therefore worth studying for our purposes. We
start again with a synthetic example and continue with a more systematic look at the syntax.
In the final part of the chapter, we will learn the JSON syntax and how to access JSON data
with R.
The JSON code in Figure 3.9 holds some basic information on the first three Indiana
Jones movies. We observe that JSON has a more slender appearance than XML. Data are
stored in key/value pairs, for example, "name" : "Raiders of the Lost Ark", which
obviates the need for end tags. Different types of brackets (curly and square ones) allow
describing hierarchical structures and to differentiate between unordered and ordered data.
Just as in XML, JSON data structures can become arbitrarily complex regarding nestedness.
Apart from differences in the syntax, JSON is as intuitive as XML, particularly when indented
like in the example code, although this is no necessary requirement for valid JSON data.

XML AND JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

69

{"indy movies" :[
{
"name" : "Raiders of the Lost Ark",
"year" : 1981,
"actors" : {
"Indiana Jones": "Harrison Ford",
"Dr. René Belloq": "Paul Freeman"
},
"producers": ["Frank Marshall", "George Lucas", "Howard Kazanjian"],
"budget" : 18000000,
"academy_award_ve": true
},
{
"name" : "Indiana Jones and the Temple of Doom",
"year" : 1984,
"actors" : {
"Indiana Jones": "Harrison Ford",
"Mola Ram": "Amish Puri"
},
"producers": ["Robert Watts"],
"budget" : 28170000,
"academy_award_ve": true
},
{
"name" : "Indiana Jones and the Last Crusade",
"year" : 1989,
"actors" : {
"Indiana Jones": "Harrison Ford",
"Walter Donovan": "Julian Glover"
},
"producers": ["Robert Watts", "George Lucas"],
"budget" : 48000000,
"academy_award_ve": false
}]
}

Figure 3.9 JSON code example: Indiana Jones movies

3.7

JSON syntax rules

JSON syntax is easy to learn. We only have to know (a) how brackets are used to structure
the data, (b) how keys and values are identified and separated, and (c) which data types exist
and how they are used.
Brackets play a crucial role in structuring the document. As we see in the example data in
Figure 3.9, the whole document is enclosed in curly brackets. This is because indy movies
is the first object that holds the three movie records in an array, that is, an ordered sequence.
Arrays are framed by square brackets. The movies, in turn, are also objects and therefore
enclosed by curly brackets. In general, brackets work as follows:
1. Curly brackets, “{” and “},” embrace objects. Objects work much like elements in
XML and can contain collections of key/value pairs, other objects, or arrays.

70

AUTOMATED DATA COLLECTION WITH R

2. Square brackets, “[” and “],” enclose arrays. An array is an ordered sequence of
objects or values.
Actual data are stored in key/value pairs. The rules for keys and values are
1. Keys are placed in double quotes, data are only placed in double quotes if they are
string data
1
2

"name" : "Indiana Jones and the Temple of Doom"
"year" : 1984

2. Keys and values are always separated by a colon
1

"year" : 1981

3. Key/value pairs are separated by commas
1
2

{"Indiana Jones": "Harrison Ford",
"Dr. Rene Belloq": "Paul Freeman"}

4. Values in an array are separated by commas
1

["Frank Marshall", "George Lucas", "Howard Kazanjian"]

JSON allows a set of different data types for the value part of key/value pairs. They are
listed in Table 3.4.

Table 3.4 Data types in JSON
Data type

Meaning

Number
String

integer, real, or floating point (e.g., 1.3E10)
white space, zero, or more Unicode characters (except " or \; \ introduces
some escape sequences)
true or false
null, an unknown value
content in curly brackets
ordered content in square brackets

Boolean
Null
Object
Array

XML AND JSON

71

And that is it.10 From the perspective of an XML user, note what is not possible in
JSON: We cannot add comments, we do not distinguish between missing values and null
values, there are no namespaces and no internal validation syntax like XML’s DTD. But this
does not make JSON inferior to XML in absolute terms. They are rather based on different
concepts. JSON is not a markup language and not even a document format. It is anticipated
to be versionless—there is no JSON 1.0—and no change in the grammar is expected. It is
just a data interchange standard that is so general that it can be parsed by many languages
without effort.
Although there is not much to highlight in JSON data, there are some tools that facilitate accessing JSON documents for human readers. The JSON Formatter & Validator at
http://jsonformatter.curiousconcept.com/ is just one of several tools on the Web that automatically indent JSON input. This makes it much easier to read because JSON data frequently
come without indentation or line breaks. The tool also helps check for bugs in the data. If
you want to convert XML to JSON data, take a look at http://www.freeformatter.com/xmlto-json-converter.html or similar tools. However, such conversions are never isomorphic and
rules have to be set to deal with, for example, attributes and namespaces.
Why is JSON so important for the Web even though XML already provides a popular data
exchange format? First of all, there are some technical properties that make JSON preferable to
XML. Generally, it is more lightweight due to its less verbose syntax and only allows a limited
set of data types that are compatible with many if not most existing programming languages.
Regarding compatibility, JSON has another crucial feature: We cover only basics of JavaScript
in this book (see Chapter 6), but JavaScript is a major player on the Web to generate dynamic
content and user–browser interactions. JSON is ultimately compatible with JavaScript and
can be directly parsed into JavaScript objects. From a practical point of view, JSON seems to
become the most widely used data exchange format for web APIs; Twitter as well as YouTube
and many bigger and smaller web services have begun using JSON-only APIs.

3.8

JSON and R in practice

While R has one standard set of tools to handle XML-type data—the XML package—there
are several packages that allow importing, exporting, and manipulating JSON data. The first
published package was rjson (Couture-Beil 2013) and is still used in some R-based API
wrappers. The package that is currently more established, however, is RJSONIO (Temple
Lang 2013b), which we will use in this section. Finally, we also discuss the recently published
package jsonlite (Ooms and Temple Lang 2014), which builds on RJSONIO and improves
mapping between R objects and JSON strings.
We begin the discussion with an inspection of the RJSONIO package. In its current version
(1.0.3), the package offers 24 functions, most of which we usually do not apply directly. We
now return to the running example, the data in the indy.json file. Using the isValidJSON()
function, we first check whether the document consists of valid JSON data:
R> isValidJSON("indy.json")
[1] TRUE

10 There are some encoding details we do not dwell on here—if you want to go a little bit more into details,
http://www.json.org/ provides further information.

JSON
pocketknife
tools

The importance
of JSON for the
Web

72

AUTOMATED DATA COLLECTION WITH R

This seems to be the case. The two core functions are fromJSON() and toJSON().
fromJSON() reads content in JSON format and converts it to R objects, toJSON() does the
opposite:
R> indy <- fromJSON(content = "indy.json")
fromJSON()

content is the function’s main argument. In our case, indy.json is a file in the working
directory, but it could also be a character string possibly from the Web via getURL()
or imported with readLines(). The fromJSON() function offers several other useful
arguments, and as the package is well maintained, the documentation—accessible with
?fromJSON—is worth a look. A very useful argument is simplify, controlling whether
the function tries to combine identical elements to vectors. Otherwise the individual elements
remain separate list elements. The nullValue argument allows specifying how to deal with
JSON nulls. In general, JSON data types (see Table 3.4) match R data types nicely (numeric,
integer, character, logical). The null value is a little more differentiated in R, however. There
is NULL for empty objects and NA for indicating a missing value. Therefore, the nullValue
argument helps to specify how to deal with these cases, like turning them into NAs. The
function maps the JSON data structure into an R list object:
R> class(indy)
[1] "list"

There is no
ultimate XML/
JSON-to-R

function

From this point on we can deal with the data the standard R way, that is, decompose
or subset the list or force (parts of) it into vectors, data frames, or other structures. We
have already observed that seemingly powerful functions like xmlToDataFrame() can
be of limited use when we face real data. Data frames are useful to represent a simple
“observations by variables” structure, but become very complex if they are used to represent
highly hierarchical data. In contrast, JSON and XML can represent far more complex data
structures. When loading JSON or XML data into R, one often has to decide which subsets
of information are necessary and need to be inserted into a data frame. Consequently, there
cannot be a global and universal function for JSON/XML to R data format conversion. We
have to build our own tools case by case. In our example, we might want to try to map the
list to a data frame, consisting of three observations and several variables. The problem is
that actors and producers have several values. One option is to extract the information
variable by variable and merge in the end. This could work as follows:
R> library(stringr)
R> indy.vec <- unlist(indy, recursive = TRUE, use.names = TRUE)
R> indy.vec[str_detect(names(indy.vec), "name")]
indy movies.name
"Raiders of the Lost Ark"
indy movies.name
"Indiana Jones and the Temple of Doom"
indy movies.name
"Indiana Jones and the Last Crusade"

This strategy first flattens the complex list structure into one vector. The recursive argument ensures that all components of the list are unlisted. Since the key names are retained in
the vector by setting use.names to TRUE, we can identify all original key/value pairs with
the name name using a simple regular expression and the str_detect() function from the
stringr package (see also Chapter 8). This strategy has its drawbacks. First, all list elements

XML AND JSON

73

are coerced to a common mode, resulting in character vectors in most cases. This is useful
for the names variable, but less appropriate for the years variable. Further, this step-by-step
approach is tedious when many variables have to be extracted. An only slightly more comfortable option uses sapply() and feeds it with the [[ operators and the variable name for element subsetting, calling indy[[1]][[1]][['name']], indy[[1]][[2]][['name']],
and so on:
R> sapply(indy[[1]], "[[", "year")
[1] 1981 1984 1989

The benefit of this approach over the first is that data types are retained. Finally, to pull
all variables and directly assemble them into a data frame, we have to take into account that
some variables do not exist or vary in structure from observation to observation in the sample
data. For example, the number of producers varies. We do the conversion as follows:
R> library(plyr)
R> indy.unlist <- sapply(indy[[1]], unlist)
R> indy.df <- do.call("rbind.fill", lapply(lapply(indy.unlist, t),
data.frame, stringsAsFactors = FALSE))

We first unlist the elements within the list. The second command is more complex. First,
we transpose each list element, turn them into data frames, and finally make use of the
rbind.fill() function of the plyr package to combine the data frames into one single data
frame, taking care of the fact that some variables do not exist in some data frames. The
result reveals that we would have to continue with some data cleansing—note for example
the split-up producer variables:
R> names(indy.df)
[1] "name"
[3] "actors.Indiana.Jones"
[5] "producers1"
[7] "producers3"
[9] "academy_award_ve"
[11] "producers"

"year"
"actors.Dr..René.Belloq"
"producers2"
"budget"
"actors.Mola.Ram"
"actors.Walter.Donovan"

It is clear that importing JSON data, or working with lists in general, can be painful. Even
if data structures are simpler, we need to use apply functions. Consider this last example of a
JSON data import with a simple Peanuts dataset:
1
2

[
{
"name":"van Pelt, Lucy",
"sex":"female",
"age":32

3
4
5
6
7

},
{
"name":"Peppermint, Patty",
"sex":"female",
"age":null

8
9
10
11

},

74

AUTOMATED DATA COLLECTION WITH R

{

12

"name":"Brown, Charlie",
"sex":"male",
"age":27

13
14
15

}

16
17

]

We turn the data into an ordinary data frame with the following expression:
R> peanuts.json <- fromJSON("peanuts.json", nullValue = NA,
simplify = FALSE )
R> peanuts.df <- do.call("rbind", lapply(peanuts.json, data.frame,
stringsAsFactors = FALSE))

We parse the JSON snippet with the fromJSON function and tell the parser to set null
values to zero. We also set simplify to FALSE in order to retain the list structure in
all elements. Otherwise, the parser would convert the second entry to a character vector,
rendering the data.frame() apply function useless. We use the lapply() function to turn
the lists into data frames and keep strings as strings with the stringsAsFactors = FALSE
argument. Finally, we join the data frames with a do.call() on rbind(). The result looks
acceptable:
R> peanuts.df
name
sex age
1
van Pelt, Lucy female 32
2 Peppermint, Patty female NA
3
Brown, Charlie
male 27
toJSON()

To do the conversion the other way round, that is from R to JSON data, the function we
need is toJSON():
R>
R>
R>
R>

More consistent
mapping with
jsonlite

peanuts.json <- toJSON(peanuts.df, pretty = TRUE)
file.output <- file("peanuts_out.json")
writeLines(peanuts.json, file.output)
close(file.output)

While transforming JSON data into appropriate R objects cannot always be done with
preexisting functions, but require some postprocessing of the resulting objects, the recently
developed jsonlite package offers more consistency between both data structures. It builds
upon the parser of the RJSONIO package and provides the main functions fromJSON()
and toJSON as well, but implements a different mapping scheme (see Ooms 2013). A
set of rules ensures that data from an external source like an API are transformed in a
way that guarantees consistent transformations. Some important conventions for JSON-to-R
conversions for arrays are

r arrays are encoded as character data if at least one value is of type character;
r null values are encoded as NA;
r true and false values are encoded as 1 and 0 in numerical vectors and TRUE and
FALSE in character and logical vectors.

XML AND JSON

75

There are more conventions for the transformation of vectors, matrices, lists, and data
frames. They are documented in Ooms (2013). For our purposes, the rules concerning JSONto-R conversion are most important, as this is part of the regular scraping workflow. Consider
the following set of transformations from JSON arrays into R objects to see how the conventions cited above work in practice:
R> library(jsonlite)
R> x <- '[1, 2, true, false]'
R> fromJSON(x)
[1] 1 2 1 0
R> x <- '["foo", true, false]'
R> fromJSON(x)
[1] "foo"
"TRUE" "FALSE"
R> x <- '[1, "foo", null, false]'
R> fromJSON(x)
[1] "1"
"foo"
NA
"FALSE"

The consistent mapping rules of jsonlite not only ensure that data are transformed adequately on the vector level, but also make mapping of JSON data into R data frames a lot
easier. Reconsidering the Peanuts example with jsonlite, it turns out that the JSON data are
conveniently mapped into the desired R object of type data.frame right away:
R> (peanuts.json <- fromJSON("peanuts.json"))
name
sex age
1
van Pelt, Lucy female 32
2 Peppermint, Patty female NA
3
Brown, Charlie
male 27

In the Indiana Jones example, the Indy JSON is also mapped into a list. However, the
only element in the list is a data frame of the desired content. We simply pull the data frame
from the list to access the variables
R> (indy <- fromJSON("indy.json"))
$'indy movies'
name year actors.Indiana Jones
1
Raiders of the Lost Ark 1981
Harrison Ford
2 Indiana Jones and the Temple of Doom 1984
Harrison Ford
3
Indiana Jones and the Last Crusade 1989
Harrison Ford
actors.Dr. René Belloq actors.Mola Ram actors.Walter Donovan
1
Paul Freeman

2

Amish Puri

3

Julian Glover
producers
budget academy_award_ve
1 Frank Marshall, George Lucas, Howard Kazanjian 18000000
TRUE
2
Robert Watts 28170000
TRUE
3
Robert Watts, George Lucas 48000000
FALSE
R> indy.df <- indy$'indy movies'
R> indy.df$name
[1] "Raiders of the Lost Ark"
[2] "Indiana Jones and the Temple of Doom"
[3] "Indiana Jones and the Last Crusade"

In short, whenever RJSONIO returns a list when you would expect a data frame, jsonlite
manages to generate tabular data from JSON data structures as long as it is appropriate,

76

AUTOMATED DATA COLLECTION WITH R

because the mapping scheme acknowledges the way in which tabular data are stored in R,
which is column based, and JSON—and many other formats, languages, or databases—which
is row based (see Ooms 2013).
To be sure, the functionality of jsonlite does not solve all problems of JSON-to-R transfer.
However, the choice of rules implemented in jsonlite makes the import of JSON data into R
more consistent. We therefore suggest to make this package the standard tool when working
with JSON data even though it is still in an early version.

Summary
Both XML and JSON are very important standards for data exchange on the Web, and as
such will occur several times in the course of this book (for example in Chapter 4 and the
case study on Twitter, Chapter 14). Knowing how to handle both data types is helpful in many
web data collection tasks.
We have seen that XML serves as a basic standard for many other formats, such as GPX,
KML, RSS, SVG, XHTML. Whenever we encounter such data on the Web we are able to
import and process them in Rtoo. JSON is an increasingly popular alternative to XML for the
exchange of data on the Web, especially when working with web services/web APIs. JSON
is derived from JavaScript and can be parsed in many languages, including R.

Further reading
There are many books that go far beyond this basic introduction to XML and JSON. If you
have acquired a taste for the languages of the Web and plan to go deeper into web developing,
you could have a look at XML in a Nutshell by Harold and Means (2004) or at Ray (2003). For
the web scraping tasks presented in this book, however, deeper knowledge of XML should
not be necessary.
If you want to dig deeper into JSON and JavaScript, the book JavaScript: The Good Parts
by JSON developer Douglas Crockford (2008) might be a good start. For a quick overview,
the excellent website http://www.json.org/ is highly recommended.

Problems
1.

Describe the relationship between XML and HTML.

2.

What are possible ways to import XML data into R? What are the advantages and
disadvantages of each approach?

3.

What is the purpose of namespaces in XML-style documents?

4.

What are the main elements of the JSON syntax?

5.

Write the smallest well-formed XML document you can think of.

6.

Why do we need an escape sequence for the ampersand in XML?

7.

Take a look at the invalid XML code snippet in Section 3.2.2. How could the family
structure be represented in a valid XML document so that it is possible to identify
Jonathan both as a child and as a father?

XML AND JSON

77

8.

Go to your vinyl record, CD, DVD, or Blu-ray Disc shelf and randomly pick three titles.
Create an XML document that holds useful information about your sample of discs.

9.

Inform yourself about the Election Markup Language (EML).
(a) Find out the purpose of EML.
(b) Look for the current specification of the language and identify the key concepts.
(c) Search for a real EML document, load it into R and turn parts of it into native
data structures.

10.

Working with SVG files.
(a) Manipulate the ricon.svg file such that the icon is framed with a black box. Redefine
the color, size, and font of the image.
(b) Rebuild the RSS icon as an SVG document.

11.

Find the formatting errors in the following JSON piece.11

1

{

2

"text": "@slowpoketweeter @yaaawn123: Just saw a cat on
the road. Awesome! #YOLO",
"truncated": false,
"favorited": "true",
"source": "Twitter for iPhone",
"id_str": "61723550048377463",
"user_mentions": ["slowpoketweeter" "yaaawn123"],
"screen_name": "SlowpokeTweeter",
"id",
"retweet_count": 4,
"geo": NULL,
"created_at": "Sun Apr 03 23:48:36 +0000 2011";
user: {
"statuses_count": 3,511,
"profile_background_color": "C0DEED",
"followers_count": "48",
"description": "watcha doin in my waters?",
"screen_name": "OldGREG85",
"time_zone": "Hawaii",
'lang': "en",
"friends_count": 81,
"geo_enabled": false,
}

3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

12.

Convert the James Bond XML example from Figure 3.1 into valid JSON.

13.

Convert the Indiana Jones example from Figure 3.9 into valid XML.

14.

Import the indy.json file into R and extract the values of all budget keys.

11 The

example is a shortened fragment of the content that is being returned by the Twitter Streaming API.

78

15.

AUTOMATED DATA COLLECTION WITH R

The XML file potus.xml (available in the book’s materials) contains biographical information on US presidents.
(a) Use the DOM-style XML parser and parse the document into an R object called
potus. Inspect the source code. The  nodes contain additional
white space at the end of the text string. Find the appropriate argument to remove
them in the parsing stage.
(b) The XML file contains  nodes. Discard them while parsing the file.
Remove the additional white space in the  nodes by using a custom
handler function and a string manipulation function (see Section 8.2).
(c) Write a handler for extracting the  nodes’ value and pass it to the
DOM-style parser. Repeat the process with an event-driven parser. Inspect the
results.

4

XPath
In Chapters 2 and 3 we introduced and illustrated how HTML/XML documents use markup
to store information and create the visual appearance of the webpage when opened in the
browser. We also explained how to use a scripting language like R to transform the source
code underlying web documents into modifiable data objects called the DOM with the use
of dedicated parsing functions (Sections 2.4 and 3.5.1). In a typical data analysis workflow,
these are important, but only intermediate steps in the process of assembling well-structured
and cleaned datasets from webpages. Before we can take full advantage of the Web as a nearly
endless data source, a series of filtering and extraction steps follow once the relevant web
documents have been identified and downloaded. The main purpose of these steps is to recast
information that is encoded in formats using markup language into formats that are suitable
for further processing and analysis with statistical software. Initially, this task comprises
asking what information we are interested in and identifying where the information is located
in a specific document. Once we know this, we can tailor a query to the document and obtain
the desired information. Additionally, some data reshaping and exception handling is often
necessary to cast the extracted values into a format that facilitates further analysis.
This chapter walks you through each of these steps and helps you to build an intuition for
querying tree-based data structures like HTML/XML documents. We will see that accessing
particular information from HTML/XML documents is straightforward using the concise,
yet powerful path statements provided by the XML Path language (short XPath), a very
popular web technology and W3C standard (W3C 1999). After introducing the basic logic
underlying XPath, we show how to leverage the full power of its vocabulary using predicates,
operators, and custom extractor functions in an application to real documents. We further
explore how to work with namespace properties (Section 4.3.2). The chapter concludes with
a pointer to helpful tools (Section 4.3.3) and a more high-level discussion about general
problems in constructing efficient and robust extraction code for HTML/XML documents
(Section 4.3.3).

Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, First Edition.
Simon Munzert, Christian Rubba, Peter Meißner and Dominic Nyhuis.
© 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

80

AUTOMATED DATA COLLECTION WITH R

4.1 XPath—a query language for web documents

First stop:
Parsing

XPath is a query language that is useful for addressing and extracting parts from HTML/XML
documents. XPath is best categorized as a domain-specific language, which indicates that its
field of application is a rather narrow one—it is simply a very helpful tool for selecting
information from marked up documents such as HTML, XML, or any variant of it such as
SVG or RSS (see Sections 3.4.3 and 3.4.3). XPath is also a W3C standard, which means
that the language is subjected to constant maintenance and widely employed in modern web
applications. Among the two versions of XPath that are in current use, we apply XPath 1.0
as it provides sufficiently powerful statements and is implemented in the XML package for R.
As a stylized, running example, we revisit fortunes.html—a simple HTML file that
includes short quotes of R wisdoms. A first, necessary step prior to applying XPath is to
parse the document and make its content available in the workspace of the R session, since
XPath only works on the DOM representation of a document and cannot be applied on the
native code. We begin by loading the XML package and use htmlParse() to parse the file
into the object parsed_doc:
R> library(XML)
R> parsed_doc <- htmlParse(file = "fortunes.html")

The document is now available in the workspace and we can examine its content using
XML’s print() method on the object:
R> print(parsed_doc)

Collected R wisdoms

Robert Gentleman
'What we have is nice, but we need something very different'
Source: Statistical Computing 2003, Reisensburg

Rolf Turner
'R is wonderful, but it cannot work magic' 
answering a
request for automatic generation of 'data from a known mean and 95% CI'

Source: 
R-help

The book homepage

Before proceeding, we would like to restate a crucial idea from Chapter 2 that will be
helpful in understanding the basic logic of XPath statements. HTML/XML documents use
tags to markup information and the nestedness of the tags describe a hierarchical order.

XPATH

81

value:
Collected ...

href: https ...

<i>
href: https ...

<div>

<div>

lang: english
date:
October/2011

id: R-Inventor
lang: english
date: June/2003

<p>

<h1>
value:Robert
Gentleman

<p>

<h1>

value:
Statistical ...

value:Robert
Turner

<b>

<i>
value:What
we ...

value:
Source ...

<p>

<i>

<emph>

value:R is ...

value:
answering ...

<p>

<b>
value:
Source ...

<a>
value:R-help
href: http ...

Figure 4.1 A tree perspective on parsed_doc

One way to depict this hierarchical order of tags is by means of a tree as it is portrayed in
Figure 4.1. In the tree, edges represent the nestedness of a lower-level node inside a higherlevel node. Throughout this chapter we will not only refer to this image but also adopt graph
language to describe the location of the tags in the document as well as their relations.
Therefore, if we refer to the <div> node, we mean the entire information that is encapsulated
within the div tags, that is, the value of the node, its set of attributes as well as their values, and
its children nodes. When we use the word node set, we refer to a selection of multiple nodes.

4.2 Identifying node sets with XPath
4.2.1

Basic structure of an XPath query

To get started, let us put ourselves to the task of extracting information from the <i> nodes,
that is, text that is written in italics, which contain the actual quotes. A look at either HTML
code or the document tree in Figure 4.1 reveals that there are two nodes of interest and they
are both located at the lowest level of the document. In XPath, we can express this hierarchical
order by constructing a sequence of nodes separated by the / (forward slash). This is called a
hierarchical addressing mechanism and it is similar to a location path on a local file system.
The resemblance is not accidental but results from a similar hierarchical organization of the
underlying document/file system. Just like folders can be nested inside other folders on a

Hierarchical
addressing
mechanism

82

Absolute paths

AUTOMATED DATA COLLECTION WITH R

local hard drive, the DOM treats an XML document as a tree of nodes, where the nestedness
of nodes within other nodes creates a node hierarchy.
For our HTML file we can describe the position of the <i> nodes by describing the sequence of nodes that lead to it. The XPath that represents this position is
“/html/body/div/p/i.” This statement reads from left to right: Start at the root node
<html>—the top node in a tree is also referred to as the root note—proceed to the <body>
node, the <div> node, the <p> node, and finally the <i> node. To apply this XPath we use
XML’s xpathSApply() function. Essentially, xpathSApply() allows us to conduct two
tasks in one step. First, the function returns the complete node set that matches the XPath
expression. Second, if intended, we can pass an extractor function to obtain a node’s value,
attribute, or attribute value.1 In our case, we set xpathSApply()’s first argument doc to the
parsed document and the second argument path to the XPath statement that we wish to apply:
R> xpathSApply(doc = parsed_doc, path = "/html/body/div/p/i")
[[1]]
<i>'What we have is nice, but we need something very different'</i>
[[2]]
<i>'R is wonderful, but it cannot work magic'</i>

Relative paths

In the present case, the specified path is valid for two <i> nodes. Thus, the XPath query
extracts more than one node at once if it describes a valid path for multiple nodes. The path
that we just applied is called an absolute path. The distinctive feature about absolute paths is
that they always emanate from the root node and describe a sequence of consecutive nodes to
the target node. As an alternative we can construct shorter, relative paths to the target node.
Relative paths tolerate “jumps” between nodes, which we can indicate with //. To exemplify,
consider the following path:
R> xpathSApply(parsed_doc, "//body//p/i")
[[1]]
<i>'What we have is nice, but we need something very different'</i>
[[2]]
<i>'R is wonderful, but it cannot work magic'</i>

This statement reads as follows: Find the <body> node at some level of the document’s
hierarchy—it does not have to be the root—then find one or more levels lower in the hierarchy
a <p> node, immediately followed by an <i> node. We obtain the same set of <i> nodes as
previously. An even more concise path for the <i> nodes would be the following:
R> xpathSApply(parsed_doc, "//p/i")
[[1]]
<i>'What we have is nice, but we need something very different'</i>
[[2]]
<i>'R is wonderful, but it cannot work magic'</i>

1 These two steps may be conducted separately from one another. You can use getNodeSet() to apply the XPath.
Using a looping structure or functionality from the apply() family, the received node set can be postprocessed and
the information recast.

XPATH

83

These three examples help to stress an important point in XPath’s design. There are
virtually always several ways to describe the same node set by means of different XPath
statements. So why do we construct a long absolute path if a valid relative path exists that
returns the same information? xpathSApply() traverses through the complete document
and resolves node jumps of any width and at any depth within the document tree. The
appeal of relative paths derives from their shortness, but there are reasons for favoring
absolute paths in some instances. Relative path statements result in complete traversals of
the document tree, which is rather expensive computationally and decreases the efficiency
of the query. For the small HTML file we consider here, computational efficiency is of
no concern. Nonetheless, the additional strain will become noticeable in the speed of code
execution when larger file sizes or extraction tasks for multiple documents are concerned.
Hence, if speed is an issue to your code execution, it is advisable to express node locations by
absolute paths.
Beyond pure path logic, XPath allows the incorporation of symbols with special meaning
in the expressions. One such symbol is the wildcard operator *. The wildcard operator
matches any (single) node with arbitrary name at its position. To return all <i> nodes from
the HTML file we can use the operator between the <div> and <i> node to match the
<p> nodes:

Deciding
between
relative and
absolute paths

Wildcard
operator

R> xpathSApply(parsed_doc, "/html/body/div/*/i")
[[1]]
<i>'What we have is nice, but we need something very different'</i>
[[2]]
<i>'R is wonderful, but it cannot work magic'</i>

Two further elements that we repeatedly make use of are the . and the .. operator.
The . operator selects the current nodes (or self-axis) in a selected node set. This operation
is occasionally useful when using predicates. We postpone a detailed exploration of the .
operator to Section 4.2.3, where we discuss predicates. The .. operator selects the node one
level up the hierarchy from the current node. Thus, if we wish to select the <head> node we
could first locate its child <title> and then go one level up the hierarchy:

Selection
expressions

R> xpathSApply(parsed_doc, "//title/..")
[[1]]
<head>
<title>Collected R wisdoms

Lastly, we sometimes want to conduct multiple queries at once to extract elements that
lie at different paths. There are two principal methods to do this. The first method is to use
the pipe operator ∣ to indicate several paths, which are evaluated individually and returned
together. For example, to select the  and the  nodes, we can use the
following statement:
R> xpathSApply(parsed_doc, "//address | //title")
[[1]]
<title>Collected R wisdoms

Multiple paths

84

AUTOMATED DATA COLLECTION WITH R

[[2]]

The book homepage

Another option is to store the XPath queries in a vector and pass this vector to xpathSApply(). Here, we first generate a named vector twoQueries where the elements represent
the distinct XPath queries. Passing twoQueries to xpathSApply() we get
R> twoQueries <- c(address = "//address", title = "//title")
R> xpathSApply(parsed_doc, twoQueries)
[[1]]
Collected R wisdoms
[[2]]

The book homepage

4.2.2

The family tree
analogy

Node relations

The XPath syntax introduced so far is sufficiently powerful to select some of the nodes in the
document, but it is of limited use when the extraction tasks become increasingly complex.
Connected node sequences simply lack expressiveness, which is required for singling out
specific nodes from smaller node subsets. This issue is nicely illustrated by the queries that
we used to identify the  nodes in the document. Assume we would like to identify the
 node that appears within the second section element . With the syntax introduced
so far, no path can be constructed to return this single node since the node sequence to this
node is equally valid for the  nodes within the first section of the document.
In this type of situation, we can make use of XPath’s capability to exploit other features
of the document tree. One such feature is the position of a node relative to other nodes
in the document tree. These relationships between nodes are apparent in Figure 4.1. Most
nodes have nodes that precede or follow their path, an information that is often unique and
thus differentiates between nodes. As is usual in describing tree-structured data formats, we
employ notation based on family relationships (child, parent, grandparent, …) to describe the
between-node relations. This feature allows analysts to extract information from a specific
target node with an unknown name solely based on the relationship to another node with a
known name. The construction of a proper XPath statement that employ this feature follows
the pattern node1/relation::node2, where node2 has a specific relation to node1. Let
us try to apply this technique on the problem discussed above, selecting the second 
node in the document. We learn from Figure 4.1 that only the second  node has an 

XPATH

85

node as one of its grandchildren. This constitutes a unique feature of the second  node
that we can extract as follows:
R> xpathSApply(parsed_doc, "//a/ancestor::div")
[[1]]

Rolf Turner
'R is wonderful, but it cannot work magic' 
answering a
request for automatic generation of 'data from a known mean and 95% CI'

Source: 
R-help

Here, we first select the  nodes in the document and then subselect among this set all
ancestor nodes with name div. Comparing the resulting node set to the results from above,
we find that a smaller set is returned. If we were interested on extracting only the text in italics
from this node set, we can make a straightforward extension to this expression. To proceed
from the thus selected  node to all the  that come one or more levels lower in the
hierarchy, we add //i to the expression:
R> xpathSApply(parsed_doc, "//a/ancestor::div//i")
[[1]]
'R is wonderful, but it cannot work magic'

As a testament to XPath’s capability to reflect complex relationships between nodes,
consider the following statement:
R> xpathSApply(parsed_doc, "//p/preceding-sibling::h1")
[[1]]
Robert Gentleman
[[2]]
Rolf Turner

Here, we first select all the  nodes in the document and then all the  siblings that
precede these nodes.2
Generally, XPath statements are limitless with respect to their length and the number of
special symbols used in it. To illustrate the combination of the wildcard operator with another
node relation, consider the following statement:
R> xpathSApply(parsed_doc, "//title/parent::*")
[[1]]

Collected R wisdoms

2 When we apply XPath in real scraping scenarios, we usually cannot draw on visual representations of node
relations like the one in Figure 4.1. Such information must be read directly from the page’s source code. This often
is the most demanding part in information extraction tasks that use XPath.

86

AUTOMATED DATA COLLECTION WITH R

Table 4.1 XPath axes
Axis name

Result

ancestor

Selects all ancestors (parent, grandparent, etc.) of the
current node
Selects all ancestors (parent, grandparent, etc.) of the
current node and the current node itself
Selects all attributes of the current node
Selects all children of the current node
Selects all descendants (children, grandchildren, etc.) of the
current node
Selects all descendants (children, grandchildren, etc.) of the
current node and the current node itself
Selects everything in the document after the closing tag of
the current node
Selects all siblings after the current node
Selects all namespace nodes of the current node
Selects the parent of the current node
Selects all nodes that appear before the current node in the
document except ancestors, attribute nodes, and
namespace nodes
Selects all siblings before the current node
Selects the current node

ancestor-or-self
attribute
child
descendant
descendant-or-self
following
following-sibling
namespace
parent
preceding

preceding-sibling
self

Source: http://www.w3schools.com/xpath/xpath_axes.asp

The parent selects nodes in the tree that appear one level higher with respect to the
reference node 
node in the document. For a full list of available relations, take a look at Table 4.1. A visual
impression of all available node relationships is displayed in Figure 4.2.

4.2.3

XPath predicates

Beside exploiting relationship properties of the tree, we can use predicates to obtain and
process numerical and textual properties of the document. Applying these features in a
conditioning statement for the node selection adds another level of expressiveness to XPath
statements. Put simply, predicates are nothing but simple functions that are applied to a node’s
name, value, or attribute, and which evaluate whether a condition (or set of conditions) is
true or false. Internally, a predicate returns a logical response. Nodes where the response is
true are selected. Their general use is as follows: After a node (or node set) we specify the
predicate in square brackets, for example, node1[predicate]. We select all  nodes
in the document that comply with the condition formulated by the predicate. As a complete
coverage of all predicates is neither possible nor helpful for this introduction, we restrict our
attention to the most frequent—and in our view most helpful—predicates in XPath. We have

XPATH

87

Ancestor-or-self
Ancestor

Attribute

Namespace

Parent

Preceding

Following

Self

Child

Descendant
Descendant-or-self

Figure 4.2 Visualizing node relations. Descriptions are presented in relation to the white
node

listed some of the available predicates in Table 4.2. Our goal is not to provide an exhaustive
examination of this topic, but to convey the inherent logic in applying predicates. We will see
that some predicates work in combination with so-called operators. A complete overview of
available operators is presented in Table 4.3.
4.2.3.1

Numerical predicates

XPath offers the possibility to take advantage of implied numerical properties of documents,
such as counts or positions. There are several predicates that return numerical properties,
which can be used to create conditional statements. The position of a node is an important

Implicit
numerical
properties

last()

position()

count()

local-name()

substring-before
(str1,str2)
substring-after
(str1,str2)
not(arg)

starts-with(str1,str2)

translate(str1, str2,
str3)
contains(str1,str2)

string-length(str1)

@attribute

Returns the index position of  that is
currently being processed
Returns the number of items in the processed
node list 

//div/p[last()]; Result: The last  node in each  node

node

//div/p[position()=1]; Result: The first  node in each 

child

//div[count(.//a)=0]; Result: The second  with one 

//*[local-name()='address']; Returns: 

node that does not contain the string Inventor in its id attribute value

//div[not(contains(@id, 'Inventor'))]; Returns: the 

with date attribute value June/2003

//div[substring-before(@date, '/')='June']; Returns:
 with date attribute value June/2003
//div[substring-after(@date, '/')=2003]; Returns: 

book homepage

//i[starts-with(text(), 'The')]; Returns:  with value The

with id attribute value R Inventor

//div[translate(./@date, '2003', '2005')='June/2005'];
Returns: first  node with date attribute value June/2003
//div[contains(@id, 'Inventor')]; Returns: first  node

Gentleman

//h1[string-length()>11]; Returns:  with value Robert

Inventor

//div[@id='R Inventor']; Returns:  with attribute id value R

book homepage

//*[text()='The book homepage']; Returns:  with value The

//*[name()='title']; Returns: 

Returns the name of <node> or the first node
in a node set
Returns the value of <node> or the first node
in a node set
Returns the value of a node’s attribute or of
the first node in a node set
Returns the length of str1. If there is no
string argument, it returns the length of the
string value of the current node
Converts str1 by replacing the characters in
str2 with the characters in str3
Returns TRUE if str1 contains str2,
otherwise FALSE
Returns TRUE if str1 starts with str2,
otherwise FALSE
Returns the start of str1 before str2 occurs
in it
Returns the remainder of str1 after str2
occurs in it
Returns TRUE if the boolean value is FALSE,
and FALSE if the boolean value is TRUE
Returns the name of the current <node> or the
first node in a node set—without the
namespace prefix
Returns the count of a nodeset <node>

name(<node>)

text(<node>)

Example

Description

Function

Table 4.2 Overview of some important XPath functions

XPATH

89

Table 4.3 XPath operators
Operators

Description

Example

∣

Computes two node sets
Addition
Subtraction
Multiplication
Division
Equal
Not equal
Less than
Less than or equal to
Greater than
Greater than or equal to
Or
And
Modulo (division remainder)

//i |
5 + 3
8 - 2
8 * 5
8 div
count
count
count
count
count
count
count
count
7 mod

+
*
div
=
!=

<
≤
>
≥
or
and
mod

//b

5
= 27
!= 27
< 27
<= 27
> 27
>= 27
= 27 or count = 28
> 26 and count < 30
2

Source:Adapted from http://www.w3schools.com/xpath/xpath_operators.asp

numerical characteristic that we can easily implement. Let us collect the <p> nodes that
appear on first position:
R> xpathSApply(parsed_doc, "//div/p[position()=1]")
[[1]]
<p>
<i>'What we have is nice, but we need something very different'</i>
</p>
[[2]]
<p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a
request for automatic generation of 'data from a known mean and 95% CI'
</emph></p>

The predicate we use is position() in combination with the equal operator =.3 The
statement returns two nodes. The position predicate does not evaluate which <p> node is on
first position among all <p> nodes in the document but on first position in each node subset
relative to its parent. If we wish to select the last element of a node set without knowing the
number of nodes in a subset in advance, we can use the last() operator:
R> xpathSApply(parsed_doc, "//div/p[last()]")
[[1]]
<p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
[[2]]
<p>
<b>Source: </b>
<a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a>
</p>
3 Please

note that an even more concise way of expressing the same query is //div/p[1].

90

AUTOMATED DATA COLLECTION WITH R

Output from numerical predicates may be further processed with mathematical operations.
To select the next to last <p> nodes, we extend the previous statement:
R> xpathSApply(parsed_doc, "//div/p[last()-1]")
[[1]]
<p>
<i>'What we have is nice, but we need something very different'</i>
</p>
[[2]]
<p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a
request for automatic generation of 'data from a known mean and 95% CI'
</emph></p>

A count is another numerical property we can use as a condition for node selection. One
of the most frequent uses of counts is selecting nodes based on their number of children
nodes. An implementation of this logic is the following:
R> xpathSApply(parsed_doc, "//div[count(.//a)>0]")
[[1]]
<div lang="english" date="October/2011">
<h1>Rolf Turner</h1>
<p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a
request for automatic generation of 'data from a known mean and 95% CI'
</emph></p>
<p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">
R-help</a></p>
</div>

Piece by piece, the statement reads as follows. We start by selecting all the <div> nodes
in the document (//div). We refine the selection by using the count() predicate, which
takes as argument the thing we need to count. In this case we count the number of <a>
nodes that precede the selected <div> nodes (.//a). The . element is used to condition on
the previous selection. Internally, this results in another node set, which we then pass to the
count() function. Combining the operator with a value >0, we ask for those <div> nodes
in the document that have more than zero <a> nodes as children. Besides nodes, we can also
condition on the number of attributes in a node:
R> xpathSApply(parsed_doc, "//div[count(./@*)>2]")
[[1]]
<div id="R Inventor" lang="english" date="June/2003">
<h1>Robert Gentleman</h1>
<p><i>'What we have is nice, but we need something very different'</i></p>
<p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
</div>

The @ element retrieves the attributes from a selected node. Here, the ./@* expression
returns all the attributes—regardless of their name—from the currently selected nodes. We
pass these attributes to the count function and evaluate whether the number of attributes is
greater than 2. Only the nodes returning TRUE for this function are selected.
The number of characters in the content of an element is another kind of count we can
obtain and use to condition node selection. This is particularly useful if all we know about
our extraction target is that the node contains some greater amount of text. It is implemented
as follows:

XPATH

91

R> xpathSApply(parsed_doc, "//*[string-length(text())>50]")
[[1]]
<i>'What we have is nice, but we need something very different'</i>
[[2]]
<emph>answering a request for automatic generation of 'data from a
known mean and 95% CI'</emph>

We first obtain a node set of all the nodes in the document (//*). On this set, we impose
the condition that the content of these nodes (as returned by text()) must contain more than
50 characters.
It is sometimes useful to invert the node selection and return all nodes for which the
predicate does not return TRUE. XPath includes a couple of functions that allow employing
Boolean logic in the query. To express an inversion of a node set, one can use the Boolean
not function to select all nodes that are not selected by the query. To select all <div> with
two or fewer attributes, we can write
R> xpathSApply(parsed_doc, "//div[not(count(./@*)>2)]")
[[1]]
<div lang="english" date="October/2011">
<h1>Rolf Turner</h1>
<p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a
request for automatic generation of 'data from a known mean and 95% CI'
</emph></p>
<p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">
R-help</a></p>
</div>

4.2.3.2

Textual predicates

Since HTML/XML files or any of their variants are plain text files, textual properties of the
document are useful predicates for node selection. This might come in handy if we need
to pick nodes on the basis of text in their names, content, attributes, or attributes’ values.
Besides exact matching, working with strings often requires tools for partial matching of
substrings. While XPath 1.0 is sufficiently powerful in this respect, version 2.0 has seen huge
improvements with the implementation of a complete library of regular expression predicates
(for an introduction to string manipulation techniques see Chapter 8). Nonetheless, XPath 1.0
fares well enough in most situations, so that switching to other XPath implementations is not
necessary. To begin, let us explore methods to perform exact matches for strings. We already
introduced the = operator for equalizing numerical values, but it works just as well for exact
string matching. To select all <div> nodes in the document, which contain quotes written
in October 2011, that is, contain an attribute date with the value October/2011, we can
write
R> xpathSApply(parsed_doc, "//div[@date='October/2011']")
[[1]]
<div lang="english" date="October/2011">
<h1>Rolf Turner</h1>
<p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a
request for automatic generation of 'data from a known mean and 95% CI'
</emph></p>

Boolean
functions

92

AUTOMATED DATA COLLECTION WITH R

<p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">
R-help</a></p>
</div>

We first select all the <div> nodes in the document and then subselect those that have
an attribute date with the value October/2011. In many instances, exact matching for
strings as implied by the equal sign is an exceedingly strict operation. One way to be more
liberal is to conduct partial matching for strings. The general use of these methods is as
follows: string_method(text1, 'text2'), where text1 refers to a text element in the
document and text2 to a string we want to match it to. To select all nodes in a document
that contain the word magic in their value, we can construct the following statement:
R> xpathSApply(parsed_doc, "//*[contains(text(), 'magic')]")
[[1]]
<i>'R is wonderful, but it cannot work magic'</i>

In this statement we first select all the nodes in the document and condition this set
using the contains() function on whether the value contains the word magic as returned
by text(). Please note that all partial matching functions are case sensitive, so capitalized
versions of the term would not be matched. To match a pattern to the beginning of a string, the
starts_with() function can be used. The following code snippet illustrates the application
of this function by selecting all the <div> nodes with an attribute id, where the value starts
with the letter R:
R> xpathSApply(parsed_doc, "//div[starts-with(./@id, 'R')]")
[[1]]
<div id="R Inventor" lang="english" date="June/2003">
<h1>Robert Gentleman</h1>
<p><i>'What we have is nice, but we need something very different'</i></p>
<p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
</div>

Preprocessing
node strings

ends_with() is used to match a string to the end of a string. It is frequently useful to
preprocess node strings before conducting matching operations. The purpose of this step is to
normalize node values, attributes, and attribute values, for example, by removing capitalization or replacing substrings. Let us try to extract only those quotes that have been published
in 2003. As we see in the source code, the <div> nodes contain a date attribute, which holds
information about the year of the quote. To condition our selection on this value, we can issue
the following expression:
R> xpathSApply(parsed_doc, "//div[substring-after(./@date, '/')='2003']//i")
[[1]]
<i>'What we have is nice, but we need something very different'</i>

Let us consider the statement piece by piece. We first select all the <div> nodes in
the document (//div). The selection is further conditioned on the returned attribute value
from the predicate. In the predicate we first obtain the date value for all the selected nodes
(./@date). This yields the following vector: June/2003, October/2011. The values are
passed to the substring-after() function where they are split according to the /, specified
as the second argument. Internally, the function outputs 2003, 2011. We then conduct exact
matching against the value 2003, which selects the <div> node we are looking for. Finally,
we move down to the <i> node by attaching //i to the expression.

XPATH

93

4.3 Extracting node elements
So far, we have used xpathSApply() to return nodes that match specified XPath statements.
We learned that the function returns a list object that contains the nodes’ name, value, and
attribute values (if specified). We usually do not care for the node in its entirety, but need to
extract a specific information from the node, for example, its value. Fortunately, this task is
fairly straightforward to implement. We simply pass an extractor function to the fun argument
in the function call. The XML package offers an extensive set of these functions to select the
pieces of information we are interested in. A complete overview of all extractor functions is
presented in Table 4.4. For example, in order to extract the value of the <title> node we
can simply write
R> xpathSApply(parsed_doc, "//title", fun = xmlValue)
[1] "Collected R wisdoms"

Instead of a list with complete node information, xpathSApply() now returns a vector
object, which only contains the value of the node set that matches the XPath statement. For
nodes without value information, the functions would return an NA value. Beside the value, we
can also extract information from the attributes. Passing xmlAttrs() to the fun argument
will select all attributes that are in the selected nodes:
R> xpathSApply(parsed_doc, "//div", xmlAttrs)
[[1]]
id
lang
date
"R Inventor"
"english" "June/2003"
[[2]]
lang
date
"english" "October/2011"

In most applications we are interested in specific rather than all node attributes. To select
a specific attribute from a node, we use xmlGetAttr() and add the attribute name:
R> xpathSApply(parsed_doc, "//div", xmlGetAttr, "lang")
[1] "english" "english"

Table 4.4 XML extractor functions
Function
xmlName
xmlValue
xmlGetAttr
xmlAttrs
xmlChildren
xmlSize

Argument

name

Return value
Node name
Node value
Node attribute
(All) node attributes
Node children
Node size

Extractor
functions

94

AUTOMATED DATA COLLECTION WITH R

4.3.1

Custom
functions for
xpathSApply()

Extending the fun argument

Processing returned node sets from XPath can easily extend beyond mere feature extraction
as introduced in the last section. Rather than extracting information from the node, we can
adapt the fun argument to perform any available numerical or textual operation on the node
element. We can build novel functions for particular purposes or modify existing extractor
functions for our specific needs and pass them to xpathSApply(). The goal of further
processing can either lie in cleansing the numeric or textual content of the node, or some kind
of exception handling in order to deal with extraction failures.
To illustrate the concept in a first application, let us attempt to extract all quotes from the
document and convert the symbols to lowercase during the extraction process. We can use
base R’s tolower() function, which transforms strings to lowercase. We begin by writing a
function called lowerCaseFun(). In the function, we simply feed the information from the
node value to the tolower() function and return the transformed text:
R> lowerCaseFun <- function(x) {
x <- tolower(xmlValue(x))
x
}

Adding the function to xpathSApply()’s fun argument, yields:
R> xpathSApply(parsed_doc, "//div//i", fun = lowerCaseFun)
[1] "'what we have is nice, but we need something very different'"
[2] "'r is wonderful, but it cannot work magic'"

The returned vector now consists of all the transformed node values and spares us an
additional postprocessing step after the extraction. A second and a little more complex
postprocessing function might include some basic string operations that employ functionality
from the stringr package. Again, we begin by writing a function that loads the stringr package,
collects the date and extracts the year information4 :
R> dateFun <- function(x) {
require(stringr)
date <- xmlGetAttr(node = x, name = "date")
year <- str_extract(date, "[0-9]{4}")
year
}

Passing this function to the fun argument in xpathSApply() yields:
R> xpathSApply(parsed_doc, "//div", dateFun)
[1] "2003" "2011"

We can also use the fun argument to cope with situations where an XPath statement
returns an empty node set. In XML’s DOM the NULL object is used to indicate a node that
4 See Chapter 8 for an introduction to string manipulation. In particular, the function str_extract() in the
custom extractor function collects four consecutive digits using a so-called regular expression. The concept and
details of regular expressions will also be explained in Chapter 8.

XPATH

95

does not exist. We can employ a custom function that includes a test for the NULL object and
makes further processing dependent on positive or negative evaluation of this test:
R> idFun <- function(x) {
id <- xmlGetAttr(x, "id")
id <- ifelse(is.null(id), "not specified", id)
return(id)
}

The first line in this custom function saves the node’s id value into a new object id.
Conditional on this value being NULL, we either return not specified or the id value in
the second line. To see the results, let us pass the function to xpathSApply():
R> xpathSApply(parsed_doc, "//div", idFun)
[1] "R Inventor"
"not specified"

4.3.1.1

Using variables in XPath expressions

The previous examples were simple enough to allow querying all information with a single,
fixed XPath expression. Occasionally, though, it becomes inevitable to treat XPath expressions
themselves as variable parts of the extraction program. Data analysts often find that a specific
type of information is encoded heterogeneously across documents, and hence, constructing a
valid XPath expression for all documents may be impossible, especially when future versions
of a site are expected to change. To illustrate this, consider extracting information from the
XML file technology.xml, which we introduced in Section 3.5.1. Previously, we extracted
the Apple stock from this file, but now we tackle the problem of pulling out all companies’
stock information. The problem is that the target closing stock information (<close>) is
encapsulated in parent nodes with different names (Apple, Google, IBM). Instead of creating
separate query functions for each company, we can help ourselves by using the sprintf()
function to create flexible XPath expressions. We start by parsing the document again and
building a character vector with the relevant company names:
R> parsed_stocks <- xmlParse(file = "technology.xml")
R> companies <- c("Apple", "IBM", "Google")

Next, we use sprintf() to create the queries. Inside the function, we set the basic
template of the XPath expression. The string %s is used to indicate the variable part, where
s stands for a string variable. The object companies indicates the elements we want to
substitute for %s:
R> (expQuery <- sprintf("//%s/close", companies))
[1] "//Apple/close" "//IBM/close"
"//Google/close"

We can proceed as usual by first laying out an extractor function…
R> getClose <- function(node) {
value <- xmlValue(node)
company <- xmlName(xmlParent(node))
mat <- c(company = company, value = value)
}

96

AUTOMATED DATA COLLECTION WITH R

… and then passing this extractor function to xpathSApply(). Here, we additionally
convert the output to a more convenient data frame format and change the vector type:
R> stocks <- as.data.frame(t(xpathSApply(parsed_stocks, expQuery, getClose)))
R> stocks$value <- as.numeric(as.character(stocks$value))
R> head(stocks, 3)
company value
1
Apple 520.6
2
Apple 520.0
3
Apple 519.0

4.3.2

XML namespaces

In our introduction to XML technologies in Chapter 3, we introduced namespaces as a feature
to create uniquely identified nodes in a web document. Namespaces become an indispensable
part of XML when different markup vocabularies are used inside a single document. Such
may be the result of merging two different XML files into a single document. When the
constituent XML files employ similar vocabulary, namespaces help to resolve ambiguities
and prevent name collisions.
Separate namespaces pose a problem to the kinds of XPath statements we have been
considering so far, since XPath ordinarily considers the default namespace. In this section
we learn how to specify the namespace under which a specific node set is defined and thus
extract the elements of interest. Let us return to the example we used in our introduction
to XML namespaces (Section 3.4). The file books.xml not only contains an HTML title but
also information on a book enclosed in XML nodes. We start by parsing the document with
xmlParse() and print its contents to the screen:
R> parsed_xml <- xmlParse("titles.xml")
R> parsed_xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE presidents SYSTEM "presidents.dtd">
<root xmlns:h="http://www.w3.org/1999/xhtml" xmlns:t="http://
funnybooknames.com/crockford">
<h:head>
<h:title>Basic HTML Sample Page</h:title>
</h:head>
<t:book id="1">
<t:author>Douglas Crockford</t:author>
<t:title>JavaScript: The Good Parts</t:title>
</t:book>
</root>

For the sake of the example, let us assume we are interested in extracting information
from the <title> node, which holds the text string JavaScript: The Good Parts. We
can start by issuing a call to all <title> nodes in the document and retrieve their values:
R> xpathSApply(parsed_xml, "//title", fun = xmlValue)
list()
Bypassing
Evidently, the call
namespaces <title> nodes in the

returns an empty list. The key problem is that neither of the two
document has been defined under the default namespace on which

XPATH

97

standard XPath operates. The specific namespaces can be inspected in the xmlns statements
in the attributes of the <root> node. Here, two separate namespaces are declared, which are
referred to by the letters h and t. One way to bypass the unique namespaces is to make a
query directly to the local name of interest:
R> xpathSApply(parsed_xml, "//*[local-name()='title']", xmlValue)
[1] "Basic HTML Sample Page"
"JavaScript: The Good Parts"

Here, we first select all the nodes in the document and then subselect all the nodes with
local name title. To conduct namespace-aware XPath queries on the document, we can extend
the function and use the namespaces argument in the xpathSApply() function to refer
to the particular namespace under which the second <title> node has been defined. We
know that the namespace information appears in the <root> node. We can pass the second
namespace string to the namespaces argument of the xpathSApply() function:
R> xpathSApply(parsed_xml, "//x:title", namespaces = c(x = "http://
funnybooknames.com/crockford"),
fun = xmlValue)
[1] "JavaScript: The Good Parts"

Similarly, if we were interested in extracting information from the <title> node under
the first namespace, we would simply change the URI:
R> xpathSApply(parsed_xml, "//x:title", namespaces = c(x = "http://
www.w3.org/1999/xhtml"),
fun = xmlValue)
[1] "Basic HTML Sample Page"

These methods require the namespaces under which the nodes of interest have been
declared to be known in advance. The literal specification of the URI can be circumvented if we know where in the document the namespace definition occurs. Namespaces are
always declared as attribute values of an XML element. For the sample file, the information
appears in the <root> node’s xmlns attribute. We capitalize on this knowledge by extracting
the namespace URI for the second namespace using the xmlNamespaceDefinitions()
function:
R> nsDefs <- xmlNamespaceDefinitions(parsed_xml)[[2]]
R> ns <- nsDefs$uri
R> ns
[1] "http://funnybooknames.com/crockford"

Having stored the information in a new object, the namespace URI can be passed to the
XPath query in order to extract information from the <title> node under that namespace:
R> xpathSApply(parsed_xml, "//x:title", namespaces = c(x = ns), xmlValue)
[1] "JavaScript: The Good Parts"

4.3.3 Little XPath helper tools
XPath’s versatility comes at the cost of a steep learning curve. Beginners and experienced
XPath users may find the following tools helpful in verifying and constructing valid statements
for their extraction tasks:

98

AUTOMATED DATA COLLECTION WITH R

SelectorGadget SelectorGadget (http://selectorgadget.com) is an open-source bookmarklet
that simplifies the generation of suitable XPath statements through a point-and-click approach.
To make use of its functionality, visit the SelectorGadget website and create a bookmark for
the page. On the website of interest, activate SelectorGadget by clicking on the bookmark.
Once a tool bar on the bottom left appears, SelectorGadget is activated and highlights the
page’s DOM elements when the cursor moves across the page. Clicking an element adds
it to the list of nodes to be scraped. From this selection, SelectorGadget creates a generalized statement that we can obtain by clicking on the XPath button. The XPath expression
can then be passed to xpathSApply()’s path argument. Please note that in order to use
the generated XPath expressions in xpathSApply(), you need to be aware that the type
of quotation mark that embrace the XPath expression may not be used inside the expression (e.g., for the attribute names). Replace them either with double (" ") or single (‘’)
quotation marks.
Web Developer Tools Many modern browsers contain a suite of developer tools to help
inspect elements in the webpage and create valid XPath statements that can be passed to
XML’s node retrieval functions. Beyond information on the current DOM, developer tools
also allow tracing changes to DOM elements in dynamic webpages. We will make use of
these tools in Section 6.3.

Summary
In this chapter we made a broad introduction to the XPath language for querying XML
documents. We hope to have shown that XPath constitutes an indispensable investment for
data analysts who want to work with data from webpages in a productive and efficient manner.
With the tools introduced at the end of this chapter, many extraction problems may even be
solved through simply clicking elements and pasting the returned expression. Despite their
helpfulness, these tools may fail for rather intricate extraction problems, and, thus, knowing
how to build expressions from scratch remains a necessary skill. We also would like to assert
that the construction of an applicable XPath statement is rarely a one-shot affair but requires
an iterative learning process. This process can be described as a cycle of three steps. In the
construction stage, we assemble an XPath statement that is believed to return the correct
information. In the testing stage, we apply the XPath, observe the returned node set or error
message, and find that perhaps the returned node set is too broad or too narrow. The learning
stage is a necessary stage when the XPath query has failed. Learning from this failure, we
might infer a more suitable XPath expression, for example, by making it more strict or more
lax in order to obtain only the desired information. Going back to step number one, we apply
the refined XPath to check whether it now yields the correct set of nodes. For many extraction
problems we find that multiple traverses through this cycle are necessary to build confidence
in the robustness of the programmed extraction routine. We are going to elaborate on the
XPath scraping strategy again in Section 9.2.2.
The issue of XPath robustness is exacerbated when the code is to work on unseen instances
of a webpage, for example, when the extraction code is automatically executed daily (see
Section 11.2). Inevitably, websites undergo changes to their structure; elements are removed
or shifted, new features are implemented, visual appearances are modified, which ultimately
affect the page’s contents as well. This is especially true for popular websites. But we will

XPATH

99

see that certain dispositions can be made in the XPath statements and auxiliary code design
to increase robustness and warn the analyst when the extraction fails. One possibility is to
rely on textual predicates when textual information should be extracted from the document.
Adding information to the query on the substantive interest can add necessary robustness to
the code.

Further reading
A full exploration of XPath and the XML package is beyond the scope of this chapter. For
an extensive overview of the XML package, interested readers are referred to Nolan and
Temple Lang (2014). A more concise introduction to the package is provided by Temple
Lang (2013c). Tennison provides a comprehensive overview of XPath 1.0 (Tennison 2001).
Another helpful overview of XPath 1.0 and 2.0 methods can be found in Holzner (2003). For
an excellent online documentation on web technologies, including XPath, consult Mozilla
Developer Network (2013).

Problems
1.

What makes XPath a domain-specific language?

2.

XPath is the XML Path language, but it also works for HTML documents. Explain why.

3.

Return to the fortunes1.html file and consider the following XPath expression:
//a[text()[contains(., 'R-help')]]§. Replace § to get the <h1> node with

value “Robert Gentleman.”
4.

Construct a predicate with the appropriate string functions to test whether the month of
a quote is October.

5.

Consider the following two XPath statements for extracting paragraph nodes from a
HTML file. 1. //div//p, 2. //p. Decide which of the two statements makes a more
narrow request. Explain why.

6.

Verify that for extracting the quotes from fortunes.html the XPath expression //i does
not return the correct results. Explain why not.

7.

The XML file potus.xml contains biographical information on US presidents. Parse the
file into an object of the R session.
(a) Extract the names of all the presidents.
(b) Extract the names of all presidents, beginning with the 40th term.
(c) Extract the value of the <occupation> node for all Republican presidents.
(d) Extract the <occupation> node for all Republican presidents that are also Baptists.
(e) The <occupation> node contains a string with additional white space at the
beginning and the end of the string. Remove the white space by extending the
extractor function.
(f) Extract information from the <education> nodes. Replace all instances of “No
formal education” with NA.
(g) Extract the <name> node for all presidents whose terms started in or after the year
1960.

100

AUTOMATED DATA COLLECTION WITH R

8.

The State of Delaware maintains a repository of datasets published by the Delaware
Government Information Center and other Delaware agencies. Take a look at Naturalizations.xml (included in the chapter’s materials at http://www.r-datacollection.com).
The file contains information about naturalization records from the Superior Court.
Convert the data into an R data frame.

9.

The Commonwealth War Graves Commission database contains geographical information on burial plots and memorials across the globe for those who lost their lives as a
result of World War I. The data have been recast as a KML document, an XML-type
data structure. Take a look at cwgc-uk.kml (included in the chapter’s materials). Parse
the data and create a data frame from the information on name and coordinates. Plot
the distribution on a map.

10.

Inspect the SelectorGadget (see Section 4.3.3). Go to http://planning.maryland.gov/
Redistricting/2010/legiDist.shtml and identify the XPath expression suited to extract
the links in the bottom right table named Maryland 2012 Legislative District Maps
(with Precincts) using SelectorGadget.

5

HTTP
To retrieve data from the Web, we have to enable our software to communicate with
servers and web services. The lingua franca of communication on the Web is HTTP, the
Hypertext Transfer Protocol. HTTP dates back to the late 1980s when it was invented by Tim
Berners-Lee, Roy Fielding and others at the CERN near Geneva, Switzerland (Berners-Lee
2000; Berners-Lee et al. 1996). It is the most common protocol for communication between
web clients (e.g., browsers) and servers, that is, computers that respond to requests from
the network. Virtually every HTML page we open, every image we view in a browser,
every video we watch is delivered by HTTP. When we type a URL into the address bar,
we usually do not even start with http:// anymore, but with the hostname directly (e.g.,
r-datacollection.com) as a request via HTTP is taken for granted and automatically processed
by the browser. HTTP’s current official version 1.1 dates back to 1999 (Fielding et al. 1999),
a fact that nicely illustrates its reliability over the years—in the same time period, other web
standards such as HTML have changed a lot more often.
We hardly ever come into direct contact with HTTP. Constructing and sending HTTP
requests and processing servers’ HTTP responses are tasks that are automatically processed
by our browsers and email clients. Imagine how exhausting it would be if we had to formulate
requests like “Hand me a document called index.html from the host www.nytimes.com/ in
the directory pages/science/ using the HTTP protocol” every time we wanted to search for
articles. But have you ever tried to use R for that purpose? To maintain our heroic claim that R
is a convenient tool for gathering data from the Web, we need to prove that it is in fact suited
to mimic browser-to-web communication. As we will see, for many of the basic web scraping
tasks we still do not have to care much about the HTTP particulars in the background, as R
handles this for us by default. In some instances, however, we have to dig deeper into protocol
file transfers and formulate precise requests in order to get the information we want. This
chapter serves as an introduction to those parts of HTTP that are most important to us to
become successful web scrapers.

Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, First Edition.
Simon Munzert, Christian Rubba, Peter Meißner and Dominic Nyhuis.
© 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

102

AUTOMATED DATA COLLECTION WITH R

The chapter starts with an introduction to client–server conversation (Section 5.1.1).
Before we turn to the technical details of HTTP, we briefly digress to talk about URLs,
standardized names of resources on the Internet (Section 5.1.2). Our presentation of HTTP is
then subdivided into a fundamental look at the logic of HTTP messages (Section 5.1.3), request
methods (Section 5.1.4), status codes (Section 5.1.5), and headers (Section 5.1.6). In the
second part, we inspect more advanced features of HTTP for identification and authentication
purposes (Sections 5.2.1 and 5.2.2) and talk about the use of proxies (Section 5.2.3). Although
HTTP is by far the most widespread protocol on the Web, we also take a look at HTTPS
and FTP (Section 5.3). We conclude with the practical implementation of HTTP-based
communication using R (Section 5.4). All in all, we have tried to keep this introduction
to HTTP as nontechnical as possible, while still enabling you to use R as a web client in
situations that are not explicitly covered in this book.

5.1
5.1.1

Client-server
communication

HTTP fundamentals
A short conversation with a web server

To access content on the Web, we are used to typing URLs into our browser or to simply
clicking on links to get from one place to another, to check our mails, to read news, or
to download files. Behind this program layer that is designed for user interaction there are
several more layers—techniques, standards, and protocols—that make the whole thing work.
Together they are called the Internet Protocol Suite (IPS). Two of the most prominent players
of this Protocol Suite are TCP (Transmission Control Protocol) and IP (Internet Protocol).
They represent the Internet layer (IP) and the transportation layer (TCP). The inner workings
of these techniques are beyond the scope of this book, but fortunately there is no need to
manually manipulate contents of either of these protocols to conduct successful web scraping.
What is worth mentioning, however, is that TCP and IP take care of reliable data transfer
between computers in the network.1
On top of these transportation standards there are specialized message exchange protocols
like HTTP (Hyper Text Transfer Protocol), FTP (File Transfer Protocol), Post Office Protocol
(POP) for email retrieval, SMTP (Simple Mail Transfer Protocol) or IMAP (Internet Message
Access Protocol) for email storage and retrieval. All of these protocols define standard
vocabulary and procedures for clients and servers to talk about specific tasks—retrieving
or storing documents, files, messages, and so forth. They are subsumed under the label
application layer.
Other than the name suggests, HTTP is not only a standard for hypertext document
retrieval. Although HTTP is quite simple, it is flexible enough to ask for nearly any kind of
resource from a server and can also be used to send data to the server instead of retrieving it.
Figure 5.1 presents a stylized version of ordinary user–client interactions. Simply put,
when we access a website like www.r-datacollection.com/index.html, our browser serves
as the HTTP client. The client first asks a DNS server (Domain Name System) which IP
1 If you care to learn more about the Transmission Control Protocol or the Internet Protocol, both Fall and Stevens
(2011) and Forouzan (2010) provide extensive introductions to the topic. For a more accessible introduction, check
out https://www.netbsd.org/docs/guide/en/chap-net-intro.html

HTTP

103

TCP/IP

User

Mouse
keybord
Text
images
files
video
sound

HTTP Client

HTTP request

Browser

HTTP response

URL

HTTP Server

IP
address

DNS Server

Figure 5.1 User–server communication via HTTP

address is associated with the domain part of the URL we typed in.2 In our example, the
domain part is www.r-datacollection.com.3 After the browser has received the IP address as
response from the DNS server, it establishes a connection to the requested HTTP server via
TCP/IP. Once the connection is established, client and server can exchange information—in
our case by exchanging HTTP messages. The most basic HTTP conversation consists of one
client request and one server response. For example, our browser asks for a specific HTML
document, an image, or some other file, and the server responds by delivering the document
or giving back an error code if something went wrong. In our example, the browser would ask
for index.html and start parsing the content of the response to provide the representation of the
website. If the received document contains further linked resources like images, the browser
continues sending HTTP requests to the server until all necessary resources are transmitted.
In the early days of the Internet, one could literally observe how the browser loaded webpages
piece by piece. By now, it almost seems like webpages are received all at once due to the availability of higher bandwidths, keeping HTTP connections alive or posing numerous requests
in parallel.
There are two important facts about HTTP that should be kept in mind. First, HTTP is
not only a protocol to transport hypertext documents but is used for all kinds of resources.
Second, HTTP is a stateless protocol. This means that without further effort each pair of
request and response between client and server is handled by default as though the two
were talking to each other for the first time no matter how often they previously exchanged
information.
Let us take a look at one of these standardized messages. For the sake of the example we
establish a connection to www.r-datacollection.com and ask the server to send us index.html.
The HTTP client first translates the host URL into an IP address and then establishes a
connection to the server on the default HTTP port (port 80). The port can be imagined as a
door at the server’s house where the HTTP client knocks. Consider the following summary
of the client-side of the conversation:4

2 Note

that we only scratch the surface of the technologies of client–server communication. If you want to learn
more about the technologies behind it, for example, how DNS servers are contacted, we point you to the “Further
reading” section of this chapter.
3 We consider the structure of URLs in the next section.
4 We will elaborate further below how to monitor the HTTP exchanges.

HTTP
messages

104

1
2
3
4

AUTOMATED DATA COLLECTION WITH R

About to connect() to www.r-datacollection.com port 80 (#0)
Trying 173.236.186.125... connected
Connected to www.r-datacollection.com (173.236.186.125) port 80 (#0)
Connection #0 to host www.r-datacollection.com left intact

After having established the connection the server expects a request and our client sends
the following HTTP request to the server:

1
2
3

GET /index.html HTTP/1.1
Host: www.r-datacollection.com
Accept: */*

Now it is the client’s turn to expect a response from the server. The server responds with
some general information followed by the content of our requested document.5 The HTTP
response reads as follows:

HTTP/1.1 200 OK
Date: Thu, 27 Feb 2014 09:40:35 GMT
Server: Apache
Vary: Accept-Encoding
Content-Length: 131
...
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html> <head>
<title>

...

After having received all the data, the connection is closed again by the client …

1

Closing connection #0

… and the transaction is completed.

5.1.2

URL syntax

The location of websites and other web content are identified by Uniform Resource
Locators (URLs). They are not part of HTTP but make communication via HTTP and
5 Several

lines of the server response have been omitted for purposes of presentation.

HTTP

105

other protocols straightforward for users.6 The general URL format can be expressed as
follows:
scheme://hostname:port/path?querystring#fragment

A corresponding real-life example would be
http://www.w3.org:80/People/Berners-Lee/#Bio

Each URL starts with a scheme that defines the protocol that is used to communicate
between client/application and server. In the example, the scheme is http, separated by a
colon. There are other schemes like ftp (File Transfer Protocol) or mailto, which corresponds to email addresses that rely on the SMTP (Simple Mail Transfer Protocol) standard.
Most enable communication in networks, but you will also find the file scheme familiar,
which addresses files on local or network drives.
The hostname provides the name of the server where the resource of interest is stored.
It is a unique identifier of a server. The hostname along with the port component tell the
client at which door it has to knock in order to get access to the requested resource. The
information is provided in the example as www.w3.org:80. Port 80 is the default port in the
Transmission Control Protocol (TCP). If the client is fine with using the default port, this part
of the URL can be dropped. Hostnames are usually human readable, but every hostname also
has a machine-readable IP address. In the example, www.w3.org belongs to the IP address
128.30.52.37, making the following an equivalent URL:7
http://128.30.52.37/People/Berners-Lee/#Bio

As we usually provide the human-friendly versions of URLs, the Domain Name System
(DNS) translates hostnames into numerical IP addresses. Therefore, the DNS is frequently
compared to a worldwide phone book that redirects users who provide hostnames to services
or devices.
The path determines the location of the requested resource on the server. It works like
paths on any conventional file system where files are nested in folders that may again be
nested in folders and so on. Path segments are separated by slashes (/).
In some cases, URLs provide supplementary information in the path that helps the server
to process the request correctly. The additional information is delivered in query strings that
hold one or more name=value pairs. The query string is separated from the rest of the URL
by a question mark. It encodes data using a ‘field = value’ format and uses the ampersand
symbol (&) to separate multiple name–value pairs.
https://www.google.com/search?q=RCurl+filetype%3Apdf

A comparable URL is constructed when we search for “RCurl” documents on Google
that are of type PDF. The name–value pair q=RCurl+filetype%3Apdf is the transformed
6 We will learn in Section 9.1.3 that one of the easiest ways to collect data from websites is often to inspect
and manipulate the URLs that refer to content of interest. Sometimes the URLs follow a simple logic, for example,
when they contain a running index. It is simple to generate a set of URLs, automatically access them, and store their
content.
7 We can use services like the one at http://whatismyipaddress.com/ip-lookup to look up the corresponding IP
addresses of hostnames.

106

URL encoding

AUTOMATED DATA COLLECTION WITH R

actual request written in the search form as “RCurl filetype:pdf,” a compact syntax to search
for PDF files that include the term “RCurl.” One could easily extend the request with further
search parameters such as tbs=qdr:y. This would limit the results to hits that are younger
than one year.8
Finally, fragments help point to a specific part of a document. This works well if the
requested resource is HTML and the fragment identifier refers to a section, image, or similar.
In the example above, the fragment #Bio requests a direct jump to the biography section of
the document. Note that fragments are handled by the browser, that is, on the client side. After
the server has returned the whole document, the fragment is used to display the specified part.
There are some encoding rules for URLs. URLs are transmitted using the ASCII character
set, which is limited to a set of 128 characters. All characters not included in this set and most
special characters need to be escaped, that is, they are replaced by a standardized representation. Consider once again the example. The expression “RCurl filetype:pdf” is converted to
q=RCurl+filetype%3Apdf. Both white space and the colon character seem to be “unsafe”
and have been replaced with a + sign and the URL encoding%3A, respectively. URL encodings
are also called percent-encoding because the percent character % initializes each of these
encodings. Note that the plus character is a special case of a URL escape sequence that is only
valid in the query part. In other parts, the valid URL encoding of space is %20. A complete
list of URL encodings can be found at http://www.w3schools.com/tags/ref_urlencode.asp.
We can encode or decode characters in URLs with the base functions URLencode()
and URLdecode() in R. The reserved argument in the former function ensures that nonalphanumeric characters are encoded with their percent-encoding representation:
R> t <- "I'm Eddie! How are you & you? 1 + 1 = 2"
R> (url <- URLencode(t, reserve = TRUE))
[1] "I'm%20Eddie!%20How%20are%20you%20%26%20you%3f%201%20+%201%20%3d%202"
R> URLdecode(url)
[1] "I'm Eddie! How are you & you? 1 + 1 = 2"

These functions can be useful when we want to construct URLs manually, for example,
to specify a GET form (see below), without having to insert the percent-encodings by hand.

5.1.3

HTTP messages

HTTP messages, whether client requests or server response messages, consist of three parts:
start line, headers, and body—see Figures 5.2 and 5.3. While start lines differ for request
and response, the messages’ header and body sections are structured identically.
To separate start line from headers and headers from body, carriage return and line feed
characters (CRLF) are used.9 Note that start line and headers are separated by one sequence of
CRLF while the last header before the body is followed by two CRLF. In R, these characters
are represented as escaped characters \r for carriage return and \n for new line feed.
The start line is the first and indispensable line of each HTTP message. In requests,
the start line defines the method used for the request, followed by the path to the resource
8 We can identify additional parameters by specifying advanced searches and observing the changes in the URL.
For a comprehensive overview, see http://jwebnet.net/advancedgooglesearch.html
9 Carriage return and line feed are control characters that are inherited from typewriters. Using a typewriter,
starting a new line required returning the carriage to the left and moving the plate one line further down.

HTTP
Schema

Example

[method] [path] [version]

[CRLF]

Start line

[header name:]

[CRLF]
[CRLF]

Header

[header value]

[body]

107

Body

POST /greetings.html HTTP/1.1
Host:

www.r-datacollection.com

Hi, there.
How are you?

Figure 5.2 HTTP request schema

Schema

Example

[version] [status] [phrase]

[CRLF]

Start line

[header name:]

[CRLF]
[CRLF]

Header

[body]

[header value]

Body

HTTP/1.1 200 OK
Content-type:

text/plain

I am fine, thank you very much.
What else might I help you with?

Figure 5.3 HTTP response schema

requested, followed by the highest HTTP version the client can handle. In our example we use
the POST method requesting greetings.html and indicate that our client understands HTTP
up to version 1.1.
The server response start line begins with a statement on the highest HTTP version the
server can handle, followed by a status code, followed by a human-readable explanation of
the status. Here, www.r-datacollection.com tells us that it understands HTTP up to version
1.1, that everything went fine by returning 200 as status code, and that this status code means
something like OK.
The header section below the start line provides client and server with meta information
about the other sides’ preferences or the content sent along with the message. Headers contain
a set of header fields in the form of name–value pairs. Ordinarily, each header field is placed
on a new line and header field name and value are separated by colon. If a header line becomes
very long, it can be divided into several lines by beginning the additional line with an empty
space character to indicate that they belong to the previous header line.
The body of an HTTP message contains the data. This might be plain text or binary
data. Which type of data the body is composed of is specified in the content-type header,
following the MIME type specification (Multipurpose Internet Mail Extensions). MIME
types tell the client or server which type of data it should expect. They follow a scheme of
main-type/sub-type. Main types are, for example, application, audio, image, text, and video
with subtypes like application/pdf, audio/mpg, audio/ogg, image/gif, image/jpeg, image/png,
text/plain, text/html, text/xml, video/mp4, video/quicktime, and many more.10

10 For the full set, see the list provided by IANA (Internet Assigned Numbers Authority) at http://www.iana.org/
assignments/media-types/media-types.xhtml.

MIME types

108

AUTOMATED DATA COLLECTION WITH R

Table 5.1 Common HTTP request methods
Method

Description

GET
POST

Retrieves resource from server
Retrieves resource from server using the message body to
send data or files to the server
Works like GET, but server responds only with start line
and header, no body
Stores the body of the request message on the server
Deletes a resource from the server
Traces the route of the message along its way to the server
Returns list of supported HTTP methods
Establishes a network connection

HEAD
PUT
DELETE
TRACE
OPTIONS
CONNECT

Source: Fielding et al. (1999).

5.1.4

Request methods

When initiating HTTP client requests, we can choose among several request methods—see
Table 5.1 for an overview. The two most important HTTP methods are GET and POST. Both
methods request a resource from the server, but differ in the usage of the body. Whereas
GET does not send anything in the body of the request, POST uses the body to send data. In
practice, simple requests for HTML documents and other files are usually executed with the
GET method. Conversely, POST is used to send data to the server, like a file or inputs from
an HTML form.
If we are not interested in content from the server we can use the HEAD method. HEAD
tells the server to only send the start line and the headers but not transfer the requested
resource, which might be convenient to test if our requests are accepted. Two more handy
methods for testing are OPTIONS, which asks the server to send back the methods it supports
and TRACE, which requests the list of proxy servers (see Section 5.2.3) the request message
has passed on its way to the server.
Last but not least there are two methods for uploading files to and deleting files from
a server—PUT and DELETE—as well as CONNECT, a method for establishing an HTTP
connection that might be used, for example, for SSL tunneling (see Section 5.3.1).
We will elaborate the methods GET and POST, the two most important methods for web
scraping, when we discuss HTTP in action (see Section 5.4).

5.1.5

Status codes

When a server responds to a request, it will always send back a status code in the start line
of the response. The most famous response that nearly everybody knows from browsing the
Web is 404, stating that the server could not find the requested document. Status codes can
range from 100 up to 599 and follow a specific scheme: the leading digit signifies the status
category—1xx for informations, 2xx for success, 3xx for redirection, 4xx for client errors
and 5xx for server errors—see Table 5.2 for a list of common status codes.

HTTP

109

Table 5.2 Common HTTP status codes
Code

Phrase

Description

200
202

OK
Accepted

204

No Content

Everything is fine
The request was understood and accepted but no
further actions have yet taken place
The request was understood and accepted but no
further data needs to be returned except for
potentially updated header information

300

Multiple Choices

301

Moved Permanently

302
303
304

Found
See Other
Not Modified

305

Use Proxy

400
401
403

Bad Request
Unauthorized
Forbidden

404
405

Not Found
Method Not Allowed

406

Not Acceptable

500

Internal Server Error

501
502

Not Implemented
Bad Gateway

503
504

Service Unavailable
Gateway Timeout

505

HTTP Version Not
Supported

The request was understood and accepted but the
request applies to more than one resource
The requested resource has moved, the new location is
included in the response header Location
Similar to Moved Permanently but temporarily
Redirection to the location of the requested resource
Response to a conditional request stating that the
requested resource has not been changed
To access the requested resource a specific proxy
server found in the Location header should be used
The request has syntax errors
The client should authenticate itself before progressing
The server refuses to provide the requested resource
and does not give any further reasons
The server could not find the resource
The method in the request is not allowed for the
specific resource
The server has found no resource that conforms to the
resources accepted by the client
The server has encountered some internal error and
cannot provide the requested resource
The server does not support the request method
The server acting as intermediate proxy or gateway got
a negative response forwarding the request
The server can temporarily not fulfill the request
The server acting as intermediate proxy or gateway got
no response to its forwarded request
The server cannot or refuses to support the HTTP
version used in the request

Source: Fielding et al. (1999).

5.1.6

Header fields

Headers define the actions to take upon reception of a request or response. Headers can
be general or belong to one specialized group: header fields for requests, header fields for
responses, and header fields regarding the body of the message. For example, request header

110

AUTOMATED DATA COLLECTION WITH R

fields can inform the server about the type of resources the client accepts as response,
like restricting the responses to plain HTML documents or give details on the technical
specification of the client, like the software that was used to request the document. They can
also describe the content of the message, which might be plain text or binary, an image or
audio file and might also have gone through encoding steps like compression. Header fields
always follow the same, simple syntax. The name comes first and is separated with a colon
from the value. Some header fields contain multiple values that are separated by comma.
Let us go through a sample of common and important header fields to see what they can
do and how they are used. The paragraphs in the following overview provide the name of
the header in bold and the field type in parentheses, that is, whether the header is used for
request, response, or body.
Accept (request)
1

Accept: text/html,image/gif,image/*,*/*;q=0.8

Accept is a request header field that tells the server about the type of resources the client
is willing to accept as response. If no resource fits the restrictions made in Accept, the server
should send a 406 status code. The specification of accepted content follows the MIME type
scheme. Types are separated by commas; semicolons are used to specify so-called accept
parameters type/subtype;acceptparameter=value,type/.... The asterisk (*) can
be used to specify ranges of type and subtypes. The rules of content-type preferences are
as follows: (1) more specific types are preferred over less specific ones and (2) types are
preferred in decreasing order of the q parameter while (3) all type specifications have a
default preference of q = 1 if not specified otherwise.
The above example can be read as follows: The client accepts HTML and GIF but if
neither is available will accept any other image type. If no other image type is available, the
client will also accept any other type of content.
Accept-Encoding (request)
1

Accept-Encoding: gzip,deflate,sdch;q=0.9,identity;q=0.8;*;q=0

Accept-Encoding tells the server which encodings or compression methods are
accepted by the client. If the server cannot send the content in the specified encoding, it
should return a 406 status code.
The example reads as follows: The client accepts gzip and deflate for encoding. If
neither are available it also accepts sdch and otherwise content that was not encoded at all.
It will not accept any other encodings as the value of the acceptance parameter is 0, which
equals nonacceptance.

Allow (response; body)
1

Allow: GET, PUT

HTTP

111

Allow informs the client about the HTTP methods that are allowed for a particular
resource and will be part of responses with a status code of 405.

Authorization (request)
1

Authorization: Basic cm9va2llOjEyM0lzTm90QVNlY3VyZVBX

Authorization is a simple way of passing username and password to the server.
Username and password are first merged to username:password and encoded according
to the Base64 scheme. The result of this encoding can be seen in the header field line above.
Note that the encoding procedure does not provide encryption, but simply ensures that all
characters are contained in the ASCII character set. We discuss HTTP authorization methods
in more detail in Section 5.2.2.
The Authorization header field in the example indicates that the authorization method
is Basic and the Base64-encoded username–password combination is cm9va2...

Content-Encoding (response; body)
1

Content-Encoding: gzip

Content-Encoding specifies the transformations, for example, compression methods,
that have been applied to the content—see Accept-Encoding for further details.

Content-Length (response; body)
1

Content-Length: 108

Content-Length provides the receiver of the message with information on the size of
the content in decimal number of OCTETs (bytes).

Content-Type (response; body)
1

Content-Type: text/plain; charset=UTF-8

Content-Type provides information on the type of content in the body. Content types
are described as MIME types—see Accept for further details.

Cookie (request)
1

Cookie: sessionid=2783321; path=/; domain=r-datacollection.com;
expires=Mon, 31-Dec-2035 23:00:01 GMT

112

AUTOMATED DATA COLLECTION WITH R

Cookies are information sent from server to client with the Set-Cookie header field.
They allow identifying clients—without cookies servers would not know that they have
had contact with a client before. The Cookie header field returns the previously received
information. The syntax of the header field is simple: Cookies consist of name=value pairs
that are separated from each other by semicolon. Names like expires, domain, path, and
secure are reserved parameters that define how the cookie should be handled by the client.
expires defines a date after which the cookie is no longer valid. If no expiration date is
given the cookie is only valid for one session. domain and path specify for which resource
requests the cookie is needed. secure is used to indicate that the cookie should only be sent
over secured connections (SSL; see Section 5.3.1). We introduce cookies in greater detail in
Section 5.2.1.
The example reads as follows: The cookie sessionid=2783321 is valid until 31st of
December 2035 for the domain www.r-datacollection.com and all its subdirectories (declared
with /).

From (response)
1

From: eddie@r-datacollection.com

From provides programmers of web crawlers or scraping programs with the option to send
their email address. This helps webmasters to contact those who are in control of automated
robots and web crawlers if they observe unauthorized behavior. This header field is useful for
web scraping purposes, and we discuss it in Section 5.2.1.

Host (request)
1

Host: www.r-datacollection.com:80

Host is a header field required in HTTP/1.1 requests and helps servers to decide upon
ambiguous URLs when more than one host name redirects to the same IP address.

If-Modified-Since (request)
1

If-Modified-Since: Thu, 27 Feb 2014 13:05:34 GMT

If-Modified-Since can be used to make requests conditional on the time stamp
associated with the requested resource. If the server finds that the resource has not been
modified since the date provided in the header field, it should return a 304 (Not Modified)
status code. We can make use of this header to write more efficient and friendly web scrapers
(see Section 9.3.3).

HTTP

113

Connection (request, response)
1

Connection: Keep-Alive

1

Connection: Close

Connection is an ambiguous header field in the sense that it has two completely different
purposes in HTTP/1.0 and HTTP/1.1. In HTTP/1.1, connections are persistent by default. This
means that client and server keep their connection alive after the request–response procedure
has finished. In contrast, it is standard in HTTP/1.0 to close connections after the client has
got its response. Since establishing connections for each request, the value Keep-Alive can
be specified in HTTP/1.0, while this is the default procedure in HTTP/1.1 and thus does not
have to be explicitly stated. Instead, the server or client can force the connection to be shut
down after the request–response exchange with the Close value.

Last-Modified (response; body)
1

Last-Modified: Tue, 25 Mar 2014 19:24:50 GMT

Last-Modified provides the date and time stamp of the last modification of the

resource.
Location (response; body)
1

Location: redirected.html

Location serves to redirect the receiver of a message to the location where the requested
resource can be found. This header is used in combination with status code 3xx when content
has moved to another place or in combination with status code 201 when content was created
as result of the request.

Proxy-Authorization (request)
1

Proxy-Authentication: Basic bWFnaWNpYW5zYXlzOmFicmFrYWRhYnI=

The same as Authorization, only for proxy servers. For more information on proxies,
see Section 5.2.3.
Proxy-Connection (request)
1

Proxy-Connection: keep-alive

114

AUTOMATED DATA COLLECTION WITH R

The same as Connection, only for proxy servers. For more information on proxies, see
Section 5.2.3.
Referer (request)
1

Referer: www.r-datacollection.com/index.html

Referer is a header field that informs the server what referred to the requested resource.
In the example, www.r-datacollection.com/index.html might provide a link to a picture (e.g.,
/pictures/eddie.jpg). In a request for this picture the referer header field can be added
to signal that the user has already been on the site and does not want to access the image from
elsewhere, like another website.

Server (response)
1

Server: Apache/2.4.7 (Unix) mod_wsgi/3.4 Python/2.7.5 OpenSSL/1.0.1e

1

Server: Microsoft-IIS/8.0

Server provides information about the server addressed in the request. The first server
above is based on Apache software using a Unix platform (httpd.apache.org/), while the
second one is based on Microsoft’s Internet Information Service (www.microsoft.com/).

Set-Cookie (response)
1
2

Set-Cookie: sessionid=2783321; path=/; domain=r-datacollection.com;
expires=Mon, 31-Dec-2035 23:00:01 GMT

Set-Cookie asks the client to store the information contained in the Set-Cookie header
field and send them along in subsequent requests as part of the Cookie header. See Cookie
and Section 5.2.1 for further explanation.

User-Agent (request)
1

User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

The User-Agent header field indicates the type of client that makes a request to the
server. These more or less cryptic descriptions can indicate the use of a certain browser
on a certain operating system. This information can be helpful for the server to adapt the
content of the response to the system of the client. Nevertheless, the User-Agent can contain
arbitrary user-defined information, such as User-Agent: My fabulous web crawler or
User-Agent: All your base are belong to us. Web scrapers can and should use
User-Agents responsibly. We discuss how this is done in Sections 5.2.1 and 9.3.3.

HTTP

115

Vary (response)
1

Vary: User-Agent, Cookie, Accept-Encoding

1

Vary: *

The server response sometimes depends on certain parameters, for example, on the
browser or device of the client (e.g., a desktop PC or a mobile phone), on whether the user
has previously visited a site and has received a cookie, and on the encoding format the client
accepts. Servers can indicate that content changes according to these parameters with the
Vary header field.
The first example above indicates that the content might vary with changes in UserAgent, Cookie, or Accept-Encoding. The second example is rather unspecific. It states
that changes on an unspecified set of parameters lead to changes in the response. This header
field is important for the behavior of browser caches that try to load only new content and
retrieve old and unchanged content from a local source.
Via (request, response)
1

Via: 1.1 varnish

1

Via: 1.1 www.spiegel.de, 1.0 lnxp-3960.srv.mediaways.net (squid/3.1.4)

Via is like Server but for proxy servers and gateways that HTTP messages pass on their
way to the server or client. Each proxy or gateway can add its ID to this header, which is
usually a protocol version and a platform type or a name.

WWW-Authenticate (response)
1

WWW-Authenticate: Basic realm="r-datacollection"

1

WWW-Authenticate: Digest realm="r-datacollection" qop="auth"
nonce="ecf88f261853fe08d58e2e903220da14"

2

WWW-Authenticate asks the client to identify itself and is sent along a 401 Unauthorized
status code. It is the counterpart to the Authorization request header field. The WWWAuthenticate header field describes the method of identification as well the “realm” this
identification is valid for, as well as further parameters needed for authorization. The first
example requests basic authentication while the second asks for digest authentication, which
ensures that passwords cannot be read out by proxies. We explain both types of authentication
in Section 5.2.2.

116

AUTOMATED DATA COLLECTION WITH R

5.2

Advanced features of HTTP

What we have learned so far are just the basics of HTTP-based communication. There are
more complex tasks that go beyond the default configuration of standard HTTP methods.
Both web users and server maintainers may ask questions like the following:

r How can servers identify revisiting users?
r How can users avoid being identified?
r How can communication between servers and clients be more than “stateless,” that is,
how can they memorize and rely on previous conversations?

r How can users transfer and access confidential content securely?
r How can users check if content on the server has changed—without requesting the full
body of content?

Using
httpbin.org to
test HTTP
requests

Many of these tasks can be handled with means that are directly implemented in HTTP.
We will now highlight three areas that extend the basic functionality of HTTP. The first
comprises issues of identification, which are useful to personalize web experiences. The
second area deals with different forms of authentication that serve to make server–client
exchanges more secure. The third area covers a certain type of web intermediaries, that is,
middlemen between clients and servers, namely proxy servers. These are implemented for
a variety of reasons like safety or efficiency. As the availability of content may depend on
the use of such advanced features, basic knowledge about them is often useful for web data
collection tasks.
To showcase some advanced HTTP requests, we use the server at http://httpbin.org. This
server, set up by Kenneth Reitz, offers a testing environment for HTTP requests and returns
JSON-encoded content. It is a useful service to test HTTP calls before actually implementing
them in real-life scenarios. We use it to formulate calls to the server via RCurl commands and
evaluate the returned message within R.
Further, we will gently introduce the RCurl package to demonstrate some advanced HTTP
features by example. RCurl provides means to use R as a web client software. The package is
introduced in greater detail in Section 5.4.

5.2.1

Identification

The communication between client and server via the HTTP protocol is an amnesic matter.
Connections are established and closed for each session; the server does not keep track of
earlier requests from the same user by default. It is sometimes desirable that server responses
are built upon results from previous conversations. For example, users might prefer that sites
are automatically displayed in their language or adapted to fit a specific device or operating
system. Moreover, customers of an online shop want to place items into a virtual shopping
cart and continue browsing other products, while the website keeps track of these operations.
Apart from scenarios like these that enhance user experience, some basic knowledge about
clients is interesting for web administrators who want to know, for example, from which other
sites their pages are visited most frequently.

HTTP

117

HTTP offers a set of procedures that are used for such purposes. We discuss the most
popular and relevant ones in the context of web scraping—basic identification header fields
and cookies.
5.2.1.1

HTTP header fields for client identification

By default, modern web browsers deliver basic client identification in the HTTP header when
sending a request to a server. This information is usually not sufficient to uniquely identify
users but may improve surfing experience. As we will see, it can also make sense to pass
these fields to servers when the request does not come from a browser but, for example, from
a program like R that processes a scraping script.
The User-Agent header field contains information about the software that is used on the
client side. Ordinary browsers deliver User-Agent header fields like the following:
1
2
3

GET /headers HTTP/1.1
Host: httpbin.org
User-Agent: Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36

What is hidden behind this cryptic string is that the request was performed by a Chrome
browser, version 31.0.1650.57. The browser is ‘Mozilla-compatible’ (this is of no further
interest), operates on a Windows system, and draws upon the web kit.11 This information
does not suffice to uniquely identify the user. But they still serve an important purpose: They
allow web designers to deliver content that is adapted to the clients’ software.
While an adequate layout is hardly relevant for web scraping purposes, we can deliver
information on the software we use for scraping in the User-Agent field to keep our work as
transparent as possible. Technically, we could put any string in this header. A both useful and
convenient approach is to provide the current R version number along with the platform that
R is run on. This way, the webmaster at the other end of the interaction is told what kind of
program puts a series of requests to the server. The following command returns the current R
version number and the corresponding platform:
R> R.version$version.string
[1] "R version 3.0.2 (2013-09-25)"
R> R.version$platform
[1] "x86_64-w64-mingw32"

We can use this string to configure a GET request that we conduct with the getURL()
function of the RCurl package:
R> cat(getURL("http://httpbin.org/headers",
useragent = str_c(R.version$platform,
R.version$version.string,
sep=", ")))
11 If you care to see the User-Agent of your default browser, copy the string that is given back when you request
the site http://httpbin.org/user-agent and paste it into the ‘Analyze’ form at http://useragentstring.com.

User-Agent

118

AUTOMATED DATA COLLECTION WITH R

{
"headers": {
"X-Request-Id": "0726a0cf-a26a-43b9-b5a4-9578d0be712b",
"User-Agent": "x86_64-w64-mingw32, R version 3.0.2 (2013-09-25)",
"Connection": "close",
"Accept": "*/*",
"Host": "httpbin.org"
}
}

Referer

1
2
3

cat() is used to concatenate and print the results over several lines. The useragent
argument allows specifying a User-Agent header field string. RCurl takes care of writing
this string into a header field and passes it to the server. http://httpbin.org/headers returns the
sent header information in JSON format.12 Beside the set of header fields that are used by
default, we find that a User-Agent header field has been added.13
We will later learn that the basis of the RCurl package is the C library libcurl. Many
of the options that libcurl offers can also be used in RCurl’s high-level functions (for more
details, see Section 5.4.1). We will return to the use of User-Agents in practical web scraping
in Section 9.3.3.
The second header field that is informative about the client is the Referer. It stores the
URL of the page that referred the user to the current page. Referrers can be used for traffic
evaluation to asses where visitors of a site come from. Another purpose is to be able to limit
access to specific server content like image files. A webmaster could modify the settings of
the server such that access to images is only possible from another resource on the server
in order to prevent other people from using images on their own webpage by referring to
the location on the original server. This causes unwanted traffic and is therefore unwelcome
behavior. The default browser setting is that the Referer header is delivered automatically.
This may look as follows:

GET /headers HTTP/1.1
Host: httpbin.org
Referer: http://www.r-datacollection.com/

We can provide the Referer header field with R using getURL()’s referer argument.
We test the request to http://httpbin.org/headers with:14
R> getURL("http://httpbin.org/headers", referer = "http://www.rdatacollection.com/")

12 Note

that we use the cat() function to concatenate and print the returned JSON string.
do not have to care about the X-Request-Id and Heroku-Request-Id header fields, they are added by
the service at http://httpbin.org for debugging purposes.
14 We do not print the JSON output from now on—you can easily see the returned content by loading RCurl and
pasting the command in your R console.
13 We

HTTP

119

Note that adding information on the Referer from within R is misleading when R has
not actually been referred from the site provided. We suggest that if it is necessary to provide
a valid referrer in order to get access to certain resources, stay identifiable, for example, by
properly specifying the From header field as described below, and contact the webmaster if
in doubt. Providing wrong information in the Referer header field in order to disguise the
source of the access request is called referrer spoofing. This may have its legitimacy for data
privacy purposes but is not encouraged by scraping etiquette (see Section 9.3.3).
The From header field for client identification is not delivered by browsers but a convenient
header for well-behaved web spiders and robots. It carries the user’s email address to make
her identifiable for web administrators. In web scraping, it is good practice to specify the
From header field with a valid email address, as in
1
2
3

GET /headers HTTP/1.1
Host: httpbin.org
From: eddie@r-datacollection.com

Providing contact details signals good intentions and enables webmasters who note
unusual traffic patterns on their sites to get in touch. We thus reformulate our request:
R> getURL("http://httpbin.org/headers", httpheader = c(From =
"eddie@r-collection.com"))

Note that we have to use the httpheader option here to add the From header field, as
from is not a valid option—in contrast to “referer,” for example. httpheader allows us
to specify additional other header fields.
5.2.1.2

Cookies

Cookies help to keep users identifiable for a server. They are a tool to turn stateless HTTP
communication into a stateful conversation where future responses depend on past conversations. Cookies work as follows: Web servers store a unique session ID in a cookie that is
placed on the client’s local drive, usually in a text file. The next time a browser sends an
HTTP request to the same web server, it looks for stored cookies that belong to the server
and—if successful—adds the cookie information to the request. The server then processes
this “we already met” information and adapts its response. Usually, further information on
the user has been stored on the server over the course of several conversations and can be
“reactivated” using cookies. In other words, cookies enable browsers and servers to continue
conversations from the past.
Cookies are shared via the HTTP header fields “Set-Cookie” (in the response header)
and “Cookie” (in the request header). A typical conversation via HTTP that results in a
cookie exchange looks as follows. First, the client makes a request to a web server:
1
2

GET /headers HTTP/1.1
Host: httpbin.org

From

120

AUTOMATED DATA COLLECTION WITH R

If the request is successful, the server responds and passes the cookie with the SetCookie response header field. The field provides a set of name–value pairs:

1
2
3

HTTP/1.1 200 OK
Set-cookie: id="12345"; domain="httpbin.org"
...

The id attribute allows the server to identify the user in a subsequent request and the
domain attribute indicates which domain the cookie is associated with. The client stores the
cookie and attaches it in future requests to the same domain, using the Cookie request header
field:

1
2
3

Different types
of cookies

GET /headers HTTP/1.1
Host: httpbin.org
Cookie: id="12345"

There are several types of cookies that differ in terms of persistence and range. Session
cookies are kept in memory only as long as the user visits a website and are deleted as soon
as the browser is closed. Persistent cookies, or tracking cookies, last longer—their lifetime
is defined by the value of the max-age attribute or the expires attribute (not shown in
the examples above). The browser delivers the cookie with every request during a cookie’s
lifetime, which makes the user traceable for the server across several sessions. Third-party
cookies are used to personalize content across different sites. They do not belong to the
domain the client visits but to another domain. If you have ever wondered how personalized
ads are placed on pages you visit—this is most likely done with third-party cookies that are
placed by advertising companies on domains you visit and which can be used by advertisers
to tailor ads to your interests. The use of cookies for such purposes surely has contributed to
the fact that cookies have a bad reputation regarding privacy. In general, however, cookies are
only sent to the server that created them. Further, the user can decide how to handle locally
stored cookies. And in the end, cookies are useful as they often enhance the web experience
considerably.
If cookies influence the content a server returns in response to a request, they can be
relevant for web scraping purposes as well. Imagine we care to scrape data from our crammed
shopping cart in an online store. During our visit we have added several products to the cart.
In order to track our spending spree, the server has stored a session ID in a cookie that keeps
us identifiable. If we want to request the webpage that lists the shopped items, we have to
deliver the cookie with our request.
In order to deliver existing cookies with R, we can draw upon the cookie argument:
R> getURL("http://httpbin.org/headers", cookie = "id=12345;domain=
httpbin.org")

It is usually not desirable to manage cookies manually, that is, retrieve them, store them,
and send them. This is why browsers automatically take care of such operations by default.

HTTP

121

In order to achieve similar convenience in R, we can rely on libcurl’s cookiefile and
cookiejar options that, if specified correctly, manage cookies for us. We show in detail how
this can be done in Section 9.1.8.

5.2.2 Authentication
While techniques for client identification are useful to personalize web content and enable
stateful communication, they are not suited to protect content that only the user should see.
A set of authentication techniques exist that allow qualified access to confidential content.
Some of these techniques are part of the HTTP protocol. Others, like OpenID or OAuth (see
Section 9.1.11), have been developed more recently to extend authentication functionality on
the Web.
The simplest form of authentication via the HTTP protocol is basic authentication (Franks
et al. 1999). If a client requests a resource that is protected by basic authentication, the server
sends back a response that includes the WWW-Authenticate header. The client has to repeat
its request with a username and password in order to be granted access to the requested
resource. Both are stored in the response’s Authorization header. If the server can verify
that the username/password combination is correct, it returns the requested resource in a
HTTP 200 message. Technically, basic authentication looks as follows.
1. The client requests a protected resource:
1

GET /basic-auth/user/passwd HTTP/1.1

2. The server asks the client for a user name and password:
1
2

HTTP/1.1 401 Authorization required
WWW-Authenticate: Basic realm="Protected area"

3. The client/user provides the requested username and password in Base64 encoding:
1
2

GET /basic-auth/user/passwd HTTP/1.1
Authorization: Basic dXNlcm5hbWU6cGFzc3dvcmQ=

4. The server returns the requested resource:
1
2

HTTP/1.1 200 OK
...

Note that in the third step, the username/password combination has been automatically
“encrypted” into the string sequence “dXNlcm5hbWU6cGFzc3dvcmQ=.” This transformation
is done via Base64 encoding. Base64 encoding is not actually an encrypting technique but
follows a rather trivial and static scheme (see Gourley and Totty 2002, Appendix E). We can

Basic
authentication

122

AUTOMATED DATA COLLECTION WITH R

perform Base64 encoding and decoding with R; the necessary functions are implemented in
the RCurl package:
R> (secret <- base64("This is a secret message"))
[1] "VGhpcyBpcyBhIHNlY3JldCBtZXNzYWdl"
attr(,"class")
[1] "base64"
R> base64Decode(secret)
[1] "This is a secret message"

Digest
authentication

The example reveals the insecurity of basic HTTP authentication: As long as it is done
via standard HTTP, the sensitive information is sent practically unencrypted across the network. Therefore, basic authentication should only be used in combination with HTTPS (see
Section 5.3.1).
A more sophisticated authentication technique is digest authentication (Franks et al.
1999). The idea behind digest authentication is that passwords are never sent across the Web
in order to verify a user, but only a “digest” of it. The server attaches a little random string
sequence to its response, called nonce. The browser transforms username, password, and the
nonce into a hash code, following one of several algorithms that are known to both server and
browser. This hash code is then sent back, compared to the hash calculations of the server,
and if both match the server grants access to the client. The crucial point is that the hash alone
does not suffice to learn anything about the password; it is just a “digest” of it. This makes
digest authentication an improvement relative to basic authentication, as the encrypted client
message is incomprehensible for an eavesdropper.
Steps 2 and 3 in the authentication procedure sketched above are slightly different. The
server returns something like the following:15
2a. The server asks the client for a username and password and delivers a nonce, reports
a “quality of protection” value (qop) and describes the realm as Protected area:
1
2
3

HTTP/1.1 401 Authorization required
WWW-Authenticate: Digest realm="Protected area",
qop="auth",nonce="f7hf4xu8n2kxuujnszrctx4fexqnahopjdrn4zbi"

3a. The client provides the encrypted username and password in the response attribute,
as well as the unencrypted username, the qop and nonce parameters and a client nonce
(cnonce):
1
2

GET /basic-auth/user/passwd HTTP/1.1
Authorization: Digest username="user", nonce="
f7hf4xu8n2kxuujnszrctx4fexqnahopjdrn4zbi", qop="auth",
cnonce="1g443t8b", response="
y1h5uafdsda8r2wsxdy1vxzhqnht5ngry2m5argc"

15 Note that this is a simplified example. We have left out some intermediate steps, but the fundamental logic
remains the same.

HTTP

HTTP Client

HTTP request

Browser

HTTP response

Proxy

123

HTTP request
HTTP response

HTTP Server

Figure 5.4 The principle of web proxies

In Section 9.1.6 we give a short demonstration of HTTP authentication in practice with
RCurl.

5.2.3

Proxies

Web proxy servers, or simply proxies, are servers that act as intermediaries between clients
and other servers. HTTP requests from clients are first sent to a proxy, which evaluates the
request and passes it to the desired server. The server response takes the way back via the
proxy. In that sense, the proxy serves as a server to clients and as a client to other servers (see
Figure 5.4).
Proxies are useful for several purposes. They are deployed for performance, economic,
and security reasons. For users, proxies can help to

The use of
proxies

r speed up network use;
r stay anonymous on the Web;
r get access to sites that restrict access to IPs from certain locations;
r get access to sites that are normally blocked in the country from where the request
is put; or

r keep on querying resources from a server that blocks requests from IPs we have
used before.
Especially when proxies are used for any of the last three reasons, web scrapers might get
into troubles with the law. Recent verdicts point in the direction that it is illegal to use proxy
servers in order to get access to public websites that one has been disallowed to visit (see
Kerr 2013). We therefore do not recommend the use of proxies for any of these purposes.
In order to establish connections via a proxy server, we have to know the proxy’s IP
address and port. Some proxies require authentication as well, that is, a username and a
password. There are many services on the Web that provide large databases of open and free
proxies, including their location and specification. Open proxies can be used by anyone who
knows their IP address and port. Note that proxies vary in the degree to which they provide
anonymity to the user. Transparent proxies specify a Via header field in their request to
the server, filling it with their IP. Further, they offer an X-Forwarded-For header field
with your IP. Simple anonymous proxies replace both the Via and the X-Forwarded-For
header field with their IP. As both fields are delivered only when a proxy is used, the server
knows that the requests comes from a server, but does not easily see the client’s IP address
behind it. Distorting proxies are similar but replace the value of the X-Forwarded-For
header field with a random IP address. Finally, High anonymity proxies or elite proxies
behave like normal clients, that is, they neither provide the Via nor the X-Forwarded-For
header field but only their IP and are not immediately identifiable as proxy servers.

Types of
proxies

124

AUTOMATED DATA COLLECTION WITH R
TLS or SSL
TCP/IP

HTTP Client

HTTP request

Browser

HTTP response

HTTP Server

Figure 5.5 The principle of HTTPS
To send a request to a server detouring via a proxy with R, we can add the proxy argument
to the request command. In the following, we choose a fictional proxy from Poland that has
the IP address 109.205.54.112 and is on call on port 8080:
R> getURL("http://httpbin.org/headers",
R>
proxy = "109.205.54.112:8080",
R>
followlocation = TRUE)

IP address and port of the proxy are specified in the proxy option. Further, we set
the followlocation argument to TRUE to ensure that we are redirected to the desired
resource.

5.3

Protocols beyond HTTP

HTTP is far from the only protocol for data transfer over the Internet. To get an overview of
the protocols that are currently supported by the RCurl package, we call
R> library(RCurl)
R> curlVersion()$protocols
[1] "tftp"
"ftp"
"telnet" "dict"
[8] "file"
"https" "ftps"
"scp"

"ldap"
"sftp"

"ldaps"

"http"

Not all of them are relevant for web scraping purposes. In the following, we will highlight two protocols that we often encounter when browsing and scraping the Web: HTTPS
and FTP.

5.3.1

HTTP Secure

Strictly speaking, the Hypertext Transfer Protocol Secure (HTTPS) is not a protocol of
its own, but the result of a combination of HTTP with the SSL/TLS (Secure Sockets
Layer/Transport Security Layer) protocol. HTTPS is indispensable when it comes to the
transfer of sensitive data, as is the case in banking or online shopping. To transfer money or
credit card information we need to ensure that the information is inaccessible to third parties.
HTTPS encrypts all the client–server communication (see Figure 5.5). HTTPS URLs have
the scheme https and use the port 443 by default.16
16 Recall

that the default HTTP port is 80.

HTTP

125

HTTPS serves two purposes: First, it helps the client to ensure that the server it talks
to is trustworthy (server authentication). Second, it provides encryption of client–server
communication so that users can be reasonably sure that nobody else reads what is exchanged
during communication.
The SSL/TLS security layer runs as a sublayer of the application layer where HTTP
operates. This means that HTTP messages are encrypted before they get transmitted. The
SSL protocol was first defined in 1994 by Netscape (see Freier et al. 2011) and was updated
as TLS 1.0 in 1999 (see Dierks and Allen 1999). When using the term “SSL” in the following,
we refer to both SSL and TLS as their differences are of no importance to us.
A crucial feature of SSL that allows secure communication in an insecure network is
public key, or asymmetric, cryptography. As the name already indicates, encryption keys are
in fact not kept secret but publicly available to everyone. In order to encrypt a message for
a specific receiver, the receiver’s public key is used. In order to decrypt the message, both
the public and a private key is needed, and the private key is only known to the receiver. The
basic idea is that if a client wants to send a secret message to a server, it knows how to encrypt
it because the server’s public key is known. After encryption, however, nobody—not even
the sender—is able to decipher the message except for the receiver, who possesses both the
public and the private key.
We do not have to delve deeply into the details of public key encryption, how the secret
codes (ciphers) work, and why it is so hard to crack them in order to understand HTTPS’s
purpose. For the details of cryptography behind SSL, we refer to the excellent introductions
by Gourley and Totty (2002) and Garfinkel (2002). If you want to get a more profound
understanding of digital cryptography, the books by Ferguson et al. (2010) and Paar and Pelzl
(2011) are a good choice. What is worth knowing though is how secure channels between
client and server are actually established and how we can achieve this from within R. A very
simplified scheme of the “SSL handshake,” that is, the negotiation between client and server
about the establishment of an HTTPS connection before actually exchanging encrypted HTTP
messages, works as follows (see Gourley and Totty 2002, pp. 322–328).
1. The client establishes a TCP connection to the server via port 443 and sends information
about the SSL version and cipher settings.
2. The server sends back information about the SSL and cipher settings. The server also
proves his identity by sending a certificate. This certificate includes information about
the authority that issued the certificate, for whom it was issued and its period of validity.
As anybody can create his or her own certificates without much effort, the signature of
a trusted certificate authority (CA) is of great importance. There are many commercial
CAs, but some providers also issue certificates for free.
3. The client checks if it trusts the certificate. Browsers and operating systems are shipped
with lists of certificate authorities that are automatically trusted. If one of these authorities has signed the server’s certificate, the client trusts the server. If this is not the
case, the browser asks the user whether she finds the server trustworthy and wants to
continue, or if communication should be stopped.
4. By using the public key of the HTTPS server, the client generates a session key that
only the server can read, and sends it to the server.
5. The server decrypts the session key.

SSL/TLS

SSL handshake

126

AUTOMATED DATA COLLECTION WITH R

6. Both client and server now possess a session key. Thus, knowledge about the key is not
asymmetric anymore but symmetric. This reduces computational costs that are needed
for encryption. Future data transfers from server to client and vice versa are encrypted
and decrypted through this symmetric SSL tunnel.
It is important to note that what is protected is the content of communication. This
includes HTTP headers, cookies, and the message body. What is not protected, however, are
IP addresses, that is, websites a client communicates with.
We will address how connections via HTTPS are established in R and how much of the
technical details are hidden deeply in the respective functions in Section 9.1.7—using HTTPS
with R is not difficult at all.

5.3.2
FTP vs. HTTP

Active and
passive modes

FTP

The File Transfer Protocol (FTP) was developed to transfer files from client to server (upload),
from server to client (download), and to manage directories. FTP was first specified in 1971
by Abhay Bhushan (1971); its current specification (see Postel and Reynolds 1985) is almost
30 years old. In principle, HTTP has several advantages over FTP. It allows persistent, keepalive connections, that is, connections between client and server that are maintained for several
transfers. This is not possible with FTP, where the connection has to be reestablished after
each transfer. Further, FTP does not natively support proxies and pipelining, that is, several
simultaneous requests before receiving an answer. On the upside, FTP may be faster under
certain circumstances, as it does not come with a bunch of header fields like HTTP—just the
binary or ASCII files are transferred.
FTP uses two ports on each side, one for data exchange (“data port,” the default is port 20)
and one for command exchange (“control port,” the default is port 21). Just like HTTP, FTP
comes with a set of commands that specify which files to transfer, what directories to create,
and many other operations.17 FTP connections can be established in two different modes:
the active mode and the passive mode. In active FTP, the client connects with the server’s
command port and then requests a data transfer to another port. The problem with this mode
is that the actual data connection is established by the server. As the client’s firewall has not
been told that the client expects data to come in on a certain port, it usually blocks the server’s
attempt to deliver the data. This issue is tackled with the passive mode in which the client
initiates both the command and the data connection. We are going to demonstrate accessing
FTP servers with R in Section 9.1.2.

5.4

HTTP in action

We now learn to use R as an HTTP client. We will have a closer look at two available
packages: the powerful RCurl package (Temple Lang 2013a), and the more lightweight but
sometimes also more convenient httr package (Wickham 2012) that rests on the voluminous
RCurl package.

17 For

an overview over existing commands, see http://www.nsftools.com/tips/RawFTP.htm

HTTP

127

Base R already comes with basic functionality for downloading web resources. The
download.file() function handles many download procedures where we do not need
complex modifications of the HTTP request. Further, there is a set of basic functions to set
up and manipulate connections. For an overview, type ?connections in R. However, using
these functions is anything but convenient. Regarding download.file(), there are two
major drawbacks for sophisticated web scraping. First, it is not very flexible. We cannot use
it to connect with a server via HTTPS, for example, or to specify additional headers. Second,
it is difficult to adhere to our standards of friendly web scraping with download.file(),
as it lacks basic identification facilities. However, if we just want to download single files,
download.file() works perfectly fine. For more complex tasks, we can apply the functionality of the RCurl and the httr package.

5.4.1

The libcurl library

Much of what we need to do with R on the web is dramatically facilitated by libcurl (Stenberg
2013). libcurl is an external library programmed in C. Development began in 1996 by Daniel
Stenberg and the cURL project and has since been under continuous development. The
purpose of libcurl is to provide an easy interface to various Internet protocols for programs
on many platforms. Over time, the list of features has grown and now comprises a multitude
of possible actions and options to configure, among others, HTTP communication. We can
think of it as a tool that knows how to

r specify HTTP headers;
r interpret URL encoding;
r process incoming streams of data from web servers;
r establish SSL connections;
r connect with proxies;
r handle authentication;
and much more. In contrast, R’s url() and download.file() are precious little help
when it comes to complex tasks like filling forms, authentication, or establishing a stateful
conversation. Therefore, libcurl has been tapped to enable users to work with the libcurl
library in their ordinary programming environment. In his manifest of RCurl and libcurl’s
philosophy, Temple Lang points out the benefits of libcurl: Being the most widely used
file transfer library, libcurl is extraordinarily well tested and flexible (Temple Lang 2012a).
Further, being programmed in C makes it fast. To get a first impression about the flexibility of
libcurl, you might want to start by taking a look at the available options of libcurl’s interface at
http://curl.haxx.se/libcurl/c/curl_easy_setopt.html. Alternatively, you can type the following
into R to get the comprehensive list of libcurl’s “easy” interface options that can be specified
with RCurl:
R> names(getCurlOptionsConstants())

Currently, there are 174 available options. Among them are some that we have already
relied on above, like useragent or proxy. We sometimes speak of curl options instead of

128

AUTOMATED DATA COLLECTION WITH R

libcurl options for reasons of convenience. curl is a command line tool also developed by the
cURL software project. With R we draw on the libcurl library.18

5.4.2

Basic request methods

5.4.2.1

The GET method

High-level In order to perform a basic GET request to retrieve a resource from a web server, the
functions RCurl package provides some high-level functions—getURL(), getBinaryURL(), and

getURLContent(). The basic function is getURL(); getBinaryURL() is convenient when
the expected content is binary, and getURLContent() tries to identify the type of content
in advance by inspecting the Content-Type field in the response header and proceeding
adequately. While this seems preferable, the configuration of getURLContent() is sometimes more sophisticated, so we continue to use getURL() by default except when we expect
binary content.
The function automatically identifies the host, port, and requested resource. If the call
succeeds, that is, if the server gives a 2XX response along with the body, the function
returns the content of the response. Note that if everything works fine, all of the negotiation between R/libcurl and the server is hidden from us. We just have to pass the desired
URL to the high-level function. For example, if we try to fetch helloworld.html from
www.r-datacollection.com/materials/http, we type
R> getURL("http://www.r-datacollection.com/materials/http/helloworld.html")
[1] "\nHello World\nHello World
\n\n"

The body is returned as character data. For binary content, we use getBinaryURL()
and get back raw content. For example, if we request the PNG image file sky.png from
www.r-datacollection.com/materials/http, we write
R> pngfile <- getBinaryURL("http://www.r-datacollection.com/materials/http/
sky.png")

It depends on the format how we can actually process it; in our case we use the
writeBin() function to locally store the file:
R> writeBin(pngfile, "sky.png")
GET forms

1
2
3
4
5

Sometimes content is not embedded in a static HTML page but returned after we submit an HTML form. The little example at http://www.r-datacollection.com/materials/http/
GETexample.html lets you specify a name and age as input fields. The HTML source code
looks as follows:

HTTP GET Example 
HTTP GET Example

18 See

also http://daniel.haxx.se/docs/curl-vs-libcurl.html for the differences between cURL, curl, and libcurl.

HTTP
6
7
8
9
10
11
12
13

129

Name: 

Age: 

The  element indicates that data put into the form is sent to a file
called GETexample.php.19 After having received the data from the GET request, the
PHP script evaluates the input and returns “Hello ! You are  years
old.” In the browser, we see an URL of form http://www.rdatacollection.com/materials/
http/GETexample.php?name=&age=, which indicates that a PHP script has
generated the output.
How can we process this and similar requests from within R? There are several ways to
specify the arguments of an HTML form. The first is to construct the URL manually using
paste() and to pass it to the getURL() function:
R>
R>
R>
R>

url <- "http://www.r-datacollection.com/materials/http/GETexample.php"
namepar <- "Eddie"
agepar <- "32"
url_get <- str_c(url, "?", "name=", namepar, "&", "age=", agepar)

R> cat(getURL(url_get))
Hello Eddie!
You are 32 years old.

An easier way than using getURL() and constructing the GET form request manually is
to use getForm(), which allows specifying the parameters as separate values in the function.
This is our preferred procedure as it simplifies modifying the call and does not require manual
URL encoding (see Section 5.1.2). In order to get the same result as above, we write
R> url <- "http://www.r-datacollection.com/materials/http/GETexample.php"
R> cat(getForm(url, name = "Eddie", age = 32))
Hello Eddie!
You are 32 years old.

5.4.2.2

The POST method

When using HTML forms we often have to use the POST method instead of GET. In
general, POST allows more sophisticated requests, as the request parameters do not have do
be inserted into the URL, which may be limited in length. The POST method implies that
parameters and their values are sent in the request body, not in the URL itself. We replicate
the example from above, except that now a POST request is required. The form is located at
19 PHP, Hypertext Preprocessor or previously Personal Home Page Tools is a scripting language which is
frequently implemented on the server side to create dynamic webpages. The ending .php indicates that the content
is generated by a PHP script.

POST forms

130

AUTOMATED DATA COLLECTION WITH R

http://www.r-datacollection.com/materials/http/POSTexample.html. The HTML source code
reads as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14

HTTP POST Example

HTTP POST Example

Name: 

Age: 

We find that the  element has remained almost identical, except for the required
method, which is now POST. When we submit the POST form in the browser we see that
the URL changes to ../POSTexample.php and no query parameters have been added as in the
GET query. In order to replicate the POST query with R, we do not have to construct the
request manually but can use the postForm() function:
R> url <- "http://www.r-datacollection.com/materials/http/POSTexample.php"
R> cat(postForm(url, name = "Eddie", age = 32, style = "post"))
Hello Eddie!
You are 32 years old.

postForm() automatically constructs the body and fills it with the pre-specified parameter pairs. Unfortunately, there are several ways to format these pairs, and we sometimes have
to explicitly specify the one that is accepted in advance using the style argument (see Nolan
and Temple Lang 2014, p. 270–272 and http://www.w3.org/TR/html401/interact/forms.html
for details on the form content types). For the application/x-www-form-urlencoded
form content type, we have to specify style = "post" and for the multipart/formdata form content type, style = "httppost". This formats the parameter pairs in the
body correctly and adds the request header "Content-Type" = "application/xwww-form-urlencoded" or "Content-Type" = "multipart/form-data". To find
the adequate POST format, we can look for an attribute named enctype in the 
element. If it is specified as enctype='application/x-www-form-urlencoded', we
use style = "post". If it is missing (as above), leaving out the style parameter should
also work.

5.4.2.3

Other methods

RCurl offers functions to deal with other HTTP methods as well. We can change methods in calls to getURL(), getBinaryURL(), getURLContent() by making use of the
customrequest option, for example,

HTTP

131

R> url <- "r-datacollection.com/materials/http/helloworld.html"
R> res <- getURL(url = url, customrequest = "HEAD", header = TRUE)
R> cat(str_split(res, "\r")[[1]])
HTTP/1.1 200 OK
Date: Wed, 26 Mar 2014 00:20:07 GMT
Server: Apache
Vary: Accept-Encoding
Content-Type: text/html

As we hardly encounter situations where we need these methods, we refrain from going
into more detail.

5.4.3 A low-level function of RCurl
RCurl builds on the powerful libcurl library, making it a mighty weapon in the hands of
the initiated and an unmanageable beast in the hands of others. The low-level function
curlPerform() is the workhorse of the package. The function gathers options specified
in R on how to perform web requests—which protocol or methods to use, which headers to
set—and patches them through to libcurl to execute the request. Everything in this function
has to be specified explicitly so later on we will come back to more high-level functions.
Nevertheless, it is useful to demonstrate how the high-level functions work under the hood.
We start with a call to curlPerform() to request an HTML document:
R> url <- "www.r-datacollection.com/materials/http/helloworld.html"
R> (pres <- curlPerform(url = url))
OK
0

Instead of getting the content of the URL we only get the information that everything seems
to have worked as expected by the function. This is because we have to specify everything
explicitly when using curlPerform(). The function did retrieve the document but did not
know what to do with the content. We need to define a handler for the content. Let us create
one ourselves. First, we create an object pres to store the document and a function that takes
the content as argument and writes it into pres. As the list of options can get extensive we
save them in a separate object performOptions and pass it to curlPerform():
R> pres <- NULL
R> performOptions <- curlOptions(url = url,
writefunc = function(con) pres <<- con )
R> curlPerform(.opts = performOptions)
OK
0
R> pres
[1] "\nHello World\nHello
World\n\n"

That looks more like what we would have expected. In addition to the content handler, there
are other handlers that can be supplied to curlPerform(): a debug handler via debugfunc,
and a HTTP header handler via headerfunc. There are sophisticated functions in RCurl for
each of these types that spare us the need to specify our own handler functions. For content

132

AUTOMATED DATA COLLECTION WITH R

and headers, basicTextGatherer() turns an object into a list of functions that handles
updates, resets, and value retrieval. In the following example we make use of all three. Note
that in order for debugfunc to work we need to set the verbose option to TRUE:
R>
R>
R>
R>

content <- basicTextGatherer()
header <- basicTextGatherer()
debug
<- debugGatherer()
performOptions <- curlOptions(url = url,
writefunc = content$update,
headerfunc = header$update,
debugfunc = debug$update,
verbose = T)
R> curlPerform(.opts=performOptions)
OK
0

Using the value() function of content we can extract the content that was sent from the
server.
R> str_sub(content$value(), 1, 100)
[1] "\nHello World\nHello
World\n\n"

header$value() contains the headers sent back from the server:
R> header$value()
[1] "HTTP/1.1 200 OK\r\nDate: Wed, 26 Mar 2014 00:20:10 GMT\r\nServer:
Apache\r\nVary: Accept-Encoding\r\nContent-Length: 89\r\nContent-Type:
text/html\r\n\r\n"

debug$value() stores various pieces of information on the HTTP request. See Section 5.4.6 for more information on this topic:
R> names(debug$value())
[1] "text"
"headerIn"
"headerOut"
[6] "sslDataIn" "sslDataOut"
R> debug$value()["headerOut"]

"dataIn"

"dataOut"

headerOut
"GET /materials/http/helloworld.html HTTP/1.1\r\nHost: www.r-datacollection.
com\r\nAccept: */*\r\n\r\n"

5.4.4

Maintaining connections across multiple requests

It is a common scenario to make multiple requests to a server, especially if we are interested
in accessing a set of resources like multiple HTML pages. The default setting in HTTP/1.0 is
to establish a new connection with each request, which is slow and inefficient. Connections
in HTTP/1.1 are persistent by default, meaning that we can use the same connection for
multiple requests. RCurl provides the functionality to reuse established connections, which
we can exploit to create faster scrapers.

HTTP

133

Reusing connections works with the so-called “curl handles.” They serve as containers
for the connection itself and additional features/options. We establish a handle as follows:

Curl handles

R> handle <- getCurlHandle()

The handle in the handle object is of class CURLHandle and currently an empty container.
We can add useful curl options from the list listCurlOptions(), for example:
R> handle <- getCurlHandle(useragent = str_c(R.version$platform,
R>
R.version$version.string,
R>
sep=", "),
R>
httpheader = c(from = "ed@datacollection.com"),
R>
followlocation = TRUE,
R>
cookiefile = "")

In the example, we specify a User-Agent header field that contains the current R version
and a From header field containing an email address, set the followlocation argument to
TRUE, and activate cookie management (see Section 5.4.5). The curl handle can now be used
for multiple requests using the curl argument. For instance, if we have a vector of URLs in
the object url, we can retrieve them with getURL() fed with the settings in the curl handle
from above.
R> lapply(urls, getURL, curl = handle)

Note that the curl handle container is not fixed across multiple requests, but can
be modified. As soon as we specify new options in a request, these are added—or old
ones overridden—in the curl handle, for example, with getURL(urls, curl = handle,
httpheader = c(from = "max@datacollection.com")). To retain the status of the
handle but use a modified handle for another request, we can duplicate it and use the “cloned”
version with dupCurlHandle():
R> handle2 <- dupCurlHandle(curlhandle,
R>
httpheader = c(from = "ed@datacollection.com"))

Cloning handles can be especially useful if we want to reuse the settings specified in a
handle in requests to different servers. Not all settings may be useful for every request (e.g.,
protocol settings or referrer information), and some of the information should probably be
communicated only to one specific server, like authentication details.
When should we use curl handles? First, they are generally convenient for specifying and
using curl options across an entire session with RCurl, simplifying our code and making it
more reliable. Second, fetching a bunch of resources from the same server is faster when we
reuse the same connection.

5.4.5 Options
We have seen that we can use curl handles to specify options in RCurl function calls. However,
there are also other means. Generally, RCurl options can be divided into those that define the
behavior of the underlying libcurl library and those that define how information is handled
in R. The list of possible options is vast, so we selected the ones we frequently use and listed
them in Table 5.3. Some of these options were already introduced above, the others will be
explained below. Let us begin by showing the various ways to declare options.

Cloning
handles

Limit the number of redirections to avoid infinite loop error
with followlocation = TRUE
Retrieve a certain byte range from a file, that is, only parts of a
document (does not work with every server; see p. 264)
Convenience option to specify a Referer header field
Set maximum number of seconds waiting for curl request to
execute
Convenience option to specify a User-agent header field

maxredirs

useragent

referer
timeout

range

header
httpheader

Set maximum number of seconds waiting to connect to server
Define HTTP method to use in RCurl’s high-level functions
Specifies the encoding scheme we expect
Follow the redirection to another URL if suggested by the
server
Retrieve response header information as well
Specifies additional HTTP request headers

connecttimeout
customrequest
.encoding
followlocation

HTTP

Description

Option

Table 5.3 List of useful libcurl options that can be specified in RCurl functions

useragent = "RCurl"

referer = "www.example.com"
timeout = 20

range = "1-250"

header = TRUE
httpheader = c('Accept-Charset'
= "utf-8")
maxredirs = 5L

connecttimeout = 10
customrequest = "HEAD"
.encoding = "UTF-8"
followlocation = TRUE

Example

Specify file with digital signatures for SSL certificate
verification

Set validation of the certificate-host match true or false
Set validation of server certificate true or false—useful if a
server lacks a certificate but is still trustworthy

ssl.verifyhost
ssl.verifypeer

SSL (see Sections 5.3.1 and 9.1.7)

Specifies a file that contains cookies to read from it or, if
argument remains empty, activates cookie management
Writes cookies that have been gathered over several requests
to a file

Cookies (see Sections 5.2.1.1 and 9.1.8)

Set if only the file names and no further information should be
downloaded from FTP servers
Set extended or regular passive mode when accessing FTP
servers

cainfo

cookiejar

cookiefile

ftp.use.epsv

dirlistonly

FTP (see Sections 5.3.2 and 9.1.2)

cainfo = system.file("CurlSSL",
"cacert.pem", package =
"RCurl")
ssl.verifyhost = FALSE
ssl.verifypeer = FALSE

cookiejar = "/files/cookies"

cookiefile = ""

ftp.use.epsv = FALSE

dirlistonly = TRUE

136

AUTOMATED DATA COLLECTION WITH R

Options as
We can declare options for single calls to the high-level functions (e.g., getURL, getURLarguments Content, and getBinaryURL). In this case the options will only affect that single function

call. In the following example we add header = TRUE in order to not only retrieve the
content but also the response header:20
R> url <- "www.r-datacollection.com/materials/http/helloworld.html"
R> res <- getURL(url = url, header = TRUE)
R> cat(str_split(res, "\r")[[1]])
HTTP/1.1 200 OK
Date: Wed, 26 Mar 2014 00:20:11 GMT
Server: Apache
Vary: Accept-Encoding
Content-Length: 89
Content-Type: text/html

Hello World
Hello World

Options in
handles

Another, more persistent way of specifying options is to bind them to a curl handle as
described in the previous section. Every function using this handle via the curl option will
use the same options. If a function uses the handle and redefines some options or adds others,
these changes will stick to the handle. In the following example we create a new handle and
specify that the HTTP method HEAD should be used for the request:
R> handle <- getCurlHandle(customrequest = "HEAD")
R> res <- getURL(url = url, curl = handle)
R> cat(str_split(res, "\r")[[1]])

The first function call using the handle results in an empty vector because HEAD provides
no response body and the header option was not specified. In the second call we add the
header argument to retrieve header information:
R> res <- getURL(url = url, curl = handle, header = TRUE)
R> cat(str_split(res, "\r")[[1]])
HTTP/1.1 200 OK
Date: Wed, 26 Mar 2014 00:20:14 GMT
Server: Apache
Vary: Accept-Encoding
Content-Type: text/html

The added header specification has become part of the handle. When we reuse it, we do
not need to specify header = TRUE anymore:
R> res <- getURL(url = url, curl = handle)
R> cat(str_split(res, "\r")[[1]])
20 Note that unfortunately not all options work the same way for each of the high-level functions. The header
argument, for example, does not expect Boolean input in getURLContent(). We will point to exceptions when we
come across them.

HTTP

137

HTTP/1.1 200 OK
Date: Wed, 26 Mar 2014 00:20:16 GMT
Server: Apache
Vary: Accept-Encoding
Content-Type: text/html

With dupCurlHandle() we can also copy the options set in one handle to another
handle:
R> handle2 <- dupCurlHandle(handle)
R> res <- getURL(url = url, curl = handle2)

Two more global approaches exist. First, we can define a list of options, save it in an object,
and pass it to .opts when initializing a handle or calling a function. The curlOptions()
function helps to expand and match option names:
R> curl_options <- curlOptions(header = TRUE, customrequest = "HEAD")
R> res <- getURL(url = url, .opts = curl_options)

To specify further curl options when using getForm() and postForm(), we have to use
the .opts argument. Otherwise the function cannot distinguish between form parameters and
curl options. Further, instead of specifying the parameters of POST directly after the URL,
they can also be processed in a list passed to the .params option:
R> cat(postForm(url, .params = c(name = "Eddie", age = "32"),
style = "post",
.opts = list(useragent = "Eddie's R scraper",
referer = "www.r-datacollection.com")))

Hello World
Hello World

Second, we can even use R’s global option system to specify standard values that will be
part of each curl handle or function call unless specified otherwise:
R> options(RCurlOptions = list(header = TRUE, customrequest = "HEAD"))
R> res <- getURL(url = url)
R> options(RCurlOptions = list())

Now that we know how to set options, we should inspect two options a little closer because
they can control HTTP methods and HTTP headers: customrequest and httpheader.
The customrequest option was already used throughout the examples above and tells
libcurl to use whatever method specified—for example, POST, HEAD, or PUT instead of
the default GET. For instance, we can transform getURL() into a function that posts form
information:
R> res <- getURL("www.r-datacollection.com/materials/http/POSTexample.php",
customrequest = "POST",
postfields = "name=Eddie&age=32")
R> cat(str_split(res, "\r")[[1]])
Hello Eddie!
You are 32 years old.

Global options

138
Adding request
header fields

AUTOMATED DATA COLLECTION WITH R

Individual HTTP headers can be added using the httpheaders option. We add them
as a list where names of the list items identify the header name and their values correspond
to header values. Let us specify some helpful standard headers and pass them to a call to
getURL(). To check which headers are sent along the HTTP request, we send our request
to a page that simply returns the HTTP request that was received. First we send a request
without any further specifications:
R> url <- "r-datacollection.com/materials/http/ReturnHTTP.php"
R> res <- getURL(url = url)
R> cat(str_split(res, "\r")[[1]])
GET /materials/http/ReturnHTTP.php HTTP/1.1
Authorization:
Host: r-datacollection.com
Accept: */*
Connection: close

The results from above show that only few headers are sent along our HTTP request. Now
we want to add a from and user-agent header specification to the list:21
R> standardHeader <- list(
from
= "eddie@r-datacollection.com",
'user-agent' = str_c(R.version$platform,
R.version$version.string,
sep=", "))
R> res <- getURL(url = url, httpheader = standardHeader)
R> cat(str_split(res, "\r")[[1]])
GET /materials/http/ReturnHTTP.php HTTP/1.1
Authorization:
Host: r-datacollection.com
Accept: */*
From: eddie@r-datacollection.com
User-Agent: x86_64-w64-mingw32, R version 3.0.2 (2013-09-25)
Connection: close
A set of default
options

To conclude this section we provide an example of a list of default options. We recommend
setting these options via options() directly at the start of a session after loading RCurl.
This way it is transparent which options are set as default values for all functions and
handles with the convenience of having to type the options only once. First, we include
the from and user-agent options from above to always identify ourselves. Next we set
followlocation to TRUE to tell libcurl to automatically follow redirections—maxredirs
restricts these redirections to avoid infinite loops. Next, we specify a default connection
timeout as well as a completion timeout (connecttimeout and timeout). The former
tells libcurl to stop trying to connect to a server after 10 seconds while the latter timeout is
for the maximum time we give libcurl to complete a request altogether. The standard libcurl
connection timeout is 300 seconds. Setting the cookiefile option enables libcurl to receive,

21 For a list of other conventional header fields, see Section 5.1.6 or the comprehensive list at
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

HTTP

139

save, and send back cookies. The last option specifies the location for files that contain digital
signatures for SSL certificate verification:
R> defaultOptions <- curlOptions(
httpheader = list(
from = "Eddie@r-datacollection.com",
'user-agent'
= str_c(R.version$platform,
R.version$version.string,
sep=", ")),
followlocation = TRUE,
maxredirs
= 10,
connecttimeout = 10,
timeout
= 300,
cookiefile
= "RCurlCookies.txt",
cainfo = system.file("CurlSSL","cacert.pem", package = "RCurl"))
R> options(RCurlOptions = defaultOptions)

The list of default options can be emptied using
R> options(RCurlOptions = list())

5.4.6

Debugging

What happens in case of an error in an HTTP call? We have documented in Section 5.1.5 that
many things can go wrong and the server communicates the presumed type of error. In the
simplest of cases, we might have gotten the URL wrong:
R> getURL("http://www.stata-datacollection.com")
Error: Could not resolve host: www.stata-datacollection.com; Host
not found

Some errors might be less obvious but still prevent us from receiving content from a
server. In this section we will show some tools that help identify reasons why things do not
work as expected. We know already that we can ask RCurl functions to capture the response
headers in addition to the content by setting the header option to TRUE. Often, however, we
want to have more information, for example the information that arrives at the server after
we put a request to it.
A generally useful tool for HTTP debugging is the service at http://httpbin.org, which
provides a set of endpoints for specific HTTP requests. To check whether a GET request is
specified correctly and what information arrives at the server, we write
R> url <- "httpbin.org/get"
R> res <- getURL(url = url)
R> cat(res)
{
"args": {},
"origin": "134.34.221.149",
"headers": {
"Accept": "*/*",
"X-Request-Id": "348467d6-6641-4863-abb3-a79a602f17e5",

140

AUTOMATED DATA COLLECTION WITH R

"Host": "httpbin.org",
"Connection": "close"
},
"url": "http://httpbin.org/get"
}
RCurl

debugging
functions

Moreover, RCurl provides its own way of checking HTTP calls by specifying a debug
gatherer within the function call. The procedure is powerful and does not rely on external resources. It works as follows. First, we create an object that contains three functions
(update(), value(), reset()) by calling the debugGatherer():
R> debugInfo <- debugGatherer()
R> names(debugInfo)
[1] "update" "value" "reset"
R> class(debugInfo[[1]])
[1] "function"

In a second step, we request a document using the ordinary getURL() function and
use the debugfunction option. With this option we specify a function that gathers debug
information as is supplied by libcurl—the update() function we stored in debugInfo. To
make the necessary debugging information available, we have to set the verbose option
to TRUE:
R> url <- "r-datacollection.com/materials/http/helloworld.html"
R> res <- getURL(url = url, debugfunction = debugInfo$update,
verbose = T)

In a third and last step, we access the debugging information gathered during the execution
of getURL() by calling the value() function stored in the debugInfo object. The value
function provides seven items:
R> names(debugInfo$value())
[1] "text"
"headerIn"
"headerOut"
[6] "sslDataIn" "sslDataOut"

"dataIn"

"dataOut"

The first item of the resulting vector—text—captures information libcurl provides about
the procedure:
R> cat(debugInfo$value()["text"])
About to connect() to r-datacollection.com port 80 (#0)
Trying 173.236.186.125... connected
Connected to r-datacollection.com (173.236.186.125) port 80 (#0)
Connection #0 to host r-datacollection.com left intact
Closing connection #0
headerIn stores the HTTP response header:
R> cat(str_split(debugInfo$value()["headerIn"], "\r")[[1]])
HTTP/1.1 200 OK
Date: Wed, 26 Mar 2014 00:20:25 GMT
Server: Apache

HTTP

141

Vary: Accept-Encoding
Content-Length: 89
Content-Type: text/html

The HTTP request header is stored in headerOut:
R> cat(str_split(debugInfo$value()["headerOut"], "\r")[[1]])
GET /materials/http/helloworld.html HTTP/1.1
Host: r-datacollection.com
Accept: */*

The body of the response is contained in dataIn:
R> cat(str_split(debugInfo$value()["dataIn"], "\r")[[1]])

Hello World
Hello World

The body of the sent data—for example, if we use the POST method—is found in
dataOut:
R> cat(str_split(debugInfo$value()["dataOut"], "\r")[[1]])

In this example it is empty as we used the GET method, which, by definition, does not
send any data along with the request body.
The remaining two items sslDataIn and sslDataOut are analogous to dataIn and
dataOut but for encrypted connections. They are also empty in our request:
R> cat(str_split(debugInfo$value()["sslDataIn"], "\r")[[1]])
R> cat(str_split(debugInfo$value()["sslDataOut"], "\r")[[1]])

Another source of valuable information might be the getCurlInfo() function, which
provides additional information on the present state of a curl handle. To get the information
we first specify a handle, use it in a function call, and then apply getCurlInfo() to the
handle:
R>
R>
R>
R>

handle <- getCurlHandle()
url <- "r-datacollection.com/materials/http/helloworld.html"
res <- getURL(url = url, curl = handle)
handleInfo <- getCurlInfo(handle)

The information provided is manifold:
R> names(handleInfo)
[1] "effective.url"
[3] "total.time"
[5] "connect.time"
[7] "size.upload"
[9] "speed.download"
[11] "header.size"
[13] "ssl.verifyresult"

"response.code"
"namelookup.time"
"pretransfer.time"
"size.download"
"speed.upload"
"request.size"
"filetime"

142
[15]
[17]
[19]
[21]
[23]
[25]
[27]
[29]
[31]
[33]
[35]

AUTOMATED DATA COLLECTION WITH R
"content.length.download"
"starttransfer.time"
"redirect.time"
"private"
"httpauth.avail"
"os.errno"
"ssl.engines"
"lastsocket"
"redirect.url"
"appconnect.time"
"condition.unmet"

"content.length.upload"
"content.type"
"redirect.count"
"http.connectcode"
"proxyauth.avail"
"num.connects"
"cookielist"
"ftp.entry.path"
"primary.ip"
"certinfo"

A useful operation might be to consider the total time it took to complete the request
and the time it took to do all the things necessary to start the transfer—that is, resolve the
host name, establish the connection to the host, and send the request—to get an idea where
possible bottlenecks occur:
R> handleInfo[c("total.time", "pretransfer.time")]
$total.time
[1] 0.219
$pretransfer.time
[1] 0.11

If the time before the actual download takes up a substantial part of the overall time it
takes to complete a request, we should—for multiple requests to the same server—ensure
that connections are reused. Let us gather the ratio of pre-transfer time to total time ten times
in succession with and without reusing the same handle:
R> preTransTimeNoReuse <- rep(NA, 10)
R> preTransTimeReuse <- rep(NA, 10)
R> url <- "r-datacollection.com/materials/http/helloworld.html"
R> # no reuse
R> for (i in 1:10) {
handle <- getCurlHandle()
res <- getURL(url = url, curl = handle)
handleInfo <- getCurlInfo(handle)
preTransTimeNoReuse[i] <- handleInfo$pretransfer.time
}
R> # reuse
R> handle <- getCurlHandle()
R> for (i in 1:10) {
res <- getURL(url = url, curl = handle)
handleInfo <- getCurlInfo(handle)
preTransTimeReuse[i] <- handleInfo$pretransfer.time
}

The gathered times show quite nicely how connection times can accumulate when establishing connections for each request and how this can be prevented by reusing curl handles
that establish a connection once and reuse this connection to send multiple requests:

HTTP

143

R> preTransTimeNoReuse
[1] 0.110 0.094 0.109 0.109 0.094 0.109 0.110 0.109 0.125 0.110
R> preTransTimeReuse
[1] 0.125 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

5.4.7

Error handling

After discussing tools for discovering why things might not work, this section presents an
RCurl specific way to handle errors. We can experience various types of errors—those that
we generated ourselves (e.g., by specifying a wrong host name or asking for a nonexisting
resource), those that are due to a broken connection, and those generated on the server side.
The set of functions provided by RCurl to retrieve content know a lot of different error types.
We get a list by calling the getCurlErrorClassNames() function. We selected a couple
of the most common ones:
R> getCurlErrorClassNames()[c(2:4, 7, 8, 10, 23, 29, 35, 64)]
[1] "UNSUPPORTED_PROTOCOL" "FAILED_INIT"
"URL_MALFORMAT"
[4] "COULDNT_RESOLVE_HOST" "COULDNT_CONNECT"
"REMOTE_ACCESS_DENIED"
[7] "HTTP_RETURNED_ERROR" "OPERATION_TIMEDOUT"
"HTTP_POST_ERROR"
[10] "FILESIZE_EXCEEDED"

Using tryCatch()22 we can specify individual actions to react to different types of
errors. As an example we choose a user-generated error. We have a set of two URLs. We
begin by trying to collect the first one. This operation fails because the host does not exist.
The second URL serves as a replacement in case the first URL produces an error of class
COULDNT_RESOLVE_HOST:
R> url1 <- "wwww.r-datacollection.com/materials/http/helloworld.html"
R> res <- getURL(url1)
Error: Could not resolve host: wwww.r-datacollection.com; Host not
found

The call produces an error and res is not created. This can cause further errors in the
program if we try to process the res object.
Now let us try to react to the error. In the following snippet, the object res stores the
results from a call to tryCatch(). Within the function we first state the default—retrieving
the URL. The next two statements are of the form errorType = errorFunction. For each error
tryCatch checks whether the class of the error matches one of the error names provided—in
our case COULDNT_RESOLVE_HOST and error—and executes the matching function. In
the example the default statement produces a resolve host error and the second URL will be
retrieved. Any other error would have produced an NA that would have been assigned to res
and the default error message would have been printed.
R> url2 <- "www.r-datacollection.com/materials/http/helloworld.html"
R> res <- tryCatch(
getURL(url = url1),
COULDNT_RESOLVE_HOST = function(error) {
getURL(url = url2)
},
22 See

Section 11.3.2 for a more general elaboration of this function.

144

AUTOMATED DATA COLLECTION WITH R

error = function(error) {
print(error$message)
NA
}
)
R> cat(str_split(res,"\r")[[1]])

Hello World
Hello World

5.4.8

RCurl or httr—what to use?

RCurl is a quite powerful package that helps to make the most sophisticated requests and to

receive and process the incoming response. At times, it is a bit bulky, however. Fortunately,
there is a package that offers a more slender interface: the httr package (Wickham 2012). It
builds upon RCurl by wrapping the functions we have discussed so far.
In Table 5.4 we contrast functions of both packages to perform selected HTTP and
authentication tasks. Some of the functions have quite a different syntax and do not always
provide the same functionality. Although we do not want to go into more details of the httr
package at this point, there is no reason to be dogmatic and work with only one of the two
packages. In fact, httr offers several features that considerably ease some data collection tasks,
for example, authentication via OAuth (see Section 9.1.11).

Summary
A basic knowledge of HTTP is fundamental to specify advanced requests to web servers
with R. In this chapter, we gave a brief overview of the basic concepts of HTTP and some
more intricate features that prove useful in web scraping. We also introduced RCurl, which
provides excellent facilities to use R as an HTTP client and for other protocols.
There may be R users who have performed some rudimentary web scarping tasks with
download.file() and have thus largely disregarded most of the features of RCurl and
libcurl to specify advanced HTTP requests. We have argued that the RCurl toolbox offers a
number of handy features that should pay off even for basic scraping tasks. And after all, even
though the package is not too easily accessible for users who are not yet experienced with
HTTP, the fundamentals can be learned and implemented quickly. For those who are deterred
by the vast range of functions in the manual, the httr package offers convenient wrappers
for the most useful RCurl features and a couple more handy functions. In Section 9.1 we
will come back to some scenarios of HTTP communication with R. We will show, among
other things, how to efficiently deal with forms, use HTTP authentication, and collect data
via HTTP Secure.

Further reading
Gourley and Totty (2002) offer an encyclopedic introduction to HTTP. The shorter—and a
little less useful—version is the “HTTP Pocket Reference” by Wong (2000). A very thorough

HTTP

145

Table 5.4 Selected HTTP and authentication tasks and how to realize them with RCurl or
httr

Task

RCurl function/option

httr function

HTTP methods (verbs)
Specify GET request

getURL(),
getURLContent(),
getForm()
postForm()
httpHEAD()
httpPUT()

Specify POST request
Specify HEAD request
Specify PUT request

GET()

POST()
HEAD()
PUT()

Content extraction
Extract raw or character content
from response

content  tags (see also Section 2.3.10). These tags are typically located
before the  section of the document but they may as well be placed at any other
position of the document. Another way is to make reference to an externally stored JavaScript
code file via a path passed to the scr attribute of the 

7
8
9

Collected R wisdoms

11
12
13
14
15
16

Robert Gentleman
'What we have is nice, but we need something very different'
Source: Statistical Computing 2003, Reisensburg

18
19
20

Rolf Turner
'R is wonderful, but it cannot work magic' 
answering a
request for automatic generation of 'data from a known mean and 95% CI’

Source: Rhelp

21
22
24
25
26

The book homepage

For the most part, the code is identical to the fortunes example we introduced in Section 2.4.1. The only modification of the file concerns the extra code appearing in lines 4 and
5. The HTML Collected R wisdoms

Robert Gentleman
'What we have is nice, but we need something very different'
Source: Statistical Computing 2003, Reisensburg

Rolf Turner
'R is wonderful, but it cannot work magic' 
answering a
request for automatic generation of 'data from a known mean and 95% CI'

Implications of
JavaScript for
scraping, 1

154

AUTOMATED DATA COLLECTION WITH R

Source: Rhelp

The book homepage

Evidently, all the information is included in the HTML file and can be accessed via
the HTTP request, parsing, and extraction routines we previously presented. As we will see
further below, this is unfortunately not always the case.

6.2

User–server
communication
with XHR

XHR

A limitation of the HTTP protocol is that communication between client and server necessarily
follows a synchronous request–response pattern. Synchronous communication means that the
user’s interaction with the browser is being disabled while a request is received, processed,
and a new page is delivered by the web server.
A more flexible data exchange mechanism is required to enable a continuous user experience that resembles the behavior of desktop applications. A popular method to allow for
a continuous exchange of information between the browser and the web server is the socalled XMLHttpRequest (XHR), an interface that is implemented in nearly all modern web
browsers. It allows initiating HTTP or HTTPS requests to a web server in an asynchronous
fashion. XHR’s principal purpose is to allow the browser to fetch additional information in
the background without interfering with the user’s behavior on the page.
To illustrate this process, take a look at the graphical illustration of the XHR-enriched
communication model in Figure 6.2. As in the traditional HTTP communication process (see
Section 5.1.1), XHR provides a mechanism to exchange data between a client and a server.
A typical communication proceeds as follows.
1. Commonly, but not necessarily, the user of a webpage is also the initiator of the AJAX
request. The initiating event can be any kind of event that is recognizable by the
browser, for example, a click on a button. JavaScript then instantiates an XHR object
2

XMLHttpRequest

HTTP request

XHR callback function

HTTP response

JavaScript

HTML &
CSS data

4

HTTP Server
3

Data
exchange

1
Data Storage
User Interface

Figure 6.2 The user–server communication process using the XMLHttpRequest. Adapted
from Stepp et al. (2012)

AJAX

155

that serves as the object that makes the request and may also define how the retrieved
data is used via a callback function.
2. The XHR object initiates a request to the server for a specified file. This request may
either be sent through HTTP or HTTPS. Due to JavaScript’s Same Origin Policy,
cross-domain requests are forbidden in native AJAX applications, meaning that the
file to be requested must be in the domain of the current webpage. While the request
is taking place in the background, the user is free to continue interacting with the site.
3. On the server side, the request is received, processed, and the response including data
is sent back to the browser client via the XHR object.
4. Back in the browser client, the data are received and an event is triggered that is caught
by an event handler. If the content of the file needs to be displayed on the page, the file
may be relayed through a previously defined callback event handler. Via this handler
the content can be manipulated to present it in the browser. Once the process handler
has processed the information, it can be fed back into the current DOM and displayed
on the screen.
To see XHR in action, we now discuss two applications.

6.2.1

Loading external HTML/XML documents

The simplest type of data to be fetched from the server via an XHR request is a document
containing HTML code. The task we illustrate here is to gather an HTML code and feed
it back into the current webpage. The proper method to carry out this task in jQuery is its
load() method. The load() method instantiates an XHR object that sends an HTTP GET
request to the server and retrieves the data. Consider the following empty HTML file named
fortunes2.html, which will serve as a placeholder document:

1
2

4
5

7
8
9

Collected R wisdoms

11

13

The book homepage

14
15

The task is to insert substantially interesting information from another HTML document
into fortunes2.html. Key to this task is once again a JavaScript code to which we refer in line 5.

156

1
2
3
4
5

AUTOMATED DATA COLLECTION WITH R

$(document).ready(function() {
$("body").load("quotes/quotes.html", function() {
alert("Quotes.html was fetched.");
});
});

Like the previous script, script2.js starts with the document ready handler ready() to
ensure the DOM is completely loaded before executing the script. In line 2 we initiate a
selection for the  node. The  node serves as an anchor to which we link
the data returned from the XHR data request. The essential part of the script uses jQuery’s
load() method to fetch information that is accessible in “quotes/quotes.html.” The load()
method creates the XHR object, which is not only responsible for requesting information
from the server but also for feeding it back into the HTML document. The file quotes.html
contains the marked up quotes.
1
2
3
4
5

Robert Gentleman
'What we have is nice, but we need something very different'
Source: Statistical Computing 2003, Reisensburg

7
8
9

Rolf Turner
'R is wonderful, but it cannot work magic' 
answering a
request for automatic generation of 'data from a known mean and 95% CI'

Source: Rhelp

10
11

Implications of
JavaScript for
scraping, 2

As part of the load() method, we also assign a callback function that is executed in case
the XHR request is successful. In line 3, we use JavaScript to open an alert box with the
text in quotation marks. This is purely for illustrative purpose and could be omitted without
causing any problems. To check the success of the method, open fortunes2.html in a browser
and compare the displayed information with the HTML code outlined above.
How does the XHR object interfere with attempts to obtain information from the quotes?
Once again we compare the information displayed in the browser with what we get by parsing
the document in R:
R> library(XML)
R> (fortunes2 <- htmlParse("fortunes2.html"))

Collected R wisdoms

The book homepage

AJAX

157

Unlike in the previous example, we observe that information shown in the browser is not
included in the parsed document. As you might have guessed, the reason for this is the XHR
object, which loads the quote information only after the placeholder HTML file has been
requested.

6.2.2

Loading JSON

Although the X in AJAX stands for the XML format, XHR requests are not limited to
retrieving data formatted this way. We have introduced the JSON format in Chapter 3, which
has become a viable alternative, preferred by many web developers for its brevity and wide
support. jQuery not only provides methods for retrieving JSON via XHR request but it also
includes parsing functions that facilitate further processing of JSON files. Compared to the
example before, we need to remind ourselves that JSON content is displayed unformatted
in the browser. In this example, we therefore show first how to instruct jQuery to access a
JSON file and second, how to convert JSON information into HTML tags, to obtain a clearer
and more attractive display of the information. Take a look at fortunes3.html for our generic
placeholder HTML document.
1
2
3
4
5

Collected R wisdoms (JavaScript Extension)

7

9

11
12

14

The book homepage

15

The new element here is an HTML button element to which we assign the id quote
Button. Inside the HTML code there is a reference to script3.js.
1
2
3
4

5
6
7

$("#quoteButton").click(function(){
$.getJSON("quotes/all_quotes.json", function(data){
$.each(data, function(key, value){
$("body").prepend(""+value.author+"
'"+value.quote+"'
Source: '"+value.
source+"'");
});
});
});

Once again, go ahead and open fortunes3.html to check out the behavior of the document.
What you should observe is that upon clicking on the button, new quote information appears
that is visually similar to what we have seen in fortunes.html.

158

AUTOMATED DATA COLLECTION WITH R

Let us dismantle the script into its constituent parts. In the top line, the scripts initiates a
query for a node with id quoteButton. This node is being bound to the click event handler.
The next lines detail the click’s functionality. If a click occurs a data request is sent via
jQuery’s getJSON() method. This method does two things. First, the request for the file is
initiated and the data fetched using a HTTP GET request. Second, the data are parsed by a
JSON parser, which disassembles the file’s key and value pairs into usable JavaScript objects.
The file to be requested is specified as all_quotes.json, which contains the complete set of R
wisdoms and is located in the folder named quotes. The first couple of lines of this file are
printed below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

[
{
"quote": "What we have is nice, but we need something very different.",
"author": "Robert Gentleman",
"context": null,
"source": "Statistical Computing 2003, Reisensburg",
"date": "June 2003"
},
{
"quote": "R is wonderful, but it cannot work magic.",
"author": "Rolf Turner",
"context": "answering a request for automatic generation of 'data from a
known mean and 95% CI'",
"source": "R-help",
"date": "October 2011"
},

Line 3 initiates a looping construct that iterates over the objects of the retrieved JSON file
and defines a function for the key and value variables. The function first performs a selection
of the HTML document’s  node to which it prepends the expression in parentheses. As
you can see, this expression is a mixture of HTML markup and some sort of variable objects
(encapsulated by + signs) through we can inject JSON information. Effectively, this statement
produces familiar HTML code that includes data, author, quote, and source information from
each object of the JSON file.

6.3 Exploring AJAX with Web Developer Tools
When sites employ more sophisticated request methods, a cursory look at the source code
will usually not suffice to inform our R scraping routine. To obtain a sufficient understanding
of the underlying structure and functionality we need to dig a little deeper. Despite our praise
for the R environment, using R would render this task unnecessarily cumbersome and, at least
for AJAX-enriched sites, it simply does not provide the necessary functionality. Instead, we
examine the page directly in the browser. The majority of browsers comes with functionality
that has turned them into powerful environments for developing web projects—and helpful
companions for web scrapers. These tools are not only helpful for on-the-fly engagement with
a site’s DOM, but they may also be used for inspecting network performance and activities
that are triggered through JavaScript code. In this section, we make use of Google Chrome’s

AJAX

159

suite of Web Developer Tools (WDT), but tools of comparable scope are available in all the
major browser clients.

6.3.1

Getting started with Chrome’s Web Developer Tools

We return to the previously introduced fortunes2.html file, which caused some headache due
to its application of XHR-based data retrieval. Open the file in Google Chrome to accustom
yourself once again with the site’s structure. By default, WDT are not visible. To bring them
to the forefront, you can right-click on any HTML element and select the Inspect Element
option. Chrome will split the screen horizontally, with the lower panel showing the WDT and
the upper panel the classical page view of fortunes2.html. IInside the WDT (Chrome version
33.0.1750.146), the top-aligned bar shows eight panels named Elements, Network, Sources,
Timeline, Profiles, Resources, Audits, and Console and which correspond to the different
aspects of a site’s behavior that we can analyze. Not all of these panels will be important for
our purposes, so the next sections only discuss the Elements and the Network panels in the
context of investigating a site’s structure and creating an R scraping routine.

6.3.2

The Elements panel

From the Elements panel, we can learn useful information about the HTML structure of the
page. It reveals the live DOM tree, that is, all the HTML elements that are displayed at any
given moment. Figure 6.3 illustrates the situation upon opening the WDT on fortunes2.html.
The Elements panel is particularly useful for learning about the links between specific HTML
code and its corresponding graphical representation in the page view. By hovering your cursor
over a node in the WDT, the respective element in the HTML page view is highlighted. To do
the reverse and identify the code piece that produces an element in the page view, click on the
magnifying glass symbol at the top right of the panel bar. Now, once you click on an element
in the page view, the WDT highlights the respective HTML element in the DOM tree. The
Elements panel is also helpful for generating an XPath expression that can be passed directly
to R’s extractor functions (see Chapter 4). Simply right-click on an element and choose “Copy
XPath” from the menu.

Figure 6.3 View on fortunes2.html from the Elements panel

Perspective on
the DOM tree

160

AUTOMATED DATA COLLECTION WITH R

Figure 6.4 View on fortunes2.html from the Network panel

6.3.3
Tracing
resources

The Network panel

The Network panel provides insights into resources that are requested and downloaded over
the network in real time. It is thus an ideal tool to investigate resource requests that have
been initiated by JavaScript-triggered XHR objects. Make sure to open the Network panel tab
before loading a webpage, otherwise the request will not be captured. Figure 6.4 shows the
Network panel after loading fortunes2.html. The panel shows that altogether four resources
have been requested since the fortunes2.html has been opened. The first column of the panel
displays the file names, that is, fortunes2.html, jquery-1.8.0.min.js, script2.js, and quotes.html.
The second column provides information on the HTTP request method that provided the file.
Here, all four files have been requested via HTTP GET. The next column displays the HTTP
status code that was returned from the server (see Section 5.1.5). This can be of interest when
an error occurs in the data request. The type column depicts the files’ type such as HTML
or JavaScript. From the initiator column we learn about the file that triggered the request.
Lastly, the size, time, and timeline columns provide auxiliary information on the requested
resources.
We are interested in collecting the quote information. Since the information is not
part of the source HTML, we can refrain from further inspecting fortunes2.html. From
the other three files, we can also ignore jquery-1.8.0.min.js as this is a library of methods. While script2.js could include the required quote information in principle, good web
development practice usually separates data from scripts. By the principle of exclusion, we
have thus identified quotes.html as the most likely candidate for containing the quotes. To
take a closer look, click on the file, like in Figure 6.5. From the Preview tab we observe
that quotes.html indeed contains the information. In the next step we need to identify
the request URL for this specific file so we can pass it to R. This information is easily
obtained from the Headers tab, which provides us with the header information that requested
quotes.html. For our purpose, we only need the URL next to the Request URL field, which is
http://r-datacollection.com/materials/ajax/quotes/quotes.html. With this information, we can
return to our R session and pass the URL to RCurl’s getURL() command:
R> (fortunes_xhr <- getURL("r-datacollection.com/materials/ajax/quotes/
quotes.html"))
[1] "\n 
Robert Gentleman
\n 'What we have is nice, but we need something
very different'
\n Source: Statistical Computing 2003,
Reisensburg
\n
\n\n\n

AJAX

161

(a) Preview

(b) Headers

Figure 6.5 Information on quotes.html from the Network panel (a) Preview (b) Headers
Rolf Turner
\n 'R is wonderful, but it cannot work magic'

answering a request for automatic generation of 'data from a known
mean and 95% CI'
\n Source: R-help
\n"

The results do in fact contain the target information, which we can now process with all
the functions that were previously introduced.

Summary
AJAX has made a lasting impact on the user friendliness of services provided on the Web.
This chapter gave a short introduction to the principles of AJAX and it sought to convey the
conceptual differences between AJAX and classical HTTP-transmitted contents. From the
perspective of a web scraper, AJAX constitutes a challenge since it encourages a separation
of the stylistic structure of the page (HTML, CSS) and the information that is displayed (e.g.,
XML, JSON). Therefore, to retrieve data from a page it might not suffice to download and
parse the front-end HTML code. Fortunately, this does not prevent our data scraping efforts.
As we have seen, the AJAX-requested information was located in a file on the domain of the
main page that is accessible to anyone who takes an interest in the data. With Web Developer
Tools such as provided in Chrome, we can trace the file’s origin and obtain a URL that
oftentimes leads us directly to the source of interest. We will come back to problems created

162

AUTOMATED DATA COLLECTION WITH R

by dynamically rendered pages when we discuss the Selenium/Webdriver framework as an
alternative solution to these kinds of scraping problems (see Section 9.1.9).

Further reading
To learn more about AJAX consult Holdener III (2008) or Stepp et al. (2012). A good way
to learn and discover useful features of the Chrome Web Developer Tools is on the tool’s
reference pages Google (2014) or for a book reference Zakas (2010).

Problems
1.

Why are AJAX-enriched webpages often valuable for web users, but an obstacle to web
scrapers?

2.

What are the three methods to embed JavaScript in HTML?

3.

Why are Web Developer Tools particularly useful for web scraping when the goal is to
gather information from websites using dynamic HTML?

4.

Return to fortunes3.html. Implement the JavaScript alert() function at two points of
the document. First, put the function in the  section of the document with text
“fortunes3.html successfully loaded!” Second, open script3.js and include the alert()
function here as well with text “quotes.html successfully loaded!” Watch the page’s
behavior in the browser.

5.

Use the appropriate parsing function for fortunes3.html and verify that it does not
contain the quotes of interest.

6.

Use the Web Developer Tools to identify the source of the quote information in fortunes3.html. Obtain the request URL for the file and create an R routine that parses it.

7.

Write a script for fortunes2.html that extracts the source of the quote. Conduct the
following steps:
(a) Parse fortunes2.html into an R object called fortunes2.
(b) Write an XPath statement to extract the names of the JavaScript files and create
a regular expression for extracting the name of the JavaScript script (and not the
library).
(c) Import the JavaScript code using readLines() and extract the file path of the
requested HTML document quotes.html.
(d) Parse quotes.html into an R object called quotes and query the document for the
names.

8.

Repeat exercise two for fortunes3.html. Extract the sources of the quotes.

9.

The website http://www.parl.gc.ca/About/Parliament/FederalRidingsHistory/hfer.asp?
Language=E&Search=C provides information on candidates in Canadian federal elections via a request to a database.
(a) Request information for all candidates with the name “Smith.” Inspect the live
DOM tree with Web Developer Tools and find out the HTML tags of the returned
information.

AJAX

163

(b) Which mechanism is used to request the information from the server? Can you
manipulate the request manually to obtain information for different candidates?
10.

The city of Seattle maintains an open data platform, providing ample information
on city services. Take a look at the violations database at https://data.seattle.gov/
Community/Seattle-code-violations-database/8agr-hifc
(a) Use the Web Developer Tools to learn about how the database information is stored
in HTML code.
(b) Assess the data requesting mechanism. Can you access the underlying database
directly?

7

SQL and relational databases

When
databases
become useful

Handling and analyzing data are key functions of R. It is capable of handling vectors, matrices,
arrays, lists, data frames as well as their import and export, aggregation, transformation,
subsetting, merging, appending, plotting, and, not least, analysis. If one of the standard data
formats does not suffice there is always the possibility of defining new ones and incorporating
them into the R data family. For example, the sp package defines a special purpose data object
to handle spatial data (see Bivand and Lewin-Koh 2013; Pebesma and Bivand 2005). So, why
should we care about databases and yet another language called SQL?
Simple and everyday processes like shopping online, browsing through library catalogs,
wiring money, or even buying sweets in the supermarket all involve databases. We hardly ever
realize that databases play an important role because we neither see nor interact with them
directly—databases like to work behind the scenes. Whenever data are key to a project, web
administrators will rely on databases because of their reliability, efficiency, multiuser access,
virtually unlimited data size, and remote access capabilities.
Regarding automated data collection, databases are of interest for two reasons: First, we
might occasionally get direct access to a database and should be able to cope with it. Second,
and more importantly, we can use databases as a tool for storing and managing data. Although
R has a lot of useful data management facilities, it is after all a tool designed to analyze data,
not to store it. Databases on the other hand are specifically designed for data storage and
therefore offer some features that base R cannot provide. Consider the following scenarios.

r You work on a project where data needs to be presented or made accessible on a
website—using a database, you only need one tool to achieve this.

r In a data collection project, you do not gather all the data yourself but have other
parties gathering specific parts of it—with a database you have a common, current,
always accessible, and reliable infrastructure at hand that several users can access at the
same time.

Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, First Edition.
Simon Munzert, Christian Rubba, Peter Meißner and Dominic Nyhuis.
© 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

SQL AND RELATIONAL DATABASES

165

r When several parties are involved, most databases allow for defining different users
with different rights—one party might only be able to read, others have access only
to parts of the data, and yet others are equipped with administrative rights but cannot
create new users.

r You have loads of data that exceed the RAM available on your computer—databases
are only limited by the available disk size. In fact, databases can even be distributed
across multiple disks or several machines altogether.

r If your data are complex it might be difficult to write into one data table—databases
are best at storing such kind of data. Not only do they allow storing but also retrieving
and subsetting data with complex data structures.

r Your data are large and you have to subset and manipulate them frequently—querying
databases is fast.

r Your data are complex and you use them for various purposes—for example, the
information is distributed across several tables but depending on the context you need
to combine information from specific tables in a task-specific way. Databases allow the
definition of virtual tables to have data always up to date, organized in a specific way
without using much disk space.

r You care about data quality and have several rules when data are valid and when they
are not. Using databases you can define specific rules for extending or updating your
database.
Section 7.1 provides a brief overview of how R and databases are related to one another
and defines some of the vocabulary indispensable for talking about databases. Subsequently,
Section 7.2 dives into the conceptual basics of relational databases, followed by an introduction to SQL fundamentals, the language to handle relational databases in Section 7.3. In the
last part (Section 7.4) we learn how to deal with databases using R—establishing connections,
passing through SQL queries, and using convenient functions of the numerous R packages
that provide database connectivity.

7.1 Overview and terminology
For a start let us consider a schematic overview of how R, SQL, the database and the
database management system are related (see Figure 7.1). As you can see, we do not access
the database directly. Instead, R provides facilities to connect to the database management
system—DBMS—which then executes the user requests written as SQL queries. The tasks
are defined by the user, but how the tasks are achieved is up to the DBMS. SQL is the tool for
speaking to a whole range of DBMS. It is the workhorse of relational database management.
Let us define some of the terms that we have used up to this point to have a common basis
to build upon throughout the remainder of the chapter.
Data are basically a collection of information such as numbers, logical values, text, or
some other format. Sometimes collections of information might be data for one purpose but
a useless bunch of bits and bytes for another. Imagine that we have collected names of people
that have participated in the Olympic Games. If we only care to know who has participated,
it might suffice if our data has the format of: "Carlo Pedersoli". If, however, we want to

Some
definitions

166

AUTOMATED DATA COLLECTION WITH R

USER

DBMS

•
•
•
•
•
•

• Executes SQL-queries and sends
back results
• Manages user rights
• Manages users access
• Manages data structure
• Manages data

Defines structure of data
Defines validity of data
Inputs data
Manipulates data
Queries data
Sets user rights

SQL

R
•
•
•
•

Connects to database
Provides convenience functions
Sends SQL-queries
Recieves results

•
•
•
•

Data control language
Data definition language
Data manipulation language
Transaction control language

Figure 7.1 How users, R, SQL, DBMS, and databases are related to each other

sort the data by last name, a format like "Pedersoli, Carlo" is more appropriate. To be
on the safe side, we might even consider splitting the names into first name ("Carlo") and
last name ("Pedersoli"). Being on the safe side plays a crucial role in database design and
we will come back to it.
In general a database is simply a collection of data. Within most database systems and
relational database systems, the data are related to each other. Consider for example a table
that contains the bodyweights of various people. We have at least two types of data in the
table: bodyweights and names. What is more, not only does the table store the two variables,
but additionally it provides information on how these two pieces of information are related to
each other. The most basic rule is this: Every piece of information in a single row is related to
each other. We are familiar with handling tables that are structured in this way. In relational
databases, these relations between data can be far more complex and will typically be spread
across multiple tables.
A database management system (DBMS) is an implementation of a specific database
concept bundled together with software. The software is responsible for managing the user
rights and access, the way data and meta information are stored physically, or how SQL
statements are interpreted and executed. DBMS are numerous and exist as open source as
well as commercial products for all kinds of purposes, operation systems, data sizes, and
hardware architectures.
Relational database management systems (RDBMS) are a specific type of DBMS
based on the relational model and the most common form of database management systems.
Relational databases have been around for a while. The concept goes back to the 1970s when
Edgar F. Codd proposed to store data in tables that would be related and the relations again
stored in tables (Codd 1970). Relational databases, although simple in their conceptual basics,

SQL AND RELATIONAL DATABASES

167

are general and flexible enough to store all kinds of different data structures, while the specific
parts of the database remain easy to understand. Popular relational DBMS are Oracle, MySQL,
Microsoft SQL Server, PostgreSQL, DB2, Microsoft Access, Sybase, SQLite, Teradata, and
File-Maker.1 In this book we exclusively talk about RDBMS and use DBMS and RDBMS
interchangeably.
SQL2 is a language to communicate with relational database management systems. When
Codd proposed the relational model for databases in 1970, he also proposed to use a language
to communicate with database systems that should be general and work only on a meta level.
The idea was to be able to express exactly what a DBMS should do—the same statement
on different DBMS should always lead to the same result—but leave it completely up to
the DBMS how to achieve it computationally (Codd 1970). Such a language would be
user friendly and would allow using a common framework for different implementations
of the relational model. Based on these conceptual ideas, SQL was later developed by
Donald D. Chamberlin and Raymond F. Boyce (1974) and, although occasionally revised
throughout the decades, still lives on today as the one common language for relational
database communication.
A query is strictly speaking a request sent to a DBMS to retrieve data. In a broader
and more frequently used sense it refers to any request made to a DBMS. Such requests
might define or manipulate the structure of the data, insert new data, or retrieve data from the
database.

7.2 Relational Databases
7.2.1

Storing data in tables

The main concept of relational databases is that any kind of information can be represented in
a table. We already know that tables are devices to store data as well as relations—each piece
of information within the same row is related to the same entity. To achieve more complex
relations—for example, a persons’ weight is measured twice, or the weight is measured for
a persons’ children as well—we can relate data from one table to another.
Let us take a look at an example to make this point clear. Imagine that we have collected
data on some of our friends—Peter, Paul, and Mary. We collected information on their
birthdays, their telephone numbers, and their favorite foods—see Tables 7.1 and 7.2. We
had trouble putting the data into one table and ended up separating the data into two tables.
Because we do not like to duplicate information, we did not add the full names to the telephone
table(Table 7.2), but specified IDs referring to the names in a column called nameid. Now
how do we find Peter’s telephone number? First, we look up Peter’s ID in the birthdays table
(Table 7.1) because we know that data on the same line is related—Peter’s ID is 1. Second,
we check which row in the telephone table has a 1 on nameid—rows one and three. Third, we
look up the telephone numbers for these rows—001665443 and 001878345—and realize
1 See

also http://db-engines.com/en/ranking/relational+dbms
is an ongoing debate about whether or not SQL is an acronym and how to pronounce it correctly—well,
no and S-Q-L are the correct answers. SQL is the successor name of SEQUEL, which was the name of an airplane
and thus had to be changed (McJones et al. 1997, p. 22). SEQUEL (not SQL) was an acronym for: Structured English
QUery Language (McJones et al. 1997, p. 14). The pronunciation of SQL was officially declared as S-Q-L (Gillespie
2012).
2 There

Tables and
relations

168

AUTOMATED DATA COLLECTION WITH R

Table 7.1 Our friends’ birthdays and favorite foods

Keys

Redundancy
and
exclusiveness

nameid

name

birthday

1
2
3

Peter Pascal
Paul Panini
Mary Meyer

01/02/1991
02/03/1992
03/04/1993

favoritefood1

favoritefood2

spaghetti
fruit salad
chocolate

hamburger
fish fingers

favoritefood3

hamburger

that Peter has two telephone numbers. We can also reverse the process and ask which name
of the birthdays table belongs to a specific telephone number or what number we need to call
to speak to a fruit salad or spaghetti lover.
The relation in our example is called 1:m or one-to-many, as one person can be related to
zero, one, or more telephone numbers. There are of course also 1:1, n:1, and n:m relations
(one-to-one, many-to-one, many-to-many), which work just the same way except that the
number of data that might show up on one or the other side of the relation differs.
This connection of tables to express relations between data is one of the most important
concepts in relational database models. But note that to make it work we had to include
identifiers in both tables. These so-called keys ensure that we know which entity the data
belongs to and how to combine information from different tables.
Including keys makes some parts of the data redundant—the nameid column exists in
Table 7.1 and in Table 7.2—but it also reduces redundancy. Let us consider the example again
to make this point clear. Have another look at Table 7.1. There are three columns to store the
same type of data. We have to store the data somewhere and what can we do about the fact
that some of our friends did name more than one preference? Imagine what it would look
like if we recorded more than just these 3 preferences, maybe 5 or 7 or even 10. We would
have to add another column for each preference and if somebody only had one preference all
the other columns would remain empty. Clearly there has to be another way to cope with this
kind of data rather than adding column after column to store some more preferences. Indeed
there is.
We divide the data contained in the birthdays table into two separate tables that are related
to each other via a key. Have a look at Tables 7.3 and 7.4 for the result. Now we do not have
to care how many favorite foods somebody names, because we always have the option of
adding another row whenever necessary and still all preferences are unambiguously related
to one person.
Even though the data are stored in a cleaner way, we still have some redundancy in our
food preference table. Is it necessary to have hamburgers in the table twice? In our example
it does not change much, but if the example was just a little more complex, for example, with
Table 7.2 Our friends’ telephone numbers
telephoneid
1
2
3

nameid

telephonenumber

1
2
1

001665443
00255555
001878345

SQL AND RELATIONAL DATABASES

169

Table 7.3 Our friends’ birthdays—revised
name id

first name

last name

1
2
3

Peter
Paul
Mary

Pascal
Panini
Meyer

birthday
01/02/1991
02/03/1992
03/04/1993

Table 7.4 Our friends’ food preferences
nameid

foodname

rank

1
1
2
3
3
3

spaghetti
hamburger
fruit salad
chocolate
fish fingers
hamburger

1
2
1
1
2
3

10 instead of 3 friends and 80% of them being fans of hamburgers, the table would grow
quickly. What if we would like to add further information on the food types? Do we store it in
the same table—repeating the information for hamburger over and over again? No, we would
do something similar to what we did when putting telephone numbers in a separate table
(Table 7.2). In the new food preference table (Table 7.5) all information on the preferences
itself remains untouched, because this will be the table on preferences. Furthermore, the key
relating it to the birthdays table (Table 7.3)—nameid—and the key relating it to the new
table on food (Table 7.6)—foodid—are kept.
After restructuring our data, we now have a decent database. Take a look at Figure 7.2
to see how the data are structured. In the schema each table is represented by a square. As
we can see, there are four tables. Let us call them birthdays, telephone, foodranking, and
foodtypes. The upper part of the square gives the name of the table, the column names are
listed in the lower part. The double-headed arrows show which tables are related by pointing
to the columns that serve as keys—the columns set in bold and the columns set in italics.

Table 7.5 Our friend’s food preferences—revised
rankid
1
2
3
4
5
6

nameid

foodid

rank

1
1
2
3
3
3

1
2
3
4
5
2

1
2
1
1
2
3

170

AUTOMATED DATA COLLECTION WITH R

Table 7.6 Food types

Primary keys
and foreign
keys

Combined keys

foodid

food name

healthy

1
2
3
4
5

spaghetti
hamburger
fruit salad
chocolate
fish fingers

no
no
yes
no
no

kcalp100g
0.158
0.295
0.043
0.546
0.290

The reason why some keys are set in bold and some are set in italics is because there are
two types of keys: primary keys and foreign keys. Primary keys are keys that unambiguously
identify each row in a table—nameid in our birthdays table (Table 7.3) is an example where
each person is associated with exactly one unique nameid. nameid is not a primary key in
our telephone table (Table 7.2), as subjects might well have more than one telephone number,
making the ID value non-unique. Thus, while nameid in the telephone table is still a key, it is
now called a foreign key. Foreign keys are keys that unambiguously identify rows in another
table. In other words, each nameid found in the telephone table refers to one and only one
row in the birthdays table. There cannot be a nameid in the telephone table that matches
more than one nameid in the birthdays table.
Note that it does not matter whether keys consist of one single value per row or a combination of values. Nor does it matter whether the identifier is a number, a string, something
else, or a combination of those. A key can span across several columns as long as the value
combinations fulfill the requirements for primary or foreign keys. Primary keys in our example are restricted to one single column and are always running integers, but they could look
different.
Let us consider an example of an alternative primary key to make this point clear. In our
food preference table (Table 7.5) neither nameid nor foodid are sufficient as primary keys,
because individuals appear multiple times depending on their stated preferences and the same
food might be preferred by several subjects. However, as no one prefers the same food twice,
the combination of name identifier and food identifier would be valid as primary key for the
food preference table.

7.2.2

Normalization

In the previous paragraphs we decomposed our data step by step. We did so because it saves
unnecessary work in the long run and keeps redundancy at bay. Normalization is the process

telephone
telephoneid
nameid
telephone

birthdays
nameid
firstname
lastname
birthday

foodranking
rankid
foodid
nameid
rank

Figure 7.2 Database scheme

foodtypes
foodid
foodname
healthy
kcalp100g

SQL AND RELATIONAL DATABASES

171

of getting rid of redundancies and possible inconsistencies. In the following section, we will
learn about this procedure, which is quite helpful to cope with complex data. If you do not
plan to use databases to store data—for example, you only want to store large amounts of
data that otherwise fit nicely into a single table, or you only need it for its user management
capabilities—you can simply skip this section or come back to it later on.
Let us go through the formal rules for normalization to ensure we understand why
databases often look the way they do and how we might do it ourselves. While there are
numerous ways of decomposing data stored in tables—called normal forms—we will only
cover the first three, as they are the most important and most common.
First normal form
1. One column shall refer to one thing and to one thing only and any column row
intersection should contain only one piece of data (the atomacy of data requirement).
This rule requires that different types of information are not mixed within one column.
One could argue that we violated this rule by storing both first and last names in
the name column of our first birthdays table (Table 7.1). We corrected this in the
updated birthdays table (Table 7.3), where first name and last name were split up
into two columns. Furthermore, it requires that the same data are saved in only one
column.
This rule was violated in the first version of our birthdays table, where we had three
columns to store a person’s favorite food. By exporting the information to the favorite
food table (Tables 7.4 and 7.5) and food type table (Table 7.6), this problem was taken
care of—a person’s favorite food is now stored in a single column. In addition it is not
allowed to store more than one piece of the same information in an intersection of row
and column. For example, it is not allowed to store two telephone numbers in a single
cell of a table. Take a look at Tables 7.7, 7.8, and 7.9 for three examples that violate
the first normal form.
2. Each table shall have a primary key.
This rule is easy to understand as keys were covered at length in the previous section.
It ensures that data can be related across tables and that data in one row is related to
the same entity.

Table 7.7 First normal form error—1
zip code and city
789222 Big Blossom
43211 Little Hamstaedt
123456 Bloomington
…
This table fails the first rule of the first normal
form because two different types of information
are saved in one column—a city’s zip code and
a city’s name.

172

AUTOMATED DATA COLLECTION WITH R

Table 7.8 First normal form error—2
telephone
0897729344, 0666556322
123123454
675345334
…
This table fails the first rule of the first normal form because two telephone numbers
are saved within the first row.

Table 7.9 First normal form error—3
telephone1
0897729344
123123454
675345334
…

telephone2
0666556322

This table fails the first rule of the first normal form because it uses two columns to
store the same kind of information.

Second normal Form
1. All requirements for the first normal form shall be met.
2. Each column of a table shall relate to the complete primary key.
We have learned that a primary key can be a combination of the values of more than
one column. This rule requires that all data in a table describe one thing only: only
people, only telephone numbers, only food preferences or only food types. One can
think of this rule as a topical division of data among tables.
Consider Table 7.10 for a violation of this rule. The primary key of the table is a
combination of nameid and foodid. The problem is that firstname and birthday
relate to nameid only while favoritefood relates to foodid only. The only column
that depends on the whole primary key—the combination of nameid and foodid—is
rank, which stores the rank of a specific food in a specific person’s food preference
order.
A solution to this violation of the second normal form is to split the table into several
tables. One table to capture information on persons, one on food and yet another on
food preferences. Take a look at Tables 7.3, 7.5, and 7.6 from the previous section for
tables that are compliant with the second normal form.
Third normal form
1. All requirements for the second normal form shall be met.

SQL AND RELATIONAL DATABASES

173

Table 7.10 A second normal form error
nameid

firstname

1
1
2
3
3
3

Peter
Peter
Paul
Mary
Mary
Mary

birthday

favoritefood

foodid

rank

01/02/1991
01/02/1991
02/03/1992
03/04/1993
03/04/1993
03/04/1993

spaghetti
hamburger
fruit salad
chocolate
fish fingers
hamburger

1
2
3
4
5
1

1
2
1
1
2
3

This table does not comply to second normal form because all columns
except rank either relate to one part of the combined primary key (nameid
and foodid) or to other part but not to both.

2. Each column of a table shall relate only and directly to the primary key.
The third normal form is actually a more strict version of the second normal form.
Simply stated, it excludes that data on different things are kept in one table.
Consider Table 7.11. The table only contains three entries, so we could easily use
nameid as primary key. Because the primary key only consists of one column, every
piece of information depends on the whole primary key. But the table remains odd,
because it contains two kinds of information—information related to individuals and
information related to food. As the primary key is based on subjects, all information
on them relates directly to the primary key. Things look different for information that
relates to food. Food-specific information is only related to the primary key insofar as
subjects have food preferences. Therefore, including information on food in the table
violates the third normal form. Again, the information should be stored in separate
tables like in Tables 7.3, 7.5, and 7.6 from the previous subsection.
Keep in mind that there are several other normal forms, but for our purposes these three
should suffice. Recall that normalization is primarily done to ensure data consistency—
meaning that any given piece of information is stored only once in a database. If the same
piece of information is stored twice, changes would have to be made in multiple places and
might contradict each other if forgotten.
There are, however, no technical restrictions that prevent us from putting redundant or
inconsistent data structures into a database. Whether this benefits our goal strongly depends on
questions like: What purpose does the database serve? What do we describe in our database?
How are the elements in our database related to each other? Will information be added
in the future? Will the database serve other purposes in the future? How much effort is it
Table 7.11 A third normal form error
nameid

firstname

1
2
3

Peter
Paul
Mary

birthday
01/02/1991
02/03/1992
03/04/1993

favoritefood

healthy

spaghetti
fruit salad
chocolate

no
yes
no

kcalp100g
0.158
0.043
0.546

This table does not comply to third normal form because favoritefood,
healthy and kcalp100g do not relate directly to persons which the primary
key (nameid) is based on.

174

AUTOMATED DATA COLLECTION WITH R

to completely normalize our database? The higher the complexity of the data, the higher
the number of different purposes a database might serve and the higher the probability that
information might be added in the future, the greater the effort that should be put into planning
and rigorously designing the database structure. Generally speaking, when you try to extract
data from an unknown database, you should always check the structure of the database to
make sure you get the right information.

7.2.3

Data types

Constraints

Transactions

User
management

Advanced features of relational databases and DBMS

Understanding how to store data in tables, how to decompose the data, and how relations
work in databases is enough for a basic understanding of the nature of databases. But often
DBMS have implemented further features like data type definitions, data constraints, virtual
tables or views, procedures, triggers, and events that make DBMS powerful tools but go far
beyond what can be captured in this short introduction. Nonetheless, we will briefly describe
these concepts to provide an idea of what is possible when working with DBMS.
One aspect we usually have to take care of when building up a database is to specify each
column’s data type. The data type definition tells the DBMS how to handle and store the
data; it affects the required disk space and also impacts efficiency. There are several broad
data types that are implemented in one way or another in every DBMS: boolean data, that is,
true and false values, numeric data (integer and float), character or text data, data referring to
dates, times, and time spans as well as data types for files (so-called BLOBs—binary large
objects). Which types are available and how they are implemented depends on the specific
DBMS, so we will not go into details here. The manual section of each DBMS should list
and describe the supported data types, how they are implemented, and further features that
are associated with it. To get an idea, consider Table 7.11 once more. Column by column the
data types can be specified as: integer, character, date, character, boolean, and float.
Beyond fixing columns to specific types of data, some DBMS even implement the possibility to constrain data. Constraining data enables the user to define under which circumstances—
for which values and value ranges—data should be accepted and in which cases the DBMS
should reject to store the data. In general, there are two ways to constrain data validity:
Specifying columns as primary or foreign keys and explicitly specifying which values or
value ranges are valid and which are not. Setting a column as primary key results in the
rejection of duplicated values because primary keys must identify each row unambiguously.
Defining a column as foreign key will lead to the rejection of values that are not already part
of the primary key it relates to. Explicit constraining of data instead is user-defined and might
be as simple as forbidding negative values or more complex involving several clauses and
references to other tables. No matter which type of constraint is used the general behavior of
the DBMS is to reject values that do not fulfill the constraint.
DBMS are designed for consistent handling of data. This entails that queries to the
database should not break it, for example, if an invalid change is requested or a data import
suddenly breaks up—for example, when our system crashes. This is ensured by enforcing that
a manipulation is completely carried out or not at all. Furthermore, DBMS usually provide a
way to define transactional blocks. This feature is useful whenever we have a manipulation
that takes several steps and we want all steps to take effect or none—that is, we do not want
the process to stop halfway through because one statement causes an error, leaving us with
data that is partially manipulated.
Nearly all DBMS provide a way to secure access to the database. The simplest possibility
is to ask for a password before granting access to the whole database but more elaborate

SQL AND RELATIONAL DATABASES

175

frameworks are common like having several users with their own passwords. Furthermore,
each user account can be accompanied with different user rights. One user might only be
allowed to read a single table, while another can read all tables and even add new data. Yet
other users might be allowed to create new tables and grant user rights to other users.
DBMS can easily handle accesses of more than one user to the same database at the same
time. That is, DBMS will always give you the present state of the database as response to a
query including changes made by other users but only those that were completely executed.
How a DBMS precisely solves problems of concurrent access is DBMS specific.
As the normalization of data and complex structures makes it more cumbersome to
assemble the data needed for a specific purpose, most DBMS have the possibility to define
virtual tables called views. Imagine that we want to retrieve and compose data from different
data tables frequently. We can define a query each time we need the specific combination of
data. Alternatively, we can define the query once and save it in a separate table. The downside
of this operation is that it takes up additional disk space. Further, we have to recreate the table
each time to make sure it is up to date, which is identical to defining the query anew each
time we need the specific combination of data.
Another, more elegant solution is to store the query that provides the data we need as a
virtual table. This table is virtual because instead of the data only the query that provides the
data is stored. This virtual table behaves exactly as if the data was stored in the table, but
potentially saves a lot of rewriting and rethinking. Compared to executing the query once and
saving the results in the database, the data in the virtual table will always be up to date.
All DBMS provide functions for data manipulation and aggregation in addition to the
simple data storage capacities. These functions might provide us with the current date, the
absolute value of a number, a substring, the mean for a set of values, and so on. Note that
while R functions can return all kinds of data formats, database functions are restricted
to scalars. Which functions are provided and how they are named is DBMS dependent.
Furthermore, most DBMS allow user-defined functions as well but, again, the syntax for
function definition might vary.
Another DBMS-dependent feature is procedures and triggers. Procedures and triggers
help extending the functionality of SQL. Imagine a database with a lot of tables, where
adding a new entry involves changes to many tables. Procedures are stored sequences of
queries that can be recalled whenever needed, thus making repeating tasks much easier.
While procedures are executed upon user request, triggers are procedures that are executed
automatically when certain events, like changes in a table, take place.

7.3
7.3.1

Multi-user
access

Views

Functions

Procedures and
triggers

SQL: a language to communicate with Databases
General remarks on SQL, syntax, and our running example

Now that we have learned how databases and DBMS work and which features they provide,
we can turn our attention to SQL, the language to communicate with DBMS. SQL is a
multipurpose language that incorporates vocabulary and syntax for various tasks:

r DCL (data control language)
The DCL part of SQL helps us to define who is allowed to do what and where in our
database and allows for fine-grained user rights definitions.

SQL, a
multi-purpose
language

176

AUTOMATED DATA COLLECTION WITH R

r DDL (data definition language)
DDL is the part of SQL that defines the structure of the data and its relations. This
means that the vocabulary enables us to create tables and columns, define data types,
primary keys and foreign keys, or to set constraints.

r DML (data manipulation language)
The DML part of SQL takes care of actually filling our database with data or retrieving
information from it.

r TCL (transaction control language)
The last part of SQL is TCL, which enables us to commit or rollback previous queries.
This is similar to save and undo buttons in ordinary desktop programs.
General syntax

1
2
3
4

The syntax and vocabulary of SQL is rather simple. Let us first consider the general syntax
before moving on to specific vocabularies and syntaxes of the different language branches
of SQL.
SQL statements generally start with a command describing which action should be
executed—for example, CREATE, SELECT, UPDATE, or INSERT INTO—followed by the unit
on which it should be executed—for example, FROM table1—and one or more clauses—for
example, WHERE column1 = 1. Below you find four SQL statement examples.
>
>
>
>

CREATE
SELECT
UPDATE
INSERT
VALUES

DATABASE database1 ;
column1 FROM table1 WHERE column2 = 1 ;
table1 SET column1 = 1 WHERE column2 > 3 ;
INTO table1 (column1, column2)
('rc11', 'rc12'), ('rc21', 'rc22') ;

Although it is customary to write all SQL statements in capital letters, SQL is actually
case insensitive towards its key words. Using capital or small case does not change the
interpretation of the statements. Note however, that depending on the DBMS, the DBMS
might be case sensitive to table and column names. For purposes of readability we will stick
to the capitalized key words and lower case table and column names.
Each SQL statement ends with a semicolon—therefore, SQL statements might span across
multiple lines.
1
2
3
4
5

>
>
>
>
>

CREATE TABLE table (
column1 INTEGER NOT NULL AUTO_INCREMENT ,
column2 VARCHAR(100) NOT NULL ,
PRIMARY KEY (column1)
) ;

Comments either start with -- or have to be put in between /* and */.
1
2
3
4
5

>
>
>
>
>

-- One line comment.
/*
Comment spanning
several lines
*/

SQL AND RELATIONAL DATABASES

177

For the remainder of this section we will use the birthdays, foodranking, and foodtypes
tables (Tables 7.3, 7.5, and 7.6) from the database that we built up in the previous sections.
As SQL is the common standard for a broad range of DBMS, the examples should work
with most DBMS that speak SQL. However, there might always be slight differences—for
example, SQLite does not support user management.
To access the database, you can either use R as client—you will find a description of
the possible ways in the next section—or install and use another client software. Before you
can access the database, you have to create your own server—except when using SQLite,
which works with the R package RSQLite out of the box. We recommend using a MySQL
community server in combination with MySQL Workbench CE for a start. Both the MySQL
server and MySQL Workbench are easy to install, easy to use, and can be downloaded
free of charge. Furthermore, MySQL ODBC drivers are reliable, are available for a large
range of platforms, and it is easy to connect to the server from within R by making use of
RODBC.3

7.3.2 Data control language—DCL
To control access and privileges to our database, we first ask the DBMS to create a database
called db1:
1

> CREATE DATABASE db1 ;

Next, we create and delete several users identifying themselves by password:
1
2
3
4

>
>
>
>

CREATE
CREATE
CREATE
DROP

USER
USER
USER
USER

'tester'
'tester2'
'tester3'
'tester3'

IDENTIFIED BY '123456' ;
IDENTIFIED BY '123456' ;
IDENTIFIED BY '123456' ;
;

Now we can use two powerful SQL commands to define what a user is allowed to
do. GRANT for granting privileges and REVOKE to remove privileges. A full SQL statement
granting user tester all privileges; all privileges for a certain database, and all privileges
for a certain table in a database looks as follows:

1
2
3

> GRANT ALL ON *.* TO 'tester2' ;
> GRANT ALL ON db1.* TO 'tester2' ;
> GRANT ALL ON db1.table1 TO 'tester2' ;

3 The examples were checked with MySQL 5.5.34 and MySQL Workbench CE 5.2.47 to establish the connection.
You can download the MySQL Community Server and MySQL Workbench from http://www.mysql.com. When
asked for an account login or sign up, look for the No thanks, just start my download button—or create an account
if you like.

Running
example and
SQL execution

178

AUTOMATED DATA COLLECTION WITH R

We can also grant specific privileges only, for example, for selecting and inserting
information:
1

> GRANT SELECT, INSERT ON *.* TO 'tester2' ;

We can add the right to grant privileges to other users as well:
1

> GRANT SELECT, INSERT ON *.* TO 'tester2' WITH GRANT OPTION ;

To remove all privileges from our test user, we can use REVOKE or delete the user altogether.
1
2

> REVOKE ALL ON *.* FROM 'tester2' ;
> DROP USER 'tester2' ;

7.3.3
CREATE TABLE

1
2
3
4
5
6
7

Data definition language—DDL

After having created a database and a user and having set up the user rights, we can turn
to those statements that define the structure of our data. The commands for data definition
are CREATE TABLE for the definition of tables, ALTER TABLE for changing aspects of an
existing table, and DROP TABLE to delete a table from the database. Let us start by defining
Table 7.3, the revised birthdays table.
> CREATE TABLE birthdays (
>
nameid INTEGER NOT NULL AUTO_INCREMENT ,
>
firstname VARCHAR(100) NOT NULL ,
>
lastname VARCHAR(100) NOT NULL ,
>
birthday DATE ,
>
PRIMARY KEY (nameid)
> ) ;

Let us go through this line by line. The first line starts the statement by using CREATE
TABLE to indicate that we want to define a new table and also specifies the name ‘birthdays’
for this new table. The details of the columns follow in parentheses. Each column definition
is separated by a colon and always starts with the name of the column. After the name we
specify the data type and may add further options. While the data types of nameid and
birthday—INTEGER and DATETIME—are self-explaining, the name variables were defined
as characters with a maximum length of 100 characters—VARCHAR(100).
Using the options NOT NULL and AUTO_INCREMENT we define some basic constraints.
NOT NULL specifies that this column cannot be left empty—we demand that each person
included in birthdays has to have a name identifier, a first name and a last name. Should we
try to add an observation to the table without all of these pieces of information, the DBMS
will refuse to include it in birthdays. The AUTO_INCREMENT parameter for the nameid

SQL AND RELATIONAL DATABASES

179

column adds the option that whenever no name identifier is manually specified when inserting
data, the DBMS takes care of that by assigning a unique number. The last line before we
close the parentheses and terminate the statement with a semicolon adds a further constraint
to the table. With PRIMARY KEY (nameid) we define that nameid should serve as primary
key. The DBMS will prevent the insertion of duplicated values into this column.
Let us add two more tables to our database to add some complexity and to showcase
some further concepts. We define the structure of Tables 7.5 and 7.6, where we recorded food
preferences and food attributes.
1
2
3
4
5
6
7

1
2
3
4
5
6
7
8

> CREATE TABLE foodtypes (
>
foodid INT NOT NULL AUTO_INCREMENT,
>
foodname VARCHAR(100) NOT NULL,
>
healthy INT,
>
kcalp100g float,
>
PRIMARY KEY (foodid)
> );

>
>
>
>
>
>
>
>

CREATE TABLE foodranking (
rankid INT NOT NULL AUTO_INCREMENT ,
foodid INT ,
nameid INT ,
rank INT NULL ,
PRIMARY KEY (rankid) ,
FOREIGN KEY (foodid) REFERENCES foodtypes (foodid) ON UPDATE CASCADE,
FOREIGN KEY (nameid) REFERENCES birthdays (nameid) ON UPDATE CASCADE ) ;

The creation of foodtypes is quite similar to that of birthdays. More interesting is
the creation of the food preference table, because it relates to data about subjects as well as to
information about food—captured in birthdays and foodtypes. Take a look at the lines
starting with FOREIGN KEY. First, we choose the column that serves as foreign key, then we
define which primary key this column refers to, followed by further options. By specifying
ON UPDATE CASCADE we choose that whenever the primary key changes, this change is
cascaded down to our foreign key column that is changed accordingly.
To change the definition of a table later on, we can make use of the ALTER TABLE
command. Below you find several examples for adding a column, changing the data type of
this column, and dropping it again.
1
2
3

> ALTER TABLE foodtypes ADD COLUMN dummy INT ;
> ALTER TABLE foodtypes MODIFY COLUMN dummy FLOAT ;
> ALTER TABLE foodtypes DROP COLUMN dummy ;

To get rid of a table, we can use DROP TABLE.
1
2

ALTER TABLE

> CREATE TABLE dummy (dcolumn INT) ;
> DROP TABLE dummy ;

DROP TABLE

180

AUTOMATED DATA COLLECTION WITH R

birthdays
nameid
firstname
lastname
birthday

foodranking
rankid
foodid
nameid
rank

foodtypes
foodid
foodname
healthy
kcalp100g

Figure 7.3 SQL example database scheme

7.3.4

INSERT INTO

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

UPDATE

Data manipulation language—DML

Now that we have defined some tables in our database, we need to learn how to insert,
manipulate, and retrieve data from it. Figure 7.3 provides an overview of the structure of our
database—bold column names refer to primary keys while italics denote foreign keys; the
arrows show which foreign keys relate to which primary keys.
The following three SQL statements use the INSERT INTO command to fill our tables
with data. After selecting the name of the table to fill, we also specify column names—
enclosed in parentheses—to specify for which columns data are provided and in which order.
If we had not specifed column names, data for all columns would be provided in the same
order as in the definition of the table. As each table contains one column that is automatically
filled with identification numbers—nameid, foodid, rankid—we do not want to specify
this information manually but let the DBMS take care of it. Note that every non-numeric
value—text and dates—is enclosed in single quotes’.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

INSERT INTO birthdays (firstname, lastname, birthday)
VALUES ('Peter', 'Pascal', '1991-02-01'),
('Paul', 'Panini', '1992-03-02'),
('Mary', 'Meyer', '1993-04-03') ;
INSERT INTO foodtypes (foodname, healthy,kcalp100g)
VALUES ('spaghetti',
0, 0.158),
('hamburger',
0, 0.295),
('fruit salad', 1, 0.043),
('chocolate',
0, 0.546),
('fish fingers', 0, 0.290) ;
INSERT INTO foodranking (nameid, foodid, rank)
VALUES (1, 1, 1),
(1, 2, 2),
(2, 3, 1),
(3, 4, 1),
(3, 5, 2),
(3, 2, 3) ;

To update and delete rows of data, we have to specify for which rows the update or
deletion takes place. For this we can make use of the WHERE clause. Let us create a new
column that captures whether or not the energy of a food type is above 0.2 kcal per 100 g. To

SQL AND RELATIONAL DATABASES

181

achieve this, we tell the DBMS to create a new column and to update the column.4 In a last
step, we delete the column again:
1
2
3
4
5

>
>
>
>
>

SET SQL_SAFE_UPDATES = 0 ;
ALTER TABLE foodtypes ADD COLUMN highenergy INT ;
UPDATE foodtypes SET highenergy=1 WHERE kcalp100g > 0.2 ;
UPDATE foodtypes SET highenergy=0 WHERE kcalp100g <= 0.2 ;
ALTER TABLE foodtypes DROP COLUMN highenergy ;

Let us have another example of deleting data. We first create a row with false data and
then drop the row:
1
2
3

> INSERT INTO foodtypes (foodname, healthy, kcalp100g)
> VALUES ("Dominic's incredible pancakes", NULL, NULL) ;
> DELETE FROM foodtypes WHERE foodname = "Dominic's incredible pancakes" ;

Data retrieval works similar to insertion of data and is achieved using the SELECT command. After the SELECT command we specify the columns we want to retrieve, followed
by the keyword FROM and the name of the table from which we want to get the data. The
following query retrieves all columns of the birthday table:
1
2
3
4
5
6
7
8

> SELECT * FROM birthdays ;
+--------+-----------+----------+------------+
| nameid | firstname | lastname | birthday
|
+--------+-----------+----------+------------+
|
1 | Peter
| Pascal
| 1991-02-01 |
|
2 | Paul
| Panini
| 1992-03-02 |
|
3 | Mary
| Meyer
| 1993-04-03 |
+--------+-----------+----------+------------+

This retrieves only the birthdays:
1
2
3
4
5
6
7
8

DELETE

> SELECT birthday FROM birthdays ;
+------------+
| birthday
|
+------------+
| 1991-02-01 |
| 1992-03-02 |
| 1993-04-03 |
+------------+

4 The first line is only relevant for MySQL. By default, MySQL prevents updates that have a WHERE clause not
referring to a primary key.

SELECT

182

AUTOMATED DATA COLLECTION WITH R

Retrieving birthdays and first names:
1
2
3
4
5
6
7
8

JOIN

> SELECT firstname, birthday FROM birthdays ;
+-----------+------------+
| firstname | birthday
|
+-----------+------------+
| Peter
| 1991-02-01 |
| Paul
| 1992-03-02 |
| Mary
| 1993-04-03 |
+-----------+------------+

So far we only retrieved data from a single table but often we need to combine data from
multiple tables. Combining data is done with the JOIN command—similar to the merge()
function in R. There are four possible joins:5
1. INNER JOIN will return a row whenever there is a match in both tables.
2. LEFT JOIN will return a row whenever there is a match in the first table.
3. RIGHT JOIN will return a row whenever there is a match in the second table.
4. FULL JOIN will return a row whenever there is a match in one of the tables.
To show how joins work, let us consider three examples in which data from the birthdays
table and the foodranking table are combined. Both tables are related by nameid. We will
match rows on this identifier, meaning that information from both tables is merged by identical
values on nameid.
To show the differences in the join statements, let us add a row to the birthdays table with
a nameid value that is not included in the foodranking table:

1
2

> INSERT INTO birthdays (nameid,firstname,lastname,birthday)
> VALUES (10,"Donald","Docker","1934-06-09") ;

The birthdays table now has one additional row:
1
2
3
4
5
6
7
8
9

> SELECT * FROM birthdays ;
+--------+-----------+----------+------------+
| nameid | firstname | lastname | birthday
|
+--------+-----------+----------+------------+
|
1 | Peter
| Pascal
| 1991-02-01 |
|
2 | Paul
| Panini
| 1992-03-02 |
|
3 | Mary
| Meyer
| 1993-04-03 |
|
10 | Donald
| Docker
| 1934-06-09 |
+--------+-----------+----------+------------+

5 Note

that support of the different join commands is DBMS dependent. For example, SQLite only supports

INNER JOIN and LEFT JOIN while MySQL has no implementation of FULL JOIN.

SQL AND RELATIONAL DATABASES

183

Recall the food ranking table (Table 7.5). The first JOIN command is an inner join, which
needs matching keys in both tables. Therefore, only information related to Peter, Paul, and
Mary should show up in the resulting table, because Donald does not have a nameid in both
tables:
1
2
3
4
5
6
7
8
9
10
11
12
13
14

> SELECT birthdays.nameid, firstname, lastname, birthday, foodid, rank
> FROM birthdays
> INNER JOIN foodranking
> ON birthdays.nameid=foodranking.nameid ;
+--------+-----------+----------+------------+--------+------+
| nameid | firstname | lastname | birthday
| foodid | rank |
+--------+-----------+----------+------------+--------+------+
|
1 | Peter
| Pascal
| 1991-02-01 |
1 |
1 |
|
1 | Peter
| Pascal
| 1991-02-01 |
2 |
2 |
|
2 | Paul
| Panini
| 1992-03-02 |
3 |
1 |
|
3 | Mary
| Meyer
| 1993-04-03 |
4 |
1 |
|
3 | Mary
| Meyer
| 1993-04-03 |
5 |
2 |
|
3 | Mary
| Meyer
| 1993-04-03 |
2 |
3 |
+--------+-----------+----------+------------+--------+------+

Joins are de facto extended SELECT statements. As in an ordinary SELECT statement we
first specify the command followed by the names of the columns to be retrieved. Note that
the columns to be retrieved can be from both tables. If columns with the same name exist
in both tables they should be preceded by the table name to clarify which column we are
referring to—for example, birthdays.nameid refers to the name identification column in
the birthdays table. The column specification is followed by FROM and the name of the first
table. In contrast to ordinary SELECT statements we now specify the join keywords—in this
case INNER JOIN—followed by the name of the second table. Using the keyword ON we
specify which columns serve as keys for the match.
As expected, Donald does not show up in the resulting table because his id is not included
in the foodranking table. Furthermore, the resulting table has a row for each food preference
so that information from the birthdays table like name and birthday is duplicated as needed.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

> SELECT birthdays.nameid, firstname, lastname, birthday, foodid, rank
> FROM birthdays
> LEFT JOIN foodranking
> ON birthdays.nameid=foodranking.nameid ;
+--------+-----------+----------+------------+--------+------+
| nameid | firstname | lastname | birthday
| foodid | rank |
+--------+-----------+----------+------------+--------+------+
|
1 | Peter
| Pascal
| 1991-02-01 |
1 |
1 |
|
1 | Peter
| Pascal
| 1991-02-01 |
2 |
2 |
|
2 | Paul
| Panini
| 1992-03-02 |
3 |
1 |
|
3 | Mary
| Meyer
| 1993-04-03 |
4 |
1 |
|
3 | Mary
| Meyer
| 1993-04-03 |
5 |
2 |
|
3 | Mary
| Meyer
| 1993-04-03 |
2 |
3 |
|
10 | Donald
| Docker
| 1934-06-09 |
NULL | NULL |
+--------+-----------+----------+------------+--------+------+

184

AUTOMATED DATA COLLECTION WITH R

Because LEFT JOIN only requires that a key appears in the first—or left—table, Donald
Docker is now included in the resulting table, but there is no information on food preference—
both columns show NULL values for Donald Docker. If we had specified the join the other
way around with foodranking being the first table and birthdays second, Donald would
not have been included as his id is not part of foodranking.
More than one table can be combined with join statements. To get the individuals’
preferences as well as the actual name of the food—we need columns from three tables. We
gather this information by extending the INNER JOIN of our previous example with another
join that specifies that tables foodranking and foodtypes are related via their foodid
columns and a request for the foodname column:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

> SELECT firstname, rank, foodname FROM birthdays
> INNER JOIN foodranking
> ON birthdays.nameid = foodranking.nameid
> INNER JOIN foodtypes
> ON foodranking.foodid = foodtypes.foodid ;
+-----------+------+--------------+
| firstname | rank | foodname
|
+-----------+------+--------------+
| Peter
|
1 | spaghetti
|
| Peter
|
2 | hamburger
|
| Mary
|
3 | hamburger
|
| Paul
|
1 | fruit salad |
| Mary
|
1 | chocolate
|
| Mary
|
2 | fish fingers |
+-----------+------+--------------+

Let us clean up before moving to the next section by dropping the data on Donald from
the database:

1

> DELETE FROM birthdays WHERE firstname = 'Donald' ;

7.3.5

WHERE and

operators

Clauses

We have already used the WHERE clause in SQL statements to restrict data manipulations to
certain rows, but we have neither treated the clause thoroughly nor have we mentioned that
SQL also has other clauses.
Let us begin by extending our knowledge of the WHERE clause. We already know that
it restricts data manipulations and retrievals to specific rows. Restrictions are specified in
the form of column_name operator value, where operator defines the type of comparison and value, the content of the comparison. Possible operators are = and != for equality/inequality, <, <=, >, >= for smaller (or equal) and greater (or equal) values, LIKE for basic
matching of text patterns and IN to specify a set of acceptable values.
We can also use AND and OR to build more complex restrictions and even nest restrictions
by using parentheses. Let us start with a composite WHERE clause with two conditions—see

SQL AND RELATIONAL DATABASES

185

the code snippet below. This statement retrieves data from all three tables but the resulting
set of rows is restricted by the WHERE clause to those lines that have a food preference rank
equal or larger than 2 and where the firstname matches 'Mary'.
1
2
3
4
5
6
7
8
9
10

> SELECT firstname, foodname, rank FROM birthdays
> INNER JOIN foodranking ON birthdays.nameid = foodranking.nameid
> INNER JOIN foodtypes ON foodranking.foodid = foodtypes.foodid
> WHERE rank >= 2 AND firstname = 'Mary' ;
+-----------+--------------+------+
| firstname | foodname
| rank |
+-----------+--------------+------+
| Mary
| hamburger
|
3 |
| Mary
| fish fingers |
2 |
+-----------+--------------+------+

The next statement has a nested composite WHERE clause and uses alphabetical sorting of
text (firstname < 'Peter'). While firstname should never match 'Mary', the other
part of the clause states that either healthy should equal to 1 or firstname should be a
string that precedes Peter alphabetically.
1
2
3
4
5
6
7
8
9

> SELECT firstname, foodname, healthy FROM birthdays
> INNER JOIN foodranking ON birthdays.nameid = foodranking.nameid
> INNER JOIN foodtypes ON foodranking.foodid = foodtypes.foodid
> WHERE (healthy = 1 OR firstname < 'Peter') AND firstname != 'Mary' ;
+-----------+-------------+---------+
| firstname | foodname
| healthy |
+-----------+-------------+---------+
| Paul
| fruit salad |
1 |
+-----------+-------------+---------+

The following statement is an example of using IN—the value of firstname should
match one of three names:
1
2
3
4
5
6
7
8

> SELECT firstname, lastname FROM birthdays
> WHERE firstname IN ('Peter','Paul','Karl') ;
+-----------+----------+
| firstname | lastname |
+-----------+----------+
| Peter
| Pascal
|
| Paul
| Panini
|
+-----------+----------+

The last statement showcases the usage of LIKE—% is a wildcard for any number of any
character and _ is a wildcard for any one character. The statement requires that a row is part
of the resulting table if either firstname contains er at the end of the string preceded by

186

AUTOMATED DATA COLLECTION WITH R

any number of characters or that lastname contains an e that is preceded by any number of
characters and followed by exactly one character.
1
2
3
4
5
6
7
8

ORDER BY

1
2
3
4
5
6
7
8

> SELECT firstname, lastname FROM birthdays
> WHERE firstname LIKE '%er' OR lastname LIKE '%e_';
+-----------+----------+
| firstname | lastname |
+-----------+----------+
| Peter
| Pascal
|
| Mary
| Meyer
|
+-----------+----------+

A second clause is the ORDER BY clause, which enables us to order results by column
values. Below you find several examples that order the results of a data retrieval. The standard
for sorting is to do it in ascending order.
> SELECT firstname FROM birthdays ORDER BY firstname ;
+-----------+
| firstname |
+-----------+
| Mary
|
| Paul
|
| Peter
|
+-----------+

To revert this behavior, we can add the keyword DESC. We can also specify more than
one column to define the sort order and choose for every column whether it should be used
in ascending or descending order.
1
2
3
4
5
6
7
8
9

GROUP BY

> SELECT firstname FROM birthdays
> ORDER BY birthday DESC, firstname ASC ;
+-----------+
| firstname |
+-----------+
| Mary
|
| Paul
|
| Peter
|
+-----------+

The GROUP BY clause allows aggregating values. The type of aggregation depends on the
specific aggregation function that we use.6 In the following example the use of GROUP
6 Aggregation functions are, for example, averages (AVG), counts (COUNT), first values (FIRST), maxima (MAX),
and sums (SUM).

SQL AND RELATIONAL DATABASES

187

BY is exemplified with the COUNT aggregation function, which returns a count of how
many food preferences each person has—the result is a table with a row for each unique
nameid from the birthdays table and a count of how many times a food preference was
recorded.

1
2
3
4
5
6
7
8
9
10
11

> SELECT firstname, COUNT(rank) FROM birthdays
> INNER JOIN foodranking ON birthdays.nameid = foodranking.nameid
> INNER JOIN foodtypes ON foodranking.foodid = foodtypes.foodid
> GROUP BY birthdays.nameid ;
+-----------+-------------+
| firstname | COUNT(rank) |
+-----------+-------------+
| Peter
|
2 |
| Paul
|
1 |
| Mary
|
3 |
+-----------+-------------+

To filter the aggregation table resulting from a GROUP BY clause, a special clause is
needed—a WHERE clause can be used in combination with GROUP BY, but this excludes
rows only before aggregation not after. Using HAVING we can filter the aggregation
results.

1
2
3
4
5
6
7
8
9
10
11

HAVING

> SELECT firstname, COUNT(rank) FROM birthdays
> INNER JOIN foodranking ON birthdays.nameid = foodranking.nameid
> INNER JOIN foodtypes ON foodranking.foodid = foodtypes.foodid
> GROUP BY birthdays.nameid
> HAVING COUNT(rank) > 1 ;
+-----------+-------------+
| firstname | COUNT(rank) |
+-----------+-------------+
| Peter
|
2 |
| Mary
|
3 |
+-----------+-------------+

7.3.6

Transaction control language—TCL

SQL statements are usually executed after a statement is sent to the DBMS and made
permanent unless some error occurs. This standard behavior can be modified by explicitly
starting a transacting with START TRANSACTION. Using this statement a save point is created.
Instead of executing each SQL statement immediately and making them permanent, each
statement is executed temporarily until it is explicitly committed by the user with the COMMIT
command. If an error occurs before COMMIT was specified all changes until the last save point
are reversed.

START
TRANSACTION
and COMMIT

188
ROLLBACK

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

AUTOMATED DATA COLLECTION WITH R

We can achieve the same behavior by asking the DBMS to ROLLBACK until the last save
point. Below you find an example where some data are added and then the database is reversed
to the status when the save point was set.
> START TRANSACTION ;
> INSERT INTO birthdays (firstname, lastname)
> VALUES ('Simon', 'Sorcerer') ;
> SELECT firstname, lastname FROM birthdays ;
+-----------+-----------+
| firstname | lastname |
+-----------+-----------+
| Peter
| Pascal
|
| Paul
| Panini
|
| Mary
| Meyer
|
| Simon
| Sorcerer |
+-----------+-----------+
> ROLLBACK ;
> SELECT firstname, lastname FROM birthdays ;
+-----------+----------+
| firstname | lastname |
+-----------+----------+
| Peter
| Pascal
|
| Paul
| Panini
|
| Mary
| Meyer
|
+-----------+----------+

7.4 Databases in action
7.4.1

R packages to manage databases

R has several packages to connect to DBMS: One way is to use packages that rely on the DBI
package (R Special Interest Group on Databases 2013) like RMySQL (James and DebRoy
2013), ROracle (Denis Mukhin and Luciani 2013), RPostgreSQL (Conway et al. 2013) and
RSQLite (James and DebRoy 2013) to establish a “native” connection to a specific DBMS.
While the DBI package defines virtual functions, the database-specific packages implement

these functions in database-specific ways. The added value of this approach is that while
there is a common set of functions that are expected to work the same way, different package
authors can concentrate on developing and maintaining solutions for one type of database
only.
Another approach to work with DBMS via R is to rely on RODBC (Ripley and Lapsley
2013). This package uses open database connectivity (ODBC) drivers as an indirect way to
connect to DBMS and requires that the user installs and configures the necessary driver before
using it in R. ODBC drivers are available across platforms and for a wide variety of DBMS.
They even exist for data storage formats that are no databases at all, like CSV or XLS/XLSX.
The package also delivers a general approach to manage different types of databases with the
same set of functions. On the downside, this approach depends on whether ODBC drivers are
available for the DBMS type to be used in combination with the platform R is working on.

SQL AND RELATIONAL DATABASES

189

Which package to use is essentially a matter of taste and how difficult it is to get the
package or driver running—at the moment and to the authors’ best knowledge RSQLite is
the only package that works completely out of the box across multiple platforms. All other
packages need driver installation and/or package compilation.

7.4.2

Speaking R-SQL via DBI-based packages

As the DBI package defines a common framework for working with databases from within
R, all database packages that rely on this framework work the same way, regardless of which
particular DBMS we establish a connection to. To show how things work for DBI-based
packages, we will make use of RSQLite. Let us load the birthdays database, which has been
bundled as an SQLite file and execute a simple SELECT statement:
R> # loading package
R> library(RSQLite)
R> # establish connection
R> sqlite <- dbDriver("SQLite")
R> con <- dbConnect(sqlite, "birthdays.db")
R> # 'plain' SQL
R> sql <- "SELECT * FROM birthdays"
R> res <- dbGetQuery(con, sql)
R> res
nameid firstname lastname
birthday
1
1
Peter
Pascal 1991-02-01
2
2
Paul
Panini 1992-03-02
3
3
Mary
Meyer 1993-04-03
R> res <- dbSendQuery(con, sql)
R> fetch(res)
nameid firstname lastname
birthday
1
1
Peter
Pascal 1991-02-01
2
2
Paul
Panini 1992-03-02
3
3
Mary
Meyer 1993-04-03

Using these functions we are able to perform basic database operations from within R.
Let us go through the code line by line. First we load the RSQLite package so that R knows
how to handle SQLite databases. Next, we build up a connection to the database by first
defining the driver and then using the driver in the actual connection. Because our SQLite
database has no password, we do not have to specify much except the database driver and
the location of the database. Now we can query our database. We have two functions to do
so: dbGetQuery() and dbSendQuery(). Both functions ask the DBMS to execute a single
query, but differ in how they handle the results returned by the DBMS. The first one fetches
all results and converts them to a data frame, while the second one will not fetch any results
unless we explicitly ask R to do so with the fetch() function.
Because we can send any SQL query that is supported by the specific DBMS, these four
functions suffice to fully control databases from within R. Nonetheless, there are several other
functions provided by DBI-based packages. These functions do not add further features but
help to make communication between R and DBMS more convenient. There are, for example,

190

AUTOMATED DATA COLLECTION WITH R

functions for getting an overview of the database properties—dbGetInfo()—and the tables
that are provided—dbListTables():
R> # general information
R> dbGetInfo(con)[2]
$serverVersion
[1] "3.7.17"
R> # listing tables
R> dbListTables(con)
[1] "birthdays"

"foodranking"

"foodtypes"

"sqlite_sequence"

There are also functions for reading, writing, and removing tables, which are as convenient
as they are self-explanatory: dbReadTable(), dbWriteTable(), dbExistsTable(), and
dbRemoveTable().
R> # reading tables
R> res <- dbReadTable(con,
R> res
nameid firstname lastname
1
1
Peter
Pascal
2
2
Paul
Panini
3
3
Mary
Meyer

"birthdays")
birthday
1991-02-01
1992-03-02
1993-04-03

R> # writing tables
R> dbWriteTable(con, "test", res)
[1] TRUE
R> # table exists?
R> dbExistsTable(con, "test")
[1] TRUE
R> # remove table
R> dbRemoveTable(con, "test")
[1] TRUE

To check the data type an R object would be assigned if stored in a database, we use
dbDataType():
R> # checking data
R> dbDataType(con,
[1] "INTEGER"
R> dbDataType(con,
[1] "TEXT"
R> dbDataType(con,
[1] "TEXT"

type
res$nameid)
res$firstname)
res$birthday)

We can also start, revert, and commit transactions as well as close a connection to a
DBMS:
R> # transaction management
R> dbBeginTransaction(con)
[1] TRUE
R> dbRollback(con)

SQL AND RELATIONAL DATABASES

191

[1] TRUE
R> dbBeginTransaction(con)
[1] TRUE
R> dbCommit(con)
[1] TRUE
R> # closing connection
R> dbDisconnect(con)
[1] TRUE

7.4.3

Speaking R-SQL via RODBC

Communicating with databases by relying on the RODBC package is quite similar to the
DBI-based packages. There are functions that forward SQL statements to the DBMS and
convenience functions that do not require the user to specify SQL statements.7
Let us start by establishing a connection to our database and passing a simple SELECT *
statement to read all lines of the birthdays table:
R> # reading package
R> require(RODBC)
R> # establishing connection
R> con <- odbcConnect("db1")
R> # 'plain' SQL
R> sql <- "SELECT * FROM birthdays ;"
R> res <- sqlQuery(con, sql)
R> res
nameid firstname lastname
birthday
1
1
Peter
Pascal 1991-02-01
2
2
Paul
Panini 1992-03-02
3
3
Mary
Meyer 1993-04-03

The code to establish a connection and pass the SQL statement to the DBMS is quite
similar to what we have seen before. One difference is that we do not have to specify any
driver as the driver and all other connection information have already been specified in the
ODBC manager so that we only have to refer to the name we gave this particular connection
in the ODBC manager.
Besides the direct execution of SQL statements, there are numerous convenience functions
similar to those found in the DBI-based packages. To get general information on the connection
and the drivers used or to list all tables in the database, we can use odbcGetInfo() and
sqlTables():
R> # general information
R> odbcGetInfo(con)[3]
Driver_ODBC_Ver
"03.51"
7 The example works with MySQL drivers. They are available at http://dev.mysql.com/downloads/
connector/odbc/. MySQL ODBC drivers are reliable and available for a whole range of platforms. If you have
followed the examples in the last section with your own MySQL database, you can now connect to it to follow the
example.

192

AUTOMATED DATA COLLECTION WITH R

R> # listing tables
R> sqlTables(con)[, 3:5]
TABLE_NAME TABLE_TYPE REMARKS
1
birthdays
TABLE
2 foodranking
TABLE
3
foodtypes
TABLE

To get an overview of the ODBC driver connections that are currently specified in our
ODBC manager, we can use odbcDataSources(). The function reveals that db1 to which
we are connected is based on MySQL drivers version 5.2:
R> odbcDataSources()
db1
"MySQL ODBC 5.2 ANSI Driver"

We can also ask for whole tables without specifying the SQL statement by a simple call
to sqlFetch():
R> # 'plain' SQL
R> res <- sqlFetch(con, "birthdays")
R> res
nameid firstname lastname
birthday
1
1
Peter
Pascal 1991-02-01
2
2
Paul
Panini 1992-03-02
3
3
Mary
Meyer 1993-04-03

Similarly, we can write R data frames to SQL tables with convenience functions. We can
also empty tables or delete them altogether:
R> # writing tables
R> test <- data.frame(x = 1:3, y = letters[7:9])
R> sqlSave(con, test, "test")
R> sqlFetch(con, "test")
x y
1 1 g
2 2 h
3 3 i
R> # empty table
R> sqlClear(con, "test")
R> sqlFetch(con, "test")
[1] x y
<0 rows> (or 0-length row.names)
R> # drop table
R> sqlDrop(con, "test")

Summary
In this chapter we learned about databases, SQL, and several R packages that enable us to
connect to databases and to access the data stored in them. Simply put, relational databases

SQL AND RELATIONAL DATABASES

193

are collections of tables that are related to one another by keys. Although R is capable of
handling data, databases offer solutions to certain data management problems that are best
dealt with in a specific environment. SQL is the lingua franca for communication between
the user and a wide range of database management systems. While SQL allows us to define
what should be done, it is in fact the DBMS that manages how this is achieved in a reliable
manner. As a multipurpose language we can use SQL to manage user rights, define data
structures, import, manipulate, and retrieve data as well as to control transactions. We have
seen that R is capable of communicating with a variety of databases and provides additional
convenience functions. As DBMS were designed for reliable and efficient handling of data
they might be solutions to limited RAM size, multipurpose usages of data, multiuser access
to data, complex data storage, and remote access to data.

Further reading
Relational databases and SQL are part of the web technology community and are therefore
treated in hundreds of forums, blogs, and manuals. Therefore, you can easily find a solution
to most problems by typing your question into any ordinary search engine. To learn the
full spectrum of options for a specific DBMS, you might be better advised to turn to a
comprehensive treatment of the subject. For an introduction to MySQL we recommend
Beaulieu (2009). Those who like it a little bit more fundamental might find the SQL Bible
(Kriegel and Trukhnov 2008) or Relational Database Design and Implementation (Harrington
2009) helpful sources. Last but not least the SQL Pocket Guide (Gennick 2011) is a gentle
pocket reference that fits in every bookshelf.

Problems
The following problems are built around two—more or less—real-life databases, one on
Pokemon characters, the other on data about elections, governments, and parties. The Pokemon data was gathered and provided by Francisco S. Velazquez. We extracted some of the
tables and provide them as CSV files along with supplementary material for this chapter. The
complete database is available at https://github.com/kikin81/pokemon-sqlite. The ParlGov
database is provided by Döring (2013). It combines data on “elections, parties, and governments for all EU and most OECD members from 1945 until today” gathered from multiple
sources. More information is available at http://parlgov.org. Downloading the whole database
will be part of the exercise.

Pokemon problems
1.

Load the RSQLite package and create a new RSQLite database called pokemon.sqlite.

2.

Use read.csv2() to read pokemon.csv, pokemon_species.csv, pokemon_stats.csv,
pokemon_types.csv, stats.csv, type_efficacy.csv, and types.csv into R and write the tables
to pokemon.sqlite. Have a look at PokemonReadme.txt to learn about the tables you
imported.

3.

Use functions from DBI/RSQLite to read the tables that you stored in the database
back into R and save them in objects named: pokemon, pokemon_species,
pokemon_stats, pokemon_types, stats, type_efficacy, and types.

194

AUTOMATED DATA COLLECTION WITH R

4.

Build a query that SELECTs those pokemon from table pokemon that are heavier than
4000. Next, build a SELECT query that JOINs tables pokemon and pokemon_species.

5.

Combining the previous SQL queries, build a query that JOINs both tables and restricts
the results to Pokemon that are heavier than 4000.

6.

Build a query that SELECTs all Pokemon names from table pokemon_species that have
Nido as part of their name.

7.

Fetching names.
(a) Build a query that SELECTs Pokemon names.
(b) Send the query to the database using dbSendQuery() and save the result in an
object.
(c) Use fetch() three times in a row, each time retrieving another set of five names.
(d) Use dbClearResult() to clean up afterwards.

8.

Creating views.
(a) Create a VIEW called pokeview …
(b) … that JOINs table pokemon with table pokemon_species,
(c) … and contains the following information: height and weight, species identifier,
Pokemon id from which the Pokemon evolves, the id of the evolution chain, and
ids for Pokemon and species.
(d) Create a VIEW called typeview …
(e) … that JOINs pokemon_types and types …
(f) … and contains the following information: slot of the type, identifier of the type,
and ids for Pokemon, damage class, and type.
(g) Create a VIEW called statsview …
(h) … that JOINs pokemon_stats and stats …
(i) … and contains the following information: identifier of the statistic, base value of
the statistics, and ids for Pokemon, statistics, and damage class.

9.

Using the views you created, which Pokemon are of type dragon? Which Pokemon has
most health points, which has the best attack, defense, and speed?

ParlGov problems
10.

Use download.file() with mode="wb" to save the following resource http://parlgov
.org/stable/static/data/parlgov-stable.db as parlgov.sqlite and establish a connection.

11.

Get a list of all tables in the database. According to info_data_source, which external
data sources were used for the database?

12.

Figure out for which countries the database offers data.

13.

Which time span is covered by the election table?

14.

How many early elections were there in Spain, the United Kingdom, and Switzerland?

15.

Creating views.
(a) CREATE a VIEW named edata …
(b) … that JOINs table election_result …
(c) … with tables election, country, and party …

SQL AND RELATIONAL DATABASES

195

(d) … so that the view contains the following information: country name, date of the
election, the abbreviated party name, party name in English, seats to be won in
total, seats won by party, vote share won by the party, as well as ids for country,
election, election results, and party.
(e) Make sure the VIEW is restricted to elections of type 13 (elections to parliament).
(f) Read the data of the view into R and save it in an object.
(g) Add a variable storing the seat share.
(h) Plot vote shares versus seat shares. Use text() to add the name of the country if
vote share and seat share differ by more than 20 percentage points.
16.

Find answers to the following questions in the database.
(a) Which country had a cabinet lead by Lojze Peterle?
(b) Which parties were part of that government?
(c) What were their vote shares in the election?

17.

More SELECT queries.
(a) Build a query that SELECTs column tbl_name and type from table sqlite_master.
(b) Build a query that SELECTs column sql from table sqlite_master WHERE column
tbl_name equals edata. Save the result in an object and use cat() to display the
contents of the object.
(c) Do the same for tbl_name equal to view_election.

8

Regular expressions and essential
string functions

A short
example

The Web consists predominantly of unstructured text. One of the central tasks in web scraping
is to collect the relevant information for our research problem from heaps of textual data.
Within the unstructured text we are often interested in systematic information—especially
when we want to analyze the data using quantitative methods. Systematic structures can be
numbers or recurrent names like countries or addresses. We usually proceed in three steps.
First we gather the unstructured text, second we determine the recurring patterns behind the
information we are looking for, and third we apply these patterns to the unstructured text
to extract the information. This chapter will focus on the last two steps. Consider HTML
documents from the previous chapters as an example. In principle, they are nothing but
collections of text. Our goal is always to identify and extract those parts of the document that
contain the relevant information. Ideally we can do so using XPath—but sometimes the crucial
information is hidden within atomic values. In some settings, relevant information might
be scattered across an HTML document, rendering approaches that exploit the document
structure useless. In this chapter we introduce a powerful tool that helps retrieve data in such
settings—regular expressions. Regular expressions provide us with a syntax for systematically
accessing patterns in text.
Consider the following short example. Imagine we have collected a string of names and
corresponding phone numbers from fictional characters of the “The Simpsons” TV series.
Our task is to extract the names and numbers and to put them into a data frame.
R> raw.data <- "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555
-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson,
Homer5553642Dr. Julius Hibbert"

Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, First Edition.
Simon Munzert, Christian Rubba, Peter Meißner and Dominic Nyhuis.
© 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

REGULAR EXPRESSIONS AND ESSENTIAL STRING FUNCTIONS

197

The first thing we notice is how the names and numbers come in all sorts of formats. Some
numbers include area codes, some contain dashes, others even parentheses. Yet, despite these
differences we also notice the similarities between all the phone numbers and the names. Most
importantly, the numbers all contain digits while all the names contain alphabetic characters.
We can make use of this knowledge by writing two regular expressions that will extract only
the information that we are interested in. Do not worry about the details of the functions
at this point. They simply serve to illustrate the task that we tackle in this chapter. We will
learn the various elements the queries are made up of and also how they can be applied in
different contexts to extract information and get it into a structured format. We will return to
the example in Section 8.1.3.
R> library(stringr)
R> name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
R> name
[1] "Moe Szyslak"
"Burns, C. Montgomery" "Rev. Timothy Lovejoy"
[4] "Ned Flanders"
"Simpson, Homer"
"Dr. Julius Hibbert"
R> phone <- unlist(str_extract_all(raw.data, "\$?(\\d{3})?\$?
(-| )?\\d{3}(-| )?\\d{4}"))
R> phone
[1] "555-1239"
"(636) 555-0113" "555-6542"
"555 8904"
[5] "636-555-3226"
"5553642"

We can input the results into a data frame:
R> data.frame(name = name, phone = phone)
name
phone
1
Moe Szyslak
555-1239
2 Burns, C. Montgomery (636) 555-0113
3 Rev. Timothy Lovejoy
555-6542
4
Ned Flanders
555 8904
5
Simpson, Homer
636-555-3226
6
Dr. Julius Hibbert
5553642

Although R offers the main functions necessary to accomplish such tasks, R was not
designed with a focus on string manipulation. Therefore, relevant functions sometimes lack
coherence. As the importance of text mining and natural language processing in particular has
increased in recent years, several packages have been developed to facilitate text manipulation
in R. In the following sections—and throughout the remainder of this volume—we rely
predominantly on the stringr package, as it provides most of the string manipulation capability
we require and it enforces a more consistent coding behavior (Wickham 2010).
The following section introduces regular expressions as implemented in R. Section 8.2
provides an overview on how string manipulation can be used in practice. This is done by
presenting commands that are available in the stringr package. If you have previously worked
with regular expressions, you can skip Section 8.1 without much loss. Section 8.3 concludes
with some aspects of character encodings—an important concept in web scraping.

198

AUTOMATED DATA COLLECTION WITH R

8.1

Regular expressions

Regular expressions are generalizable text patterns for searching and manipulating text data.
Strictly speaking, they are not so much a tool as they are a convention on how to query strings
across a wide range of functions. In this section, we will introduce the basic building blocks
of extended regular expressions as implemented in R. The following string will serve as a
running example:
R> example.obj <- "1. A small sentence. - 2. Another tiny sentence."

8.1.1

Exact character matching

At the most basic level characters match characters—even in regular expressions. Thus,
extracting a substring of a string will yield itself if present:
R> str_extract(example.obj, "small")
[1] "small"

Otherwise, the function would return a missing value:
R> str_extract(example.obj, "banana")
[1] NA

The function we use here and in the remainder of this section is str_extract() from
the stringr package, which we assume is loaded in all subsequent examples. It is defined as
str_extract(string, pattern) such that we first input the string that is to be operated
upon and second the expression we are looking for. Note that this differs from most base
functions, like grep() or grepl(), where the regular expression is typically input first.1 The
function will return the first instance of a match to the regular expression in a given string.
We can also ask R to extract every match by calling the function str_extract_all():
R> unlist(str_extract_all(example.obj, "sentence"))
[1] "sentence" "sentence"

The stringr package offers both str_whatever() and str_whatever_all() in many
instances. The former addresses the first instance of a matching string while the latter
accesses all instances. The syntax of all these functions is such that the character vector in
question is the first element, the regular expression the second, and all possible additional
values come after that. The functions’ consistency is the main reason why we prefer to use
the stringr package by Hadley Wickham (2010). We introduce the package in more detail
in Section 8.2. See Table 8.5 for an overview of the counterparts of the stringr functions in
base R.
As str_extract_all() is ordinarily called on multiple strings, the results are returned
as a list, with each list element providing the results for one string. Our input string in the
call above is a character vector of length one; hence, the function returns a list of length
one, which we unlist() for convenience of exposition. Compare this to the behavior of the
function when we call it upon multiple strings at the same time. We create a vector containing
1 See

also Table 8.5 for a comparison between base R and stringr string manipulation functions.

REGULAR EXPRESSIONS AND ESSENTIAL STRING FUNCTIONS

199

the strings text, manipulation, and basics. We use the function str_extract_all()
to extract all instances of the pattern a:
R> out <- str_extract_all(c("text", "manipulation", "basics"), "a")
R> out
[[1]]
character(0)
[[2]]
[1] "a" "a"
[[3]]
[1] "a"

The function returns a list of the same length as our input vector—three—where each
element in the list contains the result for one string. As there is no a in the first string, the first
element is an empty character vector. String two contains two as, string three one occurrence.
By default, character matching is case sensitive. Thus, capital letters in regular expressions
are different from lowercase letters.
R> str_extract(example.obj, "small")
[1] "small"
small is contained in the example string while SMALL is not.
R> str_extract(example.obj, "SMALL")
[1] NA

Consequently, the function extracts no matching value. We can change this behavior by
enclosing a string with ignore.case().2
R> str_extract(example.obj, ignore.case("SMALL"))
[1] "small"

We are not limited to using regular expressions on words. A string is simply a sequence
of characters. Hence, we can just as well match particles of words …
R> unlist(str_extract_all(example.obj, "en"))
[1] "en" "en" "en" "en"

… or mixtures of alphabetic characters and blank spaces.
R> str_extract(example.obj, "mall sent")
[1] "mall sent"

Searching for the pattern en in the example string returns every instance of the pattern,
that is, both occurrences in the word sentence, which is contained twice in the example
object. Sometimes we do not simply care about finding a match anywhere in a string but are
2 This behavior is a property of the stringr package. For case-insensitive matching in base functions, set the
ignore.case argument to TRUE. Incidentally, if you have never worked with strings before, tolower() and
toupper() will convert your string to lower/upper case.

Matching
beginnings and
ends

200

AUTOMATED DATA COLLECTION WITH R

concerned about the specific location within a string. There are two simple additions we can
make to our regular expression to specify locations. The caret symbol (ˆ) at the beginning
of a regular expression marks the beginning of a string—$ at the end marks the end.3 Thus,
extracting 2 from our running example will return a 2.
R> str_extract(example.obj, "2")
[1] "2"

Extracting a 2 from the beginning of the string, however, fails.
R> str_extract(example.obj, "ˆ2")
[1] NA

Similarly, the $ sign signals the end of a string, such that …
R> unlist(str_extract_all(example.obj, "sentence$"))
character(0)
The pipe
operator

… returns no matches as our example string ends in a period character and not in the
word sentence. Another powerful addition to our regular expressions toolkit is the pipe,
displayed as |. This character is treated as an OR operator such that the function returns all
matches to the expressions before and after the pipe.
R> unlist(str_extract_all(example.obj, "tiny|sentence"))
[1] "sentence" "tiny"
"sentence"

8.1.2

Generalizing regular expressions

Up to this point, we have only matched fixed expressions. But the power of regular expressions
stems from the possibility to write more flexible, generalized search queries. The most general
among them is the period character. It matches any character.
R> str_extract(example.obj, "sm.ll")
[1] "small"

Another powerful generalization in regular expressions are character classes, which are
enclosed in brackets—[]. A character class means that any of the characters within the
brackets will be matched.
R> str_extract(example.obj, "sm[abc]ll")
[1] "small"

The above code extracts the word small as the character a is part of the character class
[abc]. A different way to specify the elements of a character class is to employ ranges of
characters, using a dash -.
R> str_extract(example.obj, "sm[a-p]ll")
[1] "small"
3 Note

that inside a character class a caret has a different meaning (see p. 202).

REGULAR EXPRESSIONS AND ESSENTIAL STRING FUNCTIONS

201

Table 8.1 Selected predefined character classes in R regular expressions
[:digit:]
[:lower:]
[:upper:]
[:alpha:]
[:alnum:]

Digits: 0 1 2 3 4 5 6 7 8 9
Lowercase characters: a–z
Uppercase characters: A–Z
Alphabetic characters: a–z and A–Z
Digits and alphabetic characters

[:punct:]
[:graph:]
[:blank:]
[:space:]
[:print:]

Punctuation characters: . , ; etc.
Graphical characters: [:alnum:] and [:punct:]
Blank characters: Space and tab
Space characters: Space, tab, newline, and other space characters
Printable characters: [:alnum:], [:punct:] and [:space:]

Source: Adapted from http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html

In this case, any characters from a to p are valid matches. Apart from alphabetic characters
and digits, we can also include punctuation and spaces in regular expressions. Accordingly,
they can be part of a character class. For example, the character class [uvw. ] matches the
letters u, v and w as well as a period and a blank space. Applying this to our running example
(Recall: "1. A small sentence. - 2. Another tiny sentence.") yields all of its
constituent periods and spaces but neither u, v, or w as there are none in the object. Note that
the period character in the character class loses its special meaning. Inside a character class,
a dot only matches a literal dot.
R> unlist(str_extract_all(example.obj, "[uvw. ]"))
[1] "." " " " " " " "." " " " " "." " " " " " " "."

So far, we have manually specified character classes. However, there are some typical
collections of characters that we need to match in a body of text. For example, we are often
interested in finding all alphabetic characters in a given text. This can be accomplished with
the character class [a-zA-Z], that is, all letters from a to z as well as all letters from A
to Z. For convenience, a number of common character classes have been predefined in R.
Table 8.1 provides an overview of selected predefined classes.
In order to use the predefined classes, we have to enclose them in brackets. Otherwise,
R assumes that we are specifying a character class consisting of the constituent characters.
Say we are interested in extracting all the punctuation characters in our example. The correct
expression is
R> unlist(str_extract_all(example.obj, "[[:punct:]]"))
[1] "." "." "-" "." "."

Notice how this differs from
R> unlist(str_extract_all(example.obj, "[:punct:]"))
[1] "n" "t" "n" "c" "n" "t" "t" "n" "n" "t" "n" "c"

Not enclosing the character class returns all the :, p, u, n, c, and t in our running
example. Note that the duplicate : does not throw off R. A redundant inclusion of a character
in a character class will only match each instance once.

Character
classes

202

AUTOMATED DATA COLLECTION WITH R

R> unlist(str_extract_all(example.obj, "[AAAAAA]"))
[1] "A" "A"

Furthermore, while [A-Za-z] is almost identical to [:alpha:], the former disregards
special characters, such that …
R> str_extract("François Hollande", "Fran[a-z]ois")
[1] NA

… returns no matches, while …
R> str_extract("François Hollande", "Fran[[:alpha:]]ois")
[1] "François"

… does. The predefined character classes will cover many requests we might like to make
but in case they do not, we can even extend a predefined character class by adding elements
to it.
R> unlist(str_extract_all(example.obj, "[[:punct:]ABC]"))
[1] "." "A" "." "-" "." "A" "."

In this case, we extract all punctuation characters along with the capital letters A, B, and C.
Incidentally, making use of the range operator we introduced above, this extended character
class could be rewritten as [[:punct:]A-C]. Another nifty use of character classes is to
invert their meanings by adding a caret (ˆ) at the beginning of a character class. Doing so,
the function will match everything except the contents of the character class.
R> unlist(str_extract_all(example.obj, "[ˆ[:alnum:]]"))
[1] "." " " " " " " "." " " "-" " " "." " " " " " " "."
Quantifiers

Accordingly, in our case every non-alphanumeric character yields every blank space
and punctuation character. To recap, we have learned that every digit and character matches
itself in a regular expression, a period matches any character, and a character class will
match any of its constituent characters. However, we are still missing the option to use
quantification in our expressions. Say, we would like to extract a sequence starting with
an s, ending with a l, and any three alphabetic characters in between from our running
example. With the tools we have learned so far, our only option is to write an expression like
s[[:alpha:]][[:alpha:]][[:alpha:]]l. Recall that we cannot use the . character as
this would match any character, including blank spaces and punctuation.
R> str_extract(example.obj, "s[[:alpha:]][[:alpha:]][[:alpha:]]l")
[1] "small"

Writing our regular expressions in this manner not only quickly becomes difficult to read
and understand, but it is also inefficient to write and more prone to errors. To avoid this we
can add quantifiers to characters. For example, a number in {} after a character signals a
fixed number of repetitions of this character. Using this quantifier, a sequence such as aaaa
could be shortened to read a{4}. In our case, we thus write …
R> str_extract(example.obj, "s[[:alpha:]]{3}l")
[1] "small"

REGULAR EXPRESSIONS AND ESSENTIAL STRING FUNCTIONS

203

Table 8.2 Quantifiers in R regular expressions
?
*
+
{n}
{n,}
{n,m}

The preceding item is optional and will be matched at most once
The preceding item will be matched zero or more times
The preceding item will be matched one or more times
The preceding item is matched exactly n times
The preceding item is matched n or more times
The preceding item is matched at least n times, but not more than m times

Source: Adapted from http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html

… where [[:alpha:]]{3} matches any three alphabetic characters. Table 8.2 provides
an overview of the available quantifiers in R. A common quantification operator is the + sign,
which signals that the preceding item has to be matched one or more times. Using the . as
any character we could thus write the following in order to extract a sequence that runs from
an A to sentence with any number—greater than zero—of any characters in between.
R> str_extract(example.obj, "A.+sentence")
[1] "A small sentence. - 2. Another tiny sentence"
R applies greedy quantification. This means that the program tries to extract the greatest
possible sequence of the preceding character. As the . matches any character, the function
returns the greatest possible sequence of any characters before a sequence of sentence. We
can change this behavior by adding a ? to the expression in order to signal that we are only
looking for the shortest possible sequence of any characters before a sequence of sentence.
The ? means that the preceding item is optional and will be matched at most once (see again
Table 8.2).
R> str_extract(example.obj, "A.+?sentence")
[1] "A small sentence"

We are not restricted to applying quantifiers to single characters. In order to apply a
quantifier to a group of characters, we enclose them in parentheses.
R> unlist(str_extract_all(example.obj, "(.en){1,5}"))
[1] "senten" "senten"

In this case, we are asking the function to return a sequence of characters where the first
character can be any character and the second and third characters have to be an e and an n.
We are asking the function for all instances where this sequence appears at least once, but
at most five times. The longest possible sequence that could conform to this request would
thus be 3 × 5 = 15 characters long, where every second and third character would be an e
and an n. In the next code snippet we drop the parentheses. The function will thus match all
sequences that run from any character over e to n where the n has to appear at least once but
at most five times. Consider how the previous result differs from the following:
R> unlist(str_extract_all(example.obj, ".en{1,5}"))
[1] "sen" "ten" "sen" "ten"

Greedy
quantification
and how to
avoid it

204

AUTOMATED DATA COLLECTION WITH R

Table 8.3 Selected symbols with special meaning
\w
\W
\s
\S
\d
\D
\b
\B
\<
\>

Metacharacters

Word characters: [[:alnum:]_]
No word characters: [ˆ[:alnum:]_]
Space characters: [[:blank:]]
No space characters: [ˆ[:blank:]]
Digits: [[:digit:]]
No digits: [ˆ[:digit:]]
Word edge
No word edge
Word beginning
Word end

So far, we have encountered a number of characters that have a special meaning in regular
expressions.4 They are called metacharacters. In order to match them literally, we precede
them with two backslashes. In order to literally extract all period characters from our running
example, we write
R> unlist(str_extract_all(example.obj, "\\."))
[1] "." "." "." "."

The double backslash before the period character is interpreted as a single literal backslash.
Inputting a single backslash in a regular expression will be interpreted as introducing an escape
sequence. Several of these escape sequences are quite common in web scraping tasks and
should be familiar to you. The most common are \n and \t which mean new line and tab.
For example, “a\n\n\na” is interpreted as a, three new lines, and another a. If we want the
entire regular expression to be interpreted literally, we have a better alternative than preceding
every metacharacter with a backslash. We can enclose the expression with fixed() in order
for metacharacters to be interpreted literally.
R> unlist(str_extract_all(example.obj, fixed(".")))
[1] "." "." "." "."

Further
shortcuts

Most metacharacters lose their special meaning inside a character class. For example, a
period character inside a character class will only match a literal period character. The only
two exceptions to this rule are the caret (ˆ) and the -. Putting the former at the beginning
of a character class matches the inverse of the character class’ contents. The latter can be
applied to describe ranges inside a character class. This behavior can be altered by putting
the - at the beginning or the end of a character class. In this case it will be interpreted
literally.
One last aspect of regular expressions that we want to introduce here are a number of
shortcuts that have been assigned to several specific character classes. Table 8.3 provides an
overview of available shortcuts.
4 We

have encountered ., |, (, ), [, ], {, }, ˆ, $, *, +, ? and -.

REGULAR EXPRESSIONS AND ESSENTIAL STRING FUNCTIONS

205

Consider the \w character. This symbol matches any word character in our running
example, such that …
R> unlist(str_extract_all(example.obj, "\\w+"))
[1] "1"
"A"
"small"
"sentence" "2"
[7] "tiny"
"sentence"

"Another"

… extracts every word separated by blank spaces or punctuation. Note that \w is equivalent
to [[:alnum:]_] and thus the leading digits are interpreted as whole words. Consider further
the useful shortcuts for word edges \>, \<, and \b. Using them, we can be more specific in
the location of matches. Imagine we would like to extract all e from our running example
that are at the end of a word. To do so, we could apply one of the following two expressions:
R> unlist(str_extract_all(example.obj, "e\\>"))
[1] "e" "e"
R> unlist(str_extract_all(example.obj, "e\\b"))
[1] "e" "e"

This query extracts the two e from the edges of the word sentence. Finally, it is even
possible to match a sequence that has been previously matched in a regular expression. This
is called backreferencing. Say, we are looking for the first letter in our running example
and—whatever it may be—want to match further instances of that particular letter. To do
so, we enclose the element in question in parentheses—for example, ([[:alpha:]]) and
reference it using \1.5
R> str_extract(example.obj, "([[:alpha:]]).+?\\1")
[1] "A small sentence. - 2. A"

In our example, the letter is an A. The function returns this match and the subsequent
characters up to the next instance of an A. To make matters a little more complicated, we now
look for a lowercase word without the letter a up to and including the second occurrence of
this word.
R> str_extract(example.obj, "(\\<[b-z]+\\>).+?\\1")
[1] "sentence. - 2. Another tiny sentence"

The expression we use is (\\<[b-z]+\\>).+?\\1. First, consider the [b-z]+ part.
The expression matches all sequences of lowercase letters of length one or more that do not
contain the letter a. In our running example, neither the 1 nor the A fulfill this requirement.
The first substring that would match this expression is the double l in the word small. Recall
that the + quantifier is greedy. Hence, it tries to capture the longest possible sequence which
would be ll instead of l. This is not what we want. Instead, we are looking for a whole word
of lowercase letters that do not contain the letter a. Thus, to exclude this finding we add the
\\< and \\> to the expression to signal a word’s beginning and end. This entire expression is
enclosed in parentheses in order to reference it further down in the expression. The first part
of the string that this expression matches is the word sentence. Next, we are looking for
the subsequent occurrence of this substring in our string using the \\1—regardless of what
comes in between (.+?). Not so easy, is it?
5 There

can be up to nine backreferences, which would be labeled \1, \2, etc.

Backreferencing

206

AUTOMATED DATA COLLECTION WITH R

8.1.3

The introductory example reconsidered

Now that we have encountered the main ingredients of regular expressions, we can come
back to our introductory example of sorting out the Simpsons phone directory. Take another
look at the raw data.
R> raw.data
[1] "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev.
Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr.
Julius Hibbert"

In order to extract the names, we used the regular expression [[:alpha:]., ]{2,}. Let
us have a look at it step by step. At its core, we used the character class [:alpha:], which
signals that we are looking for alphabetic characters. Apart from these characters, names can
also contain periods, commas and empty spaces, which we want to add to the character class to
read [[:alpha:]., ]. Finally, we add a quantifier to impose the restriction that the contents
of the character class have to be matched at least twice to be considered a match. Failing to add
a quantifier would extract every single character that matches the character class. Moreover,
we have to specify that we only want matches of at least length two; otherwise the expression
would return the empty spaces between some of the phone numbers.
R> name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
R> name
[1] "Moe Szyslak"
"Burns, C. Montgomery" "Rev. Timothy Lovejoy"
[4] "Ned Flanders"
"Simpson, Homer"
"Dr. Julius Hibbert"

We also wanted to extract all the phone numbers from the string. The regular expression
we used for the task was a little more complicated to conform to the different formats of the
phone numbers. Let us consider the elements that phone numbers consist of, mostly digits
(\\d). The primary source of difficulty stems from the fact that the phone numbers were not
formatted identically. Instead, some contained empty spaces, dashes, parentheses, or did not
have an area code attached to them.
Applying our knowledge of regular expressions, we are now able to dismantle the regular
expression. In its entirety it reads \$?(\\d{3})?\$?(-| )?\\d{3}(-| )?\\d{4}. Let
us go through the expression. The first part of the expression reads \$?(\\d{3})?\$?. In
the center we find \\d{3}, which we use to collect the three-digit area code. As this is not
contained in every phone number we enclose the expression with two parentheses and add
a question mark, signaling that this part of the expression can be dropped. Before and after
this core element we add \$ and \$ to incorporate two literal parentheses surrounding the
three-digit area code. These too can be dropped, if the phone number does not contain them,
using the ?. Next, our regular expression contains the expression (-| )?. This means that
either a dash or an empty space will be matched, but again, we enclose the entire expression
with parentheses and add a question mark in order to signal that this part of the expression
might be missing. These elements are then simply repeated. Specifically, we are looking for
three digits, another dash or empty space that might or might not be part of the phone number,
and four more digits. Applying this to our mock example yields
R> phone <- unlist(str_extract_all(raw.data, "\$?(\\d{3})?\$?(-| )?\\d
{3}(-| )?\\d{4}"))

REGULAR EXPRESSIONS AND ESSENTIAL STRING FUNCTIONS
R> phone
[1] "555-1239"
[5] "636-555-3226"

"(636) 555-0113" "555-6542"
"5553642"

207

"555 8904"

Before moving on to discuss how regular expressions can be used in practice in the
subsequent section, we would like to conclude this part with some general observations on
regular expressions. First, even though we have provided a fairly comprehensive picture on
how we can go about generalizing regular expressions to meet our string manipulation needs,
there are still several aspects that we have not covered in this section. In particular, there
are two flavors of regular expressions implemented in R—extended basic and Perl regular
expressions. In the above example we have exclusively relied on the former. While Perl
regular expressions provide some additional features, most tasks can be accomplished by
relying on the default flavor—the extended basic variant.6
Although there is no harm in learning Perl regular expressions we advise you to stick to
the default for several reasons. One, it is generally confusing to keep two flavors in mind—
especially if this is your first time approaching regular expressions. Two, most tasks can be
accomplished with the default implementation. Sometimes this means solving a task in two
steps rather than one but in many instances this behavior is even preferable. We believe that
it is poor practice to try and come up with a “golden expression” that accomplishes all your
string manipulation needs in just one line. For the sake of readability one should try to restrict
the number of steps that are taken in any given line of code. This simplifies error detection
and furthermore helps grasp what is going on in your code when revisiting it at a later stage.
Keeping this rule in mind, the use of such intricate concepts as backreferences becomes
dubious. While there may be instances when they cannot be avoided, they also tend to make
code confusing. Splitting all the steps that are taken inside a backreference expression into
several smaller steps is often preferable.
Now we have the building blocks ready to take a look at what can be accomplished with
regular expressions in practice.

8.2
8.2.1

String processing
The stringr package

In this section we present some of the available functions that rely on regular expressions.
To do so we look at functions that are implemented in the stringr package. Two functions
we have used throughout the last section were str_extract() and str_extract_all().
They extract the first/all instance/s of a match between the regular expression and the string.
To reiterate, str_extract() extracts the first matching instance to a regular expression …
R> str_extract(example.obj, "tiny")
[1] "tiny"

… while str_extract_all() extracts all of the matches.
R> str_extract_all(example.obj, "[[:digit:]]")
[[1]]
[1] "1" "2"
6 If you care to use Perl regular expressions, simply enclose the expression with perl(). This behavior is a
convention of the stringr package. For Perl regular expressions in base functions, set the perl switch to TRUE. For
information on additional functionality in Perl regular expressions, check out http://www.pcre.org/.

Regular
expression
flavors

208

AUTOMATED DATA COLLECTION WITH R

Table 8.4 Functions of package stringr in this chapter
Function

Description

Functions using regular expressions
str_extract()
Extracts first string that matches
pattern
str_extract_all() Extracts all strings that match
pattern
str_locate()
Returns position of first pattern
match
str_locate_all()
Returns positions of all pattern
matches
str_replace()
Replaces first pattern match
str_replace_all() Replaces all pattern matches
str_split()
Splits string at pattern
str_split_fixed() Splits string at pattern into fixed
number of pieces
str_detect()
Detects patterns in string
str_count()
Counts number of pattern
occurrences in string
Further functions
str_sub()
Extracts strings by position
str_dup()
Duplicates strings
str_length()
Returns length of string
str_pad()
Pads a string
str_trim()
Discards string padding
str_c()
Concatenates strings

Output
Character vector
List of character vectors
Matrix of start/end
positions
List of matrices
Character vector
Character vector
List of character vectors
Matrix of character vectors
Boolean vector
Numeric vector

Character vector
Character vector
Numeric vector
Character vector
Character vector
Character vector

We have pointed out that the function outputs differ. In the former case a character vector
is returned, while a list is returned in the latter case. Table 8.4 gives an overview of the
different functions that will be introduced in the present chapter. Column two presents a short
description of the function’s purpose, column three specifies the format of the return value. If
instead of extracting the result we are interested in the location of a match in a given string,
we use the functions str_locate() or str_locate_all().
R> str_locate(example.obj, "tiny")
start end
[1,]
35 38
Substring
extraction

The function outputs a matrix with the start and end position of the first instance of a
match, in this case the 35th to 38th characters in our example string. We can make use of
positional information in a string to extract a substring using the function str_sub().
R> str_sub(example.obj, start = 35, end = 38)
[1] "tiny"

REGULAR EXPRESSIONS AND ESSENTIAL STRING FUNCTIONS

209

Here we extract the 35th to 38th characters that we know to be the word tiny. Possibly,
a more common task is to replace a given substring. As usual, this can be done using the
assignment operator.
R> str_sub(example.obj, 35, 38) <- "huge"
R> example.obj
[1] "1. A small sentence. - 2. Another huge sentence."
str_replace() and str_replace_all() are used for replacements more generally.
R> str_replace(example.obj, pattern = "huge", replacement = "giant")
[1] "1. A small sentence. - 2. Another giant sentence."

We might care to split a string into several smaller strings. In the easiest of cases we simply
define a split, say at each dash.

String splitting

R> unlist(str_split(example.obj, "-"))
[1] "1. A small sentence. "
" 2. Another huge sentence."

We can also fix the number of particles we want the string to be split into. If we wanted
to split the string at each blank space, but did not want more than five resulting strings, we
would write
R> as.character(str_split_fixed(example.obj, "[[:blank:]]", 5))
[1] "1."
"A"
[3] "small"
"sentence."
[5] "- 2. Another huge sentence."

So far, all the examples we looked at have assumed a single string object. Recall our little
running example that consists of two sentences—but only one string.
R> example.obj
[1] "1. A small sentence. - 2. Another huge sentence."

We can apply the functions to several strings at the same time. Consider a character vector
that consists of several strings as a second running example:
R> char.vec <- c("this", "and this", "and that")

The first thing we can do is to check the occurrence of particular pattern inside a character
vector. Assume we are interested in knowing whether the pattern this appears in the elements
of a given vector. The function we use to do this is str_detect().

String detection

R> str_detect(char.vec, "this")
[1] TRUE TRUE FALSE

Moreover, we could be interested in how often this particular word appears in the elements
of a given vector …
R> str_count(char.vec, "this")
[1] 1 1 0

String counting

210

AUTOMATED DATA COLLECTION WITH R

… or how many words there are in total in each of the different elements.
R> str_count(char.vec, "\\w+")
[1] 1 2 2
String
duplication

We can duplicate strings …
R> dup.obj <- str_dup(char.vec, 3)
R> dup.obj
[1] "thisthisthis"
"and thisand thisand this"
[3] "and thatand thatand that"

… or count the number of characters in a given string.
R> length.char.vec <- str_length(char.vec)
R> length.char.vec
[1] 4 8 8
String padding

Two important functions in web data manipulation are str_pad() and str_trim().
They are used to add characters to the edges of strings or trim blank spaces.
R> char.vec <- str_pad(char.vec, width = max(length.char.vec),
side = "both", pad = " ")
R> char.vec
[1] " this " "and this" "and that"

String
trimming

In this case we add white spaces to the shorter string equally on both sides such that each
string has the same length. The opposite operation is performed using str_trim(), which
strips excess white spaces from the edges of strings.
R> char.vec <- str_trim(char.vec)
R> char.vec
[1] "this"
"and this" "and that"

String joining

Finally, we can join strings using the str_c() function.
R> cat(str_c(char.vec, collapse = "\n"))
this
and this
and that

Here, we join the three strings of our character vector into a single string. We add a new
line character (\n) and produce the result using the cat() function, which interprets the
new line character as a new line. Beyond joining the contents of one vector, we can use the
function to join two different vectors.
R> str_c("text", "manipulation", sep = " ")
[1] "text manipulation"

If the length of one vector is the multiple of the other, the function automatically recycles
the shorter one.
R> str_c("text", c("manipulation", "basics"), sep = " ")
[1] "text manipulation" "text basics"

REGULAR EXPRESSIONS AND ESSENTIAL STRING FUNCTIONS

211

Table 8.5 Equivalents of the functions in the stringr
package in base R
stringr function

Base function

Functions using regular expressions
str_extract()
str_extract_all()
str_locate()
str_locate_all()
str_replace()
str_replace_all()
str_split()
str_split_fixed()
str_detect()
str_count()

regmatches()
regmatches()
regexpr()
gregexpr()
sub()
gsub()
strsplit()

–
grepl()

–

Further functions
str_sub()
str_dup()
str_length()
str_pad()
str_trim()
str_c()

regmatches()

–
nchar()

–
–
paste(), paste0()

Throughout this book we frequently rely on the stringr package for strings processing.
However, base R provides string processing functionality as well. We find the base functions
less consistent and thus more difficult to learn. If you still want to learn them or want to
switch from base R functionality to the stringr package, have a look at Table 8.5. It provides
an overview of the analogue functions from the stringr package as implemented in base R.

8.2.2

A couple more handy functions

Many string manipulation tasks can be accomplished using the stringr package we introduced
in the previous section. However, there are a couple of additional functions in base R we would
like to introduce in this section. Text data, especially data scraped from web sources, is often
messy. Data that should be matched come in different formats, names are spelled differently—
problems come from all sorts of places. Throughout this volume we stress the need to cleanse
data after it is collected. One way to deal with messy text data is the agrep() function,
which provides approximate matching via the Levenshtein distance. Without going into too
much detail, the function calculates the number of insertions, deletions, and substitutions
necessary to transform one string into another. Specifying a cutoff, we can provide a criterion
on whether a pattern should be considered as present in a string.
R> agrep("Barack Obama", "Barack H. Obama", max.distance = list(all = 3))
[1] 1

Approximate
matching

212

AUTOMATED DATA COLLECTION WITH R

In this case, we are looking for the pattern Barack Obama in the string Barack H.
Obama and we allow three alterations in the string.7 See how this compares to a search for
the pattern in the string Michelle Obama.
R> agrep("Barack Obama", "Michelle Obama", max.distance = list(all = 3))
integer(0)

Detecting
positions

Too many changes are needed in order to find the pattern in the string; hence there is no
result. You can change the maximum distance between pattern and string by adjusting both
the max.distance and the costs parameter. The higher the max.distance parameter
(default = 0.1), the more approximate matches it will find. Using the costs parameter you
can adjust the costs for the different operations necessary to liken to strings.
Another handy function is pmatch(). The function returns the positions of the strings in
the first vector in the second vector. Consider the character vector from above, c("this",
"and this", "and that").
R> pmatch(c("and this", "and that", "and these", "and those"), char.vec)
[1] 2 3 NA NA

We are looking for the positions of the elements in the first vector (c("and this",
"and that", "and these", "and those") in the character vector. The output signals
that the first element is at the second position, the second at the third. The third and fourth
elements in the first vector are not contained in the character vector. A final useful function
is make.unique(). Using this function you can transform a collection of nonunique strings
by adding digits where necessary.
R> make.unique(c("a", "b", "a", "c", "b", "a"))
[1] "a"
"b"
"a.1" "c"
"b.1" "a.2"
Extending base
functionality

Although there are a lot of handy functions already available, there will always be
problems and situations when the one special function desperately needed is missing. One of
those problems might be the following. Imagine we have to check for more than one pattern
within a character vector and want to get a logical vector indicating compliant rows or an
index listing all the compliant row numbers. For checking patterns, we know that grep(),
grepl(), or str_detect() might be good candidates. Because grep() offers a switch
for returning the matched text or a row index vector, we try to build a solution starting with
grep(). We begin by downloading a test dataset of Simpsons episodes and store it in the
local file episodes.Rdata.
R> library(XML)
R> # download file
R> if(!file.exists("listOfSimpsonsEpisodes.html")){
link <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
download.file(link, "listOfSimpsonsEpisodes.html", mode="wb")
}
R> # getting the table

7 An alternative way to specify the maximum distance between two strings is to input a fraction of changes over
the entire length of a string.

REGULAR EXPRESSIONS AND ESSENTIAL STRING FUNCTIONS

213

R> tables <- readHTMLTable("listOfSimpsonsEpisodes.html",
header=T, stringsAsFactors=F)
R> tmpcols <- names(tables[[3]])
R> for(i in 3:20){
tmpcols <- intersect(tmpcols, names(tables[[i]]))
}
R> episodes <- NULL
R> for(i in 3:20){
episodes <- rbind(episodes[,tmpcols],tables[[i]][,tmpcols])
}
R> for(i in 1:dim(episodes)[2]){
Encoding(episodes[,i]) <- "UTF-8"
}
R> names(episodes) <- c("pnr", "nr", "title", "directedby",
"Writtenby", "airdate", "productioncode")
R> save(episodes,file="episodes.Rdata")

Let us load the table containing all the Simpsons episodes.
R> load("episodes.Rdata")

As you can see below, it is easy to switch between different answers to the same
question—which episodes mention Homer in the title—using grep(), grepl() and using the
value = TRUE option. The easy switch makes these functions particularly valuable when
we start developing regular expressions, as we might need an index or logical vector at the
end, but we can use the value option to check if the used pattern actually works.
R> grep("Homer",episodes$title[1:10], value=T)
[1] "Homer's Odyssey"
"Homer's Night Out"
R> grepl("Homer",episodes$title[1:10])
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE

TRUE

What is missing, however, is the option to ask for a whole bunch of patterns to be matched
at the same time. Imagine we would like to know whether there are episodes where Homer
and Lisa are mentioned in the title. The standard solution would be to make a logical vector
for each separate pattern to be matched and later combine them to a logical vector that equals
TRUE when all patterns are found.
R> iffer1 <- grepl("Homer",episodes$title)
R> iffer2 <- grepl("Lisa",episodes$title)
R> iffer <- iffer1 & iffer2
R> episodes$title[iffer]
[1] "Homer vs. Lisa and the 8th Commandment"

Although this solution might seem acceptable in the case of two patterns, it becomes more
and more inconvenient if the number of patterns grows or if the task has to be repeated. We
will therefore create a new function built upon grep().
R> grepall <- function(pattern, x,
ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE,
value=FALSE, logic=FALSE){

214

AUTOMATED DATA COLLECTION WITH R

# error and exception handling
if(length(pattern)==0 | length(x)==0){
warning("Length of pattern or data equals zero.")
return(NULL)
}
# apply grepl() and all()
indicies <- sapply(pattern, grepl, x,
ignore.case, perl, fixed, useBytes)
index
<- apply(indicies, 1, all)
# indexation and return of results
if(logic==T) return(index)
if(value==F) return((1:length(x))[index])
if(value==T) return(x[index])
}
R> grepall(c("Lisa","Homer"), episodes$title)
[1] 26
R> grepall(c("Lisa","Homer"), episodes$title, value=T)
[1] "Homer vs. Lisa and the 8th Commandment"

The idea of the grepall() function is that we need to repeat the pattern search for a series
of patterns—as we did in the previous code snippet when doing two separate pattern searches.
Going through a series of things can be done by using a loop or more efficiently by using apply
functions. Therefore, we first apply the grepl() function to get the logical vectors indicating
which patterns were found in which row. We use sapply() because we have a vector as
input and would like to have a matrix like object as output. What we get is a matrix with
columns referring to the different search patterns and rows referring to the individual strings.
To make sure all patterns were found in a certain row we use a second apply—this time we use
apply() because we have a matrix as input—where the all() function returns TRUE when
all values in a row are true and FALSE if any one value in a row is false. Depending on whether
or not we want to return a vector containing the row numbers or a vector containing the text
for which all the patterns were found the value option switches between two different uses
of the internal logical vector to return row numbers or text accordingly. To get the full logical
vector we can use the logic option. Besides providing functionality that works like grep()
and grepl() for multiple search terms, all other options like ignore.case, perl, fixed,
or useBytes are forwarded to the first apply step, so that this functionality is also part of the
new function.

8.3 A word on character encodings
When working with web-based text data—particularly non-English data—one quickly runs
into encoding issues. While there are no simple rules to deal with these problems, it is
important to keep the difficulties that arise from them in mind. Generally speaking, character
encodings refer to how the digital binary signals are translated into human-readable characters,
for example, making a “d” from “01100100.” As there are many languages around the world,
there are also many special characters, like ä, ø, ç, and so forth. The issues arise since there
are different translation tables such that without knowing which particular table is used to
encode a binary signal it is difficult to draw inferences on the correct content of a signal. If

REGULAR EXPRESSIONS AND ESSENTIAL STRING FUNCTIONS

215

you have not changed the defaults, R works with the system encoding scheme to present the
output. You can query this standard with the following function:
R> Sys.getlocale()
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;
LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"

If you have not figured it out already from the names on the cover, this book was written
by four guys from Germany on a computer with a German operating system. The name of
the character encoding, hidden behind the number 1252, is Windows-1252 and it is the
default character encoding on systems that run Microsoft Windows in English and some
other languages. Your output is likely to be a different one. For example, if you are working
on a Windows PC and are located in the United States, R will give you a feedback like
English_United States.1252. If you are operating on a Mac, the encoding standard is
UTF-8.8 Let us input a string with some special characters. Consider this fragment from a
popular Swedish song, called “small frogs” (små grodorna):
R> small.frogs <- "Små grodorna, små grodorna är lustiga att se."
R> small.frogs
[1] "Små grodorna, små grodorna är lustiga att se."

There are several special characters in this fragment. By default, our inputs and outputs
are assumed to be of Windows-1252 standard; thus the output is correct. Using the function
iconv(), we can translate a string from one encoding scheme to another:

Convert
encodings

R> small.frogs.utf8 <- iconv(small.frogs, from = "windows-1252",
to = "UTF-8")
R> Encoding(small.frogs.utf8)
[1] "UTF-8"
R> small.frogs.utf8
[1] "Små grodorna, små grodorna är lustiga att se."

In this case, the function applies a translation table from the Windows-1252 encoding to
the UTF-8 standard. Thus, the binary sequence is recast as a UTF-8-encoded string. Consider
how this behavior differs from the one we encounter when applying the Encoding() function
to the string.
R> Encoding(small.frogs.utf8) <- "windows-1252"
R> small.frogs.utf8
[1] "SmÃ¥ grodorna, smÃ¥ grodorna Ã¤r lustiga att se."

Doing so, we force the system to treat the UTF-8-encoded binary sequence as though it
were generated by a different encoding scheme (our system default Windows-1252), resulting
in the well-known garbled output we get, for example, when visiting a website with malspecified encodings. There are currently 350 conversion schemes available, which can be accessed
using the iconvlist() function.
8 This is quite convenient for working with data from the Web, as UTF-8 is probably the most popular scheme
and therefore used on many websites.

Declare
encodings

216

AUTOMATED DATA COLLECTION WITH R

R> sample(iconvlist(), 10)
[1] "PT154"
"latin7"
[5] "IBM860"
"CP50221"
[9] "WINDOWS-50221" "IBM864"

"UTF-16BE"
"IBM424"

"CP51932"
"CP1257"

Having established the importance of keeping track of the encodings of text and webbased text, in particular, we now turn to the question of how to figure out the encoding of an
unknown text. Luckily, in many instances a website gives a pointer in its header. Consider the
 tag with the http-equiv attribute from the website of the Science Journal, which
is located at http://www.sciencemag.org/.
R> library(RCurl)
R> enc.test <- getURL("http://www.sciencemag.org/")
R> unlist(str_extract_all(enc.test, ""))
[1] ""
[2] ""
[3] ""
Testing for
encodings

The first tag provides some structured information on the type of content we can expect
on the site as well as how the characters are encoded—in this case UTF-8. But what if such a
tag is not available? While it is difficult to guess the encoding of a particular text, a couple of
handy functions toward this end have been implemented in the tau package. There are three
functions available to test the encoding of a particular string, is.ascii(), is.locale(),
and is.utf8(). What these functions do is to test whether the binary sequences are “legal”
in a particular encoding scheme. Recall that the letter “å” is stored as a particular binary
sequence in the local encoding scheme. This binary sequence is not valid in the ASCII
scheme—hence, the string cannot have been encoded in ASCII. And in fact, this is what we
find:
R> library(tau)
R> is.locale(small.frogs)
[1] TRUE
R> is.ascii(small.frogs)
[1] FALSE

Summary
Many aspects of automated data collection deal with textual data. Every step of a typical
web scraping exercise might involve some form of string manipulation. Be it that you need
to format a URL request according to your needs, collect information from an HTML page,
(re-)arrange results that come in the form of strings, or general data cleansing. All of these
tasks could require some form of string manipulation. This chapter has introduced the most
important tool for any of these tasks—regular expressions. These expressions allow you to
search for information using highly flexible queries.
The chapter has also outlined the main elements of string manipulation. First, we considered the ingredients of regular expressions as implemented in R. Starting with the simplest of

REGULAR EXPRESSIONS AND ESSENTIAL STRING FUNCTIONS

217

all cases where a character represents itself in a regular expression, we subsequently treated
more elaborate concepts to generalize searches, such as quantifiers and character classes. In
the second step, we considered how regular expressions and string manipulation is generally
performed. To do so, we principally looked at the function range that is provided by the
stringr package and several functions that go beyond the package. The chapter concluded
with a discussion on how to deal with character encodings.

Further reading
In this chapter, we introduced extended basic regular expressions as implemented in R. Check
out http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html for an overview of
the available concepts. We restricted our exposition to extended regular expressions, as
these suffice to accomplish most common tasks in string manipulation. There is, however,
a second flavor of regular expressions that is implemented in R—Perl regular expressions.
These introduce several aspects that allow string manipulations that were not discussed in
this chapter.9 Should you be interested in finding out more about Perl regular expressions,
check out http://www.pcre.org/.

Problems
1.

Describe regular expressions and why they can be used for web scraping purposes.

2.

Find a regular expression that matches any text.

3.

Copy the introductory example. The vector name stores the extracted names.
R> name
[1] "Moe Szyslak"
[4] "Ned Flanders"

"Burns, C. Montgomery" "Rev. Timothy Lovejoy"
"Simpson, Homer"
"Dr. Julius Hibbert"

(a) Use the tools of this chapter to rearrange the vector so that all elements conform to
the standard first_name last_name.
(b) Construct a logical vector indicating whether a character has a title (i.e., Rev. and
Dr.).
(c) Construct a logical vector indicating whether a character has a second name.
4.

Describe the types of strings that conform to the following regular expressions and
construct an example that is matched by the regular expression.
(a) [0-9]+\\$
(b) \\b[a-z]{1,4}\\b
(c) .*?\\.txt$
(d) \\d{2}/\\d{2}/\\d{4}
(e) <(.+?)>.+?

9 In almost all cases, however, one can break up search queries into several smaller queries that can easily be
handled by the extended regular expressions.

218

AUTOMATED DATA COLLECTION WITH R

5.

Rewrite the expression [0-9]+\\$ in a way that all elements are altered but the
expression performs the same task.

6.

Consider the mail address chunkylover53[at]aol[dot]com.
(a) Transform the string to a standard mail format using regular expressions.
(b) Imagine we are trying to extract the digits in the mail address. To do so we write
the expression [:digit:]. Explain why this fails and correct the expression.
(c) Instead of using the predefined character classes, we would like to use the predefined
symbols to extract the digits in the mail address. To do so we write the expression
\\D. Explain why this fails and correct the expression.

7.

Consider the string +++BREAKING NEWS+++. We would like to
extract the first HTML tag. To do so we write the regular expression <.+>. Explain why
this fails and correct the expression.

8.

Consider the string (5-3)ˆ2=5ˆ2-2*5*3+3ˆ2 conforms to the binomial
theorem. We would like to extract the formula in the string. To do so we write the
regular expression [ˆ0-9=+*()]+. Explain why this fails and correct the expression.

9.

The following code hides a secret message. Crack it with R and regular expressions.
Hint: Some of the characters are more revealing than others! The code snippet is also
available in the materials at www.r-datacollection.com.
clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo
Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO
d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5
fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr

10.

Why it is important to be familiar with character encodings when working with string
data?

Part Two
A PRACTICAL TOOLBOX
FOR WEB SCRAPING AND
TEXT MINING

9

Scraping the Web
Having learned much about the basics of the architecture of the Web, we now turn to data
collection in practice. In this chapter, we address three main aspects of web scraping with
R. The first is how to retrieve data from the Web in different scenarios (Section 9.1). Recall
Figure 1.4. The first part of the chapter looks at the stage where we try to get resources from
servers into R. The principal technology to deal with in this step is HTTP. We offer a set
of real-life scenarios that demonstrate how to use libcurl to gather data in various settings.
In addition to examples based on HTTP or FTP communication, we introduce the use of
web services (web application programming interfaces [APIs]) and a related authentication
standard, OAuth. We also offer a solution for the problem of scraping dynamic content that
we described in Chapter 6. Section 9.1.9 provides an introduction to Selenium, a browser
automation tool that can be used to gather content from JavaScript-enriched pages.
The second part of the chapter turns to strategies for extracting information from gathered
resources (Section 9.2). We are already familiar with the necessary technologies: regular
expressions (Chapter 8) and XPath (Chapter 4). From a technology-based perspective, this
corresponds to the second column of Figure 1.4. In this part we shed light on these techniques
from a more practical perspective, providing a stylized sketch of the strategies and discuss
their advantages and disadvantages. We also consider APIs once more. They are an ideal
case of automated web data collection as they offer a seamless integration of the retrieval and
extracting stages.
Whatever the level of difficulty for scraping information from the web, the circle of
scraping remains almost always identical. The followings tasks are part of most scraping
exercises:
1. Information identification
2. Choice of strategy
3. Data retrieval
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, First Edition.
Simon Munzert, Christian Rubba, Peter Meißner and Dominic Nyhuis.
© 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

On the art of
web scraping

222

AUTOMATED DATA COLLECTION WITH R

4. Information extraction
5. Data preparation
6. Data validation
7. Debugging and maintenance
8. Generalization

Spiders versus
scrapers

The art of scraping lies in cleverly combining and redefining these tasks, and we can only
sketch out some basic principles, either theoretically (as in Section 9.2) or by examples in
the set of retrieval scenarios and case studies. In the end, questions such as “Is automation
efficient?,” “Is R the right tool for my web data collection work?,” and “Is my data source of
choice reliable in the long run?” are project-specific and lack a generally helpful answer.
The third part of this chapter addresses an important, but sometimes disregarded aspect of
web scraping. It deals with the question of how to behave nicely on the Web as a web scraper.
We are convinced that the abundance of online data is something positive and opens up
new ways for understanding human interactions. Whether collecting these data is inherently
positive depends in no small part on (a) the behavior of data gatherers and (b) on the purpose
for which data are collected. The latter point is entirely up to you. For the former point, we
offer some basic advice in Section 9.3. We discuss legal implications of web scraping, show
how to take robots.txt, an informal standard for web crawler behavior, into account, and offer
a practical guideline for friendly web-scraping practice.
We conclude the chapter with a glimpse of ongoing efforts for giving R more interfaces
with web data and on lighthouses of web scraping more generally (Section 9.4).
A final remark before we get started: This chapter is mostly about how to build specialpurpose web scrapers. In our definition scrapers are programs that grab specific content from
web pages. Such information could be telephone data (see Chapter 15), data on products (see
Chapter 16), or political behavior (see Chapter 12). Spiders (or crawlers or web robots), in
contrast, are programs that grab and index entire pages and move around the Web following
every link they can find. Most scraping work involves a spidering component. In order to
extract content from webpages, we usually first download them as a whole and then continue
with the extraction part. In general, however, we disregard scenarios in which the goal is to
wander through the Web without a specific data collection target.

9.1 Retrieval scenarios
For the following scenarios of web data retrieval, we rely on the following set of R packages
which were introduced in the first part of the book. We assume that you have loaded them for
the exercises. We will indicate throughout the chapter whenever we make use of additional
packages.
R> library(RCurl)
R> library(XML)
R> library(stringr)

SCRAPING THE WEB

9.1.1

223

Downloading ready-made files

The first way to get data from the Web is almost too banal to be considered here and actually
not a case of web scraping in the narrower sense. In some situations, you will find data
of interest ready for download in TXT, CSV, or any other plain-text/spreadsheet or binary
format like PDF, XLS, or JPEG. R can still prove useful for such simple tasks, as (a) the
data acquisition process remains reproducible in principle and (b) it may save a considerable
amount of time. We picked two common examples to illustrate the benefits of using R in
scenarios like these.
9.1.1.1

CSV election results data

The Maryland State Board of Elections at http://www.elections.state.md.us/ provides a rich
data resource on past elections. We identified a set of comma-separated value spreadsheets that
comprise information on state-, county-, and precinct-level election results for the 2012 Presidential election in Maryland in one of the page’s subdirectories at http://www.elections.
state.md.us/elections/2012/election_data/index.html. The targeted files are accessible via
“General” hyperlinks. Suppose we want to download these files for analyses.
The links to the CSV files are scattered across several tables on the page. We are only
interested in some of the documents, namely those that contain the raw election results for
the general election. The page provides data on the primaries and on ballot questions, too. In
order to retrieve the desired files, we want to proceed in three steps.
1. We identify the links to the desired files.
2. We construct a download function.
3. We execute the downloads.
The XML package provides a neat function to identify links in an HTML document—
getHTMLLinks(). We introduce this and other convenience functions from the package in
greater detail in Section 9.1.4.
We use getHTMLLinks() to extract all the URLs and external file names in the HTML
document that we first assign to the object url. The list of links in links comprises more
entries than we are interested in, so we apply the regular expression _General.csv to retrieve
the subset of external file names that point to the general election result CSVs. Finally, we
store the file names in a list to be able to apply a download function to this list in the next step.
R> url <- "http://www.elections.state.md.us/elections/2012/election_
data/index.html"
R> links <- getHTMLLinks(url)
R> filenames <- links[str_detect(links, "_General.csv")]
R> filenames_list <- as.list(filenames)
R> filenames_list[1:3]
[[1]]

Identifying
locations

224

AUTOMATED DATA COLLECTION WITH R

[1] "http://www.elections.state.md.us/elections/2012/election_data/
State_Legislative_Districts_2012_General.csv"
[[2]]
[1] "http://www.elections.state.md.us/elections/2012/election_data/
Allegany_County_2012_General.csv"
[[3]]
[1] "http://www.elections.state.md.us/elections/2012/election_data/
Allegany_By_Precinct_2012_General.csv"
Constructing a
download
function

Next, we set up a function to download all the files and call the function downloadCSV().
The function wraps around the base R function download.file(), which is perfectly
sufficient to download URLs or other files in standard scenarios. Our function has three
arguments. filename refers to each of the entries in the filenames_list object. baseurl
specifies the source path of the files to be downloaded. Along with the file names, we can
thus construct the full URL of each file. We do this using str_c() and feed the result to
the download.file() function. The second argument of the function is the destination on
our local drive. We determine a folder where we want to store the CSV files and add the
file name parameter. We tweak the download by adding (1) a condition which ensures that
the file download is only performed if the file does not already exist in the folder using the
file.exists() function and (2) a pause of 1 second between each file download. We will
motivate these tweaks later in Section 9.3.3.
R> downloadCSV <- function(filename, baseurl, folder) {
R>
dir.create(folder, showWarnings = FALSE)
R>
fileurl <- str_c(baseurl, filename)
R>
if (!file.exists(str_c(folder, "/", filename))) {
R>
download.file(fileurl,
R>
destfile = str_c(folder, "/", filename))
R>
Sys.sleep(1)
R>
}
R> }

Executing the
download

We apply the function to the list of CSV file names filenames_list using l_ply()
from the plyr package. The function takes a list as main argument and passes each list element
as argument to the specified function, in our case downloadCSV(). We can pass further
arguments to the function. For baseurl we identify the path where all CSVs are located.
With folder we select the local folder where want to store the files.
R> library(plyr)
R> l_ply(filenames_list, downloadCSV,
R>
baseurl = "www.elections.state.md.us/elections/2012/election_data/",
R>
folder = "elec12_maryland")

To check the results, we consider the number of downloaded files and the first couple of
entries.
R> length(list.files("./elec12_maryland"))
[1] 68

SCRAPING THE WEB

225

R> list.files("./elec12_maryland")[1:3]
[1] "Allegany_By_Precinct_2012_General.csv"
[2] "Allegany_County_2012_General.csv"
[3] "Anne_Arundel_By_Precinct_2012_General.csv"

Sixty-eight CSV files have been added to the folder. We could now proceed with an
analysis by importing the files into R using read.csv(). The web scraping task is thus
completed and could easily be replicated with data on other elections stored on the website.
9.1.1.2

PDF legislative district maps

download.file() frequently does not provide the functionality we need to download
files from certain sites. In particular, download.file() does not support data retrieval via

HTTPS by default and is not capable of dealing with cookies or many other advanced features
of HTTP. In such situations, we can switch to RCurl’s high-level functions which can easily
handle problems like these—and offer further useful options.
As a showcase we try to retrieve PDF files of the 2012 Maryland legislative district
maps, complementing the voting data from above. The maps are available at the Maryland
Department of Planning’s website: http://planning.maryland.gov/Redistricting/2010/legiDist.
shtml.1 The targeted PDFs are accessible in a three-column table at the bottom right of the
screen and named “1A,” “1B,” and so on. We reuse the download procedure from above, but
specify a different base URL and regular expression to detect the desired files.
R>
R>
R>
R>

url <- "http://planning.maryland.gov/Redistricting/2010/legiDist.shtml"
links <- getHTMLLinks(url)
filenames <- links[str_detect(links, "2010maps/Leg/Districts_")]
filenames_list <- str_extract_all(filenames, "Districts.+pdf")

The download function downloadPDF() now relies on getBinaryURL(). We allow
for the use of a curl handle. We cannot specify a destination file in the getBinaryURL()
function, so we store the raw data in a pdffile object first and then pass it to writeBin().
This function writes the PDF files to the specified folder. The other components of the function
remain the same.
R> downloadPDF <- function(filename, baseurl, folder, handle) {
R>
dir.create(folder, showWarnings = FALSE)
R>
fileurl <- str_c(baseurl, filename)
R>
if (!file.exists(str_c(folder, "/", filename))) {
R>
content <- getBinaryURL(fileurl, curl = handle)
R>
writeBin(content, str_c(folder, "/", filename))
R>
Sys.sleep(1)
R>
}
R> }

We execute the function with a handle that adds a User-Agent and a From header field
to every call and keeps the connection alive. We could specify further options if we had to
deal with cookies or other HTTP specifics.
1 Note

that the “2010” in the URL is misleading—it is the 2012 election maps that are offered at this address.

Download with
RCurl

226

AUTOMATED DATA COLLECTION WITH R

R> handle <- getCurlHandle(useragent = str_c(R.version$platform,
R.version$version.string, sep=", "), httpheader = c(from =
"eddie@datacollection.com"))
R> l_ply(filenames_list, downloadPDF,
R>
baseurl = "planning.maryland.gov/PDF/Redistricting/2010maps/Leg/",
R>
folder = "elec12_maryland_maps",
R>
handle = handle)

Again, we examine the results by checking the number of files in the folder and the first
couple of results.
R> length(list.files("./elec12_maryland_maps"))
[1] 68
R> list.files("./elec12_maryland_maps")[1:3]
[1] "Districts_10.pdf" "Districts_11.pdf" "Districts_12.pdf"

Everything seems to have worked out fine—68 PDF files have been downloaded. The
bottom line of this exercise is that downloading plain-text or binary files from a website is one
of the easiest tasks. The core tools are download.file() and RCurl’s high-level functions.
getHTMLLinks() from the XML package often does a good job of identifying the links to
single files, especially when they are scattered across a document.

9.1.2

Downloading multiple files from an FTP index

We have introduced an alternative network protocol to HTTP for pure file transfer, the File
Transfer Protocol (FTP) in Section 5.3.2. Downloading files from FTP servers is a rewarding
task for data wranglers because FTP servers host files, nothing else. We do not have to care
about getting rid of HTML layout or other unwanted information. Again, RCurl is well-suited
to fetch files via FTP.
Let us have a look at the CRAN FTP server to see how this works. The server has the URL
ftp://cran.r-project.org/. It stores a lot of R-related data, including older R versions, CRAN
task views, and all CRAN packages. Say we want to download all CRAN task view HTML
files for closer inspection. They are stored at ftp://cran.r-project.org/pub/R/web/views/. Our
downloading strategy is similar to the one in the last scenario.
1. We identify the desired files.
2. We construct a download function.
3. We execute the downloads.
Fetch FTP
directory

In order to load the FTP directory list into R, we assign the URL to ftp. Next, we
save the list of file names to the object ftp_files with getURL().2 By setting the libcurl
option dirlistonly to TRUE, we ensure that only the file names are fetched, but no further
information about file size or creation date.
2 For FTP servers the getHTMLLinks() command is not an option, because the documents are not structured as
HTML.

SCRAPING THE WEB

227

R> ftp <- "ftp://cran.r-project.org/pub/R/web/views/"
R> ftp_files <- getURL(ftp, dirlistonly = TRUE)

It is sometimes the case that the default FTP mode in libcurl, extended passive (EPSV),
does not work with some FTP servers. In this case, we have to add the ftp.use.epsv =
FALSE option. In our example, we have successfully downloaded the list of files and stored it
in a character vector, ftp_files. The information is corrupted with line feeds and carriage
returns representations \r \n, however, and still contains CTV files.
R> ftp_files
[1] "Bayesian.ctv\r\nBayesian.html\r\nChemPhys.ctv\r\nChemPhys.html\r..."

To get rid of them we use them as splitting patterns for str_split(). We also apply a
regular expression to select only the HTML files with str_extract_all():

Extract file
names

R> filenames <- str_split(ftp_files, "\r\n")[[1]]
R> filenames_html <- unlist(str_extract_all(filenames, ".+(.html)"))
R> filenames_html[1:3]
[1] "Bayesian.html"

"ChemPhys.html"

"ClinicalTrials.html"

An equivalent, but more elegant way to get only the HTML files would be
R> filenames_html <- getURL(ftp, customrequest = "NLST *.html")
R> filenames_html = str_split(filenames_html, "\\\r\\\n")[[1]]

This way we pass the FTP command NLST *.html to our function. This returns a list
of file names in the FTP directory that end in .html. We thus exploit the libcurl option
customrequest that allows changing the request method and do not have to extract the
HTML files ex post.3
In the last step, we construct a function downloadFTP() that fetches the desired files
from the FTP server and stores them in a specified folder. It basically follows the syntax of
the downloadPDF() function from the previous section.
R> downloadFTP <- function(filename, folder, handle) {
R>
dir.create(folder, showWarnings = FALSE)
R>
fileurl <- str_c(ftp, filename)
R>
if (!file.exists(str_c(folder, "/", filename))) {
R>
content <- try(getURL(fileurl, curl = handle))
R>
write(content, str_c(folder, "/", filename))
R>
Sys.sleep(1)
R>
}
R> }

3 Recall that FTP has a list of commands of its own, just as there are HTTP commands like GET and POST. A list
of FTP commands—some of which can easily be implemented with curl’s customrequest option—can be found
at http://www.nsftools.com/tips/RawFTP.htm.

Download files

228

AUTOMATED DATA COLLECTION WITH R

We set up a handle that disables FTP-extended passive mode and download the CRAN
task HTML documents to the cran_tasks folder:
R> handle <- getCurlHandle(ftp.use.epsv = FALSE)
R> l_ply(filenames_list, downloadFTP,
R>
folder = "cran_tasks",
R>
handle = handle)

A quick inspection of our newly created folder reveals that the files were successfully
downloaded.
R> length(list.files("./cran_tasks"))
[1] 34
R> list.files("./cran_tasks")[1:3]
[1] "Bayesian.html"
"ChemPhys.html"

"ClinicalTrials.html"

It is also possible to upload data to an FTP server. As we do not have any rights to upload
content to the CRAN server, we offer a fictional example.
R> ftpUpload(what = "example.txt", to = "ftp://example.com/",
userpwd = "username:password")
Where to find
FTP archives

To get a taste of the good old FTP times where there was no more than just data and
directories, visit http://www.search-ftps.com/ or http://www.filesearching.com/ to search for
existing archives. What you will find might occasionally be content of dubious quality,
however.

9.1.3

Navigating
through pages
by URL
manipulation

Manipulating URLs to access multiple pages

We usually care little about the web addresses of the sites we visit. Sometimes we might
access a web page by entering a URL into our browser, but more frequently we come to a site
through a search engine. Either way, once we have accessed a particular site we move around
by clicking on links, but do not take note of the fact that the URL changes when accessing
the various sites on the same server. We already know that directories on a web server are
comparable to the folders on our local hard drive. Once we realize that the directories of
the website follow specific systematics, we can make use of this fact and apply it in web
scraping by manipulating the URL of a site. Compared with other retrieval strategies, URL
manipulation is a “quick and dirty” approach, as we usually do not care about the internal
mechanisms that create URLs (e.g., GET forms).
Imagine we would like to collect all press releases from the organization Transparency
International. Check out the organization’s press releases under the heading “News” at
http://www.transparency.org/news/pressreleases/. Now select the year 2011 from the dropdown menu. Notice how the statement year/2011 is appended to the URL. We can apply
this observation and call up the press releases from 2010 by changing the figure in the URL.
As expected, the browser now displays all press releases from 2010, starting with releases in
late December. Notice how the webpage displays 10 hits for each search. Click on “Next” at
the bottom of the page. We find that the URL is appended with the statement P10. Apparently,
we are able to select specific results pages by using multiples of 10. Let us try this by choosing

SCRAPING THE WEB

229

the fourth site of the 2010 press releases by selecting the directory http://www.transparency.
org/news/pressreleases/year/2010/P30. In fact, we can wander through the pages by manipulating the URL instead of clicking on HTML buttons.
Now let us capitalize on these insights and implement them in small scraper. We proceed
in five steps.
1. We identify the running mechanism in the URL syntax.
2. We retrieve links to the running pages.
3. We download the running pages.
4. We retrieve links to the entries on the running pages.
5. We download the single entries.
We begin by constructing a function that returns a list of URLs for every page in the index.
We have already identified the running mechanism in the URL syntax—a P and a multiple
of 10 is attached to the base URL for every page other than the first one. To know how many
of these pages exist, we retrieve the total number of pages from the bottom line on the base
page, which reads “Page x of X”. “X” is the total number of pages. We fetch the number with
the XPath command //div[@id='Page']/strong[2] and use the result (total_pages)
to construct a vector add_url with string additions to the base URL. The first entries are
stored on the base URL page which does not need an addition. Therefore, we construct X − 1
snippets to be added to the base URL. We store this number 10 times, as the index runs from
10 to X * 10, rather than from 1 to X in max_url and merge it with /P10 and store it in the
object add_url.
R> baseurl <- htmlParse("http://www.transparency.org/news/
pressreleases/year/2010")
R> xpath <- "//div[@id='Page']/strong[2]"
R> total_pages <- as.numeric(xpathSApply(baseurl, xpath, xmlValue))
R> total_pages
[1] 16
R> max_url <- (total_pages - 1) * 10
R> add_url <- str_c("/P", seq(10, max_url, 10))
R> add_url
[1] "/P10" "/P20" "/P30" "/P40" "/P50" "/P60" "/P70" "/P80"
[9] "/P90" "/P100" "/P110" "/P120" "/P130" "/P140" "/P150"

Next, we construct the full URLs and put them in a list. To fetch entries from the
first page as well, we add the base URL to the list. Everything is wrapped into a function
getPageURLs() that returns the URLs of single index pages as a list.
R> getPageURLs <- function(url) {
baseurl <- htmlParse(url)
xpath <- "//div[@id='Page']/strong[2]"
total_pages <- as.numeric(xpathSApply(baseurl, xpath, xmlValue))
max_url <- (total_pages - 1) * 10
add_url <- str_c("/P", seq(10, max_url, 10))

URL
manipulation

230

AUTOMATED DATA COLLECTION WITH R
urls_list <- as.list(str_c(url, add_url))
urls_list[length(urls_list) + 1] <- url
return(urls_list)

}

Applying the function yields
R> url <- "http://www.transparency.org/news/pressreleases/year/2010"
R> urls_list <- getPageURLs(url)
R> urls_list[1:3]
[[1]]
[1] "http://www.transparency.org/news/pressreleases/year/2010/P10"
[[2]]
[1] "http://www.transparency.org/news/pressreleases/year/2010/P20"
[[3]]
[1] "http://www.transparency.org/news/pressreleases/year/2010/P30"
Downloading
index pages

In the third step, we construct a function to download each index page. The function takes
the returned list from getPageURLs(), extracts the file names, and writes the HTML pages
to a local folder.
Notice that we have to add a file name for the base URL index manually because the
regular expression "/P.+" which identifies the file names does not apply here. This is done
in the fourth line of the function. As usual, the download is conducted with getURL:
R> dlPages <- function(pageurl, folder ,handle) {
dir.create(folder, showWarnings = FALSE)
page_name <- str_c(str_extract(pageurl, "/P.+"), ".html")
if (page_name == "NA.html") { page_name <- "/base.html" }
if (!file.exists(str_c(folder, "/", page_name))) {
content <- try(getURL(pageurl, curl = handle))
write(content, str_c(folder, "/", page_name))
Sys.sleep(1)
}
}

We perform the download with l_ply to download the files stored in the
baselinks_list list elements.
R> handle <- getCurlHandle()
R> l_ply(urls_list, dlPages,
folder = "tp_index_2010",
handle = handle)
R> list.files("tp_index_2010")[1:3]
[1] "base.html" "P10.html" "P100.html"

Sixteen files have been downloaded. Now we parse the downloaded index files to identify
the links to the individual press releases. The getPressURLs() function works as follows.
First, we parse the documents into a list. We retrieve all links in the documents using
getHTMLLinks(). Finally, we extract only those links that refer to one of the press releases.

SCRAPING THE WEB

231

To do so, we apply the regular expression "http.+/pressrelease/" which uniquely
identifies the releases and stores them in a list.
R> getPressURLs <- function(folder) {
pages_parsed <- lapply(str_c(folder, "/", dir(folder)), htmlParse)
urls <- unlist(llply(pages_parsed, getHTMLLinks))
press_urls <- urls[str_detect(urls, "http.+/pressrelease/")]
press_urls_list <- as.list(press_urls)
return(press_urls_list)
}

Applying the function we retrieve a list of links to roughly 150 press releases.
R> press_urls_list <- getPressURLs(folder = "tp_index_2010")
R> length(press_urls_list)
[1] 152

The press releases are downloaded in the last step. The function works similarly to
the one that downloaded the index pages. Again, we first retrieve the file names of the
press releases based on the full URLs. We apply the rather nasty regular expression
[ˆ//][[:alnum:]_.]+$. We download the press release files with getURL() and store
them in the created folder.
R> dlPress <- function(press_url, folder, handle) {
dir.create(folder, showWarnings = FALSE)
press_filename <- str_c(str_extract(press_url,
"[ˆ//][[:alnum:]_.]+$") , ".html")
if (!file.exists(str_c(folder, "/", press_filename))) {
content <- try(getURL(press_url, curl = handle))
write(content, str_c(folder, "/", press_filename))
Sys.sleep(1)
}
}

We apply this function using
R> handle <- getCurlHandle()
R> l_ply(press_urls_list, dlPress,
folder = "tp_press_2010",
handle = handle)
R> length(list.files("tp_press_2010"))
[1] 152

All 152 files have been downloaded successfully. To process the press releases, we would
have to parse them similar to the getPressURLs() function and extract the text. Moreover,
to accomplish the task that was specified at the beginning of the section we would also have
to generalize the functions to loop over the years on the website but the underlying ideas do
not change.
In scenarios where the range of URLs is not as clear as in the example described above, we
can make use of the url.exists() function from the RCurl package. It works analogously
to file.exists() and indicates whether a given URL exists, that is, whether the server
responds without an error.

Downloading
press releases

232

AUTOMATED DATA COLLECTION WITH R

In many web scraping exercises, we can apply URL manipulation to easily access all the
sites that we are interested in. The downside of this type of access to a website is that we need
a fairly intimate knowledge of the website and of the websites’ directories in order to perform
URL manipulations. This is to say that URL manipulation cannot be used to write a crawler
for multiple websites as the specific manipulations must be developed for each website.

9.1.4

Extracting
links

Convenient functions to gather links, lists, and tables from
HTML documents

The XML package provides powerful tools for parsing XML-style documents. Yet it offers
more commands that considerably ease information extraction tasks in the web-scraping
workflow. The functions readHTMLTable(), readHTMLList(), and getHTMLLinks()
help extract data from HTML tables, lists, and internal as well as external links. We illustrate
their functionality with a Wikipedia article on Niccolò Machiavelli, an “Italian historian,
politician, diplomat, philosopher, humanist, and writer” (Wikipedia 2014).
The first function we will inspect is getHTMLlinks() which serves to extract links
from HTML documents. To illustrate the flexibility of the convenience functions, we prepare
several objects. The first object stores the URL for the article (mac_url), the second stores
the source code (mac_source), the third stores the parsed document (mac_parsed), and the
fourth and last object (mac_node) holds only one node of the parsed document, namely the
 node that includes the introductory text.
R>
R>
R>
R>

mac_url
mac_source
mac_parsed
mac_node

<<<<-

"http://en.wikipedia.org/wiki/Machiavelli"
readLines(mac_url, encoding = "UTF-8")
htmlParse(mac_source, encoding = "UTF-8")
mac_parsed["//p"][[1]]

All of these representations of an HTML document (URL, source code, parsed document,
and a single node) can be used as input for getHTMLLinks() and the other convenience
functions introduced in this section.
R> getHTMLLinks(mac_url)[1:3]
[1] "/w/index.php?title=Machiavelli&redirect=no"
[2] "/wiki/Machiavelli_(disambiguation)"
[3] "/wiki/File:Portrait_of_Niccol%C3%B2_Machiavelli_by_Santi_di_Tito.jpg"
R> getHTMLLinks(mac_source)[1:3]
[1] "/w/index.php?title=Machiavelli&redirect=no"
[2] "/wiki/Machiavelli_(disambiguation)"
[3] "/wiki/File:Portrait_of_Niccol%C3%B2_Machiavelli_by_Santi_di_Tito.jpg"
R> getHTMLLinks(mac_parsed)[1:3]
[1] "/w/index.php?title=Machiavelli&redirect=no"
[2] "/wiki/Machiavelli_(disambiguation)"
[3] "/wiki/File:Portrait_of_Niccol%C3%B2_Machiavelli_by_Santi_di_Tito.jpg"
R> getHTMLLinks(mac_node)[1:3]
[1] "/wiki/Help:IPA_for_Italian" "/wiki/Renaissance_humanism"
[3] "/wiki/Renaissance"

SCRAPING THE WEB

233

We can also supply XPath expressions to restrict the returned documents to specific
subsets, for example, only those links of class extiw.
R> getHTMLLinks(mac_source,
xpQuery="//a[@class='extiw']/@href")[1:3]
[1] "//en.wiktionary.org/wiki/chancery"
[2] "//en.wikisource.org/wiki/Catholic_Encyclopedia_(1913)/Niccol%
C3%B2_Machiavelli"
[3] "//commons.wikimedia.org/wiki/Niccol%C3%B2_Machiavelli"
getHTMLLinks() retrieves links from HTML as well as names of external files. We
already made use of the latter feature in Section 9.1.1. An extension of getHTMLLinks() is
getHTMLExternalFiles(), designed to extract only links that point to external files which
are part of the document. Let us use the function along with its xpQuery parameter. We
restrict the set of returned links to those mentioning Machiavelli to hopefully find a URL that
links to a picture.
R> xpath <- "//img[contains(@src, 'Machiavelli')]/@src"
R> getHTMLExternalFiles(mac_source,
xpQuery = xpath)[1:3]
[1] "//upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Portrait
_of_Niccol%C3%B2_Machiavelli_by_Santi_di_Tito.jpg/220px-Portrait_
of_Niccol%C3%B2_Machiavelli_by_Santi_di_Tito.jpg"
[2] "//upload.wikimedia.org/wikipedia/commons/thumb/a/a4/
Machiavelli_Signature.svg/128px-Machiavelli_Signature.svg.png"
[3] "//upload.wikimedia.org/wikipedia/commons/thumb/f/f3/
Cesare_borgia-Machiavelli-Corella.jpg/220px-Cesare_borgiaMachiavelli-Corella.jpg"

The first three results look promising; they all point to image files stored on the
Wikimedia servers.
The next convenient function is readHTMLList() and as the name already suggests, it
extracts list elements (see Section 2.3.7). Browsing through the article we find that under
Discourses on Livy several citations from the work are pooled as an unordered list that we
can easily extract. Note that the function returns a list object where each element corresponds
to a list in the HTML. As the citations are the tenth list within the HTML, we figured this out
by eyeballing the output of readHTMLList() and we use the index operator [[10]].
R> readHTMLList(mac_source)[[10]][1:3]
[1] "\"In fact, when there is combined under the same constitution
a prince, a nobility, and the power of the people, then these three
powers will watch and keep each other reciprocally in check.\" Book
I, Chapter II"
[2] "\"Doubtless these means [of attaining power] are cruel and
destructive of all civilized life, and neither Christian, nor even
human, and should be avoided by every one. In fact, the life of a
private citizen would be preferable to that of a king at the expense
of the ruin of so many human beings.\" Bk I, Ch XXVI"
[3] "\"Now, in a well-ordered republic, it should never be necessary
to resort to extra-constitutional measures. ...\" Bk I, Ch XXXIV"

Extracting lists

234

AUTOMATED DATA COLLECTION WITH R

Extracting
The last function of the XML package we would like
tables readHTMLTable(), a function to extract HTML tables. Not

to introduce at this point is
only does the function locate
tables within the HTML document, but also transforms them into data frames. As before, the
function extracts all tables and stores them in a list. Whenever the extracted HTML tables
have information that can be used as name, they are stored as named list item. Let us first get
an overview of the tables by listing the table names.
R> names(readHTMLTable(mac_source))
[1] "Niccolò Machiavelli" "NULL"
[4] "NULL"
"NULL"
[7] "NULL"
"NULL"
[10] "persondata"

"NULL"
"NULL"
"NULL"

There are ten tables; two of them are labeled. Let us extract the last one to retrieve personal
information on Machiavelli.
R> readHTMLTable(mac_source)$persondata
V1
V2
1
Name
Machiavelli, Niccolò
2 Alternative names
Machiavelli, Niccolò
3 Short description Italian politician and political theorist
4
Date of birth
May 3, 1469
5
Place of birth
Florence
6
Date of death
June 21, 1527
7
Place of death
Florence
Applying
element
functions

A powerful feature of readHTMLList() and readHTMLTable() is that we can define
individual element functions using the elFun argument. By default, the function applied to
each list item (
) and each cell of the table (
 for defining cells or 
for header cells.
Table 2.2 as HTML code—the full HTML document is HTMLTable.html from the book’s
materials:

 Rank  Nominal GDP  Name
  (per capita, USD)  
 1  170,373  Lichtenstein
 2  167,021  Monaco
 3  115,377  Luxembourg
 4  98,565  Norway
 5  92,682  Qatar

), respectively, is xmlValue(), but we
can specify other functions that take XML nodes as arguments. Let us use another HTML table
to demonstrate this feature. The first table of the article gives an overview of Machiavelli’s
personal information and, in the seventh and eighth rows, lists persons and schools of thought
that have influenced him in his thinking as well as those that were influenced by him.
R> readHTMLTable(mac_source, stringsAsFactors = F)[[1]][7:8, 1]
[1] "Influenced by\nXenophon, Plutarch, Tacitus, Polybius, Cicero,
Sallust, Livy, Thucydides"
[2] "Influenced\nPolitical Realism, Bacon, Hobbes, Harrington,
Rousseau, Vico, Edward Gibbon, David Hume, John Adams, Cuoco,
Nietzsche, Pareto, Gramsci, Althusser, T. Schelling, Negri, Waltz,
Baruch de Spinoza, Denis Diderot, Carl Schmitt"

In the HTML file, the names of philosophers and schools of thought are also linked to
the corresponding Wikipedia articles, but this information gets lost by relying on the default
element function. Let us replace the default function by one that is designed to extract links—
getHTMLLinks(). This allows us to extract all links for influential and influenced thinkers.
R> influential <- readHTMLTable(mac_source,
elFun = getHTMLLinks,
stringsAsFactors = FALSE)[[1]][7,]
R> as.character(influential)[1:3]
[1] "/wiki/Xenophon" "/wiki/Plutarch" "/wiki/Tacitus"

SCRAPING THE WEB

235

R> influenced <- readHTMLTable(mac_source,
elFun = getHTMLLinks,
stringsAsFactors = FALSE)[[1]][8,]
R> as.character(influenced)[1:3]
[1] "/wiki/Political_Realism" "/wiki/Francis_Bacon"
[3] "/wiki/Thomas_Hobbes"

Extracting links, tables, and lists from HTML documents are ordinary tasks in web
scraping practice. These functions save a lot of time or otherwise we would have to spend on
constructing suited XPath expressions and keeping our code tidy.

9.1.5

Dealing with HTML forms

Forms are a classical feature of user–server interaction via HTTP on static websites. They
vary in size, layout, input type, and other parameters—just think about all the search bars you
have used, the radio buttons you have slided, the check marks you have set, the user names
and passwords typed in, and so on. Forms are easy to handle with a graphical user interface
like a browser, but a little more difficult when they have to be disentangled in the source code.
In this section, we will cover the general approach to master forms with R. In the end you
should be able to recognize forms, determine the method used to pass the inputs, the location
where the information is sent, and how to specify options and parameters for sending data to
the servers and capture the result.
We will consider three different examples throughout this section to learn how to prepare
your R session, approach forms in general, use the HTTP GET method to send forms to
the server, use POST with url-encoded or multipart body, and let R automatically generate
functions that use GET or POST with adequate options to send form data.
Filling out forms in the browser and handling them from within R differs in many respects,
because much of the work that is usually done by the browser in the background has to be
specified explicitly. Using a browser, we
1. fill out the form,
2. push the submit, ok, start, or the like! button.
3. let the browser execute the action specified in the source code of the form and send
the data to the server,
4. and let the browser receive the returned resources after the server has evaluated the
inputs.
In scraping practice, things get a little more complicated. We have to
1. recognize the forms that are involved,
2. determine the method used to transfer the data,
3. determine the address to send the data to,
4. determine the inputs to be sent along,
5. build a valid request and send it out, and
6. process the returned resources.

236
Preparations

AUTOMATED DATA COLLECTION WITH R

In this section, we use functions from the RCurl, XML, stringr, and the plyr packages.
Furthermore, we specify an object that captures debug information along the way so that
we can check for details if something goes awry (see Section 5.4.3 for details). Additionally, we specify a curl handle with a set of default options—cookiejar to enable cookie
management, followlocation to follow page redirections which may be triggered by the
POST command, and autoreferer to automatically set the Referer request header when
we have to follow a location redirect. Finally, we specify the From and User-Agent header
manually to stay identifiable:
R> info
<- debugGatherer()
R> handle <- getCurlHandle(cookiejar
followlocation
autoreferer
debugfunc
verbose
httpheader
from
'user-agent'

=
=
=
=
=
=
=
=

"",
TRUE,
TRUE,
info$update,
TRUE,
list(
"eddie@r-datacollection.com",
str_c(R.version$version.string,
", ", R.version$platform)

))

Another preparatory step is to define a function that translates lists of XML attributes
into data frames. This will come in handy when we are going to evaluate the attributes of
HTML form elements of parsed HTML documents. The function we construct is called
xmlAttrsToDF() and takes two arguments. The first argument supplies a parsed HTML
document and the second an XPath expression specifying the nodes from which we want
to collect the attributes. The function extracts the nodes’ attributes via xpathApply() and
xmlAttrs() and transforms the resulting list into a data frame while ensuring that attribute
names do not get lost and that each attribute value is stored in a separate column:
R> xmlAttrsToDF <- function(parsedHTML, xpath) {
x <- xpathApply(parsedHTML, xpath, xmlAttrs)
x <- lapply(x, function(x) as.data.frame(t(x)))
do.call(rbind.fill, x)
}

9.1.5.1

GETting to grips with forms

To presenting how to generally approach forms and specifically how to handle forms that
demand HTTP GET, we use WordNet. WordNet is a service provided by Princeton University
at http://wordnetweb.princeton.edu/perl/webwn. Researchers at Princeton have built up a
database of synonyms for English nouns, verbs, and adjectives. They offer their data as an
online service. The website relies on an HTML form to gather the parameters and send a
request for synonyms—see Princeton University (2010a) for further details and Princeton
University (2010b) for the license.
Let us browse to the page and type in a word, for example, data. Hitting the Search
WordNet button results in a change to the URL which now contains 13 parameters.
1

http://wordnetweb.princeton.edu/perl/webwn?s=data&sub=Search+
WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=

SCRAPING THE WEB

237

We have been redirected to another page, which informs us that data is a noun and that it
has two semantic meanings.
From the fact that the URL is extended with a query string when submitting our search
term we can infer that the form uses the HTTP GET method to send the data to the server.
But let us verify this conclusion. To briefly recap the relevant facts from Chapter 2: HTML
forms are specified with the help of  nodes and their attributes. The  nodes’
attributes define the specifics of the data transfer from client to server.  nodes are
nested in  nodes and define the kind of data that needs to be supplied to the form.
We can either use view source code feature of a browser to check out the attributes of the
form nodes, or we use R to get the information. This time we do the latter. First, we load the
page into R and parse it.
R> url
<- "http://wordnetweb.princeton.edu/perl/webwn"
R> html_form
<- getURL(url, curl = handle)
R> parsed_form <- htmlParse(html_form)

Let us have a look at the form node attributes to learn the specifics of sending data to the
server. We use the xmlAttrsToDF() that we have set up above for this task.
R> xmlAttrsToDF(parsed_form, "//form")
method action
enctype
name
1
get webwn multipart/form-data
f
2
get webwn multipart/form-data change

There are two HTML forms on the page, one called f and the other change. The first
form submits the search terms to the server while the second takes care of submitting further
options on the type and range of data being returned. For the sake of simplicity, we will ignore
the second form.
With regard to the specifics of sending the data, the attribute values tell us that we
should use the HTTP method GET (method) and send it to webwn (action) which is the
location of the form we just downloaded and parsed. The enctype parameter with value
multipart/form-data comes as a bit of a surprise. It refers to how content is encoded
in the body of the request. As GET explicitly does not use the body to transport data, we
disregard this option.
The next task is to get the list of input parameters. When GET is used to send data, we
can easily spot the parameters sent to the server by inspecting the query string added to the
URL. But those parameters might only be a subset of all possible parameters. We therefore
use xmlAttrsToDF() again to get the full set of inputs and their attributes.
R> xmlAttrsToDF(parsed_form, "//form[1]/input")
type name maxlength
value
1
text
s
500

2 submit sub
 Search WordNet
3 hidden
o2

4 hidden
o0

1
5 hidden
o8

1
6 hidden
o1

1
7 hidden
o7

8 hidden
o5

9 hidden
o9

10 hidden
o6

Inspecting
forms with R

238

AUTOMATED DATA COLLECTION WITH R

11 hidden
12 hidden
13 hidden

Specifying GET
requests with R

o3
o4
h

As suggested by the long query string added to the URL after searching for our first
search term, we get a list of 13 input nodes. Recall that there was only one input field on the
page—the text field where we specified the search term. Inspecting the inputs reveals that 11
of the input fields are of type hidden, that is, input fields which cannot be manipulated by
the user. Moreover, input fields of type submit are hidden from user manipulation as well,
so there is only one parameter left for us to take care of. It turns out that the other parameters
are used for submitting options to the server and have nothing to do with the actual search.
To make simple search requests, the s parameter is sufficient.
Combining the informations on HTTP method, request location, and parameters, we can
now build an adequate request by using one of RCurl’s form functions. As the HTTP method
to send data to the server is GET, we use getForm(). Since the location to which we send
the request remains the same, we can reuse the URL we used before. As parameter we only
supply the s parameter with a value equal to the search term that we want to get synonyms for.
R> html_form_res <- getForm(uri = url, curl = handle, s = "data")
R> parsed_form_res <- htmlParse(html_form_res)
R> xpathApply(parsed_form_res, "//li", xmlValue)
[[1]]
[1] "S: (n) data, information (a collection of facts from which
conclusions may be drawn) \"statistical data\""
[[2]]
[1] "S: (n) datum, data point (an item of factual information
derived from measurement or research) "

Let us also have a look at the header information we supply by inspecting the information
stored in the info object with the debugGatherer() function and reset it afterwards.
R> cat(str_split(info$value()["headerOut"], "\r")[[1]])
GET /perl/webwn HTTP/1.1
Host: wordnetweb.princeton.edu
Accept: */*
from: eddie@r-datacollection.com
user-agent: R version 3.0.2 (2013-09-25), x86_64-w64-mingw32
GET /perl/webwn?s=data HTTP/1.1
Host: wordnetweb.princeton.edu
Accept: */*
from: eddie@r-datacollection.com
user-agent: R version 3.0.2 (2013-09-25), x86_64-w64-mingw32
R> info$reset()

We find that the requests for fetching the form information and sending the form data are
nearly identical, except that in the latter case the query string ?s=data is appended to the
requested resource.

SCRAPING THE WEB

239

The same could have been achieved by supplying a URL with appended query string and
a call to getURL():
R> url <- "http://wordnetweb.princeton.edu/perl/webwn?s=data"
R> html_form_res <- getURL(url = url, curl = handle)

9.1.5.2

POSTing forms

Forms that use the HTTP method POST are in many respects identical to forms that use
GET. They key difference between the two methods is that with POST, the information is
transferred in the body of the request. There are two common styles for transporting data in
the body, either as url-encoded or as multipart. While the former is efficient for text data, the
latter is better suited for sending files. Thus, depending on the purpose of the form, one or the
other POST style is expected. The next two sections will show how to handle POST forms
in practice. The first example deals with a url-encoded body and the second one showcases
sending multipart data.
POST with url-encoded body In the first example, we use a form from http://www.readable.com. The website offers a service that evaluates the readability of webpages and texts.
As before, we use the precomposed handle to retrieve the page and directly parse and save it.
R> url <- "http://read-able.com/"
R> form <- htmlParse(getURL(url = url, curl = handle))

Looking for  nodes reveals that there are two forms in the document. An examination of the site reveals that the first is used to supply a URL to evaluating a webpage’s
readability, and the second form allows inputting text directly.
R> xmlAttrsToDF(form, "//form")
method
action
1
get check.php
2
post check.php

There is no enctype specified in the attributes of the second form, so we expect the
server to accept both encoding styles. Because url-encoded bodies are more efficient for text
data, we will use this style to send the data.
An inspection of the second form’s input fields indicates that there seem to be no inputs
other than the submit button.
R> xmlAttrsToDF(form, "//form[2]//input")

Looking at the entire source code of the form, we find that there is a textarea node that
gathers text to be sent to the server.
R> xpathApply(form, "//form[2]")
[[1]]

Inspecting
POST forms

240

AUTOMATED DATA COLLECTION WITH R

Enter text to
check the readability:

HTML is allowed - it
will be stripped from the text.

attr(,"class")
[1] "XMLNodeSet"

Its name attribute is directInput which serves as parameter name for sending the
text. Let us use a famous quote about data found at http://www.goodreads.com/ to check its
readability.
R> sentence <- "\"It is a capital mistake to theorize before one has
data. Insensibly one begins to twist facts to suit theories, instead
of theories to suit facts.\" -- Arthur Conan Doyle, Sherlock Holmes"
Specifying
We send it to the read-able server for evaluation. Within the
POST requests style to "POST" for an url-encoded transmission of the data.
with R

call to postForm() we set

R> res <- postForm(uri = str_c(url, "check.php"),
curl = handle,
style = "POST",
directInput = sentence)

Most of the results are presented as HTML tables as shown below.
R> readHTMLTable(res)
$'NULL'
Flesch Kincaid Reading Ease 66.5
1 Flesch Kincaid Grade Level 6.6
2
Gunning Fog Score 6.8
3
SMOG Index
5
4
Coleman Liau Index 11.4
5 Automated Readability Index 5.7
$'NULL'
No. of sentences
3
1
No. of words
32
2
No. of complex words
2
3
Percent of complex words 6.25%
4 Average words per sentence 10.67
5 Average syllables per word 1.53

SCRAPING THE WEB

241

All in all, with a Grade Level of 6.6, 12- to 13-year-old children should be able to
understand Sherlock Holmes’ dictum. Let us check out the header information that was sent
to the server.
R> cat(str_split(info$value()["headerOut"], "\r")[[1]])
GET / HTTP/1.1
Host: read-able.com
Accept: */*
from: eddie@r-datacollection.com
user-agent: R version 3.0.2 (2013-09-25), x86_64-w64-mingw32
POST /check.php HTTP/1.1
Host: read-able.com
Accept: */*
from: eddie@r-datacollection.com
user-agent: R version 3.0.2 (2013-09-25), x86_64-w64-mingw32
Content-Length: 277
Content-Type: application/x-www-form-urlencoded

The second header confirms that the data have been sent via POST, using the following
url-encoded body.4
R> cat(str_split(info$value()["dataOut"], "\r")[[1]])
directInput=%22It%20is%20a%20capital%20mistake%20to%20theorize%20
before%20one%20has%20data%2E%20Insensibly%20one%20begins%20to%20
twist%20facts%20to%20suit%20theories%2C%20instead%20of%20theories%20
to%20suit%20facts%2E%22%20%2D%2D%20Arthur%20Conan%20Doyle%2C%20
Sherlock%20Holmes
R> info$reset()

POST with multipart-encoded body The second example considers a POST with a
multipart-encoded body. Fix Picture (http://www.fixpicture.org/) is a web service to transform image files from one format to another. In our example we will transform a picture from
PNG format to PDF.
Let us begin by retrieving a picture in PNG format and save it to our disk.
R> url <- "r-datacollection.com/materials/http/sky.png"
R> sky <- getBinaryURL(url = url, curl = handle)
R> writeBin(sky, "sky.png")

Next, we collect the main page of Fix Picture including the HTML form.
R> url
R> form

<- "http://www.fixpicture.org/"
<- htmlParse(getURL(url = url, curl = handle))

4 Recall that URL encoding refers to the process of replacing special characters with their percent-escaped
representations. For more information on the topic see Section 5.1.2.

242

AUTOMATED DATA COLLECTION WITH R

We check out the attributes of the form nodes.
R> xmlAttrsToDF(form, "//form")
name
id
action method
enctype
1 form form resize.php?LANG=en
post multipart/form-data

We find that there is only one form on the page. The form expects data to be sent with
POST and a multipart-encoded body. The list of possible inputs is extensive, as we can not
only transform the picture from one format to another but also flip and rotate it, restrict it
to grayscale, or choose the quality of the new format. For the sake of simplicity, we restrict
ourselves to a simple transformation from one format to another.
R> xmlAttrsToDF(form, "//input")[1:2, c("name", "type", "class", "value")]
name type
class value
1 image file upload-file 
2  image btn_submit 

The important input is the image. The upload-file value for the class attribute in one
of the  nodes suggests that we supply the file’s content under this name.
There is no input node for selecting the format of the output file. Inspecting the source
code reveals that a select node is enclosed in the form. Select elements allow choosing
between several options which are supplied as option nodes:
R> xmlAttrsToDF(form, "//select")
onchange
id
name
1 changeSelect() format format

The name attribute of the select node indicates under which name (format) the value—
listed within the option nodes should be sent to the server.
R> xmlAttrsToDF(form, "//select/option")
value
1 jpeg
2
png
3 tiff
4
pdf
5
bmp
6
gif

Disregarding all other possible options, we are ready to send the data along with parameters to the server. For RCurl to read the file and send it to the server, we have to use
RCurl’s fileUpload() function that takes care of providing the correct information for the
underlying libcurl library. The following code snippet sends the data to the server.
R> res <- postForm(uri = "http://www.fixpicture.org/resize.php?LANG=en",
image = fileUpload(filename = "sky.png",
contentType = "image/png"),
format = "pdf",
curl = handle)

The result is not the transformed file itself but another HTML document from which we
extract the link to the file.
R> doc <- htmlParse(res)
R> link <- str_c(url, xpathApply(doc, "//a/@href", as.character)[[1]])

SCRAPING THE WEB

243

We download the transformed file and write it to our local drive.
R> resImage <- getBinaryURL(link, curl = handle)
R> writeBin(resImage, "sky.pdf", useBytes = TRUE)

The result is the PNG picture transformed to PDF format. Last but not least let us have a
look at the multipart body with the data that have been sent via POST:
R> cat(str_split(info$value()["dataOut"], "\r")[[1]])
----------------------------30059d14e820
Content-Disposition: form-data; name="image"; filename="sky.png"
Content-Type: image/png
[[BINARY DATA]]
----------------------------30059d14e820
Content-Disposition: form-data; name="format"
pdf
----------------------------30059d14e820--

The [[BINARY DATA]] snippet indicates binary data that cannot be properly displayed
with text. Finally, we reset the info slot again.
R> info$reset()

9.1.5.3

Automating form handling—the RHTMLForms package

The tools we have introduced in the previous paragraphs can be adapted to specific cases to
handle form interactions. One shortcoming is that the interaction requires a lot of manual
labor and inspection of the source code. One attempt to automate some of the necessary steps
is the RHTMLForms package (Temple Lang et al. 2012). It was designed to automatically
create functions that fill out forms, select the appropriate HTTP method to send data to the
server, and retrieve the result. The RHTMLForms package is not hosted on CRAN. You can
install it by supplying the location of the repository.
R> install.packages("RHTMLForms", repos = "http://www.omegahat.org/R",
type = "source")
R> library(RHTMLForms)

The basic procedure of RHTMLForms works as follows:
1. We use getHTMLFormDescription() on the URL where the HTML form is located
and save its results in an object—let us call it forms.
2. We use createFunction() on the first item of the forms object and save the results
in another object, say form_function.
3. formFunction() takes input fields as options to send them to the server and return
the result.

244
Purpose-built
form functions

AUTOMATED DATA COLLECTION WITH R

Let us go through this process using WordNet again. We start by gathering the form
description information and creating the form function.
R> url
<- "http://wordnetweb.princeton.edu/perl/webwn"
R> forms
<- getHTMLFormDescription(url)
R> formFunction <- createFunction(forms[[1]])

Having created formFunction(), we use it to send form data to the server and retrieve
the results.
R> html_form_res
<- formFunction(s = "data", .curl = handle)
R> parsed_form_res <- htmlParse(html_form_res)
R> xpathApply(parsed_form_res,"//li", xmlValue)
[[1]]
[1] "S: (n) data, information (a collection of facts from which
conclusions may be drawn) \"statistical data\""
[[2]]
[1] "S: (n) datum, data point (an item of factual information
derived from measurement or research) "

Let us have a look at the function we just created.
R> args(formFunction)
function ( s = "",
.url = "http://wordnetweb.princeton.edu/perl/webwn",
...,
.reader = NULL,
.formDescription = list(formAttributes = c(
"get",
"http://wordnetweb.princeton.edu/perl/webwn",
"multipart/form-data",
"f"),
elements = list(
s = list(name = "s",
nodeAttributes = c("text", "s", "500"),
defaultValue = ""),
o2 = list(name = "o2", value = "" ),
o0 = list(name = "o0", value = "1"),
o8 = list(name = "o8", value = "1"),
o1 = list(name = "o1", value = "1"),
o7 = list(name = "o7", value = "" ),
o5 = list(name = "o5", value = "" ),
o9 = list(name = "o9", value = "" ),
o6 = list(name = "o6", value = "" ),
o3 = list(name = "o3", value = "" ),
o4 = list(name = "o4", value = "" ),
h = list(name = "h" , value = "" )),
url = "http://wordnetweb.princeton.edu/perl/webwn"),
.opts = structure(list(
referer = "http://wordnetweb.princeton.edu/perl/webwn"),
.Names = "referer"),
.curl = getCurlHandle(),
.cleanArgs = NULL)

SCRAPING THE WEB

245

Although it might look intimidating at first, it is easier than it looks because most of the
options are for internal use. The options are set automatically and we can disregard them—
.reader, .formDescription, elements, .url, url, and .cleanArgs. We are already
familiar with some of the options like .curl and .opts. In fact, when looking at the call
to formFunction() above you will notice that the same handler was used as before and
the updation of info was successful. That is because under the hood of these functions all
requests are made with the RCurl functions getForm() and postForm() so that we can
expect .opts and .curl to work in the same way as when using pure RCurl functions.
The last set of options are the names of the inputs we fill in and send to the server. In
our case, createFunction() correctly recognized o0 to o8 and h as inputs that need not
be manipulated by users. The elements argument stores the default values, but in contrast
to the input that stores the search term—s—which got an option with the same name,
createFunction() did not create arguments for formFunction() that allow specifying
values for o0, o1, and so on, as they are not necessary for the POST command.
The RHTMLForms packages might sound like they simplify interactions with HTML
forms to a great extent. While it is true that we save some of the actual coding, the interactions
still require a fairly intimate knowledge of the form in order to be able to interact with it.
This is to say that it is difficult to interact sensibly with a form if you do not know the type
of input and output for a form.

9.1.6 HTTP authentication
Not all places on the Web are accessible to everyone. We have learned in Section 5.2.2 that
HTTP offers authentication techniques which restrict content from unauthorized users, namely
basic and digest authentication. Performing basic authentication with R is straightforward with
the RCurl package.
As a short example, we try to access the “solutions” section at www.r-datacollection.com/
materials/solutions. When trying to access the resources with our browser, we are confronted
with a login form (see Figure 9.1). In R we can pass username and password to the server
with libcurl’s userpwd option. Base64 encoding is performed automatically.
R> url <- "www.r-datacollection.com/materials/solutions"
R> cat(getURL(url, userpwd = "teacher:sesame", followlocation = TRUE))
solutions coming soon

The userpwd option also works for digest authentication, and we do not have to manually
deal with nonces, algorithms, and hash codes—libcurl takes care for these things on its own.
To avoid storing passwords in the code, it can be convenient to put them in the .Rprofile
file, as R reads it automatically with every start (see Nolan and Temple Lang 2014, p. 295).
R> options(RDataCollectionLogin = "teacher:sesame")

We can retrieve and use the password using getOption().
R> getURL(url, userpwd = getOption("RDataCollectionLogin"),
followlocation = TRUE)

Storing
passwords

246

AUTOMATED DATA COLLECTION WITH R

Figure 9.1 Screenshot of HTTP authentication mask at http://www.r-datacollection.com/
materials/solutions

9.1.7

Connections via HTTPS

The secure transfer protocol HTTPS (see Section 5.3.1) becomes increasingly common. In
order to retrieve content from servers via HTTPS, we can draw on libcurl/RCurl which
support SSL connections. In fact, we do not have to care much about the encryption and SSL
negotiation details, as they are handled by libcurl in the background by default.
Let us consider an example. The Inter-university Consortium for Political and Social
Research (ICPSR) at the University of Michigan provides access to a huge archive of social
science data. We are interested in just a tiny fraction of it—some meta-information on survey
variables. At https://www.icpsr.umich.edu/icpsrweb/ICPSR/ssvd/search, the ICPSR offers a
fielded search for variables. The search mask allows us to specify variable label, question text
or category label, and returns a list of results with some snippets of information. What makes
this page a good exercise is that it has to be accessed via HTTPS, as the URL in the browser
reveals. In principle, connecting to websites via HTTPS can be just as easy as this
R> url <- "https://www.icpsr.umich.edu/icpsrweb/ICPSR/ssvd/search"
R> getURL(url)
Error: SSL certificate problem, verify that the CA cert is OK.
Details:
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate
verify failed

Setting up a successful connection does not seem to always be straightforward. The error
message states that the server certificate signed by a trusted certificate authority (CA)—
necessary to prove the server’s identity—could not be verified. This error could indicate that
the server should not be trusted because it is not able to provide a valid proof of its identity. In
this case, however, the reason for this error is different and we can easily remedy the problem.
What libcurl tries to do when connecting to a server via HTTPS is to access the locally
stored library of CA signatures to validate the server’s first response. On some systems—ours
included—libcurl has difficulties finding the relevant file (cacert.pem) on the local drive. We

SCRAPING THE WEB

247

therefore have to specify the path to the file manually and hand it to our gathering function
with the argument cainfo. We can either supply the directory where our browser stores its
library of certificates or use the file that comes with the installation of RCurl.
R> signatures = system.file("CurlSSL", cainfo = "cacert.pem",
package = "RCurl")
R> res <- getURL(url, cainfo = signatures)

Alternatively, we can update the bundle of CA root certificates. A current version can be
accessed at http://curl.haxx.se/ca/cacert.pem. In cases where validation of the server certificate
still fails, we can prevent libcurl from trying to validate the server altogether. This is done
with the ssl.verifypeer argument (see Nolan and Temple Lang 2014, p. 300).
R> res <- getURL(url, ssl.verifypeer = FALSE)

This might be a potentially risky choice if the server is in fact not trustworthy. After all,
it is the primary purpose of HTTPS to provide means to establish secure connections to a
verified server.
Returning to the example, we examine the GET form with which we can query the
ICPSR database. The action parameter reveals that the GET refers to /icpsrweb/ICPSR/
ssvd/variables. The  elements are variableLabel, questionText, and
categoryLabel. We re-specify the target URL in u_action and set up a curl handle.
It stores the CA signatures and can be used across multiple requests. Finally, we formulate a
getForm() call searching for questions that contain ‘climate change’ in their label, and
extract the number of results from the query.
R> url_action <- "https://www.icpsr.umich.edu/icpsrweb/ICPSR/ssvd/
variables?"
R> handle <- getCurlHandle(cainfo = signatures)
R> res <- getForm(url_action,
variableLabel = "climate+change",
questionText = "",
categoryLabel = "",
curl = handle)
R> str_extract(res, "Your query returned [[:digit:]]+ variables")
[1] "Your query returned 263 variables"

This is a minimal evaluation of our search results. We could easily extract more information on the single questions and query other question specifics, too.

9.1.8

Using cookies

Cookies are used to allow HTTP servers to re-recognize clients, because HTTP itself is a
stateless protocol that treats each exchange of request and response as though it were the first
(see Section 5.2.1). With RCurl and its underlying libcurl library, cookie management with R
is quite easy. All we have to do is to turn it on and keep it running across several requests with
the use of a curl handle—setting and sending the right cookie at the right time is managed in
the background.
In this section, we draw on functions from the packages RCurl, XML, and stringr for HTTP
client support, HTML parsing, and XPath queries as well as convenient text manipulation.
Furthermore, we create an object info that logs information on exchanged information

Preparations

248

AUTOMATED DATA COLLECTION WITH R

between our client and the servers we connect to. We also create a handle that is used
throughout this section.
R> info
<- debugGatherer()
R> handle <- getCurlHandle(cookiejar
followlocation
autoreferer
debugfunc
verbose
httpheader
from
'user-agent'

=
=
=
=
=
=
=
=

"",
TRUE,
TRUE,
info$update,
TRUE,
list(
"eddie@r-datacollection.com",
str_c(R.version$version.string,
", ", R.version$platform)

))

The most important option for this section is the first argument in the handle—
cookiejar = "". Specifying the cookiejar option even without supplying a file name

for the jar—a place to store cookie information in—activates cookie management by the
handle. The two options to follow (followlocation and autoreferer) are nice-to-have
options that preempt problems which might occur due to redirections to other resources. The
remaining options are known from above.
The general approach for using cookies with R is to rely on RCurl’s cookie management
by reusing a handle with activated cookie management, like the one specified above, in
subsequent requests.
9.1.8.1

Filling an online shopping cart

Although cookie support is most likely needed for accessing webpages that require logins in
practice, the following example illustrates cookies with a bookshop shopping cart at Biblio,
a page that specializes in finding and ordering used, rare, and out-of-print books.
Let us browse to http://www.biblio.com/search.php?keyisbn=data and put some books
into our cart. For the sake of simplicity, the query string appended to the URL already issues
a search for books with data as keyword. Each time we select a book for our cart by clicking
on the add to cart button, we are redirected to the cart (http://www.biblio.com/cart.php). We
can go back to the search page, select another book and add it to the cart.
To replicate this from within R, we first define the URL leading to the search results page
(search_url) as well as the URL leading to the shopping cart (cart_url) for later use.
R> search_url <- "www.biblio.com/search.php?keyisbn=data"
R> cart_url
<- "www.biblio.com/cart.php"

Next, we download the search results page and directly parse and save it in searchPage.
R> search_page <- htmlParse(getURL(url = search_url, curl = handle))

Adding items to the shopping cart is done via HTML forms.
R> xpathApply(search_page, "//div[@class='order-box'][position()<2]/
form")
[[1]]

SCRAPING THE WEB

249

attr(,"class")
[1] "XMLNodeSet"

We extract the book IDs to later add items to the cart.
R> xpath <- "//div[@class='order-box'][position()<4]/form/input
[@name='bid']/@value"
R> bids <- unlist(xpathApply(search_page, xpath, as.numeric))
R> bids
[1] 652559100 453475278 468759385

Now we add the first three items from the search results page to the shopping cart by
sending the necessary information (bid, add, and int) to the server. Notice that by passing
the same handle to the request via the curl option, we automatically add received cookies
to our requests.

Requests with
cookies

R> for (i in seq_along(bids)) {
res <- getForm(uri = cart_url, curl = handle, bid = bids[i],
add = 1, int = "keyword_search")
}

Finally, we retrieve the shopping cart and check out the items that have been stored.
R> cart <- htmlParse(getURL(url = cart_url, curl = handle))
R> clean <- function(x) str_replace_all(xmlValue(x), "(\t)|(\n\n)", "")
R> cat(xpathSApply(cart, "//div[@class='title-block']", clean))
DATA
by Hill, Anthony (ed)
Developing Language Through Design and Technology
by DATA
Guide to Design and technology Resources
by DATA

As expected, there are three items stored in the cart. Let us consider again the headers sent
with our requests and received from the server. We first issued a request that did not contain
any cookies.
R> cat(str_split(info$value()["headerOut"], "\r")[[1]][1:13])
GET /search.php?keyisbn=data HTTP/1.1
Host: www.biblio.com
Accept: */*
from: eddie@r-datacollection.com
user-agent: R version 3.0.2 (2013-09-25), x86_64-w64-mingw32

Reconsidering
the headers

250

AUTOMATED DATA COLLECTION WITH R

The server responded with the prompt to set two cookies—one called vis, the other
variation.
R> cat(str_split(info$value()["headerIn"], "\r")[[1]][1:14])
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 06 Mar 2014 10:27:23 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Keep-Alive: timeout=60
Set-Cookie: vis=language%3Ade%7Ccountry%3A6%7Ccurrency%3A9%7Cvisitor
%3AVrCZ...; expires=Tue, 05-Mar-2019 10:27:21 GMT; path=/;
domain=.biblio.com; httponly
Set-Cookie: variation=res_a; expires=Fri, 07-Mar-2014 10:27:21 GMT;
path=/; domain=.biblio.com; httponly
Vary: User-Agent,Accept-Encoding
Expires: Fri, 07 Mar 2014 10:27:23 GMT
Cache-Control: max-age=86400
Cache-Control: no-cache

Our client responded with a new request, now containing the two cookies.
R> cat(str_split(info$value()["headerOut"], "\r")[[1]][1:13])
GET /cart.php?bid=652559100&add=1&int=keyword%5Fsearch HTTP/1.1
Host: www.biblio.com
Accept: */*
Cookie: variation=res_a; vis=language%3Ade%7Ccountry%3A6%7Ccurrency
%3A9%7Cvisitor%3AVrCZz...
from: eddie@r-datacollection.com
user-agent: R version 3.0.2 (2013-09-25), x86_64-w64-mingw32

If we had failed to supply the cookies, our shopping cart would have remained empty.
The following request is identical to the request made above—we use the same handler and
code—except that we use cookielist = "ALL" to reset all cookies collected so far.
R> cart <- htmlParse(getURL(url = cart_url, curl = handle,
cookielist = "ALL"))
R> clean <- function(x) str_replace_all(xmlValue(x), "(\t)|(\n\n)", "")
R> cat(xpathSApply(cart, "//div[@class='title-block']", clean))

Consequently, the cart is returned empty because without cookies the server has no way
of knowing which actions, like adding items to the shopping cart, have been taken so far.
9.1.8.2

Further tricks of the trade

The approach from above—define and use a handle with enabled cookie management and let
RCurl and libcurl take care of further details of HTTP communication—will be sufficient in
most cases. Nevertheless, sometimes more control of the specifics is needed. In the following
we will go through some further features in handling cookies with RCurl.

SCRAPING THE WEB

251

We have specified cookiejar = "" in the previous section to activate automatic
cookie management. If a file name is supplied to this option, for example, cookiejar =
"cookies.txt", all cookies are stored in this file whenever cookielist = "FLUSH" is
specified as option to an RCurl function using the handle or via curlSetOpt().
R> handle <- getCurlHandle(cookiejar = "cookies.txt")
R> res <- getURL("http://httpbin.org/cookies/set?k1=v1&k2=v2",
curl = handle)
R> handle <- curlSetOpt(cookielist = "FLUSH", curl = handle)

An example of a cookie file looks as follows:
R> readLines("cookies.txt")
[1] "# Netscape HTTP Cookie File"
[2] "# http://curl.haxx.se/rfc/cookie_spec.html"
[3] "# This file was generated by libcurl! Edit at your own risk."
[4] ""
[5] "httpbin.org\tFALSE\t/\tFALSE\t0\tk2\tv2"
[6] "httpbin.org\tFALSE\t/\tFALSE\t0\tk1\tv1"

We can use the information in the file to get a set of initial cookies using the cookiefile
option.
R> new_handle <- getCurlHandle(cookiefile = "cookies.txt")

Besides writing collected cookies to a file, we can also clear the list of cookies collected
so far with cookielist="ALL".
R> getURL("http://httpbin.org/cookies", curl = new_handle,
cookielist = "ALL")
\"k2\": \"v2\",\n
\"k1\": \"v1\" n
[1] "{\n \"cookies\": {\n
}\n}"

Last but not least, although RCurl and libcurl will handle cookies set via HTTP reliably if
cookies are set by other technologies, for example, by JavaScript—it is necessary to provide
some cookies manually. We can do this by providing the cookie option with the exact
specification of the contents of cookies.
R> getURL("http://httpbin.org/cookies", cookie = "name=Eddie;age=32")
\"name\": \"Eddie\",\n
\"age\":
[1] "{\n \"cookies\": {\n
\"32\"\n }\n}"

9.1.9

Scraping data from AJAX-enriched webpages with
Selenium/Rwebdriver

We learned in Chapter 6 that accessing particular information in webpages may be impeded
when a site employs methods for dynamic data requests, especially through XHR objects. We
illustrated that in certain situations this problem can be circumvented through the use of Web
Developer Tools which can reveal the target source from which AJAX-enriched webpages
query their information. Unfortunately, this approach does not constitute a universal solution
to all extraction problems where dynamic data requests are involved. For one reason, the

Adding cookies
manually

252

AUTOMATED DATA COLLECTION WITH R

source may not be so easily spotted as in the stylized examples that we introduced but requires
time-consuming investigation of the respective code and considerably more knowledge about
JavaScript and the XHR object. Another problem that renders this approach infeasible is
that AJAX is frequently not directly responsible for accessing a specific data source but only
interacts with an intermediate server-side scripting language like PHP. PHP allows evaluating
queries and sending requests to a database, for example, a MySQL database (see Chapter 7),
and then feeds the returned data back to the AJAX callback function and into the DOM tree.
Effectively, such an approach would conceal the target source and eliminate the option of
directly accessing it.
In this section, we introduce a generalized approach to cope with dynamically rendered
webpages by means of browser control. The idea is the following: Instead of bypassing web
browsers, we leverage their capabilities of interpreting JavaScript and formulating changes to
the live DOM tree by directly including them into the scraping process. Essentially, this means
that all communication with a webpage is routed through a web browser session to which we
send and from which we receive information. There are numerous programs which allow such
an approach. Here, we introduce the Selenium/Webdriver framework for browser automation
(Selenium Project 2014a,b) and its implementation in R via the Rwebdriver package. We start
by presenting the problems caused by a running example. We then turn to illustrating the
basic idea behind Selenium/Webdriver, explain how to install the Rwebdriver package, and
show how to direct commands to the browser directly from the R command line. Using the
running example, we discuss the implemented methods and how we can leverage them for
web scraping.
9.1.9.1

Case study: Federal Contributions Database

As a running example we try to obtain data from a database on financial contributions to US
parties and candidates. The data have originally been collected and published by OpenSecrets.org under a non-restrictive license (Center for Responsive Politics 2014). A sample
of the data has been fed to a database that can be accessed at http://r-datacollection.com/
materials/selenium/dbQuery.html. As always, we start by trying to learn the structure of the
page and the way it requests and handles information of interest. The tool of choice for this
task are browser-implemented Web Developer Tools which were introduced in Section 6.3.
Let us go through the following steps:
1. Open a new browser window and go to http://r-datacollection.com/materials/selenium/
dbQuery.html. In the Network tab of your Web Developer Tools you should spot that
opening the page has triggered requests of three additional files: dbQuery.html which
includes the front end HTML code as well as the auxiliary JavaScript code, jquery1.8.0.min.js which is the jQuery library, and bootstrap.css, a style sheet. The visual
display of the page should be more or less similar to the one shown in Figure 9.2.
2. Choose input values from the scroll-down menus and click the submit button.
Upon clicking, your Network tab should indicate the request of a file named
getEntry.php?y=2013&m=01&r=&p=&t=T or similar, depending on the values you
have picked.
3. Take a look again at the page view to ensure that an HTML table has been created
at the lower end of the page. While it is not directly obvious where this information

SCRAPING THE WEB

253

Figure 9.2 The Federal Contributions database

comes from, usually a request to a PHP file is employed to fetch information from an
underlying MySQL database using the parameter value pairs transmitted in the URL to
construct the query to the database. This complicates extraction matters, since working
directly with the database is usually not possible because we do not have the required
access information and are thus restricted to working with the retrieved output from
the PHP file.

9.1.9.2

Selenium and the Rwebdriver package

Selenium/Webdriver is an open-source suite of software with the primary purpose of providing
a coherent, cross-platform framework for testing applications that run natively in the browser.
In the development of web applications, testing is a necessary step to establish expected functionality of the application, minimize potential security and accessibility issues, and guarantee
reliability under increased user traffic. Before the creation of Selenium this kind of testing
had been carried out manually—a tedious and error-prone undertaking. Selenium solves this
problem by providing drivers to control browser behavior such as clicks, scrolls, swipes,
and text inputs. This enables programmatic approaches to the problem by using a scripting
language to characterize sequences of user behaviors and report if the application fails.
Selenium’s capability to drive interactions with the webpage through the browser is of
more general use besides testing purposes. Since it allows to remote-control the browser, we
can work with and request information directly from the live DOM tree, that is, how the visual
display is presented in the browser window. Accessing Selenium functionality from within R
is possible via the Rwebdriver package. It is available from a GitHub repository and can be

Installing the
Rwebdriver

package

254

AUTOMATED DATA COLLECTION WITH R

Figure 9.3 Initializing the Selenium Java Server
installed with the install_github() function from the devtools package (Wickham and
Chang 2013).
R> library(devtools)
R> install_github(repo = "Rwebdriver", username = "crubba")

Getting started with Selenium Webdriver Using Selenium requires initiating the Selenium Java Server. The server is responsible for launching and killing browsers as well as
receiving and interpreting the browser commands. The communication with the server from
inside the programming environment works via simple HTTP messages. To get the server up
and running, the Selenium server file needs to be downloaded from http://docs.seleniumhq.
org/download/ to the local file system. The server file follows the naming convention
(selenium-server-standalone-.jar).5 In order to initiate the server, open
the system prompt, change the directory to the jar-file location, and execute the file.
1
2

cd Rwebdriver/
java -jar selenium-server-standalone-2.39.0.jar

The console output should resemble the one printed in Figure 9.3. The server is now
initiated and waits for commands. The system prompt may be minimized and we can turn
5 At

the time of writing, the latest server version is 2.39.0.

SCRAPING THE WEB

255

our attention back to the R console. Here, we first load the Rwebdriver as well as the XML
package.
R> library(Rwebdriver)
R> library(XML)

The first step is to create a new browser window. This can be accomplished through
the start_session() function which requires passing the address of the Java server—
by default http://localhost:4444/wd/hub/. Additionally, we pass firefox to the browser
argument to instruct the server to produce a Firefox browser window.
R> start_session(root = "http://localhost:4444/wd/hub/", browser =
"firefox")

Once the command is executed, the Selenium API opens a new Firefox window to which
we can now direct browser requests.
Using Selenium for web scraping We now return to the running example and explore some
of Selenium’s capabilities. Note that we are not introducing all functionality of the package
but focus our attention to functions most commonly used in the web scraping process. For a
full list of implemented methods, see Table 9.1.
Let us assume we wish to access the database through its introductory page at
http://www.r-datacollection.com/materials/selenium/intro.html. To direct the browser to a
specific webpage, we can use post.url() with specified url parameter.

Accessing a
webpage

R> post.url(url = "http://www.r-datacollection.com/materials/
selenium/intro.html")

The browser should respond and display the intro webpage. When a page forwards the
browser to another page, it can be helpful to retrieve the current browser URL, since this may
differ from the one that was specified in the query. We can obtain the information through the
get.url() command.

Retrieving the
current URL

R> get.url()
[1]"http://r-datacollection.com/materials/selenium/intro.html"

The returned output is a standard character vector. To pull the page title of the browser
window, use page_title().

Retrieving the
page title

R> page_title()
[1]"The Federal Contributions Database"

To arrive at the form for querying the database, we need to perform a click on the
enter button at the bottom right. Performing clicks with Selenium requires a two-step process. First, we need to create an identifier for the button element. Selenium allows specifying such an identifier through multiple ways. Since we already know how to work
with XPath expressions (see Chapter 4), we will employ this method. By using the Web
Developer Tools, we can obtain the following XPath expression for the button element,
/html/body/div/div[2]/form/input. When we pass the XPath expression as a string

Performing
clicks

256

AUTOMATED DATA COLLECTION WITH R

Table 9.1 Overview of Selenium methods (v.0.1)
Command

Arguments

Output

start_session()
quit_session()
status()
active_sessions()

root, browser

post.url()
get.url()

url

element_find()

by, value

element_xpath_find()

value

element_ptext_find()

value

element_css_find()

value

element_click()
element_clear()

ID, times, button
ID

page_back()
page_forward()
page_refresh()
page_source()
page_title()
window_handle()

times
times

Creates a new session
Closes session
Queries the server’s current status
Retrieves information on active
sessions
Opens new url
Receives URL from current
webpage
Finds elements by method and the
value
Finds elements corresponding to
XPath string value
Finds elements corresponding to
text string value
Finds elements corresponding to
CSS selector string value
Clicks on element ID
Clears input value from element
ID’s text field
One page backward
One page forward
Refreshes current webpage
Receives HTML source string
Receives webpage title string
Returns handle of the activated
window
Returns all window handles in
current session
Changes focus to window with
handle
Closes window with handle
Returns vector of current window
size
Posts a new window size for
window handle
Returns x,y coordinates of window
handle
Changes coordinates of window
handle
Post keyboard term values

window_handles()
window_change()

handle

window_close()
get_window_size()

handle
handle

post_window_size()

size, handle

get.window_position()

handle

post_window_position()

position, handle

key

terms

SCRAPING THE WEB

257

to the element_xpath_find() function, we are returned the corresponding element ID
from the live DOM. Let us go ahead and save the ID in a new object called buttonID.
R> buttonID <- element_xpath_find(value = "/html/body/div/div[2]/
form/input")

The second step is to actually perform the left-mouse click on the identified element. For
this task, we make use of element_click(), and pass buttonID as the ID argument.
R> element_click(ID = buttonID)

This causes the browser to change the page to the one displayed in Figure 9.2. Additionally,
you might have observed a pop-up window opening upon clicking the button. The occurrence
of pop-ups generates a little complication, since they cause Selenium to switch the focus of
its activate window to the newly opened pop-up. To return focus to the database page, we
need to first obtain all active window handles using window_handles().

Window
handles

R> allHandles <- window_handles()

To change the focus back on the database window, you can use the window_change()
function and pass the window handle that corresponds to the correct window. In this case, it
is the first element in allHandles.6
R> window_change(allHandles[1])

Now that we have accessed the database page, we can start to query information from it.
Let us try to fetch contribution records for Barack Obama from January 2013. To accomplish
this task, we change the value in the Month field. Again, this requires obtaining the ID for
the Month input field. From the Web Developer Tools we learn that the following XPath
expression is appropriate: '//*[@id="yearSelect"]'. At the same time, we save the IDs
for the month and the recipient text field.

Identifying
elements

R> yearID <- element_xpath_find(value = '//*[@id="yearSelect"]')
R> monthID <- element_xpath_find(value = '//*[@id="monthSelect"]')
R> recipID <- element_xpath_find(value = '//*[@id="recipientSelect"]')

In order to change the year, we perform a mouse click on the year field by executing
element_click() with the appropriate ID argument.
R> element_click(yearID)

Next, we need to pass the keyboard input that we wish to enter into the database field.
Since we are interested in records from the year 2013, we use the keys() function with the
first argument set to the correct term.
R> keys("2013")

6 Another option would be to close the window using close_window(). This automatically returns the focus to
the previous window.

Passing
keyboard input

258

AUTOMATED DATA COLLECTION WITH R

In a similar fashion, we do the same for the other fields.7
R>
R>
R>
R>

element_click(monthID)
keys("January")
element_click(recipID)
keys("Barack Obama")

We can now send the query to the database with a click on the submit button. Again, we
first identify the button using XPath and pass the corresponding ID element to the clicking
function.
R> submitID <- element_xpath_find(value = '//*[@id="yearForm"]
/div/button')
R> element_click(submitID)
Accessing
source code

This action should have resulted in a new HTML table being displayed at the bottom of
the page. To obtain this information, we can extract the underlying HTML code from the live
DOM tree and search the code for a table.
R> pageSource <- page_source()
R> moneyTab <- readHTMLTable(pageSource, which = 1)

With a few last processing steps, we can bring the information into a displayable format.
R> colnames(moneyTab) <- c("year", "name", "party", "contributor",
"state", "amount")
R> moneyTab <- moneyTab[-1, ]
R> head(moneyTab)

2
3
4
5
6
7

year
2013
2013
2013
2013
2013
2013

Barack
Barack
Barack
Barack
Barack
Barack

name party
contributor state amount
Obama
D
ROBERTS, GARY
TX
-50
Obama
D
TOENNIES, MICHAEL MR
CO
-55
Obama
D
PENTA, NEELAM
NY
-100
Obama
D
VALENSTEIN, JILL
NY
-15
Obama
D SPRECHER KEATING, KAREN
DC
-100
Obama
D
FISCHER, DAMIEN
CA
-100

Concluding remarks The web scraping process laid out in this section departs markedly
from the techniques and tools we have previously outlined. As we have seen, Selenium
provides a powerful framework and a way for working with dynamically rendered webpages
when simple HTTP-based approaches fail. It helps keep in mind that this flexibility comes
with a cost, since the browser itself takes some time to receive the request, process it, and
render the page. This has the potential to slow down the extraction process drastically, and we
therefore advise users to use Selenium only for tasks where other tools are unfit. We oftentimes
find using Selenium most helpful for describing transitions between multiple webpages and
posting clicks and keyboard commands to a browser window, but once we encounter solid
URLs, we switch back to the R-based HTTP methods outlined previously for speed purposes.
7 If necessary, we can also remove any input from a text field with the element_clear() function on the
respective element.

SCRAPING THE WEB

259

Besides the Rwebdriver package there is a package called Relenium which resembles the
package introduced in this chapter (Ramon and de Villa 2013). Although Relenium provides
a more straightforward initiation process of the Selenium server, it has, at the time of writing,
a more limited functionality.

9.1.10

Retrieving data from APIs

We have mentioned APIs in passing when introducing XML, JSON, and other fundamentals.
Generally, APIs encompass tools which enable programmers to connect their software with
“something else.” They are useful in programming software that relies on external soft- or
hardware because the developers do not have to go into the details of external soft- or hardware
mechanics.
When we talk about APIs in this book, we refer to web services or (web) APIs, that is,
interfaces with web applications. We treat the terms “API” and “web service” synonymously,
although the term API encompasses a much larger body of software. The reason why APIs
are of importance for web data collection tasks is that more and more web applications offer
APIs which allow retrieving and processing data. With the rise of Web 2.0, where web APIs
provided the basis for many applications, application providers recognized that data on the
Web are interesting for many web developers. As APIs help make products more popular and
might, in the end, generate more advertising revenues, the availability of APIs has rapidly
increased.
The general logic of retrieving data from web APIs is simple. We illustrate it in Figure 9.4.
The API provider sets up a service that grants access to data from the application or to the
application itself. The API user accesses the API to gather data or communicate with the
application. It may be necessary to write wrapper software for convenient data exchange
with the web service. Wrappers are functions that handle details of API access and data
transformation, for example, from JSON to R objects. The modus operandi of APIs varies—
we shortly discuss the popular standards REST and SOAP further below. APIs provide data
in various formats. JSON has probably become the most popular data exchange format of

API provider
(e.g., Twitter, Yahoo!)

API user

Web application

User application

Modus operandi:
REST, SOAP,...

Web service / Data API

Data formats:
JSON, XML,...

User software (e.g., R)
API wrapper software

Figure 9.4 The mechanics of web APIs

The rise of web
APIs

Basic logic

260

Documentation

Standards

REST

SOAP

AUTOMATED DATA COLLECTION WITH R

modern web APIs, but XML is still frequently used, and any other formats such as HTML,
images, CSVs, and binary data files are possible.
APIs are implemented for developers and thus must be understandable to humans. Therefore, an extensive documentation of features, functions, and parameters is often part of an API.
It gives programmers an overview of the content and form of information an API provides,
and what information it expects, for example, via queries.
Standardization of APIs helps programmers familiarize themselves with the mechanics of
an API quickly. There are several API standards or styles, the more popular ones being REST
and SOAP. It is important to note that in order to tap web services with R, we often do not
have to have any deeper knowledge of these techniques—either because others have already
programmed a handy interface to these APIs or because our knowledge about HTTP, XML,
and JSON suffices to understand the documentation of an API and to retrieve the information
we are looking for. We therefore consider them just briefly.
REST stands for Representational State Transfer (Fielding 2000). The core idea behind
REST is that resources are referenced (e.g., via URLs) and representations of these resources
are exchanged. Representations are actual documents like an HTML, XML, or JSON file. One
might think of a conversation on Twitter as a resource, and this resource could be represented
with JSON code or equally valid representations in other formats. This sounds just like what
the World Wide Web is supposed to be—and in fact one could say that the World Wide Web
itself conforms to the idea of REST. The development of REST is closely linked to the design
of HTTP, as the standard HTTP methods GET, POST, PUT, and DELETE are used for the
transfer of representations. GET is the usual method when the goal is to retrieve data. To
simplify matters, the difference between a GET request of a REST API and a GET request
our browser puts to a server when asking for web content is that (a) parameters are often
well-documented and (b) the response is simply the content, not any layout information.
POST, PUT, and DELETE are methods that are implemented when the user needs to create,
update, and delete content, respectively. This is useful for APIs that are connected to personal
accounts, such as APIs from social media platforms like Facebook or Twitter. Finally, a
RESTful API is an API that confirms to the REST constraints. The constraints include the
existence of a base URL to which query parameters can be added, a certain representation
(JSON, XML,...), and the use of standard HTTP methods.
Another web service standard we sometimes encounter is SOAP, originally an acronym
for Simple Object Access Protocol. As the technology is rather difficult to understand and
implement, it is currently being gradually superseded by REST. SOAP-based services are frequently offered in combination with a WSDL (Web Service Description Language) document
that fully describes all the possible methods as well as the data structures that are expected and
returned by the service. WSDL documents themselves are based on XML and can therefore
be interpreted by XML parsers. The resulting advantage of SOAP-based web services is that
users can automatically construct API call functions for their software environment based
on the WSDL, as the API’s functionality is fully described in the document. For more information on working with SOAP in R, see Nolan and Temple Lang (2014, Chapter 12). The
authors provide the SSOAP package that helps work with SOAP and R (Temple Lang 2012b)
by transforming the rules documented in a WSDL document into R functions and classes.8
Generating wrapper functions on-the-fly has the advantage that programs can easily react to

8 At

the time of writing, the package is not yet listed on CRAN.

SCRAPING THE WEB

261

API changes. However, as the SOAP technology is becoming increasingly uncommon, we
focus on REST-based services in this section.
Using a RESTful API with R can be very simple and not very different from what we
have learned so far regarding ordinary GET requests. As a toy example we consider Yahoo’s
Weather RSS Feed, which is documented at http://developer.yahoo.com/weather/. It provides
information on current weather conditions at any given place on Earth as well as a five-day
forecast in the form of an RSS file, that is, an XML-style document (see Section 3.4.3). The
feed basically delivers the data part of what is offered at http://weather.yahoo.com/. We could
use the API to generate our own forecasts or to build an R-based weather gadget. According
to the Terms of Use in the documentation, the feeds are provided free of charge for personal,
non-commercial uses.
Making requests to the feed is pretty straightforward when studying the documentation.
All we have to specify is the location for which we want to get a feedback from the API
(the w parameter) and the preferred degrees unit (Fahrenheit or Celsius; the u parameter).
The location parameter requires a WOEID code, the Where On Earth Identifier. It is a 32bit identifier that is unique for every geographic entity (see http://developer.yahoo.com/geo/
geoplanet/guide/concepts.html). From a manual search on the Yahoo Weather application, we
find that the WOEID of Hoboken, New Jersey, is 2422673. Calling the feed is simply done
using the HTTP GET syntax. We already know how to do this in R. We specify the API’s
base URL and make a GET request to the feed, providing the w parameter with the WOEID
and the u parameter for degrees in Celsius.
R> feed_url <- "http://weather.yahooapis.com/forecastrss"
R> feed <- getForm(feed_url, .params = list(w = "2422673", u = "c"))

As the retrieved RSS feed is basically just XML content, we can parse it with XML’s
parsing function.
R> parsed_feed <- xmlParse(feed)

The original RSS file is quite spacious, so we only provide the first and last couple of
lines.
1
2
3
4
5
6
7
8
9
10
11
12

Yahoo! Weather - Hoboken, NJ
http://us.rd.yahoo.com/dailynews/rss/weather/Hoboken__NJ/*
http://weather.yahoo.com/forecast/USNJ0221_c.html
Yahoo! Weather for Hoboken, NJ
en-us
Tue, 18 Feb 2014 7:35 am EST
60

Example:
Yahoo Weather
RSS Feed

262

AUTOMATED DATA COLLECTION WITH R

...

...

13
14
15
16
17
18
19
20
21
22
23
24
25

We can process the parsed XML object using standard XPath expressions and convenience
functions from the XML package. As an example, we extract the values of current weather
parameters which are stored in a set of attributes.
R> xpath <- "//yweather:location|//yweather:wind|//yweather:condition"
R> conditions <- unlist(xpathSApply(parsed_feed, xpath, xmlAttrs))
R> data.frame(conditions)
conditions
city
Hoboken
region
NJ
country
United States
chill
-3
direction
40
speed
11.27
text
Cloudy
code
26
temp
0
date
Tue, 18 Feb 2014 7:35 am EST

We also build a small data frame that contains the forecast statistics for the next 5 days.
R>
R>
R>
R>
1
2
3
4
5

location <- t(xpathSApply(parsed_feed, "//yweather:location", xmlAttrs))
forecasts <- t(xpathSApply(parsed_feed, "//yweather:forecast", xmlAttrs))
forecast <- merge(location, forecasts)
forecast
city region
country day
date low high
text code
Hoboken
NJ United States Tue 18 Feb 2014 -2
4
Rain/Snow
5
Hoboken
NJ United States Wed 19 Feb 2014 -2
7
Showers
11
Hoboken
NJ United States Thu 20 Feb 2014
3
7 Partly Cloudy
30
Hoboken
NJ United States Fri 21 Feb 2014
1
12 Rain/Thunder
12
Hoboken
NJ United States Sat 22 Feb 2014 -1
9 Partly Cloudy
30

SCRAPING THE WEB

263

Processing the result from a REST API query is entirely up to us if no R interface to a
web service exists. We could also construct convenient wrapper functions for the API calls.
Packages exist for some web services which offer convenience functions to pass R objects to
the API and get back R objects. Such functions are not too difficult to create once you are
familiar with an API’s logic and the data technology that is returned. Let us try to construct
such a wrapper function for the Yahoo Weather Feed example.
There are always many ways to specify wrapper functions for existing web services. We
want to construct a command that takes a place’s name as main argument and gives back the
current weather conditions or a forecast for the next few days. We have seen above that the
Yahoo Weather Feed needs a WOEID as input. To manually search for the corresponding
WOEID of a place and then feed it to the function seems rather inconvenient, so we want to
automate this part of the work as well. Indeed, there is another API that does this work for us.
At http://developer.yahoo.com/geo/geoplanet/ we find a set of RESTful APIs subsumed under
the label Yahoo! GeoPlanet which offer a range of services. One of these services returns the
WOEID of a specific place.
http://where.yahooapis.com/v1/places.q(’northfield%20mn%20usa’)?appid=[yourappidhere]

The URL contains the query parameter appid. We have to obtain an app ID to be able
to use this service. Many web services require registration and sometimes even involve a
sophisticated authentication process (see next section). In our case we just have to register for
the Yahoo Developer Network to obtain an ID. We register our application named RWeather
at Yahoo. After providing the information, we get the ID and can add it to our API query.
In order to be able to reuse the ID without having to store it in the code, we save it in the R
options:9
R> options(yahooid = "t.2cnduc0BqpWb7qmlc14vEk8sbL7LijbHoKS.utZ0")

The call to the WOEID API is as follows. We start with the base URL and add the place we
are looking for in the URL’s placeholder between the parentheses. The sprintf() function
is useful because it allows pasting text within another string. We just have to mark the string
placeholder with %s.
R> baseurl <- "http://where.yahooapis.com/v1/places.q('%s')"
R> woeid_url <- sprintf(baseurl, URLencode("Hoboken, NJ, USA"))

Notice also that we have to encode the place name with URL encoding (see Section 5.1.2).
http://where.yahooapis.com/v1/places.q(’Hoboken,%20NJ,%20USA’)
Next we formulate a GET call to the API. We add our Yahoo app ID which we retrieve
from the options. The service returns an XML document which we directly parse into an
object named parsed_woeid.
R> parsed_woeid <- xmlParse((getForm(woeid_url, appid = getOption
("yahooid"))))
9 See

Section 9.1.6. Needless to say that the printed ID is fictional.

Web service
interfaces for R

Building a
wrapper
function

264

AUTOMATED DATA COLLECTION WITH R

The XML document itself looks as follows.
1
2

3
4
5
6
7
8
9
10
11
12
13
14
15

2422673
Town
Hoboken
United
States
New Jersey

Hudson

Hoboken
...
America/New_York

There are several WOEIDs stored in the document, one for the country, one for the
state, and one for the town itself. We can extract the town WOEID with an XPath query
on the retrieved XML file. Note that the document comes with namespaces. We access the
 element where the WOEID is stored with the XPath expression //*[localname()='locality1'] which addresses the document’s local names.
R> woeid <- xpathSApply(parsed_woeid, "//*[local-name()='locality1']",
xmlAttrs)[2,]
R> woeid
woeid
"2422673"

The wrapper
function

Voilà, we have retrieved the corresponding WOEID. Recall that our goal was to construct
one function which returns the results of a query to Yahoo’s Weather Feed in a useful R
format. We have seen that such a function has to wrap around not only one, but two APIs—
the WOEID returner and the actual Weather Feed. The result of our efforts, a function named
getWeather(), are displayed in Figure 9.5.
The wrapper function splits into five parts. The first reports errors if the function’s
arguments ask—to determine if current weather conditions or a forecast should be reported—
and temp—to set the reported degrees Celsius or Fahrenheit—are wrongly specified. The
second part (get woeid) replicates the call to the WOEID API which we have considered
in detail above. The third part (get weather feed) uses the WOEID and makes a call
to Yahoo’s Weather Feed. The fourth part (get current conditions) is evaluated if the
user asks for the current weather conditions at a given place. We have stored some condition

SCRAPING THE WEB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

21
22
23
24

25
26
27
28
29

265

getWeather <- function(place = "New York", ask = "current", temp = "c") {
if (!ask %in% c("current","forecast")) {
stop("Wrong ask parameter. Choose either 'current' or
'forecast'.")
}
if (!temp %in% c("c", "f")) {
stop("Wrong temp parameter. Choose either 'c' for Celsius or
'f' for Fahrenheit.")
}
## get woeid
base_url <- "http://where.yahooapis.com/v1/places.q('%s')"
woeid_url <- sprintf(base_url, URLencode(place))
parsed_woeid <- xmlParse((getForm(woeid_url, appid = getOption("
yahooid"))))
woeid <- xpathSApply(parsed_woeid, "//*[local-name()='locality1']",
xmlAttrs)[2,]
## get weather feed
feed_url <- "http://weather.yahooapis.com/forecastrss"
parsed_feed <- xmlParse(getForm(feed_url, .params = list(w = woeid,
u = temp)))
## get current conditions
if (ask == "current") {
xpath <- "//yweather:location|//yweather:condition"
conds <- data.frame(t(unlist(xpathSApply(parsed_feed, xpath,
xmlAttrs))))
message(sprintf("The weather in %s, %s, %s is %s. Current
temperature is %s degrees %s.", conds$city, conds$region,
conds$country, tolower(conds$text), conds$temp, toupper(temp)))
}
## get forecast
if (ask == "forecast") {
location  getWeather(place = "San Francisco", ask = "current", temp = "c")
The weather in San Francisco, CA, United States is cloudy. Current
temperature is 9 degrees C.

This call was successful. Note that Yahoo’s Weather API is tolerant concerning the definition
of the place. If place names are unique, we do not have to specify the state or country. If the
place is not unique (e.g., “Springfield”), the API automatically picks a default option. Next,
we want to retrieve a forecast for the weather in San Francisco.
R> getWeather(place = "San Francisco", ask = "forecast", temp = "c")
Weather forecast for San Francisco, CA, United States:
day
date low high
text code
1 Tue 18 Feb 2014 10
18 Partly Cloudy
30
2 Wed 19 Feb 2014 13
19 Partly Cloudy
30
3 Thu 20 Feb 2014 12
18
Cloudy
26
4 Fri 21 Feb 2014 11
17
Few Showers
11
5 Sat 22 Feb 2014 10
19 Partly Cloudy
30
Where to find
APIs on the
Web

We could easily expand the function by adding further parameters or returning more useful R
objects. This example served to demonstrate how REST-based web services work in general
and how easy it is to tap them from within R. There are many more useful APIs on the Web.
At http://www.programmableweb.com/apis we get an overview of thousands of web APIs.
Currently, there are more than 11,000 web APIs listed as well as over 7,000 mashups, that
is, applications which make use and combine existing content from APIs. We provide some
additional advice on finding useful data sources, including APIs, in Section 9.4.

9.1.11
Authentication
and
authorization

Authentication with OAuth

Many web services are open to anybody. In some cases, however, APIs require the user to register and provide an individual key when making a request to the web service. Authentication
is used to trace data usage and to restrict access. Related to authentication is authorization.
Authorization means granting an application access to authentication details. For example, if
you use a third-party twitter client on your mobile device, you have authorized the app to use
your authentication details to connect to your Twitter account. We have learned about HTTP
authentication methods in Section 5.2.2. APIs often require more complex authentication via
a standard called OAuth.
OAuth is an important authorization standard serving a specific scenario. Imagine you
have an account on Twitter and regularly use it to inform your friends about what is currently
on your mind and to stay up to date about what is going on in your network. To stay tuned
when you are on the road, you use Twitter on your mobile phone. As you are not satisfied
with the standard functions the official Twitter application offers, you rely on a third-party
client app (e.g., Tweetbot), an application that has been programmed by another company and
that offers additional functionality. In order to let the app display the tweets of people you
follow and give yourself the opportunity to tweet, you have to grant it some of your rights
on Twitter. What you should never want to do is to hand out your access information, that
is, login name and password, to anybody—not even the Twitter client. This is where OAuth

SCRAPING THE WEB

267

comes into play. OAuth differs from other authentication techniques in that it distinguishes
between the following three parties:

r The service or API provider. The provider implements OAuth for his service and is
responsible for the website/server which the other parties access.

r The data owners. They own the data and control which consumer (see next party) is
granted access to the data, and to what extent.

r The data consumer or client. This is the application which wants to make use of the
owner’s data.
When we are working with R, we usually take two of the roles. First, we are data owners
when we want to authorize access to data from our own accounts of whatever web service.
Second, we are data consumers because we program a piece of R software that should be
authorized to access data from the API.
OAuth currently exists in two flavors, OAuth 1.0 and OAuth 2.0 (Hammer-Lahav 2010;
Hardt 2012). They differ in terms of complexity, comfort, and security.11 However, there
have been controversies on the question whether OAuth is indeed more secure and useful
than its predecessor.12 As users, we usually do not have to make the choices between the two
standards; hence, we do not go into more into detail here. OAuth’s official website can be
found at http://oauth.net/. More information, including a beginner’s guide and tutorials, are
available at http://hueniverse.com/oauth/.
How does authorization work in the OAuth framework? First of all, OAuth distinguishes
between three types of credentials: client credentials (or consumer key and secret), temporary
credentials (or request token and secret), and token credentials (or access token and secret).
Credentials serve as a proof for legitimate access to the data owner’s information at various
stages of the authorization process. Client credentials are used to register the application
with the provider. Client credentials authenticate the client. When we use R to tap APIs,
we usually have to start with registering an application at the provider’s homepage which
we could call “My R-based program” or similar. In the process of registration, we retrieve
client credentials, that is, a consumer key and secret that is linked with our (and only our)
application. Temporary credentials prove that an application’s request for access tokens is
executed by an authorized client. If we set up our application to access data from a resource
owner (e.g., our own Twitter account), we have to obtain those temporary credentials, that is,
a request token and secret, first. If the resource owner agrees that the application may access
his/her data (or parts of it), the application’s temporary credentials are authorized. They now
can be exchanged for token credentials, that is, an access token and secret. For future requests
to the API, the application now can use these access credentials and the user does not have
to provide his/her original authentication information, that is, username and password, for
this task.
The fact that several different types of credentials are involved in OAuth authorization
practice makes it clear that this is a more complicated process that encompasses several
11 See

“Introducing OAuth 2.0” by Eran Hammer-Lahav at http://hueniverse.com/2010/05/introducing-oauth-2-

0/.
12 See “OAuth 2.0 and the Road to Hell” by Eran Hammer-Lahav at http://hueniverse.com/2012/07/oauth-2-0and-the-road-to-hell/.

OAuth versions

The OAuth
workflow

OAuth use
with R

268

Tapping the
Facebook
Graph API

AUTOMATED DATA COLLECTION WITH R

steps. Fortunately, we can rely on R software that facilitates OAuth registration. The ROAuth
package (Gentry and Lang 2013) provides a set of functions that help specify registration
requests from within R. A simplified OAuth registration interface is provided by the httr
package (Wickham 2012). We illustrate OAuth authentication in R with the commands from
the httr package.
oauth_endpoint() is used to define OAuth endpoints at the provider side. Endpoints
are URLs that can be requested by the application to gain tokens for various steps of the authorization process. These include an endpoint for the request token—the first, unauthenticated
token—and the access token to exchange the unauthenticated for the authenticated token.
oauth_app() is used to create an application. We usually register an application manually
at the API provider’s website. After registration we obtain a consumer key and secret. We
copy and paste both into R. The oauth_app() function simple bundles the consumer key and
secret to a list that can be used to request the access credentials. While the consumer key has
to be specified in the function, we can let the function fetch the consumer secret automatically
from the R environment by placing it there in the APPNAME_CONSUMER_SECRET option. The
benefit of this approach is that we do not have to store the secret in our R code.
oauth1.0_token() and oauth2.0_token() are used to exchange the consumer key
and secret (stored in an object created with the oauth_app() function) for the access key
and secret. The function tries to retrieve these credentials from the access endpoint specified
with oauth_endpoint().
Finally, sign_oauth1.0() and sign_oauth2.0() are used to create a signature from
the received access token. This signature can be added to API requests from the registered
application—we do not have to pass our username and password.
We demonstrate by example how OAuth registration is done using Facebook’s Graph API.
The API grants access to publicly available user information and—if granted by the user—
selected private information. The use of the API requires that we have a Facebook account.
We first have to register an application which we want to grant access to our profile. We go to
https://developers.facebook.com and sign in using our Facebook authentication information.
Next, we create a new application by clicking on Apps and Create a new app. We have
to provide some basic information and pass a check to prove that we are no robot. Now,
our application RDataCollectionApp is registered. We go to the app’s dashboard to retrieve
information on the app, that is, the App ID and the App secret. In OAuth terms, these are
consumer key and consumer secret.
Next, we switch to R to obtain the access key. Using httr’s functionality, we start by
defining Facebook’s OAuth endpoints. This works with the oauth_endpoint() function.
R> facebook <- oauth_endpoint(
R>
authorize = "https://www.facebook.com/dialog/oauth",
R>
access = "https://graph.facebook.com/oauth/access_token")

We bundle the consumer key and secret of our app in one object with the oauth_app()
function. Note that we have previously dumped the consumer secret in the R environment with
Sys.setenv(FACEBOOK_CONSUMER_SECRET = "3983746230hg8745389234...") to
keep this confidential information out of the R code. oauth_app() automatically retrieves
it from the environment and writes it to the new fb_app object.
R> fb_app <- oauth_app("facebook", "485980054864321")

SCRAPING THE WEB

269

Now we have to exchange the consumer credentials with the access credentials. Facebook’s Graph API uses OAuth 2.0, so we have to use the oauth2.0_token() function. However, before we execute it, we have to do some preparations. First, we add a website URL to our
app account in the browser. We do this in the Settings section by adding a website and specifying a site URL. Usually this should work with the URL http://localhost:1410/ but you can also
call oauth_callback() to retrieve the callback URL the for the OAuth listener, a web server
that listens for the provider’s OAuth feedback.13 Second, we define the scope of permissions
to apply for. A list of possible permissions can be found at https://developers.facebook.
com/docs/facebook-login/permissions/. We pick some of those and write them into the
permissions object.
R> permissions <- "user_birthday, user_hometown, user_location,
user_status, user_checkins, friends_birthday, friends_hometown,
friends_location, friends_relationships, friends_status, friends_
checkins, publish_actions, read_stream, export_stream"

Now we can ask for the access credentials. Again, we use httr’s oauth2.0_token()
command to perform OAuth 2.0 negotiations.
R> fb_token <- oauth2.0_token(facebook, fb_app, scope = permissions,
type = "application/x-www-form-urlencoded")
starting httpd help server ... done
Waiting for authentication in browser...
Authentication complete.

During the function call we approve the access in the browser. The authentication process
is successful. We use the received access credentials to generate a signature object.
R> fb_sig <- sign_oauth2.0(fb_token)

We are now ready to access the API from within R. Facebook’s web service provides a
large range of functions. Fortunately, there is an R package named Rfacebook that makes the
API easily accessible (Barberá 2014). For example, we can access publicly available data
from Facebook users with
R> getUsers("hadleywickham", fb_sig, private_info = FALSE)
id
name
username
first_name last_name ...
1 16910108 Hadley Wickham hadleywickham Hadley
Wickham
...

The package also allows us to access information about our personal network.
R> friends <- getFriends(fb_sig, simplify = TRUE)
R> nrow(friends)
[1] 143
R> table(friends_info$gender)
female
male
71
72
13 This step departs from the simplified OAuth workflow from above. Unfortunately, we often face departures
from the norm when working with OAuth and have to adapt the procedure.

270

AUTOMATED DATA COLLECTION WITH R

It provides a lot more useful functions. For a more detailed tutorial, check out http://
pablobarbera.com/blog/archives/3.html.

9.2 Extraction strategies
We have learned several methods to gather data from the Web. There are three standard procedures. Scraping with HTTP and extracting information with regular expressions, information
extraction via XPath queries, and data gathering using APIs. They should usually be preferred
over each other in ascending order (i.e., scraping with regular expressions is least preferable
and gathering data via an API is most preferable), but there will be situations where one of the
approaches is not applicable or some of the techniques have to be combined. It thus makes
sense to become familiar with all of them.
In the following, we offer a general comparison between the different approaches. Each
scraping scenario is different, so some of the advantages or disadvantages of a method may
not apply for your task. Besides, as always, there is more than one way to skin a cat.
If data on a site are not provided for download in ready-made files or via an API, scraping
them off the screen is often the only alternative. With regular expressions and XPath queries we
have introduced two strategies to extract information from HTML or XML code. We continue
discussing both techniques according to some practical criteria which become relevant in the
process of web scraping, like robustness, complexity, flexibility, or general power. Note that
these elaborations primarily target static HTML/XML content.

9.2.1

Regular expressions

Figure 9.6 provides a schema of the scraping procedure with regular expressions. In step ①,
we identify information on-site that follows a general pattern. The decision to use regular
expressions to scrape data or another approach depends on our intuition whether the information is actually generalizable to a regular expression. In some cases, the data can be described
by means of a regular expression, but the pattern cannot distinguish from other irrelevant
content on the page. For example, if we identify important information wrapped in  tags,
this can be difficult to distinguish from other information marked with  tags. If data need
to be retained in their context, regular expressions also have a rough ride.
Step ② is to download the websites. Many of the methods described in Section 9.1 might
prove useful here. Additionally, regular expressions can already be of help in this step. They
could be used to assemble a list of URLs to be downloaded, or for URL manipulation (see
Section 9.1.3).
In step ③, the downloaded content is imported into R. When pursuing a purely regexbased scraping strategy, this is done by simply reading the content as character data with the
readLines() or similar functions. When importing the textual data, we have to be exceptionally careful with the encoding scheme used for the original document, as we want to avoid
applying regular expressions to get rid of encoding errors. If you start using str_replace()
in order to get ä, ó, or ç, you are likely to have forgotten specifying the encoding argument in
the readLines() or the parsing function (see Section 8.3). Incidentally, regular expressions
do not make use of an HTML or XML DOM, so we do not need to parse the documents. In
fact, documents parsed with htmlParse() or xmlParse() cannot be accessed with regular
expressions directly. If we use regular expressions in combination with an XPath approach,

SCRAPING THE WEB

1

271

Identify information that
follows a regular pattern
Browser, HTML source code, intuition

2

Download documents / websites
RCurl, download.f ile(), regular expressions

3

Import documents
readLines(), encoding

4

Develop regular expression
General expression → optimal match
special case → optimal match

5

Extract information
Regular expressions applied with stringr package

6

Debug code
Inspection, validation

Figure 9.6 Scraping with regular expressions

we first parse the document, extract information with XPath queries and finally modify the
retrieved content with regular expressions.
Step ④ is the crucial one for successful web scraping with regular expressions. One has to
develop one or more regular expressions which extract the relevant information. The syntax
of regular expressions as implemented in R often allows a set of different solutions. The
problem is that these solutions may not differ in the outcome they produce for a certain
sample of text to be regexed, but they can make a difference for new data. This makes
debugging very complex. There are some useful tools which help make regex development
more convenient, for example, http://regex101.com/ or http://www.regexplanet.com/.14 These

14 For

a more complete overview, see http://scraping.pro/10-best-online-regex-testers/

272

AUTOMATED DATA COLLECTION WITH R

pages offer instant feedback to given regular expressions on sample text input, which makes
the process of regex programming more interactive. In general, we follow one of two strategies
in regular expression programming. The first is to start with a special case and to work toward
a more general solution that captures every piece. For example, only one bit of information
is matched with a regular expression—this is the information itself, as characters match
themselves. The second strategy is to start with a general expression and introduce restrictions
or exceptions that limit the number of matched strings to the desired sample. One could label
the first approach the “inductive” and the second one “deductive.” The “deductive” approach
is probably more efficient because it starts at an abstract level—and regular expressions are
often meant to be abstract—, but usually requires more knowledge about regular expressions.
Another feasible strategy which could be located between the two is to start with several
rather different pieces of information to be matched and find the pick lock that fits for all
of them.
As soon as the regular expression is programmed, extracting the information is the next
step (⑤). As shown in Chapter 8, the stringr package (Wickham 2010) is enormously useful
for this purpose, possibly in combination with apply-like functions (the native R functions or
those provided by the plyr package (Wickham 2011)) for efficient looping over documents.
In the last step ⑥, the code has to be debugged. It is likely that applying regular expressions
on the full sample of strings reveals further problems, like false positives, that is, information
that has been matched should not be matched, or false negatives, that is, some information
to be matched is not matched. It is sometimes necessary to split documents or delete certain
parts before regexing them to exclude a bunch of false positives a priori.

9.2.1.1

Robustness to
malformed
XML

Efficiency

Speed

Power for data
cleansing

Advantages of regular expressions for web scraping

What are the advantages of scraping with regular expressions? In the opinion of many seasoned
web scrapers, there are not too many. Nevertheless, we think that there are circumstances
under which a purely regex-based approach may be superior to any other strategy.
Regular expressions do not operate on context-defining parts of a document. This can be an
advantage over an XPath strategy when the XML or HTML document is malformed. When
DOM parsers fail, regular expressions, ignorant as they are of DOM structure, continue
to search for information. Moreover, to retrieve information from a heterogeneous set of
documents, regular expressions can deal with them as long as they can be converted to a
plain-text format. Generally, regular expressions are powerful for parsing unstructured text.
String patterns can be the most efficient way to identify and extract content in a document.
Imagine a situation where you want to scrape a list of URLs which are scattered across a
document and which share a common string feature like a running index. It is possible to
identify these URLs by searching for anchor tags, but you would have to sort out the URLs
you are looking for in a second step by means of a regular expression.
Regular expression scraping can be faster than XPath-based scraping, especially when
documents are large and parsing the whole DOM consumes a lot of time. However, the
speed argument cannot be generalized, and the construction of regular expressions or XPath
queries is also an aspect of speed. And after all, there are usually more important arguments
than speed.
Finally, regular expressions are a useful instrument for data-cleansing purposes as they
enable us to get extracted information in the desired shape.

SCRAPING THE WEB

9.2.1.2

273

Disadvantages of regular expressions for web scraping

As soon as information in a document are connected, and should remain so after harvesting,
regular expressions are stretched to their limits. Data without context are often rather uninteresting. Think back to the introductory example from Chapter 8. It was already a complex task
to extract telephone numbers from an unstructured document, but to match the corresponding
names is often hardly possible if a document does not follow a specific structure. When
scraping information from web pages, sticking to regular expressions as a standard scraping
tool means ignoring the virtue that sites are hierarchically or sometimes even semantically
structured by construction. Markup is structure, and while it is possible to exploit markup
with regular expressions, elements which are anchored in the DOM can usually be extracted
more efficiently by means of XPath queries.
Besides, regular expressions are difficult to master. Building regular expressions is a
brain-teaser. It is sometimes very challenging to identify and then formulate the patterns of
information we need to extract. In addition, due to their complexity it is hard to read what is
going on in a regular expression. This makes it hard to debug regex scraping code when one
has not looked at the scraper for a while.
Many scraping tasks cannot be solved with regular expressions because the content to be
scraped is simply too heterogeneous. This means that it cannot be abstracted and formulated
as a generalized string pattern. The structure of XML/HTML documents is inherently hierarchical. Sometimes this hierarchy implies different levels of observations in your final dataset.
It can be a very complex task to disentangle these information with regular expressions alone.
If regular expressions make use of nodes that structure the document, the regex strategy soon
becomes very fragile. Incremental changes in the document structure can break. We have
observed that such errors can be fixed more easily when working with XPath.
The usefulness of regular expressions depends not least on the kind of information one
is looking for. If content on websites can be abstracted by means of a general string pattern,
regular expressions probably should be used, as they are rather robust toward changes in
the page layout. And even if you prefer to work with XPath, regular expressions are still an
important tool in the process. First, when parsing fails, regular expressions can constitute a
“last line of defense.” Second, when content has been scraped, the desired information is often
not available in atomic pieces but is still raw text. Regular expressions are extraordinarily
useful for data-cleaning tasks, such as string replacements, string deletions, and string splits.
In the third part of the book, we provide an application that relies mainly on regular
expressions to scrape data from web resources. In Chapter 13, we try to convert an unstructured
table—a “format” we sometimes encounter on the Web—into an appropriate R data structure.

9.2.2 XPath
Although the specifics of scraping with XPath are different from regex scraping, the road
maps are quite similar. We have sketched the path of XPath scraping in Figure 9.7. First, we
identify the relevant information that is stored in an XML/HTML document and is therefore
accessible with XPath queries (step ①). In order to identify the source of information, we can
inspect the source code in our browser and rely on Web Developer Tools.
Step ② is equivalent to the regular expressions scraping approach. We download the
required resources to our local drive. In principle, we could bypass this step and instantly
parse the document “from the webpage.” Either way, the content has to be fetched from the

Lack of context
sensitivity

Difficult to
develop and
debug

Lack of
flexibility

274

AUTOMATED DATA COLLECTION WITH R

1

Identify information that is nested
in an XML/HTML document
Browser, XML/HTML source code,
Web Developer Tools

2

Download documents / websites
RCurl, download.file(), regular expressions

3

Parse document
xmlParse(), htmlParse(), encoding

4

Develop XPath query
Backward induction, web developer tools,
SelectorGadget

5

Extract information
XPath applied with XML package,
regular expressions

6

Debug code
Inspection, validation

Figure 9.7 Scraping with XPath

server, and by first storing it locally, we can repeat the parsing process without having to
scrape documents multiple times.
In step ③, we parse the downloaded documents using the parsers from the XML package.
We suggest addressing character-encoding issues in this step—the later we resolve potential
encoding problems, the more difficult it gets. We have learned about different techniques
of document subsetting and parsing (e.g., SAX parsing methods)—which method we chose
depends upon the requirements or restrictions of the data resources.
Next, we extract the actual information in step ④ by developing one or more XPath
queries. The more often you work with XPath, the more intuitive this step becomes. For a
start, we recommend two basic procedures. The first is to construct XPath expressions with
SelectorGadget (see Section 4.3.3). It returns an expression that usually works, but is likely

SCRAPING THE WEB

275

not the most efficient way to express what you want. The other strategy is the do-it-yourself
method. We find it most intuitive to pursue a “backwards induction” approach here. This
means that we start by defining where the actual information is located and develop the
expression from there on, usually up the tree, until the expression uniquely identifies the
information we are looking for. One could also label this a bottom-up search procedure—
regardless of how we name it, it helps construct expressions that are slim and potentially
more robust to major changes in the document structure.
Once we have constructed suitable XPath expressions, extracting the information from
the documents (step ⑤) is easy. We apply the expression with adequate functions from the
XML package. The most promising procedure is to use xpathSApply() in combination
with one of the XML extractor functions (see Table 4.4). In practice, steps ④ and ⑤ are not
distinct. Finding adequate XPath expressions is a continuous trial-and-error process and we
frequently jump between expression construction and information extraction. Additionally,
XML extractor functions often produce not as clear-cut results as we wish them to be, and
bringing the pieces of information into shape takes more than one iteration. Imagine, for
example, that we want to extract reviews from a shopping website. While each of these
reviews could be stored in a leaf in the DOM, we may want to extract more information that
is part of the text, either in a manifest (words, word counts) or latent (sentiments, classes)
manner. We can draw upon regular expressions or more advanced text mining algorithms to
gather information at this level.
In the final step ⑥, we have to debug and maintain the code. Again, this is not literally
the last step, but part of an iterative process.

9.2.2.1

Advantages of XPath

We have stressed that we prefer XPath over regular expressions for scraping content from
static HTML/XML. XPath is the ideal counterpart for working with XML-style files, as it
was explicitly designed for that purpose. This makes it the most powerful, flexible, easy to
learn and write, and robust instrument to access content in XML/HTML files.
More specifically, the fact that XPath was designed for XML documents makes queries
intuitive to write and read. This is all the more true when you compare it with regular
expressions, which are defined on the basis of content, not context. As context follows a
clearly defined structure, XPath queries are easier, especially for common cases.
XPath is an expressive language, as it allows the scraping task to be substantially informed
about a node of interest using a diverse set of characteristics. We can use a node’s name,
numeric or character features, its relation to other nodes, or content-like attributes. Single
nodes can be uniquely defined. Additionally, working with XPath is efficient because it allows
returning node sets with comparatively minimal code input.
As this strategy mainly relies on structural features of documents, XPath queries are robust
to content changes. Certain content is fundamentally heterogeneous, such as press releases,
customer reviews, and Wikipedia entries. As long as the fundamental architecture of a page
remains the same, an XPath scraping strategy remains valid.

9.2.2.2

Disadvantages of XPath

Although XPath is generally superior for scraping tasks compared with regular expressions,
there are situations where XPath scraping fails.

Naturally fits
XML/HTML

Easy to read
and write

Powerful and
flexible

Robust toward
changes in
content

276
Restriction to
valid XML
content

Fragile in
fragile contexts

AUTOMATED DATA COLLECTION WITH R

When the parser fails, that is, when it does not produce a valid representation of the
document, XPath queries are essentially useless. While our browser may be tolerant toward
broken HTML documents and still interpret them correctly, our R XML parser might not. If
we work on non-XML-style data, XPath expressions are of no help either.
Complementary to the advantages of regular expression scraping for clearly defined
patterns in a fragile environment, using XPath expressions to extract information is difficult
when the context is highly variable, for example when the layout of a webpage is constantly
altered.

9.2.3

Application Programming Interfaces

There is little doubt that gathering data from APIs/web services is the gold standard for web
data collection. Scraping data from HTML websites is often a difficult endeavor. We first
have to identify in which slots of the HTML tree the relevant data are stored and how to
get rid of everything else that is not needed. APIs provide exactly the information we need,
without any redundant information. They standardize the process of data dissemination, but
also retain control for the provider over who accesses what data. Developers use different programming languages and use data for many different purposes. Web services allow providing
standardized formats that most programming languages can deal with.
We illustrate the data collection procedure with APIs and R in Figure 9.8. In step ①, we
have to find an API and become familiar with the terms of use or limits and the available
methods. Commercial APIs can be very restrictive or offer no data at all if you do not
pay a monthly fee, so you should find out early what you get for which payment. And do
not invest time for nothing—not all web services are well-maintained. Before you start to
program wrappers around an existing API, check whether the API is regularly updated. The
API directory at http://www.programmableweb.com/apis also indicates when services are
deprecated or moved to another place.
Steps ② and ③ are optional. Some web services require the users to register. Authentication or authorization methods can be quite different. Sometimes it suffices to register by
email to obtain an individual key that has to be delivered with every request. Other ways
of authentication are based on user/password combinations. Authentication via OAuth as
described in Section 9.1.11 can be even more complex.
In step ④, we formulate a call to the API to request the resources. If we are lucky, we can
draw upon an existing set of R functions that provide an interface to the API. We suggest some
possible repositories which may offer the desired piece of R software that helps work with
an API in Section 9.4. However, as the number of available web services increases quickly,
chances are that we have to program our own R wrapper. Wrappers are pieces of software
which wrap around existing software—in our case to be able to use R functions to call an
API and to make the data we retrieve from an API accessible for further work in R.
In step ⑤, we process the incoming data. How we do this depends upon the data format
delivered by the web service. In Chapter 3, we have learned how to use tools from the XML
and jsonlite packages to parse XML and JSON data and eventually convert them to R objects.
R packages which provide ready-made interfaces to web services (see Section 9.4) usually
take care of this step and are therefore exceptionally handy to use.
As always, we should regularly check and debug our code (step ⑥). Be aware of the fact
that API features and guidelines can change over time.

SCRAPING THE WEB

1

277

Find API and get familiar with
terms of use / limits / methods
Browser, documentation

2

Register application for API use,
retrieve and store keys
Browser

3

Authenticate via OAuth
httr package

4

Call API (optional: use existing
wrapper or program your own)
R

5

Process incoming data
XML / jsonlite package

6

Debug code
Inspection, validation

Figure 9.8 Data collection with APIs

9.2.3.1

Advantages of working with APIs

The advantages of web services over the other techniques stem from the fact that tapping
APIs is in fact not web scraping. Many of the disadvantages of screen scraping, malformed
HTML, other robustness to legal issues, do not apply to data collection with web services.
As a result, we can draw upon clean data structures and have higher trust in the collection
outcomes.
Further, by registering an application for an API we make an agreement between provider
and user. In terms of stability, chances are higher that databases from maintained APIs are
updated regularly. When scraping data from HTML, we are often less sure about this. Some
APIs provide exclusive access to content which we could not otherwise access. In terms of
transparency, as data access procedures are standardized across many computer languages,
the data collection process of projects based on data from web services can be replicated in
other software environments as well.

‘Pure’ data
collection

Standardized
data access
procedures

278
Robustness

As the focus of web services is on the delivery of data, not layout, our code is generally
more robust. Web services usually satisfy a certain demand and we are often not the only ones
interested in the data. If many people create interfaces to the API from various programming
environments, we can benefit from this “wisdom of the crowds” and adding robustness to
our code.
To sum up, APIs provide important advantages which make them—if available—the
source of choice for any project that involves automated online data collection.
9.2.3.2

Dependency on
API providers

Lack of natural
connection to R

AUTOMATED DATA COLLECTION WITH R

Disadvantages of working with APIs

The fact that the overwhelming majority of resources on the Web are still not accessed by
web APIs motivates large parts of this book. This is no drawback of web services as a tool
for data collection per se but merely reflects the fact that there are more data sources on the
Web people like to work with than data providers who are willing to offer neat access to their
databases.
Although there are not many general disadvantages of using APIs for automated data
collection, relying on web service infrastructure can have its own drawbacks. Data providers
can decide to limit their API’s functionality from one day to the other, as has happened with
popular social media APIs.
From the R perspective, we have to acknowledge that we work in a software environment
that is not naturally connected to the data formats which plop out of ordinary web services.
However, the advantages of web services often easily outweigh the disadvantages, and
even more so because the potential disadvantages do not necessarily apply to every web
service and some of the drawbacks can partly be attributed to the other approaches as well.

9.3 Web scraping: Good practice
9.3.1

eBay v.
Bidder’s Edge

Is web scraping legal?

In the disclaimer of the book (see p. xix), we noted a caveat concerning the unauthorized
use or reproduction of somebody else’s intellectual work. As Dreyer and Stockton (2013)
put it: “Scraping inherently involves copying, and therefore one of the most obvious claims
against scrapers is copyright infringement.” Our advice is to work as transparently as possible
and document the sources of your data at any time. But even if one follows these rules,
where is the line between legal scraping of public content and violations of copyright or other
infringements of the law? Even for lawyers who are devoted to Internet issues the case of
web crawling seems to be a difficult matter. Additionally, as the prevailing law varies across
countries, we are unfortunately not able to give a comprehensive overview of what is legal
in which context. To get an impression of what currently seems to be regarded as illegal, we
offer some anecdotal evidence on past decisions. It should be clear, however, that you should
not rely on any of these cases to justify your doings.
Most of the prominent legal cases involve commercial interests. The usual scenario is
that one company crawls information from another company, processes, and resells it. In the
classical case eBay v. Bidder’s Edge,15 eBay successfully defended itself against the use of
15 http://en.wikipedia.org/wiki/EBay_v._Bidder%27s_Edge,
f2001/week11/bidders_edge.pdf

https://www.law.upenn.edu/fac/pwagner/law619/

SCRAPING THE WEB

279

bots on their website. Bidder’s Edge (BE), a company that aggregated auction listings, had
used automated programs to crawl information from different auction sites. Users could then
search listings on their webpage instead of posing many requests to each of the auction sites.
According to the verdict,16
BE accessed the eBay site approximate 100,000 times a day. (...) eBay allege[d]
that BE activity constituted up to 1.53% of the number of requests received by
eBay, and up to 1.10% of the total data transferred by eBay during certain periods
in October and November of 1999.
Further,
eBay allege[d] damages due to BE’s activity totaling between $ 45,323 and
$ 61,804 for a ten month period including seven months in 1999 and the first
three months in 2000.
The defendant did not steal information that was not public to anyone, but harmed the
plaintiff by causing a considerable amount of traffic on its servers. eBay also complained
about the use of deep links, that is, links that directly refer to content that is stored somewhere
“deeply” on the page. By using such links clients are able to circumvent the usual process of
a website visit.
In another case, Associated Press v. Meltwater, scraper’s rights were also curtailed.17
Meltwater is a company that offers software which scrapes news information based on
specific keywords. Clients can order summaries on topics which contain excerpts of news
articles. Associated Press (AP) argued that their content was stolen by Meltwater and that
they need to license before distributing it. The judge’s argument in favor of the AP was that
Meltwater is rather a competitor of AP than an ordinary search engine like Google News.
From a more distant perspective, it is hard to see the difference to other news-aggregating
services like Google News (Essaid 2013b; McSherry and Opsahl 2013).
A case which was settled out of court was that of programmer Pete Warden who scraped
basic information from Facebook users’ profiles (Warden 2010). His idea was to use the data
to offer services that help manage communication and networks across services. He described
the process of scraping as “very easy” and in line with the robots.txt (see next section), an
informal web bot guideline Facebook had put on its pages. After he had put a first visualization
of the data on his blog, Facebook contacted and pushed him to delete the data. According
to Warden, “Their contention was robots.txt had no legal force and they could sue anyone
for accessing their site even if they scrupulously obeyed the instructions it contained. The
only legal way to access any web site with a crawler was to obtain prior written permission”
(Warden 2010).
In the tragic case of Aaron Swartz, the core of contention was scientific work, not
commercial reuse. Swartz, who co-created RSS (see Section 3.4.3), Markdown, and Infogami
(a predecessor of Reddit), was arrested in 2011 for having illegally downloaded millions of
articles from the article archive JSTOR. The case was dismissed after Swartz’ suicide in
January 2013 (United States District Court District of Massachusetts 2013).
16 https://www.law.upenn.edu/fac/pwagner/law619/f2001/week11/bidders_edge.pdf
17 https://www.eff.org/sites/default/files/ap_v._meltwater_sdny_copy.pdf

AP v.
Meltwater

Facebook v.
Pete Warden

United States v.
Aaron Swartz

280

AUTOMATED DATA COLLECTION WITH R

In an interesting, thoughtful comment, Essaid (2013a) points out that the jurisdiction on
the issue of web scraping has changed direction several times over the last years, and there
seem to be no clear criteria about what is allowed and what is not, not even in a single judicial
system like the United States. Snell and Care (2013) deliver further anecdotal evidence and
put court decisions in the context of legal theories.
The lesson to be learned from these disconcerting stories is that it is not clear which
actions that can be subsumed under the “web scraping” label are actually illegal and which
are not. In this book we focus on very targeted examples, and republishing content for a
commercial purpose is a much more severe issue than just downloading pages and using
them for research or analysis. Most of the litigations we came across involved commercial
intentions. The Facebook v. Warden case has shown, however, that even following informal
rules like those documented in the robots.txt does not guard against prosecution. But after
all, as Frances Irwing from ScraperWiki puts it, “Google and Facebook effectively grew up
scraping,” and if there were significant restrictions on what data can be scraped then the Web
would look very different today.18
In the next sections, we describe how to identify unofficial web scraping rules and how
to behave in general to minimize the risk of being put in a difficult position.

9.3.2

The robots
exclusion
standard

1
2
3

What is robots.txt?

When you start harvesting websites for your own purposes, you are most likely only a small
fish in the gigantic data ocean. Besides you, web robots (also named “crawlers,” “web spiders,”
or just “bots”) are hunting for content. Not all of these automatic harvesters act malevolently.
Without bots, essential services on the Web would not work. Search engines like Google or
Yahoo use web robots to keep their indices up-to-date. However, maintainers of websites
sometimes want to keep at least some of their content prohibited from being crawled, for
example, to keep their server traffic in check. This is what the robots.txt file is used for. This
“Robots Exclusion Protocol” tells the robots which information on the site may be harvested.
The robots.txt emerged from a discussion on a mailing list and was initiated by Martijn
Koster (1994). The idea was to specify which information may or may not be accessed by web
robots in a text file stored in the root directory of a website (e.g., www.r-datacollection.com/
robots.txt). The fact that robots.txt does not follow an official standard has led to inconsistencies and uncontrolled extensions of the grammar. There is a set of rules, however, that is
followed by most robots.txt on the Web. Rules are listed bot by bot. A set of rules for the
Googlebot robot could look as follows:
User-agent: Googlebot
Disallow: /images/
Disallow: /private/

This tells the Googlebot robot, specified in the User-agent field, not to crawl content
from the subdirectories /images/ and /private/. Recall from Section 5.2.1 that we can
use the User-Agent field to be identifiable. Well-behaved web bots are supposed to look for
18 See Mark Ward’s article on business web scraping efforts at http://www.bbc.co.uk/news/technology-23988890.

SCRAPING THE WEB

281

their name in the list of User-Agents in the robots.txt and obey the rules. The Disallow field
can contain partial or full URLs. Rules can be generalized with the asterisk (*).
1
2

User-agent: *
Disallow: /private/

This means that any robot that is not explicitly recorded is disallowed to crawl the
/private/ subdirectory. A general ban is formulated as
1
2

User-agent: *
Disallow: /

The single slash / encompasses the entire website. Several records are separated by one
or more empty lines.
1
2
3
4

User-agent: Googlebot
Disallow: /images/
User-agent: Slurp
Disallow: /images/

A frequently used extension of this basic set of rules is the use of the Allow field. As
the name already states, such fields list directories which are explicitly accepted for scraping.
Combinations of Allow and Disallow rules enable webpage maintainers to exclude directories as a whole from crawling, but allow specific subdirectories or files within this directory
to be crawled.
1
2
3

User-agent: *
Disallow: /images/
Allow: /images/public/

Another extension of the robots.txt standard is the Crawl-delay field which asks crawlers
to pause between requests for a certain number of seconds. In the following robots.txt,
Googlebot is allowed to scrape everything except one directory, while all other users may
access everything but have to pause for 2 seconds between each request.19
1
2
3
4

User-agent: *
Crawl-delay: 2
User-Agent: Googlebot
Disallow: /search/

19 The

example is taken from the US Congress webpage: http://beta.congress.gov/

282
The robots
 tag

1

An R parser for
robots.txt

AUTOMATED DATA COLLECTION WITH R

One problem of using robots.txt is that it can become quite voluminous for large webpages
with multiple subdirectories and files. In addition, the way some crawlers work makes them
ignorant to the centralized robots.txt. A disaggregated alternative to robots.txt is the robots
 tag which can be stored in the header of an HTML file.

A well-behaved robot will refrain from indexing a page that contains this  tag
because of the noindex value in the content attribute and will not try to follow any link on
this page because of the nofollow value in the content attribute.
This book is not about web crawling, but focuses on retrieving content from specific sites
with a specific purpose. But what if we still have to scrape information from several sites
and do not want to manually inspect every single robots.txt file to program a well-behaved
web scraper? For this purpose, we wrote a program that parses robots.txt by means of regular
expressions and helps identify specific User-agents and corresponding rules of access. The
program is displayed in Figure 9.9.
The robotsCheck() program reads the robots.txt which is specified in the first argument,
robotstxt. We can specify the bot or User-agent with the second argument, useragent.
Further, the function can return allowed and disallowed directories or files. This is specified
with the dirs parameter. We do not discuss this program in greater detail here, but it can
easily be extended so that a robot stops scraping pages that are stored in one of the disallowed
directories.
We test the program on the robots.txt file of Facebook. First, we specify the link to the file.
R> facebook_robotstxt <- "http://www.facebook.com/robots.txt"

Next, we retrieve the list of directories that is prohibited from being crawled by any bot
which is not otherwise listed. If we create our own bot, this is most likely the set of rules we
have to obey.
R> robotsCheck(robotstxt = facebook_robotstxt, useragent = "*",
dirs = "disallowed")
This bot is blocked from the site.

Facebook generally prohibits crawling from its pages. Just to see how the program works,
we make another call for a bot named “Yeti.”
R> robotsCheck(robotstxt =
dirs = "disallowed")
[1] "/ajax/"
[4] "/checkpoint/"
[7] "/file_download.php"
[10] "/photo.php"
[13] "/photos.php"

facebook_robotstxt, useragent = "Yeti",
"/album.php"
"/contact_importer/"
"/l.php"
"/photo_comments.php"
"/sharer/"

"/autologin.php"
"/feeds/"
"/p.php"
"/photo_search.php"

Facebook disallows the “Yeti” bot to access a set of directories. It is important to say that
robots.txt has little to do with a firewall against robots or any other protection mechanism. It
does not prevent a website from being crawled at all. Rather, it is an advice from the website
maintainer.

SCRAPING THE WEB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

283

robotsCheck <- function(robotstxt = "", useragent = "*", dirs =
"disallowed") {
# packages
require(stringr)
require(RCurl)
# read robots.txt
bots <- getURL(robotstxt, cainfo = system.file("CurlSSL", "cacert.pem",
package = "RCurl"))
write(bots, file = "robots.txt")
bots <- readLines("robots.txt")
# detect if defined bot is on the list
useragent <- ifelse(useragent == "*", "\\*", useragent)
bot_line1 <- which(str_detect(bots, str_c("[Uu]ser[Aa]gent:[ ]{0,}", useragent, "$"))) + 1
bot_listed <- ifelse(length(bot_line1)>0, TRUE, FALSE)
# identify all user-agents and user-agent after defined bot
ua_detect <- which(str_detect(bots, "[Uu]ser-[Aa]gent:[ ].+"))
uanext_line <- ua_detect[which(ua_detect == (bot_line1 - 1)) + 1]
# if bot is on the list, identify rules
bot_d_dir <- NULL
bot_a_dir <- NULL
bot_excluded <- 0
if (bot_listed) {
bot_eline <- which(str_detect(bots, "ˆ$"))
bot_eline_end <- length(which(bot_eline - uanext_line < 0))
bot_eline_end <- ifelse(bot_eline_end == 0, length(bots), bot_eline
[bot_eline_end])
botrules <- bots[bot_line1:bot_eline_end]
# extract forbidden directories
botrules_d <- botrules[str_detect(botrules, "[Dd]isallow")]
bot_d_dir <- unlist(str_extract_all(botrules_d, "/.{0,}"))
# extract allowed directories
botrules_a <- botrules[str_detect(botrules, "ˆ[Aa]llow")]
bot_a_dir <- unlist(str_extract_all(botrules_a, "/.{0,}"))
# bot totally excluded?
bot_excluded <- str_detect(bot_d_dir, "ˆ/$")
}
# return results
if (bot_excluded[1]) { message("This bot is blocked from the site.")}
if (dirs == "disallowed" & !bot_excluded[1]) { return(bot_d_dir) }
if (dirs == "allowed" & !bot_excluded[1]) { return(bot_a_dir) }
}

Figure 9.9 R code for parsing robots.txt files

To the best of our knowledge, there is no law which explicitly states that robots.txt contents
must not be disregarded. However, we strongly recommend that you have an eye on it every
time you work with a new website, stay identifiable and in case of doubt contact the owner
in advance.
If you want to learn more about web robots and how robots.txt works, the page
http://www.robotstxt.org/ is a good start. It provides a more detailed explanation of the
syntax and a useful collection of Frequently Asked Questions.

Friendly
cooperation
with APIs

Get into contact
with the data
providers

Obey robots.txt
and terms of
use

Scraping dos
and don’ts

284

AUTOMATED DATA COLLECTION WITH R

9.3.3

Be friendly!

Not everything that can be scraped should be scraped, and there are more and less polite
ways of doing it. The programs you write should behave nicely, provide you with the data
you need, and be efficient—in this order. We suggest that if you want to gather data from
a website or service, especially when the amount of data is considerable, try to stick to our
etiquette manual for web scraping. It is shown in Figure 9.10.
As soon as you have identified potentially useful data on the Web, you should look for an
“official” way to gather the data. If you are lucky, the publisher provides ready-made files of
the data which are free to download or offers an API. If an API is available, there is usually
no reason to follow any of the other scraping strategies. APIs enable the provider to keep
control over who retrieves which data, how much of it, and how often.
As described in Section 9.2.2, accessing an API from within R usually requires one or
more wrapper functions which pose requests to the API and process the output. If such
wrappers already exist, all you have to do is to become familiar with the program and use it.
Often, this requires the registration of an application (see Section 9.1.11). Be sure to document
the purpose of your program. Many APIs restrict the user to a certain amount of API calls
per day or similar limits. These limits should generally be obeyed.
If there is no API, there might still be a more comfortable way of getting the data than
scraping them. Depending on the type and structure of the data, it can be reasonable to assume
that there is a database behind it. Virtually any data that you can access via HTTP forms is
likely to be stored in some sort of database or at least in a prestructured XML. Why not ask
proprietors of data first whether they might grant you access to the database or files? The
larger the amount of data you are interested in, the more valuable it is for both providers and
you to communicate your interests in advance. If you just want to download a few tables,
however, bothering the website maintainer might be a little over the top.
Once you have decided that scraping the data directly from the page is the only feasible
solution, you should consider the Robots Exclusion Protocol if there is any. The robots.txt is
usually not meant to block individual requests to a site, but to prevent a webpage to be indexed
by a search engine or other meta search applications. If you want to gather information from a
page that documents disallowance of web robot activity in its robots.txt, you should reconsider
your task. Do you plan to scrape data in a bot-like manner? Has your task the potential to do
the web server any harm? In case of doubt, get into contact with the page administrator or
take a look at the terms of use, if there are any. Ensure that your plans are with no ill intent,
and stay identifiable with an adequate use of the identifying HTTP header fields.
If what you are planning is neither illegal nor has the potential to harm the provider in
any way, there are still some scraping dos and don’ts you should consider with care.
As an example, we construct a small scraping program step-by-step, implementing all
techniques from the bouquet of friendly web scraping. Say we want to keep track of the 250
most popular movies as rated by users of the Internet Movie Database (IMDb). The ranking
is published at http://www.imdb.com/chart/top. Although the techniques implemented in this
example are a bit over the top as we do not actually scrape large amounts of data, the procedure
is the same for more voluminous tasks.
Suppose we have already worked through the checklist of questions of Figure 9.10 (as of
March 2014, there is no IMDb API) and have decided that there is no alternative to scraping
the content to work with the data. An inspection of IMDb’s robots.txt reveals that robots are
officially allowed to work in the /chart subdirectory.

yes

Reconsider your task. Speak to the
owner of the data if possible. If you
nevertheless start scraping, take into
account the “Scraping dos and don’ts”
on the right.

no

Does robots.txt permit bot action on
ﬁles you are interested in?

Try harder...

yes

no

no

yes

yes

Start scraping and consider all of the
aspects on the right

no

Is there someone who grants you
access to the database?

Is there an R package or project that
provides a wrapper?

no

Figure 9.10 An etiquette manual for web scraping

yes

Are there terms of use which explicitly
deny the use of the webpage you have
in mind?

no

Is there a robots.txt?

no

Do you assume a database to exist
behind the data?

no

Is there an API that offers an
interface to a relevant database?

yes

Did you identify useful data on the
Web?

World Wide Web

yes

yes

Stay identiﬁable with User-agent
and From header ﬁelds, i.e., do
not masquerade behind proxies or
browser-like user-agents
Reduce trffiac: scrape as few
as possible, use gzip if available, choose lightweight formats,
monitor changes before scraping
(Last-Modified header ﬁeld)
Do not bombard the server with unnecessary requests

Scraping dos and don’ts

Retrieve the data from your personal
contact and save a lot of time

Check out how it works and use it

Get familiar with API output and
build your own wrapper

286

AUTOMATED DATA COLLECTION WITH R

The standard scraping approach using the RCurl package would be something like
R>
R>
R>
R>
R>
R>
R>

library(RCurl)
library(XML)
url <- "http://www.imdb.com/chart/top"
top <- getURL(url)
parsed_top <- htmlParse(top, encoding = "UTF-8")
top_table <- readHTMLTable(parsed_top)[[1]]
head(top_table[1:10, 1:3])
Rank & Title IMDb Rating
1
1. The Shawshank Redemption (1994)
9.2
2
2. The Godfather (1972)
9.2
3
3. The Godfather: Part II (1974)
9.0
4
4. The Dark Knight (2008)
8.9
5
5. Pulp Fiction (1994)
8.9
6
6. The Good, the Bad and the Ugly (1966)
8.9
7
7. Schindler’s List (1993)
8.9
8
8. 12 Angry Men (1957)
8.9
9
9. The Lord of the Rings: The Return of the King (2003)
8.9
10
10. Fight Club (1999)
8.8

The first rule is to stay identifiable. We have learned in Chapter 5 how this can be done.
When sending requests via HTTP, we can use the User-agent and From header fields.
Therefore, we respecify the GET request as
R> getURL(url, useragent = str_c(R.version$platform, R.version$version.
string, sep = ", "), httpheader = c(from = "eddie@datacollection.com"))

The second rule is to reduce traffic. To do so, we should accept compressed files. One can
specify which content codings to accept via the Accept-Encoding header field. If we leave
this field unspecified, the server delivers files in its preferred format. Therefore, we do not
have to specify the preferred compression style, which would probably be gzip, manually.
The XML parser which is used in the XML package can deal with gzipped XML documents.
We do not have to respecify the parsing command—the xmlParse() function automatically
detects compression and uncompresses the file first.
Another trick to reduce traffic is applicable if we scrape the same resources multiple
times. What we can do is to check whether the resource has changed before accessing and
retrieving it. There are several ways to do so. First, we can monitor the Last-Modified
response header field and make a conditional GET request, that is, access the resources
only if the file has been modified since the last access. We can make the call conditional
by delivering an If-Modified-Since or, depending on the mechanics of the function,
If-Unmodified-Since request header field. In the IMDb example, this works as follows. First, we define a curl handle with the debugGatherer() function to be able to
track our HTTP communication. Because we want to modify the HTTP header along the
way, we store the standard headers for identifying ourselves in an extra object to use and
redefine it.
R> info
<- debugGatherer()
R> httpheader <- list(from = "Eddie@r-datacollection.com", 'useragent' = str_c(R.version$version.string, ", ", R.version$platform))
R> handle <- getCurlHandle(debugfunc = info$update, verbose = TRUE)

SCRAPING THE WEB

287

We define a new function getBest() that helps extract the best movies from the
IMDb page.
R> getBest <- function(doc) readHTMLTable(doc)[[1]][, 1:3]

Applying the function results in a data frame of the top 250 movies. To be able to analyze
it in a later step, we store it on our local drive in an .Rdata file called bestFilms.Rdata if it
does not exist already.
R> url <- "http://www.imdb.com/chart/top"
R> best_doc <- getURL(url)
R> best_vec <- getBest(best_doc)
R> if (!file.exists("bestFilms.Rdata")) {
save(best_vec, file = "bestFilms.Rdata")
}
R> head(best_vec)
Rank & Title IMDb Rating
1
1. The Shawshank Redemption (1994)
9,2
2
2. The Godfather (1972)
9,2
3
3. The Godfather: Part II (1974)
9,0
4
4. The Dark Knight (2008)
8,9
5
5. Pulp Fiction (1994)
8,9
6 6. The Good, the Bad and the Ugly (1966)
8,9

Now we want to update the file once in a while if and only if the IMDb page has been
changed since the last time we updated the file. We do that by using the If-Modified-Since
header field in the HTTP request.
R> httpheader$"If-Modified-Since" <- "Tue, 04 Mar 2014 10:00:00 GMT"
R> best_doc <- getURL(url, curl = handle, httpheader = httpheader)

It becomes a little bit more complicated if we want to use the time stamp of our .Rdata
file’s last update. For this we have to extract the date and supply it in the right format to the
If-Modified-Since header field. As the extraction and transformation of the date into the
format expected in HTTP request is cumbersome, we solve the problem once and put it into
two functions: httpDate() and file.date()—see Figure 9.11. You can download the
function from the book’s webpage with
R> writeLines(str_replace_all(getURL("http://www.r-datacollection.
com/materials/http/HTTPdate.r"),"\r",""),"httpdate.r")

Let us source the functions into our session and extract the date of last modification for
our best films data file with a call to file.date().
R> source("http://www.r-datacollection.com/materials/http/HTTPdate.r")
R> (last_mod <- file.date("bestFilms.Rdata"))
[1] "2014-03-11 15:00:31 CET"

288

AUTOMATED DATA COLLECTION WITH R

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

httpDate <- function(time="now", origin="1970-01-01", type="rfc1123"){
if(time=="now") {
tmp <- as.POSIXlt(Sys.time(), tz="GMT")
}else{
tmp <- as.POSIXlt(as.POSIXct(time, origin=origin), tz="GMT")
}
nday <- c("Sun", "Mon" , "Tue" ,
"Wed", "Thu" , "Fri" ,
"Sat")[tmp$wday+1]
month <- tmp$mon+1
nmonth <- c("Jan" , "Feb" , "Mar" ,
"Apr", "May" , "Jun" ,
"Jul" , "Aug", "Sep" ,
"Oct" , "Nov" , "Dec")[month]
mday <- formatC(tmp$mday, width=2, flag="0")
hour <- formatC(tmp$hour, width=2, flag="0")
min <- formatC(tmp$min , width=2, flag="0")
sec <- formatC(round(tmp$sec) , width=2, flag="0")
if(type=="rfc1123"){
return(paste0(nday, ", ",
mday," ", nmonth, " ", tmp$year+1900, " ",
hour, ":", min, ":", sec, " GMT") )
}else{
stop("Not implemented")
}
}

28
29
30
31
32

file.date <- function(filename, timezone=Sys.timezone() ) {
as.POSIXlt( min(unlist( file.info(filename)[4:6] )),
origin = "1970-01-01",
tz = timezone)
}

34
35
36
37
38
39
40
41

#
#
#
#
#
#
#
#

usage:
httpDate()
httpDate( file.date("WorstFilms.Rdata") )
httpDate("2001-01-02")
httpDate("2001-01-02 18:00")
httpDate("2001-01-02 18:00:01")
httpDate(60*60*24*30.43827161*12*54+60*60*24*32)
httpDate(-10*24*60*60,origin="2014-02-01")

Figure 9.11 Helper functions for handling HTTP If-Modified-Since header field
Now we can pass the date to the If-Modified-Since header field by making use of
httpDate().
R> httpheader$"If-Modified-Since" <- httpDate(last_mod)
R> best_doc <- getURL(url, curl = handle, httpheader = httpheader)

Via getCurlInfo() we can gather information on the last request and control the
status code.
R> getCurlInfo(handle)$response.code
[1] 200

SCRAPING THE WEB

289

If the status code of the response equals 200, we extract new information and update our data
file. If the server responds with the status code 304 for “not modified” we leave it as is.
R> if (getCurlInfo(handle)$response.code == 200) {
best_list <- getBest(best_doc)
save(best_list, file = "bestFilms.Rdata")
}

Using the If-Modified-Since header is not without problems. First, it is not clear what
the Last-Modified response header field actually means. We would expect the server to
store the time the file was changed the last time. If the file contains dynamic content, however,
the header field could also indicate the last modification of one of its component parts. In
fact, in the example the IMDb website is always delivered with a current time stamp, so the
file will always be downloaded—even if the ranking has not changed. We should therefore
first monitor the updating frequency of the Last-Modified header field before adapting our
scraper to it. Another problem can be that the server does not deliver a Last-Modified at
all, even though HTTP/1.1 servers should provide it (see Fielding et al. 1999, Chapters 14.25,
14.28, and 14.29).
Another strategy is to retrieve only parts of a file. We can do this by specifying the
libcurl option range which allows defining a byte range. If we know, for example, that
the information we need is always stored at the very beginning of a file, like a title, we
could truncate the document and specify our request function with range = "1-100" to
only receive the first 100 bytes of the document. The drawbacks of this approach are that
not all servers support this feature and we cut a document in two, making it not inaccessible
with XPath.
In another scenario, we might want to download specific files from an index of files, but
only those which we have not been downloaded so far. We implement a check if the file
already exists on the local drive and start the download only if we have not already retrieved
it. The following generic code snippet shows how to do this. Say we have generated a vector
of HTML file names which are stored on a page like www.example.com with filenames
<- c("page1.html", "page2.html", page3.html). We can initiate a download of
the files that have not yet been downloaded with:
R> for (i in 1:length(filenames)) {
if (!file.exists(filenames[i])) {
download.file(str_c(url, filenames[i]), destfile = filenames[i])
}
}

The file.exists() function checks if the file already exists. If not, we download it.
To know in advance how many files are new, we can compare the two sets of file names—the
ones to be downloaded and the ones that are already stored in our folder—like this
R> existing_files <- list.files("directory_of_scraped_files")
R> online_files <- vector_of_online_files
R> new.files <- setdiff(online_files, existing_files)

The list.files() function lists the names of files stored in a given directory. The
setdiff() function compares the content of two vectors and returns the asymmetric difference, that is, elements that are part of the first vector but not of the second. Note that these
code snippets works properly only if we download websites that carry a unique identifier in

290

AUTOMATED DATA COLLECTION WITH R

the URL that remains constant over time, for example, a date, and if it is reasonable to assume
that the content of these pages has not changed, while the set of pages has.
We also do not want to bother the server with multiple requests. This is partly because many
requests per second can bring smaller servers to their knees, and partly because webmasters
are sensitive to crawlers and might block us if our scraper behaves this way. With R it is
straightforward to restrict our scraper. We simply suspend execution of the request for a
time. In the following, a scraping function is programmed to process a stack of URLs and
is executed only after a pause of one second, which is specified with the Sys.sleep()
function.
R> for (i in 1:length(urls)) {
scrape_function(urls[1])
Sys.sleep(1)
}

There is no official rule how often a polite scraper should access a page. As a rule of
thumb we try to make no more than one or two requests per second if Crawl-delay in the
robots.txt does not dictate more modest request behavior.
Finally, writing a modest scraper is not only a question of efficiency but also of politeness.
There is often no reason to scrape pages daily or repeat the same task over and over again.
Although bandwidth costs have sunken over the years, server traffic still means real costs
to website maintainers. Our last piece of advice for creating well-behaved web scrapers is
therefore to make scrapers as efficient as possible. Practically, this means that if you have
a choice between several formats, choose the lightweight one. If you have to scrape from
an HTML page, it could prove useful to look for a “print version” or a “text only” version,
which is often much lighter HTML than the fully designed page. This helps both you to extract
content and the server who provides the resources. More generally, do not “overscrape” pages.
Carefully select the resources you want to exploit, and leave the rest untouched. In addition,
monitor your scraper regularly if you use it often. Webpage design can change quickly,
rendering your scraping approach useless. A broken scraper may still consume bandwidth
without any payoff.20
One final remark. We do not think that there is a reason to feel generally bad for scraping
content from the Web. In all of the cases we present in this book this has nothing to do
with stealing any private property or cheap copying of content. Ideally, processing scraped
information comes with real added value.

9.4

Valuable sources of inspiration

Before starting to set up a scraping project, it is worthwhile to do some research on things
others have done. This might help with specific problems, but the Internet is also full of
more general inspirations for scraping applications and creative work with freely available
data. In the following, we would like to point you to some resources and projects we find
extraordinarily useful or inspiring.
20 Much of this advice is inspired by the excellent “Walking Softly” introduction to web scraping with Perl by
Hemenway and Calishain (2003).

SCRAPING THE WEB

291

The CRAN Task View on web technologies (http://cran.r-project.org/web/views/
WebTechnologies.html) provides a very useful overview of what is possible with R in terms
of accessing and parsing data from the Web. You will see that not all of the available packages
are covered in this book, which is partly due to the fact that the community is currently very
active in this field, and partly because we intentionally tried to focus on the most useful pieces
of software. It might be a good exercise to set up an automated scraper that checks for updates
of this site from time to time.
GitHub is a hosting service for software projects or rather, users who publish their ongoing
coding work (https://github.com/). It is not restricted to any programming language, so
one can find many users who publish R code. Hadley Wickham and Winston Chang have
provided the handy CRAN package devtools (Wickham and Chang 2013) which makes
it easy to install R software that is not published on CRAN but on GitHub using the
install_github() function.
rOpenSci (http://ropensci.org/) is a fascinating project that aims at establishing convenient
connections between R and existing science or science-related APIs. Their motto is nothing
less than “Wrapping all science APIs.” This implies a philosophy of “meta-sharing”: The contributors to this project share and maintain software that helps accessing open science data. As
we have shown in Section 9.1.10, maintenance of API access is indeed an important topic. The
project’s website provides R packages which serve as interfaces to several data repositories,
full-texts of journals and altmetrics data. Some of the packages are also available on CRAN,
some are stored on GitHub. To pick some examples, the rgbif package provides access to
the Global Biodiversity Information Facility API which covers several thousand datasets on
species and organisms (Chamberlain et al. 2013). The RMendeley package offers access to the
personal Mendeley database (Boettiger and Temple Lang 2012). And with the rfishbase package it is possible to access the database from www.fishbase.org via R (Boettiger et al. 2014).
Further, the site offers a potentially helpful overview of R packages that enable access to science APIs but that are not affiliated with rOpenSci—http://ropensci.org/related/index.html.
It is well worth browsing this list to find R wrappers for APIs of popular sites such as Google
Maps, the New York Times, the NHL Real Time Scoring System Database, and many more.
All in all, the rOpenSci team works on an important goal for future scientific practice—the
proliferation and accessibility of open data.
Large parts of what we can do with R and web scraping would likely not be possible without the work of the “Omega Project for Statistical Computing” at http://www.omegahat.org/.
The project’s core group is basically a Who is Who in R’s core development team with
Duncan Temple Lang being its most diligent contributor. With the creation of packages like
RCurl and XML the project laid the foundation to R-Web communication. Today, the project
makes available an impressive list of (not only) R-based software for interaction with web
services and database systems. Not all of them are updated regularly or are of immediate use
for standard web scraping tasks, but a look at the page is indispensable before any attempt
to program a new interface to whatever web service. Chances are that it has been already
done and published on this site. Many of the packages are also extensively discussed in an
impressive new book by Nolan and Temple Lang (2014), which is well worth a read.

Summary
In this chapter, we demonstrated the practical use of the techniques from the book’s first part—
HTTP, HTML, regular expressions, and others—to retrieve information from webpages. Web

CRAN Task
View

GitHub

rOpenSci

Omega Project

292

AUTOMATED DATA COLLECTION WITH R

scraping is more of a skill than a science. It is difficult to generalize web scraping practice,
as every scenario is different. In the first part of this chapter, we picked some of the more
common scenarios you might encounter when collecting data from the Web in an automated
manner. If you felt overwhelmed from the vast amount of fundamental web technologies in
the first part of the book, you might have been surprised how easy it is in many scenarios to
gather data from the Web with R by relying on powerful network client interfaces like RCurl,
convenient packages for string processing like stringr, and easy-to-implement parsing tools
as provided by the XML package.
Regarding information extraction from web documents we sketched three broad strategies.
Regular expression scraping, XPath scraping, and data collection with interfaces to web APIs.
You will figure out for yourself which strategy serves your needs best in which scenarios
as soon as you become more experienced in web scraping. Our description of the general
procedure to automate data collection with each of the strategies may serve as a guideline.
One intention of our discussion of advantages and disadvantages of each of the strategies
was, however, to clarify that there is no single best web scraping strategy, and it pays of to be
familiar with each of the presented techniques.
We dedicated the last section of this chapter to an important topic, the good practice of
web scraping. Collecting data from websites is nothing inherently evil—successful business
models are based on massive online data collection and processing. However, some formal
and less formal rules we can and should obey exist. We have outlined an etiquette that gives
some rules of behavior when scraping the Web.
Having worked through this chapter you have learned the most important tools to gather
data from the Web with R. We discuss some more tricks of the trade in Chapter 11. If you
deal with text data, information extraction can be a more sophisticated matter. We present
some technical advice on how to handle text in R and to estimate latent classes in texts in
Chapter 10.

Further reading
Many of the tutorials and how-to guides for web scraping with R which can be found online
are rather case-specific and do not help much to decide which technique to use, how to behave
nicely, and so on. With regard to the foundations of R tools to tap web resources, the recently
published book by Nolan and Temple Lang (2014) offers great detail, especially regarding
the use of RCurl and other packages which are not published on CRAN but serve specific,
yet potentially important tasks in web scraping. They also provide a more extensive view
on REST, SOAP, and XML-RPC. If you want to learn more about web services that rely on
the REST technology on the theoretical side, have a look at Richardson et al. (2013). Cerami
(2002) offers a more general picture of web services.
During the writing of this book, we found some books on practical web scraping inspiring,
interesting, or simply fun to read, and do not want to withhold them from you. “Webbots,
Spiders, and Screen Scrapers” by Schrenk (2012) is a fun-to-read introduction to scraping
and web bot programming with PHP and Curl. The focus is clearly on the latter, so if you
are interested in web bots and spiders, this book might be a good start. “Spidering Hacks”
by Hemenway and Calishain (2003) is a comprehensive collection of applications and case
studies on various scraping tasks. Their scraping workhorse is Perl, but the described hacks

SCRAPING THE WEB

293

serve as good inspiration for programming R-based scrapers, too. Finally, “Baseball Hacks”
by Adler (2006) is practically a large case study on scraping and data science mostly based
on Perl (for the scraping part) and R (for data analysis). If you find the baseball scenario
entertaining, Adler’s hands-on book is a good companion on your way into data science.

Problems
1.

What are important tools and strategies to build a scraper that behaves nicely on the
Web?

2.

What is an good extraction strategy for HTML lists on static HTML pages? Explain
your choice.

3.

Imagine you want to collect data on the occurrence of earthquakes on a weekly basis.
Inform yourself about possible online data sources and develop a data collection strategy.
Consider (1) an adequate scraping strategy, (2) a strategy for information extraction (if
needed), and (3) friendly data collection behavior on the Web.

4.

Reconsider the CSV file download function in Section 9.1.1. Replicate the download
procedure with the data files for the primaries of the 2010 Gubernatorial Election.

5.

Scraping data from Wikipedia, I: The Wikipedia article at http://en.wikipedia.org/wiki/
List_of_cognitive_biases provides several lists of various types of cognitive biases.
Extract the information stored in the table on social biases. Each of the entries in the
table points to another, more detailed article on Wikipedia. Fetch the list of references
from each of these articles and store them in an adequate R object.

6.

Scraping data from Wikipedia, II: Go to http://en.wikipedia.org/wiki/List_of_MPs_
elected_in_the_United_Kingdom_general_election,_1992 and extract the table containing the elected MPs int the United Kingdom general election of 1992. Which party
has most Sirs?

7.

Scraping data from Wikipedia, III: Take a look at http://en.wikipedia.org/wiki/List_of_
national_capitals_of_countries_in_Europe_by_area and extract the geographic coordinates of each European country capital. In a second step, visualize the capitals on a
map. The code from the example in chapter 1 might be of help.

8.

Write your own robots.txt file providing the following rules: (a) no Google bot is allowed
to scrape your web site, and (b) scraping your /private folder is generally not allowed.

9.

Reconsider the R-based robots.txt parser on Figure 9.9. Use it as a start to construct a
program that makes any of your scrapers follow the rules of the robots.txt on any site.
The function has to fulfill the following tasks: (a) identification of the robots.txt on any
given host if there is one, (b) check if a specific User-Agent is listed or not, (c) check if
the path to be scraped is disallowed or not, and (d) adhere to the results of (a), (b), and
(c). Consider scraping allowed if the robots.txt is missing.

10.

Google Search allows the user to tune her request with a set of parameters. Make use
of these parameters and set up a program that regularly informs you about new search
results for your name.

294

AUTOMATED DATA COLLECTION WITH R

11.

Reconsider the Yahoo Weather Feed from Section 9.1.10.
(a) Check out the wrapper function displayed in Figure 9.5 and rebuild it in R.
(b) The API returns a weather code that has not been evaluated so far (see also the
last column in the table on page 238). Read the API’s documentation to figure out
what the code stands for and implement the result in the feedback of the wrapper
function.

12.

The CityBikes API at http://api.citybik.es/ provides free access to a global bike sharing
network. Choose a bike sharing service in a city of your choice and build an R interface
to it. The interface should enable the user to get information about the list of stations
and the number of available bikes at each of the stations. For a more advanced extension
of this API, implement a feature such that the function automatically returns the station
closest to a given geo-coordinate.

13.

The New York Times provides a set of APIs at http://developer.nytimes.com/docs. In
order to use them, you have to sign up for an API key. Construct an R interface to their
best-sellers search API which can retrieve the current best-seller list and transform the
incoming JSON data to an R data frame.

14.

Let us take another look at the Federal Contributions Database.
(a) Find out what happens when the window is not changed back from the pop-up
window. Does the code still work?
(b) Write a script building on the code outlined above that downloads all contributions
to Republication candidates from 2007 to 2011.
(c) Download all contributions from March 2011, but have the data returned in a plot.
Try to extract the amount and party information from the plot.

15.

Apply Rwebdriver to other example files introduced in this book:
(a) fortunes2.html
(b) fortunes3.html
(c) rQuotes.php
(d) JavaScript.html

10

Statistical text processing
Any quantitative research project that hopes to make use of statistical analyses needs to
collect structured information. As we have demonstrated in countless examples up to this
point, the Web is an invaluable source of structured data that is ready for analysis upon
collection. Unfortunately, in terms of quantity such structured information is far outweighed
by unstructured content. The Internet is predominantly a vast collection of more or less
unclassified text.
Consequently, the advent of the widespread use of the Internet has seen a contemporaneous
interest in natural language processing—the automated processing of human language. This
is by no means coincidental. Never before have such massive amounts of machine-readable
text been available. In order to access such data, numerous techniques have been devised to
assign systematic meaning to unstructured text. This chapters seeks to elaborate several of
the available techniques to make use of unclassified data.
In a first step, the next section presents a small running example that is used throughout the
chapter. Subsequently, Section 10.2 elaborates how to perform large-scale text operations in
R. Textual data can quickly become taxing on resources. While this is a more general concern
when dealing with textual data, it is particularly relevant in R, which was not designed to deal
with large-scale text analysis. We will introduce the tm package that allows the organization
and preparation of text and also provides the infrastructure for the analytical packages that
we will use throughout the remainder of the chapter (Feinerer 2008; Feinerer et al. 2008).
In terms of the techniques that are presented in order to make sense of unstructured
text data, we start out by presenting supervised methods in Section 10.3. This broad class
of techniques allows the categorization of text based on similarities to pre-classified text.
Simply put, supervised methods allow users to label texts based on how much they resemble
a hand-coded training set. The classic example in this area deals with the organization of text
into topical categories. Say we have 1000 texts of varied content. Imagine further that half of
the texts have been assigned a label of their topical emphasis. Using supervised methods we

Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, First Edition.
Simon Munzert, Christian Rubba, Peter Meißner and Dominic Nyhuis.
© 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

296

AUTOMATED DATA COLLECTION WITH R

can estimate the content of the unlabeled half.1 Several supervised methods have been made
available in R. This chapter introduces the RTextTools package which provides a wrapper to a
number of the available packages. This allows a convenient access to several classifiers from
within a single function call (Jurka et al. 2013).
A second approach to classifying text is presented in Section 10.4—unsupervised classifiers. In contrast to supervised learning algorithms that rely on similarities between preclassified and unlabeled text, unsupervised classifiers estimate the categories along with the
membership of texts in the different categories. The major advantage of techniques in this
group is the possibility of circumventing the cumbersome hand-coding of training data. This
advantage comes at the price of having to interpret the content of the estimated categories
a posteriori.
As a word of caution we would like to point out that this chapter can only serve as a
cursory introduction to the topics in question. We investigate some of the most important
topics and packages that are available in R. It should be emphasized, however, that in many
instances we cannot do full justice to the details of the topics that are being covered. You
should be aware that if you care to deal with these types of models, there is a lot more out
there that might serve your purposes better than what is being presented in this chapter. We
provide some guidance for further readings in the last section of the chapter.

10.1

The running example: Classifying press releases
of the British government

Before turning to the statistical processing of text, let us collect some sample text data that will
serve as a running example throughout this chapter. For the running example, we want all the
data to be labeled such that we have a benchmark for the accuracy of our classifiers. We have
selected press releases from the UK government as our test case. The data can be accessed at
https://www.gov.uk/government/announcements. Opening the website in a browser, you see
several selection options at the left side of the screen. We want to restrict our analysis to press
releases in all topics, from all departments, in all locations that were published before July
2010. At the time of writing this yields 747 results that conform to the request. The results
page presents the title of the press release, the date of publication, an acronym signaling the
publishing department, as well as the type of publication. For the statistical analysis in the
subsequent chapters, we will consider the publishing agency as a marker of the press releases’
content.
Notice how the URL of the page changes when you make the selections.
https://www.gov.uk/government/announcements?keywords=&announcem
ent_type_option=press-releases&topics[]=all&departments[]=all&
world_locations[]=all&from_date=&to_date=01%2F07%2F2010

1 We are not technically restricted to statements on the overall topical content of texts. We can use the same
techniques to estimate the content of particular text aspects, say sentences, as long as we are able to provide some
pre-classified training data. The present chapter sticks to topical classification as the most common task in statistical
text analysis. As a side note, learning algorithms are not limited to the analysis of text at all and are used in such
diverse research fields as bio-informatics or speech and, more generally, pattern recognition.

STATISTICAL TEXT PROCESSING

297

You can clearly see how the selections we make become integrated into the URL. As
the data are not too large to be stored locally, we will, in accordance with our rules of good
practice in web scraping, start by downloading all 747 results onto our hard drive before
collecting the text data in a tm corpus in a subsequent step. Check out the source code of the
page. You will find the first results toward the end of the page. However, not all 747 results
are contained in the source code. To collect them, we have to use the link that is contained at
the bottom of the press releases. It reads
Next page 2 of 19

To assemble the links to all press releases, we collect the publication links in one page,
select the link to the next page, and repeat the process until we have all the relevant links.
This is achieved with the following short code snippet. First, we load the necessary scraping
packages.
R> library(RCurl)
R> library(XML)
R> library(stringr)

We move on to download all the results. Notice that because the content is stored on
an HTTPS server, we specify the location of our CA signatures (see Section 9.1.7 for
details).
R> all_links <- character()
R> new_results <- 'government/announcements?keywords=&announcement_
type_option=press-releases&topics[]=all&departments[]=all&world_
locations[]=all&from_date=&to_date=01%2F07%2F2010'
R> signatures = system.file("CurlSSL", cainfo = "cacert.pem",
package = "RCurl")
R> while(length(new_results) > 0){
R>
new_results <- str_c("https://www.gov.uk/", new_results)
R>
results <- getURL(new_results, cainfo = signatures)
R>
results_tree <- htmlParse(results)
R>
all_links <- c(all_links, xpathSApply(results_tree,
R>
"//li[@id]//a",
R>
xmlGetAttr,
R>
"href"))
R>
new_results <- xpathSApply(results_tree,
R>
"//nav[@id='show-more-documents']
R>
//li[@class='next']//a",
R>
xmlGetAttr,
R>
"href")
R> }

We are left with a vector of length 747 but possibly some changes have been made
since this book was published. Each entry contains the link to one press release. To be sure

Gathering
press release
hyperlinks

298

AUTOMATED DATA COLLECTION WITH R

that your results are identical to ours, check the first item in your vector. It should read
as follows
R> all_links[1]
[1] "/government/news/bianca-jagger-how-to-move-beyond-oil"
R> length(all_links)
[1] 747
Download
procedure

To download all press releases, we iterate over our results vector.
R> for(i in 1:length(all_links)){
R>
url <- str_c("https://www.gov.uk", all_links[i])
R>
tmp <- getURL(url, cainfo = signatures)
R>
write(tmp, str_c("Press_Releases/", i, ".html"))
R> }

If everything was proceeded correctly, you should find the folder Press_Releases in your
working directory which contains all press releases as HTML files.
R> length(list.files("Press_Releases"))
[1] 747
R> list.files("Press_Releases")[1:3]
[1] "1.html"
"10.html" "100.html"

10.2 Processing textual data
The widespread application of statistical text analysis is a fairly recent phenomenon. It
coincides with the almost universal storage of text in digital formats. Such massive amounts
of machine-readable text created the need to come up with methods to automate the processing
of content. A number of techniques in the tradition of performing statistical text analysis have
been implemented in R. Concurrently, infrastructures had to be implemented in order to
handle large collections of digital text. The current standard for statistical text analysis in R
is the tm package. It provides facilities to manage text collections and to perform the most
common data preparation operations prior to statistical text analysis.

10.2.1

Large-scale text operations—The tm package

Let us load all press releases that we have collected in the previous section into R and store
them in a tm corpus.2 Ordinarily, this could be accomplished by calling the relevant functions
on the entire directory in which we stored the press releases. In this case, however, the press
releases are still in the HTML format. Thus, before inputting them into a corpus, we want to
strip out all the tags and text that is not specific to the press release.
Let us consider the first press release as an example. Open the press release in a browser
of your choice. The press release starts with the words “Bianca Jagger, Chair of the Bianca
Jagger Human Rights Foundation and a Council of Europe Goodwill Ambassador, has called
for a “Copernican revolution” in moving beyond carbon to a decentralized, sustainable energy
system.” There is more layout information in the document. In a real research project, one
might want to consider stripping out the additional noise. In addition to the text of the press
2A

text corpus in linguistics simply refers to a structured collection of texts.

STATISTICAL TEXT PROCESSING

299

release we find the publishing organization and the date of publication toward the top of the
page. We extract the first two bits of information and store them along with the press release as
meta information. Let us investigate the source code of the press release. We find that the press
release is stored after the tag . Thus, we get the release by calling
R>
R>
R>
R>

tmp <- readLines("Press_Releases/1.html")
tmp <- str_c(tmp, collapse = "")
tmp <- htmlParse(tmp)
release <- xpathSApply(tmp, "//div[@class='block-4']", xmlValue)

The extracted release is not evenly formatted since we discarded all the tags. However, as
we will drop the sequence of the words later on, this is of no concern. Also, while we might
like to drop several bits of text like (opens in new window), this should not influence
our estimation procedures.3 Before iterating over our entire corpus of results, we write two
queries to extract the meta information. The information on the publishing organization is
stored under .

Extracting
meta
information

R> organisation <- xpathSApply(tmp, "//span[@class='organisation
lead']", xmlValue)
R> organisation
[1] "Foreign & Commonwealth Office"
R> publication <- xpathSApply(tmp, "//dd[@class='change-notes']",
xmlValue)
R> publication
[1] "Published 1 July 2010"

Now that we have all the necessary elements set up, we create a loop that performs
the operations on all press releases and stores the resulting information in a corpus. Such a
corpus is the central element for text operations in the tm package. It is created by calling the
Corpus() function on the first press release we just assembled. The text release is wrapped
in a VectorSource() function call. This specifies that the corpus is created from text which
is stored in a character vector.4

Creating a
corpus

R> library(tm)
R> release_corpus <- Corpus(VectorSource(release))

The corpus can be accessed just like any ordinary list by specifying the name of the
object (release_corpus) and adding the subscript of the element that we are interested
in, enclosed by two square brackets. So far, we have only stored one element in our corpus
that we can call using release_corpus[[1]]. To add the two pieces of meta information
to the text, we use the meta() function. The variable specifies the document that we want
3 Not discarding these technical pieces of text is identical to making the—not overly problematic—assumption
that particular governmental departments do not systematically include features such as external links more often
than others. If this were the case, then this could very well be picked up by the estimation procedures.
4 Several alternatives have been implemented. We could, for example, create a text corpus from a directory
(DirSource()) directly if we did not have to extract the press releases from the HTML files—or if we cared to
store the entire source code in the text corpus.
Incidentally, using the Corpus() function creates a volatile corpus in the memory of R that is destroyed when
the program is terminated. Alternatively, we could have created a permanent corpus that is stored in a database
outside of R.

Adding meta
information

300

AUTOMATED DATA COLLECTION WITH R

the meta information to be assigned to and the second variable specifies the tag name of the
meta information, in our case we select organisation and publication. Note that we
select the first organization for the meta information. Several press releases have more than
one organizational affiliation. For convenience, we simply choose the first one. Again, this
creates a little bit of imprecision in our data that should not throw off the classifiers terribly.
R> meta(release_corpus[[1]], "organisation") <- organisation[1]
R> meta(release_corpus[[1]], "publication") <- publication
R> meta(release_corpus[[1]])
Available meta data pairs are:
Author
:
DateTimeStamp: 2014-03-26 00:21:46
Description :
Heading
:
ID
: 1
Language
: en
Origin
:
User-defined local meta data pairs are:
$organisation
[1] "Foreign & Commonwealth Office"
$publication
[1] "Published

1 July 2010"

The meta information of any document can be accessed using the same function. Several
meta information tags are predefined, such as Author and Language. Some are filled
automatically upon creation of the entry. At the bottom of the meta information we see
the two elements that we have created—the date of publication and the organization that has
published the press release. In the next step, we perform the operations that we have introduced
above for all the documents that we downloaded. We collect the text of the press release and
the two pieces of meta information and add them our corpus using simple concatenation
(c()). A potential problem of the automated document import is that the XPath queries may
fail on press releases which have a different layout. Usually, this should not be the case.
Nevertheless, if it did happen, our temporary corpus object tmp_corpus code would not
be created and the loop would fail. We therefore specify a condition to conduct the corpus
creation only if the release object exists, that is, has a length greater 0.5
R> n <- 1
R> for(i in 2:length(list.files("Press_Releases/"))){
R>
tmp <- readLines(str_c("Press_Releases/", i, ".html"))
R>
tmp <- str_c(tmp, collapse = "")
R>
tmp <- htmlParse(tmp)
R>
release <- xpathSApply(tmp,
R>
"//div[@class='block-4']",
R>
xmlValue)
R>
organisation <- xpathSApply(tmp,
R>
"//span[@class='organisation lead']",
R>
xmlValue)

5 Such

exceptions are typically the result of debugging our code when the functions fail.

STATISTICAL TEXT PROCESSING
R>
R>
R>
R>
R>
R>
R>
R>
R>
R>
R> }

301

publication <- xpathSApply(tmp,
"//dd[@class='change-notes']",
xmlValue)
if (length(release)!=0) {
n <- n + 1
tmp_corpus <- Corpus(VectorSource(release))
release_corpus <- c(release_corpus, tmp_corpus)
meta(release_corpus[[n]], "organisation")  release_corpus
A corpus with 746 text documents

Recall that meta information is internally linked to the document. In many cases we are
interested in a tabular form of the meta data to perform further analyses. Such a table can be
collected using the prescindMeta() function. It allows selecting pieces of meta information
from the individual documents to input them into a common data.frame.
R> meta_data <- prescindMeta(release_corpus, c("organisation",
"publication"))
R> head(meta_data)
MetaID organisation publication
1
0 Foreign .... Publishe....
2
0 Ministry.... Publishe....
3
0 Ministry.... Publishe....
4
0 Departme.... Publishe....
5
0 Departme.... Publishe....
6
0 Departme.... Publishe....

Let us inspect the meta data for a moment. As a simple summary statistic we call a count
of the different publishing organizations. We find that the publishing behavior of the various
governmental departments is fairly diverse. Assuming that the website of the UK government
truly collects all the press releases from all governmental departments we find that while
two departments have released over 100 announcements, others have published less than
a dozen.
R> table(as.character(meta_data[, "organisation"]))
Cabinet Office
31
Department for Business, Innovation & Skills
65
Department for Communities and Local Government
22
Department for Culture, Media & Sport
12

Collect meta
information
from the corpus

302

AUTOMATED DATA COLLECTION WITH R

Department for Education
5
Department for Environment, Food & Rural Affairs
35
Department for Transport
13
Department for Work & Pensions
17
Department of Energy & Climate Change
20
Deputy Prime Minister's Office
3
Driver and Vehicle Licensing Agency
4
Foreign & Commonwealth Office
204
HM Treasury
14
Home Office
8
Ministry of Defence
177
Prime Minister's Office, 10 Downing Street
62
Scotland Office
16
Vehicle and Operator Services Agency
4
Wales Office
34
Corpus
filtering with
sFilter()

As we will elaborate in greater detail in the upcoming sections, we need a certain level
of coverage in each of the categories that we would like the classifiers to pick up. Thus, we
discard all press releases from departments that have released 20 press statements or less.
There are eight departments that have published more than 20 press releases for the period
that is covered by the website up to July 2010. We select them to remain in the corpus. In
addition to the rare categories, we exclude the cabinet office and the prime minister’s office.
These bodies are potentially more difficult to classify as they are not bound to a particular
policy area. We perform the exclusion of documents using the sFilter() function. This
function takes the corpus in question as the first argument and one or more value pairs of the
form "tag == 'value"'. Recall that the pipe operator (|) is equivalent to OR.
R> release_corpus <- release_corpus[sFilter(release_corpus, "
organisation == 'Department for Business, Innovation & Skills' |
organisation == 'Department for Communities and Local Government' |
organisation == 'Department for Environment, Food & Rural Affairs' |
organisation == 'Foreign & Commonwealth Office' |
organisation == 'Ministry of Defence' |
organisation == 'Wales Office"')]
R> release_corpus
A corpus with 537 text documents

STATISTICAL TEXT PROCESSING

303

Excluding the sparsely populated categories as well as the umbrella offices, we are left
with a corpus of 537 documents. As a side note, corpus filtering is more generally applicable
in the tm package. For example, imagine we would like to extract all the documents that
contain the term “Afghanistan.” To do so, we apply the tm_filter() function which does
a full text search and returns all the documents that contain the term.
R> (afgh <- tm_filter(release_corpus,
FUN = function(x) any(str_detect(x, "Afghanistan"))))
A corpus with 131 text documents

We find that no fewer than 131 documents out of our sample contain the term.

10.2.2

Building a term-document matrix

Let us now turn our attention to preparing the text data for the statistical analyses. A great
many applications in text classification take term-document matrices as input. Simply put,
a term-document matrix is a way to arrange text in matrix form where the rows represent
individual terms and columns contain the texts. The cells are filled with counts of how often
a particular term appears in a given text. Hence, while all the information on which terms
appear in a text is retained in this format, it is impossible to reconstruct the original text, as
the term-document matrix does not contain any information on location. To make this idea a
little clearer, consider a mock example of four sentences A, B, C, and D that read
A
B
C
D

“Mary had a little lamb, little lamb”
“whose fleece was white as snow”
“and everywhere that Mary went, Mary went”
“the lamb was sure to go”

These four sentences can be rearranged in a matrix format as depicted in Table 10.1. The
majority of cells in the table are empty, which is a common case for term-document matrices.
The function in the tm package to turn a text corpus into a term-document matrix is
TermDocumentMatrix(). Calling this on our corpus of press releases yields
R> tdm <- TermDocumentMatrix(release_corpus)
R> tdm
A term-document matrix (23350 terms, 537 documents)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:

99917/12439033
99%
252
term frequency (tf)

Not surprisingly, the resulting matrix is extremely sparse, meaning that most cells have
not a single entry (approximately 99%). In addition, upon closer inspection of the terms in the
rows, we find that several are errors that can probably be traced back to unclean data sources.
This concern is validated by looking at the figure of Maximal term length that takes on
an improbably high value of 252.

Corpus
filtering with
tm_filter()

304

AUTOMATED DATA COLLECTION WITH R

Table 10.1 Example of a term-document matrix
a
and
as
had
everywhere
fleece
go
lamb
little
Mary
to
that
the
snow
sure
was
went
white
whose

Removing
numbers

10.2.3

Data cleansing

10.2.3.1

Word removals

A
1
.
.
1
.
.
.
2
2
1
.
.
.
.
.
.
.
.
.

B
.
.
1
.
.
1
.
.
.
.
.
.
.
1
.
1
.
1
1

C
.
1
.
.
1
.
.
.
.
2
.
1
.
.
.
.
2
.
.

D
.
.
.
.
.
.
1
1
.
.
1
.
1
.
1
1
.
.
.

In order to take care of some of these errors, one typically runs several data preparation
operations. Furthermore, the data preparation addresses some of the concerns that are leveled
against (semi-)automated text classification which will be discussed in the next section. Several preparation operations have been made available in tm. For example, one might consider
removing numbers and period characters from the texts without losing much information.
This can either be done on the raw textual data or while setting up the term-document matrix.
For convenience of exposition, we will run each of these functions on the original documents.
The main function we will be using in this section is tm_map(), which takes a function
and runs it on the entire corpus. To remove numbers in our documents, we call
R> release_corpus <- tm_map(release_corpus, removeNumbers)

Removing
punctuation
characters

We could use the removePunctuation() function to remove period characters. However, this function simply removes punctuation without inserting white spaces. In case of
formatting errors of the text this might accidentally join two words. Thus, to be safe, we use
the str_replace_all() function from the stringr package. The additional arguments to
the function are simply added to the call to tm_map().6
6 The downside of this operation is that it takes out all punctuation indiscriminately. If one cares to be a little
more elaborate, one might want to retain dashes within words, for example.

STATISTICAL TEXT PROCESSING

305

R> release_corpus <- tm_map(release_corpus, str_replace_all, pattern
= "[[:punct:]]", replacement = " ")

Another common operation is the removal of so-called stop words. Stop words are the
most common words in a language that should appear quite frequently in all the texts.
However, for the estimation of the topics they should not be very helpful as we would expect
them to be evenly distributed across the different texts. Hence, the removal of stop words is
rather an operation that is performed to increase computational performance and less in order
to improve the estimation procedures. The list of English stop words that is implemented in
tm contains more than a hundred terms at the time of writing.
R> length(stopwords("en"))
[1] 174
R> stopwords("en")[1:10]
[1] "i"
"me"
[6] "our"
"ours"

"my"
"myself"
"ourselves" "you"

Removing stop
words

"we"
"your"

Again, we remove these using the tm_map() function.
R> release_corpus <- tm_map(release_corpus, removeWords, words =
stopwords("en"))

Next, one typically converts all letters to lower case so that sentence beginnings would
not be treated differently by the algorithms.
R> release_corpus <- tm_map(release_corpus, tolower)

10.2.3.2

Stemming

The following operation is potentially of greater importance than those that have been introduced so far. Many statistical analyses of text will perform a stemming of terms prior to
the estimation. What this operation does is to reduce the terms in documents to their stem,
thus combining words that have the same root. A number of stemming algorithms have been
proposed and there are implementations for different languages available in R. Once more,
we apply the relevant function from the tm package, stemDocument() via the tm_map()
function.7
R> library(SnowballC)
R> release_corpus <- tm_map(release_corpus, stemDocument)

10.2.4

Sparsity and n-grams

Now that we have performed all the document preparation that we would like to include
in our analysis, we can regenerate the term-document matrix. Note again that we did not
have to perform the single operations on the original texts. We could just as easily have
performed the operations concurrently with the generation of the term-document matrices.
This is accomplished via the control parameters in the TermDocumentMatrix() function.
7 Note

that the stemming procedure requires the SnowballC package to be installed.

Removing
upper cases

306

AUTOMATED DATA COLLECTION WITH R

Note further that there are a number of common weighting operations available that are more
elaborate than the mere term frequency. For simplicity of exposition, we will not discuss them
in this chapter. Now let us see how the operations have changed the main parameters of our
term-document matrix.
R> tdm <- TermDocumentMatrix(release_corpus)
R> tdm
A term-document matrix (9452 terms, 537 documents)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:
Sparse terms

74000/5001724
99%
34
term frequency (tf)

The list of terms has become a lot cleaner and we also observe a more realistic value for
the Maximal term length parameter. One more operation that is commonly performed is
the removal of sparse terms from a text corpus prior to running the classifiers. The primary
reason for this operation is computational feasibility. Apart from that, the operation can also
be viewed as a safeguard against formatting errors in the data. If a term appears extremely
infrequently, it is possible that it contains an error. The downside of removing sparse terms
is, however, that sparse terms might provide valuable insight into the classification which is
stripped out. The following operation discards all terms that appear in 10 documents or less.
R> tdm <- removeSparseTerms(tdm, 1-(10/length(release_corpus)))
R> tdm
A term-document matrix (1546 terms, 537 documents)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:

Bigrams

57252/772950
93%
22
term frequency (tf)

A common concern that is voiced against the statistical analysis of text in the way that
is proposed in the subsequent two sections is its utter disregard of structure and context.
Furthermore, terms might have meaning associated to them that resides in several terms that
follow one after another rather than in single terms. Moreover, concerns are often voiced that
the methods have no way of dealing with negations. While there are diverse solutions to all of
these problems, one possibility is to construct term-document matrices on bigrams. Bigrams
are all two-word combinations in the text, that is, in the sentence “Mary had a little lamb,”
the bigrams are “Mary had,” “had a,” “a little,” and “little lamb.” Within the tm framework, a
term-document matrix of bigrams can easily be constructed using the R interface to the Weka
program using the RWeka package (Hornik et al. 2009; Witten and Frank 2005).
R> library(RWeka)
R> BigramTokenizer <- function(x){
R>
NGramTokenizer(x, Weka_control(min = 2, max = 2))}
R> tdm_bigram <- TermDocumentMatrix(release_corpus,
R>
control = list(
R>
tokenize =
R>
BigramTokenizer))

STATISTICAL TEXT PROCESSING

307

The disadvantages of a term-document matrix based on bigrams is the fact that the matrix
becomes substantially larger and even more sparse.
R> tdm_bigram
A term-document matrix (87040 terms, 537 documents)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:

116592/46623888
100%
39
term frequency (tf)

Especially the former point is relevant, as the computational task grows with the size of
the matrix. In fact, depending on the specific task, the accuracy of classification using this
operationalization does not increase dramatically. What is more, some of the aforementioned
concerns are not too severe. Consider the negation problem. As long as negations are randomly
distributed they should not greatly influence the classification task.
Before moving on, let us consider an interesting summary statistic of the resulting matrix.
Using the findAssocs() function, we are able to capture associations between terms in the
matrix. Specifically, the function calculates the correlation between a term and all other terms
in the matrix.
R> findAssocs(tdm, "nuclear", .7)
nuclear
weapon
0.93
disarma
0.91
treati
0.90
materi
0.80

In the above call we request the associations for the term “nuclear” where the correlation
is 0.7 or higher. The output provides a matrix of all the terms for which this is true, along
with the correlation value. We find that the stems “weapon,”“disarma,”“treati,” and “materi”
are most correlated with “nuclear” in the press releases that we collected.

10.3

Supervised learning techniques

In this and the following sections, we try to estimate the topical affiliation of the documents
in the corpus. A first and important distinction we have to make in this regard is the one
between latent and manifest characteristics of a document. Manifest characteristics describe
aspects of a text that are clearly observable in the text itself. For example, whether or not a
text contains numbers is typically not a quality up for debate. This is not the case for latent
characteristics of a text. The topical emphasis of a text might be very well debatable. This
distinction is important when thinking about the uncertainty that is part of our measurements. In text classification, we can distinguish between different forms of uncertainty and
misclassification.
The first kind of uncertainty resides in the algorithms themselves and can be traced back
to limited data availability and a number of simplifying assumptions one typically makes
when using the algorithms—not least the assumption that the sequence of the words in the
text has no effect on the topic it signals. This point becomes obvious when thinking about
the way the data are structured that is underlying our analysis. The so-called bag-of-word

Simplifying
assumptions

308

Origins of misclassification

Lack of
benchmarks

The
“supervised”in
supervised
methods

The advantage
of setting a
scheme

AUTOMATED DATA COLLECTION WITH R

approaches hold that the mere presence or absence of a term is a strong indicator of a
text’s topical emphasis—regardless of its specific location. When creating a term-document
matrix of the texts, we discard any sequence information and treat the texts as collections
of words. Consider the following example. If you observe that a particular text contains
the terms “Roe,”“Wade,”“Planned,”and “Parenthood”you have a decent chance of correctly
guessing that the text in some way revolves around the issue of abortion. Moreover, many
classifiers rely on the Naı̈ve Bayes assumption. This suggests that observing one term in a
text is independent from observing another term. This is to say that the presence of one term
contributes independently to the probability that a text is written about a particular topic from
all other terms.
Apart from misclassifications based on these simplifying assumptions, there is a second,
more fundamental aspect of uncertainty in text classification. This is related to the classification of latent traits in text. As the topical emphasis of a text is not a quantity that is
directly observable, there can even be misclassification by human coders. This leads to two
challenges in classifying latent traits. One, we are frequently faced with training data that was
human-coded and thus might contain errors. Two, it is difficult for us to differentiate between
the origins of a misclassification, that is, we cannot be certain whether a text is misclassified
for technical or conceptual reasons.
Regardless of the origin, misclassification often poses a formidable problem for social
scientists. Oftentimes, we would like to perform text classification and include the estimated
categories in a subsequent analysis in order to explain external factors. This second step
in a typical research program is frequently hampered by misclassification. In fact, in a real
research situation we have no way of knowing the degree of misclassification. If we did
have a benchmark, we would not need to perform text classification in the first place. What
is worse, it cannot be assumed that classification errors are randomly distributed across the
categories, which would pose a less dramatic problem. Instead, the classification errors might
be systematically biased toward specific categories (Hopkins and King 2010).
Keeping these shortcomings of the techniques in mind, we now turn to the technical aspects
of supervised learners. The supervised in the term reflects the commonality of classifiers in this
class that some pre-coded data are used to estimate membership of non-classified documents.
The pre-coded data are called the training dataset. It is difficult to provide an estimate of
the size of the needed pre-coded data, as the accuracy of classification depends among other
things on the length of pre-coded documents and on how well the term usage in the documents
is separable, that is, the more the language in the classes differs the easier the classification
task. In general, however, the level of misclassification should decrease with the size of the
available training data. In addition, it is important to guarantee a sufficient coverage of all
categories in the training data. Recall that we discarded press releases from departments that
published 20 pieces or less. Even if the overall training data are sizable—which, in our case, it
is not—it is possible that one or several categories dominate the training data, thus providing
little information on what the algorithm might expect in the less covered categories.
The major advantage of supervised classifiers is that they provide researchers with the
opportunity to specify a classification scheme of their choosing. Keeping in mind that we are
interested in a latent trait of the document, the topic, we could potentially be interested in a
number of other latent categories of documents, say, their ideological or sentiment orientation.
Supervised classifiers provide a simple solution to estimating different properties by supplying
different training data for the estimation procedure. Before moving on to estimating the topical
emphasis in our corpus, let us introduce three supervised classifiers.

STATISTICAL TEXT PROCESSING

10.3.1

309

Support vector machines

The first model we will estimate below is the so-called support vector machine (SVM).
This model was selected as it is currently one of the most well-known and most commonly
applied classifiers in supervised learning (D’Orazio et al. 2014). The SVM employs a spatial
representation of the data. In our application, the term occurrences which we stored in the
term-document matrices represent the spatial locations of our documents in high-dimensional
spaces. Recall that we supplied the group memberships, that is, publishing agencies, of the
documents in the training data. Using the SVMs, we try to fit vectors between the document
features that best separate the documents into the various groups. Specifically, we select the
vectors in a way that they maximize the space between the groups. After the estimation
we can classify new documents by checking on which sides of the vectors the features of
unlabeled documents come to lie and estimate the categorical membership accordingly. For
a more detailed introduction to SVMs, see Boswell (2002).

10.3.2

Random Forest

The second model which will be applied is the random forest classifier. This classifier creates
multiple decision trees and takes the most frequently predicted membership category of many
decision trees as the classification that is most likely to be accurate. To understand the logic,
let us consider a single tree first. A decision tree models the group membership of the object
we care to classify based on various observed features. In the present case, we estimate the
topical category of documents based on the observed terms in the document. A single decision
tree consists of several layers that consecutively ask whether a particular feature is present or
absent in a document. The decisions at the branches are based on the observed frequencies of
presence and absence of features in the training dataset. In a classification of a new document
we move down the tree and consider whether the trained features are present or absent to
be able to predict the categorical membership of the document. The random forest classifier
is an extension of the decision tree in that it generates multiple decision trees and makes
predictions based on the most frequent prediction from the various decision trees.

10.3.3

Maximum entropy

The last classification algorithm we have selected is the maximum entropy classifier. We
have selected this model as it might be familiar to readers who have some experience with
advanced multivariate data analysis. The maximum entropy classifier is analogous to the
multinomial logit model which is a generalization of the logit model. The logit model predicts
the probability of belonging to one of two categories. The multinomial logit model generalizes
this model to a situation where the dependent variable has more than two categories. In our
classification task we try to estimate the membership in six different topical categories.

10.3.4

The RTextTools package

Several packages have been made available in R to perform supervised classification. For
the present exposition we turn to the RTextTools package. This package provides a wrapper
to several packages that have implemented one or several classifiers. At the time of writing,
the package provides wrappers to nine different classification algorithms. Using a common

310

AUTOMATED DATA COLLECTION WITH R

framework, the RTextTools package allows a simple access to different classifiers without
having to rearrange the data to the needs of the various packages, as well as a common
framework for evaluating the model fit.
The most obvious advantage of applying several classifiers to the same dataset lies in the
possibility that individual shortcomings of the classifiers cancel each other out. It is often
most effective to choose the modal prediction of multiple classifiers as the category that most
resembles the true state of the latent category of a text. For simplicity’s sake, the present
exposition will provide an introduction to three of the classifiers. Nevertheless, all of the
algorithms perform an identical task in principle. In each case, the task of the classifiers is
to assess the degree to which a text resembles the training dataset and to choose the best
fitting label.

10.3.5
Creating a
Document-term
matrix

Application: Government press releases

Turning to the practical implementation, we first need to rearrange the data a little so that
it conforms to the needs of the RTextTools package. First and foremost, the package takes
a document-term matrix as input. So far, we generated a term-document matrix but tm can
just as easily output document-term matrices. Appropriately enough the relevant function
is called DocumentTermMatrix(). After generating the new matrix we discard terms that
appear in ten documents or less. We assemble a vector of the labels that we collected from
the press releases using the prescindMeta() function that we have introduced above.
R> dtm <- DocumentTermMatrix(release_corpus)
R> dtm <- removeSparseTerms(dtm, 1-(10/length(release_corpus)))
R> dtm
A document-term matrix (537 documents, 1546 terms)
Non-/sparse entries:
Sparsity
:
Maximal term length:
Weighting
:

57252/772950
93%
22
term frequency (tf)

R> org_labels <- unlist(prescindMeta(release_corpus, "organisation")[,2])
R> org_labels[1:3]
[1] "Foreign & Commonwealth Office" "Ministry of Defence"
[3] "Ministry of Defence"

Finally, we create a container with all relevant information for use in the estimation
procedures. This is done using the create_container() function from the RTextTools
package. Apart from the document-term matrix and the labels we have generated we specify
that the first 400 documents are training data while we want the documents 401–537 to be
classified. We set the virgin attribute to FALSE, meaning that we have labels for all 537
documents.
R> library(RTextTools)
R> N <- length(org_labels)
R> container <- create_container(
dtm,
labels = org_labels,

STATISTICAL TEXT PROCESSING

311

trainSize = 1:400,
testSize = 401:N,
virgin = FALSE
)

The generated container object is an S4 object of class matrix_container. It contains
a set of objects that are used for the estimation procedures of the supervised learning methods:
R> slotNames(container)
[1] "training_matrix"
[4] "testing_codes"

"classification_matrix" "training_codes"
"column_names"
"virgin"

In a next step, we simply supply the information that we have stored in the container
object to the models. This is done using the train_model() function.8

Estimation
procedure

R> svm_model <- train_model(container, "SVM")
R> tree_model <- train_model(container, "TREE")
R> maxent_model <- train_model(container, "MAXENT")

Having set up the models, we want to use the model parameters to estimate the membership
of the remaining 137 documents. Recall that we do have information on their membership
which is stored in the container. This information is not used for estimating the membership
of the remaining documents. Instead, the membership is estimated solely on the basis of the
word vectors contained in the supplied matrix.
R> svm_out <- classify_model(container, svm_model)
R> tree_out <- classify_model(container, tree_model)
R> maxent_out <- classify_model(container, maxent_model)

Let us inspect the outcome for a moment. In all three models the output consists of a
two-column data frame, where the first column represents the estimated labels and the second
column provides an estimate of the probability of classification.
R> head(svm_out)
SVM_LABEL SVM_PROB
1 Foreign & Commonwealth Office
0.9854
2 Foreign & Commonwealth Office
0.8667
3 Foreign & Commonwealth Office
0.9900
4
Ministry of Defence
0.9878
5
Ministry of Defence
0.9842
6 Foreign & Commonwealth Office
0.5800
R> head(tree_out)
TREE_LABEL TREE_PROB
1 Foreign & Commonwealth Office
0.9848
2 Foreign & Commonwealth Office
0.9848
3 Foreign & Commonwealth Office
0.9615
4
Ministry of Defence
1.0000
5
Ministry of Defence
1.0000
6
Ministry of Defence
0.6667
8 Note

that we use the default settings for the three classifiers. Using the additional arguments in the

train_model() function, we could change the default behavior.

Evaluation

312

AUTOMATED DATA COLLECTION WITH R

R> head(maxent_out)
MAXENTROPY_LABEL MAXENTROPY_PROB
1 Foreign & Commonwealth Office
1.0000
2 Foreign & Commonwealth Office
0.9960
3 Foreign & Commonwealth Office
1.0000
4
Ministry of Defence
1.0000
5
Ministry of Defence
1.0000
6 Foreign & Commonwealth Office
0.5204

Since we know the correct labels, we can investigate how often the algorithms have
misclassified the press releases. We construct a data frame containing the correct and the
predicted labels.
R> labels_out <- data.frame(
correct_label = org_labels[401:N],
svm = as.character(svm_out[,1]),
tree = as.character(tree_out[,1]),
maxent = as.character(maxent_out[,1]),
stringsAsFactors = F)
R> ## SVM performance
R> table(labels_out[,1] == labels_out[,2])
FALSE TRUE
20
117
R> prop.table(table(labels_out[,1] == labels_out[,2]))
FALSE TRUE
0.146 0.854
R> ## Random forest performance
R> table(labels_out[,1] == labels_out[,3])
FALSE TRUE
37
100
R> prop.table(table(labels_out[,1] == labels_out[,3]))
FALSE
TRUE
0.2701 0.7299
R> ## Maximum entropy performance
R> table(labels_out[,1] == labels_out[,4])
FALSE TRUE
18
119
R> prop.table(table(labels_out[,1] == labels_out[,4]))
FALSE
TRUE
0.1314 0.8686

We observe that the maximum entropy classifier correctly classified 119 out of 137 or
about 87% of the documents correctly. The SVM fared just a little worse and got 117 out

STATISTICAL TEXT PROCESSING

313

of 137 or about 85% of the documents right. The worst classifier in this application is the
random forest classifier, which correctly estimates the publishing organization in merely 100
or 73% of cases.
At the beginning of this section, we have elaborated factors that might be driving errors
in topic classification. We suggested that above and beyond technical features of the models,
there are conceptual aspects of topic classification that might be driving errors, that is,
it might not always be self-evident which category a particular document belongs to. In
the present application, there is an additional feature that could potentially increase the
likelihood of topic misclassifications. We classified press releases of the British government.
For convenience, we selected the publishing organization as a proxy for the document label.
However, as governmental departments deal with lots of different issues, it is likely that the
announcements cover a wide range of issues. Put differently, we might be able to boost the
classification accuracy by including more training data so that we have a more complete
image of the departmental tasks.
Having said that, we might in fact want to add that the classification outcome is remarkably
accurate, given how little data we input into the classifier. Considering that some categories
have a coverage in the training data of little more than 20 documents, it is extraordinary
that we are able to get classification accuracy of roughly 80%. This puts the aforementioned
question on its head and asks not what is driving the errors in our results but rather what
is driving the classification accuracy. One common concern that is voiced against machine
learning is the inability of the researcher to know precisely what is driving results. As we are
not specifying variables like we are used to from classical regression analysis but are rather
just throwing loads of data at the models, it is difficult to know what prompts the results. It is
entirely possible that the algorithms are picking up something in the data that is not strictly
related to topics at all. Imagine that each departmental press release is signed by a particular
government official. If this were the case the algorithms might pick up the different names as
the indicator that best separates the documents into the different categories.
In summary, the obvious advantage of supervised classifiers stems from their ability to
apply a classification scheme of the researcher’s choice. Conversely, the most obvious disadvantage stems from the need to either collect labels or to manually code large chunks of the data
of interest that can serve as training data. In the next section, we introduce a way to circumvent
this latter disadvantage by automatically estimating the topical categories from the data.

10.4

What is driving
the results?

Unsupervised learning techniques

An alternative to supervised techniques is the use of unsupervised text classification. The
main difference between the two lies in the fact that the latter does not require training data
in order to perform text categorization. Instead, categories are estimated from the documents
along with the membership in the categories. Especially for individual researchers without
supporting staff, unsupervised classification might seem like an attractive option for largescale text classification—while also conforming to the endeavor of this volume to automate
data collection.
The downside of unsupervised classification lies in the inability of researchers to specify
a categorization scheme. Thus, instead of having to manually input content information, the
difficulty in unsupervised classification lies in the interpretation of results in a context-free
analysis. Recall that we are estimating latent traits of texts. We established that texts express

Limits

314

AUTOMATED DATA COLLECTION WITH R

more than one latent category at a time. Consider as an example a research problem that
investigates agenda-setting of media and politics. Say we would like to classify a text corpus
of political statements and media reports. If we ran an unsupervised classification algorithm
on the entire corpus, it is quite possible that it might pick up differences in tonality rather than
in content. To put that in different terms, it might be that the unsupervised algorithms will
take the ideologically charged rhetoric of political statements to be different from the more
nuanced language that is used in political journalism. One possible solution for the problem
would be to run the classifier on both classes of texts sequentially. Unfortunately, running a
pure unsupervised algorithm on the two parts of the corpus creates yet another problem, as
the categories in one run of the algorithm do not necessarily match up with the categories in
a second run. In fact, this poses a more general problem in supervised classification when a
researcher wants to match data thus classified to external data that is topically categorized,
say, survey responses.
Depending on the specific research goal, it is quite possible that one finds these features
of unsupervised classification an advantage of the technique rather than a disadvantage. For
instance, the fact that unsupervised methods generate categories out of themselves might
be interesting in research that is interested in the main lines of division in a text corpus.
Conversely, unsupervised classification often expects the researchers to specify the number
of categories that the corpus is to be grouped into. This requires some theoretically driven
account of the documents’ main lines of division.

10.4.1

Latent Dirichlet allocation and correlated topic models

The technique that we briefly explore in the remainder of this section is called the Latent
Dirichlet Allocation (Blei et al. 2003). The model assumes that each document in a text corpus
consists of a mixture of topics. The terms in a document are assigned a probability value of
signaling a particular topic. Thus, the likelihood of a text belonging to particular categories is
driven by the pattern of words it contains and the probability with which they are associated
to particular topics. The number of categories that a corpus is to be split into is arbitrarily set
and should be carefully selected to reflect the researcher’s interest and prior beliefs.
A shortcoming of the latent Dirichlet model is the inability to include relationships
between the various topics. This is to say that a document on topic A is not equally likely to
be about topic B, C, or D. Some topics are more closely related than others and being able
to include such relationships creates more realistic models of topical document content. To
include this intuition into their model, Blei and Lafferty (2006) have proposed the correlated
topic model which allows for a correlation of the relative prominence of topics in the
documents.

10.4.2
Shortening the
corpus

Application: Government press releases

Before turning to the more complex models of topical document content, we begin by
investigating the similarity relationships between the documents using hierarchical clustering.
In this technique, we cluster similar texts on the basis of their mutual distances. As before, this
method also relies on the term occurrences. The hierarchical part in hierarchical clustering
means that the most similar texts are joined in small clusters which are then joined with other
texts to form larger clusters. Eventually all texts are joined if the distance criterion has been
relaxed enough.

STATISTICAL TEXT PROCESSING

315

For simplicity, we select the first 20 texts of the categories “defence,”“Wales,”and “environment, food & rural affairs”and store them in a shorter corpus.
R> short_corpus <- release_corpus[c(
which(tm_index(
release_corpus,
FUN = sFilter,
s = "organisation
))[1:20],
which(tm_index(
release_corpus,
FUN = sFilter,
s = "organisation
))[1:20],
which(tm_index(
release_corpus,
FUN = sFilter,
s = "organisation
Environment,
))[1:20]
)]

== 'Ministry of Defence"'

== 'Wales Office"'

== 'Department for
Food & Rural Affairs"'

R> table(as.character(prescindMeta(short_corpus, "organisation")[,2]))
Department for Environment, Food & Rural Affairs
20
Ministry of Defence
20
Wales Office
20

We create a document-term matrix of the shortened corpus and discard sparse terms. We
also set the names of the rows to the three categories.
R> short_dtm <- DocumentTermMatrix(short_corpus)
R> short_dtm <- removeSparseTerms(short_dtm, 1-(5/length(short_
corpus)))
R> rownames(short_dtm) <- c(rep("Defence", 20), rep("Wales", 20),
rep("Environment", 20))

The similarity measure in this application is the euclidean distance between the texts. To
calculate this metric, we subtract the count for each term in document A from the count in
document B, square the result, sum over the entire vector, and take the square root of the
result. This is done using the dist() function. The resulting matrix is clustered using the
hclust() function which clusters the resulting matrix by iteratively joining the two most
similar clusters. The similarities can be visually inspected using a dendrogram where the
clusters are increasingly joined from the bottom to the top. This is to say that the higher up
the clusters are joined, the more dissimilar they are.
R> dist_dtm <- dist(short_dtm)
R> out <- hclust(dist_dtm, method = "ward")
R> plot(out)

AUTOMATED DATA COLLECTION WITH R

Wales
Environment
Environment
Environment
Environment
Environment
Wales
Wales
Environment
Wales
Wales
Environment
Defence
Wales
Defence
Environment
Wales
Wales
Defence
Defence
Defence
Environment
Environment
Environment
Environment
Environment
Wales
Wales
Defence
Defence
Defence
Defence
Defence
Defence
Defence
Defence
Defence
Defence
Defence
Environment
Wales
Wales
Wales
Wales
Wales
Wales
Wales
Wales
Wales
Environment
Environment
Defence
Environment
Defence
Wales
Environment
Environment
Environment
Defence
Defence

0

20

40

60

Height

80

100

120

140

316

Figure 10.1 Output of hierarchical clustering of UK Government press releases

Estimating
LDA

The resulting clusters roughly recover the topical emphasis of the various press releases
(see Figure 10.1). Particularly toward the lower end of the graph we find that two press
releases from the same governmental department are frequently joined. However, as we move
up the dendrogram, the patterns become less clear. To a certain extent we find a cluster of
the “environment”press releases at one end of the graph and particularly the “Wales” press
releases are mostly joined. Conversely, the press releases pertaining to “defence” are dispersed
across the various parts of the dendrogram.
Let us now move on to a veritable unsupervised classification of the texts—the Latent
Dirichlet Allocation. One implementation of the Latent Dirichlet model is provided in the
topicmodels package. The relevant function is supplied in the LDA() function. As we know
that our corpus consists of six “topics,” we select the number of topics to be estimated as
six. As before, the function takes the document-term matrix that we created in the previous
section as input.
R> library(topicmodels)
R> lda_out <- LDA(dtm, 6)

After calculating the model, we can determine the posterior probabilities of a document’s
topics as well as the probabilities of the terms’ topics using the function posterior(). We
store the topics’ posterior probabilities in the data frame lda_topics and investigate the
mean probabilities assigned to the press releases of the government agencies. We set up a
6-by-6 matrix to store the mean topic probabilities by governmental body.
R> posterior_lda <- posterior(lda_out)
R> lda_topics <- data.frame(t(posterior_lda$topics))
R> ## Setting up matrix for mean probabilities

STATISTICAL TEXT PROCESSING

317

R> mean_topic_matrix <- matrix(
NA,
nrow = 6,
ncol = 6,
dimnames = list(
names(table(org_labels)),
str_c("Topic_", 1:6)
)
)
R> ## Filling matrix
R> for(i in 1:6){
mean_topic_matrix[i,] <- apply(lda_topics[, which(org_labels ==
rownames(mean_topic_matrix)[i])], 1, mean)
}
R> ## Outputting rounded matrix
R> round(mean_topic_matrix, 2)
Topic_1 Topic_2 Topic_3
Department for Business, Innovation & Skills
0.01
0.61
0.00
Department for Communities and Local Government
0.00
0.04
0.04
Department for Environment, Food & Rural Affairs
0.02
0.24
0.12
Foreign & Commonwealth Office
0.01
0.07
0.05
Ministry of Defence
0.49
0.02
0.25
Wales Office
0.00
0.10
0.33
Topic_4 Topic_5 Topic_6
Department for Business, Innovation & Skills
0.02
0.02
0.33
Department for Communities and Local Government
0.02
0.08
0.82
Department for Environment, Food & Rural Affairs
0.06
0.07
0.49
Foreign & Commonwealth Office
0.32
0.50
0.05
Ministry of Defence
0.13
0.05
0.06
Wales Office
0.04
0.13
0.39

We find that some topics tend to be strongly associated with the press releases from
individual government agencies. For example, topic 2 is often highly associated with the
Department for Business, Innovation & Skills, topic 5 has a high probability of occurring in
announcements from the Foreign & Commonwealth Office. Topic 1 is most associated with
the Ministry of Defence. We investigate the estimated probabilities more thoroughly when
considering the correlated topic model.
Another way to investigate the estimated topics is to consider the most likely terms for
the topics and try to come up with a label that summarizes the terms. This is done using the
function terms().
R> terms(lda_out, 10)
Topic 1
Topic 2
[1,] "oper"
"busi"
[2,] "said"
"bis"
[3,] "command" "will"
[4,] "base"
"univers"
[5,] "royal"
"depart"
[6,] "troop"
"educ"
[7,] "forc"
"gov"
[8,] "soldier" "skill"
[9,] "marin"
"colleg"
[10,] "will"
"innov"

Topic 3
"forc"
"defenc"
"royal"
"arm"
"servic"
"said"
"day"
"will"
"personnel"
"fox"

Topic 4
"nation"
"british"
"forc"
"peopl"
"will"
"afghan"
"secur"
"govern"
"travel"
"can"

Topic 5
"minist"
"will"
"foreign"
"secretari"
"secur"
"nuclear"
"intern"
"govern"
"meet"
"state"

Topic 6
"will"
"govern"
"local"
"new"
"work"
"busi"
"can"
"council"
"make"
"communiti"

Terms
associated with
topics

318

Estimating
CTM

AUTOMATED DATA COLLECTION WITH R

Particularly topics 1, 3, and 4 have a focus on aspects of the military, the terms in topic
5 tend to relate to foreign affairs, the emphasis in topic 6 is local government and topic
2 is related to business and education. This ordering nicely reflects the observations from
the previous paragraph. Nevertheless, we also find that the dominance of press releases
by the department of defence results in topics that classify various aspects of the defence
announcements in multiple topics, thus lumping together the releases from other departments.
This is to say that while there is some plausible overlap between the known labels and the
estimated categories, this overlap is far from perfect.
Let us move on to estimating a correlated topic model to run a more realistic model of topic
mixtures. Again, we select the number of topics to be six since there are six governmental
organizations for which we include press releases.
R> ctm_out <- CTM(dtm, 6)
R> terms(ctm_out, 10)
Topic 1
Topic 2
[1,] "afghan"
"foreign"
[2,] "said"
"minist"
[3,] "forc"
"will"
[4,] "oper"
"secur"
[5,] "local"
"secretari"
[6,] "afghanistan" "intern"
[7,] "secur"
"british"
[8,] "base"
"meet"
[9,] "area"
"nation"
[10,] "patrol"
"nuclear"

Evaluation of
posterior
probabilities

Topic 3
"will"
"busi"
"local"
"govern"
"bis"
"council"
"new"
"year"
"depart"
"fund"

Topic 4
"forc"
"royal"
"arm"
"command"
"defenc"
"servic"
"day"
"personnel"
"oper"
"air"

Topic 5
"govern"
"will"
"wale"
"peopl"
"can"
"work"
"must"
"right"
"said"
"make"

Topic 6
"will"
"new"
"work"
"project"
"provid"
"plan"
"said"
"system"
"use"
"build"

We now find two topics—1 and 4—to be clearly related to matters of defense.9 Topic 2
is associated with foreign affairs, topic 3 with local government. The terms of topic 5 in the
correlated topic model strongly suggest Welsh politics. A label for topic 6 is more difficult to
make out.
We can plot the document-specific probabilities to belong to one of the three topics. To
do so, we calculate the posterior probabilities of the topics and set up 2-by-3 panels to plot
the sorted probabilities of the topics in the press releases. The result is displayed in Figure
10.2. Note that to save space we only displayed two of the estimated six topics. We invite you
to run the models and plot all posterior probabilities yourself.
R>
R>
R>
R>
R>
R>
R>
R>
R>
R>

posterior_ctm <- posterior(ctm_out)
ctm_topics <- data.frame(t(posterior_ctm$topics))
par(mfrow = c(2,3), cex.main = .8, pty = "s", mar = c(5, 5, 1, 1)
for(topic in 1:2){
for(orga in names(table(org_labels))){
tmp.data <- ctm_topics[topic, org_labels == orga]
plot(
1:ncol(tmp.data),
sort(as.numeric(tmp.data)),

9 Note

that the numbers of the topics are arbitrary.

1.0
0.8
0.6
0.4
0.2

1.0
0.2

0.4

0.6

0.8

Posterior probability, topic 1

0.0
0

5

10

20

100

150

200

50

100

0.8
0.6
0.4
0.2

Posterior probability, topic 1
0

30

0.0

0.8
0.6
0.4
0.2
0.0

0.8
0.6
0.4

50

1.0

Wales Office

1.0

Ministry of Defence

0.2

150

0

5

10 15 20 25 30 35

Business, Innovation & Skills

Communities and Local Government

Environment, Food & Rural Affairs

5

10

15

0.8
0.6
0.4
0.2
0.0

0.8
0.6
0.4
0.2
0.0

0.8
0.6
0.4

10 20 30 40 50 60

1.0

Press releases

Posterior probability, topic 2

Press releases
1.0

Press releases

Posterior probability, topic 2

20

0

5

10

20

Foreign & Commonwealth Office

Ministry of Defence

Wales Office

100

150

Press releases

200

50

100

Press releases

150

0.8
0.6
0.4
0.2

Posterior probability, topic 2
0

30

0.0

0.8
0.6
0.4
0.2

0.8
0.6
0.4
0.2

50

1.0

Press releases

1.0

Press releases

Posterior probability, topic 2

Press releases

0.0

1.0

20

Foreign & Commonwealth Office

0.0
0

15

Press releases

0.2
1.0

0

10

Press releases

Posterior probability, topic 1

1.0

5

319

Environment, Food & Rural Affairs

Press releases

0.0

Posterior probability, topic 2

0

Posterior probability, topic 2

10 20 30 40 50 60

0.0

Posterior probability, topic 1

0

Communities and Local Government

0.0

0.2

0.4

0.6

0.8

Posterior probability, topic 1

1.0

Business, Innovation & Skills

0.0

Posterior probability, topic 1

STATISTICAL TEXT PROCESSING

0

5

10 15 20 25 30 35
Press releases

Figure 10.2 Output of Correlated Topic Model of UK Government press releases

320

AUTOMATED DATA COLLECTION WITH R

R>
R>
R>
R>
topic),
R>
R>
R>
}
R> }

type
ylim
xlab
ylab

=
=
=
=

"l",
c(0,1),
"Press releases",
str_c("Posterior probability, topic ",

main = str_replace(orga, "Department for", "")
)

The figures are nicely aligned with expectations from looking at the terms most indicative
of the various topics. The posterior probability of topic 1 is highest in press releases from the
ministry of defense, whereas topic 2 has the highest probability in press releases from the
foreign office. In summary, we are able to make out some plausible agreement between our
labels and the estimated topical emphasis from the correlated topic model.

Summary
The chapter offered a brief introduction to statistical text processing. To make use of the
vast data on the Web, we often have to post-process collected information. Particularly when
confronted with textual data we need to assign systematic meaning to otherwise unstructured
data. We provided an introduction to a framework for performing statistical text processing
in R—the tm package—and two classes of techniques for making textual data applicable as
data in research projects: supervised and unsupervised classifiers.
To summarize the techniques, the major advantage of supervised classification is the ability
of the researcher to specify the categories for the classification algorithm. The downside of
that benefit is that supervised classifiers typically require substantial amounts of training data
and thus manual labor. Conversely, the major advantage of unsupervised classification lies
in the ability of researchers to skip the coding of data by hand which comes at the price of
having to interpret the estimation results ex post.
At the end, we would like to add that there is a vibrant research in the field of automated
analysis of text, such that this chapter is potentially one of the first to contain somewhat dated
information. For example, some headway has been made to allow researchers to specify the
categories they are interested in without having to code a large chunk of the corpus for training
data. This is accomplished using seed words. They allow the researcher to specify words that
are most indicative of a particular category of interest (Gliozzo et al. 2009; Zagibalov and
Carroll 2008).

Further reading
We introduced the most important features of the tm package in the first section of this chapter.
However, we have not explored its full potential. If you care to learn more about the package
check out the extensive introduction in Feinerer et al. (2008).
Natural language processing and the statistical analysis of text remain actively researched
topics. Therefore, there are both numerous contributions to the topic as well as research
papers with current developments in the field. For further insight into the topics that were

STATISTICAL TEXT PROCESSING

321

introduced in this chapter take a look at Grimmer and Stewart (2013) for a brief but insightful
introduction to some of the topics that were discussed in this chapter. For a more extensive
treatment of the topics, see Manning et al. (2008).
A topic that is heavily researched in the area of automated text classification is the
classification of the sentiment or opinion in a text. We will return to this topic in Chapter 17
where we try to classify the sentiment in product reviews on http://www.amazon.com. For an
excellent introduction to the topic, take a look at Liu (2012).

11

Managing data projects

Deploying a successful data collection project requires more than knowledge of web technologies. The focus of this chapter is on R and operation system functionality that will
be required for setting up and maintaining large-scale, automated data collection projects.
Additionally, we discuss good practices to organize and write code that adds robustness and
traceability in case of errors. In Section 11.1, we start by providing an overview of R functions for interacting with the local file system. In Section 11.2, we show methods for iterative
code execution for downloading pages or extracting relevant information from multiple web
documents. Section 11.3 provides a template for organizing extraction code and making it
more robust to failed specification. We conclude the chapter with an overview of system tools
that can executive R scripts automatically, which is a key requirement for building datasets
from regularly updated Internet resources (Section 11.4).

11.1 Interacting with the file system
One type of R function that appears frequently in data projects is dedicated to working
with files and folders on the local file system. Over the course of a data project, we are
continuously interacting with the file system of our operating system. Web documents are
stored locally, loaded into R, processed, and saved again after the post-processing or analysis.
The file system has an important role in the data collection and analysis workflow and a
firm command over the hard drive constitutes a valuable auxiliary skill. For the numerous
virtues of a scripted approach, any interaction with the file management system should be
performed in a programmable fashion. Luckily, R provides an extensive list of functions for
interacting with the system and files located in it. Table 11.1 provides an overview of the
basic file management functions which we rely upon in the case studies and in data projects
more generally.

Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, First Edition.
Simon Munzert, Christian Rubba, Peter Meißner and Dominic Nyhuis.
© 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

MANAGING DATA PROJECTS

323

Table 11.1 Basic R functions for folder and file management
Important
arguments

Function

Description

Functions for folder management
dir()

path

dir.create()

path, recursive

Returns character vector of the names of files
and directories in path
Creates new directory in path (only the last
element). If recursive=T all elements on
the path are created.

Functions for file management
file.path()
file.info()

...
path

file.exists()

path

file.access()

names, mode

file.rename()
file.remove()
file.append()
file.copy()
basename()
dirname()

from, to
path
file1, file2
from, to
path
path

Constructs a file path of character elements
Returns character vector with information about
a file in path
Returns logical value whether a file already
exists in path
Tests files in names for existence (mode=0),
execute permission (mode=1), writing
permission (mode=2), read permission
(mode=4). Returns integer with values 0 for
success and −1 for failure
Renames a file in path from to a name in to
Deletes a file in path from the hard drive
Appends contents in file2 to file1
Creates a copy of a file in path from to path to
Returns the lowest level in a path
Returns all but the lower level in a path

Functions for working with compressed files
zip()
zipfile, files
Create a zip file in path zipfile of files
unzip()
zipfile, files
Extracts specific files (all when unspecified)
from a zip file in path zipfile
Path arguments may usually be passed as complete or incomplete paths. In the latter case, paths are
expanded to the working directory (getwd()). File paths may be passed in abbreviated form without the
user’s home directory. (path.expand() is used to replace a leading tilde by the user’s home directory.)

11.2

Processing multiple documents/links

A frequently encountered task in web scraping is executing a piece of code repeatedly. In fact,
using a programming language for data collection is most valuable as it allows the researcher
to automate tasks that would otherwise have to be done in tedious and time-consuming
manual processing of every file, URL, or document. To exemplify, consider the problem of
downloading a bunch of HTML sites from a vector of URLs. Another job is the processing of
multiple web documents, such as the pages from a news website where the task is extracting
the text corpora, or the extraction of tabular information from economic indicators organized
in XML files to create a single database.

324

AUTOMATED DATA COLLECTION WITH R

This section illustrates that with only a little bit of overhead one can instruct R to repeatedly
execute a function on a set of files and, thus, comfortably download countless pages or scan
thousands of documents and reassemble the extracted information. We introduce two ways
to accomplish this goal; the first one is through the use of standard R looping structures and
the second one is through functionality from the plyr package.

11.2.1

Using for-loops

For illustrative purposes, we consider the set of 11 XML files that are located in the folder
/stocks.1 The XML files contain stock information for four technology companies. Our interest
is in extracting the daily closing values for the Apple stock over all years (2003–2013). We
can divide this problem into two subtasks. First, we need to come up with an extraction code
that loads the file, extracts, and recasts the target information into the desired format. The
second task is executing the extraction code on all XML files. A straightforward approach
is to wrap the extraction code in a for-loop. Loops are standard programming structures that
help formulate an iterating statement over the set of documents from which information needs
to be extracted.
A first step is to obtain the names of the files that we would like to process. To this end,
we use the dir() function to produce a character vector with all the file names in the current
directory. The file names are inserted into a new object called all_files and its content is
printed to the screen.
R> all_files <- dir("stocks")
R> all_files
[1] "stocks_2003.xml" "stocks_2004.xml" "stocks_2005.xml"
[4] "stocks_2006.xml" "stocks_2007.xml" "stocks_2008.xml"
[7] "stocks_2009.xml" "stocks_2010.xml" "stocks_2011.xml"
[10] "stocks_2012.xml" "stocks_2013.xml"

Next, we need to create a placeholder in which we can store the extracted stock information
from each file. Although it might be necessary to obtain a data frame at the end of the process
for analytical purposes, we set up a list as an intermediate data structure. Lists provide the
flexibility to collect the information which we recast only afterwards. We create an empty
list that we name closing_stock and which serves as a container for the yearly stock
information from each file.
R> closing_stock <- list()

The core of the extraction routine consists of a for-loop over the number of elements
in the all_files character vector. This structure allows iterating over each of the files and
work on their contents individually.
R> for (i in 1:length(all_files)) {
path <- str_c("stocks/", all_files[i])
parsed_stock <- xmlParse(path)
closing_stock[[i]] <- xpathSApply(parsed_stock, "//Apple", getStock)
}

1 You

can find the data on www.r-datacollection.com/materials.

MANAGING DATA PROJECTS

325

First, we construct the path of each XML file and save the information in a new object
called path. The information is needed for the next step, where we pass the path to the parsing
function xmlParse(). This creates the internal representation of the file inside a new object
called parsed_stock. Finally, we obtain the desired information from the parsed object by
means of an XPath statement. Here, we pull the entire Apple node and do the post-processing
in the extractor function, which we discuss below. The return value from xpathSApply()
is stored at the ith position our previously defined list. The get_stock() extractor function
is a custom function that works on the entire Apple node and returns the date and closing
value for each day.
R> getStock <- function(x) {
date <- xmlValue(x[["date"]])
value <- xmlValue(x[["close"]])
c(date, value)
}

We go ahead and unlist the container list to process each information individually and
put it into a more convenient data format. Here we go for a data frame and choose more
appropriate column names.
R> closing_stock <- unlist(closing_stock)
R> closing_stock <- data.frame(matrix(closing_stock, ncol = 2, byrow = T))
R> colnames(closing_stock) <- c("date", "value")

Finally, we recast the value information into a numerical vector and the date information
into a vector of class Date.
R> closing_stock$date <- as.Date(closing_stock$date, "%Y/%m/%d")
R> closing_stock$value <- as.numeric(as.character(closing_stock$value))

We are ready to create a visual representation of the extracted data. We use plot() to
create a time-series of the stock values. The result is displayed in Figure 11.1.

500
300
100
0

Closing stock

700

R> plot(closing_stock$date, closing_stock$value, type = "l", main
= "", ylab = "Closing stock", xlab = "Time")

2004

2006

2008

2010

2012

Time

Figure 11.1 Time-series of Apple stock values, 2003–2013

2014

326

AUTOMATED DATA COLLECTION WITH R

11.2.2

Using while-loops and control structures

A second control statement that we can use for iterations is the while() expression. Instead
of iterating over a fixed sequence, it will run an expression as long as a particular condition
evaluates to TRUE. Consider the snippet below for an abstract usage of the expression.
R> a <- 0
R> while(a < 3){
a <- a + 1
print(a)
}
[1] 1
[1] 2
[1] 3

We set the a to 0 and while the a is lower than 3 it will continue looping. In each iteration
we add 1 to a and print the value of a to the screen. Once the a has reached the critical value
of 3, the loop will break.
Apart from setting a condition in the while() statement that evaluates to FALSE at some
point, thus breaking the loop, we can also break a loop with an if() clause and a break
command.
R> a <- 0
R> while(TRUE){
a <- a + 1
print(a)
if(a >= 3){
break
}
}
[1] 1
[1] 2
[1] 3

In the above snippet, we set a condition in the while() statement that will always evaluate
to TRUE, thus creating an infinite loop. Instead, in each iteration we test whether a is equal
to or greater than 0. If that condition is TRUE, the break is encountered which forces R to
break the current loop. Notice how we used the if() clause in the snippet.
In web scraping practice, the while() statement is handy to iterate over a set of documents
where you do not know the total number of documents in advance. Consider the following
scenario. You care to download a selection of HTML documents where the link to additional
documents is embedded in the source code of the last inspected HTML document, say in a
link to a NEXT document. If you do not happen to find a counter at the bottom of the page that
contains information on the total number of pages, there is no way of specifying the number
of pages you can expect. In such a case, you can apply the while() statement to check for
the existence of a link before accessing the document.
R>
R>
R>
R>
R>

# Load packages
library(XML)
library(stringr)
# Mock URL

MANAGING DATA PROJECTS

327

R> url <- "http://www.example.com"
R>
R> # XPath expression to look for additional pages
R> xpath_for_next_page <- "//a[@class='NextPage']"
R>
R> #Create index for pages
R> i <- 1
R>
R> # Collect mock URL and write to drive
R> current_document <- getURL(url)
R> write(tmp, str_c(i, ".html"))
R>
R> # Download additional pages while there are links to additional pages
R> while(length(xpathSApply(current_document, xpath_for_next_page,
xmlGetAttr, "href")) > 0){
R>
current_url <- xpathSApply(current_document, xpath_for_next_page,
xmlGetAttr, "href")
R>
current_document <- getURL(current_url)
R>
write(current_document, str_c(i, ".html"))
R>
i <- i + 1

11.2.3

Using the plyr package

The data structure which we typically wish to produce is tabular with variables populating the
columns and each row presenting a case or unit of analysis. Producing tabular data structures
from multiple web documents can be achieved easily using functionality from the plyr package
(Wickham 2011) which allows performing an extraction routine more quickly on multiple
documents. To illustrate, let us run through the previous example in the plyr framework. As
the first step, we construct the paths to the XML files on the local hard drive.
R> files <- str_c("stocks/", all_files)

We create a function getStock2() that parses an XML file and extracts relevant information. This code is similar to the one we used before.
R> getStock2 <- function(file){
parsedStock <- xmlParse(file)
closing_stock <- xpathSApply(parsedStock,
"//Apple/date | //Apple/close",
xmlValue)
closing_stock <- as.data.frame(matrix(closing_stock,
ncol = 2,
byrow = TRUE))
}

The function returns an n × 2 data frame with the first column holding information on the
date and the second one on the closing stock. We are now set to evoke ldply() and initiate
the extraction process.
R> library(plyr)
R> appleStocks <- ldply(files, getStock2)

328

AUTOMATED DATA COLLECTION WITH R

For its input ldply() requires a list or a vector and it returns a dataframe. ldply()
executes getStock2() on every element in files and binds the results row-wise. We
confirm that the procedure has worked correctly by printing the first five lines to the console.
R> head(appleStocks, 3)
V1
V2
1 2013/11/13 520.634
2 2013/11/12 520.01
3 2013/11/11 519.048

If you are dealing with larger file stacks, plyr also provides the option parallel which,
if set to TRUE, parallelizes the code execution which can speed up the process.

11.3
Don’t repeat
yourself

Organizing scraping procedures

When you begin scraping the Web regularly, you will find that numerous tasks come up
over and over again. One of the central principles of good coding practice states that you
should never repeat yourself. If you find yourself rewriting certain lines of code or copypasting elements of your code, then it is time to start thinking about organizing your code
in a more efficient way. Problems arise when you need to trace all the places in your
scripts where some particular functionality has been defined. The solution to this problem
is to wrap your code in functions and store them in dedicated places. Not only does this
guarantee that revisions happen in only one place, but it also greatly simplifies the maintenance
of code.
Besides ensuring a better maintenance of code, writing functions also allows a generalization of functionality. This is a great improvement of your code when you want to apply
a sequence of operations to lots of data, as is often the case in web scraping. In fact, by
writing your code into functions you can frequently speed up the execution time of your R
code dramatically by applying the function on a list or a vector via one of the apply functions
from the plyr package as shown in the previous section. This section serves to elaborate how
to modularize your code by using functions.
We demonstrate the use of functions with a scenario that we already discussed in Section
9.1.4. Imagine that we want to collect all links from a website. We have learned that the XML
package provides the function getHTMLLinks() which makes link collection from HTML
documents quite convenient. In fact, this function is a good example for a function which
help tackle a frequently occurring task. Imagine that the function did not exist and we needed
to build it.
We have learned in Chapter 2 that links are stored in href attributes of  elements.
Our task is thus simply to collect the content of all nodes with this attribute. Let us begin
by loading the necessary packages and specifying a URL that will serve as our running
example.
R> library(RCurl)
R> library(XML)
R> url <- "http://www.buzzfeed.com"

MANAGING DATA PROJECTS

329

We can perform the task for this single website by calling and parsing it via htmlParse()
and collecting the relevant information via xpathSApply() .
R> parsed_page <- htmlParse(url)
R> links <- xpathSApply(parsed_page, "//a[@href]", xmlGetAttr, "href")
R> length(links)
[1] 945

Now imagine that we care to apply these steps to several websites. To apply these three
steps to other sites we wrap them into a single function we call collectHref(). This is
done by storing the necessary steps in an object and calling the function function(). The
argument in the function call represents the object that the function is supposed to run on, in
this case the URL.

Setting up
functions

R> collectHref <- function(url){
parsed_page <- htmlParse(url)
links <- xpathSApply(parsed_page, "//a[@href]", xmlGetAttr, "href")
return(links)
}

Now we can simply run the function on various sites to collect all the links. First we apply
it to the URL we specified above and then we try out a second page.
R> buzzfeed <- collectHref("http://www.buzzfeed.com")
R> length(buzzfeed)
[1] 945
R> slate <- collectHref("http://www.slate.com")
R> length(slate)
[1] 475

We are able to generalize functions by adding arguments to it. For example, we can add a
variable to our function that will discard all links that do not explicitly begin with http. This
could be done with a simple regular expression that detects whether a string begins in http,
using the str_detect() function from the stringr package.
R> collectHref <- function(url, begins.http){
if(!is.logical(begins.http)){
stop("begins.http must be a logical value")
}
parsed_page <- htmlParse(url)
links <- xpathSApply(parsed_page, "//a[@href]", xmlGetAttr, "href")
if(begins.http == TRUE){
links <- links[str_detect(links, "ˆhttp")]
}
return(links)
}

Notice that we also added a test to the function that checks whether begins.http is a
logical value. If not, the function will throw an error (produced by stop()) and not return

Adding
arguments

330

AUTOMATED DATA COLLECTION WITH R

any results. Let us run the altered function on our example URL, and set begins.http to
TRUE.
R> buzzfeed <- collectHref(url, begins.http = TRUE)
R> length(buzzfeed)
[1] 63

The vector of links has shrunk considerably. Now let us call the function on the base URL
again, but change the begins.http variable to the wrong type.
R> testPage <- collectHref(url, begins.http = "TRUE")
Error: begins.http must be a logical value

We can add any number of arguments to the function. In order not to have to specify
the value for each argument whenever we call the function, we can set predefined values for
the arguments. For example, we can set begins.http to TRUE by default in the function
definition.
R> collectHref <- function(url, begins.http = TRUE){
if(!is.logical(begins.http)){
stop("begins.http must be a logical value")
}
parsed_page <- htmlParse(url)
links <- xpathSApply(parsed_page, "//a[@href]", xmlGetAttr, "href")
if(begins.http == TRUE){
links <- links[str_detect(links, "ˆhttp")]
}
return(links)
}

Storing and
calling
functions

Thus, whenever we call the function, it will assume that we care to collect only those
links that explicitly contain the sequence http.
Once you start writing functions in R, you will find that grouping them into topical files
is the most sensible way to collect functions for use in various projects. The advantage of
generating a set of functions in modules ensures that you have to modify specific functions
only in one location and that you do not have to create the same functionality over and over
again with each project that you are beginning. Instead you can call the necessary module
when you start a new project, almost like you would load a library that you download from
CRAN.
A reasonable approach for storing such recurring functions is to create a dedicated folder
where functions are stored in dedicated R script files. Whenever you want to draw on one
of these functions, you can access them using the source() command. This automatically
evaluates code from foreign R source files. Imagine we have stored our function from above
in a file named collectHref.r. In order to run the command, we proceed as follows:
R> source("collectHref.r")
R> test_out <- collectHref("http://www.buzzfeed.com")
R> length(test_out)
[1] 63

Eventually you can also go one step further and create R packages yourself and upload
them to CRAN or GitHub .

MANAGING DATA PROJECTS

11.3.1

331

Implementation of progress feedback: Messages
and progress bars

When performing a web scraping task, it can often be useful to receive a visual feedback on
the progress that our program has made in order to get a sense of when your program will have
finished. A very basic version of such a feedback would be a simple textual printout directly
to the R console. We can accomplish this with the cat() function. To illustrate, consider the
problem of downloading the stock XML files from the book homepage. The files are stored in
the following path: http://www.r-datacollection.com/materials/workflow/stocks. Let us start
by building a character vector for their URLs and save them in a new object called links.
R> baseurl <- "http://www.r-datacollection.com/materials/workflow/stocks"
R> links <- str_c(baseurl, "/stocks_", 2003:2013, ".xml")

Next, we set up a loop over the length of links. Inside the loop, we download the file,
create a sensible name using basename() to return the source file name, and then write the
XML code to the local hard drive.
R> N <- length(links)
R> for(i in 1:N){
R>
stocks <- getURL(links[i])
R>
name <- basename(links[i])
R>
write(stocks, file = str_c("stocks/", name))
R>
cat(i, "of", N, "\n")
R> }
1 of 11
2 of 11
3 of 11
...
11 of 11

In the final line, we ask R to print the number of the document just downloaded to the
console. We append the message with a \n so each new output is written to a new line.
We can enrich the information in the feedback, for example, by adding the name of the
file that is currently being downloaded.
R> for(i in 1:N){
R>
stocks <- getURL(links[i])
R>
name <- basename(links[i])
R>
write(html, file = str_c("stocks/", name))
R>
cat(i, "of", N, "-", name, "\n")
R> }
1 of 11 - stocks_2003.xml
2 of 11 - stocks_2004.xml
3 of 11 - stocks_2005.xml
...
11 of 11 - stocks_2013.xml

In some cases, you might not want to get output on each individual case but only create summary information. One possibility for this is to shorten the output by providing

Progress
feedback with
cat()

332

AUTOMATED DATA COLLECTION WITH R

information on, say, each tenth download. We add an if() statement to our code such that
only those iterations are printed where the i divided by 10 does not result in a fraction.
R> for(i in 1:30)
if(i %% 10 == 0){
cat(i, "of", 30, "\n")
}
}
10 of 30
20 of 30
30 of 30
Writing
progress to a
log file

Incidentally, you might want to store the progress information in an external file for later
inspection to be able to trace potential errors. Possibly, this should consist of more extensive
information than what is printed to your screen. For example, you can use the write()
command to write the progress to a log file that is appended in each iteration. We begin by
creating an empty file on our local hard drive.
R> write("", "download.txt")

We then append the information that is written to the screen to the external file. To make
the information a little more useful for later inspection, we add information on the number
of characters in the downloaded file. We also add dashes and a space to visually separate the
various downloads.
R> N <- length(links)
R> for(i in 1:N){
R>
stocks <- getURL(links[i])
R>
name <- basename(links[i])
R>
write(html, file = str_c("stocks/", name))
R>
feedback <- str_c(i, "of", N, "-", name, "\n", sep = " ")
R>
cat(feedback)
R>
write(feedback, "download.txt", append = T)
R>
write(nchar(stocks), "download.txt", append = T)
R>
write("- - - - - - - - - - - -\n", "download.txt", append = T)
R> }
Using built-in
progress bars

In many instances, the best feedback is textual. Nevertheless, you can also create other
types of feedback. For instance, you can easily create a simple progress bar for your function
using the txtProgressBar() that is predefined in R. For our example we start by initializing
a progress bar with the extreme values of 0 and N, that is 3. We set the style of the progress
bar to 3, which generates a progress bar that displays the percentage of the task that is done
at the right end of the bar.
R> progress_bar <- txtProgressBar(min = 0, max = N, style = 3)

Next, we download the documents once more with the shortest version of the code that
was introduced previously. We add the command setTxtProgressBar() to our call which
sets the value of the progress bar that we initialized above. The first argument specifies which
progress bar we want to change the value of, the second argument sets the value, in this case

MANAGING DATA PROJECTS

333

the values 1, 2, and 3. We add a 1 second delay after each iteration using the Sys.sleep()
function, so you can clearly see the development of the progress bar.
R> for(i in 1:N){
R>
stocks <- getURL(links[i])
R>
name <- basename(links[i])
R>
write(stocks, name)
R>
setTxtProgressBar(progress_bar, i)
R>
Sys.sleep(1)
R> }
|================================

|

73%

You can even create audio cues to signal that the execution of a piece of code is complete.
For example, the escape sequence \a calls the alert bell.2

Audio feedback

R> for(i in 1:N){
R>
tmp <- getURL(websites[i])
R>
write(tmp, str_c(str_replace(websites[i], "http://www\\.", ""), ".
html"))
R> }
R> cat("\a")

Imagine a ping! when reading cat("\a") in the last code snippet.

11.3.2

Error and exception handling

When you start to scrape the Web seriously, you will begin to stumble across exceptions.
For example, websites might not be formatted consistently such that you will not find all the
elements that you are looking for. It is sometimes difficult to build your functions sufficiently
robust to be able to deal with all the exceptions. This section introduces some simple techniques that can help you overcome such problems. Let us consider as an example the same
task that we have looked at in Section 11.3—downloading a list of websites. First, we expand
the list with a mistyped URL.
R> wrong_pages <- c("http://www.bozzfeed.com", links)

When we try to download the content of all of the sites to our hard drive using a simple
loop over all the entries, we find that this operation fails as the function is unable to collect
the first entry in our vector. The problem with errors is that they break the execution of the
entire piece of code. Even though the remaining three entries could have been collected with
the code snippet as we have previously shown, the single false entry stops the execution
altogether. The simplest way to change this behavior is to wrap the getURL() expressions in
a try() statement.
R> for(i in 1:N){
R>
url <- try(getURL(wrong_pages[i]))
R>
if(class(url) != "try-error"){
2 Users have implemented more fun notification sounds in R. Be sure to check out the pingr package (see
https://github.com/rasmusab/pingr).

The try()
function

334

AUTOMATED DATA COLLECTION WITH R

R>
name <- basename(wrong_pages[i])
R>
write(url, name)
R>
}
R> }

Notice that we added a statement to test the class of the url object. If the object is of class
try-error, we do not write the content of the object to the hard drive. The disadvantage of
function
wrapping code in try() statements is that you discard errors as inconsequential. This is a
The

tryCatch()

strong assumption, as something has gone wrong in your code and frequently it makes sense
to consider more carefully what exception you encountered. R also offers the tryCatch()
function which is a more flexible device for catching errors and defining actions to be
performed as errors occur. For example, you could log the error to consider the systematics
of the errors later on. First, we create a function that combines the two steps of our task in
a single function. We also set up a log file to export errors and the relevant URL during the
execution of the code.
R> collectHTML <- function(url){
R>
html <- getURL(url)
R>
write(html, basename(url))
R> }
R> write("", "error_log.txt")

We customize the error handling in the tryCatch() statement by making it print Not
available and the name of the website that cannot be accessed.3
R> for (i in 1:N) {
html <- tryCatch(collectHTML(site404[i]), error = function(err){
errMess <- str_c("Not available - ", site404[i])
write(str_c(errMess, "error_log.txt"))
})
}

11.4 Executing R scripts on a regular basis
On many websites, smaller or larger parts of the contents are changed on a regular basis,
which renders these resources dynamic. To exemplify, imagine a news site that publishes new
articles every other hour or the press release repository of a non-governmental organization
that adds new releases sporadically.
Implicitly, we assumed so far that scraping can be carried out in a one-time job. Yet,
when dynamic web resources are concerned, it might be a key aspect of a data project to
collect information over a longer period of time. While nothing prevents us from manually
executing a script in regular intervals, this process is cumbersome and error-prone. This
3 The added value of using the tryCatch() function compared to the try() statement in this case is fairly
limited, as the error messages of the former are similarly informative and could easily be written to an external
file. The added value of tryCatch() relative to try() stems from the fact that we can define customized action
upon encounter of an error. The simple example only serves the purpose of exposition. For error handling with
specification of alternative behavior see Section 5.4.7.

MANAGING DATA PROJECTS

335

section discusses ways to free the data scientist from this responsibility by setting up a system
task that initiates the scraping automatically and in the background. To this end, we employ
tools that are built right into the architecture of all modern operating systems for scheduling
the execution of programs. We provide an introduction to these tools and show how an R
scraping script can be invoked in user-defined intervals.
We motivate this section with the problem of downloading information from http://www.rdatacollection.com/materials/workflow/rQuotes.php. Scraping this site is complicated by its
very dynamic nature—the site changes every minute and displays a different R quote. We
set ourselves to the task of downloading a day’s worth of quotes. Clearly, a manual approach
is out of the question for obvious reasons. We approach the problem by first assembling an
R script that allows downloading and storing information from one instance of the site. The
first line loads the stringr package and the second makes sure that we have a folder called
quotes that serves as a container for the downloaded pages. The next three lines are overhead
for the file names that include the date and time of the download. The last line conducts the
download, using R’s built-in download.file() function. We save the downloading routine
under the name getQuotes.R.
R>
R>
R>
R>
R>
R>

library(stringr)
if (!file.exists("quotes")) dir.create("quotes")
time <- str_replace_all(as.character(Sys.time()), ":", "_")
fname <- str_c("quotes/rquote ", time, ".html")
url <- "http://www.r-datacollection.com/materials/workflow/rQuotes.php"
download.file(url = url, destfile = fname)

In the remainder of this section, we describe how to embed the R script with system
utilities for regular execution in predefined intervals. We discuss solutions for Linux, Mac
OS, and Windows.

11.4.1

Scheduling tasks on Mac OS and Linux

For users working on a UNIX-like operating system such as Mac OS or Linux, we propose
using Cron for the creation and administration of time-based tasks. Cron is a preinstalled
general-purpose system utility that allows setting up so-called jobs that are being run periodically or at designated times in the background of the system.
For the administration of tasks, Cron uses a simple text-based table structure called a
crontab. A crontab includes information on the specific actions and the times when the
actions should be executed. Notice that Cron will run the jobs regardless of whether the user
is actually logged into the system or not. Although graphical interfaces exist to set up a task,
it is convenient and quick to edit tasks using a text editor. To create a new task, open a system
shell4 and write.

1

crontab -e

4 Depending or your system, the shell may be accessed differently. On Mac OS open ‘Terminal,’ on Ubuntu find
Terminal or press CTRL+ALT+T.

How Cron
works

336

AUTOMATED DATA COLLECTION WITH R

Table 11.2 The five field Cron time format
Field

Description

Allowed values

MIN
HOUR
DOM
MON
DOW

Minute field
Hour field
Day of Month field
Month field
Day of Week field

0–59
0–23 (0 = midnight)
1–31
1–12 or literals
0–6 (0 = Sunday) or literals

Source: Adapted from http://www.thegeekstuff.com/2009/06/15-practicalcrontab-examples/

This command opens the crontab specific to your logged in user in the default editor of
your system. If you prefer a different text editor, prepend the command with the respective
editor’s name (e.g., nano, emacs, gedit). Conditional on the OS you use, the text file you are
being shown can be empty or include some general comments on how this file can be edited.
In any case, since crontab requires that each task has to appear on a separate line, go ahead
and point the prompt to the last line of the file. The general layout of a crontab follows the
pattern “[time] [script],”where the script component refers to a shell command and the time
component describes the temporal pattern by which the script is executed.
Cron has its own time format to express chronological regularity. Essentially, this time
format consists of five fields, separated by white space, that refer to the minute, hour, day,
month, and weekday on which the task is to be executed. Take a look at Table 11.2 to
learn about the allowed values for each of the five time fields. Notice, that any of the five
fields may be left unspecified which in the Cron time format is indicated by the asterisk
symbol *.
From this basic template, we can construct a wide range of temporal patterns for task
execution. To illustrate their capability, take a look at the following three specifications.
15 16 * * *
15 16 * 1 *
15 16 * 1 0

executes the script everyday quarter past four
executes the script everyday quarter past four when the
month is January
executes the script everyday quarter past four when the
month is January and it’s Sunday

In any of these five fields, one can produce an unconnected or connected series of time
units by using “,” or “-” respectively.
15 10-20 * * *
15 10-20 * * 6,0

executes the script quarter past every hour from 10 am
to 8 pm
executes the script quarter past every hour from 10 am
to 8 pm on Saturdays and Sundays

In many circumstances, exact specification of a time is overly rigid for a given task.
Instead, one can use the Cron time schema to express the intention of having a task executed in

MANAGING DATA PROJECTS

337

certain time intervals. The preferred way to do this is by using a “*/n”construct that specifies
an interval of length n for the respective time unit. To illustrate, consider the following
examples.
*/15 * * * *
15 0 */2 * *

executes the script every 15 minutes
executes the script 15 minutes past midnight on every second day

The second piece of information in any Cron job is the shell command that has to be
executed regularly. If you have never worked with the shell before, think of it as a command
line based user-interface for accessing the operating system and installed programs (such as
R). In order to set up a new task for the execution of getQuotes.r every minute, we append
the following line to the crontab:
1

Executing tasks
from the shell

*/1 * * * * cd [DIRECTORY] && Rscript getQuotes.r

We first specify the chronological pattern “*/1 * * * *” for every-minute repeated
execution. This is followed by the scripting part, where we first change the directory to the
folder in which getQuotes.r is saved and then use the Rscript-executable on getQuotes.r.
Rscript is a scripting front-end that should be used in cases when an R script is executed via
the shell. Once you have saved the crontab, the task is active and should be executed in the
background.
For the maintainability of Cron-induced R routines, it is helpful to retain an overview over
the outputs that are generated from the script, such as warnings or errors. The UNIX shell
allows to route the output of the R script to a log file by extending the Cron job as follows:
1

*/1 * * * * cd [DIRECTORY] && Rscript getQuotes.r >> log.txt 2>&1

11.4.2

Scheduling tasks on Windows platforms

On Windows platforms, the Windows Task Scheduler is the tool for scheduling tasks. To find
the tool click Start > All Programs > Accessories > System Tools > Scheduled Tasks.
To set up a new task, double-click on Create Task. From here, the procedure differs
according to your version of Windows, but the presented options should be very similar. On
Windows 7, we are presented with a window with five tabs—General, Triggers, Actions,
Conditions, and Settings. Under General we can provide a name for the task. Here we put in
Testing R Batch Mode for a descriptive title.
In the field Triggers we can add several triggers for starting the task—see Figure 11.2.
There are schedule triggers which start the task every day, week, or month and also triggers
that refer to events like the startup of the computer or when it is in idle mode, and many
more. To execute getQuotes.r every minute for 24 hours, we select On a schedule as general
trigger and define that it should be executed only once but repeated every 1 minutes for 1
day. Last but not least, we should make sure that the start date and time of our one-time
scheduled task should be placed somewhere in the future when we will be done specifying
the schedule.

Working with
the Windows
Task Scheduler

338

AUTOMATED DATA COLLECTION WITH R

Figure 11.2 Trigger selection on Windows platform

After the trigger specification we still have to tell the program what to do if the task is
triggered. The Actions tab is the right place to do that—see Figure 11.3. We choose Start a
program for action and use the browse button to select the destination of Rscript.exe, which
should be placed under, for example, C:\ProgramFiles\R\R-3.0.2\bin\x64\. Furthermore,
we add getQuotes.r in the Add arguments field and type in the directory where the script is
placed in the Start in field. If logging is needed we modify the procedure.
Program/script field: replace Rscript.exe by R.exe
Add arguments field: replace getQuotes.r by CMD BATCH –vanilla getQuotes.r
log.txt

Figure 11.3 Action selection on Windows platform

MANAGING DATA PROJECTS

339

While CMD BATCH tells R to run in batch mode, –vanilla ensures that no R profiles
or saved workspaces are stored or restored that might interfere with the execution of the
script. log.txt provides the name for the logfile. Now we can confirm our configuration
and click on Task Scheduler Library in the left panel to get a list of all tasks available on
our system.

Part Three
A BAG OF CASE STUDIES

Analyzing
Sentiments of
Product Reviews

Gathering Data on
Mobile Phones

Mapping the
Geographic
Distribution of
Names

Predicting the 2014
Academy Awards
using Twitter

Extension of SQLite database with
customer reviews of mobile
phones at amazon.com,
dictionary and
classification-based sentiment
analysis

Scraping of bill cosponsorship data
from the US Senate at
thomas.loc.gov, assessment of
collaboration network structure
Scraping of climate data from
Californian weather stations
(ftp.wcc.nrcs.usda.gov),
construction of a regex-based
parser
Collection of tweets from Twitter
API (dev.twitter.com/docs/api/
streaming), frequency-based
prediction of Oscar winners
Scraping phone book data from
dastelefonbuch.de, extraction of
zip codes and matching with
geo-coordinates, creation of
family name maps
Scraping of mobile phone product
data from amazon.com, data
storage in SQLite database

Collaboration
Networks in the
U.S. Senate

Parsing Information
from
Semi-Structured
Documents

Description

Case study

Overview of all case studies

XPath, data preparation with
tm functionality, sentiment
dictionary, maximum
entropy and SVM

URL manipulation, XPath
and regular expressions

Persistent connection to
Streaming API via
streamR, regular
expressions
HTML forms,XPath and
regular expressions, R
geographic functionality

FTP download, regular
expressions and string
manipulation tools

URL manipulation, regular
expressions

Scraping and information
extraction via...

RCurl, string,
XML,
RSQLite,
tm,
RTextTools,
textcat

RCurl, stringr,
XML,
RSQLite

streamR,
twitteR,
lubridate,
stringr, plyr
RCurl, stringr,
XML,
maptools,
maps, rgdal

RCurl, stringr

RCurl, stringr,
igraph

Main
packages

filterStream(),
parseTweets(),
str_detect(),
agrep()
getForm(),
htmlParse(),
xpathSApply(),
str_extract(),
function()
htmlParse(),
xpathSApply(),
str_extract(),
dbGetQuery()
dbReadTable(),
dbGetQuery(),
tm_map(),
TermDocumentMatrix(),
classify_model()

getURL(),
str_extract(),
graph.edgelist(),
get.adjacency()
getURL(),
str_extract(),
str_replace()

Important
functions

12

Collaboration networks in
the US Senate
The inner workings of legislatures are inherently difficult to investigate. Political scientists
have been interested in collaborations among parliamentarians as explanatory factors in
legislative behavior for decades. Scholars have encountered difficulties, however, in collecting
comprehensive data over time to investigate patterns of cooperation. The advent of large-scale,
machine-readable databases on legislative behavior have opened up promising new research
avenues in this regard. Among these, scholars have considered the possibility of treating bill
cosponsorships as proxies for legislative cooperation in the United States. We follow this lead
and investigate who cooperates with whom in the US Senate.
Every bill that is introduced to the US Senate is tied to one senator as its main sponsor, but
other senators are free to cosponsor a bill in order to support the bill’s content—a common
practice in senatorial procedures. In fact, in many instances, a bill will have numerous
cosponsors. Several authors have recently begun to truly appreciate the network-like structure
in bill cosponsorships that is best analyzed using network-analytic methodology.1 Using the
rich and well accessible data source on bill cosigners provides researchers with an interesting
insight into the black box of collaboration among senators. What is more, bill cosponsorships
are moving targets. New proposals are constantly put on record. Being able to collect these
data automatically provides researchers with a unique opportunity to consider structural
changes in the networks as they are happening.
In this application, we generate the necessary data to replicate some of the analyses that
have been put forward in recent years. For simplicity’s sake, we only assemble data on bill
cosponsorships for the US Senate in the 111th Congress, which was in session from 2009 to

1 For recent contributions on the topic, see Bratton and Rouse (2011), Burkett (1997), Cho and Fowler (2010),
Fowler (2006a), Fowler (2006b), and Zhang et al. (2008).

Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, First Edition.
Simon Munzert, Christian Rubba, Peter Meißner and Dominic Nyhuis.
© 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

344

AUTOMATED DATA COLLECTION WITH R

Table 12.1 Desired data structure for sponsorship matrix

S.1
S.2
S.3
S.4
S.5
S.6
S.7

⋮

Senator A

Senator B

Senator C

...

Cosponsor
.
.
.
.
Cosponsor
Sponsor

Sponsor
.
Cosponsor
.
.
.
.

Cosponsor
.
Sponsor
Sponsor
.
Cosponsor
.

⋮

⋮

⋮

...
...
...
...
...
...
...
⋱

The table displays a mock example of the dataset we wish to generate. The rows list each bill, the
columns list each senator. The cells display whether a senator was the main sponsor of a bill (Sponsor),
has cosponsored a bill (Cosponsor) or did not sign a bill at all (.).

2010. Sections 12.1 and 12.2 provide the technical details on how the underlying data sources
are generated. Again, our focus in this chapter is on data gathering; hence, we only analyze
the data with simple metrics. Section 12.3 gives a brief overview on how the data can be
employed—both descriptively and in a basic network application. Section 12.4 concludes the
chapter.

12.1

Information on the bills

Our first task is to assemble a list of all sponsors and cosponsors on each bill. To keep matters
simple, we will only gather data for the 111th Senate, which ran from 2009 to 2010. As we
are more interested in the process of data gathering than in the actual application, there is
more than enough material in one senatorial term. Should you wish to analyze the data for an
actual application, it is fairly straightforward to adapt the script to encompass more legislative
periods.
Our specific goal in this section is to construct a matrix that holds information on whether
a senator has sponsored, cosponsored, or not participated at all in a given bill. A mock example
of the data structure is presented in Table 12.1. Storing the data in this format provides the
greatest flexibility for subsequent analyses. We could, for example, be interested in analyzing
which senator was better able to collect cosponsors, or we might want to analyze which
Senators were often cosponsors on the same piece of legislation to find collaboration clusters in
the Senate. We can easily rearrange the proposed table using the facilities of R without having
to reassemble all the data from scratch if we tailor the table to a particular application from the
start. Furthermore, this data matrix even allows performing an ideal point estimation (Alemán
et al. 2009; Desposato et al. 2011; Peress 2010) that we will, however, not tackle in this chapter.
Let us have a look at the database. Luckily, the bills of the US Congress are stored in
a database that is relatively accessible at http://thomas.loc.gov.2 The first step in our web
2 Rather inadvertently, this case study is an example of rapid changes in the Web and their consequences for web
scraping. The website of the Library of Congress at http://thomas.loc.gov/ will be retired by the end of 2014 and

COLLABORATION NETWORKS IN THE US SENATE

345

scraping exercise is an inspection of the way the data are stored. In order to be able to track
the scraping procedure,
1. Call http://thomas.loc.gov/home/thomas.php
2. Go to “Try the Advanced Search”
3. Click on “Browse Bills & Resolutions” right above it
4. Select “111” at the top of the page
5. On the resulting page click on “Senate Bills”
The resulting page holds the main information on the first 100 of 4059 bills proposed during the 111th Senate. Specifically, we see the title of the proposed bill, its sponsor, the number
of cosponsors, and the latest major action for each item. Now click on Cosponsors of the first
bill S.1. Apart from the previously mentioned elements, we additionally see a list of the bill’s
17 cosponsors. This page holds all the information we are interested in for now which greatly
facilitates our task. Check out the URL of the page—http://thomas.loc.gov/cgi-bin/bdquery/
D?d111:1:./list/bss/d111SN.lst:@@@P. Despite its somewhat peculiar format the numerator
of this piece of legislation, 1, is hidden right in the middle next to the senate term 111. To
be sure, click on the NEXT:COSPONSORS button. Now the URL reads http://thomas.loc
.gov/cgi-bin/bdquery/D?d111:2:./list/bss/d111SN.lst:@@@P:&summ2=m&. Disregarding
the altered ending of the URL, we notice that the middle of the URL now reads 2 for
the second bill in the 111th Senate.
As the attached ending is not a necessary prerequisite to get the information, we are
looking for, but is rather added due to the referral from one site to the next, we can safely
drop it. Choosing a random number, 42, we rewrite the original URL by hand to read
http://thomas.loc.gov/cgi-bin/bdquery/D?d111:42:./list/bss/d111SN.lst:@@@P. Works like
a charm. Now, in line with our rules of good practice in web scraping, we will download
the web pages of all 4059 bills to our local hard drive and try to read it out in the second
step.3 We start our R session by loading some packages which we need for the rest of the
exercise.
R>
R>
R>
R>

A scraping
strategy

library(RCurl)
library(stringr)
library(XML)
library(igraph)

The scraping function we set up comprises three simple steps—generating a unique URL
for every bill, downloading the page, and finally writing the page as HTML file to the local

replaced by the new domain http://congress.gov/ (see http://beta.congress.gov/about). At the time of writing both
http://thomas.loc.gov/ and http://beta.congress.gov/ were active. Changing websites frequently take us by surprise
and changes in the page structure or even complete shutdowns are rarely communicated as transparently as in this
case. On the upside, the case study demonstrates how data from abandoned sources can be used for analyses if they
are stored appropriately.
3 You can skip this step by downloading the files provided on the book’s website.

Data retrieval

346

AUTOMATED DATA COLLECTION WITH R

folder Bills_111. We have also added a simple progress indicator to monitor the downloading
process.
R>
R>
R>
R>
R>
R>
R>
R>
R>
R>
R>
R>
R>
R>
Extracting
sponsors and
cosponsors

# Iterate over all 4059 pieces of legislation
for(i in 1:4059){
# Generate the unique URL for each piece of legislation
url <- str_c("http://thomas.loc.gov/cgi-bin/bdquery/D?d111:",
i, ":./list/bss/d111SN.lst:@@@P")
# Download the page
bill_result <- getURL(url,
useragent = R.version$version.string,
httpheader = c(from = "i@datacollection.com"))
# Write the page to local hard drive
write(bill_result, str_c("Bills_111/Bill_111_S", i, ".html"))
# Print progress of download
cat(i, "\n")
}

When you are finished downloading the data, inspect the source code of the first bill in
a text editor of your choice. Notice that each senator—both sponsors and cosponsors—is
provided with links which makes the task all the easier for us as we simply have to extract all
the links in this specific format. Note the subtle difference in the link for the sponsor, Harry
Reid,
/cgi-bin/bdquery/?\&Db=d111\&querybd=@FIELD(FLD003+@4
((@1(Sen+Reid++Harry))+00952))

and the—alphabetically speaking—first cosponsor, Mark Begich,
/cgi-bin/bdquery/?\&Db=d111\&querybd=@FIELD(FLD004+@4
((@1(Sen+Begich++Mark))+01898))

The former URL specifies that Harry Reid is in field 3 (FLD003) and Mark Begich in
field 4 (FLD004). So, apparently, the Congress website internally differentiates between the
sponsors and the cosponsors of a bill.
We can make use of this knowledge by writing two simple regular expressions to extract
all the “field 3” links—there cannot be more than one in each site, as there is only one
sponsor for each bill—and all the “field 4”links. To do so, we replace the senators’ names in
the form of Sen+Reid++Harry with a sequence of alphabetic, plus, and period characters—
[[:alpha:]+.]+?.4 Then, we precede all the characters with special meanings in regular
expressions with two backslashes in order to have them interpreted literally.
R> sponsor_regex <- "FLD003\\+@4\\(\\(@1\\([[:alpha:]+.]+"
R> cosponsor_regex <- "FLD004\\+@4\\(\\(@1\\([[:alpha:]+.]+"

4 Recall that we don’t need to precede the + and . characters with backslashes, as they loose their special meaning
inside a character class.

COLLABORATION NETWORKS IN THE US SENATE

347

Now that we have our regular expressions set up, let us go for a test drive. We load the
source code of the first senate bill into R and extract the link for the sponsor, as well as the
links for the cosponsors.
R> html_source <- readLines("Bills_111/Bill_111_1.html")
R> sponsor <- str_extract(html_source, sponsor_regex)
R> (sponsor <- sponsor[!is.na(sponsor)])
[1] "FLD003+@4((@1(Sen+Reid++Harry"
R> cosponsors <- unlist(str_extract_all(html_source, cosponsor_regex))
R> cosponsors[1:3]
[1] "FLD004+@4((@1(Sen+Begich++Mark"
"FLD004+@4((@1(Sen+Bingaman++Jeff"
[3] "FLD004+@4((@1(Sen+Boxer++Barbara"
R> length(cosponsors)
[1] 17

No problems here. Before moving on we write a small function that first extracts the
senators’ names in the parentheses, drops the parentheses, replaces the + signs with commas
and spaces, and, finally, takes out the leading Sen for convenience.

Data cleansing

R> cleanUp <- function(x){
name <- str_extract(x, "[[:alpha:]+.]+$")
name <- str_replace_all(name, fixed("++"), ", ")
name <- str_replace_all(name, fixed("+"), " ")
name <- str_trim(str_replace(name, "Sen", ""))
return(name)
}

Applying the cleanUp() function to our previous results yields
R> cleanUp(sponsor)
[1] "Reid, Harry"
R> cleanUp(cosponsors)
[1] "Begich, Mark"
[3] "Boxer, Barbara"
[5] "Casey, Robert P., Jr."
[7] "Durbin, Richard"
[9] "Kerry, John F."
[11] "Lautenberg, Frank R."
[13] "Lieberman, Joseph I."
[15] "Menendez, Robert"
[17] "Stabenow, Debbie"

"Bingaman, Jeff"
"Brown, Sherrod"
"Clinton, Hillary Rodham"
"Kennedy, Edward M."
"Klobuchar, Amy"
"Levin, Carl"
"McCaskill, Claire"
"Schumer, Charles E."

Perfect. Now we want to run this code on our entire corpus. Before doing so, we would
like to add a couple of fail safes in order to ensure that the code actually extracts what we
want. In order to do so, we apply some knowledge on what the results should look like. The
first thing we know is that there can only be one single sponsor for each bill. Accordingly,
we check whether our code returns either no sponsor or more than one sponsor.
The second fail safe is a little more tricky. A bill can have one, many, or no cosponsors
at all. Luckily, each site tells us the number of cosponsors it lists. We will read out this

Exception and
error handling

348

AUTOMATED DATA COLLECTION WITH R

information and compare this number to the number of cosponsors we find. Specifically,
we look for the elements COSPONSOR(S) and COSPONSOR(some_number) in the source
code. If neither of these strings are present, we know something might be wrong. We also
know that if the number of cosponsors in these strings does not match our findings, we better
double-check our results. We store these errors in a list for later inspection. We cannot store
the results in the proposed format from the beginning of this section, as we do not have a list
of all senators that have at one point (co-)sponsored a bill. Hence, we input our results into a
list of its own for the time being. The procedure is plotted in Figure 12.1.
After assembling the list of sponsors and errors, we investigate the latter.
R> length(error_collection)
[1] 18

There are 18 errors in the error_collection list—all of them related to a discrepancy
between the number of cosponsors recorded in the source code and the actual number
collected by our code. A manual inspection of these cases reveals that all of them can be
traced back to withdrawals of cosponsorship. We go back to them, load the source code
again, count the number of withdrawals on a bill, and shorten the cosponsors list by the right
amount—withdrawals are always last on the list.
R> for(i in 1:length(error_collection)){
bill_number <- as.numeric(error_collection[[i]][1])
html_source <- readLines(str_c("Bills_111/Bill_111_S", bill_number, ".
html"))
count_withdrawn <- unlist(
str_extract_all(
html_source,
"\$withdrawn - [[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}\$"
)
)
sponsor_list[[str_c("S.", bill_number)]]$cosponsors  all_senators <- unlist(sponsor_list)
R> all_senators <- unique(all_senators)
R> all_senators <- sort(all_senators)
R> head(all_senators)
[1] "Akaka, Daniel K." "Alexander, Lamar" "Barrasso, John"
[4] "Baucus, Max"
"Bayh, Evan"
"Begich, Mark"

COLLABORATION NETWORKS IN THE US SENATE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

error_collection <- list()
sponsor_list <- list()
# Iterate over all 4059 pieces of legislation
for(i in 1:4059){
# Read the ith result
html_source <- readLines(str_c("/Bills_111/Bill_111_S", i, ".html"))
# Extract and clean the sponsor
sponsor <- unlist(str_extract_all(html_source, sponsor_regex))
sponsor <- sponsor[!is.na(sponsor)]
sponsor <- cleanUp(sponsor)
# Extract and clean the cosponsors
cosponsors <- unlist(str_extract_all(html_source, cosponsor_regex))
cosponsors <- cleanUp(cosponsors)
# Input the results into the sponsor list
sponsor_list[[str_c("S.", i)]] <- list(sponsor = sponsor, cosponsors =
cosponsors)
# Collect potential points of error / number of cosponsors
fail_safe <- str_extract(html_source,
"COSPONSORS?\\(([[:digit:]]{1,3}|S)\\)")
fail_safe <- fail_safe[!is.na(fail_safe)]
# Error - no cosponsor string
if(length(fail_safe) == 0){
error_collection[[length(error_collection) + 1]] <- c(i, "String COSPONSOR - not found")
}
# Error - found more cosponsors than possible
if(fail_safe == "COSPONSOR(S)"){
if(length(cosponsors) > 0){
error_collection[[length(error_collection) + 1]] <- c(i, "Found
cosponsors where there should be none")
}
}else{
right_number <- str_extract(fail_safe, "[[:digit:]]+")
# Error - Found wrong number of cosponsors
if(length(cosponsors) != right_number){
error_collection[[length(error_collection) + 1]] <- c(i, "Did not
find the right number of cosponsors")
}
}
# Error - Found no sponsors
if(is.na(sponsor)){
error_collection[[length(error_collection) + 1]] <- c(i, "No sponsors
")
}
# Error - Found too many sponsors
if(length(sponsor) > 1){
error_collection[[length(error_collection) + 1]] <- c(i, "More than
one sponsor")
}
}

Figure 12.1 R procedure to collect list of bill sponsors

349

350
Creating and
filling the
matrix

AUTOMATED DATA COLLECTION WITH R

In the following step, we set up the matrix for the data as proposed in the beginning of
the section. First, we create an empty matrix.
R> sponsor_matrix <- matrix(NA, nrow = 4059, ncol = length(all_senators))
R> colnames(sponsor_matrix) <- all_senators
R> rownames(sponsor_matrix) <- paste("S.", seq(1, 4059), sep ="")

Finally, we iterate over our sponsor list to fill the correct cells.
R> for(i in 1:length(sponsor_list)){
sponsor_matrix[i, which(all_senators == sponsor_list[[i]]$sponsor)] <"Sponsor"
if(length(sponsor_list[[i]]$cosponsors) > 0){
for(j in 1:length(sponsor_list[[i]]$cosponsors)){
sponsor_matrix[i, which(all_senators == sponsor_list[[i]]
$cosponsors[j])] <- "Cosponsor"
}
}
}
R> sponsor_matrix[30:35,31:34]
Cornyn, John Crapo, Mike DeMint, Jim Dodd, Christopher J.
S.30 NA
NA
NA
NA
S.31 NA
NA
NA
NA
S.32 NA
NA
NA
NA
S.33 NA
NA
NA
NA
S.34 "Cosponsor" "Cosponsor" "Sponsor"
NA
S.35 "Cosponsor" NA
NA
NA

12.2 Information on the senators
In this section, we want to collect some simple background information on the senators to use
in our analysis. Specifically, we are interested in the party affiliation and the home state of the
senators. We collect these data from the biographical archives of the Congress.5 Let us check
out the source code of the website http://bioguide.congress.gov/biosearch/biosearch.asp. It
mainly consists of an HTML form that can be accessed using the postForm() command.
To request an answer from the form, we have to specify values for the different options the
form has. There are several types of input an HTML form can take—two out of which are
used in this case (see Section 9.1.5). There are three free inputs and three selections we can
make that come with a list of prespecified options. Let us take a look at the options first by
collecting them from the source code of the website. Again, in line with our rules of good
practice, we start by storing the source code on our local hard drive before accessing it.
R> url <- "http://bioguide.congress.gov/biosearch/biosearch.asp"
R> form_page <- getURL(url)
R> write(form_page, "form_page.html")

5 Writing a script for a little over 100 senators is in some senses a bit excessive, as it probably takes longer to
write a script to read out the website than to hand-code the data. However, doing so, we keep our analysis replicable
and make it easily extensible.

COLLABORATION NETWORKS IN THE US SENATE

351

Accessing the form with regular expressions yields
R> form_page <- str_c(readLines("form_page.html"), collapse = "")
R> destination <- str_extract(form_page, "")
R> cat(destination)

R> form <- str_extract(form_page, "")
R> cat(str_c(unlist(str_extract_all(form, "")), collapse = "\n"))

R> cat(str_c(unlist(str_extract_all(form, "")), collapse = "\n"))

, 29
, 26
, 25
, 27

			Navigation menu

Upload a User Manual

						Versions of this User Manual:

 Wiki Guide
 HTML
 Mobile
 Download & Help

						Views

		User Manual

                                                                                                                        Discussion / Help

			Navigation

											© 2025 UserManual.wiki

											Contact Us
											DMCA

Automated Data Collection With R Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis

Simon%20Munzert%2C%20Christian%20Rubba%2C%20Peter%20Mei%C3%9Fner%2C%20Dominic%20Nyhuis-Automated%20Data%20Collection%20with%20R_

,

,

Robert Gentleman

Rolf Turner

Robert Gentleman

Rolf Turner

Rolf Turner

Robert Gentleman

Rolf Turner

Hello World

HTTP GET Example

HTTP POST Example

Hello World

Hello World

Hello World

Hello World

Hello World

Hello World

Robert Gentleman

Rolf Turner

Robert Gentleman

Rolf Turner

Robert Gentleman

Rolf Turner

"+value.author+"

Robert Gentleman

Rolf Turner

Navigation menu