node that includes the introductory text.
R>
R>
R>
R>
mac_url
mac_source
mac_parsed
mac_node
<<<<-
"http://en.wikipedia.org/wiki/Machiavelli"
readLines(mac_url, encoding = "UTF-8")
htmlParse(mac_source, encoding = "UTF-8")
mac_parsed["//p"][[1]]
All of these representations of an HTML document (URL, source code, parsed document,
and a single node) can be used as input for getHTMLLinks() and the other convenience
functions introduced in this section.
R> getHTMLLinks(mac_url)[1:3]
[1] "/w/index.php?title=Machiavelli&redirect=no"
[2] "/wiki/Machiavelli_(disambiguation)"
[3] "/wiki/File:Portrait_of_Niccol%C3%B2_Machiavelli_by_Santi_di_Tito.jpg"
R> getHTMLLinks(mac_source)[1:3]
[1] "/w/index.php?title=Machiavelli&redirect=no"
[2] "/wiki/Machiavelli_(disambiguation)"
[3] "/wiki/File:Portrait_of_Niccol%C3%B2_Machiavelli_by_Santi_di_Tito.jpg"
R> getHTMLLinks(mac_parsed)[1:3]
[1] "/w/index.php?title=Machiavelli&redirect=no"
[2] "/wiki/Machiavelli_(disambiguation)"
[3] "/wiki/File:Portrait_of_Niccol%C3%B2_Machiavelli_by_Santi_di_Tito.jpg"
R> getHTMLLinks(mac_node)[1:3]
[1] "/wiki/Help:IPA_for_Italian" "/wiki/Renaissance_humanism"
[3] "/wiki/Renaissance"
SCRAPING THE WEB
233
We can also supply XPath expressions to restrict the returned documents to specific
subsets, for example, only those links of class extiw.
R> getHTMLLinks(mac_source,
xpQuery="//a[@class='extiw']/@href")[1:3]
[1] "//en.wiktionary.org/wiki/chancery"
[2] "//en.wikisource.org/wiki/Catholic_Encyclopedia_(1913)/Niccol%
C3%B2_Machiavelli"
[3] "//commons.wikimedia.org/wiki/Niccol%C3%B2_Machiavelli"
getHTMLLinks() retrieves links from HTML as well as names of external files. We
already made use of the latter feature in Section 9.1.1. An extension of getHTMLLinks() is
getHTMLExternalFiles(), designed to extract only links that point to external files which
are part of the document. Let us use the function along with its xpQuery parameter. We
restrict the set of returned links to those mentioning Machiavelli to hopefully find a URL that
links to a picture.
R> xpath <- "//img[contains(@src, 'Machiavelli')]/@src"
R> getHTMLExternalFiles(mac_source,
xpQuery = xpath)[1:3]
[1] "//upload.wikimedia.org/wikipedia/commons/thumb/e/e2/Portrait
_of_Niccol%C3%B2_Machiavelli_by_Santi_di_Tito.jpg/220px-Portrait_
of_Niccol%C3%B2_Machiavelli_by_Santi_di_Tito.jpg"
[2] "//upload.wikimedia.org/wikipedia/commons/thumb/a/a4/
Machiavelli_Signature.svg/128px-Machiavelli_Signature.svg.png"
[3] "//upload.wikimedia.org/wikipedia/commons/thumb/f/f3/
Cesare_borgia-Machiavelli-Corella.jpg/220px-Cesare_borgiaMachiavelli-Corella.jpg"
The first three results look promising; they all point to image files stored on the
Wikimedia servers.
The next convenient function is readHTMLList() and as the name already suggests, it
extracts list elements (see Section 2.3.7). Browsing through the article we find that under
Discourses on Livy several citations from the work are pooled as an unordered list that we
can easily extract. Note that the function returns a list object where each element corresponds
to a list in the HTML. As the citations are the tenth list within the HTML, we figured this out
by eyeballing the output of readHTMLList() and we use the index operator [[10]].
R> readHTMLList(mac_source)[[10]][1:3]
[1] "\"In fact, when there is combined under the same constitution
a prince, a nobility, and the power of the people, then these three
powers will watch and keep each other reciprocally in check.\" Book
I, Chapter II"
[2] "\"Doubtless these means [of attaining power] are cruel and
destructive of all civilized life, and neither Christian, nor even
human, and should be avoided by every one. In fact, the life of a
private citizen would be preferable to that of a king at the expense
of the ruin of so many human beings.\" Bk I, Ch XXVI"
[3] "\"Now, in a well-ordered republic, it should never be necessary
to resort to extra-constitutional measures. ...\" Bk I, Ch XXXIV"
Extracting lists
234
AUTOMATED DATA COLLECTION WITH R
Extracting
The last function of the XML package we would like
tables readHTMLTable(), a function to extract HTML tables. Not
to introduce at this point is
only does the function locate
tables within the HTML document, but also transforms them into data frames. As before, the
function extracts all tables and stores them in a list. Whenever the extracted HTML tables
have information that can be used as name, they are stored as named list item. Let us first get
an overview of the tables by listing the table names.
R> names(readHTMLTable(mac_source))
[1] "Niccolò Machiavelli" "NULL"
[4] "NULL"
"NULL"
[7] "NULL"
"NULL"
[10] "persondata"
"NULL"
"NULL"
"NULL"
There are ten tables; two of them are labeled. Let us extract the last one to retrieve personal
information on Machiavelli.
R> readHTMLTable(mac_source)$persondata
V1
V2
1
Name
Machiavelli, Niccolò
2 Alternative names
Machiavelli, Niccolò
3 Short description Italian politician and political theorist
4
Date of birth
May 3, 1469
5
Place of birth
Florence
6
Date of death
June 21, 1527
7
Place of death
Florence
Applying
element
functions
A powerful feature of readHTMLList() and readHTMLTable() is that we can define
individual element functions using the elFun argument. By default, the function applied to
each list item (
) and each cell of the table (), respectively, is xmlValue(), but we
can specify other functions that take XML nodes as arguments. Let us use another HTML table
to demonstrate this feature. The first table of the article gives an overview of Machiavelli’s
personal information and, in the seventh and eighth rows, lists persons and schools of thought
that have influenced him in his thinking as well as those that were influenced by him.
R> readHTMLTable(mac_source, stringsAsFactors = F)[[1]][7:8, 1]
[1] "Influenced by\nXenophon, Plutarch, Tacitus, Polybius, Cicero,
Sallust, Livy, Thucydides"
[2] "Influenced\nPolitical Realism, Bacon, Hobbes, Harrington,
Rousseau, Vico, Edward Gibbon, David Hume, John Adams, Cuoco,
Nietzsche, Pareto, Gramsci, Althusser, T. Schelling, Negri, Waltz,
Baruch de Spinoza, Denis Diderot, Carl Schmitt"
In the HTML file, the names of philosophers and schools of thought are also linked to
the corresponding Wikipedia articles, but this information gets lost by relying on the default
element function. Let us replace the default function by one that is designed to extract links—
getHTMLLinks(). This allows us to extract all links for influential and influenced thinkers.
R> influential <- readHTMLTable(mac_source,
elFun = getHTMLLinks,
stringsAsFactors = FALSE)[[1]][7,]
R> as.character(influential)[1:3]
[1] "/wiki/Xenophon" "/wiki/Plutarch" "/wiki/Tacitus"
SCRAPING THE WEB
235
R> influenced <- readHTMLTable(mac_source,
elFun = getHTMLLinks,
stringsAsFactors = FALSE)[[1]][8,]
R> as.character(influenced)[1:3]
[1] "/wiki/Political_Realism" "/wiki/Francis_Bacon"
[3] "/wiki/Thomas_Hobbes"
Extracting links, tables, and lists from HTML documents are ordinary tasks in web
scraping practice. These functions save a lot of time or otherwise we would have to spend on
constructing suited XPath expressions and keeping our code tidy.
9.1.5
Dealing with HTML forms
Forms are a classical feature of user–server interaction via HTTP on static websites. They
vary in size, layout, input type, and other parameters—just think about all the search bars you
have used, the radio buttons you have slided, the check marks you have set, the user names
and passwords typed in, and so on. Forms are easy to handle with a graphical user interface
like a browser, but a little more difficult when they have to be disentangled in the source code.
In this section, we will cover the general approach to master forms with R. In the end you
should be able to recognize forms, determine the method used to pass the inputs, the location
where the information is sent, and how to specify options and parameters for sending data to
the servers and capture the result.
We will consider three different examples throughout this section to learn how to prepare
your R session, approach forms in general, use the HTTP GET method to send forms to
the server, use POST with url-encoded or multipart body, and let R automatically generate
functions that use GET or POST with adequate options to send form data.
Filling out forms in the browser and handling them from within R differs in many respects,
because much of the work that is usually done by the browser in the background has to be
specified explicitly. Using a browser, we
1. fill out the form,
2. push the submit, ok, start, or the like! button.
3. let the browser execute the action specified in the source code of the form and send
the data to the server,
4. and let the browser receive the returned resources after the server has evaluated the
inputs.
In scraping practice, things get a little more complicated. We have to
1. recognize the forms that are involved,
2. determine the method used to transfer the data,
3. determine the address to send the data to,
4. determine the inputs to be sent along,
5. build a valid request and send it out, and
6. process the returned resources.
236
Preparations
AUTOMATED DATA COLLECTION WITH R
In this section, we use functions from the RCurl, XML, stringr, and the plyr packages.
Furthermore, we specify an object that captures debug information along the way so that
we can check for details if something goes awry (see Section 5.4.3 for details). Additionally, we specify a curl handle with a set of default options—cookiejar to enable cookie
management, followlocation to follow page redirections which may be triggered by the
POST command, and autoreferer to automatically set the Referer request header when
we have to follow a location redirect. Finally, we specify the From and User-Agent header
manually to stay identifiable:
R> info
<- debugGatherer()
R> handle <- getCurlHandle(cookiejar
followlocation
autoreferer
debugfunc
verbose
httpheader
from
'user-agent'
=
=
=
=
=
=
=
=
"",
TRUE,
TRUE,
info$update,
TRUE,
list(
"eddie@r-datacollection.com",
str_c(R.version$version.string,
", ", R.version$platform)
))
Another preparatory step is to define a function that translates lists of XML attributes
into data frames. This will come in handy when we are going to evaluate the attributes of
HTML form elements of parsed HTML documents. The function we construct is called
xmlAttrsToDF() and takes two arguments. The first argument supplies a parsed HTML
document and the second an XPath expression specifying the nodes from which we want
to collect the attributes. The function extracts the nodes’ attributes via xpathApply() and
xmlAttrs() and transforms the resulting list into a data frame while ensuring that attribute
names do not get lost and that each attribute value is stored in a separate column:
R> xmlAttrsToDF <- function(parsedHTML, xpath) {
x <- xpathApply(parsedHTML, xpath, xmlAttrs)
x <- lapply(x, function(x) as.data.frame(t(x)))
do.call(rbind.fill, x)
}
9.1.5.1
GETting to grips with forms
To presenting how to generally approach forms and specifically how to handle forms that
demand HTTP GET, we use WordNet. WordNet is a service provided by Princeton University
at http://wordnetweb.princeton.edu/perl/webwn. Researchers at Princeton have built up a
database of synonyms for English nouns, verbs, and adjectives. They offer their data as an
online service. The website relies on an HTML form to gather the parameters and send a
request for synonyms—see Princeton University (2010a) for further details and Princeton
University (2010b) for the license.
Let us browse to the page and type in a word, for example, data. Hitting the Search
WordNet button results in a change to the URL which now contains 13 parameters.
1
http://wordnetweb.princeton.edu/perl/webwn?s=data&sub=Search+
WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
SCRAPING THE WEB
237
We have been redirected to another page, which informs us that data is a noun and that it
has two semantic meanings.
From the fact that the URL is extended with a query string when submitting our search
term we can infer that the form uses the HTTP GET method to send the data to the server.
But let us verify this conclusion. To briefly recap the relevant facts from Chapter 2: HTML
forms are specified with the help of |