elements with class="productItem". Once you have all of those wrapper
elements in a list or array, you’ll be able to iterate over each of them and pull out the specifics details
you need.
If you’re on a detail page for a single product or you’re just pulling specific bits of information
that aren’t repeated, then you don’t need to find those wrapper elements, you can just look for the
specific elements that contain the information you want.
Pattern Discovery: Finding the Structure in an HTML Document
27
Choosing The Right HTML Parsing Library
We’re getting really close to actually writing some code and implementing the patterns matching
we’ve found for the URLs and the HTML DOM elements we’re looking for. But first, I want to talk
about the importance of selecting a good HTML parsing library.
Just like your browser turns plain text HTML into a complex nested structure known as the DOM,
you’ll also need a library or package in your favorite language that can build a structured object out
of the HTML you’ll be parsing.
Remember, when you make requests to a web server, all that comes back is a regular ol’ plaintext file
with a bunch of pointy brackets (‘< >’) and some text. You don’t want to be pulling your information
out using string splitting or a giant regular expression20 . Having a good HTML parsing library saves
you a boatload of time and makes finding your elements fast and easy.
HTML, not XML
Whenever I recommend my favorite HTML parsing library, I often hear the complaint that it’s “too
slow” when compared against an XML parsing library. And since HTML is loosely based on XML,
it’s tempting to think that an XML parsing library will suit your needs and be a great choice.
In the world of XML, standards are very clearly defined and generally very well followed.
Unfortunately, in the world of HTML, neither of these facts hold true. Remember when we talked
about “quirks mode,” when the browser has to try and guess how a site is supposed to be structured
because of poorly-written HTML?
HTML parsing libraries often have their own version of “quirks mode” that still allows them to figure
out the nested structure of a web page, even when the markup isn’t well formed. XML libraries will
usually throw up their hands and say, “This isn’t valid markup, I refuse to parse it!” which leaves
you in a bad spot if the site you’re scraping doesn’t use perfectly-formed HTML.
You HTML Parsing Library Can be Simple
You’re really only going to be needing two functions from your HTML parsing library: find_all()
and find(). They might be called different things in different libraries, but the idea is that same:
• one function that pulls a list of elements based on some filters
• one function that pulls a single element based on some filters
That’s pretty much it. With those two functions, you can pull out a list of wrapper elements to iterate
over, or drill down to find specific elements in the structure. Often, you’ll use a combination of both:
20
http://stackoverflow.com/a/1732454/625840
Pattern Discovery: Finding the Structure in an HTML Document
28
find_all() to pull out the list of wrapper elements, and then find() on each element to extract the
specific fields of information you’re looking for. We’ll take a look at some example code for this in
the next chapter.
Common Traps When Parsing HTML
Now that you’ve found your elements and gotten your HTML parsing library picked out, you’re
almost ready to get started! But before you do, let’s look at some common pitfalls so that you know
what to keep an eye out for.
The DOM in the developer tools is not (always) the same the
HTML returned by the server
We’ve talked about this already – how elements might appear in the DOM tree even if they weren’t
in the original HTML document returned by the server. On Javascript-heavy websites, entire sections
of content might be loaded asynchronously, or added to the page after the original HTML has been
downloaded.
Whenever you find the right selectors to point you towards the elements you need, you should
always double-check that those elements appear in the page’s HTML. I’ve wasted countless hours
scratching my head wondering why an element I see in the web inspector’s DOM isn’t getting picked
up by my HTML parsing library, when it turned out that the element wasn’t being returned in the
HTML response at all. We’ll talk more about scraping content that’s loaded asynchronously with
Javascript and AJAX in Chapter 8.
Even on sites that don’t use Javascript to load content, it’s possible that your browser has gone into
“quirks mode” while trying to render, and the DOM you see in your inspector is only a guess as to
how the site is supposed to be structured.
When in doubt, always double check that the elements you’re attempting to pull out of the DOM
actually exist in the HTML that’s returned in the main response body. The easiest way to do this is
to simply right-click, “view source” and then cmd-F and search for the class name or other selector
you’re hoping to use.
If you see the same markup that you found in the web inspector, you should be good to go.
Text Nodes Can be Devious
In the example markup I provided earlier, you can see that some of the elements terminate with so
called text nodes. These are things like "Item #1" and $19.95. Usually, these text nodes are the raw
information that you’ll be pulling into your application and potentially be saving to a database.
Now imagine a situation where a clever manager at this ecommerce company says to themselves,
“you know what? I want the word ‘Item’ to appear on a different line than ‘#1’. My nephew tried
Pattern Discovery: Finding the Structure in an HTML Document
29
teaching me HTML one time, and I remember that typing
makes things go down a line. I’ll
enter the name of the item as Item
#1 into the system and that’ll trick it to go down a line.
Perfect!”.
Aside from being a silly way to mix business information with presentation logic, this also creates
a problem for your web scraping application. Now, when you encounter that node in the DOM and
go to pull out the text – surprise! – there’s actually another DOM element stuffed in there as well.
I’ve seen this happen myself a number of times. For example, in 9 out of 10 product descriptions, there
will be a simple blob of text inside a
element. But occasionally, the
element will also contain
a bulleted list with
and - tags and other DOM nodes that my scraper wasn’t expecting.
Depending on your HTML parsing library, this can cause problems.
One solution is to simply flatten all of the content into a string and save that, but it might require
some extra parsing afterwards to strip out the
or other hidden tags that are now mixed in with
your text. Just something else to be aware of.
Now that you know how to find patterns in the DOM that lead you to your data, and have picked
out a good HTML parser, and know how to avoid the most common HTML parsing pitfalls, it’s time
to start writing some code!
Hands On: Building a Simple Web
Scraper with Python
Now that we’ve gotten a solid understanding of how to build the correct HTTP requests and how
to parse HTML responses, it’s time to put it all together and write some working web scraper code.
In this chapter, I’ll walk you through the process of building a simple web scraper in python, stepby-step.
The page that we’ll be scraping is a simple list of countries21 , along with their capital, population
and land mass in square kilometers. The information is hosted on a site that I built specifically to be
an easy-to-use sandbox for teaching beginner web scrapers: Scrape This Site.
There are a number of pages in the sandbox that go through more complex web scraping problems
using common website interface elements, but for this chapter, we’ll stick with the basics on this
page:
https://scrapethissite.com/pages/simple/
We’ll be using python since it has simple syntax, it is widely deployed (it comes pre-installed by
default on many operating systems) and it has some excellent libraries for sending HTTP requests
and parsing HTML responses that will make our lives much easier.
Our first step is to make sure that we have two core packages installed that we’ll be using: requests
and BeautifulSoup. If your system already has python and pip installed, you should be fine to run
this command at the command line to get the packages installed:
pip install requests beautifulsoup4
If that command fails, you should make sure that you have python installed (instructions for
Windows users22 ). You also need to ensure you have pip installed by running the following command
from the command line:
easy_install pip
Once you’ve got python, pip and those packages installed, you’re ready to get started.
21
22
https://scrapethissite.com/pages/simple/
http://www.howtogeek.com/197947/how-to-install-python-on-windows/
Hands On: Building a Simple Web Scraper with Python
31
Finding the Patterns
Since all of the data that we’re loading is on a single page, there isn’t much to look for in terms of
URL patterns. There’s another page in the sandbox23 that uses plenty of query parameters to support
a search box, pagination and “items per page” interface, if you’re looking for more of a challenge.
The main patterns we’ll be looking for in this example are the patterns in the HTML of the page
around each country. If I visit the page we’re scraping in the Google Chrome web browser and then
right-click on a country’s name, then choose “Inspect Element” I start to see some patterns.
Looking for HTML Patterns in Chrome
You can “open” and “close” nested structures within the DOM by clicking on the little gray arrows
that you see. This will let you drill down into elements, and zoom back up the DOM tree to look for
patterns in higher-level elements.
Even without knowing anything about the site or its developers, I can start to read the HTML
elements and their attributes. If I right-click > “Inspect Element” on a few different countries, I start
to form an idea of some patterns in my head. It seems like the general format for the HTML that
displays each country on the page looks something like this. Note that I’ve used the double-curly
23
https://scrapethissite.com/pages/forms/
Hands On: Building a Simple Web Scraper with Python
32
brackets to represent the actual data that we’d want to pull.
{{COUNTRY_NAME}}
Capital:
{{COUNTRY_CAPITAL}}
Population:
{{COUNTRY_POPULATION}}
Area (km2):
{{COUNTRY_AREA}}
If we wanted to find all of the countries on the page, we could look for all elements with a
class of country. Then we could loop over each of those elements and look for the following:
1.
2.
3.
4.
elements with a class of country-name to find the country’s name
elements with a class of country-capital to find the country’s capital
elements with a class of country-population to find the country’s population
elements with a class of country-area to find the country’s area
Note that we don’t even really need to get into the differences between an element and a
or element, what they mean or how they’re displayed. We just need to look for the patterns
in how the website is using them to markup the information we’re looking to scrape.
Implementing the Patterns
Now that we’ve found our patterns, it’s time to code them up. Let’s start with some code that makes
a simple HTTP request to the page we want to scrape:
Hands On: Building a Simple Web Scraper with Python
33
import requests
url = "https://scrapethissite.com/pages/simple/"
r = requests.get(url)
print "We got a {} response code from {}".format(r.status_code, url)
Save this code in a file called simple.py on your Desktop, and then run it from the command line
using the following command
python ~/Desktop/simple.py
You should see it print out the following:
We got a 200 response code from https://scrapethissite.com/pages/simple/
If so, congratulations! You’ve made your first HTTP request from a web scraping program and
verified that the server responded with a successful response status code.
Next, let’s update our program a bit to take a look at the actual HTML response that we got back
from the server with our request:
import requests
url = "https://scrapethissite.com/pages/simple/"
r = requests.get(url)
print "We got a {} response code from {}".format(r.status_code, url)
print r.text
# print out the HTML response text
If you scroll back up through your console after you run that script, you should see all of the HTML
text that came back from the server. This is the same as what you’d see if in your browser if you
right-click > “View Source”.
Now that we’ve got the actual HTML text of the response, let’s pass it off to our HTML parsing
library to be turned into a nested, DOM-like structure. Once it’s formed into that structure, we’ll do
a simple search and print out the page’s title. We’ll be using the much beloved BeautifulSoup and
telling it to use the default HTML parser. We’ll also remove an earlier print statement that displayed
the status code.
Hands On: Building a Simple Web Scraper with Python
34
import requests
from bs4 import BeautifulSoup
r = requests.get("https://scrapethissite.com/pages/simple/")
soup = BeautifulSoup(r.text, "html.parser")
print soup.find("title")
Now if we run this, we should see that it makes an HTTP request to the page and then prints out
the page’s
tag.
Next, let’s try adding in some of the HTML patterns that we found earlier while inspecting the site
in our web browser.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://scrapethissite.com/pages/simple/")
soup = BeautifulSoup(r.text, "html.parser")
countries = soup.find_all("div", "country")
print "We found {} countries on this page".format(len(countries))
In that code, we’re using the find_all() method of BeautifulSoup to find all of the elements
on the page with a class of country. The first argument that we pass to find_all() is the name of
the element that we’re looking for. The second argument we pass to the find_all() function is the
class that we’re looking for.
Taken together, this ensures that we only get back
elements with a class of country. There
are other arguments and functions you can use to search through the soup, which are explained in
more detail in Chapter 11.
Once we find all of the elements that match that pattern, we’re storing them in a variable called
countries. Finally, we’re printing out the length of that variable – basically a way to see how many
elements matched our pattern.
This is a good practice as you’re building your scrapers – making sure the pattern you’re using is
finding the expected number of elements on the page helps to ensure you’re using the right pattern.
If it found too many or too few, then there might be an issue that warrants more investigation.
If you run that code, you should see that it says
Hands On: Building a Simple Web Scraper with Python
35
We found 250 countries on this page
Now that we’ve implemented one search pattern to find the outer wrapper around each country, it’s
time to iterate over each country that we found and do some further extraction to get at the values
we want to store. We’ll start by just grabbing the countries’ names.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://scrapethissite.com/pages/simple/")
soup = BeautifulSoup(r.text, "html.parser")
countries = soup.find_all("div", "country")
for country in countries:
print country.find("h3").text.strip()
Here we’ve added a simple for-loop that iterates over each item in the countries list. Inside the
loop, we can access the current country we’re looping over using the country variable.
All that we’re doing inside the loop is another BeautifulSoup search, but this time, we’re only
searching inside the specific country element (remember, each
...). We’re
looking for an
element, then grabbing the text inside of it using .text attribute, stripping off
any extra whitespace around it with .strip() and finally printing it to the console.
If you run that from the command line, you should see the full list of countries printed to the page.
You’ve just built your first web scraper!
We can update it to extract more fields for each country like so:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://scrapethissite.com/pages/simple/")
soup = BeautifulSoup(r.text, "html.parser")
countries = soup.find_all("div", "country")
# note: you will likely get errors when you run this!
for country in countries:
name = country.find("h3").text.strip()
capital = country.find("span", "country-capital").text.strip()
population = country.find("span", "country-population").text.strip()
Hands On: Building a Simple Web Scraper with Python
36
area = country.find("span", "country-area").text.strip()
print "{} has a capital city of {}, a population of {} and an area of {}".fo\
rmat(name, capital, population, area)
If you try running that, it might start printing out information on the first few countries, but then
it’s likely to encounter an error when trying to print out some of the countries’ information. Since
some of the countries have foreign characters, they don’t play nicely when doing thing like putting
them in a format string, printing them to the terminal or writing them to a file.
We’ll need to update our code to add .encode("utf-8") to the extracted values that are giving us
problems. In our case, those are simply the country name and the capital city. When we put it all
together, it should look like this.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://scrapethissite.com/pages/simple/")
soup = BeautifulSoup(r.text, "html.parser")
countries = soup.find_all("div", "country")
for country in countries:
name = country.find("h3").text.strip().encode("utf-8")
capital = country.find("span", "country-capital").text.strip().encode("utf-8\
")
population = country.find("span", "country-population").text.strip()
area = country.find("span", "country-area").text.strip()
print "{} has a capital city of {}, a population of {} and an area of {}".fo\
rmat(name, capital, population, area)
And there you have it! Only a dozen lines of python code, and we’ve already built our very first
web scraper that loads a page, finds all of the countries and then loops over them, extracting several
pieces of content from each country based on the patterns we discovered in our browser’s developer
tools.
One thing you’ll notice, that I did on purpose, is to build the scraper incrementally and test it often.
I’d add a bit of functionality to it, and then run it and verify it does what we expect before moving
on and adding more functionality. You should definitely follow this same process as you’re building
your scrapers.
Web scraping can be more of an art than a science sometimes, and it often takes a lot of guess and
check work. Use the print statement liberally to check on how your program is doing and then
make tweaks if you encounter errors or need to change things. I talk more about some strategies for
testing and debugging your web scraping code in Chapter 10, but it’s good to get into the habit of
making small changes and running your code often to test those changes.
Hands On: Storing the Scraped Data &
Keeping Track of Progress
In the last chapter, we built out very first web scraper that visited a single page and extracted
information about 250 countries. However, the scraper wasn’t too useful – all we did with that
information was print it to the screen.
In reality, it’d be nice to store that information somewhere useful, where we could access it and
analyze it after the scrape completes. If the scrape will visit thousands of pages, it’d be nice to check
on the scraped data while the scraper is still running, to check on our progress, see how quickly
we’re pulling in new data, and look for potential errors in our extraction patterns.
Store Data As You Go
Before we look at specific methods and places for storing data, I wanted to cover a very important
detail that you should be aware of for any web scraping project: You should be storing the
information that you extract from each page, immediately after you extract it.
It might be temping to add the information to a list in-memory and then run some processing or
aggregation on it after you have finished visiting all the URLs you intend to scrape, before storing it
somewhere. The problem with this solution becomes apparent anytime you’re scraping more than
a few dozen URLs, and is two fold:
1. Because a single request to a URL can take 2-3 seconds on average, web scrapers that need to
access lots of pages end up taking quite a while to run. Some back of the napkin math says that
100 requests will only take 4-5 minutes, but 1,000 requests will take about 45 minutes, 10,000
requests will take about 7 hours and 100,000 requests will take almost 3 days of running nonstop.
2. The longer a web scraping program runs for, the more likely it is to encounter some sort of
fatal error that will bring the entire process grinding to a halt. We talk more about guarding
against this sort of issue in later chapters, but it’s important to be aware of even if your script
is only going to run for a few minutes.
Any web scraper that’s going to be running for any length of time should be storing the extracted
contents of each page somewhere permanent as soon as it can, so that if it fails and needs to be
restarted, it doesn’t have to go back to square one and revisit all of the URLs that it already extracted
content from.
Hands On: Storing the Scraped Data & Keeping Track of Progress
38
If you’re going to do any filtering, cleaning, aggregation or other processing on your scraped data,
do it separately in another script, and have that processing script load the data from the spreadsheet
or database where you stored the data initially. Don’t try to do them together, instead you should
scrape and save immediately, and then run your processing later.
This also gives you the benefit of being able to check on your scraped data as it is coming in. You
can examine the spreadsheet or query the database to see how many items have been pulled down
so far, refreshing periodically to see how things are going and how quickly the data is coming in.
Storing Data in a Spreadsheet
Now that we’ve gone over the importance of saving data right away, let’s look at a practical example
using the scraper we started working on in the last chapter. As a reminder, here is the code we ended
up with:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://scrapethissite.com/pages/simple/")
soup = BeautifulSoup(r.text, "html.parser")
countries = soup.find_all("div", "country")
for country in countries:
name = country.find("h3").text.strip().encode("utf-8")
capital = country.find("span", "country-capital").text.strip().encode("utf-8\
")
population = country.find("span", "country-population").text.strip()
area = country.find("span", "country-area").text.strip()
print "{} has a capital city of {}, a population of {} and an area of {}".fo\
rmat(name, capital, population, area)
What we’d like to do is store the information about each country in a Comma Separated Value (CSV)
spreadsheet. Granted, in this example we’re only making one request, so the “save data right away
so that you don’t have to revisit URLs” is a bit of a moot point, but hopefully you’ll see how this
would be important if we were scraping many more URLs.
Let’s spend a minute thinking about what we’d want our spreadsheet to look like. We’d want the
first row to contain header information – the label for each of our columns: country name, capital,
population and area. Then, each row underneath that should contain information about a new
country, with the data in the correct columns. Our final spreadsheet should have 250 rows, one
for each country, plus an extra row for the headers, for a total of 251 rows. There should be four
columns, one for reach field we’re going to extract.
Hands On: Storing the Scraped Data & Keeping Track of Progress
39
It’s good to spend a bit of time thinking about what our results should look like so that we can have a
way to tell if our scraper ran correctly. Even if we don’t know exactly how many items we’ll extract,
we should try to have a ballpark sense of the number (order of magnitude) so that we know how
large our finished data set will be, and how long we (roughly) expect our scraper to take to run until
completion.
Python comes with a builtin library for reading from and writing to CSV files, so you won’t have to
run any pip commands to install it. Let’s take a look at a simple use case:
import csv
with open("/path/to/output.csv", "w+") as f:
writer = csv.writer(f)
writer.writerow(["col #1", "col #2", "col #3"])
This creates a file called output.csv (note that you’ll likely have to change the full file path to point
to an actual folder on your computer’s hard drive). We open the file and then create a csv.writer()
object, before calling the writerow() method of that writer object.
This sample code only writes a single row to the CSV with three columns, but you can imagine that
it would write many rows to a file if the writerow() function was called inside of a for-loop, like
the one we used to iterate over the countries.
Note that the indentation of the lines matters here, just like it did inside our for-loop in the last
chapter. Once we’ve opened the file for writing, we need to indent all of the code that accesses that
file. Once our code has gone back out a tab, python will close the file for us and we won’t be able
to write to it anymore. In python, the number of spaces at the beginning of a line are significant, so
make sure you’re careful when you’re nesting file operations, for-loop and if-statements.
Let’s look at how we can combine the CSV writing that we just learned with our previous scraper.
import csv
import requests
from bs4 import BeautifulSoup
with open("/path/to/output.csv", "w+") as f:
writer = csv.writer(f)
writer.writerow(["Name", "Capital", "Population", "Area"])
r = requests.get("https://scrapethissite.com/pages/simple/")
soup = BeautifulSoup(r.text, "html.parser")
countries = soup.find_all("div", "country")
Hands On: Storing the Scraped Data & Keeping Track of Progress
40
for country in countries:
name = country.find("h3").text.strip().encode("utf-8")
capital = country.find("span", "country-capital").text.strip().encode("u\
tf-8")
population = country.find("span", "country-population").text.strip()
area = country.find("span", "country-area").text.strip()
writer.writerow([name, capital, population, area])
You’ll see that we added the import csv statement to the top of the file with our other imports.
Then we open our output file and create a csv.writer() for it. Before we even get to our scraping
code, we’re going to write a quick header row to the file, labeling the columns in the same order
that we’ll be saving them below inside our loop. Then we’re indenting all the rest of our code over
so that it’s “inside” the file handling code and can write our output to the csv file.
Just as before, we make the HTTP request, pass the response HTML into BeautifulSoup to parse,
look for all of the outer country wrapper elements, and then loop over them one-by-one. Inside our
loop, we’re extracting the name, capital, population and area, just as before, except this time we’re
now writing them to our CSV with the writer.writerow() function, instead of simply printing to
the screen.
If you save that into a text file on your Desktop called “countries.py”, then you can run it from the
command line with
python ~/Desktop/countries.py
You won’t see any output at the command line since we don’t have any print statements in our
scraper. However, if you open the output file that you created using a spreadsheet tool like excel,
you should see all 251 rows that we were expecting, with the data in the correct columns.
CSV is a very simple file format and doesn’t support formatting like bolding, colored backgrounds,
formulas, or other things you might expect from a modern spreadsheet. But it’s a universal format
that can be opened by pretty much any spreadsheet tool, and can then be re-saved into a format that
does support modern spreadsheet features, like an .xls file.
CSV files are a useful place to write data to for many applications. They’re easy files to open, edit
and work with, and are easy to pass around as email attachments or share with coworkers. If you
know your way around a spreadsheet application, you could use pivot tables or filters to rum some
basic queries and aggregation.
But if you really want a powerful place to store your data and run complex queries and relational
logic, you’d want to use a SQL database, like sqlite.
Hands On: Storing the Scraped Data & Keeping Track of Progress
41
Storing Data in a SQLite Database
There are a number of well-known and widely used SQL database that you may already be familiar
with, like MySQL, PostgreSQL, Oracle or Microsoft SQL Server.
When you’re working with a SQL database, it can be local – meaning that the actual database lives
on the same computer where your scraper is running – or it could be running on a server somewhere
else, and you have to connect to it over the network to write your scraped information to it.
If your database isn’t local (that is, it’s hosted on a server somewhere) you should consider that
each time you insert something into it, you’re making a round-trip across the network from your
computer to the database server, similar to making an HTTP request to a web server. I usually try
to have the database live locally on the machine that I’m running the scrape from, in order to cut
down on extra network traffic.
For those not familiar with SQL databases, an important concept is that the data layout is very
similar to a spreadsheet – each table in a database has several pre-defined columns and each new
item that’s added to the database is thought of as a “row” being added to the table, with the values
for each row stored in the appropriate columns.
When working with a database, your first step is always to open a connection to it (even if it’s a
local database). Then you create a cursor on the connection, which you use to send SQL statements
to the database. Finally, you can execute your SQL statements on that cursor to actually send them
to the database and have them be implemented.
Unlike spreadsheets, you have to define your table and all of its columns ahead of time before you
can begin to insert rows. Also, each column has a well-defined data type. If you say that a certain
column in a table is supposed to be a number or a boolean (ie “True” or “False”) and then try to insert
a row that has a text value in that column, the database will block you from inserting the incorrect
data type and return an error.
You can also setup unique constraints on the data to prevent duplicate values from being inserted.
We won’t go through all the features of a SQL database in this book, but we will take a look at some
basic sample code shortly.
The various SQL databases all offer slightly different features and use-cases in terms of how you
can get data out and work with it, but they all implement a very consistent SQL interface. For
our purposes, we’ll simply be inserting rows into our database using the INSERT command, so
the differences between the database are largely irrelevant. We’re using SQLite since it’s a very
lightweight database engine, easy to use and comes built in with python.
Let’s take a look at some simple code that connects to a local sqlite database, creates a table, inserts
some rows into the table and then queries them back out.
Hands On: Storing the Scraped Data & Keeping Track of Progress
42
import sqlite3
# setup the connection and cursor
conn = sqlite3.connect('example.db')
conn.row_factory = sqlite3.Row
conn.isolation_level = None
c = conn.cursor()
# create our table, only needed once
c.execute("""
CREATE TABLE IF NOT EXISTS items (
name text,
price text,
url text
)
""")
# insert some rows
insert_sql = "INSERT INTO items (name, price, url) VALUES (?, ?, ?);"
c.execute(insert_sql, ("Item #1", "$19.95", "http://example.com/item-1"))
c.execute(insert_sql, ("Item #2", "$12.95", "http://example.com/item-2"))
# get all of the rows in our table
items = c.execute("SELECT * FROM items")
for item in items.fetchall():
print item["name"]
Just as we described, we’re connecting to our database and getting a cursor before executing some
commands on the cursor. The first execute() command is to create our table. Note that we have the
IF NOT EXISTS because otherwise SQLite would throw an error if we ran this script a second time,
it tried to create the table again and discovered that there was an existing table with the same name.
We’re creating a table with three columns (name, price and URL), all of which use the column type
of text. It’s good practice to use the text column type for scraped data since it’s the most permissive
data type and is least likely to throw errors. Plus, the data we’re storing is all coming out of an HTML
text file anyways, so there’s a good chance it’s already in the correct text data type.
After we’ve created our table with the specified columns, then we can begin inserting rows into
it. You’ll see that we have two lines that are very similar, both executing INSERT SQL statements
with some sample data. You’ll also notice that we’re passing two sets of arguments to the execute()
command in order to insert the data.
The first is a SQL string with the INSERT command, the name of the table, the order of the columns
and then finally a set of question marks where we might expect the data to go. The actual data is
Hands On: Storing the Scraped Data & Keeping Track of Progress
43
passed in inside a second argument, rather than inside the SQL string directly. This is to prevent
against accidental SQL injection issues.
If you build the SQL INSERT statement as a string yourself using the scraped values, you’ll run
into issues if you scrape a value that has an apostrophe, quotation mark, comma, semicolon or lots
of other special characters. Our sqlite3 library has a way of automatically escaping these sort of
values for us so that they can be inserted into the database without issue.
It’s worth repeating: never build your SQL INSERT statements yourself directly as strings. Instead,
use the ? as a placeholder where your data will go, and then pass the data in as a second argument.
The sqlite3 library will take care of escaping it and inserting it into the string properly and safely.
Now that we’ve seen how to work with database, let’s go back to our earlier scraper and change it
to insert our scraped data into our SQLite database, instead of a CSV.
import sqlite3
import requests
from bs4 import BeautifulSoup
# setup the connection and cursor
conn = sqlite3.connect('countries.db')
conn.row_factory = sqlite3.Row
conn.isolation_level = None
c = conn.cursor()
c.execute("""
CREATE TABLE IF NOT EXISTS countries (
name text,
capital text,
population text,
area text
)
""")
r = requests.get("https://scrapethissite.com/pages/simple/")
soup = BeautifulSoup(r.text, "html.parser")
countries = soup.find_all("div", "country")
insert_sql = "INSERT INTO countries (name, capital, population, area) VALUES (?,\
?, ?, ?);"
for country in countries:
name = country.find("h3").text.strip()
Hands On: Storing the Scraped Data & Keeping Track of Progress
44
capital = country.find("span", "country-capital").text.strip()
population = country.find("span", "country-population").text.strip()
area = country.find("span", "country-area").text.strip()
c.execute(insert_sql, (name, capital, population, area))
We’ve got our imports at the top of the file, then we connect to the database and setup our cursor,
just like we did in the previous SQL example. When we were working with CSV files, we had to
indent all of our scraping code so that it could still have access to the output CSV file, but with our
SQL database connection, we don’t have to worry about that, so our scraping code can stay at the
normal indentation level.
After we’ve opened the connection and gotten the cursor, we setup our countries table with 4
columns of data that we want to store, all set to a data type of text. Then we make the HTTP
request, create the soup, and iterate over all the countries that we found, executing the same INSERT
statements over and over again, passing in the newly extracted information. You’ll notice that we
removed the .encode("utf-8") calls at the end of the name and capital extraction lines, since sqlite
has better support for the unicode text that we get back from BeautifulSoup.
Once your script has finished running, you can connect to your SQLite database and run a SELECT
* FROM countries command to get all of the data back. Running queries to aggregate and analyze
your data is outside the scope of this book, but I will mention the ever-handy SELECT COUNT(*)
FROM countries; query, which simply returns the number of rows in the table. This is useful for
checking on the progress of long-running scrapes, or ensuring that your scraper pulled in as many
items as you were expecting.
In this chapter, we learned the two most common ways you’ll want to store your scraped data –
in a CSV spreadsheet or a SQL database. At this point, we have a complete picture of how to find
the pattern of HTTP requests and HTML elements we’ll need to fetch and extract our data, and
we know how to turn those patterns into python code that makes the requests and hunts for the
patterns. Now we know how to store our extracted data somewhere permanent.
For 99% of web scraping use-cases, this is a sufficient skill set to get the data we’re looking for. The
rest of the book covers some more advanced skills that you might need to use on especially tricky
websites, as well as some handy cheat sheets and tips for debugging your code when you run into
issues.
If you’re looking for some more practice with the basics, head over to Scrape This Site24 , an entire
web scraping portal that I built out in order to help new people learn and practice the art of web
scraping. There’s more sample data that you can scrape as well as a whole set of video courses you
can sign up for that walk you through everything step-by-step.
24
https://scrapethissite.com
Scraping Data that’s Not in the
Response HTML
For most sites, you can simply make an HTTP request to the page’s URL that you see in the bar at
the top of your browser. Occasionally though, the information we see on a web page isn’t actually
fetched the way we might initially expect.
The reality is that now-a-days, websites can be very complex. While we generally think of a web
page as having its own URL, the reality is that most web pages require dozens of HTTP requests –
sometimes hundreds – to load all of their resources. Your browser makes all of these requests in the
background as it’s loading the page, going back to the web server to fetch things like CSS, Javascript,
fonts, images – and occasionally, extra data to display on the page.
Open your browser’s developer tools, click over to the “Network” tab and then try navigating to a
site like The New York Times25 . You’ll see that there might be hundreds of different HTTP requests
that are sent, simply to load the homepage.
The developer tools show some information about the “Type” of each request. It also allows you to
filter the requests if you’re only interested in certain request types. For example, you could filter to
only show requests for Javascript files or CSS stylesheets.
25
https://www.nytimes.com/
Scraping Data that’s Not in the Response HTML
46
175 HTTP Requests to Load the NYTimes.com Homepage!
It’s important to note that each one of these 175 resources that were needed to load the homepage
each required their own HTTP request. Just like the request to the main page URL (https://www.nytimes.com/),
each of these requests has its own URL, request method and headers.
Similarly, each request returns a response code, some response headers and then the actual body of
the response. For the requests we’ve looked at so far, the body of the response has always been a large
blob of HTML text, but for these requests, often the response is formatted completely differently,
depending on what sort of response it is.
I mention all of this because it’s important to note that sometimes, the data that you see on a page that
you’re trying to scrape isn’t actually returned in the HTML response to the page’s URL. Instead, it’s
requested from the web server and rendered in your browser in one of these other “hidden” requests
that your browser makes behind the scenes as the page loads.
If you try to make a request to the parent page’s URL (the one you see in your browser bar) and
then look for the DOM patterns that you see in the “Inspect Element” window of your developer
tools, you might be frustrated to find that the HTML in the parent page’s response doesn’t actually
contain the information as you see it in your browser.
At a high-level, there are two ways in which a site might load data from a web server to render in
your browser, without actually returning that data in the page’s HTML. The first is that Javascriptheavy websites might make AJAX calls in the background to fetch data in a raw form from an API.
Scraping Data that’s Not in the Response HTML
47
The second is that sites might sometimes return iframes that embed some other webpage inside the
current one.
Javascript and AJAX Requests
With many modern websites and web applications, the first HTTP request that’s made to the parent
page’s URL doesn’t actually return much information at all. Instead, it returns just a bit of HTML
that tells the browser to download a big Javascript application, and then the Javascript application
runs in your browser and makes its own HTTP requests to the backend to fetch data and then add
it to the DOM so that it’s rendered for the user to see.
The data is pulled down asynchronously – not at the same time as the main page request. This leads
to the name AJAX for asynchronous javascript and X ML (even though the server doesn’t actually
need to return XML data, and often doesn’t).
Some common indications that the site you’re viewing uses Javascript and AJAX to add content to
the page:
• so called “infinite scrolling” where more content is added to the page as you scroll
• clicking to another page loads a spinner for a few seconds before showing the new content
• the page doesn’t appear to clear and then reload when clicking around, the new content just
shows up
• the URL contains lots of information after a hashtag character (#)
It may sound complicated at first, and some web scraping professionals really get throw into a
tailspin when they encounter a site that uses Javascript. They think that they need to download all
of the site’s Javascript and run the entire application as if they were the web browser. Then they
need to wait for the Javascript application to make the requests to fetch the data, and then they
need to wait for the Javascript application to render that data to the page in HTML, before they can
finally scrape it from the page. This involves installing a ton of tools and libraries that are usually a
pain to work with and add a lot of overhead that slows things down.
The reality – if we stick to our fundamentals – is that scraping Javascript-heavy websites is actually
easier than scraping an HTML page, using the simple tools we already have. All we have to do is
inspect the requests that the Javascript application is triggering in our browser, and then instruct
our web scraper to make those same requests. Usually the response to these requests is in a nicely
structured data format like JSON (javascript object notation), which is even easier to parse and read
through than an HTML document.
In your browser’s developer tools, pull up the “Network” tab, reload then page, and then click the
request type filter at the top to only show the XHR requests. These are the AJAX requests that the
Javascript application is making to fetch the data.
Scraping Data that’s Not in the Response HTML
48
Once you’ve filtered down the list of requests, you’ll probably see a series of HTTP requests that
your browser has made in the background to load more information onto the page. Click through
each of these requests and look at the Response tab to see what content was returned from the server.
Inspecting an AJAX request in the “Network” tab of Chrome’s Developer Tools
Eventually, you should start to see the data that you’re looking to scrape. You’ve just discovered
your own hidden “API” that the website is using, and you can now direct your web scraper to use
it in order to pull down the content you’re looking for.
You’ll need to read through the response a bit to see what format the data is being returned in. You
might be able to tell by looking at the Content-Type Response header (under the “Headers” tab for
the request). Occasionally, AJAX requests will return HTML – in which case you would parse the
response with BeautifulSoup just as we have been.
Normally though, sites will have their AJAX requests return JSON formatted data. This is a much
more lightweight format than HTML and is usually more straightforward to parse. In fact, the
requests library that we’ve been using in python has a simple helper method to access JSON data
as a structured python object. Let’s take a look at an example.
Let’s say that we’ve found a hidden AJAX request URL that returns JSON-formatted data like this:
Scraping Data that’s Not in the Response HTML
49
{
"items": [{
"name": "Item #1",
"price": "$19.95",
"permalink": "/item-1/",
},
{
"name": "Item #2",
"price": "$12.95",
"permalink": "/item-2/",
},
...
]
}
We could build a simple scraper, like so:
import requests
# make the request to the "hidden" AJAX URL
r = requests.get("http://example.com/ajax/data")
# convert the response JSON into a structured python `dict()`
response_data = r.json()
for item in response_data["items"]:
print item["name"]
print item["price"]
The data is automatically parsed into a nested structure that we can loop over and inspect, without
having to do any of our own pattern matching or hunting around. Pretty straightforward.
It takes a bit of work up-front to discover the “hidden” AJAX requests that the Javascript application
is triggering in the background, so that’s not as quick as simply grabbing the URL in the big bar at
the top of your browser. But once you find the correct URLs, it’s often much simpler to scrape data
from these hidden AJAX APIs.
Pages Inside Pages: Dealing with iframes
There’s another sort of issue you’ll run into where the parent page doesn’t actually have the data
that’s displayed in it’s own HTML response. That’s when the site you’re scraping uses iframes.
Scraping Data that’s Not in the Response HTML
50
Iframes are a special HTML element that allows a parent page to embed a completely different page
(at a completely different URL) and load all of that page’s content onto the parent page, inside a box.
It’s basically a way to open one web page inside another web page.
This is commonly used for embedding content between sites, like when someone wants to embed
a youtube video on their blog. Over the years, iframes have been used and abused for all sorts of
reasons, but they’re still alive and well on the web and they’re important to know about if the site
you’re scraping is using them.
Let’s take a look at a hypothetical example.
Say I just visited a parent page with a URL http://mybaseballblog.com and I see some stats that I
want to scrape. The site owner of mybaseballblog.com didn’t come up with the stats themselves or
write the HTML to include them on their site, instead they embedded a widget from mlb.com.
The “embed code” that mlb.com provided might look something like this:
All that the site owner at mybaseballblog.com did was copy that one line of HTML onto their blog,
and then the browsers of any visitors to mybaseballblog.com will know to load the content from the
mlb.com URL in the src attribute and shove it into a box on the mybaseballblog.com page.
As a web scraper, when you’re inspecting the elements in the DOM using your browser’s developer
tools, you should always check up a few elements in the DOM tree to see if the elements you’re
looking at are nested inside of an
Navigation menu
Versions of this User Manual: