The Ultimate Guide To Web Scraping

web-scraping-guide

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 82

DownloadThe Ultimate Guide To Web Scraping Web-scraping-guide
Open PDF In BrowserView PDF
The Ultimate Guide to Web Scraping
Hartley Brody
This book is for sale at http://leanpub.com/web-scraping-guide
This version was published on 2017-02-18

This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.
© 2013 - 2017 Hartley Brody

Contents
Introduction to Web Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

Web Scraping as a Legitimate Data Collection Tool . . . . . . . . . . . . . . . . . . . . . .

5

Understand Web Technologies: What Your Browser is Doing Behind the Scenes . . . . .

9

Pattern Discovery: Finding the Right URLs that Return the Data You’re Looking For . .

18

Pattern Discovery: Finding the Structure in an HTML Document . . . . . . . . . . . . . .

24

Hands On: Building a Simple Web Scraper with Python . . . . . . . . . . . . . . . . . . .

30

Hands On: Storing the Scraped Data & Keeping Track of Progress . . . . . . . . . . . . .

37

Scraping Data that’s Not in the Response HTML . . . . . . . . . . . . . . . . . . . . . . .

45

Avoiding Common Scraping Pitfalls, Good Scraping Etiquette & Other Best Practices . .

52

How to Troubleshoot and Fix Your Web Scraping Code Without Pulling Your Hair Out .

61

A Handy, Easy-To-Reference Web Scraping Cheat Sheet . . . . . . . . . . . . . . . . . . .

67

Web Scraping Resources: A Beginner-Friendly Sandbox and Online Course . . . . . . . .

77

Introduction to Web Scraping
Web Scraping in the process of programmatically pulling information out of a web page. But if
you’re reading this you probably already knew that.
The truth is, web scraping is more of an art form than a typical engineering challenge. Every website
you’ll encounter is different – there are no “right ways” to do web scraping. To successfully scrape a
website requires time and patience. You must study your target site and learn its ways. Where is the
information you need? How is that information loaded onto the page? What traps have they setup,
ready to set off the alarms and block your scraper as an unwanted intruder?
In this book – The Ultimate Guide to Web Scraping – you will hone your skills and become a master
craftsman in the art of web scraping. We’ll talk about the reasons why web scraping is a valid way to
harvest information – despite common complaints. We’ll look at the various ways that information
is sent from a website to your computer, and how you can intercept and parse it. We’ll also look at
common traps and anti-scraping tactics and how you might be able to thwart them.

Loosely Structured Data
When we talk about web scraping, we’re usually talking about pulling information out of an HTML
document – a webpage. At the most fundamental level, HTML is simply a markup language (as in
Hyper Text Markup Language). A good website developer uses HTML to provide structure to the
document, marking up certain elements as “navigation” or “products” or “tables.”