The Definitive Guide

_The%2BDefinitive%2BGuide

%20The%20Definitive%20Guide

User Manual: Pdf

Open the PDF directly: View PDF .
Page Count: 658 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Table of Contents
Preface
- Running Example: Joe’s Hardware Store
- Chapter-by-Chapter Guide
- Typographic Conventions
- Comments and Questions
- Acknowledgments
Part I
Overview of HTTP
- HTTP: The Internet’s Multimedia Courier
- Web Clients and Servers
- Resources
- Transactions
- Messages
  - Simple Message Example
- Connections
- Protocol Versions
- Architectural Components of the Web
- The End of the Beginning
- For More Information
URLs and Resources
- Navigating the Internet’s Resources
  - The Dark Days Before URLs
- URL Syntax
- URL Shortcuts
  - Relative URLs
    - Base URLs
    - Resolving relative references
  - Expandomatic URLs
- Shady Characters
- A Sea of Schemes
- The Future
  - If Not Now, When?
- For More Information
HTTP Messages
- The Flow of Messages
  - Messages Commute Inbound to the Origin Server
  - Messages Flow Downstream
- The Parts of a Message
- Methods
- Status Codes
- Headers
- For More Information
Connection Management
- TCP Connections
- TCP Performance Considerations
- HTTP Connection Handling
  - The Oft-Misunderstood Connection Header
  - Serial Transaction Delays
- Parallel Connections
- Persistent Connections
- Pipelined Connections
- The Mysteries of Connection Close
- For More Information
Part II
Web Servers
- Web Servers Come in All Shapes and Sizes
- A Minimal Perl Web Server
- What Real Web Servers Do
- Step 1: Accepting Client Connections
- Step 2: Receiving Request Messages
  - Internal Representations of Messages
  - Connection Input/Output Processing Architectures
- Step 3: Processing Requests
- Step 4: Mapping and Accessing Resources
- Step 5: Building Responses
- Step 6: Sending Responses
- Step 7: Logging
- For More Information
Proxies
- Web Intermediaries
  - Private and Shared Proxies
  - Proxies Versus Gateways
- Why Use Proxies?
- Where Do Proxies Go?
- Client Proxy Settings
- Tricky Things About Proxy Requests
- Tracing Messages
  - The Via Header
  - The TRACE Method
    - Max-Forwards
- Proxy Authentication
- Proxy Interoperation
- For More Information
Caching
- Redundant Data Transfers
- Bandwidth Bottlenecks
- Flash Crowds
- Distance Delays
- Hits and Misses
- Cache Topologies
- Cache Processing Steps
- Keeping Copies Fresh
- Controlling Cachability
- Setting Cache Controls
  - Controlling HTTP Headers with Apache
  - Controlling HTML Caching Through HTTP-EQUIV
- Detailed Algorithms
- Caches and Advertising
- For More Information
Integration Points: Gateways, Tunnels, and Relays
- Gateways
  - Client-Side and Server-Side Gateways
- Protocol Gateways
- Resource Gateways
  - Common Gateway Interface (CGI)
  - Server Extension APIs
- Application Interfaces and Web Services
- Tunnels
- Relays
- For More Information
Web Robots
- Crawlers and Crawling
- Robotic HTTP
- Misbehaving Robots
- Excluding Robots
- Robot Etiquette
- Search Engines
- For More Information
HTTP-NG
- HTTP’s Growing Pains
- HTTP-NG Activity
- Modularize and Enhance
- Distributed Objects
- Layer 1: Messaging
- Layer 2: Remote Invocation
- Layer 3: Web Application
- WebMUX
- Binary Wire Protocol
- Current Status
- For More Information
Part III
Client Identification and Cookies
- The Personal Touch
- HTTP Headers
- Client IP Address
- User Login
- Fat URLs
- Cookies
- For More Information
Basic Authentication
- Authentication
- Basic Authentication
- The Security Flaws of Basic Authentication
- For More Information
Digest Authentication
- The Improvements of Digest Authentication
- Digest Calculations
- Quality of Protection Enhancements
  - Message Integrity Protection
  - Digest Authentication Headers
- Practical Considerations
- Security Considerations
- For More Information
Secure HTTP
- Making HTTP Safe
  - HTTPS
- Digital Cryptography
- Symmetric-Key Cryptography
  - Key Length and Enumeration Attacks
  - Establishing Shared Keys
- Public-Key Cryptography
  - RSA
  - Hybrid Cryptosystems and Session Keys
- Digital Signatures
  - Signatures Are Cryptographic Checksums
- Digital Certificates
- HTTPS: The Details
- A Real HTTPS Client
- Tunneling Secure Traffic Through Proxies
- For More Information
Part IV
Entities and Encodings
- Messages Are Crates, Entities Are Cargo
  - Entity Bodies
- Content-Length: The Entity’s Size
- Entity Digests
- Media Type and Charset
- Content Encoding
- Transfer Encoding and Chunked Encoding
- Time-Varying Instances
- Validators and Freshness
  - Freshness
  - Conditionals and Validators
- Range Requests
- Delta Encoding
  - Instance Manipulations, Delta Generators, and Delta Appliers
- For More Information
Internationalization
- HTTP Support for International Content
- Character Sets and HTTP
- Multilingual Character Encoding Primer
- Language Tags and HTTP
- Internationalized URIs
- Other Considerations
- For More Information
Content Negotiation and Transcoding
- Content-Negotiation Techniques
- Client-Driven Negotiation
- Server-Driven Negotiation
- Transparent Negotiation
  - Caching and Alternates
  - The Vary Header
- Transcoding
- Next Steps
- For More Information
Part V
Web Hosting
- Hosting Services
  - A Simple Example: Dedicated Hosting
- Virtual Hosting
- Making Web Sites Reliable
- Making Web Sites Fast
- For More Information
Publishing Systems
- FrontPage Server Extensions for Publishing Support
- WebDAV and Collaborative Authoring
- For More Information
Redirection and Load Balancing
- Why Redirect?
- Where to Redirect
- Overview of Redirection Protocols
- General Redirection Methods
- Proxy Redirection Methods
- Cache Redirection Methods
  - WCCP Redirection
- Internet Cache Protocol
- Cache Array Routing Protocol
- Hyper Text Caching Protocol
  - HTCP Authentication
  - Setting Caching Policies
- For More Information
Logging and Usage Tracking
- What to Log?
- Log Formats
- Hit Metering
  - Overview
  - The Meter Header
- A Word on Privacy
- For More Information
Part VI
URI Schemes
HTTP Status Codes
- Status Code Classifications
- Status Codes
HTTP Header Reference
- Accept
- Accept-Charset
- Accept-Encoding
- Accept-Language
- Accept-Ranges
- Age
- Allow
- Authorization
- Cache-Control
- Client-ip
- Connection
- Content-Base
- Content-Encoding
- Content-Language
- Content-Length
- Content-Location
- Content-MD5
- Content-Range
- Content-Type
- Cookie
- Cookie2
- Date
- ETag
- Expect
- Expires
- From
- Host
- If-Modified-Since
- If-Match
- If-None-Match
- If-Range
- If-Unmodified-Since
- Last-Modified
- Location
- Max-Forwards
- MIME-Version
- Pragma
- Proxy-Authenticate
- Proxy-Authorization
- Proxy-Connection
- Public
- Range
- Referer
- Retry-After
- Server
- Set-Cookie
- Set-Cookie2
- TE
- Trailer
- Title
- Transfer-Encoding
- UA-(CPU, Disp, OS, Color, Pixels)
- Upgrade
- User-Agent
- Vary
- Via
- Warning
- WWW-Authenticate
- X-Cache
- X-Forwarded-For
- X-Pad
- X-Serial-Number
MIME Types
- Background
- MIME Type Structure
- MIME Type IANA Registration
- MIME Type Tables
Base-64 Encoding
- Base-64 Encoding Makes Binary Data Safe
- Eight Bits to Six Bits
- Base-64 Padding
- Perl Implementation
- For More Information
Digest Authentication
- Digest WWW-Authenticate Directives
- Digest Authorization Directives
- Digest Authentication-Info Directives
- Reference Code
Language Tags
- First Subtag Rules
- Second Subtag Rules
- IANA-Registered Language Tags
- ISO 639 Language Codes
- ISO 3166 Country Codes
- Language Administrative Organizations
MIME Charset Registry
- MIME Charset Registry
- Preferred MIME Names
- Registered Charsets
Index

HTTP

The Deﬁnitive Guide

HTTP

The Deﬁnitive Guide

David Gourley and Brian Totty

with Marjorie Sayer, Sailu Reddy, and Anshu Aggarwal

Beijing

•

Cambridge

•

Farnham

•

Köln

•

Paris

•

Sebastopol

•

Taipei

•

Tokyo

HTTP: The Definitive Guide

by David Gourley and Brian Totty

with Marjorie Sayer, Sailu Reddy, and Anshu Aggarwal

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol,

CA 95472.

O’Reilly Media, Inc. books may be purchased for educational, business, or sales promotional use. On-

line editions are also available for most titles (safari.oreilly.com). For more information, contact our cor-

porate/institutional sales department: (800) 998-9938 or corporate@oreilly.com.

Editor:

Linda Mui

Production Editor:

Rachel Wheeler

Cover Designer:

Ellie Volckhausen

Interior Designers:

David Futato and Melanie Wang

Printing History:

September 2002: First Edition.

Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of

O’Reilly Media, Inc. HTTP: The Definitive Guide, the image of a thirteen-lined ground squirrel, and

related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by

manufacturers and sellers to distinguish their products are claimed as trademarks. Where those

designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the

designations have been printed in caps or initial caps.

While every precaution has been taken in the preparation of this book, the publisher and authors

assume no responsibility for errors or omissions, or for damages resulting from the use of the

information contained herein.

This book uses RepKover™

, a durable and flexible lay-flat binding.

ISBN-10: 1-56592-509-2

ISBN-13: 978-1-56592-509-0

[C] [01/08]

Table of Contents

Preface

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiii

Part I. HTTP: The Web’s Foundation

1. Overview of HTTP

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

HTTP: The Internet’s Multimedia Courier 3

Web Clients and Servers 4

Resources 4

Transactions 8

Messages 10

Connections 11

Protocol Versions 16

Architectural Components of the Web 17

The End of the Beginning 21

For More Information 21

2. URLs and Resources

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Navigating the Internet’s Resources 24

URL Syntax 26

URL Shortcuts 30

Shady Characters 35

A Sea of Schemes 38

The Future 40

For More Information 41

3. HTTP Messages

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Flow of Messages 43

The Parts of a Message 44

vi | Table of Contents

Methods 53

Status Codes 59

Headers 67

For More Information 73

4. Connection Management

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

TCP Connections 74

TCP Performance Considerations 80

HTTP Connection Handling 86

Parallel Connections 88

Persistent Connections 90

Pipelined Connections 99

The Mysteries of Connection Close 101

For More Information 104

Part II. HTTP Architecture

5. Web Servers

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109

Web Servers Come in All Shapes and Sizes 109

A Minimal Perl Web Server 111

What Real Web Servers Do 113

Step 1: Accepting Client Connections 115

Step 2: Receiving Request Messages 116

Step 3: Processing Requests 120

Step 4: Mapping and Accessing Resources 120

Step 5: Building Responses 125

Step 6: Sending Responses 127

Step 7: Logging 127

For More Information 127

6. Proxies

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

129

Web Intermediaries 129

Why Use Proxies? 131

Where Do Proxies Go? 137

Client Proxy Settings 141

Tricky Things About Proxy Requests 144

Tracing Messages 150

Proxy Authentication 156

Table of Contents | vii

Proxy Interoperation 157

For More Information 160

7. Caching

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

161

Redundant Data Transfers 161

Bandwidth Bottlenecks 161

Flash Crowds 163

Distance Delays 163

Hits and Misses 164

Cache Topologies 168

Cache Processing Steps 171

Keeping Copies Fresh 175

Controlling Cachability 182

Setting Cache Controls 186

Detailed Algorithms 187

Caches and Advertising 194

For More Information 196

8. Integration Points: Gateways, Tunnels, and Relays

. . . . . . . . . . . . . . . . . . . .

197

Gateways 197

Protocol Gateways 200

Resource Gateways 203

Application Interfaces and Web Services 205

Tunnels 206

Relays 212

For More Information 213

9. Web Robots

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

215

Crawlers and Crawling 215

Robotic HTTP 225

Misbehaving Robots 228

Excluding Robots 229

Robot Etiquette 239

Search Engines 242

For More Information 246

10. HTTP-NG

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

247

HTTP’s Growing Pains 247

HTTP-NG Activity 248

viii | Table of Contents

Modularize and Enhance 248

Distributed Objects 249

Layer 1: Messaging 250

Layer 2: Remote Invocation 250

Layer 3: Web Application 251

WebMUX 251

Binary Wire Protocol 252

Current Status 252

For More Information 253

Part III. Identiﬁcation, Authorization, and Security

11. Client Identiﬁcation and Cookies

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

257

The Personal Touch 257

HTTP Headers 258

Client IP Address 259

User Login 260

Fat URLs 262

Cookies 263

For More Information 276

12. Basic Authentication

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

277

Authentication 277

Basic Authentication 281

The Security Flaws of Basic Authentication 283

For More Information 285

13. Digest Authentication

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

286

The Improvements of Digest Authentication 286

Digest Calculations 291

Quality of Protection Enhancements 299

Practical Considerations 300

Security Considerations 303

For More Information 306

14. Secure HTTP

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

307

Making HTTP Safe 307

Digital Cryptography 309

Table of Contents | ix

Symmetric-Key Cryptography 313

Public-Key Cryptography 315

Digital Signatures 317

Digital Certificates 319

HTTPS: The Details 322

A Real HTTPS Client 328

Tunneling Secure Traffic Through Proxies 335

For More Information 336

Part IV. Entities, Encodings, and Internationalization

15. Entities and Encodings

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

341

Messages Are Crates, Entities Are Cargo 342

Content-Length: The Entity’s Size 344

Entity Digests 347

Media Type and Charset 348

Content Encoding 351

Transfer Encoding and Chunked Encoding 354

Time-Varying Instances 359

Validators and Freshness 360

Range Requests 363

Delta Encoding 365

For More Information 369

16. Internationalization

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

370

HTTP Support for International Content 370

Character Sets and HTTP 371

Multilingual Character Encoding Primer 376

Language Tags and HTTP 384

Internationalized URIs 389

Other Considerations 392

For More Information 392

17. Content Negotiation and Transcoding

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

395

Content-Negotiation Techniques 395

Client-Driven Negotiation 396

Server-Driven Negotiation 397

Transparent Negotiation 400

x | Table of Contents

Transcoding 403

Next Steps 405

For More Information 406

Part V. Content Publishing and Distribution

18. Web Hosting

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

411

Hosting Services 411

Virtual Hosting 413

Making Web Sites Reliable 419

Making Web Sites Fast 422

For More Information 423

19. Publishing Systems

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

424

FrontPage Server Extensions for Publishing Support 424

WebDAV and Collaborative Authoring 429

For More Information 446

20. Redirection and Load Balancing

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

448

Why Redirect? 449

Where to Redirect 449

Overview of Redirection Protocols 450

General Redirection Methods 452

Proxy Redirection Methods 462

Cache Redirection Methods 469

Internet Cache Protocol 473

Cache Array Routing Protocol 475

Hyper Text Caching Protocol 478

For More Information 481

21. Logging and Usage Tracking

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

483

What to Log? 483

Log Formats 484

Hit Metering 492

A Word on Privacy 495

For More Information 495

Table of Contents | xi

Part VI. Appendixes

A. URI Schemes

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

499

B. HTTP Status Codes

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

505

C. HTTP Header Reference

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

508

D. MIME Types

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

533

E. Base-64 Encoding

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

570

F. Digest Authentication

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

574

G. Language Tags

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

581

H. MIME Charset Registry

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

602

Index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

617

xiii

Preface

The Hypertext Transfer Protocol (HTTP) is the protocol programs use to communi-

cate over the World Wide Web. There are many applications of HTTP, but HTTP is

most famous for two-way conversation between web browsers and web servers.

HTTP began as a simple protocol, so you might think there really isn’t that much to

say about it. And yet here you stand, with a two-pound book in your hands. If you’re

wondering how we could have written 650 pages on HTTP, take a look at the Table

of Contents. This book isn’t just an HTTP header reference manual; it’s a veritable

bible of web architecture.

In this book, we try to tease apart HTTP’s interrelated and often misunderstood

rules, and we offer you a series of topic-based chapters that explain all the aspects of

HTTP. Throughout the book, we are careful to explain the “why” of HTTP, not just

the “how.” And to save you time chasing references, we explain many of the critical

non-HTTP technologies that are required to make HTTP applications work. You can

find the alphabetical header reference (which forms the basis of most conventional

HTTP texts) in a conveniently organized appendix. We hope this conceptual design

makes it easy for you to work with HTTP.

This book is written for anyone who wants to understand HTTP and the underlying

architecture of the Web. Software and hardware engineers can use this book as a

coherent reference for HTTP and related web technologies. Systems architects and

network administrators can use this book to better understand how to design,

deploy, and manage complicated web architectures. Performance engineers and ana-

lysts can benefit from the sections on caching and performance optimization. Mar-

keting and consulting professionals will be able to use the conceptual orientation to

better understand the landscape of web technologies.

This book illustrates common misconceptions, advises on “tricks of the trade,” pro-

vides convenient reference material, and serves as a readable introduction to dry and

confusing standards specifications. In a single book, we detail the essential and inter-

related technologies that make the Web work.

xiv |Preface

This book is the result of a tremendous amount of work by many people who share

an enthusiasm for Internet technologies. We hope you find it useful.

Running Example: Joe’s Hardware Store

Many of our chapters include a running example of a hypothetical online hardware

and home-improvement store called “Joe’s Hardware” to demonstrate technology

concepts. We have set up a real web site for the store (http://www.joes-hardware.

com) for you to test some of the examples in the book. We will maintain this web site

while this book remains in print.

Chapter-by-Chapter Guide

This book contains 21 chapters, divided into 5 logical parts (each with a technology

theme), and 8 useful appendixes containing reference data and surveys of related

technologies:

Part I, HTTP: The Web’s Foundation

Part II, HTTP Architecture

Part III, Identification, Authorization, and Security

Part IV, Entities, Encodings, and Internationalization

Part V, Content Publishing and Distribution

Part VI, Appendixes

Part I, HTTP: The Web’s Foundation,describes the core technology of HTTP, the

foundation of the Web, in four chapters:

• Chapter 1, Overview of HTTP, is a rapid-paced overview of HTTP.

• Chapter 2, URLs and Resources, details the formats of uniform resource locators

(URLs) and the various types of resources that URLs name across the Internet. It

also outlines the evolution to uniform resource names (URNs).

• Chapter 3, HTTP Messages, details how HTTP messages transport web content.

• Chapter 4, Connection Management, explains the commonly misunderstood and

poorly documented rules and behavior for managing HTTP connections.

Part II, HTTP Architecture, highlights the HTTP server, proxy, cache, gateway, and

robot applications that are the architectural building blocks of web systems. (Web

browsers are another building block, of course, but browsers already were covered

thoroughly in Part I of the book.) Part II contains the following six chapters:

• Chapter 5, Web Servers, gives an overview of web server architectures.

• Chapter 6, Proxies, explores HTTP proxy servers, which are intermediary serv-

ers that act as platforms for HTTP services and controls.

• Chapter 7, Caching, delves into the science of web caches—devices that improve

performance and reduce traffic by making local copies of popular documents.

Preface |xv

• Chapter 8, Integration Points: Gateways, Tunnels, and Relays, explains gateways

and application servers that allow HTTP to work with software that speaks dif-

ferent protocols, including Secure Sockets Layer (SSL) encrypted protocols.

• Chapter 9, Web Robots, describes the various types of clients that pervade the

Web, including the ubiquitous browsers, robots and spiders, and search engines.

• Chapter 10, HTTP-NG, talks about HTTP developments still in the works: the

HTTP-NG protocol.

Part III, Identification, Authorization, and Security, presents a suite of techniques and

technologies to track identity, enforce security, and control access to content. It con-

tains the following four chapters:

• Chapter 11, Client Identification and Cookies, talks about techniques to identify

users so that content can be personalized to the user audience.

• Chapter 12, Basic Authentication, highlights the basic mechanisms to verify user

identity. The chapter also examines how HTTP authentication interfaces with

databases.

• Chapter 13, Digest Authentication, explains digest authentication, a complex

proposed enhancement to HTTP that provides significantly enhanced security.

• Chapter 14, Secure HTTP, is a detailed overview of Internet cryptography, digi-

tal certificates, and SSL.

Part IV, Entities, Encodings, and Internationalization, focuses on the bodies of HTTP

messages (which contain the actual web content) and on the web standards that

describe and manipulate content stored in the message bodies. Part IV contains three

chapters:

• Chapter 15, Entities and Encodings, describes the structure of HTTP content.

• Chapter 16, Internationalization, surveys the web standards that allow users

around the globe to exchange content in different languages and character sets.

• Chapter 17, Content Negotiation and Transcoding, explains mechanisms for

negotiating acceptable content.

Part V, Content Publishing and Distribution, discusses the technology for publishing

and disseminating web content. It contains four chapters:

• Chapter 18, Web Hosting, discusses the ways people deploy servers in modern

web hosting environments and HTTP support for virtual web hosting.

• Chapter 19, Publishing Systems, discusses the technologies for creating web con-

tent and installing it onto web servers.

• Chapter 20, Redirection and Load Balancing, surveys the tools and techniques for

distributing incoming web traffic among a collection of servers.

• Chapter 21, Logging and Usage Tracking, covers log formats and common

questions.

xvi |Preface

Part VI, Appendixes, contains helpful reference appendixes and tutorials in related

technologies:

• Appendix A, URI Schemes, summarizes the protocols supported through uni-

form resource identifier (URI) schemes.

• Appendix B, HTTP Status Codes, conveniently lists the HTTP response codes.

• Appendix C, HTTP Header Reference, provides a reference list of HTTP header

fields.

• Appendix D, MIME Types, provides an extensive list of MIME types and

explains how MIME types are registered.

• Appendix E, Base-64 Encoding, explains base-64 encoding, used by HTTP

authentication.

• Appendix F, Digest Authentication, gives details on how to implement various

authentication schemes in HTTP.

• Appendix G, Language Tags, defines language tag values for HTTP language

headers.

• Appendix H, MIME Charset Registry, provides a detailed list of character encod-

ings, used for HTTP internationalization support.

Each chapter contains many examples and pointers to additional reference material.

Typographic Conventions

In this book, we use the following typographic conventions:

Italic

Used for URLs, C functions, command names, MIME types, new terms where

they are defined, and emphasis

Constant width

Used for computer output, code, and any literal text

Constant width bold

Used for user input

Comments and Questions

Please address comments and questions concerning this book to the publisher:

O’Reilly & Associates, Inc.

1005 Gravenstein Highway North

Sebastopol, CA 95472

(800) 998-9938 (in the United States or Canada)

(707) 829-0515 (international/local)

(707) 829-0104 (fax)

Preface |xvii

There is a web page for this book, which lists errata, examples, or any additional

information. You can access this page at:

http://www.oreilly.com/catalog/httptdg/

To comment or ask technical questions about this book, send email to:

bookquestions@oreilly.com

For more information about books, conferences, Resource Centers, and the O’Reilly

Network, see the O’Reilly web site at:

http://www.oreilly.com

Acknowledgments

This book is the labor of many. The five authors would like to hold up a few people

in thanks for their significant contributions to this project.

To start, we’d like to thank Linda Mui, our editor at O’Reilly. Linda first met with

David and Brian way back in 1996, and she refined and steered several concepts into

the book you hold today. Linda also helped keep our wandering gang of first-time

book authors moving in a coherent direction and on a progressing (if not rapid) time-

line. Most of all, Linda gave us the chance to create this book. We’re very grateful.

We’d also like to thank several tremendously bright, knowledgeable, and kind souls

who devoted noteworthy energy to reviewing, commenting on, and correcting drafts

of this book. These include Tony Bourke, Sean Burke, Mike Chowla, Shernaz Daver,

Fred Douglis, Paula Ferguson, Vikas Jha, Yves Lafon, Peter Mattis, Chuck Neer-

daels, Luis Tavera, Duane Wessels, Dave Wu, and Marco Zagha. Their viewpoints

and suggestions have improved the book tremendously.

Rob Romano from O’Reilly created most of the amazing artwork you’ll find in this

book. The book contains an unusually large number of detailed illustrations that

make subtle concepts very clear. Many of these illustrations were painstakingly cre-

ated and revised numerous times. If a picture is worth a thousand words, Rob added

hundreds of pages of value to this book.

Brian would like to personally thank all of the authors for their dedication to this

project. A tremendous amount of time was invested by the authors in a challenge to

make the first detailed but accessible treatment of HTTP. Weddings, childbirths,

killer work projects, startup companies, and graduate schools intervened, but the

authors held together to bring this project to a successful completion. We believe the

result is worthy of everyone’s hard work and, most importantly, that it provides a

valuable service. Brian also would like to thank the employees of Inktomi for their

enthusiasm and support and for their deep insights about the use of HTTP in real-

world applications. Also, thanks to the fine folks at Cajun-shop.com for allowing us

to use their site for some of the examples in this book.

xviii |Preface

David would like to thank his family, particularly his mother and grandfather for

their ongoing support. He’d like to thank those that have put up with his erratic

schedule over the years writing the book. He’d also like to thank Slurp, Orctomi, and

Norma for everything they’ve done, and his fellow authors for all their hard work.

Finally, he would like to thank Brian for roping him into yet another adventure.

Marjorie would like to thank her husband, Alan Liu, for technical insight, familial

support and understanding. Marjorie thanks her fellow authors for many insights

and inspirations. She is grateful for the experience of working together on this book.

Sailu would like to thank David and Brian for the opportunity to work on this book,

and Chuck Neerdaels for introducing him to HTTP.

Anshu would like to thank his wife, Rashi, and his parents for their patience, sup-

port, and encouragement during the long years spent writing this book.

Finally, the authors collectively thank the famous and nameless Internet pioneers,

whose research, development, and evangelism over the past four decades contrib-

uted so much to our scientific, social, and economic community. Without these

labors, there would be no subject for this book.

PART I

HTTP: The Web’s Foundation

This section is an introduction to the HTTP protocol. The next four chapters

describe the core technology of HTTP, the foundation of the Web:

• Chapter 1, Overview of HTTP, is a rapid-paced overview of HTTP.

• Chapter 2, URLs and Resources, details the formats of URLs and the various

types of resources that URLs name across the Internet. We also outline the evo-

lution to URNs.

• Chapter 3, HTTP Messages, details the HTTP messages that transport web

content.

• Chapter 4, Connection Management, discusses the commonly misunderstood

and poorly documented rules and behavior for managing TCP connections by

HTTP.

CHAPTER 1

Overview of HTTP

The world’s web browsers, servers, and related web applications all talk to each

other through HTTP, the Hypertext Transfer Protocol. HTTP is the common lan-

guage of the modern global Internet.

This chapter is a concise overview of HTTP. You’ll see how web applications use

HTTP to communicate, and you’ll get a rough idea of how HTTP does its job. In

particular, we talk about:

• How web clients and servers communicate

• Where resources (web content) come from

• How web transactions work

• The format of the messages used for HTTP communication

• The underlying TCP network transport

• The different variations of the HTTP protocol

• Some of the many HTTP architectural components installed around the Internet

We’ve got a lot of ground to cover, so let’s get started on our tour of HTTP.

HTTP: The Internet’s Multimedia Courier

Billions of JPEG images, HTML pages, text files, MPEG movies, WAV audio files,

Java applets, and more cruise through the Internet each and every day. HTTP moves

the bulk of this information quickly, conveniently, and reliably from web servers all

around the world to web browsers on people’s desktops.

Because HTTP uses reliable data-transmission protocols, it guarantees that your data

will not be damaged or scrambled in transit, even when it comes from the other side of

the globe. This is good for you as a user, because you can access information without

worrying about its integrity. Reliable transmission is also good for you as an Internet

application developer, because you don’t have to worry about HTTP communications

4|Chapter 1: Overview of HTTP

being destroyed, duplicated, or distorted in transit. You can focus on programming

the distinguishing details of your application, without worrying about the flaws and

foibles of the Internet.

Let’s look more closely at how HTTP transports the Web’s traffic.

Web Clients and Servers

Web content lives on web servers. Web servers speak the HTTP protocol, so they are

often called HTTP servers. These HTTP servers store the Internet’s data and provide

the data when it is requested by HTTP clients. The clients send HTTP requests to

servers, and servers return the requested data in HTTP responses, as sketched in

Figure 1-1. Together, HTTP clients and HTTP servers make up the basic compo-

nents of the World Wide Web.

You probably use HTTP clients every day. The most common client is a web

browser, such as Microsoft Internet Explorer or Netscape Navigator. Web browsers

request HTTP objects from servers and display the objects on your screen.

When you browse to a page, such as “http://www.oreilly.com/index.html,” your

browser sends an HTTP request to the server www.oreilly.com (see Figure 1-1). The

server tries to find the desired object (in this case, “/index.html”) and, if successful,

sends the object to the client in an HTTP response, along with the type of the object,

the length of the object, and other information.

Resources

Web servers host web resources. A web resource is the source of web content. The

simplest kind of web resource is a static file on the web server’s filesystem. These

files can contain anything: they might be text files, HTML files, Microsoft Word

files, Adobe Acrobat files, JPEG image files, AVI movie files, or any other format you

can think of.

However, resources don’t have to be static files. Resources can also be software pro-

grams that generate content on demand. These dynamic content resources can gen-

erate content based on your identity, on what information you’ve requested, or on

Figure 1-1. Web clients and servers

HTTP request

“Get me the document called /index.html.”

Client Server

www.oreilly.com

HTTP response

“Okay, here it is, it’s in HTML format and is 3,150 characters long.”

Resources |5

the time of day. They can show you a live image from a camera, or let you trade

stocks, search real estate databases, or buy gifts from online stores (see Figure 1-2).

In summary, a resource is any kind of content source. A file containing your com-

pany’s sales forecast spreadsheet is a resource. A web gateway to scan your local

public library’s shelves is a resource. An Internet search engine is a resource.

Media Types

Because the Internet hosts many thousands of different data types, HTTP carefully

tags each object being transported through the Web with a data format label called a

MIME type. MIME (Multipurpose Internet Mail Extensions) was originally designed

to solve problems encountered in moving messages between different electronic mail

systems. MIME worked so well for email that HTTP adopted it to describe and label

its own multimedia content.

Web servers attach a MIME type to all HTTP object data (see Figure 1-3). When a

web browser gets an object back from a server, it looks at the associated MIME type

to see if it knows how to handle the object. Most browsers can handle hundreds of

popular object types: displaying image files, parsing and formatting HTML files,

playing audio files through the computer’s speakers, or launching external plug-in

software to handle special formats.

Figure 1-2. A web resource is anything that provides web content

Client Server

Internet

E-commerce

gateway

Real estate search

gateway

Stock trading

gateway

Web cam

gateway

11000101101

Image file

Text file

Filesystem Resources

6|Chapter 1: Overview of HTTP

A MIME type is a textual label, represented as a primary object type and a specific

subtype, separated by a slash. For example:

• An HTML-formatted text document would be labeled with type text/html.

• A plain ASCII text document would be labeled with type text/plain.

• A JPEG version of an image would be image/jpeg.

• A GIF-format image would be image/gif.

• An Apple QuickTime movie would be video/quicktime.

• A Microsoft PowerPoint presentation would be application/vnd.ms-powerpoint.

There are hundreds of popular MIME types, and many more experimental or limited-

use types. A very thorough MIME type list is provided in Appendix D.

URIs

Each web server resource has a name, so clients can point out what resources they

are interested in. The server resource name is called a uniform resource identifier,or

URI. URIs are like the postal addresses of the Internet, uniquely identifying and

locating information resources around the world.

Here’s a URI for an image resource on Joe’s Hardware store’s web server:

http://www.joes-hardware.com/specials/saw-blade.gif

Figure 1-4 shows how the URI specifies the HTTP protocol to access the saw-blade

GIF resource on Joe’s store’s server. Given the URI, HTTP can retrieve the object.

URIs come in two flavors, called URLs and URNs. Let’s take a peek at each of these

types of resource identifiers now.

URLs

The uniform resource locator (URL) is the most common form of resource identifier.

URLs describe the specific location of a resource on a particular server. They tell you

exactly how to fetch a resource from a precise, fixed location. Figure 1-4 shows how

a URL tells precisely where a resource is located and how to access it. Table 1-1

shows a few examples of URLs.

Figure 1-3. MIME types are sent back with the data content

Client Server

Content-type: image/jpeg

Content-length: 12984

MIME type

Resources |7

Most URLs follow a standardized format of three main parts:

• The first part of the URL is called the scheme, and it describes the protocol used

to access the resource. This is usually the HTTP protocol (http://).

• The second part gives the server Internet address (e.g., www.joes-hardware.com).

• The rest names a resource on the web server (e.g., /specials/saw-blade.gif ).

Today, almost every URI is a URL.

URNs

The second flavor of URI is the uniform resource name, or URN. A URN serves as a

unique name for a particular piece of content, independent of where the resource

currently resides. These location-independent URNs allow resources to move from

place to place. URNs also allow resources to be accessed by multiple network access

protocols while maintaining the same name.

For example, the following URN might be used to name the Internet standards docu-

ment “RFC 2141” regardless of where it resides (it may even be copied in several

places):

urn:ietf:rfc:2141

Figure 1-4. URLs specify protocol, server, and local resource

Table 1-1. Example URLs

URL Description

http://www.oreilly.com/index.html The home URL for O’Reilly & Associates, Inc.

http://www.yahoo.com/images/logo.gif The URL for the Yahoo! web site’s logo

http://www.joes-hardware.com/inventory-check.

cgi?item=12731

The URL for a program that checks if inventory item

#12731 is in stock

ftp://joe:tools4u@ftp.joes-hardware.com/locking-

pliers.gif

The URL for the locking-pliers.gif image file, using

password-protected FTP as the access protocol

Client www.joes-hardware.com

Content-type: image/gif

Content-length: 8572

http://www.joes-hardware.com/specials/saw-blade.gif

Use HTTP protocol Go to www.joes-hardware.com Grab the resource called /specials/saw-blade.gif

12 3

8|Chapter 1: Overview of HTTP

URNs are still experimental and not yet widely adopted. To work effectively, URNs

need a supporting infrastructure to resolve resource locations; the lack of such an

infrastructure has also slowed their adoption. But URNs do hold some exciting

promise for the future. We’ll discuss URNs in a bit more detail in Chapter 2, but

most of the remainder of this book focuses almost exclusively on URLs.

Unless stated otherwise, we adopt the conventional terminology and use URI and

URL interchangeably for the remainder of this book.

Transactions

Let’s look in more detail how clients use HTTP to transact with web servers and

their resources. An HTTP transaction consists of a request command (sent from cli-

ent to server), and a response result (sent from the server back to the client). This

communication happens with formatted blocks of data called HTTP messages,as

illustrated in Figure 1-5.

Methods

HTTP supports several different request commands, called HTTP methods. Every

HTTP request message has a method. The method tells the server what action to per-

form (fetch a web page, run a gateway program, delete a file, etc.). Table 1-2 lists five

common HTTP methods.

Figure 1-5. HTTP transactions consist of request and response messages

Table 1-2. Some common HTTP methods

HTTP method Description

GET Send named resource from the server to the client.

PUT Store data from client into a named server resource.

Internet

HTTP request message contains

the command and the URI

GET /specials/saw-blade.gif HTTP/1.0

Host: www.joes-hardware.com

Client www.joes-hardware.com

HTTP/1.0 200 OK

Content-type: image/gif

Content-length: 8572 HTTP response message contains

the result of the transaction

Transactions |9

We’ll discuss HTTP methods in detail in Chapter 3.

Status Codes

Every HTTP response message comes back with a status code. The status code is a

three-digit numeric code that tells the client if the request succeeded, or if other

actions are required. A few common status codes are shown in Table 1-3.

HTTP also sends an explanatory textual “reason phrase” with each numeric status

code (see the response message in Figure 1-5). The textual phrase is included only for

descriptive purposes; the numeric code is used for all processing.

The following status codes and reason phrases are treated identically by HTTP soft-

ware:

200 OK

200 Document attached

200 Success

200 All’s cool, dude

HTTP status codes are explained in detail in Chapter 3.

Web Pages Can Consist of Multiple Objects

An application often issues multiple HTTP transactions to accomplish a task. For

example, a web browser issues a cascade of HTTP transactions to fetch and display a

graphics-rich web page. The browser performs one transaction to fetch the HTML

“skeleton” that describes the page layout, then issues additional HTTP transactions

for each embedded image, graphics pane, Java applet, etc. These embedded

resources might even reside on different servers, as shown in Figure 1-6. Thus, a

“web page” often is a collection of resources, not a single resource.

DELETE Delete the named resource from a server.

POST Send client data into a server gateway application.

HEAD Send just the HTTP headers from the response for the named resource.

Table 1-3. Some common HTTP status codes

HTTP status code Description

200 OK. Document returned correctly.

302 Redirect. Go someplace else to get the resource.

404 Not Found. Can’t find this resource.

Table 1-2. Some common HTTP methods (continued)

HTTP method Description

10 |Chapter 1: Overview of HTTP

Messages

Now let’s take a quick look at the structure of HTTP request and response mes-

sages. We’ll study HTTP messages in exquisite detail in Chapter 3.

HTTP messages are simple, line-oriented sequences of characters. Because they are

plain text, not binary, they are easy for humans to read and write.*Figure 1-7 shows

the HTTP messages for a simple transaction.

HTTP messages sent from web clients to web servers are called request messages.

Messages from servers to clients are called response messages. There are no other

kinds of HTTP messages. The formats of HTTP request and response messages are

very similar.

Figure 1-6. Composite web pages require separate HTTP transactions for each embedded resource

* Some programmers complain about the difficulty of HTTP parsing, which can be tricky and error-prone,

especially when designing high-speed software. A binary format or a more restricted text format might have

been simpler to process, but most HTTP programmers appreciate HTTP’s extensibility and debuggability.

Figure 1-7. HTTP messages have a simple, line-oriented text structure

Client

Server 1

Server 2

Internet

GET /test/hi-there.txt HTTP/1.0

Accept: text/*

Accept-Language: en,fr

HTTP/1.0 200 OK

Content-type: text/plain

Content-length: 19

Hi! I’m a message!

Start line

Headers

Body

(a) Request message (b) Response message

Connections |11

HTTP messages consist of three parts:

Start line

The first line of the message is the start line, indicating what to do for a request

or what happened for a response.

Header fields

Zero or more header fields follow the start line. Each header field consists of a

name and a value, separated by a colon (:) for easy parsing. The headers end

with a blank line. Adding a header field is as easy as adding another line.

Body

After the blank line is an optional message body containing any kind of data.

Request bodies carry data to the web server; response bodies carry data back to

the client. Unlike the start lines and headers, which are textual and structured,

the body can contain arbitrary binary data (e.g., images, videos, audio tracks,

software applications). Of course, the body can also contain text.

Simple Message Example

Figure 1-8 shows the HTTP messages that might be sent as part of a simple transac-

tion. The browser requests the resource http://www.joes-hardware.com/tools.html.

In Figure 1-8, the browser sends an HTTP request message. The request has a GET

method in the start line, and the local resource is /tools.html. The request indicates it

is speaking Version 1.0 of the HTTP protocol. The request message has no body,

because no request data is needed to GET a simple document from a server.

The server sends back an HTTP response message. The response contains the HTTP

version number (HTTP/1.0), a success status code (200), a descriptive reason phrase

(OK), and a block of response header fields, all followed by the response body con-

taining the requested document. The response body length is noted in the Content-

Length header, and the document’s MIME type is noted in the Content-Type

header.

Connections

Now that we’ve sketched what HTTP’s messages look like, let’s talk for a moment

about how messages move from place to place, across Transmission Control Protocol

(TCP) connections.

TCP/IP

HTTP is an application layer protocol. HTTP doesn’t worry about the nitty-gritty

details of network communication; instead, it leaves the details of networking to

TCP/IP, the popular reliable Internet transport protocol.

12 |Chapter 1: Overview of HTTP

TCP provides:

• Error-free data transportation

• In-order delivery (data will always arrive in the order in which it was sent)

• Unsegmented data stream (can dribble out data in any size at any time)

The Internet itself is based on TCP/IP, a popular layered set of packet-switched net-

work protocols spoken by computers and network devices around the world. TCP/IP

hides the peculiarities and foibles of individual networks and hardware, letting com-

puters and networks of any type talk together reliably.

Once a TCP connection is established, messages exchanged between the client and

server computers will never be lost, damaged, or received out of order.

In networking terms, the HTTP protocol is layered over TCP. HTTP uses TCP to

transport its message data. Likewise, TCP is layered over IP (see Figure 1-9).

Figure 1-8. Example GET transaction for http://www.joes-hardware.com/tools.html

GET /tools.html HTTP/1.0

User-agent: Mozilla/4.75 [en] (Win98; U)

Host: www.joes-hardware.com

Accept: text/html, image/gif, image/jpeg

Accept-language: en

HTTP/1.0 200 OK

Date: Sun, o1 Oct 2000 23:25:17 GMT

Server: Apache/1.3.11 BSafe-SSL/1.38 (Unix)

Last-modified: Tue, 04 Jul 2000 09:46:21 GMT

Content-length: 403

Content-type: text/html

<HTML>

<HEAD><TITLE>Joe’s Tools</TITLE></HEAD>

<BODY>

<H1>Tools Page</H1>

<H2>Hammers</H2>

<P>Joe’s Hardware Online has the largest selection of

hammers on the earth.</P>

<H2><A NAME=drills></A>Drills</H2>

<P>Joe’s Hardware has a complete line of cordless

and corded drills, as well as the latest in

plutonium-powered atomic drills, for those big

around the house jobs./<P>...

</BODY>

</HTML>

Client www.joes-hardware.com

(a) Request message

(b) Response message

Request start line (command)

Request headers

No request body

Response start line

(status)

Response headers

Response body

Connections |13

Connections, IP Addresses, and Port Numbers

Before an HTTP client can send a message to a server, it needs to establish a TCP/IP

connection between the client and server using Internet protocol (IP) addresses and

port numbers.

Setting up a TCP connection is sort of like calling someone at a corporate office.

First, you dial the company’s phone number. This gets you to the right organization.

Then, you dial the specific extension of the person you’re trying to reach.

In TCP, you need the IP address of the server computer and the TCP port number

associated with the specific software program running on the server.

This is all well and good, but how do you get the IP address and port number of the

HTTP server in the first place? Why, the URL, of course! We mentioned before that

URLs are the addresses for resources, so naturally enough they can provide us with

the IP address for the machine that has the resource. Let’s take a look at a few URLs:

http://207.200.83.29:80/index.html

http://www.netscape.com:80/index.html

http://www.netscape.com/index.html

The first URL has the machine’s IP address, “207.200.83.29”, and port number,

“80”.

The second URL doesn’t have a numeric IP address; it has a textual domain name, or

hostname (“www.netscape.com”). The hostname is just a human-friendly alias for an

IP address. Hostnames can easily be converted into IP addresses through a facility

called the Domain Name Service (DNS), so we’re all set here, too. We will talk much

more about DNS and URLs in Chapter 2.

The final URL has no port number. When the port number is missing from an HTTP

URL, you can assume the default value of port 80.

With the IP address and port number, a client can easily communicate via TCP/IP.

Figure 1-10 shows how a browser uses HTTP to display a simple HTML resource

that resides on a distant server.

Figure 1-9. HTTP network protocol stack

HTTP Application layer

TCP Transport layer

IP Network layer

Network-specific link interface Data link layer

Physical network hardware Physical layer

14 |Chapter 1: Overview of HTTP

Here are the steps:

(a) The browser extracts the server’s hostname from the URL.

(b) The browser converts the server’s hostname into the server’s IP address.

(d) The browser establishes a TCP connection with the web server.

(e) The browser sends an HTTP request message to the server.

(f) The server sends an HTTP response back to the browser.

(g) The connection is closed, and the browser displays the document.

Figure 1-10. Basic browser connection process

Client Server

Internet

(d) Connect to 161.58.228.45 port 80

Client Server

Internet

(e) Send an HTTP GET request

Client Server

Internet

(f) Read HTTP response from server

Client Server

Internet

(g) Close the connection

User types in URL

http://www.joes-hardware.com:80/tools.html

www.joes-hardware.com

(a) Get the hostname

(b) DNS

Browser showing page

Connections |15

A Real Example Using Telnet

Because HTTP uses TCP/IP, and is text-based, as opposed to using some obscure

binary format, it is simple to talk directly to a web server.

The Telnet utility connects your keyboard to a destination TCP port and connects

the TCP port output back to your display screen. Telnet is commonly used for

remote terminal sessions, but it can generally connect to any TCP server, including

HTTP servers.

You can use the Telnet utility to talk directly to web servers. Telnet lets you open a

TCP connection to a port on a machine and type characters directly into the port.

The web server treats you as a web client, and any data sent back on the TCP con-

nection is displayed onscreen.

Let’s use Telnet to interact with a real web server. We will use Telnet to fetch the

document pointed to by the URL http://www.joes-hardware.com:80/tools.html (you

can try this example yourself).

Let’s review what should happen:

• First, we need to look up the IP address of www.joes-hardware.com and open a

TCP connection to port 80 on that machine. Telnet does this legwork for us.

• Once the TCP connection is open, we need to type in the HTTP request.

• When the request is complete (indicated by a blank line), the server should send

back the content in an HTTP response and close the connection.

Our example HTTP request for http://www.joes-hardware.com:80/tools.html is shown

in Example 1-1. What we typed is shown in boldface.

Example 1-1. An HTTP transaction using telnet

%telnet www.joes-hardware.com 80

Trying 161.58.228.45...

Connected to joes-hardware.com.

Escape character is '^]'.

GET /tools.html HTTP/1.1

Host: www.joes-hardware.com

HTTP/1.1 200 OK

Date: Sun, 01 Oct 2000 23:25:17 GMT

Server: Apache/1.3.11 BSafe-SSL/1.38 (Unix) FrontPage/4.0.4.3

Last-Modified: Tue, 04 Jul 2000 09:46:21 GMT

ETag: "373979-193-3961b26d"

Accept-Ranges: bytes

Content-Length: 403

Connection: close

Content-Type: text/html

16 |Chapter 1: Overview of HTTP

Telnet looks up the hostname and opens a connection to the www.joes-hardware.com

web server, which is listening on port 80. The three lines after the command are out-

put from Telnet, telling us it has established a connection.

We then type in our basic request command, “GET /tools.html HTTP/1.1”, and send

a Host header providing the original hostname, followed by a blank line, asking the

server to GET us the resource “/tools.html” from the server www.joes-hardware.com.

After that, the server responds with a response line, several response headers, a blank

line, and finally the body of the HTML document.

Beware that Telnet mimics HTTP clients well but doesn’t work well as a server.

And automated Telnet scripting is no fun at all. For a more flexible tool, you

might want to check out nc (netcat). The nc tool lets you easily manipulate and

script UDP- and TCP-based traffic, including HTTP. See http://netcat.

sourceforge.net for details.

Protocol Versions

There are several versions of the HTTP protocol in use today. HTTP applications

need to work hard to robustly handle different variations of the HTTP protocol. The

versions in use are:

HTTP/0.9

The 1991 prototype version of HTTP is known as HTTP/0.9. This protocol con-

tains many serious design flaws and should be used only to interoperate with

legacy clients. HTTP/0.9 supports only the GET method, and it does not sup-

port MIME typing of multimedia content, HTTP headers, or version numbers.

HTTP/0.9 was originally defined to fetch simple HTML objects. It was soon

replaced with HTTP/1.0.

HTTP/1.0

1.0 was the first version of HTTP that was widely deployed. HTTP/1.0 added

version numbers, HTTP headers, additional methods, and multimedia object

handling. HTTP/1.0 made it practical to support graphically appealing web

<HTML>

<HEAD><TITLE>Joe's Tools</TITLE></HEAD>

<BODY>

<H1>Tools Page</H1>

<H2>Hammers</H2>

<P>Joe's Hardware Online has the largest selection of hammers on the earth.</P>

<H2><A NAME=drills></A>Drills</H2>

<P>Joe's Hardware has a complete line of cordless and corded drills, as well as the latest

in plutonium-powered atomic drills, for those big around the house jobs.</P> ...

</BODY>

</HTML>

Connection closed by foreign host.

Example 1-1. An HTTP transaction using telnet (continued)

Architectural Components of the Web |17

pages and interactive forms, which helped promote the wide-scale adoption of

the World Wide Web. This specification was never well specified. It represented

a collection of best practices in a time of rapid commercial and academic evolu-

tion of the protocol.

HTTP/1.0+

Many popular web clients and servers rapidly added features to HTTP in the

mid-1990s to meet the demands of a rapidly expanding, commercially success-

ful World Wide Web. Many of these features, including long-lasting “keep-

alive” connections, virtual hosting support, and proxy connection support, were

added to HTTP and became unofficial, de facto standards. This informal,

extended version of HTTP is often referred to as HTTP/1.0+.

HTTP/1.1

HTTP/1.1 focused on correcting architectural flaws in the design of HTTP, spec-

ifying semantics, introducing significant performance optimizations, and remov-

ing mis-features. HTTP/1.1 also included support for the more sophisticated

web applications and deployments that were under way in the late 1990s.

HTTP/1.1 is the current version of HTTP.

HTTP-NG (a.k.a. HTTP/2.0)

HTTP-NG is a prototype proposal for an architectural successor to HTTP/1.1

that focuses on significant performance optimizations and a more powerful frame-

work for remote execution of server logic. The HTTP-NG research effort con-

cluded in 1998, and at the time of this writing, there are no plans to advance this

proposal as a replacement for HTTP/1.1. See Chapter 10 for more information.

Architectural Components of the Web

In this overview chapter, we’ve focused on how two web applications (web browsers

and web servers) send messages back and forth to implement basic transactions.

There are many other web applications that you interact with on the Internet. In this

section, we’ll outline several other important applications, including:

Proxies

HTTP intermediaries that sit between clients and servers

Caches

HTTP storehouses that keep copies of popular web pages close to clients

Gateways

Special web servers that connect to other applications

Tunnels

Special proxies that blindly forward HTTP communications

Agents

Semi-intelligent web clients that make automated HTTP requests

18 |Chapter 1: Overview of HTTP

Proxies

Let’s start by looking at HTTP proxy servers, important building blocks for web

security, application integration, and performance optimization.

As shown in Figure 1-11, a proxy sits between a client and a server, receiving all of

the client’s HTTP requests and relaying the requests to the server (perhaps after

modifying the requests). These applications act as a proxy for the user, accessing the

server on the user’s behalf.

Proxies are often used for security, acting as trusted intermediaries through which all

web traffic flows. Proxies can also filter requests and responses; for example, to

detect application viruses in corporate downloads or to filter adult content away

from elementary-school students. We’ll talk about proxies in detail in Chapter 6.

Caches

Aweb cache or caching proxy is a special type of HTTP proxy server that keeps cop-

ies of popular documents that pass through the proxy. The next client requesting the

same document can be served from the cache’s personal copy (see Figure 1-12).

Figure 1-11. Proxies relay traffic between client and server

Figure 1-12. Caching proxies keep local copies of popular documents to improve performance

Client Server

Internet

Proxy

Client Server

Internet

Proxy cache

Client

Architectural Components of the Web |19

A client may be able to download a document much more quickly from a nearby

cache than from a distant web server. HTTP defines many facilities to make caching

more effective and to regulate the freshness and privacy of cached content. We cover

caching technology in Chapter 7.

Gateways

Gateways are special servers that act as intermediaries for other servers. They are

often used to convert HTTP traffic to another protocol. A gateway always receives

requests as if it was the origin server for the resource. The client may not be aware it

is communicating with a gateway.

For example, an HTTP/FTP gateway receives requests for FTP URIs via HTTP

requests but fetches the documents using the FTP protocol (see Figure 1-13). The

resulting document is packed into an HTTP message and sent to the client. We dis-

cuss gateways in Chapter 8.

Tunnels

Tunnels are HTTP applications that, after setup, blindly relay raw data between two

connections. HTTP tunnels are often used to transport non-HTTP data over one or

more HTTP connections, without looking at the data.

One popular use of HTTP tunnels is to carry encrypted Secure Sockets Layer (SSL)

traffic through an HTTP connection, allowing SSL traffic through corporate fire-

walls that permit only web traffic. As sketched in Figure 1-14, an HTTP/SSL tunnel

receives an HTTP request to establish an outgoing connection to a destination

address and port, then proceeds to tunnel the encrypted SSL traffic over the HTTP

channel so that it can be blindly relayed to the destination server.

Agents

User agents (or just agents) are client programs that make HTTP requests on the

user’s behalf. Any application that issues web requests is an HTTP agent. So far,

we’ve talked about only one kind of HTTP agent: web browsers. But there are many

other kinds of user agents.

Figure 1-13. HTTP/FTP gateway

HTTP client FTP serverHTTP/FTP

gateway

HTTP FTP

20 |Chapter 1: Overview of HTTP

For example, there are machine-automated user agents that autonomously wander

the Web, issuing HTTP transactions and fetching content, without human supervi-

sion. These automated agents often have colorful names, such as “spiders” or “web

robots” (see Figure 1-15). Spiders wander the Web to build useful archives of web

content, such as a search engine’s database or a product catalog for a comparison-

shopping robot. See Chapter 9 for more information.

Figure 1-14. Tunnels forward data across non-HTTP networks (HTTP/SSL tunnel shown)

Figure 1-15. Automated search engine “spiders” are agents, fetching web pages around the world

Server

Client

SSL

Tunnel start

SSLHTTP HTTP

connection SSLHTTP

SSL

Tunnel endpoint

Port 80

SSL

connection SSL

Port 443

Search engine

“spider”

Web serverWeb serverWeb server

Search engine

database

For More Information |21

The End of the Beginning

That’s it for our quick introduction to HTTP. In this chapter, we highlighted HTTP’s

role as a multimedia transport protocol. We outlined how HTTP uses URIs to name

multimedia resources on remote servers, we sketched how HTTP request and

response messages are used to manipulate multimedia resources on remote servers,

and we finished by surveying a few of the web applications that use HTTP.

The remaining chapters explain the technical machinery of the HTTP protocol,

applications, and resources in much more detail.

For More Information

Later chapters of this book will explore HTTP in much more detail, but you might

find that some of the following sources contain useful background about particular

topics we covered in this chapter.

HTTP Protocol Information

HTTP Pocket Reference

Clinton Wong, O’Reilly & Associates, Inc. This little book provides a concise

introduction to HTTP and a quick reference to each of the headers and status

codes that compose HTTP transactions.

http://www.w3.org/Protocols/

This W3C web page contains many great links about the HTTP protocol.

http://www.ietf.org/rfc/rfc2616.txt

RFC 2616, “Hypertext Transfer Protocol—HTTP/1.1,” is the official specifica-

tion for HTTP/1.1, the current version of the HTTP protocol. The specification

is a well-written, well-organized, detailed reference for HTTP, but it isn’t ideal

for readers who want to learn the underlying concepts and motivations of HTTP

or the differences between theory and practice. We hope that this book fills in

the underlying concepts, so you can make better use of the specification.

http://www.ietf.org/rfc/rfc1945.txt

RFC 1945, “Hypertext Transfer Protocol—HTTP/1.0,” is an informational RFC

that describes the modern foundation for HTTP. It details the officially sanc-

tioned and “best-practice” behavior of web applications at the time the specifica-

tion was written. It also contains some useful descriptions about behavior that is

deprecated in HTTP/1.1 but still widely implemented by legacy applications.

http://www.w3.org/Protocols/HTTP/AsImplemented.html

This web page contains a description of the 1991 HTTP/0.9 protocol, which

implements only GET requests and has no content typing.

22 |Chapter 1: Overview of HTTP

Historical Perspective

http://www.w3.org/Protocols/WhyHTTP.html

This brief web page from 1991, from the author of HTTP, highlights some of the

original, minimalist goals of HTTP.

http://www.w3.org/History.html

“A Little History of the World Wide Web” gives a short but interesting perspec-

tive on some of the early goals and foundations of the World Wide Web and

HTTP.

http://www.w3.org/DesignIssues/Architecture.html

“Web Architecture from 50,000 Feet” paints a broad, ambitious view of the

World Wide Web and the design principles that affect HTTP and related web

technologies.

Other World Wide Web Information

http://www.w3.org

The World Wide Web Consortium (W3C) is the technology steering team for

the Web. The W3C develops interoperable technologies (specifications, guide-

lines, software, and tools) for the evolving Web. The W3C site is a treasure trove

of introductory and detailed documentation about web technologies.

http://www.ietf.org/rfc/rfc2396.txt

RFC 2396, “Uniform Resource Identifiers (URI): Generic Syntax,” is the detailed

reference for URIs and URLs.

http://www.ietf.org/rfc/rfc2141.txt

RFC 2141, “URN Syntax,” is a 1997 specification describing URN syntax.

http://www.ietf.org/rfc/rfc2046.txt

RFC 2046, “MIME Part 2: Media Types,” is the second in a suite of five Internet

specifications defining the Multipurpose Internet Mail Extensions standard for

multimedia content management.

http://www.wrec.org/Drafts/draft-ietf-wrec-taxonomy-06.txt

This Internet draft, “Internet Web Replication and Caching Taxonomy,” speci-

fies standard terminology for web architectural components.

CHAPTER 2

URLs and Resources

Think of the Internet as a giant, expanding city, full of places to see and things to do.

You and the other residents and tourists of this booming community would use stan-

dard naming conventions for the city’s vast attractions and services. You’d use street

addresses for museums, restaurants, and people’s homes. You’d use phone numbers

for the fire department, the boss’s secretary, and your mother, who says you don’t

call enough.

Everything has a standardized name, to help sort out the city’s resources. Books have

ISBN numbers, buses have route numbers, bank accounts have account numbers,

and people have social security numbers. Tomorrow you will meet your business

partners at gate 31 of the airport. Every morning you take a Red-line train and exit at

Kendall Square station.

And because everyone agreed on standards for these different names, we can easily

share the city’s treasures with each other. You can tell the cab driver to take you to

246 McAllister Street, and he’ll know what you mean (even if he takes the long way).

Uniform resource locators (URLs) are the standardized names for the Internet’s

resources. URLs point to pieces of electronic information, telling you where they are

located and how to interact with them.

In this chapter, we’ll cover:

• URL syntax and what the various URL components mean and do

• URL shortcuts that many web clients support, including relative URLs and

expandomatic URLs

• URL encoding and character rules

• Common URL schemes that support a variety of Internet information systems

• The future of URLs, including uniform resource names (URNs)—a framework

to support objects that move from place to place while retaining stable names

24 |Chapter 2: URLs and Resources

Navigating the Internet’s Resources

URLs are the resource locations that your browser needs to find information. They

let people and applications find, use, and share the billions of data resources on the

Internet. URLs are the usual human access point to HTTP and other protocols: a

person points a browser at a URL and, behind the scenes, the browser sends the

appropriate protocol messages to get the resource that the person wants.

URLs actually are a subset of a more general class of resource identifier called a uni-

form resource identifier, or URI. URIs are a general concept comprised of two main

subsets, URLs and URNs. URLs identify resources by describing where resources are

located, whereas URNs (which we’ll cover later in this chapter) identify resources by

name, regardless of where they currently reside.

The HTTP specification uses the more general concept of URIs as its resource identi-

fiers; in practice, however, HTTP applications deal only with the URL subset of

URIs. Throughout this book, we’ll sometimes refer to URIs and URLs interchange-

ably, but we’re almost always talking about URLs.

Say you want to fetch the URL http://www.joes-hardware.com/seasonal/index-fall.html:

• The first part of the URL (http) is the URL scheme. The scheme tells a web client

how to access the resource. In this case, the URL says to use the HTTP protocol.

• The second part of the URL (www.joes-hardware.com) is the server location.

This tells the web client where the resource is hosted.

• The third part of the URL (/seasonal/index-fall.html) is the resource path. The

path tells what particular local resource on the server is being requested.

See Figure 2-1 for an illustration.

URLs can direct you to resources available through protocols other than HTTP.

They can point you to any resource on the Internet, from a person’s email account:

mailto:president@whitehouse.gov

Figure 2-1. How URLs relate to browser, machine, server, and location on the server’s filesystem

http://www.joes-hardware.com/seasonal/index-fall.html

Client Server Disk

Scheme

(how)

Host

(where)

Path

(what)

Web page

index fall.html

Navigating the Internet’s Resources |25

to files that are available through other protocols, such as the File Transfer Protocol

(FTP):

ftp://ftp.lots-o-books.com/pub/complete-price-list.xls

to movies hosted off of streaming video servers:

rtsp://www.joes-hardware.com:554/interview/cto_video

URLs provide a way to uniformly name resources. Most URLs have the same

“scheme://server location/path” structure. So, for every resource out there and every

way to get those resources, you have a single way to name each resource so that any-

one can use that name to find it. However, this wasn’t always the case.

The Dark Days Before URLs

Before the Web and URLs, people relied on a rag-tag assortment of applications to

access data distributed throughout the Net. Most people were not lucky enough to

have all the right applications or were not savvy and patient enough to use them.

Before URLs came along, if you wanted to share the complete-catalog.xls file with a

friend, you would have had to say something like this: “Use FTP to connect to ftp.

joes-hardware.com. Log in as anonymous. Then type your username as the password.

Change to the pub directory. Switch to binary mode. Now download the file named

complete-catalog.xls to your local filesystem and view it there.”

Today, browsers such as Netscape Navigator and Microsoft Internet Explorer bun-

dle much of this functionality into one convenient package. Using URLs, these appli-

cations are able to access many resources in a uniform way, through one interface.

Instead of the complicated instructions above, you could just say “Point your

browser at ftp://ftp.lots-o-books.com/pub/complete-catalog.xls.”

URLs have provided a means for applications to be aware of how to access a

resource. In fact, many users are probably unaware of the protocols and access meth-

ods their browsers use to get the resources they are requesting.

With web browsers, you no longer need a news reader to read Internet news or an

FTP client to access files on FTP servers. You don’t need an electronic mail program

to send and receive email messages. URLs have helped to simplify the online world,

by allowing the browser to be smart about how to access and handle resources.*

Applications can use URLs to simplify access to information.

URLs give you and your browser all you need to find a piece of information. They

define the particular resource you want, where it is located, and how to get it.

* Browsers often use other applications to handle specific resources. For example, Internet Explorer launches

an email application to handle URLs that identify email resources.

26 |Chapter 2: URLs and Resources

URL Syntax

URLs provide a means of locating any resource on the Internet, but these resources

can be accessed by different schemes (e.g., HTTP, FTP, SMTP), and URL syntax var-

ies from scheme to scheme.

Does this mean that each different URL scheme has a radically different syntax? In

practice, no. Most URLs adhere to a general URL syntax, and there is significant

overlap in the style and syntax between different URL schemes.

Most URL schemes base their URL syntax on this nine-part general format:

Almost no URLs contain all these components. The three most important parts of a

URL are the scheme, the host, and the path. Table 2-1 summarizes the various

components.

For example, consider the URL http://www.joes-hardware.com:80/index.html. The

scheme is “http”, the host is “www.joes-hardware.com”, the port is “80”, and the

path is “/index.html”.

Table 2-1. General URL components

Component Description Default value

scheme Which protocol to use when accessing a server to get a resource. None

user The username some schemes require to access a resource. anonymous

password The password that may be included after the username, separated by a colon (:). <Email address>

host The hostname or dotted IP address of the server hosting the resource. None

port The port number on which the server hosting the resource is listening. Many schemes

have default port numbers (the default port number for HTTP is 80).

Scheme-specific

path The local name for the resource on the server, separated from the previous URL com-

ponents by a slash (/). The syntax of the path component is server- and scheme-spe-

cific. (We will see later in this chapter that a URL’s path can be divided into segments,

and each segment can have its own components specific to that segment.)

None

params Used by some schemes to specify input parameters. Params are name/value pairs. A

URL can contain multiple params fields, separated from themselves and the rest of the

path by semicolons (;).

None

query Used by some schemes to pass parameters to active applications (such as databases,

bulletin boards, search engines, and other Internet gateways). There is no common

format for the contents of the query component. It is separated from the rest of the

URL by the “?” character.

None

frag A name for a piece or part of the resource. The frag field is not passed to the server

when referencing the object; it is used internally by the client. It is separated from the

rest of the URL by the “#” character.

None

URL Syntax |27

Schemes: What Protocol to Use

The scheme is really the main identifier of how to access a given resource; it tells the

application interpreting the URL what protocol it needs to speak. In our simple

HTTP URL, the scheme is simply “http”.

The scheme component must start with an alphabetic character, and it is separated

from the rest of the URL by the first “:” character. Scheme names are case-

insensitive, so the URLs “http://www.joes-hardware.com” and “HTTP://www.joes-

hardware.com” are equivalent.

Hosts and Ports

To find a resource on the Internet, an application needs to know what machine is

hosting the resource and where on that machine it can find the server that has access

to the desired resource. The host and port components of the URL provide these two

pieces of information.

The host component identifies the host machine on the Internet that has access to the

resource. The name can be provided as a hostname, as above (“www.joes-hardware.

com”) or as an IP address. For example, the following two URLs point to the same

resource—the first refers to the server by its hostname and the second by its IP address:

http://www.joes-hardware.com:80/index.html

http://161.58.228.45:80/index.html

The port component identifies the network port on which the server is listening. For

HTTP, which uses the underlying TCP protocol, the default port is 80.

Usernames and Passwords

More interesting components are the user and password components. Many servers

require a username and password before you can access data through them. FTP

servers are a common example of this. Here are a few examples:

ftp://ftp.prep.ai.mit.edu/pub/gnu

ftp://anonymous@ftp.prep.ai.mit.edu/pub/gnu

ftp://anonymous:my_passwd@ftp.prep.ai.mit.edu/pub/gnu

http://joe:joespasswd@www.joes-hardware.com/sales_info.txt

The first example has no user or password component, just our standard scheme,

host, and path. If an application is using a URL scheme that requires a username and

password, such as FTP, it generally will insert a default username and password if

they aren’t supplied. For example, if you hand your browser an FTP URL without

specifying a username and password, it will insert “anonymous” for your username

and send a default password (Internet Explorer sends “IEUser”, while Netscape Nav-

igator sends “mozilla”).

28 |Chapter 2: URLs and Resources

The second example shows a username being specified as “anonymous”. This user-

name, combined with the host component, looks just like an email address. The “@”

character separates the user and password components from the rest of the URL.

In the third example, both a username (“anonymous”) and password (“my_passwd”)

are specified, separated by the “:” character.

Paths

The path component of the URL specifies where on the server machine the resource

lives. The path often resembles a hierarchical filesystem path. For example:

http://www.joes-hardware.com:80/seasonal/index-fall.html

The path in this URL is “/seasonal/index-fall.html”, which resembles a filesystem

path on a Unix filesystem. The path is the information that the server needs to locate

the resource.*The path component for HTTP URLs can be divided into path seg-

ments separated by “/” characters (again, as in a file path on a Unix filesystem). Each

path segment can have its own params component.

Parameters

For many schemes, a simple host and path to the object just aren’t enough. Aside

from what port the server is listening to and even whether or not you have access to

the resource with a username and password, many protocols require more informa-

tion to work.

Applications interpreting URLs need these protocol parameters to access the

resource. Otherwise, the server on the other side might not service the request or,

worse yet, might service it wrong. For example, take a protocol like FTP, which has

two modes of transfer, binary and text. You wouldn’t want your binary image trans-

ferred in text mode, because the binary image could be scrambled.

To give applications the input parameters they need in order to talk to the server cor-

rectly, URLs have a params component. This component is just a list of name/value

pairs in the URL, separated from the rest of the URL (and from each other) by “;”

characters. They provide applications with any additional information that they need

to access the resource. For example:

ftp://prep.ai.mit.edu/pub/gnu;type=d

In this example, there is one param, type=d, where the name of the param is “type”

and its value is “d”.

* This is a bit of a simplification. In “Virtual Hosting” in Chapter 18, we will see that the path is not always

enough information to locate a resource. Sometimes a server needs additional information.

URL Syntax |29

As we mentioned earlier, the path component for HTTP URLs can be broken into

path segments. Each segment can have its own params. For example:

http://www.joes-hardware.com/hammers;sale=false/index.html;graphics=true

In this example there are two path segments, hammers and index.html. The hammers

path segment has the param sale, and its value is false. The index.html segment has

the param graphics, and its value is true.

Query Strings

Some resources, such as database services, can be asked questions or queries to nar-

row down the type of resource being requested.

Let’s say Joe’s Hardware store maintains a list of unsold inventory in a database and

allows the inventory to be queried, to see whether products are in stock. The follow-

ing URL might be used to query a web database gateway to see if item number 12731

is available:

http://www.joes-hardware.com/inventory-check.cgi?item=12731

For the most part, this resembles the other URLs we have looked at. What is new is

everything to the right of the question mark (?). This is called the query component.

The query component of the URL is passed along to a gateway resource, with the

path component of the URL identifying the gateway resource. Basically, gateways

can be thought of as access points to other applications (we discuss gateways in

detail in Chapter 8).

Figure 2-2 shows an example of a query component being passed to a server that is

acting as a gateway to Joe’s Hardware’s inventory-checking application. The query is

checking whether a particular item, 12731, is in inventory in size large and color

blue.

There is no requirement for the format of the query component, except that some

characters are illegal, as we’ll see later in this chapter. By convention, many gateways

Figure 2-2. The URL query component is sent along to the gateway application

http://www.joes-hardware.com/inventory-check.cgi?item=12731&color=blue&size=large

Client Server

Internet

item=12731&color=blue&size=large

“Inventory-check”

gateway

30 |Chapter 2: URLs and Resources

expect the query string to be formatted as a series of “name=value” pairs, separated

by “&” characters:

http://www.joes-hardware.com/inventory-check.cgi?item=12731&color=blue

In this example, there are two name/value pairs in the query component: item=12731

and color=blue.

Fragments

Some resource types, such as HTML, can be divided further than just the resource

level. For example, for a single, large text document with sections in it, the URL for

the resource would point to the entire text document, but ideally you could specify

the sections within the resource.

To allow referencing of parts or fragments of a resource, URLs support a frag com-

ponent to identify pieces within a resource. For example, a URL could point to a par-

ticular image or section within an HTML document.

A fragment dangles off the right-hand side of a URL, preceded by a #character. For

example:

http://www.joes-hardware.com/tools.html#drills

In this example, the fragment drills references a portion of the /tools.html web page

located on the Joe’s Hardware web server. The portion is named “drills”.

Because HTTP servers generally deal only with entire objects,*not with fragments of

objects, clients don’t pass fragments along to servers (see Figure 2-3). After your

browser gets the entire resource from the server, it then uses the fragment to display

the part of the resource in which you are interested.

URL Shortcuts

Web clients understand and use a few URL shortcuts. Relative URLs are a convenient

shorthand for specifying a resource within a resource. Many browsers also support

“automatic expansion” of URLs, where the user can type in a key (memorable) part of

a URL, and the browser fills in the rest. This is explained in “Expandomatic URLs.”

Relative URLs

URLs come in two flavors: absolute and relative. So far, we have looked only at abso-

lute URLs. With an absolute URL, you have all the information you need to access a

resource.

* In “Range Requests” in Chapter 15, we will see that HTTP agents may request byte ranges of objects. How-

ever, in the context of URL fragments, the server sends the entire object and the agent applies the fragment

identifier to the resource.

URL Shortcuts |31

On the other hand, relative URLs are incomplete. To get all the information needed

to access a resource from a relative URL, you must interpret it relative to another

URL, called its base.

Relative URLs are a convenient shorthand notation for URLs. If you have ever writ-

ten HTML by hand, you have probably found them to be a great shortcut.

Example 2-1 contains an example HTML document with an embedded relative URL.

In Example 2-1, we have an HTML document for the resource:

http://www.joes-hardware.com/tools.html

In the HTML document, there is a hyperlink containing the URL ./hammers.html.

This URL seems incomplete, but it is a legal relative URL. It can be interpreted rela-

tive to the URL of the document in which it is found; in this case, relative to the

resource /tools.html on the Joe’s Hardware web server.

Figure 2-3. The URL fragment is used only by the client, because the server deals with entire objects

Example 2-1. HTML snippet with relative URLs

<HTML>

<HEAD><TITLE>Joe's Tools</TITLE></HEAD>

<BODY>

<H1> Tools Page </H1>

<H2> Hammers <H2>

<P> Joe's Hardware Online has the largest selection of <A HREF="./hammers.html">hammers

</A> on earth.

</BODY>

</HTML>

Client www.joes-hardware.com

Internet

(b) Browser makes request to http://www.joes-hardware.com/tools.html

(a) User selects link to “http://www.joes-hardware.com/tools.html#drills “

http://www.joes-hardware.com/tools.html#drills

(d) Browser displays HTML page starting with

named “drills” fragment

Browser scrolls down to start

at named “drills” fragment

(Fragment is NOT sent to the server)

32 |Chapter 2: URLs and Resources

The abbreviated relative URL syntax lets HTML authors omit from URLs the

scheme, host, and other components. These components can be inferred by the base

URL of the resource they are in. URLs for other resources also can be specified in this

shorthand.

In Example 2-1, our base URL is:

http://www.joes-hardware.com/tools.html

Using this URL as a base, we can infer the missing information. We know the

resource is ./hammers.html, but we don’t know the scheme or host. Using the base

URL, we can infer that the scheme is http and the host is www.joes-hardware.com.

Figure 2-4 illustrates this.

Relative URLs are only fragments or pieces of URLs. Applications that process URLs

(such as your browser) need to be able to convert between relative and absolute

URLs.

It is also worth noting that relative URLs provide a convenient way to keep a set of

resources (such as HTML pages) portable. If you use relative URLs, you can move a

set of documents around and still have their links work, because they will be inter-

preted relative to the new base. This allows for things like mirroring content on other

servers.

Base URLs

The first step in the conversion process is to find a base URL. The base URL serves as

a point of reference for the relative URL. It can come from a few places:

Explicitly provided in the resource

Some resources explicitly specify the base URL. An HTML document, for exam-

ple, may include a <BASE> HTML tag defining the base URL by which to convert

all relative URLs in that HTML document.

Base URL of the encapsulating resource

If a relative URL is found in a resource that does not explicitly specify a base

URL, as in Example 2-1, it can use the URL of the resource in which it is embed-

ded as a base (as we did in our example).

Figure 2-4. Using a base URL

http://www.joes-hardware.com/tools.html

Base URL:

./hammers.html

Relative URL:

http://www.joes-hardware.com/hammers.html

New absolute URL

URL Shortcuts |33

No base URL

In some instances, there is no base URL. This often means that you have an

absolute URL; however, sometimes you may just have an incomplete or broken

URL.

Resolving relative references

Previously, we showed the basic components and syntax of URLs. The next step in

converting a relative URL into an absolute one is to break up both the relative and

base URLs into their component pieces.

In effect, you are just parsing the URL, but this is often called decomposing the URL,

because you are breaking it up into its components. Once you have broken the base

and relative URLs into their components, you can then apply the algorithm pictured

in Figure 2-5 to finish the conversion.

Figure 2-5. Converting relative to absolute URLs

Parsed relative URL:

{ scheme= X, user= Y, . . . }

Inherit base URL scheme

Examine user, password,

host, and port components

Inherit base URL, user, password,

host, and port

Examine path component

Inherit base URL path

Examine param component Have absolute path proceed

Remove “./” and “<seg>/./” from path

Proceed

Inherit base URL param

Examine query component

Inherit base URL query

Proceed

Combine inherited and relative components into new absolute URL

Defaults to base URL is absolute

Scheme empty All components empty Nonempty scheme

All components empty

Path empty Nonempty path

with leading “/”

Nonempty path

w/o leading “/”

Param

empty

Query empty Query nonempty

Param

nonempty

34 |Chapter 2: URLs and Resources

This algorithm converts a relative URL to its absolute form, which can then be used

to reference the resource. This algorithm was originally specified in RFC 1808 and

later incorporated into RFC 2396.

With our ./hammers.html example from Example 2-1, we can apply the algorithm

depicted in Figure 2-5:

1. Path is ./hammers.html; base URL is http://www.joes-hardware.com/tools.html.

2. Scheme is empty; proceed down left half of chart and inherit the base URL

scheme (HTTP).

3. At least one component is non-empty; proceed to bottom, inheriting host and

port components.

4. Combining the components we have from the relative URL (path: ./hammers.html)

with what we have inherited (scheme: http, host: www.joes-hardware.com, port:

80), we get our new absolute URL: http://www.joes-hardware.com/hammers.html.

Expandomatic URLs

Some browsers try to expand URLs automatically, either after you submit the URL

or while you’re typing. This provides users with a shortcut: they don’t have to type in

the complete URL, because it automatically expands itself.

These “expandomatic” features come in two flavors:

Hostname expansion

In hostname expansion, the browser can often expand the hostname you type in

into the full hostname without your help, just by using some simple heuristics.

For example if you type “yahoo” in the address box, your browser can automati-

cally insert “www.” and “.com” onto the hostname, creating “www.yahoo.com”.

Some browsers will try this if they are unable to find a site that matches “yahoo”,

trying a few expansions before giving up. Browsers apply these simple tricks to

save you some time and frustration.

However, these expansion tricks on hostnames can cause problems for other

HTTP applications, such as proxies. In Chapter 6, we will discuss these prob-

lems in more detail.

History expansion

Another technique that browsers use to save you time typing URLs is to store a

history of the URLs that you have visited in the past. As you type in the URL,

they can offer you completed choices to select from by matching what you type

to the prefixes of the URLs in your history. So, if you were typing in the start of a

URL that you had visited previously, such as http://www.joes-, your browser

could suggest http://www.joes-hardware.com. You could then select that instead

of typing out the complete URL.

Shady Characters |35

Be aware that URL auto-expansion may behave differently when used with proxies.

We discuss this further in “URI Client Auto-Expansion and Hostname Resolution”

in Chapter 6.

Shady Characters

URLs were designed to be portable. They were also designed to uniformly name all

the resources on the Internet, which means that they will be transmitted through

various protocols. Because all of these protocols have different mechanisms for

transmitting their data, it was important for URLs to be designed so that they could

be transmitted safely through any Internet protocol.

Safe transmission means that URLs can be transmitted without the risk of losing

information. Some protocols, such as the Simple Mail Transfer Protocol (SMTP) for

electronic mail, use transmission methods that can strip off certain characters.*To

get around this, URLs are permitted to contain only characters from a relatively

small, universally safe alphabet.

In addition to wanting URLs to be transportable by all Internet protocols, designers

wanted them to be readable by people. So invisible, nonprinting characters also are

prohibited in URLs, even though these characters may pass through mailers and oth-

erwise be portable.†

To complicate matters further, URLs also need to be complete. URL designers real-

ized there would be times when people would want URLs to contain binary data or

characters outside of the universally safe alphabet. So, an escape mechanism was

added, allowing unsafe characters to be encoded into safe characters for transport.

This section summarizes the universal alphabet and encoding rules for URLs.

The URL Character Set

Default computer system character sets often have an Anglocentric bias. Histori-

cally, many computer applications have used the US-ASCII character set. US-ASCII

uses 7 bits to represent most keys available on an English typewriter and a few non-

printing control characters for text formatting and hardware signalling.

US-ASCII is very portable, due to its long legacy. But while it’s convenient to citizens of

the U.S., it doesn’t support the inflected characters common in European languages or

the hundreds of non-Romanic languages read by billions of people around the world.

* This is caused by the use of a 7-bit encoding for messages; this can strip off information if the source is

encoded in 8 bits or more.

† Nonprinting characters include whitespace (note that RFC 2396 recommends that applications ignore

whitespace).

36 |Chapter 2: URLs and Resources

Furthermore, some URLs may need to contain arbitrary binary data. Recognizing the

need for completeness, the URL designers have incorporated escape sequences.

Escape sequences allow the encoding of arbitrary character values or data using a

restricted subset of the US-ASCII character set, yielding portability and completeness.

Encoding Mechanisms

To get around the limitations of a safe character set representation, an encoding

scheme was devised to represent characters in a URL that are not safe. The encoding

simply represents the unsafe character by an “escape” notation, consisting of a per-

cent sign (%) followed by two hexadecimal digits that represent the ASCII code of

the character.

Table 2-2 shows a few examples.

Character Restrictions

Several characters have been reserved to have special meaning inside of a URL. Oth-

ers are not in the defined US-ASCII printable set. And still others are known to con-

fuse some Internet gateways and protocols, so their use is discouraged.

Table 2-3 lists characters that should be encoded in a URL before you use them for

anything other than their reserved purposes.

Table 2-2. Some encoded character examples

Character ASCII code Example URL

~ 126 (0x7E) http://www.joes-hardware.com/%7Ejoe

SPACE 32 (0x20) http://www.joes-hardware.com/more%20tools.html

% 37 (0x25) http://www.joes-hardware.com/100%25satisfaction html

Table 2-3. Reserved and restricted characters

Character Reservation/Restriction

% Reserved as escape token for encoded characters

/ Reserved for delimiting splitting up path segments in the path component

. Reserved in the path component

.. Reserved in the path component

# Reserved as the fragment delimiter

? Reserved as the query-string delimiter

; Reserved as the params delimiter

: Reserved to delimit the scheme, user/password, and host/port components

$ , + Reserved

@ & = Reserved because they have special meaning in the context of some schemes

Shady Characters |37

A Bit More

You might be wondering why nothing bad has happened when you have used char-

acters that are unsafe. For instance, you can visit Joe’s home page at:

http://www.joes-hardware.com/~joe

and not encode the “~” character. For some transport protocols this is not an issue,

but it is still unwise for application developers not to encode unsafe characters.

Applications need to walk a fine line. It is best for client applications to convert any

unsafe or restricted characters before sending any URL to any other application.*

Once all the unsafe characters have been encoded, the URL is in a canonical form

that can be shared between applications; there is no need to worry about the other

application getting confused by any of the characters’ special meanings.

The original application that gets the URL from the user is best fit to determine

which characters need to be encoded. Because each component of the URL may have

its own safe/unsafe characters, and which characters are safe/unsafe is scheme-

dependent, only the application receiving the URL from the user really is in a posi-

tion to determine what needs to be encoded.

Of course, the other extreme is for the application to encode all characters. While this

is not recommended, there is no hard and fast rule against encoding characters that are

considered safe already; however, in practice this can lead to odd and broken behav-

ior, because some applications may assume that safe characters will not be encoded.

Sometimes, malicious folks encode extra characters in an attempt to get around

applications that are doing pattern matching on URLs—for example, web filtering

applications. Encoding safe URL components can cause pattern-matching applica-

tions to fail to recognize the patterns for which they are searching. In general, appli-

cations interpreting URLs must decode the URLs before processing them.

{ } | \ ^ ~ [ ] ‘Restricted because of unsafe handling by various transport agents, such as gateways

< > " Unsafe; should be encoded because these characters often have meaning outside the scope of the URL,

such as delimiting the URL itself in a document (e.g., “http://www.joes-hardware.com”)

0x00–0x1F, 0x7F Restricted; characters within thesehex ranges fallwithin the nonprintable section ofthe US-ASCII charac-

ter set

> 0x7F Restricted; characters whose hex values fall within this range do not fall within the 7-bit range of the US-

ASCII character set

* Here we are specifically talking about client applications, not other HTTP intermediaries, like proxies. In

“In-Flight URI Modification” in Chapter 6, we discuss some of the problems that can arise when proxies or

other intermediary HTTP applications attempt to change (e.g., encode) URLs on the behalf of a client.

Table 2-3. Reserved and restricted characters (continued)

Character Reservation/Restriction

38 |Chapter 2: URLs and Resources

Some URL components, such as the scheme, need to be recognized readily and are

required to start with an alphabetic character. Refer back to “URL Syntax” for more

guidelines on the use of reserved and unsafe characters within different URL

components.*

A Sea of Schemes

In this section, we’ll take a look at the more common scheme formats on the Web.

Appendix A gives a fairly exhaustive list of schemes and references to their individ-

ual documentation.

Table 2-4 summarizes some of the most popular schemes. Reviewing “URL Syntax”

will make the syntax portion of the table a little more familiar.

* Table 2-3 lists reserved characters for the various URL components. In general, encoding should be limited

to those characters that are unsafe for transport.

Table 2-4. Common scheme formats

Scheme Description

http The Hypertext Transfer Protocol scheme conforms to the general URL format, except that there is no username

or password. The port defaults to 80 if omitted.

Basic form:

http://<host>:<port>/<path>?<query>#<frag>

Examples:

http://www.joes-hardware.com/index.html

http://www.joes-hardware.com:80/index html

https The https scheme is a twin to the http scheme. The only difference is that the https scheme uses Netscape’s

Secure Sockets Layer (SSL), which provides end-to-end encryption of HTTP connections. Its syntax is identical to

that of HTTP, with a default port of 443.

Basic form:

https://<host>:<port>/<path>?<query>#<frag>

Example:

https://www.joes-hardware.com/secure.html

mailto Mailto URLs refer to email addresses. Because email behaves differently from other schemes (it does not refer to

objects that can be accessed directly), the format of a mailto URL differs from that of the standard URL. The syn-

tax for Internet email addresses is documented in Internet RFC 822.

Basic form:

mailto:<RFC-822-addr-spec>

Example:

mailto:joe@joes-hardware.com

A Sea of Schemes |39

ftp File Transfer Protocol URLs can be used to download and upload files on an FTP server and to obtain listings of

the contents of a directory structure on an FTP server.

FTP has been around since before the advent of the Web and URLs. Web applications have assimilated FTP as a

data-access scheme. The URL syntax follows the general form.

Basic form:

ftp //<user>:<password>@<host>:<port>/<path>;<params>

Example:

ftp //anonymous:joe%40joes-hardware.com@prep.ai.mit.edu:21/pub/gnu/

rtsp, rtspu RTSP URLs are identifiers for audio and video media resources that can be retrieved through the Real Time

Streaming Protocol.

The “u” in the rtspu scheme denotes that the UDP protocol is used to retrieve the resource.

Basic forms:

rtsp://<user>:<password>@<host>:<port>/<path>

rtspu://<user>:<password>@<host>:<port>/<path>

Example:

rtsp://www.joes-hardware.com:554/interview/cto_video

file The file scheme denotes files directly accessible on a given host machine (by local disk, a network filesystem, or

some other file-sharing system). The fields follow the general form. If the host is omitted, it defaults to the local

host from which the URL is being used.

Basic form:

file://<host>/<path>

Example:

file://OFFICE-FS/policies/casual-fridays.doc

news The news scheme is used to access specific articles or newsgroups, as defined by RFC 1036. It has the unusual

property that a news URL in itself does not contain sufficient information to locate the resource.

The news URL is missing information about where to acquire the resource—no hostname or machine name is

supplied. It is the interpreting application’s job to acquire this information from the user. For example, in your

Netscape browser, under the Options menu, you can specify your NNTP (news) server. This tells your browser

what server to use when it has a news URL.

News resources can be accessed from multiple servers. They are said to be location-independent, as they are not

dependent on any one source for access.

The “@” character is reserved within a news URL and is used to distinguish between news URLs that refer to

newsgroups and news URLs that refer to specific news articles.

Basic forms:

news:<newsgroup>

news:<news-article-id>

Example:

news:rec.arts startrek

Table 2-4. Common scheme formats (continued)

Scheme Description

40 |Chapter 2: URLs and Resources

The Future

URLs are a powerful tool. Their design allows them to name all existing objects and

easily encompass new formats. They provide a uniform naming mechanism that can

be shared between Internet protocols.

However, they are not perfect. URLs are really addresses, not true names. This

means that a URL tells you where something is located, for the moment. It provides

you with the name of a specific server on a specific port, where you can find the

resource. The downfall of this scheme is that if the resource is moved, the URL is no

longer valid. And at that point, it provides no way to locate the object.

What would be ideal is if you had the real name of an object, which you could use to

look up that object regardless of its location. As with a person, given the name of the

resource and a few other facts, you could track down that resource, regardless of

where it moved.

The Internet Engineering Task Force (IETF) has been working on a new standard,

uniform resource names (URNs), for some time now, to address just this issue.

URNs provide a stable name for an object, regardless of where that object moves

(either inside a web server or across web servers).

Persistent uniform resource locators (PURLs) are an example of how URN functional-

ity can be achieved using URLs. The concept is to introduce another level of indirec-

tion in looking up a resource, using an intermediary resource locator server that

catalogues and tracks the actual URL of a resource. A client can request a persistent

URL from the locator, which can then respond with a resource that redirects the cli-

ent to the actual and current URL for the resource (see Figure 2-6). For more infor-

mation on PURLs, visit http://purl.oclc.org.

If Not Now, When?

The ideas behind URNs have been around for some time. Indeed, if you look at the

publication dates for some of their specifications, you might ask yourself why they

have yet to be adopted.

telnet The telnet scheme is used to access interactive services. It does not represent an object per se, but an interactive

application (resource) accessible via the telnet protocol.

Basic form:

telnet //<user>:<password>@<host>:<port>/

Example:

telnet //slurp:webhound@joes-hardware.com:23/

Table 2-4. Common scheme formats (continued)

Scheme Description

For More Information |41

The change from URLs to URNs is an enormous task. Standardization is a slow pro-

cess, often for good reason. Support for URNs will require many changes—consensus

from the standards bodies, modifications to various HTTP applications, etc. A tre-

mendous amount of critical mass is required to make such changes, and unfortu-

nately (or perhaps fortunately), there is so much momentum behind URLs that it will

be some time before all the stars align to make such a conversion possible.

Throughout the explosive growth of the Web, Internet users—everyone from com-

puter scientists to the average Internet user—have been taught to use URLs. While

they suffer from clumsy syntax (for the novice) and persistence problems, people have

learned how to use them and how to deal with their drawbacks. URLs have some lim-

itations, but they’re not the web development community’s most pressing problem.

Currently, and for the foreseeable future, URLs are the way to name resources on the

Internet. They are everywhere, and they have proven to be a very important part of

the Web’s success. It will be a while before any other naming scheme unseats URLs.

However, URLs do have their limitations, and it is likely that new standards (possi-

bly URNs) will emerge and be deployed to address some of these limitations.

For More Information

For more information on URLs, refer to:

http://www.w3.org/Addressing/

The W3C web page about naming and addressing URIs and URLs.

http://www.ietf.org/rfc/rfc1738

RFC 1738, “Uniform Resource Locators (URL),” by T. Berners-Lee, L. Masinter,

and M. McCahill.

Figure 2-6. PURLs use a resource locator server to name the current location of a resource

Client purl.oclc.org

Internet

Get http://purl.oclc.org/jhardware/

STEP 1: Ask the resource resolver what the

Joe’s Hardware URL is. Receive from the

resolver the current location of the resource.

Actual: http://www.joes-hardware.com/

STEP 2: Get the actual URL for the resource

Client www.joes-hardware.com

Internet

Get http://www.joes-hardware.com/

42 |Chapter 2: URLs and Resources

http://www.ietf.org/rfc/rfc2396.txt

RFC 2396, “Uniform Resource Identifiers (URI): Generic Syntax,” by T. Berners-

Lee, R. Fielding, and L. Masinter.

http://www.ietf.org/rfc/rfc2141.txt

RFC 2141, “URN Syntax,” by R. Moats.

http://purl.oclc.org

The persistent uniform resource locator web site.

http://www.ietf.org/rfc/rfc1808.txt

RFC 1808, “Relative Uniform Resource Locators,” by R. Fielding.

CHAPTER 3

HTTP Messages

If HTTP is the Internet’s courier, HTTP messages are the packages it uses to move

things around. In Chapter 1, we showed how HTTP programs send each other mes-

sages to get work done. This chapter tells you all about HTTP messages—how to

create them and how to understand them. After reading this chapter, you’ll know

most of what you need to know to write your own HTTP applications. In particular,

you’ll understand:

• How messages flow

• The three parts of HTTP messages (start line, headers, and entity body)

• The differences between request and response messages

• The various functions (methods) that request messages support

• The various status codes that are returned with response messages

• What the various HTTP headers do

The Flow of Messages

HTTP messages are the blocks of data sent between HTTP applications. These

blocks of data begin with some text meta-information describing the message con-

tents and meaning, followed by optional data. These messages flow between clients,

servers, and proxies. The terms “inbound,” “outbound,” “upstream,” and “down-

stream” describe message direction.

Messages Commute Inbound to the Origin Server

HTTP uses the terms inbound and outbound to describe transactional direction. Mes-

sages travel inbound to the origin server, and when their work is done, they travel

outbound back to the user agent (see Figure 3-1).

44 |Chapter 3: HTTP Messages

Messages Flow Downstream

HTTP messages flow like rivers. All messages flow downstream, regardless of whether

they are request messages or response messages (see Figure 3-2). The sender of any

message is upstream of the receiver. In Figure 3-2, proxy 1 is upstream of proxy 3 for

the request but downstream of proxy 3 for the response.*

The Parts of a Message

HTTP messages are simple, formatted blocks of data. Take a peek at Figure 3-3 for

an example. Each message contains either a request from a client or a response from

a server. They consist of three parts: a start line describing the message, a block of

headers containing attributes, and an optional body containing data.

The start line and headers are just ASCII text, broken up by lines. Each line ends with

a two-character end-of-line sequence, consisting of a carriage return (ASCII 13) and a

line-feed character (ASCII 10). This end-of-line sequence is written “CRLF.” It is

worth pointing out that while the HTTP specification for terminating lines is CRLF,

robust applications also should accept just a line-feed character. Some older or bro-

ken HTTP applications do not always send both the carriage return and line feed.

The entity body or message body (or just plain “body”) is simply an optional chunk

of data. Unlike the start line and headers, the body can contain text or binary data or

can be empty.

In the example in Figure 3-3, the headers give you a bit of information about the

body. The Content-Type line tells you what the body is—in this example, it is a

plain-text document. The Content-Length line tells you how big the body is; here it

is a meager 19 bytes.

Figure 3-1. Messages travel inbound to the origin server and outbound back to the client

* The terms “upstream” and “downstream” relate only to the sender and receiver. We can’t tell whether a mes-

sage is heading to the origin server or the client, because both are downstream.

ServerClient

Proxy 1

Inbound (to server) GET /index.html HTTP/1.0

Outbound (to user agent)

HTTP/1.0 200 OK

Content-type: text/html

...

Proxy 2 Proxy 3

The Parts of a Message |45

Message Syntax

All HTTP messages fall into two types: request messages and response messages.

Request messages request an action from a web server. Response messages carry

results of a request back to a client. Both request and response messages have the

same basic message structure. Figure 3-4 shows request and response messages to get

a GIF image.

Here’s the format for a request message:

<entity-body>

Figure 3-2. All messages flow downstream

Figure 3-3. Three parts of an HTTP message

Client Proxy 1

Proxy 3

Proxy 2

Proxy 1

Client

Server

Request (flowing downstream)

Response (flowing downstream)

No messages ever go upstream

Proxy 2

Proxy 3

HTTP/1.0 200 OK

Content-type: text/plain

Content-length: 19

Hi! I’m a message!

Start line

Headers

Body

Client Server

46 |Chapter 3: HTTP Messages

Here’s the format for a response message (note that the syntax differs only in the

start line):

<entity-body>

Here’s a quick description of the various parts:

method

The action that the client wants the server to perform on the resource. It is a sin-

gle word, like “GET,” “HEAD,” or “POST”. We cover the method in detail later

in this chapter.

request-URL

A complete URL naming the requested resource, or the path component of the

URL. If you are talking directly to the server, the path component of the URL is

usually okay as long as it is the absolute path to the resource—the server can

assume itself as the host/port of the URL. Chapter 2 covers URL syntax in detail.

version

The version of HTTP that the message is using. Its format looks like:

HTTP/<major>.<minor>

where major and minor both are integers. We discuss HTTP versioning a bit

more later in this chapter.

status-code

A three-digit number describing what happened during the request. The first

digit of each code describes the general class of status (“success,” “error,” etc.).

An exhaustive list of status codes defined in the HTTP specification and their

meanings is provided later in this chapter.

Figure 3-4. An HTTP transaction has request and response messages

Internet

HTTP request message contains

the command and the URL

GET /specials/saw-blade.gif HTTP/1.0

Host: www.joes-hardware.com

Client www.joes-hardware.com

HTTP/1.0 200 OK

Content-Type: image/gif

Content-Length: 8572 HTTP response message contains

the result of the transaction

The Parts of a Message |47

reason-phrase

A human-readable version of the numeric status code, consisting of all the text

until the end-of-line sequence. Example reason phrases for all the status codes

defined in the HTTP specification are provided later in this chapter. The reason

phrase is meant solely for human consumption, so, for example, response lines

containing “HTTP/1.0 200 NOT OK” and “HTTP/1.0 200 OK” should be

treated as equivalent success indications, despite the reason phrases suggesting

otherwise.

headers

Zero or more headers, each of which is a name, followed by a colon (:), fol-

lowed by optional whitespace, followed by a value, followed by a CRLF. The

headers are terminated by a blank line (CRLF), marking the end of the list of

headers and the beginning of the entity body. Some versions of HTTP, such as

HTTP/1.1, require certain headers to be present for the request or response mes-

sage to be valid. The various HTTP headers are covered later in this chapter.

entity-body

The entity body contains a block of arbitrary data. Not all messages contain

entity bodies, so sometimes a message terminates with a bare CRLF. We discuss

entities in detail in Chapter 15.

Figure 3-5 demonstrates hypothetical request and response messages.

Note that a set of HTTP headers should always end in a blank line (bare CRLF), even

if there are no headers and even if there is no entity body. Historically, however,

many clients and servers (mistakenly) omitted the final CRLF if there was no entity

body. To interoperate with these popular but noncompliant implementations, cli-

ents and servers should accept messages that end without the final CRLF.

Start Lines

All HTTP messages begin with a start line. The start line for a request message says

what to do. The start line for a response message says what happened.

Figure 3-5. Example request and response messages

GET /test/hi-there.txt HTTP/1.1

Accept: text/*

Host: www.joes-hardware.com

HTTP/1.0 200 OK

Content-type: text/plain

Content-length: 19

Hi! I’m a message!

Start line

Headers

Body

(a) Request message (b) Response message

48 |Chapter 3: HTTP Messages

Request line

Request messages ask servers to do something to a resource. The start line for a

request message, or request line, contains a method describing what operation the

server should perform and a request URL describing the resource on which to per-

form the method. The request line also includes an HTTP version which tells the

server what dialect of HTTP the client is speaking.

All of these fields are separated by whitespace. In Figure 3-5a, the request method is

GET, the request URL is /test/hi-there.txt, and the version is HTTP/1.1. Prior to

HTTP/1.0, request lines were not required to contain an HTTP version.

Response line

Response messages carry status information and any resulting data from an opera-

tion back to a client. The start line for a response message, or response line, contains

the HTTP version that the response message is using, a numeric status code, and a

textual reason phrase describing the status of the operation.

All these fields are separated by whitespace. In Figure 3-5b, the HTTP version is

HTTP/1.0, the status code is 200 (indicating success), and the reason phrase is OK,

meaning the document was returned successfully. Prior to HTTP/1.0, responses were

not required to contain a response line.

Methods

The method begins the start line of requests, telling the server what to do. For exam-

ple, in the line “GET /specials/saw-blade.gif HTTP/1.0,” the method is GET.

The HTTP specifications have defined a set of common request methods. For exam-

ple, the GET method gets a document from a server, the POST method sends data to

a server for processing, and the OPTIONS method determines the general capabili-

ties of a web server or the capabilities of a web server for a specific resource.

Table 3-1 describes seven of these methods. Note that some methods have a body in

the request message, and other methods have bodyless requests.

Table 3-1. Common HTTP methods

Method Description Message body?

GET Get a document from the server. No

HEAD Get just the headers for a document from the server. No

POST Send data to the server for processing. Yes

PUT Store the body of the request on the server. Yes

TRACE Trace the message through proxy servers to the server. No

OPTIONS Determine what methods can operate on a server. No

DELETE Remove a document from the server. No

The Parts of a Message |49

Not all servers implement all seven of the methods in Table 3-1. Furthermore,

because HTTP was designed to be easily extensible, other servers may implement

their own request methods in addition to these. These additional methods are called

extension methods, because they extend the HTTP specification.

Status codes

As methods tell the server what to do, status codes tell the client what happened.

Status codes live in the start lines of responses. For example, in the line “HTTP/1.0

200 OK,” the status code is 200.

When clients send request messages to an HTTP server, many things can happen. If

you are fortunate, the request will complete successfully. You might not always be so

lucky. The server may tell you that the resource you requested could not be found,

that you don’t have permission to access the resource, or perhaps that the resource

has moved someplace else.

Status codes are returned in the start line of each response message. Both a numeric

and a human-readable status are returned. The numeric code makes error process-

ing easy for programs, while the reason phrase is easily understood by humans.

The different status codes are grouped into classes by their three-digit numeric codes.

Status codes between 200 and 299 represent success. Codes between 300 and 399

indicate that the resource has been moved. Codes between 400 and 499 mean that

the client did something wrong in the request. Codes between 500 and 599 mean

something went awry on the server.

The status code classes are shown in Table 3-2.

Current versions of HTTP define only a few codes for each status category. As the

protocol evolves, more status codes will be defined officially in the HTTP specifica-

tion. If you receive a status code that you don’t recognize, chances are someone has

defined it as an extension to the current protocol. You should treat it as a general

member of the class whose range it falls into.

For example, if you receive status code 515 (which is outside of the defined range for

5XX codes listed in Table 3-2), you should treat the response as indicating a server

error, which is the general class of 5XX messages.

Table 3-2. Status code classes

Overall range Defined range Category

100-199 100-101 Informational

200-299 200-206 Successful

300-399 300-305 Redirection

400-499 400-415 Client error

500-599 500-505 Server error

50 |Chapter 3: HTTP Messages

Table 3-3 lists some of the most common status codes that you will see. We will

explain all the current HTTP status codes in detail later in this chapter.

Reason phrases

The reason phrase is the last component of the start line of the response. It provides

a textual explanation of the status code. For example, in the line “HTTP/1.0 200

OK,” the reason phrase is OK.

Reason phrases are paired one-to-one with status codes. The reason phrase provides

a human-readable version of the status code that application developers can pass

along to their users to indicate what happened during the request.

The HTTP specification does not provide any hard and fast rules for what reason

phrases should look like. Later in this chapter, we list the status codes and some sug-

gested reason phrases.

Version numbers

Version numbers appear in both request and response message start lines in the for-

mat HTTP/x.y. They provide a means for HTTP applications to tell each other what

version of the protocol they conform to.

Version numbers are intended to provide applications speaking HTTP with a clue

about each other’s capabilities and the format of the message. An HTTP Version 1.2

application communicating with an HTTP Version 1.1 application should know that

it should not use any new 1.2 features, as they likely are not implemented by the

application speaking the older version of the protocol.

The version number indicates the highest version of HTTP that an application sup-

ports. In some cases this leads to confusion between applications,*because HTTP/1.0

applications interpret a response with HTTP/1.1 in it to indicate that the response is

a 1.1 response, when in fact that’s just the level of protocol used by the responding

application.

Note that version numbers are not treated as fractional numbers. Each number in the

version (for example, the “1” and “0” in HTTP/1.0) is treated as a separate number.

So, when comparing HTTP versions, each number must be compared separately in

Table 3-3. Common status codes

Status code Reason phrase Meaning

200 OK Success! Any requested data is in the response body.

401 Unauthorized You need to enter a username and password.

404 Not Found The server cannot find a resource for the requested URL.

* See http://httpd.apache.org/docs-2.0/misc/known_client_problems.html for more on cases in which Apache

has run into this problem with clients.

The Parts of a Message |51

order to determine which is the higher version. For example, HTTP/2.22 is a higher

version than HTTP/2.3, because 22 is a larger number than 3.

Headers

The previous section focused on the first line of request and response messages

(methods, status codes, reason phrases, and version numbers). Following the start

line comes a list of zero, one, or many HTTP header fields (see Figure 3-5).

HTTP header fields add additional information to request and response messages.

They are basically just lists of name/value pairs. For example, the following header

line assigns the value 19 to the Content-Length header field:

Content-length: 19

Header classiﬁcations

The HTTP specification defines several header fields. Applications also are free to

invent their own home-brewed headers. HTTP headers are classified into:

General headers

Can appear in both request and response messages

Request headers

Provide more information about the request

Response headers

Provide more information about the response

Entity headers

Describe body size and contents, or the resource itself

Extension headers

New headers that are not defined in the specification

Each HTTP header has a simple syntax: a name, followed by a colon (:), followed by

optional whitespace, followed by the field value, followed by a CRLF. Table 3-4 lists

some common header examples.

Header continuation lines

Long header lines can be made more readable by breaking them into multiple lines,

preceding each extra line with at least one space or tab character.

Table 3-4. Common header examples

Header example Description

Date: Tue, 3 Oct 1997 02:16:03 GMT The date the server generated the response

Content-length: 15040 The entity body contains 15,040 bytes of data

Content-type: image/gif The entity body is a GIF image

Accept: image/gif, image/jpeg, text/html The client accepts GIF and JPEG images and HTML

52 |Chapter 3: HTTP Messages

For example:

HTTP/1.0 200 OK

Content-Type: image/gif

Content-Length: 8572

Server: Test Server

Version 1.0

In this example, the response message contains a Server header whose value is bro-

ken into continuation lines. The complete value of the header is “Test Server Ver-

sion 1.0”.

We’ll briefly describe all the HTTP headers later in this chapter. We also provide a

more detailed reference summary of all the headers in Appendix C.

Entity Bodies

The third part of an HTTP message is the optional entity body. Entity bodies are the

payload of HTTP messages. They are the things that HTTP was designed to transport.

HTTP messages can carry many kinds of digital data: images, video, HTML docu-

ments, software applications, credit card transactions, electronic mail, and so on.

Version 0.9 Messages

HTTP Version 0.9 was an early version of the HTTP protocol. It was the starting

point for the request and response messages that HTTP has today, but with a far

simpler protocol (see Figure 3-6).

HTTP/0.9 messages also consisted of requests and responses, but the request con-

tained merely the method and the request URL, and the response contained only the

entity. No version information (it was the first and only version at the time), no sta-

tus code or reason phrase, and no headers were included.

Figure 3-6. HTTP/0.9 transaction

GET /specials/saw-blade.gif

Client www.joes-hardware.com

No version number

Methods |53

However, this simplicity did not allow for much flexibility or the implementation of

most of the HTTP features and applications described in this book. We briefly

describe it here because there are still clients, servers, and other applications that use

it, and application writers should be aware of its limitations.

Methods

Let’s talk in more detail about some of the basic HTTP methods, listed earlier in

Table 3-1. Note that not all methods are implemented by every server. To be compli-

ant with HTTP Version 1.1, a server need implement only the GET and HEAD meth-

ods for its resources.

Even when servers do implement all of these methods, the methods most likely have

restricted uses. For example, servers that support DELETE or PUT (described later in

this section) would not want just anyone to be able to delete or store resources.

These restrictions generally are set up in the server’s configuration, so they vary from

site to site and from server to server.

Safe Methods

HTTP defines a set of methods that are called safe methods. The GET and HEAD

methods are said to be safe, meaning that no action should occur as a result of an

HTTP request that uses either the GET or HEAD method.

By no action, we mean that nothing will happen on the server as a result of the

HTTP request. For example, consider when you are shopping online at Joe’s Hard-

ware and you click on the “submit purchase” button. Clicking on the button sub-

mits a POST request (discussed later) with your credit card information, and an

action is performed on the server on your behalf. In this case, the action is your

credit card being charged for your purchase.

There is no guarantee that a safe method won’t cause an action to be performed (in

practice, that is up to the web developers). Safe methods are meant to allow HTTP

application developers to let users know when an unsafe method that may cause

some action to be performed is being used. In our Joe’s Hardware example, your

web browser may pop up a warning message letting you know that you are making a

request with an unsafe method and that, as a result, something might happen on the

server (e.g., your credit card being charged).

GET

GET is the most common method. It usually is used to ask a server to send a

resource. HTTP/1.1 requires servers to implement this method. Figure 3-7 shows an

example of a client making an HTTP request with the GET method.

54 |Chapter 3: HTTP Messages

HEAD

The HEAD method behaves exactly like the GET method, but the server returns only

the headers in the response. No entity body is ever returned. This allows a client to

inspect the headers for a resource without having to actually get the resource. Using

HEAD, you can:

• Find out about a resource (e.g., determine its type) without getting it.

• See if an object exists, by looking at the status code of the response.

• Test if the resource has been modified, by looking at the headers.

Server developers must ensure that the headers returned are exactly those that a GET

request would return. The HEAD method also is required for HTTP/1.1 compli-

ance. Figure 3-8 shows the HEAD method in action.

PUT

The PUT method writes documents to a server, in the inverse of the way that GET

reads documents from a server. Some publishing systems let you create web pages

and install them directly on a web server using PUT (see Figure 3-9).

Figure 3-7. GET example

Figure 3-8. HEAD example

Client www.joes-hardware.com

HTTP/1.1 200 OK

Content-Type: text/html

Context-Length: 617

<HTML>

<HEAD><TITLE>Joe’s Special Offers </TITLE>

...

GET /seasonal/index-fall.html HTTP/1.1

Host: www.joes-hardware.com

Accept: *

Request message

Response message

Client www.joes-hardware.com

HTTP/1.1 200 OK

Content-Type: text/html

Context-Length: 617

HEAD /seasonal/index-fall.html HTTP/1.1

Host: www.joes-hardware.com

Accept: *

Request message

Response message

no entity body

Methods |55

The semantics of the PUT method are for the server to take the body of the request

and either use it to create a new document named by the requested URL or, if that

URL already exists, use the body to replace it.

Because PUT allows you to change content, many web servers require you to log in

with a password before you can perform a PUT. You can read more about password

authentication in Chapter 12.

POST

The POST method was designed to send input data to the server.*In practice, it is

often used to support HTML forms. The data from a filled-in form typically is sent to

the server, which then marshals it off to where it needs to go (e.g., to a server gateway

program, which then processes it). Figure 3-10 shows a client making an HTTP

request—sending form data to a server—with the POST method.

TRACE

When a client makes a request, that request may have to travel through firewalls,

proxies, gateways, or other applications. Each of these has the opportunity to mod-

ify the original HTTP request. The TRACE method allows clients to see how its

request looks when it finally makes it to the server.

A TRACE request initiates a “loopback” diagnostic at the destination server. The

server at the final leg of the trip bounces back a TRACE response, with the virgin

Figure 3-9. PUT example

* POST is used to send data to a server. PUT is used to deposit data into a resource on the server (e.g., a file).

Joe www.joes-hardware.com

HTTP/1.1 201 Created

Location: http://www.joes-hardware.com/product-list.txt

Content-Type: text/plain

Context-Length: 47

http://www.joes-hardware.com/product-list.txt

PUT /product-list.txt HTTP/1.1

Host: www.joes-hardware.com

Content-type: text/plain

Content-length: 34

Updated product list coming soon!

Request message

Response message Server updates/creates

resource “/product-list.txt”

and writes it to its disk.

56 |Chapter 3: HTTP Messages

request message it received in the body of its response. A client can then see how, or

if, its original message was munged or modified along the request/response chain of

any intervening HTTP applications (see Figure 3-11).

The TRACE method is used primarily for diagnostics; i.e., verifying that requests are

going through the request/response chain as intended. It’s also a good tool for see-

ing the effects of proxies and other applications on your requests.

As good as TRACE is for diagnostics, it does have the drawback of assuming that

intervening applications will treat different types of requests (different methods—

GET, HEAD, POST, etc.) the same. Many HTTP applications do different things

depending on the method—for example, a proxy might pass a POST request directly

to the server but attempt to send a GET request to another HTTP application (such

as a web cache). TRACE does not provide a mechanism to distinguish methods.

Generally, intervening applications make the call as to how they process a TRACE

request.

Figure 3-10. POST example

Client www.joes-hardware.com

HTTP/1.1 20o OK

Content-type: text/plain

Context-length: 37

The bandsaw model 2647 is in stock!

POST /inventory-check.cgi HTTP/1.1

Host: www.joes-hardware.com

Content-type: text/plain

Content-length: 18

item=bandsaw 2647

Request message

Response message

“item= bandsaw 2647”

CGI program

YES!

Inventory

list

Inventory check

Browser sticks data in entity

body of message

Methods |57

No entity body can be sent with a TRACE request. The entity body of the TRACE

response contains, verbatim, the request that the responding server received.

OPTIONS

The OPTIONS method asks the server to tell us about the various supported capabil-

ities of the web server. You can ask a server about what methods it supports in gen-

eral or for particular resources. (Some servers may support particular operations only

on particular kinds of objects).

This provides a means for client applications to determine how best to access vari-

ous resources without actually having to access them. Figure 3-12 shows a request

scenario using the OPTIONS method.

Figure 3-11. TRACE example

Figure 3-12. OPTIONS example

Proxy

TRACE /product-list.txt HTTP/1.1

Accept: *

Host: www.joes-hardware.com

Client

TRACE /product-list.txt HTTP/1.1

Host: www.joes-hardware.com

Accept: *

Via: 1.1 proxy3.company.com

Request message

www.joes-hardware.com

HTTP/1.1 200 OK

Content-type: text/plain

Content-length: 96

Via: 1.1 proxy3.company.com

TRACE /product-list.txt HTTP/1.1

Host: www.joes-hardware.com

Accept: *

Via: 1.1 proxy3.company.com

Response message

HTTP/1.1 200 OK

Content-type: text/plain

Content-length: 96

TRACE /product-list.txt HTTP/1.1

Host: www.joes-hardware.com

Accept: *

Via: 1.1 proxy3.company.com

Examining the entity, the client can see that its request was upgraded to protocol Version 1.1.

Along with the upgrade came a few additional request headers.

Client www.joes-hardware.com

HTTP/1.1 200 OK

Allow: GET, POST, PUT, OPTIONS

Context-length: 0

OPTIONS * HTTP/1.1

Host: www.joes-hardware.com

Accept: *

Request message

Response message

Since the request is for options

on all resources, the server just

returns the methods it supports

for its resources.

58 |Chapter 3: HTTP Messages

DELETE

The DELETE method does just what you would think—it asks the server to delete

the resources specified by the request URL. However, the client application is not

guaranteed that the delete is carried out. This is because the HTTP specification

allows the server to override the request without telling the client. Figure 3-13 shows

an example of the DELETE method.

Extension Methods

HTTP was designed to be field-extensible, so new features wouldn’t cause older soft-

ware to fail. Extension methods are methods that are not defined in the HTTP/1.1

specification. They provide developers with a means of extending the capabilities of

the HTTP services their servers implement on the resources that the servers manage.

Some common examples of extension methods are listed in Table 3-5. These meth-

ods are all part of the WebDAV HTTP extension (see Chapter 19) that helps sup-

port publishing of web content to web servers over HTTP.

It’s important to note that not all extension methods are defined in a formal specifi-

cation. If you define an extension method, it’s likely not to be understood by most

HTTP applications. Likewise, it’s possible that your HTTP applications could run

into extension methods being used by other applications that it does not understand.

Figure 3-13. DELETE example

Table 3-5. Example web publishing extension methods

Method Description

LOCK Allows a user to “lock”a resource—for example, you could lock a resource while you are editing it to prevent

others from editing it at the same time

MKCOL Allows a user to create a resource

COPY Facilitates copying resources on a server

MOVE Moves a resource on a server

Client www.joes-hardware.com

HTTP/1.1 200 OK

Content-Type: text/plain

Content-Length: 54

I have your delete request,

will take time to process.

DELETE /product-list.txt HTTP/1.1

Host: www.joes-hardware.com

Request message

Response message

Client thinks

resource

was deleted

File “product-list.txt”

removed from

server’s disk

Status Codes |59

In these cases, it is best to be tolerant of extension methods. Proxies should try to

relay messages with unknown methods through to downstream servers if they are

capable of doing that without breaking end-to-end behavior. Otherwise, they should

respond with a 501 Not Implemented status code. Dealing with extension methods

(and HTTP extensions in general) is best done with the old rule, “be conservative in

what you send, be liberal in what you accept.”

Status Codes

HTTP status codes are classified into five broad categories, as shown earlier in

Table 3-2. This section summarizes the HTTP status codes for each of the five classes.

The status codes provide an easy way for clients to understand the results of their

transactions. In this section, we also list example reason phrases, though there is no

real guidance on the exact text for reason phrases. We include the recommended rea-

son phrases from the HTTP/1.1 specification.

100–199: Informational Status Codes

HTTP/1.1 introduced the informational status codes to the protocol. They are rela-

tively new and subject to a bit of controversy about their complexity and perceived

value. Table 3-6 lists the defined informational status codes.

The 100 Continue status code, in particular, is a bit confusing. It’s intended to opti-

mize the case where an HTTP client application has an entity body to send to a

server but wants to check that the server will accept the entity before it sends it. We

discuss it here in a bit more detail (how it interacts with clients, servers, and proxies)

because it tends to confuse HTTP programmers.

Clients and 100 Continue

If a client is sending an entity to a server and is willing to wait for a 100 Continue

response before it sends the entity, the client needs to send an Expect request header

(see Appendix C) with the value 100-continue. If the client is not sending an entity, it

shouldn’t send a 100-continue Expect header, because this will only confuse the

server into thinking that the client might be sending an entity.

Table 3-6. Informational status codes and reason phrases

Status code Reason phrase Meaning

100 Continue Indicates that an initial part of the request was received and the client should con-

tinue. After sending this, the server must respond after receiving the request. See

the Expect header in Appendix C for more information.

101 Switching Protocols Indicates that the server is changing protocols, as specified by the client, to one

listed in the Upgrade header.

60 |Chapter 3: HTTP Messages

100-continue, in many ways, is an optimization. A client application should really

use 100-continue only to avoid sending a server a large entity that the server will not

be able to handle or use.

Because of the initial confusion around the 100 Continue status (and given some of

the older implementations out there), clients that send an Expect header for 100-

continue should not wait forever for the server to send a 100 Continue response.

After some timeout, the client should just send the entity.

In practice, client implementors also should be prepared to deal with unexpected 100

Continue responses (annoying, but true). Some errant HTTP applications send this

code inappropriately.

Servers and 100 Continue

If a server receives a request with the Expect header and 100-continue value, it should

respond with either the 100 Continue response or an error code (see Table 3-9). Serv-

ers should never send a 100 Continue status code to clients that do not send the 100-

continue expectation. However, as we noted above, some errant servers do this.

If for some reason the server receives some (or all) of the entity before it has had a

chance to send a 100 Continue response, it does not need to send this status code,

because the client already has decided to continue. When the server is done reading

the request, however, it still needs to send a final status code for the request (it can

just skip the 100 Continue status).

Finally, if a server receives a request with a 100-continue expectation and it decides to

end the request before it has read the entity body (e.g., because an error has occurred),

it should not just send a response and close the connection, as this can prevent the cli-

ent from receiving the response (see “TCP close and reset errors” in Chapter 4).

Proxies and 100 Continue

A proxy that receives from a client a request that contains the 100-continue expecta-

tion needs to do a few things. If the proxy either knows that the next-hop server (dis-

cussed in Chapter 6) is HTTP/1.1-compliant or does not know what version the

next-hop server is compliant with, it should forward the request with the Expect

header in it. If it knows that the next-hop server is compliant with a version of HTTP

earlier than 1.1, it should respond with the 417 Expectation Failed error.

If a proxy decides to include an Expect header and 100-continue value in its request

on behalf of a client that is compliant with HTTP/1.0 or earlier, it should not for-

ward the 100 Continue response (if it receives one from the server) to the client,

because the client won’t know what to make of it.

It can pay for proxies to maintain some state about next-hop servers and the ver-

sions of HTTP they support (at least for servers that have received recent requests),

so they can better handle requests received with a 100-continue expectation.

Status Codes |61

200–299: Success Status Codes

When clients make requests, the requests usually are successful. Servers have an

array of status codes to indicate success, matched up with different types of requests.

Table 3-7 lists the defined success status codes.

300–399: Redirection Status Codes

The redirection status codes either tell clients to use alternate locations for the

resources they’re interested in or provide an alternate response instead of the con-

tent. If a resource has moved, a redirection status code and an optional Location

header can be sent to tell the client that the resource has moved and where it can

Table 3-7. Success status codes and reason phrases

Status code Reason phrase Meaning

200 OK Request is okay, entity body contains requested resource.

201 Created For requests that create server objects (e.g., PUT). The entity body of the response

should contain the various URLs for referencing the created resource, with the Loca-

tion header containing the most specific reference. See Table 3-21 for more on the

Location header.

The server must have created the object prior to sending this status code.

202 Accepted The request was accepted, but the server has not yet performed any action with it.

There are no guarantees that the server will complete the request; this just means

that the request looked valid when accepted.

The server should include an entity body with a description indicating the status of

the request and possibly an estimate for when it will be completed (or a pointer to

where this information can be obtained).

203 Non-Authoritative

Information

The information contained in the entity headers (see “Entity Headers” for more infor-

mation on entity headers) came not from the origin server but from a copy of the

resource. This could happen if an intermediary had a copy of a resource but could not

or did not validate the meta-information (headers) it sent about the resource.

This response code is not required to be used; it is an option for applications that have

a response that would be a 200 status if the entity headers had come from the origin

server.

204 No Content The response message contains headers and a status line, but no entity body. Prima-

rily used to update browsers without having them move to a new document (e.g.,

refreshing a form page).

205 Reset Content Another code primarily for browsers. Tells the browser to clear any HTML form ele-

ments on the current page.

206 Partial Content A partial or range request was successful. Later, we will see that clients can request

part or a range of a document by using special headers—this status code indicates

that the range request was successful. See “Range Requests” in Chapter 15 for more

on the Range header.

A 206 response must include a Content-Range, Date, and either ETag or Content-

Location header.

62 |Chapter 3: HTTP Messages

now be found (see Figure 3-14). This allows browsers to go to the new location

transparently, without bothering their human users.

Some of the redirection status codes can be used to validate an application’s local

copy of a resource with the origin server. For example, an HTTP application can

check if the local copy of its resource is still up-to-date or if the resource has been

modified on the origin server. Figure 3-15 shows an example of this. The client sends

a special If-Modified-Since header saying to get the document only if it has been

modified since October 1997. The document has not changed since this date, so the

server replies with a 304 status code instead of the contents.

Figure 3-14. Redirected request to new location

Client www.joes-hardware.com

HTTP/1.1 301 OK

Location: http://www.gentle-grooming.com/

Content-length: 56

Content-type: text/plain

Please go to our partner site,

www.gentle-grooming.com

GET /pet-products.txt HTTP/1.1

Host: www.joes-hardware.com

Accept: *

Request message

Response message

Client www.gentle-grooming.com

HTTP/1.1 200 OK

Content-type: text/html

Content-length: 3307

...

GET / HTTP/1.1

Host: www.gentle-grooming.com

Accept: *

Request message

Response message

Status Codes |63

In general, it’s good practice for responses to non-HEAD requests that include a redi-

rection status code to include an entity with a description and links to the redirected

URL(s)—see the first response message in Figure 3-14. Table 3-8 lists the defined

redirection status codes.

Figure 3-15. Request redirected to use local copy

Table 3-8. Redirection status codes and reason phrases

Status code Reason phrase Meaning

300 Multiple Choices Returned when a client has requested a URL that actually refers to multiple

resources, such as a server hosting an English and French version of an HTML docu-

ment. This code is returned along with a list of options; the user can then select

which one he wants. See Chapter 17 for more on clients negotiating when there are

multiple versions. The server can include the preferred URL in the Location header.

301 Moved Permanently Used when the requested URL has been moved. The response should contain in the

Location header the URL where the resource now resides.

302 Found Like the 301 status code; however, the client should use the URL given in the Loca-

tion header to locate the resource temporarily. Future requests should use the old

URL.

Client

www.joes-hardware.com

HTTP/1.1 304 Not Modified

...

Client

GET /seasonal/index-fall.html HTTP/1.1

Host: www.joes-hardware.com

Accept: *

If-Modified-Since: Fri, Oct 3 1997 02:16:00 GMT

Request message

Response message

Client has previously requested copy of:

http://www.joes-hardware.com/seasonal/index-fall.html

Has not changed

Browser displays local copy, since the original

has not changed since we last requested it.

64 |Chapter 3: HTTP Messages

From Table 3-8, you may have noticed a bit of overlap between the 302, 303, and

307 status codes. There is some nuance to how these status codes are used, most of

which stems from differences in the ways that HTTP/1.0 and HTTP/1.1 applications

treat these status codes.

When an HTTP/1.0 client makes a POST request and receives a 302 redirect status

code in response, it will follow the redirect URL in the Location header with a GET

request to that URL (instead of making a POST request, as it did in the original

request).

HTTP/1.0 servers expect HTTP/1.0 clients to do this—when an HTTP/1.0 server

sends a 302 status code after receiving a POST request from an HTTP/1.0 client, the

server expects that client to follow the redirect with a GET request to the redirected

URL.

The confusion comes in with HTTP/1.1. The HTTP/1.1 specification uses the 303

status code to get this same behavior (servers send the 303 status code to redirect a

client’s POST request to be followed with a GET request).

To get around the confusion, the HTTP/1.1 specification says to use the 307 status

code in place of the 302 status code for temporary redirects to HTTP/1.1 clients.

Servers can then save the 302 status code for use with HTTP/1.0 clients.

What this all boils down to is that servers need to check a client’s HTTP version to

properly select which redirect status code to send in a redirect response.

303 See Other Used to tell the client that the resource should be fetched using a different URL. This

new URL is in the Location header of the response message. Its main purpose is to

allow responses to POST requests to direct a client to a resource.

304 Not Modified Clients can make theirrequests conditional by therequest headers they include. See

Table 3-15 for more on conditional headers. If a client makes a conditional request,

such as a GET if the resource has not been changed recently, this code is used to indi-

cate that the resource has not changed. Responses with this status code should not

contain an entity body.

305 Use Proxy Used to indicate that the resource must be accessed through a proxy; the location of

the proxy is given in the Location header. It’s important that clients interpret this

response relative to a specific resource and do not assume that this proxy should be

used for all requests or even all requests to the server holding the requested

resource. This could lead to broken behavior if the proxy mistakenly interfered with a

request, and it poses a security hole.

306 (Unused) Not currently used.

307 Temporary Redirect Like the 301 status code; however, the client should use the URL given in the Loca-

tion header to locate the resource temporarily. Future requests should use the old

URL.

Table 3-8. Redirection status codes and reason phrases (continued)

Status code Reason phrase Meaning

Status Codes |65

400–499: Client Error Status Codes

Sometimes a client sends something that a server just can’t handle, such as a badly

formed request message or, most often, a request for a URL that does not exist.

We’ve all seen the infamous 404 Not Found error code while browsing—this is just

the server telling us that we have requested a resource about which it knows nothing.

Many of the client errors are dealt with by your browser, without it ever bothering

you. A few, like 404, might still pass through. Table 3-9 shows the various client

error status codes.

Table 3-9. Client error status codes and reason phrases

Status code Reason phrase Meaning

400 Bad Request Used to tell the client that it has sent a malformed request.

401 Unauthorized Returned along with appropriate headers that ask the client to authenticate

itself before it can gain access to the resource. See Chapter 12 for more on

authentication.

402 Payment Required Currently this status code is not used, but it has been set aside for future use.

403 Forbidden Used to indicate that the request was refused by the server. If the server wants

to indicate why the request was denied, it can include an entity body describing

the reason. However, this code usually is used when the server does not want to

reveal the reason for the refusal.

404 Not Found Used to indicate that the server cannot find the requested URL. Often, an entity

is included for the client application to display to the user.

405 Method Not Allowed Used when a request is made with a method that is not supported for the

requested URL. The Allow header should be included in the response to tell the

client what methods are allowed on the requested resource. See “Entity Head-

ers” for more on the Allow header.

406 Not Acceptable Clients can specify parameters about what types of entities they are willing to

accept. This code is used when the server has no resource matching the URL that

is acceptable for the client. Often, servers include headers that allow the client

to figure out why the request could not be satisfied. See “Content Negotiation

and Transcoding” in Chapter 17 for more information.

407 Proxy Authentication

Required

Like the 401 status code, but used for proxy servers that require authentication

for a resource.

408 Request Timeout If a client takes too long to complete its request, a server can send back this sta-

tus code and close down the connection. The length of this timeout varies from

server to server but generally is long enough to accommodate any legitimate

request.

409 Conflict Used to indicate some conflict that the request may be causing on a resource.

Servers might send this code when they fear that a request could cause a con-

flict. The response should contain a body describing the conflict.

410 Gone Similar to 404, except that the server once held the resource. Used mostly for

web site maintenance, so a server’s administrator can notify clients when a

resource has been removed.

66 |Chapter 3: HTTP Messages

500–599: Server Error Status Codes

Sometimes a client sends a valid request, but the server itself has an error. This could

be a client running into a limitation of the server or an error in one of the server’s

subcomponents, such as a gateway resource.

Proxies often run into problems when trying to talk to servers on a client’s behalf.

Proxies issue 5XX server error status codes to describe the problem (Chapter 6 cov-

ers this in detail). Table 3-10 lists the defined server error status codes.

411 Length Required Used when the server requires a Content-Length header in the request mes-

sage. See “Content headers” for more on the Content-Length header.

412 Precondition Failed Used if a client makes a conditional request and one of the conditions fails. Con-

ditional requests occur when a client includes an Expect header. See Appendix C

for more on the Expect header.

413 Request Entity Too Large Used when a client sends an entity body that is larger than the server can or

wants to process.

414 Request URI Too Long Used when a client sends a request with a request URL that is larger than the

server can or wants to process.

415 Unsupported Media Type Used when a client sends an entity of a content type that the server does not

understand or support.

416 Requested Range Not

Satisfiable

Used when the request message requested a range of a given resource and that

range either was invalid or could not be met.

417 Expectation Failed Used when the request contained an expectation in the Expect request header

that the server could not satisfy. See Appendix C for more on the Expect header.

A proxy or other intermediary application can send this response code if it has

unambiguous evidence that the origin server will generate a failed expectation

for the request.

Table 3-10. Server error status codes and reason phrases

Status code Reason phrase Meaning

500 Internal Server Error Used when the server encounters an error that prevents it from servicing the

request.

501 Not Implemented Used when aclient makes a request that is beyond the server’s capabilities (e.g.,

using a request method that the server does not support).

502 Bad Gateway Used when a server acting as a proxy or gateway encounters a bogus response

from the next link in the request response chain (e.g., if it is unable to connect to

its parent gateway).

503 Service Unavailable Used to indicate that the server currently cannot service the request but will be

able to in the future. If the server knows when the resource will become avail-

able, it can include a Retry-After header in the response. See “Response Head-

ers” for more on the Retry-After header.

Table 3-9. Client error status codes and reason phrases (continued)

Status code Reason phrase Meaning

Headers |67

Headers

Headers and methods work together to determine what clients and servers do. This

section quickly sketches the purposes of the standard HTTP headers and some head-

ers that are not explicitly defined in the HTTP/1.1 specification (RFC 2616).

Appendix C summarizes all these headers in more detail.

There are headers that are specific for each type of message and headers that are

more general in purpose, providing information in both request and response mes-

sages. Headers fall into five main classes:

General headers

These are generic headers used by both clients and servers. They serve general

purposes that are useful for clients, servers, and other applications to supply to

one another. For example, the Date header is a general-purpose header that allows

both sides to indicate the time and date at which the message was constructed:

Date: Tue, 3 Oct 1974 02:16:00 GMT

Request headers

As the name implies, request headers are specific to request messages. They pro-

vide extra information to servers, such as what type of data the client is willing

to receive. For example, the following Accept header tells the server that the cli-

ent will accept any media type that matches its request:

Accept: */*

Response headers

Response messages have their own set of headers that provide information to the

client (e.g., what type of server the client is talking to). For example, the follow-

ing Server header tells the client that it is talking to a Version 1.0 Tiki-Hut server:

Server: Tiki-Hut/1.0

Entity headers

Entity headers refer to headers that deal with the entity body. For instance,

entity headers can tell the type of the data in the entity body. For example, the

following Content-Type header lets the application know that the data is an

HTML document in the iso-latin-1 character set:

Content-Type: text/html; charset=iso-latin-1

504 Gateway Timeout Similar to status code 408, except that the response is coming from a gateway

or proxy that has timed out waiting for a response to its request from another

server.

505 HTTP Version Not

Supported

Used when a server receives a request in a version of the protocol that it can’tor

won’t support. Some server applications elect not to support older versions of

the protocol.

Table 3-10. Server error status codes and reason phrases (continued)

Status code Reason phrase Meaning

68 |Chapter 3: HTTP Messages

Extension headers

Extension headers are nonstandard headers that have been created by applica-

tion developers but not yet added to the sanctioned HTTP specification. HTTP

programs need to tolerate and forward extension headers, even if they don’t

know what the headers mean.

General Headers

Some headers provide very basic information about a message. These headers are

called general headers. They are the fence straddlers, supplying useful information

about a message regardless of its type.

For example, whether you are constructing a request message or a response mes-

sage, the date and time the message is created means the same thing, so the header

that provides this kind of information is general to both types of messages.

Table 3-11 lists the general informational headers.

General caching headers

HTTP/1.0 introduced the first headers that allowed HTTP applications to cache

local copies of objects instead of always fetching them directly from the origin server.

The latest version of HTTP has a very rich set of cache parameters. In Chapter 7, we

cover caching in depth. Table 3-12 lists the basic caching headers.

Table 3-11. General informational headers

Header Description

Connection Allows clients and servers to specify options about the request/response connection

Datea

aAppendix C lists the acceptable date formats for the Date header.

Provides a date and time stamp telling when the message was created

MIME-Version Gives the version of MIME that the sender is using

Trailer Lists the set of headers that are in the trailer of a message encoded with the chunked transfer encodingb

bChunked transfer codings are discussed further in “Chunking and persistent connections” in Chapter 15.

Transfer-Encoding Tells the receiver what encoding was performed on the message in order for it to be transported safely

Upgrade Gives a new version or protocol that the sender would like to “upgrade” to using

Via Shows what intermediaries (proxies, gateways) the message has gone through

Table 3-12. General caching headers

Header Description

Cache-Control Used to pass caching directions along with the message

Pragmaa

aPragma technically is a request header. It was never specified for use in responses. Because of its common misuse as a response header,

many clients and proxies will interpret Pragma as a response header, but the precise semantics are not well defined. In any case, Pragma

is deprecated in favor of Cache-Control.

Another way to pass directions along with the message, though not specific to caching

Headers |69

Request Headers

Request headers are headers that make sense only in a request message. They give

information about who or what is sending the request, where the request originated,

or what the preferences and capabilities of the client are. Servers can use the informa-

tion the request headers give them about the client to try to give the client a better

response. Table 3-13 lists the request informational headers.

Accept headers

Accept headers give the client a way to tell servers their preferences and capabilities:

what they want, what they can use, and, most importantly, what they don’t want.

Servers can then use this extra information to make more intelligent decisions about

what to send. Accept headers benefit both sides of the connection. Clients get what

they want, and servers don’t waste their time and bandwidth sending something the

client can’t use. Table 3-14 lists the various accept headers.

Table 3-13. Request informational headers

Header Description

Client-IPa

aClient-IP and the UA-* headers are not defined in RFC 2616 but are implemented by many HTTP client applications.

Provides the IP address of the machine on which the client is running

From Provides the email address of the client’s userb

bAn RFC 822 email address format.

Host Gives the hostname and port of the server to which the request is being sent

Referer Provides the URL of the document that contains the current request URI

UA-Color Provides information about the color capabilities of the client machine’s display

UA-CPUc

cWhile implemented by some clients, the UA-* headers can be considered harmful. Content, specifically HTML, should not be targeted at

specific client configurations.

Gives the type or manufacturer of the client’s CPU

UA-Disp Provides information about the client’s display (screen) capabilities

UA-OS Gives the name and version of operating system running on the client machine

UA-Pixels Provides pixel information about the client machine’s display

User-Agent Tells the server the name of the application making the request

Table 3-14. Accept headers

Header Description

Accept Tells the server what media types are okay to send

Accept-Charset Tells the server what charsets are okay to send

Accept-Encoding Tells the server what encodings are okay to send

Accept-Language Tells the server what languages are okay to send

TEa

aSee “Transfer-Encoding Headers” in Chapter 15 for more on the TE header.

Tells the server what extension transfer codings are okay to use

70 |Chapter 3: HTTP Messages

Conditional request headers

Sometimes, clients want to put some restrictions on a request. For instance, if the cli-

ent already has a copy of a document, it might want to ask a server to send the docu-

ment only if it is different from the copy the client already has. Using conditional

request headers, clients can put such restrictions on requests, requiring the server to

make sure that the conditions are true before satisfying the request. Table 3-15 lists

the various conditional request headers.

Request security headers

HTTP natively supports a simple challenge/response authentication scheme for

requests. It attempts to make transactions slightly more secure by requiring clients to

authenticate themselves before getting access to certain resources. We discuss this

challenge/response scheme in Chapter 14, along with other security schemes that

have been implemented on top of HTTP. Table 3-16 lists the request security headers.

Proxy request headers

As proxies become increasingly common on the Internet, a few headers have been

defined to help them function better. In Chapter 6, we discuss these headers in

detail. Table 3-17 lists the proxy request headers.

Table 3-15. Conditional request headers

Header Description

Expect Allows a client to list server behaviors that it requires for a request

If-Match Gets the document if the entity tag matches the current entity tag for the documenta

aSee Chapter 7 for more on entity tags. The tag is basically an identifier for a version of the resource.

If-Modified-Since Restricts the request unless the resource has been modified since the specified date

If-None-Match Gets the document if the entity tags supplied do not match those of the current document

If-Range Allows a conditional request for a range of a document

If-Unmodified-Since Restricts the request unless the resource has not been modified since the specified date

Range Requests a specific range of a resource, if the server supports range requestsb

bSee “Range Requests” in Chapter 15 for more on the Range header.

Table 3-16. Request security headers

Header Description

Authorization Contains the data the client is supplying to the server to authenticate itself

Cookie Used by clients to pass a token to the server—not a true security header, but it does have security

implicationsa

aThe Cookie header is not defined in RFC 2616; it is discussed in detail in Chapter 11.

Cookie2 Used to note the version of cookies a requestor supports; see “Version 1 (RFC 2965) Cookies” in

Chapter 11

Headers |71

Response Headers

Response messages have their own set of response headers. Response headers pro-

vide clients with extra information, such as who is sending the response, the capabil-

ities of the responder, or even special instructions regarding the response. These

headers help the client deal with the response and make better requests in the future.

Table 3-18 lists the response informational headers.

Negotiation headers

HTTP/1.1 provides servers and clients with the ability to negotiate for a resource if

multiple representations are available—for instance, when there are both French and

German translations of an HTML document on a server. Chapter 17 walks through

negotiation in detail. Here are a few headers servers use to convey information about

resources that are negotiable. Table 3-19 lists the negotiation headers.

Table 3-17. Proxy request headers

Header Description

Max-Forwards The maximum number of times a request should be forwarded to another proxy or gateway on its way

to the origin server—used with the TRACE methoda

aSee “Max-Forwards” in Chapter 6.

Proxy-Authorization Same as Authorization, but used when authenticating with a proxy

Proxy-Connection Same as Connection, but used when establishing connections with a proxy

Table 3-18. Response informational headers

Header Description

Age How old the response isa

aImplies that the response has traveled through an intermediary, possibly from a proxy cache.

Publicb

bThe Public header is defined in RFC 2068 but does not appear in the latest HTTP definition (RFC 2616).

A list of request methods the server supports for its resources

Retry-After A date or time to try back, if a resource is unavailable

Server The name and version of the server’s application software

Titlec

cThe Title header is not defined in RFC 2616; see the original HTTP/1.0 draft definition (http://www.w3.org/Protocols/HTTP/HTTP2.html).

For HTML documents, the title as given by the HTML document source

Warning A more detailed warning message than what is in the reason phrase

Table 3-19. Negotiation headers

Header Description

Accept-Ranges The type of ranges that a server will accept for this resource

Vary A list of other headers that the server looks at and that may cause the response to vary; i.e., a list of

headers the server looks at to pick which is the best version of a resource to send the client

72 |Chapter 3: HTTP Messages

Response security headers

You’ve already seen the request security headers, which are basically the response

side of HTTP’s challenge/response authentication scheme. We talk about security in

detail in Chapter 14. For now, here are the basic challenge headers. Table 3-20 lists

the response security headers.

Entity Headers

There are many headers to describe the payload of HTTP messages. Because both

request and response messages can contain entities, these headers can appear in

either type of message.

Entity headers provide a broad range of information about the entity and its content,

from information about the type of the object to valid request methods that can be

made on the resource. In general, entity headers tell the receiver of the message what

it’s dealing with. Table 3-21 lists the entity informational headers.

Content headers

The content headers provide specific information about the content of the entity,

revealing its type, size, and other information useful for processing it. For instance, a

web browser can look at the content type returned and know how to display the

object. Table 3-22 lists the various content headers.

Table 3-20. Response security headers

Header Description

Proxy-Authenticate A list of challenges for the client from the proxy

Set-Cookie Not a true security header, but it has security implications; used to set a token on the client side that

the server can use to identify the clienta

aSet-Cookie and Set-Cookie2 are extension headers that are also covered in Chapter 11.

Set-Cookie2 Similar to Set-Cookie, RFC 2965 Cookie definition; see “Version 1 (RFC 2965) Cookies” in Chapter 11

WWW-Authenticate A list of challenges for the client from the server

Table 3-21. Entity informational headers

Header Description

Allow Lists the request methods that can be performed on this entity

Location Tells the client where the entity really is located; used in directing the receiver to a (possibly new)

location (URL) for the resource

Table 3-22. Content headers

Header Description

Content-BaseaThe base URL for resolving relative URLs within the body

Content-Encoding Any encoding that was performed on the body

For More Information |73

Entity caching headers

The general caching headers provide directives about how or when to cache. The

entity caching headers provide information about the entity being cached—for

example, information needed to validate whether a cached copy of the resource is

still valid and hints about how better to estimate when a cached resource may no

longer be valid.

In Chapter 7, we dive deep into the heart of caching HTTP requests and responses.

We will see these headers again there. Table 3-23 lists the entity caching headers.

For More Information

For more information, refer to:

http://www.w3.org/Protocols/rfc2616/rfc2616.txt

RFC 2616, “Hypertext Transfer Protocol,” by R. Fielding, J. Gettys, J. Mogul, H.

Frystyk, L. Mastinter, P. Leach, and T. Berners-Lee.

HTTP Pocket Reference

Clintin Wong, O’Reilly & Associates, Inc.

http://www.w3.org/Protocols/

The W3C architecture page for HTTP.

Content-Language The natural language that is best used to understand the body

Content-Length The length or size of the body

Content-Location Where the resource actually is located

Content-MD5 An MD5 checksum of the body

Content-Range The range of bytes that this entity represents from the entire resource

Content-Type The type of object that this body is

aThe Content-Base header is not defined in RFC 2616.

Table 3-23. Entity caching headers

Header Description

ETag The entity tag associated with this entitya

aEntity tags are basically identifiers for a particular version of a resource.

Expires The date and time at which this entity will no longer be valid and will need to be fetched from the

original source

Last-Modified The last date and time when this entity changed

Table 3-22. Content headers (continued)

Header Description

CHAPTER 4

Connection Management

The HTTP specifications explain HTTP messages fairly well, but they don’t talk

much about HTTP connections, the critical plumbing that HTTP messages flow

through. If you’re a programmer writing HTTP applications, you need to under-

stand the ins and outs of HTTP connections and how to use them.

HTTP connection management has been a bit of a black art, learned as much from

experimentation and apprenticeship as from published literature. In this chapter,

you’ll learn about:

• How HTTP uses TCP connections

• Delays, bottlenecks and clogs in TCP connections

• HTTP optimizations, including parallel, keep-alive, and pipelined connections

• Dos and don’ts for managing connections

TCP Connections

Just about all of the world’s HTTP communication is carried over TCP/IP, a popular

layered set of packet-switched network protocols spoken by computers and network

devices around the globe. A client application can open a TCP/IP connection to a

server application, running just about anywhere in the world. Once the connection is

established, messages exchanged between the client’s and server’s computers will

never be lost, damaged, or received out of order.*

Say you want the latest power tools price list from Joe’s Hardware store:

http://www.joes-hardware.com:80/power-tools.html

When given this URL, your browser performs the steps shown in Figure 4-1. In Steps

1–3, the IP address and port number of the server are pulled from the URL. A TCP

* Though messages won’t be lost or corrupted, communication between client and server can be severed if a

computer or network breaks. In this case, the client and server are notified of the communication breakdown.

TCP Connections |75

connection is made to the web server in Step 4, and a request message is sent across

the connection in Step 5. The response is read in Step 6, and the connection is closed

in Step 7.

TCP Reliable Data Pipes

HTTP connections really are nothing more than TCP connections, plus a few rules

about how to use them. TCP connections are the reliable connections of the Inter-

net. To send data accurately and quickly, you need to know the basics of TCP.*

TCP gives HTTP a reliable bit pipe. Bytes stuffed in one side of a TCP connection

come out the other side correctly, and in the right order (see Figure 4-2).

Figure 4-1. Web browsers talk to web servers over TCP connections

* If you are trying to write sophisticated HTTP applications, and especially if you want them to be fast, you’ll

want to learn a lot more about the internals and performance of TCP than we discuss in this chapter. We

recommend the “TCP/IP Illustrated” books by W. Richard Stevens (Addison Wesley).

Client Server

www.joes-hardware.com

Client Server

Internet

(7) The browser closes the connection

(1) The browser extracts the hostname

(2) The browser looks up the IP address for this hostname (DNS)

(3) The browser gets the port number (80)

(4) The browser makes a TCP connection to 202.43.78.3 port 80

(5) The browser sends an HTTP GET request message to the server

(6) The browser reads the HTTP response message from the server

202.43.78.3

http://www.joes-hardware.com:80/power-tools.html

(202.43.78.3)

Internet

76 |Chapter 4: Connection Management

TCP Streams Are Segmented and Shipped by IP Packets

TCP sends its data in little chunks called IP packets (or IP datagrams). In this way,

HTTP is the top layer in a “protocol stack” of “HTTP over TCP over IP,” as depicted

in Figure 4-3a. A secure variant, HTTPS, inserts a cryptographic encryption layer

(called TLS or SSL) between HTTP and TCP (Figure 4-3b).

When HTTP wants to transmit a message, it streams the contents of the message

data, in order, through an open TCP connection. TCP takes the stream of data,

chops up the data stream into chunks called segments, and transports the segments

across the Internet inside envelopes called IP packets (see Figure 4-4). This is all han-

dled by the TCP/IP software; the HTTP programmer sees none of it.

Each TCP segment is carried by an IP packet from one IP address to another IP

address. Each of these IP packets contains:

• An IP packet header (usually 20 bytes)

• A TCP segment header (usually 20 bytes)

• A chunk of TCP data (0 or more bytes)

The IP header contains the source and destination IP addresses, the size, and other

flags. The TCP segment header contains TCP port numbers, TCP control flags, and

numeric values used for data ordering and integrity checking.

Figure 4-2. TCP carries HTTP data in order, and without corruption

Figure 4-3. HTTP and HTTPS network protocol stacks

Client Server

Internet

...TH lmth.xedni/ TEG

HTTP Application layer

TCP Transport layer

IP Network layer

Network interfaces Data link layer

(a) HTTP

HTTP Application layer

TSL or SSL Security layer

TCP Transport layer

IP Network layer

Network interfaces Data link layer

(b) HTTPS

TCP Connections |77

Keeping TCP Connections Straight

A computer might have several TCP connections open at any one time. TCP keeps

all these connections straight through port numbers.

Port numbers are like employees’ phone extensions. Just as a company’s main phone

number gets you to the front desk and the extension gets you to the right employee,

the IP address gets you to the right computer and the port number gets you to the

right application. A TCP connection is distinguished by four values:

<source-IP-address, source-port, destination-IP-address, destination-port>

Together, these four values uniquely define a connection. Two different TCP connec-

tions are not allowed to have the same values for all four address components (but

different connections can have the same values for some of the components).

Figure 4-4. IP packets carry TCP segments, which carry chunks of the TCP data stream

Client Server

Version Hdr length

(words) Type of service

(TOS) Total datagram length

(bytes)

Packet ID

(16-bit number) Flags Fragmentation offset

Time to live

(TTL) Upper-level protocol Header checksum

Source IP address

Destination IP address

Source port Destination port

TCP sequence number

Piggybacked acknowledgment

Window size

Hdr length

(words) Reserved

URG

ACK

PSH

RST

SYN

FIN

TCP checksum Urgent pointer

GET /index.html HTTP/1.1<CR><LF>

Host: www.joes-hardware.c

Chunk of TCP data stream

TCP segment

IP packet

TCP

segment

TCP

segment

TCP

segment

78 |Chapter 4: Connection Management

In Figure 4-5, there are four connections: A, B, C and D. The relevant information

for each port is listed in Table 4-1.

Note that some of the connections share the same destination port number (C and D

both have destination port 80). Some of the connections have the same source IP

address (B and C). Some have the same destination IP address (A and B, and C and

D). But no two different connections share all four identical values.

Programming with TCP Sockets

Operating systems provide different facilities for manipulating their TCP connec-

tions. Let’s take a quick look at one TCP programming interface, to make things

concrete. Table 4-2 shows some of the primary interfaces provided by the sockets

API. This sockets API hides all the details of TCP and IP from the HTTP program-

mer. The sockets API was first developed for the Unix operating system, but variants

are now available for almost every operating system and language.

Table 4-1. TCP connection values

Connection Source IP address Source port Destination IP address Destination port

A 209.1.32.34 2034 204.62.128.58 4133

B 209.1.32.35 3227 204.62.128.58 4140

C 209.1.32.35 3105 207.25.71.25 80

D 209.1.33.89 5100 207.25.71.25 80

Figure 4-5. Four distinct TCP connections

Table 4-2. Common socket interface functions for programming TCP connections

Sockets API call Description

s = socket(<parameters>) Creates a new, unnamed, unattached socket.

bind(s, <local IP:port>) Assigns a local port number and interface to the socket.

209.1.32.34

204.62.128.58

209.1.32.35 209.1.33.89

207.25.71.25

A B C D

2034

4133 4140

3227 3105

5100

TCP Connections |79

The sockets API lets you create TCP endpoint data structures, connect these end-

points to remote server TCP endpoints, and read and write data streams. The TCP

API hides all the details of the underlying network protocol handshaking and the seg-

mentation and reassembly of the TCP data stream to and from IP packets.

In Figure 4-1, we showed how a web browser could download the power-tools.html

web page from Joe’s Hardware store using HTTP. The pseudocode in Figure 4-6

sketches how we might use the sockets API to highlight the steps the client and

server could perform to implement this HTTP transaction.

connect(s, <remote IP:port>) Establishes a TCP connection to a local socket and a remote host and port.

listen(s,...) Marks a local socket as legal to accept connections.

s2 = accept(s) Waits for someone to establish a connection to a local port.

n = read(s,buffer,n) Tries to read n bytes from the socket into the buffer.

n = write(s,buffer,n) Tries to write n bytes from the buffer into the socket.

close(s) Completely closes the TCP connection.

shutdown(s,<side>) Closes just the input or the output of the TCP connection.

getsockopt(s, ...) Reads the value of an internal socket configuration option.

setsockopt(s, ...) Changes the value of an internal socket configuration option.

Figure 4-6. How TCP clients and servers communicate using the TCP sockets interface

Table 4-2. Common socket interface functions for programming TCP connections (continued)

Sockets API call Description

Client Server

(C1) get IP address & port

(C2) create new socket (socket)

(C3) connect to server IP:port (connect)

(C4) connection successful

(C5) send HTTP request (write)

(C6) wait for HTTP response (read)

(C7) process HTTP response

(C8) close connection (close)

(S1) create new socket (socket)

(S2) bind socket to port 80 (bind)

(S3) permit socket connections (listen)

(S4) wait for connection (accept)

(S5) application notified of connection

(S6) start reading request (read)

(S7) process HTTP request message

(S8) send back HTTP response (write)

(S9) close connection (close)

80 |Chapter 4: Connection Management

We begin with the web server waiting for a connection (Figure 4-6, S4). The client

determines the IP address and port number from the URL and proceeds to establish

a TCP connection to the server (Figure 4-6, C3). Establishing a connection can take a

while, depending on how far away the server is, the load on the server, and the con-

gestion of the Internet.

Once the connection is set up, the client sends the HTTP request (Figure 4-6, C5)

and the server reads it (Figure 4-6, S6). Once the server gets the entire request mes-

sage, it processes the request, performs the requested action (Figure 4-6, S7), and

writes the data back to the client. The client reads it (Figure 4-6, C6) and processes

the response data (Figure 4-6, C7).

TCP Performance Considerations

Because HTTP is layered directly on TCP, the performance of HTTP transactions

depends critically on the performance of the underlying TCP plumbing. This section

highlights some significant performance considerations of these TCP connections. By

understanding some of the basic performance characteristics of TCP, you’ll better

appreciate HTTP’s connection optimization features, and you’ll be able to design

and implement higher-performance HTTP applications.

This section requires some understanding of the internal details of the TCP proto-

col. If you are not interested in (or are comfortable with) the details of TCP perfor-

mance considerations, feel free to skip ahead to “HTTP Connection Handling.”

Because TCP is a complex topic, we can provide only a brief overview of TCP perfor-

mance here. Refer to the section “For More Information” at the end of this chapter

for a list of excellent TCP references.

HTTP Transaction Delays

Let’s start our TCP performance tour by reviewing what networking delays occur in

the course of an HTTP request. Figure 4-7 depicts the major connect, transfer, and

processing delays for an HTTP transaction.

Figure 4-7. Timeline of a serial HTTP transaction

Client

Server

Connect Request Process Response Close Time

DNS lookup

TCP Performance Considerations |81

Notice that the transaction processing time can be quite small compared to the time

required to set up TCP connections and transfer the request and response messages.

Unless the client or server is overloaded or executing complex dynamic resources,

most HTTP delays are caused by TCP network delays.

There are several possible causes of delay in an HTTP transaction:

1. A client first needs to determine the IP address and port number of the web

server from the URI. If the hostname in the URI was not recently visited, it may

take tens of seconds to convert the hostname from a URI into an IP address

using the DNS resolution infrastructure.*

2. Next, the client sends a TCP connection request to the server and waits for the

server to send back a connection acceptance reply. Connection setup delay

occurs for every new TCP connection. This usually takes at most a second or

two, but it can add up quickly when hundreds of HTTP transactions are made.

3. Once the connection is established, the client sends the HTTP request over the

newly established TCP pipe. The web server reads the request message from the

TCP connection as the data arrives and processes the request. It takes time for

the request message to travel over the Internet and get processed by the server.

4. The web server then writes back the HTTP response, which also takes time.

The magnitude of these TCP network delays depends on hardware speed, the load of

the network and server, the size of the request and response messages, and the dis-

tance between client and server. The delays also are significantly affected by techni-

cal intricacies of the TCP protocol.

Performance Focus Areas

The remainder of this section outlines some of the most common TCP-related delays

affecting HTTP programmers, including the causes and performance impacts of:

• The TCP connection setup handshake

• TCP slow-start congestion control

• Nagle’s algorithm for data aggregation

• TCP’s delayed acknowledgment algorithm for piggybacked acknowledgments

• TIME_WAIT delays and port exhaustion

If you are writing high-performance HTTP software, you should understand each of

these factors. If you don’t need this level of performance optimization, feel free to

skip ahead.

* Luckily, most HTTP clients keep a small DNS cache of IP addresses for recently accessed sites. When the IP

address is already “cached” (recorded) locally, the lookup is instantaneous. Because most web browsing is

to a small number of popular sites, hostnames usually are resolved very quickly.

82 |Chapter 4: Connection Management

TCP Connection Handshake Delays

When you set up a new TCP connection, even before you send any data, the TCP

software exchanges a series of IP packets to negotiate the terms of the connection

(see Figure 4-8). These exchanges can significantly degrade HTTP performance if the

connections are used for small data transfers.

Here are the steps in the TCP connection handshake:

1. To request a new TCP connection, the client sends a small TCP packet (usually

40–60 bytes) to the server. The packet has a special “SYN” flag set, which means

it’s a connection request. This is shown in Figure 4-8a.

2. If the server accepts the connection, it computes some connection parameters

and sends a TCP packet back to the client, with both the “SYN” and “ACK”

flags set, indicating that the connection request is accepted (see Figure 4-8b).

3. Finally, the client sends an acknowledgment back to the server, letting it know

that the connection was established successfully (see Figure 4-8c). Modern TCP

stacks let the client send data in this acknowledgment packet.

The HTTP programmer never sees these packets—they are managed invisibly by the

TCP/IP software. All the HTTP programmer sees is a delay when creating a new TCP

connection.

The SYN/SYN+ACK handshake (Figure 4-8a and b) creates a measurable delay

when HTTP transactions do not exchange much data, as is commonly the case. The

TCP connect ACK packet (Figure 4-8c) often is large enough to carry the entire

HTTP request message,*and many HTTP server response messages fit into a single

IP packet (e.g., when the response is a small HTML file of a decorative graphic, or a

304 Not Modified response to a browser cache request).

Figure 4-8. TCP requires two packet transfers to set up the connection before it can send data

* IP packets are usually a few hundred bytes for Internet traffic and around 1,500 bytes for local traffic.

Client

Server

Connection handshake delay Data transfer Time

Connect

(a) SYN

(b) SYN+ACK

GET / HTTP. . .

(d) HTTP/1.1 304 Not modified

. . .

TCP Performance Considerations |83

The end result is that small HTTP transactions may spend 50% or more of their time

doing TCP setup. Later sections will discuss how HTTP allows reuse of existing con-

nections to eliminate the impact of this TCP setup delay.

Delayed Acknowledgments

Because the Internet itself does not guarantee reliable packet delivery (Internet rout-

ers are free to destroy packets at will if they are overloaded), TCP implements its

own acknowledgment scheme to guarantee successful data delivery.

Each TCP segment gets a sequence number and a data-integrity checksum. The

receiver of each segment returns small acknowledgment packets back to the sender

when segments have been received intact. If a sender does not receive an acknowl-

edgment within a specified window of time, the sender concludes the packet was

destroyed or corrupted and resends the data.

Because acknowledgments are small, TCP allows them to “piggyback” on outgoing

data packets heading in the same direction. By combining returning acknowledg-

ments with outgoing data packets, TCP can make more efficient use of the network.

To increase the chances that an acknowledgment will find a data packet headed in

the same direction, many TCP stacks implement a “delayed acknowledgment” algo-

rithm. Delayed acknowledgments hold outgoing acknowledgments in a buffer for a

certain window of time (usually 100–200 milliseconds), looking for an outgoing data

packet on which to piggyback. If no outgoing data packet arrives in that time, the

acknowledgment is sent in its own packet.

Unfortunately, the bimodal request-reply behavior of HTTP reduces the chances that

piggybacking can occur. There just aren’t many packets heading in the reverse direc-

tion when you want them. Frequently, the disabled acknowledgment algorithms

introduce significant delays. Depending on your operating system, you may be able

to adjust or disable the delayed acknowledgment algorithm.

Before you modify any parameters of your TCP stack, be sure you know what you

are doing. Algorithms inside TCP were introduced to protect the Internet from

poorly designed applications. If you modify any TCP configurations, be absolutely

sure your application will not create the problems the algorithms were designed to

avoid.

TCP Slow Start

The performance of TCP data transfer also depends on the age of the TCP connec-

tion. TCP connections “tune” themselves over time, initially limiting the maximum

speed of the connection and increasing the speed over time as data is transmitted

successfully. This tuning is called TCP slow start, and it is used to prevent sudden

overloading and congestion of the Internet.

84 |Chapter 4: Connection Management

TCP slow start throttles the number of packets a TCP endpoint can have in flight at

any one time. Put simply, each time a packet is received successfully, the sender gets

permission to send two more packets. If an HTTP transaction has a large amount of

data to send, it cannot send all the packets at once. It must send one packet and wait

for an acknowledgment; then it can send two packets, each of which must be acknowl-

edged, which allows four packets, etc. This is called “opening the congestion window.”

Because of this congestion-control feature, new connections are slower than “tuned”

connections that already have exchanged a modest amount of data. Because tuned

connections are faster, HTTP includes facilities that let you reuse existing connec-

tions. We’ll talk about these HTTP “persistent connections” later in this chapter.

Nagle’s Algorithm and TCP_NODELAY

TCP has a data stream interface that permits applications to stream data of any size

to the TCP stack—even a single byte at a time! But because each TCP segment car-

ries at least 40 bytes of flags and headers, network performance can be degraded

severely if TCP sends large numbers of packets containing small amounts of data.*

Nagle’s algorithm (named for its creator, John Nagle) attempts to bundle up a large

amount of TCP data before sending a packet, aiding network efficiency. The algo-

rithm is described in RFC 896, “Congestion Control in IP/TCP Internetworks.”

Nagle’s algorithm discourages the sending of segments that are not full-size (a

maximum-size packet is around 1,500 bytes on a LAN, or a few hundred bytes

across the Internet). Nagle’s algorithm lets you send a non-full-size packet only if all

other packets have been acknowledged. If other packets are still in flight, the partial

data is buffered. This buffered data is sent only when pending packets are acknowl-

edged or when the buffer has accumulated enough data to send a full packet.†

Nagle’s algorithm causes several HTTP performance problems. First, small HTTP

messages may not fill a packet, so they may be delayed waiting for additional data

that will never arrive. Second, Nagle’s algorithm interacts poorly with disabled

acknowledgments—Nagle’s algorithm will hold up the sending of data until an

acknowledgment arrives, but the acknowledgment itself will be delayed 100–200

milliseconds by the delayed acknowledgment algorithm.‡

HTTP applications often disable Nagle’s algorithm to improve performance, by setting

the TCP_NODELAY parameter on their stacks. If you do this, you must ensure that

you write large chunks of data to TCP so you don’t create a flurry of small packets.

* Sending a storm of single-byte packets is called “sender silly window syndrome.” This is inefficient, anti-

social, and can be disruptive to other Internet traffic.

† Several variations of this algorithm exist, including timeouts and acknowledgment logic changes, but the

basic algorithm causes buffering of data smaller than a TCP segment.

‡ These problems can become worse when using pipelined connections (described later in this chapter),

because clients may have several messages to send to the same server and do not want delays.

TCP Performance Considerations |85

TIME_WAIT Accumulation and Port Exhaustion

TIME_WAIT port exhaustion is a serious performance problem that affects perfor-

mance benchmarking but is relatively uncommon in real deployments. It warrants

special attention because most people involved in performance benchmarking even-

tually run into this problem and get unexpectedly poor performance.

When a TCP endpoint closes a TCP connection, it maintains in memory a small con-

trol block recording the IP addresses and port numbers of the recently closed con-

nection. This information is maintained for a short time, typically around twice the

estimated maximum segment lifetime (called “2MSL”; often two minutes*), to make

sure a new TCP connection with the same addresses and port numbers is not cre-

ated during this time. This prevents any stray duplicate packets from the previous

connection from accidentally being injected into a new connection that has the same

addresses and port numbers. In practice, this algorithm prevents two connections

with the exact same IP addresses and port numbers from being created, closed, and

recreated within two minutes.

Today’s higher-speed routers make it extremely unlikely that a duplicate packet will

show up on a server’s doorstep minutes after a connection closes. Some operating

systems set 2MSL to a smaller value, but be careful about overriding this value. Pack-

ets do get duplicated, and TCP data will be corrupted if a duplicate packet from a

past connection gets inserted into a new stream with the same connection values.

The 2MSL connection close delay normally is not a problem, but in benchmarking

situations, it can be. It’s common that only one or a few test load-generation com-

puters are connecting to a system under benchmark test, which limits the number of

client IP addresses that connect to the server. Furthermore, the server typically is lis-

tening on HTTP’s default TCP port, 80. These circumstances limit the available

combinations of connection values, at a time when port numbers are blocked from

reuse by TIME_WAIT.

In a pathological situation with one client and one web server, of the four values that

make up a TCP connection:

<source-IP-address, source-port, destination-IP-address, destination-port>

three of them are fixed—only the source port is free to change:

<client-IP, source-port, server-IP, 80>

Each time the client connects to the server, it gets a new source port in order to have

a unique connection. But because a limited number of source ports are available

(say, 60,000) and no connection can be reused for 2MSL seconds (say, 120 sec-

onds), this limits the connect rate to 60,000 / 120 = 500 transactions/sec. If you keep

* The 2MSL value of two minutes is historical. Long ago, when routers were much slower, it was estimated

that a duplicate copy of a packet might be able to remain queued in the Internet for up to a minute before

being destroyed. Today, the maximum segment lifetime is much smaller.

86 |Chapter 4: Connection Management

making optimizations, and your server doesn’t get faster than about 500 transac-

tions/sec, make sure you are not experiencing TIME_WAIT port exhaustion. You

can fix this problem by using more client load-generator machines or making sure

the client and server rotate through several virtual IP addresses to add more connec-

tion combinations.

Even if you do not suffer port exhaustion problems, be careful about having large

numbers of open connections or large numbers of control blocks allocated for con-

nection in wait states. Some operating systems slow down dramatically when there

are numerous open connections or control blocks.

HTTP Connection Handling

The first two sections of this chapter provided a fire-hose tour of TCP connections

and their performance implications. If you’d like to learn more about TCP network-

ing, check out the resources listed at the end of the chapter.

We’re going to switch gears now and get squarely back to HTTP. The rest of this

chapter explains the HTTP technology for manipulating and optimizing connec-

tions. We’ll start with the HTTP Connection header, an often misunderstood but

important part of HTTP connection management. Then we’ll talk about HTTP’s

connection optimization techniques.

The Oft-Misunderstood Connection Header

HTTP allows a chain of HTTP intermediaries between the client and the ultimate

origin server (proxies, caches, etc.). HTTP messages are forwarded hop by hop from

the client, through intermediary devices, to the origin server (or the reverse).

In some cases, two adjacent HTTP applications may want to apply a set of options to

their shared connection. The HTTP Connection header field has a comma-separated

list of connection tokens that specify options for the connection that aren’t propa-

gated to other connections. For example, a connection that must be closed after

sending the next message can be indicated by Connection: close.

The Connection header sometimes is confusing, because it can carry three different

types of tokens:

• HTTP header field names, listing headers relevant for only this connection

• Arbitrary token values, describing nonstandard options for this connection

• The value close, indicating the persistent connection will be closed when done

If a connection token contains the name of an HTTP header field, that header field

contains connection-specific information and must not be forwarded. Any header

fields listed in the Connection header must be deleted before the message is for-

warded. Placing a hop-by-hop header name in a Connection header is known as

HTTP Connection Handling |87

“protecting the header,” because the Connection header protects against accidental

forwarding of the local header. An example is shown in Figure 4-9.

When an HTTP application receives a message with a Connection header, the

receiver parses and applies all options requested by the sender. It then deletes the

Connection header and all headers listed in the Connection header before forward-

ing the message to the next hop. In addition, there are a few hop-by-hop headers that

might not be listed as values of a Connection header, but must not be proxied. These

include Proxy-Authenticate, Proxy-Connection, Transfer-Encoding, and Upgrade.

For more about the Connection header, see Appendix C.

Serial Transaction Delays

TCP performance delays can add up if the connections are managed naively. For

example, suppose you have a web page with three embedded images. Your browser

needs to issue four HTTP transactions to display this page: one for the top-level

HTML and three for the embedded images. If each transaction requires a new con-

nection, the connection and slow-start delays can add up (see Figure 4-10).*

Figure 4-9. The Connection header allows the sender to specify connection-specific options

Figure 4-10. Four transactions (serial)

* For the purpose of this example, assume all objects are roughly the same size and are hosted from the same

server, and that the DNS entry is cached, eliminating the DNS lookup time.

Client Server

Proxy

HTTP/1.1 200 OK

Cache-control: max-age=3600

Connection: meter, close, bill-my-credit-card

Meter: max-uses=3, max-refuses=6, dont-report

The Connection header says the Meter header

should not be forwarded, the hypothetical

“bill-my-credit-card” option applies, and the

persistent connection will be closed when this

transaction is done.

Client

Server

Transaction 1

Time

Connect- 1 Connect- 2 Connect- 3 Connect- 4

Transaction 2 Transaction 3 Transaction 4

Request- 1

Request- 2

Request- 3

Request- 4

Response- 1

Response- 2

Response- 3

Response- 4

88 |Chapter 4: Connection Management

In addition to the real delay imposed by serial loading, there is also a psychological

perception of slowness when a single image is loading and nothing is happening on

the rest of the page. Users prefer multiple images to load at the same time.*

Another disadvantage of serial loading is that some browsers are unable to display

anything onscreen until enough objects are loaded, because they don’t know the

sizes of the objects until they are loaded, and they may need the size information to

decide where to position the objects on the screen. In this situation, the browser may

be making good progress loading objects serially, but the user may be faced with a

blank white screen, unaware that any progress is being made at all.†

Several current and emerging techniques are available to improve HTTP connection

performance. The next several sections discuss four such techniques:

Parallel connections

Concurrent HTTP requests across multiple TCP connections

Persistent connections

Reusing TCP connections to eliminate connect/close delays

Pipelined connections

Concurrent HTTP requests across a shared TCP connection

Multiplexed connections

Interleaving chunks of requests and responses (experimental)

Parallel Connections

As we mentioned previously, a browser could naively process each embedded object

serially by completely requesting the original HTML page, then the first embedded

object, then the second embedded object, etc. But this is too slow!

HTTP allows clients to open multiple connections and perform multiple HTTP

transactions in parallel, as sketched in Figure 4-11. In this example, four embedded

images are loaded in parallel, with each transaction getting its own TCP connection.‡

Parallel Connections May Make Pages Load Faster

Composite pages consisting of embedded objects may load faster if they take advan-

tage of the dead time and bandwidth limits of a single connection. The delays can be

* This is true even if loading multiple images at the same time is slower than loading images one at a time!

Users often perceive multiple-image loading as faster.

† HTML designers can help eliminate this “layout delay” by explicitly adding width and height attributes to

HTML tags for embedded objects such as images. Explicitly providing the width and height of the embedded

image allows the browser to make graphical layout decisions before it receives the objects from the server.

‡ The embedded components do not all need to be hosted on the same web server, so the parallel connections

can be established to multiple servers.

Parallel Connections |89

overlapped, and if a single connection does not saturate the client’s Internet band-

width, the unused bandwidth can be allocated to loading additional objects.

Figure 4-12 shows a timeline for parallel connections, which is significantly faster

than Figure 4-10. The enclosing HTML page is loaded first, and then the remaining

three transactions are processed concurrently, each with their own connection.*

Because the images are loaded in parallel, the connection delays are overlapped.

Parallel Connections Are Not Always Faster

Even though parallel connections may be faster, however, they are not always faster.

When the client’s network bandwidth is scarce (for example, a browser connected to

Figure 4-11. Each component of a page involves a separate HTTP transaction

* There will generally still be a small delay between each connection request due to software overheads, but

the connection requests and transfer times are mostly overlapped.

Figure 4-12. Four transactions (parallel)

Client

Server 1

Server 2

Internet

Client

Server

Transaction 1

Time

Connect- 1 Connect- 2

Connect- 3

Connect- 4

Transaction 2, 3, 4

(parallel connections)

Request- 1

Request- 2

Request- 3

Response- 1

Re se- 2

Response- 3

Response- 4

Request- 4

(Usually a small software delay

between each connection)

90 |Chapter 4: Connection Management

the Internet through a 28.8-Kbps modem), most of the time might be spent just

transferring data. In this situation, a single HTTP transaction to a fast server could

easily consume all of the available modem bandwidth. If multiple objects are loaded

in parallel, each object will just compete for this limited bandwidth, so each object

will load proportionally slower, yielding little or no performance advantage.*

Also, a large number of open connections can consume a lot of memory and cause

performance problems of their own. Complex web pages may have tens or hundreds

of embedded objects. Clients might be able to open hundreds of connections, but

few web servers will want to do that, because they often are processing requests for

many other users at the same time. A hundred simultaneous users, each opening 100

connections, will put the burden of 10,000 connections on the server. This can cause

significant server slowdown. The same situation is true for high-load proxies.

In practice, browsers do use parallel connections, but they limit the total number of

parallel connections to a small number (often four). Servers are free to close exces-

sive connections from a particular client.

Parallel Connections May “Feel” Faster

Okay, so parallel connections don’t always make pages load faster. But even if they

don’t actually speed up the page transfer, as we said earlier, parallel connections

often make users feel that the page loads faster, because they can see progress being

made as multiple component objects appear onscreen in parallel.†Human beings

perceive that web pages load faster if there’s lots of action all over the screen, even if

a stopwatch actually shows the aggregate page download time to be slower!

Persistent Connections

Web clients often open connections to the same site. For example, most of the

embedded images in a web page often come from the same web site, and a signifi-

cant number of hyperlinks to other objects often point to the same site. Thus, an

application that initiates an HTTP request to a server likely will make more requests

to that server in the near future (to fetch the inline images, for example). This prop-

erty is called site locality.

For this reason, HTTP/1.1 (and enhanced versions of HTTP/1.0) allows HTTP

devices to keep TCP connections open after transactions complete and to reuse the

preexisting connections for future HTTP requests. TCP connections that are kept

* In fact, because of the extra overhead from multiple connections, it’s quite possible that parallel connections

could take longer to load the entire page than serial downloads.

† This effect is amplified by the increasing use of progressive images that produce low-resolution approxima-

tions of images first and gradually increase the resolution.

Persistent Connections |91

open after transactions complete are called persistent connections. Nonpersistent

connections are closed after each transaction. Persistent connections stay open

across transactions, until either the client or the server decides to close them.

By reusing an idle, persistent connection that is already open to the target server, you

can avoid the slow connection setup. In addition, the already open connection can

avoid the slow-start congestion adaptation phase, allowing faster data transfers.

Persistent Versus Parallel Connections

As we’ve seen, parallel connections can speed up the transfer of composite pages.

But parallel connections have some disadvantages:

• Each transaction opens/closes a new connection, costing time and bandwidth.

• Each new connection has reduced performance because of TCP slow start.

• There is a practical limit on the number of open parallel connections.

Persistent connections offer some advantages over parallel connections. They reduce

the delay and overhead of connection establishment, keep the connections in a tuned

state, and reduce the potential number of open connections. However, persistent

connections need to be managed with care, or you may end up accumulating a large

number of idle connections, consuming local resources and resources on remote cli-

ents and servers.

Persistent connections can be most effective when used in conjunction with parallel

connections. Today, many web applications open a small number of parallel connec-

tions, each persistent. There are two types of persistent connections: the older

HTTP/1.0+ “keep-alive” connections and the modern HTTP/1.1 “persistent” con-

nections. We’ll look at both flavors in the next few sections.

HTTP/1.0+ Keep-Alive Connections

Many HTTP/1.0 browsers and servers were extended (starting around 1996) to sup-

port an early, experimental type of persistent connections called keep-alive connec-

tions. These early persistent connections suffered from some interoperability design

problems that were rectified in later revisions of HTTP/1.1, but many clients and

servers still use these earlier keep-alive connections.

Some of the performance advantages of keep-alive connections are visible in

Figure 4-13, which compares the timeline for four HTTP transactions over serial con-

nections against the same transactions over a single persistent connection. The time-

line is compressed because the connect and close overheads are removed.*

* Additionally, the request and response time might also be reduced because of elimination of the slow-start

phase. This performance benefit is not depicted in the figure.

92 |Chapter 4: Connection Management

Keep-Alive Operation

Keep-alive is deprecated and no longer documented in the current HTTP/1.1 specifi-

cation. However, keep-alive handshaking is still in relatively common use by brows-

ers and servers, so HTTP implementors should be prepared to interoperate with it.

We’ll take a quick look at keep-alive operation now. Refer to older versions of the

HTTP/1.1 specification (such as RFC 2068) for a more complete explanation of

keep-alive handshaking.

Clients implementing HTTP/1.0 keep-alive connections can request that a connec-

tion be kept open by including the Connection: Keep-Alive request header.

If the server is willing to keep the connection open for the next request, it will

respond with the same header in the response (see Figure 4-14). If there is no Con-

nection: keep-alive header in the response, the client assumes that the server does

not support keep-alive and that the server will close the connection when the

response message is sent back.

Keep-Alive Options

Note that the keep-alive headers are just requests to keep the connection alive. Cli-

ents and servers do not need to agree to a keep-alive session if it is requested. They

Figure 4-13. Four transactions (serial versus persistent)

Client

Server

Transaction 1

Time

Transaction 4

Request- 1

Request- 2

Request- 3

Request- 4

Response- 1

Response- 2

Response- 3

Response- 4

Client

Server

Transaction 1

Time

Connect- 1 Connect- 2 Connect- 3 Connect- 4

Transaction 2 Transaction 3 Transaction 4

Request- 1

Request- 2

Request- 3

Request- 4

Response- 1

Response- 2

Response- 3

Response- 4

Transaction 2 Transaction 3

(a) Serial connections

(b) Persistent connection

Persistent Connections |93

can close idle keep-alive connections at any time and are free to limit the number of

transactions processed on a keep-alive connection.

The keep-alive behavior can be tuned by comma-separated options specified in the

Keep-Alive general header:

• The timeout parameter is sent in a Keep-Alive response header. It estimates how

long the server is likely to keep the connection alive for. This is not a guarantee.

• The max parameter is sent in a Keep-Alive response header. It estimates how

many more HTTP transactions the server is likely to keep the connection alive

for. This is not a guarantee.

• The Keep-Alive header also supports arbitrary unprocessed attributes, primarily

for diagnostic and debugging purposes. The syntax is name [= value].

The Keep-Alive header is completely optional but is permitted only when Connec-

tion: Keep-Alive also is present. Here’s an example of a Keep-Alive response header

indicating that the server intends to keep the connection open for at most five more

transactions, or until it has sat idle for two minutes:

Connection: Keep-Alive

Keep-Alive: max=5, timeout=120

Keep-Alive Connection Restrictions and Rules

Here are some restrictions and clarifications regarding the use of keep-alive

connections:

• Keep-alive does not happen by default in HTTP/1.0. The client must send a

Connection: Keep-Alive request header to activate keep-alive connections.

• The Connection: Keep-Alive header must be sent with all messages that want to

continue the persistence. If the client does not send a Connection: Keep-Alive

header, the server will close the connection after that request.

Figure 4-14. HTTP/1.0 keep-alive transaction header handshake

Internet

Client Server

GET /index.html HTTP/1.0

Host: www.joes-hardware.com

Connection: Keep-Alive

HTTP/1.0 200 OK

Content-type: text/html

Content-length: 3104

Connection: Keep-Alive

...

94 |Chapter 4: Connection Management

• Clients can tell if the server will close the connection after the response by

detecting the absence of the Connection: Keep-Alive response header.

• The connection can be kept open only if the length of the message’s entity body

can be determined without sensing a connection close—this means that the entity

body must have a correct Content-Length, have a multipart media type, or be

encoded with the chunked transfer encoding. Sending the wrong Content-Length

back on a keep-alive channel is bad, because the other end of the transaction will

not be able to accurately detect the end of one message and the start of another.

• Proxies and gateways must enforce the rules of the Connection header; the proxy

or gateway must remove any header fields named in the Connection header, and

the Connection header itself, before forwarding or caching the message.

• Formally, keep-alive connections should not be established with a proxy server

that isn’t guaranteed to support the Connection header, to prevent the problem

with dumb proxies described below. This is not always possible in practice.

• Technically, any Connection header fields (including Connection: Keep-Alive)

received from an HTTP/1.0 device should be ignored, because they may have

been forwarded mistakenly by an older proxy server. In practice, some clients

and servers bend this rule, although they run the risk of hanging on older proxies.

• Clients must be prepared to retry requests if the connection closes before they

receive the entire response, unless the request could have side effects if repeated.

Keep-Alive and Dumb Proxies

Let’s take a closer look at the subtle problem with keep-alive and dumb proxies. A

web client’s Connection: Keep-Alive header is intended to affect just the single TCP

link leaving the client. This is why it is named the “connection” header. If the client

is talking to a web server, the client sends a Connection: Keep-Alive header to tell the

server it wants keep-alive. The server sends a Connection: Keep-Alive header back if

it supports keep-alive and doesn’t send it if it doesn’t.

The Connection header and blind relays

The problem comes with proxies—in particular, proxies that don’t understand the

Connection header and don’t know that they need to remove the header before proxy-

ing it down the chain. Many older or simple proxies act as blind relays, tunneling bytes

from one connection to another, without specially processing the Connection header.

Imagine a web client talking to a web server through a dumb proxy that is acting as a

blind relay. This situation is depicted in Figure 4-15.

Here’s what’s going on in this figure:

1. In Figure 4-15a, a web client sends a message to the proxy, including the Connec-

tion: Keep-Alive header, requesting a keep-alive connection if possible. The client

waits for a response to learn if its request for a keep-alive channel was granted.

Persistent Connections |95

2. The dumb proxy gets the HTTP request, but it doesn’t understand the Connec-

tion header (it just treats it as an extension header). The proxy has no idea what

keep-alive is, so it passes the message verbatim down the chain to the server

(Figure 4-15b). But the Connection header is a hop-by-hop header; it applies to

only a single transport link and shouldn’t be passed down the chain. Bad things

are about to happen.

3. In Figure 4-15b, the relayed HTTP request arrives at the web server. When the

web server receives the proxied Connection: Keep-Alive header, it mistakenly

concludes that the proxy (which looks like any other client to the server) wants

to speak keep-alive! That’s fine with the web server—it agrees to speak keep-

alive and sends a Connection: Keep-Alive response header back in Figure 4-15c.

So, at this point, the web server thinks it is speaking keep-alive with the proxy

and will adhere to rules of keep-alive. But the proxy doesn’t know the first thing

about keep-alive. Uh-oh.

4. In Figure 4-15d, the dumb proxy relays the web server’s response message back to

the client, passing along the Connection: Keep-Alive header from the web server.

The client sees this header and assumes the proxy has agreed to speak keep-alive.

So at this point, both the client and server believe they are speaking keep-alive,

but the proxy they are talking to doesn’t know anything about keep-alive.

5. Because the proxy doesn’t know anything about keep-alive, it reflects all the

data it receives back to the client and then waits for the origin server to close the

connection. But the origin server will not close the connection, because it

believes the proxy explicitly asked the server to keep the connection open. So

the proxy will hang waiting for the connection to close.

6. When the client gets the response message back in Figure 4-15d, it moves right

along to the next request, sending another request to the proxy on the keep-alive

connection (see Figure 4-15e). Because the proxy never expects another request

Figure 4-15. Keep-alive doesn’t interoperate with proxies that don’t support Connection headers

Client Server

Dumb proxy

(a) Connection: Keep-Alive (b) Connection: Keep-Alive

(d) Connection: Keep-Alive (c) Connection: Keep-Alive

Next request

to close, ignoring any new

requests on the connection

(e) Client’s second request on

the keep-alive connection just

hangs because the proxy never

processes it

(b) Server won’t close connection

when done because it thinks it has

been asked to speak keep-alive

96 |Chapter 4: Connection Management

on the same connection, the request is ignored and the browser just spins, mak-

ing no progress.

7. This miscommunication causes the browser to hang until the client or server

times out the connection and closes it.*

Proxies and hop-by-hop headers

To avoid this kind of proxy miscommunication, modern proxies must never proxy

the Connection header or any headers whose names appear inside the Connection

values. So if a proxy receives a Connection: Keep-Alive header, it shouldn’t proxy

either the Connection header or any headers named Keep-Alive.

In addition, there are a few hop-by-hop headers that might not be listed as values of

a Connection header, but must not be proxied or served as a cache response either.

These include Proxy-Authenticate, Proxy-Connection, Transfer-Encoding, and

Upgrade. For more information, refer back to “The Oft-Misunderstood Connection

Header.”

The Proxy-Connection Hack

Browser and proxy implementors at Netscape proposed a clever workaround to the

blind relay problem that didn’t require all web applications to support advanced ver-

sions of HTTP. The workaround introduced a new header called Proxy-Connection

and solved the problem of a single blind relay interposed directly after the client—

but not all other situations. Proxy-Connection is implemented by modern browsers

when proxies are explicitly configured and is understood by many proxies.

The idea is that dumb proxies get into trouble because they blindly forward hop-by-

hop headers such as Connection: Keep-Alive. Hop-by-hop headers are relevant only

for that single, particular connection and must not be forwarded. This causes trou-

ble when the forwarded headers are misinterpreted by downstream servers as

requests from the proxy itself to control its connection.

In the Netscape workaround, browsers send nonstandard Proxy-Connection exten-

sion headers to proxies, instead of officially supported and well-known Connection

headers. If the proxy is a blind relay, it relays the nonsense Proxy-Connection header

to the web server, which harmlessly ignores the header. But if the proxy is a smart

proxy (capable of understanding persistent connection handshaking), it replaces the

nonsense Proxy-Connection header with a Connection header, which is then sent to

the server, having the desired effect.

Figure 4-16a–d shows how a blind relay harmlessly forwards Proxy-Connection head-

ers to the web server, which ignores the header, causing no keep-alive connection to

* There are many similar scenarios where failures occur due to blind relays and forwarded handshaking.

Persistent Connections |97

be established between the client and proxy or the proxy and server. The smart proxy

in Figure 4-16e–h understands the Proxy-Connection header as a request to speak

keep-alive, and it sends out its own Connection: Keep-Alive headers to establish

keep-alive connections.

This scheme works around situations where there is only one proxy between the cli-

ent and server. But if there is a smart proxy on either side of the dumb proxy, the

problem will rear its ugly head again, as shown in Figure 4-17.

Furthermore, it is becoming quite common for “invisible” proxies to appear in net-

works, either as firewalls, intercepting caches, or reverse proxy server accelerators.

Because these devices are invisible to the browser, the browser will not send them

Proxy-Connection headers. It is critical that transparent web applications implement

persistent connections correctly.

HTTP/1.1 Persistent Connections

HTTP/1.1 phased out support for keep-alive connections, replacing them with an

improved design called persistent connections. The goals of persistent connections are

the same as those of keep-alive connections, but the mechanisms behave better.

Figure 4-16. Proxy-Connection header fixes single blind relay

Client Server

Dumb proxy

(a) Proxy-Connection: Keep-Alive (b) Proxy-Connection: Keep-Alive

(d) No Connection header (c) No Connection header

A dumb proxy forwards the Proxy-Connection header, which the server ignores.

The proxy recognizes the Proxy-Connection header, agrees to talk

keep-alive with the client, and may also (optionally) decide to

set up a keep-alive Connection with the server.

Client Server

Smart proxy

(e) Proxy-Connection: Keep-Alive (f) Connection: Keep-Alive

(h) Connection: Keep-Alive (g) Connection: Keep-Alive

A smart proxy understands the Proxy-Connection header and actively sends

a Connection: Keep-Alive header to the server.

The server does not recognize the Proxy-Connection header, and ignores it.

No keep-alive Connection is established.

98 |Chapter 4: Connection Management

Unlike HTTP/1.0+ keep-alive connections, HTTP/1.1 persistent connections are

active by default. HTTP/1.1 assumes all connections are persistent unless otherwise

indicated. HTTP/1.1 applications have to explicitly add a Connection: close header

to a message to indicate that a connection should close after the transaction is com-

plete. This is a significant difference from previous versions of the HTTP protocol,

where keep-alive connections were either optional or completely unsupported.

An HTTP/1.1 client assumes an HTTP/1.1 connection will remain open after a

response, unless the response contains a Connection: close header. However, clients

and servers still can close idle connections at any time. Not sending Connection:

close does not mean that the server promises to keep the connection open forever.

Persistent Connection Restrictions and Rules

Here are the restrictions and clarifications regarding the use of persistent connections:

• After sending a Connection: close request header, the client can’t send more

requests on that connection.

• If a client does not want to send another request on the connection, it should

send a Connection: close request header in the final request.

• The connection can be kept persistent only if all messages on the connection

have a correct, self-defined message length—i.e., the entity bodies must have

correct Content-Lengths or be encoded with the chunked transfer encoding.

Figure 4-17. Proxy-Connection still fails for deeper hierarchies of proxies

Client Server

Dumb

proxy

(a)

Proxy-Connection: Keep-Alive

(f)

Connection: Keep-Alive

A dumb proxy unwittingly advertises keep-Alive to browser and smart proxy.

Smart

proxy

(b)

Proxy-Connection: Keep-Alive

(e)

Connection: Keep-Alive

(c)

Connection: Keep-Alive

(d)

Connection: Keep-Alive

Client Server

Smart

proxy

(g)

Proxy-Connection: Keep-Alive

(l)

Connection: Keep-Alive

A dumb proxy unwittingly advertises keep-Alive to smart proxy and server.

Dumb

proxy

(h)

Connection: Keep-Alive

(k)

Connection: Keep-Alive

(i)

Connection: Keep-Alive

(j)

Connection: Keep-Alive

Pipelined Connections |99

• HTTP/1.1 proxies must manage persistent connections separately with clients

and servers—each persistent connection applies to a single transport hop.

• HTTP/1.1 proxy servers should not establish persistent connections with an

HTTP/1.0 client (because of the problems of older proxies forwarding Connec-

tion headers) unless they know something about the capabilities of the client.

This is, in practice, difficult, and many vendors bend this rule.

• Regardless of the values of Connection headers, HTTP/1.1 devices may close the

connection at any time, though servers should try not to close in the middle of

transmitting a message and should always respond to at least one request before

closing.

• HTTP/1.1 applications must be able to recover from asynchronous closes. Cli-

ents should retry the requests as long as they don’t have side effects that could

accumulate.

• Clients must be prepared to retry requests if the connection closes before they

receive the entire response, unless the request could have side effects if repeated.

• A single user client should maintain at most two persistent connections to any

server or proxy, to prevent the server from being overloaded. Because proxies

may need more connections to a server to support concurrent users, a proxy

should maintain at most 2Nconnections to any server or parent proxy, if there

are N users trying to access the servers.

Pipelined Connections

HTTP/1.1 permits optional request pipelining over persistent connections. This is a

further performance optimization over keep-alive connections. Multiple requests can

be enqueued before the responses arrive. While the first request is streaming across

the network to a server on the other side of the globe, the second and third requests

can get underway. This can improve performance in high-latency network condi-

tions, by reducing network round trips.

Figure 4-18a-c shows how persistent connections can eliminate TCP connection

delays and how pipelined requests (Figure 4-18c) can eliminate transfer latencies.

There are several restrictions for pipelining:

• HTTP clients should not pipeline until they are sure the connection is persistent.

• HTTP responses must be returned in the same order as the requests. HTTP mes-

sages are not tagged with sequence numbers, so there is no way to match

responses with requests if the responses are received out of order.

• HTTP clients must be prepared for the connection to close at any time and be

prepared to redo any pipelined requests that did not finish. If the client opens a

persistent connection and immediately issues 10 requests, the server is free to

close the connection after processing only, say, 5 requests. The remaining 5

100 |Chapter 4: Connection Management

requests will fail, and the client must be willing to handle these premature closes

and reissue the requests.

• HTTP clients should not pipeline requests that have side effects (such as

POSTs). In general, on error, pipelining prevents clients from knowing which of

a series of pipelined requests were executed by the server. Because nonidempo-

tent requests such as POSTs cannot safely be retried, you run the risk of some

methods never being executed in error conditions.

Figure 4-18. Four transactions (pipelined connections)

Request- 1

Resp 1

Client

Server

Transaction 1

Time

Transaction 4

Request- 1

Request- 2

Request- 3

Request- 4

Response- 1

Response- 2

Response- 3

Response- 4

Client

Server

Transaction 1

Time

Connect- 1 Connect- 2 Connect- 3 Connect- 4

Transaction 2 Transaction 3 Transaction 4

Request- 1

Request- 2

Request- 3

Request- 4

Response- 1

Response- 2

Response- 3

Response- 4

Transaction 2 Transaction 3

(a) Serial connections

(b) Persistent connection

Client

Server

Time

Transaction- 1

Request- 2

Request- 3

Re e- 2

Response- 3

Response- 4

Request- 4

Transaction- 2

Transaction- 3

Transaction- 4

The Mysteries of Connection Close |101

The Mysteries of Connection Close

Connection management—particularly knowing when and how to close connec-

tions—is one of the practical black arts of HTTP. This issue is more subtle than

many developers first realize, and little has been written on the subject.

“At Will” Disconnection

Any HTTP client, server, or proxy can close a TCP transport connection at any time.

The connections normally are closed at the end of a message,*but during error con-

ditions, the connection may be closed in the middle of a header line or in other

strange places.

This situation is common with pipelined persistent connections. HTTP applications

are free to close persistent connections after any period of time. For example, after a

persistent connection has been idle for a while, a server may decide to shut it down.

However, the server can never know for sure that the client on the other end of the

line wasn’t about to send data at the same time that the “idle” connection was being

shut down by the server. If this happens, the client sees a connection error in the

middle of writing its request message.

Content-Length and Truncation

Each HTTP response should have an accurate Content-Length header to describe the

size of the response body. Some older HTTP servers omit the Content-Length header

or include an erroneous length, depending on a server connection close to signify the

actual end of data.

When a client or proxy receives an HTTP response terminating in connection close,

and the actual transferred entity length doesn’t match the Content-Length (or there

is no Content-Length), the receiver should question the correctness of the length.

If the receiver is a caching proxy, the receiver should not cache the response (to mini-

mize future compounding of a potential error). The proxy should forward the ques-

tionable message intact, without attempting to “correct” the Content-Length, to

maintain semantic transparency.

Connection Close Tolerance, Retries, and Idempotency

Connections can close at any time, even in non-error conditions. HTTP applica-

tions have to be ready to properly handle unexpected closes. If a transport connec-

tion closes while the client is performing a transaction, the client should reopen the

* Servers shouldn’t close a connection in the middle of a response unless client or network failure is suspected.

102 |Chapter 4: Connection Management

connection and retry one time, unless the transaction has side effects. The situation

is worse for pipelined connections. The client can enqueue a large number of

requests, but the origin server can close the connection, leaving numerous requests

unprocessed and in need of rescheduling.

Side effects are important. When a connection closes after some request data was

sent but before the response is returned, the client cannot be 100% sure how much

of the transaction actually was invoked by the server. Some transactions, such as

GETting a static HTML page, can be repeated again and again without changing

anything. Other transactions, such as POSTing an order to an online book store,

shouldn’t be repeated, or you may risk multiple orders.

A transaction is idempotent if it yields the same result regardless of whether it is exe-

cuted once or many times. Implementors can assume the GET, HEAD, PUT,

DELETE, TRACE, and OPTIONS methods share this property.*Clients shouldn’t

pipeline nonidempotent requests (such as POSTs). Otherwise, a premature termina-

tion of the transport connection could lead to indeterminate results. If you want to

send a nonidempotent request, you should wait for the response status for the previ-

ous request.

Nonidempotent methods or sequences must not be retried automatically, although

user agents may offer a human operator the choice of retrying the request. For exam-

ple, most browsers will offer a dialog box when reloading a cached POST response,

asking if you want to post the transaction again.

Graceful Connection Close

TCP connections are bidirectional, as shown in Figure 4-19. Each side of a TCP con-

nection has an input queue and an output queue, for data being read or written.

Data placed in the output of one side will eventually show up on the input of the

other side.

Full and half closes

An application can close either or both of the TCP input and output channels. A

close( ) sockets call closes both the input and output channels of a TCP connection.

*Administrators who use GET-based dynamic forms should make sure the forms are idempotent.

Figure 4-19. TCP connections are bidirectional

Client Server

in out

inout

The Mysteries of Connection Close |103

This is called a “full close” and is depicted in Figure 4-20a. You can use the

shutdown( ) sockets call to close either the input or output channel individually. This

is called a “half close” and is depicted in Figure 4-20b.

TCP close and reset errors

Simple HTTP applications can use only full closes. But when applications start talk-

ing to many other types of HTTP clients, servers, and proxies, and when they start

using pipelined persistent connections, it becomes important for them to use half

closes to prevent peers from getting unexpected write errors.

In general, closing the output channel of your connection is always safe. The peer on

the other side of the connection will be notified that you closed the connection by

getting an end-of-stream notification once all the data has been read from its buffer.

Closing the input channel of your connection is riskier, unless you know the other

side doesn’t plan to send any more data. If the other side sends data to your closed

input channel, the operating system will issue a TCP “connection reset by peer” mes-

sage back to the other side’s machine, as shown in Figure 4-21. Most operating sys-

tems treat this as a serious error and erase any buffered data the other side has not

read yet. This is very bad for pipelined connections.

Say you have sent 10 pipelined requests on a persistent connection, and the

responses already have arrived and are sitting in your operating system’s buffer (but

the application hasn’t read them yet). Now say you send request #11, but the server

decides you’ve used this connection long enough, and closes it. Your request #11

will arrive at a closed connection and will reflect a reset back to you. This reset will

erase your input buffers.

Figure 4-20. Full and half close

Client Server

in out

inout

(a) Server full close

Client Server

in out

inout

(b) Server output half close (graceful close)

Client Server

in out

inout

104 |Chapter 4: Connection Management

When you finally get to reading data, you will get a connection reset by peer error,

and the buffered, unread response data will be lost, even though much of it success-

fully arrived at your machine.

Graceful close

The HTTP specification counsels that when clients or servers want to close a connec-

tion unexpectedly, they should “issue a graceful close on the transport connection,”

but it doesn’t describe how to do that.

In general, applications implementing graceful closes will first close their output

channels and then wait for the peer on the other side of the connection to close its

output channels. When both sides are done telling each other they won’t be sending

any more data (i.e., closing output channels), the connection can be closed fully,

with no risk of reset.

Unfortunately, there is no guarantee that the peer implements or checks for half

closes. For this reason, applications wanting to close gracefully should half close

their output channels and periodically check the status of their input channels (look-

ing for data or for the end of the stream). If the input channel isn’t closed by the peer

within some timeout period, the application may force connection close to save

resources.

For More Information

This completes our overview of the HTTP plumbing trade. Please refer to the fol-

lowing reference sources for more information about TCP performance and HTTP

connection-management facilities.

HTTP Connections

http://www.ietf.org/rfc/rfc2616.txt

RFC 2616, “Hypertext Transfer Protocol—HTTP/1.1,” is the official specification

for HTTP/1.1; it explains the usage of and HTTP header fields for implementing

Figure 4-21. Data arriving at closed connection generates “connection reset by peer” error

Client Server

in out

inout

RESET

For More Information |105

parallel, persistent, and pipelined HTTP connections. This document does not

cover the proper use of the underlying TCP connections.

http://www.ietf.org/rfc/rfc2068.txt

RFC 2068 is the 1997 version of the HTTP/1.1 protocol. It contains explanation

of the HTTP/1.0+ Keep-Alive connections that is missing from RFC 2616.

http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-connection-00.txt

This expired Internet draft, “HTTP Connection Management,” has some good

discussion of issues facing HTTP connection management.

HTTP Performance Issues

http://www.w3.org/Protocols/HTTP/Performance/

This W3C web page, entitled “HTTP Performance Overview,” contains a few

papers and tools related to HTTP performance and connection management.

http://www.w3.org/Protocols/HTTP/1.0/HTTPPerformance.html

This short memo by Simon Spero, “Analysis of HTTP Performance Problems,” is

one of the earliest (1994) assessments of HTTP connection performance. The

memo gives some early performance measurements of the effect of connection

setup, slow start, and lack of connection sharing.

ftp://gatekeeper.dec.com/pub/DEC/WRL/research-reports/WRL-TR-95.4.pdf

“The Case for Persistent-Connection HTTP.”

http://www.isi.edu/lsam/publications/phttp_tcp_interactions/paper.html

“Performance Interactions Between P-HTTP and TCP Implementations.”

http://www.sun.com/sun-on-net/performance/tcp.slowstart.html

“TCP Slow Start Tuning for Solaris” is a web page from Sun Microsystems that

talks about some of the practical implications of TCP slow start. It’s a useful

read, even if you are working with different operating systems.

TCP/IP

The following three books by W. Richard Stevens are excellent, detailed engineering

texts on TCP/IP. These are extremely useful for anyone using TCP:

TCP Illustrated, Volume I: The Protocols

W. Richard Stevens, Addison Wesley

UNIX Network Programming, Volume 1: Networking APIs

W. Richard Stevens, Prentice-Hall

UNIX Network Programming, Volume 2: The Implementation

W. Richard Stevens, Prentice-Hall

106 |Chapter 4: Connection Management

The following papers and specifications describe TCP/IP and features that affect its

performance. Some of these specifications are over 20 years old and, given the world-

wide success of TCP/IP, probably can be classified as historical treasures:

http://www.acm.org/sigcomm/ccr/archive/2001/jan01/ccr-200101-mogul.pdf

In “Rethinking the TCP Nagle Algorithm,” Jeff Mogul and Greg Minshall

present a modern perspective on Nagle’s algorithm, outline what applications

should and should not use the algorithm, and propose several modifications.

http://www.ietf.org/rfc/rfc2001.txt

RFC 2001, “TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast

Recovery Algorithms,” defines the TCP slow-start algorithm.

http://www.ietf.org/rfc/rfc1122.txt

RFC 1122, “Requirements for Internet Hosts—Communication Layers,” dis-

cusses TCP acknowledgment and delayed acknowledgments.

http://www.ietf.org/rfc/rfc896.txt

RFC 896, “Congestion Control in IP/TCP Internetworks,” was released by John

Nagle in 1984. It describes the need for TCP congestion control and introduces

what is now called “Nagle’s algorithm.”

http://www.ietf.org/rfc/rfc0813.txt

RFC 813, “Window and Acknowledgement Strategy in TCP,” is a historical

(1982) specification that describes TCP window and acknowledgment imple-

mentation strategies and provides an early description of the delayed acknowl-

edgment technique.

http://www.ietf.org/rfc/rfc0793.txt

RFC 793, “Transmission Control Protocol,” is Jon Postel’s classic 1981 defini-

tion of the TCP protocol.

PART II

HTTP Architecture

The six chapters of Part II highlight the HTTP server, proxy, cache, gateway, and

robot applications, which are the building blocks of web systems architecture:

• Chapter 5, Web Servers, gives an overview of web server architectures.

• Chapter 6, Proxies, describes HTTP proxy servers, which are intermediary servers

that connect HTTP clients and act as platforms for HTTP services and controls.

• Chapter 7, Caching, delves into the science of web caches—devices that improve

performance and reduce traffic by making local copies of popular documents.

• Chapter 8, Integration Points: Gateways, Tunnels, and Relays, explains applica-

tions that allow HTTP to interoperate with software that speaks different proto-

cols, including SSL encrypted protocols.

• Chapter 9, Web Robots, wraps up our tour of HTTP architecture with web clients.

• Chapter 10, HTTP-NG, covers future topics for HTTP—in particular, HTTP-NG.

109

CHAPTER 5

Web Servers

Web servers dish out billions of web pages a day. They tell you the weather, load up

your online shopping carts, and let you find long-lost high-school buddies. Web

servers are the workhorses of the World Wide Web. In this chapter, we:

• Survey the many different types of software and hardware web servers.

• Describe how to write a simple diagnostic web server in Perl.

• Explain how web servers process HTTP transactions, step by step.

Where it helps to make things concrete, our examples use the Apache web server and

its configuration options.

Web Servers Come in All Shapes and Sizes

A web server processes HTTP requests and serves responses. The term “web server”

can refer either to web server software or to the particular device or computer dedi-

cated to serving the web pages.

Web servers comes in all flavors, shapes, and sizes. There are trivial 10-line Perl

script web servers, 50-MB secure commerce engines, and tiny servers-on-a-card. But

whatever the functional differences, all web servers receive HTTP requests for

resources and serve content back to the clients (look back to Figure 1-5).

Web Server Implementations

Web servers implement HTTP and the related TCP connection handling. They also

manage the resources served by the web server and provide administrative features to

configure, control, and enhance the web server.

The web server logic implements the HTTP protocol, manages web resources, and

provides web server administrative capabilities. The web server logic shares responsi-

bilities for managing TCP connections with the operating system. The underlying

110 |Chapter 5: Web Servers

operating system manages the hardware details of the underlying computer system

and provides TCP/IP network support, filesystems to hold web resources, and pro-

cess management to control current computing activities.

Web servers are available in many forms:

• You can install and run general-purpose software web servers on standard com-

puter systems.

• If you don’t want the hassle of installing software, you can purchase a web server

appliance, in which the software comes preinstalled and preconfigured on a

computer, often in a snazzy-looking chassis.

• Given the miracles of microprocessors, some companies even offer embedded

web servers implemented in a small number of computer chips, making them

perfect administration consoles for consumer devices.

Let’s look at each of those types of implementations.

General-Purpose Software Web Servers

General-purpose software web servers run on standard, network-enabled computer

systems. You can choose open source software (such as Apache or W3C’s Jigsaw) or

commercial software (such as Microsoft’s and iPlanet’s web servers). Web server

software is available for just about every computer and operating system.

While there are tens of thousands of different kinds of web server programs (includ-

ing custom-crafted, special-purpose web servers), most web server software comes

from a small number of organizations.

In February 2002, the Netcraft survey (http://www.netcraft.com/survey/) showed three

vendors dominating the public Internet web server market (see Figure 5-1):

• The free Apache software powers nearly 60% of all Internet web servers.

• Microsoft web server makes up another 30%.

• Sun iPlanet servers comprise another 3%.

Figure 5-1. Web server market share as estimated by Netcraft’s automated survey

A Minimal Perl Web Server |111

Take these numbers with a few grains of salt, however, as the Netcraft survey is com-

monly believed to exaggerate the dominance of Apache software. First, the survey

counts servers independent of server popularity. Proxy server access studies from

large ISPs suggest that the amount of pages served from Apache servers is much less

than 60% but still exceeds Microsoft and Sun iPlanet. Additionally, it is anecdotally

believed that Microsoft and iPlanet servers are more popular than Apache inside cor-

porate enterprises.

Web Server Appliances

Web server appliances are prepackaged software/hardware solutions. The vendor pre-

installs a software server onto a vendor-chosen computer platform and preconfig-

ures the software. Some examples of web server appliances include:

• Sun/Cobalt RaQ web appliances (http://www.cobalt.com)

• Toshiba Magnia SG10 (http://www.toshiba.com)

• IBM Whistle web server appliance (http://www.whistle.com)

Appliance solutions remove the need to install and configure software and often

greatly simplify administration. However, the web server often is less flexible and

feature-rich, and the server hardware is not easily repurposeable or upgradable.

Embedded Web Servers

Embedded servers are tiny web servers intended to be embedded into consumer prod-

ucts (e.g., printers or home appliances). Embedded web servers allow users to

administer their consumer devices using a convenient web browser interface.

Some embedded web servers can even be implemented in less than one square inch,

but they usually offer a minimal feature set. Two examples of very small embedded

web servers are:

• IPic match-head sized web server (http://www-ccs.cs.umass.edu/~shri/iPic.html)

• NetMedia SitePlayer SP1 Ethernet Web Server (http://www.siteplayer.com)

A Minimal Perl Web Server

If you want to build a full-featured HTTP server, you have some work to do. The

core of the Apache web server has over 50,000 lines of code, and optional processing

modules make that number much bigger.

All this software is needed to support HTTP/1.1 features: rich resource support, vir-

tual hosting, access control, logging, configuration, monitoring, and performance

features. That said, you can create a minimally functional HTTP server in under 30

lines of Perl. Let’s take a look.

112 |Chapter 5: Web Servers

Example 5-1 shows a tiny Perl program called type-o-serve. This program is a useful

diagnostic tool for testing interactions with clients and proxies. Like any web server,

type-o-serve waits for an HTTP connection. As soon as type-o-serve gets the request

message, it prints the message on the screen; then it waits for you to type (or paste)

in a response message, which is sent back to the client. This way, type-o-serve pre-

tends to be a web server, records the exact HTTP request messages, and allows you

to send back any HTTP response message.

This simple type-o-serve utility doesn’t implement most HTTP functionality, but it is

a useful tool to generate server response messages the same way you can use Telnet

to generate client request messages (refer back to Example 5-1). You can download

the type-o-serve program from http://www.http-guide.com/tools/type-o-serve.pl.

Example 5-1. type-o-serve—a minimal Perl web server used for HTTP debugging

#!/usr/bin/perl

use Socket;

use Carp;

use FileHandle;

# (1) use port 8080 by default, unless overridden on command line

$port = (@ARGV ? $ARGV[0] : 8080);

# (2) create local TCP socket and set it to listen for connections

$proto = getprotobyname('tcp');

socket(S, PF_INET, SOCK_STREAM, $proto) || die;

setsockopt(S, SOL_SOCKET, SO_REUSEADDR, pack("l", 1)) || die;

bind(S, sockaddr_in($port, INADDR_ANY)) || die;

listen(S, SOMAXCONN) || die;

# (3) print a startup message

printf(" <<<Type-O-Serve Accepting on Port %d>>>\n\n",$port);

while (1)

{

# (4) wait for a connection C

$cport_caddr = accept(C, S);

($cport,$caddr) = sockaddr_in($cport_caddr);

C->autoflush(1);

# (5) print who the connection is from

$cname = gethostbyaddr($caddr,AF_INET);

printf(" <<<Request From '%s'>>>\n",$cname);

# (6) read request msg until blank line, and print on screen

while ($line = <C>)

{

print $line;

if ($line =~ /^\r/) { last; }

}

What Real Web Servers Do |113

Figure 5-2 shows how the administrator of Joe’s Hardware store might use type-o-

serve to test HTTP communication:

• First, the administrator starts the type-o-serve diagnostic server, listening on a

particular port. Because Joe’s Hardware store already has a production web

server listing on port 80, the administrator starts the type-o-serve server on port

8080 (you can pick any unused port) with this command line:

%type-o-serve.pl 8080

• Once type-o-serve is running, you can point a browser to this web server. In

Figure 5-2, we browse to http://www.joes-hardware.com:8080/foo/bar/blah.txt.

• The type-o-serve program receives the HTTP request message from the browser

and prints the contents of the HTTP request message on screen. The type-o-serve

diagnostic tool then waits for the user to type in a simple response message, fol-

lowed by a period on a blank line.

•type-o-serve sends the HTTP response message back to the browser, and the

browser displays the body of the response message.

What Real Web Servers Do

The Perl server we showed in Example 5-1 is a trivial example web server. State-of-

the-art commercial web servers are much more complicated, but they do perform

several common tasks, as shown in Figure 5-3:

1. Set up connection—accept a client connection, or close if the client is unwanted.

2. Receive request—read an HTTP request message from the network.

3. Process request—interpret the request message and take action.

4. Access resource—access the resource specified in the message.

5. Construct response—create the HTTP response message with the right headers.

6. Send response—send the response back to the client.

7. Log transaction—place notes about the completed transaction in a log file.

# (7) prompt for response message, and input response lines,

# sending response lines to client, until solitary "."

printf(" <<<Type Response Followed by '.'>>>\n");

while ($line = <STDIN>)

{

$line =~ s/\r//;

$line =~ s/\n//;

if ($line =~ /^\./) { last; }

print C $line . "\r\n";

}

close(C);

}

Example 5-1. type-o-serve—a minimal Perl web server used for HTTP debugging (continued)

114 |Chapter 5: Web Servers

The next seven sections highlight how web servers perform these basic tasks.

Figure 5-2. The type-o-serve utility lets you type in server responses to send back to clients

Figure 5-3. Steps of a basic web server request

GET /foo/bar/blah.txt HTTP/1.1

Accept: */*

Accept-language: en-us

Accept-encoding: gzip, deflate

User-agent: Mozilla/4.0

Host: www.joes.hardware.com:8080

Connection: Keep-alive

HTTP request message

HTTP/1.0 200 OK

Connection: close

Content-type: text/plain

Hi there!

HTTP response message

% ./type-o-serve.pl 8080

<<<Type-O-Serve Accepting on Port 8080>>>

<<<Request From 'home-44-027.extranet.inktomi.com'>>>

GET /foo/bar/blah.txt HTTP/1.1

Accept: */*

Accept-language: en-us

Accept-encoding: gzip, deflate

User-agent: Mozilla/4.0

Host: www.joes-hardware.com:8080

Connection: Keep-alive

<<<Type response followed by '.'>>>

HTTP/1.0 200 OK

Connection: close

Content-type: text-plain

Hi there!

type-o-serve dialog

HTTP server software process

TCP/IP

network

stack

Client

Network interface Object storage

User space

Operating system

(2) Receive request

(3) Process request

(4) Access resource

(5) Create response

(7) Log transaction

(6) Send response

(1) Set up connection

Step 1: Accepting Client Connections |115

Step 1: Accepting Client Connections

If a client already has a persistent connection open to the server, it can use that connec-

tion to send its request. Otherwise, the client needs to open a new connection to the

server (refer back to Chapter 4 to review HTTP connection-management technology).

Handling New Connections

When a client requests a TCP connection to the web server, the web server estab-

lishes the connection and determines which client is on the other side of the connec-

tion, extracting the IP address from the TCP connection.*Once a new connection is

established and accepted, the server adds the new connection to its list of existing

web server connections and prepares to watch for data on the connection.

The web server is free to reject and immediately close any connection. Some web

servers close connections because the client IP address or hostname is unauthorized

or is a known malicious client. Other identification techniques can also be used.

Client Hostname Identiﬁcation

Most web servers can be configured to convert client IP addresses into client host-

names, using “reverse DNS.” Web servers can use the client hostname for detailed

access control and logging. Be warned that hostname lookups can take a very long

time, slowing down web transactions. Many high-capacity web servers either disable

hostname resolution or enable it only for particular content.

You can enable hostname lookups in Apache with the HostnameLookups configura-

tion directive. For example, the Apache configuration directives in Example 5-2 turn

on hostname resolution for only HTML and CGI resources.

Determining the Client User Through ident

Some web servers also support the IETF ident protocol. The ident protocol lets

servers find out what username initiated an HTTP connection. This information is

* Different operating systems have different interfaces and data structures for manipulating TCP connections.

In Unix environments, the TCP connection is represented by a socket, and the IP address of the client can be

found from the socket using the getpeername call.

Example 5-2. Configuring Apache to look up hostnames for HTML and CGI resources

HostnameLookups off

HostnameLookups on

</Files>

116 |Chapter 5: Web Servers

particularly useful for web server logging—the second field of the popular Com-

mon Log Format contains the ident username of each HTTP request.*

If a client supports the ident protocol, the client listens on TCP port 113 for ident

requests. Figure 5-4 shows how the ident protocol works. In Figure 5-4a, the client

opens an HTTP connection. The server then opens its own connection back to the

client’s identd server port (113), sends a simple request asking for the username cor-

responding to the new connection (specified by client and server port numbers), and

retrieves from the client the response containing the username.

ident can work inside organizations, but it does not work well across the public Inter-

net for many reasons, including:

• Many client PCs don’t run the identd Identification Protocol daemon software.

• The ident protocol significantly delays HTTP transactions.

• Many firewalls won’t permit incoming ident traffic.

• The ident protocol is insecure and easy to fabricate.

• The ident protocol doesn’t support virtual IP addresses well.

• There are privacy concerns about exposing client usernames.

You can tell Apache web servers to use ident lookups with Apache’s IdentityCheck on

directive. If no ident information is available, Apache will fill ident log fields with

hyphens (-). Common Log Format log files typically contain hyphens in the second

field because no ident information is available.

Step 2: Receiving Request Messages

As the data arrives on connections, the web server reads out the data from the net-

work connection and parses out the pieces of the request message (Figure 5-5).

* This Common Log Format ident field is called “rfc931,” after an outdated version of the RFC defining the

ident protocol (the updated ident specification is documented by RFC 1413).

Figure 5-4. Using the ident protocol to determine HTTP client username

Mary Web server

HTTP connection

Port 4236 Port 80

(a) Mary establishes new HTTP connection

ident connection

Port 113

Port 80

(b) Server establishes ident connection

4236,80

4236,80: USERID: UNIX: mary

(d) Client returns ident response

Step 2: Receiving Request Messages |117

When parsing the request message, the web server:

• Parses the request line looking for the request method, the specified resource

identifier (URI), and the version number,*each separated by a single space, and

ending with a carriage-return line-feed (CRLF) sequence†

• Reads the message headers, each ending in CRLF

• Detects the end-of-headers blank line, ending in CRLF (if present)

• Reads the request body, if any (length specified by the Content-Length header)

When parsing request messages, web servers receive input data erratically from the

network. The network connection can stall at any point. The web server needs to

read data from the network and temporarily store the partial message data in mem-

ory until it receives enough data to parse it and make sense of it.

Internal Representations of Messages

Some web servers also store the request messages in internal data structures that

make the message easy to manipulate. For example, the data structure might con-

tain pointers and lengths of each piece of the request message, and the headers might

be stored in a fast lookup table so the specific values of particular headers can be

accessed quickly (Figure 5-6).

Connection Input/Output Processing Architectures

High-performance web servers support thousands of simultaneous connections.

These connections let the web server communicate with clients around the world,

each with one or more connections open to the server. Some of these connections

may be sending requests rapidly to the web server, while other connections trickle

Figure 5-5. Reading a request message from a connection

* The initial version of HTTP, called HTTP/0.9, does not support version numbers. Some web servers support

missing version numbers, interpreting the message as an HTTP/0.9 request.

† Many web servers support LF or CRLF as end-of-line sequences, because some clients mistakenly send LF

as the end-of-line terminator.

Client Server

Internet

LF CR LF CR moc.erawdrah-seo

Request message being read from network

GET /specials/saw-blade.gif HTTP/1.0CRLF

Accept: image/gifCRLF

Host: www.j

118 |Chapter 5: Web Servers

requests slowly or infrequently, and still others are idle, waiting quietly for some

future activity.

Web servers constantly watch for new web requests, because requests can arrive at

any time. Different web server architectures service requests in different ways, as

Figure 5-7 illustrates:

Single-threaded web servers (Figure 5-7a)

Single-threaded web servers process one request at a time until completion.

When the transaction is complete, the next connection is processed. This archi-

tecture is simple to implement, but during processing, all the other connections

are ignored. This creates serious performance problems and is appropriate only

for low-load servers and diagnostic tools like type-o-serve.

Multiprocess and multithreaded web servers (Figure 5-7b)

Multiprocess and multithreaded web servers dedicate multiple processes or

higher-efficiency threads to process requests simultaneously.*The threads/

processes may be created on demand or in advance.†Some servers dedicate a

thread/process for every connection, but when a server processes hundreds,

thousands, or even tens of thousands of simultaneous connections, the resulting

number of processes or threads may consume too much memory or system

Figure 5-6. Parsing a request message into a convenient internal representation

* A process is an individual program flow of control, with its own set of variables. A thread is a faster, more

efficient version of a process. Both threads and processes let a single program do multiple things at the same

time. For simplicity of explanation, we treat processes and threads interchangeably. But, because of the per-

formance differences, many high-performance servers are both multiprocess and multithreaded.

† Systems where threads are created in advance are called “worker pool” systems, because a set of threads

waits in a pool for work to do.

GET /specials/saw-blade.gif HTTP/1.0CRLF

Accept: image/gifCRLF

Host: www.joes-hardware.comCRLF

CRLF

Request message

Parse

method: 1

version: 1.0

uri:

header count: 2

headers:

body: -

Parsed encoding of request message

specials/saw-blade.gif

www.joes-hardware.com

image/gif

name: Host

name: Accept

value:

Step 2: Receiving Request Messages |119

resources. Thus, many multithreaded web servers put a limit on the maximum

number of threads/processes.

Multiplexed I/O servers (Figure 5-7c)

To support large numbers of connections, many web servers adopt multiplexed

architectures. In a multiplexed architecture, all the connections are simulta-

neously watched for activity. When a connection changes state (e.g., when data

becomes available or an error condition occurs), a small amount of processing is

performed on the connection; when that processing is complete, the connection

is returned to the open connection list for the next change in state. Work is done

on a connection only when there is something to be done; threads and processes

are not tied up waiting on idle connections.

Multiplexed multithreaded web servers (Figure 5-7d)

Some systems combine multithreading and multiplexing to take advantage of

multiple CPUs in the computer platform. Multiple threads (often one per physi-

cal processor) each watch the open connections (or a subset of the open connec-

tions) and perform a small amount of work on each connection.

Figure 5-7. Web server input/output architectures

(a) Single-threaded I/O architecture (b) Multithreaded I/O architecture

Connection list

Connection

Thread/process

(d) Multiplexed, multithreaded I/O architecture(c) Multiplexed I/O architecture

Connection

multiplexer

120 |Chapter 5: Web Servers

Step 3: Processing Requests

Once the web server has received a request, it can process the request using the

method, resource, headers, and optional body.

Some methods (e.g., POST) require entity body data in the request message. Other

methods (e.g., OPTIONS) allow a request body but don’t require one. A few meth-

ods (e.g., GET) forbid entity body data in request messages.

We won’t talk about request processing here, because it’s the subject of most of the

chapters in the rest of this book!

Step 4: Mapping and Accessing Resources

Web servers are resource servers. They deliver precreated content, such as HTML

pages or JPEG images, as well as dynamic content from resource-generating applica-

tions running on the servers.

Before the web server can deliver content to the client, it needs to identify the source

of the content, by mapping the URI from the request message to the proper content

or content generator on the web server.

Docroots

Web servers support different kinds of resource mapping, but the simplest form of

resource mapping uses the request URI to name a file in the web server’s filesystem.

Typically, a special folder in the web server filesystem is reserved for web content.

This folder is called the document root,ordocroot. The web server takes the URI

from the request message and appends it to the document root.

In Figure 5-8, a request arrives for /specials/saw-blade.gif. The web server in this

example has document root /usr/local/httpd/files. The web server returns the file /usr/

local/httpd/files/specials/saw-blade.gif.

Figure 5-8. Mapping request URI to local web server resource

Internet

Request message

GET /specials/saw-blade.gif HTTP/1.0

Host: www.joes-hardware.com

Client

/usr/local/httpd/files

Web server

object storage

Request URI: /specials/saw-blade.gif Server resource: /usr/local/httpd/files/specials/saw-blade.gif

Step 4: Mapping and Accessing Resources |121

To set the document root for an Apache web server, add a DocumentRoot line to the

httpd.conf configuration file:

DocumentRoot /usr/local/httpd/files

Servers are careful not to let relative URLs back up out of a docroot and expose other

parts of the filesystem. For example, most mature web servers will not permit this

URI to see files above the Joe’s Hardware document root:

http://www.joes-hardware.com/../

Virtually hosted docroots

Virtually hosted web servers host multiple web sites on the same web server, giving

each site its own distinct document root on the server. A virtually hosted web server

identifies the correct document root to use from the IP address or hostname in the

URI or the Host header. This way, two web sites hosted on the same web server can

have completely distinct content, even if the request URIs are identical.

In Figure 5-9, the server hosts two sites: www.joes-hardware.com and www.marys-

antiques.com. The server can distinguish the web sites using the HTTP Host header,

or from distinct IP addresses.

• When request A arrives, the server fetches the file for /docs/joe/index.html.

• When request B arrives, the server fetches the file for /docs/mary/index.html.

Configuring virtually hosted docroots is simple for most web servers. For the popu-

lar Apache web server, you need to configure a VirtualHost block for each virtual

web site, and include the DocumentRoot for each virtual server (Example 5-3).

Figure 5-9. Different docroots for virtually hosted requests

Example 5-3. Apache web server virtual host docroot configuration

ServerName www.joes-hardware.com

DocumentRoot /docs/joe

/docs/mary/docs/joe

Internet

Request message A

GET /index.html HTTP/1.0

Host: www.joes-hardware.com

Client www.joes-hardware.com

www.marys-antiques.com

GET /index.html HTTP/1.0

Host: www.marys-antiques.com

Request message B

122 |Chapter 5: Web Servers

Look forward to “Virtual Hosting” in Chapter 18 for much more detail about virtual

hosting.

User home directory docroots

Another common use of docroots gives people private web sites on a web server. A

typical convention maps URIs whose paths begin with a slash and tilde (/~) fol-

lowed by a username to a private document root for that user. The private docroot is

often the folder called public_html inside that user’s home directory, but it can be

configured differently (Figure 5-10).

Directory Listings

A web server can receive requests for directory URLs, where the path resolves to a

directory, not a file. Most web servers can be configured to take a few different

actions when a client requests a directory URL:

• Return an error.

• Return a special, default, “index file” instead of the directory.

• Scan the directory, and return an HTML page containing the contents.

Most web servers look for a file named index.html or index.htm inside a directory to

represent that directory. If a user requests a URL for a directory and the directory

TransferLog /logs/joe.access_log

ErrorLog /logs/joe.error_log

</VirtualHost>

ServerName www.marys-antiques.com

DocumentRoot /docs/mary

TransferLog /logs/mary.access_log

ErrorLog /logs/mary.error_log

</VirtualHost>

...

Figure 5-10. Different docroots for different users

Example 5-3. Apache web server virtual host docroot configuration (continued)

Internet

Request message A

GET /~bob/index.html HTTP/1.0

Client www.joes-hardware.com

www.marys-antiques.com

GET /~betty/index.html HTTP/1.0

Request message B

/home/betty/public_html

/home/bob/public_html

Step 4: Mapping and Accessing Resources |123

contains a file named index.html (or index.htm), the server will return the contents of

that file.

In the Apache web server, you can configure the set of filenames that will be inter-

preted as default directory files using the DirectoryIndex configuration directive. The

DirectoryIndex directive lists all filenames that serve as directory index files, in pre-

ferred order. The following configuration line causes Apache to search a directory for

any of the listed files in response to a directory URL request:

DirectoryIndex index.html index.htm home.html home.htm index.cgi

If no default index file is present when a user requests a directory URI, and if direc-

tory indexes are not disabled, many web servers automatically return an HTML file

listing the files in that directory, and the sizes and modification dates of each file,

including URI links to each file. This file listing can be convenient, but it also allows

nosy people to find files on a web server that they might not normally find.

You can disable the automatic generation of directory index files with the Apache

directive:

Options -Indexes

Dynamic Content Resource Mapping

Web servers also can map URIs to dynamic resources—that is, to programs that gen-

erate content on demand (Figure 5-11). In fact, a whole class of web servers called

application servers connect web servers to sophisticated backend applications. The

web server needs to be able to tell when a resource is a dynamic resource, where the

dynamic content generator program is located, and how to run the program. Most

web servers provide basic mechanisms to identify and map dynamic resources.

Apache lets you map URI pathname components into executable program directo-

ries. When a server receives a request for a URI with an executable path compo-

nent, it attempts to execute a program in a corresponding server directory. For

example, the following Apache configuration directive specifies that all URIs whose

paths begin with /cgi-bin/ should execute corresponding programs found in the

directory /usr/local/etc/httpd/cgi-programs/:

ScriptAlias /cgi-bin/ /usr/local/etc/httpd/cgi-programs/

Apache also lets you mark executable files with a special file extension. This way,

executable scripts can be placed in any directory. The following Apache configura-

tion directive specifies that all web resources ending in .cgi should be executed:

AddHandler cgi-script .cgi

CGI is an early, simple, and popular interface for executing server-side applications.

Modern application servers have more powerful and efficient server-side dynamic

content support, including Microsoft’s Active Server Pages and Java servlets.

124 |Chapter 5: Web Servers

Server-Side Includes (SSI)

Many web servers also provide support for server-side includes. If a resource is

flagged as containing server-side includes, the server processes the resource contents

before sending them to the client.

The contents are scanned for certain special patterns (often contained inside special

HTML comments), which can be variable names or embedded scripts. The special

patterns are replaced with the values of variables or the output of executable scripts.

This is an easy way to create dynamic content.

Access Controls

Web servers also can assign access controls to particular resources. When a request

arrives for an access-controlled resource, the web server can control access based on

the IP address of the client, or it can issue a password challenge to get access to the

resource.

Refer to Chapter 12 for more information about HTTP authentication.

Figure 5-11. A web server can serve static resources as well as dynamic resources

Client Server

Internet

E-commerce

gateway

Real estate search

gateway

Stock trading

gateway

Web cam

gateway

11000101101

Image file

Text file

Filesystem Resources

Step 5: Building Responses |125

Step 5: Building Responses

Once the web server has identified the resource, it performs the action described in

the request method and returns the response message. The response message con-

tains a response status code, response headers, and a response body if one was gener-

ated. HTTP response codes were detailed in “Status Codes” in Chapter 3.

Response Entities

If the transaction generated a response body, the content is sent back with the

response message. If there was a body, the response message usually contains:

• A Content-Type header, describing the MIME type of the response body

• A Content-Length header, describing the size of the response body

• The actual message body content

MIME Typing

The web server is responsible for determining the MIME type of the response body.

There are many ways to configure servers to associate MIME types with resources:

mime.types

The web server can use the extension of the filename to indicate MIME type.

The web server scans a file containing MIME types for each extension to com-

pute the MIME type for each resource. This extension-based type association is

the most common; it is illustrated in Figure 5-12.

Figure 5-12. A web server uses MIME types file to set outgoing Content-Type of resources

Internet

GET /specials/saw-blade.gif HTTP/1.0

Host: www.joes-hardware.com

Client www.joes-hardware.com

HTTP/1.0 200 OK

Content-type: image/gif

Content-length: 8572 Server MIME type table

saw-blade.gif file

application/msword doc

application/postscript ai eps ps

application/powerpoint ppt

audio/mpeg mpga mp2

image/gif gif

image/jpeg jpeg jpg jpe

image/tiff tiff tif

text/html html htm

text/plain txt

video/mpeg mpeg mpg mpe

video/quicktime qt mov

video/x-msvideo avi

x-word/x-vrml wrl vrml

HTTP request message contains

the command and the URI

126 |Chapter 5: Web Servers

Magic typing

The Apache web server can scan the contents of each resource and pattern-

match the content against a table of known patterns (called the magic file) to

determine the MIME type for each file. This can be slow, but it is convenient,

especially if the files are named without standard extensions.

Explicit typing

Web servers can be configured to force particular files or directory contents to

have a MIME type, regardless of the file extension or contents.

Type negotiation

Some web servers can be configured to store a resource in multiple document

formats. In this case, the web server can be configured to determine the “best”

format to use (and the associated MIME type) by a negotiation process with the

user. We’ll discuss this in Chapter 17.

Web servers also can be configured to associate particular files with MIME types.

Redirection

Web servers sometimes return redirection responses instead of success messages. A

web server can redirect the browser to go elsewhere to perform the request. A redirec-

tion response is indicated by a 3XX return code. The Location response header con-

tains a URI for the new or preferred location of the content. Redirects are useful for:

Permanently moved resources

A resource might have been moved to a new location, or otherwise renamed, giv-

ing it a new URL. The web server can tell the client that the resource has been

renamed, and the client can update any bookmarks, etc. before fetching the

resource from its new location. The status code 301 Moved Permanently is used

for this kind of redirect.

Temporarily moved resources

If a resource is temporarily moved or renamed, the server may want to redirect

the client to the new location. But, because the renaming is temporary, the server

wants the client to come back with the old URL in the future and not to update

any bookmarks. The status codes 303 See Other and 307 Temporary Redirect

are used for this kind of redirect.

URL augmentation

Servers often use redirects to rewrite URLs, often to embed context. When the

request arrives, the server generates a new URL containing embedded state infor-

mation and redirects the user to this new URL.*The client follows the redirect,

reissuing the request, but now including the full, state-augmented URL. This is a

* These extended, state-augmented URLs are sometimes called “fat URLs.”

For More Information |127

useful way of maintaining state across transactions. The status codes 303 See

Other and 307 Temporary Redirect are used for this kind of redirect.

Load balancing

If an overloaded server gets a request, the server can redirect the client to a less

heavily loaded server. The status codes 303 See Other and 307 Temporary Redi-

rect are used for this kind of redirect.

Server affinity

Web servers may have local information for certain users; a server can redirect

the client to a server that contains information about the client. The status codes

303 See Other and 307 Temporary Redirect are used for this kind of redirect.

Canonicalizing directory names

When a client requests a URI for a directory name without a trailing slash, most

web servers redirect the client to a URI with the slash added, so that relative

links work correctly.

Step 6: Sending Responses

Web servers face similar issues sending data across connections as they do receiving.

The server may have many connections to many clients, some idle, some sending

data to the server, and some carrying response data back to the clients.

The server needs to keep track of the connection state and handle persistent connec-

tions with special care. For nonpersistent connections, the server is expected to close

its side of the connection when the entire message is sent.

For persistent connections, the connection may stay open, in which case the server

needs to be extra cautious to compute the Content-Length header correctly, or the

client will have no way of knowing when a response ends (see Chapter 4).

Step 7: Logging

Finally, when a transaction is complete, the web server notes an entry into a log file,

describing the transaction performed. Most web servers provide several configurable

forms of logging. Refer to Chapter 21 for more details.

For More Information

For more information on Apache, Jigsaw, and ident, check out:

Apache: The Definitive Guide

Ben Laurie and Peter Laurie, O’Reilly & Associates, Inc.

Professional Apache

Peter Wainwright, Wrox Press.

128 |Chapter 5: Web Servers

http://www.w3c.org/Jigsaw/

Jigsaw—W3C’s Server W3C Consortium Web Site.

http://www.ietf.org/rfc/rfc1413.txt

RFC 1413, “Identification Protocol,” by M. St. Johns.

129

CHAPTER 6

Proxies

Web proxy servers are intermediaries. Proxies sit between clients and servers and act

as “middlemen,” shuffling HTTP messages back and forth between the parties.This

chapter talks all about HTTP proxy servers, the special support for proxy features,

and some of the tricky behaviors you’ll see when you use proxy servers.

In this chapter, we:

• Explain HTTP proxies, contrasting them to web gateways and illustrating how

proxies are deployed.

• Show some of the ways proxies are helpful.

• Describe how proxies are deployed in real networks and how traffic is directed

to proxy servers.

• Show how to configure your browser to use a proxy.

• Demonstrate HTTP proxy requests, how they differ from server requests, and

how proxies can subtly change the behavior of browsers.

• Explain how you can record the path of your messages through chains of proxy

servers, using Via headers and the TRACE method.

• Describe proxy-based HTTP access control.

• Explain how proxies can interoperate between clients and servers, each of which

may support different features and versions.

Web Intermediaries

Web proxy servers are middlemen that fulfill transactions on the client’s behalf.

Without a web proxy, HTTP clients talk directly to HTTP servers. With a web

proxy, the client instead talks to the proxy, which itself communicates with the

server on the client’s behalf. The client still completes the transaction, but through

the good services of the proxy server.

130 |Chapter 6: Proxies

HTTP proxy servers are both web servers and web clients. Because HTTP clients

send request messages to proxies, the proxy server must properly handle the requests

and the connections and return responses, just like a web server. At the same time,

the proxy itself sends requests to servers, so it must also behave like a correct HTTP

client, sending requests and receiving responses (see Figure 6-1). If you are creating

your own HTTP proxy, you’ll need to carefully follow the rules for both HTTP cli-

ents and HTTP servers.

Private and Shared Proxies

A proxy server can be dedicated to a single client or shared among many clients.

Proxies dedicated to a single client are called private proxies. Proxies shared among

numerous clients are called public proxies.

Public proxies

Most proxies are public, shared proxies. It’s more cost effective and easier to

administer a centralized proxy. And some proxy applications, such as caching

proxy servers, become more useful as more users are funneled into the same proxy

server, because they can take advantage of common requests between users.

Private proxies

Dedicated private proxies are not as common, but they do have a place, espe-

cially when run directly on the client computer. Some browser assistant prod-

ucts, as well as some ISP services, run small proxies directly on the user’s PC in

order to extend browser features, improve performance, or host advertising for

free ISP services.

Proxies Versus Gateways

Strictly speaking, proxies connect two or more applications that speak the same pro-

tocol, while gateways hook up two or more parties that speak different protocols. A

gateway acts as a “protocol converter,” allowing a client to complete a transaction

with a server, even when the client and server speak different protocols.

Figure 6-1. A proxy must be both a server and a client

Client Server

Proxy

Request

Proxies act like SERVERS to web clients,

receiving request messages, and

returning response messages

Request

ResponseResponse

Proxies act like CLIENTS to web servers,

sending web request messages, and

receiving web response messages

Why Use Proxies? |131

Figure 6-2 illustrates the difference between proxies and gateways:

• The intermediary device in Figure 6-2a is an HTTP proxy, because the proxy

speaks HTTP to both the client and server.

• The intermediary device in Figure 6-2b is an HTTP/POP gateway, because it ties

an HTTP frontend to a POP email backend. The gateway converts web transac-

tions into the appropriate POP transactions, to allow the user to read email

through HTTP. Web-based email programs such as Yahoo! Mail and MSN Hot-

mail are HTTP email gateways.

In practice, the difference between proxies and gateways is blurry. Because browsers

and servers implement different versions of HTTP, proxies often do some amount of

protocol conversion. And commercial proxy servers implement gateway functional-

ity to support SSL security protocols, SOCKS firewalls, FTP access, and web-based

applications. We’ll talk more about gateways in Chapter 8.

Why Use Proxies?

Proxy servers can do all kinds of nifty and useful things. They can improve security,

enhance performance, and save money. And because proxy servers can see and touch

all the passing HTTP traffic, proxies can monitor and modify the traffic to imple-

ment many useful value-added web services. Here are examples of just a few of the

ways proxies can be used:

Child filter (Figure 6-3)

Elementary schools use filtering proxies to block access to adult content, while

providing unhindered access to educational sites. As shown in Figure 6-3, the

Figure 6-2. Proxies speak the same protocol; gateways tie together different protocols

Browser Web server

Web proxy

HTTP HTTP

(a) HTTP/HTTP proxy

Browser Email serverWeb/email

gateway

HTTP POP

(b) HTTP/POP gateway

132 |Chapter 6: Proxies

proxy might permit unrestricted access to educational content but forcibly deny

access to sites that are inappropriate for children.*

Document access controller (Figure 6-4)

Proxy servers can be used to implement a uniform access-control strategy across

a large set of web servers and web resources and to create an audit trail. This is

useful in large corporate settings or other distributed bureaucracies.

All the access controls can be configured on the centralized proxy server, with-

out requiring the access controls to be updated frequently on numerous web

servers, of different makes and models, administered by different organizations.†

In Figure 6-4, the centralized access-control proxy:

• Permits client 1 to access news pages from server A without restriction

• Gives client 2 unrestricted access to Internet content

• Requires a password from client 3 before allowing access to server B

Security firewall (Figure 6-5)

Network security engineers often use proxy servers to enhance security. Proxy

servers restrict which application-level protocols flow in and out of an organiza-

tion, at a single secure point in the network. They also can provide hooks to

scrutinize that traffic (Figure 6-5), as used by virus-eliminating web and email

proxies.

Figure 6-3. Proxy application example: child-safe Internet filter

* Several companies and nonprofit organizations provide filtering software and maintain “blacklists” in order

to identify and restrict access to objectionable content.

† To prevent sophisticated users from willfully bypassing the control proxy, the web servers can be statically

configured to accept requests only from the proxy servers.

Server

Child user

Child user School’s filtering

proxy

Internet

DENY

Why Use Proxies? |133

Web cache (Figure 6-6)

Proxy caches maintain local copies of popular documents and serve them on

demand, reducing slow and costly Internet communication.

In Figure 6-6, clients 1 and 2 access object A from a nearby web cache, while cli-

ents 3 and 4 access the document from the origin server.

Figure 6-4. Proxy application example: centralized document access control

Figure 6-5. Proxy application example: security firewall

Server B

General

news

Client 1

Client 2

Client 3

To the Internet

Secret

financial

data

What is the password for

the financial data?

Intended request

to server B

blocked

Access

control

proxy Server A

General

news

Local area

network

Internet

Server

Client

Internet

Server

Filtering router

Firewall

proxy

Filtering router

Virus

Firewall Firewall

134 |Chapter 6: Proxies

Surrogate (Figure 6-7)

Proxies can masquerade as web servers. These so-called surrogates or reverse

proxies receive real web server requests, but, unlike web servers, they may initiate

communication with other servers to locate the requested content on demand.

Surrogates may be used to improve the performance of slow web servers for com-

mon content. In this configuration, the surrogates often are called server accelera-

tors (Figure 6-7). Surrogates also can be used in conjunction with content-routing

functionality to create distributed networks of on-demand replicated content.

Content router (Figure 6-8)

Proxy servers can act as “content routers,” vectoring requests to particular web

servers based on Internet traffic conditions and type of content.

Content routers also can be used to implement various service-level offerings.

For example, content routers can forward requests to nearby replica caches if the

Figure 6-6. Proxy application example: web cache

Figure 6-7. Proxy application example: surrogate (in a server accelerator deployment)

Origin

server

Client 1

Client 2

Client 3

Client 4

Web caching

proxy

Client Server

Internet

Surrogate

(also known as a reverse proxy

or a server accelerator)

Why Use Proxies? |135

user or content provider has paid for higher performance (Figure 6-8), or route

HTTP requests through filtering proxies if the user has signed up for a filtering

service. Many interesting services can be constructed using adaptive content-

routing proxies.

Transcoder (Figure 6-9)

Proxy servers can modify the body format of content before delivering it to clients.

This transparent translation between data representations is called transcoding.*

Transcoding proxies can convert GIF images into JPEG images as they fly by, to

reduce size. Images also can be shrunk and reduced in color intensity to be view-

able on television sets. Likewise, text files can be compressed, and small text

summaries of web pages can be generated for Internet-enabled pagers and smart

phones. It’s even possible for proxies to convert documents into foreign lan-

guages on the fly!

Figure 6-9 shows a transcoding proxy that converts English text into Spanish

text and also reformats HTML pages into simpler text that can displayed on the

small screen of a mobile phone.

Figure 6-8. Proxy application example: content routing

* Some people distinguish “transcoding” and “translation,” defining transcoding as relatively simple conver-

sions of the encoding of the data (e.g., lossless compression) and translation as more significant reformatting

or semantic changes of the data. We use the term transcoding to mean any intermediary-based modification

of the content.

Server A

Sharon

Rob

Luis

Server B

Content

router

Content

router

Server A paid to have content distributed

to replica caches, but server B did not.

The content router steers Luis to a replica cache for

A’s pages but to the origin server for B’s pages.

Sharon paid for the performance, so the content

router sends her to the nearby cache. Rob didn’t,

so the content router sends him to the origin server.

Web caching

proxy

136 |Chapter 6: Proxies

Anonymizer (Figure 6-10)

Anonymizer proxies provide heightened privacy and anonymity, by actively

removing identifying characteristics from HTTP messages (e.g., client IP address,

From header, Referer header, cookies, URI session IDs).*

In Figure 6-10, the anonymizing proxy makes the following changes to the user’s

messages to increase privacy:

• The user’s computer and OS type is removed from the User-Agent header.

• The From header is removed to protect the user’s email address.

• The Referer header is removed to obscure other sites the user has visited.

• The Cookie headers are removed to eliminate profiling and identity data.

Figure 6-9. Proxy application example: content transcoder

* However, because identifying information is removed, the quality of the user’s browsing experience may be

diminished, and some web sites may not function properly.

Figure 6-10. Proxy application example: anonymizer

Blanco

Negro

Naranja amanecer

Spanish-

speaking

client

Web-enabled

mobile phone

Summer Beach Shirts

You’ll get lots of smiles and

winks when you wear our

summer beach shirts.

1) White

2) Black

3) Sunrise orange

Playeras de Verano

Obtendra muchas sonrisas

y guiñios cuando use nuestras

playeras de verano.

Transcoding

proxy

Origin

server

Summer Beach Shirts

You’ll get lots of smiles and

winks when you wear our

summer beach shirts.

White

Black

Sunrise orange

Client Server

GET /something/file.html HTTP/1.0

Date: Sun, 01 Oct 2000 23:25:17 GMT

User-agent: Mozilla/4.75 (Win98; U)

From: joe@joes-hardware.com

Referer: http://www.irs.gov/tax-audits.html

Cookie: profile="football,lite beer"

Cookie: income-bracket="30K-45K"

Anonymizing

proxy

GET /something/file.html HTTP/1.0

Date: Sun, 01 Oct 2000 23:25:17 GMT

User-agent: Mozilla/4.75

Anonymized message doesn’t contain the

common identifying information headers

Where Do Proxies Go? |137

Where Do Proxies Go?

The previous section explained what proxies do. Now let’s talk about where proxies

sit when they are deployed into a network architecture. We’ll cover:

• How proxies can be deployed into networks

• How proxies can chain together into hierarchies

• How traffic gets directed to a proxy server in the first place

Proxy Server Deployment

You can place proxies in all kinds of places, depending on their intended uses.

Figure 6-11 sketches a few ways proxy servers can be deployed.

Egress proxy (Figure 6-11a)

You can stick proxies at the exit points of local networks to control the traffic

flow between the local network and the greater Internet. You might use egress

proxies in a corporation to offer firewall protection against malicious hackers

outside the enterprise or to reduce bandwidth charges and improve perfor-

mance of Internet traffic. An elementary school might use a filtering egress proxy

to prevent precocious students from browsing inappropriate content.

Access (ingress) proxy (Figure 6-11b)

Proxies are often placed at ISP access points, processing the aggregate requests

from the customers. ISPs use caching proxies to store copies of popular docu-

ments, to improve the download speed for their users (especially those with

high-speed connections) and reduce Internet bandwidth costs.

Surrogates (Figure 6-11c)

Proxies frequently are deployed as surrogates (also commonly called reverse

proxies) at the edge of the network, in front of web servers, where they can field

all of the requests directed at the web server and ask the web server for resources

only when necessary. Surrogates can add security features to web servers or

improve performance by placing fast web server caches in front of slower web

servers. Surrogates typically assume the name and IP address of the web server

directly, so all requests go to the proxy instead of the server.

Network exchange proxy (Figure 6-11d)

With sufficient horsepower, proxies can be placed in the Internet peering

exchange points between networks, to alleviate congestion at Internet junctions

through caching and to monitor traffic flows.*

* Core proxies often are deployed where Internet bandwidth is very expensive (especially in Europe). Some

countries (such as the UK) also are evaluating controversial proxy deployments to monitor Internet traffic

for national security concerns.

138 |Chapter 6: Proxies

Proxy Hierarchies

Proxies can be cascaded in chains called proxy hierarchies. In a proxy hierarchy, mes-

sages are passed from proxy to proxy until they eventually reach the origin server

(and then are passed back through the proxies to the client), as shown in Figure 6-12.

Proxy servers in a proxy hierarchy are assigned parent and child relationships. The

next inbound proxy (closer to the server) is called the parent, and the next outbound

proxy (closer to the client) is called the child. In Figure 6-12, proxy 1 is the child

Figure 6-11. Proxies can be deployed many ways, depending on their intended use

Client

Client Server

Proxy

Internet

(a) Private LAN egress proxy

Client

Client Server

Internet

(b) ISP access proxy

Client Server

(d) Network exchange proxy

Local

network

Proxy

Client

Client Server

Proxy

Internet Local

network

Network 1 Network 2

Proxy

Router Router

Where Do Proxies Go? |139

proxy of proxy 2. Likewise, proxy 2 is the child proxy of proxy 3, and proxy 3 is the

parent proxy of proxy 2.

Proxy hierarchy content routing

The proxy hierarchy in Figure 6-12 is static—proxy 1 always forwards messages to

proxy 2, and proxy 2 always forwards messages to proxy 3. However, hierarchies do

not have to be static. A proxy server can forward messages to a varied and changing

set of proxy servers and origin servers, based on many factors.

For example, in Figure 6-13, the access proxy routes to parent proxies or origin serv-

ers in different circumstances:

• If the requested object belongs to a web server that has paid for content distribu-

tion, the proxy could route the request to a nearby cache server that would

either return the cached object or fetch it if it wasn’t available.

• If the request was for a particular type of image, the access proxy might route the

request to a dedicated compression proxy that would fetch the image and then

compress it, so it would download faster across a slow modem to the client.

Figure 6-12. Three-level proxy hierarchy

Figure 6-13. Proxy hierarchies can be dynamic, changing for each request

Client Proxy 1

(child of proxy 2) Origin server

Proxy 2

(child of proxy 3

parent of proxy 1)

Proxy 3

(parent of proxy 2)

Client Access proxy

Internet

Web servers around

the globe

Dedicated cache server for

specially-subscribed objects

Compressor

proxy

Caching proxy

140 |Chapter 6: Proxies

Here are a few other examples of dynamic parent selection:

Load balancing

A child proxy might pick a parent proxy based on the current level of workload

on the parents, to spread the load around.

Geographic proximity routing

A child proxy might select a parent responsible for the origin server’s geographic

region.

Protocol/type routing

A child proxy might route to different parents and origin servers based on the

URI. Certain types of URIs might cause the requests to be transported through

special proxy servers, for special protocol handling.

Subscription-based routing

If publishers have paid extra money for high-performance service, their URIs

might be routed to large caches or compression engines to improve performance.

Dynamic parenting routing logic is implemented differently in different products,

including configuration files, scripting languages, and dynamic executable plug-ins.

How Proxies Get Trafﬁc

Because clients normally talk directly to web servers, we need to explain how HTTP

traffic finds its way to a proxy in the first place. There are four common ways to

cause client traffic to get to a proxy:

Modify the client

Many web clients, including Netscape and Microsoft browsers, support both

manual and automated proxy configuration. If a client is configured to use a

proxy server, the client sends HTTP requests directly and intentionally to the

proxy, instead of to the origin server (Figure 6-14a).

Modify the network

There are several techniques where the network infrastructure intercepts and

steers web traffic into a proxy, without the client’s knowledge or participation.

This interception typically relies on switching and routing devices that watch for

HTTP traffic, intercept it, and shunt the traffic into a proxy, without the client’s

knowledge (Figure 6-14b). This is called an intercepting proxy.*

Modify the DNS namespace

Surrogates, which are proxy servers placed in front of web servers, assume the

name and IP address of the web server directly, so all requests go to them instead

* Intercepting proxies commonly are called “transparent proxies,” because you connect to them without being

aware of their presence. Because the term “transparency” already is used in the HTTP specifications to indi-

cate functions that don’t change semantic behavior, the standards community suggests using the term “inter-

ception” for traffic capture. We adopt this nomenclature here.

Client Proxy Settings |141

of to the server (Figure 6-14c). This can be arranged by manually editing the

DNS naming tables or by using special dynamic DNS servers that compute the

appropriate proxy or server to use on-demand. In some installations, the IP

address and name of the real server is changed and the surrogate is given the

former address and name.

Modify the web server

Some web servers also can be configured to redirect client requests to a proxy by

sending an HTTP redirection command (response code 305) back to the client.

Upon receiving the redirect, the client transacts with the proxy (Figure 6-14d).

The next section explains how to configure clients to send traffic to proxies.

Chapter 20 will explain how to configure the network, DNS, and servers to redirect

traffic to proxy servers.

Client Proxy Settings

All modern web browsers let you configure the use of proxies. In fact, many brows-

ers provide multiple ways of configuring proxies, including:

Manual configuration

You explicitly set a proxy to use.

Browser preconfiguration

The browser vendor or distributor manually preconfigures the proxy setting of

the browser (or any other web client) before delivering it to customers.

Figure 6-14. There are many techniques to direct web requests to proxies

Client Server

Proxy

(a) Client configured to use proxy

Client Server

(b) Network intercepts and redirects traffic to proxy

Client Server

Proxy

(assuming the

web server’s

name)

Client Server

(d) Server redirects HTTP requests to proxy

Router

Proxy

142 |Chapter 6: Proxies

Proxy auto-configuration (PAC)

You provide a URI to a JavaScript proxy auto-configuration (PAC) file; the client

fetches the JavaScript file and runs it to decide if it should use a proxy and, if so,

which proxy server to use.

WPAD proxy discovery

Some browsers support the Web Proxy Autodiscovery Protocol (WPAD), which

automatically detects a “configuration server” from which the browser can

download an auto-configuration file.*

Client Proxy Conﬁguration: Manual

Many web clients allow you to configure proxies manually. Both Netscape Navigator

and Microsoft Internet Explorer have convenient support for proxy configuration.

In Netscape Navigator 6, you specify proxies through the menu selection Edit ➝Pref-

erences ➝Advanced ➝Proxies and then selecting the “Manual proxy configuration”

radio button.

In Microsoft Internet Explorer 5, you can manually specify proxies from the Tools ➝

Internet Options menu, by selecting a connection, pressing “Settings,” checking the

“Use a proxy server” box, and clicking “Advanced.”

Other browsers have different ways of making manual configuration changes, but

the idea is the same: specifying the host and port for the proxy. Several ISPs ship cus-

tomers preconfigured browsers, or customized operating systems, that redirect web

traffic to proxy servers.

Client Proxy Conﬁguration: PAC Files

Manual proxy configuration is simple but inflexible. You can specify only one proxy

server for all content, and there is no support for failover. Manual proxy configura-

tion also leads to administrative problems for large organizations. With a large base

of configured browsers, it’s difficult or impossible to reconfigure every browser if you

need to make changes.

Proxy auto-configuration (PAC) files are a more dynamic solution for proxy configu-

ration, because they are small JavaScript programs that compute proxy settings on

the fly. Each time a document is accessed, a JavaScript function selects the proper

proxy server.

To use PAC files, configure your browser with the URI of the JavaScript PAC file

(configuration is similar to manual configuration, but you provide a URI in an “auto-

matic configuration” box). The browser will fetch the PAC file from this URI and use

* Currently supported only by Internet Explorer.

Client Proxy Settings |143

the JavaScript logic to compute the proper proxy server for each access. PAC files

typically have a .pac suffix and the MIME type “application/x-ns-proxy-autoconfig.”

Each PAC file must define a function called FindProxyForURL(url,host) that com-

putes the proper proxy server to use for accessing the URI. The return value of the

function can be any of the values in Table 6-1.

The PAC file in Example 6-1 mandates one proxy for HTTP transactions, another

proxy for FTP transactions, and direct connections for all other kinds of transactions.

For more details about PAC files, refer to Chapter 20.

Client Proxy Conﬁguration: WPAD

Another mechanism for browser configuration is the Web Proxy Autodiscovery Pro-

tocol (WPAD). WPAD is an algorithm that uses an escalating strategy of discovery

mechanisms to find the appropriate PAC file for the browser automatically. A client

that implements the WPAD protocol will:

• Use WPAD to find the PAC URI.

• Fetch the PAC file given the URI.

• Execute the PAC file to determine the proxy server.

• Use the proxy server for requests.

WPAD uses a series of resource-discovery techniques to determine the proper PAC

file. Multiple discovery techniques are used, because not all organizations can use all

techniques. WPAD attempts each technique, one by one, until it succeeds.

Table 6-1. Proxy auto-configuration script return values

FindProxyForURL return value Description

DIRECT Connections should be made directly, without any proxies.

PROXY host:port The specified proxy should be used.

SOCKS host:port The specified SOCKS server should be used.

Example 6-1. Example proxy auto-configuration file

function FindProxyForURL(url, host) {

if (url.substring(0,5) == "http:") {

return "PROXY http-proxy.mydomain.com:8080";

} else if (url.substring(0,4) =="ftp:") {

return "PROXY ftp-proxy.mydomain.com:8080";

} else {

return "DIRECT";

}

144 |Chapter 6: Proxies

The current WPAD specification defines the following techniques, in order:

• Dynamic Host Discovery Protocol (DHCP)

• Service Location Protocol (SLP)

• DNS well-known hostnames

• DNS SRV records

• DNS service URIs in TXT records

For more information, consult Chapter 20.

Tricky Things About Proxy Requests

This section explains some of the tricky and much misunderstood aspects of proxy

server requests, including:

• How the URIs in proxy requests differ from server requests

• How intercepting and reverse proxies can obscure server host information

• The rules for URI modification

• How proxies impact a browser’s clever URI auto-completion or hostname-

expansion features

Proxy URIs Differ from Server URIs

Web server and web proxy messages have the same syntax, with one exception. The

URI in an HTTP request message differs when a client sends the request to a server

instead of a proxy.

When a client sends a request to a web server, the request line contains only a par-

tial URI (without a scheme, host, or port), as shown in the following example:

GET /index.html HTTP/1.0

User-Agent: SuperBrowserv1.3

When a client sends a request to a proxy, however, the request line contains the full

URI. For example:

GET http://www.marys-antiques.com/index.html HTTP/1.0

User-Agent: SuperBrowser v1.3

Why have two different request formats, one for proxies and one for servers? In the

original HTTP design, clients talked directly to a single server. Virtual hosting did

not exist, and no provision was made for proxies. Because a single server knows its

own hostname and port, to avoid sending redundant information, clients sent just

the partial URI, without the scheme and host (and port).

When proxies emerged, the partial URIs became a problem. Proxies needed to know

the name of the destination server, so they could establish their own connections to

Tricky Things About Proxy Requests |145

the server. And proxy-based gateways needed the scheme of the URI to connect to

FTP resources and other schemes. HTTP/1.0 solved the problem by requiring the full

URI for proxy requests, but it retained partial URIs for server requests (there were

too many servers already deployed to change all of them to support full URIs).*

So we need to send partial URIs to servers, and full URIs to proxies. In the case of

explicitly configured client proxy settings, the client knows what type of request to

issue:

• When the client is not set to use a proxy, it sends the partial URI (Figure 6-15a).

• When the client is set to use a proxy, it sends the full URI (Figure 6-15b).

* HTTP/1.1 now requires servers to handle full URIs for both proxy and server requests, but in practice, many

deployed servers still accept only partial URIs.

Figure 6-15. Intercepting proxies will get server requests

Client Origin server

(a) Server request GET /index.html HTTP/1.0

User-agent: SuperBrowser v1.3

Client Origin server

(b) Explicit proxy request GET http://www.marys-antiques.com/index.html HTTP/1.0

User-agent: SuperBrowser v1.3

Client

User-agent: SuperBrowser v1.3

Client Origin server

(d) Intercepting proxy request

GET /index.html HTTP/1.0

User-agent: SuperBrowser v1.3

Surrogate

Intercepting proxy

(Proxy explicitly configured) Proxy server

Origin server

(Server hostname points to the surrogate proxy)

146 |Chapter 6: Proxies

The Same Problem with Virtual Hosting

The proxy “missing scheme/host/port” problem is the same problem faced by

virtually hosted web servers. Virtually hosted web servers share the same physi-

cal web server among many web sites. When a request comes in for the partial

URI /index.html, the virtually hosted web server needs to know the hostname of

the intended web site (see “Virtually hosted docroots” in Chapter 5 and “Virtual

Hosting” in Chapter 18 for more information).

In spite of the problems being similar, they were solved in different ways:

• Explicit proxies solve the problem by requiring a full URI in the request message.

• Virtually hosted web servers require a Host header to carry the host and port

information.

Intercepting Proxies Get Partial URIs

As long as the clients properly implement HTTP, they will send full URIs in requests

to explicitly configured proxies. That solves part of the problem, but there’s a catch:

a client will not always know it’s talking to a proxy, because some proxies may be

invisible to the client. Even if the client is not configured to use a proxy, the client’s

traffic still may go through a surrogate or intercepting proxy. In both of these cases,

the client will think it’s talking to a web server and won’t send the full URI:

•Asurrogate, as described earlier, is a proxy server taking the place of the origin

server, usually by assuming its hostname or IP address. It receives the web server

request and may serve cached responses or proxy requests to the real server. A

client cannot distinguish a surrogate from a web server, so it sends partial URIs

(Figure 6-15c).

•Anintercepting proxy is a proxy server in the network flow that hijacks traffic

from the client to the server and either serves a cached response or proxies it.

Because the intercepting proxy hijacks client-to-server traffic, it will receive par-

tial URIs that are sent to web servers (Figure 6-15d).*

Proxies Can Handle Both Proxy and Server Requests

Because of the different ways that traffic can be redirected into proxy servers,

general-purpose proxy servers should support both full URIs and partial URIs in

request messages. The proxy should use the full URI if it is an explicit proxy request

or use the partial URI and the virtual Host header if it is a web server request.

* Intercepting proxies also might intercept client-to-proxy traffic in some circumstances, in which case the

intercepting proxy might get full URIs and need to handle them. This doesn’t happen often, because explicit

proxies normally communicate on a port different from that used by HTTP (usually 8080 instead of 80), and

intercepting proxies usually intercept only port 80.

Tricky Things About Proxy Requests |147

The rules for using full and partial URIs are:

• If a full URI is provided, the proxy should use it.

• If a partial URI is provided, and a Host header is present, the Host header

should be used to determine the origin server name and port number.

• If a partial URI is provided, and there is no Host header, the origin server needs

to be determined in some other way:

— If the proxy is a surrogate, standing in for an origin server, the proxy can be

configured with the real server’s address and port number.

— If the traffic was intercepted, and the interceptor makes the original IP

address and port available, the proxy can use the IP address and port num-

ber from the interception technology (see Chapter 20).

— If all else fails, the proxy doesn’t have enough information to determine the

origin server and must return an error message (often suggesting that the

user upgrade to a modern browser that supports Host headers).*

In-Flight URI Modiﬁcation

Proxy servers need to be very careful about changing the request URI as they for-

ward messages. Slight changes in the URI, even if they seem benign, may create

interoperability problems with downstream servers.

In particular, some proxies have been known to “canonicalize” URIs into a standard

form before forwarding them to the next hop. Seemingly benign transformations,

such as replacing default HTTP ports with an explicit “:80”, or correcting URIs by

replacing illegal reserved characters with their properly escaped substitutions, can

cause interoperation problems.

In general, proxy servers should strive to be as tolerant as possible. They should not

aim to be “protocol policemen” looking to enforce strict protocol compliance,

because this could involve significant disruption of previously functional services.

In particular, the HTTP specifications forbid general intercepting proxies from

rewriting the absolute path parts of URIs when forwarding them. The only excep-

tion is that they can replace an empty path with “/”.

URI Client Auto-Expansion and Hostname Resolution

Browsers resolve request URIs differently, depending on whether or not a proxy is

present. Without a proxy, the browser takes the URI you type in and tries to find a

corresponding IP address. If the hostname is found, the browser tries the corre-

sponding IP addresses until it gets a successful connection.

* This shouldn’t be done casually. Users will receive cryptic error pages they never got before.

148 |Chapter 6: Proxies

But if the host isn’t found, many browsers attempt to provide some automatic

“expansion” of hostnames, in case you typed in a “shorthand” abbreviation of the

host (refer back to “Expandomatic URLs” in Chapter 2):*

• Many browsers attempt adding a “www.” prefix and a “.com” suffix, in case you

just entered the middle piece of a common web site name (e.g., to let people

enter “yahoo” instead of “www.yahoo.com”).

• Some browsers even pass your unresolvable URI to a third-party site, which

attempts to correct spelling mistakes and suggest URIs you may have intended.

• In addition, the DNS configuration on most systems allows you to enter just the

prefix of the hostname, and the DNS automatically searches the domain. If you are

in the domain “oreilly.com” and type in the hostname “host7,” the DNS automati-

cally attempts to match “host7.oreilly.com”. It’s not a complete, valid hostname.

URI Resolution Without a Proxy

Figure 6-16 shows an example of browser hostname auto-expansion without a

proxy. In steps 2a–3c, the browser looks up variations of the hostname until a valid

hostname is found.

Here’s what’s going on in this figure:

• In Step 1, the user types “oreilly” into the browser’s URI window. The browser

uses “oreilly” as the hostname and assumes a default scheme of “http://”, a

default port of “80”, and a default path of “/”.

• In Step 2a, the browser looks up host “oreilly.” This fails.

* Most browsers let you type in “yahoo” and auto-expand that into “www.yahoo.com.” Similarly, browsers

let you omit the “http://” prefix and insert it if it’s missing.

Figure 6-16. Browser auto-expands partial hostnames when no explicit proxy is present

Client

(1) User types “oreilly” into

browser’s URI location window

(3a) The browser does auto-expansion,

converting “oreilly” into “www.oreilly.com”

DNS server

(2b) Failed, host unknown

(2a) Browser looks up host “oreilly” via DNS

(3b) Browser looks up host “www.oreilly.com” via DNS

(3c) Success! Get IP addresses back

www.oreilly.com

(4a) Browser tries to connect to IP addresses, one by one, until connect successful

(4b) Success; connection established

(5a) Browser sends HTTP request

(5b) Browser gets HTTP response

Tricky Things About Proxy Requests |149

• In Step 3a, the browser auto-expands the hostname and asks the DNS to resolve

“www.oreilly.com.” This is successful.

• The browser then successfully connects to www.oreilly.com.

URI Resolution with an Explicit Proxy

When you use an explicit proxy the browser no longer performs any of these conve-

nience expansions, because the user’s URI is passed directly to the proxy.

As shown in Figure 6-17, the browser does not auto-expand the partial hostname

when there is an explicit proxy. As a result, when the user types “oreilly” into the

browser’s location window, the proxy is sent “http://oreilly/” (the browser adds the

default scheme and path but leaves the hostname as entered).

For this reason, some proxies attempt to mimic as much as possible of the browser’s

convenience services as they can, including “www...com” auto-expansion and addi-

tion of local domain suffixes.*

URI Resolution with an Intercepting Proxy

Hostname resolution is a little different with an invisible intercepting proxy, because

as far as the client is concerned, there is no proxy! The behavior proceeds much like

the server case, with the browser auto-expanding hostnames until DNS success. But

a significant difference occurs when the connection to the server is made, as

Figure 6-18 illustrates.

Figure 6-17. Browser does not auto-expand partial hostnames when there is an explicit proxy

* But, for widely shared proxies, it may be impossible to know the proper domain suffix for individual users.

Client

(1) User types “oreilly” into

browser’s URI location window

(3a) The browser does auto-expansion,

converting “oreilly” into “www.oreilly.com”

DNS server

(2a) Proxy is explicitly configured,

so the browser looks up the address

of the proxy server using DNS

(2b) Success! Get proxy server

IP addresses

www.oreilly.com

(3a) Browser tries to connect to proxy

(3b) Success; connection established

(4a) Browser sends HTTP request Proxy

GET http://oreilly/ HTTP/1.0

Proxy-connection: Keep-Alive

User-agent: Mozilla/4.72[en] (Win98:I)

Host: oreilly

Accept: */*

Accept-encoding: gzip

Accept-language: en

Accept-charset: iso-8859-1,*,utf-8

Request message, as sent in (4a)

(4b) Proxy gets a partial hostname

in the request, because the client

did not auto-expand it.

150 |Chapter 6: Proxies

Figure 6-18 demonstrates the following transaction:

• In Step 1, the user types “oreilly” into the browser’s URI location window.

• In Step 2a, the browser looks up the host “oreilly” via DNS, but the DNS server

fails and responds that the host is unknown, as shown in Step 2b.

• In Step 3a, the browser does auto-expansion, converting “oreilly” into “www.

oreilly.com.” In Step 3b, the browser looks up the host “www.oreilly.com” via

DNS. This time, as shown in Step 3c, the DNS server is successful and returns IP

addresses back to the browser.

• In Step 4a, the client already has successfully resolved the hostname and has a

list of IP addresses. Normally, the client tries to connect to each IP address until

it succeeds, because some of the IP addresses may be dead. But with an inter-

cepting proxy, the first connection attempt is terminated by the proxy server, not

the origin server. The client believes it is successfully talking to the web server,

but the web server might not even be alive.

• When the proxy finally is ready to interact with the real origin server (Step 5b),

the proxy may find that the IP address actually points to a down server. To pro-

vide the same level of fault tolerance provided by the browser, the proxy needs

to try other IP addresses, either by reresolving the hostname in the Host header

or by doing a reverse DNS lookup on the IP address. It is important that both

intercepting and explicit proxy implementations support fault tolerance on DNS

resolution to dead servers, because when browsers are configured to use an

explicit proxy, they rely on the proxy for fault tolerance.

Tracing Messages

Today, it’s not uncommon for web requests to go through a chain of two or more

proxies on their way from the client to the server (Figure 6-19). For example, many

Figure 6-18. Browser doesn’t detect dead server IP addresses when using intercepting proxies

Client

(1)

(3a)

DNS server

(2b)

(2a)

(3b)

(3c)

www.oreilly.com

(4a)

(4b)

(5a)

Interceptor

Proxy

(5b)

Tracing Messages |151

corporations use caching proxy servers to access the Internet, for security and cost

savings, and many large ISPs use proxy caches to improve performance and imple-

ment features. A significant percentage of web requests today go through proxies. At

the same time, it’s becoming increasingly popular to replicate content on banks of

surrogate caches scattered around the globe, for performance reasons.

Proxies are developed by different vendors. They have different features and bugs

and are administrated by various organizations.

As proxies become more prevalent, you need to be able to trace the flow of messages

across proxies and to detect any problems, just as it is important to trace the flow of

IP packets across different switches and routers.

The Via Header

The Via header field lists information about each intermediate node (proxy or gate-

way) through which a message passes. Each time a message goes through another

node, the intermediate node must be added to the end of the Via list.

The following Via string tells us that the message traveled through two proxies. It

indicates that the first proxy implemented the HTTP/1.1 protocol and was called

proxy-62.irenes-isp.net, and that the second proxy implemented HTTP/1.0 and was

called cache.joes-hardware.com:

Via: 1.1 proxy-62.irenes-isp.net, 1.0 cache.joes-hardware.com

The Via header field is used to track the forwarding of messages, diagnose message

loops, and identify the protocol capabilities of all senders along the request/response

chain (Figure 6-20).

Proxies also can use Via headers to detect routing loops in the network. A proxy

should insert a unique string associated with itself in the Via header before sending

out a request and should check for the presence of this string in incoming requests to

detect routing loops in the network.

Figure 6-19. Access proxies and CDN proxies create two-level proxy hierarchies

Client ISP proxy Internet

Surrogate cache bank

Web server

152 |Chapter 6: Proxies

Via syntax

The Via header field contains a comma-separated list of waypoints. Each waypoint

represents an individual proxy server or gateway hop and contains information about

the protocol and address of that intermediate node. Here is an example of a Via

header with two waypoints:

Via = 1.1 cache.joes-hardware.com, 1.1 proxy.irenes-isp.net

The formal syntax for a Via header is shown here:

Via = "Via" ":" 1#( waypoint )

waypoint = ( received-protocol received-by [ comment ] )

received-protocol = [ protocol-name "/" ] protocol-version

received-by = ( host [ ":" port ] ) | pseudonym

Note that each Via waypoint contains up to four components: an optional protocol

name (defaults to HTTP), a required protocol version, a required node name, and an

optional descriptive comment:

Protocol name

The protocol received by an intermediary. The protocol name is optional if the

protocol is HTTP. Otherwise, the protocol name is prepended to the version,

separated by a “/”. Non-HTTP protocols can occur when gateways connect

HTTP requests for other protocols (HTTPS, FTP, etc.).

Protocol version

The version of the message received. The format of the version depends on the

protocol. For HTTP, the standard version numbers are used (“1.0”, “1.1”, etc.).

The version is included in the Via field, so later applications will know the proto-

col capabilities of all previous intermediaries.

Node name

The host and optional port number of the intermediary (if the port isn’t

included, you can assume the default port for the protocol). In some cases an

organization might not want to give out the real hostname, for privacy reasons,

in which case it may be replaced by a pseudonym.

Figure 6-20. Via header example

Client

proxy-62.irenes-isp.net

(HTTP/1.1) www.joes-hardware.com

cache.joes-hardware.com

(HTTP/1.0)

GET /index.html HTTP/1.0

Accept: text/html

Host: www.joes-hardware.com

Via: 1.1 proxy-62.irenes-isp.net, 1.0 cache.joes-hardware.com

Request message (as received by server)

Tracing Messages |153

Node comment

An optional comment that further describes the intermediary node. It’s com-

mon to include vendor and version information here, and some proxy servers

also use the comment field to include diagnostic information about the events

that occurred on that device.*

Via request and response paths

Both request and response messages pass through proxies, so both request and

response messages have Via headers.

Because requests and responses usually travel over the same TCP connection,

response messages travel backward across the same path as the requests. If a request

message goes through proxies A, B, and C, the corresponding response message trav-

els through proxies C, B, then A. So, the Via header for responses is almost always

the reverse of the Via header for responses (Figure 6-21).

Via and gateways

Some proxies provide gateway functionality to servers that speak non-HTTP proto-

cols. The Via header records these protocol conversions, so HTTP applications can

be aware of protocol capabilities and conversions along the proxy chain. Figure 6-22

shows an HTTP client requesting an FTP URI through an HTTP/FTP gateway.

The client sends an HTTP request for ftp://http-guide.com/pub/welcome.txt to the

gateway proxy.irenes-isp.net. The proxy, acting as a protocol gateway, retrieves the

desired object from the FTP server, using the FTP protocol. The proxy then sends

the object back to the client in an HTTP response, with this Via header field:

Via: FTP/1.0 proxy.irenes-isp.net (Traffic-Server/5.0.1-17882 [cMs f ])

* For example, caching proxy servers may include hit/miss information.

Figure 6-21. The response Via is usually the reverse of the request Via

Client Server

ABC

Request Via header

via: 1.1 A, 1.1 B, 1.1 C

Reponse Via header

via: 1.1 C, 1.1 B, 1.1 A

154 |Chapter 6: Proxies

Notice the received protocol is FTP. The optional comment contains the brand and

version number of the proxy server and some vendor diagnostic information. You

can read all about gateways in Chapter 8.

The Server and Via headers

The Server response header field describes the software used by the origin server.

Here are a few examples:

Server: Apache/1.3.14 (Unix) PHP/4.0.4

Server: Netscape-Enterprise/4.1

Server: Microsoft-IIS/5.0

If a response message is being forwarded through a proxy, make sure the proxy does

not modify the Server header. The Server header is meant for the origin server.

Instead, the proxy should add a Via entry.

Privacy and security implications of Via

There are some cases when we want don’t want exact hostnames in the Via string. In

general, unless this behavior is explicitly enabled, when a proxy server is part of a net-

work firewall it should not forward the names and ports of hosts behind the firewall,

because knowledge of network architecture behind a firewall might be of use to a

malicious party.*

Figure 6-22. HTTP/FTP gateway generates Via headers, logging the received protocol (FTP)

* Malicious people can use the names of computers and version numbers to learn about the network architec-

ture behind a security perimeter. This information might be helpful in security attacks. In addition, the

names of computers might be clues to private projects within an organization.

HTTP request message sent to proxy

GET ftp://http-guide.com/pub/welcome.txt HTTP/1.0

Client http-guide.com

FTP server

HTTP/1.0 200 OK

Date: Sun, 11 Nov 2001 21:01:59 GMT

Via: FTP/1.0 proxy.irenes-isp.net (Traffic-Server/5.0.1-17882 [cMsf])

Last-modified: Sun, 11 Nov 2001 21:05:24 GMT

Content-type: text/plain

Hi there. This is an FTP server.

HTTP response message

proxy.irenes-isp.net

(HTTP/1.0)

FTP request

FTP response

Tracing Messages |155

If Via node-name forwarding is not enabled, proxies that are part of a security perim-

eter should replace the hostname with an appropriate pseudonym for that host. Gen-

erally, though, proxies should try to retain a Via waypoint entry for each proxy

server, even if the real name is obscured.

For organizations that have very strong privacy requirements for obscuring the

design and topology of internal network architectures, a proxy may combine an

ordered sequence of Via waypoint entries (with identical received-protocol values)

into a single, joined entry. For example:

Via: 1.0 foo, 1.1 devirus.company.com, 1.1 access-logger.company.com

could be collapsed to:

Via: 1.0 foo, 1.1 concealed-stuff

Don’t combine multiple entries unless they all are under the same organizational

control and the hosts already have been replaced by pseudonyms. Also, don’t com-

bine entries that have different received-protocol values.

The TRACE Method

Proxy servers can change messages as the messages are forwarded. Headers are

added, modified, and removed, and bodies can be converted to different formats. As

proxies become more sophisticated, and more vendors deploy proxy products,

interoperability problems increase. To easily diagnose proxy networks, we need a

way to conveniently watch how messages change as they are forwarded, hop by hop,

through the HTTP proxy network.

HTTP/1.1’s TRACE method lets you trace a request message through a chain of

proxies, observing what proxies the message passes through and how each proxy

modifies the request message. TRACE is very useful for debugging proxy flows.*

When the TRACE request reaches the destination server,†the entire request mes-

sage is reflected back to the sender, bundled up in the body of an HTTP response

(see Figure 6-23). When the TRACE response arrives, the client can examine the

exact message the server received and the list of proxies through which it passed (in

the Via header). The TRACE response has Content-Type message/http and a 200

OK status.

Max-Forwards

Normally, TRACE messages travel all the way to the destination server, regardless of

the number of intervening proxies. You can use the Max-Forwards header to limit

* Unfortunately, it isn’t widely implemented yet.

† The final recipient is either the origin server or the first proxy or gateway to receive a Max-Forwards value of

zero (0) in the request.

156 |Chapter 6: Proxies

the number of proxy hops for TRACE and OPTIONS requests, which is useful for

testing a chain of proxies forwarding messages in an infinite loop or for checking the

effects of particular proxy servers in the middle of a chain. Max-Forwards also limits

the forwarding of OPTIONS messages (see “Proxy Interoperation”).

The Max-Forwards request header field contains a single integer indicating the

remaining number of times this request message may be forwarded (Figure 6-24). If

the Max-Forwards value is zero (Max-Forwards: 0), the receiver must reflect the

TRACE message back toward the client without forwarding it further, even if the

receiver is not the origin server.

If the received Max-Forwards value is greater than zero, the forwarded message must

contain an updated Max-Forwards field with a value decremented by one. All prox-

ies and gateways should support Max-Forwards. You can use Max-Forwards to view

the request at any hop in a proxy chain.

Proxy Authentication

Proxies can serve as access-control devices. HTTP defines a mechanism called proxy

authentication that blocks requests for content until the user provides valid access-

permission credentials to the proxy:

• When a request for restricted content arrives at a proxy server, the proxy server

can return a 407 Proxy Authorization Required status code demanding access

Figure 6-23. TRACE response reflects back the received request message

Proxy 1

(proxy.irenes-isp net)

Client Server

(www.joes-hardware.com)

Proxy 2

(p1127.att net)

Proxy 3

(cache.joes-hardware.com)

TRACE /index.html HTTP/1.1

Host: www.joes-hardware.com

Accept: text/html

TRACE request

HTTP/1.1 200 OK

Content-Type: message/http

Content-Length: 269

Via: 1.1 cache.joes-hardware.com, 1.1 p1127.att.net, 1.1 proxy.irenes-isp.net

TRACE /index.html HTTP/1.1

Host: www.joes-hardware.com

Accept: text/html

Via: 1.1 proxy.irenes-isp.net, 1.1 p1127.att.net, 1.1 cache.joes-hardware.com

X-Magic-CDN-Thingy: 134-AF-0003

Cookie: access-isp="Irene’s ISP, California"

Client-ip: 209.134.49.32

TRACE response

Received request

Proxy Interoperation |157

credentials, accompanied by a Proxy-Authenticate header field that describes

how to provide those credentials (Figure 6-25b).

• When the client receives the 407 response, it attempts to gather the required cre-

dentials, either from a local database or by prompting the user.

• Once the credentials are obtained, the client resends the request, providing the

required credentials in a Proxy-Authorization header field.

• If the credentials are valid, the proxy passes the original request along the chain

(Figure 6-25c); otherwise, another 407 reply is sent.

Proxy authentication generally does not work well when there are multiple proxies in

a chain, each participating in authentication. People have proposed enhancements to

HTTP to associate authentication credentials with particular waypoints in a proxy

chain, but those enhancements have not been widely implemented.

Be sure to read Chapter 12 for a detailed explanation of HTTP’s authentication

mechanisms.

Proxy Interoperation

Clients, servers, and proxies are built by multiple vendors, to different versions of the

HTTP specification. They support various features and have different bugs. Proxy

servers need to intermediate between client-side and server-side devices, which may

implement different protocols and have troublesome quirks.

Figure 6-24. You can limit the forwarding hop count with the Max-Forwards header field

Proxy 1

(proxy.irenes-isp.net)

Client Server

(www.joes-hardware.com)

Proxy 2

(p1127.att.net)

Proxy 3

(cache.joes-hardware.com)

TRACE /index.html HTTP/1.1

Host: www.joes-hardware.com

Max-Forwards: 2

Accept: text/html

TRACE request

HTTP/1.1 200 OK

Content-Type: message/http

Content-Length: 269

Via: 1.1 p1127.att.net, 1.1 proxy.irenes-isp.net

TRACE /index.html HTTP/1.1

Host: www.joes-hardware.com

Accept: text/html

Via: 1.1 proxy.irenes-isp.net, 1.1 p1127.att.net

X-Magic-CDN-Thingy: 134-AF-0003

Cookie: access-isp="Irene’s ISP, California"

Client-ip: 209.134.49.32

TRACE response

Received request

Max-Forwards= 1 Max-Forwards= 0

158 |Chapter 6: Proxies

Handling Unsupported Headers and Methods

The proxy server may not understand all the header fields that pass through it. Some

headers may be newer than the proxy itself; others may be customized header fields

unique to a particular application. Proxies must forward unrecognized header fields

and must maintain the relative order of header fields with the same name.*Similarly,

if a proxy is unfamiliar with a method, it should try to forward the message to the

next hop, if possible.

Proxies that cannot tunnel unsupported methods may not be viable in most net-

works today, because Hotmail access through Microsoft Outlook makes extensive

use of HTTP extension methods.

Figure 6-25. Proxies can implement authentication to control access to content

* Multiple message header fields with the same field name may be present in a message, but if they are, they

must be able to be equivalently combined into a comma-separated list. The order in which header fields with

the same field name are received is therefore significant to the interpretation of the combined field value, so

a proxy can’t change the relative order of these same-named field values when it forwards a message.

Client Server

(a) GET http://server.com/secret.jpg HTTP/1.0

Client Server

(b) HTTP/1.o 407 Proxy Authorization Required

Proxy-Authenticate: Basic realm="Secure Stuff"

Client

Proxy-Authorization: Basic YnJpOmZvbw==

Client Server

(d) HTTP/1.0 200 OK

Content-type: image/jpeg

...<image data included>...

Access control

proxy

Server

Super secret

image

Access control

proxy

Access control

proxy

Access control

proxy

Proxy Interoperation |159

OPTIONS: Discovering Optional Feature Support

The HTTP OPTIONS method lets a client (or proxy) discover the supported func-

tionality (for example, supported methods) of a web server or of a particular resource

on a web server (Figure 6-26). Clients can use OPTIONS to determine a server’s

capabilities before interacting with the server, making it easier to interoperate with

proxies and servers of different feature levels.

If the URI of the OPTIONS request is an asterisk (*), the request pertains to the

entire server’s supported functionality. For example:

OPTIONS * HTTP/1.1

If the URI is a real resource, the OPTIONS request inquires about the features avail-

able to that particular resource:

OPTIONS http://www.joes-hardware.com/index.html HTTP/1.1

On success, the OPTIONS method returns a 200 OK response that includes various

header fields that describe optional features that are supported on the server or avail-

able to the resource. The only header field that HTTP/1.1 specifies in the response is

the Allow header, which describes what methods are supported by the server (or

particular resource on the server).*OPTIONS allows an optional response body with

more information, but this is undefined.

The Allow Header

The Allow entity header field lists the set of methods supported by the resource iden-

tified by the request URI, or the entire server if the request URI is *. For example:

Allow: GET, HEAD, PUT

The Allow header can be used as a request header to recommend the methods to be

supported by the new resource. The server is not required to support these methods

Figure 6-26. Using OPTIONS to find a server’s supported methods

* Not all resources support every method. For example, a CGI script query may not support a file PUT, and a

static HTML file wouldn’t accept a POST method.

Client Proxy Server

OPTIONS * HTTP/1.1

HTTP/1.1 200 OK

Allow: GET,PUT,POST,HEAD,TRACE,OPTIONS

160 |Chapter 6: Proxies

and should include an Allow header in the matching response, listing the actual sup-

ported methods.

A proxy can’t modify the Allow header field even if it does not understand all the

methods specified, because the client might have other paths to talk to the origin

server.

For More Information

For more information, refer to:

http://www.w3.org/Protocols/rfc2616/rfc2616.txt

RFC 2616, “Hypertext Transfer Protocol,” by R. Fielding, J. Gettys, J. Mogul, H.

Frystyk, L. Mastinter, P. Leach, and T. Berners-Lee.

http://search.ietf.org/rfc/rfc3040.txt

RFC 3040, “Internet Web Replication and Caching Taxonomy.”

Web Proxy Servers

Ari Luotonen, Prentice Hall Computer Books.

http://search.ietf.org/rfc/rfc3143.txt

RFC 3143, “Known HTTP Proxy/Caching Problems.”

Web Caching

Duane Wessels, O’Reilly & Associates, Inc.

161

CHAPTER 7

Caching

Web caches are HTTP devices that automatically keep copies of popular docu-

ments. When a web request arrives at a cache, if a local “cached” copy is available,

the document is served from the local storage instead of from the origin server.

Caches have the following benefits:

• Caches reduce redundant data transfers, saving you money in network charges.

• Caches reduce network bottlenecks. Pages load faster without more bandwidth.

• Caches reduce demand on origin servers. Servers reply faster and avoid overload.

• Caches reduce distance delays, because pages load slower from farther away.

In this chapter, we explain how caches improve performance and reduce cost, how

to measure their effectiveness, and where to place caches to maximize impact. We

also explain how HTTP keeps cached copies fresh and how caches interact with

other caches and servers.

Redundant Data Transfers

When multiple clients access a popular origin server page, the server transmits the

same document multiple times, once to each client. The same bytes travel across the

network over and over again. These redundant data transfers eat up expensive net-

work bandwidth, slow down transfers, and overload web servers. With caches, the

cache keeps a copy of the first server response. Subsequent requests can be fulfilled

from the cached copy, reducing wasteful, duplicate traffic to and from origin servers.

Bandwidth Bottlenecks

Caches also can reduce network bottlenecks. Many networks provide more band-

width to local network clients than to remote servers (Figure 7-1). Clients access serv-

ers at the speed of the slowest network on the way. If a client gets a copy from a cache

on a fast LAN, caching can boost performance—especially for larger documents.

162 |Chapter 7: Caching

In Figure 7-1, it might take 30 seconds for a user in the San Francisco branch of Joe’s

Hardware, Inc. to download a 5-MB inventory file from the Atlanta headquarters,

across the 1.4-Mbps T1 Internet connection. If the document was cached in the San

Francisco office, a local user might be able to get the same document in less than a

second across the Ethernet connection.

Table 7-1 shows how bandwidth affects transfer time for a few different network

speeds and a few different sizes of documents. Bandwidth causes noticeable delays

for larger documents, and the speed difference between different network types is

dramatic.*A 56-Kbps modem would take 749 seconds (over 12 minutes) to transfer a

5-MB file that could be transported in under a second across a fast Ethernet LAN.

Figure 7-1. Limited wide area bandwidth creates a bottleneck that caches can improve

* This table shows just the effect of network bandwidth on transfer time. It assumes 100% network efficiency

and no network or application processing latencies. In this way, the delay is a lower bound. Real delays will

be larger, and the delays for small objects will be dominated by non-bandwidth overheads.

Table 7-1. Bandwidth-imposed transfer time delays, idealized (time in seconds)

Large HTML (15 KB) JPEG (40 KB) Large JPEG (150 KB) Large file (5 MB)

Dialup modem (56 Kbit/sec) 2.19 5.85 21.94 748.98

DSL (256 Kbit/sec) .48 1.28 4.80 163.84

T1 (1.4 Mbit/sec) .09 .23 .85 29.13

Slow Ethernet (10 Mbit/sec) .01 .03 .12 4.19

DS3 (45 Mbit/sec) .00 .01 .03 .93

Fast Ethernet (100 Mbit/sec) .00 .00 .01 .42

Atlanta corporate headquarters

San Francisco branch office

Client

Server

Cache

Fast connection to cache

(100 Mbit/sec ethernet)

Slow WAN connection to server

(1.4 Mbit/sec T1)

Distance Delays |163

Flash Crowds

Caching is especially important to break up flash crowds. Flash crowds occur when a

sudden event (such as breaking news, a bulk email announcement, or a celebrity

event) causes many people to access a web document at nearly the same time

(Figure 7-2). The resulting redundant traffic spike can cause a catastrophic collapse

of networks and web servers.

When the “Starr Report” detailing Kenneth Starr’s investigation of U.S. President

Clinton was released to the Internet on September 11, 1998, the U.S. House of Rep-

resentatives web servers received over 3 million requests per hour, 50 times the aver-

age server load. One news web site, CNN.com, reported an average of over 50,000

requests every second to its servers.

Distance Delays

Even if bandwidth isn’t a problem, distance might be. Every network router adds

delays to Internet traffic. And even if there are not many routers between client and

server, the speed of light alone can cause a significant delay.

The direct distance from Boston to San Francisco is about 2,700 miles. In the very best

case, at the speed of light (186,000 miles/sec), a signal could travel from Boston to San

Francisco in about 15 milliseconds and complete a round trip in 30 milliseconds.*

Figure 7-2. Flash crowds can overload web servers

* In reality, signals travel at somewhat less than the speed of light, so distance delays are even worse.

Atlanta

San

Francisco

Los Angeles

Boston

Chicago

Flash crowd

164 |Chapter 7: Caching

Say a web page contains 20 small images, all located on a server in San Francisco. If a

client in Boston opens four parallel connections to the server, and keeps the connec-

tions alive, the speed of light alone contributes almost 1/4 second (240 msec) to the

download time (Figure 7-3). If the server is in Tokyo (6,700 miles from Boston), the

delay grows to 600 msec. Moderately complicated web pages can incur several sec-

onds of speed-of-light delays.

Placing caches in nearby machine rooms can shrink document travel distance from

thousands of miles to tens of yards.

Hits and Misses

So caches can help. But a cache doesn’t store a copy of every document in the

world.*

Figure 7-3. Speed of light can cause significant delays, even with parallel, keep-alive connections

* Few folks can afford to buy a cache big enough to hold all the Web’s documents. And even if you could afford

gigantic “whole-Web caches,” some documents change so frequently that they won’t be fresh in many caches.

Connection 1 Connection 2 Connection 3 Connection 4

30 msec

Connect request

30 msec

GET web page

Web page

30 msec

GET image 1

image 1

30 msec

GET image 2

image 2

30 msec

GET image 6

image 6

30 msec

GET image 10

image 10

30 msec

GET image 14

image 14

30 msec

GET image 18

image 18

30 msec

Connect request

30 msec

GET image 3

image 3

30 msec

GET image 7

image 7

30 msec

GET image 11

image 11

30 msec

GET image 15

image 15

30 msec

GET image 19

image 19

30 msec

Connect request

30 msec

GET image 4

image 4

30 msec

GET image 8

image 8

30 msec

GET image 12

image 12

30 msec

GET image 16

image 16

30 msec

GET image 20

image 20

30 msec

Connect request

30 msec

GET image 5

image 5

30 msec

GET image 9

image 9

30 msec

GET image 13

image 13

30 msec

GET image 17

image 17

30 msec

GET image 21

image 21

240 msec

Speed of light delay

Client in Boston

Server in San Francisco

speed of light 30 msec round trip

Hits and Misses |165

Some requests that arrive at a cache can be served from an available copy. This is

called a cache hit (Figure 7-4a). Other requests arrive at a cache only to be forwarded

to the origin server, because no copy is available. This is called a cache miss

(Figure 7-4b).

Revalidations

Because the origin server content can change, caches have to check every now and

then that their copies are still up-to-date with the server. These “freshness checks”

are called HTTP revalidations (Figure 7-4c). To make revalidations efficient, HTTP

defines special requests that can quickly check if content is still fresh, without fetch-

ing the entire object from the server.

A cache can revalidate a copy any time it wants, and as often as it wants. But because

caches often contain millions of documents, and because network bandwidth is

scarce, most caches revalidate a copy only when it is requested by a client and when

the copy is old enough to warrant a check. We’ll explain the HTTP rules for fresh-

ness checking later in the chapter.

When a cache needs to revalidate a cached copy, it sends a small revalidation request

to the origin server. If the content hasn’t changed, the server responds with a tiny

304 Not Modified response. As soon as the cache learns the copy is still valid, it

marks the copy temporarily fresh again and serves the copy to the client

(Figure 7-5a). This is called a revalidate hit oraslow hit. It’s slower than a pure cache

hit, because it does need to check with the origin server, but it’s faster than a cache

miss, because no object data is retrieved from the server.

Figure 7-4. Cache hits, misses, and revalidations

(a ) Cache hit

Client ServerCache

Cache

object

(b ) Cache miss

Client ServerCache

(c ) Cache revalidate hit

Client ServerCache

Cache

object

Server

object

Freshness check

“Still fresh”

Server

object

166 |Chapter 7: Caching

HTTP gives us a few tools to revalidate cached objects, but the most popular is the

If-Modified-Since header. When added to a GET request, this header tells the server

to send the object only if it has been modified since the time the copy was cached.

Here is what happens when a GET If-Modified-Since request arrives at the server in

three circumstances—when the server content is not modified, when the server con-

tent has been changed, and when the object on the server is deleted:

Revalidate hit

If the server object isn’t modified, the server sends the client a small HTTP 304

Not Modified response. This is depicted in Figure 7-6.

Revalidate miss

If the server object is different from the cached copy, the server sends the client a

normal HTTP 200 OK response, with the full content.

Figure 7-5. Successful revalidations are faster than cache misses; failed revalidations are nearly

identical to misses

Figure 7-6. HTTP uses If-Modified-Since header for revalidation

Server

object

(a) Revalidate hit (slow hit)

Client ServerCache

Freshness check

“Still fresh”

(b) Revalidate miss

Client ServerCache

Freshness check

Server

object

Server object same as cached copy

Cached copy is out of date

Cache

object

Server

GET /announce.html HTTP/1.0

If-Modified-Since: Sat, 29 Jun 2002, 14:30:00 GMT

HTTP/1.0 304 Not Modified

Date: Wed, 03 Jul 2002, 19:18:23 GMT

Content-type: text/plain

Content-length: 67

Expires: Fri, 05 Jul 2002, 05:00:00 GMT

Cache

(browser cache or

proxy cache)

Revalidate request with If-Modified-Since

“Still fresh” response

Hits and Misses |167

Object deleted

If the server object has been deleted, the server sends back a 404 Not Found

response, and the cache deletes its copy.

Hit Rate

The fraction of requests that are served from cache is called the cache hit rate (or

cache hit ratio),*or sometimes the document hit rate (or document hit ratio). The hit

rate ranges from 0 to 1 but is often described as a percentage, where 0% means that

every request was a miss (had to get the document across the network), and 100%

means every request was a hit (had a copy in the cache).†

Cache administrators would like the cache hit rate to approach 100%. The actual hit

rate you get depends on how big your cache is, how similar the interests of the cache

users are, how frequently the cached data is changing or personalized, and how the

caches are configured. Hit rate is notoriously difficult to predict, but a hit rate of

40% is decent for a modest web cache today. The nice thing about caches is that

even a modest-sized cache may contain enough popular documents to significantly

improve performance and reduce traffic. Caches work hard to ensure that useful con-

tent stays in the cache.

Byte Hit Rate

Document hit rate doesn’t tell the whole story, though, because documents are not

all the same size. Some large objects might be accessed less often but contribute

more to overall data traffic, because of their size. For this reason, some people pre-

fer the byte hit rate metric (especially those folks who are billed for each byte of

traffic!).

The byte hit rate represents the fraction of all bytes transferred that were served from

cache. This metric captures the degree of traffic savings. A byte hit rate of 100%

means every byte came from the cache, and no traffic went out across the Internet.

Document hit rate and byte hit rate are both useful gauges of cache performance.

Document hit rate describes how many web transactions are kept off the outgoing

network. Because transactions have a fixed time component that can often be large

(setting up a TCP connection to a server, for example), improving the document hit

rate will optimize for overall latency (delay) reduction. Byte hit rate describes how

many bytes are kept off the Internet. Improving the byte hit rate will optimize for

bandwidth savings.

* The term “hit ratio” probably is better than “hit rate,” because “hit rate” mistakenly suggests a time factor.

However, “hit rate” is in common use, so we use it here.

† Sometimes people include revalidate hits in the hit rate, but other times hit rate and revalidate hit rate are

measured separately. When you are examining hit rates, be sure you know what counts as a “hit.”

168 |Chapter 7: Caching

Distinguishing Hits and Misses

Unfortunately, HTTP provides no way for a client to tell if a response was a cache hit

or an origin server access. In both cases, the response code will be 200 OK, indicat-

ing that the response has a body. Some commercial proxy caches attach additional

information to Via headers to describe what happened in the cache.

One way that a client can usually detect if the response came from a cache is to use

the Date header. By comparing the value of the Date header in the response to the

current time, a client can often detect a cached response by its older date value.

Another way a client can detect a cached response is the Age header, which tells how

old the response is (see “Age” in Appendix C).

Cache Topologies

Caches can be dedicated to a single user or shared between thousands of users. Dedi-

cated caches are called private caches. Private caches are personal caches, containing

popular pages for a single user (Figure 7-7a). Shared caches are called public caches.

Public caches contain the pages popular in the user community (Figure 7-7b).

Private Caches

Private caches don’t need much horsepower or storage space, so they can be made

small and cheap. Web browsers have private caches built right in—most browsers

cache popular documents in the disk and memory of your personal computer and

allow you to configure the cache size and settings. You also can peek inside the

browser caches to see what they contain. For example, with Microsoft Internet

Figure 7-7. Public and private caches

Private cache

Client

Internet

Web server

Public cache

Client Internet

Web server

(a ) Accessing private cache

Client

(b ) Accessing shared public cache

Cache Topologies |169

Explorer, you can get the cache contents from the Tools ➝Internet Options... dia-

log box. MSIE calls the cached documents “Temporary Files” and lists them in a file

display, along with the associated URLs and document expiration times. You can

view Netscape Navigator’s cache contents through the special URL about:cache,

which gives you a “Disk Cache statistics” page showing the cache contents.

Public Proxy Caches

Public caches are special, shared proxy servers called caching proxy servers or, more

commonly, proxy caches (proxies were discussed in Chapter 6). Proxy caches serve

documents from the local cache or contact the server on the user’s behalf. Because a

public cache receives accesses from multiple users, it has more opportunity to elimi-

nate redundant traffic.*

In Figure 7-8a, each client redundantly accesses a new, “hot” document (not yet in

the private cache). Each private cache fetches the same document, crossing the net-

work multiple times. With a shared, public cache, as in Figure 7-8b, the cache needs

to fetch the popular object only once, and it uses the shared copy to service all

requests, reducing network traffic.

Proxy caches follow the rules for proxies described in Chapter 6. You can configure

your browser to use a proxy cache by specifying a manual proxy or by configuring a

proxy auto-configuration file (see “Client Proxy Configuration: Manual” in Chapter 6).

You also can force HTTP requests through caches without configuring your browser

by using intercepting proxies (see Chapter 20).

Proxy Cache Hierarchies

In practice, it often makes sense to deploy hierarchies of caches, where cache misses in

smaller caches are funneled to larger parent caches that service the leftover “distilled”

traffic. Figure 7-9 shows a two-level cache hierarchy.†The idea is to use small, inex-

pensive caches near the clients and progressively larger, more powerful caches up the

hierarchy to hold documents shared by many users.‡

Hopefully, most users will get cache hits on the nearby, level-1 caches (as shown in

Figure 7-9a). If not, larger parent caches may be able to handle their requests

(Figure 7-9b). For deep cache hierarchies it’s possible to go through long chains of

* Because a public cache caches the diverse interests of the user community, it needs to be large enough to hold

a set of popular documents, without being swept clean by individual user interests.

† If the clients are browsers with browser caches, Figure 7-9 technically shows a three-level cache hierarchy.

‡ Parent caches may need to be larger, to hold the documents popular across more users, and higher-

performance, because they receive the aggregate traffic of many children, whose interests may be diverse.

170 |Chapter 7: Caching

caches, but each intervening proxy does impose some performance penalty that can

become noticeable as the proxy chain becomes long.*

Cache Meshes, Content Routing, and Peering

Some network architects build complex cache meshes instead of simple cache hierar-

chies. Proxy caches in cache meshes talk to each other in more sophisticated ways,

and make dynamic cache communication decisions, deciding which parent caches to

talk to, or deciding to bypass caches entirely and direct themselves to the origin

server. Such proxy caches can be described as content routers, because they make

routing decisions about how to access, manage, and deliver content.

Caches designed for content routing within cache meshes may do all of the follow-

ing (among other things):

• Select between a parent cache or origin server dynamically, based on the URL.

• Select a particular parent cache dynamically, based on the URL.

Figure 7-8. Shared, public caches can decrease network traffic

* In practice, network architects try to limit themselves to two or three proxies in a row. However, a new gen-

eration of high-performance proxy servers may make proxy-chain length less of an issue.

Client

(a) Redundant accesses from private caches

Client

Internet

Server

Client

(b) Shared caches can reduce traffic

Client

Internet

Server

Cache

Cache Processing Steps |171

• Search caches in the local area for a cached copy before going to a parent cache.

• Allow other caches to access portions of their cached content, but do not permit

Internet transit through their cache.

These more complex relationships between caches allow different organizations to

peer with each other, connecting their caches for mutual benefit. Caches that pro-

vide selective peering support are called sibling caches (Figure 7-10). Because HTTP

doesn’t provide sibling cache support, people have extended HTTP with protocols,

such as the Internet Cache Protocol (ICP) and the HyperText Caching Protocol

(HTCP). We’ll talk about these protocols in Chapter 20.

Cache Processing Steps

Modern commercial proxy caches are quite complicated. They are built to be very

high-performance and to support advanced features of HTTP and other technologies.

But, despite some subtle details, the basic workings of a web cache are mostly simple.

A basic cache-processing sequence for an HTTP GET message consists of seven steps

(illustrated in Figure 7-11):

1. Receiving—Cache reads the arriving request message from the network.

2. Parsing—Cache parses the message, extracting the URL and headers.

Figure 7-9. Accessing documents in a two-level cache hierarchy

X X

Origin server

Level-2 cache

Wide area

network

Regional network

Level-1

cache

(a) Level-1 cache hit

Origin server

Level-2 cache

Wide area

network

Regional network

Level-1

cache

(b) Level-2 cache hit

Origin server

Level-2 cache

Wide area

network

Regional network

Level-1

cache

172 |Chapter 7: Caching

3. Lookup—Cache checks if a local copy is available and, if not, fetches a copy

(and stores it locally).

4. Freshness check—Cache checks if cached copy is fresh enough and, if not, asks

server for any updates.

5. Response creation—Cache makes a response message with the new headers and

cached body.

6. Sending—Cache sends the response back to the client over the network.

7. Logging—Optionally, cache creates a log file entry describing the transaction.

Step 1: Receiving

In Step 1, the cache detects activity on a network connection and reads the incoming

data. High-performance caches read data simultaneously from multiple incoming con-

nections and begin processing the transaction before the entire message has arrived.

Step 2: Parsing

Next, the cache parses the request message into pieces and places the header parts in

easy-to-manipulate data structures. This makes it easier for the caching software to

process the header fields and fiddle with them.*

Figure 7-10. Sibling caches

* The parser also is responsible for normalizing the parts of the header so that unimportant differences, like

capitalization or alternate date formats, all are viewed equivalently. Also, because some request messages

contain a full absolute URL and other request messages contain a relative URL and Host header, the parser

typically hides these details (see “Relative URLs” in Chapter 2).

Origin server

Wide area

network

Organization A

B’s access point

Organization B

A’s access point

Sibling

Cache Processing Steps |173

Step 3: Lookup

In Step 3, the cache takes the URL and checks for a local copy. The local copy

might be stored in memory, on a local disk, or even in another nearby computer.

Professional-grade caches use fast algorithms to determine whether an object is

available in the local cache. If the document is not available locally, it can be fetched

from the origin server or a parent proxy, or return a failure, based on the situation

and configuration.

The cached object contains the server response body and the original server response

headers, so the correct server headers can be returned during a cache hit. The cached

object also includes some metadata, used for bookkeeping how long the object has

been sitting in the cache, how many times it was used, etc.*

Step 4: Freshness Check

HTTP lets caches keep copies of server documents for a period of time. During this

time, the document is considered “fresh” and the cache can serve the document with-

out contacting the server. But once the cached copy has sat around for too long, past

the document’s freshness limit, the object is considered “stale,” and the cache needs to

Figure 7-11. Processing a fresh cache hit

* Sophisticated caches also keep a copy of the original client response headers that yielded the server response,

for use in HTTP/1.1 content negotiation (see Chapter 17).

Client Server

GET /www.joes-hardware.com/index.html HTTP/1.1

User-agent: Superbrowser 2.0

Host: www.joes-hardware.com

Accept: *.*

(1) Receive HTTP request message

(2) Parse message (3) In cache?

Cache

Server

headers

Body

YES

Server

headers

Body

NEW

headers

Body

(4) Is fresh?

(5) Create response headers

HTTP/1.1 200 OK

Content-length: 2140

Content-type: text/html

Cache-control: max-age=86400

Age: 21562

Via: ...

<HEAD><TITLE>Joe’s Hardware Home Page</TITLE></HEAD>

<BODY><H1>Welcome to Joe’s Hardware</H1>...

YES

(6) Send response

174 |Chapter 7: Caching

revalidate with the server to check for any document changes before serving it. Com-

plicating things further are any request headers that a client sends to a cache, which

themselves can force the cache to either revalidate or avoid validation altogether.

HTTP has a set of very complicated rules for freshness checking, made worse by the

large number of configuration options cache products support and by the need to

interoperate with non-HTTP freshness standards. We’ll devote most of the rest of

this chapter to explaining freshness calculations.

Step 5: Response Creation

Because we want the cached response to look like it came from the origin server, the

cache uses the cached server response headers as the starting point for the response

headers. These base headers are then modified and augmented by the cache.

The cache is responsible for adapting the headers to match the client. For example,

the server may return an HTTP/1.0 response (or even an HTTP/0.9 response), while

the client expects an HTTP/1.1 response, in which case the cache must translate the

headers accordingly. Caches also insert cache freshness information (Cache-Control,

Age, and Expires headers) and often include a Via header to note that a proxy cache

served the request.

Note that the cache should not adjust the Date header. The Date header represents

the date of the object when it was originally generated at the origin server.

Step 6: Sending

Once the response headers are ready, the cache sends the response back to the cli-

ent. Like all proxy servers, a proxy cache needs to manage the connection with the

client. High-performance caches work hard to send the data efficiently, often avoid-

ing copying the document content between the local storage and the network I/O

buffers.

Step 7: Logging

Most caches keep log files and statistics about cache usage. After each cache transac-

tion is complete, the cache updates statistics counting the number of cache hits and

misses (and other relevant metrics) and inserts an entry into a log file showing the

request type, URL, and what happened.

The most popular cache log formats are the Squid log format and the Netscape

extended common log format, but many cache products allow you to create custom

log files. We discuss log file formats in detail in Chapter 21.

Keeping Copies Fresh |175

Cache Processing Flowchart

Figure 7-12 shows, in simplified form, how a cache processes a request to GET a

URL.*

Keeping Copies Fresh

Cached copies might not all be consistent with the documents on the server. After

all, documents do change over time. Reports might change monthly. Online newspa-

pers change daily. Financial data may change every few seconds. Caches would be

useless if they always served old data. Cached data needs to maintain some consis-

tency with the server data.

HTTP includes simple mechanisms to keep cached data sufficiently consistent with

servers, without requiring servers to remember which caches have copies of their

documents. HTTP calls these simple mechanisms document expiration and server

revalidation.

Document Expiration

HTTP lets an origin server attach an “expiration date” to each document, using spe-

cial HTTP Cache-Control and Expires headers (Figure 7-13). Like an expiration date

on a quart of milk, these headers dictate how long content should be viewed as fresh.

Figure 7-12. Cache GET request flowchart

* The revalidation and fetching of a resource as outlined in Figure 7-12 can be done in one step with a condi-

tional request (see “Revalidation with Conditional Methods”).

Revalidate with server Revalidated? Fetch from server

Store into cache

Serve to client

Update freshness

of cached document

Fresh enough?

Cached?

Request arrives

yes

yes yes

176 |Chapter 7: Caching

Until a cache document expires, the cache can serve the copy as often as it wants,

without ever contacting the server—unless, of course, a client request includes

headers that prevent serving a cached or unvalidated resource. But, once the cached

document expires, the cache must check with the server to ask if the document has

changed and, if so, get a fresh copy (with a new expiration date).

Expiration Dates and Ages

Servers specify expiration dates using the HTTP/1.0+ Expires or the HTTP/1.1

Cache-Control: max-age response headers, which accompany a response body. The

Expires and Cache-Control: max-age headers do basically the same thing, but the

newer Cache-Control header is preferred, because it uses a relative time instead of an

absolute date. Absolute dates depend on computer clocks being set correctly.

Table 7-2 lists the expiration response headers.

Let’s say today is June 29, 2002 at 9:30 am Eastern Standard Time (EST), and Joe’s

Hardware store is getting ready for a Fourth of July sale (only five days away). Joe

wants to put a special web page on his web server and set it to expire at midnight

EST on the night of July 5, 2002. If Joe’s server uses the older-style Expires headers,

the server response message (Figure 7-13a) might include this header:*

Expires: Fri, 05 Jul 2002, 05:00:00 GMT

Figure 7-13. Expires and Cache Control headers

Table 7-2. Expiration response headers

Header Description

Cache-Control: max-age The max-age value defines the maximum age of the document—the maximum legal elapsed

time (in seconds) from when a document is first generated to when it can no longer be considered

fresh enough to serve.

Cache-Control: max-age=484200

Expires Specifies an absolute expiration date. If the expiration date is in the past, the document is no

longer fresh.

Expires: Fri, 05 Jul 2002, 05:00:00 GMT

* Note that all HTTP dates and times are expressed in Greenwich Mean Time (GMT). GMT is the time at the

prime meridian (0˚ longitude) that passes through Greenwich, UK. GMT is five hours ahead of U.S. Eastern

Standard Time, so midnight EST is 05:00 GMT.

HTTP/1.0 200 OK

Date: Sat, 29 Jun 2002, 14:30:00 GMT

Content-type: text/plain

Content-length: 67

Expires: Fri, 05 Jul 2002, 05:00:00 GMT

Independence Day sale at Joe's Hardware

Come shop with us today!

(a) Expires header

HTTP/1.0 200 OK

Date: Sat, 29 Jun 2002, 14:30:00 GMT

Content-type: text/plain

Content-length: 67

Cache-Control: max-age=484200

Independence Day sale at Joe's Hardware

Come shop with us today!

(b) Cache-Control: max-age header

Keeping Copies Fresh |177

If Joe’s server uses the newer Cache-Control: max-age headers, the server response

message (Figure 7-13b) might contain this header:

Cache-Control: max-age=484200

In case that wasn’t immediately obvious, 484,200 is the number of seconds between

the current date, June 29, 2002 at 9:30 am EST, and the sale end date, July 5, 2002 at

midnight. There are 134.5 hours (about 5 days) until the sale ends. With 3,600 sec-

onds in each hour, that leaves 484,200 seconds until the sale ends.

Server Revalidation

Just because a cached document has expired doesn’t mean it is actually different

from what’s living on the origin server; it just means that it’s time to check. This is

called “server revalidation,” meaning the cache needs to ask the origin server

whether the document has changed:

• If revalidation shows the content has changed, the cache gets a new copy of the

document, stores it in place of the old data, and sends the document to the client.

• If revalidation shows the content has not changed, the cache only gets new head-

ers, including a new expiration date, and updates the headers in the cache.

This is a nice system. The cache doesn’t have to verify a document’s freshness for

every request—it has to revalidate with the server only once the document has

expired. This saves server traffic and provides better user response time, without

serving stale content.

The HTTP protocol requires a correctly behaving cache to return one of the following:

• A cached copy that is “fresh enough”

• A cached copy that has been revalidated with the server to ensure it’s still fresh

• An error message, if the origin server to revalidate with is down*

• A cached copy, with an attached warning that it might be incorrect

Revalidation with Conditional Methods

HTTP’s conditional methods make revalidation efficient. HTTP allows a cache to

send a “conditional GET” to the origin server, asking the server to send back an

object body only if the document is different from the copy currently in the cache. In

this manner, the freshness check and the object fetch are combined into a single con-

ditional GET. Conditional GETs are initiated by adding special conditional headers to

GET request messages. The web server returns the object only if the condition is true.

* If the origin server is not accessible, but the cache needs to revalidate, the cache must return an error or a

warning describing the communication failure. Otherwise, pages from a removed server may live in network

caches for an arbitrary time into the future.

178 |Chapter 7: Caching

HTTP defines five conditional request headers. The two that are most useful for

cache revalidation are If-Modified-Since and If-None-Match.*All conditional head-

ers begin with the prefix “If-”. Table 7-3 lists the conditional response headers used

in cache revalidation.

If-Modiﬁed-Since: Date Revalidation

The most common cache revalidation header is If-Modified-Since. If-Modified-Since

revalidation requests often are called “IMS” requests. IMS requests instruct a server

to perform the request only if the resource has changed since a certain date:

• If the document was modified since the specified date, the If-Modified-Since

condition is true, and the GET succeeds normally. The new document is

returned to the cache, along with new headers containing, among other informa-

tion, a new expiration date.

• If the document was not modified since the specified date, the condition is false,

and a small 304 Not Modified response message is returned to the client, with-

out a document body, for efficiency.†Headers are returned in the response;

however, only the headers that need updating from the original need to be

returned. For example, the Content-Type header does not usually need to be

sent, since it usually has not changed. A new expiration date typically is sent.

The If-Modified-Since header works in conjunction with the Last-Modified server

response header. The origin server attaches the last modification date to served docu-

ments. When a cache wants to revalidate a cached document, it includes an If-Modi-

fied-Since header with the date the cached copy was last modified:

If-Modified-Since: <cached last-modified date>

* Other conditional headers include If-Unmodified-Since (useful for partial document transfers, when you

need to ensure the document is unchanged before you fetch another piece of it), If-Range (to support caching

of incomplete documents), and If-Match (useful for concurrency control when dealing with web servers).

Table 7-3. Two conditional headers used in cache revalidation

Header Description

If-Modified-Since:

<date>

Perform the requested method if the document has been modified since the specified date. This

is used in conjunction with the Last-Modified server response header, to fetch content only if

the content has been modified from the cached version.

If-None-Match: <tags> Instead of matching on last-modified date, the server may provide special tags (see “ETag” in

Appendix C) on the document that act like serial numbers. The If-None-Match header performs

the requested method if the cached tags differ from the tags in the server’s document.

† If an old server that doesn’t recognize the If-Modified-Since header gets the conditional request, it interprets

it as a normal GET. In this case, the system will still work, but it will be less efficient due to unnecessary

transmittal of unchanged document data.

Keeping Copies Fresh |179

If the content has changed in the meantime, the last modification date will be differ-

ent, and the origin server will send back the new document. Otherwise, the server

will note that the cache’s last-modified date matches the server document’s current

last-modified date, and it will return a 304 Not Modified response.

For example, as shown in Figure 7-14, if your cache revalidates Joe’s Hardware’s

Fourth of July sale announcement on July 3, you will receive back a Not Modified

response (Figure 7-14a). But if your cache revalidates the document after the sale

ends at midnight on July 5, the cache will receive a new document, because the

server content has changed (Figure 7-14b).

Note that some web servers don’t implement If-Modified-Since as a true date com-

parison. Instead, they do a string match between the IMS date and the last-modified

date. As such, the semantics behave as “if not last modified on this exact date”

instead of “if modified since this date.” This alternative semantic works fine for

Figure 7-14. If-Modified-Since revalidations return 304 if unchanged or 200 with new body if

changed

(a) If-Modified-Since successful revalidation

Client Server

GET /announce.html HTTP/1.0

If-Modified-Since: Sat, 29 Jun 2002, 14:30:00 GMT

Conditional request

HTTP/1.0 304 Not Modified

Date: Wed, 03 Jul 2002, 19:18:23 GMT

Expires: Fri, 05 Jul 2002, 14:30:00 GMT

Response

(b) If-Modified-Since failed revalidation

Client Server

GET /announce.html HTTP/1.0

If-Modified-Since: Sat, 29 Jun 2002, 14:30:00 GMT

Conditional request

HTTP/1.0 200 OK

Date: Fri, 05 Jul 2002, 17:54:40 GMT

Content-type: text/plain

Content-length: 124

Expires: Mon, 09 Sep 2002, 05:00:00 GMT

All exterior house paint on sale through

Labor Day. Just another reason for you

to shop this summer at Joe's Hardware!

Response

180 |Chapter 7: Caching

cache expiration, when you are using the last-modified date as a kind of serial num-

ber, but it prevents clients from using the If-Modified-Since header for true time-

based purposes.

If-None-Match: Entity Tag Revalidation

There are some situations when the last-modified date revalidation isn’t adequate:

• Some documents may be rewritten periodically (e.g., from a background pro-

cess) but actually often contain the same data. The modification dates will

change, even though the content hasn’t.

• Some documents may have changed, but only in ways that aren’t important

enough to warrant caches worldwide to reload the data (e.g., spelling or com-

ment changes).

• Some servers cannot accurately determine the last modification dates of their

pages.

• For servers that serve documents that change in sub-second intervals (e.g. real-

time monitors), the one-second granularity of modification dates might not be

adequate.

To get around these problems, HTTP allows you to compare document “version

identifiers” called entity tags (ETags). Entity tags are arbitrary labels (quoted strings)

attached to the document. They might contain a serial number or version name for

the document, or a checksum or other fingerprint of the document content.

When the publisher makes a document change, he can change the document’s entity

tag to represent this new version. Caches can then use the If-None-Match condi-

tional header to GET a new copy of the document if the entity tags have changed.

In Figure 7-15, the cache has a document with entity tag “v2.6”. It revalidates with

the origin server asking for a new object only if the tag “v2.6” no longer matches. In

Figure 7-15, the tag still matches, so a 304 Not Modified response is returned.

Figure 7-15. If-None-Match revalidates because entity tag still matches

Cache Server

GET /announce.html HTTP/1.0

If-None-Match: "v2.6"

Conditional request

HTTP/1.0 304 Not Modified

Date: Wed, 03 Jul 2002, 19:18:23 GMT

ETag: "v2.6"

Expires: Fri, 05 Jul 2002, 05:00:00 GMT

Response

ETag: “v2.6”ETag: “v2.6”

Keeping Copies Fresh |181

If the entity tag on the server had changed (perhaps to “v3.0”), the server would

return the new content in a 200 OK response, along with the content and new ETag.

Several entity tags can be included in an If-None-Match header, to tell the server that

the cache already has copies of objects with those entity tags:

If-None-Match: "v2.6"

If-None-Match: "v2.4","v2.5","v2.6"

If-None-Match: "foobar","A34FAC0095","Profiles in Courage"

Weak and Strong Validators

Caches use entity tags to determine whether the cached version is up-to-date with

respect to the server (much like they use last-modified dates). In this way, entity tags

and last-modified dates both are cache validators.

Servers may sometimes want to allow cosmetic or insignificant changes to docu-

ments without invalidating all cached copies. HTTP/1.1 supports “weak validators,”

which allow the server to claim “good enough” equivalence even if the contents have

changed slightly.

Strong validators change any time the content changes. Weak validators allow some

content change but generally change when the significant meaning of the content

changes. Some operations cannot be performed using weak validators (such as condi-

tional partial-range fetches), so servers identify validators that are weak with a “W/”

prefix:

ETag: W/"v2.6"

If-None-Match: W/"v2.6"

A strong entity tag must change whenever the associated entity value changes in any

way. A weak entity tag should change whenever the associated entity changes in a

semantically significant way.

Note that an origin server must avoid reusing a specific strong entity tag value for two

different entities, or reusing a specific weak entity tag value for two semantically differ-

ent entities. Cache entries might persist for arbitrarily long periods, regardless of expi-

ration times, so it might be inappropriate to expect that a cache will never again

attempt to validate an entry using a validator that it obtained at some point in the past.

When to Use Entity Tags and Last-Modiﬁed Dates

HTTP/1.1 clients must use an entity tag validator if a server sends back an entity tag. If

the server sends back only a Last-Modified value, the client can use If-Modified-Since

validation. If both an entity tag and a last-modified date are available, the client should

use both revalidation schemes, allowing both HTTP/1.0 and HTTP/1.1 caches to

respond appropriately.

182 |Chapter 7: Caching

HTTP/1.1 origin servers should send an entity tag validator unless it is not feasible to

generate one, and it may be a weak entity tag instead of a strong entity tag, if there

are benefits to weak validators. Also, it’s preferred to also send a last-modified value.

If an HTTP/1.1 cache or server receives a request with both If-Modified-Since and

entity tag conditional headers, it must not return a 304 Not Modified response

unless doing so is consistent with all of the conditional header fields in the request.

Controlling Cachability

HTTP defines several ways for a server to specify how long a document can be

cached before it expires. In decreasing order of priority, the server can:

• Attach a Cache-Control: no-store header to the response.

• Attach a Cache-Control: no-cache header to the response.

• Attach a Cache-Control: must-revalidate header to the response.

• Attach a Cache-Control: max-age header to the response.

• Attach an Expires date header to the response.

• Attach no expiration information, letting the cache determine its own heuristic

expiration date.

This section describes the cache controlling headers. The next section, “Setting Cache

Controls,” describes how to assign different cache information to different content.

No-Cache and No-Store Response Headers

HTTP/1.1 offers several ways to limit the caching of objects, or the serving of cached

objects, to maintain freshness. The no-store and no-cache headers prevent caches

from serving unverified cached objects:

Cache-Control: no-store

Cache-Control: no-cache

Pragma: no-cache

A response that is marked "no-store" forbids a cache from making a copy of the

response. A cache would typically forward a no-store response to the client, and

then delete the object, as would a non-caching proxy server.

A response that is marked "no-cache" can actually be stored in the local cache stor-

age. It just cannot be served from the cache to the client without first revalidating

the freshness with the origin server. A better name for this header might be "do-not-

serve-from-cache-without-revalidation."

The Pragma: no-cache header is included in HTTP/1.1 for backward compatibility

with HTTP/1.0+. HTTP 1.1 applications should use Cache-Control: no-cache, except

when dealing with HTTP 1.0 applications, which understand only Pragma: no-cache.*

Controlling Cachability |183

Max-Age Response Headers

The Cache-Control: max-age header indicates the number of seconds since it came

from the server for which a document can be considered fresh. There is also an s-

maxage header (note the absence of a hyphen in “maxage”) that acts like max-age

but applies only to shared (public) caches:

Cache-Control: max-age=3600

Cache-Control: s-maxage=3600

Servers can request that caches either not cache a document or refresh on every

access by setting the maximum aging to zero:

Cache-Control: max-age=0

Cache-Control: s-maxage=0

Expires Response Headers

The deprecated Expires header specifies an actual expiration date instead of a time in sec-

onds. The HTTP designers later decided that, because many servers have unsynchro-

nized or incorrect clocks, it would be better to represent expiration in elapsed seconds,

rather than absolute time. An analogous freshness lifetime can be calculated by comput-

ing the number of seconds difference between the expires value and the date value:

Expires: Fri, 05 Jul 2002, 05:00:00 GMT

Some servers also send back an Expires: 0 response header to try to make docu-

ments always expire, but this syntax is illegal and can cause problems with some

software. You should try to support this construct as input, but shouldn’t generate it.

Must-Revalidate Response Headers

Caches may be configured to serve stale (expired) objects, in order to improve per-

formance. If an origin server wishes caches to strictly adhere to expiration informa-

tion, it can attach a Cache-Control:

Cache-Control: must-revalidate

The Cache-Control: must-revalidate response header tells caches they cannot serve a

stale copy of this object without first revalidating with the origin server. Caches are

still free to serve fresh copies. If the origin server is unavailable when a cache

attempts a must-revalidate freshness check, the cache must return a 504 Gateway

Timeout error.

* Pragma no-cache is technically valid only for HTTP requests, yet it is widely used as an extension header for

both HTTP erquests and responses.

184 |Chapter 7: Caching

Heuristic Expiration

If the response doesn’t contain either a Cache-Control: max-age header or an Expires

header, the cache may compute a heuristic maximum age. Any algorithm may be

used, but if the resulting maximum age is greater than 24 hours, a Heuristic Expira-

tion Warning (Warning 13) header should be added to the response headers. As far

as we know, few browsers make this warning information available to users.

One popular heuristic expiration algorithm, the LM-Factor algorithm, can be used if

the document contains a last-modified date. The LM-Factor algorithm uses the last-

modified date as an estimate of how volatile a document is. Here’s the logic:

• If a cached document was last changed in the distant past, it may be a stable

document and less likely to change suddenly, so it is safer to keep it in the cache

longer.

• If the cached document was modified just recently, it probably changes fre-

quently, so we should cache it only a short while before revalidating with the

server.

The actual LM-Factor algorithm computes the time between when the cache talked

to the server and when the server said the document was last modified, takes some

fraction of this intervening time, and uses this fraction as the freshness duration in

the cache. Here is some Perl pseudocode for the LM-factor algorithm:

$time_since_modify = max(0, $server_Date - $server_Last_Modified);

$server_freshness_limit = int($time_since_modify * $lm_factor);

Figure 7-16 depicts the LM-factor freshness period graphically. The cross-hatched

line indicates the freshness period, using an LM-factor of 0.2.

Typically, people place upper bounds on heuristic freshness periods so they can’t

grow excessively large. A week is typical, though more conservative sites use a day.

Finally, if you don’t have a last-modified date either, the cache doesn’t have much

information to go on. Caches typically assign a default freshness period (an hour or a

day is typical) for documents without any freshness clues. More conservative caches

sometimes choose freshness lifetimes of 0 for these heuristic documents, forcing the

cache to validate that the data is still fresh before each time it is served to a client.

Figure 7-16. Computing a freshness period using the LM-Factor algorithm

20% of time between fetch

and last modification

Cached copy is fresh for

time period New expiration time

Last modified When cache talked

to server

Time

(LM-factor= 0.2)

Controlling Cachability |185

One last note about heuristic freshness calculations—they are more common than

you might think. Many origin servers still don’t generate Expires and max-age head-

ers. Pick your cache’s expiration defaults carefully!

Client Freshness Constraints

Web browsers have a Refresh or Reload button to forcibly refresh content, which

might be stale in the browser or proxy caches. The Refresh button issues a GET

request with additional Cache-control request headers that force a revalidation or

unconditional fetch from the server. The precise Refresh behavior depends on the

particular browser, document, and intervening cache configurations.

Clients use Cache-Control request headers to tighten or loosen expiration con-

straints. Clients can use Cache-control headers to make the expiration more strict,

for applications that need the very freshest documents (such as the manual Refresh

button). On the other hand, clients might also want to relax the freshness require-

ments as a compromise to improve performance, reliability, or expenses. Table 7-4

summarizes the Cache-Control request directives.

Cautions

Document expiration isn’t a perfect system. If a publisher accidentally assigns an

expiration date too far in the future, any document changes she needs to make won’t

necessarily show up in all caches until the document has expired.*For this reason,

Table 7-4. Cache-Control request directives

Directive Purpose

Cache-Control: max-stale

Cache-Control: max-stale =

<s>

The cache is free to serve a stale document. If the <s> parameter is specified, the docu-

ment must not be stale by more than this amount of time. This relaxes the caching rules.

Cache-Control: min-fresh =

<s>

The document must still be fresh for at least <s> seconds in the future. This makes the

caching rules more strict.

Cache-Control: max-age = <s> The cache cannot return a document that has been cached for longer than <s> seconds.

This directive makes the caching rules more strict, unless the max-stale directive also is

set, in which case the age can exceed its expiration time.

Cache-Control: no-cache

Pragma: no-cache

This client won’t accept a cached resource unless it has been revalidated.

Cache-Control: no-store The cache should delete every trace of the document from storage as soon as possible,

because it might contain sensitive information.

Cache-Control: only-if-cached The client wants a copy only if it is in the cache.

* Document expiration is a form of “time to live” technique used in many Internet protocols, such as DNS. DNS,

like HTTP, has trouble if you publish an expiration date far in the future and then find that you need to make

a change. However, HTTP provides mechanisms for a client to override and force a reloading, unlike DNS.

186 |Chapter 7: Caching

many publishers don’t use distant expiration dates. Also, many publishers don’t even

use expiration dates, making it tough for caches to know how long the document

will be fresh.

Setting Cache Controls

Different web servers provide different mechanisms for setting HTTP cache-control

and expiration headers. In this section, we’ll talk briefly about how the popular

Apache web server supports cache controls. Refer to your web server documentation

for specific details.

Controlling HTTP Headers with Apache

The Apache web server provides several mechanisms for setting HTTP cache-

controlling headers. Many of these mechanisms are not enabled by default—you

have to enable them (in some cases first obtaining Apache extension modules). Here

is a brief description of some of the Apache features:

mod_headers

The mod_headers module lets you set individual headers. Once this module is

loaded, you can augment the Apache configuration files with directives to set

individual HTTP headers. You also can use these settings in combination with

Apache’s regular expressions and filters to associate headers with individual con-

tent. Here is an example of a configuration that could mark all HTML files in a

directory as uncachable:

Header set Cache-control no-cache

</Files>

mod_expires

The mod_expires module provides program logic to automatically generate

Expires headers with the correct expiration dates. This module allows you to set

expiration dates for some time period after a document was last accessed or after

its last-modified date. The module also lets you assign different expiration dates

to different file types and use convenient verbose descriptions, like “access plus 1

month,” to describe cachability. Here are a few examples:

ExpiresDefault A3600

ExpiresDefault M86400

ExpiresDefault "access plus 1 week"

ExpiresByType text/html "modification plus 2 days 6 hours 12 minutes"

mod_cern_meta

The mod_cern_meta module allows you to associate a file of HTTP headers with

particular objects. When you enable this module, you create a set of “metafiles,”

one for each document you want to control, and add the desired headers to each

metafile.

Detailed Algorithms |187

Controlling HTML Caching Through HTTP-EQUIV

HTTP server response headers are used to carry back document expiration and

cache-control information. Web servers interact with configuration files to assign the

correct cache-control headers to served documents.

To make it easier for authors to assign HTTP header information to served HTML

documents without interacting with web server configuration files, HTML 2.0

defined the <META HTTP-EQUIV> tag. This optional tag sits at the top of an

HTML document and defines HTTP headers that should be associated with the doc-

ument. Here is an example of a <META HTTP-EQUIV> tag set to mark the HTML

document uncachable:

<HTML>

<HEAD>

<TITLE>My Document</TITLE>

</HEAD>

...

This HTTP-EQUIV tag was originally intended to be used by web servers. Web serv-

ers were supposed to parse HTML for <META HTTP-EQUIV> tags and insert the

prescribed headers into the HTTP response, as documented in HTML RFC 1866:

An HTTP server may use this information to process the document. In particular, it

may include a header field in the responses to requests for this document: the header

name is taken from the HTTP-EQUIV attribute value, and the header value is taken

from the value of the CONTENT attribute.

Unfortunately, few web servers and proxies support this optional feature because of

the extra server load, the values being static, and the fact that it supports only HTML

and not the many other file types.

However, some browsers do parse and adhere to HTTP-EQUIV tags in the HTML

content, treating the embedded headers like real HTTP headers (Figure 7-17). This is

unfortunate, because HTML browsers that do support HTTP-EQUIV may apply dif-

ferent cache-control rules than intervening proxy caches. This causes confusing

cache expiration behavior.

In general, <META HTTP-EQUIV> tags are a poor way of controlling document

cachability. The only sure-fire way to communicate cache-control requests for docu-

ments is through HTTP headers sent by a properly configured server.

Detailed Algorithms

The HTTP specification provides a detailed, but slightly obscure and often confus-

ing, algorithm for computing document aging and cache freshness. In this section,

we’ll discuss the HTTP freshness computation algorithms in detail (the “Fresh

enough?” diamond in Figure 7-12) and explain the motivation behind them.

188 |Chapter 7: Caching

This section will be most useful to readers working with cache internals. To help

illustrate the wording in the HTTP specification, we will make use of Perl

pseudocode. If you aren’t interested in the gory details of cache expiration formulas,

feel free to skip this section.

Age and Freshness Lifetime

To tell whether a cached document is fresh enough to serve, a cache needs to com-

pute only two values: the cached copy’s age and the cached copy’s freshness lifetime.

If the age of a cached copy is less than the freshness lifetime, the copy is fresh enough

to serve. In Perl:

$is_fresh_enough = ($age < $freshness_lifetime);

Figure 7-17. HTTP-EQUIV tags cause problems, because most software ignores them

Some HTTP servers can be configured to parse HTML files for special

<META HTTP-EQUIV> tags. These metadata tags (in the HTML document)

describe HTTP headers that the author would like to be received by the client.

Unfortunately, most web servers don’t process HTTP-EQUIV tags, and even

fewer proxies do. This causes client caches to receive cache-control commands

that proxy caches do not always see.

Client Server

GET /xyz.html HTTP/1.0

HTTP request

HTTP/1.0 200 OK

Date: Fri, 07 Apr 2002, 19:21:13 GMT

Content-length: 124

Cache-control: max-age=3600

Content-type: text/html; charset=utf-8

<HTML>

<HEAD>

<META HTTP-EQUIV="Cache-control"

CONTENT="max-age=3600"

<META HTTP-EQUIV="Content-type"

CONTENT="text/html; charset=utf-8"

</HEAD>

<BODY>

Welcome to XYZ Industries, a <B>leader</B>

in mechanical drilling machines for...

HTTP response

<HTML>

<HEAD>

<META HTTP-EQUIV="Cache-control"

CONTENT="max-age=3600">

<META HTTP-EQUIV="Content-type"

CONTENT="text/html; charset=utf-8">

</HEAD>

<BODY>

Welcome to XYZ Industries, a

<B>leader</B> in mechanical drilling

machines for 30 years. Our new line of

100% automated manufacturing tools sets

the standard for CAM, at a suprisingly

low price.

</BODY>

HTML file

Some servers will insert HTTP-EQUIV specified headers into

the response header for proxies to see. Others servers will not.

Detailed Algorithms |189

The age of the document is the total time the document has “aged” since it was sent

from the server (or was last revalidated by the server).*Because a cache might not

know if a document response is coming from an upstream cache or a server, it can’t

assume that the document is brand new. It must determine the document’s age,

either from an explicit Age header (preferred) or by processing the server-generated

Date header.

The freshness lifetime of a document tells how old a cached copy can get before it is

no longer fresh enough to serve to clients. The freshness lifetime takes into account the

expiration date of the document and any freshness overrides the client might request.

Some clients may be willing to accept slightly stale documents (using the Cache-Con-

trol: max-stale header). Other clients may not accept documents that will become

stale in the near future (using the Cache-Control: min-fresh header). The cache com-

bines the server expiration information with the client freshness requirements to

determine the maximum freshness lifetime.

Age Computation

The age of the response is the total time since the response was issued from the

server (or revalidated from the server). The age includes the time the response has

floated around in the routers and gateways of the Internet, the time stored in inter-

mediate caches, and the time the response has been resident in your cache.

Example 7-1 provides pseudocode for the age calculation.

The particulars of HTTP age calculation are a bit tricky, but the basic concept is sim-

ple. Caches can tell how old the response was when it arrived at the cache by exam-

ining the Date or Age headers. Caches also can note how long the document has

been sitting in the local cache. Summed together, these values are the entire age of

the response. HTTP throws in some magic to attempt to compensate for clock skew

and network delays, but the basic computation is simple enough:

$age = $age_when_document_arrived_at_our_cache +

$how_long_copy_has_been_in_our_cache;

* Remember that the server always has the most up-to-date version of any document.

Example 7-1. HTTP/1.1 age-calculation algorithm calculates the overall age of a cached document

$apparent_age = max(0, $time_got_response - $Date_header_value);

$corrected_apparent_age = max($apparent_age, $Age_header_value);

$response_delay_estimate = ($time_got_response - $time_issued_request);

$age_when_document_arrived_at_our_cache =

$corrected_apparent_age + $response_delay_estimate;

$how_long_copy_has_been_in_our_cache = $current_time - $time_got_response;

$age = $age_when_document_arrived_at_our_cache +

$how_long_copy_has_been_in_our_cache;

190 |Chapter 7: Caching

A cache can pretty easily determine how long a cached copy has been cached locally

(a matter of simple bookkeeping), but it is harder to determine the age of a response

when it arrives at the cache, because not all servers have synchronized clocks and

because we don’t know where the response has been. The complete age-calculation

algorithm tries to remedy this.

Apparent age is based on the Date header

If all computers shared the same, exactly correct clock, the age of a cached document

would simply be the “apparent age” of the document—the current time minus the

time when the server sent the document. The server send time is simply the value of

the Date header. The simplest initial age calculation would just use the apparent age:

$apparent_age = $time_got_response - $Date_header_value;

$age_when_document_arrived_at_our_cache = $apparent_age;

Unfortunately, not all clocks are well synchronized. The client and server clocks may

differ by many minutes, or even by hours or days when clocks are set improperly.*

Web applications, especially caching proxies, have to be prepared to interact with

servers with wildly differing clock values. The problem is called clock skew—the dif-

ference between two computers’ clock settings. Because of clock skew, the apparent

age sometimes is inaccurate and occasionally is negative.

If the age is ever negative, we just set it to zero. We also could sanity check that the

apparent age isn’t ridiculously large, but large apparent ages might actually be cor-

rect. We might be talking to a parent cache that has cached the document for a long

time (the cache also stores the original Date header):

$apparent_age = max(0, $time_got_response - $Date_header_value);

$age_when_document_arrived_at_our_cache = $apparent_age;

Be aware that the Date header describes the original origin server date. Proxies and

caches must not change this date!

Hop-by-hop age calculations

So, we can eliminate negative ages caused by clock skew, but we can’t do much

about overall loss of accuracy due to clock skew. HTTP/1.1 attempts to work around

the lack of universal synchronized clocks by asking each device to accumulate rela-

tive aging into an Age header, as a document passes through proxies and caches.

This way, no cross-server, end-to-end clock comparisons are needed.

The Age header value increases as the document passes through proxies. HTTP/1.1-

aware applications should augment the Age header value by the time the document

* The HTTP specification recommends that clients, servers, and proxies use a time synchronization protocol

such as NTP to enforce a consistent time base.

Detailed Algorithms |191

sat in each application and in network transit. Each intermediate application can eas-

ily compute the document’s resident time by using its local clock.

However, any non-HTTP/1.1 device in the response chain will not recognize the Age

header and will either proxy the header unchanged or remove it. So, until HTTP/1.1

is universally adopted, the Age header will be an underestimate of the relative age.

The relative age values are used in addition to the Date-based age calculation, and

the most conservative of the two age estimates is chosen, because either the cross-

server Date value or the Age-computed value may be an underestimate (the most

conservative is the oldest age). This way, HTTP tolerates errors in Age headers as

well, while erring on the side of fresher content:

$apparent_age = max(0, $time_got_response - $Date_header_value);

$corrected_apparent_age = max($apparent_age, $Age_header_value);

$age_when_document_arrived_at_our_cache = $corrected_apparent_age;

Compensating for network delays

Transactions can be slow. This is the major motivation for caching. But for very slow

networks, or overloaded servers, the relative age calculation may significantly under-

estimate the age of documents if the documents spend a long time stuck in network

or server traffic jams.

The Date header indicates when the document left the origin server,*but it doesn’t

say how long the document spent in transit on the way to the cache. If the docu-

ment came through a long chain of proxies and parent caches, the network delay

might be significant.†

There is no easy way to measure one-way network delay from server to cache, but it

is easier to measure the round-trip delay. A cache knows when it requested the docu-

ment and when it arrived. HTTP/1.1 conservatively corrects for these network delays

by adding the entire round-trip delay. This cache-to-server-to-cache delay is an over-

estimate of the server-to-cache delay, but it is conservative. If it is in error, it will only

make the documents appear older than they really are and cause unnecessary revali-

dations. Here’s how the calculation is made:

$apparent_age = max(0, $time_got_response - $Date_header_value);

$corrected_apparent_age = max($apparent_age, $Age_header_value);

$response_delay_estimate = ($time_got_response - $time_issued_request);

$age_when_document_arrived_at_our_cache =

$corrected_apparent_age + $response_delay_estimate;

* Note that if the document came from a parent cache and not from an origin server, the Date header will

reflect the date of the origin server, not of the parent cache.

† In practice, this shouldn’t be more than a few tens of seconds (or users will abort), but the HTTP designers

wanted to try to support accurate expiration of even of short-lifetime objects.

192 |Chapter 7: Caching

Complete Age-Calculation Algorithm

The last section showed how to compute the age of an HTTP-carried document when

it arrives at a cache. Once this response is stored in the cache, it ages further. When a

request arrives for the document in the cache, we need to know how long the docu-

ment has been resident in the cache, so we can compute the current document age:

$age = $age_when_document_arrived_at_our_cache +

$how_long_copy_has_been_in_our_cache;

Ta-da! This gives us the complete HTTP/1.1 age-calculation algorithm we presented

in Example 7-1. This is a matter of simple bookkeeping—we know when the docu-

ment arrived at the cache ($time_got_response) and we know when the current

request arrived (right now), so the resident time is just the difference. This is all

shown graphically in Figure 7-18.

Freshness Lifetime Computation

Recall that we’re trying to figure out whether a cached document is fresh enough to

serve to a client. To answer this question, we must determine the age of the cached

document and compute the freshness lifetime based on server and client constraints.

We just explained how to compute the age; now let’s move on to freshness lifetimes.

The freshness lifetime of a document tells how old a document is allowed to get before

it is no longer fresh enough to serve to a particular client. The freshness lifetime

Figure 7-18. The age of a cached document includes resident time in the network and cache

Server

Client

Cache

Server processing

time Server processing

time Response’s

network delay

time_issued_request

date_value

time_got_response

current_time

time_client_issued_request

cache resident time

Age of cached document

Detailed Algorithms |193

depends on server and client constraints. The server may have information about the

publication change rate of the document. Very stable, filed reports may stay fresh for

years. Periodicals may be up-to-date only for the time remaining until the next sched-

uled publication—next week, or 6:00 am tomorrow.

Clients may have certain other guidelines. They may be willing to accept slightly

stale content, if it is faster, or they might need the most up-to-date content possible.

Caches serve the users. We must adhere to their requests.

Complete Server-Freshness Algorithm

Example 7-2 shows a Perl algorithm to compute server freshness limits. It returns the

maximum age that a document can reach and still be served by the server.

Example 7-2. Server freshness constraint calculation

sub server_freshness_limit

{

local($heuristic,$server_freshness_limit,$time_since_last_modify);

$heuristic = 0;

if ($Max_Age_value_set)

{

$server_freshness_limit = $Max_Age_value;

}

elsif ($Expires_value_set)

{

$server_freshness_limit = $Expires_value - $Date_value;

}

elsif ($Last_Modified_value_set)

{

$time_since_last_modify = max(0, $Date_value - $Last_Modified_value);

$server_freshness_limit = int($time_since_last_modify * $lm_factor);

$heuristic = 1;

}

else

{

$server_freshness_limit = $default_cache_min_lifetime;

$heuristic = 1;

}

if ($heuristic)

{

if ($server_freshness_limit > $default_cache_max_lifetime)

{ $server_freshness_limit = $default_cache_max_lifetime; }

if ($server_freshness_limit < $default_cache_min_lifetime)

{ $server_freshness_limit = $default_cache_min_lifetime; }

}

return($server_freshness_limit);

}

194 |Chapter 7: Caching

Now let’s look at how the client can override the document’s server-specified age

limit. Example 7-3 shows a Perl algorithm to take a server freshness limit and mod-

ify it by the client constraints. It returns the maximum age that a document can

reach and still be served by the cache without revalidation.

The whole process involves two variables: the document’s age and its freshness limit.

The document is “fresh enough” if the age is less than the freshness limit. The algo-

rithm in Example 7-3 just takes the server freshness limit and slides it around based

on additional client constraints. We hope this section made the subtle expiration

algorithms described in the HTTP specifications a bit clearer.

Caches and Advertising

If you’ve made it this far, you’ve realized that caches improve performance and

reduce traffic. You know caches can help users and give them a better experience,

and you know caches can help network operators reduce their traffic.

The Advertiser’s Dilemma

You might also expect content providers to like caches. After all, if caches were

everywhere, content providers wouldn’t have to buy big multiprocessor web servers

to keep up with demand—and they wouldn’t have to pay steep network service

charges to feed the same data to their viewers over and over again. And better yet,

Example 7-3. Client freshness constraint calculation

sub client_modified_freshness_limit

{

$age_limit = server_freshness_limit( ); ## From Example 7-2

if ($Max_Stale_value_set)

{

if ($Max_Stale_value == $INT_MAX)

{ $age_limit = $INT_MAX; }

else

{ $age_limit = server_freshness_limit( ) + $Max_Stale_value; }

}

if ($Min_Fresh_value_set)

{

$age_limit = min($age_limit, server_freshness_limit( ) - $Min_Fresh_value_set);

}

if ($Max_Age_value_set)

{

$age_limit = min($age_limit, $Max_Age_value);

}

Caches and Advertising |195

caches make the flashy articles and advertisements show up even faster and look

even better on the viewer’s screens, encouraging them to consume more content and

see more advertisements. And that’s just what content providers want! More eye-

balls and more advertisements!

But that’s the rub. Many content providers are paid through advertising—in particu-

lar, they get paid every time an advertisement is shown to a user (maybe just a frac-

tion of a penny or two, but they add up if you show a million ads a day!). And that’s

the problem with caches—they can hide the real access counts from the origin

server. If caching was perfect, an origin server might not receive any HTTP accesses

at all, because they would be absorbed by Internet caches. But, if you are paid on

access counts, you won’t be celebrating.

The Publisher’s Response

Today, advertisers use all sorts of “cache-busting” techniques to ensure that caches

don’t steal their hit stream. They slap no-cache headers on their content. They serve

advertisements through CGI gateways. They rewrite advertisement URLs on each

access.

And these cache-busting techniques aren’t just for proxy caches. In fact, today they

are targeted primarily at the cache that’s enabled in every web browser. Unfortu-

nately, while over-aggressively trying to maintain their hit stream, some content pro-

viders are reducing the positive effects of caching to their site.

In the ideal world, content providers would let caches absorb their traffic, and the

caches would tell them how many hits they got. Today, there are a few ways caches

can do this.

One solution is to configure caches to revalidate with the origin server on every

access. This pushes a hit to the origin server for each access but usually does not

transfer any body data. Of course, this slows down the transaction.*

Log Migration

One ideal solution wouldn’t require sending hits through to the server. After all, the

cache can keep a log of all the hits. Caches could just distribute the hit logs to serv-

ers. In fact, some large cache providers have been know to manually process and

hand-deliver cache logs to influential content providers to keep the content provid-

ers happy.

* Some caches support a variant of this revalidation, where they do a conditional GET or a HEAD request in

the background. The user does not perceive the delay, but the request triggers an offline access to the origin

server. This is an improvement, but it places more load on the caches and significantly increases traffic across

the network.

196 |Chapter 7: Caching

Unfortunately, hit logs are large, which makes them tough to move. And cache logs

are not standardized or organized to separate logs out to individual content provid-

ers. Also, there are authentication and privacy issues.

Proposals have been made for efficient (and less efficient) log-redistribution schemes.

None are far enough developed to be adopted by web software vendors. Many are

extremely complex and require joint business partnerships to succeed.*Several cor-

porate ventures have been launched to develop supporting infrastructure for adver-

tising revenue reclamation.

Hit Metering and Usage Limiting

RFC 2227, “Simple Hit-Metering and Usage-Limiting for HTTP,” defines a much sim-

pler scheme. This protocol adds one new header to HTTP, called Meter, that periodi-

cally carries hit counts for particular URLs back to the servers. This way, servers get

periodic updates from caches about the number of times cached documents were hit.

In addition, the server can control how many times documents can be served from

cache, or a wall clock timeout, before the cache must report back to the server. This

is called usage limiting; it allows servers to control how much a cached resource can

be used before it needs to report back to the origin server.

We’ll describe RFC 2227 in detail in Chapter 21.

For More Information

For more information on caching, refer to:

http://www.w3.org/Protocols/rfc2616/rfc2616.txt

RFC 2616, “Hypertext Transfer Protocol,” by R. Fielding, J. Gettys, J. Mogul, H.

Frystyk, L. Mastinter, P. Leach, and T. Berners-Lee.

Web Caching

Duane Wessels, O’Reilly & Associates, Inc.

http://search.ietf.org/rfc/rfc3040.txt

RFC 3040, “Internet Web Replication and Caching Taxonomy.”

Web Proxy Servers

Ari Luotonen, Prentice Hall Computer Books.

http://search.ietf.org/rfc/rfc3143.txt

RFC 3143, “Known HTTP Proxy/Caching Problems.”

http://www.squid-cache.org

Squid Web Proxy Cache.

* Several businesses have launched trying to develop global solutions for integrated caching and logging.

197

CHAPTER 8

Integration Points: Gateways,

Tunnels, and Relays

The Web has proven to be an incredible tool for disseminating content. Over time,

people have moved from just wanting to put static documents online to wanting to

share ever more complex resources, such as database content or dynamically gener-

ated HTML pages. HTTP applications, like web browsers, have provided users with

a unified means of accessing content over the Internet.

HTTP also has come to be a fundamental building block for application developers,

who piggyback other protocols on top of HTTP (for example, using HTTP to tunnel

or relay other protocol traffic through corporate firewalls, by wrapping that traffic in

HTTP). HTTP is used as a protocol for all of the Web’s resources, and it’s also a pro-

tocol that other applications and application protocols make use of to get their jobs

done.

This chapter takes a general look at some of the methods that developers have

come up with for using HTTP to access different resources and examines how

developers use HTTP as a framework for enabling other protocols and application

communication.

In this chapter, we discuss:

• Gateways, which interface HTTP with other protocols and applications

• Application interfaces, which allow different types of web applications to com-

municate with one another

• Tunnels, which let you send non-HTTP traffic over HTTP connections

• Relays, which are a type of simplified HTTP proxy used to forward data one hop

at a time

Gateways

The history behind HTTP extensions and interfaces was driven by people’s needs.

When the desire to put more complicated resources on the Web emerged, it rapidly

became clear that no single application could handle all imaginable resources.

198 |Chapter 8: Integration Points: Gateways, Tunnels, and Relays

To get around this problem, developers came up with the notion of a gateway that

could serve as a sort of interpreter, abstracting a way to get at the resource. A gate-

way is the glue between resources and applications. An application can ask (through

HTTP or some other defined interface) a gateway to handle the request, and the

gateway can provide a response. The gateway can speak the query language to the

database or generate the dynamic content, acting like a portal: a request goes in, and

a response comes out.

Figure 8-1 depicts a kind of resource gateway. Here, the Joe’s Hardware server is act-

ing as a gateway to database content—note that the client is simply asking for a

resource through HTTP, and the Joe’s Hardware server is interfacing with a gateway

to get at the resource.

Some gateways automatically translate HTTP traffic to other protocols, so HTTP cli-

ents can interface with other applications without the clients needing to know other

protocols (Figure 8-2).

Figure 8-2 shows three examples of gateways:

• In Figure 8-2a, the gateway receives HTTP requests for FTP URLs. The gateway

then opens FTP connections and issues the appropriate commands to the FTP

server. The document is sent back through HTTP, along with the correct HTTP

headers.

• In Figure 8-2b, the gateway receives an encrypted web request through SSL,

decrypts the request,*and forwards a normal HTTP request to the destination

server. These security accelerators can be placed directly in front of web servers

(usually in the same premises) to provide high-performance encryption for ori-

gin servers.

Figure 8-1. Gateway magic

* The gateway would need to have the proper server certificates installed.

Client is requesting:

http://www.joes-hardware.com/query-db.cgi?newproducts

GET /query-db.cgi?newproducts HTTP/1.1

Host: www.joes-hardware.com

Accept: *

Client

www.joes-hardware.com

HTTP/1.0 200 OK

New product list:

... Gateway Database

Request message

Response message

Gateways |199

• In Figure 8-2c, the gateway connects HTTP clients to server-side application

programs, through an application server gateway API. When you purchase from

e-commerce stores on the Web, check the weather forecast, or get stock quotes,

you are visiting application server gateways.

Client-Side and Server-Side Gateways

Web gateways speak HTTP on one side and a different protocol on the other side.*

Gateways are described by their client- and server-side protocols, separated by a slash:

<client-protocol>/<server-protocol>

So a gateway joining HTTP clients to NNTP news servers is an HTTP/NNTP gate-

way. We use the terms “server-side gateway” and “client-side gateway” to describe

what side of the gateway the conversion is done for:

•Server-side gateways speak HTTP with clients and a foreign protocol with serv-

ers (HTTP/*).

•Client-side gateways speak foreign protocols with clients and HTTP with servers

(*/HTTP).

Figure 8-2. Three web gateway examples

* Web proxies that convert between different versions of HTTP are like gateways, because they perform

sophisticated logic to negotiate between the parties. But because they speak HTTP on both sides, they are

technically proxies.

HTTP client FTP serverGateway

HTTP FTP

(a) HTTP/FTP server-side FTP gateway

HTTPS client Web serverGateway

SSL HTTP

(b) HTTPS/HTTP client-side security gateway

HTTP client Application server gateway

HTTP

CGI (or other API)

App server Program

200 |Chapter 8: Integration Points: Gateways, Tunnels, and Relays

Protocol Gateways

You can direct HTTP traffic to gateways the same way you direct traffic to proxies.

Most commonly, you explicitly configure browsers to use gateways, intercept traffic

transparently, or configure gateways as surrogates (reverse proxies).

Figure 8-3 shows the dialog boxes used to configure a browser to use server-side FTP

gateways. In the configuration shown, the browser is configured to use gw1.joes-

hardware.com as an HTTP/FTP gateway for all FTP URLs. Instead of sending FTP

commands to an FTP server, the browser will send HTTP commands to the HTTP/

FTP gateway gw1.joes-hardware.com on port 8080.

The result of this gateway configuration is shown in Figure 8-4. Normal HTTP traf-

fic is unaffected; it continues to flow directly to origin servers. But requests for FTP

URLs are sent to the gateway gw1.joes-hardware.com within HTTP requests. The

gateway performs the FTP transactions on the client’s behalf and carries results back

to the client by HTTP.

The following sections describe common kinds of gateways: server protocol con-

verters, server-side security gateways, client-side security gateways, and application

servers.

HTTP/*: Server-Side Web Gateways

Server-side web gateways convert client-side HTTP requests into a foreign protocol,

as the requests travel inbound to the origin server (see Figure 8-5).

In Figure 8-5, the gateway receives an HTTP request for an FTP resource:

ftp://ftp.irs.gov/pub/00-index.txt

Figure 8-3. Configuring an HTTP/FTP gateway

(a) MSIE manual proxy settings (b) Navigator manual proxy settings

Protocol Gateways |201

The gateway proceeds to open an FTP connection to the FTP port on the origin

server (port 21) and speak the FTP protocol to fetch the object. The gateway does

the following:

• Sends the USER and PASS commands to log in to the server

• Issues the CWD command to change to the proper directory on the server

• Sets the download type to ASCII

• Fetches the document’s last modification time with MDTM

• Tells the server to expect a passive data retrieval request using PASV

• Requests the object retrieval using RETR

• Opens a data connection to the FTP server on a port returned on the control

channel; as soon as the data channel is opened, the object content flows back to

the gateway

Figure 8-4. Browsers can configure particular protocols to use particular gateways

Figure 8-5. The HTTP/FTP gateway translates HTTP request into FTP requests

HTTP client

Web server

(www.cnn.com)

GET http://www.cnn.com/HTTP/1.0

Host: www.cnn.com

User-agent: SuperBrowser 4.2

HTTP

FTP server

(ftp.irs.gov)

GET ftp://ftp.irs.gov/pub/00-index.txt HTTP/1.0

Host: ftp.irs.gov

User-agent: SuperBrowser 4.2

HTTP

HTTP/FTP gateway

(gw1.joes-hardware.com)

FTP

8080

HTTP client

HTTP

FTP server

GET ftp://ftp.irs.gov/pub/00-index.txt HTTP/1.0

Host: ftp.irs.gov

User-agent: SuperBrowser 4.2

HTTP/FTP inbound

conversion gateway

FTP control connection

Port 21

USER anonymous

PASS joe

CWD /pub

TYPE A

MDTM 00-index.txt

PASV

RETR 00-index.txt

FTP data connection

...data..

Inbound

202 |Chapter 8: Integration Points: Gateways, Tunnels, and Relays

When the retrieval is complete, the object will be sent to the client in an HTTP

response.

HTTP/HTTPS: Server-Side Security Gateways

Gateways can be used to provide extra privacy and security for an organization, by

encrypting all inbound web requests. Clients can browse the Web using normal

HTTP, but the gateway will automatically encrypt the user’s sessions (Figure 8-6).

HTTPS/HTTP: Client-Side Security Accelerator Gateways

Recently, HTTPS/HTTP gateways have become popular as security accelerators.

These HTTPS/HTTP gateways sit in front of the web server, usually as an invisible

intercepting gateway or a reverse proxy. They receive secure HTTPS traffic, decrypt

the secure traffic, and make normal HTTP requests to the web server (Figure 8-7).

These gateways often include special decryption hardware to decrypt secure traffic

much more efficiently than the origin server, removing load from the origin server.

Because these gateways send unencrypted traffic between the gateway and origin

server, you need to use caution to make sure the network between the gateway and

origin server is secure.

Figure 8-6. Inbound HTTP/HTTPS security gateway

Figure 8-7. HTTPS/HTTP security accelerator gateway

HTTP client

HTTP

Secure web

server

GET http://www.cnn.com/ HTTP/1.0

Host: www.cnn.com

User-agent: Superbrowser 4.2

HTTP/HTTPS inbound

security gateway

Port 443

mdsnrt734tngfd/p0f92piub5.

lod9fuo8w34b4/;p-90[g9yk,8

U|t6y6/%$!&9890G&*&98...

HTTP over SSL (HTTPS)

Browser www.cnn.com

GET http://www.cnn.com/ HTTP/1.0

Host: www.cnn.com

User-agent: Superbrowser 4.2

HTTPS/HTTP security

accelerator gateway

mdsnrt734tngfd/p0f92piub5.

lod9fuo8w34b4/;p-90[g9yk,8

U|t6y6/%$!&9890G&*&98...

HTTP over SSL (HTTPS) HTTP

Protected internal LAN

Resource Gateways |203

Resource Gateways

So far, we’ve been talking about gateways that connect clients and servers across a

network. However, the most common form of gateway, the application server, com-

bines the destination server and gateway into a single server. Application servers are

server-side gateways that speak HTTP with the client and connect to an application

program on the server side (see Figure 8-8).

In Figure 8-8, two clients are connecting to an application server using HTTP. But,

instead of sending back files from the server, the application server passes the

requests through a gateway application programming interface (API) to applications

running on the server:

• Client A’s request is received and, based on the URI, is sent through an API to a

digital camera application. The resulting camera image is bundled up into an

HTTP response message and sent back to the client, for display in the client’s

browser.

• Client B’s URI is for an e-commerce application. Client B’s requests are sent

through the server gateway API to the e-commerce software, and the results are

sent back to the browser. The e-commerce software interacts with the client,

walking the user through a sequence of HTML pages to complete a purchase.

The first popular API for application gateways was the Common Gateway Interface

(CGI). CGI is a standardized set of interfaces that web servers use to launch pro-

grams in response to HTTP requests for special URLs, collect the program output,

and send it back in HTTP responses. Over the past several years, commercial web

servers have provided more sophisticated interfaces for connecting web servers to

applications.

Figure 8-8. An application server connects HTTP clients to arbitrary backend applications

11000101101

Client A HTTP

Web camera API

E-commerce API

Client B

Application server

Camera device and software

E-commerce application

HTTP

204 |Chapter 8: Integration Points: Gateways, Tunnels, and Relays

Early web servers were fairly simple creations, and the simple approach that was

taken for implementing an interface for gateways has stuck to this day.

When a request comes in for a resource that needs a gateway, the server spawns the

helper application to handle the request. The helper application is passed the data it

needs. Often this is just the entire request or something like the query the user wants

to run on the database (from the query string of the URL; see Chapter 2).

It then returns a response or response data to the server, which vectors it off to the

client. The server and gateway are separate applications, so the lines of responsibil-

ity are kept clear. Figure 8-9 shows the basic mechanics behind server and gateway

application interactions.

This simple protocol (request in, hand off, and respond) is the essence behind the

oldest and one of the most common server extension interfaces, CGI.

Common Gateway Interface (CGI)

The Common Gateway Interface was the first and probably still is the most widely

used server extension. It is used throughout the Web for things like dynamic HTML,

credit card processing, and querying databases.

Since CGI applications are separate from the server, they can be implemented in

almost any language, including Perl, Tcl, C, and various shell languages. And

because CGI is simple, almost all HTTP servers support it. The basic mechanics of

the CGI model are shown in Figure 8-9.

CGI processing is invisible to users. From the perspective of the client, it’s just mak-

ing a normal request. It is completely unaware of the hand-off procedure going on

between the server and the CGI application. The client’s only hint that a CGI appli-

cation might be involved would be the presence of the letters “cgi” and maybe “?” in

the URL.

Figure 8-9. Server gateway application mechanics

Server system

Request 1

Request 2

Request N

Server process

Spawned gateway process #1

Spawned gateway process #2

Spawned gateway process #N

Response N

Response 2

Response 1

Request data

Response data

Server internal view

Application Interfaces and Web Services |205

So CGI is wonderful, right? Well, yes and no. It provides a simple, functional form of

glue between servers and pretty much any type of resource, handling any translation

that needs to occur. The interface also is elegant in protecting the server from buggy

extensions (if the extension were glommed onto the server itself, it could cause an

error that might end up crashing the server).

However, this separation incurs a cost in performance. The overhead to spawn a new

process for every CGI request is quite high, limiting the performance of servers that

use CGI and taxing the server machine’s resources. To try to get around this prob-

lem, a new form of CGI—aptly dubbed Fast CGI—has been developed. This inter-

face mimics CGI, but it runs as a persistent daemon, eliminating the performance

penalty of setting up and tearing down a new process for each request.

Server Extension APIs

The CGI protocol provides a clean way to interface external interpreters with stock

HTTP servers, but what if you want to alter the behavior of the server itself, or you just

want to eke every last drop of performance you can get out of your server? For these

two needs, server developers have provided server extension APIs, which provide a

powerful interface for web developers to interface their own modules with an HTTP

server directly. Extension APIs allow programmers to graft their own code onto the

server or completely swap out a component of the server and replace it with their own.

Most popular servers provide one or more extension APIs for developers. Since these

extensions often are tied to the architecture of the server itself, most of them are spe-

cific to one server type. Microsoft, Netscape, Apache, and other server flavors all

have API interfaces that allow developers to alter the behavior of the server or pro-

vide custom interfaces to different resources. These custom interfaces provide a pow-

erful interface for developers.

One example of a server extension is Microsoft’s FrontPage Server Extension (FPSE),

which supports web publishing services for FrontPage authors. FPSE is able to inter-

pret remote procedure call (RPC) commands sent by FrontPage clients. These com-

mands are piggybacked on HTTP (specifically, overlaid on the HTTP POST method).

For details, see “FrontPage Server Extensions for Publishing Support” in Chapter 19.

Application Interfaces and Web Services

We’ve discussed resource gateways as ways for web servers to communicate with

applications. More generally, with web applications providing ever more types of ser-

vices, it becomes clear that HTTP can be part of a foundation for linking together

applications. One of the trickier issues in wiring up applications is negotiating the

protocol interface between the two applications so that they can exchange data—

often this is done on an application-by-application basis.

206 |Chapter 8: Integration Points: Gateways, Tunnels, and Relays

To work together, applications usually need to exchange more complex information

with one another than is expressible in HTTP headers. A couple of examples of

extending HTTP or layering protocols on top of HTTP in order to exchange custom-

ized information are described in Chapter 19. “FrontPage Server Extensions for Pub-

lishing Support” in Chapter 19 talks about layering RPCs over HTTP POST

messages, and “WebDAV and Collaborative Authoring” talks about adding XML to

HTTP headers.

The Internet community has developed a set of standards and protocols that allow

web applications to talk to each other. These standards are loosely referred to as web

services, although the term can mean standalone web applications (building blocks)

themselves. The premise of web services is not new, but they are a new mechanism

for applications to share information. Web services are built on standard web tech-

nologies, such as HTTP.

Web services exchange information using XML over SOAP. The Extensible Markup

Language (XML) provides a way to create and interpret customized information

about a data object. The Simple Object Access Protocol (SOAP) is a standard for

adding XML information to HTTP messages.*

Tunnels

We’ve discussed different ways that HTTP can be used to enable access to various

kinds of resources (through gateways) and to enable application-to-application com-

munication. In this section, we’ll take a look at another use of HTTP, web tunnels,

which enable access to applications that speak non-HTTP protocols through HTTP

applications.

Web tunnels let you send non-HTTP traffic through HTTP connections, allowing

other protocols to piggyback on top of HTTP. The most common reason to use web

tunnels is to embed non-HTTP traffic inside an HTTP connection, so it can be sent

through firewalls that allow only web traffic.

Establishing HTTP Tunnels with CONNECT

Web tunnels are established using HTTP’s CONNECT method. The CONNECT pro-

tocol is not part of the core HTTP/1.1 specification,†but it is a widely implemented

extension. Technical specifications can be found in Ari Luotonen’s expired Internet

draft specification, “Tunneling TCP based protocols through Web proxy servers,” or

in his book Web Proxy Servers, both of which are cited at the end of this chapter.

* For more information, see http://www.w3.org/TR/2001/WD-soap12-part0-20011217/.Programming Web

Services with SOAP, by Doug Tidwell, James Snell, and Pavel Kulchenko (O’Reilly) is also an excellent source

of information on the SOAP protocol.

† The HTTP/1.1 specification reserves the CONNECT method but does not describe its function.

Tunnels |207

The CONNECT method asks a tunnel gateway to create a TCP connection to an

arbitrary destination server and port and to blindly relay subsequent data between

client and server.

Figure 8-10 shows how the CONNECT method works to establish a tunnel to a

gateway:

• In Figure 8-10a, the client sends a CONNECT request to the tunnel gateway.

The client’s CONNECT method asks the tunnel gateway to open a TCP connec-

tion (here, to the host named orders.joes-hardware.com on port 443, the normal

SSL port).

• The TCP connection is created in Figure 8-10b and Figure 8-10c.

• Once the TCP connection is established, the gateway notifies the client

(Figure 8-10d) by sending an HTTP 200 Connection Established response.

• At this point, the tunnel is set up. Any data sent by the client over the HTTP

tunnel will be relayed directly to the outgoing TCP connection, and any data

sent by the server will be relayed to the client over the HTTP tunnel.

Figure 8-10. Using CONNECT to establish an SSL tunnel

Client orders.joes-hardware.comGateway

(Tunnel endpoint)

The tunnel goes between client and gateway Normal SSL connection

CONNECT orders.joes-hardware.com:443 HTTP/1.0

User-agent: SuperBrowser: 4.2

mdsnrt734tngfd/p0f92piub5.

lod9fuo8w34b4/;p-90[g9yk,8

U|t6y6/%$!&9890G&*&98...

(a) CONNECT request sent

(b) Open TCP connection to port 443

HTTP/1.0 200 Connection established

(d) HTTP connection ready message returned

(e) At this point, arbitrary, bidirectional communication of raw

data occurs, until connection close

mdsnrt734tngfd/p0f92piub5.

lod9fuo8w34b4/;p-90[g9yk,8

U|t6y6/%$!&9890G&*&98... gal1304-*&hsgd

gal1304-*&hsgd

208 |Chapter 8: Integration Points: Gateways, Tunnels, and Relays

The example in Figure 8-10 describes an SSL tunnel, where SSL traffic is sent over an

HTTP connection, but the CONNECT method can be used to establish a TCP con-

nection to any server using any protocol.

CONNECT requests

The CONNECT syntax is identical in form to other HTTP methods, with the excep-

tion of the start line. The request URI is replaced by a hostname, followed by a

colon, followed by a port number. Both the host and the port must be specified:

CONNECT home.netscape.com:443 HTTP/1.0

User-agent: Mozilla/4.0

After the start line, there are zero or more HTTP request header fields, as in other

HTTP messages. As usual, the lines end in CRLFs, and the list of headers ends with a

bare CRLF.

CONNECT responses

After the request is sent, the client waits for a response from the gateway. As with

normal HTTP messages, a 200 response code indicates success. By convention, the

reason phrase in the response is normally set to “Connection Established”:

HTTP/1.0 200 Connection Established

Proxy-agent: Netscape-Proxy/1.1

Unlike normal HTTP responses, the response does not need to include a Content-

Type header. No content type is required*because the connection becomes a raw

byte relay, instead of a message carrier.

Data Tunneling, Timing, and Connection Management

Because the tunneled data is opaque to the gateway, the gateway cannot make any

assumptions about the order and flow of packets. Once the tunnel is established,

data is free to flow in any direction at any time.†

As a performance optimization, clients are allowed to send tunnel data after sending

the CONNECT request but before receiving the response. This gets data to the server

faster, but it means that the gateway must be able to handle data following the

request properly. In particular, the gateway cannot assume that a network I/O request

will return only header data, and the gateway must be sure to forward any data read

with the header to the server, when the connection is ready. Clients that pipeline data

* Future specifications may define a media type for tunnels (e.g., application/tunnel), for uniformity.

† The two endpoints of the tunnel (the client and the gateway) must be prepared to accept packets from either

of the connections at any time and must forward that data immediately. Because the tunneled protocol may

include data dependencies, neither end of the tunnel can ignore input data. Lack of data consumption on

one end of the tunnel may hang the producer on the other end of the tunnel, leading to deadlock.

Tunnels |209

after the request must be prepared to resend the request data if the response comes

back as an authentication challenge or other non-200, nonfatal status. *

If at any point either one of the tunnel endpoints gets disconnected, any outstanding

data that came from that endpoint will be passed to the other one, and after that also

the other connection will be terminated by the proxy. If there is undelivered data for

the closing endpoint, that data will be discarded.

SSL Tunneling

Web tunnels were first developed to carry encrypted SSL traffic through firewalls.

Many organizations funnel all traffic through packet-filtering routers and proxy serv-

ers to enhance security. But some protocols, such as encrypted SSL, cannot be prox-

ied by traditional proxy servers, because the information is encrypted. Tunnels let

the SSL traffic be carried through the port 80 HTTP firewall by transporting it

through an HTTP connection (Figure 8-11).

To allow SSL traffic to flow through existing proxy firewalls, a tunneling feature was

added to HTTP, in which raw, encrypted data is placed inside HTTP messages and

sent through normal HTTP channels (Figure 8-12).

* Try not to pipeline more data than can fit into the remainder of the request’s TCP packet. Pipelining more

data can cause a client TCP reset if the gateway subsequently closes the connection before all pipelined TCP

packets are received. A TCP reset can cause the client to lose the received gateway response, so the client won’t

be able to tell whether the failure was due to a network error, access control, or authentication challenge.

Figure 8-11. Tunnels let non-HTTP traffic flow through HTTP connections

Client Server

Filtering

router Firewall proxy

Filtering

router

SSL

(rejected)

(a) SSL rejected by firewall

Port 443

Client SSL server

Filtering

router Firewall proxy

Filtering

router

SSL tunneled

inside HTTP

(accepted)

(b) HTTP-carried SSL accepted by firewall

Port 80

210 |Chapter 8: Integration Points: Gateways, Tunnels, and Relays

In Figure 8-12a, SSL traffic is sent directly to a secure web server (on SSL port 443).

In Figure 8-12b, SSL traffic is encapsulated into HTTP messages and sent over HTTP

port 80 connections, until it is decapsulated back into normal SSL connections.

Tunnels often are used to let non-HTTP traffic pass through port-filtering firewalls.

This can be put to good use, for example, to allow secure SSL traffic to flow through

firewalls. However, this feature can be abused, allowing malicious protocols to flow

into an organization through the HTTP tunnel.

SSL Tunneling Versus HTTP/HTTPS Gateways

The HTTPS protocol (HTTP over SSL) can alternatively be gatewayed in the same

way as other protocols: having the gateway (instead of the client) initiate the SSL ses-

sion with the remote HTTPS server and then perform the HTTPS transaction on the

client’s part. The response will be received and decrypted by the proxy and sent to

the client over (insecure) HTTP. This is the way gateways handle FTP. However, this

approach has several disadvantages:

• The client-to-gateway connection is normal, insecure HTTP.

• The client is not able to perform SSL client authentication (authentication based

on X509 certificates) to the remote server, as the proxy is the authenticated party.

• The gateway needs to support a full SSL implementation.

Figure 8-12. Direct SSL connection vs. tunnelled SSL connection

Server

Client

SSL

Tunnel start

SSLHTTP HTTP

connection SSLHTTP

SSL

Tunnel endpoint

Port 80

SSL

connection SSL

Client

SSL SSL

Port 443

(a) Direct SSL connection

(b) SSL through HTTP tunnel

Tunnel carries SSL traffic, intended for port 443,

over a plain old HTTP connection

SSL

connection

Tunnels |211

Note that this mechanism, if used for SSL tunneling, does not require an implemen-

tation of SSL in the proxy. The SSL session is established between the client generat-

ing the request and the destination (secure) web server; the proxy server in between

merely tunnels the encrypted data and does not take any other part in the secure

transaction.

Tunnel Authentication

Other features of HTTP can be used with tunnels where appropriate. In particular,

the proxy authentication support can be used with tunnels to authenticate a client’s

right to use a tunnel (Figure 8-13).

Tunnel Security Considerations

In general, the tunnel gateway cannot verify that the protocol being spoken is really

what it is supposed to tunnel. Thus, for example, mischievous users might use tun-

nels intended for SSL to tunnel Internet gaming traffic through a corporate firewall,

Figure 8-13. Gateways can proxy-authenticate a client before it’s allowed to use a tunnel

Client orders.joes-hardware.comGateway

(Tunnel endpoint)

The tunnel goes between client and gateway Normal SLL connection

CONNECT orders.joes-hardware.com:443 HTTP/1.0

User-agent: SuperBrowser 4.2

(a) CONNECT request sent

(d) Open TCP connection to port 443

(e) Connection established

HTTP/1.0 407 Proxy authentication required

Proxy-authenticate: Basic realm="wormhole"

(b) Authentication challange returned

HTTP/1.0 200 Connection established

CONNECT orders.joes-hardware.com:443 HTTP/1.0

User-agent: SuperBrowser 4.2

Proxy-authorization: Basic YnJpYW4tdG90dHk6T3ch

(f) HTTP connection ready message returned

212 |Chapter 8: Integration Points: Gateways, Tunnels, and Relays

or malicious users might use tunnels to open Telnet sessions or to send email that

bypasses corporate email scanners.

To minimize abuse of tunnels, the gateway should open tunnels only for particular

well-known ports, such as 443 for HTTPS.

Relays

HTTP relays are simple HTTP proxies that do not fully adhere to the HTTP specifi-

cations. Relays process enough HTTP to establish connections, then blindly forward

bytes.

Because HTTP is complicated, it’s sometimes useful to implement bare-bones prox-

ies that just blindly forward traffic, without performing all of the header and method

logic. Because blind relays are easy to implement, they sometimes are used to provide

simple filtering, diagnostics, or content transformation. But they should be deployed

with great caution, because of the serious potential for interoperability problems.

One of the more common (and infamous) problems with some implementations of

simple blind relays relates to their potential to cause keep-alive connections to hang,

because they don’t properly process the Connection header. This situation is depicted

in Figure 8-14.

Here’s what’s going on in this figure:

• In Figure 8-14a, a web client sends a message to the relay, including the Connec-

tion: Keep-Alive header, requesting a keep-alive connection if possible. The client

waits for a response to learn if its request for a keep-alive channel was granted.

• The relay gets the HTTP request, but it doesn’t understand the Connection

header, so it passes the message verbatim down the chain to the server

(Figure 8-14b). However, the Connection header is a hop-by-hop header; it

Figure 8-14. Simple blind relays can hang if they are single-tasking and don’t support the

Connection header

(

)

Client Server

(a) Connection: Keep-Alive

(d) Connection: Keep-Alive Blind relay

(b) Connection: Keep-Alive

(f) Client’s second request on the keep-alive

connection just hangs because the relay never

processes it

(b) Server won’t close connection when done because

it thinks it has been asked to speak keep-alive

any new requests on the connection

For More Information |213

applies only to a single transport link and shouldn’t be passed down the chain.

Bad things are about to start happening!

• In Figure 8-14b, the relayed HTTP request arrives at the web server. When the

web server receives the proxied Connection: Keep-Alive header, it mistakenly

concludes that the relay (which looks like any other client to the server) wants to

speak keep-alive! That’s fine with the web server—it agrees to speak keep-alive

and sends a Connection: Keep-Alive response header back in Figure 8-14c. So, at

this point, the web server thinks it is speaking keep-alive with the relay, and it

will adhere to rules of keep-alive. But the relay doesn’t know anything about

keep-alive.

• In Figure 8-14d, the relay forwards the web server’s response message back to

the client, passing along the Connection: Keep-Alive header from the web server.

The client sees this header and assumes the relay has agreed to speak keep-alive.

At this point, both the client and server believe they are speaking keep-alive, but

the relay to which they are talking doesn’t know the first thing about keep-alive.

• Because the relay doesn’t know anything about keepalive, it forwards all the data

it receives back to the client, waiting for the origin server to close the connec-

tion. But the origin server will not close the connection, because it believes the

relay asked the server to keep the connection open! So, the relay will hang wait-

ing for the connection to close.

• When the client gets the response message back in Figure 8-14d, it moves right

along to the next request, sending another request to the relay on the keep-alive

connection (Figure 8-14e). Simple relays usually never expect another request on

the same connection. The browser just spins, making no progress.

There are ways to make relays slightly smarter, to remove these risks, but any simpli-

fication of proxies runs the risk of interoperation problems. If you are building sim-

ple HTTP relays for a particular purpose, be cautious how you use them. For any

wide-scale deployment, you should strongly consider using a real, HTTP-compliant

proxy server instead.

For more information about relays and connection management, see “Keep-Alive

and Dumb Proxies” in Chapter 4.

For More Information

For more information, refer to:

http://www.w3.org/Protocols/rfc2616/rfc2616.txt

RFC 2616, “Hypertext Transfer Protocol,” by R. Fielding, J. Gettys, J. Mogul, H.

Frystyk, L. Mastinter, P. Leach, and T. Berners-Lee.

Web Proxy Servers

Ari Luotonen, Prentice Hall Computer Books.

214 |Chapter 8: Integration Points: Gateways, Tunnels, and Relays

http://www.alternic.org/drafts/drafts-l-m/draft-luotonen-web-proxy-tunneling-01.txt

“Tunneling TCP based protocols through Web proxy servers,” by Ari Luotonen.

http://cgi-spec.golux.com

The Common Gateway Interface—RFC Project Page.

http://www.w3.org/TR/2001/WD-soap12-part0-20011217/

W3C—SOAP Version 1.2 Working Draft.

Programming Web Services with SOAP

James Snell, Doug Tidwell, and Pavel Kulchenko, O’Reilly & Associates, Inc.

http://www.w3.org/TR/2002/WD-wsa-reqs-20020429

W3C—Web Services Architecture Requirements.

Web Services Essentials

Ethan Cermai, O’Reilly & Associates, Inc.

215

CHAPTER 9

Web Robots

We continue our tour of HTTP architecture with a close look at the self-animating

user agents called web robots.

Web robots are software programs that automate a series of web transactions with-

out human interaction. Many robots wander from web site to web site, fetching con-

tent, following hyperlinks, and processing the data they find. These kinds of robots

are given colorful names such as “crawlers,” “spiders,” “worms,” and “bots” because

of the way they automatically explore web sites, seemingly with minds of their own.

Here are a few examples of web robots:

• Stock-graphing robots issue HTTP GETs to stock market servers every few min-

utes and use the data to build stock price trend graphs.

• Web-census robots gather “census” information about the scale and evolution of

the World Wide Web. They wander the Web counting the number of pages and

recording the size, language, and media type of each page.*

• Search-engine robots collect all the documents they find to create search

databases.

• Comparison-shopping robots gather web pages from online store catalogs to

build databases of products and their prices.

Crawlers and Crawling

Web crawlers are robots that recursively traverse information webs, fetching first one

web page, then all the web pages to which that page points, then all the web pages to

which those pages point, and so on. When a robot recursively follows web links, it is

called a crawler or a spider because it “crawls” along the web created by HTML

hyperlinks.

*http://www.netcraft.com collects great census metrics on what flavors of servers are being used by sites

around the Web.

216 |Chapter 9: Web Robots

Internet search engines use crawlers to wander about the Web and pull back all the

documents they encounter. These documents are then processed to create a search-

able database, allowing users to find documents that contain particular words. With

billions of web pages out there to find and bring back, these search-engine spiders

necessarily are some of the most sophisticated robots. Let’s look in more detail at

how crawlers work.

Where to Start: The “Root Set”

Before you can unleash your hungry crawler, you need to give it a starting point. The

initial set of URLs that a crawler starts visiting is referred to as the root set. When

picking a root set, you should choose URLs from enough different places that crawl-

ing all the links will eventually get you to most of the web pages that interest you.

What’s a good root set to use for crawling the web in Figure 9-1? As in the real Web,

there is no single document that eventually links to every document. If you start with

document A in Figure 9-1, you can get to B, C, and D, then to E and F, then to J, and

then to K. But there’s no chain of links from A to G or from A to N.

Some web pages in this web, such as S, T, and U, are nearly stranded—isolated,

without any links pointing at them. Perhaps these lonely pages are new, and no one

has found them yet. Or perhaps they are really old or obscure.

In general, you don’t need too many pages in the root set to cover a large portion of

the web. In Figure 9-1, you need only A, G, and S in the root set to reach all pages.

Typically, a good root set consists of the big, popular web sites (for example, http://

www.yahoo.com), a list of newly created pages, and a list of obscure pages that aren’t

often linked to. Many large-scale production crawlers, such as those used by Internet

search engines, have a way for users to submit new or obscure pages into the root set.

This root set grows over time and is the seed list for any fresh crawls.

Figure 9-1. A root set is needed to reach all pages

B C D

E F

H I

M N

T U

J O

RQPK

Crawlers and Crawling |217

Extracting Links and Normalizing Relative Links

As a crawler moves through the Web, it is constantly retrieving HTML pages. It needs

to parse out the URL links in each page it retrieves and add them to the list of pages

that need to be crawled. While a crawl is progressing, this list often expands rapidly,

as the crawler discovers new links that need to be explored.*Crawlers need to do some

simple HTML parsing to extract these links and to convert relative URLs into their

absolute form. “Relative URLs” in Chapter 2 discusses how to do this conversion.

Cycle Avoidance

When a robot crawls a web, it must be very careful not to get stuck in a loop, or

cycle. Look at the crawler in Figure 9-2:

• In Figure 9-2a, the robot fetches page A, sees that A links to B, and fetches page B.

• In Figure 9-2b, the robot fetches page B, sees that B links to C, and fetches page C.

• In Figure 9-2c, the robot fetches page C and sees that C links to A. If the robot

fetches page A again, it will end up in a cycle, fetching A, B, C, A, B, C, A...

Robots must know where they’ve been to avoid cycles. Cycles can lead to robot traps

that can either halt or slow down a robot’s progress.

Loops and Dups

Cycles are bad for crawlers for at least three reasons:

• They get the crawler into a loop where it can get stuck. A loop can cause a

poorly designed crawler to spin round and round, spending all its time fetching

* In “Cycle Avoidance,” we begin to discuss the need for crawlers to remember where they have been. During

a crawl, this list of discovered URLs grows until the web space has been explored thoroughly and the crawler

reaches a point at which it is no longer discovering new links.

Figure 9-2. Crawling over a web of hyperlinks

(a) Robot fetches page A, follows link,

fetches B

(b) Robot follows link and fetches page C

218 |Chapter 9: Web Robots

the same pages over and over again. The crawler can burn up lots of network

bandwidth and may be completely unable to fetch any other pages.

• While the crawler is fetching the same pages repeatedly, the web server on the

other side is getting pounded. If the crawler is well connected, it can overwhelm

the web site and prevent any real users from accessing the site. Such denial of

service can be grounds for legal claims.

• Even if the looping isn’t a problem itself, the crawler is fetching a large number

of duplicate pages (often called “dups,” which rhymes with “loops”). The

crawler’s application will be flooded with duplicate content, which may make

the application useless. An example of this is an Internet search engine that

returns hundreds of matches of the exact same page.

Trails of Breadcrumbs

Unfortunately, keeping track of where you’ve been isn’t always so easy. At the time

of this writing, there are billions of distinct web pages on the Internet, not counting

content generated from dynamic gateways.

If you are going to crawl a big chunk of the world’s web content, you need to be pre-

pared to visit billions of URLs. Keeping track of which URLs have been visited can

be quite challenging. Because of the huge number of URLs, you need to use sophisti-

cated data structures to quickly determine which URLs you’ve visited. The data

structures need to be efficient in speed and memory use.

Speed is important because hundreds of millions of URLs require fast search struc-

tures. Exhaustive searching of URL lists is out of the question. At the very least, a

robot will need to use a search tree or hash table to be able to quickly determine

whether a URL has been visited.

Hundreds of millions of URLs take up a lot of space, too. If the average URL is 40

characters long, and a web robot crawls 500 million URLs (just a small portion of the

Web), a search data structure could require 20 GB or more of memory just to hold

the URLs (40 bytes per URL × 500 million URLs = 20 GB)!

Here are some useful techniques that large-scale web crawlers use to manage where

they visit:

Trees and hash tables

Sophisticated robots might use a search tree or a hash table to keep track of vis-

ited URLs. These are software data structures that make URL lookup much faster.

Lossy presence bit maps

To minimize space, some large-scale crawlers use lossy data structures such as

presence bit arrays. Each URL is converted into a fixed size number by a hash

function, and this number has an associated “presence bit” in an array. When a

Crawlers and Crawling |219

URL is crawled, the corresponding presence bit is set. If the presence bit is

already set, the crawler assumes the URL has already been crawled.*

Checkpoints

Be sure to save the list of visited URLs to disk, in case the robot program crashes.

Partitioning

As the Web grows, it may become impractical to complete a crawl with a single

robot on a single computer. That computer may not have enough memory, disk

space, computing power, or network bandwidth to complete a crawl.

Some large-scale web robots use “farms” of robots, each a separate computer,

working in tandem. Each robot is assigned a particular “slice” of URLs, for

which it is responsible. Together, the robots work to crawl the Web. The indi-

vidual robots may need to communicate to pass URLs back and forth, to cover

for malfunctioning peers, or to otherwise coordinate their efforts.

A good reference book for implementing huge data structures is Managing Gigabytes:

Compressing and Indexing Documents and Images, by Witten, et. al (Morgan Kauf-

mann). This book is full of tricks and techniques for managing large amounts of

data.

Aliases and Robot Cycles

Even with the right data structures, it is sometimes difficult to tell if you have visited

a page before, because of URL “aliasing.” Two URLs are aliases if the URLs look dif-

ferent but really refer to the same resource.

Table 9-1 illustrates a few simple ways that different URLs can point to the same

resource.

* Because there are a potentially infinite number of URLs and only a finite number of bits in the presence bit

array, there is potential for collision—two URLs can map to the same presence bit. When this happens, the

crawler mistakenly concludes that a page has been crawled when it hasn’t. In practice, this situation can be

made very unlikely by using a large number of presence bits. The penalty for collision is that a page will be

omitted from a crawl.

Table 9-1. Different URLs that alias to the same documents

First URL Second URL When aliased

ahttp://www.foo.com/bar.html http://www.foo.com:80/bar html Port is 80 by default

bhttp://www.foo.com/~fred http://www.foo.com/%7Ffred %7F is same as ~

chttp://www.foo.com/x html#early http://www.foo.com/x html#middle Tags don’t change the page

dhttp://www.foo.com/readme.htm http://www.foo.com/README.HTM Case-insensitive server

ehttp://www.foo.com/ http://www.foo.com/index.html Default page is index.html

fhttp://www.foo.com/index.html http://209.231.87.45/index.html www.foo.com has this IP address

220 |Chapter 9: Web Robots

Canonicalizing URLs

Most web robots try to eliminate the obvious aliases up front by “canonicalizing”

URLs into a standard form. A robot might first convert every URL into a canonical

form, by:

1. Adding “:80” to the hostname, if the port isn’t specified

2. Converting all %xx escaped characters into their character equivalents

3. Removing # tags

These steps can eliminate the aliasing problems shown in Table 9-1a–c. But, with-

out knowing information about the particular web server, the robot doesn’t have any

good way of avoiding the duplicates from Table 9-1d–f:

• The robot would need to know whether the web server was case-insensitive to

avoid the alias in Table 9-1d.

• The robot would need to know the web server’s index-page configuration for

this directory to know whether the URLs in Table 9-1e were aliases.

• The robot would need to know if the web server was configured to do virtual host-

ing (covered in Chapter 5) to know if the URLs in Table 9-1f were aliases, even if

it knew the hostname and IP address referred to the same physical computer.

URL canonicalization can eliminate the basic syntactic aliases, but robots will

encounter other URL aliases that can’t be eliminated through converting URLs to

standard forms.

Filesystem Link Cycles

Symbolic links on a filesystem can cause a particularly insidious kind of cycle,

because they can create an illusion of an infinitely deep directory hierarchy where

none exists. Symbolic link cycles usually are the result of an unintentional error by

the server administrator, but they also can be created by “evil webmasters” as a mali-

cious trap for robots.

Figure 9-3 shows two filesystems. In Figure 9-3a, subdir is a normal directory. In

Figure 9-3b, subdir is a symbolic link pointing back to /. In both figures, assume the

file /index.html contains a hyperlink to the file subdir/index.html.

Using Figure 9-3a’s filesystem, a web crawler may take the following actions:

1. GET http://www.foo.com/index.html

Get /index.html, find link to subdir/index.html.

2. GET http://www.foo.com/subdir/index.html

Get subdir/index.html, find link to subdir/logo.gif.

3. GET http://www.foo.com/subdir/logo.gif

Get subdir/logo.gif, no more links, all done.

Crawlers and Crawling |221

But in Figure 9-3b’s filesystem, the following might happen:

1. GET http://www.foo.com/index.html

Get /index.html, find link to subdir/index.html.

2. GET http://www.foo.com/subdir/index.html

Get subdir/index.html, but get back same index.html.

3. GET http://www.foo.com/subdir/subdir/index.html

Get subdir/subdir/index.html.

4. GET http://www.foo.com/subdir/subdir/subdir/index.html

Get subdir/subdir/subdir/index.html.

The problem with Figure 9-3b is that subdir/ is a cycle back to /, but because the

URLs look different, the robot doesn’t know from the URL alone that the docu-

ments are identical. The unsuspecting robot runs the risk of getting into a loop.

Without some kind of loop detection, this cycle will continue, often until the length

of the URL exceeds the robot’s or the server’s limits.

Dynamic Virtual Web Spaces

It’s possible for malicious webmasters to intentionally create sophisticated crawler

loops to trap innocent, unsuspecting robots. In particular, it’s easy to publish a URL

that looks like a normal file but really is a gateway application. This application can

whip up HTML on the fly that contains links to imaginary URLs on the same server.

When these imaginary URLs are requested, the nasty server fabricates a new HTML

page with new imaginary URLs.

The malicious web server can take the poor robot on an Alice-in-Wonderland jour-

ney through an infinite virtual web space, even if the web server doesn’t really con-

tain any files. Even worse, it can make it very difficult for the robot to detect the

cycle, because the URLs and HTML can look very different each time. Figure 9-4

shows an example of a malicious web server generating bogus content.

Figure 9-3. Symbolic link cycles

index.html subdir

index.html logo.gif

index.html subdir

(a) subdir is a directory (b) subdir is an upward symbolic link

222 |Chapter 9: Web Robots

More commonly, well-intentioned webmasters may unwittingly create a crawler trap

through symbolic links or dynamic content. For example, consider a CGI-based cal-

endaring program that generates a monthly calendar and a link to the next month. A

real user would not keep requesting the next-month link forever, but a robot that is

unaware of the dynamic nature of the content might keep requesting these resources

indefinitely.*

Avoiding Loops and Dups

There is no foolproof way to avoid all cycles. In practice, well-designed robots need

to include a set of heuristics to try to avoid cycles.

Generally, the more autonomous a crawler is (less human oversight), the more likely

it is to get into trouble. There is a bit of a trade-off that robot implementors need to

make—these heuristics can help avoid problems, but they also are somewhat

“lossy,” because you can end up skipping valid content that looks suspect.

Figure 9-4. Malicious dynamic web space example

* This is a real example mentioned on http://www.searchtools.com/robots/robot-checklist.html for the calendar-

ing site at http://cgi.umbc.edu/cgi-bin/WebEvent/webevent.cgi. As a result of dynamic content like this, many

robots refuse to crawl pages that have the substring “cgi” anywhere in the URL.

www.evil-joes-hardware.com

GET /index-fall.html HTTP/1.1

Host: www.evil-joes-hardware.com

Accept: *

User-agent: ShopBot

Request message

HTTP/1.1 200 OK

Content-type: text/html

Content-length: 617

<A HREF=/index-fall2.html>trick</A>[...]

Response message

Web robot client

www.evil-joes-hardware.com

GET /index-fall2.html HTTP/1.1

Host: www.evil-joes-hardware.com

Accept: *

User-agent: ShopBot

Request message

HTTP/1.1 200 OK

Content-type: text/html

Content-length: 617

<A HREF=/index-fall3.html>trick</A>[...]

Response message

Web robot client

A few sites exist that are just malicious gateway applications, whose sole purpose is to trap

unsuspecting robots with bogus content . In this example, the gateway dynamically

generates an infinite number of fake web pages, each pointing to the next.

Crawlers and Crawling |223

Some techniques that robots use to behave better in a web full of robot dangers are:

Canonicalizing URLs

Avoid syntactic aliases by converting URLs into standard form.

Breadth-first crawling

Crawlers have a large set of potential URLs to crawl at any one time. By schedul-

ing the URLs to visit in a breadth-first manner, across web sites, you can mini-

mize the impact of cycles. Even if you hit a robot trap, you still can fetch

hundreds of thousands of pages from other web sites before returning to fetch a

page from the cycle. If you operate depth-first, diving head-first into a single site,

you may hit a cycle and never escape to other sites.*

Throttling†

Limit the number of pages the robot can fetch from a web site in a period of

time. If the robot hits a cycle and continually tries to access aliases from a site,

you can cap the total number of duplicates generated and the total number of

accesses to the server by throttling.

Limit URL size

The robot may refuse to crawl URLs beyond a certain length (1KB is common).

If a cycle causes the URL to grow in size, a length limit will eventually stop the

cycle. Some web servers fail when given long URLs, and robots caught in a URL-

increasing cycle can cause some web servers to crash. This may make webmas-

ters misinterpret the robot as a denial-of-service attacker.

As a caution, this technique can certainly lead to missed content. Many sites

today use URLs to help manage user state (for example, storing user IDs in the

URLs referenced in a page). URL size can be a tricky way to limit a crawl; how-

ever, it can provide a great flag for a user to inspect what is happening on a par-

ticular site, by logging an error whenever requested URLs reach a certain size.

URL/site blacklist

Maintain a list of known sites and URLs that correspond to robot cycles and

traps, and avoid them like the plague. As new problems are found, add them to

the blacklist.

This requires human action. However, most large-scale crawlers in production

today have some form of a blacklist, used to avoid certain sites because of inher-

ent problems or something malicious in the sites. The blacklist also can be used

to avoid certain sites that have made a fuss about being crawled.‡

* Breadth-first crawling is a good idea in general, so as to more evenly disperse requests and not overwhelm

any one server. This can help keep the resources that a robot uses on a server to a minimum.

† Throttling of request rate is also discussed in “Robot Etiquette.”

‡ “Excluding Robots” discusses how sites can avoid being crawled, but some users refuse to use this simple

control mechanism and become quite irate when their sites are crawled.

224 |Chapter 9: Web Robots

Pattern detection

Cycles caused by filesystem symlinks and similar misconfigurations tend to fol-

low patterns; for example, the URL may grow with components duplicated.

Some robots view URLs with repeating components as potential cycles and

refuse to crawl URLs with more than two or three repeated components.

Not all repetition is immediate (e.g., “/subdir/subdir/subdir...”). It’s possible to

have cycles of period 2 or other intervals, such as “/subdir/images/subdir/

images/subdir/images/...”. Some robots look for repeating patterns of a few dif-

ferent periods.

Content fingerprinting

Fingerprinting is a more direct way of detecting duplicates that is used by some

of the more sophisticated web crawlers. Robots using content fingerprinting take

the bytes in the content of the page and compute a checksum. This checksum is a

compact representation of the content of the page. If a robot ever fetches a page

whose checksum it has seen before, the page’s links are not crawled—if the

robot has seen the page’s content before, it has already initiated the crawling of

the page’s links.

The checksum function must be chosen so that the odds of two different pages

having the same checksum are small. Message digest functions such as MD5 are

popular for fingerprinting.

Because some web servers dynamically modify pages on the fly, robots some-

times omit certain parts of the web page content, such as embedded links, from

the checksum calculation. Still, dynamic server-side includes that customize

arbitrary page content (adding dates, access counters, etc.) may prevent dupli-

cate detection.

Human monitoring

The Web is a wild place. Your brave robot eventually will stumble into a prob-

lem that none of your techniques will catch. All production-quality robots must

be designed with diagnostics and logging, so human beings can easily monitor

the robot’s progress and be warned quickly if something unusual is happening.

In some cases, angry net citizens will highlight the problem for you by sending

you nasty email.

Good spider heuristics for crawling datasets as vast as the Web are always works in

progress. Rules are built over time and adapted as new types of resources are added

to the Web. Good rules are always evolving.

Many smaller, more customized crawlers skirt some of these issues, as the resources

(servers, network bandwidth, etc.) that are impacted by an errant crawler are man-

ageable, or possibly even are under the control of the person performing the crawl

(such as on an intranet site). These crawlers rely on more human monitoring to pre-

vent problems.

Robotic HTTP |225

Robotic HTTP

Robots are no different from any other HTTP client program. They too need to abide

by the rules of the HTTP specification. A robot that is making HTTP requests and

advertising itself as an HTTP/1.1 client needs to use the appropriate HTTP request

headers.

Many robots try to implement the minimum amount of HTTP needed to request the

content they seek. This can lead to problems; however, it’s unlikely that this behav-

ior will change anytime soon. As a result, many robots make HTTP/1.0 requests,

because that protocol has few requirements.

Identifying Request Headers

Despite the minimum amount of HTTP that robots tend to support, most do imple-

ment and send some identification headers—most notably, the User-Agent HTTP

header. It’s recommended that robot implementors send some basic header informa-

tion to notify the site of the capabilities of the robot, the robot’s identity, and where

it originated.

This is useful information both for tracking down the owner of an errant crawler and

for giving the server some information about what types of content the robot can

handle. Some of the basic identifying headers that robot implementors are encour-

aged to implement are:

User-Agent

Tells the server the name of the robot making the request.

From

Provides the email address of the robot’s user/administrator.*

Tells the server what media types are okay to send.†This can help ensure that

the robot receives only content in which it’s interested (text, images, etc.).

Referer

Provides the URL of the document that contains the current request-URL.‡

Virtual Hosting

Robot implementors need to support the Host header. Given the prevalence of virtual

hosting (Chapter 5 discusses virtually hosted servers in more detail), not including the

* An RFC 822 email address format.

† “Accept headers” in Chapter 3 lists all of the accept headers; robots may find it useful to send headers such

as Accept-Charset if they are interested in particular versions.

‡ This can be very useful to site administrators that are trying to track down how a robot found links to their

sites’ content.

226 |Chapter 9: Web Robots

Host HTTP header in requests can lead to robots identifying the wrong content with a

particular URL. HTTP/1.1 requires the use of the Host header for this reason.

Most servers are configured to serve a particular site by default. Thus, a crawler not

including the Host header can make a request to a server serving two sites, like those

in Figure 9-5 (www.joes-hardware.com and www.foo.com) and, if the server is config-

ured to serve www.joes-hardware.com by default (and does not require the Host

header), a request for a page on www.foo.com can result in the crawler getting con-

tent from the Joe’s Hardware site. Worse yet, the crawler will actually think the con-

tent from Joe’s Hardware was from www.foo.com. I am sure you can think of some

more unfortunate situations if documents from two sites with polar political or other

views were served from the same server.

Conditional Requests

Given the enormity of some robotic endeavors, it often makes sense to minimize the

amount of content a robot retrieves. As in the case of Internet search-engine robots,

with potentially billions of web pages to download, it makes sense to re-retrieve con-

tent only if it has changed.

Some of these robots implement conditional HTTP requests,*comparing timestamps

or entity tags to see if the last version that they retrieved has been updated. This is

very similar to the way that an HTTP cache checks the validity of the local copy of a

previously fetched resource. See Chapter 7 for more on how caches validate local

copies of resources.

Figure 9-5. Example of virtual docroots causing trouble if no Host header is sent with the request

* “Conditional request headers” in Chapter 3 gives a complete listing of the conditional headers that a robot

can implement.

Robot tries to request /index.html

from www.foo.com, but does not

include a Host header.

Web robot client

www.joes-hardware.com

www.foo.com

GET /index.html HTTP/1.0

User-agent: ShopBot 1.0

Request message

HTTP/1.0 200 OK

[...]

<HTML>

<TITLE>Welcome to Joe's Hardware!</TITLE>

[...]

Response message

Server is configured to serve both sites,

but serves Joe’s Hardware by default.

Robotic HTTP |227

Response Handling

Because many robots are interested primarily in getting the content requested

through simple GET methods, often they don’t do much in the way of response han-

dling. However, robots that use some features of HTTP (such as conditional

requests), as well as those that want to better explore and interoperate with servers,

need to be able to handle different types of HTTP responses.

Status codes

In general, robots should be able to handle at least the common or expected status

codes. All robots should understand HTTP status codes such as 200 OK and 404

Not Found. They also should be able to deal with status codes that they don’t explic-

itly understand based on the general category of response. Table 3-2 in Chapter 3

gives a breakdown of the different status-code categories and their meanings.

It is important to note that some servers don’t always return the appropriate error

codes. Some servers even return 200 OK HTTP status codes with the text body of the

message describing an error! It’s hard to do much about this—it’s just something for

implementors to be aware of.

Entities

Along with information embedded in the HTTP headers, robots can look for infor-

mation in the entity itself. Meta HTML tags,*such as the meta http-equiv tag, are a

means for content authors to embed additional information about resources. The

http-equiv tag itself is a way for content authors to override certain headers that the

server handling their content may serve:

This tag instructs the receiver to treat the document as if its HTTP response header

contained a Refresh HTTP header with the value “1, URL=index.html”.†

Some servers actually parse the contents of HTML pages prior to sending them and

include http-equiv directives as headers; however, some do not. Robot implemen-

tors may want to scan the HEAD elements of HTML documents to look for http-

equiv information. ‡

* “Robot META directives” lists additional meta directives that site administrators and content authors can

use to control the behavior of robots and what they do with documents that have been retrieved.

† The Refresh HTTP header sometimes is used as a means to redirect users (or in this case, a robot) from one

page to another.

‡ Meta tags must occur in the HEAD section of HTML documents, according to the HTML specification.

However, they sometimes occur in other HTML document sections, as not all HTML documents adhere to

the specification.

228 |Chapter 9: Web Robots

User-Agent Targeting

Web administrators should keep in mind that many robots will visit their sites and

therefore should expect requests from them. Many sites optimize content for various

user agents, attempting to detect browser types to ensure that various site features

are supported. By doing this, the sites serve error pages instead of content to robots.

Performing a text search for the phrase “your browser does not support frames” on

some search engines will yield a list of results for error pages that contain that

phrase, when in fact the HTTP client was not a browser at all, but a robot.

Site administrators should plan a strategy for handling robot requests. For example,

instead of limiting their content development to specific browser support, they can

develop catch-all pages for non–feature rich browsers and robots. At a minimum,

they should expect robots to visit their sites and not be caught off guard when they

do.*

Misbehaving Robots

There are many ways that wayward robots can cause mayhem. Here are a few mis-

takes robots can make, and the impact of their misdeeds:

Runaway robots

Robots issue HTTP requests much faster than human web surfers, and they

commonly run on fast computers with fast network links. If a robot contains a

programming logic error, or gets caught in a cycle, it can throw intense load

against a web server—quite possibly enough to overload the server and deny ser-

vice to anyone else. All robot authors must take extreme care to design in safe-

guards to protect against runaway robots.

Stale URLs

Some robots visit lists of URLs. These lists can be old. If a web site makes a big

change in its content, robots may request large numbers of nonexistent URLs.

This annoys some web site administrators, who don’t like their error logs filling

with access requests for nonexistent documents and don’t like having their web

server capacity reduced by the overhead of serving error pages.

Long, wrong URLs

As a result of cycles and programming errors, robots may request large, non-

sense URLs from web sites. If the URL is long enough, it may reduce the perfor-

mance of the web server, clutter the web server access logs, and even cause

fragile web servers to crash.

* “Excluding Robots” provides information for how site administrators can control the behavior of robots on

their sites if there is content that should not be accessed by robots.

Excluding Robots |229

Nosy robots*

Some robots may get URLs that point to private data and make that data easily

accessible through Internet search engines and other applications. If the owner

of the data didn’t actively advertise the web pages, she may view the robotic

publishing as a nuisance at best and an invasion of privacy at worst.

Usually this happens because a hyperlink to the “private” content that the robot

followed already exists (i.e., the content isn’t as secret as the owner thought it

was, or the owner forgot to remove a preexisting hyperlink). Occasionally it hap-

pens when a robot is very zealous in trying to scavenge the documents on a site,

perhaps by fetching the contents of a directory, even if no explicit hyperlink exists.

Robot implementors retrieving large amounts of data from the Web should be

aware that their robots are likely to retrieve sensitive data at some point—data

that the site implementor never intended to be accessible over the Internet. This

sensitive data can include password files or even credit card information.

Clearly, a mechanism to disregard content once this is pointed out (and remove

it from any search index or archive) is important. Malicious search engine and

archive users have been known to exploit the abilities of large-scale web crawl-

ers to find content—some search engines, such as Google,†actually archive rep-

resentations of the pages they have crawled, so even if content is removed, it can

still be found and accessed for some time.

Dynamic gateway access

Robots don’t always know what they are accessing. A robot may fetch a URL

whose content comes from a gateway application. In this case, the data obtained

may be special-purpose and may be expensive to compute. Many web site admin-

istrators don’t like naïve robots requesting documents that come from gateways.

Excluding Robots

The robot community understood the problems that robotic web site access could

cause. In 1994, a simple, voluntary technique was proposed to keep robots out of

where they don’t belong and provide webmasters with a mechanism to better control

their behavior. The standard was named the “Robots Exclusion Standard” but is often

just called robots.txt, after the file where the access-control information is stored.

The idea of robots.txt is simple. Any web server can provide an optional file named

robots.txt in the document root of the server. This file contains information about

what robots can access what parts of the server. If a robot follows this voluntary

* Generally, if a resource is available over the public Internet, it is likely referenced somewhere. Few resources

are truly private, with the web of links that exists on the Internet.

† See search results at http://www.google.com. A cached link, which is a copy of the page that the Google

crawler retrieved and indexed, is available on most results.

230 |Chapter 9: Web Robots

standard, it will request the robots.txt file from the web site before accessing any

other resource from that site. For example, the robot in Figure 9-6 wants to down-

load http://www.joes-hardware.com/specials/acetylene-torches.html from Joe’s Hard-

ware. Before the robot can request the page, however, it needs to check the robots.txt

file to see if it has permission to fetch this page. In this example, the robots.txt file

does not block the robot, so the robot fetches the page.

The Robots Exclusion Standard

The Robots Exclusion Standard is an ad hoc standard. At the time of this writing, no

official standards body owns this standard, and vendors implement different subsets

of the standard. Still, some ability to manage robots’ access to web sites, even if

imperfect, is better than none at all, and most major vendors and search-engine

crawlers implement support for the exclusion standard.

There are three revisions of the Robots Exclusion Standard, though the naming of the

versions is not well defined. We adopt the version numbering shown in Table 9-2.

Figure 9-6. Fetching robots.txt and verifying accessibility before crawling the target file

Table 9-2. Robots Exclusion Standard versions

Version Title and description Date

0.0 A Standard for Robot Exclusion—Martijn Koster’s original robots.txt mechanism with Disallow

directive

June 1994

1.0 A Method for Web Robots Control—Martijn Koster’s IETF draft with additional support for Allow Nov. 1996

2.0 An Extended Standard for Robot Exclusion—Sean Conner’s extension including regex and timing

information; not widely supported

Nov. 1996

www.joes-hardware.comWeb robot client

GET /robots.txt

GET /specials/acetylene-torches.html

Robot parses the robots.txt file and

determines if it is allowed to access

the acetylene-torches.html file.

It is, so it proceeds with the request.

Excluding Robots |231

Most robots today adopt the v0.0 or v1.0 standards. The v2.0 standard is much more

complicated and hasn’t been widely adopted. It may never be. We’ll focus on the v1.0

standard here, because it is in wide use and is fully compatible with v0.0.

Web Sites and robots.txt Files

Before visiting any URLs on a web site, a robot must retrieve and process the robots.txt

file on the web site, if it is present.*There is a single robots.txt resource for the entire

web site defined by the hostname and port number. If the site is virtually hosted, there

can be a different robots.txt file for each virtual docroot, as with any other file.

Currently, there is no way to install “local” robots.txt files in individual subdirecto-

ries of a web site. The webmaster is responsible for creating an aggregate robots.txt

file that describes the exclusion rules for all content on the web site.

Fetching robots.txt

Robots fetch the robots.txt resource using the HTTP GET method, like any other file

on the web server. The server returns the robots.txt file, if present, in a text/plain

body. If the server responds with a 404 Not Found HTTP status code, the robot can

assume that there are no robotic access restrictions and that it can request any file.

Robots should pass along identifying information in the From and User-Agent head-

ers to help site administrators track robotic accesses and to provide contact informa-

tion in the event that the site administrator needs to inquire or complain about the

robot. Here’s an example HTTP crawler request from a commercial web robot:

GET /robots.txt HTTP/1.0

Host: www.joes-hardware.com

User-Agent: Slurp/2.0

Date: Wed Oct 3 20:22:48 EST 2001

Response codes

Many web sites do not have a robots.txt resource, but the robot doesn’t know that. It

must attempt to get the robots.txt resource from every site. The robot takes different

actions depending on the result of the robots.txt retrieval:

• If the server responds with a success status (HTTP status code 2XX), the robot

must parse the content and apply the exclusion rules to fetches from that site.

• If the server response indicates the resource does not exist (HTTP status code

404), the robot can assume that no exclusion rules are active and that access to

the site is not restricted by robots.txt.

* Even though we say “robots.txt file,” there is no reason that the robots.txt resource must strictly reside in a

filesystem. For example, the robots.txt resource could be dynamically generated by a gateway application.

232 |Chapter 9: Web Robots

• If the server response indicates access restrictions (HTTP status code 401 or 403)

the robot should regard access to the site as completely restricted.

• If the request attempt results in temporary failure (HTTP status code 503), the

robot should defer visits to the site until the resource can be retrieved.

• If the server response indicates redirection (HTTP status code 3XX), the robot

should follow the redirects until the resource is found.

robots.txt File Format

The robots.txt file has a very simple, line-oriented syntax. There are three types of

lines in a robots.txt file: blank lines, comment lines, and rule lines. Rule lines look like

HTTP headers (<Field>: <value>) and are used for pattern matching. For example:

# this robots.txt file allows Slurp & Webcrawler to crawl

# the public parts of our site, but no other robots...

User-Agent: slurp

User-Agent: webcrawler

Disallow: /private

User-Agent: *

Disallow:

The lines in a robots.txt file are logically separated into “records.” Each record

describes a set of exclusion rules for a particular set of robots. This way, different

exclusion rules can be applied to different robots.

Each record consists of a set of rule lines, terminated by a blank line or end-of-file

character. A record starts with one or more User-Agent lines, specifying which robots

are affected by this record, followed by Disallow and Allow lines that say what URLs

these robots can access.*

The previous example shows a robots.txt file that allows the Slurp and Webcrawler

robots to access any file except those files in the private subdirectory. The same file

also prevents any other robots from accessing anything on the site.

Let’s look at the User-Agent, Disallow, and Allow lines.

The User-Agent line

Each robot’s record starts with one or more User-Agent lines, of the form:

User-Agent: <robot-name>

or:

User-Agent: *

* For practical reasons, robot software should be robust and flexible with the end-of-line character. CR, LF,

and CRLF should all be supported.

Excluding Robots |233

The robot name (chosen by the robot implementor) is sent in the User-Agent header

of the robot’s HTTP GET request.

When a robot processes a robots.txt file, it must obey the record with either:

• The first robot name that is a case-insensitive substring of the robot’s name

• The first robot name that is “*”

If the robot can’t find a User-Agent line that matches its name, and can’t find a wild-

carded “User-Agent: *” line, no record matches, and access is unlimited.

Because the robot name matches case-insensitive substrings, be careful about false

matches. For example, “User-Agent: bot” matches all the robots named Bot,Robot,

Bottom-Feeder,Spambot, and Dont-Bother-Me.

The Disallow and Allow lines

The Disallow and Allow lines immediately follow the User-Agent lines of a robot

exclusion record. They describe which URL paths are explicitly forbidden or explic-

itly allowed for the specified robots.

The robot must match the desired URL against all of the Disallow and Allow rules

for the exclusion record, in order. The first match found is used. If no match is

found, the URL is allowed.*

For an Allow/Disallow line to match a URL, the rule path must be a case-sensitive

prefix of the URL path. For example, “Disallow: /tmp” matches all of these URLs:

http://www.joes-hardware.com/tmp

http://www.joes-hardware.com/tmp/

http://www.joes-hardware.com/tmp/pliers.html

http://www.joes-hardware.com/tmpspc/stuff.txt

Disallow/Allow preﬁx matching

Here are a few more details about Disallow/Allow prefix matching:

• Disallow and Allow rules require case-sensitive prefix matches. The asterisk has

no special meaning (unlike in User-Agent lines), but the universal wildcarding

effect can be obtained from the empty string.

• Any “escaped” characters (%XX) in the rule path or the URL path are unes-

caped back into bytes before comparison (with the exception of %2F, the for-

ward slash, which must match exactly).

• If the rule path is the empty string, it matches everything.

Table 9-3 lists several examples of matching between rule paths and URL paths.

* The robots.txt URL always is allowed and must not appear in the Allow/Disallow rules.

234 |Chapter 9: Web Robots

Prefix matching usually works pretty well, but there are a few places where it is not

expressive enough. If there are particular subdirectories for which you also want to

disallow crawling, regardless of what the prefix of the path is, robots.txt provides no

means for this. For example, you might want to avoid crawling of RCS version con-

trol subdirectories. Version 1.0 of the robots.txt scheme provides no way to support

this, other than separately enumerating every path to every RCS subdirectory.

Other robots.txt Wisdom

Here are some other rules with respect to parsing the robots.txt file:

• The robots.txt file may contain fields other than User-Agent, Disallow, and

Allow, as the specification evolves. A robot should ignore any field it doesn’t

understand.

• For backward compatibility, breaking of lines is not allowed.

• Comments are allowed anywhere in the file; they consist of optional whitespace,

followed by a comment character (#) followed by the comment, until the end-of-

line character.

• Version 0.0 of the Robots Exclusion Standard didn’t support the Allow line.

Some robots implement only the Version 0.0 specification and ignore Allow

lines. In this situation, a robot will behave conservatively, not retrieving URLs

that are permitted.

Caching and Expiration of robots.txt

If a robot had to refetch a robots.txt file before every file access, it would double the

load on web servers, as well as making the robot less efficient. Instead, robots are

expected to fetch the robots.txt file periodically and cache the results. The cached copy

of robots.txt should be used by the robot until the robots.txt file expires. Standard

HTTP cache-control mechanisms are used by both the origin server and robots to

Table 9-3. Robots.txt path matching examples

Rule path URL path Match? Comments

/tmp /tmp ✓Rule path == URL path

/tmp /tmpfile.html ✓Rule path is a prefix of URL path

/tmp /tmp/a html ✓Rule path is a prefix of URL path

/tmp/ /tmp ✗/tmp/ is not a prefix of /tmp

README.TXT ✓Empty rule path matches everything

/~fred/hi.html %7Efred/hi.html ✓%7E is treated the same as ~

/%7Efred/hi.html /~fred/hi.html ✓%7E is treated the same as ~

/%7efred/hi.html /%7Efred/hi.html ✓Case isn’t significant in escapes

/~fred/hi.html ~fred%2Fhi html ✗%2F is slash, but slash is a special case that must match exactly

Excluding Robots |235

control the caching of the robots.txt file. Robots should take note of Cache-Control

and Expires headers in the HTTP response.*

Many production crawlers today are not HTTP/1.1 clients; webmasters should note

that those crawlers will not necessarily understand the caching directives provided

for the robots.txt resource.

If no Cache-Control directives are present, the draft specification allows caching for

seven days. But, in practice, this often is too long. Web server administrators who

did not know about robots.txt often create one in response to a robotic visit, but if

the lack of a robots.txt file is cached for a week, the newly created robots.txt file will

appear to have no effect, and the site administrator will accuse the robot administra-

tor of not adhering to the Robots Exclusion Standard.†

Robot Exclusion Perl Code

A few publicly available Perl libraries exist to interact with robots.txt files. One exam-

ple is the WWW::RobotsRules module available for the CPAN public Perl archive.

The parsed robots.txt file is kept in the WWW::RobotRules object, which provides

methods to check if access to a given URL is prohibited. The same WWW::

RobotRules object can parse multiple robots.txt files.

Here are the primary methods in the WWW::RobotRules API:

Create RobotRules object

$rules = WWW::RobotRules->new($robot_name);

Load the robots.txt file

$rules->parse($url, $content, $fresh_until);

Check if a site URL is fetchable

$can_fetch = $rules->allowed($url);

Here’s a short Perl program that demonstrates the use of WWW::RobotRules:

require WWW::RobotRules;

# Create the RobotRules object, naming the robot "SuperRobot"

my $robotsrules = new WWW::RobotRules 'SuperRobot/1.0';

use LWP::Simple qw(get);

# Get and parse the robots.txt file for Joe's Hardware, accumulating the rules

$url = "http://www.joes-hardware.com/robots.txt";

my $robots_txt = get $url;

$robotsrules->parse($url, $robots_txt);

* See “Keeping Copies Fresh” in Chapter 7 for more on handling caching directives.

† Several large-scale web crawlers use the rule of refetching robots.txt daily when actively crawling the Web.

236 |Chapter 9: Web Robots

# Get and parse the robots.txt file for Mary's Antiques, accumulating the rules

$url = "http://www.marys-antiques.com/robots.txt";

my $robots_txt = get $url;

$robotsrules->parse($url, $robots_txt);

# Now RobotRules contains the set of robot exclusion rules for several

# different sites. It keeps them all separate. Now we can use RobotRules

# to test if a robot is allowed to access various URLs.

if ($robotsrules->allowed($some_target_url))

{

$c = get $url;

...

}

The following is a hypothetical robots.txt file for www.marys-antiques.com:

#####################################################################

# This is the robots.txt file for Mary's Antiques web site

#####################################################################

# Keep Suzy's robot out of all the dynamic URLs because it doesn't

# understand them, and out of all the private data, except for the

# small section Mary has reserved on the site for Suzy.

User-Agent: Suzy-Spider

Disallow: /dynamic

Allow: /private/suzy-stuff

Disallow: /private

# The Furniture-Finder robot was specially designed to understand

# Mary's antique store's furniture inventory program, so let it

# crawl that resource, but keep it out of all the other dynamic

# resources and out of all the private data.

User-Agent: Furniture-Finder

Allow: /dynamic/check-inventory

Disallow: /dynamic

Disallow: /private

# Keep everyone else out of the dynamic gateways and private data.

User-Agent: *

Disallow: /dynamic

Disallow: /private

This robots.txt file contains a record for the robot called SuzySpider, a record for the

robot called FurnitureFinder, and a default record for all other robots. Each record

applies a different set of access policies to the different robots:

• The exclusion record for SuzySpider keeps the robot from crawling the store

inventory gateway URLs that start with /dynamic and out of the private user

data, except for the section reserved for Suzy.

Excluding Robots |237

• The record for the FurnitureFinder robot permits the robot to crawl the furni-

ture inventory gateway URL. Perhaps this robot understands the format and

rules of Mary’s gateway.

• All other robots are kept out of all the dynamic and private web pages, though

they can crawl the remainder of the URLs.

Table 9-4 lists some examples for different robot accessibility to the Mary’s Antiques

web site.

HTML Robot-Control META Tags

The robots.txt file allows a site administrator to exclude robots from some or all of a

web site. One of the disadvantages of the robots.txt file is that it is owned by the web

site administrator, not the author of the individual content.

HTML page authors have a more direct way of restricting robots from individual

pages. They can add robot-control tags to the HTML documents directly. Robots

that adhere to the robot-control HTML tags will still be able to fetch the documents,

but if a robot exclusion tag is present, they will disregard the documents. For exam-

ple, an Internet search-engine robot would not include the document in its search

index. As with the robots.txt standard, participation is encouraged but not enforced.

Robot exclusion tags are implemented using HTML META tags, using the form:

Robot META directives

There are several types of robot META directives, and new directives are likely to

be added over time and as search engines and their robots expand their activities

and feature sets. The two most-often-used robot META directives are:

NOINDEX

Tells a robot not to process the page’s content and to disregard the document

(i.e., not include the content in any index or database).

Table 9-4. Robot accessibility to the Mary’s Antiques web site

URL SuzySpider FurnitureFinder NosyBot

http://www.marys-antiques.com/ ✓✓ ✓

http://www.marys-antiques.com/index.html ✓✓ ✓

http://www.marys-antiques.com/private/payroll.xls ✗✗ ✗

http://www.marys-antiques.com/private/suzy-stuff/taxes.txt ✓✗ ✗

http://www.marys-antiques.com/dynamic/buy-stuff?id=3546 ✗✗ ✗

http://www.marys-antiques.com/dynamic/check-inventory?kitchen ✗✓ ✗

238 |Chapter 9: Web Robots

NOFOLLOW

Tells a robot not to crawl any outgoing links from the page.

In addition to NOINDEX and NOFOLLOW, there are the opposite INDEX and

FOLLOW directives, the NOARCHIVE directive, and the ALL and NONE direc-

tives. These robot META tag directives are summarized as follows:

INDEX

Tells a robot that it may index the contents of the page.

Tells a robot that it may crawl any outgoing links in the page.

NOARCHIVE

Tells a robot that it should not cache a local copy of the page.*

ALL

Equivalent to INDEX, FOLLOW.

NONE

Equivalent to NOINDEX, NOFOLLOW.

The robot META tags, like all HTML META tags, must appear in the HEAD section

of an HTML page:

<html>

<head>

</head>

<body>

...

</body>

</html>

Note that the “robots” name of the tag and the content are case-insensitive.

You obviously should not specify conflicting or repeating directives, such as:

the behavior of which likely is undefined and certainly will vary from robot imple-

mentation to robot implementation.

Search engine META tags

We just discussed robots META tags, used to control the crawling and indexing

activity of web robots. All robots META tags contain the name="robots" attribute.

* This META tag was introduced by the folks who run the Google search engine as a way for webmasters to

opt out of allowing Google to serve cached pages of their content. It also can be used with META

NAME="googlebot".

Robot Etiquette |239

Many other types of META tags are available, including those shown in Table 9-5.

The DESCRIPTION and KEYWORDS META tags are useful for content-indexing

search-engine robots.

Robot Etiquette

In 1993, Martijn Koster, a pioneer in the web robot community, wrote up a list of

guidelines for authors of web robots. While some of the advice is dated, much of it

still is quite useful. Martijn’s original treatise, “Guidelines for Robot Writers,” can be

found at http://www.robotstxt.org/wc/guidelines.html.

Table 9-6 provides a modern update for robot designers and operators, based

heavily on the spirit and content of the original list. Most of these guidelines are tar-

geted at World Wide Web robots; however, they are applicable to smaller-scale

crawlers too.

Table 9-5. Additional META tag directives

name= content= Description

DESCRIPTION <text> Allows an author to define a short text summary of the web page. Many search engines

look at META DESCRIPTION tags, allowing page authors to specify appropriate short

abstracts to describe their web pages.

content="Welcome to Mary's Antiques web site">

KEYWORDS <comma list> Associates a comma-separated list of words that describe the web page, to assist in

keyword searches.

content="antiques,mary,furniture,restoration">

REVISIT-AFTER a

aThis directive is not likely to have wide support.

<no. days> Instructs the robot or search engine that the page should be revisited, presumably

because it is subject to change, after the specified number of days.

Table 9-6. Guidelines for web robot operators

Guideline Description

(1) Identiﬁcation

Identify Your Robot Use the HTTP User-Agent field to tell web serversthe name of yourrobot. This will helpadminis-

trators understand what your robot is doing. Some robots also include a URL describing the pur-

pose and policies of the robot in the User-Agent header.

Identify Your Machine Make sure your robot runs from a machine with a DNS entry, so web sites can reverse-DNS the

robot IP address into a hostname. This will help the administrator identify the organization

responsible for the robot.

Identify a Contact Use the HTTP From field to provide a contact email address.

240 |Chapter 9: Web Robots

(2) Operations

Be Alert Your robot will generate questions and complaints. Some of this is caused by robots that run

astray. You must be cautious and watchful that your robot is behaving correctly. If your robot

runs around the clock, you need to be extra careful. You may need to have operations people

monitoring the robot 24 ×7 until your robot is well seasoned.

Be Prepared When you begin a major robotic journey, be sure to notify people at your organization. Your

organization will want to watch for network bandwidth consumption and be ready for any pub-

lic inquiries.

Monitor and Log Your robot should be richly equipped with diagnostics and logging, so you can track progress,

identify any robot traps, and sanity check that everything is working right. We cannot stress

enough the importance of monitoring and logging a robot’s behavior. Problems and complaints

will arise, and having detailed logs of a crawler’s behavior can help a robot operator backtrack to

what has happened. This is important not only for debugging your errant web crawler but also

for defending its behavior against unjustified complaints.

Learn and Adapt Each crawl, you will learn new things. Adapt your robot so it improves each time and avoids the

common pitfalls.

(3) Limit Yourself

Filter on URL If a URL looks like it refers to data that you don’t understand or are not interested in, you might

want to skip it. For example, URLs ending in “.Z”, “.gz”, “.tar”, or “.zip” are likely to be com-

pressed files or archives. URLs ending in “.exe” are likely to be programs. URLs ending in “.gif”,

“.tif”, “.jpg” are likely to be images. Make sure you get what you are after.

Filter Dynamic URLs Usually, robots don’t want to crawl content from dynamic gateways. The robot won’t know how

to properly format and post queries to gateways, and the results are likely to be erratic or tran-

sient. If a URL contains “cgi” or has a “?”, the robot may want to avoid crawling the URL.

Filter with Accept Headers Your robot should use HTTP Accept headers to tell servers what kind of content it understands.

Adhere to robots.txt Your robot should adhere to the robots.txt controls on the site.

Throttle Yourself Your robot should count the number of accesses to each site and when they occurred, and use

this information to ensure that it doesn’t visit any site too frequently. When a robot accesses a

site more frequently than every few minutes, administrators get suspicious. When a robot

accesses a site every few seconds, some administrators get angry. When a robot hammers a site

as fast as it can, shutting out all other traffic, administrators will be furious.

In general, you should limit your robot to a few requests per minute maximum, and ensure a

few seconds between each request. You also should limit the total number of accesses to a site,

to prevent loops.

(4) Tolerate Loops and Dups and Other Problems

Handle All Return Codes You must beprepared tohandle all HTTP status codes, including redirectsand errors.You should

also log and monitor these codes. A large number of non-success results on a site should cause

investigation. It may be that many URLs are stale, or the server refuses to serve documents to

robots.

Canonicalize URLs Try to remove common aliases by normalizing all URLs into a standard form.

Aggressively Avoid Cycles Work very hard to detect and avoid cycles. Treat the process of operating a crawl as a feedback

loop. The results of problems and their resolutions should be fed back into the next crawl, mak-

ing your crawler better with each iteration.

Table 9-6. Guidelines for web robot operators (continued)

Guideline Description

Robot Etiquette |241

Monitor for Traps Some types of cycles are intentional and malicious. These may be intentionally hard to detect.

Monitor for large numbers of accesses to a site with strange URLs. These may be traps.

Maintain a Blacklist When you find traps, cycles, broken sites,and sites thatwant your robot to stayaway, add them

to a blacklist, and don’t visit them again.

(5) Scalability

Understand Space Work out the math in advance for how large a problem you are solving. You may be surprised

how much memory your application will require to complete a robotic task, because of the huge

scale of the Web.

Understand Bandwidth Understand how much network bandwidth you have available and how much you will need to

complete your robotic task in the required time. Monitor the actual usage of network band-

width. You probably will find that the outgoing bandwidth (requests) is much smaller than the

incoming bandwidth (responses). By monitoring network usage, you also may find the potential

to better optimize your robot, allowing it to take better advantage of the network bandwidth by

better usage of its TCP connections.a

Understand Time Understand howlong it shouldtake for yourrobot to completeits task, andsanity check thatthe

progress matches your estimate. If your robot is way off your estimate, there probably is a prob-

lem worth investigating.

Divide and Conquer For large-scale crawls, you will likely need to apply more hardware to get the job done, either

using big multiprocessor servers with multiple network cards, or using multiple smaller comput-

ers working in unison.

(6) Reliability

Test Thoroughly Test your robot thoroughly internally before unleashing it on the world. When you are ready to

test off-site, run a few, small, maiden voyages first. Collect lots of results and analyze your per-

formance and memory use, estimating how they will scale up to the larger problem.

Checkpoint Any serious robot will need to save a snapshot of its progress, from which it can restart on fail-

ure. There will be failures: you will find software bugs, and hardware will fail. Large-scale robots

can’t start from scratch each time this happens. Design in a checkpoint/restart feature from the

beginning.

Fault Resiliency Anticipate failures, and design your robot to be able to keep making progress when they occur.

(7) Public Relations

Be Prepared Your robot probably will upset a number of people. Be prepared to respond quickly to their

enquiries. Make a web page policy statement describing your robot, and include detailed

instructions on how to create a robots.txt file.

Be Understanding Some of the people who contact you about your robot will be well informed and supportive;

others will be naïve. A few will be unusually angry. Some may well seem insane. It’s generally

unproductive to argue the importance of your robotic endeavor. Explain the Robots Exclusion

Standard, and if they are still unhappy, remove the complainant URLs immediately from your

crawl and add them to the blacklist.

Be Responsive Most unhappy webmasters are just unclear about robots. If you respond immediately and pro-

fessionally, 90% of the complaints will disappear quickly. On the other hand, if you wait several

days before responding, while your robot continues to visit a site, expect to find a very vocal,

angry opponent.

aSee Chapter 4 for more on optimizing TCP performance.

Table 9-6. Guidelines for web robot operators (continued)

Guideline Description

242 |Chapter 9: Web Robots

Search Engines

The most widespread web robots are used by Internet search engines. Internet search

engines allow users to find documents about any subject all around the world.

Many of the most popular sites on the Web today are search engines. They serve as a

starting point for many web users and provide the invaluable service of helping users

find the information in which they are interested.

Web crawlers feed Internet search engines, by retrieving the documents that exist on

the Web and allowing the search engines to create indexes of what words appear in

what documents, much like the index at the back of this book. Search engines are

the leading source of web robots—let’s take a quick look at how they work.

Think Big

When the Web was in its infancy, search engines were relatively simple databases

that helped users locate documents on the Web. Today, with the billions of pages

accessible on the Web, search engines have become essential in helping Internet

users find information. They also have become quite complex, as they have had to

evolve to handle the sheer scale of the Web.

With billions of web pages and many millions of users looking for information,

search engines have to deploy sophisticated crawlers to retrieve these billions of web

pages, as well as sophisticated query engines to handle the query load that millions

of users generate.

Think about the task of a production web crawler, having to issue billions of HTTP

queries in order to retrieve the pages needed by the search index. If each request took

half a second to complete (which is probably slow for some servers and fast for oth-

ers*), that still takes (for 1 billion documents):

0.5 seconds × (1,000,000,000) / ((60 sec/day) × (60 min/hour) × (24 hour/day))

which works out to roughly 5,700 days if the requests are made sequentially! Clearly,

large-scale crawlers need to be more clever, parallelizing requests and using banks of

machines to complete the task. However, because of its scale, trying to crawl the

entire Web still is a daunting challenge.

Modern Search Engine Architecture

Today’s search engines build complicated local databases, called “full-text indexes,”

about the web pages around the world and what they contain. These indexes act as a

sort of card catalog for all the documents on the Web.

* This depends on the resources of the server, the client robot, and the network between the two.

Search Engines |243

Search-engine crawlers gather up web pages and bring them home, adding them to

the full-text index. At the same time, search-engine users issue queries against the

full-text index through web search gateways such as HotBot (http://www.hotbot.com)

or Google (http://www.google.com). Because the web pages are changing all the time,

and because of the amount of time it can take to crawl a large chunk of the Web, the

full-text index is at best a snapshot of the Web.

The high-level architecture of a modern search engine is shown in Figure 9-7.

Full-Text Index

A full-text index is a database that takes a word and immediately tells you all the

documents that contain that word. The documents themselves do not need to be

scanned after the index is created.

Figure 9-8 shows three documents and the corresponding full-text index. The full-

text index lists the documents containing each word.

For example:

• The word “a” is in documents A and B.

• The word “best” is in documents A and C.

• The word “drill” is in documents A and B.

• The word “routine” is in documents B and C.

• The word “the” is in all three documents, A, B, and C.

Figure 9-7. A production search engine contains cooperating crawlers and query gateways

User

Web search

gateway Full-text index

database

Web server

Search engine

crawler/indexer

Web search users Query engine Crawling and indexing

244 |Chapter 9: Web Robots

Posting the Query

When a user issues a query to a web search-engine gateway, she fills out an HTML

form and her browser sends the form to the gateway, using an HTTP GET or POST

request. The gateway program extracts the search query and converts the web UI

query into the expression used to search the full-text index.*

Figure 9-9 shows a simple user query to the www.joes-hardware.com site. The user

types “drills” into the search box form, and the browser translates this into a GET

request with the query parameter as part of the URL.†The Joe’s Hardware web

server receives the query and hands it off to its search gateway application, which

returns the resulting list of documents to the web server, which in turn formats those

results into an HTML page for the user.

Sorting and Presenting the Results

Once a search engine has used its index to determine the results of a query, the gate-

way application takes the results and cooks up a results page for the end user.

Figure 9-8. Three documents and a full-text index

* The method for passing this query is dependent on the search solution being used.

† “Query Strings” in Chapter 2 discusses the common use of the query parameter in URLs.

best

buy

drill

electric

fat

fire

from

have

into

know

lose

routine

the

today

tools

tragedy

turned

workmaster

Word Documents

ABC

We have the best tools,

like the WorkMaster 5000

electric drill. Buy a drill

from us today!

The routine fire drill

turned into tragedy today

. . .

We know the best

routine to lose fat.

Search Engines |245

Since many web pages can contain any given word, search engines deploy clever

algorithms to try to rank the results. For example, in Figure 9-8, the word “best”

appears in multiple documents; search engines need to know the order in which they

should present the list of result documents in order to present users with the most

relevant results. This is called relevancy ranking—the process of scoring and order-

ing a list of search results.

To better aid this process, many of the larger search engines actually use census data

collected during the crawl of the Web. For example, counting how many links point

to a given page can help determine its popularity, and this information can be used

to weight the order in which results are presented. The algorithms, tips from crawl-

ing, and other tricks used by search engines are some of their most guarded secrets.

Spooﬁng

Since users often get frustrated when they do not see what they are looking for in the

first few results of a search query, the order of search results can be important in

finding a site. There is a lot of incentive for webmasters to attempt to get their sites

listed near the top of the results sections for the words that they think best describe

Figure 9-9. Example search query request

Client

User fills out HTML search form

(with a GET action HTTP method)

on site in browser and hits Submit

www.joes-hardware.com

GET /search.html?query=drills HTTP/1.1

Host: www.joes-hardware.com

Accept: *

User-agent: ShopBot

Request message

HTTP/1.1 200 OK

Content-type: text/html

Content-length: 1037

<HTML>

<HEAD><TITLE>Search Results</TITLE>

<A HREF=/BD.html>Black and Decker Drills</A>

[...]

Response message

Search gateway

Query: “drills”

Results: File “BD.html”

Welcome to Joe’s Hardware

Search for: drills

Submit

246 |Chapter 9: Web Robots

their sites, particularly if the sites are commercial and are relying on users to find

them and use their services.

This desire for better listing has led to a lot of gaming of the search system and has

created a constant tug-of-war between search-engine implementors and those seek-

ing to get their sites listed prominently. Many webmasters list tons of keywords

(some irrelevant) and deploy fake pages, or spoofs—even gateway applications that

generate fake pages that may better trick the search engines’ relevancy algorithms for

particular words.

As a result of all this, search engine and robot implementors constantly have to

tweak their relevancy algorithms to better catch these spoofs.

For More Information

For more information on web clients, refer to:

http://www.robotstxt.org/wc/robots.html

The Web Robots Pages—resources for robot developers, including the registry

of Internet Robots.

http://www.searchengineworld.com

Search Engine World—resources for search engines and robots.

http://www.searchtools.com

Search Tools for Web Sites and Intranets—resources for search tools and robots.

http://search.cpan.org/doc/ILYAZ/perl_ste/WWW/RobotRules.pm

RobotRules Perl source.

http://www.conman.org/people/spc/robots2.html

An Extended Standard for Robot Exclusion.

Managing Gigabytes: Compressing and Indexing Documents and Images

Witten, I., Moffat, A., and Bell, T., Morgan Kaufmann.

247

CHAPTER 10

HTTP-NG

As this book nears completion, HTTP is celebrating its tenth birthday. And it has

been quite an accomplished decade for this Internet protocol. Today, HTTP moves

the absolute majority of digital traffic around the world.

But as HTTP grows into its teenage years it faces a few challenges. In some ways, the

pace of HTTP adoption has gotten ahead of its design. Today, people are using

HTTP as a foundation for many diverse applications, over many different network-

ing technologies.

This chapter outlines some of the trends and challenges for the future of HTTP, and

a proposal for a next-generation architecture called HTTP-NG. While the working

group for HTTP-NG has disbanded and its rapid adoption now appears unlikely, it

nonetheless outlines some potential future directions of HTTP.

HTTP’s Growing Pains

HTTP originally was conceived as a simple technique for accessing linked multime-

dia content from distributed information servers. But, over the past decade, HTTP

and its derivatives have taken on a much broader role.

HTTP/1.1 now provides tagging and fingerprinting to track document versions,

methods to support document uploading and interactions with programmatic gate-

ways, support for multilingual content, security and authentication, caching to

reduce traffic, pipelining to reduce latency, persistent connections to reduce startup

time and improve bandwidth, and range accesses to implement partial updates.

Extensions and derivatives of HTTP have gone even further, supporting document

publishing, application serving, arbitrary messaging, video streaming, and founda-

tions for wireless multimedia access. HTTP is becoming a kind of “operating sys-

tem” for distributed media applications.

248 |Chapter 10: HTTP-NG

The design of HTTP/1.1, while well considered, is beginning to show some strains as

HTTP is used more and more as a unified substrate for complex remote operations.

There are at least four areas where HTTP shows some growing pains:

Complexity

HTTP is quite complex, and its features are interdependent. It is decidedly pain-

ful and error-prone to correctly implement HTTP software, because of the com-

plex, interwoven requirements and the intermixing of connection management,

message handling, and functional logic.

Extensibility

HTTP is difficult to extend incrementally. There are many legacy HTTP applica-

tions that create incompatibilities for protocol extensions, because they contain

no technology for autonomous functionality extensions.

Performance

HTTP has performance inefficiencies. Many of these inefficiencies will become

more serious with widespread adoption of high-latency, low-throughput wire-

less access technologies.

Transport dependence

HTTP is designed around a TCP/IP network stack. While there are no restric-

tions against alternative substacks, there has been little work in this area. HTTP

needs to provide better support for alternative substacks for it to be useful as a

broader messaging platform in embedded and wireless applications.

HTTP-NG Activity

In the summer of 1997, the World Wide Web Consortium launched a special project

to investigate and propose a major new version of HTTP that would fix the prob-

lems related to complexity, extensibility, performance, and transport dependence.

This new HTTP was called HTTP: The Next Generation (HTTP-NG).

A set of HTTP-NG proposals was presented at an IETF meeting in December 1998.

These proposals outlined one possible major evolution of HTTP. This technology

has not been widely implemented (and may never be), but HTTP-NG does represent

the most serious effort toward extending the lineage of HTTP. Let’s look at HTTP-

NG in more detail.

Modularize and Enhance

The theme of HTTP-NG can be captured in three words: “modularize and enhance.”

Instead of having connection management, message handling, server processing

logic, and protocol methods all intermixed, the HTTP-NG working group proposed

modularizing the protocol into three layers, illustrated in Figure 10-1:

Distributed Objects |249

• Layer 1, the message transport layer, focuses on delivering opaque messages

between endpoints, independent of the function of the messages. The message

transport layer supports various substacks (for example, stacks for wireless envi-

ronments) and focuses on the problems of efficient message delivery and han-

dling. The HTTP-NG project team proposed a protocol called WebMUX for this

layer.

• Layer 2, the remote invocation layer, defines request/response functionality

where clients can invoke operations on server resources. This layer is indepen-

dent of message transport and of the precise semantics of the operations. It just

provides a standard way of invoking any server operation. This layer attempts to

provide an extensible, object-oriented framework more like CORBA, DCOM,

and Java RMI than like the static, server-defined methods of HTTP/1.1. The

HTTP-NG project team proposed the Binary Wire Protocol for this layer.

• Layer 3, the web application layer, provides most of the content-management

logic. All of the HTTP/1.1 methods (GET, POST, PUT, etc.), as well as the

HTTP/1.1 header parameters, are defined here. This layer also supports other

services built on top of remote invocation, such as WebDAV.

Once the HTTP components are modularized, they can be enhanced to provide bet-

ter performance and richer functionality.

Distributed Objects

Much of the philosophy and functionality goals of HTTP-NG borrow heavily from

structured, object-oriented, distributed-objects systems such as CORBA and DCOM.

Distributed-objects systems can help with extensibility and feature functionality.

A community of researchers has been arguing for a convergence between HTTP and

more sophisticated distributed-objects systems since 1996. For more information

about the merits of a distributed-objects paradigm for the Web, check out the early

paper from Xerox PARC entitled “Migrating the Web Toward Distributed Objects”

(ftp://ftp.parc.xerox.com/pub/ilu/misc/webilu.html).

Figure 10-1. HTTP-NG separates functions into layers

Web application functions

Remote operation invocation Binary Wire Protocol

Message transport WebMUX

Underlying network transport TCP/IP

Layer 3

Layer 2

Layer 1

HTTP-NG

250 |Chapter 10: HTTP-NG

The ambitious philosophy of unifying the Web and distributed objects created

resistance to HTTP-NG’s adoption in some communities. Some past distributed-

objects systems suffered from heavyweight implementation and formal complexity.

The HTTP-NG project team attempted to address some of these concerns in the

requirements.

Layer 1: Messaging

Let’s take a closer look at the three layers of HTTP-NG, starting with the lowest layer.

The message transport layer is concerned with the efficient delivery of messages, inde-

pendent of the meaning and purpose of the messages. The message transport layer

provides an API for messaging, regardless of the actual underlying network stack.

This layer focuses on improving the performance of messaging, including:

• Pipelining and batching messages to reduce round-trip latency

• Reusing connections to reduce latency and improve delivered bandwidth

• Multiplexing multiple message streams in parallel, over the same connection, to

optimize shared connections while preventing starvation of message streams

• Efficient message segmentation to make it easier to determine message boundaries

The HTTP-NG team invested much of its energy into the development of the Web-

MUX protocol for layer 1 message transport. WebMUX is a high-performance mes-

sage protocol that fragments and interleaves messages across a multiplexed TCP

connection. We discuss WebMUX in a bit more detail later in this chapter.

Layer 2: Remote Invocation

The middle layer of the HTTP-NG architecture supports remote method invocation.

This layer provides a generic request/response framework where clients invoke opera-

tions on server resources. This layer does not concern itself with the implementation

and semantics of the particular operations (caching, security, method logic, etc.); it is

concerned only with the interface to allow clients to remotely invoke server operations.

Many remote method invocation standards already are available (CORBA, DCOM,

and Java RMI, to name a few), and this layer is not intended to support every nifty

feature of these systems. However, there is an explicit goal to extend the richness of

HTTP RMI support from that provided by HTTP/1.1. In particular, there is a goal to

provide more general remote procedure call support, in an extensible, object-oriented

manner.

The HTTP-NG team proposed the Binary Wire Protocol for this layer. This protocol

supports a high-performance, extensible technology for invoking well-described

operations on a server and carrying back the results. We discuss the Binary Wire Pro-

tocol in a bit more detail later in this chapter.

WebMUX |251

Layer 3: Web Application

The web application layer is where the semantics and application-specific logic are

performed. The HTTP-NG working group shied away from the temptation to extend

the HTTP application features, focusing instead on formal infrastructure.

The web application layer describes a system for providing application-specific ser-

vices. These services are not monolithic; different APIs may be available for different

applications. For example, the web application for HTTP/1.1 would constitute a dif-

ferent application from WebDAV, though they may share some common parts. The

HTTP-NG architecture allows multiple applications to coexist at this level, sharing

underlying facilities, and provides a mechanism for adding new applications.

The philosophy of the web application layer is to provide equivalent functionality for

HTTP/1.1 and extension interfaces, while recasting them into a framework of extensi-

ble distributed objects. You can read more about the web application layer interfaces

at http://www.w3.org/Protocols/HTTP-NG/1998/08/draft-larner-nginterfaces-00.txt.

WebMUX

The HTTP-NG working group has invested much of its energy in the development of

the WebMUX standard for message transport. WebMUX is a sophisticated, high-

performance message system, where messages can be transported in parallel across a

multiplexed TCP connection. Individual message streams, produced and consumed

at different rates, can efficiently be packetized and multiplexed over a single or small

number of TCP connections (see Figure 10-2).

Here are some of the significant goals of the WebMUX protocol:

• Simple design.

• High performance.

• Multiplexing—Multiple data streams (of arbitrary higher-level protocols) can be

interleaved dynamically and efficiently over a single connection, without stalling

data waiting for slow producers.

Figure 10-2. WebMUX can multiplex multiple messages over a single connection

Message A

Message B

Message C

Message D

Message A

Message B

Message C

Message D

252 |Chapter 10: HTTP-NG

• Credit-based flow control—Data is produced and consumed at different rates,

and senders and receivers have different amounts of memory and CPU resources

available. WebMUX uses a “credit-based” flow-control scheme, where receivers

preannounce interest in receiving data to prevent resource-scarcity deadlocks.

• Alignment preserving—Data alignment is preserved in the multiplexed stream so

that binary data can be sent and processed efficiently.

• Rich functionality—The interface is rich enough to support a sockets API.

You can read more about the WebMUX Protocol at http://www.w3.org/Protocols/

MUX/WD-mux-980722.html.

Binary Wire Protocol

The HTTP-NG team proposed the Binary Wire Protocol to enhance how the next-

generation HTTP protocol supports remote operations.

HTTP-NG defines “object types” and assigns each object type a list of methods.

Each object type is assigned a URI, so its description and methods can be advertised.

In this way, HTTP-NG is proposing a more extensible and object-oriented execution

model than that provided with HTTP/1.1, where all methods were statically defined

in the servers.

The Binary Wire Protocol carries operation-invocation requests from the client to the

server and operation-result replies from the server to the client across a stateful con-

nection. The stateful connection provides extra efficiency.

Request messages contain the operation, the target object, and optional data values.

Reply messages carry back the termination status of the operation, the serial number

of the matching request (allowing arbitrary ordering of parallel requests and

responses), and optional return values. In addition to request and reply messages,

this protocol defines several internal control messages used to improve the efficiency

and robustness of the connection.

You can read more about the Binary Wire Protocol at http://www.w3.org/Protocols/

HTTP-NG/1998/08/draft-janssen-httpng-wire-00.txt.

Current Status

At the end of 1998, the HTTP-NG team concluded that it was too early to bring the

HTTP-NG proposals to the IETF for standardization. There was concern that the

industry and community had not yet fully adjusted to HTTP/1.1 and that the signifi-

cant HTTP-NG rearchitecture to a distributed-objects paradigm would have been

extremely disruptive without a clear transition plan.

For More Information |253

Two proposals were made:

• Instead of attempting to promote the entire HTTP-NG rearchitecture in one

step, it was proposed to focus on the WebMUX transport technology. But at the

time of this writing, there hasn’t been sufficient interest to establish a WebMUX

working group.

• An effort was launched to investigate whether formal protocol types can be

made flexible enough for use on the Web, perhaps using XML. This is especially

important for a distributed-objects system that is extensible. This work is still in

progress.

At the time of this writing, no major driving HTTP-NG effort is underway. But, with

the ever-increasing use of HTTP, its growing use as a platform for diverse applica-

tions, and the growing adoption of wireless and consumer Internet technology, some

of the techniques proposed in the HTTP-NG effort may prove significant in HTTP’s

teenage years.

For More Information

For more information about HTTP-NG, please refer to the following detailed specifi-

cations and activity reports:

http://www.w3.org/Protocols/HTTP-NG/

HTTP-NG Working Group (Proposed), W3C Consortium Web Site.

http://www.w3.org/Protocols/MUX/WD-mux-980722.html

“The WebMUX Protocol,” by J. Gettys and H. Nielsen.

http://www.w3.org/Protocols/HTTP-NG/1998/08/draft-janssen-httpng-wire-00.txt

“Binary Wire Protocol for HTTP-NG,” by B. Janssen.

http://www.w3.org/Protocols/HTTP-NG/1998/08/draft-larner-nginterfaces-00.txt

“HTTP-NG Web Interfaces,” by D. Larner.

ftp://ftp.parc.xerox.com/pub/ilu/misc/webilu.html

“Migrating the Web Toward Distributed Objects,” by D. Larner.

PART III

Identiﬁcation, Authorization,

and Security

The four chapters in Part III present a suite of techniques and technologies to track

identity, enforce security, and control access to content:

• Chapter 11, Client Identification and Cookies, talks about techniques to identify

users, so content can be personalized to the user audience.

• Chapter 12, Basic Authentication, highlights the basic mechanisms to verify user

identity. This chapter also examines how HTTP authentication interfaces with

databases.

• Chapter 13, Digest Authentication, explains digest authentication, a complex

proposed enhancement to HTTP that provides significantly enhanced security.

• Chapter 14, Secure HTTP, is a detailed overview of Internet cryptography, digi-

tal certificates, and the Secure Sockets Layer (SSL).

257

CHAPTER 11

Client Identiﬁcation and Cookies

Web servers may talk to thousands of different clients simultaneously. These servers

often need to keep track of who they are talking to, rather than treating all requests

as coming from anonymous clients. This chapter discusses some of the technologies

that servers can use to identify who they are talking to.

The Personal Touch

HTTP began its life as an anonymous, stateless, request/response protocol. A request

came from a client, was processed by the server, and a response was sent back to the

client. Little information was available to the web server to determine what user sent

the request or to keep track of a sequence of requests from the visiting user.

Modern web sites want to provide a personal touch. They want to know more about

users on the other ends of the connections and be able to keep track of those users as

they browse. Popular online shopping sites like Amazon.com personalize their sites

for you in several ways:

Personal greetings

Welcome messages and page contents are generated specially for the user, to

make the shopping experience feel more personal.

Targeted recommendations

By learning about the interests of the customer, stores can suggest products that

they believe the customer will appreciate. Stores can also run birthday specials

near customers’ birthdays and other significant days.

Administrative information on file

Online shoppers hate having to fill in cumbersome address and credit card forms

over and over again. Some sites store these administrative details in a database.

Once they identify you, they can use the administrative information on file, mak-

ing the shopping experience much more convenient.

258 |Chapter 11: Client Identification and Cookies

Session tracking

HTTP transactions are stateless. Each request/response happens in isolation.

Many web sites want to build up incremental state as you interact with the site

(for example, filling an online shopping cart). To do this, web sites need a way to

distinguish HTTP transactions from different users.

This chapter summarizes a few of the techniques used to identify users in HTTP.

HTTP itself was not born with a rich set of identification features. The early web-site

designers (practical folks that they were) built their own technologies to identify

users. Each technique has its strengths and weaknesses. In this chapter, we’ll discuss

the following mechanisms to identify users:

• HTTP headers that carry information about user identity

• Client IP address tracking, to identify users by their IP addresses

• User login, using authentication to identify users

• Fat URLs, a technique for embedding identity in URLs

• Cookies, a powerful but efficient technique for maintaining persistent identity

HTTP Headers

Table 11-1 shows the seven HTTP request headers that most commonly carry infor-

mation about the user. We’ll discuss the first three now; the last four headers are

used for more advanced identification techniques that we’ll discuss later.

The From header contains the user’s email address. Ideally, this would be a viable

source of user identification, because each user would have a different email address.

However, few browsers send From headers, due to worries of unscrupulous servers

collecting email addresses and using them for junk mail distribution. In practice,

From headers are sent by automated robots or spiders so that if something goes

astray, a webmaster has someplace to send angry email complaints.

Table 11-1. HTTP headers carry clues about users

Header name Header type Description

From Request User’s email address

User-Agent Request User’s browser software

Referer Request Page user came from by following link

Authorization Request Username and password (discussed later)

Client-ip Extension (Request) Client’s IP address (discussed later)

X-Forwarded-For Extension (Request) Client’s IP address (discussed later)

Cookie Extension (Request) Server-generated ID label (discussed later)

Client IP Address |259

The User-Agent header tells the server information about the browser the user is

using, including the name and version of the program, and often information about

the operating system. This sometimes is useful for customizing content to interoper-

ate well with particular browsers and their attributes, but that doesn’t do much to

help identify the particular user in any meaningful way. Here are two User-Agent

headers, one sent by Netscape Navigator and the other by Microsoft Internet Explorer:

Navigator 6.2

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.4) Gecko/20011128

Netscape6/6.2.1

Internet Explorer 6.01

User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

The Referer header provides the URL of the page the user is coming from. The Ref-

erer header alone does not directly identify the user, but it does tell what page the

user previously visited. You can use this to better understand user browsing behav-

ior and user interests. For example, if you arrive at a web server coming from a base-

ball site, the server may infer you are a baseball fan.

The From, User-Agent, and Referer headers are insufficient for dependable identifi-

cation purposes. The remaining sections discuss more precise schemes to identify

particular users.

Client IP Address

Early web pioneers tried using the IP address of the client as a form of identification.

This scheme works if each user has a distinct IP address, if the IP address seldom (if

ever) changes, and if the web server can determine the client IP address for each

request. While the client IP address typically is not present in the HTTP headers,*

web servers can find the IP address of the other side of the TCP connection carrying

the HTTP request. For example, on Unix systems, the getpeername function call

returns the client IP address of the sending machine:

status = getpeername(tcp_connection_socket,...);

Unfortunately, using the client IP address to identify the user has numerous weak-

nesses that limit its effectiveness as a user-identification technology:

• Client IP addresses describe only the computer being used, not the user. If multi-

ple users share the same computer, they will be indistinguishable.

• Many Internet service providers dynamically assign IP addresses to users when

they log in. Each time they log in, they get a different address, so web servers

can’t assume that IP addresses will identify a user across login sessions.

* As we’ll see later, some proxies do add a Client-ip header, but this is not part of the HTTP standard.

260 |Chapter 11: Client Identification and Cookies

• To enhance security and manage scarce addresses, many users browse the Inter-

net through Network Address Translation (NAT) firewalls. These NAT devices

obscure the IP addresses of the real clients behind the firewall, converting the

actual client IP address into a single, shared firewall IP address (and different

port numbers).

• HTTP proxies and gateways typically open new TCP connections to the origin

server. The web server will see the IP address of the proxy server instead of that

of the client. Some proxies attempt to work around this problem by adding spe-

cial Client-ip or X-Forwarded-For HTTP extension headers to preserve the origi-

nal IP address (Figure 11-1). But not all proxies support this behavior.

Some web sites still use client IP addresses to keep track of the users between ses-

sions, but not many. There are too many places where IP address targeting doesn’t

work well.

A few sites even use client IP addresses as a security feature, serving documents only

to users from a particular IP address. While this may be adequate within the con-

fines of an intranet, it breaks down in the Internet, primarily because of the ease with

which IP addresses are spoofed (forged). The presence of intercepting proxies in the

path also breaks this scheme. Chapter 14 discusses much stronger schemes for con-

trolling access to privileged documents.

User Login

Rather than passively trying to guess the identity of a user from his IP address, a web

server can explicitly ask the user who he is by requiring him to authenticate (log in)

with a username and password.

To help make web site logins easier, HTTP includes a built-in mechanism to pass

username information to web sites, using the WWW-Authenticate and Authoriza-

tion headers. Once logged in, the browsers continually send this login information

with each request to the site, so the information is always available. We’ll discuss

this HTTP authentication in much more detail in Chapter 12, but let’s take a quick

look at it now.

Figure 11-1. Proxies can add extension headers to pass along the original client IP address

Client Server

Proxy server

209.172.34.56 Client-ip: 209.172.34.56

X-Forwarded-For: 209.172.34.56

56.41.11.4

User Login |261

If a server wants a user to register before providing access to the site, it can send back

an HTTP 401 Login Required response code to the browser. The browser will then

display a login dialog box and supply the information in the next request to the

browser, using the Authorization header.* This is depicted in Figure 11-2.

Here’s what’s happening in this figure:

• In Figure 11-2a, a browser makes a request from the www.joes-hardware.com site.

• The site doesn’t know the identity of the user, so in Figure 11-2b, the server

requests a login by returning the 401 Login Required HTTP response code and

Figure 11-2. Registering username using HTTP authentication headers

* To save users from having to log in for each request, most browsers will remember login information for a

site and pass in the login information for each request to the site.

Internet

Client Server

GET /index.html HTTP/1.0

Host: www.joes-hardware.com

(a)

Internet

Client Server

HTTP/1.0 401 Login Required

WWW-authenticate: Basic realm="Plumbing and Fixtures"

(b)

Internet

Client Server

GET /index.html HTTP/1.0

Host: www.joes-hardware.com

Authorization: Basic am910jRmdW4=

(c)

Internet

Client Server

HTTP/1.0 200 OK

Content-length: 4342

Content-type: text/html

...

(d)

262 |Chapter 11: Client Identification and Cookies

adds the WWW-Authenticate header. This causes the browser to pop up a login

dialog box.

• Once the user enters a username and a password (to sanity check his identity),

the browser repeats the original request. This time it adds an Authorization

header, specifying the username and password. The username and password are

scrambled, to hide them from casual or accidental network observers.*

• Now, the server is aware of the user’s identity.

• For future requests, the browser will automatically issue the stored username

and password when asked and will often even send it to the site when not asked.

This makes it possible to log in once to a site and have your identity maintained

through the session, by having the browser send the Authorization header as a

token of your identity on each request to the server.

However, logging in to web sites is tedious. As Fred browses from site to site, he’ll

need to log in for each site. To make matters worse, it is likely that poor Fred will

need to remember different usernames and passwords for different sites. His favorite

username, “fred,” will already have been chosen by someone else by the time he vis-

its many sites, and some sites will have different rules about the length and composi-

tion of usernames and passwords. Pretty soon, Fred will give up on the Internet and

go back to watching Oprah. The next section discusses a solution to this problem.

Fat URLs

Some web sites keep track of user identity by generating special versions of each URL

for each user. Typically, a real URL is extended by adding some state information to

the start or end of the URL path. As the user browses the site, the web server dynam-

ically generates hyperlinks that continue to maintain the state information in the

URLs.

URLs modified to include user state information are called fat URLs. The following

are some example fat URLs used in the Amazon.com e-commerce web site. Each

URL is suffixed by a user-unique identification number (002-1145265-8016838, in

this case) that helps track a user as she browses the store.

...

<a href="/exec/obidos/tg/browse/-/229220/ref=gr_gifts/002-1145265-8016838">All

Gifts</a><br>

...

<a href="http://s1.amazon.com/exec/varzea/tg/armed-forces/-//ref=gr_af_/002-1145265-

8016838">Salute Our Troops</a><br>

<a href="/exec/obidos/tg/browse/-/749188/ref=gr_p4_/002-1145265-8016838">Free

Shipping</a><br>

* As we will see in Chapter 14, the HTTP basic authentication username and password can easily be unscram-

bled by anyone who wants to go through a minimal effort. More secure techniques will be discussed later.

Cookies |263

<a href="/exec/obidos/tg/browse/-/468532/ref=gr_returns/002-1145265-8016838">Easy

Returns</a>

...

You can use fat URLs to tie the independent HTTP transactions with a web server

into a single “session” or “visit.” The first time a user visits the web site, a unique ID is

generated, it is added to the URL in a server-recognizable way, and the server redi-

rects the client to this fat URL. Whenever the server gets a request for a fat URL, it can

look up any incremental state associated with that user ID (shopping carts, profiles,

etc.), and it rewrites all outgoing hyperlinks to make them fat, to maintain the user ID.

Fat URLs can be used to identify users as they browse a site. But this technology does

have several serious problems. Some of these problems include:

Ugly URLs

The fat URLs displayed in the browser are confusing for new users.

Can’t share URLs

The fat URLs contain state information about a particular user and session. If

you mail that URL to someone else, you may inadvertently be sharing your accu-

mulated personal information.

Breaks caching

Generating user-specific versions of each URL means that there are no longer

commonly accessed URLs to cache.

Extra server load

The server needs to rewrite HTML pages to fatten the URLs.

Escape hatches

It is too easy for a user to accidentally “escape” from the fat URL session by

jumping to another site or by requesting a particular URL. Fat URLs work only if

the user strictly follows the premodified links. If the user escapes, he may lose

his progress (perhaps a filled shopping cart) and will have to start again.

Not persistent across sessions

All information is lost when the user logs out, unless he bookmarks the particu-

lar fat URL.

Cookies are the best current way to identify users and allow persistent sessions. They

don’t suffer many of the problems of the previous techniques, but they often are used

in conjunction with those techniques for extra value. Cookies were first developed by

Netscape but now are supported by all major browsers.

Because cookies are important, and they define new HTTP headers, we’re going to

explore them in more detail than we did the previous techniques. The presence of

cookies also impacts caching, and most caches and browsers disallow caching of any

cookied content. The following sections present more details.

264 |Chapter 11: Client Identification and Cookies

Types of Cookies

You can classify cookies broadly into two types: session cookies and persistent cook-

ies. A session cookie is a temporary cookie that keeps track of settings and prefer-

ences as a user navigates a site. A session cookie is deleted when the user exits the

browser. Persistent cookies can live longer; they are stored on disk and survive

browser exits and computer restarts. Persistent cookies often are used to retain a

configuration profile or login name for a site that a user visits periodically.

The only difference between session cookies and persistent cookies is when they

expire. As we will see later, a cookie is a session cookie if its Discard parameter is set,

or if there is no Expires or Max-Age parameter indicating an extended expiration time.

How Cookies Work

Cookies are like “Hello, My Name Is” stickers stuck onto users by servers. When a

user visits a web site, the web site can read all the stickers attached to the user by

that server.

The first time the user visits a web site, the web server doesn’t know anything about

the user (Figure 11-3a). The web server expects that this same user will return again,

so it wants to “slap” a unique cookie onto the user so it can identify this user in the

future. The cookie contains an arbitrary list of name=value information, and it is

attached to the user using the Set-Cookie or Set-Cookie2 HTTP response (exten-

sion) headers.

Cookies can contain any information, but they often contain just a unique identifica-

tion number, generated by the server for tracking purposes. For example, in

Figure 11-3b, the server slaps onto the user a cookie that says id=“34294”. The

server can use this number to look up database information that the server accumu-

lates for its visitors (purchase history, address information, etc.).

However, cookies are not restricted to just ID numbers. Many web servers choose to

keep information directly in the cookies. For example:

Cookie: name="Brian Totty"; phone="555-1212"

The browser remembers the cookie contents sent back from the server in Set-Cookie

or Set-Cookie2 headers, storing the set of cookies in a browser cookie database (think

of it like a suitcase with stickers from various countries on it). When the user returns

to the same site in the future (Figure 11-3c), the browser will select those cookies

slapped onto the user by that server and pass them back in a Cookie request header.

Cookie Jar: Client-Side State

The basic idea of cookies is to let the browser accumulate a set of server-specific

information, and provide this information back to the server each time you visit.

Cookies |265

Because the browser is responsible for storing the cookie information, this system is

called client-side state. The official name for the cookie specification is the HTTP

State Management Mechanism.

Netscape Navigator cookies

Different browsers store cookies in different ways. Netscape Navigator stores cook-

ies in a single text file called cookies.txt. For example:

# Netscape HTTP Cookie File

# http://www.netscape.com/newsref/std/cookie_spec.html

# This is a generated file! Do not edit.

# domain allh path secure expires name value

www.fedex.com FALSE / FALSE 1136109676 cc /us/

.bankofamericaonline.com TRUE / FALSE 1009789256 state CA

.cnn.com TRUE / FALSE 1035069235 SelEdition www

secure.eepulse.net FALSE /eePulse FALSE 1007162968 cid %FE%FF%002

www.reformamt.org TRUE /forum FALSE 1033761379 LastVisit 1003520952

www.reformamt.org TRUE /forum FALSE 1033761379 UserName Guest

...

Figure 11-3. Slapping a cookie onto a user

Internet

Client Server

GET /index.html HTTP/1.0

Host: www.joes-hardware.com

(a)

Internet

Client Server

HTTP/1.0 200 OK

Set-cookie: id="34294"; domain="joes-hardware.com"

Content-type: text/html

Content-length: 1903

...

(b)

Internet

Client Server

GET /index.html HTTP/1.0

Host: www.joes-hardware.com

Cookie: id="34294"

(c)

id=34294

Set-Cookie

id=34294

266 |Chapter 11: Client Identification and Cookies

Each line of the text file represents a cookie. There are seven tab-separated fields:

domain

The domain of the cookie

allh

Whether all hosts in a domain get the cookie, or only the specific host named

path

The path prefix in the domain associated with the cookie

secure

Whether we should send this cookie only if we have an SSL connection

expiration

The cookie expiration date in seconds since Jan 1, 1970 00:00:00 GMT

name

The name of the cookie variable

value

The value of the cookie variable

Microsoft Internet Explorer cookies

Microsoft Internet Explorer stores cookies in individual text files in the cache direc-

tory. You can browse this directory to view the cookies, as shown in Figure 11-4.

The format of the Internet Explorer cookie files is proprietary, but many of the fields

are easily understood. Each cookie is stored one after the other in the file, and each

cookie consists of multiple lines.

The first line of each cookie in the file contains the cookie variable name. The next

line is the variable value. The third line contains the domain and path. The remain-

ing lines are proprietary data, presumably including dates and other flags.

Different Cookies for Different Sites

A browser can have hundreds or thousands of cookies in its internal cookie jar, but

browsers don’t send every cookie to every site. In fact, they typically send only two

or three cookies to each site. Here’s why:

• Moving all those cookie bytes would dramatically slow performance. Browsers

would actually be moving more cookie bytes than real content bytes!

• Most of these cookies would just be unrecognizable gibberish for most sites,

because they contain server-specific name/value pairs.

• Sending all cookies to all sites would create a potential privacy concern, with

sites you don’t trust getting information you intended only for another site.

Cookies |267

In general, a browser sends to a server only those cookies that the server generated.

Cookies generated by joes-hardware.com are sent to joes-hardware.com and not to

bobs-books.com or marys-movies.com.

Many web sites contract with third-party vendors to manage advertisements. These

advertisements are made to look like they are integral parts of the web site and do

push persistent cookies. When the user goes to a different web site serviced by the

same advertisement company, the persistent cookie set earlier is sent back again by

the browser (because the domains match). A marketing company could use this tech-

nique, combined with the Referer header, to potentially build an exhaustive data set

of user profiles and browsing habits. Modern browsers allow you to configure pri-

vacy settings to restrict third-party cookies.

Cookie Domain attribute

A server generating a cookie can control which sites get to see that cookie by adding

a Domain attribute to the Set-Cookie response header. For example, the following

HTTP response header tells the browser to send the cookie user=“mary17” to any

site in the domain .airtravelbargains.com:

Set-cookie: user="mary17"; domain="airtravelbargains.com"

Figure 11-4. Internet Explorer cookies are stored in individual text files in the cache directory

Can open MSIE cookies in

a text viewer program

Name = “session-id”

Value = “002-9351993-5692007”

Domain/path = “amazon.com”

Proprietary format for

other attributes

Each cookie file has

cookies for a particular

site; the cookies are stored

in text lines, one after the

other

MSIE stores cookies in the same location as other cached objects

268 |Chapter 11: Client Identification and Cookies

If the user visits www.airtravelbargains.com,specials.airtravelbargains.com, or any

site ending in .airtravelbargains.com, the following Cookie header will be issued:

Cookie: user="mary17"

Cookie Path attribute

The cookie specification even lets you associate cookies with portions of web sites.

This is done using the Path attribute, which indicates the URL path prefix where

each cookie is valid.

For example, one web server might be shared between two organizations, each hav-

ing separate cookies. The site www.airtravelbargains.com might devote part of its

web site to auto rentals—say, http://www.airtravelbargains.com/autos/—using a sep-

arate cookie to keep track of a user’s preferred car size. A special auto-rental cookie

might be generated like this:

Set-cookie: pref=compact; domain="airtravelbargains.com"; path=/autos/

If the user goes to http://www.airtravelbargains.com/specials.html, she will get only

this cookie:

Cookie: user="mary17"

But if she goes to http://www.airtravelbargains.com/autos/cheapo/index.html, she will

get both of these cookies:

Cookie: user="mary17"

Cookie: pref=compact

So, cookies are pieces of state, slapped onto the client by the servers, maintained by

the clients, and sent back to only those sites that are appropriate. Let’s look in more

detail at the cookie technology and standards.

Cookie Ingredients

There are two different versions of cookie specifications in use: Version 0 cookies

(sometimes called “Netscape cookies”), and Version 1 (“RFC 2965”) cookies. Ver-

sion 1 cookies are a less widely used extension of Version 0 cookies.

Neither the Version 0 or Version 1 cookie specification is documented as part of the

HTTP/1.1 specification. There are two primary adjunct documents that best describe

the use of cookies, summarized in Table 11-2.

Table 11-2. Cookie specifications

Title Description Location

Persistent Client State: HTTP Cookies Original Netscape cookie standard http://home.netscape.com/newsref/

std/cookie_spec html

RFC 2965: HTTP State Management

Mechanism

October 2000 cookie standard,

obsoletes RFC 2109

http://www.ietf.org/rfc/rfc2965.txt

Cookies |269

Version 0 (Netscape) Cookies

The initial cookie specification was defined by Netscape. These “Version 0” cookies

defined the Set-Cookie response header, the Cookie request header, and the fields

available for controlling cookies. Version 0 cookies look like this:

Set-Cookie: name=value [; expires=date] [; path=path] [; domain=domain] [; secure]

Cookie: name1=value1 [; name2=value2] ...

Version 0 Set-Cookie header

The Set-Cookie header has a mandatory cookie name and cookie value. It can be fol-

lowed by optional cookie attributes, separated by semicolons. The Set-Cookie fields

are described in Table 11-3.

Table 11-3. Version 0 (Netscape) Set-Cookie attributes

Set-Cookie attribute Description and examples

NAME=VALUE Mandatory. Both NAME and VALUE are sequences of characters, excluding the semicolon, comma,

equals sign, and whitespace, unless quoted in double quotes. The web server can create any

NAME=VALUE association, which will be sent back to the web server on subsequent visits to the site.

Set-Cookie: customer=Mary

Expires Optional. This attribute specifies a date string that defines the valid lifetime of that cookie. Once the

expiration date has been reached, the cookie will no longer be stored or given out. The date is for-

matted as:

Weekday, DD-Mon-YY HH:MM:SS GMT

The only legal time zone is GMT, and the separators between the elements of the date must be

dashes. If Expires is not specified, the cookie will expire when the user’s session ends.

Set-Cookie: foo=bar; expires=Wednesday, 09-Nov-99 23:12:40 GMT

Domain Optional. A browser sends the cookie only to server hostnames in the specified domain. This lets serv-

ers restrict cookies to only certain domains. A domain of “acme.com”would match hostnames “anvil.

acme.com” and “shipping.crate.acme.com”, but not “www.cnn.com”.

Only hosts within the specified domain can set a cookie for a domain, and domains must have at least

two or three periods in them to prevent domains of the form “.com”, “.edu”, and “va.us”. Any

domain that falls within the fixed set of special top-level domains listed here requires only two peri-

ods. Any other domain requires at least three. The special top-level domains are: .com, .edu, .net,

.org, .gov, .mil, .int, .biz, .info, .name, .museum, .coop, .aero, and .pro.

If the domain is not specified, it defaults to the hostname of the server that generated the Set-Cookie

response.

Set-Cookie: SHIPPING=FEDEX; domain="joes-hardware.com"

Path Optional. This attribute lets you assign cookies to particular documents on a server. If the Path

attribute is a prefix of a URL path, a cookie can be attached. The path “/foo” matches “/foobar” and

“/foo/bar.html”. The path “/” matches everything in the domain.

If the path is not specified, it is set to the path of the URL that generated the Set-Cookie response.

Set-Cookie: lastorder=00183; path=/orders

Secure Optional. If this attribute is included, a cookie will be sent only if HTTP is using an SSL secure connection.

Set-Cookie: private_id=519; secure

270 |Chapter 11: Client Identification and Cookies

Version 0 Cookie header

When a client sends requests, it includes all the unexpired cookies that match the

domain, path, and secure filters to the site. All the cookies are combined into a

Cookie header:

Cookie: session-id=002-1145265-8016838; session-id-time=1007884800

Version 1 (RFC 2965) Cookies

An extended version of cookies is defined in RFC 2965 (previously RFC 2109). This

Version 1 standard introduces the Set-Cookie2 and Cookie2 headers, but it also

interoperates with the Version 0 system.

The RFC 2965 cookie standard is a bit more complicated than the original Netscape

standard and is not yet completely supported. The major changes of RFC 2965 cook-

ies are:

• Associate descriptive text with each cookie to explain its purpose

• Support forced destruction of cookies on browser exit, regardless of expiration

• Max-Age aging of cookies in relative seconds, instead of absolute dates

• Ability to control cookies by the URL port number, not just domain and path

• The Cookie header carries back the domain, port, and path filters (if any)

• Version number for interoperability

•$ prefix in Cookie header to distinguish additional keywords from usernames

The Version 1 cookie syntax is as follows:

set-cookie = "Set-Cookie2:" cookies

cookies = 1#cookie

cookie = NAME "=" VALUE *(";" set-cookie-av)

NAME = attr

VALUE = value

set-cookie-av = "Comment" "=" value

| "CommentURL" "=" <"> http_URL <">

| "Discard"

| "Domain" "=" value

| "Max-Age" "=" value

| "Path" "=" value

| "Port" [ "=" <"> portlist <"> ]

| "Secure"

| "Version" "=" 1*DIGIT

portlist = 1#portnum

portnum = 1*DIGIT

cookie = "Cookie:" cookie-version 1*((";" | ",") cookie-value)

cookie-value = NAME "=" VALUE [";" path] [";" domain] [";" port]

cookie-version = "$Version" "=" value

NAME = attr

VALUE = value

Cookies |271

path = "$Path" "=" value

domain = "$Domain" "=" value

port = "$Port" [ "=" <"> value <"> ]

cookie2 = "Cookie2:" cookie-version

Version 1 Set-Cookie2 header

More attributes are available in the Version 1 cookie standard than in the Netscape

standard. Table 11-4 provides a quick summary of the attributes. Refer to RFC 2965

for more detailed explanation.

Table 11-4. Version 1 (RFC 2965) Set-Cookie2 attributes

Set-Cookie2 attribute Description and examples

NAME=VALUE Mandatory. The web server can create any NAME=VALUE association, which will be sent back to

the web server on subsequent visits to the site. The name must not begin with “$”, because that

character is reserved.

Version Mandatory. The value of this attribute is an integer, corresponding to the version of the cookie

specification. RFC 2965 is Version 1.

Set-Cookie2: Part="Rocket_Launcher_0001"; Version="1"

Comment Optional. This attribute documents how a server intends to use the cookie. The user can inspect this

policy to decide whether to permit a session with this cookie. The value must be in UTF-8 encoding.

CommentURL Optional. This attribute provides a URL pointer to detailed documentation about the purpose and

policy for a cookie. The user can inspect this policy to decide whether to permit a session with this

cookie.

Discard Optional. If this attribute is present, it instructs the client to discard the cookie when the client

program terminates.

Domain Optional. A browser sends the cookie only to server hostnames in the specified domain. This lets

servers restrict cookies to only certain domains. A domain of “acme.com” would match host-

names “anvil.acme.com” and “shipping.crate.acme.com”, but not “www.cnn.com”. The rules for

domain matching are basically the same as in Netscape cookies, but there are a few additional

rules. Refer to RFC 2965 for details.

Max-Age Optional. The value of this attribute is an integer that sets the lifetime of the cookie in seconds.

Clients should calculate the age of the cookie according to the HTTP/1.1 age-calculation rules.

When a cookie’s age becomes greater than the Max-Age, the client should discard the cookie. A

value of zero means the cookie with that name should be discarded immediately.

Path Optional. This attribute lets you assign cookies to particular documents on a server. If the Path

attribute is a prefix of a URL path, a cookie can be attached. The path “/foo” would match

“/foobar” and “/foo/bar.html”. The path “/” matches everything in the domain. If the path is not

specified, it is set to the path of the URL that generated the Set-Cookie response.

Port Optional. This attribute can stand alone as a keyword, or it can include a comma-separated list of

ports to which a cookie may be applied. If there is a port list, the cookie can be served only to serv-

ers whose ports match a port in the list. If the Port keyword is provided in isolation, the cookie can

be served only to the port number of the current responding server.

Set-Cookie2: foo="bar"; Version="1"; Port="80,81,8080"

Set-Cookie2: foo="bar"; Version="1"; Port

Secure Optional. If this attribute is included, a cookie will be sent only if HTTP is using an SSL secure

connection.

272 |Chapter 11: Client Identification and Cookies

Version 1 Cookie header

Version 1 cookies carry back additional information about each delivered cookie,

describing the filters each cookie passed. Each matching cookie must include any

Domain, Port, or Path attributes from the corresponding Set-Cookie2 headers.

For example, assume the client has received these five Set-Cookie2 responses in the

past from the www.joes-hardware.com web site:

Set-Cookie2: ID="29046"; Domain=".joes-hardware.com"

Set-Cookie2: color=blue

Set-Cookie2: support-pref="L2"; Domain="customer-care.joes-hardware.com"

Set-Cookie2: Coupon="hammer027"; Version="1"; Path="/tools"

Set-Cookie2: Coupon="handvac103"; Version="1"; Path="/tools/cordless"

If the client makes another request for path /tools/cordless/specials.html, it will pass

along a long Cookie2 header like this:

Cookie: $Version="1";

ID="29046"; $Domain=".joes-hardware.com";

color="blue";

Coupon="hammer027"; $Path="/tools";

Coupon="handvac103"; $Path="/tools/cordless"

Notice that all the matching cookies are delivered with their Set-Cookie2 filters, and

the reserved keywords begin with a dollar sign ($).

Version 1 Cookie2 header and version negotiation

The Cookie2 request header is used to negotiate interoperability between clients and

servers that understand different versions of the cookie specification. The Cookie2

header advises the server that the user agent understands new-style cookies and pro-

vides the version of the cookie standard supported (it would have made more sense

to call it Cookie-Version):

Cookie2: $Version="1"

If the server understands new-style cookies, it recognizes the Cookie2 header and

should send Set-Cookie2 (rather than Set-Cookie) response headers. If a client gets

both a Set-Cookie and a Set-Cookie2 header for the same cookie, it ignores the old

Set-Cookie header.

If a client supports both Version 0 and Version 1 cookies but gets a Version 0 Set-

Cookie header from the server, it should send cookies with the Version 0 Cookie

header. However, the client also should send Cookie2: $Version=“1” to give the

server indication that it can upgrade.

Cookies and Session Tracking

Cookies can be used to track users as they make multiple transactions to a web site.

E-commerce web sites use session cookies to keep track of users’ shopping carts as

they browse. Let’s take the example of the popular shopping site Amazon.com.

Cookies |273

When you type http://www.amazon.com into your browser, you start a chain of

transactions where the web server attaches identification information through a

series of redirects, URL rewrites, and cookie setting.

Figure 11-5 shows a transaction sequence captured from an Amazon.com visit:

• Figure 11-5a—Browser requests Amazon.com root page for the first time.

• Figure 11-5b—Server redirects the client to a URL for the e-commerce software.

• Figure 11-5c—Client makes a request to the redirected URL.

• Figure 11-5d—Server slaps two session cookies on the response and redirects the

user to another URL, so the client will request again with these cookies attached.

This new URL is a fat URL, meaning that some state is embedded into the URL.

If the client has cookies disabled, some basic identification can still be done as

long as the user follows the Amazon.com-generated fat URL links and doesn’t

leave the site.

• Figure 11-5e—Client requests the new URL, but now passes the two attached

cookies.

• Figure 11-5f—Server redirects to the home.html page and attaches two more

cookies.

• Figure 11-5g—Client fetches the home.html page and passes all four cookies.

• Figure 11-5h—Server serves back the content.

Cookies and Caching

You have to be careful when caching documents that are involved with cookie trans-

actions. You don’t want to assign one user some past user’s cookie or, worse, show

one user the contents of someone else’s personalized document.

The rules for cookies and caching are not well established. Here are some guiding

principles for dealing with caches:

Mark documents uncacheable if they are

The document owner knows best if a document is uncacheable. Explicitly mark

documents uncacheable if they are—specifically, use Cache-Control: no-

cache=“Set-Cookie” if the document is cacheable except for the Set-Cookie

header. The other, more general practice of using Cache-Control: public for doc-

uments that are cacheable promotes bandwidth savings in the Web.

Be cautious about caching Set-Cookie headers

If a response has a Set-Cookie header, you can cache the body (unless told other-

wise), but you should be extra cautious about caching the Set-Cookie header. If

you send the same Set-Cookie header to multiple users, you may be defeating

user targeting.

Some caches delete the Set-Cookie header before storing a response in the cache,

but that also can cause problems, because clients served from the cache will no

274 |Chapter 11: Client Identification and Cookies

longer get cookies slapped on them that they normally would without the cache.

This situation can be improved by forcing the cache to revalidate every request

with the origin server and merging any returned Set-Cookie headers with the cli-

ent response. The origin server can dictate such revalidations by adding this

header to the cached copy:

Cache-Control: must-revalidate, max-age=0

Figure 11-5. The Amazon.com web site uses session cookies to track users

Client www.amazon.com

GET / HTTP/1.0

Host: www.amazon.com

HTTP/1.1 302 Found

Location: http://www.amazon.com:80/exec/obidos/subst/home/redirect.html

GET /exec/obidos/subst/home/redirect.html HTTP/1.0

Host: www.amazon.com

HTTP/1.1 302 Found

Date: Sun, 02 Dec 2001 03:20:47 GMT

Set-cookie: session-id=002-1145265-8016838; path=/; domain=.amazon.com;

expires=Sunday, 09-Dec-2001 08:00:00 GMT

Set-cookie: session-id-time=1007884800; path=/; domain=.amazon.com;

expires=Sunday, 09-Dec-2001 08:00:00 GMT

GET /exec/obidos/subst/home/redirect.html/002-1145265-8016838 HTTP/1.0

Host: www.amazon.com

Cookie: session-id=002-1145265-8016838; session-id-time=1007884800

HTTP/1.1 302 Found

Date: Sun, 02 Dec 2001 03:45:40 GMT

Set-cookie: ubid-main=430-8248051-6231206; path=/; domain.amazon.com;

expires=Tuesday, 01-Jan-2036 08:00:01 GMT

Location: http://www.amazon.com/exec/obidos/subst/home/home.html/002-1145265-8016838

Set-cookie: x-main="hQ...Bf; path=/; domain=.amazon.com;

expires=Tuesday, 01-Jan-2036 08:00:01 GMT

GET /exec/obidos/subst/home/home.html/002-1145265-8016838 HTTP/1.0

Host: www.amazon.com

Cookie: session-id=002-1145265-8016838; session-id-time=1007884800;

ubid-main=430-8248051-6231206; x-main=hQ...Bf

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Cookies |275

More conservative caches may refuse to cache any response that has a Set-

Cookie header, even though the content may actually be cacheable. Some caches

allow modes when Set-Cookied images are cached, but not text.

Be cautious about requests with Cookie headers

When a request arrives with a Cookie header, it provides a hint that the result-

ing content might be personalized. Personalized content must be flagged

uncacheable, but some servers may erroneously not mark this content as

uncacheable.

Conservative caches may choose not to cache any document that comes in

response to a request with a Cookie header. And again, some caches allow

modes when Cookied images are cached, but not text. The more accepted pol-

icy is to cache images with Cookie headers, with the expiration time set to zero,

thus forcing a revalidate every time.

Cookies, Security, and Privacy

Cookies themselves are not believed to be a tremendous security risk, because they

can be disabled and because much of the tracking can be done through log analysis

or other means. In fact, by providing a standardized, scrutinized method for retain-

ing personal information in remote databases and using anonymous cookies as keys,

the frequency of communication of sensitive data from client to server can be

reduced.

Still, it is good to be cautious when dealing with privacy and user tracking, because

there is always potential for abuse. The biggest misuse comes from third-party web

sites using persistent cookies to track users. This practice, combined with IP

addresses and information from the Referer header, has enabled these marketing

companies to build fairly accurate user profiles and browsing patterns.

In spite of all the negative publicity, the conventional wisdom is that the session han-

dling and transactional convenience of cookies outweighs most risks, if you use cau-

tion about who you provide personal information to and review sites’ privacy

policies.

The Computer Incident Advisory Capability (part of the U.S. Department of Energy)

wrote an assessment of the overrepresented dangers of cookies in 1998. Here’s an

excerpt from that report:

CIAC I-034: Internet Cookies

(http://www.ciac.org/ciac/bulletins/i-034.shtml)

PROBLEM:

Cookies are short pieces of data used by web servers to help identify web users. The

popular concepts and rumors about what a cookie can do has reached almost mystical

proportions, frightening users and worrying their managers.

276 |Chapter 11: Client Identification and Cookies

VULNERABILITY ASSESSMENT:

The vulnerability of systems to damage or snooping by using web browser cookies is

essentially nonexistent. Cookies can only tell a web server if you have been there

before and can pass short bits of information (such as a user number) from the web

server back to itself the next time you visit. Most cookies last only until you quit

your browser and then are destroyed. A second type of cookie known as a persistent

cookie has an expiration date and is stored on your disk until that date. A

persistent cookie can be used to track a user's browsing habits by identifying him

whenever he returns to a site. Information about where you come from and what web

pages you visit already exists in a web server's log files and could also be used to

track users browsing habits, cookies just make it easier.

For More Information

Here are a few more useful sources for additional information about cookies:

Simon St.Laurent, McGraw-Hill.

http://www.ietf.org/rfc/rfc2965.txt

RFC 2965, “HTTP State Management Mechanism” (obsoletes RFC 2109).

http://www.ietf.org/rfc/rfc2964.txt

RFC 2964, “Use of HTTP State Management.”

http://home.netscape.com/newsref/std/cookie_spec.html

This classic Netscape document, “Persistent Client State: HTTP Cookies,”

describes the original form of HTTP cookies that are still in common use today.

277

CHAPTER 12

Basic Authentication

Millions of people use the Web to perform private transactions and access private

data. The Web makes it very easy to access this information, but easy isn’t good

enough. We need assurances about who can look at our sensitive data and who can

perform our privileged transactions. Not all information is intended for the general

public.

We need to feel comfortable that unauthorized users can’t view our online travel

profiles or publish documents onto our web sites without our consent. We need to

make sure our most sensitive corporate-planning documents aren’t available to

unauthorized and potentially unscrupulous members of our organization. And we

need to feel at ease that our personal web communications with our children, our

spouses, and our secret loves all occur with a modicum of privacy.

Servers need a way to know who a user is. Once a server knows who the user is, it

can decide which transactions and resources the user can access. Authentication

means proving who you are; usually, you authenticate by providing a username and

a secret password. HTTP provides a native facility for HTTP authentication. While

it’s certainly possible to “roll your own” authentication facility on top of HTTP

forms and cookies, for many situations, HTTP’s native authentication fits the bill

nicely.

This chapter explains HTTP authentication and delves into the most common form

of HTTP authentication, basic authentication. The next chapter explains a more

powerful technique called digest authentication.

Authentication

Authentication means showing some proof of your identity. When you show a photo

ID, like a passport or a driver’s license, you are showing some proof that you are who

you claim to be. When you type a PIN number into an automatic teller machine, or

type a secret password into a computer’s dialog box, you also are proving that you

are who you claim to be.

278 |Chapter 12: Basic Authentication

Now, none of these schemes are foolproof. Passwords can be guessed or overheard,

and ID cards can be stolen or forged. But each piece of supporting evidence helps to

build a reasonable trust that you are who you say you are.

HTTP’s Challenge/Response Authentication Framework

HTTP provides a native challenge/response framework to make it easy to authenti-

cate users. HTTP’s authentication model is sketched in Figure 12-1.

Whenever a web application receives an HTTP request message, instead of acting on

the request, the server can respond with an “authentication challenge,” challenging

the user to demonstrate who she is by providing some secret information.

The user needs to attach the secret credentials (username and password) when she

repeats the request. If the credentials don’t match, the server can challenge the client

again or generate an error. If the credentials do match, the request completes normally.

Authentication Protocols and Headers

HTTP provides an extensible framework for different authentication protocols,

through a set of customizable control headers. The format and content of the headers

listed in Table 12-1 vary depending on the authentication protocol. The authentica-

tion protocol also is specified in the HTTP authentication headers.

Figure 12-1. Simplified challenge/response authentication

Internet

Client Server

Request Please give me the internal sales forecast.

Internet

Client Server

Challenge You requested a secret financial document.

Please tell me your username and password.

Internet

Client Server

Authorization Please give me the internal sales forecast.

Here is my username and password: “•••••”.

Internet

Client Server

Success OK. You have access rights.

Here is the document.

(Ask user for password)

Authentication |279

HTTP defines two official authentication protocols: basic authentication and digest

authentication. In the future, people are free to devise new protocols that use

HTTP’s challenge/response framework. The rest of this chapter explains basic

authentication. See Chapter 13 for details on digest authentication.

To make this concrete, let’s take a look at Figure 12-2.

When a server challenges a user, it returns a 401 Unauthorized response and

describes how and where to authenticate in the WWW-Authenticate header

(Figure 12-2b).

Table 12-1. Four phases of authentication

Phase Headers Description Method/Status

Request The first request has no authentication. GET

Challenge WWW-Authenticate The server rejects the request with a 401 status, indicating

that the user needs to provide his username and password.

Because the server might have different areas, each with its

own password, the server describes the protection area in

the WWW-Authenticate header. Also, the authentication

algorithm is specified in the WWW-Authenticate header.

401 Unauthorized

Authorization Authorization The client retries the request, but this time attaching an

Authorization header specifying the authentication algo-

rithm, username, and password.

GET

Success Authentication-Info If the authorization credentials are correct, the server

returns the document. Some authorization algorithms

return some additional information about the authorization

session in the optional Authentication-Info header.

200 OK

Figure 12-2. Basic authentication example

Client Server

GET /family/jeff.jpg HTTP/1.0

HTTP/1.0 401 Authorization Required

WWW-Authenticate: Basic realm="Family"

(a)

(b)

GET /family/jeff.jpg HTTP/1.0

Authorization: Basic YnJpYW4tdG90dHk6T3ch (c)

HTTP/1.0 200 OK

Content-type: image/jpeg

...<image data included>

280 |Chapter 12: Basic Authentication

When a client authorizes the server to proceed, it resends the request but attaches an

encoded password and other authentication parameters in an Authorization header

(Figure 12-2c).

When an authorized request is completed successfully, the server returns a normal

status code (e.g., 200 OK) and, for advanced authentication algorithms, might attach

additional information in an Authentication-Info header (Figure 12-2d).

Security Realms

Before we discuss the details of basic authentication, we need to explain how HTTP

allows servers to associate different access rights to different resources. You might

have noticed that the WWW-Authenticate challenge in Figure 12-2b included a

realm directive. Web servers group protected documents into security realms. Each

security realm can have different sets of authorized users.

For example, suppose a web server has two security realms established: one for cor-

porate financial information and another for personal family documents (see

Figure 12-3). Different users will have different access to the realms. The CEO of

your company probably should have access to the sales forecast, but you might not

give her access to your family vacation photos!

Here’s a hypothetical basic authentication challenge, with a realm specified:

HTTP/1.0 401 Unauthorized

WWW-Authenticate: Basic realm="Corporate Financials"

A realm should have a descriptive string name, like “Corporate Financials,” to help

the user understand which username and password to use. It may also be useful to list

the server hostname in the realm name—for example, “executive-committee@big-

company.com”.

Figure 12-3. Security realms in a web server

Family realm

corporate index.html

press financials

family

pr1.html pr2.html sales-forecast.xls

jeff.jpg brian.jpg

Corporate financials realm

Server

Basic Authentication |281

Basic Authentication

Basic authentication is the most prevalent HTTP authentication protocol. Almost

every major client and server implements basic authentication. Basic authentication

was originally described in the HTTP/1.0 specification, but it has since been relo-

cated into RFC 2617, which details HTTP authentication.

In basic authentication, a web server can refuse a transaction, challenging the client

for a valid username and password. The server initiates the authentication challenge

by returning a 401 status code instead of 200 and specifies the security realm being

accessed with the WWW-Authenticate response header. When the browser receives

the challenge, it opens a dialog box requesting the username and password for this

realm. The username and password are sent back to the server in a slightly scram-

bled format inside an Authorization request header.

Basic Authentication Example

Figure 12-2, earlier in this chapter, showed a detailed example of basic authentication:

• In Figure 12-2a, a user requests the personal family photo /family/jeff.jpg.

• In Figure 12-2b, the server sends back a 401 Authorization Required password

challenge for the personal family photo, along with the WWW-Authenticate

header. The header requests basic authentication for the realm named Family.

• In Figure 12-2c, the browser receives the 401 challenge and pops open a dialog

box asking for the username and password for the Family realm. When the user

enters the username and password, the browser joins them with a colon,

encodes them into a “scrambled” base-64 representation (discussed in the next

section), and sends them back in the Authorization header.

• In Figure 12-2d, the server decodes the username and password, verifies that they

are correct, and returns the requested document in an HTTP 200 OK message.

The HTTP basic authentication WWW-Authenticate and Authorization headers are

summarized in Table 12-2.

Table 12-2. Basic authentication headers

Challenge/Response Header syntax and description

Challenge (server to client) There may be different passwords for different parts of the site. The realm is a quoted string

that names the set of documents being requested, so the user knows which password to use.

WWW-Authenticate: Basic realm=quoted-realm

Response (client to server) The username and password are joined together by a colon (:) and then converted to base-64

encoding, making it a bit easier to include international characters in usernames and passwords

and making it less likely that a cursory examination will yield usernames and passwords while

watching network traffic.

Authorization: Basic base64-username-and-password

282 |Chapter 12: Basic Authentication

Note that the basic authentication protocol does not make use of the Authentication-

Info header we showed in Table 12-1.

Base-64 Username/Password Encoding

HTTP basic authentication packs the username and password together (separated by

a colon), and encodes them using the base-64 encoding method. If you don’t know

what base-64 encoding is, don’t worry. You don’t need to know much about it, and

if you are curious, you can read all about it in Appendix E. In a nutshell, base-64

encoding takes a sequence of 8-bit bytes and breaks the sequence of bits into 6-bit

chunks. Each 6-bit piece is used to pick a character in a special 64-character alpha-

bet, consisting mostly of letters and numbers.

Figure 12-4 shows an example of using base-64 encoding for basic authentication.

Here, the username is “brian-totty” and the password is “Ow!”. The browser joins the

username and password with a colon, yielding the packed string “brian-totty:Ow!”.

This string is then base 64–encoded into this mouthful: “YnJpYW4tdG90dHk6T3ch”.

Base-64 encoding was invented to take strings of binary, text, and international char-

acter data (which caused problems on some systems) and convert them temporarily

into a portable alphabet for transmission. The original strings could then be decoded

on the remote end without fear of transmission corruption.

Base-64 encoding can be useful for usernames and passwords that contain interna-

tional characters or other characters that are illegal in HTTP headers (such as quo-

tation marks, colons, and carriage returns). Also, because base-64 encoding

trivially scrambles the username and password, it can help prevent administrators

Figure 12-4. Generating a basic Authorization header from username and password

(a) Prompt for username and password

brian-totty

Ow!

(b ) Pack username and password with colon brian-totty

Ow!

(d) Send authorization

brian-totty:Ow!

YnJpYW4tdG90dHk6T3ch

Client Server

GET /family/jeff.jpg HTTP/1.0

Authorization: Basic YnJpYW4tdG90dHk6T3ch

The Security Flaws of Basic Authentication |283

from accidentally viewing usernames and passwords while administering servers

and networks.

Proxy Authentication

Authentication also can be done by intermediary proxy servers. Some organizations

use proxy servers to authenticate users before letting them access servers, LANs, or

wireless networks. Proxy servers can be a convenient way to provide unified access

control across an organization’s resources, because access policies can be centrally

administered on the proxy server. The first step in this process is to establish the

identity via proxy authentication.

The steps involved in proxy authentication are identical to that of web server identifi-

cation. However, the headers and status codes are different. Table 12-3 contrasts the

status codes and headers used in web server and proxy authentication.

The Security Flaws of Basic Authentication

Basic authentication is simple and convenient, but it is not secure. It should only be

used to prevent unintentional access from nonmalicious parties or used in combina-

tion with an encryption technology such as SSL.

Consider the following security flaws:

1. Basic authentication sends the username and password across the network in a

form that can trivially be decoded. In effect, the secret password is sent in the

clear, for anyone to read and capture. Base-64 encoding obscures the username

and password, making it less likely that friendly parties will glean passwords by

accidental network observation. However, given a base 64–encoded username

and password, the decoding can be performed trivially by reversing the encod-

ing process. Decoding can even be done in seconds, by hand, with pencil and

paper! Base 64–encoded passwords are effectively sent “in the clear.” Assume

that motivated third parties will intercept usernames and passwords sent by

basic authentication. If this is a concern, send all your HTTP transactions over

SSL encrypted channels, or use a more secure authentication protocol, such as

digest authentication.

Table 12-3. Web server versus proxy authentication

Web server Proxy server

Unauthorized status code: 401 Unauthorized status code: 407

WWW-Authenticate Proxy-Authenticate

Authorization Proxy-Authorization

Authentication-Info Proxy-Authentication-Info

284 |Chapter 12: Basic Authentication

2. Even if the secret password were encoded in a scheme that was more compli-

cated to decode, a third party could still capture the garbled username and pass-

word and replay the garbled information to origin servers over and over again to

gain access. No effort is made to prevent these replay attacks.

3. Even if basic authentication is used for noncritical applications, such as corpo-

rate intranet access control or personalized content, social behavior makes this

dangerous. Many users, overwhelmed by a multitude of password-protected ser-

vices, share usernames and passwords. A clever, malicious party may capture a

username and password in the clear from a free Internet email site, for example,

and find that the same username and password allow access to critical online

banking sites!

4. Basic authentication offers no protection against proxies or intermediaries that

act as middlemen, leaving authentication headers intact but modifying the rest of

the message to dramatically change the nature of the transaction.

5. Basic authentication is vulnerable to spoofing by counterfeit servers. If a user can

be led to believe that he is connecting to a valid host protected by basic authenti-

cation when, in fact, he is connecting to a hostile server or gateway, the attacker

can request a password, store it for later use, and feign an error.

This all said, basic authentication still is useful for providing convenient personaliza-

tion or access control to documents in a friendly environment, or where privacy is

desired but not absolutely necessary. In this way, basic authentication is used to pre-

vent accidental or casual access by curious users.*

For example, inside a corporation, product management may password-protect

future product plans to limit premature distribution. Basic authentication makes it

sufficiently inconvenient for friendly parties to access this data.†Likewise, you might

password-protect personal photos or private web sites that aren’t top-secret or don’t

contain valuable information, but really aren’t anyone else’s business either.

Basic authentication can be made secure by combining it with encrypted data trans-

mission (such as SSL) to conceal the username and password from malicious individ-

uals. This is a common technique.

We discuss secure encryption in Chapter 14. The next chapter explains a more

sophisticated HTTP authentication protocol, digest authentication, that has stron-

ger security properties than basic authentication.

* Be careful that the username/password in basic authentication is not the same as the password on your more

secure systems, or malicious users can use them to break into your secure accounts!

† While not very secure, internal employees of the company usually are unmotivated to maliciously capture

passwords. That said, corporate espionage does occur, and vengeful, disgruntled employees do exist, so it is

wise to place any data that would be very harmful if maliciously acquired under a stronger security scheme.

For More Information |285

For More Information

For more information on basic authentication and LDAP, see:

http://www.ietf.org/rfc/rfc2617.txt

RFC 2617, “HTTP Authentication: Basic and Digest Access Authentication.”

http://www.ietf.org/rfc/rfc2616.txt

RFC 2616 “Hypertext Transfer Protocol—HTTP/1.1.”

286

CHAPTER 13

Digest Authentication

Basic authentication is convenient and flexible but completely insecure. Usernames

and passwords are sent in the clear,*and there is no attempt to protect messages

from tampering. The only way to use basic authentication securely is to use it in con-

junction with SSL.

Digest authentication was developed as a compatible, more secure alternative to

basic authentication. We devote this chapter to the theory and practice of digest

authentication. Even though digest authentication is not yet in wide use, the con-

cepts still are important for anyone implementing secure transactions.

The Improvements of Digest Authentication

Digest authentication is an alternate HTTP authentication protocol that tries to fix

the most serious flaws of basic authentication. In particular, digest authentication:

• Never sends secret passwords across the network in the clear

• Prevents unscrupulous individuals from capturing and replaying authentication

handshakes

• Optionally can guard against tampering with message contents

• Guards against several other common forms of attacks

Digest authentication is not the most secure protocol possible.†Many needs for

secure HTTP transactions cannot be met by digest authentication. For those needs,

Transport Layer Security (TLS) and Secure HTTP (HTTPS) are more appropriate

protocols.

* Usernames and passwords are scrambled using a trivial base-64 encoding, which can be decoded easily. This

protects against unintentional accidental viewing but offers no protection against malicious parties.

† For example, compared to public key–based mechanisms, digest authentication does not provide a strong

authentication mechanism. Also, digest authentication offers no confidentiality protection beyond protect-

ing the actual password—the rest of the request and response are available to eavesdroppers.

The Improvements of Digest Authentication |287

However, digest authentication is significantly stronger than basic authentication,

which it was designed to replace. Digest authentication also is stronger than many

popular schemes proposed for other Internet services, such as CRAM-MD5, which

has been proposed for use with LDAP, POP, and IMAP.

To date, digest authentication has not been widely deployed. However, because of the

security risks inherent to basic authentication, the HTTP architects counsel in RFC

2617 that “any service in present use that uses Basic should be switched to Digest as

soon as practical.”* It is not yet clear how successful this standard will become.

Using Digests to Keep Passwords Secret

The motto of digest authentication is “never send the password across the network.”

Instead of sending the password, the client sends a “fingerprint” or “digest” of the

password, which is an irreversible scrambling of the password. The client and the

server both know the secret password, so the server can verify that the digest pro-

vided a correct match for the password. Given only the digest, a bad guy has no easy

way to find what password it came from, other than going through every password

in the universe, trying each one!†

Let’s see how this works (this is a simplified version):

• In Figure 13-1a, the client requests a protected document.

• In Figure 13-1b, the server refuses to serve the document until the client authen-

ticates its identity by proving it knows the password. The server issues a chal-

lenge to the client, asking for the username and a digested form of the password.

• In Figure 13-1c, the client proves that it knows the password by passing along

the digest of the password. The server knows the passwords for all the users,‡so

it can verify that the user knows the password by comparing the client-supplied

digest with the server’s own internally computed digest. Another party would

not easily be able to make up the right digest if it didn’t know the password.

• In Figure 13-1d, the server compares the client-provided digest with the server’s

internally computed digest. If they match, it shows that the client knows the

password (or made a really lucky guess!). The digest function can be set to gen-

erate so many digits that lucky guesses effectively are impossible. When the

server verifies the match, the document is served to the client—all without ever

sending the password over the network.

* There has been significant debate about the relevance of digest authentication, given the popularity and

widespread adoption of SSL-encrypted HTTP. Time will tell if digest authentication gains the critical mass

required.

† There are techniques, such as dictionary attacks, where common passwords are tried first. These cryptanal-

ysis techniques can dramatically ease the process of cracking passwords.

‡ In fact, the server really needs to know only the digests of the passwords.

288 |Chapter 13: Digest Authentication

We’ll discuss the particular headers used in digest authentication in more detail in

Table 13-8.

One-Way Digests

A digest is a “condensation of a body of information.”*Digests act as one-way func-

tions, typically converting an infinite number of possible input values into a finite

range of condensations.†One popular digest function, MD5,‡converts any arbitrary

sequence of bytes, of any length, into a 128-bit digest.

128 bits = 2128, or about 1,000,000,000,000,000,000,000,000,000,000,000,000,000

possible distinct condensations.

Figure 13-1. Using digests for password-obscured authentication

* Merriam-Webster dictionary, 1998.

† In theory, because we are converting an infinite number of input values into a finite number of output values,

it is possible to have two distinct inputs map to the same digest. This is called a collision. In practice, the

number of potential outputs is so large that the chance of a collision in real life is vanishingly small and, for

the purpose of password matching, unimportant.

‡ MD5 stands for “Message Digest #5,” one in a series of digest algorithms. The Secure Hash Algorithm (SHA)

is another popular digest function.

Internet

Client Server

(a) Request Please give me the internal sales forecast.

Internet

Client Server

(b) Challenge You requested a secret financial document.

Please tell me your username and

password digest.

Internet

Client Server

My username is “bri”

My digested password is “A3F5”

Internet

Client Server

(d) Success OK. The digest you sent me matches the

digest of my internal password, so here is

the document.

Ask user for username and password

digest(”Ow!”)= A3F5

This is a match!

The Improvements of Digest Authentication |289

What is important about these digests is that if you don’t know the secret password,

you’ll have an awfully hard time guessing the correct digest to send to the server.

And likewise, if you have the digest, you’ll have an awfully hard time figuring out

which of the effectively infinite number of input values generated it.

The 128 bits of MD5 output often are written as 32 hexadecimal characters, each

character representing 4 bits. Table 13-1 shows a few examples of MD5 digests of

sample inputs. Notice how MD5 takes arbitrary inputs and yields a fixed-length

digest output.

Digest functions sometimes are called cryptographic checksums, one-way hash func-

tions, or fingerprint functions.

Using Nonces to Prevent Replays

One-way digests save us from having to send passwords in the clear. We can just

send a digest of the password instead, and rest assured that no malicious party can

easily decode the original password from the digest.

Unfortunately, obscured passwords alone do not save us from danger, because a bad

guy can capture the digest and replay it over and over again to the server, even

though the bad guy doesn’t know the password. The digest is just as good as the

password.

To prevent such replay attacks, the server can pass along to the client a special token

called a nonce,*which changes frequently (perhaps every millisecond, or for every

Table 13-1. MD5 digest examples

Input MD5 digest

“Hi”C1A5298F939E87E8F962A5EDFC206918

“bri:Ow!”BEAAA0E34EBDB072F8627C038AB211F8

“3.1415926535897”475B977E19ECEE70835BC6DF46F4F6DE

“http://www.http-guide.com/index.htm”C617C0C7D1D05F66F595E22A4B0EAAA5

“WE hold these Truths to be self-evident, that all Men are created equal,

that they are endowed by their Creator with certain unalienable Rights,

that among these are Life, Liberty and the Pursuit of Happiness—That to

secure these Rights, Governments are instituted among Men, deriving their

just Powers from the Consent of the Governed, that whenever any Form of

Government becomes destructive of these Ends, it is the Right of the People

to alter or to abolish it, and to institute new Government, laying its Founda-

tion on such Principles, and organizing its Powers in such Form, as to them

shall seem most likely to effect their Safety and Happiness.”

66C4EF58DA7CB956BD04233FBB64E0A4

* The word nonce means “the present occasion” or “the time being.” In a computer-security sense, the nonce

captures a particular point in time and figures that into the security calculations.

290 |Chapter 13: Digest Authentication

authentication). The client appends this nonce token to the password before com-

puting the digest.

Mixing the nonce in with the password causes the digest to change each time the

nonce changes. This prevents replay attacks, because the recorded password digest is

valid only for a particular nonce value, and without the secret password, the attacker

cannot compute the correct digest.

Digest authentication requires the use of nonces, because a trivial replay weakness

would make un-nonced digest authentication effectively as weak as basic authentica-

tion. Nonces are passed from server to client in the WWW-Authenticate challenge.

The Digest Authentication Handshake

The HTTP digest authentication protocol is an enhanced version of authentication

that uses headers similar to those used in basic authentication. Some new options are

added to the traditional headers, and one new optional header, Authorization-Info, is

added.

The simplified three-phase handshake of digest authentication is depicted in

Figure 13-2.

Here’s what’s happening in Figure 13-2:

• In Step 1, the server computes a nonce value. In Step 2, the server sends the

nonce to the client in a WWW-Authenticate challenge message, along with a list

of algorithms that the server supports.

Figure 13-2. Digest authentication handshake

Server

Authorization (response)

Client

WWW-Authenticate (challenge)

(2) Server sends realm, nonce, algorithms

(4) Client sends response digest

[send algorithm]

[send client nonce]

Authentication-Info (info)

(6) Server sends next nonce

[send client rspauth digest]

(1) Server generates nonce

(5) Server verifies digest

[generate rspauth digest]

[generate next nonce]

(3) Choose algorithm from set

[generate response digest]

[generate client-nonce]

(7) Client verifies rspauth digest

Digest Calculations |291

• In Step 3, the client selects an algorithm and computes the digest of the secret

password and the other data. In Step 4, it sends the digest back to the server in

an Authorization message. If the client wants to authenticate the server, it can

send a client nonce.

• In Step 5, the server receives the digest, chosen algorithm, and supporting data

and computes the same digest that the client did. The server then compares the

locally generated digest with the network-transmitted digest and validates that

they match. If the client symmetrically challenged the server with a client nonce,

a client digest is created. Additionally, the next nonce can be precomputed and

handed to the client in advance, so the client can preemptively issue the right

digest the next time.

Many of these pieces of information are optional and have defaults. To clarify things,

Figure 13-3 compares the messages sent for basic authentication (Figure 13-3a–d)

with a simple example of digest authentication (Figure 13-3e–h).

Now let’s look a bit more closely at the internal workings of digest authentication.

Digest Calculations

The heart of digest authentication is the one-way digest of the mix of public informa-

tion, secret information, and a time-limited nonce value. Let’s look now at how the

digests are computed. The digest calculations generally are straightforward.*Sample

source code is provided in Appendix F.

Digest Algorithm Input Data

Digests are computed from three components:

• A pair of functions consisting of a one-way hash function H(d) and digest

KD(s,d), where s stands for secret and d stands for data

• A chunk of data containing security information, including the secret password,

called A1

• A chunk of data containing nonsecret attributes of the request message, called A2

The two pieces of data, A1 and A2, are processed by H and KD to yield a digest.

The Algorithms H(d) and KD(s,d)

Digest authentication supports the selection of a variety of digest algorithms. The

two algorithms suggested in RFC 2617 are MD5 and MD5-sess (where “sess” stands

for session), and the algorithm defaults to MD5 if no other algorithm is specified.

* However, they are made a little more complicated for beginners by the optional compatibility modes of RFC

2617 and by the lack of background material in the specifications. We’ll try to help...

292 |Chapter 13: Digest Authentication

Figure 13-3. Basic versus digest authentication syntax

Client Server

(a) Query

GET /cgi-bin/checkout?cart=17854 HTTP/1.1

Client Server

(b) Challenge

HTTP/1.1 401 Unauthorized

WWW-Authenticate: Basic realm="Shopping Cart"

Client Server

GET /cgi-bin/checkout?cart=17854 HTTP/1.1

Authorization: Basic YnJpYW4tdG90dHk6T3ch

Client Server

(d) Success

HTTP/1.1 200 OK

...

Basic authentication

Client Server

(e) Query

GET /cgi-bin/checkout?cart=17854 HTTP/1.1

Client Server

(f) Challenge

HTTP/1.1 401 Unauthorized

WWW-Authenticate: Digest

realm="Shopping Cart"

qop="auth,auth-int"

nonce="66C4EF58DA7CB956BD04233FBB64E0A4"

Digest authentication

Client Server

(g) Response

GET /cgi-bin/checkout?cart=17854 HTTP/1.1

Authorization: Digest

username="bri"

realm="Shopping Cart"

nonce="66C4EF58DA7CB956BD04233FBB64E0A4"

uri="/cgi-bin/checkout?cart=17854"

qop="auth"

nc=0000001,

cnonce="CFA9207102EA210EA210FFC1120F6001110D073"

response="E483C94FOB3CA29109A7BA83D10FE519"

Client Server

(h) Success

HTTP/1.1 200 OK

Authorization-Info: nextnonce=

"29FE72D109C7EF23841AB914F0C3B831"

qop= “auth”

rspauth="89F5A4CE6FA932F6C4DA120CEB754290"

cnonce="CFA9207102EA210EA210FFC1120F6001110D073"

...

Username:

Password:

Shopping Cart

Digest Calculations |293

If either MD5 or MD5-sess is used, the H function computes the MD5 of the data,

and the KD digest function computes the MD5 of the colon-joined secret and nonse-

cret data. In other words:

H(<data>) = MD5(<data>)

KD(<secret>,<data>) = H(concatenate(<secret>:<data>))

The Security-Related Data (A1)

The chunk of data called A1 is a product of secret and protection information, such

as the username, password, protection realm, and nonces. A1 pertains only to secu-

rity information, not to the underlying message itself. A1 is used along with H, KD,

and A2 to compute digests.

RFC 2617 defines two ways of computing A1, depending on the algorithm chosen:

MD5

One-way hashes are run for every request; A1 is the colon-joined triple of user-

name, realm, and secret password.

MD5-sess

The hash function is run only once, on the first WWW-Authenticate hand-

shake; the CPU-intensive hash of username, realm, and secret password is done

once and prepended to the current nonce and client nonce (cnonce) values.

The definitions of A1 are shown in Table 13-2.

The Message-Related Data (A2)

The chunk of data called A2 represents information about the message itself, such as

the URL, request method, and message entity body. A2 is used to help protect

against method, resource, or message tampering. A2 is used along with H, KD, and

A1 to compute digests.

RFC 2617 defines two schemes for A2, depending on the quality of protection (qop)

chosen:

• The first scheme involves only the HTTP request method and URL. This is used

when qop=“auth”, which is the default case.

• The second scheme adds in the message entity body to provide a degree of mes-

sage integrity checking. This is used when qop=“auth-int”.

Table 13-2. Definitions for A1 by algorithm

Algorithm A1

MD5 A1 = <user>:<realm>:<password>

MD5-sess A1 = MD5(<user>:<realm>:<password>):<nonce>:<cnonce>

294 |Chapter 13: Digest Authentication

The definitions of A2 are shown in Table 13-3.

The request-method is the HTTP request method. The uri-directive-value is the

request URI from the request line. This may be “*,” an “absoluteURL,” or an “abs_

path,” but it must agree with the request URI. In particular, it must be an absolute

URL if the request URI is an absoluteURL.

Overall Digest Algorithm

RFC 2617 defines two ways of computing digests, given H, KD, A1, and A2:

• The first way is intended to be compatible with the older specification RFC

2069, used when the qop option is missing. It computes the digest using the

hash of the secret information and the nonced message data.

• The second way is the modern, preferred approach—it includes support for nonce

counting and symmetric authentication. This approach is used whenever qop is

“auth” or “auth-int”. It adds nonce count, qop, and cnonce data to the digest.

The definitions for the resulting digest function are shown in Table 13-4. Notice the

resulting digests use H, KD, A1, and A2.

It’s a bit easy to get lost in all the layers of derivational encapsulation. This is one of

the reasons that some readers have difficulty with RFC 2617. To try to make it a bit

easier, Table 13-5 expands away the H and KD definitions, and leaves digests in

terms of A1 and A2.

Table 13-3. Definitions for A2 by algorithm (request digests)

qop A2

undefined <request-method>:<uri-directive-value>

auth <request-method>:<uri-directive-value>

auth-int <request-method>:<uri-directive-value>:H(<request-entity-body>)

Table 13-4. Old and new digest algorithms

qop Digest algorithm Notes

undefined KD(H(A1), <nonce>:H(A2)) Deprecated

auth or auth-int KD(H(A1), <nonce>:<nc>:<cnonce>:<qop>:H(A2)) Preferred

Table 13-5. Unfolded digest algorithm cheat sheet

qop Algorithm Unfolded algorithm

undefined <undefined>

MD5

MD5-sess

MD5(MD5(A1):<nonce>:MD5(A2))

Digest Calculations |295

Digest Authentication Session

The client response to a WWW-Authenticate challenge for a protection space starts

an authentication session with that protection space (the realm combined with the

canonical root of the server being accessed defines a “protection space”).

The authentication session lasts until the client receives another WWW-Authenti-

cate challenge from any server in the protection space. A client should remember the

username, password, nonce, nonce count, and opaque values associated with an

authentication session to use to construct the Authorization header in future

requests within that protection space.

When the nonce expires, the server can choose to accept the old Authorization

header information, even though the nonce value included may not be fresh. Alterna-

tively, the server may return a 401 response with a new nonce value, causing the cli-

ent to retry the request; by specifying “stale=true” with this response, the server tells

the client to retry with the new nonce without prompting for a new username and

password.

Preemptive Authorization

In normal authentication, each request requires a request/challenge cycle before the

transaction can be completed. This is depicted in Figure 13-4a.

This request/challenge cycle can be eliminated if the client knows in advance what

the next nonce will be, so it can generate the correct Authorization header before the

server asks for it. If the client can compute the Authorization header before it is

requested, the client can preemptively issue the Authorization header to the server,

without first going through a request/challenge. The performance impact is depicted

in Figure 13-4b.

Preemptive authorization is trivial (and common) for basic authentication. Browsers

commonly maintain client-side databases of usernames and passwords. Once a user

authenticates with a site, the browser commonly sends the correct Authorization

header for subsequent requests to that URL (see Chapter 12).

auth <undefined>

MD5

MD5-sess

MD5(MD5(A1):<nonce>:<nc>:<cnonce>:<qop>:MD5(A2))

auth-int <undefined>

MD5

MD5-sess

MD5(MD5(A1):<nonce>:<nc>:<cnonce>:<qop>:MD5(A2))

Table 13-5. Unfolded digest algorithm cheat sheet (continued)

qop Algorithm Unfolded algorithm

296 |Chapter 13: Digest Authentication

Preemptive authorization is a bit more complicated for digest authentication,

because of the nonce technology intended to foil replay attacks. Because the server

generates arbitrary nonces, there isn’t always a way for the client to determine what

Authorization header to send until it receives a challenge.

Digest authentication offers a few means for preemptive authorization while retain-

ing many of the safety features. Here are three potential ways a client can obtain the

correct nonce without waiting for a new WWW-Authenticate challenge:

• Server pre-sends the next nonce in the Authentication-Info success header.

• Server allows the same nonce to be reused for a small window of time.

• Both the client and server use a synchronized, predictable nonce-generation

algorithm.

Figure 13-4. Preemptive authorization reduces message count

Server

Request

Client

Challenge

Request+authorization

Success

Request

Challenge

Request+authorization

Success

Request

Challenge

Request+authorization

Success

(a) Normal request/challenge

Server

Request

Client

Challenge

Request+authorization

Success+nonceinfo

Request+authorization

Success+nonceinfo

Request+authorization

Success

(b) Preemptive authorization

Digest Calculations |297

Next nonce pregeneration

The next nonce value can be provided in advance to the client by the Authentication-

Info success header. This header is sent along with the 200 OK response from a pre-

vious successful authentication.

Authentication-Info: nextnonce="<nonce-value>"

Given the next nonce, the client can preemptively issue an Authorization header.

While this preemptive authorization avoids a request/challenge cycle (speeding up

the transaction), it also effectively nullifies the ability to pipeline multiple requests to

the same server, because the next nonce value must be received before the next

request can be issued. Because pipelining is expected to be a fundamental technol-

ogy for latency avoidance, the performance penalty may be large.

Limited nonce reuse

Instead of pregenerating a sequence of nonces, another approach is to allow limited

reuse of nonces. For example, a server may allow a nonce to be reused 5 times, or for

10 seconds.

In this case, the client can freely issue requests with the Authorization header, and it

can pipeline them, because the nonce is known in advance. When the nonce finally

expires, the server is expected to send the client a 401 Unauthorized challenge, with

the WWW-Authenticate: stale=true directive set:

WWW-Authenticate: Digest

realm="<realm-value>"

nonce="<nonce-value>"

stale=true

Reusing nonces does reduce security, because it makes it easier for an attacker to

succeed at replay attacks. Because the lifetime of nonce reuse is controllable, from

strictly no reuse to potentially long reuse, trade-offs can be made between windows

of vulnerability and performance.

Additionally, other features can be employed to make replay attacks more difficult,

including incrementing counters and IP address tests. However, while making

attacks more inconvenient, these techniques do not eliminate the vulnerability.

Synchronized nonce generation

It is possible to employ time-synchronized nonce-generation algorithms, where both

the client and the server can generate a sequence of identical nonces, based on a

shared secret key, that a third party cannot easily predict (such as secure ID cards).

These algorithms are beyond the scope of the digest authentication specification.

298 |Chapter 13: Digest Authentication

Nonce Selection

The contents of the nonce are opaque and implementation-dependent. However, the

quality of performance, security, and convenience depends on a smart choice.

RFC 2617 suggests this hypothetical nonce formulation:

BASE64(time-stamp H(time-stamp ":" ETag ":" private-key))

where time-stamp is a server-generated time or other nonrepeating value, ETag is the

value of the HTTP ETag header associated with the requested entity, and private-key

is data known only to the server.

With a nonce of this form, a server will recalculate the hash portion after receiving

the client authentication header and reject the request if it does not match the nonce

from that header or if the time-stamp value is not recent enough. In this way, the

server can limit the duration of the nonce’s validity.

The inclusion of the ETag prevents a replay request for an updated version of the

resource. (Note that including the IP address of the client in the nonce would appear

to offer the server the ability to limit the reuse of the nonce to the same client that orig-

inally got it. However, that would break proxy farms, in which requests from a single

user often go through different proxies. Also, IP address spoofing is not that hard.)

An implementation might choose not to accept a previously used nonce or digest, to

protect against replay attacks. Or, an implementation might choose to use one-time

nonces or digests for POST or PUT requests and time-stamps for GET requests.

Refer to “Security Considerations” for practical security considerations that affect

nonce selection.

Symmetric Authentication

RFC 2617 extends digest authentication to allow the client to authenticate the server.

It does this by providing a client nonce value, to which the server generates a correct

response digest based on correct knowledge of the shared secret information. The

server then returns this digest to the client in the Authorization-Info header.

This symmetric authentication is standard as of RFC 2617. It is optional for back-

ward compatibility with the older RFC 2069 standard, but, because it provides

important security enhancements, all modern clients and servers are strongly recom-

mended to implement all of RFC 2617’s features. In particular, symmetric authenti-

cation is required to be performed whenever a qop directive is present and required

not to be performed when the qop directive is missing.

The response digest is calculated like the request digest, except that the message

body information (A2) is different, because there is no method in a response, and the

message entity data is different. The methods of computation of A2 for request and

response digests are compared in Tables 13-6 and 13-7.

Quality of Protection Enhancements |299

The cnonce value and nc value must be the ones for the client request to which this

message is the response. The response auth, cnonce, and nonce count directives

must be present if qop=“auth” or qop=“auth-int” is specified.

Quality of Protection Enhancements

The qop field may be present in all three digest headers: WWW-Authenticate,

Authorization, and Authentication-Info.

The qop field lets clients and servers negotiate for different types and qualities of pro-

tection. For example, some transactions may want to sanity check the integrity of

message bodies, even if that slows down transmission significantly.

The server first exports a comma-separated list of qop options in the WWW-Authen-

ticate header. The client then selects one of the options that it supports and that

meets its needs and passes it back to the server in its Authorization qop field.

Use of qop is optional, but only for backward compatibility with the older RFC

2069 specification. The qop option should be supported by all modern digest

implementations.

RFC 2617 defines two initial quality of protection values: “auth,” indicating authen-

tication, and “auth-int,” indicating authentication with message integrity protection.

Other qop options are expected in the future.

Message Integrity Protection

If integrity protection is applied (qop=“auth-int”), H (the entity body) is the hash of

the entity body, not the message body. It is computed before any transfer encoding is

applied by the sender and after it has been removed by the recipient. Note that this

includes multipart boundaries and embedded headers in each part of any multipart

content type.

Table 13-6. Definitions for A2 by algorithm (request digests)

qop A2

undefined <request-method>:<uri-directive-value>

auth <request-method>:<uri-directive-value>

auth-int <request-method>:<uri-directive-value>:H(<request-entity-body>)

Table 13-7. Definitions for A2 by algorithm (response digests)

qop A2

undefined :<uri-directive-value>

auth :<uri-directive-value>

auth-int :<uri-directive-value>:H(<response-entity-body>)

300 |Chapter 13: Digest Authentication

Digest Authentication Headers

Both the basic and digest authentication protocols contain an authorization chal-

lenge, carried by the WWW-Authenticate header, and an authorization response,

carried by the Authorization header. Digest authentication adds an optional Authori-

zation-Info header, which is sent after successful authentication, to complete a three-

phase handshake and pass along the next nonce to use. The basic and digest authen-

tication headers are shown in Table 13-8.

The digest authentication headers are quite a bit more complicated. They are

described in detail in Appendix F.

Practical Considerations

There are several things you need to consider when working with digest authentica-

tion. This section discusses some of these issues.

Table 13-8. HTTP authentication headers

Phase Basic Digest

Challenge WWW-Authenticate: Basic

realm="<realm-value>"

WWW-Authenticate: Digest

realm="<realm-value>"

nonce="<nonce-value>"

[domain="<list-of-URIs>"]

[opaque="<opaque-token-value>"]

[stale=<true-or-false>]

[algorithm=<digest-algorithm>]

[qop="<list-of-qop-values>"]

[<extension-directive>]

Response Authorization: Basic

<base64(user:pass)>

Authorization: Digest

username="<username>"

realm="<realm-value>"

nonce="<nonce-value>"

uri=<request-uri>

response="<32-hex-digit-digest>"

[algorithm=<digest-algorithm>]

[opaque="<opaque-token-value>"]

[cnonce="<nonce-value>"]

[qop=<qop-value>]

[nc=<8-hex-digit-nonce-count>]

[<extension-directive>]

Info n/a Authentication-Info:

nextnonce="<nonce-value>"

[qop="<list-of-qop-values>"]

[rspauth="<hex-digest>"]

[cnonce="<nonce-value>"]

[nc=<8-hex-digit-nonce-count>]

Practical Considerations |301

Multiple Challenges

A server can issue multiple challenges for a resource. For example, if a server does

not know the capabilities of a client, it may provide both basic and digest authentica-

tion challenges. When faced with multiple challenges, the client must choose to

answer with the strongest authentication mechanism that it supports.

User agents must take special care in parsing the WWW-Authenticate or Proxy-

Authenticate header field value if it contains more than one challenge or if more than

one WWW-Authenticate header field is provided, as a challenge may itself contain a

comma-separated list of authentication parameters. Note that many browsers recog-

nize only basic authentication and require that it be the first authentication mecha-

nism presented.

There are obvious “weakest link” security concerns when providing a spectrum of

authentication options. Servers should include basic authentication only if it is mini-

mally acceptable, and administrators should caution users about the dangers of shar-

ing the same password across systems when different levels of security are being

employed.

Error Handling

In digest authentication, if a directive or its value is improper, or if a required direc-

tive is missing, the proper response is 400 Bad Request.

If a request’s digest does not match, a login failure should be logged. Repeated fail-

ures from a client may indicate an attacker attempting to guess passwords.

The authenticating server must assure that the resource designated by the “uri” direc-

tive is the same as the resource specified in the request line; if they are different, the

server should return a 400 Bad Request error. (As this may be a symptom of an attack,

server designers may want to consider logging such errors.) Duplicating information

from the request URL in this field deals with the possibility that an intermediate

proxy may alter the client’s request line. This altered (but, presumably, semantically

equivalent) request would not result in the same digest as that calculated by the client.

Protection Spaces

The realm value, in combination with the canonical root URL of the server being

accessed, defines the protection space.

Realms allow the protected resources on a server to be partitioned into a set of pro-

tection spaces, each with its own authentication scheme and/or authorization data-

base. The realm value is a string, generally assigned by the origin server, which may

have additional semantics specific to the authentication scheme. Note that there may

be multiple challenges with the same authorization scheme but different realms.

302 |Chapter 13: Digest Authentication

The protection space determines the domain over which credentials can be automati-

cally applied. If a prior request has been authorized, the same credentials may be

reused for all other requests within that protection space for a period of time deter-

mined by the authentication scheme, parameters, and/or user preference. Unless oth-

erwise defined by the authentication scheme, a single protection space cannot extend

outside the scope of its server.

The specific calculation of protection space depends on the authentication mechanism:

• In basic authentication, clients assume that all paths at or below the request URI

are within the same protection space as the current challenge. A client can pre-

emptively authorize for resources in this space without waiting for another chal-

lenge from the server.

• In digest authentication, the challenge’s WWW-Authenticate: domain field more

precisely defines the protection space. The domain field is a quoted, space-sepa-

rated list of URIs. All the URIs in the domain list, and all URIs logically beneath

these prefixes, are assumed to be in the same protection space. If the domain field

is missing or empty, all URIs on the challenging server are in the protection space.

Rewriting URIs

Proxies may rewrite URIs in ways that change the URI syntax but not the actual

resource being described. For example:

• Hostnames may be normalized or replaced with IP addresses.

• Embedded characters may be replaced with “%” escape forms.

• Additional attributes of a type that doesn’t affect the resource fetched from the

particular origin server may be appended or inserted into the URI.

Because URIs can be changed by proxies, and because digest authentication sanity

checks the integrity of the URI value, the digest authentication will break if any of

these changes are made. See “The Message-Related Data (A2)” for more information.

Caches

When a shared cache receives a request containing an Authorization header and a

response from relaying that request, it must not return that response as a reply to any

other request, unless one of two Cache-Control directives was present in the response:

• If the original response included the “must-revalidate” Cache-Control directive,

the cache may use the entity of that response in replying to a subsequent request.

However, it must first revalidate it with the origin server, using the request head-

ers from the new request, so the origin server can authenticate the new request.

• If the original response included the “public” Cache-Control directive, the

response entity may be returned in reply to any subsequent request.

Security Considerations |303

Security Considerations

RFC 2617 does an admirable job of summarizing some of the security risks inherent

in HTTP authentication schemes. This section describes some of these risks.

Header Tampering

To provide a foolproof system against header tampering, you need either end-to-end

encryption or a digital signature of the headers—preferably a combination of both!

Digest authentication is focused on providing a tamper-proof authentication scheme,

but it does not necessarily extend that protection to the data. The only headers that

have some level of protection are WWW-Authenticate and Authorization.

Replay Attacks

A replay attack, in the current context, is when someone uses a set of snooped

authentication credentials from a given transaction for another transaction. While

this problem is an issue with GET requests, it is vital that a foolproof method for

avoiding replay attacks be available for POST and PUT requests. The ability to suc-

cessfully replay previously used credentials while transporting form data could cause

security nightmares.

Thus, in order for a server to accept “replayed” credentials, the nonce values must be

repeated. One of the ways to mitigate this problem is to have the server generate a

nonce containing a digest of the client’s IP address, a time-stamp, the resource ETag,

and a private server key (as recommended earlier). In such a scenario, the combina-

tion of an IP address and a short timeout value may provide a huge hurdle for the

attacker.

However, this solution has a major drawback. As we discussed earlier, using the cli-

ent’s IP address in creating a nonce breaks transmission through proxy farms, in

which requests from a single user may go through different proxies. Also, IP spoof-

ing is not too difficult.

One way to completely avoid replay attacks is to use a unique nonce value for every

transaction. In this implementation, for each transaction, the server issues a unique

nonce along with a timeout value. The issued nonce value is valid only for the given

transaction, and only for the duration of the timeout value. This accounting may

increase the load on servers; however, the increase should be miniscule.

Multiple Authentication Mechanisms

When a server supports multiple authentication schemes (such as basic and digest),

it usually provides the choice in WWW-Authenticate headers. Because the client is

304 |Chapter 13: Digest Authentication

not required to opt for the strongest authentication mechanism, the strength of the

resulting authentication is only as good as that of the weakest of the authentication

schemes.

The obvious ways to avoid this problem is to have the clients always choose the

strongest authentication scheme available. If this is not practical (as most of us do

use commercially available clients), the only other option is to use a proxy server to

retain only the strongest authentication scheme. However, such an approach is feasi-

ble only in a domain in which all of the clients are known to be able to support the

chosen authentication scheme—e.g., a corporate network.

Dictionary Attacks

Dictionary attacks are typical password-guessing attacks. A malicious user can eaves-

drop on a transaction and use a standard password-guessing program against nonce/

response pairs. If the users are using relatively simple passwords and the servers are

using simplistic nonces, it is quite possible to find a match. If there is no password

aging policy, given enough time and the one-time cost of cracking the passwords, it

is easy to collect enough passwords to do some real damage.

There really is no good way to solve this problem, other than using relatively com-

plex passwords that are hard to crack and a good password aging policy.

Hostile Proxies and Man-in-the-Middle Attacks

Much Internet traffic today goes through a proxy at one point or another. With the

advent of redirection techniques and intercepting proxies, a user may not even real-

ize that his request is going through a proxy. If one of those proxies is hostile or com-

promised, it could leave the client vulnerable to a man-in-the-middle attack.

Such an attack could be in the form of eavesdropping, or altering available authenti-

cation schemes by removing all of the offered choices and replacing them with the

weakest authentication scheme (such as basic authentication).

One of the ways to compromise a trusted proxy is though its extension interfaces.

Proxies sometimes provide sophisticated programming interfaces, and with such

proxies it may be feasible to write an extension (i.e., plug-in) to intercept and modify

the traffic. However, the data-center security and security offered by proxies them-

selves make the possibility of man-in-the-middle attacks via rogue plug-ins quite

remote.

There is no good way to fix this problem. Possible solutions include clients provid-

ing visual cues regarding the authentication strength, configuring clients to always

use the strongest possible authentication, etc., but even when using the strongest

possible authentication scheme, clients still are vulnerable to eavesdropping. The

only foolproof way to guard against these attacks is by using SSL.

Security Considerations |305

Chosen Plaintext Attacks

Clients using digest authentication use a nonce supplied by the server to generate the

response. However, if there is a compromised or malicious proxy in the middle

intercepting the traffic (or a malicious origin server), it can easily supply a nonce for

response computation by the client. Using the known key for computing the

response may make the cryptanalysis of the response easier. This is called a chosen

plaintext attack. There are a few variants of chosen plaintext attacks:

Precomputed dictionary attacks

This is a combination of a dictionary attack and a chosen plaintext attack. First,

the attacking server generates a set of responses, using a predetermined nonce

and common password variations, and creates a dictionary. Once a sizeable dic-

tionary is available, the attacking server/proxy can complete the interdiction of

the traffic and start sending predetermined nonces to the clients. When it gets a

response from a client, the attacker searches the generated dictionary for matches.

If a there is a match, the attacker has the password for that particular user.

Batched brute-force attacks

The difference in a batched brute-force attack is in the computation of the pass-

word. Instead of trying to match a precomputed digest, a set of machines goes to

work on enumerating all of the possible passwords for a given space. As the

machines get faster, the brute-force attack becomes more and more viable.

In general, the threat posed by these attacks is easily countered. One way to prevent

them is to configure clients to use the optional cnonce directive, so that the response

is generated at the client’s discretion, not using the nonce supplied by the server

(which could be compromised by the attacker). This, combined with policies enforc-

ing reasonably strong passwords and a good password aging mechanism, can miti-

gate the threat of chosen plaintext attacks completely.

Storing Passwords

The digest authentication mechanism compares the user response to what is stored

internally by the server—usually, usernames and H(A1) tuples, where H(A1) is

derived from the digest of username, realm, and password.

Unlike with a traditional password file on a Unix box, if a digest authentication pass-

word file is compromised, all of the documents in the realm immediately are avail-

able to the attacker; there is no need for a decrypting step.

Some of the ways to mitigate this problem are to:

• Protect the password file as though it contained clear-text passwords.

• Make sure the realm name is unique among all the realms, so that if a password

file is compromised, the damage is localized to a particular realm. A fully quali-

fied realm name with host and domain included should satisfy this requirement.

306 |Chapter 13: Digest Authentication

While digest authentication provides a much more robust and secure solution than

basic authentication, it still does not provide any protection for security of the con-

tent—a truly secure transaction is feasible only through SSL, which we describe in

the next chapter.

For More Information

For more information on authentication, see:

http://www.ietf.org/rfc/rfc2617.txt

RFC 2617, “HTTP Authentication: Basic and Digest Access Authentication.”

307

CHAPTER 14

Secure HTTP

The previous three chapters reviewed features of HTTP that help identify and

authenticate users. These techniques work well in a friendly community, but they

aren’t strong enough to protect important transactions from a community of moti-

vated and hostile adversaries.

This chapter presents a more complicated and aggressive technology to secure HTTP

transactions from eavesdropping and tampering, using digital cryptography.

Making HTTP Safe

People use web transactions for serious things. Without strong security, people

wouldn’t feel comfortable doing online shopping and banking. Without being able

to restrict access, companies couldn’t place important documents on web servers.

The Web requires a secure form of HTTP.

The previous chapters talked about some lightweight ways of providing authentica-

tion (basic and digest authentication) and message integrity (digest qop=“auth-int”).

These schemes are good for many purposes, but they may not be strong enough for

large purchases, bank transactions, or access to confidential data. For these more

serious transactions, we combine HTTP with digital encryption technology.

A secure version of HTTP needs to be efficient, portable, easy to administer, and

adaptable to the changing world. It also has to meet societal and governmental

requirements. We need a technology for HTTP security that provides:

• Server authentication (clients know they’re talking to the real server, not a phony)

• Client authentication (servers know they’re talking to the real user, not a phony)

• Integrity (clients and servers are safe from their data being changed)

• Encryption (clients and servers talk privately without fear of eavesdropping)

• Efficiency (an algorithm fast enough for inexpensive clients and servers to use)

• Ubiquity (protocols are supported by virtually all clients and servers)

308 |Chapter 14: Secure HTTP

• Administrative scalability (instant secure communication for anyone, anywhere)

• Adaptability (supports the best known security methods of the day)

• Social viability (meets the cultural and political needs of the society)

HTTPS

HTTPS is the most popular secure form of HTTP. It was pioneered by Netscape

Communications Corporation and is supported by all major browsers and servers.

You can tell if a web page was accessed through HTTPS instead of HTTP, because

the URL will start with the scheme https:// instead of http:// (some browsers also dis-

play iconic security cues, as shown in Figure 14-1).

When using HTTPS, all the HTTP request and response data is encrypted before

being sent across the network. HTTPS works by providing a transport-level crypto-

graphic security layer—using either the Secure Sockets Layer (SSL) or its successor,

Transport Layer Security (TLS)—underneath HTTP (Figure 14-2). Because SSL and

TLS are so similar, in this book we use the term “SSL” loosely to represent both SSL

and TLS.

Because most of the hard encoding and decoding work happens in the SSL libraries,

web clients and servers don’t need to change much of their protocol processing logic

Figure 14-1. Browsing secure web sites

https scheme

security icon

Digital Cryptography |309

to use secure HTTP. For the most part, they simply need to replace TCP input/out-

put calls with SSL calls and add a few other calls to configure and manage the secu-

rity information.

Digital Cryptography

Before we talk in detail about HTTPS, we need to provide a little background about

the cryptographic encoding techniques used by SSL and HTTPS. In the next few sec-

tions, we’ll give a speedy primer of the essentials of digital cryptography. If you

already are familiar with the technology and terminology of digital cryptography, feel

free to jump ahead to “HTTPS: The Details.”

In this digital cryptography primer, we’ll talk about:

Ciphers

Algorithms for encoding text to make it unreadable to voyeurs

Keys

Numeric parameters that change the behavior of ciphers

Symmetric-key cryptosystems

Algorithms that use the same key for encoding and decoding

Asymmetric-key cryptosystems

Algorithms that use different keys for encoding and decoding

Public-key cryptography

A system making it easy for millions of computers to send secret messages

Digital signatures

Checksums that verify that a message has not been forged or tampered with

Digital certificates

Identifying information, verified and signed by a trusted organization

Figure 14-2. HTTPS is HTTP layered over a security layer, layered over TCP

HTTP Application layer

TCP Transport layer

IP Network layer

Network interfaces Data link layer

(a) HTTP

HTTP Application layer

SSL or TLS Security layer

TCP Transport layer

IP Network layer

Network interfaces Data link layer

(b) HTTPS

310 |Chapter 14: Secure HTTP

The Art and Science of Secret Coding

Cryptography is the art and science of encoding and decoding messages. People have

used cryptographic methods to send secret messages for thousands of years. How-

ever, cryptography can do more than just encrypt messages to prevent reading by

nosy folks; it also can be used to prevent tampering with messages. Cryptography

even can be used to prove that you indeed authored a message or transaction, just

like your handwritten signature on a check or an embossed wax seal on an envelope.

Ciphers

Cryptography is based on secret codes called ciphers. A cipher is a coding scheme—a

particular way to encode a message and an accompanying way to decode the secret

later. The original message, before it is encoded, often is called plaintext or cleartext.

The coded message, after the cipher is applied, often is called ciphertext. Figure 14-3

shows a simple example.

Ciphers have been used to generate secret messages for thousands of years. Legend has

it that Julius Caesar used a three-character rotation cipher, where each character in the

message is replaced with a character three alphabetic positions forward. In our mod-

ern alphabet, “A” would be replaced by “D,” “B” would be replaced by “E,” and so on.

For example, in Figure 14-4, the message “meet me at the pier at midnight” encodes

into the ciphertext “phhw ph dw wkh slhu dw plgqljkw” using the rot3 (rotate by 3

characters) cipher.*The ciphertext can be decrypted back to the original plaintext

message by applying the inverse coding, rotating –3 characters in the alphabet.

Figure 14-3. Plaintext and ciphertext

* For simplicity of example, we aren’t rotating punctuation or whitespace, but you could.

Figure 14-4. Rotate-by-3 cipher example

Plaintext

Meet me at the pier

at midnight

Encoder

Ciphertext

Phhw ph dw wkh slhu

dw plgqljkw

Decoder

Plaintext

Meet me at the pier

at midnight

ABCDEFGHIJKLMNOPQRSTUVWXYZ

ABCDEFGHIJKLMNOPQRSTUVWXYZABC

Cipher

Plaintext MEET

Ciphertext PHHW

THE

WKH

PIER

SLHU

MIDNIGHT

PLGQLJKWDW

Digital Cryptography |311

Cipher Machines

Ciphers began as relatively simple algorithms, because human beings needed to do

the encoding and decoding themselves. Because the ciphers were simple, people

could work the codes using pencil and paper and code books. However, it also was

possible for clever people to “crack” the codes fairly easily.

As technology advanced, people started making machines that could quickly and

accurately encode and decode messages using much more complicated ciphers.

Instead of just doing simple rotations, these cipher machines could substitute charac-

ters, transpose the order of characters, and slice and dice messages to make codes

much harder to crack.*

Keyed Ciphers

Because code algorithms and machines could fall into enemy hands, most machines

had dials that could be set to a large number of different values that changed how the

cipher worked. Even if the machine was stolen, without the right dial settings (key

values) the decoder wouldn’t work.†

These cipher parameters were called keys. You needed to enter the right key into the

cipher machine to get the decoding process to work correctly. Cipher keys make a

single cipher machine act like a set of many virtual cipher machines, each of which

behaves differently because they have different key values.

Figure 14-5 illustrates an example of keyed ciphers. The cipher algorithm is the triv-

ial “rotate-by-N” cipher. The value of N is controlled by the key. The same input

message, “meet me at the pier at midnight,” passed through the same encoding

machine, generates different outputs depending on the value of the key. Today, vir-

tually all cipher algorithms use keys.

Digital Ciphers

With the advent of digital computation, two major advances occurred:

• Complicated encoding and decoding algorithms became possible, freed from the

speed and function limitations of mechanical machinery.

* Perhaps the most famous mechanical code machine was the World War II German Enigma code machine.

Despite the complexity of the Enigma cipher, Alan Turing and colleagues were able to crack the Enigma

codes in the early 1940s, using the earliest digital computers.

† In reality, having the logic of the machine in your possession can sometimes help you to crack the code,

because the machine logic may point to patterns that you can exploit. Modern cryptographic algorithms usu-

ally are designed so that even if the algorithm is publicly known, it’s difficult to come up with any patterns

that will help evildoers crack the code. In fact, many of the strongest ciphers in common use have their

source code available in the public domain, for all to see and study!

312 |Chapter 14: Secure HTTP

• It became possible to support very large keys, so that a single cipher algorithm

could yield trillions of virtual cipher algorithms, each differing by the value of

the key. The longer the key, the more combinations of encodings are possible,

and the harder it is to crack the code by randomly guessing keys.

Unlike physical metal keys or dial settings in mechanical devices, digital keys are just

numbers. These digital key values are inputs to the encoding and decoding algo-

rithms. The coding algorithms are functions that take a chunk of data and encode/

decode it based on the algorithm and the value of the key.

Given a plaintext message called P, an encoding function called E, and a digital

encoding key called e, you can generate a coded ciphertext message C (Figure 14-6).

You can decode the ciphertext C back into the original plaintext P by using the

decoder function D and the decoding key d. Of course, the decoding and encoding

functions are inverses of each other; the decoding of the encoding of P gives back the

original message P.

Figure 14-5. The rotate-by-N cipher, using different keys

Plaintext

Meet me at the pier

at midnight

Rotate(n) encoder

Ciphertext

nffu nf bu uif qjfs

bu njeojhiu

Key= 1

(a)

Plaintext

Meet me at the pier

at midnight

Rotate(n) encoder

Ciphertext

oggv og cv vjg

rkgt cv okfpkijv

Key= 2

(b)

Plaintext

Meet me at the pier

at midnight

Rotate(n) encoder

Ciphertext

phhw ph dw wkh

slhu dw plgqlijkw

Key= 3

(c)

Symmetric-Key Cryptography |313

Symmetric-Key Cryptography

Let’s talk in more detail about how keys and ciphers work together. Many digital

cipher algorithms are called symmetric-key ciphers, because they use the same key

value for encoding as they do for decoding (e = d). Let’s just call the key k.

In a symmetric key cipher, both a sender and a receiver need to have the same shared

secret key, k, to communicate. The sender uses the shared secret key to encrypt the

message and sends the resulting ciphertext to the receiver. The receiver takes the

ciphertext and applies the decrypting function, along with the same shared secret

key, to recover the original plaintext (Figure 14-7).

Some popular symmetric-key cipher algorithms are DES, Triple-DES, RC2, and RC4.

Key Length and Enumeration Attacks

It’s very important that secret keys stay secret. In most cases, the encoding and

decoding algorithms are public knowledge, so the key is the only thing that’s secret!

A good cipher algorithm forces the enemy to try every single possible key value in the

universe to crack the code. Trying all key values by brute force is called an enumera-

tion attack. If there are only a few possible key values, a bad guy can go through all of

them by brute force and eventually crack the code. But if there are a lot of possible

key values, it might take the bad guy days, years, or even the lifetime of the universe

to go through all the keys, looking for one that breaks the cipher.

Figure 14-6. Plaintext is encoded with encoding key e, and decoded using decoding key d

Figure 14-7. Symmetric-key cryptography algorithms use the same key for encoding and decoding

Plaintext P

Encoder E

Ciphertext C

Key= e

C = E(P, e)

P = D(C, d)

Plaintext P

Decoder D

Ciphertext C

Key= d

314 |Chapter 14: Secure HTTP

The number of possible key values depends on the number of bits in the key and how

many of the possible keys are valid. For symmetric-key ciphers, usually all of the key

values are valid.*An 8-bit key would have only 256 possible keys, a 40-bit key would

have 240 possible keys (around one trillion keys), and a 128-bit key would generate

around 340,000,000,000,000,000,000,000,000,000,000,000,000 possible keys.

For conventional symmetric-key ciphers, 40-bit keys are considered safe enough for

small, noncritical transactions. However, they are breakable by today’s high-speed

workstations, which can now do billions of calculations per second.

In contrast, 128-bit keys are considered very strong for symmetric-key cryptography.

In fact, long keys have such an impact on cryptographic security that the U.S. gov-

ernment has put export controls on cryptographic software that uses long keys, to

prevent potentially antagonistic organizations from creating secret codes that the U.

S. National Security Agency (NSA) would itself be unable to crack.

Bruce Schneier’s excellent book, Applied Cryptography (John Wiley & Sons),

includes a table describing the time it would take to crack a DES cipher by guessing

all keys, using 1995 technology and economics.†Excerpts of this table are shown in

Table 14-1.

Given the speed of 1995 microprocessors, an attacker willing to spend $100,000 in

1995 could break a 40-bit DES code in about 2 seconds. And computers in 2002

already are 20 times faster than they were in 1995. Unless the users change keys fre-

quently, 40-bit keys are not safe against motivated opponents.

The DES standard key size of 56 bits is more secure. In 1995 economics, a $1 mil-

lion assault still would take several hours to crack the code. But a person with access

to supercomputers could crack the code by brute force in a matter of seconds. In

* There are ciphers where only some of the key values are valid. For example, in RSA, the best-known

asymmetric-key cryptosystem, valid keys must be related to prime numbers in a certain way. Only a small

number of the possible key values have this property.

† Computation speed has increased dramatically since 1995, and cost has been reduced. And the longer it

takes you to read this book, the faster they’ll become! However, the table still is relatively useful, even if the

times are off by a factor of 5, 10, or more.

Table 14-1. Longer keys take more effort to crack (1995 data, from “Applied Cryptography”)

Attack cost 40-bit key 56-bit key 64-bit key 80-bit key 128-bit key

$100,000 2 secs 35 hours 1 year 70,000 years 1019 years

$1,000,000 200 msecs 3.5 hours 37 days 7,000 years 1018 years

$10,000,000 20 msecs 21 mins 4 days 700 years 1017 years

$100,000,000 2 msecs 2 mins 9 hours 70 years 1016 years

$1,000,000,000 200 usecs 13 secs 1 hour 7 years 1015 years

Public-Key Cryptography |315

contrast, 128-bit DES keys, similar in size to Triple-DES keys, are believed to be

effectively unbreakable by anyone, at any cost, using a brute-force attack.*

Establishing Shared Keys

One disadvantage of symmetric-key ciphers is that both the sender and receiver have

to have a shared secret key before they can talk to each other.

If you wanted to talk securely with Joe’s Hardware store, perhaps to order some wood-

working tools after watching a home-improvement program on public television,

you’d have to establish a private secret key between you and www.joes-hardware.com

before you could order anything securely. You’d need a way to generate the secret key

and to remember it. Both you and Joe’s Hardware, and every other Internet user,

would have thousands of keys to generate and remember.

Say that Alice (A), Bob (B), and Chris (C) all wanted to talk to Joe’s Hardware (J). A,

B, and C each would need to establish their own secret keys with J. A would need

key kAJ, B would need key kBJ, and C would need key kCJ. Every pair of communicat-

ing parties needs its own private key. If there are N nodes, and each node has to talk

securely with all the other N–1 nodes, there are roughly N2total secret keys: an

administrative nightmare.

Public-Key Cryptography

Instead of a single encoding/decoding key for every pair of hosts, public-key cryptog-

raphy uses two asymmetric keys: one for encoding messages for a host, and another

for decoding the host’s messages. The encoding key is publicly known to the world

(thus the name public-key cryptography), but only the host knows the private decod-

ing key (see Figure 14-8). This makes key establishment much easier, because every-

one can find the public key for a particular host. But the decoding key is kept secret,

so only the recipient can decode messages sent to it.

Node X can take its encoding key exand publish it publicly.†Now anyone wanting

to send a message to node X can use the same, well-known public key. Because each

host is assigned an encoding key, which everyone uses, public-key cryptography

avoids the N2explosion of pairwise symmetric keys (see Figure 14-9).

* A large key does not mean that the cipher is foolproof, though! There may be an unnoticed flaw in the cipher

algorithm or implementation that provides a weakness for an attacker to exploit. It’s also possible that the

attacker may have some information about how the keys are generated, so that he knows some keys are more

likely than others, helping to focus a brute-force attack. Or a user might leave the secret key someplace where

an attacker might be able to steal it.

† As we’ll see later, most public-key lookup actually is done through digital certificates, but the details of how

you find public keys don’t matter much now—just know that they are publicly available somewhere.

316 |Chapter 14: Secure HTTP

Even though everyone can encode messages to X with the same key, no one other

than X can decode the messages, because only X has the decoding private key dx.

Splitting the keys lets anyone encode a message but restricts the ability to decode

messages to only the owner. This makes it easier for nodes to securely send mes-

sages to servers, because they can just look up the server’s public key.

Public-key encryption technology makes it possible to deploy security protocols to

every computer user around the world. Because of the great importance of making a

Figure 14-8. Public-key cryptography is asymmetric, using different keys for encoding and decoding

Figure 14-9. Public-key cryptography assigns a single, public encoding key to each host

Plaintext

Private

key= ds

Plaintext

Encrypted ciphertext

Public

key= es

Server

Client

Internet

kAX

kCX

kBX kDX

ex ex

(a) Symmetric-key cryptography (b) Public-key cryptography

X X

Digital Signatures |317

standardized public-key technology suite, a massive Public-Key Infrastructure (PKI)

standards initiative has been under way for well over a decade.

RSA

The challenge of any public-key asymmetric cryptosystem is to make sure no bad guy

can compute the secret, private key—even if he has all of the following clues:

• The public key (which anyone can get, because it’s public)

• A piece of intercepted ciphertext (obtained by snooping the network)

• A message and its associated ciphertext (obtained by running the encoder on any

text)

One popular public-key cryptosystem that meets all these needs is the RSA algo-

rithm, invented at MIT and subsequently commercialized by RSA Data Security.

Given a public key, an arbitrary piece of plaintext, the associated ciphertext from

encoding the plaintext with the public key, the RSA algorithm itself, and even the

source code of the RSA implementation, cracking the code to find the corresponding

private key is believed to be as hard a problem as computing huge prime numbers—

believed to be one of the hardest problems in all of computer science. So, if you can

find a fast way of factoring large numbers into primes, not only can you break into

Swiss bank accounts, but you can also win a Turing Award.

The details of RSA cryptography involve some tricky mathematics, so we won’t go

into them here. There are plenty of libraries available to let you perform the RSA

algorithms without you needing a Ph.D. in number theory.

Hybrid Cryptosystems and Session Keys

Asymmetric, public-key cryptography is nifty, because anyone can send secure mes-

sages to a public server, just by knowing its public key. Two nodes don’t first have to

negotiate a private key in order to communicate securely.

But public-key cryptography algorithms tend to be computationally slow. In prac-

tice, mixtures of both symmetric and asymmetric schemes are used. For example, it

is common to use public-key cryptography to conveniently set up secure communi-

cation between nodes but then to use that secure channel to generate and communi-

cate a temporary, random symmetric key to encrypt the rest of the data through

faster, symmetric cryptography.

Digital Signatures

So far, we’ve been talking about various kinds of keyed ciphers, using symmetric and

asymmetric keys, to allow us to encrypt and decrypt secret messages.

318 |Chapter 14: Secure HTTP

In addition to encrypting and decrypting messages, cryptosystems can be used to

sign messages, proving who wrote the message and proving the message hasn’t been

tampered with. This technique, called digital signing, is important for Internet secu-

rity certificates, which we discuss in the next section.

Signatures Are Cryptographic Checksums

Digital signatures are special cryptographic checksums attached to a message. They

have two benefits:

• Signatures prove the author wrote the message. Because only the author has the

author’s top-secret private key,*only the author can compute these checksums.

The checksum acts as a personal “signature” from the author.

• Signatures prevent message tampering. If a malicious assailant modified the mes-

sage in-flight, the checksum would no longer match. And because the checksum

involves the author’s secret, private key, the intruder will not be able to fabricate

a correct checksum for the tampered-with message.

Digital signatures often are generated using asymmetric, public-key technology. The

author’s private key is used as a kind of “thumbprint,” because the private key is

known only by the owner.

Figure 14-10 shows an example of how node A can send a message to node B and

sign it:

• Node A distills the variable-length message into a fixed-sized digest.

• Node A applies a “signature” function to the digest that uses the user’s private

key as a parameter. Because only the user knows the private key, a correct signa-

ture function shows the signer is the owner. In Figure 14-10, we use the decoder

function D as the signature function, because it involves the user’s private key.†

• Once the signature is computed, node A appends it to the end of the message

and sends both the message and the signature to node B.

• On receipt, if node B wants to make sure that node A really wrote the message,

and that the message hasn’t been tampered with, node B can check the signa-

ture. Node B takes the private-key scrambled signature and applies the inverse

function using the public key. If the unpacked digest doesn’t match node B’s

own version of the digest, either the message was tampered with in-flight, or the

sender did not have node A’s private key (and therefore was not node A).

* This assumes the private key has not been stolen. Most private keys expire after a while. There also are “revo-

cation lists” that keep track of stolen or compromised keys.

† With the RSA cryptosystem, the decoder function D is used as the signature function, because D already

takes the private key as input. Note that the decoder function is just a function, so it can be used on any

input. Also, in the RSA cryptosystem, the D and E functions work when applied in either order and cancel

each other out. So, E(D(stuff)) = stuff, just as D(E(stuff)) = stuff.

Digital Certificates |319

Digital Certiﬁcates

In this section, we talk about digital certificates, the “ID cards” of the Internet. Digi-

tal certificates (often called “certs,” like the breath mints) contain information about

a user or firm that has been vouched for by a trusted organization.

We all carry many forms of identification. Some IDs, such as passports and drivers’

licenses, are trusted enough to prove one’s identity in many situations. For example,

a U.S. driver’s license is sufficient proof of identity to let you board an airplane to

New York for New Year’s Eve, and it’s sufficient proof of your age to let you drink

intoxicating beverages with your friends when you get there.

More trusted forms of identification, such as passports, are signed and stamped by a

government on special paper. They are harder to forge, so they inherently carry a

higher level of trust. Some corporate badges and smart cards include electronics to

help strengthen the identity of the carrier. Some top-secret government organiza-

tions even need to match up your fingerprints or retinal capillary patterns with your

ID before trusting it!

Other forms of ID, such as business cards, are relatively easy to forge, so people trust

this information less. They may be fine for professional interactions but probably are

not enough proof of employment when you apply for a home loan.

The Guts of a Certiﬁcate

Digital certificates also contain a set of information, all of which is digitally signed by

an official “certificate authority.” Basic digital certificates commonly contain basic

things common to printed IDs, such as:

• Subject’s name (person, server, organization, etc.)

• Expiration date

Figure 14-10. Unencrypted digital signature

Private

key= dA

Message

digest

DSignature

Plaintext

message

Public

key= eA

Message digest

Same?

Message

digest

320 |Chapter 14: Secure HTTP

• Certificate issuer (who is vouching for the certificate)

• Digital signature from the certificate issuer

Additionally, digital certificates often contain the public key of the subject, as well as

descriptive information about the subject and about the signature algorithm used.

Anyone can create a digital certificate, but not everyone can get a well-respected sign-

ing authority to vouch for the certificate’s information and sign the certificate with

its private key. A typical certificate structure is shown in Figure 14-11.

X.509 v3 Certiﬁcates

Unfortunately, there is no single, universal standard for digital certificates. There are

many, subtly different styles of digital certificates, just as not all printed ID cards con-

tain the same information in the same place. The good news is that most certificates

in use today store their information in a standard form, called X.509 v3. X.509 v3 cer-

tificates provide a standard way of structuring certificate information into parseable

fields. Different kinds of certificates have different field values, but most follow the

X.509 v3 structure. The fields of an X.509 certificate are described in Table 14-2.

Figure 14-11. Typical digital signature format

Table 14-2. X.509 certificate fields

Field Description

Version The X.509 certificate version number for this certificate. Usually version 3 today.

Serial Number A unique integer generated by the certification authority. Each certificate from a CA must

have a unique serial number.

Signature Algorithm ID The cryptographic algorithm used for the signature. For example, “MD2 digest with RSA

encryption”.

Certificate Issuer The name for the organization that issued and signed this certificate, in X.500 format.

Validity Period When this certificate is valid, defined by a start date and an end date.

Certificate format version number

Digital signature

function

Certificate serial number

Certificate signature algorithm

Certificate issuer

Validity period

Subject’s name

Subject’s public key

Other extension information

Digital signature

Digital Certificates |321

There are several flavors of X.509-based certificates, including (among others) web

server certificates, client email certificates, software code-signing certificates, and cer-

tificate authority certificates.

Using Certiﬁcates to Authenticate Servers

When you establish a secure web transaction through HTTPS, modern browsers

automatically fetch the digital certificate for the server being connected to. If the

server does not have a certificate, the secure connection fails. The server certificate

contains many fields, including:

• Name and hostname of the web site

• Public key of the web site

• Name of the signing authority

• Signature from the signing authority

When the browser receives the certificate, it checks the signing authority.*If it is a

public, well-respected signing authority, the browser will already know its public key

Subject’s Name The entity described in the certificate, such as a person or an organization. The subject

name is in X.500 format.

Subject’s Public Key Information The public key for the certificate’s subject, the algorithm used for the public key, and any

additional parameters.

Issuer Unique ID (optional) An optional unique identifier for the certificate issuer, to allow the potential reuse of the

same issuer name.

Subject Unique ID (optional) An optional unique identifier for the certificate subject, toallow the potential reuseof the

same subject name.

Extensions An optional set of extension fields (in version 3 and higher). Each extension field is flagged

as critical or noncritical. Critical extensions are important and must be understood by the

certificate user. If a certificate user doesn’t recognize a critical extension field, it must

reject the certificate. Common extension fields in use include:

Basic Constraints

Subject’s relationship to certification authority

Certificate Policy

The policy under which the certificate is granted

Key Usage

Restricts how the public key can be used

Certification Authority Signature The certification authority’s digital signature of all of the above fields, using the specified

signing algorithm.

* Browsers and other Internet applications try hard to hide the details of most certificate management, to make

browsing easier. But, when you are browsing through secure connections, all the major browsers allow you

to personally examine the certificates of the sites to which you are talking, to be sure all is on the up-and-up.

Table 14-2. X.509 certificate fields (continued)

Field Description

322 |Chapter 14: Secure HTTP

(browsers ship with certificates of many signing authorities preinstalled), so it can

verify the signature as we discussed in the previous section, “Digital Signatures.”

Figure 14-12 shows how a certificate’s integrity is verified using its digital signature.

If the signing authority is unknown, the browser isn’t sure if it should trust the sign-

ing authority and usually displays a dialog box for the user to read and see if he trusts

the signer. The signer might be the local IT department, or a software vendor.

HTTPS: The Details

HTTPS is the most popular secure version of HTTP. It is widely implemented and

available in all major commercial browsers and servers. HTTPS combines the HTTP

protocol with a powerful set of symmetric, asymmetric, and certificate-based crypto-

graphic techniques, making HTTPS very secure but also very flexible and easy to

administer across the anarchy of the decentralized, global Internet.

HTTPS has accelerated the growth of Internet applications and has been a major

force in the rapid growth of web-based electronic commerce. HTTPS also has been

critical in the wide-area, secure administration of distributed web applications.

HTTPS Overview

HTTPS is just HTTP sent over a secure transport layer. Instead of sending HTTP

messages unencrypted to TCP and across the world-wide Internet (Figure 14-13a),

HTTPS sends the HTTP messages first to a security layer that encrypts them before

sending them to TCP (Figure 14-13b).

Figure 14-12. Verifying that a signature is real

Certificate format version number

Certificate serial number

Certificate signature algorithm

Certificate issuer

(signing authority)

Validity period

Subject’s name

Subject’s public key

Other extension information

Digital signature

Signing authority’s

public key

Message digest

Message

digest

Same?

HTTPS: The Details |323

Today, the HTTP security layer is implemented by SSL and its modern replacement,

TLS. We follow the common practice of using the term “SSL” to mean either SSL or

TLS.

HTTPS Schemes

Today, secure HTTP is optional. Thus, when making a request to a web server, we

need a way to tell the web server to perform the secure protocol version of HTTP.

This is done in the scheme of the URL.

In normal, nonsecure HTTP, the scheme prefix of the URL is http, as in:

http://www.joes-hardware.com/index.html

In the secure HTTPS protocol, the scheme prefix of the URL is https, as in:

https://cajun-shop.securesites.com/Merchant2/merchant.mv?Store_Code=AGCGS

When a client (such as a web browser) is asked to perform a transaction on a web

resource, it examines the scheme of the URL:

• If the URL has an http scheme, the client opens a connection to the server on

port 80 (by default) and sends it plain-old HTTP commands (Figure 14-14a).

• If the URL has an https scheme, the client opens a connection to the server on

port 443 (by default) and then “handshakes” with the server, exchanging some

SSL security parameters with the server in a binary format, followed by the

encrypted HTTP commands (Figure 14-14b).

Because SSL traffic is a binary protocol, completely different from HTTP, the traffic

is carried on different ports (SSL usually is carried over port 443). If both SSL and

HTTP traffic arrived on port 80, most web servers would interpret binary SSL traffic

as erroneous HTTP and close the connection. A more integrated layering of security

services into HTTP would have eliminated the need for multiple destination ports,

but this does not cause severe problems in practice.

Let’s look a bit more closely at how SSL sets up connections with secure servers.

Figure 14-13. HTTP transport-level security

HTTP Application layer

TCP Transport layer

IP Network layer

Network interfaces Data link layer

(a) HTTP

HTTP Application layer

SSL or TLS Security layer

TCP Transport layer

IP Network layer

Network interfaces Data link layer

(b) HTTPS

324 |Chapter 14: Secure HTTP

Secure Transport Setup

In unencrypted HTTP, a client opens a TCP connection to port 80 on a web server,

sends a request message, receives a response message, and closes the connection.

This sequence is sketched in Figure 14-15a.

The procedure is slightly more complicated in HTTPS, because of the SSL security

layer. In HTTPS, the client first opens a connection to port 443 (the default port for

secure HTTP) on the web server. Once the TCP connection is established, the client

and server initialize the SSL layer, negotiating cryptography parameters and exchang-

ing keys. When the handshake completes, the SSL initialization is done, and the cli-

ent can send request messages to the security layer. These messages are encrypted

before being sent to TCP. This procedure is depicted in Figure 14-15b.

SSL Handshake

Before you can send encrypted HTTP messages, the client and server need to do an

SSL handshake, where they:

• Exchange protocol version numbers

• Select a cipher that each side knows

• Authenticate the identity of each side

• Generate temporary session keys to encrypt the channel

Figure 14-14. HTTP and HTTPS port numbers

Client Server

HTTP

(a) HTTP request

Client Secure server

HTTPS

443

(b) HTTPS request

Client

Proxy

HTTPS tunnel

8080

Secure server

HTTPS

443

HTTPS: The Details |325

Before any encrypted HTTP data flies across the network, SSL already has sent a

bunch of handshake data to establish the communication. The essence of the SSL

handshake is shown in Figure 14-16.

This is a simplified version of the SSL handshake. Depending on how SSL is being

used, the handshake can be more complicated, but this is the general idea.

Figure 14-15. HTTP and HTTPS transactions

E D

Client Server

Establish TCP connection to server port 443

Client Server

SSL security parameters handshake

Client Server

Internet

HTTP request sent over SSL/encrypted request sent over TCP

Client Server

Internet

HTTP response sent over SSL/encrypted response sent over TCP

Internet

Client Server

SSL close notification

Internet

Client Server

TCP connection close

Internet

Client Server

Establish TCP connection to server port 80

Client Server

Internet

HTTP request sent over TCP

Client Server

Internet

HTTP response sent over TCP

Internet

Client Server

TCP connection close

Internet

80 443

(a) Unencrypted HTTP transaction (b) Encrypted HTTPS transaction

326 |Chapter 14: Secure HTTP

Server Certiﬁcates

SSL supports mutual authentication, carrying server certificates to clients and carry-

ing client certificates back to servers. But today, client certificates are not commonly

used for browsing. Most users don’t even possess personal client certificates.*A web

server can demand a client certificate, but that seldom occurs in practice.†

On the other hand, secure HTTPS transactions always require server certificates.

When you perform a secure transaction on a web server, such as posting your credit

card information, you want to know that you are talking to the organization you

think you are talking to. Server certificates, signed by a well-known authority, help

you assess how much you trust the server before sending your credit card or per-

sonal information.

The server certificate is an X.509 v3–derived certificate showing the organization’s

name, address, server DNS domain name, and other information (see Figure 14-17).

You and your client software can examine the certificate to make sure everything

seems to be on the up-and-up.

Figure 14-16. SSL handshake (simplified)

* Client certificates are used for web browsing in some corporate settings, and client certificates are used for

secure email. In the future, client certificates may become more common for web browsing, but today

they’ve caught on very slowly.

† Some organizational intranets use client certificates to control employee access to information.

Server

certificate

Server

SSL security parameters handshake

Client Server

(1) Client sends cipher choices and requests certification

Internet

Client Server

(2) Server sends chosen cipher and certificate

Internet

Client Server

(3) Client sends secret; client and server make keys

Internet

Client Server

(4) Client and server tell each other to start encryption

Internet

Client ServerInternet

HTTPS: The Details |327

Site Certiﬁcate Validation

SSL itself doesn’t require you to examine the web server certificate, but most mod-

ern browsers do some simple sanity checks on certificates and provide you with the

means to do more thorough checks. One algorithm for web server certificate valida-

tion, proposed by Netscape, forms the basis of most browser’s validation tech-

niques. The steps are:

Date check

First, the browser checks the certificate’s start and end dates to ensure the certifi-

cate is still valid. If the certificate has expired or has not yet become active, the

certificate validation fails and the browser displays an error.

Signer trust check

Every certificate is signed by some certificate authority (CA), who vouches for

the server. There are different levels of certificate, each requiring different levels

of background verification. For example, if you apply for an e-commerce server

certificate, you usually need to provide legal proof of incorporation as a business.

Anyone can generate certificates, but some CAs are well-known organizations

with well-understood procedures for verifying the identity and good business

behavior of certificate applicants. For this reason, browsers ship with a list of

signing authorities that are trusted. If a browser receives a certificate signed by

some unknown (and possibly malicious) authority, the browser usually displays

a warning. Browsers also may choose to accept any certificates with a valid sign-

ing path to a trusted CA. In other words, if a trusted CA signs a certificate for

“Sam’s Signing Shop” and Sam’s Signing Shop signs a site certificate, the

browser may accept the certificate as deriving from a valid CA path.

Figure 14-17. HTTPS certificates are X.509 certificates with site information

Server

certificate

Client ServerInternet

Certificate serial number 35:DE:F4:CF

Certificate expiration date Wed, Sep 17, 2003

Site’s organization name Joe’s Hardware Online

Site’s DNS hostname www.joes-hardware.com

Site’s public key

Certificate issuer name RSA Data Security

Certificate issuer signature

328 |Chapter 14: Secure HTTP

Signature check

Once the signing authority is judged as trustworthy, the browser checks the cer-

tificate’s integrity by applying the signing authority’s public key to the signature

and comparing it to the checksum.

Site identity check

To prevent a server from copying someone else’s certificate or intercepting their

traffic, most browsers try to verify that the domain name in the certificate matches

the domain name of the server they talked to. Server certificates usually contain a

single domain name, but some CAs create certificates that contain lists of server

names or wildcarded domain names, for clusters or farms of servers. If the host-

name does not match the identity in the certificate, user-oriented clients must

either notify the user or terminate the connection with a bad certificate error.

Virtual Hosting and Certiﬁcates

It’s sometimes tricky to deal with secure traffic on sites that are virtually hosted (mul-

tiple hostnames on a single server). Some popular web server programs support only

a single certificate. If a user arrives for a virtual hostname that does not strictly match

the certificate name, a warning box is displayed.

For example, consider the Louisiana-themed e-commerce site Cajun-Shop.com. The

site’s hosting provider provided the official name cajun-shop.securesites.com. When

users go to https://www.cajun-shop.com, the official hostname listed in the server cer-

tificate (*.securesites.com) does not match the virtual hostname the user browsed to

(www.cajun-shop.com), and the warning in Figure 14-18 appears.

To prevent this problem, the owners of Cajun-Shop.com redirect all users to cajun-

shop.securesites.com when they begin secure transactions. Cert management for vir-

tually hosted sites can be a little tricky.

A Real HTTPS Client

SSL is a complicated binary protocol. Unless you are a crypto expert, you shouldn’t

send raw SSL traffic directly. Thankfully, several commercial and open source librar-

ies exist to make it easier to program SSL clients and servers.

OpenSSL

OpenSSL is the most popular open source implementation of SSL and TLS. The

OpenSSL Project is a collaborative volunteer effort to develop a robust, commercial-

grade, full-featured toolkit implementing the SSL and TLS protocols, as well as a full-

strength, general-purpose cryptography library. You can get information about

OpenSSL, and download the software, from http://www.openssl.org.

A Real HTTPS Client |329

You might also hear of SSLeay (pronounced S-S-L-e-a-y). OpenSSL is the successor

to the SSLeay library, and it has a very similar interface. SSLeay was originally devel-

oped by Eric A. Young (the “eay” of SSLeay).

A Simple HTTPS Client

In this section, we’ll use the OpenSSL package to write an extremely primitive

HTTPS client. This client establishes an SSL connection with a server, prints out

Figure 14-18. Certificate name mismatches bring up certificate error dialog boxes

(a) The hostname in this URL (www.cajun-shop.com)

does not match the name in the certificate, because the

site is virtually hosted, and the certificate is made out

to *.securesites.com.

(b) A dialog box warns the user that the site’s certificate has

a valid date and is from a valid certificate authority, but the

name listed in the certificate does not match the site

requested in the URL.

button, and sees that the certificate is a wildcard certificate

made out to “*.securesites.com”. With this information, the user

can decide whether to accept or decline the certificate.

(d) Accepting the certificate loads the page through the secure

HTTPS protocol.

To avoid this kind of user error, this particular site directs all

HTTPS traffic to the hostname alias cajun-shop.securesites.com.

This virtual hostname matches the name on the certificate

provided by the ISP as part of their commerce package.

330 |Chapter 14: Secure HTTP

some identification information from the site server, sends an HTTP GET request

across the secure channel, receives an HTTP response, and prints the response.

The C program shown below is an OpenSSL implementation of the trivial HTTPS

client. To keep the program simple, error-handling and certificate-processing logic

has not been included.

Because error handling has been removed from this example program, you should

use it only for explanatory value. The software will crash or otherwise misbehave in

normal error conditions.

/**********************************************************************

* https_client.c --- very simple HTTPS client with no error checking

* usage: https_client servername

**********************************************************************/

#include <stdio.h>

#include <memory.h>

#include <errno.h>

#include <sys/types.h>

#include <sys/socket.h>

#include <netinet/in.h>

#include <arpa/inet.h>

#include <netdb.h>

#include <openssl/crypto.h>

#include <openssl/x509.h>

#include <openssl/pem.h>

#include <openssl/ssl.h>

#include <openssl/err.h>

void main(int argc, char **argv)

{

SSL *ssl;

SSL_CTX *ctx;

SSL_METHOD *client_method;

X509 *server_cert;

int sd,err;

char *str,*hostname,outbuf[4096],inbuf[4096],host_header[512];

struct hostent *host_entry;

struct sockaddr_in server_socket_address;

struct in_addr ip;

/*========================================*/

/* (1) initialize SSL library */

/*========================================*/

SSLeay_add_ssl_algorithms( );

client_method = SSLv2_client_method( );

SSL_load_error_strings( );

ctx = SSL_CTX_new(client_method);

A Real HTTPS Client |331

printf("(1) SSL context initialized\n\n");

/*=============================================*/

/* (2) convert server hostname into IP address */

/*=============================================*/

hostname = argv[1];

host_entry = gethostbyname(hostname);

bcopy(host_entry->h_addr, &(ip.s_addr), host_entry->h_length);

printf("(2) '%s' has IP address '%s'\n\n", hostname, inet_ntoa(ip));

/*=================================================*/

/* (3) open a TCP connection to port 443 on server */

/*=================================================*/

sd = socket (AF_INET, SOCK_STREAM, 0);

memset(&server_socket_address, '\0', sizeof(server_socket_address));

server_socket_address.sin_family = AF_INET;

server_socket_address.sin_port = htons(443);

memcpy(&(server_socket_address.sin_addr.s_addr),

host_entry->h_addr, host_entry->h_length);

err = connect(sd, (struct sockaddr*) &server_socket_address,

sizeof(server_socket_address));

if (err < 0) { perror("can't connect to server port"); exit(1); }

printf("(3) TCP connection open to host '%s', port %d\n\n",

hostname, server_socket_address.sin_port);

/*========================================================*/

/* (4) initiate the SSL handshake over the TCP connection */

/*========================================================*/

ssl = SSL_new(ctx); /* create SSL stack endpoint */

SSL_set_fd(ssl, sd); /* attach SSL stack to socket */

err = SSL_connect(ssl); /* initiate SSL handshake */

printf("(4) SSL endpoint created & handshake completed\n\n");

/*============================================*/

/* (5) print out the negotiated cipher chosen */

/*============================================*/

printf("(5) SSL connected with cipher: %s\n\n", SSL_get_cipher(ssl));

/*========================================*/

/* (6) print out the server's certificate */

/*========================================*/

server_cert = SSL_get_peer_certificate(ssl);

332 |Chapter 14: Secure HTTP

printf("(6) server's certificate was received:\n\n");

str = X509_NAME_oneline(X509_get_subject_name(server_cert), 0, 0);

printf(" subject: %s\n", str);

str = X509_NAME_oneline(X509_get_issuer_name(server_cert), 0, 0);

printf(" issuer: %s\n\n", str);

/* certificate verification would happen here */

X509_free(server_cert);

/*********************************************************/

/* (7) handshake complete --- send HTTP request over SSL */

/*********************************************************/

sprintf(host_header,"Host: %s:443\r\n",hostname);

strcpy(outbuf,"GET / HTTP/1.0\r\n");

strcat(outbuf,host_header);

strcat(outbuf,"Connection: close\r\n");

strcat(outbuf,"\r\n");

err = SSL_write(ssl, outbuf, strlen(outbuf));

shutdown (sd, 1); /* send EOF to server */

printf("(7) sent HTTP request over encrypted channel:\n\n%s\n",outbuf);

/**************************************************/

/* (8) read back HTTP response from the SSL stack */

/**************************************************/

err = SSL_read(ssl, inbuf, sizeof(inbuf) - 1);

inbuf[err] = '\0';

printf ("(8) got back %d bytes of HTTP response:\n\n%s\n",err,inbuf);

/************************************************/

/* (9) all done, so close connection & clean up */

/************************************************/

SSL_shutdown(ssl);

close (sd);

SSL_free (ssl);

SSL_CTX_free (ctx);

printf("(9) all done, cleaned up and closed connection\n\n");

}

This example compiles and runs on Sun Solaris, but it is illustrative of how SSL pro-

grams work on many OS platforms. This entire program, including all the encryp-

tion and key and certificate management, fits in a three-page C program, thanks to

the powerful features provided by OpenSSL.

A Real HTTPS Client |333

Let’s walk through the program section by section:

• The top of the program includes support files needed to support TCP network-

ing and SSL.

• Section 1 creates the local context that keeps track of the handshake parameters

and other state about the SSL connection, by calling SSL_CTX_new.

• Section 2 converts the input hostname (provided as a command-line argument)

to an IP address, using the Unix gethostbyname function. Other platforms may

have other ways to provide this facility.

• Section 3 opens a TCP connection to port 443 on the server by creating a local

socket, setting up the remote address information, and connecting to the remote

server.

• Once the TCP connection is established, we attach the SSL layer to the TCP con-

nection using SSL_new and SSL_set_fd and perform the SSL handshake with the

server by calling SSL_connect. When section 4 is done, we have a functioning

SSL channel established, with ciphers chosen and certificates exchanged.

• Section 5 prints out the value of the chosen bulk-encryption cipher.

• Section 6 prints out some of the information contained in the X.509 certificate

sent back from the server, including information about the certificate holder and

the organization that issued the certificate. The OpenSSL library doesn’t do any-

thing special with the information in the server certificate. A real SSL applica-

tion, such as a web browser, would do some sanity checks on the certificate to

make sure it is signed properly and came from the right host. We discussed what

browsers do with server certificates in “Site Certificate Validation.”

• At this point, our SSL connection is ready to use for secure data transfer. In sec-

tion 7, we send the simple HTTP request “GET / HTTP/1.0” over the SSL chan-

nel using SSL_write, then close the outbound half of the connection.

• In section 8, we read the response back from the connection using SSL_read, and

print it on the screen. Because the SSL layer takes care of all the encryption and

decryption, we can just write and read normal HTTP commands.

• Finally, we clean up in section 9.

Refer to http://www.openssl.org for more information about the OpenSSL libraries.

Executing Our Simple OpenSSL Client

The following shows the output of our simple HTTP client when pointed at a secure

server. In this case, we pointed the client at the home page of the Morgan Stanley

Online brokerage. Online trading companies make extensive use of HTTPS.

%https_client clients1.online.msdw.com

(1) SSL context initialized

334 |Chapter 14: Secure HTTP

(2) 'clients1.online.msdw.com' has IP address '63.151.15.11'

(3) TCP connection open to host 'clients1.online.msdw.com', port 443

(4) SSL endpoint created & handshake completed

(5) SSL connected with cipher: DES-CBC3-MD5

(6) server's certificate was received:

subject: /C=US/ST=Utah/L=Salt Lake City/O=Morgan Stanley/OU=Online/CN=

clients1.online.msdw.com

issuer: /C=US/O=RSA Data Security, Inc./OU=Secure Server Certification

Authority

(7) sent HTTP request over encrypted channel:

GET / HTTP/1.0

Host: clients1.online.msdw.com:443

Connection: close

(8) got back 615 bytes of HTTP response:

HTTP/1.1 302 Found

Date: Sat, 09 Mar 2002 09:43:42 GMT

Server: Stronghold/3.0 Apache/1.3.14 RedHat/3013c (Unix) mod_ssl/2.7.1 OpenSSL/0.9.6

Location: https://clients.online.msdw.com/cgi-bin/ICenter/home

Connection: close

Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

<TITLE>302 Found</TITLE>

</HEAD><BODY>

<H1>Found</H1>

The document has moved <A HREF="https://clients.online.msdw.com/cgi-bin/ICenter/

home">here</A>.<P>

<HR>

<ADDRESS>Stronghold/3.0 Apache/1.3.14 RedHat/3013c Server at clients1.online.msdw.com

Port 443</ADDRESS>

</BODY></HTML>

(9) all done, cleaned up and closed connection

As soon as the first four sections are completed, the client has an open SSL connec-

tion. It can then inquire about the state of the connection and chosen parameters

and can examine server certificates.

In this example, the client and server negotiated the DES-CBC3-MD5 bulk-encryption

cipher. You also can see that the server site certificate belongs to the organization

“Morgan Stanley” in “Salt Lake City, Utah, USA”. The certificate was granted by RSA

Data Security, and the hostname is “clients1.online.msdw.com,” which matches our

request.

Tunneling Secure Traffic Through Proxies |335

Once the SSL channel is established and the client feels comfortable about the site

certificate, it sends its HTTP request over the secure channel. In our example, the cli-

ent sends a simple “GET / HTTP/1.0” HTTP request and receives back a 302 Redi-

rect response, requesting that the user fetch a different URL.

Tunneling Secure Trafﬁc Through Proxies

Clients often use web proxy servers to access web servers on their behalf (proxies are

discussed in Chapter 6). For example, many corporations place a proxy at the secu-

rity perimeter of the corporate network and the public Internet (Figure 14-19). The

proxy is the only device permitted by the firewall routers to exchange HTTP traffic,

and it may employ virus checking or other content controls.

But once the client starts encrypting the data to the server, using the server’s public

key, the proxy no longer has the ability to read the HTTP header! And if the proxy can-

not read the HTTP header, it won’t know where to forward the request (Figure 14-20).

To make HTTPS work with proxies, a few modifications are needed to tell the proxy

where to connect. One popular technique is the HTTPS SSL tunneling protocol.

Figure 14-19. Corporate firewall proxy

Figure 14-20. Proxy can’t proxy an encrypted request

Client

Client Firewall

proxy

Security

perimeter

Public Internet

bdfwr73ytr6ouydoiw687eqidfjwvd76weti76fig287hdi9

8r82yr87pfdy72y87193836PDUyqe719eyty3gee98y8787

client17.mycompany.com proxy.mycompany.com www.cajun-gifts.com

336 |Chapter 14: Secure HTTP

Using the HTTPS tunneling protocol, the client first tells the proxy the secure host

and port to which it wants to connect. It does this in plaintext, before encryption

starts, so the proxy can read this information.

HTTP is used to send the plaintext endpoint information, using a new extension

method called CONNECT. The CONNECT method tells the proxy to open a con-

nection to the desired host and port number and, when that’s done, to tunnel data

directly between the client and server. The CONNECT method is a one-line text

command that provides the hostname and port of the secure origin server, separated

by a colon. The host:port is followed by a space and an HTTP version string fol-

lowed by a CRLF. After that there is a series of zero or more HTTP request header

lines, followed by an empty line. After the empty line, if the handshake to establish

the connection was successful, SSL data transfer can begin. Here is an example:

CONNECT home.netscape.com:443 HTTP/1.0

User-agent: Mozilla/1.1N

After the empty line in the request, the client will wait for a response from the proxy.

The proxy will evaluate the request and make sure that it is valid and that the user is

authorized to request such a connection. If everything is in order, the proxy will

make a connection to the destination server and, if successful, send a 200 Connec-

tion Established response to the client.

HTTP/1.0 200 Connection established

Proxy-agent: Netscape-Proxy/1.1

For more information about secure tunnels and security proxies, refer back to “Tun-

nels” in Chapter 8.

For More Information

Security and cryptography are hugely important and hugely complicated topics. If

you’d like to learn more about HTTP security, digital cryptography, digital certifi-

cates, and the Public-Key Infrastructure, here are a few starting points.

HTTP Security

Web Security, Privacy & Commerce

Simson Garfinkel, O’Reilly & Associates, Inc. This is one of the best, most read-

able introductions to web security and the use of SSL/TLS and digital certificates.

http://www.ietf.org/rfc/rfc2818.txt

RFC 2818, “HTTP Over TLS,” specifies how to implement secure HTTP over

Transport Layer Security (TLS), the modern successor to SSL.

For More Information |337

http://www.ietf.org/rfc/rfc2817.txt

RFC 2817, “Upgrading to TLS Within HTTP/1.1,” explains how to use the

Upgrade mechanism in HTTP/1.1 to initiate TLS over an existing TCP connec-

tion. This allows unsecured and secured HTTP traffic to share the same well-

known port (in this case, http: at 80 rather than https: at 443). It also enables

virtual hosting, so a single HTTP+TLS server can disambiguate traffic intended

for several hostnames at a single IP address.

SSL and TLS

http://www.ietf.org/rfc/rfc2246.txt

RFC 2246, “The TLS Protocol Version 1.0,” specifies Version 1.0 of the TLS pro-

tocol (the successor to SSL). TLS provides communications privacy over the

Internet. The protocol allows client/server applications to communicate in a way

that is designed to prevent eavesdropping, tampering, and message forgery.

http://developer.netscape.com/docs/manuals/security/sslin/contents.htm

“Introduction to SSL” introduces the Secure Sockets Layer (SSL) protocol. Origi-

nally developed by Netscape, SSL has been universally accepted on the World

Wide Web for authenticated and encrypted communication between clients and

servers.

http://www.netscape.com/eng/ssl3/draft302.txt

“The SSL Protocol Version 3.0” is Netscape’s 1996 specification for SSL.

http://developer.netscape.com/tech/security/ssl/howitworks.html

“How SSL Works” is Netscape’s introduction to key cryptography.

http://www.openssl.org

The OpenSSL Project is a collaborative effort to develop a robust, commercial-

grade, full-featured, and open source toolkit implementing the Secure Sockets

Layer (SSL v2/v3) and Transport Layer Security (TLS v1) protocols, as well as a

full-strength, general-purpose cryptography library. The project is managed by a

worldwide community of volunteers that use the Internet to communicate, plan,

and develop the OpenSSL toolkit and its related documentation. OpenSSL is

based on the excellent SSLeay library developed by Eric A. Young and Tim J.

Hudson. The OpenSSL toolkit is licensed under an Apache-style licence, which

basically means that you are free to get and use it for commercial and noncom-

mercial purposes, subject to some simple license conditions.

Public-Key Infrastructure

http://www.ietf.org/html.charters/pkix-charter.html

The IETF PKIX Working Group was established in 1995 with the intent of

developing Internet standards needed to support an X.509-based Public-Key

Infrastructure. This is a nice summary of that group’s activities.

338 |Chapter 14: Secure HTTP

http://www.ietf.org/rfc/rfc2459.txt

RFC 2459, “Internet X.509 Public Key Infrastructure Certificate and CRL Pro-

file,” contains details about X.509 v3 digital certificates.

Digital Cryptography

Applied Cryptography

Bruce Schneier, John Wiley & Sons. This is a classic book on cryptography for

implementors.

The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography

Simon Singh, Anchor Books. This entertaining book is a cryptography primer.

While it’s not intended for technology experts, it is a lively historical tour of

secret coding.

PART IV

Entities, Encodings, and

Internationalization

Part IV is all about the entity bodies of HTTP messages and the content that the

entity bodies ship around as cargo:

• Chapter 15, Entities and Encodings, describes the formats and syntax of HTTP

content.

• Chapter 16, Internationalization, surveys the web standards that allow people to

exchange content in different languages and different character sets, around the

globe.

• Chapter 17, Content Negotiation and Transcoding, explains mechanisms for

negotiating acceptable content.

341

CHAPTER 15

Entities and Encodings

HTTP ships billions of media objects of all kinds every day. Images, text, movies,

software programs... you name it, HTTP ships it. HTTP also makes sure that its

messages can be properly transported, identified, extracted, and processed. In partic-

ular, HTTP ensures that its cargo:

• Can be identified correctly (using Content-Type media formats and Content-

Language headers) so browsers and other clients can process the content properly

• Can be unpacked properly (using Content-Length and Content-Encoding headers)

• Is fresh (using entity validators and cache-expiration controls)

• Meets the user’s needs (based on content-negotiation Accept headers)

• Moves quickly and efficiently through the network (using range requests, delta

encoding, and other data compression)

• Arrives complete and untampered with (using transfer encoding headers and

Content-MD5 checksums)

To make all this happen, HTTP uses well-labeled entities to carry content.

This chapter discusses entities, their associated entity headers, and how they work to

transport web cargo. We’ll show how HTTP provides the essentials of content size,

type, and encodings. We’ll also explain some of the more complicated and powerful

features of HTTP entities, including range requests, delta encoding, digests, and

chunked encodings.

This chapter covers:

• The format and behavior of HTTP message entities as HTTP data containers

• How HTTP describes the size of entity bodies, and what HTTP requires in the

way of sizing

• The entity headers used to describe the format, alphabet, and language of con-

tent, so clients can process it properly

342 |Chapter 15: Entities and Encodings

• Reversible content encodings, used by senders to transform the content data for-

mat before sending to make it take up less space or be more secure

• Transfer encoding, which modifies how HTTP ships data to enhance the commu-

nication of some kinds of content, and chunked encoding, a transfer encoding

that chops data into multiple pieces to deliver content of unknown length safely

• The assortment of tags, labels, times, and checksums that help clients get the lat-

est version of requested content

• The validators that act like version numbers on content, so web applications can

ensure they have fresh content, and the HTTP header fields designed to control

object freshness

• Ranges, which are useful for continuing aborted downloads where they left off

• HTTP delta encoding extensions, which allow clients to request just those parts

of a web page that actually have changed since a previously viewed revision

• Checksums of entity bodies, which are used to detect changes in entity content

as it passes through proxies

Messages Are Crates, Entities Are Cargo

If you think of HTTP messages as the crates of the Internet shipping system, then

HTTP entities are the actual cargo of the messages. Figure 15-1 shows a simple

entity, carried inside an HTTP response message.

The entity headers indicate a plaintext document (Content-Type: text/plain) that is a

mere 18 characters long (Content-Length: 18). As always, a blank line (CRLF) sepa-

rates the header fields from the start of the body.

HTTP entity headers (covered in Chapter 3) describe the contents of an HTTP mes-

sage. HTTP/1.1 defines 10 primary entity header fields:

Content-Type

The kind of object carried by the entity.

Content-Length

The length or size of the message being sent.

Figure 15-1. Message entity is made up of entity headers and entity body

HTTP/1.0 200 OK

Server: Netscape-Enterprise/3.6

Date: Sun, 17 Sep 2000 00:01:05 GMT

Content-type: text/plain

Content-length: 18

Hi! I'm a message!

Entity headers

Entity body

Entity

Messages Are Crates, Entities Are Cargo |343

Content-Language

The human language that best matches the object being sent.

Content-Encoding

Any transformation (compression, etc.) performed on the object data.

Content-Location

An alternate location for the object at the time of the request.

Content-Range

If this is a partial entity, this header defines which pieces of the whole are included.

Content-MD5

A checksum of the contents of the entity body.

Last-Modified

The date on which this particular content was created or modified at the server.

Expires

The date and time at which this entity data will become stale.

Allow

What request methods are legal on this resource; e.g., GET and HEAD.

ETag

A unique validator for this particular instance*of the document. The ETag

header is not defined formally as an entity header, but it is an important header

for many operations involving entities.

Cache-Control

Directives on how this document can be cached. The Cache-Control header, like

the ETag header, is not defined formally as an entity header.

Entity Bodies

The entity body just contains the raw cargo.†Any other descriptive information is

contained in the headers. Because the entity body cargo is just raw data, the entity

headers are needed to describe the meaning of that data. For example, the Content-

Type entity header tells us how to interpret the data (image, text, etc.), and the Con-

tent-Encoding entity header tells us if the data was compressed or otherwise recoded.

We talk about all of this and more in upcoming sections.

The raw content begins immediately after the blank CRLF line that marks the end of

the header fields. Whatever the content is—text or binary, document or image, com-

pressed or uncompressed, English or French or Japanese—it is placed right after the

CRLF.

* Instances are described later in this chapter, in the section “Time-Varying Instances.”

† If there is a Content-Encoding header, the content already has been encoded by the content-encoding algo-

rithm, and the first byte of the entity is the first byte of the encoded (e.g., compressed) cargo.

344 |Chapter 15: Entities and Encodings

Figure 15-2 shows two examples of real HTTP messages, one carrying a text entity,

the other carrying an image entity. The hexadecimal values show the exact contents

of the message:

• In Figure 15-2a, the entity body begins at byte number 65, right after the end-of-

headers CRLF. The entity body contains the ASCII characters for “Hi! I’m a

message!”

• In Figure 15-2b, the entity body begins at byte number 67. The entity body con-

tains the binary contents of the GIF image. GIF files begin with 6-byte version

signature, a 16-bit width, and a 16-bit height. You can see all three of these

directly in the entity body.

Content-Length: The Entity’s Size

The Content-Length header indicates the size of the entity body in the message, in

bytes. The size includes any content encodings (the Content-Length of a gzip-

compressed text file will be the compressed size, not the original size).

The Content-Length header is mandatory for messages with entity bodies, unless the

message is transported using chunked encoding. Content-Length is needed to detect

premature message truncation when servers crash and to properly segment messages

that share a persistent connection.

Detecting Truncation

Older versions of HTTP used connection close to delimit the end of a message. But,

without Content-Length, clients cannot distinguish between successful connection

Figure 15-2. Hex dumps of real message content (raw message content follows blank CRLF)

final LF

(0x0A= <LF>) start-of-content

(”GIF87a”)Width

(0x0227= 551) Height

(0x0206= 518)

HTTP/1.0 200 OK

Content-type: text/plain

Content-length: 18

Hi! I’m a message!

final LF (0x0A= <LF>) start-of-content (0x48= “H”)

(b) Image/gif entity in HTTP response message

HTTP/1.0 200 OK

Content-Type: image/gif

Content-Length: 34867

(a) Text/plain entity in HTTP response message

Content-Length: The Entity’s Size |345

close at the end of a message and connection close due to a server crash in the mid-

dle of a message. Clients need Content-Length to detect message truncation.

Message truncation is especially severe for caching proxy servers. If a cache receives a

truncated message and doesn’t recognize the truncation, it may store the defective

content and serve it many times. Caching proxy servers generally do not cache HTTP

bodies that don’t have an explicit Content-Length header, to reduce the risk of cach-

ing truncated messages.

Incorrect Content-Length

An incorrect Content-Length can cause even more damage than a missing Content-

Length. Because some early clients and servers had well-known bugs with respect to

Content-Length calculations, some clients, servers, and proxies contain algorithms to

try to detect and correct interactions with broken servers. HTTP/1.1 user agents offi-

cially are supposed to notify the user when an invalid length is received and detected.

Content-Length and Persistent Connections

Content-Length is essential for persistent connections. If the response comes across a

persistent connection, another HTTP response can immediately follow the current

response. The Content-Length header lets the client know where one message ends

and the next begins. Because the connection is persistent, the client cannot use con-

nection close to identify the message’s end. Without a Content-Length header, HTTP

applications won’t know where one entity body ends and the next message begins.

As we will see in “Transfer Encoding and Chunked Encoding,” there is one situation

where you can use persistent connections without having a Content-Length header:

when you use chunked encoding. Chunked encoding sends the data in a series of

chunks, each with a specified size. Even if the server does not know the size of the

entire entity at the time the headers are generated (often because the entity is being

generated dynamically), the server can use chunked encoding to transmit pieces of

well-defined size.

Content Encoding

HTTP lets you encode the contents of an entity body, perhaps to make it more

secure or to compress it to take up less space (we explain compression in detail later

in this chapter). If the body has been content-encoded, the Content-Length header

specifies the length, in bytes, of the encoded body, not the length of the original,

unencoded body.

Some HTTP applications have been known to get this wrong and to send the size of

the data before the encoding, which causes serious errors, especially with persis-

tent connections. Unfortunately, none of the headers described in the HTTP/1.1

346 |Chapter 15: Entities and Encodings

specification can be used to send the length of the original, unencoded body, which

makes it difficult for clients to verify the integrity of their unencoding processes.*

Rules for Determining Entity Body Length

The following rules describe how to correctly determine the length and end of an

entity body in several different circumstances. The rules should be applied in order;

the first match applies.

1. If a particular HTTP message type is not allowed to have a body, ignore the

Content-Length header for body calculations. The Content-Length headers are

informational in this case and do not describe the actual body length. (Naïve

HTTP applications can get in trouble if they assume Content-Length always

means there is a body).

The most important example is the HEAD response. The HEAD method

requests that a server send the headers that would have been returned by an

equivalent GET request, but no body. Because a GET response would send back

a Content-Length header, so will the HEAD response—but unlike the GET

response, the HEAD response will not have a body. 1XX, 204, and 304

responses also can have informational Content-Length headers but no entity

body. Messages that forbid entity bodies must terminate at the first empty line

after the headers, regardless of which entity header fields are present.

2. If a message contains a Transfer-Encoding header (other than the default HTTP

“identity” encoding), the entity will be terminated by a special pattern called a

“zero-byte chunk,” unless the message is terminated first by closing the connec-

tion. We’ll discuss transfer encodings and chunked encodings later in this chapter.

3. If a message has a Content-Length header (and the message type allows entity

bodies), the Content-Length value contains the body length, unless there is a

non-identity Transfer-Encoding header. If a message is received with both a

Content-Length header field and a non-identity Transfer-Encoding header field,

you must ignore the Content-Length, because the transfer encoding will change

the way entity bodies are represented and transferred (and probably the number

of bytes transmitted).

4. If the message uses the “multipart/byteranges” media type and the entity length

is not otherwise specified (in the Content-Length header), each part of the multi-

part message will specify its own size. This multipart type is the only entity body

type that self-delimits its own size, so this media type must not be sent unless the

sender knows the recipient can parse it.†

* Even the Content-MD5 header, which can be used to send the 128-bit MD5 of the document, contains the

MD5 of the encoded document. The Content-MD5 header is described later in this chapter.

† Because a Range header might be forwarded by a more primitive proxy that does not understand multipart/

byteranges, the sender must delimit the message using methods 1, 3, or 5 in this section if it isn’t sure the

receiver understands the self- delimiting format.

Entity Digests |347

5. If none of the above rules match, the entity ends when the connection closes.

In practice, only servers can use connection close to indicate the end of a

message. Clients can’t close the connection to signal the end of client mes-

sages, because that would leave no way for the server to send back a

response.*

6. To be compatible with HTTP/1.0 applications, any HTTP/1.1 request that has

an entity body also must include a valid Content-Length header field (unless the

server is known to be HTTP/1.1-compliant). The HTTP/1.1 specification coun-

sels that if a request contains a body and no Content-Length, the server should

send a 400 Bad Request response if it cannot determine the length of the mes-

sage, or a 411 Length Required response if it wants to insist on receiving a valid

Content-Length.

Entity Digests

Although HTTP typically is implemented over a reliable transport protocol such

as TCP/IP, parts of messages may get modified in transit for a variety of reasons,

such as noncompliant transcoding proxies or buggy intermediary proxies. To

detect unintended (or undesired) modification of entity body data, the sender

can generate a checksum of the data when the initial entity is generated, and the

receiver can sanity check the checksum to catch any unintended entity modifica-

tion.†

The Content-MD5 header is used by servers to send the result of running the

MD5 algorithm on the entity body. Only the server where the response origi-

nates may compute and send the Content-MD5 header. Intermediate proxies and

caches may not modify or add the header—that would violate the whole pur-

pose of verifying end-to-end integrity. The Content-MD5 header contains the

MD5 of the content after all content encodings have been applied to the entity

body and before any transfer encodings have been applied to it. Clients seeking

to verify the integrity of the message must first decode the transfer encodings,

then compute the MD5 of the resulting unencoded entity body. As an example, if

a document is compressed using the gzip algorithm, then sent with chunked

encoding, the MD5 algorithm is run on the full gripped body.

In addition to checking message integrity, the MD5 can be used as a key into a

hash table to quickly locate documents and reduce duplicate storage of content.

Despite these possible uses, the Content-MD5 header is not sent often.

* The client could do a half close of just its output connection, but many server applications aren’t designed

to handle this situation and will interpret a half close as the client disconnecting from the server. Connection

management was never well specified in HTTP. See Chapter 4 for more details.

† This method, of course, is not immune to a malicious attack that replaces both the message body and digest

header. It is intended only to detect unintentional modification. Other facilities, such as digest authentica-

tion, are needed to provide safeguards against malicious tampering.

348 |Chapter 15: Entities and Encodings

Extensions to HTTP have proposed other digest algorithms in IETF drafts. These

extensions have proposed a new header, Want-Digest, that allows clients to specify

the type of digest they expect with the response. Quality values can be used to sug-

gest multiple digest algorithms and indicate preference.

Media Type and Charset

The Content-Type header field describes the MIME type of the entity body.*The

MIME type is a standardized name that describes the underlying type of media car-

ried as cargo (HTML file, Microsoft Word document, MPEG video, etc.). Client

applications use the MIME type to properly decipher and process the content.

The Content-Type values are standardized MIME types, registered with the Internet

Assigned Numbers Authority (IANA). MIME types consist of a primary media type

(e.g., text, image, audio), followed by a slash, followed by a subtype that further

specifies the media type. Table 15-1 lists a few common MIME types for the Content-

Type header. More MIME types are listed in Appendix D.

It is important to note that the Content-Type header specifies the media type of the

original entity body. If the entity has gone through content encoding, for example,

the Content-Type header will still specify the entity body type before the encoding.

* In the case of the HEAD request, Content-Type shows the type that would have been sent if it was a GET

request.

Table 15-1. Common media types

Media type Description

text/html Entity body is an HTML document

text/plain Entity body is a document in plain text

image/gif Entity body is an image of type GIF

image/jpeg Entity body is an image of type JPEG

audio/x-wav Entity body contains WAV sound data

model/vrml Entity body is a three-dimensional VRML model

application/vnd.ms-powerpoint Entity body is a Microsoft PowerPoint presentation

multipart/byteranges Entity body has multiple parts, each containing a different range (in bytes) of the full doc-

ument

message/http Entity body contains a complete HTTP message (see TRACE)

Media Type and Charset |349

Character Encodings for Text Media

The Content-Type header also supports optional parameters to further specify the

content type. The “charset” parameter is the primary example, specifying the mecha-

nism to convert bits from the entity into characters in a text file:

Content-Type: text/html; charset=iso-8859-4

We talk about character sets in detail in Chapter 16.

Multipart Media Types

MIME “multipart” email messages contain multiple messages stuck together and

sent as a single, complex message. Each component is self-contained, with its own

set of headers describing its content; the different components are concatenated

together and delimited by a string.

HTTP also supports multipart bodies; however, they typically are sent in only one of

two situations: in fill-in form submissions and in range responses carrying pieces of a

document.

Multipart Form Submissions

When an HTTP fill-in form is submitted, variable-length text fields and uploaded

objects are sent as separate parts of a multipart body, allowing forms to be filled out

with values of different types and lengths. For example, you may choose to fill out a

form that asks for your name and a description with your nickname and a small

photo, while your friend may put down her full name and a long essay describing her

passion for fixing Volkswagen buses.

HTTP sends such requests with a Content-Type: multipart/form-data header or a

Content-Type: multipart/mixed header and a multipart body, like this:

Content-Type: multipart/form-data; boundary=[abcdefghijklmnopqrstuvwxyz]

where the boundary specifies the delimiter string between the different parts of the

body.

The following example illustrates multipart/form-data encoding. Suppose we have

this form:

<FORM action="http://server.com/cgi/handle"

enctype="multipart/form-data"

method="post">

<P>

What is your name? <INPUT type="text" name="submit-name"><BR>

What files are you sending? <INPUT type="file" name="files"><BR>

</FORM>

350 |Chapter 15: Entities and Encodings

If the user enters “Sally” in the text-input field and selects the text file “essayfile.txt,”

the user agent might send back the following data:

Content-Type: multipart/form-data; boundary=AaB03x

--AaB03x

Content-Disposition: form-data; name="submit-name"

Sally

--AaB03x

Content-Disposition: form-data; name="files"; filename="essayfile.txt"

Content-Type: text/plain

...contents of essayfile.txt...

--AaB03x--

If the user selected a second (image) file, “imagefile.gif,” the user agent might con-

struct the parts as follows:

Content-Type: multipart/form-data; boundary=AaB03x

--AaB03x

Content-Disposition: form-data; name="submit-name"

Sally

--AaB03x

Content-Disposition: form-data; name="files"

Content-Type: multipart/mixed; boundary=BbC04y

--BbC04y

Content-Disposition: file; filename="essayfile.txt"

Content-Type: text/plain

...contents of essayfile.txt...

--BbC04y

Content-Disposition: file; filename="imagefile.gif"

Content-Type: image/gif

Content-Transfer-Encoding: binary

...contents of imagefile.gif...

--BbC04y--

--AaB03x--

Multipart Range Responses

HTTP responses to range requests also can be multipart. Such responses come with a

Content-Type: multipart/byteranges header and a multipart body with the different

ranges. Here is an example of a multipart response to a request for different ranges of

a document:

HTTP/1.0 206 Partial content

Server: Microsoft-IIS/5.0

Date: Sun, 10 Dec 2000 19:11:20 GMT

Content-Location: http://www.joes-hardware.com/gettysburg.txt

Content-Type: multipart/x-byteranges; boundary=--[abcdefghijklmnopqrstuvwxyz]--

Last-Modified: Sat, 09 Dec 2000 00:38:47 GMT

--[abcdefghijklmnopqrstuvwxyz]--

Content-Type: text/plain

Content-Range: bytes 0-174/1441

Content Encoding |351

Fourscore and seven years ago our fathers brough forth on this continent

a new nation, conceived in liberty and dedicated to the proposition that

all men are created equal.

--[abcdefghijklmnopqrstuvwxyz]--

Content-Type: text/plain

Content-Range: bytes 552-761/1441

But in a larger sense, we can not dedicate, we can not consecrate,

we can not hallow this ground. The brave men, living and dead who

struggled here have consecrated it far above our poor power to add

or detract.

--[abcdefghijklmnopqrstuvwxyz]--

Content-Type: text/plain

Content-Range: bytes 1344-1441/1441

and that government of the people, by the people, for the people shall

not perish from the earth.

--[abcdefghijklmnopqrstuvwxyz]--

Range requests are discussed in more detail later in this chapter.

Content Encoding

HTTP applications sometimes want to encode content before sending it. For exam-

ple, a server might compress a large HTML document before sending it to a client

that is connected over a slow connection, to help lessen the time it takes to transmit

the entity. A server might scramble or encrypt the contents in a way that prevents

unauthorized third parties from viewing the contents of the document.

These types of encodings are applied to the content at the sender. Once the content

is content-encoded, the encoded data is sent to the receiver in the entity body as

usual.

The Content-Encoding Process

The content-encoding process is:

1. A web server generates an original response message, with original Content-

Type and Content-Length headers.

2. A content-encoding server (perhaps the origin server or a downstream proxy)

creates an encoded message. The encoded message has the same Content-Type

but (if, for example, the body is compressed) a different Content-Length. The

content-encoding server adds a Content-Encoding header to the encoded mes-

sage, so that a receiving application can decode it.

3. A receiving program gets the encoded message, decodes it, and obtains the

original.

352 |Chapter 15: Entities and Encodings

Figure 15-3 sketches a content-encoding example.

Here, an HTML page is encoded by a gzip content-encoding function, to produce a

smaller, compressed body. The compressed body is sent across the network, flagged

with the gzip encoding. The receiving client decompresses the entity using the gzip

decoder.

This response snippet shows another example of an encoded response (a com-

pressed image):

HTTP/1.1 200 OK

Date: Fri, 05 Nov 1999 22:35:15 GMT

Server: Apache/1.2.4

Content-Length: 6096

Content-Type: image/gif

Content-Encoding: gzip

[...]

Note that the Content-Type header can and should still be present in the message. It

describes the original format of the entity—information that may be necessary for

displaying the entity once it has been decoded. Remember that the Content-Length

header now represents the length of the encoded body.

Content-Encoding Types

HTTP defines a few standard content-encoding types and allows for additional

encodings to be added as extension encodings. Encodings are standardized through

the IANA, which assigns a unique token to each content-encoding algorithm. The

Content-Encoding header uses these standardized token values to describe the algo-

rithm used in the encoding.

Some of the common content-encoding tokens are listed in Table 15-2.

Figure 15-3. Content-encoding example

Gzip content

decoder

Gzip content

encoder

Content-type: text/html

Content-length: 12480

Original content

Content-type: text/html

Content-length: 3907

Content-encoding: gzip

Content-encoded content

Content-type: text/html

Content-length: 12480

Original content

01001011

11000101

Content Encoding |353

The gzip, compress, and deflate encodings are lossless compression algorithms used

to reduce the size of transmitted messages without loss of information. Of these, gzip

typically is the most effective compression algorithm and is the most widely used.

Accept-Encoding Headers

Of course, we don’t want servers encoding content in ways that the client can’t deci-

pher. To prevent servers from using encodings that the client doesn’t support, the

client passes along a list of supported content encodings in the Accept-Encoding

request header. If the HTTP request does not contain an Accept-Encoding header, a

server can assume that the client will accept any encoding (equivalent to passing

Accept-Encoding: *).

Figure 15-4 shows an example of Accept-Encoding in an HTTP transaction.

Table 15-2. Content-encoding tokens

Content-encoding value Description

gzip Indicates that the GNU zip encoding was applied to the entity.a

aRFC 1952 describes the gzip encoding.

compress Indicates that the Unix file compression program has been run on the entity.

deflate Indicates that the entity has been compressed into the zlib format.b

bRFCs 1950 and 1951 describe the zlib format and deflate compression.

identity Indicates that no encoding has been performed on the entity. When a Content-Encoding header

is not present, this can be assumed.

Figure 15-4. Content encoding

Request message

GET /logo.gif HTTP/1.1

Accept-encoding: gzip

[...]

HTTP/1.1 200 OK

Content-type: image/gif

Content-encoding: gzip

[...]

Response message

gzip

...011010011...

gunzip

...011010011...

The server compresses the image with gzip to transport a smaller file over the thin

network connection between itself and the client. This saves network bandwidth

and reduces the amount of time that the client waits for the transfer. Though, the

client will have to spend time decompressing the image once the image is served.

354 |Chapter 15: Entities and Encodings

The Accept-Encoding field contains a comma-separated list of supported encodings.

Here are a few examples:

Accept-Encoding: compress, gzip

Accept-Encoding:

Accept-Encoding: *

Accept-Encoding: compress;q=0.5, gzip;q=1.0

Accept-Encoding: gzip;q=1.0, identity; q=0.5, *;q=0

Clients can indicate preferred encodings by attaching Q (quality) values as parame-

ters to each encoding. Q values can range from 0.0, indicating that the client does

not want the associated encoding, to 1.0, indicating the preferred encoding. The

token “*” means “anything else.” The process of selecting which content encoding to

apply is part of a more general process of deciding which content to send back to a

client in a response. This process and the Content-Encoding and Accept-Encoding

headers are discussed in more detail in Chapter 17.

The identity encoding token can be present only in the Accept-Encoding header and is

used by clients to specify relative preference over other content-encoding algorithms.

Transfer Encoding and Chunked Encoding

The previous section discussed content encodings—reversible transformations applied

to the body of the message. Content encodings are tightly associated with the details

of the particular content format. For example, you might compress a text file with

gzip, but not a JPEG file, because JPEGs don’t compress well with gzip.

This section discusses transfer encodings. Transfer encodings also are reversible

transformations performed on the entity body, but they are applied for architectural

reasons and are independent of the format of the content. You apply a transfer

encoding to a message to change the way message data is transferred across the net-

work (Figure 15-5).

Safe Transport

Historically, transfer encodings exist in other protocols to provide “safe transport” of

messages across a network. The concept of safe transport has a different focus for

HTTP, where the transport infrastructure is standardized and more forgiving. In

HTTP, there are only a few reasons why transporting message bodies can cause trou-

ble. Two of these are:

Unknown size

Some gateway applications and content encoders are unable to determine the

final size of a message body without generating the content first. Often, these

servers would like to start sending the data before the size is known. Because

Transfer Encoding and Chunked Encoding |355

HTTP requires the Content-Length header to precede the data, some servers

apply a transfer encoding to send the data with a special terminating footer that

indicates the end of data.*

Security

You might use a transfer encoding to scramble the message content before send-

ing it across a shared transport network. However, because of the popularity of

transport layer security schemes like SSL, transfer-encoding security isn’t very

common.

Transfer-Encoding Headers

There are just two defined headers to describe and control transfer encoding:

Transfer-Encoding

Tells the receiver what encoding has been performed on the message in order for

it to be safely transported

Used in the request header to tell the server what extension transfer encodings

are okay to use†

Figure 15-5. Content encodings versus transfer encodings

* You could close the connection as a “poor man’s” end-of-message signal, but this breaks persistent

connections.

† The meaning of the TE header would be more intuitive if it were called the Accept-Transfer-Encoding header.

Normal header block

Normal entity

(just encoded)

HTTP/1.0 200 OK

Content-encoding: gzip

Content-type: text/html

[...]

[encoded message]

Content-encoded response

Basic header

HTTP/1.1 200 OK

Transfer-encoding: chunked

abcdefghijk

Transfer-encoded response

Encoded blocks

A Content-encoded message just encodes the entity

section of the message. With Transfer-encoded

messages the encoding is a function of the entire

message, changing the structure of the message itself.

356 |Chapter 15: Entities and Encodings

In the following example, the request uses the TE header to tell the server that it

accepts the chunked encoding (which it must if it’s an HTTP 1.1 application) and is

willing to accept trailers on the end of chunk-encoded messages:

GET /new_products.html HTTP/1.1

Host: www.joes-hardware.com

User-Agent: Mozilla/4.61 [en] (WinNT; I)

TE: trailers, chunked

...

The response includes a Transfer-Encoding header to tell the receiver that the mes-

sage has been transfer-encoded with the chunked encoding:

HTTP/1.1 200 OK

Transfer-Encoding: chunked

Server: Apache/3.0

...

After this initial header, the structure of the message will change.

All transfer-encoding values are case-insensitive. HTTP/1.1 uses transfer-encoding

values in the TE header field and in the Transfer-Encoding header field. The latest

HTTP specification defines only one transfer encoding, chunked encoding.

The TE header, like the Accept-Encoding header, can have Q values to describe pre-

ferred forms of transfer encoding. The HTTP/1.1 specification, however, forbids the

association of a Q value of 0.0 to chunked encoding.

Future extensions to HTTP may drive the need for additional transfer encodings. If

and when this happens, the chunked transfer encoding should always be applied on

top of the extension transfer encodings. This guarantees that the data will get “tun-

neled” through HTTP/1.1 applications that understand chunked encoding but not

other transfer encodings.

Chunked Encoding

Chunked encoding breaks messages into chunks of known size. Each chunk is sent

one after another, eliminating the need for the size of the full message to be known

before it is sent.

Note that chunked encoding is a form of transfer encoding and therefore is an

attribute of the message, not the body. Multipart encoding, described earlier in this

chapter, is an attribute of the body and is completely separate from chunked encoding.

Chunking and persistent connections

When the connection between the client and server is not persistent, clients do not

need to know the size of the body they are reading—they expect to read the body

until the server closes the connection.

Transfer Encoding and Chunked Encoding |357

With persistent connections, the size of the body must be known and sent in the

Content-Length header before the body can be written. When content is dynami-

cally created at a server, it may not be possible to know the length of the body before

sending it.

Chunked encoding provides a solution for this dilemma, by allowing servers to send

the body in chunks, specifying only the size of each chunk. As the body is dynami-

cally generated, a server can buffer up a portion of it, send its size and the chunk,

and then repeat the process until the full body has been sent. The server can signal

the end of the body with a chunk of size 0 and still keep the connection open and

ready for the next response.

Chunked encoding is fairly simple. Figure 15-6 shows the basic anatomy of a chunked

message. It begins with an initial HTTP response header block, followed by a stream

of chunks. Each chunk contains a length value and the data for that chunk. The length

value is in hexadecimal form and is separated from the chunk data with a CRLF. The

size of the chunk data is measured in bytes and includes neither the CRLF sequence

between the length value and the data nor the CRLF sequence at the end of the chunk.

The last chunk is special—it has a length of zero, which signifies “end of body.”

Figure 15-6. Anatomy of a chunked message

HTTP/1.1 200 OK<CR><LF>

Content-type: text/plain<CR><LF>

Transfer-encoding: chunked<CR><LF>

Trailer: Content-MD5<CR><LF>

27<CR><LF>

We hold these truths to be self-evident<CR><LF>

26<CR><LF>

, that all men are created equal, that<CR><LF>

84<CR><LF>

they are endowed by their Creator with certain

unalienable Rights, that among these are Life,

Liberty and the pursuit of Happiness.<CR><LF>

0<CR><LF>

Content-MD5:gjqei54p26tjisgj3p4utjgrj53<CR><LF>

HTTP response

Chunk #1

Chunk #2

Chunk #3

Last chunk

Trailer*

Response

stream

Hexadecimal chunk size (27 hex=> 39 characters)

*Optional–only present if there is a Trailer header in the message headers.

358 |Chapter 15: Entities and Encodings

A client also may send chunked data to a server. Because the client does not know

beforehand whether the server accepts chunked encoding (servers do not send TE

headers in responses to clients), it must be prepared for the server to reject the

chunked request with a 411 Length Required response.

Trailers in chunked messages

A trailer can be added to a chunked message if the client’s TE header indicates that it

accepts trailers, or if the trailer is added by the server that created the original

response and the contents of the trailer are optional metadata that it is not necessary

for the client to understand and use (it is okay for the client to ignore and discard the

contents of the trailer).*

The trailer can contain additional header fields whose values might not have been

known at the start of the message (e.g., because the contents of the body had to be

generated first). An example of a header that can be sent in the trailer is the Content-

MD5 header—it would be difficult to calculate the MD5 of a document before the

document has been generated. Figure 15-6 illustrates the use of trailers. The message

headers contain a Trailer header listing the headers that will follow the chunked mes-

sage. The last chunk is followed by the headers listed in the Trailer header.

Any of the HTTP headers can be sent as trailers, except for the Transfer-Encoding,

Trailer, and Content-Length headers.

Combining Content and Transfer Encodings

Content encoding and transfer encoding can be used simultaneously. For example,

Figure 15-7 illustrates how a sender can compress an HTML file using a content

encoding and send the data chunked using a transfer encoding. The process to

“reconstruct” the body is reversed on the receiver.

Transfer-Encoding Rules

When a transfer encoding is applied to a message body, a few rules must be followed:

• The set of transfer encodings must include “chunked.” The only exception is if

the message is terminated by closing the connection.

• When the chunked transfer encoding is used, it is required to be the last transfer

encoding applied to the message body.

• The chunked transfer encoding must not be applied to a message body more

than once.

* The Trailer header was added after the initial chunked encoding was added to drafts of the HTTP/1.1 spec-

ification, so some applications may not understand it (or understand trailers) even if they claim to be

HTTP/1.1-compliant.

Time-Varying Instances |359

These rules allow the recipient to determine the transfer length of the message.

Transfer encodings are a relatively new feature of HTTP, introduced in Version 1.1.

Servers that implement transfer encodings need to take special care not to send

transfer-encoded messages to non-HTTP/1.1 applications. Likewise, if a server

receives a transfer-encoded message that it can not understand, it should respond

with the 501 Unimplemented status code. However, all HTTP/1.1 applications must

at least support chunked encoding.

Time-Varying Instances

Web objects are not static. The same URL can, over time, point to different versions of

an object. Take the CNN home page as an example—going to “http://www.cnn.com”

several times in a day is likely to result in a slightly different page being returned each

time.

Think of the CNN home page as being an object and its different versions as being

different instances of the object (see Figure 15-8). The client in the figure requests the

same resource (URL) multiple times, but it gets different instances of the resource as

it changes over time. At time (a) and (b) it has the same instance; at time (c) it has a

different instance.

The HTTP protocol specifies operations for a class of requests and responses, called

instance manipulations, that operate on instances of an object. The two main

instance-manipulation methods are range requests and delta encoding. Both of these

methods require clients to be able to identify the exact copy of the resource that they

have (if any) and request new instances conditionally. These mechanisms are dis-

cussed later in this chapter.

Figure 15-7. Combining content encoding with transfer encoding

Content-type: text/html

Content encoding

9BF2578EA4

2670CD

Content-type: text/html

Content-encoding: gzip

Transfer encoding

(chunking)

Content-type: text/html

Content-encoding: gzip

Transfer-encoding: chunked

426

8EA

257

9BF

9BF2578EA4

2670CD

9BF

257

8EA

426

360 |Chapter 15: Entities and Encodings

Validators and Freshness

Look back at Figure 15-8. The client does not initially have a copy of the resource, so

it sends a request to the server asking for it. The server responds with Version 1 of

the resource. The client can now cache this copy, but for how long?

Once the document has “expired” at the client (i.e., once the client can no longer

consider its copy a valid copy), it must request a fresh copy from the server. If the

document has not changed at the server, however, the client does not need to receive

it again—it can just continue to use its cached copy.

This special request, called a conditional request, requires that the client tell the server

which version it currently has, using a validator, and ask for a copy to be sent only if

its current copy is no longer valid. Let’s look at the three key concepts—freshness,

validators, and conditionals—in more detail.

Freshness

Servers are expected to give clients information about how long clients can cache

their content and consider it fresh. Servers can provide this information using one of

two headers: Expires and Cache-Control.

The Expires header specifies the exact date and time when the document

“expires”—when it can no longer be considered fresh. The syntax for the Expires

header is:

Expires: Sun Mar 18 23:59:59 GMT 2001

For a client and server to use the Expires header correctly, their clocks must be syn-

chronized. This is not always easy, because neither may run a clock synchronization

protocol such as the Network Time Protocol (NTP). A mechanism that defines expi-

ration using relative time is more useful. The Cache-Control header can be used to

specify the maximum age for a document in seconds—the total amount of time since

the document left the server. Age is not dependent on clock synchronization and

therefore is likely to yield more accurate results.

Figure 15-8. Instances are “snapshots” of a resource in time

V1 V1 V2 V2 V4

Time

(a)

Feb 17

4:30 p.m.

Version 1

(b)

Mar 3

11:21 a.m.

Version 2

Apr 2

9:07 a.m.

Version 3

(e)

Apr 12

1:48 p.m.

Version 4 www.cnn.com

Validators and Freshness |361

The Cache-Control header actually is very powerful. It can be used by both servers

and clients to describe freshness using more directives than just specifying an age or

expiration time. Table 15-3 lists some of the directives that can accompany the

Cache-Control header.

Caching and freshness were discussed in more detail in Chapter 7.

Conditionals and Validators

When a cache’s copy is requested, and it is no longer fresh, the cache needs to make

sure it has a fresh copy. The cache can fetch the current copy from the origin server,

but in many cases, the document on the server is still the same as the stale copy in

the cache. We saw this in Figure 15-8b; the cached copy may have expired, but the

Table 15-3. Cache-Control header directives

Directive Message type Description

no-cache Request Do not return a cached copy of the document without first revalidating it with the

server.

no-store Request Do not return a cached copy of the document. Do not store the response from the

server.

max-age Request The document in the cache must not be older than the specified age.

max-stale Request The document may be stale based on the server-specified expiration information,

but it must not have been expired for longer than the value in this directive.

min-fresh Request The document’s age must not be more than its age plus the specified amount. In

other words, the response must be fresh for at least the specified amount of time.

no-transform Request The document must not be transformed before being sent.

only-if-cached Request Send the document only if it is in the cache, without contacting the origin server.

public Response Response may be cached by any cache.

private Response Response may be cached such that it can be accessed only by a single client.

no-cache Response If the directive is accompanied by a list of header fields, the content may be

cached and served to clients, but the listed header fields must first be removed. If

no header fields are specified, the cached copy must not be served without revali-

dation with the server.

no-store Response Response must not be cached.

no-transform Response Response must not be modified in any way before being served.

must-revalidate Response Response must be revalidated with the server before being served.

proxy-revalidate Response Shared caches must revalidate the response with the origin server before serving.

This directive can be ignored by private caches.

max-age Response Specifies the maximum length of time the document can be cached and still con-

sidered fresh.

s-max-age Response Specifies the maximum age of the document as it applies to shared caches (over-

riding the max-age directive, if one is present). This directive can be ignored by

private caches.

362 |Chapter 15: Entities and Encodings

server content still is the same as the cache content. If a cache always fetches a

server’s document, even if it’s the same as the expired cache copy, the cache wastes

network bandwidth, places unnecessary load on the cache and server, and slows

everything down.

To fix this, HTTP provides a way for clients to request a copy only if the resource has

changed, using special requests called conditional requests. Conditional requests are

normal HTTP request messages, but they are performed only if a particular condi-

tion is true. For example, a cache might send the following conditional GET message

to a server, asking it to send the file /announce.html only if the file has been modified

since June 29, 2002 (the date the cached document was last changed by the author):

GET /announce.html HTTP/1.0

If-Modified-Since: Sat, 29 Jun 2002, 14:30:00 GMT

Conditional requests are implemented by conditional headers that start with “If-”. In

the example above, the conditional header is If-Modified-Since. A conditional header

allows a method to execute only if the condition is true. If the condition is not true,

the server sends an HTTP error code back.

Each conditional works on a particular validator. A validator is a particular attribute

of the document instance that is tested. Conceptually, you can think of the validator

like the serial number, version number, or last change date of a document. A wise cli-

ent in Figure 15-8b would send a conditional validation request to the server saying,

“send me the resource only if it is no longer Version 1; I have Version 1.” We dis-

cussed conditional cache revalidation in Chapter 7, but we’ll study the details of

entity validators more carefully in this chapter.

The If-Modified-Since conditional header tests the last-modified date of a document

instance, so we say that the last-modified date is the validator. The If-None-Match

conditional header tests the ETag value of a document, which is a special keyword or

version-identifying tag associated with the entity. Last-Modified and ETag are the

two primary validators used by HTTP. Table 15-4 lists four of the HTTP headers

used for conditional requests. Next to each conditional header is the type of valida-

tor used with the header.

Table 15-4. Conditional request types

Request type Validator Description

If-Modified-Since Last-Modified Send a copy of the resource if the version that was last modified at the time in your

previous Last-Modified response header is no longer the latest one.

If-Unmodified-Since Last-Modified Send a copy of the resource only if it is the same as the version that was last modi-

fied at the time in your previous Last-Modified response header.

If-Match ETag Send a copy of the resource if its entity tag is the same as that of the one in your

previous ETag response header.

If-None-Match ETag Send a copy of the resource if its entity tag is different from that of the one in your

previous ETag response header.

Range Requests |363

HTTP groups validators into two classes: weak validators and strong validators.

Weak validators may not always uniquely identify an instance of a resource; strong

validators must. An example of a weak validator is the size of the object in bytes. The

resource content might change even thought the size remains the same, so a hypo-

thetical byte-count validator only weakly indicates a change. A cryptographic check-

sum of the contents of the resource (such as MD5), however, is a strong validator; it

changes when the document changes.

The last-modified time is considered a weak validator because, although it specifies

the time at which the resource was last modified, it specifies that time to an accuracy

of at most one second. Because a resource can change multiple times in a second,

and because servers can serve thousands of requests per second, the last-modified

date might not always reflect changes. The ETag header is considered a strong vali-

dator, because the server can place a distinct value in the ETag header every time a

value changes. Version numbers and digest checksums are good candidates for the

ETag header, but they can contain any arbitrary text. ETag headers are flexible; they

take arbitrary text values (“tags”), and can be used to devise a variety of client and

server validation strategies.

Clients and servers may sometimes want to adopt a looser version of entity-tag vali-

dation. For example, a server may want to make cosmetic changes to a large, popu-

lar cached document without triggering a mass transfer when caches revalidate. In

this case, the server might advertise a “weak” entity tag by prefixing the tag with

“W/”. A weak entity tag should change only when the associated entity changes in a

semantically significant way. A strong entity tag must change whenever the associ-

ated entity value changes in any way.

The following example shows how a client might revalidate with a server using a

weak entity tag. The server would return a body only if the content changed in a

meaningful way from Version 4.0 of the document:

GET /announce.html HTTP/1.1

If-None-Match: W/"v4.0"

In summary, when clients access the same resource more than once, they first need

to determine whether their current copy still is fresh. If it is not, they must get the lat-

est version from the server. To avoid receiving an identical copy in the event that the

resource has not changed, clients can send conditional requests to the server, specify-

ing validators that uniquely identify their current copies. Servers will then send a

copy of the resource only if it is different from the client’s copy. For more details on

cache revalidation, please refer back to “Cache Processing Steps” in Chapter 7.

Range Requests

We now understand how a client can ask a server to send it a resource only if the cli-

ent’s copy of the resource is no longer valid. HTTP goes further: it allows clients to

actually request just part or a range of a document.

364 |Chapter 15: Entities and Encodings

Imagine if you were three-fourths of the way through downloading the latest hot soft-

ware across a slow modem link, and a network glitch interrupted your connection.

You would have been waiting for a while for the download to complete, and now you

would have to start all over again, hoping the same thing does not happen again.

With range requests, an HTTP client can resume downloading an entity by asking

for the range or part of the entity it failed to get (provided that the object did not

change at the origin server between the time the client first requested it and its subse-

quent range request). For example:

GET /bigfile.html HTTP/1.1

Host: www.joes-hardware.com

Range: bytes=4000-

User-Agent: Mozilla/4.61 [en] (WinNT; I)

...

In this example, the client is requesting the remainder of the document after the first

4,000 bytes (the end bytes do not have to be specified, because the size of the docu-

ment may not be known to the requestor). Range requests of this form can be used for

a failed request where the client received the first 4,000 bytes before the failure. The

Range header also can be used to request multiple ranges (the ranges can be specified

in any order and may overlap)—for example, imagine a client connecting to multiple

servers simultaneously, requesting different ranges of the same document from differ-

ent servers in order to speed up overall download time for the document. In the case

where clients request multiple ranges in a single request, responses come back as a

single entity, with a multipart body and a Content-Type: multipart/byteranges header.

Not all servers accept range requests, but many do. Servers can advertise to clients

that they accept ranges by including the header Accept-Ranges in their responses.

The value of this header is the unit of measure, usually bytes.* For example:

HTTP/1.1 200 OK

Date: Fri, 05 Nov 1999 22:35:15 GMT

Server: Apache/1.2.4

Accept-Ranges: bytes

...

Figure 15-9 shows an example of a set of HTTP transactions involving ranges.

Range headers are used extensively by popular peer-to-peer file-sharing client software

to download different parts of multimedia files simultaneously, from different peers.

Note that range requests are a class of instance manipulations, because they are

exchanges between a client and a server for a particular instance of an object. That is,

a client’s range request makes sense only if the client and server have the same ver-

sion of a document.

* The HTTP/1.1 specification defines only the bytes token, but server and client implementors could come up

with their own units to measure or chop up an entity.

Delta Encoding |365

Delta Encoding

We have described different versions of a web page as different instances of a page. If

a client has an expired copy of a page, it requests the latest instance of the page. If

the server has a newer instance of the page, it will send it to the client, and it will

send the full new instance of the page even if only a small portion of the page actu-

ally has changed.

Rather than sending it the entire new page, the client would get the page faster if the

server sent just the changes to the client’s copy of the page (provided that the num-

ber of changes is small). Delta encoding is an extension to the HTTP protocol that

optimizes transfers by communicating changes instead of entire objects. Delta encod-

ing is a type of instance manipulation, because it relies on clients and servers

exchanging information about particular instances of an object. RFC 3229 describes

delta encoding.

Figure 15-10 illustrates more clearly the mechanism of requesting, generating, receiv-

ing, and applying a delta-encoded document. The client has to tell the server which

version of the page it has, that it is willing to accept a delta from the latest version of

page, and which algorithms it knows for applying those deltas to its current version.

Figure 15-9. Entity range request example

110001

111011

010111

000101

Client

www.joes-hardware.com

HTTP/1.1 200 OK

Content-type: text/html

Content-length: 65537

Accept-ranges: bytes

[...]

GET /bigfile.html HTTP/1.1

[...]

Request message

Response message

GET /bigfile.html HTTP.1.1

Range: bytes=20224-

[...]

Range request message

Client received only

the first 20224 bytes

of the resource

HTTP/1.1 200 OK

Range: bytes=20224-

Accept-ranges: bytes

[...]

Range response message

The client’s original request was

interrupted, but a second request

for the part of the message that

was not received allows the

client to resume from the point

of the interruption

www.joes-hardware.com

366 |Chapter 15: Entities and Encodings

The server has to check if it has the client’s version of the page and how to compute

deltas from the latest version and the client’s version (there are several algorithms for

computing the difference between two objects). It then has to compute the delta,

send it to the client, let the client know that it’s sending a delta, and specify the new

identifier for the latest version of the page (because this is the version that the client

will end up with after it applies the delta to its old version).

The client uses the unique identifier for its version of the page (sent by the server in

its previous response to the client in the ETag header) in an If-None-Match header.

This is the client’s way of telling the server, “if the latest version of the page you have

Figure 15-10. Mechanics of delta-encoding

Client

Server

HTTP/1.1 200 OK

Content-type: text/html

Expires: Mon, 01 Feb 2001 12:00:00 GMT

Etag: abcdefghi09876AF

...

GET /bigfile.html HTTP/1.1

Date: Mon, 01 Feb 2001 12:03:00 GMT

Request message

Response message

GET /bigfile.html HTTP.1.1

If-None-Match: abcdefghi09876AF

A-IM: diffe

Date: Tue, 02 Feb 2001 03:03:00 GMT

Delta request message

Client receives this response and

caches it. The next day, the client

tries to access the same page and

sees its cached copy has expired,

so it sends a request to the server

requesting the latest copy. Since it

has a cached copy, it tells the server

which copy it has and indicates

its willingness to accept a delta.

HTTP/1.1 226 IM Used

IM: diffe

Etag: zywxtuv123456BG

Delta-base: abcdefghi09876AF

...

Delta response message

Client receives the delta and applies

it to its cached version of the

page, generating the latest version

of the page. The client also updates its

ETag to that of the new version of the page.

Page on Monday

Feb 1, 2001 at 12:03 p m.

Hello, welcome to

Joe’s Hardware store.

Today’s special is on

hammers.

Page on Tuesday

Feb 2, 2001 at 03:03 a m.

Hello, welcome to

Joe’s Hardware store.

Today’s special is on

chisels.

Delta generator

5c.

chisels.

Delta

Delta applier

Hello, welcome to

Joe’s Hardware store.

Today’s special is on

chisels.

Delta Encoding |367

does not have this same ETag, send me the latest version of the page.” Just the If-

None-Match header, then, would cause the server to send the client the full latest

version of the page (if it was different from the client’s version).

The client can tell the server, however, that it is willing to accept a delta of the page

by also sending an A-IM header. A-IM is short for Accept-Instance-Manipulation

(“Oh, by the way, I do accept some forms of instance manipulation, so if you apply

one of those you will not have to send me the full document.”). In the A-IM header,

the client specifies the algorithms it knows how to apply in order to generate the lat-

est version of a page given an old version and a delta. The server sends back the fol-

lowing: a special response code (226 IM Used) telling the client that it is sending it

an instance manipulation of the requested object, not the full object itself; an IM

(short for Instance-Manipulation) header, which specifies the algorithm used to com-

pute the delta; the new ETag header; and a Delta-Base header, which specifies the

ETag of the document used as the base for computing the delta (ideally, the same as

the ETag in the client’s If-None-Match request!). The headers used in delta encoding

are summarized in Table 15-5.

Instance Manipulations, Delta Generators,

and Delta Appliers

Clients can specify the types of instance manipulation they accept using the A-IM

header. Servers specify the type of instance manipulation used in the IM header. Just

what are the types of instance manipulation that are accepted, and what do they do?

Table 15-6 lists some of the IANA registered types of instance manipulations.

Table 15-5. Delta-encoding headers

Header Description

ETag Unique identifier for each instance of a document. Sent by the server in the response; used by clients in sub-

sequent requests in If-Match and If-None-Match headers.

If-None-Match Request header sent by the client, asking the server for a document if and only if the client’s version of the

document is different from the server’s.

A-IM Client request header indicating types of instance manipulations accepted.

IM Server response header specifying the type of instance manipulation applied to the response. This header is

sent when the response code is 226 IM Used.

Delta-Base Server response header that specifies the ETag of the base document used for generating the delta (should

be the same as the ETag in the client request’s If-None-Match header).

Table 15-6. IANA registered types of instance manipulations

Type Description

vcdiff Delta using the vcdiff algorithma

diffe Delta using the Unix diff -e command

gdiff Delta using the gdiff algorithmb

368 |Chapter 15: Entities and Encodings

A “delta generator” at the server, as in Figure 15-10, takes the base document and

the latest instance of the document and computes the delta between the two using

the algorithm specified by the client in the A-IM header. At the client side, a “delta

applier” takes the delta and applies it to the base document to generate the latest

instance of the document. For example, if the algorithm used to generate the delta is

the Unix diff -e command, the client can apply the delta using the functionality of the

Unix ed text editor, because diff -e <file1> <file2> generates the set of ed commands

that will convert <file1> into <file2>.ed is a very simple editor with a few supported

commands. In the example in Figure 15-10, 5c says delete line 5 in the base docu-

ment, and chisels.<cr>. says add “chisels.”. That’s it. More complicated instructions

can be generated for bigger changes. The Unix diff -e algorithm does a line-by-line

comparison of files. This obviously is okay for text files but breaks down for binary

files. The vcdiff algorithm is more powerful, working even for non-text files and gen-

erally producing smaller deltas than diff -e.

The delta encoding specification defines the format of the A-IM and IM headers in

detail. Suffice it to say that multiple instance manipulations can be specified in these

headers (along with corresponding quality values). Documents can go through multi-

ple instance manipulations before being returned to clients, in order to maximize

compression. For example, deltas generated by the vcdiff algorithm may in turn be

compressed using the gzip algorithm. The server response would then contain the

header IM: vcdiff, gzip. The client would first gunzip the content, then apply the

results of the delta to its base page in order to generate the final document.

Delta encoding can reduce transfer times, but it can be tricky to implement. Imagine

a page that changes frequently and is accessed by many different people. A server

supporting delta encoding must keep all the different copies of that page as it

changes over time, in order to figure out what’s changed between any requesting cli-

ent’s copy and the latest copy. (If the document changes frequently, as different cli-

ents request the document, they will get different instances of the document. When

they make subsequent requests to the server, they will be requesting changes

between their instance of the document and the latest instance of the document. To

be able to send them just the changes, the server must keep copies of all the previous

gzip Compression using the gzip algorithm

deflate Compression using the deflate algorithm

range Used in a server response to indicate that the response is partial content as the result of a range selection

identity Used in a client request’s A-IM header to indicate that the client is willing to accept an identity instance

manipulation

aInternet draft draft-korn-vcdiff-01 describes the vcdiff algorithm. This specification was approved by the IESG in early 2002 and

should be released in RFC form shortly.

bhttp://www.w3org/TR/NOTE-gdiff-19970901.html describes the GDIFF algorithm.

Table 15-6. IANA registered types of instance manipulations (continued)

Type Description

For More Information |369

instances that the clients have.) In exchange for reduced latency in serving docu-

ments, servers need to increase disk space to keep old instances of documents

around. The extra disk space necessary to do so may quickly negate the benefits from

the smaller transfer amounts.

For More Information

For more information on entities and encodings, see:

http://www.ietf.org/rfc/rfc2616.txt

The HTTP/1.1 specification, RFC 2616, is the primary reference for entity body

management and encodings.

http://www.ietf.org/rfc/rfc3229.txt

RFC 3229, “Delta Encoding in HTTP,” describes how delta encoding can be

supported as an extension to HTTP/1.1.

Introduction to Data Compression

Khalid Sayood, Morgan Kaufmann Publishers. This book explains some of the

compression algorithms supported by HTTP content encodings.

http://www.ietf.org/rfc/rfc1521.txt

RFC 1521, “Multipurpose Internet Mail Extensions, Part One: Mechanisms for

Specifying and Describing the Format of Internet Message Bodies,” describes the

format of MIME bodies. This reference material is useful because HTTP bor-

rows heavily from MIME. In particular, this document is designed to provide

facilities to include multiple objects in a single message, to represent body text in

character sets other than US-ASCII, to represent formatted multi-font text mes-

sages, and to represent nontextual material such as images and audio fragments.

http://www.ietf.org/rfc/rfc2045.txt

RFC 2045, “Multipurpose Internet Mail Extensions, Part One: Format of Inter-

net Message Bodies,” specifies the various headers used to describe the structure

of MIME messages, many of which are similar or identical to HTTP.

http://www.ietf.org/rfc/rfc1864.txt

RFC 1864, “The Content-MD5 Header Field,” provides some historical detail

about the behavior and intended use of the Content-MD5 header field in MIME

content as a message integrity check.

http://www.ietf.org/rfc/rfc3230.txt

RFC 3230, “Instance Digests in HTTP,” describes improvements to HTTP entity-

digest handling that fix weaknesses present in the Content-MD5 formulation.

370

CHAPTER 16

Internationalization

Every day, billions of people write documents in hundreds of languages. To live up

to the vision of a truly world-wide Web, HTTP needs to support the transport and

processing of international documents, in many languages and alphabets.

This chapter covers two primary internationalization issues for the Web: character

set encodings and language tags. HTTP applications use character set encodings to

request and display text in different alphabets, and they use language tags to describe

and restrict content to languages the user understands. We finish with a brief chat

about multilingual URIs and dates.

This chapter:

• Explains how HTTP interacts with schemes and standards for multilingual

alphabets

• Gives a rapid overview of the terminology, technology, and standards to help

HTTP programmers do things right (readers familiar with character encodings

can skip this section)

• Explains the standard naming system for languages, and how standardized lan-

guage tags describe and select content

• Outlines rules and cautions for international URIs

• Briefly discusses rules for dates and other internationalization issues

HTTP Support for International Content

HTTP messages can carry content in any language, just as it can carry images, mov-

ies, or any other kind of media. To HTTP, the entity body is just a box of bits.

To support international content, servers need to tell clients about the alphabet and

languages of each document, so the client can properly unpack the document bits

into characters and properly process and present the content to the user.

Character Sets and HTTP |371

Servers tell clients about a document’s alphabet and language with the HTTP

Content-Type charset parameter and Content-Language headers. These headers

describe what’s in the entity body’s “box of bits,” how to convert the contents into

the proper characters that can be displayed onscreen, and what spoken language the

words represent.

At the same time, the client needs to tell the server which languages the user under-

stands and which alphabetic coding algorithms the browser has installed. The client

sends Accept-Charset and Accept-Language headers to tell the server which charac-

ter set encoding algorithms and languages the client understands, and which of them

are preferred.

The following HTTP Accept headers might be sent by a French speaker who prefers

his native language (but speaks some English in a pinch) and who uses a browser

that supports the iso-8859-1 West European charset encoding and the UTF-8 Uni-

code charset encoding:

Accept-Language: fr, en;q=0.8

Accept-Charset: iso-8859-1, utf-8

The parameter “q=0.8” is a quality factor, giving lower priority to English (0.8) than

to French (1.0 by default).

Character Sets and HTTP

So, let’s jump right into the most important (and confusing) aspects of web interna-

tionalization—international alphabetic scripts and their character set encodings.

Web character set standards can be pretty confusing. Lots of people get frustrated

when they first try to write international web software, because of complex and

inconsistent terminology, standards documents that you have to pay to read, and

unfamiliarity with foreign languages. This section and the next section should make

it easier for you to use character sets with HTTP.

Charset Is a Character-to-Bits Encoding

The HTTP charset values tell you how to convert from entity content bits into char-

acters in a particular alphabet. Each charset tag names an algorithm to translate bits

to characters (and vice versa). The charset tags are standardized in the MIME charac-

ter set registry, maintained by the IANA (see http://www.iana.org/assignments/

character-sets). Appendix H summarizes many of them.

The following Content-Type header tells the receiver that the content is an HTML

file, and the charset parameter tells the receiver to use the iso-8859-6 Arabic charac-

ter set decoding scheme to decode the content bits into characters:

Content-Type: text/html; charset=iso-8859-6

372 |Chapter 16: Internationalization

The iso-8859-6 encoding scheme maps 8-bit values into both the Latin and Arabic

alphabets, including numerals, punctuation and other symbols.*For example, in

Figure 16-1, the highlighted bit pattern has code value 225, which (under iso-8859-6)

maps into the Arabic letter “FEH” (a sound like the English letter “F”).

Some character encodings (e.g., UTF-8 and iso-2022-jp) are more complicated, vari-

able-length codes, where the number of bits per character varies. This type of coding

lets you use extra bits to support alphabets with large numbers of characters (such as

Chinese and Japanese), while using fewer bits to support standard Latin characters.

How Character Sets and Encodings Work

Let’s see what character sets and encodings really do.

We want to convert from bits in a document into characters that we can display

onscreen. But because there are many different alphabets, and many different ways

of encoding characters into bits (each with advantages and disadvantages), we need a

standard way to describe and apply the bits-to-character decoding algorithm.

Bits-to-character conversions happen in two steps, as shown in Figure 16-2:

• In Figure 16-2a, bits from a document are converted into a character code that

identifies a particular numbered character in a particular coded character set. In

the example, the decoded character code is numbered 225.

• In Figure 16-2b, the character code is used to select a particular element of the

coded character set. In iso-8859-6, the value 225 corresponds to “ARABIC LET-

TER FEH.” The algorithms used in Steps a and b are determined from the

MIME charset tag.

A key goal of internationalized character systems is the isolation of the semantics

(letters) from the presentation (graphical presentation forms). HTTP concerns itself

* Unlike Chinese and Japanese, Arabic has only 28 characters. Eight bits provides 256 unique values, which

gives plenty of room for Latin characters, Arabic characters, and other useful symbols.

Figure 16-1. The charset parameter tells the client how to go from bits to characters

HTTP/1.1 200 OK

Content-type: text/html; charset=iso-8859-6

Content-length: 18572

Content-language: ar

00100101110100100101001001111101

01010010100111101001111110000110

01010101011100000101010001010011

01011111001000010101111101010...

Entity body

Code bits in HTTP response

iso-8859-6 decoding

of code

11100001

(decimal 225)

Arabic letter Feh

Character

Character Sets and HTTP |373

only with transporting the character data and the associated language and charset

labels. The presentation of the character shapes is handled by the user’s graphics dis-

play software (browser, operating system, fonts), as shown in Figure 16-2c.

The Wrong Charset Gives the Wrong Characters

If the client uses the wrong charset parameter, the client will display strange, bogus

characters. Let’s say a browser got the value 225 (binary 11100001) from the body:

• If the browser thinks the body is encoded with iso-8859-1 Western European

character codes, it will show a lowercase Latin “a” with acute accent:

• If the browser is using iso-8859-6 Arabic codes, it will show “FEH”:

• If the browser is using iso-8859-7 Greek, it will show a small “Alpha”:

Figure 16-2. HTTP “charset” combines a character encoding scheme and a coded character set

65 LATIN CAPITAL LETTER A

66 LATIN CAPITAL LETTER B

224 ARABIC TATWEEL

225 ARABIC LETTER FEH

226 ARABIC LETTER QAF

227 ARABIC LETTER KAF

...11100001

Data bits

encoding scheme

(using iso-8859-6’s encoding)

225

Character code

(in iso-8859-6 set)

Coded character set

Unique character

"ARABIC LETTER FEH"

Fonts and presentation logic

Glyph

(a) Decode using encoding scheme (b) Find character using coded

character set (c) Find display shape using fonts and

formatting software

MIME charset tag describes the combination of character

encoding scheme and coded character set mapping

(iso-8859-6 coded

character set)

374 |Chapter 16: Internationalization

• If the browser is using iso-8859-8 Hebrew codes, it will show “BET”:

Standardized MIME Charset Values

The combination of a particular character encoding and a particular coded character

set is called a MIME charset. HTTP uses standardized MIME charset tags in the Con-

tent-Type and Accept-Charset headers. MIME charset values are registered with the

IANA.*Table 16-1 lists a few MIME charset encoding schemes used by documents

and browsers. A more complete list is provided in Appendix H.

* See http://www.iana.org/numbers.htm for the list of registered charset values.

Table 16-1. MIME charset encoding tags

MIME charset value Description

us-ascii The famous character encoding standardized in 1968 as ANSI_X3.4-1968. It is also named ASCII, but

the “US” prefix is preferred because of several international variants in ISO 646 that modify selected

characters. US-ASCII maps 7-bit values into 128 characters. The high bit is unused.

iso-8859-1 iso-8859-1 is an 8-bit extension to ASCII to support Western European languages. It uses the high bit

to include many West European characters, while leaving the ASCII codes (0–127) intact. Also called

iso-latin-1, or nicknamed “Latin1.”

iso-8859-2 Extends ASCII to include characters for Central and Eastern European languages, including Czech,

Polish, and Romanian. Also called iso-latin-2.

iso-8859-5 Extends ASCII to include Cyrillic characters, for languages including Russian, Serbian, and Bulgarian.

iso-8859-6 Extends ASCII to include Arabic characters. Because the shapes of Arabic characters change depend-

ing on their position in a word, Arabic requires a display engine that analyzes the context and gener-

ates the correct shape for each character.

iso-8859-7 Extends ASCII to include modern Greek characters. Formerly known as ELOT-928 or ECMA-118:1986.

iso-8859-8 Extends ASCII to include Hebrew and Yiddish characters.

iso-8859-15 Updates iso-8859-1, replacing some less-needed punctuation and fraction symbols with forgotten

French and Finnish letters and replacing the international currency sign with the symbol for the new

Euro currency. This character set is nicknamed “Latin0” and may one day replace iso-8859-1 as the

preferred default character set in Europe.

iso-2022-jp iso-2022-jp is a widely used encoding for Japanese email and web content. It is a variable-length

encoding scheme that supports ASCII characters with single bytes but uses three-character modal

escape sequences to shift into three different Japanese character sets.

euc-jp euc-jp is an ISO 2022–compliant variable-length encoding that uses explicit bit patterns to identify

each character, without requiring modes and escape sequences. It uses 1-byte, 2-byte, and 3-byte

sequences of characters to identify characters in multiple Japanese character sets.

Shift_JIS This encoding was originally developed by Microsoft and sometimes is called SJIS or MS Kanji. It is a

bit complicated, for reasons of historic compatibility, and it cannot map all characters, but it still is

common.

Character Sets and HTTP |375

Content-Type Charset Header and META Tags

Web servers send the client the MIME charset tag in the Content-Type header, using

the charset parameter:

Content-Type: text/html; charset=iso-2022-jp

If no charset is explicitly listed, the receiver may try to infer the character set from

the document contents. For HTML content, character sets might be found in

<META HTTP-EQUIV="Content-Type"> tags that describe the charset.

Example 16-1 shows how HTML META tags set the charset to the Japanese encod-

ing iso-2022-jp. If the document is not HTML, or there is no META Content-Type

tag, software may attempt to infer the character encoding by scanning the actual text

for common patterns indicative of languages and encodings.

If a client cannot infer a character encoding, it assumes iso-8859-1.

The Accept-Charset Header

There are thousands of defined character encoding and decoding methods, devel-

oped over the past several decades. Most clients do not support all the various char-

acter coding and mapping systems.

HTTP clients can tell servers precisely which character systems they support, using

the Accept-Charset request header. The Accept-Charset header value provides a list

of character encoding schemes that the client supports. For example, the following

HTTP request header indicates that a client accepts the Western European iso-8859-1

koi8-r KOI8-R is a popular 8-bit Internet character set encoding for Russian, defined in IETF RFC 1489. The

initials are transliterations of the acronym for “Code for Information Exchange, 8 bit, Russian.”

utf-8 UTF-8 is a common variable-length character encoding scheme for representing UCS (Unicode),

which is the Universal Character Set of the world’s characters. UTF-8 uses a variable-length encoding

for character code values, representing each character by from one to six bytes. One of the primary

features of UTF-8 is backward compatibility with ordinary 7-bit ASCII text.

windows-1252 Microsoft calls its coded character sets “code pages.” Windows code page 1252 (a.k.a. “CP1252” or

“WinLatin1”) is an extension of iso-8859-1.

Example 16-1. Character encoding can be specified in HTML META tags

<HEAD>

<TITLE>A Japanese Document</TITLE>

</HEAD>

<BODY>

...

Table 16-1. MIME charset encoding tags (continued)

MIME charset value Description

376 |Chapter 16: Internationalization

character system as well as the UTF-8 variable-length Unicode compatibility system.

A server is free to return content in either of these character encoding schemes.

Accept-Charset: iso-8859-1, utf-8

Note that there is no Content-Charset response header to match the Accept-Charset

request header. The response character set is carried back from the server by the

charset parameter of the Content-Type response header, to be compatible with

MIME. It’s too bad this isn’t symmetric, but all the information still is there.

Multilingual Character Encoding Primer

The previous section described how the HTTP Accept-Charset header and the

Content-Type charset parameter carry character-encoding information from the cli-

ent and server. HTTP programmers who do a lot of work with international applica-

tions and content need to have a deeper understanding of multilingual character sys-

tems to understand technical specifications and properly implement software.

It isn’t easy to learn multilingual character systems—the terminology is complex and

inconsistent, you often have to pay to read the standards documents, and you may

be unfamiliar with the other languages with which you’re working. This section is an

overview of character systems and standards. If you are already comfortable with

character encodings, or are not interested in this detail, feel free to jump ahead to

“Language Tags and HTTP.”

Character Set Terminology

Here are eight terms about electronic character systems that you should know:

Character

An alphabetic letter, numeral, punctuation mark, ideogram (as in Chinese), sym-

bol, or other textual “atom” of writing. The Universal Character Set (UCS) ini-

tiative, known informally as Unicode,*has developed a standardized set of

textual names for many characters in many languages, which often are used to

conveniently and uniquely name characters.†

Glyph

A stroke pattern or unique graphical shape that describes a character. A charac-

ter may have multiple glyphs if it can be written different ways (see Figure 16-3).

Coded character

A unique number assigned to a character so that we can work with it.

Coding space

A range of integers that we plan to use as character code values.

* Unicode is a commercial consortium based on UCS that drives commercial products.

† The names look like “LATIN CAPITAL LETTER S” and “ARABIC LETTER QAF.”

Multilingual Character Encoding Primer |377

Code width

The number of bits in each (fixed-size) character code.

Character repertoire

A particular working set of characters (a subset of all the characters in the world).

Coded character set

A set of coded characters that takes a character repertoire (a selection of charac-

ters from around the world) and assigns each character a code from a coding

space. In other words, it maps numeric character codes to real characters.

Character encoding scheme

An algorithm to encode numeric character codes into a sequence of content bits

(and to decode them back). Character encoding schemes can be used to reduce

the amount of data required to identify characters (compression), work around

transmission restrictions, and unify overlapping coded character sets.

Charset Is Poorly Named

Technically, the MIME charset tag (used in the Content-Type charset parameter and

the Accept-Charset header) doesn’t specify a character set at all. The MIME charset

value names a total algorithm for mapping data bits to codes to unique characters. It

combines the two separate concepts of character encoding scheme and coded charac-

ter set (see Figure 16-2).

This terminology is sloppy and confusing, because there already are published stan-

dards for character encoding schemes and for coded character sets.*Here’s what the

HTTP/1.1 authors say about their use of terminology (in RFC 2616):

The term “character set” is used in this document to refer to a method ... to convert a

sequence of octets into a sequence of characters... Note: This use of the term “charac-

ter set” is more commonly referred to as a “character encoding.” However, since

HTTP and MIME share the same registry, it’s important that the terminology also be

shared.

The IETF also adopts nonstandard terminology in RFC 2277:

This document uses the term “charset” to mean a set of rules for mapping from a

sequence of octets to a sequence of characters, such as the combination of a coded

character set and a character encoding scheme; this is also what is used as an identifier

in MIME “charset=” parameters, and registered in the IANA charset registry. (Note

that this is NOT a term used by other standards bodies, such as ISO).

So, be careful when reading standards documents, so you know exactly what’s being

defined. Now that we’ve got the terminology sorted out, let’s look a bit more closely

at characters, glyphs, character sets, and character encodings.

* Worse, the MIME charset tag often co-opts the name of a particular coded character set or encoding scheme.

For example, iso-8859-1 is a coded character set (it assigns numeric codes to a set of 256 European characters),

but MIME uses the charset value “iso-8859-1” to mean an 8-bit identity encoding of the coded character set.

This imprecise terminology isn’t fatal, but when reading standards documents, be clear on the assumptions.

378 |Chapter 16: Internationalization

Characters

Characters are the most basic building blocks of writing. A character represents an

alphabetic letter, numeral, punctuation mark, ideogram (as in Chinese), mathemati-

cal symbol, or other basic unit of writing.

Characters are independent of font and style. Figure 16-3 shows several variants of

the same character, called “LATIN SMALL LETTER A.” A native reader of Western

European languages would immediately recognize all five of these shapes as the same

character, even though the stroke patterns and styles are quite different.

Many writing systems also have different stroke shapes for a single character,

depending on the position of the character in the word. For example, the four

strokes in Figure 16-4 all represent the character “ARABIC LETTER AIN.”*

Figure 16-4a shows how “AIN” is written as a standalone character. Figure 16-4d

shows “AIN” at the beginning of a word, Figure 16-4c shows “AIN” in the middle of

a word, and Figure 16-4b shows “AIN” at the end of a word.†

Glyphs, Ligatures, and Presentation Forms

Don’t confuse characters with glyphs. Characters are the unique, abstract “atoms” of

language. Glyphs are the particular ways you draw each character. Each character

has many different glyphs, depending on the artistic style and script.‡

Also, don’t confuse characters with presentation forms. To make writing look

nicer, many handwritten scripts and typefaces let you join adjacent characters into

pretty ligatures, in which the two characters smoothly connect. English-speaking

Figure 16-3. One character can have many different written forms

* The sound “AIN” is pronounced something like “ayine,” but toward the back of the throat.

Figure 16-4. Four positional forms of the single character “ARABIC LETTER AIN”

† Note that Arabic words are written from right to left.

‡ Many people use the term “glyph” to mean the final rendered bitmap image, but technically a glyph is the

inherent shape of a character, independent of font and minor artistic style. This distinction isn’t very easy to

apply, or useful for our purposes.

(a) Standalone (b)Final position (c) Medial position (d) Initial postion

(These different glyphs represent the same character, “ARABIC LETTER AIN”)

Multilingual Character Encoding Primer |379

typesetters often join “F” and “I” into an “FI ligature” (see Figure 16-5a–b), and

Arabic writers often join the “LAM” and “ALIF” characters into an attractive liga-

ture (Figure 16-5c–d).

Here’s the general rule: if the meaning of the text changes when you replace one

glyph with another, the glyphs are different characters. Otherwise, they are the same

characters, with a different stylistic presentation.*

Coded Character Sets

Coded character sets, defined in RFCs 2277 and 2130, map integers to characters.

Coded character sets often are implemented as arrays,†indexed by code number (see

Figure 16-6). The array elements are characters.‡

Let’s look at a few important coded character set standards, including the historic

US-ASCII character set, the iso-8859 extensions to ASCII, the Japanese JIS X 0201

character set, and the Universal Character Set (Unicode).

US-ASCII: The mother of all character sets

ASCII is the most famous coded character set, standardized back in 1968 as ANSI

standard X3.4 “American Standard Code for Information Interchange.” ASCII uses

Figure 16-5. Ligatures are stylistic presentation forms of adjacent characters, not new characters

* The division between semantics and presentation isn’t always clear. For ease of implementation, some pre-

sentation variants of the same characters have been assigned distinct characters, but the goal is to avoid this.

† The arrays can be multidimensional, so different bits of the code number index different axes of the array.

Figure 16-6. Coded character sets can be thought of as arrays that map numeric codes to characters

‡ Figure 16-6 uses a grid to represent a coded character set. Each element of the grid contains a character

image. These images are symbolic. The presence of an image “D” is shorthand for the character “LATIN

CAPITAL LETTER D,” not for any particular graphical glyph.

(a) Without FI ligature (b) With FI ligature (c) Without LA ligature (d) With LA ligature

ALIF LAM LAM and ALIF

“LATIN CAPTIAL LETTER D”

US-ASCII coded character set

Code 68 (0x44)

380 |Chapter 16: Internationalization

only the code values 0–127, so only 7 bits are required to cover the code space. The

preferred name for ASCII is “US-ASCII,” to distinguish it from international variants

of the 7-bit character set.

HTTP messages (headers, URIs, etc.) use US-ASCII.

iso-8859

The iso-8859 character set standards are 8-bit supersets of US-ASCII that use the

high bit to add characters for international writing. The additional space provided by

the extra bit (128 extra codes) isn’t large enough to hold even all of the European

characters (not to mention Asian characters), so iso-8859 provides customized char-

acter sets for different regions:

iso-8859-1, also known as Latin1, is the default character set for HTML. It can be

used to represent text in most Western European languages. There has been some

discussion of replacing iso-8859-1 with iso-8859-15 as the default HTTP coded char-

acter set, because it includes the new Euro currency symbol. However, because of

the widespread adoption of iso-8859-1, it’s unlikely that a widespread change to iso-

8859-15 will be adopted for quite some time.

JIS X 0201

JIS X 0201 is an extremely minimal character set that extends ASCII with Japanese

half width katakana characters. The half-width katakana characters were originally

used in the Japanese telegraph system. JIS X 0201 is often called “JIS Roman.” JIS is

an acronym for “Japanese Industrial Standard.”

JIS X 0208 and JIS X 0212

Japanese includes thousands of characters from several writing systems. While it is

possible to limp by (painfully) using the 63 basic phonetic katakana characters in JIS

X 0201, a much more complete character set is required for practical use.

iso-8859-1 Western European languages (e.g., English, French)

iso-8859-2 Central and Eastern European languages (e.g., Czech, Polish)

iso-8859-3 Southern European languages

iso-8859-4 Northern European languages (e.g., Latvian, Lithuanian, Greenlandic)

iso-8859-5 Cyrillic (e.g., Bulgarian, Russian, Serbian)

iso-8859-6 Arabic

iso-8859-7 Greek

iso-8859-8 Hebrew

iso-8859-9 Turkish

iso-8859-10 Nordic languages (e.g., Icelandic, Inuit)

iso-8859-15 Modification to iso-8859-1 that includes the new Euro currency character

Multilingual Character Encoding Primer |381

The JIS X 0208 character set was the first multi-byte Japanese character set; it

defined 6,879 coded characters, most of which are Chinese-based kanji. The JIS X

0212 character set adds an additional 6,067 characters.

UCS

The Universal Character Set (UCS) is a worldwide standards effort to combine all of

the world’s characters into a single coded character set. UCS is defined by ISO

10646. Unicode is a commercial consortium that tracks the UCS standards. UCS has

coding space for millions of characters, although the basic set consists of only about

50,000 characters.

Character Encoding Schemes

Character encoding schemes pack character code numbers into content bits and

unpack them back into character codes at the other end (Figure 16-7). There are

three broad classes of character encoding schemes:

Fixed width

Fixed-width encodings represent each coded character with a fixed number of

bits. They are fast to process but can waste space.

Variable width (nonmodal)

Variable-width encodings use different numbers of bits for different character

code numbers. They can reduce the number of bits required for common charac-

ters, and they retain compatibility with legacy 8-bit character sets while allowing

the use of multiple bytes for international characters.

Variable width (modal)

Modal encodings use special “escape” patterns to shift between different modes.

For example, a modal encoding can be used to switch between multiple, over-

lapping character sets in the middle of text. Modal encodings are complicated to

process, but they can efficiently support complicated writing systems.

Let’s look at a few common encoding schemes.

Figure 16-7. Character encoding scheme encodes character codes into bits and back again

HTTP/1.1 200 OK

Content-type: text/html; charset=iso-2022-jp

Content-length: 4198

Content-lanuage: jp

00100101110100100101001001111101

01010010100111101001010011010010

01010101011100000101010001010011

01011111001000010101111101010...

Entity body

Character encoder Character decoder

382 |Chapter 16: Internationalization

8-bit

The 8-bit fixed-width identity encoding simply encodes each character code with its

corresponding 8-bit value. It supports only character sets with a code range of 256

characters. The iso-8859 family of character sets uses the 8-bit identity encoding.

UTF-8

UTF-8 is a popular character encoding scheme designed for UCS (UTF stands for

“UCS Transformation Format”). UTF-8 uses a nonmodal, variable-length encoding

for the character code values, where the leading bits of the first byte tell the length of

the encoded character in bytes, and any subsequent byte contains six bits of code

value (see Table 16-2).

If the first encoded byte has a high bit of 0, the length is just 1 byte, and the remain-

ing 7 bits contain the character code. This has the nice result of ASCII compatibility

(but not iso-8859 compatibility, because iso-8859 uses the high bit).

For example, character code 90 (ASCII “Z”) would be encoded as 1 byte (01011010),

while code 5073 (13-bit binary value 1001111010001) would be encoded into 3 bytes:

11100001 10001111 10010001

iso-2022-jp

iso-2022-jp is a widely used encoding for Japanese Internet documents. iso-2022-jp is

a variable-length, modal encoding, with all values less than 128 to prevent problems

with non–8-bit-clean software.

The encoding context always is set to one of four predefined character sets.*Special

“escape sequences” shift from one set to another. iso-2022-jp initially uses the US-

ASCII character set, but it can switch to the JIS X 0201 (JIS-Roman) character set or

the much larger JIS X 0208-1978 and JIS X 0208-1983 character sets using 3-byte

escape sequences.

Table 16-2. UTF-8 variable-width, nonmodal encoding

Character code bits Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6

0–7 0ccccccc -----

8–11 110ccccc 10cccccc ----

12–16 1110cccc 10cccccc 10cccccc - - -

17–21 11110ccc 10cccccc 10cccccc 10cccccc - -

22–26 111110cc 10cccccc 10cccccc 10cccccc 10cccccc -

27–31 1111110c 10cccccc 10cccccc 10cccccc 10cccccc 10cccccc

* The iso-2022-jp encoding is tightly bound to these four character sets, whereas some other encodings are

independent of the particular character set.

Multilingual Character Encoding Primer |383

The escape sequences are shown in Table 16-3. In practice, Japanese text begins with

“ESC $ @” or “ESC $ B” and ends with “ESC ( B” or “ESC ( J”.

When in the US-ASCII or JIS-Roman modes, a single byte is used per character.

When using the larger JIS X 0208 character set, two bytes are used per character

code. The encoding restricts the bytes sent to be between 33 and 126.*

euc-jp

euc-jp is another popular Japanese encoding. EUC stands for “Extended Unix

Code,” first developed to support Asian characters on Unix operating systems.

Like iso-2022-jp, the euc-jp encoding is a variable-length encoding that allows the

use of several standard Japanese character sets. But unlike iso-2022-jp, the euc-jp

encoding is not modal. There are no escape sequences to shift between modes.

euc-jp supports four coded character sets: JIS X 0201 (JIS-Roman, ASCII with a few

Japanese substitutions), JIS X 0208, half-width katakana (63 characters used in the

original Japanese telegraph system), and JIS X 0212.

One byte is used to encode JIS Roman (ASCII compatible), two bytes are used for JIS X

0208 and half-width katakana, and three bytes are used for JIS X 0212. The coding is a

bit wasteful but is simple to process.

The encoding patterns are outlined in Table 16-4.

Table 16-3. iso-2022-jp character set switching escape sequences

Escape sequence Resulting coded character set Bytes per code

ESC ( B US-ASCII 1

ESC ( J JIS X 0201-1976 (JIS Roman) 1

ESC $ @ JIS X 0208-1978 2

ESC $ B JIS X 0208-1983 2

* Though the bytes can have only 94 values (between 33 and 126), this is sufficient to cover all the characters

in the JIS X 0208 character sets, because the character sets are organized into a 94 ×94 grid of code values,

enough to cover all JIS X 0208 character codes.

Table 16-4. euc-jp encoding values

Which byte Encoding values

JIS X 0201 (94 coded characters)

1st byte 33–126

JIS X 0208 (6879 coded characters)

1st byte 161–254

2nd byte 161–254

384 |Chapter 16: Internationalization

This wraps up our survey of character sets and encodings. The next section explains

language tags and how HTTP uses language tags to target content to audiences.

Please refer to Appendix H for a detailed listing of standardized character sets.

Language Tags and HTTP

Language tags are short, standardized strings that name spoken languages.

We need standardized names, or some people will tag French documents as

“French,” others will use “Français,” others still might use “France,” and lazy people

might just use “Fra” or “F.” Standardized language tags avoid this confusion.

There are language tags for English (en), German (de), Korean (ko), and many other

languages. Language tags can describe regional variants and dialects of languages,

such as Brazilian Portuguese (pt-BR), U.S. English (en-US), and Hunan Chinese (zh-

xiang). There is even a standard language tag for Klingon (i-klingon)!

The Content-Language Header

The Content-Language entity header field describes the target audience languages for

the entity. If the content is intended primarily for a French audience, the Content-

Language header field would contain:

Content-Language: fr

The Content-Language header isn’t limited to text documents. Audio clips, movies,

and applications might all be intended for a particular language audience. Any media

type that is targeted to particular language audiences can have a Content-Language

header. In Figure 16-8, the audio file is tagged for a Navajo audience.

If the content is intended for multiple audiences, you can list multiple languages. As

suggested in the HTTP specification, a rendition of the “Treaty of Waitangi,” pre-

sented simultaneously in the original Maori and English versions, would call for:

Content-Language: mi, en

Half-width katakana (63 coded characters)

1st byte 142

2nd byte 161–223

JIS X 0212 (6067 coded characters)

1st byte 143

2nd byte 161–254

3rd byte 161–254

Table 16-4. euc-jp encoding values (continued)

Which byte Encoding values

Language Tags and HTTP |385

However, just because multiple languages are present within an entity does not mean

that it is intended for multiple linguistic audiences. A beginner’s language primer,

such as “A First Lesson in Latin,” which clearly is intended to be used by an English-

literate audience, would properly include only “en”.

The Accept-Language Header

Most of us know at least one language. HTTP lets us pass our language restrictions

and preferences along to web servers. If the web server has multiple versions of a

resource, in different languages, it can give us content in our preferred language.*

Here, a client requests Spanish content:

Accept-Language: es

You can place multiple language tags in the Accept-Language header to enumerate all

supported languages and the order of preference (left to right). Here, the client pre-

fers English but will accept Swiss German (de-CH) or other variants of German (de):

Accept-Language: en, de-CH, de

Clients use Accept-Language and Accept-Charset to request content they can under-

stand. We’ll see how this works in more detail in Chapter 17.

Types of Language Tags

Language tags have a standardized syntax, documented in RFC 3066, “Tags for the

Identification of Languages.” Language tags can be used to represent:

• General language classes (as in “es” for Spanish)

• Country-specific languages (as in “en-GB” for English in Great Britain)

• Dialects of languages (as in “no-bok” for Norwegian “Book Language”)

Figure 16-8. Content-Language header marks a “Rain Song” audio clip for Navajo speakers

* Servers also can use the Accept-Language header to generate dynamic content in the language of the user or

to select images or target language-appropriate merchandising promotions.

HTTP/1.1 200 OK

Content-type: audio/x-wav

Content-length: 289772

Content-language: i-navajo

http://www.canyonrecords.com/wav/534.wav

00100101110100100101

01010010100111101001

01010101011100000101

01011111001000011...

386 |Chapter 16: Internationalization

• Regional languages (as in “sgn-US-MA” for Martha’s Vineyard sign language)

• Standardized nonvariant languages (e.g., “i-navajo”)

• Nonstandard languages (e.g., “x-snowboarder-slang”*)

Subtags

Language tags have one or more parts, separated by hyphens, called subtags:

• The first subtag called the primary subtag. The values are standardized.

• The second subtag is optional and follows its own naming standard.

• Any trailing subtags are unregistered.

The primary subtag contains only letters (A–Z). Subsequent subtags can contain let-

ters or numbers, up to eight characters in length. An example is shown in Figure 16-9.

Capitalization

All tags are case-insensitive—the tags “en” and “eN” are equivalent. However, low-

ercasing conventionally is used to represent general languages, while uppercasing is

used to signify particular countries. For example, “fr” means all languages classified

as French, while “FR” signifies the country France.†

IANA Language Tag Registrations

The values of the first and second language subtags are defined by various standards

documents and their maintaining organizations. The IANA‡administers the list of

standard language tags, using the rules outlined in RFC 3066.

If a language tag is composed of standard country and language values, the tag doesn’t

have to be specially registered. Only those language tags that can’t be composed out

of the standard country and language values need to be registered specially with the

* Describes the unique dialect spoken by “shredders.”

Figure 16-9. Language tags are separated into subtags

† This convention is recommended by ISO standard 3166.

‡ See http://www.iana.org and RFC 2860.

sgn-US-MA

First subtag

(sign language)

Second subtag

(America)

Third subtag

(Massachusetts

regional variant)

Martha’s Vineyard sign language

Language Tags and HTTP |387

IANA.*The following sections outline the RFC 3066 standards for the first and sec-

ond subtags.

First Subtag: Namespace

The first subtag usually is a standardized language token, chosen from the ISO 639

set of language standards. But it also can be the letter “i” to identify IANA-registered

names, or “x” for private, extension names. Here are the rules:

If the first subtag has:

• Two characters, it is a language code from the ISO 639† and 639-1 standards

• Three characters, it is a language code listed in the ISO 639-2‡standard and

extensions

• The letter “i,” the language tag is explicitly IANA-registered

• The letter “x,” the language tag is a private, nonstandard, extension subtag

The ISO 639 and 639-2 names are summarized in Appendix G. A few examples are

shown here in Table 16-5.

* At the time of writing, only 21 language tags have been explicitly registered with the IANA, including Can-

tonese (“zh-yue”), New Norwegian (“no-nyn”), Luxembourgish (“i-lux”), and Klingon (“i-klingon”). The

hundreds of remaining spoken languages in use on the Internet have been composed from standard compo-

nents.

† See ISO standard 639, “Codes for the representation of names of languages.”

‡ See ISO 639-2, “Codes for the representation of names of languages—Part 2: Alpha-3 code.”

Table 16-5. Sample ISO 639 and 639-2 language codes

Language ISO 639 ISO 639-2

Arabic ar ara

Chinese zh chi/zho

Dutch nl dut/nla

English en eng

French fr fra/fre

German de deu/ger

Greek (Modern) el ell/gre

Hebrew he heb

Italian it ita

Japanese ja jpn

Korean ko kor

Norwegian no nor

Russian ru rus

Spanish es esl/spa

388 |Chapter 16: Internationalization

Second Subtag: Namespace

The second subtag usually is a standardized country token, chosen from the ISO

3166 set of country code and region standards. But it may also be another string,

which you may register with the IANA. Here are the rules:

If the second subtag has:

• Two characters, it’s a country/region defined by ISO 3166*

• Three to eight characters, it may be registered with the IANA

• One character, it is illegal

Some of the ISO 3166 country codes are shown in Table 16-6. The complete list of

country codes can be found in Appendix G.

Swedish sv sve/swe

Turkish tr tur

* The country codes AA, QM–QZ, XA–XZ and ZZ are reserved by ISO 3166 as user-assigned codes. These

must not be used to form language tags.

Table 16-6. Sample ISO 3166 country codes

Country Code

Brazil BR

Canada CA

China CN

France FR

Germany DE

Holy See (Vatican City State) VA

Hong Kong HK

India IN

Italy IT

Japan JP

Lebanon LB

Mexico MX

Pakistan PK

Russian Federation RU

United Kingdom GB

United States US

Table 16-5. Sample ISO 639 and 639-2 language codes (continued)

Language ISO 639 ISO 639-2

Internationalized URIs |389

Remaining Subtags: Namespace

There are no rules for the third and following subtags, apart from being up to eight

characters (letters and digits).

Conﬁguring Language Preferences

You can configure language preferences in your browser profile.

Netscape Navigator lets you set language preferences through Edit ➝Preferences...

➝Languages..., and Microsoft Internet Explorer lets you set languages through

Tools ➝ Internet Options... ➝ Languages.

Language Tag Reference Tables

Appendix G contains convenient reference tables for language tags:

• IANA-registered language tags are shown in Table G-1.

• ISO 639 language codes are shown in Table G-2.

• ISO 3166 country codes are shown in Table G-3.

Internationalized URIs

Today, URIs don’t provide much support for internationalization. With a few

(poorly defined) exceptions, today’s URIs are comprised of a subset of US-ASCII

characters. There are efforts underway that might let us include a richer set of char-

acters in the hostnames and paths of URLs, but right now, these standards have not

been widely accepted or deployed. Let’s review today’s practice.

Global Transcribability Versus Meaningful Characters

The URI designers wanted everyone around the world to be able to share URIs with

each other—by email, by phone, by billboard, even over the radio. And they wanted

URIs to be easy to use and remember. These two goals are in conflict.

To make it easy for folks around the globe to enter, manipulate, and share URIs, the

designers chose a very limited set of common characters for URIs (basic Latin alpha-

bet letters, digits, and a few special characters). This small repertoire of characters is

supported by most software and keyboards around the world.

Unfortunately, by restricting the character set, the URI designers made it much

harder for people around the globe to create URIs that are easy to use and remem-

ber. The majority of world citizens don’t even recognize the Latin alphabet, making

it nearly impossible to remember URIs as abstract patterns.

390 |Chapter 16: Internationalization

The URI authors felt it was more important to ensure transcribability and sharability

of resource identifiers than to have them consist of the most meaningful characters. So

we have URIs that (today) essentially consist of a restricted subset of ASCII characters.

URI Character Repertoire

The subset of US-ASCII characters permitted in URIs can be divided into reserved,

unreserved, and escape character classes. The unreserved character classes can be

used generally within any component of URIs that allow them. The reserved charac-

ters have special meanings in many URIs, so they shouldn’t be used in general. See

Table 16-7 for a list of the unreserved, reserved, and escape characters.

Escaping and Unescaping

URI “escapes” provide a way to safely insert reserved characters and other unsup-

ported characters (such as spaces) inside URIs. An escape is a three-character

sequence, consisting of a percent character (%) followed by two hexadecimal digit

characters. The two hex digits represent the code for a US-ASCII character.

For example, to insert a space (ASCII 32) in a URL, you could use the escape “%20”,

because 20 is the hexadecimal representation of 32. Similarly, if you wanted to

include a percent sign and have it not be treated as an escape, you could enter

“%25”, where 25 is the hexadecimal value of the ASCII code for percent.

Figure 16-10 shows how the conceptual characters for a URI are turned into code

bytes for the characters, in the current character set. When the URI is needed for

processing, the escapes are undone, yielding the underlying ASCII code bytes.

Internally, HTTP applications should transport and forward URIs with the escapes

in place. HTTP applications should unescape the URIs only when the data is needed.

And, more importantly, the applications should ensure that no URI ever is unes-

caped twice, because percent signs that might have been encoded in an escape will

themselves be unescaped, leading to loss of data.

Escaping International Characters

Note that escape values should be in the range of US-ASCII codes (0–127). Some

applications attempt to use escape values to represent iso-8859-1 extended charac-

ters (128–255)—for example, web servers might erroneously use escapes to code

Table 16-7. URI character syntax

Character class Character repertoire

Unreserved [A-Za-z0-9] | “-” | “_” | “.” | “!” | “~” | “*” | “'” | “(” | “)”

Reserved “;” | “/” | “?” | “:” | “@” | “&” | “=” | “+” | “$” | “,”

Escape “%” <HEX> <HEX>

Internationalized URIs |391

filenames that contain international characters. This is incorrect and may cause

problems with some applications.

For example, the filename Sven Ölssen.html (containing an umlaut) might be

encoded by a web server as Sven%20%D6lssen.html. It’s fine to encode the space

with %20, but is technically illegal to encode the Ö with %D6, because the code D6

(decimal 214) falls outside the range of ASCII. ASCII defines only codes up to 0x7F

(decimal 127).

Modal Switches in URIs

Some URIs also use sequences of ASCII characters to represent characters in other

character sets. For example, iso-2022-jp encoding might be used to insert “ESC ( J”

to shift into JIS-Roman and “ESC ( B” to shift back to ASCII. This works in some

local circumstances, but the behavior is not well defined, and there is no standard-

ized scheme to identify the particular encoding used for the URL. As the authors of

RFC 2396 say:

For original character sequences that contain non-ASCII characters, however, the situ-

ation is more difficult. Internet protocols that transmit octet sequences intended to

represent character sequences are expected to provide some way of identifying the

charset used, if there might be more than one [RFC2277].

However, there is currently no provision within the generic URI syntax to accomplish

this identification. An individual URI scheme may require a single charset, define a

default charset, or provide a way to indicate the charset used. It is expected that a sys-

tematic treatment of character encoding within URI will be developed as a future mod-

ification of this specification.

Currently, URIs are not very international-friendly. The goal of URI portability out-

weighed the goal of language flexibility. There are efforts currently underway to

internationalize URIs, but in the near term, HTTP applications should stick with

ASCII. It’s been around since 1968, so it can’t be all that bad.

Figure 16-10. URI characters are transported as escaped code bytes but processed unescaped

Big Sale at Joe’s

http://www.joes-hardware.com/big%20sale.txt

...

o=111

m=109

/=47

b=98

i=105

g=103

%=37

2=50

0=48

s=115

...

External form

(email, web, billboard, radio)

What you enter and send

(in current character set)

...

111

109

105

103

115

...

What you process

(in US-ASCII character set)

Conceptual characters URI code bytes Unescaped ASCII code byte

392 |Chapter 16: Internationalization

Other Considerations

This section discusses a few other things you should keep in mind when writing

international HTTP applications.

Headers and Out-of-Spec Data

HTTP headers must consist of characters from the US-ASCII character set. How-

ever, not all clients and servers implement this correctly, so you may on occasion

receive illegal characters with code values larger than 127.

Many HTTP applications use operating-system and library routines for processing

characters (for example, the Unix ctype character classification library). Not all of

these libraries support character codes outside of the ASCII range (0–127).

In some circumstances (generally, with older implementations), these libraries may

return improper results or crash the application when given non-ASCII characters.

Carefully read the documentation for your character classification libraries before

using them to process HTTP messages, in case the messages contain illegal data.

Dates

The HTTP specification clearly defines the legal GMT date formats, but be aware

that not all web servers and clients follow the rules. For example, we have seen web

servers send invalid HTTP Date headers with months expressed in local languages.

HTTP applications should attempt to be tolerant of out-of-spec dates, and not crash

on receipt, but they may not always be able to interpret all dates sent. If the date is

not parseable, servers should treat it conservatively.

Domain Names

DNS doesn’t currently support international characters in domain names. There are

standards efforts under way to support multilingual domain names, but they have

not yet been widely deployed.

For More Information

The very success of the World Wide Web means that HTTP applications will con-

tinue to exchange more and more content in different languages and character sets.

For more information on the important but slightly complex topic of multilingual

multimedia, please refer to the following sources.

For More Information |393

Appendixes

• IANA-registered charset tags are listed in Table H-1.

• IANA-registered language tags are shown in Table G-1.

• ISO 639 language codes are shown in Table G-2.

• ISO 3166 country codes are shown in Table G-3.

Internet Internationalization

http://www.w3.org/International/

“Making the WWW Truly World Wide”—the W3C Internationalization and

Localization web site.

http://www.ietf.org/rfc/rfc2396.txt

RFC 2396, “Uniform Resource Identifiers (URI): Generic Syntax,” is the defin-

ing document of URIs. This document includes sections describing character set

restrictions for international URIs.

CJKV Information Processing

Ken Lunde, O’Reilly & Associates, Inc. CJKV is the bible of Asian electronic

character processing. Asian character sets are varied and complex, but this book

provides an excellent introduction to the standards technologies for large charac-

ter sets.

http://www.ietf.org/rfc/rfc2277.txt

RFC 2277, “IETF Policy on Character Sets and Languages,” documents the cur-

rent policies being applied by the Internet Engineering Steering Group (IESG)

toward the standardization efforts in the Internet Engineering Task Force (IETF)

in order to help Internet protocols interchange data in multiple languages and

characters.

International Standards

http://www.iana.org/numbers.htm

The Internet Assigned Numbers Authority (IANA) contains repositories of regis-

tered names and numbers. The “Protocol Numbers and Assignments Directory”

contains records of registered character sets for use on the Internet. Because

much work on international communications falls under the domain of the ISO,

and not the Internet community, the IANA listings are not exhaustive.

http://www.ietf.org/rfc/rfc3066.txt

RFC 3066, “Tags for the Identification of Languages,” describes language tags,

their values, and how to construct them.

394 |Chapter 16: Internationalization

“Codes for the representation of names of languages”

ISO 639:1988 (E/F), The International Organization for Standardization, first

edition.

“Codes for the representation of names of languages—Part 2: Alpha-3 code”

ISO 639-2:1998, Joint Working Group of ISO TC46/SC4 and ISO TC37/SC2,

first edition.

“Codes for the representation of names of countries”

ISO 3166:1988 (E/F), The International Organization for Standardization, third

edition.

395

CHAPTER 17

Content Negotiation and Transcoding

Often, a single URL may need to correspond to different resources. Take the case of

a web site that wants to offer its content in multiple languages. If a site such as Joe’s

Hardware has both French- and English-speaking users, it might want to offer its

web site in both languages. However, if this is the case, when one of Joe’s customers

requests “http://www.joes-hardware.com,” which version should the server send?

French or English?

Ideally, the server will send the English version to an English speaker and the French

version to a French speaker—a user could go to Joe’s Hardware’s home page and get

content in the language he speaks. Fortunately, HTTP provides content-negotiation

methods that allow clients and servers to make just such determinations. Using these

methods, a single URL can correspond to different resources (e.g., a French and

English version of the same web page). These different versions are called variants.

Servers also can make other types of decisions about what content is best to send to a

client for a particular URL. In some cases, servers even can automatically generate

customized pages—for instance, a server can convert an HTML page into a WML

page for your handheld device. These kinds of dynamic content transformations are

called transcodings. They are done in response to content negotiation between HTTP

clients and servers.

In this chapter, we will discuss content negotiation and how web applications go

about their content-negotiation duties.

Content-Negotiation Techniques

There are three distinct methods for deciding which page at a server is the right one

for a client: present the choice to the client, decide automatically at the server, or ask

an intermediary to select. These three techniques are called client-driven negotiation,

server-driven negotiation, and transparent negotiation, respectively (see Table 17-1).

396 |Chapter 17: Content Negotiation and Transcoding

In this chapter, we will look at the mechanics of each technique as well as their

advantages and disadvantages.

Client-Driven Negotiation

The easiest thing for a server to do when it receives a client request is to send back a

response listing the available pages and let the client decide which one it wants to

see. This, of course, is the easiest to implement at the server and is likely to result in

the best copy being selected (provided that the list has enough information to allow

the client to pick the right copy). The disadvantage is that two requests are needed

for each page—one to get the list and a second to get the selected copy. This is a

slow and tedious process, and it’s likely to become annoying to the client.

Mechanically, there are actually two ways for servers to present the choices to the cli-

ent for selection: by sending back an HTML document with links to the different ver-

sions of the page and descriptions of each of the versions, or by sending back an

HTTP/1.1 response with the 300 Multiple Choices response code. The client

browser may receive this response and display a page with the links, as in the first

method, or it may pop up a dialog window asking the user to make a selection. In

any case, the decision is made manually at the client side by the browser user.

In addition to the increased latency and annoyance of multiple requests per page,

this method has another drawback: it requires multiple URLs—one for the main

page and one for each specific page. So, if the original request was for www.joes-

hardware.com, Joe’s server may respond with a page that has links to www.joes-

hardware.com/english and www.joes-hardware.com/french. Should clients now book-

mark the original main page or the selected ones? Should they tell their friends

about the great web site at www.joes-hardware.com or tell only their English-speak-

ing friends about the web site at www.joes-hardware.com/english?

Table 17-1. Summary of content-negotiation techniques

Technique How it works Advantages Drawbacks

Client-driven Client makes a request,

server sends list of choices

to client, client chooses.

Easiest to implement at server side. Client can

make best choice.

Adds latency: at least two

requests are needed to

get the correct content.

Server-driven Server examines client’s

request headers and

decides what version to

serve.

Quicker than client-driven negotiation. HTTP

provides a q-value mechanism to allow serv-

ers to make approximate matches and a Vary

header for servers to tell downstream devices

how to evaluate requests.

If the decision is not obvi-

ous (headers don’t match

up), the server must

guess.

Transparent An intermediate device

(usually a proxy cache)

does the request negotia-

tion on the client’s behalf.

Offloads the negotiation from the web server.

Quicker than client-driven negotiation.

No formal specifications

for how to do transparent

negotiation.

Server-Driven Negotiation |397

Server-Driven Negotiation

Client-driven negotiation has several drawbacks, as discussed in the previous sec-

tion. Most of these drawbacks center around the increased communication between

the client and server to decide on the best page in response to a request. One way to

reduce this extra communication is to let the server decide which page to send

back—but to do this, the client must send enough information about its preferences

to allow the server to make an informed decision. The server gets this information

from the client’s request headers.

There are two mechanisms that HTTP servers use to evaluate the proper response to

send to a client:

• Examining the set of content-negotiation headers. The server looks at the client’s

Accept headers and tries to match them with corresponding response headers.

• Varying on other (non–content-negotiation) headers. For example, the server

could send responses based on the client’s User-Agent header.

These two mechanisms are explained in more detail in the following sections.

Content-Negotiation Headers

Clients may send their preference information using the set of HTTP headers listed

in Table 17-2.

Notice how similar these headers are to the entity headers discussed in Chapter 15.

However, there is a clear distinction between the purposes of the two types of head-

ers. As mentioned in Chapter 15, entity headers are like shipping labels—they spec-

ify attributes of the message body that are necessary during the transfer of messages

from the server to the client. Content-negotiation headers, on the other hand, are

used by clients and servers to exchange preference information and to help choose

between different versions of a document, so that the one most closely matching the

client’s preferences is served.

Servers match clients’ Accept headers with the corresponding entity headers, listed in

Table 17-3.

Table 17-2. Accept headers

Header Description

Accept Used to tell the server what media types are okay to send

Accept-Language Used to tell the server what languages are okay to send

Accept-Charset Used to tell the server what charsets are okay to send

Accept-Encoding Used to tell the server what encodings are okay to send

398 |Chapter 17: Content Negotiation and Transcoding

Note that because HTTP is a stateless protocol (meaning that servers do not keep

track of client preferences across requests), clients must send their preference infor-

mation with every request.

If both clients sent Accept-Language header information specifying the language in

which they were interested, the server could decide which copy of www.joes-hard-

ware.com to send back to each client. Letting the server automatically pick which

document to send back reduces the latency associated with the back-and-forth com-

munication required by the client-driven model.

However, say that one of the clients prefers Spanish. Which version of the page

should the server send back? English or French? The server has just two choices:

either guess, or fall back on the client-driven model and ask the client to choose.

However, if the Spaniard happens to understand some English, he might choose the

English page—it wouldn’t be ideal, but it would do. In this case, the Spaniard needs

the ability to pass on more information about his preferences, conveying that he does

have minimal knowledge of English and that, in a pinch, English will suffice.

Fortunately, HTTP does provide a mechanism for letting clients like our Spaniard

give richer descriptions of their preferences, using quality values (“q values” for short).

Content-Negotiation Header Quality Values

The HTTP protocol defines quality values to allow clients to list multiple choices for

each category of preference and associate an order of preference with each choice.

For example, clients can send an Accept-Language header of the form:

Accept-Language: en;q=0.5, fr;q=0.0, nl;q=1.0, tr;q=0.0

Where the q values can range from 0.0 to 1.0 (with 0.0 being the lowest preference

and 1.0 being the highest). The header above, then, says that the client prefers to

receive a Dutch (nl) version of the document, but an English (en) version will do.

Under no circumstances does the client want a French (fr) or Turkish (tr) version,

though. Note that the order in which the preferences are listed is not important; only

the q values associated with them are.

Occasionally, the server may not have any documents that match any of the client’s

preferences. In this case, the server may change or transcode the document to match

the client’s preferences. This mechanism is discussed later in this chapter.

Table 17-3. Accept and matching document headers

Accept header Entity header

Accept Content-Type

Accept-Language Content-Language

Accept-Charset Content-Type

Accept-Encoding Content-Encoding

Server-Driven Negotiation |399

Varying on Other Headers

Servers also can attempt to match up responses with other client request headers,

such as User-Agent. Servers may know that old versions of a browser do not support

JavaScript, for example, and may therefore send back a version of the page that does

not contain JavaScript.

In this case, there is no q-value mechanism to look for approximate “best” matches.

The server either looks for an exact match or simply serves whatever it has (depend-

ing on the implementation of the server).

Because caches must attempt to serve correct “best” versions of cached documents,

the HTTP protocol defines a Vary header that the server sends in responses; the Vary

header tells caches (and clients, and any downstream proxies) which headers the

server is using to determine the best version of the response to send. The Vary header

is discussed in more detail later in this chapter.

Content Negotiation on Apache

Here is an overview of how the Apache web server supports content negotiation. It is

up to the web site content provider—Joe, for example—to provide different versions

of Joe’s index page. Joe must put all his index page files in the appropriate directory

on the Apache server corresponding to his web site. There are two ways to enable

content negotiation:

• In the web site directory, create a type-map file for each URI in the web site that

has variants. The type-map file lists all the variants and the content-negotiation

headers to which they correspond.

• Enable the MultiViews directive, which causes Apache to create type-map files

for the directory automatically.

Using type-map ﬁles

The Apache server needs to know what type-map files look like. To configure this,

set a handler in the server configuration file that specifies the file suffix for type-map

files. For example:

AddHandler type-map .var

This line indicates that files with the extension .var are type-map files.

Here is a sample type-map file:

URI: joes-hardware.html

URI: joes-hardware.en.html

Content-type: text/html

Content-language: en

400 |Chapter 17: Content Negotiation and Transcoding

URI: joes-hardware.fr.de.html

Content-type: text/html;charset=iso-8859-2

Content-language: fr, de

From this type-map file, the Apache server knows to send joes-hardware.en.html to

clients requesting English and joes-hardware.fr.de.html to clients requesting French.

Quality values also are supported; see the Apache server documentation.

Using MultiViews

To use MultiViews, you must enable it for the directory containing the web site, using

an Options directive in the appropriate section of the access.conf file (<Directory>,

<Location>, or <Files>).

If MultiViews is enabled and a browser requests a resource named joes-hardware, the

server looks for all files with “joes-hardware” in the name and creates a type-map file

for them. Based on the names, the server guesses the appropriate content-negotiation

headers to which the files correspond. For example, a French-language version of

joes-hardware should contain .fr.

Server-Side Extensions

Another way to implement content negotiation at the server is by server-side exten-

sions, such as Microsoft’s Active Server Pages (ASP). See Chapter 8 for an overview

of server-side extensions.

Transparent Negotiation

Transparent negotiation seeks to move the load of server-driven negotiation away

from the server, while minimizing message exchanges with the client by having an

intermediary proxy negotiate on behalf of the client. The proxy is assumed to have

knowledge of the client’s expectations and be capable of performing the negotia-

tions on its behalf (the proxy has received the client’s expectations in the request for

content). To support transparent content negotiation, the server must be able to tell

proxies what request headers the server examines to determine the best match for the

client’s request. The HTTP/1.1 specification does not define any mechanisms for

transparent negotiation, but it does define the Vary header. Servers send Vary head-

ers in their responses to tell intermediaries what request headers they use for content

negotiation.

Caching proxies can store different copies of documents accessed via a single URL. If

servers communicate their decision-making processes to caches, the caches can nego-

tiate with clients on behalf of the servers. Caches also are great places to transcode

content, because a general-purpose transcoder deployed in a cache can transcode

content from any server, not just one. Transcoding of content at a cache is illus-

trated in Figure 17-3 and discussed in more detail later in the chapter.

Transparent Negotiation |401

Caching and Alternates

Caching of content assumes that the content can be reused later. However, caches

must employ much of the decision-making logic that servers do when sending back a

response, to ensure that they send back the correct cached response to a client request.

The previous section described the Accept headers sent by clients and the correspond-

ing entity headers that servers match them up against in order to choose the best

response to each request. Caches must use these same headers to decide which cached

response to send back.

Figure 17-1 illustrates both a correct and incorrect sequence of operations involving

a cache. The first request results in the cache forwarding the request to the server

and storing the response. The second response is looked up by the cache, and a doc-

ument matching the URL is found. This document, however, is in French, and the

requestor wants a Spanish document. If the cache just sends back the French docu-

ment to the requestor, it will be behaving incorrectly.

The cache must therefore forward the second request to the server as well, and store

both the response and an “alternate” response for that URL. The cache now has two

Figure 17-1. Caches use content-negotiation headers to send back correct responses to clients

French-speaking

user

GET / HTTP/1.1

Host: www.joes-hardware.com

User-agent: spiffy multimedia browser

Accept-language: fr;q=1.0

Web server

Cache

Bonjour

Hi! Welcome to

Joe's Hardware

Store.

Hola! Bienvenido

a Joe's Hardware

Store.

Bonjour!

Bienvenue a Joe's

Hardware Store

Spanish-speaking

user

GET / HTTP/1.1

Host: www.joes-hardware.com

User-agent: spiffy multimedia browser

Accept-language: es;q=1.0

Web server

Cache

Hola! Bienvenido

a Joe's Hardware

Store.

Bonjour

Bienvenido

402 |Chapter 17: Content Negotiation and Transcoding

different documents for the same URL, just as the server does. These different ver-

sions are called variants or alternates. Content negotiation can be thought of as the

process of selecting, from the variants, the best match for a client request.

The Vary Header

Here’s a typical set of request and response headers from a browser and server:

GET http://www.joes-hardware.com/ HTTP/1.0

Proxy-Connection: Keep-Alive

User-Agent: Mozilla/4.73 [en] (WinNT; U)

Host: www.joes-hardware.com

Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*

Accept-Encoding: gzip

Accept-Language: en, pdf

Accept-Charset: iso-8859-1, *, utf-8

HTTP/1.1 200 OK

Date: Sun, 10 Dec 2000 22:13:40 GMT

Server: Apache/1.3.12 OpenSSL/0.9.5a (Unix) FrontPage/4.0.4.3

Last-Modified: Fri, 05 May 2000 04:42:52 GMT

Etag: "1b7ddf-48-3912514c"

Accept-Ranges: Bytes

Content-Length: 72

Connection: close

Content-Type: text/html

What happens, however, if the server’s decision was based on headers other than the

Accept headers, such as the User-Agent header? This is not as radical as it may

sound. Servers may know that old versions of a browser do not support JavaScript,

for example, and may therefore send back a version of the page that does not have

JavaScript in it. If servers are using other headers to make their decisions about

which pages to send back, caches must know what those headers are, so that they

can perform parallel logic in choosing which cached page to send back.

The HTTP Vary response header lists all of the client request headers that the server

considers to select the document or generate custom content (in addition to the regu-

lar content-negotiation headers). For example, if the served document depends on the

User-Agent header, the Vary header must include “User-Agent”.

When a new request arrives, the cache finds the best match using the content-negoti-

ation headers. Before it can serve this document to the client, however, it must see

whether the server sent a Vary header in the cached response. If a Vary header is

present, the header values for the headers in the new request must match the header

values in the old, cached request. Because servers may vary their responses based on

client request headers, caches must store both the client request headers and the cor-

responding server response headers with each cached varaint, in order to implement

transparent negotiation. This is illustrated in Figure 17-2.

Transcoding |403

If a server’s Vary header looked like this, the huge number of different User-Agent

and Cookie values could generate many variants:

Vary: User-Agent, Cookie

A cache would have to store each document version corresponding to each variant.

When the cache does a lookup, it first does content matching with the content-nego-

tiation headers, then matches the request’s variant with cached variants. If there is no

match, the cache fetches the document from the origin server.

Transcoding

We have discussed in some detail the mechanism by which clients and servers can

choose between a set of documents for a URL and send the one that best matches the

Figure 17-2. If servers vary on specific request headers, caches must match those request headers

in addition to the regular content-negotiation headers before sending back cached responses

French-speaking

user 1

GET / HTTP/1.1

Host: www.joes-hardware.com

User-agent: spiffy multimedia browser

Accept-language: fr;q=1.0

Web server

Cache

Bonjour

I need to send her a French document.

Since she has such a cool browser, I'll

send her a media-rich version of

the page.

HTTP/1.1 200 OK

Content-language: fr

Vary: User-agent

Bonjour

[...media-rich content]

French-speaking

user 2

GET / HTTP/1.1

Host: www.joes-hardware.com

User-agent: wimpy wireless device

Accept-language: fr;q=1.0

Cache

Bonjour

HTTP/1.1 200 OK

Content-language: fr

Vary: User-agent

Bonjour

[...simple text content]

Bonjour

He wants a French copy of the document

and I have it in my cache, but I’d better

not send it to him. The server said my

cached copy was for a spiffy browser. This

guy has a wimpy wireless one. I had

better ask the server for a French version

for the wireless browser.

Web server

404 |Chapter 17: Content Negotiation and Transcoding

client’s needs. These mechanisms rely on the presence of documents that match the

client’s needs—whether they match the needs perfectly or not so well.

What happens, however, when a server does not have a document that matches the

client’s needs at all? The server may have to respond with an error, but theoretically,

the server may be able to transform one of its existing documents into something

that the client can use. This option is called transcoding.

Table 17-4 lists some hypothetical transcodings.

There are three categories of transcoding: format conversion, information synthesis,

and content injection.

Format Conversion

Format conversion is the transformation of data from one format to another to make it

viewable by a client. A wireless device seeking to access a document typically viewed

by a desktop client may be able do so with an HTML-to-WML conversion. A client

accessing a web page over a slow link that is not very interested in high-resolution

images may be able to view an image-rich page more easily if the images are reduced

in size and resolution by converting them from color to black and white and shrink-

ing them.

Format conversion is driven by the content-negotiation headers listed in Table 17-2,

although it may also be driven by the User-Agent header. Note that content transfor-

mation or transcoding is different from content encoding or transfer encoding, in

that the latter two typically are used for more efficient or safe transport of content,

whereas the former is used to make content viewable on the access device.

Information Synthesis

The extraction of key pieces of information from a document—known as informa-

tion synthesis—can be a useful transcoding process. A simple example of this is the

generation of an outline of a document based on section headings, or the removal of

advertisements and logos from a page.

Table 17-4. Hypothetical transcodings

Before After

HTML document WML document

High-resolution image Low-resolution image

Image in 64K colors Black-and-white image

Complex page with frames Simple text page without frames or images

HTML page with Java applets HTML page without Java applets

Page with ads Page with ads removed

Next Steps |405

More sophisticated technologies that categorize pages based on keywords in content

also are useful in summarizing the essence of a document. This technology often is

used by automatic web page–classification systems, such as web-page directories at

portal sites.

Content Injection

The two categories of transcodings described so far typically reduce the amount of

content in web documents, but there is another category of transformations that

increases the amount of content: content-injection transcodings. Examples of content-

injection transcodings are automatic ad generators and user-tracking systems.

Imagine the appeal (and offence) of an ad-insertion transcoder that automatically

adds advertisements to each HTML page as it goes by. Transcoding of this type has to

be dynamic—it must be done on the fly in order to be effective in adding ads that cur-

rently are relevant or somehow have been targeted for a particular user. User-tracking

systems also can be built to add content to pages dynamically, for the purpose of col-

lecting statistics about how the page is viewed and how clients surf the Web.

Transcoding Versus Static Pregeneration

An alternative to transcodings is to build different copies of web pages at the web

server—for example, one with HTML, one with WML, one with high-resolution

images, one with low-resolution images, one with multimedia content, and one with-

out. This, however, is not a very practical technique, for many reasons: any small

change in a page requires multiple pages to be modified, more space is necessary to

store all the different versions of each page, and it’s harder to catalog pages and pro-

gram web servers to serve the right ones. Some transcodings, such as ad insertion

(especially targeted ad insertion), cannot be done statically—the ad inserted will

depend upon the user requesting the page.

An on-the-fly transformation of a single root page can be an easier solution than static

pregeneration. It can come, however, at the cost of increased latency in serving the

content. Some of this computation can, however, be done by a third party, thereby off-

loading the computation from the web server—the transformation can be done by an

external agent at a proxy or cache. Figure 17-3 illustrates transcoding at a proxy cache.

Next Steps

The story of content negotiation does not end with the Accept and Content headers,

for a couple of reasons:

• Content negotiation in HTTP incurs some performance limits. Searching through

many variants for appropriate content, or trying to “guess” the best match, can

406 |Chapter 17: Content Negotiation and Transcoding

be costly. Are there ways to streamline and focus the content-negotiation proto-

col? RFCs 2295 and 2296 attempt to address this question for transparent HTTP

content negotiation.

• HTTP is not the only protocol that needs to do content negotiation. Streaming

media and fax are two other examples where client and server need to discuss

the best answer to the client’s request. Can a general content-negotiation proto-

col be developed on top of TCP/IP application protocols? The Content Negotia-

tion Working Group was formed to tackle this question. The group is now

closed, but it contributed several RFCs. See the next section for a link to the

group’s web site.

For More Information

The following Internet drafts and online documentation can give you more details

about content negotiation:

http://www.ietf.org/rfc/rfc2616.txt

RFC 2616, “Hypertext Transfer Protocol—HTTP/1.1,” is the official specifica-

tion for HTTP/1.1, the current version of the HTTP protocol. The specification

is a well-written, well-organized, detailed reference for HTTP, but it isn’t ideal

for readers who want to learn the underlying concepts and motivations of HTTP

or the differences between theory and practice. We hope that this book fills in

the underlying concepts, so you can make better use of the specification.

Figure 17-3. Content transformation or transcoding at a proxy cache

French-speaking

user

GET / HTTP/1.1

Host: www.joes-hardware.com

User-agent: wimpy wireless device

Accept-language: fr;q=1.0

Web server

Cache

Bonjour

HTTP/1.1 200 OK

Content-language: fr

Vary: User-agent

Bonjour

[...simple text content]

Bonjour Transmogrifier

I have a French copy of the document

that he wants, but my copy is very media-

rich and he has a wimpy wireless browser.

I will strip out all of the multimedia content

and send it to him.

Since I have transformed this

document for a wireless device,

I will store the transformed

copy as an alternate in case

someone else wants it as well.

For More Information |407

http://search.ietf.org/rfc/rfc2295.txt

RFC 2295, “Transparent Content Negotiation in HTTP,” is a memo describing a

transparent content-negotiation protocol on top of HTTP. The status of this

memo remains experimental.

http://search.ietf.org/rfc/rfc2296.txt

RFC 2296, “HTTP Remote Variant Selection Algorithm—RVSA 1.0,” is a memo

describing an algorithm for the transparent selection of the “best” content for a

particular HTTP request. The status of this memo remains experimental.

http://search.ietf.org/rfc/rfc2936.txt

RFC 2936, “HTTP MIME Type Handler Detection,” is a memo describing an

approach for determining the actual MIME type handlers that a browser sup-

ports. This approach can help if the Accept header is not specific enough.

http://www.imc.org/ietf-medfree/index.htm

This is a link to the Content Negotiation (CONNEG) Working Group, which

looked into transparent content negotiation for HTTP, fax, and print. This

group is now closed.

PART V

Content Publishing

and Distribution

Part V talks all about the technology for publishing and disseminating web content:

• Chapter 18, Web Hosting, discusses the ways people deploy servers in modern

web hosting environments, HTTP support for virtual web hosting, and how to

replicate content across geographically distant servers.

• Chapter 19, Publishing Systems, discusses the technologies for creating web con-

tent and installing it onto web servers.

• Chapter 20, Redirection and Load Balancing, surveys the tools and techniques for

distributing incoming web traffic among a collection of servers.

• Chapter 21, Logging and Usage Tracking, covers log formats and common

questions.

411

CHAPTER 18

Web Hosting

When you place resources on a public web server, you make them available to the

Internet community. These resources can be as simple as text files or images, or as

complicated as real-time driving maps or e-commerce shopping gateways. It’s criti-

cal that this rich variety of resources, owned by different organizations, can be conve-

niently published to web sites and placed on web servers that offer good performance

at a fair price.

The collective duties of storing, brokering, and administering content resources is

called web hosting. Hosting is one of the primary functions of a web server. You need

a server to hold, serve, log access to, and administer your content. If you don’t want

to manage the required hardware and software yourself, you need a hosting service,

or hoster. Hosters rent you serving and web-site administration services and provide

various degrees of security, reporting, and ease of use. Hosters typically pool web

sites on heavy-duty web servers for cost-efficiency, reliability, and performance.

This chapter explains some of the most important features of web hosting services

and how they interact with HTTP applications. In particular, this chapter covers:

• How different web sites can be “virtually hosted” on the same server, and how

this affects HTTP

• How to make web sites more reliable under heavy traffic

• How to make web sites load faster

Hosting Services

In the early days of the World Wide Web, individual organizations purchased their

own computer hardware, built their own computer rooms, acquired their own net-

work connections, and managed their own web server software.

As the Web quickly became mainstream, everyone wanted a web site, but few peo-

ple had the skills or time to build air-conditioned server rooms, register domain

412 |Chapter 18: Web Hosting

names, or purchase network bandwidth. To save the day, many new businesses

emerged, offering professionally managed web hosting services. Many levels of ser-

vice are available, from physical facilities management (providing space, air condi-

tioning, and wiring) to full-service web hosting, where all the customer does is

provide the content.

This chapter focuses on what the hosting web server provides. Much of what makes

a web site work—as well as, for example, its ability to support different languages

and its ability to do secure e-commerce transactions—depends on what capabilities

the hosting web server supports.

A Simple Example: Dedicated Hosting

Suppose that Joe’s Hardware Online and Mary’s Antique Auction both want fairly

high-volume web sites. Irene’s ISP has racks and racks full of identical, high-

performance web servers that it can lease to Joe and Mary, instead of having Joe and

Mary purchase their own servers and maintain the server software.

In Figure 18-1, both Joe and Mary sign up for the dedicated web hosting service

offered by Irene’s ISP. Joe leases a dedicated web server that is purchased and

maintained by Irene’s ISP. Mary gets a different dedicated server from Irene’s ISP.

Irene’s ISP gets to buy server hardware in volume and can select hardware that is

reliable, time-tested, and low-cost. If either Joe’s Hardware Online or Mary’s

Antique Auction grows in popularity, Irene’s ISP can offer Joe or Mary additional

servers immediately.

In this example, browsers send HTTP requests for www.joes-hardware.com to the IP

address of Joe’s server and requests for www.marys-antiques.com to the (different) IP

address of Mary’s server.

Figure 18-1. Outsourced dedicated hosting

Irene’s ISP

Internet

Client

www.joes-hardware.com

www.cajun-gifts.com

www.marys-antiques.com

www.irenes-isp.com

Content

Joe

Content

Mary

Virtual Hosting |413

Virtual Hosting

Many folks want to have a web presence but don’t have high-traffic web sites. For

these people, providing a dedicated web server may be a waste, because they’re pay-

ing many hundreds of dollars a month to lease a server that is mostly idle!

Many web hosters offer lower-cost web hosting services by sharing one computer

between several customers. This is called shared hosting or virtual hosting. Each web

site appears to be hosted by a different server, but they really are hosted on the same

physical server. From the end user’s perspective, virtually hosted web sites should be

indistinguishable from sites hosted on separate dedicated servers.

For cost efficiency, space, and management reasons, a virtual hosting company

wants to host tens, hundreds, or thousands of web sites on the same server—but this

does not necessarily mean that 1,000 web sites are served from only one PC. Hosters

can create banks of replicated servers (called server farms) and spread the load across

the farm of servers. Because each server in the farm is a clone of the others, and hosts

many virtual web sites, administration is much easier. (We’ll talk more about server

farms in Chapter 20.)

When Joe and Mary started their businesses, they might have chosen virtual hosting

to save money until their traffic levels made a dedicated server worthwhile (see

Figure 18-2).

Virtual Server Request Lacks Host Information

Unfortunately, there is a design flaw in HTTP/1.0 that makes virtual hosters pull

their hair out. The HTTP/1.0 specification didn’t give any means for shared web

servers to identify which of the virtual web sites they’re hosting is being accessed.

Figure 18-2. Outsourced virtual hosting

Internet

Client

Content

Joe

Content

Mary

Irene’s ISP

www.joes-hardware.com

www.cajun-gifts.com

www.marys-antiques.com

www.irenes-isp.com

414 |Chapter 18: Web Hosting

Recall that HTTP/1.0 requests send only the path component of the URL in the

request message. If you try to get http://www.joes-hardware.com/index.html, the

browser connects to the server www.joes-hardware.com, but the HTTP/1.0 request

says “GET /index.html”, with no further mention of the hostname. If the server is

virtually hosting multiple sites, this isn’t enough information to figure out what vir-

tual web site is being accessed. For example, in Figure 18-3:

• If client A tries to access http://www.joes-hardware.com/index.html, the request

“GET /index.html” will be sent to the shared web server.

• If client B tries to access http://www.marys-antiques.com/index.html, the identi-

cal request “GET /index.html” will be sent to the shared web server.

As far as the web server is concerned, there is not enough information to determine

which web site is being accessed! The two requests look the same, even though they

are for totally different documents (from different web sites). The problem is that the

web site host information has been stripped from the request.

As we saw in Chapter 6, HTTP surrogates (reverse proxies) and intercepting proxies

also need site-specifying information.

Making Virtual Hosting Work

The missing host information was an oversight in the original HTTP specification,

which mistakenly assumed that each web server would host exactly one web site.

HTTP’s designers didn’t provide support for virtually hosted, shared servers. For this

reason, the hostname information in the URL was viewed as redundant and stripped

away; only the path component was required to be sent.

Because the early specifications did not make provisions for virtual hosting, web

hosters needed to develop workarounds and conventions to support shared virtual

hosting. The problem could have been solved simply by requiring all HTTP request

Figure 18-3. HTTP/1.0 server requests don’t contain hostname information

Internet

Client B

Client A

(A getting http://www.joes-hardware.com/index.html)

GET /index.html HTTP/1.0

User-agent: SuperBrowser v1.3

GET /index.html HTTP/1.0

User-agent: WebSurfer 2000

(B getting http://www.marys-antiques.com/index.html)

/voting /mary /joe

www.voting-info.gov

www.joes-hardware.com

www.marys-antiques.com

HTTP/1.0 requests do not contain hostname information, so

they do not support web servers that host multiple web sites.

(HTTP/1.1 supports a Host header to fix this problem.)

Virtual Hosting |415

messages to send the full URL instead of just the path component. HTTP/1.1 does

require servers to handle full URLs in the request lines of HTTP messages, but it will

be a long time before all legacy applications are upgraded to this specification. In the

meantime, four techniques have emerged:

Virtual hosting by URL path

Adding a special path component to the URL so the server can determine the site.

Virtual hosting by port number

Assigning a different port number to each site, so requests are handled by sepa-

rate instances of the web server.

Virtual hosting by IP address

Dedicating different IP addresses for different virtual sites and binding all the IP

addresses to a single machine. This allows the web server to identify the site

name by IP address.

Virtual hosting by Host header

Many web hosters pressured the HTTP designers to solve this problem.

Enhanced versions of HTTP/1.0 and the official version of HTTP/1.1 define a

Host request header that carries the site name. The web server can identify the

virtual site from the Host header.

Let’s take a closer look at each technique.

Virtual hosting by URL path

You can use brute force to isolate virtual sites on a shared server by assigning them

different URL paths. For example, you could give each logical web site a special path

prefix:

• Joe’s Hardware store could be http://www.joes-hardware.com/joe/index.html.

• Mary’s Antiques store could be http://www.marys-antiques.com/mary/index.html.

When the requests arrive at the server, the hostname information is not present in

the request, but the server can tell them apart based on the path:

• The request for Joe’s hardware is “GET /joe/index.html”.

• The request for Mary’s antiques is “GET /mary/index.html”.

This is not a good solution. The “/joe” and “/mary” prefixes are redundant and con-

fusing (we already mentioned “joe” in the hostname). Worse, the common conven-

tion of specifying http://www.joes-hardware.com or http://www.joes-hardware.com/

index.html for the home page won’t work.

In general, URL-based virtual hosting is a poor solution and seldom is used.

Virtual hosting by port number

Instead of changing the pathname, Joe and Mary could each be assigned a different

port number on the web server. Instead of port 80, for example, Joe could get 82 and

416 |Chapter 18: Web Hosting

Mary could have 83. But this solution has the same problem: an end user would

expect to find the resources without having to specify a nonstandard port in the URL.

Virtual hosting by IP address

A much better approach (in common use) is virtual IP addressing. Here, each virtual

web site gets one or more unique IP addresses. The IP addresses for all of the virtual

web sites are attached to the same shared server. The server can look up the destina-

tion IP address of the HTTP connection and use that to determine what web site the

client thinks it is connected to.

Say a hoster assigned the IP address 209.172.34.3 to www.joes-hardware.com,

assigned 209.172.34.4 to www.marys-antiques.com, and tied both IP addresses to the

same physical server machine. The web server could then use the destination IP

address to identify which virtual site is being requested, as shown in Figure 18-4:

• Client A fetches http://www.joes-hardware.com/index.html.

• Client A finds the IP address for www.joes-hardware.com, getting 209.172.34.3.

• Client A opens a TCP connection to the shared web server at 209.172.34.3.

• Client A sends the request “GET /index.html HTTP/1.0”.

• Before the web server serves a response, it notes the actual destination IP address

(209.172.34.3), determines that this is a virtual IP address for Joe’s web site, and

fulfills the request from the /joe subdirectory. The page /joe/index.html is returned.

Similarly, if client B asks for http://www.marys-antiques.com/index.html:

• Client B finds the IP address for www.marys-antiques.com, getting 209.172.34.4.

• Client B opens a TCP connection to the web server at 209.172.34.4.

• Client B sends the request “GET /index.html HTTP/1.0”.

• The web server determines that 209.172.34.4 is Mary’s web site and fulfills the

request from the /mary subdirectory, returning the document /mary/index.html.

Figure 18-4. Virtual IP hosting

Dest IP address

209.172.34.2

209.172.34.3

209.172.34.4

The Definitive Guide

_The%2BDefinitive%2BGuide

_The%2BDefinitive%2BGuide

_The%2BDefinitive%2BGuide

_The%2BDefinitive%2BGuide

%20The%20Definitive%20Guide

Navigation menu

Versions of this User Manual:

Views

Navigation