Python 2.6 Text Processing Beginner's Guide (2010)
Python%202.6%20Text%20Processing%20-%20Beginner's%20Guide%20(2010)
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 380 [warning: Documents this large are best viewed by clicking the View PDF Link!]
- Cover
- Copyright
- Credits
- About the Author
- About the Reviewer
- Table of Contents
- Preface
- Chapter 1: Getting Started
- Categorizing types of text data
- Ensuring you have Python installed
- Implementing a simple cipher
- Time for action – implementing a ROT13 encoder
- Time for action – processing as a filter
- Time for action – skipping over markup tags
- Supporting third-party modules
- Time for action – installing SetupTools
- Running a virtual environment
- Time for action – configuring a virtual environment
- Where to get help?
- Summary
- Chapter 2: Working with the IO System
- Parsing web server logs
- Time for action – generating transfer statistics
- Using objects interchangeably
- Time for action – introducing a new log format
- Accessing files directly
- Time for action – accessing files directly
- Time for action – handling compressed files
- Accessing multiple files
- Time for action – spell-checking HTML content
- Accessing remote files
- Time for action – spell-checking live HTML pages
- Time for action – handling urllib 2 errors
- Handling string IO instances
- Understanding IO in Python 3
- Summary
- Chapter 3: Python String Services
- Understanding the basics of string object
- Time for action – employee management
- String formatting
- Time for action – customizing log processor output
- Time for action – adding status code data
- Creating templates
- Time for action – displaying warnings on malformed lines
- Calling string object methods
- Time for action – simple manipulation with string methods
- Summary
- Chapter 4: Text Processing Using the Standard Library
- Reading CSV data
- Time for action – processing Excel formats
- Time for action – CSV and formulas
- Time for action – processing custom CSV formats
- Writing CSV data
- Time for action – creating a spreadsheet of UNIX users
- Modifying application configuration files
- Time for action – adding basic configuration read support
- Time for action – relying on configuration value interpolation
- Time for action – configuration defaults
- Writing configuration data
- Time for action – generating a configuration file
- Reconfiguring our source
- Time for action – creating an egg-based package
- Working with JSON
- Time for action – writing JSON data
- Summary
- Chapter 5: Regular Expressions
- Chapter 6: Structured Markup
- XML data
- SAX processing
- Time for action – event-driven processing
- Time for action – driving incremental processing
- Time for action – creating a dungeon adventure game
- The Document Object Model
- Time for action – updating our game to use DOM processing
- XPath
- Time for action – using XPath in our adventure
- Reading HTML
- Time for action – displaying links in an HTML page
- Summary
- Chapter 7: Creating Templates
- Time for action – installing Mako
- Basic Mako usage
- Time for action – loading a simple Mako template
- Time for action – reformatting the date with Python code
- Time for action – defining Mako def tags
- Time for action – converting mail message to use namespaces
- Inheriting from base templates
- Time for action – updating base template
- Time for action – adding another inheritance layer
- Customizing
- Time for action – creating custom Mako tags
- Overviewing alternative approaches
- Summary
- Chapter 8: Understanding Encodings and i18n
- Understanding basic character encodings
- Unicode
- Encodings in Python
- Time for action – manually decoding
- Time for action – copying Unicode data
- Time for action – fixing our copy application
- The codecs module
- Time for action – changing encodings
- Adopting good practices
- Internationalization and Localization
- Time for action – preparing for multiple languages
- Time for action – providing translations
- Summary
- Chapter 9: Advanced Output Formats
- Dealing with PDF files using PLATYPUS
- Time for action – installing ReportLab
- Time for action – writing PDF with basic layout and style
- Writing native Excel data
- Time for action – installing xlwt
- Time for action – generating XLS data
- Working with OpenDocument files
- Time for action – installing ODFPy
- Time for action – generating ODT data
- Summary
- Chapter 10: Advanced Parsing and Grammars
- Chapter 11: Searching and Indexing
- Understanding search complexity
- Time for action – implementing a linear search
- Text indexing
- Time for action – installing Nucular
- Time for action – full text indexing
- Time for action – measuring index benefit
- Time for action – field-qualified indexes
- Time for action – performing advanced Nucular queries
- Indexing and searching other data
- Time for action – indexing Open Office documents
- Other index systems
- Summary
- Appendix A: Looking for Additional Resources
- Appendix B: Pop Quiz Answers
- Chapter 1: Getting Started
- Chapter 2: Working with the IO System
- Chapter 3: Python String Services
- Chapter 4: Text Processing Using the Standard Library
- Chapter 5: Regular Expressions
- Chapter 6: Structured Markup
- Chapter 7: Creating Templates
- Chapter 8: Understanding Encoding and i18n
- Chapter 9: Advanced Output Formats
- Chapter 11: Searching and Indexing
- Index


Python 2.6 Text Processing
Beginner's Guide
The easiest way to learn how to manipulate text with Python
Je McNeil
BIRMINGHAM - MUMBAI
Python 2.6 Text Processing
Beginner's Guide
Copyright © 2010 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmied in any form or by any means, without the prior wrien permission of the
publisher, except in the case of brief quotaons embedded in crical arcles or reviews.
Every eort has been made in the preparaon of this book to ensure the accuracy of the
informaon presented. However, the informaon contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers
and distributors will be held liable for any damages caused or alleged to be caused directly or
indirectly by this book.
Packt Publishing has endeavored to provide trademark informaon about all of the
companies and products menoned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this informaon.
First published: December 2010
Producon Reference: 1081210
Published by Packt Publishing Ltd.
32 Lincoln Road
Olton
Birmingham, B27 6PA, UK.
ISBN 978-1-849512-12-1
www.packtpub.com
Cover Image by John Quick (john@johnmquick.com)

Credits
Author
Je McNeil
Reviewer
Maurice HT Ling
Acquision Editor
Steven Wilding
Development Editor
Reshma Sundaresan
Technical Editor
Gauri Iyer
Indexer
Tejal Daruwale
Editorial Team Leader
Mithun Sehgal
Project Team Leader
Priya Mukherji
Project Coordinator
Shubhanjan Chaerjee
Proofreader
Jonathan Todd
Graphics
Nilesh R. Mohite
Producon Coordinator
Kruthika Bangera
Cover Work
Kruthika Bangera

About the Author
Je McNeil has been working in the Internet Services industry for over 10 years. He cut
his teeth during the late 90's Internet boom and has been developing soware for Unix and
Unix-avored systems ever since. Je has been a full-me Python developer for the beer
half of that me and has professional experience with a collecon of other languages,
including C, Java, and Perl. He takes an interest in systems administraon and server
automaon problems. Je recently joined Google and has had the pleasure of working with
some very talented individuals.
I'd like to above all thank Julie, Savannah, Phoebe, Maya, and Trixie for
allowing me to lock myself in the oce every night for months. The
Web.com gang and those in the Python community willing to share their
authoring experiences. Finally, Steven Wilding, Reshma Sundaresan,
Shubhanjan Chaerjee, and the rest of the Packt Publishing team for all of
the hard work and guidance.
About the Reviewer
Maurice HT Ling completed his Ph.D. in Bioinformacs and B.Sc(Hons) in Molecular and
Cell Biology from the University of Melbourne where he worked on microarray analysis
and text mining for protein-protein interacons. He is currently an honorary fellow in the
University of Melbourne, Australia. Maurice holds several Chief Editorships, including the
Python papers, Computaonal, and Mathemacal Biology, and Methods and Cases in
Computaonal, Mathemacal and Stascal Biology. In Singapore, he co-founded the Python
User Group (Singapore) and is the co-chair of PyCon Asia-Pacic 2010. In his free me,
Maurice likes to train in the gym, read, and enjoy a good cup of coee. He is also a senior
fellow of the Internaonal Fitness Associaon, USA.

www.PacktPub.com
Support les, eBooks, discount offers and more
You might want to visit www.PacktPub.com for support les and downloads related
to your book.
Did you know that Packt oers eBook versions of every book published, with PDF and ePub
les available? You can upgrade to the eBook version at www.PacktPub.com, and as a print
book customer, you are entled to a discount on the eBook copy. Get in touch with us at
service@packtpub.com for more details.
At www.PacktPub.com, you can also read a collecon of free technical arcles, sign up for a
range of free newsleers, and receive exclusive discounts and oers on Packt books and eBooks.
http://PacktLib.PacktPub.com
Do you need instant soluons to your IT quesons? PacktLib is Packt's online digital book
library. Here, you can access, read, and search across Packt's enre library of books.
Why Subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via web browser
Free Access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access
PacktLib today and view nine enrely free books. Simply use your login credenals for
immediate access.

Table of Contents
Preface 1
Chapter 1: Geng Started 7
Categorizing types of text data 8
Providing informaon through markup 8
Meaning through structured formats 9
Understanding freeform content 9
Ensuring you have Python installed 9
Providing support for Python 3 10
Implemenng a simple cipher 10
Time for acon – implemenng a ROT13 encoder 11
Processing structured markup with a lter 15
Time for acon – processing as a lter 15
Time for acon – skipping over markup tags 18
State machines 22
Supporng third-party modules 23
Packaging in a nutshell 23
Time for acon – installing SetupTools 23
Running a virtual environment 25
Conguring virtualenv 25
Time for acon – conguring a virtual environment 25
Where to get help? 28
Summary 28
Chapter 2: Working with the IO System 29
Parsing web server logs 30
Time for acon – generang transfer stascs 31
Using objects interchangeably 35
Time for acon – introducing a new log format 35
Accessing les directly 37

Table of Contents
[ ii ]
Time for acon – accessing les directly 37
Context managers 39
Handling other le types 41
Time for acon – handling compressed les 41
Implemenng le-like objects 42
File object methods 43
Enabling universal newlines 45
Accessing mulple les 45
Time for acon – spell-checking HTML content 46
Simplifying mulple le access 50
Inplace ltering 51
Accessing remote les 52
Time for acon – spell-checking live HTML pages 52
Error handling 55
Time for acon – handling urllib 2 errors 55
Handling string IO instances 57
Understanding IO in Python 3 58
Summary 59
Chapter 3: Python String Services 61
Understanding the basics of string object 61
Dening strings 62
Time for acon – employee management 62
Building non-literal strings 68
String formang 68
Time for acon – customizing log processor output 68
Percent (modulo) formang 74
Mapping key 75
Conversion ags 76
Minimum width 76
Precision 76
Width 77
Conversion type 77
Using the format method approach 78
Time for acon – adding status code data 79
Making use of conversion speciers 83
Creang templates 86
Time for acon – displaying warnings on malformed lines 86
Template syntax 88
Rendering a template 88
Calling string object methods 89
Time for acon – simple manipulaon with string methods 89
Aligning text 92

Table of Contents
[ iii ]
Detecng character classes 92
Casing 93
Searching strings 93
Dealing with lists of strings 94
Treang strings as sequences 95
Summary 96
Chapter 4: Text Processing Using the Standard Library 97
Reading CSV data 98
Time for acon – processing Excel formats 98
Time for acon – CSV and formulas 101
Reading non-Excel data 103
Time for acon – processing custom CSV formats 103
Wring CSV data 106
Time for acon – creang a spreadsheet of UNIX users 106
Modifying applicaon conguraon les 110
Time for acon – adding basic conguraon read support 110
Using value interpolaon 114
Time for acon – relying on conguraon value interpolaon 114
Handling default opons 116
Time for acon – conguraon defaults 116
Wring conguraon data 118
Time for acon – generang a conguraon le 119
Reconguring our source 122
A note on Python 3 122
Time for acon – creang an egg-based package 122
Understanding the setup.py le 131
Working with JSON 132
Time for acon – wring JSON data 132
Encoding data 134
Decoding data 135
Summary 136
Chapter 5: Regular Expressions 137
Simple string matching 138
Time for acon – tesng an HTTP URL 138
Understanding the match funcon 140
Learning basic syntax 140
Detecng repeon 140
Specifying character sets and classes 141
Applying anchors to restrict matches 143
Wrapping it up 144

Table of Contents
[ iv ]
Advanced paern matching 145
Grouping 145
Time for acon – regular expression grouping 146
Using greedy versus non-greedy operators 149
Asserons 150
Performing an 'or' operaon 152
Implemenng Python-specic elements 153
Other search funcons 153
search 153
ndall and nditer 153
split 154
sub 154
Compiled expression objects 155
Dealing with performance issues 156
Parser ags 156
Unicode regular expressions 157
The match object 158
Processing bind zone les 158
Time for acon – reading DNS records 159
Summary 164
Chapter 6: Structured Markup 165
XML data 166
SAX processing 168
Time for acon – event-driven processing 168
Incremental processing 171
Time for acon – driving incremental processing 171
Building an applicaon 172
Time for acon – creang a dungeon adventure game 172
The Document Object Model 176
xml.dom.minidom 176
Time for acon – updang our game to use DOM processing 176
Creang and modifying documents programmacally 183
XPath 185
Accessing XML data using ElementTree 186
Time for acon – using XPath in our adventure 187
Reading HTML 194
Time for acon – displaying links in an HTML page 194
BeaufulSoup 195
Summary 196

Table of Contents
[ v ]
Chapter 7: Creang Templates 197
Time for acon – installing Mako 198
Basic Mako usage 199
Time for acon – loading a simple Mako template 199
Generang a template context 203
Managing execuon with control structures 204
Including Python code 205
Time for acon – reformang the date with Python code 205
Adding funconality with tags 206
Rendering les with %include 206
Generang mulline comments with %doc 207
Documenng Mako with %text 207
Dening funcons with %def 208
Time for acon – dening Mako def tags 208
Imporng %def secons using %namespace 210
Time for acon – converng mail message to use namespaces 210
Filtering output 213
Expression lters 214
Filtering the output of %def blocks 214
Seng default lters 215
Inhering from base templates 215
Time for acon – updang base template 215
Growing the inheritance chain 218
Time for acon – adding another inheritance layer 219
Inhering aributes 221
Customizing 222
Custom tags 222
Time for acon – creang custom Mako tags 223
Customizing lters 226
Overviewing alternave approaches 226
Summary 227
Chapter 8: Understanding Encodings and i18n 229
Understanding basic character encodings 230
ASCII 230
Limitaons of ASCII 231
KOI8-R 232
Unicode 232
Using Unicode with Python 3 233
Understanding Unicode 234
Design goals 234
Organizaonal structure 236
Backwards compability 236

Table of Contents
[ vi ]
Encoding 237
UTF-32 237
UTF-8 237
Encodings in Python 238
Time for acon – manually decoding 239
Reading Unicode 240
Wring Unicode strings 241
Time for acon – copying Unicode data 242
Time for acon – xing our copy applicaon 244
The codecs module 245
Time for acon – changing encodings 245
Adopng good pracces 248
Internaonalizaon and Localizaon 249
Preparing an applicaon for translaon 250
Time for acon – preparing for mulple languages 250
Time for acon – providing translaons 253
Looking for more informaon on internaonalizaon 254
Summary 255
Chapter 9: Advanced Output Formats 257
Dealing with PDF les using PLATYPUS 258
Time for acon – installing ReportLab 258
Generang PDF documents 259
Time for acon – wring PDF with basic layout and style 259
Wring nave Excel data 266
Time for acon – installing xlwt 266
Building XLS documents 267
Time for acon – generang XLS data 267
Working with OpenDocument les 271
Time for acon – installing ODFPy 272
Building an ODT generator 273
Time for acon – generang ODT data 273
Summary 277
Chapter 10: Advanced Parsing and Grammars 279
Dening a language syntax 280
Specifying grammar with Backus-Naur Form 281
Grammar-driven parsing 282
PyParsing 283
Time for acon – installing PyParsing 283
Time for acon – implemenng a calculator 284
Parse acons 287
Time for acon – handling type translaons 287

Table of Contents
[ vii ]
Suppressing parts of a match 289
Time for acon – suppressing porons of a match 289
Processing data using the Natural Language Toolkit 297
Time for acon – installing NLTK 298
NLTK processing examples 298
Removing stems 298
Discovering collocaons 299
Summary 300
Chapter 11: Searching and Indexing 301
Understanding search complexity 302
Time for acon – implemenng a linear search 302
Text indexing 304
Time for acon – installing Nucular 304
An introducon to Nucular 305
Time for acon – full text indexing 307
Time for acon – measuring index benet 310
Scripts provided by Nucular 312
Using XML les 312
Advanced Nucular features 313
Time for acon – eld-qualied indexes 314
Performing an enhanced search 317
Time for acon – performing advanced Nucular queries 317
Indexing and searching other data 320
Time for acon – indexing Open Oce documents 320
Other index systems 325
Apache Lucene 325
ZODB and zc.catalog 325
SQL text indexing 325
Summary 326
Appendix A: Looking for Addional Resources 327
Python resources 328
Unocial documentaon 328
Python enhancement proposals 328
Self-documenng 329
Using other documentaon tools 331
Community resources 332
Following groups and mailing lists 332
Finding a users' group 333
Aending a local Python conference 333
Honorable menon 333
Lucene and Solr 333

Table of Contents
[ viii ]
Generang C-based parsers with GNU Bison 334
Apache Tika 335
Geng started with Python 3 335
Major language changes 336
Print is now a funcon 336
Catching excepons 337
Using metaclasses 338
New reserved words 338
Major library changes 339
Changes to list comprehensions 339
Migrang to Python 3 339
Time for acon – using 2to3 to move to Python 3 340
Summary 342
Appendix B: Pop Quiz Answers 343
Chapter 1: Geng Started 343
ROT 13 Processing Answers 343
Chapter 2: Working with the IO System 344
File-like objects 344
Chapter 3: Python String Services 344
String literals 344
String formang 345
Chapter 4: Text Processing Using the Standard Library 345
CSV handling 345
JSON formang 346
Chapter 5: Regular Expressions 346
Regular expressions 346
Understanding the Pythonisms 346
Chapter 6: Structured Markup 347
SAX processing 347
Chapter 7: Creang Templates 347
Template inheritance 347
Chapter 8: Understanding Encoding and i18n 347
Character encodings 347
Python encodings 348
Internaonalizaon 348
Chapter 9: Advanced Output Formats 348
Creang XLS documents 348
Chapter 11: Searching and Indexing 349
Introducon to Nucular 349
Index 351
Preface
The Python Text Processing Beginner's Guide is intended to provide a gentle, hands-on
introducon to processing, understanding, and generang textual data using the Python
programming language. Care is taken to ensure the content is example-driven, while sll
providing enough background informaon to allow for a solid understanding of the topics
covered.
Throughout the book, we use real world examples such as logle processing and PDF
creaon to help you further understand dierent aspects of text handling. By the me you've
nished, you'll have a solid working knowledge of both structured and unstructured text
data management. We'll also look at praccal indexing and character encodings.
A good deal of supporng informaon is included. We'll touch on packaging, Python IO,
third-party ulies, and some details on working with the Python 3 series releases. We'll
even spend a bit of me porng a small example applicaon to the latest version.
Finally, we do our best to provide a number of high quality external references. While this
book will cover a broad range of topics, we also want to help you dig deeper when necessary.
What this book covers
Chapter 1, Geng Started: This chapter provides an introducon into character and string
data types and how strings are represented using underlying integers. We'll implement a
simple encoding script to illustrate how text can be manipulated at the character level. We
also set up our systems to allow safe third-party library installaon.
Chapter 2, Working with the IO System: Here, you'll learn how to access your data. We cover
Python's IO capabilies in this chapter. We'll learn how to access les locally and remotely.
Finally, we cover how Python's IO layers change in Python 3.
Chapter 3, Python String Services: Covers Python's core string funconality. We look at the
methods of string objects, the core template classes, and Python's various string formang
methods. We introduce the dierences between Unicode and string objects here.

Preface
[ 2 ]
Chapter 4, Test Processing Using the Standard Library: The standard Python distribuon
includes a powerful set of built-in libraries designed to manage textual content. We look
at conguraon le reading and manipulaon, CSV les, and JSON data. We take a bit of a
detour at the end of this chapter to learn how to create your own redistributable Python egg
les.
Chapter 5, Regular Expressions: Looks at Python's regular expression implementaon and
teaches you how to implement them. We look at standardized concepts as well as Python's
extensions. We'll break down a few graphically so that the component parts are easy to piece
together. You'll also learn how to safely use regular expressions with internaonal alphabets.
Chapter 6, Structured Markup: Introduces you to XML and HTML processing. We create an
adventure game using both SAX and DOM approaches. We also look briey at lxml and
ElementTree. HTML parsing is also covered.
Chapter 7, Creang Templates: Using the Mako template language, we'll generate e-mail
and HTML text templates much like the ones that you'll encounter within common web
frameworks. We visit template creaon, inheritance, lters, and custom tag creaon.
Chapter 8, Understanding Encodings and i18n: We provide a look into character encoding
schemes and how they work. For reference, we'll examine ASCII as well as KOI8-R. We also
look into Unicode and its various encoding mechanisms. Finally, we nish up with a quick
look at applicaon internaonalizaon.
Chapter 9, Advanced Output Formats: Provides informaon on how to generate PDF, Excel,
and OpenDocument data. We'll build these document types from scratch using direct Python
API calls relying on third-party libraries.
Chapter 10, Advanced Parsing and Grammars: A look at more advanced text manipulaon
techniques such as those used by programming language designers. We'll use the PyParsing
library to handle some conguraon le management and look into the Python Natural
Language Toolkit.
Chapter 11, Searching and Indexing: A praccal look at full text searching and the benet an
index can provide. We'll use the Nucular system to index a collecon of small text les and
make them quickly searchable.
Appendix A, Looking for Addional Resources: It introduces you to places of interest on the
Internet and some community resources. In this appendix, you will learn to create your own
documentaon and to use Java Lucene based engines. You will also learn about dierences
between Python 2 & Python 3 and to port code to Python 3.

Preface
[ 3 ]
What you need for this book
This book assumes you've an elementary knowledge of the Python programming language,
so we don't provide a tutorial introducon. From a soware angle, you'll simply need a
version of Python (2.6 or later) installed. Each me we require a third-party library, we'll
detail the installaon in text.
Who this book is for
If you are a novice Python developer who is interested in processing text then this book is for
you. You need no experience with text processing, though basic knowledge of Python would
help you to beer understand some of the topics covered by this book. As the content of this
book develops gradually, you will be able to pick up Python while reading.
Conventions
In this book, you will nd several headings appearing frequently.
To give clear instrucons of how to complete a procedure or task, we use:
Time for action – heading
1. Acon 1
2. Acon 2
3. Acon 3
Instrucons oen need some extra explanaon so that they make sense, so they are
followed with:
What just happened?
This heading explains the working of tasks or instrucons that you have just completed.
You will also nd some other learning aids in the book, including:
Pop Quiz – heading
These are short mulple choice quesons intended to help you test your own understanding.

Preface
[ 4 ]
Have a go hero – heading
These set praccal challenges and give you ideas for experimenng with what you have
learned.
You will also nd a number of styles of text that disnguish between dierent kinds of
informaon. Here are some examples of these styles, and explanaons of their meanings.
Code words in text are shown as follows: "First of all, we imported the re module"
A block of code is set as follows:
parser = OptionParser()
parser.add_option('-f', '--file', help="CSV Data File")
opts, args = parser.parse_args()
if not opts.file:
When we wish to draw your aenon to a parcular part of a code block, the relevant lines
or items are set in bold:
def init_game(self):
"""
Process World XML.
"""
self.location = parse(open(self.world)).documentElement
Any command-line input or output is wrien as follows:
(text_processing)$ python render_mail.py thank_you-e.txt
New terms and important words are shown in bold. Words that you see on the screen, in
menus or dialog boxes for example, appear in the text like this: "Any X found in the source
data would simply become an A in the output data.".
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.

Preface
[ 5 ]
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book—what you liked or may have disliked. Reader feedback is important for us to
develop tles that you really get the most out of.
To send us general feedback, simply send an e-mail to feedback@packtpub.com, and
menon the book tle via the subject of your message.
If there is a book that you need and would like to see us publish, please send us a note in the
SUGGEST A TITLE form on www.packtpub.com or e-mail suggest@packtpub.com.
If there is a topic that you have experse in and you are interested in either wring or
contribung to a book, see our author guide on www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.
Downloading the example code for this book
You can download the example code les for all Packt books you have purchased
from your account at http://www.PacktPub.com. If you purchased this
book elsewhere, you can visit http://www.PacktPub.com/support and
register to have the les e-mailed directly to you.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you nd a mistake in one of our books—maybe a mistake in the text or the
code—we would be grateful if you would report this to us. By doing so, you can save other
readers from frustraon and help us improve subsequent versions of this book. If you
nd any errata, please report them by vising http://www.packtpub.com/support,
selecng your book, clicking on the errata submission form link, and entering the details
of your errata. Once your errata are veried, your submission will be accepted and the
errata will be uploaded on our website, or added to any list of exisng errata, under the
Errata secon of that tle. Any exisng errata can be viewed by selecng your tle from
http://www.packtpub.com/support.

Preface
[ 6 ]
Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt,
we take the protecon of our copyright and licenses very seriously. If you come across any
illegal copies of our works, in any form, on the Internet, please provide us with the locaon
address or website name immediately so that we can pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected pirated material.
We appreciate your help in protecng our authors, and our ability to bring you valuable
content.
Questions
You can contact us at questions@packtpub.com if you are having a problem with any
aspect of the book, and we will do our best to address it.

1
Getting Started
As computer professionals, we deal with text data every day. Developers and
programmers interact with XML and source code. System administrators
have to process and understand logles. Managers need to understand and
format nancial data and reports. Web designers put in me, hand tuning and
polishing up HTML content. Managing this broad range of formats can seem
like a daunng task, but it's really not that dicult.
This book aims to introduce you, the programmer, to a variety of methods used
to process these data formats. We'll look at approaches ranging from standard
language funcons through more complex third-party modules. Somewhere in
there, we'll cover a ulity that's just the right tool for your specic job. In the
process, we hope to also cover some Python development best pracces.
Where appropriate, we'll look into implementaon details enough to help you
understand the techniques used. Most of the me, though, we'll work as hard
as we can to get you up on your feet and crunching those text les.
You'll nd that Python makes tasks like this quite painless through its clean and
easy-to-understand syntax, vast community, and the available collecon of
addional ulies and modules.
In this chapter, we shall:
Briey introduce the data formats handled in this book
Implement a simple ROT13 translator
Introduce you to basic processing via lter programs
Learn state machine basics

Geng Started
[ 8 ]
Learn how to install supporng libraries and components safely and without
administrave access
Look at where to nd more informaon on introductory topics
Categorizing types of text data
Textual data comes in a variety of formats. For our purposes, we'll categorize text into three
very broad groups. Isolang down into segments helps us to understand the problem a bit
beer, and subsequently choose a parsing approach. Each one of these sweeping groups can
be further broken down into more detailed chunks.
One thing to remember when working your way through the book is that text content isn't
limited to the Lan alphabet. This is especially true when dealing with data acquired via the
Internet. We'll cover some of the techniques and tricks to handling internaonalized data in
Chapter 8, Understanding Encoding and i18n.
Providing information through markup
Structured text includes formats such as XML and HTML. These formats generally consist of
text content surrounded by special symbols or markers that give extra meaning to a le's
contents. These addional tags are usually meant to convey informaon to the processing
applicaon and to arrange informaon in a tree-like structure. Markup allows a developer to
dene his or her own data structure, yet rely on standardized parsers to extract elements.
For example, consider the following contrived HTML document.
<html>
<head>
<title>Hello, World!</title>
</head>
<body>
<p>
Hi there, all of you earthlings.
</p>
<p>
Take us to your leader.
</p>
</body>
</html>
In this example, our document's tle is clearly idened because it is surrounded by opening
and closing <title> and </title> elements.

Chapter 1
[ 9 ]
Note that although the document's tags give each element
a meaning, it's sll up to the applicaon developer to
understand what to do with a title object or a p element.
Noce that while it sll has meaning to us humans, it is also laid out in such a way as to make
it computer friendly. We'll take a deeper look into these formats in Chapter 6, Structured
Markup. Python provides some rich libraries for dealing with these popular formats.
One interesng aspect to these formats is that it's possible to embed references to validaon
rules as well as the actual document structure. This is a nice benet in that we're able to rely
on the parser to perform markup validaon for us. This makes our job much easier as it's
possible to trust that the input structure is valid.
Meaning through structured formats
Text data that falls into this category includes things such as conguraon les, marker
delimited data, e-mail message text, and JavaScript Object Notaon web data. Content
within this second category does not contain explicit markup much like XML and HTML does,
but the structure and formang is required as it conveys meaning and informaon about
the text to the parsing applicaon. For example, consider the format of a Windows INI le
or a Linux system's /etc/hosts le. There are no tags, but the column on the le clearly
means something other than the column on the right.
Python provides a collecon of modules and libraries intended to help us handle popular
formats from this category. We'll look at Python's built-in text services in detail when we get
to Chapter 4, The Standard Library to the Rescue.
Understanding freeform content
This category contains data that does not fall into the previous two groupings. This describes
e-mail message content, leers, book copy, and other unstructured character-based content.
However, this is where we'll largely have to look at building our own processing components.
There are external packages available to us if we wish to perform common funcons. Some
examples include full text searching and more advanced natural language processing.
Ensuring you have Python installed
Our rst order of business is to ensure that you have Python installed. You'll need it in order
to complete most of the examples in this book. We'll be working with Python 2.6 and we
assume that you're using that same version. If there are any drasc dierences in earlier
releases, we'll make a note of them as we go along. All of the examples should sll funcon
properly with Python 2.4 and later versions.

Geng Started
[ 10 ]
If you don't have Python installed, you can download the latest 2.X version from http://
www.python.org. Most Linux distribuons, as well as Mac OS, usually have a version of
Python preinstalled.
At the me of this wring, Python 2.6 was the latest version available, while 2.7 was in an
alpha state.
Providing support for Python 3
The examples in this book are wrien for Python 2. However, wherever possible, we will
provide code that has already been ported to Python 3. You can nd the Python 3 code in
the Python3 directories in the code bundle available on the Packt Publishing FTP site.
Unfortunately, we can't promise that all of the third-party libraries that we'll use will support
Python 3. The Python community is working hard to port popular modules to version 3.0.
However, as the versions are incompable, there is a lot of work remaining. In situaons
where we cannot provide example code, we'll note this.
Implementing a simple cipher
Let's get going early here and implement our rst script to get a feel for what's in store.
A Caesar Cipher is a simple form of cryptography in which each leer of the alphabet is shied
down by a number of leers. They're generally of no cryptographic use when applied alone,
but they do have some valid applicaons when paired with more advanced techniques.
This preceding diagram depicts a cipher with an oset of three. Any X found in the source
data would simply become an A in the output data. Likewise, any A found in the input data
would become a D.

Chapter 1
[ 11 ]
Time for action – implementing a ROT13 encoder
The most popular implementaon of this system is ROT13. As its name suggests, ROT13
shis – or rotates – each leer by 13 spaces to produce an encrypted result. As the English
alphabet has 26 leers, we simply run it a second me on the encrypted text in order to get
back to our original result.
Let's implement a simple version of that algorithm.
1. Start your favorite text editor and create a new Python source le. Save it
as rot13.py.
2. Enter the following code exactly as you see it below and save the le.
import sys
import string
CHAR_MAP = dict(zip(
string.ascii_lowercase,
string.ascii_lowercase[13:26] + string.ascii_lowercase[0:13]
)
)
def rotate13_letter(letter):
"""
Return the 13-char rotation of a letter.
"""
do_upper = False
if letter.isupper():
do_upper = True
letter = letter.lower()
if letter not in CHAR_MAP:
return letter
else:
letter = CHAR_MAP[letter]
if do_upper:
letter = letter.upper()
return letter
if __name__ == '__main__':
for char in sys.argv[1]:
sys.stdout.write(rotate13_letter(char))
sys.stdout.write('\n')

Geng Started
[ 12 ]
3. Now, from a command line, execute the script as follows. If you've entered all of the
code correctly, you should see the same output.
$ python rot13.py 'We are the knights who say, nee!'
4. Run the script a second me, using the output of the rst run as the new input
string. If everything was entered correctly, the original text should be printed to
the console.
$ python rot13.py 'Dv ziv gsv pmrtsgh dsl hzb, mvv!'
What just happened?
We implemented a simple text-oriented cipher using a collecon of Python's string handling
features. We were able to see it put to use for both encoding and decoding source text.
We saw a lot of stu in this lile example, so you should have a good feel for what can be
accomplished using the standard Python string object.
Following our inial module imports, we dened a diconary named CHAR_MAP, which
gives us a nice and simple way to shi our leers by the required 13 places. The value of a
diconary key is the target leer! We also took advantage of string slicing here. We'll look at
slicing a bit more in later chapters, but it's a convenient way for us to extract a substring from
an exisng string object.

Chapter 1
[ 13 ]
In our translaon funcon rotate13_letter, we checked whether our input character
was uppercase or lowercase and then saved that as a Boolean aribute. We then forced our
input to lowercase for the translaon work. As ROT13 operates on leers alone, we only
performed a rotaon if our input character was a leer of the Lan alphabet. We allowed
other values to simply pass through. We could have just as easily forced our string to a pure
uppercased value.
The last thing we do in our funcon is restore the leer to its proper case, if necessary. This
should familiarize you with upper- and lowercasing of Python ASCII strings.
We're able to change the case of an enre string using this same method; it's not limited to
single characters.
>>> name = 'Ryan Miller'
>>> name.upper()
'RYAN MILLER'
>>> "PLEASE DO NOT SHOUT".lower()
'please do not shout'
>>>
It's worth poinng out here that a single character string is sll a string.
There is not a char type, which you may be familiar with if you're coming
from a dierent language such as C or C++. However, it is possible to
translate between character ASCII codes and back using the ord and chr
built-in methods and a string with a length of one.
Noce how we were able to loop through a string directly using the Python for syntax.
A string object is a standard Python iterable, and we can walk through them detailed as
follows. In pracce, however, this isn't something you'll normally do. In most cases, it makes
sense to rely on exisng libraries.
$ python
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> for char in "Foo":
... print char
...
F
o
o
>>>

Geng Started
[ 14 ]
Finally, you should note that we ended our script with an if statement such as the following:
>>> if__name__ == '__main__'
Python modules all contain an internal __name__ variable that corresponds to the name of
the module. If a module is executed directly from the command line, as is this script, whose
name value is set to __main__, this code only runs if we've executed this script directly. It
will not run if we import this code from a dierent script. You can import the code directly
from the command line and see for yourself.
$ python
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import rot13
>>> dir(rot13)
['CHAR_MAP', '__builtins__', '__doc__', '__file__', '__name__', '__
package__', 'rotate13_letter', 'string', 'sys']
>>>
Noce how we were able to import our module and see all of the methods and aributes
inside of it, but the driver code did not execute. This is a convenon we'll use throughout the
book in order to help achieve maximum reusability.
Have a go hero – more translation work
Each Python string instance contains a collecon of methods that operate on one or more
characters. You can easily display all of the available methods and aributes by using the dir
method. For example, enter the following command into a Python window. Python responds
by prinng a list of all methods on a string object.
>>> dir("content")
['__add__', '__class__', '__contains__', '__delattr__', '__doc__',
'__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__',
'__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__
le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__',
'__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__
setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_
field_name_split', '_formatter_parser', 'capitalize', 'center', 'count',
'decode', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index',
'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle',
'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace',
'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split',
'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate',
'upper', 'zfill']
>>>

Chapter 1
[ 15 ]
Much like the isupper and islower methods discussed previously, we also have an
isspace method. Using this method, in combinaon with your newfound knowledge of
Python strings, update the method we dened previously to translate spaces to underscores
and underscores to spaces.
Processing structured markup with a lter
Our ROT13 applicaon works great for simple one-line strings that we can t on the
command line. However, it wouldn't work very well if we wanted to encode an enre
le, such as the HTML document we took a look at earlier. In order to support larger text
documents, we'll need to change the way we accept input. We'll redesign our applicaon to
work as a lter.
A lter is an applicaon that reads data from its standard input le descriptor and writes to
its standard output le descriptor. This allows users to create command pipelines that allow
mulple ulies to be strung together. If you've ever typed a command such as cat /etc/
hosts | grep mydomain.com, you've set up a pipeline
In many circumstances, data is fed into the pipeline via the keyboard and completes its
journey when a processed result is displayed on the screen.
Time for action – processing as a lter
Let's make the changes required to allow our simple ROT13 processor to work as a
command-line lter. This will allow us to process larger les.
1. Create a new source le and enter the following code. When complete, save the le
as rot13-b.py.
import sys
import string
CHAR_MAP = dict(zip(
string.ascii_lowercase,
string.ascii_lowercase[13:26] + string.ascii_lowercase[0:13]
)
)
def rotate13_letter(letter):
"""

Geng Started
[ 16 ]
Return the 13-char rotation of a letter.
"""
do_upper = False
if letter.isupper():
do_upper = True
letter = letter.lower()
if letter not in CHAR_MAP:
return letter
else:
letter = CHAR_MAP[letter]
if do_upper:
letter = letter.upper()
return letter
if __name__ == '__main__':
for line in sys.stdin:
for char in line:
sys.stdout.write(rotate13_letter(char))
2. Enter the following HTML data into a new text le and save it as sample_page.
html. We'll use this as example input to our updated rot13.py.
<html>
<head>
<title>Hello, World!</title>
</head>
<body>
<p>
Hi there, all of you earthlings.
</p>
<p>
Take us to your leader.
</p>
</body>
</html>
3. Now, run our rot13.py example and provide our HTML document as standard
input data. The exact method used will vary with your operang system. If you've
entered the code successfully, you should simply see a new prompt.
$ cat sample_page.html | python rot13-b.py > rot13.html
$

Chapter 1
[ 17 ]
4. The contents of rot13.html should be as follows. If that's not the case, double
back and make sure everything is correct.
<ugzy>
<urnq>
<gvgyr>Uryyb, Jbeyq!</gvgyr>
</urnq>
<obql>
<c>
Uv gurer, nyy bs lbh rneguyvatf.
</c>
<c>
Gnxr hf gb lbhe yrnqre.
</c>
</obql>
</ugzy>
5. Open the translated HTML le using your web browser.
What just happened?
We updated our rot13.py script to read standard input data rather than rely on a
command-line opon. Doing this provides opmal congurability going forward and lets us
feed input of varying length from a collecon of dierent sources. We did this by looping on
each line available on the sys.stdin le stream and calling our translaon funcon. We
wrote each character returned by that funcon to the sys.stdout stream.
Next, we ran our updated script via the command line, using sample_page.html as input.
As expected, the encoded version was printed on our terminal.
As you can see, there is a major problem with our output. We should have a proper page
tle and our content should be broken down into dierent paragraphs.

Geng Started
[ 18 ]
Remember, structured markup text is sprinkled with
tag elements that dene its structure and organizaon.
In this example, we not only translated the text content, we also translated the markup
tags, rendering them meaningless. A web browser would not be able to display this data
properly. We'll need to update our processor code to ignore the tags. We'll do just that
in the next secon.
Time for action – skipping over markup tags
In order to preserve the proper, structured HTML that tags provide, we need to ensure we
don't include them in our rotaon. To do this, we'll keep track of whether or not our input
stream is currently within a tag. If it is, we won't translate our leers.
1. Once again, create a new Python source le and enter the following code. When
you're nished, save the le as rot13-c.py.
import sys
from optparse import OptionParser
import string
CHAR_MAP = dict(zip(
string.ascii_lowercase,
string.ascii_lowercase[13:26] + string.ascii_lowercase[0:13]
)
)
class RotateStream(object):
"""
General purpose ROT13 Translator
A ROT13 translator smart enough to skip
Markup tags if that's what we want.
"""
MARKUP_START = '<'
MARKUP_END = '>'
def __init__(self, skip_tags):
self.skip_tags = skip_tags
def rotate13_letter(self, letter):
"""
Return the 13-char rotation of a letter.
"""
do_upper = False
if letter.isupper():

Chapter 1
[ 19 ]
do_upper = True
letter = letter.lower()
if letter not in CHAR_MAP:
return letter
else:
letter = CHAR_MAP[letter]
if do_upper:
letter = letter.upper()
return letter
def rotate_from_file(self, handle):
"""
Rotate from a file handle.
Takes a file-like object and translates
text from it into ROT13 text.
"""
state_markup = False
for line in handle:
for char in line:
if self.skip_tags:
if state_markup:
# here we're looking for a closing
# '>'
if char == self.MARKUP_END:
state_markup = False
else:
# Not in a markup state, rotate
# unless we're starting a new
# tag
if char == self.MARKUP_START:
state_markup = True
else:
char = self.rotate13_letter(char)
else:
char = self.rotate13_letter(char)
# Make this a generator
yield char
if __name__ == '__main__':
parser = OptionParser()
parser.add_option('-t', '--tags', dest="tags",
help="Ignore Markup Tags", default=False,

Geng Started
[ 20 ]
action="store_true")
options, args = parser.parse_args()
rotator = RotateStream(options.tags)
for letter in rotator.rotate_from_file(sys.stdin):
sys.stdout.write(letter)
2. Run the same example.html le that we created for the last example through the
new processor. This me, be sure to pass a -t command-line opon.
$ cat sample_page.html | python rot13-c.py -t > rot13.html
$
3. If everything was entered correctly, the contents of rot13.html should be exactly
as follows.
<html>
<head>
<title>Uryyb, Jbeyq!</title>
</head>
<body>
<p>
Uv gurer, nyy bs lbh rneguyvatf.
</p>
<p>
Gnxr hf gb lbhe yrnqre.
</p>
</body>
</html>
4. Open the translated le in your web browser.

Chapter 1
[ 21 ]
What just happened?
That was a prey complex example, so let's step through it. We did quite a bit. First, we
moved away from a simple rotate13_letter funcon and wrapped almost all of our
funconality in a Python class named RotateStream. Doing this helps us ensure that our
code will be reusable down the road.
We dene a __init__ method within the class that accepts a single parameter named
skip_tags. The value of this parameter is assigned to the self parameter so we can access
it later from within other methods. If this is a True value, then our parser class will know
that it's not supposed to translate markup tags.
Next, you'll see our familiar rotate13_letter method (it's a method now as it's dened
within a class). The only real dierence here is that in addion to the letter parameter,
we're also requiring the standard self parameter.
Finally, we have our rotate_from_file method. This is where the bulk of our new
funconality was added. Like before, we're iterang through all of the characters available
on a le stream. This me, however, the le stream is passed in as a handle parameter.
This means that we could have just as easily passed in an open le handle rather than the
standard in le handle.
Inside the method, we implement a simple state machine, with two possible states. Our
current state is saved in the state_markup Boolean aribute. We only rely on it if the value
of self.skip_tags set in the __init__ method is True.
1. If state_markup is True, then we're currently within the context of a markup tag
and we're looking for the > character. When it's found, we'll change state_markup
to False. As we're inside a tag, we'll never ask our class to perform a ROT13
operaon.
2. If state_markup is False, then we're parsing standard text. If we come across
the < character, then we're entering a new markup tag. We set the value of state_
markup to True. Finally, if we're not in tag, we'll call rotate13_letter to perform
our ROT13 operaon.
You should also noce some unfamiliar code at the end of the source lisng. We've taken
advantage of the OptionParser class, which is part of the standard library. We've added
a single opon that will allow us to selecvely enable our markup bypass funconality. The
value of this opon is passed into RotateStream's __init__ method.
The nal two lines of the lisng show how we pass the sys.stdin le handle to rotate_
from_file and iterate over the results. The rotate_from_file method has been dened
as a generator funcon. A generator funcon returns values as it processes rather than
waing unl compleon. This method avoids storing all of the result in memory and lowers
overall applicaon memory consumpon.

Geng Started
[ 22 ]
State machines
A state machine is an algorithm that keeps track of an applicaon's internal state. Each
state has a set of available transions and funconality associated with it. In this example,
we were either inside or outside of a tag. Applicaon behavior changed depending on
our current state. For example, if we were inside then we could transion to outside. The
opposite also holds true.
The state machine concept is advanced and won't be covered in detail. However, it is a
major method used when implemenng text-processing machinery. For example, regular
expression engines are generally built on variaons of this model. For more informaon
on state machine implementaon, see the Wikipedia arcle available at http://
en.wikipedia.org/wiki/Finite-state_machine.
Pop Quiz – ROT 13 processing
1. We dene MARKUP_START and MARKUP_END class constants within our
RotateStream class. How might our state machine be aected if these
values were swapped?
2. Is it possible to use ROT13 on a string containing characters found outside of the
English alphabet?
3. What would happen if we embedded > or < signs within our text content or tag
values?
4. In our example, we read our input a line at a me. Can you think of a way to make
this more ecient?
Have a go hero – support multiple input channels
We've briey covered reading data via standard in as well as processing simple
command-line opons. Your job is to integrate the two so that your applicaon will
simply translate a command-line value if one is present before defaulng to standard input.
If you're able to implement this, try extending the opon handling code so that your input
string can be passed in to the rotaon applicaon using a command-line opon.
$python rot13-c.py –s 'myinputstring'
zlvachgfgevat
$

Chapter 1
[ 23 ]
Supporting third-party modules
Now that we've got our rst example out of the way, we're going to take a lile bit of a
detour and learn how to obtain and install third-party modules. This is important, as we'll
install a few throughout the remainder of the book.
The Python community maintains a centralized package repository, termed the Python
Package Index (or PyPI). It is available on the web at http://pypi.python.org. From
there, it is possible to download packages as compressed source distribuons, or in some
cases, pre-packaged Python components. PyPI is also a rich source of informaon. It's a
great place to learn about available third-party applicaons. Links are provided to individual
package documentaon if it's not included directly into the package's PyPI page.
Packaging in a nutshell
There are at least two dierent popular methods of packaging and deploying Python
packages. The distutils package is part of the standard distribuon and provides a
mechanism for building and installing Python soware. Packages that take advantage of the
distutils system are downloaded as a source distribuon and built and installed by a local
user. They are installed by simply creang an addional directory structure within the system
Python directory that matches the package name.
In an eort to make packages more accessible and self-contained, the concept of the
Python Egg was introduced. An egg le is simply a ZIP archive of a package. When an egg is
installed, the ZIP le itself is placed on the Python path, rather than a subdirectory.
Time for action – installing SetupTools
Egg les have largely become the de facto standard in Python packaging. In order to install,
develop, and build egg les, it is necessary to install a third-party tool kit. The most popular
is SetupTools, and this is what we'll be working with throughout this book. The installaon
process is fairly easy to complete and is rather self-contained. Installing SetupTools gives us
access to the easy_install command, which automates the download and installaon of
packages that have been registered with PyPI.
1. Download the installaon script, which is available at http://peak.
telecommunity.com/dist/ez_setup.py. This same script will be
used for all versions of Python.

Geng Started
[ 24 ]
2. As an administrave user, run the ez_setup.py script from the command line. The
SetupTools installaon process will complete. If you've executed the script with the
proper rights, you should see output similar as follows:
# python ez_setup.py
Downloading http://pypi.python.org/packages/2.6/s/setuptools/
setuptools-0.6c11-py2.6.egg
Processing setuptools-0.6c11-py2.6.egg
creating /usr/lib/python2.6/site-packages/setuptools-0.6c11-
py2.6.egg
Extracting setuptools-0.6c11-py2.6.egg to /usr/lib/python2.6/site-
packages
Adding setuptools 0.6c11 to easy-install.pth file
Installing easy_install script to /usr/bin
Installing easy_install-2.6 script to /usr/bin
Installed /usr/lib/python2.6/site-packages/setuptools-0.6c11-
py2.6.egg
Processing dependencies for setuptools==0.6c11
Finished processing dependencies for setuptools==0.6c11
#
What just happened?
We downloaded the SetupTools installaon script and executed it as an administrave
user. By doing so, our system Python environment was congured so that we can install egg
les in the future via the SetupTools easy_install system.
SetupTools does not currently work with Python 3.0. There is, however, an
alternave available via the Distribute project. Distribute is intended to be a
drop-in replacement for SetupTools and will work with either major Python
version. For more informaon, or to download the installer, visit http://
pypi.python.org/pypi/distribute.

Chapter 1
[ 25 ]
Running a virtual environment
Now that we have SetupTools installed, we can install third-party packages by simply
running the easy_install command. This is nice because package dependencies will
automacally be downloaded and installed so we no longer have to do this manually.
However, there's sll one piece missing. Even though we can install these packages easily,
we sll need to retain administrave privileges to do so. Addionally, all of the packages
that we chose to install will be placed in the system's Python library directory, which has
the potenal to cause inconsistencies and problems down the road.. As you've probably
guessed, there's a ulity to address that.
Python 2.6 introduces the concept of a local user package directory. This is
simply an addional locaon found within your user home directory that Python
searches for installed packages. It is possible to install eggs into this locaon via
easy_install with a –user command-line switch. For more informaon,
see http://www.python.org/dev/peps/pep-0370/.
Conguring virtualenv
The virtualenv package, distributed as a Python egg, allows us to create an isolated
Python environment anywhere we wish. The environment comes complete with a bin
directory containing a Python binary, its own installaon of SetupTools, and an instance-
specic library directory. In short, it creates a locaon for us to install and congure Python
without interfering with the system installaon.
Time for action – conguring a virtual environment
Here, we'll enable the virtualenv package, which will illustrate how to install packages
from the PyPI site. We'll also congure our rst environment, which we'll use throughout the
book for the rest of our examples and code illustraons.
1. As a user with administrave privileges, install virtualenv from the system
command line by running easy_install virtualenv. If you have the correct
permissions, your output should be similar to the following.
Searching for virtualenv
Reading http://pypi.python.org/simple/virtualenv/
Reading http://virtualenv.openplans.org
Best match: virtualenv 1.4.5
Downloading http://pypi.python.org/packages/source/v/virtualenv/
virtualenv-1.4.5.tar.gz#md5=d3c621dd9797789fef78442e336df63e
Processing virtualenv-1.4.5.tar.gz

Geng Started
[ 26 ]
Running virtualenv-1.4.5/setup.py -q bdist_egg --dist-dir /tmp/
easy_install-rJXhVC/virtualenv-1.4.5/egg-dist-tmp-AvWcd1
warning: no previously-included files matching '*.*' found under
directory 'docs/_templates'
Adding virtualenv 1.4.5 to easy-install.pth file
Installing virtualenv script to /usr/bin
Installed /usr/lib/python2.6/site-packages/virtualenv-1.4.5-
py2.6.egg
Processing dependencies for virtualenv
Finished processing dependencies for virtualenv
2. Drop administrave privileges as we won't need them any longer. Ensure that you're
within your home directory and create a new virtual instance by running:
$ virtualenv --no-site-packages text_processing
3. Step into the newly created text_processing directory and acvate the
virtual environment. Windows users will do this by simply running the Scripts\
activate applicaon, while Linux users must instead source the script using the
shell's dot operator.
$ . bin/activate
4. If you've done this correctly, you should now see your command-line prompt change
to include the string (text_processing). This serves as a visual cue to remind you
that you're operang within a specic virtual environment.
(text_processing)$ pwd
/home/jmcneil/text_processing
(text_processing)$ which python
/home/jmcneil/text_processing/bin/python
(text_processing)$
5. Finally, deacvate the environment by running the deacvate command. This will
return your shell environment to default. Note that once you've done this, you're
once again working with the system's Python install.
(text_processing)$ deactivate
$ which python
/usr/bin/python
$

Chapter 1
[ 27 ]
If you're running Windows, by default python.exe and easy_install.
exe are not placed on your system %PATH%. You'll need to manually congure
your %PATH% variable to include C:\Python2.6\ and C:\Python2.6\
Scripts. Addional scripts added by easy_install will also be placed in
this directory, so it's worth seng up your %PATH% variable.
What just happened?
We installed the virtualenv package using the easy_install command directly o of
the Python Package index. This is the method we'll use for installing any third-party packages
going forward. You should now be familiar with the easy_install process. Also, note that
for the remainder of the book, we'll operate from within this text_processing virtual
environment. Addional packages are installed using this same technique from within the
connes of our environment.
Aer the install process was completed, we congured and acvated our rst virtual
environment. You saw how to create a new instance via the virtualenv command and
you also learned how to subsequently acvate it using the bin/activate script. Finally, we
showed you how to deacvate your environment and return to your system's default state.
Have a go hero – install your own environment
Now that you know how to set up your own isolated Python environment, you're encouraged
to create a second one and install a collecon of third-party ulies in order to get the hang of
the installaon process.
1. Create a new environment and name it as of your own choice.
2. Point your browser to http://pypi.python.org and select one or more
packages that you nd interesng. Install them via the easy_install command
within your new virtual environment.
Note that you should not require administrave privileges to do this. If you receive an error
about permissions, make certain you've remembered to acvate your new environment.
Deacvate when complete. Some of the packages available for install may require a correctly
congured C-language compiler.

Geng Started
[ 28 ]
Where to get help?
The Python community is a friendly bunch of people. There is a wide range of online
resources you can take advantage of if you nd yourself stuck. Let's take a quick look at
what's out there.
Home site: The Python website, available at http://www.python.org.
Specically, the documentaon secon. The standard library reference is a
wonderful asset and should be something you keep at your ngerps. This site also
contains a wonderful tutorial as well as a complete language specicaon.
Member groups: The comp.lang.python newsgroup. Available via Google
groups as well as an e-mail gateway, this provides a general-purpose locaon to
ask Python-related quesons. A very smart bunch of developers patrol this group;
you're certain to get a quality answer.
Forums: Stack Overow, available at http://www.stackoverflow.com.
Stack overow is a website dedicated to developers. You're welcome to ask your
quesons, as well as answer others' inquires, if you're up to it!
Mailing list: If you have a beginner-level queson, there is a Python tutor mailing
list available o of the Python.org site. This is a great place to ask your beginner
quesons no maer how basic they might be!
Centralized package repository: The Python Package Index at http://pypi.
python.org. Chances are someone has already had to do exactly what it is
you're doing.
If all else fails, you're more than welcome to contact the author via e-mail to questions@
packtpub.com. Every eort will be made to answer your queson, or point you to a freely
available resource where you can nd your resoluon.
Summary
This chapter introduced you to the dierent categories of text that we'll cover in greater
detail throughout the book and provided you with a lile bit of informaon as to how we'll
manage our packaging going forward.
We performed a few low-level text translaons by implemenng a ROT13 encoder and
highlighted the dierences between freeform and structured markup. We'll examine these
categories in much greater detail as we move on. The goal of that exercise was to learn some
byte-level transformaon techniques.
Finally, we touched on a couple of dierent ways to read data into our applicaons. In our
next chapter, we'll spend a great deal of me geng to know the IO system and learning
how you can extract text from a collecon of sources.

2
Working with the IO System
Now that we've covered some basic text-processing methods and introduced
you to some core Python best pracces, it's me we take a look at how to
actually get to your data. Reading some example text from the command line is
an easy process, but geng to real world data can be more dicult. However,
it's important to understand how to do so.
Python provides all of the standard le IO mechanisms you would expect from
any full-featured programming language. Addionally, there is a wide range of
standard library modules included that enable you to access data via various
network services such as HTTP, HTTPS, and FTP.
In this chapter, we'll focus on those methods and systems. We'll look at standard le
funconality, the extended abilies within the standard library, and how these components
can be used interchangeably in many situaons.
As part of our introducon to le input and output, we'll also cover some common
excepon-handling techniques that are especially helpful when dealing with external data.
In this chapter, we shall:
Look at Python's le IO and examine the objects created by the open factory funcon
Understand text-based and raw IO, and how they dier
Examine the urllib and urllib2 modules and detail le access via HTTP and FTP
streams
Handle le IO using Context Managers
Learn about le-like objects and methods to use objects interchangeably for
maximum reuse

Working with the IO System
[ 30 ]
Introduce excepons with a specic focus on idioms specic to le IO and how to
deal with certain error condions
Introduce a web server logle processor, which we'll expand upon throughout
future chapters
Examine ways to deal with mulple les
We'll also spend some me looking at changes to the IO subsystem in future
versions of Python
Parsing web server logs
We're going to introduce a web server log parser in this secon that we'll build upon
throughout the remainder of the book. We're going to start by assuming the logle is in the
standard Apache combined format.
For example, the following line represents an HTTP request for the root directory of a
website. The request is successful, as indicated by the 200 series response code.
In order, the above line contains the remote IP address of the client, the remote identd
name, the authencated username, the server's mestamp, the rst line of the request, the
HTTP response code, the size of the le as returned by the server, the referring page, and
nally the User Agent, or the browser soware running on the end user's computer.
The dashes in the previous screenshot indicate a missing value. This doesn't necessarily
correspond to an error condion. For example, if the page is not password-protected then
there will be no remote user. The dash is a common condion we'll need to handle.

Chapter 2
[ 31 ]
For more informaon on web server log formats and available data points,
please see your web server documentaon. Apache logs were used to write
this book; documentaon for the Apache web server is available at http://
httpd.apache.org/docs/2.2/mod/mod_log_config.html
Time for action – generating transfer statistics
Now, let's start our processor. Inially, we'll build enough funconality to scan our logle
as read via standard input and report les served over a given size. System administrators
may nd ulies such as this useful when aempng to track down abusive users. It's also
generally a good idea to iteravely add funconality to an applicaon in development.
1. First, step into the virtual environment created in Chapter 1, Geng Started and
acvate it so that all of our work is isolated locally. Only the UNIX method is shown
here.
$ cd text_processing/
$ . bin/activate
2. Create an empty Python le and name it logscan.py. Enter the following code:
#!/usr/bin/python
import sys
from optparse import OptionParser
class LogProcessor(object):
"""
Process a combined log format.
This processor handles logfiles in a combined format,
objects that act on the results are passed in to
the init method as a series of methods.
"""
def __init__(self, call_chain=None):
"""
Setup parser.
Save the call chain. Each time we process a log,
we'll run the list of callbacks with the processed
log results.
"""
if call_chain is None:
call_chain = []
self._call_chain = call_chain

Working with the IO System
[ 32 ]
def split(self, line):
"""
Split a logfile.
Initially, we just want size and requested file name, so
we'll split on spaces and pull the data out.
"""
parts = line.split()
return {
'size': 0 if parts[9] == '-' else int(parts[9]),
'file_requested': parts[6]
}
def parse(self, handle):
"""
Parses the logfile.
Returns a dictionary composed of log entry values
for easy data summation.
"""
for line in stream:
fields = self.split(line)
for func in self._call_chain:
func(fields)
class MaxSizeHandler(object):
"""
Check a file's size.
"""
def __init__(self, size):
self.size = size
def process(self, fields):
"""
Looks at each line individually.
Looks at each parsed log line individually and
performs a size calculation. If it's bigger than
our self.size, we just print a warning.
"""
if fields['size'] > self.size:
print >>sys.stderr, \
'Warning: %s exceeeds %d bytes (%d)!' % \
(fields['file_requested'], self.size,
fields['size'])
if __name__ == '__main__':
parser = OptionParser()
parser.add_option('-s', '--size', dest="size",

Chapter 2
[ 33 ]
help="Maximum File Size Allowed",
default=0, type="int")
opts,args = parser.parse_args()
call_chain = []
size_check = MaxSizeHandler(opts.size)
call_chain.append(size_check.process)
processor = LogProcessor(call_chain)
processor.parse(sys.stdin)
3. Now, create a new le and name it example.log. Enter the following mock
logdata. Note that each line begins with 127.0.0.1 and should be entered as such.
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /a HTTP/1.1" 200
65383 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /b HTTP/1.1" 200
22912 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /c HTTP/1.1" 200
1818212 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
SV1; .NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /d HTTP/1.1" 200
888 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /e HTTP/1.1" 200
38182121 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
SV1; .NET CLR 1.1.4322)"
4. Now run the logscan.py script by entering the following command. If all code and
data has been entered correctly, you should see the following output.
(text_processing)$ cat example.log | python logscan.py -s 1000
What just happened?
Let's go through the code and look at what's going on. We expanded on concepts from
the rst chapter and introduced quite a few new elements here. It's important that you
understand this example as we'll use it as the foundaon for many of our future exercises.

Working with the IO System
[ 34 ]
First, recognize what should be familiar to you. We've parsed our arguments, ensured that
our main code is only executed when our script is started directly, and we created a couple
of classes that make up our applicaon. We also passed the open le stream to our parse
method, much like we did with our ROT13 example. Simple!
This applicaon is largely composed of two main classes: LogProcessor and
MaxSizeHandler. We split it o like this to ensure we can expand in the future. Perhaps
we'll want to add more checks or handle logles in a dierent format. This approach ensures
that is possible.
The __init__ method of LogProcessor takes a call_chain argument, which defaults to
None. This will contain a list of funcons that we'll call for each line in the logle, passing in
the values parsed out of each line as a diconary.
If you look further into the __init__ method, you'll see the following code:
if call_chain is None:
call_chain = []
self._call_chain = call_chain
This may look peculiar to you. Why wouldn't we simply default call_chain to an empty list
object? The answer is actually rather complex. For now, simply understand that if we do that,
we may accidentally share a copy of call_chain among all instances of our class!
If you're curious as to why using an empty list is a bad idea, have a look
at http://www.ferg.org/projects/python_gotchas.
html#contents_item_6.2. Most of the me, what you actually get is not
what you would expect and subtle bugs slip into your code.
In our split method, we break our logle line up at the space boundary. Obviously, this
doesn't work if we needed some of the elds that contain spaces, but we're not that far yet.
For now, this is an acceptable approach. Note the check for the dash here. It's possible that
the web server may not report a size on each request. Consider the eect of a browser cache
where new data is not transferred over the network if it hasn't changed on the server.
The split method ulizes Python's condional expressions, which rst
appeared in version 2.5. If you're using an earlier version of Python, you'll need
to expand into a tradional if – else block.
Finally, we have our parse method. This method is responsible for translang each line
of the logle into a useable diconary and passing it into each method in our stored
call_chain.

Chapter 2
[ 35 ]
Next, we have our MaxSizeHandler class. This class ought to be rather straighorward. At
inializaon me, we store a maximum le size. When our process method is called as part
of the call_chain run, we simply print a warning if the current le exceeds the threshold.
The script proper should look largely familiar to you. We parse our command-line opons via
the OptionParser class, but this me we introduce type translaon. We create an instance
of MaxSizeHandler and add its process method to our call_chain list. Finally, that list
is used to create a new LogProcessor instance and we call its parse method.
Python methods and funcons are considered to be rst class objects. What
does this mean? Simply put, you can pass them around to methods, assign them
to collecons, and bind them as other aributes just as if they were simple data
types such as integers, strings, and class instances. No wrapper classes required!
Using objects interchangeably
The big take-away from this example is that objects can be designed such that they're
interchangeable. The technical term for this is Polymorphism. This comes into play
throughout the chapter as we look at dierent methods of accessing datales.
Time for action – introducing a new log format
Let's take a closer look at this concept. Let's assume for a second that a colleague heard about
your niy log-processing program and wanted to use it to parse his data. The trouble is that
he's already tried his hand at solving the problem with standard shell ulies and his import
format is slightly dierent. It's simply a list of le names followed by the le size in bytes.
1. Using logscan.py as a template, create a new le named logscan-b.py. The
two les should be exactly the same.
2. Add an addional class directly below LogProcessor as follows.
class ColumnLogProcessor(LogProcessor):
def split(self, line):
parts = line.split()
return {
'size': int(parts[1]),
'file_requested': parts[0]
}

Working with the IO System
[ 36 ]
3. Now, change the line that creates a LogProcessor object. Instead, we want it to
create a ColumnLogProcess object.
call_chain.append(size_check.process)
processor = ColumnLogProcessor(call_chain)
processor.parse(sys.stdin)
processor = ColumnLogProcessor(call_chain)
4. Create a new input le and name it example-b.log. Enter test data exactly as follows.
/1 1000
/2 96316
/3 84722
/4 81712
/5 19231
5. Finally, run the updated source code. If you entered everything correctly, your
output should be as follows.
(text_processing)$ cat example-b.log | python logscan-b.py -s
1000
What just happened?
We added support for a new log input format simply by replacing the parse method of
our log processor. We did this by inhering from LogProcessor and creang a new class,
overriding parse.
There are no addional changes required to support an enrely new format. As long as your
new LogProcessor class implements the required methods and returns the proper values,
it's a piece of cake. Your LogProcessor subclass could have done something much more
elaborate, such as process each line via regular expressions or handle missing elements
gracefully.
Conversely, adding new call_chain methods is just as easy. As long as the funcon in the
list takes a diconary as input, you can add new processing methods as well.

Chapter 2
[ 37 ]
Have a go hero – creating a new processing class
In these examples, we've printed a warning if a le exceeds a threshold. Instead, what if we
wanted to warn if a le was below a given threshold? This might be useful if we thought our
web server was truncang results or returning invalid data. Your job is to add a new handler
class to the call_chain that warns if a le is below a specic size. It should be able to run
side-by-side along with the exisng MaxSizeHandler handler.
Accessing les directly
Up unl now, we've read all of our data via a standard input pipe. This is a perfectly
acceptable and extensible way of handling input. However, Python provides a very simple
mechanism for accessing les directly. There are situaons where direct le access is
preferable. For example, perhaps you're accessing data from within a web applicaon and
using standard IO just isn't possible.
Time for action – accessing les directly
Let's update our LogProcessor so that we can pass a le on the command line rather than
read all of our data via sys.stdin.
1. Create a new le named logscan-c.py, using logscan.py as your template.
We'll be adding le access support to this original "combined format" processor.
2. Update the code in the __name__ == '__main__' secon as follows.
if __name__ == '__main__':
parser = OptionParser()
parser.add_option('-s', '--size', dest="size",
help="Maximum File Size Allowed",
default=0, type="int")
parser.add_option('-f', '--file', dest="file",
help="Path to Web Log File", default="-")
opts,args = parser.parse_args()
call_chain = []
if opts.file == '-':
file_stream = sys.stdin
else:
try:
file_stream = open(opts.file, 'r')
except IOError, e:

Working with the IO System
[ 38 ]
print >>sys.stderr, str(e)
sys.exit(-1)
size_check = MaxSizeHandler(opts.size)
call_chain.append(size_check.process)
processor = LogProcessor(call_chain)
processor.parse(file_stream)
3. Run the updated applicaon from the command line as follows:
(text_processing)$ python logscan-c.py -s 1000 -f example.log
What just happened?
There are a couple of things here that are new. First, we added a second opon to our
command-line parser. Using a –f or a –file switch, you can now pass in the name of a
logle you wish to parse. We set the default value to a single dash, which signies we should
use sys.stdin as we did in our earlier examples. Using a dash in this manner is common
with command-line-based ulies such as tar and cat.
Next, if an actual le name was passed via our new switch, we're going to open it here via
Python's built-in open funcon. open returns a le object and binds it to the file_stream
aribute. The rst argument to open is the le name; the second is the mode we wish to use.
>>> open('/etc/hosts', 'r')
<open file '/etc/hosts', mode 'r' at 0x10047d250>
>>>
Noce that if a le name wasn't passed in, we simply assign sys.stdin to file_stream.
Both of these objects are considered to be le-like objects. They implement the same set
of core funconality, though the input sources are dierent. This is another example of
polymorphism.
Finally, we've wrapped our open method in a try/except block in order to catch any
excepons that may bubble up from the open funcon. In this example, we are catching
IOErrors only. Any other programming error triggered inside the try block will simply
trigger a stack trace.

Chapter 2
[ 39 ]
The Python excepon hierarchy is described in detail at http://docs.
python.org/library/exceptions.html#exception-hierarchy.
Errors generated during Input/Output operaons generally raise IOError
excepons. You should take some me to familiarize yourself with the layout of
Python's excepon classes.
The open funcon is a built-in factory for python file objects. It is possible to call the file
object directly, but that is discouraged. In later versions of Python, a call to open actually
returns a layered IO object and not just a simple le class.
It's possible to open a le in either text or binary mode. By default, a le is opened using text
mode. To tell Python that you're working with binary data, you simply need to pass a b in
as an addional mode ag. So, if you wanted to open a le for appending binary data, you
would use a ag of ab. Binary mode is only signicant on DOS/Windows systems. When text
data is wrien on a Windows machine, trailing newlines are converted to a newline-carriage
return combinaon. The le object needs to take that into account.
Astute readers should have noced that we never actually closed the le. We simply le it open
and allowed the operang system to reclaim resources when we were nished. While this is
alright for small applicaons like this, we need to be careful to close all les in real applicaons.
Context managers
The with statement has been a Python xture since 2.5. The statement allows the developer
to create a new code block while holding a resource. When the code block exits, the
resource is automacally closed. This is true even if the code block exits in error.
It's also possible to use context managers for other resources as the context
manager protocol is quite extensible.
The following example illustrates the use of a context manager.
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more
information.
>>> with open('/etc/passwd') as f:
... for line in f:
... if line.startswith('root:'):
... print line
...
root:*:0:0:System Administrator:/var/root:/bin/sh
>>> f.read()

Working with the IO System
[ 40 ]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: I/O operation on closed file
>>>
In this example, we opened our system password database and assigned the value returned
by the open funcon to f. While we were in the subsequent block, we were able to perform
le IO as we normally would.
When we exited the block by decreasing the indent, the context manager associated
with the le object ensured the le was automacally closed for us. This is evident by the
excepon raised when we tried to simply read the object outside of the with statement.
Note that while the aribute f is sll a valid object, the underlying le descriptor has already
been closed.
To achieve the same closed-le guarantee without the with statement, we would need to
do something such as the following.
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more
information.
>>> try:
... f = open('/etc/hosts')
... print len(f.read())
... finally:
... try:
... f.close()
... except AttributeError:
... pass
...
345
>>>
Here, the code within the finally block is executed whether or not the proceeding try
block completes successfully. Within our finally block, we've nested yet another try. This
is because if the original open had failed, then f was never bound. Aempng to close it
would result in an AttributeError excepon originang from f.close!
You're encouraged to take advantage of the with statement as it's a wonderful way to avoid
le descriptor leaks within long-running applicaons.

Chapter 2
[ 41 ]
Handling other le types
As we've seen, the Python le-like object is a powerful thing. But, there's more. Let's imagine
for a second that your server logles are compressed in order to save on storage space. We
can make one more simple change to our script so that we have nave support for common
compression formats.
Time for action – handling compressed les
In this example, we'll add support for common compression formats using Python's
standard library.
1. Using the code in logscan-c.py as your starng point, create logscan-d.py.
Add a new funcon just below the MaxSizeHandler class.
def get_stream(path):
"""
Detect compression.
If the file name ends in a compression
suffix, we'll open it using the correct
algorith. If not, we just return a standard
file object.
"""
_open = open
if path.endswith('.gz'):
_open = gzip.open
elif path.endswith('.bz2'):
_open = bz2.open
return _open(path)
2. Within our main secon, update the line that reads open(opts.file) to read
get_stream(opts.file)..
3. At the top of the lisng, ensure that you're imporng the two new compression
modules referenced in get_stream.
import gzip
import bz2

Working with the IO System
[ 42 ]
4. Finally, we can compress our example log using GZIP and run our log scanner as we
have in earlier examples.
(text_processing)$ gzip example.log
(text_processing)$ python logscan-d.py -f example.log.gz -s 1000
What just happened?
In this example, we added support for both GZIP and BZ2 compressed les as supported by
Python's standard library.
The bulk of the new funconality resides in the get_stream funcon we've added. We
look at the le extension provided by the user and make a determinaon as to which open
funcon we want to use. If the le appears to be compressed, we'll use a compression-
specic approach. If the le appears to be plain text, we'll default to the built-in open
funcon we used in our earlier examples.
In order to add our new funconality into the mix, we've replaced our call to open within the
main code to reference our new get_stream funcon.
Implementing le-like objects
As menoned earlier, objects can be used interchangeably as long as they provide the same
set of externally facing methods. This is referred to as implemenng a protocol, or more
commonly, an interface. Languages such as Java, C#, and Objecve-C ulize strict interfaces
that require a developer to implement a minimum set of funconality within a class
Python, on the other hand, does not enforce such restricons. Python's type system is referred
to as Duck Typing. If it looks like a duck and quacks like a duck, then it must be a duck.
While Python itself does not support strict interfaces, there are third-party
libraries available designed to ll that perceived gap. The Zope project is heavily
based on a library-based interface system. For more informaon, see http://
www.zope.org/Products/ZopeInterface.
Probably the most common protocol you'll see within Python code is the le-like object. Not
surprisingly, a le-like object is a Python object designed to "stand in" for a real le object.
The compression streams, as well as the sys.stdin pipe that we looked at earlier, are all
examples of a le-like object.

Chapter 2
[ 43 ]
These objects do not necessarily need to implement all of the methods associated with a
real le object. For example, a read-only object needs to only implement the proper read
methods, and a socket stream doesn't need to implement a seek method.
File object methods
Let's take a closer look at some of the methods found on a standard le object. It's important
to understand le objects as proper IO and data access can dramacally aect the speed
and performance of a data-bound applicaon. This is not an all-inclusive list. To see a
detailed breakdown, visit the http://docs.python.org/library/stdtypes.
html#file-objects.
Objects are free to implement as many of these as they wish, so be prepared to deal with
excepons if you're not certain where your le object is coming from.
close
The close method is responsible for ushing data and closing the underlying le descriptor.
Any aempt to access a le aer it has been closed will result in a ValueError excepon.
This also sets the .closed aribute to True. Note that it is possible to call the close
method more than once without triggering an error.
leno
The fileno method returns the underlying integer le descriptor. Many lower-level IO
funcons (especially those found in the os module) require a standard system-level le
descriptor.
ush
The flush method causes Python to clear the internal I/O buer and force data to disk. This
doesn't perform a disk sync, however, as data may sll simply reside in OS memory.
read
The read method will read data from the le object and return it as a string. If a size
argument is passed in then this method will read that much data from the le object, in
bytes. If the size argument is not passed in then read will go unl EOF is reached.
readline
The readline method will read a single line from a le, retaining the trailing newline
character. A size argument may be passed in, which limits the amount of data that will be
read. If the maximum size is smaller than line length, an incomplete line may be returned.
Each call returns a successive line in a le.
(text_processing)$ python
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin

Working with the IO System
[ 44 ]
Type "help", "copyright", "credits", or "license" for more information.
>>> f = open('/etc/passwd')
>>> f.readline()
'##\n'
>>> f.readline()
'# User Database\n'
>>> f.readline()
'# \n'
>>> f.readline()
'# Note that this file is consulted directly only when the system is
running\n'
>>>
This is a convenient method to extract the rst line of a le; however, there are beer
methods if you wish to simply loop through the context of a text le.
readlines
This method reads each line of a le into a list, unl it reaches EOF. Each element of the list
is one line within a le. As with the readline method, each line retains its trailing new line.
This method is acceptable for smaller les, but can trigger heavy memory use if used on
larger les.
The idiomac way to loop through a text le is to loop on the le object directly, as we've
done in previous examples.
seek
As IO is performed, an oset within the instance is changed accordingly. Subsequent reads (or
writes) will take place at that current locaon. The seek method allows us to manually set that
oset value. To expand upon the read line example from above, let's introduce a seek.
(text_processing)$ python
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> f = open('/etc/passwd')
>>> f.readline()
'##\n'
>>> f.readline()
'# User Database\n'
>>> f.seek(0)
>>> f.readline()
'##\n'
>>> f.readline()
'# User Database\n'
>>>

Chapter 2
[ 45 ]
Noce how the call to seek moves us back to the beginning of the le and we begin reading
the same data a second me. This method is frequently le out of non le-based le-like
objects, or is coded as a null operaon.
tell
This is the counterpart to seek. Calling tell returns the current locaon of the le pointer
as an integer oset.
write
The write method simply takes a source argument and writes it to the open le. It is
not possible to pass in a desired size; the enre string is pushed to disk. If you wish to only
write a poron then you should limit the size via string slicing. A flush or a close may be
required before the data wrien appears on disk. String slicing is covered in our chapter on
Python String Services.
writelines
The writelines method is the counterpart to the readlines method. Given a list or a
sequence of strings, they will be wrien to the le. Newlines are not automacally added
(just as they are not automacally stripped from readlines). This is generally equivalent to
calling write for each element in a list.
Remember that not all of these methods need to be implemented on all le-like
objects. It's up to you to implement what is needed and convey that via proper
documentaon.
Enabling universal newlines
Python ulizes a universal newlines system. Remember that the end-of-line marker varies
by operang system. On Unix and Unix derivaves, a line is marked with a \n terminator. On
Windows systems, a line ends with a \r\n combinaon.
Universal newlines support abstracts that out and presents each end-of-line marker as a \n
to the programmer. To enable this support, append a U to the mode string when calling the
built-in open funcon.
Accessing multiple les
Let's take a lile break from our LogProcessing applicaon and look at Python's
fileinput module. In situaons where you need to open more than one le and iterate
through the connents of each sequenally, this module can be a great help.

Working with the IO System
[ 46 ]
Note that as of the me of wring, the PyEnchant modules were not compable
with Python 3. Therefore, these examples will only work with Python 2.
Time for action – spell-checking HTML content
In this example, we'll build a small applicaon that can be used to check spelling in a
collecon of HTML documents. We'll ulize the PyEnchant library here, which is based
upon the Enchant spell-check system.
1. Step into the virtual environment that we've created for our examples and run the
acvate script for your plaorm.
2. Next, we'll install the pyenchant libraries using the easy_install ulity.
The spell-check system is available on PyPI. Note that you must already have
the Enchant system installed on your workstaon. Ubuntu users can install the
libenchant1c2a library. Windows users should follow the instrucons at http://
www.abisource.com/projects/enchant/. There are binary packages available.
You may also need to install the en_US diconary, which is also covered at the
previous URL.
3. Using easy_install, we'll add the PyEnchant libraries to our virtual
environment.
(text_processing)$ easy_install pyenchant
Searching for pyenchant
Reading http://pypi.python.org/simple/pyenchant/
Reading http://pyenchant.sourceforge.net/
Best match: pyenchant 1.6.1
Downloading http://pypi.python.org/packages/2.6/p/pyenchant/
pyenchant-1.6.1-py2.6.egg#md5=21d991be432cc92781575b42225a6d3e
Processing pyenchant-1.6.1-py2.6.egg
creating /home/jmcneil/text_processing/lib/python2.6/site-
packages/pyenchant-1.6.1-py2.6.egg
Extracting pyenchant-1.6.1-py2.6.egg to /home/jmcneil/text_
processing/lib/python2.6/site-packages
Adding pyenchant 1.6.1 to easy-install.pth file
Installed /home/jmcneil/text_processing/lib/python2.6/site-
packages/pyenchant-1.6.1-py2.6.egg
Processing dependencies for pyenchant
Finished processing dependencies for pyenchant
(text_processing)$

Chapter 2
[ 47 ]
4. Create this rst HTML le and name it index.html. This will be the main page of
our very basic website.
<html>
<head>
<title>Welcome to our home page</title>
</head>
<body>
<h1>Unladen Swallow Spped<h1>
There is an ongoing debate in the Python community regarding
the speed of an unladen swallw. This site aims to settle
that debate.
<ul>
<li><a href="air_speed.html">Air Speed</a>
</ul>
<body>
</html>
Now create this second HTML file and name it air_speed.html, as
referenced in the anchor tag above.
<html>
<head>
<title>Air speed</title>
<head>
<body>
In order to maintain speed, a swallow must flap its wings 32
times per second?
</body>
</html>
</html>
5. Finally, we'll create our code. Create the following le and name it html_
spelling.py. Save it and exit your editor.
import fileinput
import enchant
from enchant.tokenize import get_tokenizer,
from enchant.tokenize import HTMLChunker
__metaclass__ = type
class HTMLSpellChecker:
def __init__(self, lang='en_US'):
"""
Setup tokenizer.
Create a new tokenizer based on lang.
This lets us skip the HTML and only

Working with the IO System
[ 48 ]
care about our contents.
"""
self.lang = lang
self._dict = enchant.Dict(self.lang)
self._tk = get_tokenizer(self.lang,
chunkers=(HTMLChunker,))
def __call__(self, line):
for word,off in self._tk(line):
if not self._dict.check(word):
yield word, self._dict.suggest(word)
if __name__ == '__main__':
check = HTMLSpellChecker()
for line in fileinput.input():
for word,suggestions in check(line):
print "error on line %d (%s) in file %s. \
Did you mean one of %s?" % \
(fileinput.filelineno(), word, \
fileinput.filename(),
', '.join(suggestions))
6. Run the last script using the HTML les we created as input on the command line.
If you've entered everything correctly, you should see the following output. Note
we've reformaed here to avoid potenally confusing line-wrapping.
(text_processing)$ python html_spelling.py *.html
What just happened?
We took a look at a few new things in this example, in addion to Python's fileinput
module. Let's step though this example slowly as there's quite a bit going on.
First of all, we imported all of our necessary modules. Following the standards, we rst
imported the modules that are part of the Python standard library, and then we required
third-party packages. In this case, we're using the third-party PyEnchant toolkit.

Chapter 2
[ 49 ]
Next, we bump into something that's probably unfamiliar to you: __metaclass__ = type.
The core Python developers changed the class implementaon (for the beer) before the
release of Python 2.1. We have both new style and old style classes. New style classes must
inherit from the object in some manner, or be explicitly assigned a metaclass of type. This
is a neat lile trick that tells Python to create only new style classes in this module.
Our HTMLSpellChecker class is responsible for performing the spell-check. In the
__init__ method, we create both a diconary (which has no relaon to the built in
dict type) and a tokenizer. We'll use the diconary for both spell-check and to ask for
suggesons if we've found a misspelled word. The tokenizer object will be used to split
each line into its component parts. The chunkers=(HTMLChunker,) argument tells
Enchant that we're working with HTML, and that it should automacally strip markup. The
provided HTMLChunker class saves us some extra work, though we'll cover how to do that
via regular expressions later in the book.
Next, we dene a __call__ method. This method is special as it is executed each me a
Python object is called directly, as if it were a funcon.
(text_processing)$ python
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> class A(object):
... def __call__(self):
... print "A is for Apple"
...
>>> a = A()
>>> a()
A is for Apple
>>> a.__call__()
A is for Apple
>>>
This example illustrates the usage of a __call__ method in detail. Noce how we can
simply treat our object as if it were a funcon. Of course, it's also possible to call the __
call__ method directly.
Within the body of the __call__ method, we tokenize each line, using the tokenizer we
created within __init__. PyEnchant strips out the HTML for us. Each word is then validated
via the diconary. If it is not found, the applicaon will provide a list of suggesons. The
yield keyword marks this method as a generator, so we yield each spelling error and its
suggesons back to our caller.
Now, we get to our main content. The rst line is familiar. We're simply creang an instance
of our HTMLSpellChecker class. The next secon is where we put fileinput to use.

Working with the IO System
[ 50 ]
The call to fileinput.input creates an iterator that transparently chains together all of
the les we passed in on the command line. Helper funcons fileinput.filelineno,
and fileinput.filename give us the current le's line number and the current le's
name, respecvely.
In Python, an iterator is a type of object that implements an interface that
allows the developer to easily iterate through its contents. For more informaon
on iteraon, see http://docs.python.org/library/stdtypes.
html#iterator-types.
You may have noced that we don't actually pass any le names to the fileinput.input
method. The module actually defaults to the values on the command line, and assumes they
are valid paths. If nothing is passed on the command line then the module will fall back to
standard input. It is possible to bypass this behavior and pass in our own list of les.
Simplifying multiple le access
The fileinput module takes a lot of complexity in opening and managing mulple les.
In addion to current le and line number, it's possible to look at things such as absolute
line number among all les and access le object-specic items such as a le's specic
integer descriptor.
Using a classic approach, one would need to open each le manually and iterate through,
retaining overall posion informaon.
As we said previously, it's possible to use fileinput without relying on the value of the
command-line arguments on sys.argv. The fileinput.input funcon takes an oponal
list of les to use read rather than working with the default.
A drawback in using the module-level methods is that we'll be creang a single instance
of fileinput.FileInput under the covers, which holds global state. Doing this ensures
that we cannot have more than one iterator acve at one point of me and that it's not a
thread-safe operaon.
Thankfully, we can easily overcome these limitaons by building our own instance of
fileinput.FileInput rather than relying on the module-convenient funcons.
>>> import fileinput
>>> input = fileinput.FileInput(['/etc/hosts'])
>>> for line in input:
... if line.startswith('127'):
... print line
...
127.0.0.1 localhost
>>>

Chapter 2
[ 51 ]
Each fileinput.FileInput instance contains the same methods available to us at the
module-level, though they all operate on their own separate context and do not interfere
with each other.
Inplace ltering
Finally, the fileinput module contains an inplace lter feature that isn't very widely
ulized. If the fileinput.input funcon is called with an inplace=1 keyword argument,
or if inplace=1 is passed to the fileinput.FileInput constructor, the opened les
are renamed to backup les and standard output is redirected to the original le. Inplace
ltering is disabled when reading from standard input.
For example, take a look at the following snippet of code.
import sys
import fileinput
# Iterate through all lines and replace
# convert everything to uppercase.
for line in fileinput.input(inplace=1, backup='.bak'):
sys.stdout.write(line.upper())
Running this script with a text le on the command line will rst generate a backup of the
text le, ending in a .bak extension. Next, the original le will be overwrien with whatever
is printed as the standard output. Specically, we're simply translang all of the text to
uppercase here.
If you accidentally divide by zero and don't handle the excepon, your desnaon le can be
le in a corrupted state as your applicaon may exit unexpectedly before you write any data
to your le.
When using this approach, ensure you're properly handling excepons as your
le will be opened in write mode and truncated accordingly.
Pop Quiz – le-like objects
1. As we've seen, le-like objects do not necessarily need to implement the enre
standard le object's methods. If an aempt is made to run a method and that
method does not exist, what happens?
2. In what situaon might you be beer o using the readlines method of a le
versus iterang over the le object itself?
3. What happens if you aempt to open a text le and you specify binary mode?
4. What is the dierence between a le object and a le-like object?

Working with the IO System
[ 52 ]
Accessing remote les
We've now had a somewhat complete crash-course in Python I/O. We've covered les,
le-like objects, handling mulple les, wring lter programs, and even modifying les
"inplace" using some slightly esoteric features of the fileinput module.
Python's standard library contains a whole series of modules, which allow you to access data
on remote systems almost as easily as you would access local le. Through the le-like object
protocol, most I/O is transparent once the protocol-level session has been congured and
established.
Time for action – spell-checking live HTML pages
In this example, we'll update our HTML spell-checker so that we can check pages that are
already being served, without requiring local access to the le system. To do this, we'll make
use of the Python urllib2 module.
1. We'll be using html_spelling.py le as our base here, so create a copy of it and
name the le html_spelling-b.py.
2. At the top of the le, update your import statements to include urllib2, and
remove the fileinput module as we'll not take advantage of it in this example.
import urllib2
import enchant
import optparse
3. Now, we'll update our module-level main code and add an opon to accept a URL on
the command-line.
if __name__ == '__main__':
parser = optparse.OptionParser()
parser.add_option('-u', '--url', help="URL to Check")
opts, args = parser.parse_args()
if not opts.url:
parser.error("URL is required")
4. Finally, change the fileinput.input call to reference urllib2.urlopen, add a
line number counter, and polish up the output content.
for line in urllib2.urlopen(opts.url):
lineno = 0
for word,suggestions in check(line):
lineno += 1
print "error on line %d (%s) on page %s. Did you
mean:\n\t%s" % \
(lineno, word, opts.url, ', '.join(suggestions))

Chapter 2
[ 53 ]
5. That should be it. The nal lisng should look like the following code. Noce how
lile we had to change.
import urllib2
import enchant
import optparse
from enchant.tokenize import get_tokenizer
from enchant.tokenize import HTMLChunker
__metaclass__ = type
class HTMLSpellChecker:
def __init__(self, lang='en_US'):
"""
Setup tokenizer.
Create a new tokenizer based on lang.
This lets us skip the HTML and only
care about our contents.
"""
self.lang = lang
self._dict = enchant.Dict(self.lang)
self._tk = get_tokenizer(self.lang,
chunkers=(HTMLChunker,))
def __call__(self, line):
for word,off in self._tk(line):
if not self._dict.check(word):
yield word, self._dict.suggest(word)
if __name__ == '__main__':
parser = optparse.OptionParser()
parser.add_option('-u', '--url', help="URL to Check")
opts, args = parser.parse_args()
if not opts.url:
parser.error("URL is required")
check = HTMLSpellChecker()
for line in urllib2.urlopen(opts.url):
lineno = 0
for word,suggestions in check(line):
lineno += 1
print "error on line %d (%s) on page %s. Did you
mean:\n\t%s" % \
(lineno, word, opts.url, ', '.join(suggestions))

Working with the IO System
[ 54 ]
6. Now, run the applicaon with a URL passed in on the command line. If it was coded
correctly, your output should resemble the following.
(text_processing)$ python html_spelling-b.py --url=http://www.
jmcneil.net
What just happened?
By simply changing a few lines of code, we were able to access a web page and scan for
spelling errors almost exactly as we did when we checked our local les. Of course, you're
seeing a limitaon of our diconary here. Our spell-checker sees words such as DOCTYPE,
DTD, and HTML as misspelled as they do not fall under the en_US category.
We could x this by adding a custom diconary to the spell-checker that includes technical
lingo, but the goal in this example is to introduce I/O with the urllib2 module.
One important thing to note is that the urllib2.urlopen method supports more than just
the HTTP protocol. You can also access les using the secure-sockets layer by simply passing
in an HTTPS URL. It's even possible to access local les by passing a path into the urllib2.
urlopen method.
Yes, there is a urllib module. It is simply named urllib. This newer version
is far more extensible and is recommended. However, it can be a bit tricky to
understand in detail. There is a great reference available out there that describes
some of the intricacies in a simple manner. The document is tled "urllib2: The
Missing Manual" and is available at http://www.voidspace.org.uk/
python/articles/urllib2.shtml.
The urllib2.urlopen can also directly access les via the FTP protocol. It's quite simple;
the URL you pass into urlopen simply needs to begin with ftp://.
Have a go hero – access web logs remotely
As we've covered both web LogProcessing and the urllib2 module supercially, you should
be able to update our earlier LogProcessing applicaon to access les remotely. You don't
need an external account to try this. Remember, URLs beginning with file:// are valid
urllib2.urlopen URLs. You can make this change and test it locally.

Chapter 2
[ 55 ]
Error handling
By now, you may have noced that while we're able to access a range of protocols using this
same mechanism, they all potenally return dierent errors and raise varying excepons.
There are two obvious soluons to this problem: we could catch each individual excepon
explicitly, or simply catch an excepon located at the top of the excepon hierarchy.
Fortunately, we don't need to take either of those sub-opmal approaches. When an internal
error occurs within the urllib2.urlopen funcon, a urllib2.URLError excepon is
raised. This gives us a convenient way to catch relevant excepons while leng unrelated
problems bubble up. Let's take a quick look at an example to solidify the point.
Python's excepon hierarchy is worth geng to know. You can read up
on excepons in detail at http://docs.python.org/library/
exceptions.html.
Time for action – handling urllib 2 errors
In this example, we'll update our HTML spell-checker in order to handle network errors
slightly more gracefully. Whenever you provide ulies and interfaces to your users, you
should present errors in a clean manner (while logging any valid stack traces).
1. We're going to build o html_spelling-b.py, so copy it over and rename it to
html_spelling-c.py.
2. At the top of the le, add import sys. We'll need access to the methods within
the sys module.
3. Update the __name__ == '__main__' secon to include some addional
excepon-handling logic.
if __name__ == '__main__':
parser = optparse.OptionParser()
parser.add_option('-u', '--url', help="URL to Check")
opts, args = parser.parse_args()
if not opts.url:
parser.error("URL is required")
check = HTMLSpellChecker()
try:
source = urllib2.urlopen(opts.url)
except urllib2.URLError, e:
reason = str(e)
try:
reason = str(e.reason)

Working with the IO System
[ 56 ]
except AttributeError:
pass
print >>sys.stderr, "File Download Error: %s" % reason
sys.exit(-1)
for line in urllib2.urlopen(opts.url):
lineno = 0
for word,suggestions in check(line):
lineno += 1
print "error on line %d (%s) on page %s. Did you
mean:\n\t%s" % \
(lineno, word, opts.url, ', '.join(suggestions))
4. You should now be able to execute this code and pass in a pair of invalid URL values,
using dierent protocols. Your output should be similar to the following.
(text_processing)$ python html_spelling-c.html --url=ftp://
localhost
(text_processing)$ python html_spelling-c.html --url=http://www.
jmcneil.net/notfound.html
What just happened?
We made a small update to our main code so that we can beer handle excepons bubbling
up from the urllib2 module.
In our excepon handler's except statement, we do something that might seem slightly
peculiar. First, we bind the value of str(e) to an aribute named reason. Next, we set up
another try/except block and aempt to bind the value of str(e.reason) to that same
reason aribute. Why would we do that?
The explanaon is simple. Some of the excepons bubbling up have a reason aribute,
which provides more informaon. Specically, the FTP errors contain it. We always try to pull
the more specic error. If it doesn't exist, that will raise an AttributeError excepon. We
just ignore it and go with the rst value of reason.

Chapter 2
[ 57 ]
Our method of accessing the reason aribute highlights Python's Duck Typing design again.
It would have been possible for us to check whether a reason aribute existed on our
URLError object before aempng to access it. In other words, we could have ensured our
object adhered to a strict interface. This approach is usually dubbed Look Before You Leap.
Instead, we took the other (and more Python standard) way. We just did it and handled the
fallout in the event of an error. This is somemes referred to as Easier to Ask Forgiveness
than Permission.
Finally, we simply printed out a meaningful error and exited our applicaon. If you had
observed the examples of this chapter, you'd noce that it does not maer which protocol
type we use.
Handling string IO instances
There's one more IO library that we'll take a look at in this chapter – Python's StringIO
module. In many of your applicaons, you're likely to run into a situaon where it would be
convenient to write to a locaon in memory rather than using string operaons or direct IO
to a temporary le.
StringIO handles just this. A StringIO instance is a le-like object that simply appends
wrien data to a locaon in memory.
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import StringIO
>>> handle = StringIO.StringIO()
>>> handle.write('A')
>>> handle.write('B')
>>> handle.getvalue()
'AB'
>>> handle.seek(0)
>>> handle.write("a")
>>> handle.getvalue()
'aB'
>>>
Looking at the example, you can see that the StringIO instance supports le methods such
as seek and write. By calling getvalue, we're able to retrieve the enre in-memory string
representaon.
There's also a cStringIO module, which implements nearly the same interface and is quite
a bit faster, though there are limitaons on Unicode values and subclassing that should be
understood before using it. For more informaon, see the StringIO library documentaon
available at http://docs.python.org/library/stringio.html.

Working with the IO System
[ 58 ]
The StringIO modules changed a bit between Python 2 and Python 3. Both
the StringIO and the cStringIO modules are gone. Instead, developers
should use io.StringIO for textual data and IO.BytesIO for binary data.
There is no longer a dierenaon between a pure Python implementaon and
the C-level implementaon.
Understanding IO in Python 3
The last thing we'll look at in this chapter is the IO system in Python 3.0. In order to ease
transion, the new IO code has been back-ported to Python 2.6 and is available via the IO
module.
The new IO system introduces a layered approach, almost comparable to Java's IO system.
At the boom lies the IOBase class, which provides commonalies among the IO stream
classes. From there, objects are stacked according to IO type, buering capability, and
read/write support.
While the details look complex, the actual interface to system IO really doesn't change too
much. For example, the io.open call can generally be used the same way. However, there
are some dierences.
Most importantly, binary mode maers. The text will be decoded automacally into Unicode
using the system's locale, or a codec passed. If a le isn't truly text, it shouldn't be opened as
text. Files opened in binary mode now return a dierent object type than les opened in text
mode.
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import io
>>> io.open('/etc/hosts')
<io.TextIOWrapper object at 0x10049d250>
>>> io.open('/bin/ls', 'rb')
<io.BufferedReader object at 0x10049d210>
>>>

Chapter 2
[ 59 ]
Noce that opening a le in text mode, which is the default mode, returns a
TextIOWrapper object whereas opening a le in binary mode returns a BufferdReader
object. Although it doesn't appear as a subclass of BufferedIOBase, TextIOWrapper
does actually implement buered IO.
The new io.open method is intended to replace the built-in open method as of 3.0. As with
the exisng funcon, it can also be used in a context manager.
For more details on the new Python IO system, see the documentaon available at http://
docs.python.org/release/3.0.1/library/io.html. This covers the new IO system
in detail and underscores some of the changes between major Python releases.
Summary
This chapter served as a crash course on Python IO. The goal here is to ensure that you know
how to actually access your data in order to process it.
We covered quite a bit here and really focused on understanding Python's IO system. Most
textual data you'll process will likely come from local disk les, so understanding this material
is important.
You also learned how to build your own le-like-objects and take advantage of
polymorphism, a powerful object-oriented development aribute. We covered HTTP and
compressed data, but as you've seen, the underlying access methods do not maer when
the exposed interface follows the le-like object protocol.
In the next chapter, we'll examine text handling using Python's built in string funcons.

Python String Services
Python's built-in string services provide all of the text-processing funconality
you would expect from any full-featured programming language. This includes
methods to search, test, and create new string objects from exisng ones.
String objects also provide a C-like format mechanism that allows us to build
new string objects and interpolate them with standard Python values and
user-dened objects. Later versions of Python are built on this concept.
Addionally, the actual string objects provide a rich set of methods and
funcons that may be used to further manipulate textual string data.
In this chapter, we will:
Cover the basics of Python string and Unicode objects so that you'll understand the
similaries and dierences.
Take a detailed look at Python string formang so that you'll understand how to
easily build new strings. We'll look at the older and more common syntax as well as
the newer formats as dened in PEP-3101.
Familiarize yourself with the methods found on the standard Python string objects
as well as the Unicode components.
Dive into built-in string templang. We'll see more examples on templang in more
detail in Chapter 7, Creang Templates.
Understanding the basics of string object
Python supports both Unicode and ASCII-encoded text data. However, in versions of Python
earlier than 3.0, there are two built-in objects to manage text data. The str type holds standard
byte-width characters, while the unicode type exists to deal with wider unicode data.
3

Python String Services
[ 62 ]
All Python string objects are immutable, regardless of encoding type. This generally means
that methods that operate on strings all return new objects and not modied text. The big
excepon to this rule is the StringIO module as covered in Chapter 2, Working with the
IO System. Eding StringIO data via its le-like interface results in manipulaon of the
underlying string content.
Python's built-in string services do not operate on any type of structured data. They deal
with text data at the character-level.
In Python 2.7, a new memoryview module has been introduced. These objects
allow certain C-based data types to expose their contents via a byte-oriented
interface. Strings support this funconality. Generally speaking, however, a
memoryview object shouldn't be used for standard text operaons.
Dening strings
Strings can be dened in a variety of ways, using a variety of dierent quong methods. The
Python interpreter treats string values dierently based on the choice of quotes used. Let's
look at an example that includes a variety of dierent denion approaches.
Time for action – employee management
In this short and rather contrived example, we'll handle some simple employee records and
just print them to the screen. Along the way, however, we'll cover the various dierent ways
a developer can quote and dene string literals. A literal is a value that is explicitly entered,
and not computed.
1. From within our text processing virtual environment, create a new le and name it
string_definitions.py.
2. Enter the following code:
import sys
import re
class BadEmployeeFormat(Exception):
"""Badly formatted employee name"""
def get_employee():
"""
Retrieve user information.
This method simply prompts the user for
an employee's name and his current job
title.
"""

Chapter 3
[ 63 ]
employee = raw_input('Employee Name: ')
role = raw_input("Employee's Role: ")
if not re.match(r'^.+\s.+', employee):
raise BadEmployeeFormat('Full Name Required '
'for records database.' )
return {'name': employee, 'role': role }
if __name__ == '__main__':
employees = []
print 'Enter your employees, EOF to Exit...'
while True:
try:
employees.append(get_employee())
except EOFError:
print
print "Employee Dump"
for number, employee in enumerate(employees):
print 'Emp #%d: %s, %s' % (number+1,
employee['name'], employee['role'])
print u'\N{Copyright Sign}2010, SuperCompany, Inc.'
sys.exit(0)
except BadEmployeeFormat, e:
print >>sys.stderr, 'Error: ' + str(e)
3. Assuming that you've entered the content correctly, run it on the command line.
Your output should be similar to the following:
(text_processing)$ python string_definitions.py

Python String Services
[ 64 ]
What just happened?
Let us go through this example. There are quite a few things to point out.
The very rst thing we do, other than import our required modules, is dene a custom
excepon class named BadEmployeeFormat. We simply have a subclass Exception
and dene a new docstring. Note that no pass keyword is required; the docstring is
essenally the body of our class. We do this because later on in this example, we'll raise this
error if an employee name doesn't match our simple validaon.
Now, note that our docstring is enclosed by triple quotes. As you've probably guessed,
that holds a special meaning. Python strings enclosed in triple quotes preserve line endings
so that mulline strings are represented correctly. Consider the following example.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> s = """This is a multiline string.
...
... There are many like it, but this one
... is mine.
... """
>>>
>>> print s
This is a multiline string.
There are many like it, but this one
is mine.
>>>
As you can see, the new line values are included. Note that all other values sll require
addional escaping. For example, including a \t will sll translate to a tab character.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> t = """This still creates a \tab"""
>>> print t
This still creates a ab
>>>
Aer our excepon class, we create a module-level funcon named get_employee that is
responsible for collecng, tesng, and returning employee data. The rst thing you should
noce is another triple quoted docstring. You should note that docstrings do not have
to be triple-quoted, but they do need to be string literals.

Chapter 3
[ 65 ]
The very rst line of code within get_employee calls raw_input, which simply receives a
single line of text via standard input, trimming the trailing newline. The single-quoted string
passed to it serves as the text prompt that the caller will see on the command line.
The very next line includes another call to raw_input, asking for the employee's role.
Noce that this invocaon includes the prompt text in double quotes. Why is that? The
answer is simple. We used an apostrophe in the word "employee's" in order to indicate
ownership. Both double and single quotes serve the same funconal purpose. There is
nothing dierent about them, as in other languages. They're both allowed in order to let you
include one set of quotes within the other without resorng to long sequences of escapes.
As you can see, the following string variables are all the same.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> single = '"Yes, I\'m a programmer", she said.'
>>> double = "\"Yes, I'm a programmer\", she said."
>>> triple = """Yes, I'm a programmer", she said."""
>>> print single
"Yes, I'm a programmer", she said.
>>> print double
"Yes, I'm a programmer", she said.
>>> print triple
Yes, I'm a programmer", she said.
>>>
The Python convenon is to use single quotes for strings unless there is an override needed
to use a dierent format, so you should also adhere to this whenever possible.
On the next line, we call re.match. This is a very simple regular expression that is used to
validate the employee's name. We're checking to ensure that the input value contained a space
because we want the end-user to supply both the rst and last name. We'd do a much beer
job in a real applicaon (where we would probably ask for both values independently).
The call to re.match includes a single-quoted string, but it's prexed with a single r. That
leading r indicates that we're dening a raw string. A raw string is interpreted as-is, and
escape sequences hold no special meaning. The most common use of raw strings is probably
within regular expressions like this. The following brief example details the dierence
between manual escapes and raw strings.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> standard = '\n\nOur Data\n\n'
>>> raw = r'\n\nOur Data\n\n'

Python String Services
[ 66 ]
>>> print standard
Our Data
>>> print raw
\n\nOur Data\n\n
>>>
Using the standard string syntax, we would have had to include backslashes if we wished
to mute the escape interpretaon, and our string value would have been '\\n\\nOur
Data\\n\\n'. Of course, this is a much more dicult string to read.
Users of the popular Django framework may recognize this syntax. Django uses
regular expressions to express HTTP request-roung rules. By default, these
regular expressions are all contained within raw string denions.
If the regular expression test fails, we'll raise our BadEmployeeFormat excepon that we
dened at the top of this example. Look carefully at the raise statement. Noce that the
string passed into BadEmployeeFormat's __init__ method is actually composed of two
strings. When the Python interpreter encounters string literals separated by white space,
it automacally concatenates them together. This provides a nice way for the developer
to wrap his or her strings neatly without creang long and hard to manage lines. As these
strings were dened within the parenthesis following BadEmployeeFormat, we were able
to include a newline.
Now, within our main secon, we create an innite loop and begin calling get_employee.
We append the result of each successful call onto our employees list. If an excepon is raised
from within get_employee, we might have to take some addional acon.
If EOFError bubbles up then a user has clicked Ctrl + d (Ctrl + z on Windows), indicang that
they have no more data to supply. The raw_input funcon actually raises the excepon;
we just let it percolate up the call stack. The rst thing we do within this handler is print out
some status text we nofy the user that we're dumping our employee list.
Next, we have a for loop that iterates on the results of enumerate(employee).
Enumerate is a convenient funcon that, when given a sequence as an argument, returns
the zero-based loop number as well as the actual value in a tuple, like in this example
snippet:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> for c,i in enumerate(xrange(2)):
... print "Loop %d, xrange value %d" % (c,i)
...

Chapter 3
[ 67 ]
Loop 0, xrange value 0
Loop 1, xrange value 1
>>>
Each employee's name and role is printed out this way. This connues unl we reach the end
of the list, at which point we're going to print a simple copyright statement.
When our employee applicaon becomes wildly popular, we want to be certain that we're
protected aer all! The copyright line introduces yet another string variant – a Unicode
literal. Unicode strings contain all of the funconality of standard string objects, plus some
encoding specics.
A Unicode literal can be created by prepending any standard string with a single u, much
like we did with the r for raw strings. Addionally, Unicode strings introduce the \N escape
sequence, which allows us to insert a Unicode character by standardized name rather than
literally or by character code.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> russian_pm = u'\N{CYRILLIC CAPITAL LETTER PE}\N{CYRILLIC SMALL
LETTER U}\N{CYRILLIC SMALL LETTER TE}\N{CYRILLIC SMALL LETTER SHORT
I}\N{CYRILLIC SMALL LETTER EN}'
>>> print russian_pm
Путйн
>>> russian_pm = u'\u041f\u0443\u0442\u0439\u043d'
>>> print russian_pm
Путйн
>>>
As of now, you should understand that Unicode allows us to represent characters outside of
the ASCII range. This includes symbols such as the one we added above as well as alphabets
such as Cyrillic, which at one point would have required its own encoding standard (in fact,
KOI8 is just that). We'll cover Unicode and addional text encodings in much more detail
when we get to Chapter 8, Understanding Encodings and i18n
Finally, we'll catch our BadEmployeeFormat excepon. This indicates that our test regular
expression didn't match. Here, you'll see that we're concatenang a string literal with a
calculated value, so we can't simply place them adjacent within our source lisng. We use
plus-syntax to create a new string, which is a concatenaon of the two.
One important thing to remember is that, although there are three dierent variants of
quotes and raw string modiers, there are only two string types: unicode and str.

Python String Services
[ 68 ]
Building non-literal strings
The majority of the strings you'll create in a manual fashion will be done using literals. In
most other scenarios, text data is generated as the result of a funcon or a method call.
Consider the value returned by sys.stdin.readline. We'll cover some of the common
methods for building strings programmacally as we progress through this chapter.
Python 3.0 eliminates the concept of a separate byte string and Unicode string.
All strings in Python 3.0 are Unicode. Dening a string using the u'content'
approach while running under Py3k will simply result in a SyntaxError
excepon. As there is only one string type, the previously menoned
basestring is no longer valid within Python 3.0, either. A bytes type
replaces the standard string object and is used to represent raw byte data, such
as binary informaon.
Pop Quiz – string literals
1. We've seen where we would use raw strings and we've seen where we would use
Unicode strings. Where might you wish to combine the two? Is it even possible?
2. What do you suppose would happen if you tried to concatenate a Unicode object
and a standard Python string? Here's a hint: what happens when you divide a whole
integer by a oat?
3. Suppose a ZeroDivisionError or an AttributeError is triggered from within
get_employee. What do you suppose would happen?
String formatting
In addion to simply creang plain old strings as we've just covered, Python also lets you
format them using a C sprintf style syntax. Strings in later versions of Python also support
a more advanced format method.
Time for action – customizing log processor output
Let's revisit and extend our web server log processor now. Our rst versions simply printed
text to sys.stdout when informaon was encountered. Let's expand upon that a bit. Using
Python's built-in string formaers, we'll do a beer job at reporng what we nd. In fact, we'll
delegate that responsibility to the classes responsible for evaluang the parsed log data.
We'll also add some addional processing meta-output as well, such as how many
lines we've processed and how long it takes to execute the enre report. This is helpful
informaon as we further extend our log processor.

Chapter 3
[ 69 ]
1. We're going to use logscan-c.py from Chapter 2, Working with the IO System as
our base here, so copy it over and rename it as logscan-e.py.
2. Update the code in logscan-e.py to resemble the following.
import time
import sys
from optparse import OptionParser
class LogProcessor(object):
"""
Process a combined log format.
This processor handles logfiles in a combined format;
objects that act on the results are passed in to
the init method as a series of methods.
"""
def __init__(self, call_chain=None):
"""
Setup parser.
Save the call chain. Each time we process a log,
we'll run the list of callbacks with the processed
log results.
"""
if call_chain is None:
call_chain = []
self._call_chain = call_chain
def split(self, line):
"""
Split a logfile.
Initially we just want size and requested filename, so
we'll split on spaces and pull the data out.
"""
parts = line.split()
return {
'size': 0 if parts[9] == '-' else int(parts[9]),
'file_requested': parts[6]
}
def report(self):
"""
Run report chain.
"""
for c in self._call_chain:
print c.title
print '=' * len(c.title)

Python String Services
[ 70 ]
c.report()
print
def parse(self, handle):
"""
Parses the logfile.
Returns a dictionary composed of log entry values,
for easy data summation.
"""
line_count = 0
for line in handle:
line_count += 1
fields = self.split(line)
for handler in self._call_chain:
getattr(handler, 'process')(fields)
return line_count
class MaxSizeHandler(object):
"""
Check a file's size.
"""
def __init__(self, size):
self.size = size
self.name_size = 0
self.warning_files = set()
@property
def title(self):
return 'Files over %d bytes' % self.size
def process(self, fields):
"""
Looks at each line individually.
Looks at each parsed log line individually and
performs a size calculation. If it's bigger than
our self.size, we just print a warning.
"""
if fields['size'] > self.size:
self.warning_files.add(
(fields['file_requested'], fields['size']))
# We want to keep track of the longest file
# name, for formatting later.
fs = len(fields['file_requested'])
if fs > self.name_size:
self.name_size = fs

Chapter 3
[ 71 ]
def report(self):
"""
Format the Max Size Report.
This method formats the report and prints
it to the console.
"""
for f,s in self.warning_files:
print '%-*s :%d' % (self.name_size, f, s)
if __name__ == '__main__':
parser = OptionParser()
parser.add_option('-s', '--size', dest="size",
help="Maximum File Size Allowed",
default=0, type="int")
parser.add_option('-f', '--file', dest="file",
help="Path to Web Log File", default="-")
opts,args = parser.parse_args()
call_chain = []
if opts.file == '-':
file_stream = sys.stdin
else:
try:
file_stream = open(opts.file)
except IOError, e:
print >>sys.stderr, str(e)
sys.exit(-1)
size_check = MaxSizeHandler(opts.size)
call_chain.append(size_check)
processor = LogProcessor(call_chain)
initial = time.time()
line_count = processor.parse(file_stream)
duration = time.time() - initial
# Ask the processor to display the
# individual reports.
processor.report()
# Print our internal statistics
print "Report Complete!"
print "Elapsed Time: %#.8f seconds" % duration
print "Lines Processed: %d" % line_count
print "Avg. Duration per line: %#.16f seconds" % \
(duration / line_count) if line_count else 0

Python String Services
[ 72 ]
3. Now, in order to illustrate what's going on here, create a new le named
example2.log, and enter the following data. Note that each line begins with
127.0.0.1.
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /a HTTP/1.1" 200
65383 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /short HTTP/1.1"
200 22912 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
SV1; .NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /bit_long
HTTP/1.1" 200 1818212 "-" "Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /extra_long
HTTP/1.1" 200 873923465 "-" "Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /e HTTP/1.1" 200
8221 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /e HTTP/1.1" 200 4
"-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET
CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /d HTTP/1.1" 200
22 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 1.1.4322)"
4. Now, from within our virtual environment, run this code on the command line. Your
output should be similar as follows:
(text_processing)$ cat example2.log | python logscan-e.py -s 30

Chapter 3
[ 73 ]
What just happened?
We introduced some extended string formang mechanisms and extended our code to be a
lile bit more extensible, which is generally a good pracce.
First of all, we're imporng the time module. We use this to calculate runme and other
things as we move forward. As we introduce new methods of extracng and parsing these
les, it's nice to have a means to measure the performance hit or gain associated with the
change.
We updated the LogProcessor class in a few places. First, we've added a report method.
This method will pull the title o of each log handler dened and display it, followed by
a separator bar. Next, the report method will call each handler class directly and ask it to
print its own report segment.
The parse funcon has been updated to return the number of lines processed for stascs
purposes. We've also replaced our direct call to handle with a dynamic lookup of a
process funcon. This is a great example of Python's dynamic nature and duck-typing at
work. We did this so that we can get at more of the class elds directly in other areas. Simply
passing the parsing funcon around limits what we have access to.
The MaxSizeHandler got an even bigger faceli this me through. We've added instance
level variables name_size and warning_files. The name_size variable keeps track of
the longest lename we've found while warning_files is a set object.
The following three lines dene a Python property:
@property
def title(self):
return 'Files over %d bytes' % self.size
A property is a special object that appears to be an aribute when accessed directly, but
is actually handled by a method under the scenes. When we access c.title from within
LogProcessor, we're actually triggering an instance of MaxSizeHandler's title method.
We've made changes to our process method, too. It now appends a tuple for each le
name/size pair that exceeds our maximum allowed size. Why did we use a set? Simple. If the
same le is accessed mulple mes, we only want to display it once for each size. Python lets
us use tuples as unique values within a set object as they're immutable. As is the nature
of sets, adding the same value mulple mes is a null operaon. A value only exists once
within a set.
Note that sets were available only as an external module up unl 2.6. Prior
to that, it was necessary to 'from sets import set' at the top of your
module. If you're running an earlier version, you'll have to take this precauon.

Python String Services
[ 74 ]
We nish up this revision of the MaxSizeHandler class by updang the longest lename, if
applicable, and dening our report funcon.
If you take a closer look at report, you'll see a line containing a string format that reads '%-
*s :%d' % (self.name_size, f, s). There is a bit of formaer magic included here.
We'll take a closer look at this syntax below, but understand that this line prints a le's name
and corresponding size. It also ensures that each size value lines up in a columnar format, to
the right of the longest lename we've found. We're allong for variaons in lename size
and spacing our sizes accordingly to void a jagged –edge look.
Finally, we hit our main secon. Not a whole lot has changed here. We've added code to
track how long we run and how many lines we've processed as returned by processor.
parse. We've also switched to passing instances of our handler classes to LogProcessor's
__init__ method rather than specic funcons.
At the boom of the main secon, we've introduced another variaon of the formang
expression. Here, we're shoring up some of our decimal formang and using some alternate
formang methods available to us. The '#' in this line alters the way the string is rendered.
Percent (modulo) formatting
This is the oldest method of string format available within Python, and as such, it's the
most popular one. We've been using it throughout the book so far, though this example
introduced some of the more esoteric features.
A percent formaer expression consists of two main parts a format string and a tuple or a
diconary of formang values. Format strings consist of plain text with format specicaons
mixed in it. Format specicaons begin with a percent sign and instruct Python on how to
translate a data value into printed text.
These two main components are then separated via a percent sign, or modulus operator. If
you're formang a string with a single % specier then the use of a tuple is not necessary.
For example, simple string formang expressions usually look like the following:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> "%d + %d = %d" % (1,2,3)
'1 + 2 = 3'
>>> '%d %% %d = %d' % (5,2,1)
'5 % 2 = 1'
>>> 'I am a %s programmer' % 'python'
'I am a python programmer'
>>>

Chapter 3
[ 75 ]
It is also possible to use a diconary instead of a tuple, if the corresponding key is specied
in parenthesis aer the % operator, like in this example.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> user = {'uid': 0, 'gid': 0, 'login': 'root'}
>>> 'Logged in as %(login)s with uid %(uid)d and gid %(gid)d' % user
'Logged in as root with uid 0 and gid 0'
>>>
Each formang specicaon consists of a variety of dierent elements, most of which are
usually le out. Here is a diagram detailing all of the available modiers.
This example uses a diconary to provide the mapped values. Let's review each possible
component. Remember that some of the possible values change depending on whether
we've used a diconary or a tuple.
Mapping key
If the mapping key is present then the format conversion expects a diconary aer the dividing
percent sign. The mapping key is quite simply a key into the diconary you'll provide.

Python String Services
[ 76 ]
Conversion ags
These are oponal values that change the way the provided value is displayed. There are a
series of dierent ags available.
Flag Usage
#Dictates that an alternate format should be used. Alternate formats vary by formang
me. For example, using this ag with a oang point ensures that the decimal point is
present, even if not required.
0 If the minimal display width is greater than the value, pad with zero for numeric values.
-The printed value is le-jused in relaon to the padding. The default is to right-jusfy.
<space> Signies that a space should be le aer a posive number.
+Add a sign character. Has a higher precedence than <space>.
In the above example, we specied an alternate format in order to ensure that the decimal is
always present.
Minimum width
If the value to be translated does not meet this minimum length, it will be padded
accordingly. If a * (asterisk) is passed in as opposed to a number, the value will be taken from
the tuple of values.
This is the approach taken in our last example. We programmacally determined the padding
we wanted to use and inserted it into our values tuple while forcing le-juscaon.
Precision
This is valid for oang-point numbers. The precision indicates how many places aer the
decimal to display. In the preceding diagram, we specied four places in the value, but
only requested three in the formang. The following small example details the use of the
precision opon. Note that the value printed versions the value provided.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> '%.3f' % 3.1415
'3.142'
>>>
As you can see, the value we've supplied is rounded up correctly and printed.

Chapter 3
[ 77 ]
Width
These have no use in Python and do not change the formang at all. They are largely carried
over from C's sprintf funconality. Accepted values are l, L, or h. If they are supplied, they
are simply ignored.
Conversion type
The data type we're converng from. These are generally the same as found in C. However,
the r and the s types are slightly special and we'll cover them below. Here is a list of the
valid conversion formats.
Conversion Descripon
d, i Signed Decimal
o Signed Octal
x Signed hexadecimal in lowercase
X Signed hexadecimal in uppercase
uObsolete – idencal to d
eFloang point exponenal in lowercase
EFloang point exponenal in uppercase
F,f Floang point decimal
gLowercase exponenal if exponent is less than -4, otherwise use decimal format.
GUppercase exponenal if exponent is less than -4, otherwise use decimal format.
c Single character. Can be an integer value or a string of one.
r Object repr value, see below.
s Object str value, see below.
%Literal percent sign.
Using string special methods
If an object has a __str__ method then it is implicitly called whenever an instance of that
object is passed to the str built-in funcon. Accepted pracce is to return human-friendly
string representaon of that object.
Likewise, if an object contains a __repr__ method, passing that object to the repr built-
in should return a Python-friendly representaon of that object. Historically, that means
enough text to recreate the object via eval, but that's not a strict requirement.
Using %s or %r results in the values of __str__ or __repr__ replacing the formang
specicaon. For example, consider the following code.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin

Python String Services
[ 78 ]
Type "help", "copyright", "credits", or "license" for more information.
>>> class MicroController(object):
... def __init__(self, brand, bits):
... self.brand = brand
... self.bits = bits
... def __str__(self):
... return '%s %s-bit CPU' % (self.brand, self.bits)
...
>>> m = MicroController('WhizBang', 8)
>>> 'my box runs a %s' % m
'my box runs a WhizBang 8-bit CPU'
>>>
This is very convenient while formang strings containing representaons of objects.
Though, in some cases, it can be somewhat misleading.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> 'I have %s bits' % 8
'I have 8 bits'
>>>
In many languages, an approach like that would simply result in either a syntax error or a
memory-related crash. Python treats it dierently, however, as the result of str(8) is the
string representaon of the number eight.
Have a go hero – make log processing more readable
So, now you should have a prey good grasp of percent string formang. All of the le sizes
outpued in our example above are in pure bytes. That's great for accuracy's sake, but it can
be quite dicult on the eyes.
Update all of the preceding output to display as kilobytes in a decimal form. We don't want
to display decimals beyond two places as that could get just as dicult to read.
Using the format method approach
As of Python 2.6 (and all values of 3.0), the format method has been available to all string
and Unicode objects. This method was introduced to combat exibility restricons in
the percent approach. While this is a much more powerful and exible method of string
formang, it's only available in newer versions of Python. If your code must run on older
distribuons, you're stuck with the classic percent-formang approach.

Chapter 3
[ 79 ]
Instead of marking our format specicaons with percent signs, the format method expects
formang informaon to be enclosed in curly braces.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> platforms = {'linux': 121, 'windows': 120, 'solaris': 12}
>>> 'We have {0} platforms, Linux: {linux}, Windows: {windows}, and
Solaris: {solaris}'.format(
… 3, **platforms)
'We have 3 platforms, Linux: 121, Windows: 120, and Solaris: 12'
>>>
In the simplest cases, numeric values in curly braces represent posional arguments while
text names represent keyword arguments.
In addion to the new format method found on string objects, Python 2.6 and
above also have a new built-in method – format. This essenally provides a
means to access the features of the string object's format, without requiring
a temporary string. Under the hood, it triggers a call to a formaed object's __
format__ method. For more informaon on the __format__ method, see
http://python.org/dev/peps/pep-3101/.
Time for action – adding status code data
First of all, note that this example won't work if you're using a version of Python less than
2.6. If you fall into that category, you'll have to either upgrade your version, or simply pass
over this secon.
We're going to update our LogProcessor script to report on the collecon of HTTP
response codes found within the logle. We'll simply add an addional handler to process
the parsed data.
1. Using logscan-e.py as a base, create logscan-f.py and add the following
addional import statement:
from collections import defaultdict
2. Now, we're going to change the split method of LogProcessor to also include
HTTP status code informaon.
def split(self, line):
"""
Split a logfile.
Initially, we just want size and requested filename, so

Python String Services
[ 80 ]
we'll split on spaces and pull the data out.
"""
parts = line.split()
return {
'size': 0 if parts[9] == '-' else int(parts[9]),
'file_requested': parts[6],
'status': parts[8]
}
3. Now, directly below the LogProcessor class, add the following new handler class.
class ErrorCodeHandler(object):
"""
Collect Error Code Information.
"""
title = 'Error Code Breakdown'
def __init__(self):
self.error_codes = defaultdict(int)
self.errors = 0
self.lines = 0
def process(self, fields):
"""
Scan each line's data.
Reading each line in, we'll save out the
number of response codes we run into so we
can get a picture of our success rate.
"""
code = fields['status']
self.error_codes[code] += 1
# Assume anything > 400 is
# an HTTP error
self.lines += 1
if int(code) >= 400:
self.errors += 1
def report(self):
"""
Print out Status Summary.
Create the status segment of the
report.
"""
longest_num = sorted(self.error_codes.values())[-1]
longest = len(str(longest_num))
for k,v in self.error_codes.items():

Chapter 3
[ 81 ]
print '{0}: {1:>{2}}'.format(k, v, longest)
# Print summary information
print
'Errors: {0}; Failure Rate: {1:%}; Codes: {2}'.format(
self.errors, float(self.errors)/self.lines,
len(self.error_codes.keys()))
4. Finally, add the following line to the main secon, right below:
call_chain.append(size_check).
call_chain.append(ErrorCodeHandler())
5. Now, run the updated applicaon. Your output should resemble the following:
(text_processing)$ cat example2.log | python logscan-f.py -s 30
What just happened?
Let's take a quick survey of the changes we made to this applicaon. First of all, we imported
defaultdict. This is a rather useful object. It also acts as a diconary. However, if a
referenced key doesn't exist, it calls the funcon supplied and uses its value to seed the
diconary before returning. A standard diconary would simply raise a KeyError, as in the
following example:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin

Python String Services
[ 82 ]
Type "help", "copyright", "credits", or "license" for more information.
>>> d = {}
>>> d['200'] += 1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: '200'
>>> from collections import defaultdict
>>> d_dict = defaultdict(int)
>>> d_dict['200'] += 1
>>> d_dict
defaultdict(<type 'int'>, {'200': 1})
>>>
Next, we're just updang the parse method to return the eighth eld in each line, which
happens to be the HTTP status code as returned to the client.
In the new handler class, ErrorCodeHandler, we set up three instance-level variables. The
defaultdict object detailed previously, and two counters that represent the number of
errors we've run into as well as the number of lines we've processed.
The process method adds to the defaultdict each me an error is encountered. If a
specic value hasn't been added yet, the diconary defaults (hence its name) to the value of
int(), which will be zero.
The defaultdict type is a useful helper when tallying or extracng informaon from
logles or other unknown sources of data when you're not certain whether a specic
key will exist and want to add it dynamically.
Next, we increase our line number counter. If the error number is greater than 400 then we
also increment our error counter. You should note that we're actually passing the value of
code to the int funcon before doing the comparison. Why is this?
Python is a dynamically-typed language; however, it is sll strictly-typed. For example, a
HTTP code value of '200' is a textual representaon of a number; it is sll a string type.
The value was assigned its type when we extracted it as a substring from a line in a logle,
which itself was read in as a collecon of strings. So, without the explicit conversion, we're
comparing an integer (400) against a string representaon of a number. The result probably
isn't what you would expect.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> snum = '1000'
>>> snum == 1000
False
>>> int(snum) == 1000
True

Chapter 3
[ 83 ]
This is a common gotcha and has actually been reced in Python 3.0. Aempng to
perform the preceding comparison will result in a TypeError when using Python 3.
>>> '1000' > 1000
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unorderable types: str() > int()
>>>
Within the report method, we next sort the self.error_code diconary values via
the built-in sorted funcon. We take the highest number in that list, via a subscript of -1,
and convert it into a string. We then take the length of that string. We'll use this value for a
formang modier later in this method.
The next secon loops through all of the response codes we've run into thus far and prints
them out to the screen, though it does that via the string format method.
The last thing we do within the report method is display a summary of the error code data
we've collected while processing a logle. Here, we're also using the format method rather
than tradional percent-sign formang.
Finally, within our main secon, we added an instance of ErrorCodeHandler to the call_
chain list that is passed into LogProcessor's __init__ method. This ensures that it will
be included during logle processing.
Making use of conversion speciers
As we menoned earlier, conversion markup is enclosed in curly braces as opposed to the
percent prex as used in standard string formang. In addion to the replacement value,
though, the curly braces also contain all of the same formang informaon (with some new
opons) that the standard methods support.
Let's take another look at that graphical breakdown of a format string, but this me we'll use
the newer format syntax.

Python String Services
[ 84 ]
Noce how the replacement value name or posion is separated from the formang
arguments by a colon. The colon itself holds no other special meaning. This example does
not include all possible combinaons. When using the format method, the # opon is only
valid for integers. Likewise, the precision argument is only valid for oang point values.
Fill
The fill argument allows us to specify which character we should use to pad our string if
the minimum width is less than the actual width of the replacement value. Any character can
be used other than a closing brace, which would signify the end of the format specicaon.
Align
This signies how text should be aligned in relaon to the ll characters if actual width is less
than minimum width.
Flag Usage
<The eld is le-aligned, this is the default alignment.
>The eld is right-aligned.
= This forces the padding to be placed between a sign character and the
value. This is only valid for numeric types.
^ Forces the value to be centered within the available spacing.
Sign
This eld is valid only for numeric types and is used to determine how the sign informaon is
displayed, if at all.
Flag Usage
+Sign data is always displayed.
-Python should only display the sign for negave numbers. This is the
default behavior.
<space> Leading space should be used on posive, while a sign should be used when
the value is negave.
Width
This species the minimum width of the eld. If the actual value is shorter, the result will be
padded according to the alignment rules using the ll character.
Precision
This species the oang-point precision. As menoned previously, this is only valid for
oang-point values. Floang-point numbers are rounded and not simply truncated.

Chapter 3
[ 85 ]
Type
The type eld is the last argument in the format specicaon and details how the value
should be displayed. Unlike standard percent-formang, this is no longer a required eld. If
not specied, a default is used based on the value's type.
There are quite a few new type ags introduced with the format method and some of
the implementaon details are rather complex. For a complete introducon to type elds
for use with the format funcon, see http://docs.python.org/library/string.
html#format-string-syntax.
The following table contains a survey of the available values.
Flag Usage
s String output. This is the default for strings and class instances
b Binary output
d Decimal output
o Octal format
xHexadecimal format using lowercase leers
XHexadecimal format using uppercase leers
nSame as the d ag, though it uses local informaon to display correctly based on
your preferences. This is the default for integers
eExponent (Scienc) Notaon using lowercase leers
EExponent (Scienc) Notaon using uppercase leers
f Fixed point
F Same as the 'f' type
gGeneral format. There is a collecon of rules regarding display for this type. See
the Python documentaon for details. This is the default for oang-point values
G Uppercase version of 'g'
% Percentage. Mulplies a number by 100 and displays in 'f' format, followed by a
percent sign
Have a go hero – updating the le size check to use the format method
Now that you've got a crash course in Python string-formang methods, you should
be able to work with both approaches. Take a few minutes and back up to update the
MaxSizeHandler class to use format methods rather than percent syntax. However, you'll
probably want to create a temporary copy.
You may nd the Python documentaon helpful in addion to the tables included in this
chapter. Formang markup seems to be one area that many developers never really seem to
fully grasp. Take a moment and stand out from the crowd!

Python String Services
[ 86 ]
Creating templates
It's oen said within the Python community that every programmer, at some point,
implements his or her own Python-based template language. The good news, then, is that
we don't have to as so many of them already exist!
There's a large collecon of very powerful third-party templang libraries available for Python.
We'll cover them in more detail (and even write our own) in Chapter 7, Creang Templates.
Python includes an elementary templang class within the string module. The Template
class doesn't provide any advanced features such as code execuon or inherited blocks. In
general, it's a simple way to replace tokens within a text le with Python values.
Time for action – displaying warnings on malformed lines
Up unl now, we've assumed that all of our lines processed are very well-formed and will
never generate excepons. In order to illustrate the use of the Template class, we'll x that
here. Under normal circumstances, it would probably be preferred to simply print an error
just quietly pass by incorrectly formaed lines.
1. Using logscan-f.py as a starng place, create logscan-g.py. We'll use this as
our starng point.
2. At the top of the le, add import string to the list of modules imported.
3. Immediately aer the docstring for LogProcessor, add the following code:
tmpl = string.Template(
'line $line is malformed, raised $exc error: $error')
4. Replace the parse method in LogProcessor with the following new method:
def parse(self, handle):
"""
Parses the logfile.
Returns a dictionary composed of log entry values,
for easy data summation.
"""
line_count = 0
for line in handle:
line_count += 1
try:
fields = self.split(line)
except Exception, e:
print >>sys.stderr, self.tmpl.substitute(
line=line_count,

Chapter 3
[ 87 ]
exc=e.__class__.__name__,
error=e)
continue
for handler in self._call_chain:
getattr(handler, 'process')(fields)
return line_count
5. Finally, copy example2.log over and create example3.log. Insert a :q! on line
eight, followed by a newline. This should be the only text on that line.
6. Running the example should produce the following output:
(text_processing)$ cat example3.log | python logscan-g.py -s 30
What just happened?
Aer imporng the required string module, we created a Template object within the
LogProcessor class denion. By adding it where we did, we ensured that it's only created
once. If we had placed it within a method, it would be created each me that specic
method was called.
Next, we updated our parse method to catch any excepons that rise up from within
split. If we happen to catch an error, we populate our template with values describing the
excepon and print the rendered result to the screen via standard error.

Python String Services
[ 88 ]
Template syntax
When we create an instance of Template, we pass in the template string we'll use. The
syntax is fairly straighorward. If we want a value to be replaced, we simply precede it with a
dollar sign. Two $ characters adjacent to each other act as an escape; they are replaced with
a single character in the rendered text.
If the idener we intend to replace is embedded in a longer string, we can surround it with
braces. A small example may clarify this concept.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> from string import Template
>>> template = Template('${name} has $$${amount} in his ${loc}et')
>>>
Rendering a template
Once we've created a template object, we use it to render a new string by calling either its
substitute or safe_substitute methods.
>>> template.substitute(name='Bill Gates', amount=35000000000,
loc='wall')
'Bill Gates has $35000000000 in his wallet'
>>> template.substitute(name='Joe', amount=10, loc='blank')
'Joe has $10 in his blanket'
>>>
If a template variable is le o, or if a standalone dollar sign is encountered, the
substitute method raises an error. If the safe_substitute alternave is used, errors
are simply ignored and the conversion will not take place. Noce the dierence in both
approaches below:
>>> template.substitute(name='Joe', amount=10)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/string.py", line 170, in substitute
return self.pattern.sub(convert, self.template)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/string.py", line 160, in convert
val = mapping[named]
KeyError: 'loc'
>>> template.safe_substitute(name='Joe', amount=10)
'Joe has $10 in his ${loc}et'
>>>

Chapter 3
[ 89 ]
Pop Quiz – string formatting
1. In what situaon might you elect to use the string.Template class versus
tradional string formang?
2. What method might you use to pass a diconary of values into the format method?
3. We know that expressions such as "1" + 2 are invalid. What do you think would be
the result of "1" + "2"?
Calling string object methods
In addion to providing powerful creaon and formang mechanisms, Python string objects
also provide a collecon of useful methods. We've already seen a few of them in our earlier
examples. For example, we called line.split() within our LogProcessor class in order
to separate a text line into pieces, delimited by space characters.
All of these methods are present on both standard byte strings and Unicode
objects. As a general rule, Unicode objects return Unicode while byte string
methods return byte strings.
Time for action – simple manipulation with string methods
In this example, we'll extend our lile employee data-gathering script present earlier in the
chapter. The goal is to illustrate the use of some of the string object methods.
1. Create a new le and name it string_definitions-b.py.
2. Enter the following code:
import sys
class BadEmployeeFormat(Exception):
"""Badly formatted employee name"""
def __init__(self, reason, name):
Exception.__init__(self, reason)
self.name = name
def get_employee():
"""
Retrieve user information.
This method simply prompts the user for
an employee's name and his current job
title.

Python String Services
[ 90 ]
"""
employee = raw_input('Employee Name: ')
role = raw_input("Employee's Role: ")
employee, role = employee.strip(), role.strip()
# Make sure we have a full name
if not employee.count(' '):
raise BadEmployeeFormat('Full Name Required '
'for records database.', employee )
return {'name': employee, 'role': role }
if __name__ == '__main__':
employees = []
failed_entries = []
print 'Enter your employees, EOF to Exit...'
while True:
try:
employees.append(get_employee())
except EOFError:
print
print "Employee Dump"
for number, employee in enumerate(employees):
print 'Emp #%d: %s, %s' % (number+1,
employee['name'], employee['role'].title())
print 'The following entries failed: ' +
', '.join(failed_entries)
print u'\N{Copyright Sign}2010, SuperCompany, Inc.'
sys.exit(0)
except BadEmployeeFormat, e:
failed_entries.append(e.name)
err_msg = 'Error: ' + str(e)
print >>sys.stderr, err_msg.center(len(err_msg)+20,
'*')
3. Run this example from the command line. If you entered it correctly then you should
see output similar to the following:
(text_processing)$ python string_definitions-b.py

Chapter 3
[ 91 ]
What just happened?
There's not a whole lot extra going on in this new example. We've simply cleaned up our
data a lile bit more and took the liberty of nofying the user which employees were not
successfully entered.
The rst thing you'll noce is that we updated our BadEmployeeFormat excepon to
take an addional argument, the employee name. We do this so we can append the failed
employee's informaon to a list within our main secon.
The next update you'll run into is the employee, role = employee.strip(), role.
strip() line. Each string (employee, role) might have white space on either end. Calling
the strip method trims the string down and removes that spacing. If we wanted to, we could
have passed addional characters into the strip and it would have removed those as well:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> 'ABC123DEF'.strip('ABCDEF')
'123'
>>>
The strip method removes any of the characters that appear in the argument string if they
appear in the source string.
We've updated our check for a space to simply scan for a single space character rather than
using our regular expression. The downside here, though, is that this check will pass even if
data was entered incorrectly. Consider 'AlexanderPushkin', for example.

Python String Services
[ 92 ]
In the main secon, we've added a failed_entries list. Whenever we catch a
BadEmployeeFormat excepon, we append the name of the employee to this list. When
we receive our EOFError, we join this list via ', '.join(failed_entries). Note that
in Python, join is a method of a string object and not a method of a list or an array data
structure.
Now that we've seen some of them put to use, let's take a closer look at some of the
methods available on string and Unicode objects. However, this isn't a complete survey.
For a detailed descripon of all methods available on Python string objects, see the Python
documentaon.
Aligning text
There are four methods available on string objects that allow you to manage alignment
and juscaon. Those methods are center, ljust, rjust, and zfill. We've seen
the center method used previously. The ljust and rjust methods simply change the
orientaon of a supplied padding character.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> 'abc'.rjust(5, '*')
'**abc'
>>> 'abc'.ljust(5, '*')
'abc**'
>>> 'abc'.center(5, '*')
'*abc*'
>>>
The zfill method adds zeros to the le of the string object, up to the passed-in minimum
width argument.
Detecting character classes
These methods correspond to a set of standard C character idencaon methods. However,
unlike their C equivalents, it is possible to test all values of a specic string and not just a
single character.
These methods include isalnum, isalpha, isdigit, isspace, istitle, isupper, and
islower. These methods all test the enre string value; if any one character doesn't t the
bill, these methods simply return False.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.

Chapter 3
[ 93 ]
>>> '1'.isdigit()
True
>>> '1f'.isdigit()
False
>>> 'Back to the Future'.istitle()
False
>>> 'Back To The Future'.istitle()
True
>>> 'abc123'.isalnum()
True
>>>
The one method here that might not be clear up front is the istitle method. This returns
True if all words within a string have their rst leer capitalized.
Casing
Strings objects contain four methods for updang capitalizaon: title, capitalize,
upper, and lower. Both the upper and lower methods change casing for an enre string.
The capitalize and title methods are slightly dierent. Have a look at them in acon:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> position = 'VP of marketing mumbo jumbo'
>>> position.title()
'Vp Of Marketing Mumbo Jumbo'
>>> city = 'buffalo'
>>> city.capitalize()
'Buffalo'
>>>
Noce how the title method returns the string in tle case while the capitalize
method simply capitalizes the rst character of the string.
Searching strings
There are a number of methods associated with string objects that help with searching
and comparison. To check for general quality, simply use the double equal sign comparison
operator.

Python String Services
[ 94 ]
The count, find, index, replace, rfind, rindex, startswith, and endswith
methods all scan a string for the occurrence of a substring. Addionally, it's possible to use
the in keyword to test for a substring's occurrence within a larger string.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> 'one' in 'Bone Dry'
True
>>> 'one' == 'one'
True
>>>
We've already introduced you to the count method, so we'll skip over that here. find
and index are both similar. When called, both return the oset into a string in which the
substring is found. The dierence, however, is how they'll respond in the event that the test
string isn't present. The find method will simply return a -1. The index method will raise a
ValueError.
Both startswith and endswith test to see whether their respecve end is made up of the
test string passed in.
The replace method allows you to replace a given substring within a larger string with an
oponal upper bound on the number of mes the operaon takes place. In the following
example, noce how only one of the 'string-a' values is replaced:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> 'trout salmon turkey perch flounder'.replace('turkey', 'shark')
'trout salmon shark perch flounder'
>>> 'string-a string-b string-a'.replace('string-a', 'string', 1)
'string string-b string-a'
>>>
Finally, rfind and rindex are idencal to find and index, except that they'll work from
the end of the string rather than the beginning.
Dealing with lists of strings
There are four methods for dealing with string parts – join, split, partition, and
rpartition. We've already seen them to some extent, but let's take a closer look as they're
commonly-used string methods.

Chapter 3
[ 95 ]
The split method takes a delimiter and an oponal number of max splits. It will return a list
of strings as broken up by the delimiter. If the separator is not found then a single element
list is returned that contains the original string text. The oponal maximum separator limits
on how many mes the split takes place. An example might help solidify its usage:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> string = 'cheese,mouse,cat,dog'
>>> string.split(',')
['cheese', 'mouse', 'cat', 'dog']
>>> string.split('banana')
['cheese,mouse,cat,dog']
>>> string.split(',', 2)
['cheese', 'mouse', 'cat,dog']
>>>
We've already covered the join method; it places a string together given a list of elements. It
is common to join around an empty string in order to simply concatenate a larger list of values.
Finally, we have partition and rpartition. These methods act much like the split
method, except that they'll return three values - the part before a separator, the separator
itself, and nally the part aer the separator.
Treating strings as sequences
Remember that Python strings can be interpreted as sequences of characters as well. This
means that all common sequence operaons will also work on a string. It's possible to iterate
through a string or break it into pieces using standard slicing syntax.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> 'abcdefg'[2-5]
'e'
>>> 'abcdefg'[2:5]
'cde'
>>> 'abcdefg'[2:5]
'cde'
>>> for i in 'abcdefg'[2:5]:
... print 'Letter %c' % i
...
Letter c
Letter d
Letter e
>>>

Python String Services
[ 96 ]
This works for both byte strings as well as Unicode strings as Python deals with the
underlying method calls at a character-level, and not a byte-level.
Have a go hero – dive into the string object
We've covered the majority of the string methods here as well as the most common usage
scenarios, but we've not touched on all of them. Addionally, there are opons we've not
touched on.
Open a Python prompt and have a look at all of the methods and aributes available on a
standard string object.
>>> dir('')
['__add__', '__class__', '__contains__', '__delattr__', '__doc__',
'__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__',
'__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__',
'__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__
new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__
rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'_formatter_field_name_split', '_formatter_parser', 'capitalize',
'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs',
'find', 'format', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower',
'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip',
'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition',
'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip',
'swapcase', 'title', 'translate', 'upper', 'zfill']
Using the output of dir, as well as the Python documentaon (either online or via pydoc),
spend some me and familiarize yourself with the available funcons. You'll be glad you did!
Summary
We covered a lot of detail in this chapter. Python's string services provide a clean mechanism
for dealing with text data at the character-level. You should now be familiar with built-in
templang, formang, and core string manipulaon. These techniques are valid and should
be considered before many more advanced approaches are evaluated.
Next, we'll leave the string basics behind and step into the standard library for a look at how
to handle some of the more commonly encountered text formats. Python makes processing
standard formats easy!

Text Processing Using the Standard
Library
In addion to its powerful built-in string manipulaon abilies, Python
also ships with an array of standard library modules designed to parse and
manipulate common standardized text formats.
Using the standard library, it's possible to parse INI les, read CSV and related
les, and access common data formats used on the web, such as JSON. In this
chapter, we'll take a look at some of these modules and look at how they can
help us process text data a layer above the string management foundaon.
We'll take a closer look at the following:
CSV, or Comma Separated Values. Python provides a rich mechanism for accessing
and extracng data from this common format commonly used as a spreadsheet
stand-in.
Parse and rely on INI les. We'll look at the standard Conguraon File parsing
classes for our own purposes and as a means to read Microso Windows
conguraons.
We'll parse JSON data as it's oen used as a data delivery mechanism on the
Internet.
Learn how to beer organize our log processing applicaon via modules and
packages in order to make it more extensible going forward.
4

Text Processing Using the Standard Library
[ 98 ]
Reading CSV data
Comma separated values, or CSV, is a generic term that refers to columnar data, which is
simply separated by commas. In fact, in spite of its name, the delimiter may actually be a
dierent character. Other common delimiters include a tab, a space, or a semi-colon.
The major drawback to CSV data is that there is no standardizaon. In some circumstances,
data elements will be quoted. In other circumstances, the wring applicaon may include
column or row headers along with the CSV data. Furthermore, consider the eects of the
various line-endings used by dierent operang systems.
Clearly, it's not just a maer of spling a comma-delimited line. Python's CSV support aims
to work around the formang variaons and provide a standardized interface.
Time for action – processing Excel formats
The csv module provides support for formang dierences by allowing the use of dierent
dialects. Dialects provide details such as which delimiter to use and how to address data
element quong.
In this example, we'll create an Excel spreadsheet and save it as a CSV document. We can
then open that via Python and access all of the elds directly.
1. First, we'll need to create an Excel spreadsheet and build an inial dataset. We'll use
some mock nancial data. Build up a spreadsheet that includes the following data:
2. Now, from the File menu, select Save As. The Save As dialog contains a Format
drop-down. From this dropdown, select CSV (Comma Delimited). Name the
le Workbook1.csv. Note that if you do not have Excel, these sample les are
downloadable from the Packt Publishing FTP Site.

Chapter 4
[ 99 ]
3. Create a new Python le and name it csv_reader.py. Enter the following code:
import csv
import sys
from optparse import OptionParser
def calculate_profit(day):
return float(day['Revenue']) - float(day['Cost'])
if __name__ == '__main__':
parser = OptionParser()
parser.add_option('-f', '--file', help="CSV Data File")
opts, args = parser.parse_args()
if not opts.file:
parser.error('File name is required')
# Create a dict reader from an open file
# handle and iterate through rows.
reader = csv.DictReader(open(opts.file, 'rU'))
for day in reader:
print '%10s: %10.2f' % \
(day['Date'], calculate_profit(day))
4. Running the preceding code should produce the following output, if you've copied it
correctly.
(text_processing)$ python csv_reader.py --file=./Workbook1.csv
What just happened?
Let's walk through the code here. By now, you should be familiar with both the __name__
== '__main__' secon as well as the opon parser. We won't cover that boilerplate stu
any longer.
The rst interesng line is redirect = csv.DictReader(open(opts.file, 'rU')).
There are two things worth poinng out on this line alone. First, we're opening the le
using Universal Newline support. This is because Excel will save the CSV le according to our
plaorm's convenon. We want Python just to hide all of that for us here.

Text Processing Using the Standard Library
[ 100 ]
Secondly, we're creang an instance of csv.DictReader. The basic approach to accessing
CSV data is via the csv.reader method. However, this requires us to access each row via an
array index. The csv.DictReader class uses the rst row in the CSV le (by default) as the
diconary keys. This makes it much easier to access data by name.
If we had used the standard reader, we would have had to parse our data as in the following
small example snippet:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import csv
>>> reader = csv.reader(open('Workbook1.csv', 'rU'))
>>> for row in reader:
... print 'Revenue on ' + row[0] + ': ' + row[1]
...
Revenue on Date: Revenue
Revenue on 3-May-10: 1289.41
Revenue on 4-May-10: 951.89
Revenue on 6-May-10: 2812.23
Revenue on 7-May-10: 554.34l
Revenue on 8-May-10: 2419.62
Revenue on 9-May-10: 999.44
Revenue on 10-May-10: 514.78
>>>
As you can see, the diconary approach makes it much easier to handle the processed data.
Next, we iterate through each row in the dataset and print out a prot summary. If you
take a look at the calculate_profit funcon, you'll see how we do this. As menoned
before, Python is not only dynamically-typed, but also strongly-typed once a value has been
created. We have to explicitly create new oang-point types based on the text value in
order to perform our subtracon operaon.
Finally, our print statement uses classic percent-formang and adds a lile bit of padding
in order to keep everything easy to read.
If you were paying aenon, you'll remember we menoned that we need a dialect in order
to process a CSV le. What gives? We didn't specify one, did we? Well, no. Python defaults
to the Excel dialect, which is exactly what we're using in our example.
If you're familiar with Excel, you're probably wondering why we used Python to calculate our
prot rather than leng Excel do it for us. Aer all, that's what a spreadsheet applicaon is for!

Chapter 4
[ 101 ]
Time for action – CSV and formulas
Let's run though an example and illustrate why we chose to calculate the values ourselves
rather than leng Excel do it.
1. First, open Excel again and add a new column. We're going to name it Prot. The
value of this column should be a simple formula, =(BX-CX), where 'X' is the row
number you're at. Repeat unl your spreadsheet looks like this:
2. Now, like we did with our rst example, save this as Workbook2.csv. You'll need
to accept any warnings that Excel gives you. This document is also available on the
Packt Publishing FTP site.
3. Using csv_reader.py as a starng point, create csv_reader-b.py and modify
the calculate_profit funcon to read as follows.
def calculate_profit(day):
return float(day['Profit'])
4. Running the example using the new CSV input should produce the following results,
if you've entered the code correctly.
(text_processing)$ python csv_reader-b.py --file=Workbook2.csv

Text Processing Using the Standard Library
[ 102 ]
5. Now, open the Workbook2.csv le in a text editor and add a 1 to every revenue
column to increase net revenue by a visible amount. Save it as Workbook2a.csv.
The updated text le should look like this:
Date,Revenue,Cost,Profit,,
3-May-10,11289.41,899.54,389.87,,
4-May-10,1951.89,772.12,179.77,,
6-May-10,12812.23,749.9,2062.33,,
7-May-10,1554.34,442.91,111.43,,
8-May-10,12419.62,1754.23,665.39,,
9-May-10,1999.44,801.12,198.32,,
10-May-10,1514.78,332.21,182.57,,
6. Finally, let's run the applicaon again, using this new source of input.
(text_processing)$ python csv_reader-b.py --file=Workbook2a.csv
What just happened?
There's not much new code here. We simply updated our calculate_profit funcon to
return the Prot diconary key rather than perform the calculaon. Prey simple.
But, what happened? Why was the output the same for both runs? CSV data generated with
Excel (and probably all spreadsheet tools) does not contain formula informaon. Formula
results are calculated before the data is saved and the target cells receive that value.
The important thing to remember here is that if you're dealing with spreadsheet data,
you cannot rely on formula contents. If an input value to a formula changes outside of the
applicaon, you'll need to perform that calculaon yourself, within Python.
If you have a desire to read and manipulate nave Excel les, the xlrd module
provides that funconality. It is available on the Python Package Index at
http://pypi.python.org/pypi/xlrd/0.7.1.

Chapter 4
[ 103 ]
Reading non-Excel data
Not all CSV data is generated and wrien by Microso Excel. In fact, it's a fairly open and
exible format and is used in a lot of other arenas as well. For example, many shopping-cart
applicaons and online-banking ulies allow end users to export data using this format as
most all spreadsheet applicaons can read it.
In order to read a non-Excel format, we'll need to dene our own CSV dialect, which tells
the parser what to expect as a delimiter, whether values are quoted, and a few other details
as well.
Time for action – processing custom CSV formats
In this example, we'll build a Dialect class that is responsible for interpreng our own
format. We'll use some alternate delimiters and some dierent processing sengs. This is
the general approach you'll use when parsing your own format les.
We're going to process a UNIX style /etc/passwd le in this example. If you're not familiar
with the format, here's a small sample:
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
sys:x:3:3:sys:/dev:/bin/sh
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/bin/sh
man:x:6:12:man:/var/cache/man:/bin/sh
lp:x:7:7:lp:/var/spool/lpd:/bin/sh
mail:x:8:8:mail:/var/mail:/bin/sh
news:x:9:9:news:/var/spool/news:/bin/sh
uucp:x:10:10:uucp:/var/spool/uucp:/bin/sh
proxy:x:13:13:proxy:/bin:/bin/sh
www-data:x:33:33:www-data:/var/www:/bin/sh
backup:x:34:34:backup:/var/backups:/bin/sh
Each line is a colon-separated list of values. We're only going to concern ourselves with the
rst and the last values - the user's login name and the shell applicaon that is executed
when a login occurs.
If you're following along using a Windows machine, you obviously do not have
an /etc/passwd le. An example le is available on the Packt Publishing FTP
site. These examples will use that le so they match up for all users.

Text Processing Using the Standard Library
[ 104 ]
1. Create a new le named csv_reader-c.py and enter the following code. Note
that this le is based on the csv_reader.py source we created earlier in the
chapter.
import csv
import sys
from optparse import OptionParser
if __name__ == '__main__':
parser = OptionParser()
parser.add_option('-f', '--file', help="CSV Data File")
opts, args = parser.parse_args()
if not opts.file:
parser.error('File name is required')
csv.register_dialect('passwd', delimiter=':',
quoting=csv.QUOTE_NONE)
dict_keys = ('login', 'pwd', 'uid', 'gid',
'comment', 'home', 'shell')
# Create a dict reader from an open file
# handle and iterate through rows.
reader = csv.DictReader(
open(opts.file, 'rU'), fieldnames=dict_keys,
dialect='passwd')
for user in reader:
print '%s logs in with %s' % \
(user['login'], user['shell'])
2. Run the preceding example using an /etc/passwd le as input. We'll use the
example provided, but feel free to use your own if you wish.
(text_processing)$ python csv_translate.py --file=passwd > pwd.csv

Chapter 4
[ 105 ]
What just happened?
We made a few changes to our csv_reader.py code in order to manage UNIX /etc/
passwd les to illustrate how you would go about processing non-Excel compable formats.
The rst line we'll look at is the call to csv.register_dialect. In this call, we're adding
an enrely new CSV dialect, named passwd. We're seng the delimiter to a single colon and
conguring the system not to expect quotes. This is a convenient way to introduce a new
dialect, but it's not the only way.
If we had a reason to, we could have extended the Dialect class and passed that in instead
of a series of keyword arguments to csv.register_dialect. In most cases, though, you
will do it this way as a Dialect is simply a collecon of processing opons.
Next, we create a tuple of diconary keys. The DictReader uses the rst line of a CSV le
as it's a set of diconary keys by default. As a password le does not contain a header as our
Excel sheets did, we need to explicitly pass in the list of diconary keys to use. They should
be in the order in which they'll be split.

Text Processing Using the Standard Library
[ 106 ]
Finally, we call csv.DictReader again, but this me, we specify the dialect name to use as
well as the diconary keys in the tuple we just created. The remainder of this example simply
prints out a user and her corresponding login shell.
Writing CSV data
We've looked at methods for parsing two dierent dialects of CSV: Excel formats and our
own custom format. Let's wrap up our discussion on CSV by looking at how we would write
out a new le.
Time for action – creating a spreadsheet of UNIX users
We're going to read our UNIX password database using the code we've already developed,
and transform it into an Excel-friendly CSV dialect. We should then be able to open our list of
users in spreadsheet format if we choose.
1. Create a new le and name it csv_translate.py.
2. Enter the following code:
import csv
import sys
from optparse import OptionParser
parser = OptionParser()
parser.add_option('-f', '--file', help="CSV Data File")
opts, args = parser.parse_args()
if not opts.file:
parser.error('File name is required')
csv.register_dialect('passwd', delimiter=':',
quoting=csv.QUOTE_NONE)
dict_keys = ('login', 'pwd', 'uid', 'gid',
'comment', 'home', 'shell')
print ','.join([i.title() for i in dict_keys])
writer = csv.DictWriter(sys.stdout, dict_keys)
# Create a dict reader from an open file
# handle and iterate through rows.
reader = csv.DictReader(
open(opts.file, 'rU'), fieldnames=dict_keys, dialect='passwd')
writer.writerows(reader)

Chapter 4
[ 107 ]
3. Now, run the example using the supplied passwd le as your input. Redirect the
output to a le named passwd.csv.
(text_processing)$ python csv_translate.py --file=passwd > passwd.
csv
4. The contents of the newly created CSV le should be exactly as follows.

Text Processing Using the Standard Library
[ 108 ]
5. Finally, open the new CSV le in Microso Excel or OpenOce. The rendered
spreadsheet should resemble the following screenshot:
What just happened?
Using two dierent dialects, we read from our password le and wrote Excel-friendly CSV to
our standard output channel.
Lets skip over the boilerplate code again and look at what makes this example actually
work. First, the two lines that appear directly under the dict_key assignment line. We're
doing two important things here. First, we translate the keys we've been using into tle
case via a list comprehension and join them with a comma. Both of these steps use string
object methods covered in the previous chapter. In the same line, we then print this newly
generated value. This serves as the top line of the new CSV.

Chapter 4
[ 109 ]
The next line creates a writer object, which simply takes a le-like object and a list of
diconary keys. Note that the list of keys is required here as Python's diconaries are
unordered. This tells the writer in which order to print the diconary values. The actual write
logic executes much like the following small example:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> dicts = [{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value1',
'key2': 'value2'}]
>>> key_order = ('key2', 'key1')
>>> for d in dicts:
... print ','.join([d[key] for key in key_order])
...
value2,value1
value2,value1
>>>
Finally, we call writer.writerows(reader) to read all of the data from the source
CSV and print it to the new desnaon. The writerows method of a DictWriter object
expects a sequence of diconaries with the appropriate keys.
Pop Quiz – CSV handling
1. We've described two methods of creang new CSV dialects. What are they? In what
situaons might you choose one over the other?
2. What's the drawback to simply using the split method of the string object for
parsing CSV data? Why isn't this approach reliable?
3. How are formulas executed once a spreadsheet document has been saved in a
text-only CSV format?
Have a go hero – detecting CSV dialects
One aspect of the CSV module we didn't cover here is the csv.Sniffer class. This class
aempts to build a new dialect based on a sample segment of a CSV le. You can read more
about the Sniffer class at http://docs.python.org/library/csv.html.
Given your knowledge of CSV les and how to process them, update the previous code to
automacally detect the CSV dialect in use given our example passwd le. If you're using a
UNIX system, try it on your own passwd le. Does it work? In which situaons do you run
into issues?

Text Processing Using the Standard Library
[ 110 ]
Modifying application conguration les
As you develop applicaons, you're going to want to allow your end users to make
runme changes without updang and eding source code. This is where the need for a
conguraon le comes in.
You've surely dealt with them before as you've set up and managed dierent computer
systems and applicaons. Perhaps you've had to edit one while dening a web server virtual
host, or while conguring drivers or boot preferences.
For the most part, applicaons choose their own conguraon formats and implement their
own parsers, to some degree. Some les contain simple name-value pairs while others build
programming-language-like structures. Sll others implement secons and segment values
even further.
Luckily, Python provides a full-featured conguraon le management module for us, so we
don't have to worry about wring our own error-prone processing logic. As an added benet,
Python's ConfigParser module also supports the generaon of new conguraon les
using Python data structures. This means we can easily write new les as well.
Time for action – adding basic conguration read support
In this example, we'll add some basic conguraon le support into our ever-growing
log-processing applicaon. There are a few values that we've been passing on the
command line that have become somewhat repeve. Let's x that.
1. First, create logscan-h.py, using logscan-g.py as your starng place.
2. Update the import statements at the top of the le to look like this:
import time
import string
import sys
from optparse import OptionParser
from collections import defaultdict
from ConfigParser import SafeConfigParser
from ConfigParser import ParsingError
3. Now, directly below the MaxSizeHandler class, add the following
configuration parser funcon. Note that this is not a part of the
MaxSizeHandler class and should not have a base indent.
def load_config():
"""
Load configuration.

Chapter 4
[ 111 ]
Reads the name of the configuration
of sys.argv and loads our config.
from disk.
"""
parser = OptionParser()
parser.add_option('-c', '--config', dest='config',
help="Configuration File Path")
opts, args = parser.parse_args()
if not opts.config:
parser.error('Configuration File Required')
config_parser = SafeConfigParser()
if not config_parser.read(opts.config):
parser.error('Could not parse configuration')
return config_parser
4. We need to update our __main__ secon to take advantage of our new
conguraon le support. Update your main secon to read as follows:
if __name__ == '__main__':
config = load_config()
input_source = config.get('main', 'input_source')
if input_source == '-':
file_stream = sys.stdin
else:
try:
file_stream = open(input_source)
except IOError, e:
print >>sys.stderr, str(e)
sys.exit(-1)
size_check = MaxSizeHandler(
int(config.get(
'maxsize', 'threshold')
)
)
call_chain = []
call_chain.append(size_check)
call_chain.append(ErrorCodeHandler())
processor = LogProcessor(call_chain)
initial = time.time()
line_count = processor.parse(file_stream)
duration = time.time() - initial
# Ask the processor to display the

Text Processing Using the Standard Library
[ 112 ]
# individual reports.
processor.report()
if config.getboolean('display', 'show_footer'):
# Print our internal statistics
print "Report Complete!"
print "Elapsed Time: %#.8f seconds" % duration
print "Lines Processed: %d" % line_count
print "Avg. Duration per line: %#.16f seconds" % \
(duration / line_count) if line_count else 0
5. The next thing to do is create a basic conguraon le. Enter the following text into
a le named logscan.cfg:
[main]
# Input filename. This must be either a pathname or a simple
# dash (-), which signifies we'll use standard in.
input_source = example3.log
[maxsize]
# When we hit this threshold, we'll alert for maximum
# file size.
threshold = 100
[display]
# Whether we want to see the final footer calculations or
# not. Sometimes things like this just get in the way.
show_footer = no
6. Now, let's run the example using this conguraon. If you entered everything
correctly, then your output should resemble the following:
(text_processing)$ python logscan-h.py --config=logscan.cfg

Chapter 4
[ 113 ]
7. Finally, open up the conguraon le and comment out the very last line. It should
begin with show_footer. Run the applicaon again. You should see the following
output:
(text_processing)$ python logscan-h.py --config=logscan.cfg
What just happened?
We opened, scanned, processed, converted, and used elements of an ini-style conguraon
le without having to deal with a single split or white space trim! Let's have a closer look at
how we set everything up.
First o, we updated our import statements to include the needed classes within the
ConfigParser module. In many cases, it's simpler to just import the ConfigParser
module itself rather than individual classes. We did it this way in order to save a bit of space
in the example text.
Next, we added a load_config funcon that is responsible for handling most of the actual
work. The rst thing we do here is parse our command line for a single –c (or –config)
opon, which is the locaon of our le. This opon is required and we'll exit if it's not found
(more on that later).
Next, we instanate a SafeConfigParser class and aempt to make it read the name of the
le we pass in via the command-line opon. If the read doesn't succeed then we exit with a
rather generic error. We return the config_parser object aer we have read our le.

Text Processing Using the Standard Library
[ 114 ]
Skip now to our __main__ secon. The very rst thing we do here is process our
conguraon le via the new funcon. The very next line shows the canonical way for
accessing data, via the get method. The get method takes a conguraon le secon as
well as a value name. This rst access retrieves the input_source value, which is the name
of our logle.
Next, we access the conguraon object again when we create our MaxSizeHandler class.
We pull the threshold size out and pass it to the constructor.
Noce that we have to explicitly convert our data to an integer type. Values read
via conguraon les are typed as strings.
The nal me we access our conguraon object is near the boom when we check the
display secon for the show_footer value. If it's not True, we won't print our familiar footer
text. Here, we use a convenience method available to us, called getboolean. There are a
series of these methods available that automacally handle the data transformaon for us.
The last thing we did was to comment out a conguraon line and run our applicaon. In
doing so, you'll noce that it results in a fatal error! This probably isn't what we want most
of the me. It's possible to avoid this situaon and set reasonable default values.
One nice thing about the SafeConfigParser classes is that they're also able
to read Microso Windows conguraon les directly. However, none of the
ConfigParser classes support value-type prexes found in extended version
INI syntax.
Using value interpolation
One really interesng feature of the ConfigParser module is that it supports conguraon
value interpolaon, or substuon, directly within the conguraon le itself. This is a very
useful feature.
Time for action – relying on conguration value interpolation
For this example, we'll simply update our conguraon le to take advantage of this feature.
There are no Python code changes necessary.
1. First, add a new conguraon value to the [main] secon of logscan.cfg.
The name of the value should be dir and the value should be the full path to the
directory that you're execung examples from.
[main]
# The main directory Where we're running from (or, rather, where

Chapter 4
[ 115 ]
# we store logfiles and write output to)
dir = /Users/jeff/Desktop/ptpbg/Chapters/Ch4
2. Next, you're going to update the input_source conguraon opon to reference
this full path.
# Input filename. This must be either a pathname or a simple
# dash (-), which signifies we'll use standard in.
input_source = %(dir)s/www.log
3. Finally, running this updated example should produce the same output as the
previous execuon did.
(text_processing)$ python logscan-h.py --config=logscan.cfg
What just happened?
We included the value of a configuration opon within a second one by using the
familiar percent syntax. This allows us to