Python 2.6 Text Processing Beginner's Guide (2010)

Python%202.6%20Text%20Processing%20-%20Beginner's%20Guide%20(2010)

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 380 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Python 2.6 Text Processing
Beginner's Guide
The easiest way to learn how to manipulate text with Python
Je McNeil
BIRMINGHAM - MUMBAI
Python 2.6 Text Processing
Beginner's Guide
Copyright © 2010 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmied in any form or by any means, without the prior wrien permission of the
publisher, except in the case of brief quotaons embedded in crical arcles or reviews.
Every eort has been made in the preparaon of this book to ensure the accuracy of the
informaon presented. However, the informaon contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers
and distributors will be held liable for any damages caused or alleged to be caused directly or
indirectly by this book.
Packt Publishing has endeavored to provide trademark informaon about all of the
companies and products menoned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this informaon.
First published: December 2010
Producon Reference: 1081210
Published by Packt Publishing Ltd.
32 Lincoln Road
Olton
Birmingham, B27 6PA, UK.
ISBN 978-1-849512-12-1
www.packtpub.com
Cover Image by John Quick (john@johnmquick.com)
Credits
Author
Je McNeil
Reviewer
Maurice HT Ling
Acquision Editor
Steven Wilding
Development Editor
Reshma Sundaresan
Technical Editor
Gauri Iyer
Indexer
Tejal Daruwale
Editorial Team Leader
Mithun Sehgal
Project Team Leader
Priya Mukherji
Project Coordinator
Shubhanjan Chaerjee
Proofreader
Jonathan Todd
Graphics
Nilesh R. Mohite
Producon Coordinator
Kruthika Bangera
Cover Work
Kruthika Bangera
About the Author
Je McNeil has been working in the Internet Services industry for over 10 years. He cut
his teeth during the late 90's Internet boom and has been developing soware for Unix and
Unix-avored systems ever since. Je has been a full-me Python developer for the beer
half of that me and has professional experience with a collecon of other languages,
including C, Java, and Perl. He takes an interest in systems administraon and server
automaon problems. Je recently joined Google and has had the pleasure of working with
some very talented individuals.
I'd like to above all thank Julie, Savannah, Phoebe, Maya, and Trixie for
allowing me to lock myself in the oce every night for months. The
Web.com gang and those in the Python community willing to share their
authoring experiences. Finally, Steven Wilding, Reshma Sundaresan,
Shubhanjan Chaerjee, and the rest of the Packt Publishing team for all of
the hard work and guidance.
About the Reviewer
Maurice HT Ling completed his Ph.D. in Bioinformacs and B.Sc(Hons) in Molecular and
Cell Biology from the University of Melbourne where he worked on microarray analysis
and text mining for protein-protein interacons. He is currently an honorary fellow in the
University of Melbourne, Australia. Maurice holds several Chief Editorships, including the
Python papers, Computaonal, and Mathemacal Biology, and Methods and Cases in
Computaonal, Mathemacal and Stascal Biology. In Singapore, he co-founded the Python
User Group (Singapore) and is the co-chair of PyCon Asia-Pacic 2010. In his free me,
Maurice likes to train in the gym, read, and enjoy a good cup of coee. He is also a senior
fellow of the Internaonal Fitness Associaon, USA.
www.PacktPub.com
Support les, eBooks, discount offers and more
You might want to visit www.PacktPub.com for support les and downloads related
to your book.
Did you know that Packt oers eBook versions of every book published, with PDF and ePub
les available? You can upgrade to the eBook version at www.PacktPub.com, and as a print
book customer, you are entled to a discount on the eBook copy. Get in touch with us at
service@packtpub.com for more details.
At www.PacktPub.com, you can also read a collecon of free technical arcles, sign up for a
range of free newsleers, and receive exclusive discounts and oers on Packt books and eBooks.
http://PacktLib.PacktPub.com
Do you need instant soluons to your IT quesons? PacktLib is Packt's online digital book
library. Here, you can access, read, and search across Packt's enre library of books.
Why Subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via web browser
Free Access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access
PacktLib today and view nine enrely free books. Simply use your login credenals for
immediate access.
Table of Contents
Preface 1
Chapter 1: Geng Started 7
Categorizing types of text data 8
Providing informaon through markup 8
Meaning through structured formats 9
Understanding freeform content 9
Ensuring you have Python installed 9
Providing support for Python 3 10
Implemenng a simple cipher 10
Time for acon – implemenng a ROT13 encoder 11
Processing structured markup with a lter 15
Time for acon – processing as a lter 15
Time for acon – skipping over markup tags 18
State machines 22
Supporng third-party modules 23
Packaging in a nutshell 23
Time for acon – installing SetupTools 23
Running a virtual environment 25
Conguring virtualenv 25
Time for acon – conguring a virtual environment 25
Where to get help? 28
Summary 28
Chapter 2: Working with the IO System 29
Parsing web server logs 30
Time for acon – generang transfer stascs 31
Using objects interchangeably 35
Time for acon – introducing a new log format 35
Accessing les directly 37
Table of Contents
[ ii ]
Time for acon – accessing les directly 37
Context managers 39
Handling other le types 41
Time for acon – handling compressed les 41
Implemenng le-like objects 42
File object methods 43
Enabling universal newlines 45
Accessing mulple les 45
Time for acon – spell-checking HTML content 46
Simplifying mulple le access 50
Inplace ltering 51
Accessing remote les 52
Time for acon – spell-checking live HTML pages 52
Error handling 55
Time for acon – handling urllib 2 errors 55
Handling string IO instances 57
Understanding IO in Python 3 58
Summary 59
Chapter 3: Python String Services 61
Understanding the basics of string object 61
Dening strings 62
Time for acon – employee management 62
Building non-literal strings 68
String formang 68
Time for acon – customizing log processor output 68
Percent (modulo) formang 74
Mapping key 75
Conversion ags 76
Minimum width 76
Precision 76
Width 77
Conversion type 77
Using the format method approach 78
Time for acon – adding status code data 79
Making use of conversion speciers 83
Creang templates 86
Time for acon – displaying warnings on malformed lines 86
Template syntax 88
Rendering a template 88
Calling string object methods 89
Time for acon – simple manipulaon with string methods 89
Aligning text 92
Table of Contents
[ iii ]
Detecng character classes 92
Casing 93
Searching strings 93
Dealing with lists of strings 94
Treang strings as sequences 95
Summary 96
Chapter 4: Text Processing Using the Standard Library 97
Reading CSV data 98
Time for acon – processing Excel formats 98
Time for acon – CSV and formulas 101
Reading non-Excel data 103
Time for acon – processing custom CSV formats 103
Wring CSV data 106
Time for acon – creang a spreadsheet of UNIX users 106
Modifying applicaon conguraon les 110
Time for acon – adding basic conguraon read support 110
Using value interpolaon 114
Time for acon – relying on conguraon value interpolaon 114
Handling default opons 116
Time for acon – conguraon defaults 116
Wring conguraon data 118
Time for acon – generang a conguraon le 119
Reconguring our source 122
A note on Python 3 122
Time for acon – creang an egg-based package 122
Understanding the setup.py le 131
Working with JSON 132
Time for acon – wring JSON data 132
Encoding data 134
Decoding data 135
Summary 136
Chapter 5: Regular Expressions 137
Simple string matching 138
Time for acon – tesng an HTTP URL 138
Understanding the match funcon 140
Learning basic syntax 140
Detecng repeon 140
Specifying character sets and classes 141
Applying anchors to restrict matches 143
Wrapping it up 144
Table of Contents
[ iv ]
Advanced paern matching 145
Grouping 145
Time for acon – regular expression grouping 146
Using greedy versus non-greedy operators 149
Asserons 150
Performing an 'or' operaon 152
Implemenng Python-specic elements 153
Other search funcons 153
search 153
ndall and nditer 153
split 154
sub 154
Compiled expression objects 155
Dealing with performance issues 156
Parser ags 156
Unicode regular expressions 157
The match object 158
Processing bind zone les 158
Time for acon – reading DNS records 159
Summary 164
Chapter 6: Structured Markup 165
XML data 166
SAX processing 168
Time for acon – event-driven processing 168
Incremental processing 171
Time for acon – driving incremental processing 171
Building an applicaon 172
Time for acon – creang a dungeon adventure game 172
The Document Object Model 176
xml.dom.minidom 176
Time for acon – updang our game to use DOM processing 176
Creang and modifying documents programmacally 183
XPath 185
Accessing XML data using ElementTree 186
Time for acon – using XPath in our adventure 187
Reading HTML 194
Time for acon – displaying links in an HTML page 194
BeaufulSoup 195
Summary 196
Table of Contents
[ v ]
Chapter 7: Creang Templates 197
Time for acon – installing Mako 198
Basic Mako usage 199
Time for acon – loading a simple Mako template 199
Generang a template context 203
Managing execuon with control structures 204
Including Python code 205
Time for acon – reformang the date with Python code 205
Adding funconality with tags 206
Rendering les with %include 206
Generang mulline comments with %doc 207
Documenng Mako with %text 207
Dening funcons with %def 208
Time for acon – dening Mako def tags 208
Imporng %def secons using %namespace 210
Time for acon – converng mail message to use namespaces 210
Filtering output 213
Expression lters 214
Filtering the output of %def blocks 214
Seng default lters 215
Inhering from base templates 215
Time for acon – updang base template 215
Growing the inheritance chain 218
Time for acon – adding another inheritance layer 219
Inhering aributes 221
Customizing 222
Custom tags 222
Time for acon – creang custom Mako tags 223
Customizing lters 226
Overviewing alternave approaches 226
Summary 227
Chapter 8: Understanding Encodings and i18n 229
Understanding basic character encodings 230
ASCII 230
Limitaons of ASCII 231
KOI8-R 232
Unicode 232
Using Unicode with Python 3 233
Understanding Unicode 234
Design goals 234
Organizaonal structure 236
Backwards compability 236
Table of Contents
[ vi ]
Encoding 237
UTF-32 237
UTF-8 237
Encodings in Python 238
Time for acon – manually decoding 239
Reading Unicode 240
Wring Unicode strings 241
Time for acon – copying Unicode data 242
Time for acon – xing our copy applicaon 244
The codecs module 245
Time for acon – changing encodings 245
Adopng good pracces 248
Internaonalizaon and Localizaon 249
Preparing an applicaon for translaon 250
Time for acon – preparing for mulple languages 250
Time for acon – providing translaons 253
Looking for more informaon on internaonalizaon 254
Summary 255
Chapter 9: Advanced Output Formats 257
Dealing with PDF les using PLATYPUS 258
Time for acon – installing ReportLab 258
Generang PDF documents 259
Time for acon – wring PDF with basic layout and style 259
Wring nave Excel data 266
Time for acon – installing xlwt 266
Building XLS documents 267
Time for acon – generang XLS data 267
Working with OpenDocument les 271
Time for acon – installing ODFPy 272
Building an ODT generator 273
Time for acon – generang ODT data 273
Summary 277
Chapter 10: Advanced Parsing and Grammars 279
Dening a language syntax 280
Specifying grammar with Backus-Naur Form 281
Grammar-driven parsing 282
PyParsing 283
Time for acon – installing PyParsing 283
Time for acon – implemenng a calculator 284
Parse acons 287
Time for acon – handling type translaons 287
Table of Contents
[ vii ]
Suppressing parts of a match 289
Time for acon – suppressing porons of a match 289
Processing data using the Natural Language Toolkit 297
Time for acon – installing NLTK 298
NLTK processing examples 298
Removing stems 298
Discovering collocaons 299
Summary 300
Chapter 11: Searching and Indexing 301
Understanding search complexity 302
Time for acon – implemenng a linear search 302
Text indexing 304
Time for acon – installing Nucular 304
An introducon to Nucular 305
Time for acon – full text indexing 307
Time for acon – measuring index benet 310
Scripts provided by Nucular 312
Using XML les 312
Advanced Nucular features 313
Time for acon – eld-qualied indexes 314
Performing an enhanced search 317
Time for acon – performing advanced Nucular queries 317
Indexing and searching other data 320
Time for acon – indexing Open Oce documents 320
Other index systems 325
Apache Lucene 325
ZODB and zc.catalog 325
SQL text indexing 325
Summary 326
Appendix A: Looking for Addional Resources 327
Python resources 328
Unocial documentaon 328
Python enhancement proposals 328
Self-documenng 329
Using other documentaon tools 331
Community resources 332
Following groups and mailing lists 332
Finding a users' group 333
Aending a local Python conference 333
Honorable menon 333
Lucene and Solr 333
Table of Contents
[ viii ]
Generang C-based parsers with GNU Bison 334
Apache Tika 335
Geng started with Python 3 335
Major language changes 336
Print is now a funcon 336
Catching excepons 337
Using metaclasses 338
New reserved words 338
Major library changes 339
Changes to list comprehensions 339
Migrang to Python 3 339
Time for acon – using 2to3 to move to Python 3 340
Summary 342
Appendix B: Pop Quiz Answers 343
Chapter 1: Geng Started 343
ROT 13 Processing Answers 343
Chapter 2: Working with the IO System 344
File-like objects 344
Chapter 3: Python String Services 344
String literals 344
String formang 345
Chapter 4: Text Processing Using the Standard Library 345
CSV handling 345
JSON formang 346
Chapter 5: Regular Expressions 346
Regular expressions 346
Understanding the Pythonisms 346
Chapter 6: Structured Markup 347
SAX processing 347
Chapter 7: Creang Templates 347
Template inheritance 347
Chapter 8: Understanding Encoding and i18n 347
Character encodings 347
Python encodings 348
Internaonalizaon 348
Chapter 9: Advanced Output Formats 348
Creang XLS documents 348
Chapter 11: Searching and Indexing 349
Introducon to Nucular 349
Index 351
Preface
The Python Text Processing Beginner's Guide is intended to provide a gentle, hands-on
introducon to processing, understanding, and generang textual data using the Python
programming language. Care is taken to ensure the content is example-driven, while sll
providing enough background informaon to allow for a solid understanding of the topics
covered.
Throughout the book, we use real world examples such as logle processing and PDF
creaon to help you further understand dierent aspects of text handling. By the me you've
nished, you'll have a solid working knowledge of both structured and unstructured text
data management. We'll also look at praccal indexing and character encodings.
A good deal of supporng informaon is included. We'll touch on packaging, Python IO,
third-party ulies, and some details on working with the Python 3 series releases. We'll
even spend a bit of me porng a small example applicaon to the latest version.
Finally, we do our best to provide a number of high quality external references. While this
book will cover a broad range of topics, we also want to help you dig deeper when necessary.
What this book covers
Chapter 1, Geng Started: This chapter provides an introducon into character and string
data types and how strings are represented using underlying integers. We'll implement a
simple encoding script to illustrate how text can be manipulated at the character level. We
also set up our systems to allow safe third-party library installaon.
Chapter 2, Working with the IO System: Here, you'll learn how to access your data. We cover
Python's IO capabilies in this chapter. We'll learn how to access les locally and remotely.
Finally, we cover how Python's IO layers change in Python 3.
Chapter 3, Python String Services: Covers Python's core string funconality. We look at the
methods of string objects, the core template classes, and Python's various string formang
methods. We introduce the dierences between Unicode and string objects here.
Preface
[ 2 ]
Chapter 4, Test Processing Using the Standard Library: The standard Python distribuon
includes a powerful set of built-in libraries designed to manage textual content. We look
at conguraon le reading and manipulaon, CSV les, and JSON data. We take a bit of a
detour at the end of this chapter to learn how to create your own redistributable Python egg
les.
Chapter 5, Regular Expressions: Looks at Python's regular expression implementaon and
teaches you how to implement them. We look at standardized concepts as well as Python's
extensions. We'll break down a few graphically so that the component parts are easy to piece
together. You'll also learn how to safely use regular expressions with internaonal alphabets.
Chapter 6, Structured Markup: Introduces you to XML and HTML processing. We create an
adventure game using both SAX and DOM approaches. We also look briey at lxml and
ElementTree. HTML parsing is also covered.
Chapter 7, Creang Templates: Using the Mako template language, we'll generate e-mail
and HTML text templates much like the ones that you'll encounter within common web
frameworks. We visit template creaon, inheritance, lters, and custom tag creaon.
Chapter 8, Understanding Encodings and i18n: We provide a look into character encoding
schemes and how they work. For reference, we'll examine ASCII as well as KOI8-R. We also
look into Unicode and its various encoding mechanisms. Finally, we nish up with a quick
look at applicaon internaonalizaon.
Chapter 9, Advanced Output Formats: Provides informaon on how to generate PDF, Excel,
and OpenDocument data. We'll build these document types from scratch using direct Python
API calls relying on third-party libraries.
Chapter 10, Advanced Parsing and Grammars: A look at more advanced text manipulaon
techniques such as those used by programming language designers. We'll use the PyParsing
library to handle some conguraon le management and look into the Python Natural
Language Toolkit.
Chapter 11, Searching and Indexing: A praccal look at full text searching and the benet an
index can provide. We'll use the Nucular system to index a collecon of small text les and
make them quickly searchable.
Appendix A, Looking for Addional Resources: It introduces you to places of interest on the
Internet and some community resources. In this appendix, you will learn to create your own
documentaon and to use Java Lucene based engines. You will also learn about dierences
between Python 2 & Python 3 and to port code to Python 3.
Preface
[ 3 ]
What you need for this book
This book assumes you've an elementary knowledge of the Python programming language,
so we don't provide a tutorial introducon. From a soware angle, you'll simply need a
version of Python (2.6 or later) installed. Each me we require a third-party library, we'll
detail the installaon in text.
Who this book is for
If you are a novice Python developer who is interested in processing text then this book is for
you. You need no experience with text processing, though basic knowledge of Python would
help you to beer understand some of the topics covered by this book. As the content of this
book develops gradually, you will be able to pick up Python while reading.
Conventions
In this book, you will nd several headings appearing frequently.
To give clear instrucons of how to complete a procedure or task, we use:
Time for action – heading
1. Acon 1
2. Acon 2
3. Acon 3
Instrucons oen need some extra explanaon so that they make sense, so they are
followed with:
What just happened?
This heading explains the working of tasks or instrucons that you have just completed.
You will also nd some other learning aids in the book, including:
Pop Quiz – heading
These are short mulple choice quesons intended to help you test your own understanding.
Preface
[ 4 ]
Have a go hero – heading
These set praccal challenges and give you ideas for experimenng with what you have
learned.
You will also nd a number of styles of text that disnguish between dierent kinds of
informaon. Here are some examples of these styles, and explanaons of their meanings.
Code words in text are shown as follows: "First of all, we imported the re module"
A block of code is set as follows:
parser = OptionParser()
parser.add_option('-f', '--file', help="CSV Data File")
opts, args = parser.parse_args()
if not opts.file:
When we wish to draw your aenon to a parcular part of a code block, the relevant lines
or items are set in bold:
def init_game(self):
"""
Process World XML.
"""
self.location = parse(open(self.world)).documentElement
Any command-line input or output is wrien as follows:
(text_processing)$ python render_mail.py thank_you-e.txt
New terms and important words are shown in bold. Words that you see on the screen, in
menus or dialog boxes for example, appear in the text like this: "Any X found in the source
data would simply become an A in the output data.".
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Preface
[ 5 ]
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book—what you liked or may have disliked. Reader feedback is important for us to
develop tles that you really get the most out of.
To send us general feedback, simply send an e-mail to feedback@packtpub.com, and
menon the book tle via the subject of your message.
If there is a book that you need and would like to see us publish, please send us a note in the
SUGGEST A TITLE form on www.packtpub.com or e-mail suggest@packtpub.com.
If there is a topic that you have experse in and you are interested in either wring or
contribung to a book, see our author guide on www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.
Downloading the example code for this book
You can download the example code les for all Packt books you have purchased
from your account at http://www.PacktPub.com. If you purchased this
book elsewhere, you can visit http://www.PacktPub.com/support and
register to have the les e-mailed directly to you.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you nd a mistake in one of our books—maybe a mistake in the text or the
code—we would be grateful if you would report this to us. By doing so, you can save other
readers from frustraon and help us improve subsequent versions of this book. If you
nd any errata, please report them by vising http://www.packtpub.com/support,
selecng your book, clicking on the errata submission form link, and entering the details
of your errata. Once your errata are veried, your submission will be accepted and the
errata will be uploaded on our website, or added to any list of exisng errata, under the
Errata secon of that tle. Any exisng errata can be viewed by selecng your tle from
http://www.packtpub.com/support.
Preface
[ 6 ]
Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt,
we take the protecon of our copyright and licenses very seriously. If you come across any
illegal copies of our works, in any form, on the Internet, please provide us with the locaon
address or website name immediately so that we can pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected pirated material.
We appreciate your help in protecng our authors, and our ability to bring you valuable
content.
Questions
You can contact us at questions@packtpub.com if you are having a problem with any
aspect of the book, and we will do our best to address it.
1
Getting Started
As computer professionals, we deal with text data every day. Developers and
programmers interact with XML and source code. System administrators
have to process and understand logles. Managers need to understand and
format nancial data and reports. Web designers put in me, hand tuning and
polishing up HTML content. Managing this broad range of formats can seem
like a daunng task, but it's really not that dicult.
This book aims to introduce you, the programmer, to a variety of methods used
to process these data formats. We'll look at approaches ranging from standard
language funcons through more complex third-party modules. Somewhere in
there, we'll cover a ulity that's just the right tool for your specic job. In the
process, we hope to also cover some Python development best pracces.
Where appropriate, we'll look into implementaon details enough to help you
understand the techniques used. Most of the me, though, we'll work as hard
as we can to get you up on your feet and crunching those text les.
You'll nd that Python makes tasks like this quite painless through its clean and
easy-to-understand syntax, vast community, and the available collecon of
addional ulies and modules.
In this chapter, we shall:
Briey introduce the data formats handled in this book
Implement a simple ROT13 translator
Introduce you to basic processing via lter programs
Learn state machine basics
Geng Started
[ 8 ]
Learn how to install supporng libraries and components safely and without
administrave access
Look at where to nd more informaon on introductory topics
Categorizing types of text data
Textual data comes in a variety of formats. For our purposes, we'll categorize text into three
very broad groups. Isolang down into segments helps us to understand the problem a bit
beer, and subsequently choose a parsing approach. Each one of these sweeping groups can
be further broken down into more detailed chunks.
One thing to remember when working your way through the book is that text content isn't
limited to the Lan alphabet. This is especially true when dealing with data acquired via the
Internet. We'll cover some of the techniques and tricks to handling internaonalized data in
Chapter 8, Understanding Encoding and i18n.
Providing information through markup
Structured text includes formats such as XML and HTML. These formats generally consist of
text content surrounded by special symbols or markers that give extra meaning to a le's
contents. These addional tags are usually meant to convey informaon to the processing
applicaon and to arrange informaon in a tree-like structure. Markup allows a developer to
dene his or her own data structure, yet rely on standardized parsers to extract elements.
For example, consider the following contrived HTML document.
<html>
<head>
<title>Hello, World!</title>
</head>
<body>
<p>
Hi there, all of you earthlings.
</p>
<p>
Take us to your leader.
</p>
</body>
</html>
In this example, our document's tle is clearly idened because it is surrounded by opening
and closing <title> and </title> elements.
Chapter 1
[ 9 ]
Note that although the document's tags give each element
a meaning, it's sll up to the applicaon developer to
understand what to do with a title object or a p element.
Noce that while it sll has meaning to us humans, it is also laid out in such a way as to make
it computer friendly. We'll take a deeper look into these formats in Chapter 6, Structured
Markup. Python provides some rich libraries for dealing with these popular formats.
One interesng aspect to these formats is that it's possible to embed references to validaon
rules as well as the actual document structure. This is a nice benet in that we're able to rely
on the parser to perform markup validaon for us. This makes our job much easier as it's
possible to trust that the input structure is valid.
Meaning through structured formats
Text data that falls into this category includes things such as conguraon les, marker
delimited data, e-mail message text, and JavaScript Object Notaon web data. Content
within this second category does not contain explicit markup much like XML and HTML does,
but the structure and formang is required as it conveys meaning and informaon about
the text to the parsing applicaon. For example, consider the format of a Windows INI le
or a Linux system's /etc/hosts le. There are no tags, but the column on the le clearly
means something other than the column on the right.
Python provides a collecon of modules and libraries intended to help us handle popular
formats from this category. We'll look at Python's built-in text services in detail when we get
to Chapter 4, The Standard Library to the Rescue.
Understanding freeform content
This category contains data that does not fall into the previous two groupings. This describes
e-mail message content, leers, book copy, and other unstructured character-based content.
However, this is where we'll largely have to look at building our own processing components.
There are external packages available to us if we wish to perform common funcons. Some
examples include full text searching and more advanced natural language processing.
Ensuring you have Python installed
Our rst order of business is to ensure that you have Python installed. You'll need it in order
to complete most of the examples in this book. We'll be working with Python 2.6 and we
assume that you're using that same version. If there are any drasc dierences in earlier
releases, we'll make a note of them as we go along. All of the examples should sll funcon
properly with Python 2.4 and later versions.
Geng Started
[ 10 ]
If you don't have Python installed, you can download the latest 2.X version from http://
www.python.org. Most Linux distribuons, as well as Mac OS, usually have a version of
Python preinstalled.
At the me of this wring, Python 2.6 was the latest version available, while 2.7 was in an
alpha state.
Providing support for Python 3
The examples in this book are wrien for Python 2. However, wherever possible, we will
provide code that has already been ported to Python 3. You can nd the Python 3 code in
the Python3 directories in the code bundle available on the Packt Publishing FTP site.
Unfortunately, we can't promise that all of the third-party libraries that we'll use will support
Python 3. The Python community is working hard to port popular modules to version 3.0.
However, as the versions are incompable, there is a lot of work remaining. In situaons
where we cannot provide example code, we'll note this.
Implementing a simple cipher
Let's get going early here and implement our rst script to get a feel for what's in store.
A Caesar Cipher is a simple form of cryptography in which each leer of the alphabet is shied
down by a number of leers. They're generally of no cryptographic use when applied alone,
but they do have some valid applicaons when paired with more advanced techniques.
This preceding diagram depicts a cipher with an oset of three. Any X found in the source
data would simply become an A in the output data. Likewise, any A found in the input data
would become a D.
Chapter 1
[ 11 ]
Time for action – implementing a ROT13 encoder
The most popular implementaon of this system is ROT13. As its name suggests, ROT13
shis – or rotates – each leer by 13 spaces to produce an encrypted result. As the English
alphabet has 26 leers, we simply run it a second me on the encrypted text in order to get
back to our original result.
Let's implement a simple version of that algorithm.
1. Start your favorite text editor and create a new Python source le. Save it
as rot13.py.
2. Enter the following code exactly as you see it below and save the le.
import sys
import string
CHAR_MAP = dict(zip(
string.ascii_lowercase,
string.ascii_lowercase[13:26] + string.ascii_lowercase[0:13]
)
)
def rotate13_letter(letter):
"""
Return the 13-char rotation of a letter.
"""
do_upper = False
if letter.isupper():
do_upper = True
letter = letter.lower()
if letter not in CHAR_MAP:
return letter
else:
letter = CHAR_MAP[letter]
if do_upper:
letter = letter.upper()
return letter
if __name__ == '__main__':
for char in sys.argv[1]:
sys.stdout.write(rotate13_letter(char))
sys.stdout.write('\n')
Geng Started
[ 12 ]
3. Now, from a command line, execute the script as follows. If you've entered all of the
code correctly, you should see the same output.
$ python rot13.py 'We are the knights who say, nee!'
4. Run the script a second me, using the output of the rst run as the new input
string. If everything was entered correctly, the original text should be printed to
the console.
$ python rot13.py 'Dv ziv gsv pmrtsgh dsl hzb, mvv!'
What just happened?
We implemented a simple text-oriented cipher using a collecon of Python's string handling
features. We were able to see it put to use for both encoding and decoding source text.
We saw a lot of stu in this lile example, so you should have a good feel for what can be
accomplished using the standard Python string object.
Following our inial module imports, we dened a diconary named CHAR_MAP, which
gives us a nice and simple way to shi our leers by the required 13 places. The value of a
diconary key is the target leer! We also took advantage of string slicing here. We'll look at
slicing a bit more in later chapters, but it's a convenient way for us to extract a substring from
an exisng string object.
Chapter 1
[ 13 ]
In our translaon funcon rotate13_letter, we checked whether our input character
was uppercase or lowercase and then saved that as a Boolean aribute. We then forced our
input to lowercase for the translaon work. As ROT13 operates on leers alone, we only
performed a rotaon if our input character was a leer of the Lan alphabet. We allowed
other values to simply pass through. We could have just as easily forced our string to a pure
uppercased value.
The last thing we do in our funcon is restore the leer to its proper case, if necessary. This
should familiarize you with upper- and lowercasing of Python ASCII strings.
We're able to change the case of an enre string using this same method; it's not limited to
single characters.
>>> name = 'Ryan Miller'
>>> name.upper()
'RYAN MILLER'
>>> "PLEASE DO NOT SHOUT".lower()
'please do not shout'
>>>
It's worth poinng out here that a single character string is sll a string.
There is not a char type, which you may be familiar with if you're coming
from a dierent language such as C or C++. However, it is possible to
translate between character ASCII codes and back using the ord and chr
built-in methods and a string with a length of one.
Noce how we were able to loop through a string directly using the Python for syntax.
A string object is a standard Python iterable, and we can walk through them detailed as
follows. In pracce, however, this isn't something you'll normally do. In most cases, it makes
sense to rely on exisng libraries.
$ python
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> for char in "Foo":
... print char
...
F
o
o
>>>
Geng Started
[ 14 ]
Finally, you should note that we ended our script with an if statement such as the following:
>>> if__name__ == '__main__'
Python modules all contain an internal __name__ variable that corresponds to the name of
the module. If a module is executed directly from the command line, as is this script, whose
name value is set to __main__, this code only runs if we've executed this script directly. It
will not run if we import this code from a dierent script. You can import the code directly
from the command line and see for yourself.
$ python
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import rot13
>>> dir(rot13)
['CHAR_MAP', '__builtins__', '__doc__', '__file__', '__name__', '__
package__', 'rotate13_letter', 'string', 'sys']
>>>
Noce how we were able to import our module and see all of the methods and aributes
inside of it, but the driver code did not execute. This is a convenon we'll use throughout the
book in order to help achieve maximum reusability.
Have a go hero – more translation work
Each Python string instance contains a collecon of methods that operate on one or more
characters. You can easily display all of the available methods and aributes by using the dir
method. For example, enter the following command into a Python window. Python responds
by prinng a list of all methods on a string object.
>>> dir("content")
['__add__', '__class__', '__contains__', '__delattr__', '__doc__',
'__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__',
'__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__
le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__',
'__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__
setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_
field_name_split', '_formatter_parser', 'capitalize', 'center', 'count',
'decode', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index',
'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle',
'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace',
'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split',
'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate',
'upper', 'zfill']
>>>
Chapter 1
[ 15 ]
Much like the isupper and islower methods discussed previously, we also have an
isspace method. Using this method, in combinaon with your newfound knowledge of
Python strings, update the method we dened previously to translate spaces to underscores
and underscores to spaces.
Processing structured markup with a lter
Our ROT13 applicaon works great for simple one-line strings that we can t on the
command line. However, it wouldn't work very well if we wanted to encode an enre
le, such as the HTML document we took a look at earlier. In order to support larger text
documents, we'll need to change the way we accept input. We'll redesign our applicaon to
work as a lter.
A lter is an applicaon that reads data from its standard input le descriptor and writes to
its standard output le descriptor. This allows users to create command pipelines that allow
mulple ulies to be strung together. If you've ever typed a command such as cat /etc/
hosts | grep mydomain.com, you've set up a pipeline
In many circumstances, data is fed into the pipeline via the keyboard and completes its
journey when a processed result is displayed on the screen.
Time for action – processing as a lter
Let's make the changes required to allow our simple ROT13 processor to work as a
command-line lter. This will allow us to process larger les.
1. Create a new source le and enter the following code. When complete, save the le
as rot13-b.py.
import sys
import string
CHAR_MAP = dict(zip(
string.ascii_lowercase,
string.ascii_lowercase[13:26] + string.ascii_lowercase[0:13]
)
)
def rotate13_letter(letter):
"""
Geng Started
[ 16 ]
Return the 13-char rotation of a letter.
"""
do_upper = False
if letter.isupper():
do_upper = True
letter = letter.lower()
if letter not in CHAR_MAP:
return letter
else:
letter = CHAR_MAP[letter]
if do_upper:
letter = letter.upper()
return letter
if __name__ == '__main__':
for line in sys.stdin:
for char in line:
sys.stdout.write(rotate13_letter(char))
2. Enter the following HTML data into a new text le and save it as sample_page.
html. We'll use this as example input to our updated rot13.py.
<html>
<head>
<title>Hello, World!</title>
</head>
<body>
<p>
Hi there, all of you earthlings.
</p>
<p>
Take us to your leader.
</p>
</body>
</html>
3. Now, run our rot13.py example and provide our HTML document as standard
input data. The exact method used will vary with your operang system. If you've
entered the code successfully, you should simply see a new prompt.
$ cat sample_page.html | python rot13-b.py > rot13.html
$
Chapter 1
[ 17 ]
4. The contents of rot13.html should be as follows. If that's not the case, double
back and make sure everything is correct.
<ugzy>
<urnq>
<gvgyr>Uryyb, Jbeyq!</gvgyr>
</urnq>
<obql>
<c>
Uv gurer, nyy bs lbh rneguyvatf.
</c>
<c>
Gnxr hf gb lbhe yrnqre.
</c>
</obql>
</ugzy>
5. Open the translated HTML le using your web browser.
What just happened?
We updated our rot13.py script to read standard input data rather than rely on a
command-line opon. Doing this provides opmal congurability going forward and lets us
feed input of varying length from a collecon of dierent sources. We did this by looping on
each line available on the sys.stdin le stream and calling our translaon funcon. We
wrote each character returned by that funcon to the sys.stdout stream.
Next, we ran our updated script via the command line, using sample_page.html as input.
As expected, the encoded version was printed on our terminal.
As you can see, there is a major problem with our output. We should have a proper page
tle and our content should be broken down into dierent paragraphs.
Geng Started
[ 18 ]
Remember, structured markup text is sprinkled with
tag elements that dene its structure and organizaon.
In this example, we not only translated the text content, we also translated the markup
tags, rendering them meaningless. A web browser would not be able to display this data
properly. We'll need to update our processor code to ignore the tags. We'll do just that
in the next secon.
Time for action – skipping over markup tags
In order to preserve the proper, structured HTML that tags provide, we need to ensure we
don't include them in our rotaon. To do this, we'll keep track of whether or not our input
stream is currently within a tag. If it is, we won't translate our leers.
1. Once again, create a new Python source le and enter the following code. When
you're nished, save the le as rot13-c.py.
import sys
from optparse import OptionParser
import string
CHAR_MAP = dict(zip(
string.ascii_lowercase,
string.ascii_lowercase[13:26] + string.ascii_lowercase[0:13]
)
)
class RotateStream(object):
"""
General purpose ROT13 Translator
A ROT13 translator smart enough to skip
Markup tags if that's what we want.
"""
MARKUP_START = '<'
MARKUP_END = '>'
def __init__(self, skip_tags):
self.skip_tags = skip_tags
def rotate13_letter(self, letter):
"""
Return the 13-char rotation of a letter.
"""
do_upper = False
if letter.isupper():
Chapter 1
[ 19 ]
do_upper = True
letter = letter.lower()
if letter not in CHAR_MAP:
return letter
else:
letter = CHAR_MAP[letter]
if do_upper:
letter = letter.upper()
return letter
def rotate_from_file(self, handle):
"""
Rotate from a file handle.
Takes a file-like object and translates
text from it into ROT13 text.
"""
state_markup = False
for line in handle:
for char in line:
if self.skip_tags:
if state_markup:
# here we're looking for a closing
# '>'
if char == self.MARKUP_END:
state_markup = False
else:
# Not in a markup state, rotate
# unless we're starting a new
# tag
if char == self.MARKUP_START:
state_markup = True
else:
char = self.rotate13_letter(char)
else:
char = self.rotate13_letter(char)
# Make this a generator
yield char
if __name__ == '__main__':
parser = OptionParser()
parser.add_option('-t', '--tags', dest="tags",
help="Ignore Markup Tags", default=False,
Geng Started
[ 20 ]
action="store_true")
options, args = parser.parse_args()
rotator = RotateStream(options.tags)
for letter in rotator.rotate_from_file(sys.stdin):
sys.stdout.write(letter)
2. Run the same example.html le that we created for the last example through the
new processor. This me, be sure to pass a -t command-line opon.
$ cat sample_page.html | python rot13-c.py -t > rot13.html
$
3. If everything was entered correctly, the contents of rot13.html should be exactly
as follows.
<html>
<head>
<title>Uryyb, Jbeyq!</title>
</head>
<body>
<p>
Uv gurer, nyy bs lbh rneguyvatf.
</p>
<p>
Gnxr hf gb lbhe yrnqre.
</p>
</body>
</html>
4. Open the translated le in your web browser.
Chapter 1
[ 21 ]
What just happened?
That was a prey complex example, so let's step through it. We did quite a bit. First, we
moved away from a simple rotate13_letter funcon and wrapped almost all of our
funconality in a Python class named RotateStream. Doing this helps us ensure that our
code will be reusable down the road.
We dene a __init__ method within the class that accepts a single parameter named
skip_tags. The value of this parameter is assigned to the self parameter so we can access
it later from within other methods. If this is a True value, then our parser class will know
that it's not supposed to translate markup tags.
Next, you'll see our familiar rotate13_letter method (it's a method now as it's dened
within a class). The only real dierence here is that in addion to the letter parameter,
we're also requiring the standard self parameter.
Finally, we have our rotate_from_file method. This is where the bulk of our new
funconality was added. Like before, we're iterang through all of the characters available
on a le stream. This me, however, the le stream is passed in as a handle parameter.
This means that we could have just as easily passed in an open le handle rather than the
standard in le handle.
Inside the method, we implement a simple state machine, with two possible states. Our
current state is saved in the state_markup Boolean aribute. We only rely on it if the value
of self.skip_tags set in the __init__ method is True.
1. If state_markup is True, then we're currently within the context of a markup tag
and we're looking for the > character. When it's found, we'll change state_markup
to False. As we're inside a tag, we'll never ask our class to perform a ROT13
operaon.
2. If state_markup is False, then we're parsing standard text. If we come across
the < character, then we're entering a new markup tag. We set the value of state_
markup to True. Finally, if we're not in tag, we'll call rotate13_letter to perform
our ROT13 operaon.
You should also noce some unfamiliar code at the end of the source lisng. We've taken
advantage of the OptionParser class, which is part of the standard library. We've added
a single opon that will allow us to selecvely enable our markup bypass funconality. The
value of this opon is passed into RotateStream's __init__ method.
The nal two lines of the lisng show how we pass the sys.stdin le handle to rotate_
from_file and iterate over the results. The rotate_from_file method has been dened
as a generator funcon. A generator funcon returns values as it processes rather than
waing unl compleon. This method avoids storing all of the result in memory and lowers
overall applicaon memory consumpon.
Geng Started
[ 22 ]
State machines
A state machine is an algorithm that keeps track of an applicaon's internal state. Each
state has a set of available transions and funconality associated with it. In this example,
we were either inside or outside of a tag. Applicaon behavior changed depending on
our current state. For example, if we were inside then we could transion to outside. The
opposite also holds true.
The state machine concept is advanced and won't be covered in detail. However, it is a
major method used when implemenng text-processing machinery. For example, regular
expression engines are generally built on variaons of this model. For more informaon
on state machine implementaon, see the Wikipedia arcle available at http://
en.wikipedia.org/wiki/Finite-state_machine.
Pop Quiz – ROT 13 processing
1. We dene MARKUP_START and MARKUP_END class constants within our
RotateStream class. How might our state machine be aected if these
values were swapped?
2. Is it possible to use ROT13 on a string containing characters found outside of the
English alphabet?
3. What would happen if we embedded > or < signs within our text content or tag
values?
4. In our example, we read our input a line at a me. Can you think of a way to make
this more ecient?
Have a go hero – support multiple input channels
We've briey covered reading data via standard in as well as processing simple
command-line opons. Your job is to integrate the two so that your applicaon will
simply translate a command-line value if one is present before defaulng to standard input.
If you're able to implement this, try extending the opon handling code so that your input
string can be passed in to the rotaon applicaon using a command-line opon.
$python rot13-c.py –s 'myinputstring'
zlvachgfgevat
$
Chapter 1
[ 23 ]
Supporting third-party modules
Now that we've got our rst example out of the way, we're going to take a lile bit of a
detour and learn how to obtain and install third-party modules. This is important, as we'll
install a few throughout the remainder of the book.
The Python community maintains a centralized package repository, termed the Python
Package Index (or PyPI). It is available on the web at http://pypi.python.org. From
there, it is possible to download packages as compressed source distribuons, or in some
cases, pre-packaged Python components. PyPI is also a rich source of informaon. It's a
great place to learn about available third-party applicaons. Links are provided to individual
package documentaon if it's not included directly into the package's PyPI page.
Packaging in a nutshell
There are at least two dierent popular methods of packaging and deploying Python
packages. The distutils package is part of the standard distribuon and provides a
mechanism for building and installing Python soware. Packages that take advantage of the
distutils system are downloaded as a source distribuon and built and installed by a local
user. They are installed by simply creang an addional directory structure within the system
Python directory that matches the package name.
In an eort to make packages more accessible and self-contained, the concept of the
Python Egg was introduced. An egg le is simply a ZIP archive of a package. When an egg is
installed, the ZIP le itself is placed on the Python path, rather than a subdirectory.
Time for action – installing SetupTools
Egg les have largely become the de facto standard in Python packaging. In order to install,
develop, and build egg les, it is necessary to install a third-party tool kit. The most popular
is SetupTools, and this is what we'll be working with throughout this book. The installaon
process is fairly easy to complete and is rather self-contained. Installing SetupTools gives us
access to the easy_install command, which automates the download and installaon of
packages that have been registered with PyPI.
1. Download the installaon script, which is available at http://peak.
telecommunity.com/dist/ez_setup.py. This same script will be
used for all versions of Python.
Geng Started
[ 24 ]
2. As an administrave user, run the ez_setup.py script from the command line. The
SetupTools installaon process will complete. If you've executed the script with the
proper rights, you should see output similar as follows:
# python ez_setup.py
Downloading http://pypi.python.org/packages/2.6/s/setuptools/
setuptools-0.6c11-py2.6.egg
Processing setuptools-0.6c11-py2.6.egg
creating /usr/lib/python2.6/site-packages/setuptools-0.6c11-
py2.6.egg
Extracting setuptools-0.6c11-py2.6.egg to /usr/lib/python2.6/site-
packages
Adding setuptools 0.6c11 to easy-install.pth file
Installing easy_install script to /usr/bin
Installing easy_install-2.6 script to /usr/bin
Installed /usr/lib/python2.6/site-packages/setuptools-0.6c11-
py2.6.egg
Processing dependencies for setuptools==0.6c11
Finished processing dependencies for setuptools==0.6c11
#
What just happened?
We downloaded the SetupTools installaon script and executed it as an administrave
user. By doing so, our system Python environment was congured so that we can install egg
les in the future via the SetupTools easy_install system.
SetupTools does not currently work with Python 3.0. There is, however, an
alternave available via the Distribute project. Distribute is intended to be a
drop-in replacement for SetupTools and will work with either major Python
version. For more informaon, or to download the installer, visit http://
pypi.python.org/pypi/distribute.
Chapter 1
[ 25 ]
Running a virtual environment
Now that we have SetupTools installed, we can install third-party packages by simply
running the easy_install command. This is nice because package dependencies will
automacally be downloaded and installed so we no longer have to do this manually.
However, there's sll one piece missing. Even though we can install these packages easily,
we sll need to retain administrave privileges to do so. Addionally, all of the packages
that we chose to install will be placed in the system's Python library directory, which has
the potenal to cause inconsistencies and problems down the road.. As you've probably
guessed, there's a ulity to address that.
Python 2.6 introduces the concept of a local user package directory. This is
simply an addional locaon found within your user home directory that Python
searches for installed packages. It is possible to install eggs into this locaon via
easy_install with a –user command-line switch. For more informaon,
see http://www.python.org/dev/peps/pep-0370/.
Conguring virtualenv
The virtualenv package, distributed as a Python egg, allows us to create an isolated
Python environment anywhere we wish. The environment comes complete with a bin
directory containing a Python binary, its own installaon of SetupTools, and an instance-
specic library directory. In short, it creates a locaon for us to install and congure Python
without interfering with the system installaon.
Time for action – conguring a virtual environment
Here, we'll enable the virtualenv package, which will illustrate how to install packages
from the PyPI site. We'll also congure our rst environment, which we'll use throughout the
book for the rest of our examples and code illustraons.
1. As a user with administrave privileges, install virtualenv from the system
command line by running easy_install virtualenv. If you have the correct
permissions, your output should be similar to the following.
Searching for virtualenv
Reading http://pypi.python.org/simple/virtualenv/
Reading http://virtualenv.openplans.org
Best match: virtualenv 1.4.5
Downloading http://pypi.python.org/packages/source/v/virtualenv/
virtualenv-1.4.5.tar.gz#md5=d3c621dd9797789fef78442e336df63e
Processing virtualenv-1.4.5.tar.gz
Geng Started
[ 26 ]
Running virtualenv-1.4.5/setup.py -q bdist_egg --dist-dir /tmp/
easy_install-rJXhVC/virtualenv-1.4.5/egg-dist-tmp-AvWcd1
warning: no previously-included files matching '*.*' found under
directory 'docs/_templates'
Adding virtualenv 1.4.5 to easy-install.pth file
Installing virtualenv script to /usr/bin
Installed /usr/lib/python2.6/site-packages/virtualenv-1.4.5-
py2.6.egg
Processing dependencies for virtualenv
Finished processing dependencies for virtualenv
2. Drop administrave privileges as we won't need them any longer. Ensure that you're
within your home directory and create a new virtual instance by running:
$ virtualenv --no-site-packages text_processing
3. Step into the newly created text_processing directory and acvate the
virtual environment. Windows users will do this by simply running the Scripts\
activate applicaon, while Linux users must instead source the script using the
shell's dot operator.
$ . bin/activate
4. If you've done this correctly, you should now see your command-line prompt change
to include the string (text_processing). This serves as a visual cue to remind you
that you're operang within a specic virtual environment.
(text_processing)$ pwd
/home/jmcneil/text_processing
(text_processing)$ which python
/home/jmcneil/text_processing/bin/python
(text_processing)$
5. Finally, deacvate the environment by running the deacvate command. This will
return your shell environment to default. Note that once you've done this, you're
once again working with the system's Python install.
(text_processing)$ deactivate
$ which python
/usr/bin/python
$
Chapter 1
[ 27 ]
If you're running Windows, by default python.exe and easy_install.
exe are not placed on your system %PATH%. You'll need to manually congure
your %PATH% variable to include C:\Python2.6\ and C:\Python2.6\
Scripts. Addional scripts added by easy_install will also be placed in
this directory, so it's worth seng up your %PATH% variable.
What just happened?
We installed the virtualenv package using the easy_install command directly o of
the Python Package index. This is the method we'll use for installing any third-party packages
going forward. You should now be familiar with the easy_install process. Also, note that
for the remainder of the book, we'll operate from within this text_processing virtual
environment. Addional packages are installed using this same technique from within the
connes of our environment.
Aer the install process was completed, we congured and acvated our rst virtual
environment. You saw how to create a new instance via the virtualenv command and
you also learned how to subsequently acvate it using the bin/activate script. Finally, we
showed you how to deacvate your environment and return to your system's default state.
Have a go hero – install your own environment
Now that you know how to set up your own isolated Python environment, you're encouraged
to create a second one and install a collecon of third-party ulies in order to get the hang of
the installaon process.
1. Create a new environment and name it as of your own choice.
2. Point your browser to http://pypi.python.org and select one or more
packages that you nd interesng. Install them via the easy_install command
within your new virtual environment.
Note that you should not require administrave privileges to do this. If you receive an error
about permissions, make certain you've remembered to acvate your new environment.
Deacvate when complete. Some of the packages available for install may require a correctly
congured C-language compiler.
Geng Started
[ 28 ]
Where to get help?
The Python community is a friendly bunch of people. There is a wide range of online
resources you can take advantage of if you nd yourself stuck. Let's take a quick look at
what's out there.
Home site: The Python website, available at http://www.python.org.
Specically, the documentaon secon. The standard library reference is a
wonderful asset and should be something you keep at your ngerps. This site also
contains a wonderful tutorial as well as a complete language specicaon.
Member groups: The comp.lang.python newsgroup. Available via Google
groups as well as an e-mail gateway, this provides a general-purpose locaon to
ask Python-related quesons. A very smart bunch of developers patrol this group;
you're certain to get a quality answer.
Forums: Stack Overow, available at http://www.stackoverflow.com.
Stack overow is a website dedicated to developers. You're welcome to ask your
quesons, as well as answer others' inquires, if you're up to it!
Mailing list: If you have a beginner-level queson, there is a Python tutor mailing
list available o of the Python.org site. This is a great place to ask your beginner
quesons no maer how basic they might be!
Centralized package repository: The Python Package Index at http://pypi.
python.org. Chances are someone has already had to do exactly what it is
you're doing.
If all else fails, you're more than welcome to contact the author via e-mail to questions@
packtpub.com. Every eort will be made to answer your queson, or point you to a freely
available resource where you can nd your resoluon.
Summary
This chapter introduced you to the dierent categories of text that we'll cover in greater
detail throughout the book and provided you with a lile bit of informaon as to how we'll
manage our packaging going forward.
We performed a few low-level text translaons by implemenng a ROT13 encoder and
highlighted the dierences between freeform and structured markup. We'll examine these
categories in much greater detail as we move on. The goal of that exercise was to learn some
byte-level transformaon techniques.
Finally, we touched on a couple of dierent ways to read data into our applicaons. In our
next chapter, we'll spend a great deal of me geng to know the IO system and learning
how you can extract text from a collecon of sources.
2
Working with the IO System
Now that we've covered some basic text-processing methods and introduced
you to some core Python best pracces, it's me we take a look at how to
actually get to your data. Reading some example text from the command line is
an easy process, but geng to real world data can be more dicult. However,
it's important to understand how to do so.
Python provides all of the standard le IO mechanisms you would expect from
any full-featured programming language. Addionally, there is a wide range of
standard library modules included that enable you to access data via various
network services such as HTTP, HTTPS, and FTP.
In this chapter, we'll focus on those methods and systems. We'll look at standard le
funconality, the extended abilies within the standard library, and how these components
can be used interchangeably in many situaons.
As part of our introducon to le input and output, we'll also cover some common
excepon-handling techniques that are especially helpful when dealing with external data.
In this chapter, we shall:
Look at Python'sle IO and examine the objects created by the open factory funcon
Understand text-based and raw IO, and how they dier
Examine the urllib and urllib2 modules and detail le access via HTTP and FTP
streams
Handle le IO using Context Managers
Learn about le-like objects and methods to use objects interchangeably for
maximum reuse
Working with the IO System
[ 30 ]
Introduce excepons with a specic focus on idioms specic to le IO and how to
deal with certain error condions
Introduce a web server logle processor, which we'll expand upon throughout
future chapters
Examine ways to deal with mulple les
We'll also spend some me looking at changes to the IO subsystem in future
versions of Python
Parsing web server logs
We're going to introduce a web server log parser in this secon that we'll build upon
throughout the remainder of the book. We're going to start by assuming the logle is in the
standard Apache combined format.
For example, the following line represents an HTTP request for the root directory of a
website. The request is successful, as indicated by the 200 series response code.
In order, the above line contains the remote IP address of the client, the remote identd
name, the authencated username, the server's mestamp, the rst line of the request, the
HTTP response code, the size of the le as returned by the server, the referring page, and
nally the User Agent, or the browser soware running on the end user's computer.
The dashes in the previous screenshot indicate a missing value. This doesn't necessarily
correspond to an error condion. For example, if the page is not password-protected then
there will be no remote user. The dash is a common condion we'll need to handle.
Chapter 2
[ 31 ]
For more informaon on web server log formats and available data points,
please see your web server documentaon. Apache logs were used to write
this book; documentaon for the Apache web server is available at http://
httpd.apache.org/docs/2.2/mod/mod_log_config.html
Time for action – generating transfer statistics
Now, let's start our processor. Inially, we'll build enough funconality to scan our logle
as read via standard input and report les served over a given size. System administrators
may nd ulies such as this useful when aempng to track down abusive users. It's also
generally a good idea to iteravely add funconality to an applicaon in development.
1. First, step into the virtual environment created in Chapter 1, Geng Started and
acvate it so that all of our work is isolated locally. Only the UNIX method is shown
here.
$ cd text_processing/
$ . bin/activate
2. Create an empty Python le and name it logscan.py. Enter the following code:
#!/usr/bin/python
import sys
from optparse import OptionParser
class LogProcessor(object):
"""
Process a combined log format.
This processor handles logfiles in a combined format,
objects that act on the results are passed in to
the init method as a series of methods.
"""
def __init__(self, call_chain=None):
"""
Setup parser.
Save the call chain. Each time we process a log,
we'll run the list of callbacks with the processed
log results.
"""
if call_chain is None:
call_chain = []
self._call_chain = call_chain
Working with the IO System
[ 32 ]
def split(self, line):
"""
Split a logfile.
Initially, we just want size and requested file name, so
we'll split on spaces and pull the data out.
"""
parts = line.split()
return {
'size': 0 if parts[9] == '-' else int(parts[9]),
'file_requested': parts[6]
}
def parse(self, handle):
"""
Parses the logfile.
Returns a dictionary composed of log entry values
for easy data summation.
"""
for line in stream:
fields = self.split(line)
for func in self._call_chain:
func(fields)
class MaxSizeHandler(object):
"""
Check a file's size.
"""
def __init__(self, size):
self.size = size
def process(self, fields):
"""
Looks at each line individually.
Looks at each parsed log line individually and
performs a size calculation. If it's bigger than
our self.size, we just print a warning.
"""
if fields['size'] > self.size:
print >>sys.stderr, \
'Warning: %s exceeeds %d bytes (%d)!' % \
(fields['file_requested'], self.size,
fields['size'])
if __name__ == '__main__':
parser = OptionParser()
parser.add_option('-s', '--size', dest="size",
Chapter 2
[ 33 ]
help="Maximum File Size Allowed",
default=0, type="int")
opts,args = parser.parse_args()
call_chain = []
size_check = MaxSizeHandler(opts.size)
call_chain.append(size_check.process)
processor = LogProcessor(call_chain)
processor.parse(sys.stdin)
3. Now, create a new le and name it example.log. Enter the following mock
logdata. Note that each line begins with 127.0.0.1 and should be entered as such.
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /a HTTP/1.1" 200
65383 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /b HTTP/1.1" 200
22912 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /c HTTP/1.1" 200
1818212 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
SV1; .NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /d HTTP/1.1" 200
888 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /e HTTP/1.1" 200
38182121 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
SV1; .NET CLR 1.1.4322)"
4. Now run the logscan.py script by entering the following command. If all code and
data has been entered correctly, you should see the following output.
(text_processing)$ cat example.log | python logscan.py -s 1000
What just happened?
Let's go through the code and look at what's going on. We expanded on concepts from
the rst chapter and introduced quite a few new elements here. It's important that you
understand this example as we'll use it as the foundaon for many of our future exercises.
Working with the IO System
[ 34 ]
First, recognize what should be familiar to you. We've parsed our arguments, ensured that
our main code is only executed when our script is started directly, and we created a couple
of classes that make up our applicaon. We also passed the open le stream to our parse
method, much like we did with our ROT13 example. Simple!
This applicaon is largely composed of two main classes: LogProcessor and
MaxSizeHandler. We split it o like this to ensure we can expand in the future. Perhaps
we'll want to add more checks or handle logles in a dierent format. This approach ensures
that is possible.
The __init__ method of LogProcessor takes a call_chain argument, which defaults to
None. This will contain a list of funcons that we'll call for each line in the logle, passing in
the values parsed out of each line as a diconary.
If you look further into the __init__ method, you'll see the following code:
if call_chain is None:
call_chain = []
self._call_chain = call_chain
This may look peculiar to you. Why wouldn't we simply default call_chain to an empty list
object? The answer is actually rather complex. For now, simply understand that if we do that,
we may accidentally share a copy of call_chain among all instances of our class!
If you're curious as to why using an empty list is a bad idea, have a look
at http://www.ferg.org/projects/python_gotchas.
html#contents_item_6.2. Most of the me, what you actually get is not
what you would expect and subtle bugs slip into your code.
In our split method, we break our logle line up at the space boundary. Obviously, this
doesn't work if we needed some of the elds that contain spaces, but we're not that far yet.
For now, this is an acceptable approach. Note the check for the dash here. It's possible that
the web server may not report a size on each request. Consider the eect of a browser cache
where new data is not transferred over the network if it hasn't changed on the server.
The split method ulizes Python's condional expressions, which rst
appeared in version 2.5. If you're using an earlier version of Python, you'll need
to expand into a tradional if – else block.
Finally, we have our parse method. This method is responsible for translang each line
of the logle into a useable diconary and passing it into each method in our stored
call_chain.
Chapter 2
[ 35 ]
Next, we have our MaxSizeHandler class. This class ought to be rather straighorward. At
inializaon me, we store a maximum le size. When our process method is called as part
of the call_chain run, we simply print a warning if the current le exceeds the threshold.
The script proper should look largely familiar to you. We parse our command-line opons via
the OptionParser class, but this me we introduce type translaon. We create an instance
of MaxSizeHandler and add its process method to our call_chain list. Finally, that list
is used to create a new LogProcessor instance and we call its parse method.
Python methods and funcons are considered to be rst class objects. What
does this mean? Simply put, you can pass them around to methods, assign them
to collecons, and bind them as other aributes just as if they were simple data
types such as integers, strings, and class instances. No wrapper classes required!
Using objects interchangeably
The big take-away from this example is that objects can be designed such that they're
interchangeable. The technical term for this is Polymorphism. This comes into play
throughout the chapter as we look at dierent methods of accessing datales.
Time for action – introducing a new log format
Let's take a closer look at this concept. Let's assume for a second that a colleague heard about
your niy log-processing program and wanted to use it to parse his data. The trouble is that
he's already tried his hand at solving the problem with standard shell ulies and his import
format is slightly dierent. It's simply a list of le names followed by the le size in bytes.
1. Using logscan.py as a template, create a new le named logscan-b.py. The
two les should be exactly the same.
2. Add an addional class directly below LogProcessor as follows.
class ColumnLogProcessor(LogProcessor):
def split(self, line):
parts = line.split()
return {
'size': int(parts[1]),
'file_requested': parts[0]
}
Working with the IO System
[ 36 ]
3. Now, change the line that creates a LogProcessor object. Instead, we want it to
create a ColumnLogProcess object.
call_chain.append(size_check.process)
processor = ColumnLogProcessor(call_chain)
processor.parse(sys.stdin)
processor = ColumnLogProcessor(call_chain)
4. Create a new input le and name it example-b.log. Enter test data exactly as follows.
/1 1000
/2 96316
/3 84722
/4 81712
/5 19231
5. Finally, run the updated source code. If you entered everything correctly, your
output should be as follows.
(text_processing)$ cat example-b.log | python logscan-b.py -s
1000
What just happened?
We added support for a new log input format simply by replacing the parse method of
our log processor. We did this by inhering from LogProcessor and creang a new class,
overriding parse.
There are no addional changes required to support an enrely new format. As long as your
new LogProcessor class implements the required methods and returns the proper values,
it's a piece of cake. Your LogProcessor subclass could have done something much more
elaborate, such as process each line via regular expressions or handle missing elements
gracefully.
Conversely, adding new call_chain methods is just as easy. As long as the funcon in the
list takes a diconary as input, you can add new processing methods as well.
Chapter 2
[ 37 ]
Have a go hero – creating a new processing class
In these examples, we've printed a warning if a le exceeds a threshold. Instead, what if we
wanted to warn if a le was below a given threshold? This might be useful if we thought our
web server was truncang results or returning invalid data. Your job is to add a new handler
class to the call_chain that warns if a le is below a specic size. It should be able to run
side-by-side along with the exisng MaxSizeHandler handler.
Accessing les directly
Up unl now, we've read all of our data via a standard input pipe. This is a perfectly
acceptable and extensible way of handling input. However, Python provides a very simple
mechanism for accessing les directly. There are situaons where direct le access is
preferable. For example, perhaps you're accessing data from within a web applicaon and
using standard IO just isn't possible.
Time for action – accessing les directly
Let's update our LogProcessor so that we can pass a le on the command line rather than
read all of our data via sys.stdin.
1. Create a new le named logscan-c.py, using logscan.py as your template.
We'll be adding le access support to this original "combined format" processor.
2. Update the code in the __name__ == '__main__' secon as follows.
if __name__ == '__main__':
parser = OptionParser()
parser.add_option('-s', '--size', dest="size",
help="Maximum File Size Allowed",
default=0, type="int")
parser.add_option('-f', '--file', dest="file",
help="Path to Web Log File", default="-")
opts,args = parser.parse_args()
call_chain = []
if opts.file == '-':
file_stream = sys.stdin
else:
try:
file_stream = open(opts.file, 'r')
except IOError, e:
Working with the IO System
[ 38 ]
print >>sys.stderr, str(e)
sys.exit(-1)
size_check = MaxSizeHandler(opts.size)
call_chain.append(size_check.process)
processor = LogProcessor(call_chain)
processor.parse(file_stream)
3. Run the updated applicaon from the command line as follows:
(text_processing)$ python logscan-c.py -s 1000 -f example.log
What just happened?
There are a couple of things here that are new. First, we added a second opon to our
command-line parser. Using a –f or a –file switch, you can now pass in the name of a
logle you wish to parse. We set the default value to a single dash, which signies we should
use sys.stdin as we did in our earlier examples. Using a dash in this manner is common
with command-line-based ulies such as tar and cat.
Next, if an actual le name was passed via our new switch, we're going to open it here via
Python's built-in open funcon. open returns a le object and binds it to the file_stream
aribute. The rst argument to open is the le name; the second is the mode we wish to use.
>>> open('/etc/hosts', 'r')
<open file '/etc/hosts', mode 'r' at 0x10047d250>
>>>
Noce that if a le name wasn't passed in, we simply assign sys.stdin to file_stream.
Both of these objects are considered to be le-like objects. They implement the same set
of core funconality, though the input sources are dierent. This is another example of
polymorphism.
Finally, we've wrapped our open method in a try/except block in order to catch any
excepons that may bubble up from the open funcon. In this example, we are catching
IOErrors only. Any other programming error triggered inside the try block will simply
trigger a stack trace.
Chapter 2
[ 39 ]
The Python excepon hierarchy is described in detail at http://docs.
python.org/library/exceptions.html#exception-hierarchy.
Errors generated during Input/Output operaons generally raise IOError
excepons. You should take some me to familiarize yourself with the layout of
Python's excepon classes.
The open funcon is a built-in factory for python file objects. It is possible to call the file
object directly, but that is discouraged. In later versions of Python, a call to open actually
returns a layered IO object and not just a simple le class.
It's possible to open a le in either text or binary mode. By default, a le is opened using text
mode. To tell Python that you're working with binary data, you simply need to pass a b in
as an addional mode ag. So, if you wanted to open a le for appending binary data, you
would use a ag of ab. Binary mode is only signicant on DOS/Windows systems. When text
data is wrien on a Windows machine, trailing newlines are converted to a newline-carriage
return combinaon. The le object needs to take that into account.
Astute readers should have noced that we never actually closed the le. We simply le it open
and allowed the operang system to reclaim resources when we were nished. While this is
alright for small applicaons like this, we need to be careful to close allles in real applicaons.
Context managers
The with statement has been a Python xture since 2.5. The statement allows the developer
to create a new code block while holding a resource. When the code block exits, the
resource is automacally closed. This is true even if the code block exits in error.
It's also possible to use context managers for other resources as the context
manager protocol is quite extensible.
The following example illustrates the use of a context manager.
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more
information.
>>> with open('/etc/passwd') as f:
... for line in f:
... if line.startswith('root:'):
... print line
...
root:*:0:0:System Administrator:/var/root:/bin/sh
>>> f.read()
Working with the IO System
[ 40 ]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: I/O operation on closed file
>>>
In this example, we opened our system password database and assigned the value returned
by the open funcon to f. While we were in the subsequent block, we were able to perform
le IO as we normally would.
When we exited the block by decreasing the indent, the context manager associated
with the le object ensured the le was automacally closed for us. This is evident by the
excepon raised when we tried to simply read the object outside of the with statement.
Note that while the aribute f is sll a valid object, the underlying le descriptor has already
been closed.
To achieve the same closed-le guarantee without the with statement, we would need to
do something such as the following.
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more
information.
>>> try:
... f = open('/etc/hosts')
... print len(f.read())
... finally:
... try:
... f.close()
... except AttributeError:
... pass
...
345
>>>
Here, the code within the finally block is executed whether or not the proceeding try
block completes successfully. Within our finally block, we've nested yet another try. This
is because if the original open had failed, then f was never bound. Aempng to close it
would result in an AttributeError excepon originang from f.close!
You're encouraged to take advantage of the with statement as it's a wonderful way to avoid
le descriptor leaks within long-running applicaons.
Chapter 2
[ 41 ]
Handling other le types
As we've seen, the Python le-like object is a powerful thing. But, there's more. Let's imagine
for a second that your server logles are compressed in order to save on storage space. We
can make one more simple change to our script so that we have nave support for common
compression formats.
Time for action – handling compressed les
In this example, we'll add support for common compression formats using Python's
standard library.
1. Using the code in logscan-c.py as your starng point, create logscan-d.py.
Add a new funcon just below the MaxSizeHandler class.
def get_stream(path):
"""
Detect compression.
If the file name ends in a compression
suffix, we'll open it using the correct
algorith. If not, we just return a standard
file object.
"""
_open = open
if path.endswith('.gz'):
_open = gzip.open
elif path.endswith('.bz2'):
_open = bz2.open
return _open(path)
2. Within our main secon, update the line that reads open(opts.file) to read
get_stream(opts.file)..
3. At the top of the lisng, ensure that you're imporng the two new compression
modules referenced in get_stream.
import gzip
import bz2
Working with the IO System
[ 42 ]
4. Finally, we can compress our example log using GZIP and run our log scanner as we
have in earlier examples.
(text_processing)$ gzip example.log
(text_processing)$ python logscan-d.py -f example.log.gz -s 1000
What just happened?
In this example, we added support for both GZIP and BZ2 compressed les as supported by
Python's standard library.
The bulk of the new funconality resides in the get_stream funcon we've added. We
look at the le extension provided by the user and make a determinaon as to which open
funcon we want to use. If the le appears to be compressed, we'll use a compression-
specic approach. If the le appears to be plain text, we'll default to the built-in open
funcon we used in our earlier examples.
In order to add our new funconality into the mix, we've replaced our call to open within the
main code to reference our new get_stream funcon.
Implementing le-like objects
As menoned earlier, objects can be used interchangeably as long as they provide the same
set of externally facing methods. This is referred to as implemenng a protocol, or more
commonly, an interface. Languages such as Java, C#, and Objecve-C ulize strict interfaces
that require a developer to implement a minimum set of funconality within a class
Python, on the other hand, does not enforce such restricons. Python's type system is referred
to as Duck Typing. If it looks like a duck and quacks like a duck, then it must be a duck.
While Python itself does not support strict interfaces, there are third-party
libraries available designed to ll that perceived gap. The Zope project is heavily
based on a library-based interface system. For more informaon, see http://
www.zope.org/Products/ZopeInterface.
Probably the most common protocol you'll see within Python code is the le-like object. Not
surprisingly, a le-like object is a Python object designed to "stand in" for a real le object.
The compression streams, as well as the sys.stdin pipe that we looked at earlier, are all
examples of a le-like object.
Chapter 2
[ 43 ]
These objects do not necessarily need to implement all of the methods associated with a
real le object. For example, a read-only object needs to only implement the proper read
methods, and a socket stream doesn't need to implement a seek method.
File object methods
Let's take a closer look at some of the methods found on a standard le object. It's important
to understand le objects as proper IO and data access can dramacally aect the speed
and performance of a data-bound applicaon. This is not an all-inclusive list. To see a
detailed breakdown, visit the http://docs.python.org/library/stdtypes.
html#file-objects.
Objects are free to implement as many of these as they wish, so be prepared to deal with
excepons if you're not certain where your le object is coming from.
close
The close method is responsible for ushing data and closing the underlying le descriptor.
Any aempt to access a le aer it has been closed will result in a ValueError excepon.
This also sets the .closed aribute to True. Note that it is possible to call the close
method more than once without triggering an error.
leno
The fileno method returns the underlying integer le descriptor. Many lower-level IO
funcons (especially those found in the os module) require a standard system-level le
descriptor.
ush
The flush method causes Python to clear the internal I/O buer and force data to disk. This
doesn't perform a disk sync, however, as data may sll simply reside in OS memory.
read
The read method will read data from the le object and return it as a string. If a size
argument is passed in then this method will read that much data from the le object, in
bytes. If the size argument is not passed in then read will go unl EOF is reached.
readline
The readline method will read a single line from a le, retaining the trailing newline
character. A size argument may be passed in, which limits the amount of data that will be
read. If the maximum size is smaller than line length, an incomplete line may be returned.
Each call returns a successive line in a le.
(text_processing)$ python
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Working with the IO System
[ 44 ]
Type "help", "copyright", "credits", or "license" for more information.
>>> f = open('/etc/passwd')
>>> f.readline()
'##\n'
>>> f.readline()
'# User Database\n'
>>> f.readline()
'# \n'
>>> f.readline()
'# Note that this file is consulted directly only when the system is
running\n'
>>>
This is a convenient method to extract the rst line of a le; however, there are beer
methods if you wish to simply loop through the context of a text le.
readlines
This method reads each line of a le into a list, unl it reaches EOF. Each element of the list
is one line within a le. As with the readline method, each line retains its trailing new line.
This method is acceptable for smaller les, but can trigger heavy memory use if used on
larger les.
The idiomac way to loop through a text le is to loop on the le object directly, as we've
done in previous examples.
seek
As IO is performed, an oset within the instance is changed accordingly. Subsequent reads (or
writes) will take place at that current locaon. The seek method allows us to manually set that
oset value. To expand upon the read line example from above, let's introduce a seek.
(text_processing)$ python
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> f = open('/etc/passwd')
>>> f.readline()
'##\n'
>>> f.readline()
'# User Database\n'
>>> f.seek(0)
>>> f.readline()
'##\n'
>>> f.readline()
'# User Database\n'
>>>
Chapter 2
[ 45 ]
Noce how the call to seek moves us back to the beginning of the le and we begin reading
the same data a second me. This method is frequently le out of non le-based le-like
objects, or is coded as a null operaon.
tell
This is the counterpart to seek. Calling tell returns the current locaon of the le pointer
as an integer oset.
write
The write method simply takes a source argument and writes it to the open le. It is
not possible to pass in a desired size; the enre string is pushed to disk. If you wish to only
write a poron then you should limit the size via string slicing. A flush or a close may be
required before the data wrien appears on disk. String slicing is covered in our chapter on
Python String Services.
writelines
The writelines method is the counterpart to the readlines method. Given a list or a
sequence of strings, they will be wrien to the le. Newlines are not automacally added
(just as they are not automacally stripped from readlines). This is generally equivalent to
calling write for each element in a list.
Remember that not all of these methods need to be implemented on all le-like
objects. It's up to you to implement what is needed and convey that via proper
documentaon.
Enabling universal newlines
Python ulizes a universal newlines system. Remember that the end-of-line marker varies
by operang system. On Unix and Unix derivaves, a line is marked with a \n terminator. On
Windows systems, a line ends with a \r\n combinaon.
Universal newlines support abstracts that out and presents each end-of-line marker as a \n
to the programmer. To enable this support, append a U to the mode string when calling the
built-in open funcon.
Accessing multiple les
Let's take a lile break from our LogProcessing applicaon and look at Python's
fileinput module. In situaons where you need to open more than one le and iterate
through the connents of each sequenally, this module can be a great help.
Working with the IO System
[ 46 ]
Note that as of the me of wring, the PyEnchant modules were not compable
with Python 3. Therefore, these examples will only work with Python 2.
Time for action – spell-checking HTML content
In this example, we'll build a small applicaon that can be used to check spelling in a
collecon of HTML documents. We'll ulize the PyEnchant library here, which is based
upon the Enchant spell-check system.
1. Step into the virtual environment that we've created for our examples and run the
acvate script for your plaorm.
2. Next, we'll install the pyenchant libraries using the easy_install ulity.
The spell-check system is available on PyPI. Note that you must already have
the Enchant system installed on your workstaon. Ubuntu users can install the
libenchant1c2a library. Windows users should follow the instrucons at http://
www.abisource.com/projects/enchant/. There are binary packages available.
You may also need to install the en_US diconary, which is also covered at the
previous URL.
3. Using easy_install, we'll add the PyEnchant libraries to our virtual
environment.
(text_processing)$ easy_install pyenchant
Searching for pyenchant
Reading http://pypi.python.org/simple/pyenchant/
Reading http://pyenchant.sourceforge.net/
Best match: pyenchant 1.6.1
Downloading http://pypi.python.org/packages/2.6/p/pyenchant/
pyenchant-1.6.1-py2.6.egg#md5=21d991be432cc92781575b42225a6d3e
Processing pyenchant-1.6.1-py2.6.egg
creating /home/jmcneil/text_processing/lib/python2.6/site-
packages/pyenchant-1.6.1-py2.6.egg
Extracting pyenchant-1.6.1-py2.6.egg to /home/jmcneil/text_
processing/lib/python2.6/site-packages
Adding pyenchant 1.6.1 to easy-install.pth file
Installed /home/jmcneil/text_processing/lib/python2.6/site-
packages/pyenchant-1.6.1-py2.6.egg
Processing dependencies for pyenchant
Finished processing dependencies for pyenchant
(text_processing)$
Chapter 2
[ 47 ]
4. Create this rst HTML le and name it index.html. This will be the main page of
our very basic website.
<html>
<head>
<title>Welcome to our home page</title>
</head>
<body>
<h1>Unladen Swallow Spped<h1>
There is an ongoing debate in the Python community regarding
the speed of an unladen swallw. This site aims to settle
that debate.
<ul>
<li><a href="air_speed.html">Air Speed</a>
</ul>
<body>
</html>
Now create this second HTML file and name it air_speed.html, as
referenced in the anchor tag above.
<html>
<head>
<title>Air speed</title>
<head>
<body>
In order to maintain speed, a swallow must flap its wings 32
times per second?
</body>
</html>
</html>
5. Finally, we'll create our code. Create the following le and name it html_
spelling.py. Save it and exit your editor.
import fileinput
import enchant
from enchant.tokenize import get_tokenizer,
from enchant.tokenize import HTMLChunker
__metaclass__ = type
class HTMLSpellChecker:
def __init__(self, lang='en_US'):
"""
Setup tokenizer.
Create a new tokenizer based on lang.
This lets us skip the HTML and only
Working with the IO System
[ 48 ]
care about our contents.
"""
self.lang = lang
self._dict = enchant.Dict(self.lang)
self._tk = get_tokenizer(self.lang,
chunkers=(HTMLChunker,))
def __call__(self, line):
for word,off in self._tk(line):
if not self._dict.check(word):
yield word, self._dict.suggest(word)
if __name__ == '__main__':
check = HTMLSpellChecker()
for line in fileinput.input():
for word,suggestions in check(line):
print "error on line %d (%s) in file %s. \
Did you mean one of %s?" % \
(fileinput.filelineno(), word, \
fileinput.filename(),
', '.join(suggestions))
6. Run the last script using the HTML les we created as input on the command line.
If you've entered everything correctly, you should see the following output. Note
we've reformaed here to avoid potenally confusing line-wrapping.
(text_processing)$ python html_spelling.py *.html
What just happened?
We took a look at a few new things in this example, in addion to Python's fileinput
module. Let's step though this example slowly as there's quite a bit going on.
First of all, we imported all of our necessary modules. Following the standards, we rst
imported the modules that are part of the Python standard library, and then we required
third-party packages. In this case, we're using the third-party PyEnchant toolkit.
Chapter 2
[ 49 ]
Next, we bump into something that's probably unfamiliar to you: __metaclass__ = type.
The core Python developers changed the class implementaon (for the beer) before the
release of Python 2.1. We have both new style and old style classes. New style classes must
inherit from the object in some manner, or be explicitly assigned a metaclass of type. This
is a neat lile trick that tells Python to create only new style classes in this module.
Our HTMLSpellChecker class is responsible for performing the spell-check. In the
__init__ method, we create both a diconary (which has no relaon to the built in
dict type) and a tokenizer. We'll use the diconary for both spell-check and to ask for
suggesons if we've found a misspelled word. The tokenizer object will be used to split
each line into its component parts. The chunkers=(HTMLChunker,) argument tells
Enchant that we're working with HTML, and that it should automacally strip markup. The
provided HTMLChunker class saves us some extra work, though we'll cover how to do that
via regular expressions later in the book.
Next, we dene a __call__ method. This method is special as it is executed each me a
Python object is called directly, as if it were a funcon.
(text_processing)$ python
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> class A(object):
... def __call__(self):
... print "A is for Apple"
...
>>> a = A()
>>> a()
A is for Apple
>>> a.__call__()
A is for Apple
>>>
This example illustrates the usage of a __call__ method in detail. Noce how we can
simply treat our object as if it were a funcon. Of course, it's also possible to call the __
call__ method directly.
Within the body of the __call__ method, we tokenize each line, using the tokenizer we
created within __init__. PyEnchant strips out the HTML for us. Each word is then validated
via the diconary. If it is not found, the applicaon will provide a list of suggesons. The
yield keyword marks this method as a generator, so we yield each spelling error and its
suggesons back to our caller.
Now, we get to our main content. The rst line is familiar. We're simply creang an instance
of our HTMLSpellChecker class. The next secon is where we put fileinput to use.
Working with the IO System
[ 50 ]
The call to fileinput.input creates an iterator that transparently chains together all of
the les we passed in on the command line. Helper funcons fileinput.filelineno,
and fileinput.filename give us the current le's line number and the current le's
name, respecvely.
In Python, an iterator is a type of object that implements an interface that
allows the developer to easily iterate through its contents. For more informaon
on iteraon, see http://docs.python.org/library/stdtypes.
html#iterator-types.
You may have noced that we don't actually pass any le names to the fileinput.input
method. The module actually defaults to the values on the command line, and assumes they
are valid paths. If nothing is passed on the command line then the module will fall back to
standard input. It is possible to bypass this behavior and pass in our own list of les.
Simplifying multiple le access
The fileinput module takes a lot of complexity in opening and managing mulple les.
In addion to current le and line number, it's possible to look at things such as absolute
line number among all les and access le object-specic items such as a le's specic
integer descriptor.
Using a classic approach, one would need to open each le manually and iterate through,
retaining overall posion informaon.
As we said previously, it's possible to use fileinput without relying on the value of the
command-line arguments on sys.argv. The fileinput.input funcon takes an oponal
list of les to use read rather than working with the default.
A drawback in using the module-level methods is that we'll be creang a single instance
of fileinput.FileInput under the covers, which holds global state. Doing this ensures
that we cannot have more than one iterator acve at one point of me and that it's not a
thread-safe operaon.
Thankfully, we can easily overcome these limitaons by building our own instance of
fileinput.FileInput rather than relying on the module-convenient funcons.
>>> import fileinput
>>> input = fileinput.FileInput(['/etc/hosts'])
>>> for line in input:
... if line.startswith('127'):
... print line
...
127.0.0.1 localhost
>>>
Chapter 2
[ 51 ]
Each fileinput.FileInput instance contains the same methods available to us at the
module-level, though they all operate on their own separate context and do not interfere
with each other.
Inplace ltering
Finally, the fileinput module contains an inplace lter feature that isn't very widely
ulized. If the fileinput.input funcon is called with an inplace=1 keyword argument,
or if inplace=1 is passed to the fileinput.FileInput constructor, the opened les
are renamed to backup les and standard output is redirected to the original le. Inplace
ltering is disabled when reading from standard input.
For example, take a look at the following snippet of code.
import sys
import fileinput
# Iterate through all lines and replace
# convert everything to uppercase.
for line in fileinput.input(inplace=1, backup='.bak'):
sys.stdout.write(line.upper())
Running this script with a text le on the command line will rst generate a backup of the
text le, ending in a .bak extension. Next, the original le will be overwrien with whatever
is printed as the standard output. Specically, we're simply translang all of the text to
uppercase here.
If you accidentally divide by zero and don't handle the excepon, your desnaon le can be
le in a corrupted state as your applicaon may exit unexpectedly before you write any data
to your le.
When using this approach, ensure you're properly handling excepons as your
le will be opened in write mode and truncated accordingly.
Pop Quiz – le-like objects
1. As we've seen, le-like objects do not necessarily need to implement the enre
standard le object's methods. If an aempt is made to run a method and that
method does not exist, what happens?
2. In what situaon might you be beer o using the readlines method of a le
versus iterang over the le object itself?
3. What happens if you aempt to open a text le and you specify binary mode?
4. What is the dierence between a le object and a le-like object?
Working with the IO System
[ 52 ]
Accessing remote les
We've now had a somewhat complete crash-course in Python I/O. We've covered les,
le-like objects, handling mulple les, wring lter programs, and even modifying les
"inplace" using some slightly esoteric features of the fileinput module.
Python's standard library contains a whole series of modules, which allow you to access data
on remote systems almost as easily as you would access local le. Through the le-like object
protocol, most I/O is transparent once the protocol-level session has been congured and
established.
Time for action – spell-checking live HTML pages
In this example, we'll update our HTML spell-checker so that we can check pages that are
already being served, without requiring local access to the le system. To do this, we'll make
use of the Python urllib2 module.
1. We'll be using html_spelling.py le as our base here, so create a copy of it and
name the le html_spelling-b.py.
2. At the top of the le, update your import statements to include urllib2, and
remove the fileinput module as we'll not take advantage of it in this example.
import urllib2
import enchant
import optparse
3. Now, we'll update our module-level main code and add an opon to accept a URL on
the command-line.
if __name__ == '__main__':
parser = optparse.OptionParser()
parser.add_option('-u', '--url', help="URL to Check")
opts, args = parser.parse_args()
if not opts.url:
parser.error("URL is required")
4. Finally, change the fileinput.input call to reference urllib2.urlopen, add a
line number counter, and polish up the output content.
for line in urllib2.urlopen(opts.url):
lineno = 0
for word,suggestions in check(line):
lineno += 1
print "error on line %d (%s) on page %s. Did you
mean:\n\t%s" % \
(lineno, word, opts.url, ', '.join(suggestions))
Chapter 2
[ 53 ]
5. That should be it. The nal lisng should look like the following code. Noce how
lile we had to change.
import urllib2
import enchant
import optparse
from enchant.tokenize import get_tokenizer
from enchant.tokenize import HTMLChunker
__metaclass__ = type
class HTMLSpellChecker:
def __init__(self, lang='en_US'):
"""
Setup tokenizer.
Create a new tokenizer based on lang.
This lets us skip the HTML and only
care about our contents.
"""
self.lang = lang
self._dict = enchant.Dict(self.lang)
self._tk = get_tokenizer(self.lang,
chunkers=(HTMLChunker,))
def __call__(self, line):
for word,off in self._tk(line):
if not self._dict.check(word):
yield word, self._dict.suggest(word)
if __name__ == '__main__':
parser = optparse.OptionParser()
parser.add_option('-u', '--url', help="URL to Check")
opts, args = parser.parse_args()
if not opts.url:
parser.error("URL is required")
check = HTMLSpellChecker()
for line in urllib2.urlopen(opts.url):
lineno = 0
for word,suggestions in check(line):
lineno += 1
print "error on line %d (%s) on page %s. Did you
mean:\n\t%s" % \
(lineno, word, opts.url, ', '.join(suggestions))
Working with the IO System
[ 54 ]
6. Now, run the applicaon with a URL passed in on the command line. If it was coded
correctly, your output should resemble the following.
(text_processing)$ python html_spelling-b.py --url=http://www.
jmcneil.net
What just happened?
By simply changing a few lines of code, we were able to access a web page and scan for
spelling errors almost exactly as we did when we checked our local les. Of course, you're
seeing a limitaon of our diconary here. Our spell-checker sees words such as DOCTYPE,
DTD, and HTML as misspelled as they do not fall under the en_US category.
We could x this by adding a custom diconary to the spell-checker that includes technical
lingo, but the goal in this example is to introduce I/O with the urllib2 module.
One important thing to note is that the urllib2.urlopen method supports more than just
the HTTP protocol. You can also access les using the secure-sockets layer by simply passing
in an HTTPS URL. It's even possible to access local les by passing a path into the urllib2.
urlopen method.
Yes, there is a urllib module. It is simply named urllib. This newer version
is far more extensible and is recommended. However, it can be a bit tricky to
understand in detail. There is a great reference available out there that describes
some of the intricacies in a simple manner. The document is tled "urllib2: The
Missing Manual" and is available at http://www.voidspace.org.uk/
python/articles/urllib2.shtml.
The urllib2.urlopen can also directly access les via the FTP protocol. It's quite simple;
the URL you pass into urlopen simply needs to begin with ftp://.
Have a go hero – access web logs remotely
As we've covered both web LogProcessing and the urllib2 module supercially, you should
be able to update our earlier LogProcessing applicaon to access les remotely. You don't
need an external account to try this. Remember, URLs beginning with file:// are valid
urllib2.urlopen URLs. You can make this change and test it locally.
Chapter 2
[ 55 ]
Error handling
By now, you may have noced that while we're able to access a range of protocols using this
same mechanism, they all potenally return dierent errors and raise varying excepons.
There are two obvious soluons to this problem: we could catch each individual excepon
explicitly, or simply catch an excepon located at the top of the excepon hierarchy.
Fortunately, we don't need to take either of those sub-opmal approaches. When an internal
error occurs within the urllib2.urlopen funcon, a urllib2.URLError excepon is
raised. This gives us a convenient way to catch relevant excepons while leng unrelated
problems bubble up. Let's take a quick look at an example to solidify the point.
Python's excepon hierarchy is worth geng to know. You can read up
on excepons in detail at http://docs.python.org/library/
exceptions.html.
Time for action – handling urllib 2 errors
In this example, we'll update our HTML spell-checker in order to handle network errors
slightly more gracefully. Whenever you provide ulies and interfaces to your users, you
should present errors in a clean manner (while logging any valid stack traces).
1. We're going to build o html_spelling-b.py, so copy it over and rename it to
html_spelling-c.py.
2. At the top of the le, add import sys. We'll need access to the methods within
the sys module.
3. Update the __name__ == '__main__' secon to include some addional
excepon-handling logic.
if __name__ == '__main__':
parser = optparse.OptionParser()
parser.add_option('-u', '--url', help="URL to Check")
opts, args = parser.parse_args()
if not opts.url:
parser.error("URL is required")
check = HTMLSpellChecker()
try:
source = urllib2.urlopen(opts.url)
except urllib2.URLError, e:
reason = str(e)
try:
reason = str(e.reason)
Working with the IO System
[ 56 ]
except AttributeError:
pass
print >>sys.stderr, "File Download Error: %s" % reason
sys.exit(-1)
for line in urllib2.urlopen(opts.url):
lineno = 0
for word,suggestions in check(line):
lineno += 1
print "error on line %d (%s) on page %s. Did you
mean:\n\t%s" % \
(lineno, word, opts.url, ', '.join(suggestions))
4. You should now be able to execute this code and pass in a pair of invalid URL values,
using dierent protocols. Your output should be similar to the following.
(text_processing)$ python html_spelling-c.html --url=ftp://
localhost
(text_processing)$ python html_spelling-c.html --url=http://www.
jmcneil.net/notfound.html
What just happened?
We made a small update to our main code so that we can beer handle excepons bubbling
up from the urllib2 module.
In our excepon handler's except statement, we do something that might seem slightly
peculiar. First, we bind the value of str(e) to an aribute named reason. Next, we set up
another try/except block and aempt to bind the value of str(e.reason) to that same
reason aribute. Why would we do that?
The explanaon is simple. Some of the excepons bubbling up have a reason aribute,
which provides more informaon. Specically, the FTP errors contain it. We always try to pull
the more specic error. If it doesn't exist, that will raise an AttributeError excepon. We
just ignore it and go with the rst value of reason.
Chapter 2
[ 57 ]
Our method of accessing the reason aribute highlights Python's Duck Typing design again.
It would have been possible for us to check whether a reason aribute existed on our
URLError object before aempng to access it. In other words, we could have ensured our
object adhered to a strict interface. This approach is usually dubbed Look Before You Leap.
Instead, we took the other (and more Python standard) way. We just did it and handled the
fallout in the event of an error. This is somemes referred to as Easier to Ask Forgiveness
than Permission.
Finally, we simply printed out a meaningful error and exited our applicaon. If you had
observed the examples of this chapter, you'd noce that it does not maer which protocol
type we use.
Handling string IO instances
There's one more IO library that we'll take a look at in this chapter – Python's StringIO
module. In many of your applicaons, you're likely to run into a situaon where it would be
convenient to write to a locaon in memory rather than using string operaons or direct IO
to a temporary le.
StringIO handles just this. A StringIO instance is a le-like object that simply appends
wrien data to a locaon in memory.
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import StringIO
>>> handle = StringIO.StringIO()
>>> handle.write('A')
>>> handle.write('B')
>>> handle.getvalue()
'AB'
>>> handle.seek(0)
>>> handle.write("a")
>>> handle.getvalue()
'aB'
>>>
Looking at the example, you can see that the StringIO instance supports le methods such
as seek and write. By calling getvalue, we're able to retrieve the enre in-memory string
representaon.
There's also a cStringIO module, which implements nearly the same interface and is quite
a bit faster, though there are limitaons on Unicode values and subclassing that should be
understood before using it. For more informaon, see the StringIO library documentaon
available at http://docs.python.org/library/stringio.html.
Working with the IO System
[ 58 ]
The StringIO modules changed a bit between Python 2 and Python 3. Both
the StringIO and the cStringIO modules are gone. Instead, developers
should use io.StringIO for textual data and IO.BytesIO for binary data.
There is no longer a dierenaon between a pure Python implementaon and
the C-level implementaon.
Understanding IO in Python 3
The last thing we'll look at in this chapter is the IO system in Python 3.0. In order to ease
transion, the new IO code has been back-ported to Python 2.6 and is available via the IO
module.
The new IO system introduces a layered approach, almost comparable to Java's IO system.
At the boom lies the IOBase class, which provides commonalies among the IO stream
classes. From there, objects are stacked according to IO type, buering capability, and
read/write support.
While the details look complex, the actual interface to system IO really doesn't change too
much. For example, the io.open call can generally be used the same way. However, there
are some dierences.
Most importantly, binary mode maers. The text will be decoded automacally into Unicode
using the system's locale, or a codec passed. If a le isn't truly text, it shouldn't be opened as
text. Files opened in binary mode now return a dierent object type than les opened in text
mode.
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import io
>>> io.open('/etc/hosts')
<io.TextIOWrapper object at 0x10049d250>
>>> io.open('/bin/ls', 'rb')
<io.BufferedReader object at 0x10049d210>
>>>
Chapter 2
[ 59 ]
Noce that opening a le in text mode, which is the default mode, returns a
TextIOWrapper object whereas opening a le in binary mode returns a BufferdReader
object. Although it doesn't appear as a subclass of BufferedIOBase, TextIOWrapper
does actually implement buered IO.
The new io.open method is intended to replace the built-in open method as of 3.0. As with
the exisng funcon, it can also be used in a context manager.
For more details on the new Python IO system, see the documentaon available at http://
docs.python.org/release/3.0.1/library/io.html. This covers the new IO system
in detail and underscores some of the changes between major Python releases.
Summary
This chapter served as a crash course on Python IO. The goal here is to ensure that you know
how to actually access your data in order to process it.
We covered quite a bit here and really focused on understanding Python's IO system. Most
textual data you'll process will likely come from local disk les, so understanding this material
is important.
You also learned how to build your own le-like-objects and take advantage of
polymorphism, a powerful object-oriented development aribute. We covered HTTP and
compressed data, but as you've seen, the underlying access methods do not maer when
the exposed interface follows the le-like object protocol.
In the next chapter, we'll examine text handling using Python's built in string funcons.
Python String Services
Python's built-in string services provide all of the text-processing funconality
you would expect from any full-featured programming language. This includes
methods to search, test, and create new string objects from exisng ones.
String objects also provide a C-like format mechanism that allows us to build
new string objects and interpolate them with standard Python values and
user-dened objects. Later versions of Python are built on this concept.
Addionally, the actual string objects provide a rich set of methods and
funcons that may be used to further manipulate textual string data.
In this chapter, we will:
Cover the basics of Python string and Unicode objects so that you'll understand the
similaries and dierences.
Take a detailed look at Python string formang so that you'll understand how to
easily build new strings. We'll look at the older and more common syntax as well as
the newer formats as dened in PEP-3101.
Familiarize yourself with the methods found on the standard Python string objects
as well as the Unicode components.
Dive into built-in string templang. We'll see more examples on templang in more
detail in Chapter 7, Creang Templates.
Understanding the basics of string object
Python supports both Unicode and ASCII-encoded text data. However, in versions of Python
earlier than 3.0, there are two built-in objects to manage text data. The str type holds standard
byte-width characters, while the unicode type exists to deal with wider unicode data.
3
Python String Services
[ 62 ]
All Python string objects are immutable, regardless of encoding type. This generally means
that methods that operate on strings all return new objects and not modied text. The big
excepon to this rule is the StringIO module as covered in Chapter 2, Working with the
IO System. Eding StringIO data via its le-like interface results in manipulaon of the
underlying string content.
Python's built-in string services do not operate on any type of structured data. They deal
with text data at the character-level.
In Python 2.7, a new memoryview module has been introduced. These objects
allow certain C-based data types to expose their contents via a byte-oriented
interface. Strings support this funconality. Generally speaking, however, a
memoryview object shouldn't be used for standard text operaons.
Dening strings
Strings can be dened in a variety of ways, using a variety of dierent quong methods. The
Python interpreter treats string values dierently based on the choice of quotes used. Let's
look at an example that includes a variety of dierent denion approaches.
Time for action – employee management
In this short and rather contrived example, we'll handle some simple employee records and
just print them to the screen. Along the way, however, we'll cover the various dierent ways
a developer can quote and dene string literals. A literal is a value that is explicitly entered,
and not computed.
1. From within our text processing virtual environment, create a new le and name it
string_definitions.py.
2. Enter the following code:
import sys
import re
class BadEmployeeFormat(Exception):
"""Badly formatted employee name"""
def get_employee():
"""
Retrieve user information.
This method simply prompts the user for
an employee's name and his current job
title.
"""
Chapter 3
[ 63 ]
employee = raw_input('Employee Name: ')
role = raw_input("Employee's Role: ")
if not re.match(r'^.+\s.+', employee):
raise BadEmployeeFormat('Full Name Required '
'for records database.' )
return {'name': employee, 'role': role }
if __name__ == '__main__':
employees = []
print 'Enter your employees, EOF to Exit...'
while True:
try:
employees.append(get_employee())
except EOFError:
print
print "Employee Dump"
for number, employee in enumerate(employees):
print 'Emp #%d: %s, %s' % (number+1,
employee['name'], employee['role'])
print u'\N{Copyright Sign}2010, SuperCompany, Inc.'
sys.exit(0)
except BadEmployeeFormat, e:
print >>sys.stderr, 'Error: ' + str(e)
3. Assuming that you've entered the content correctly, run it on the command line.
Your output should be similar to the following:
(text_processing)$ python string_definitions.py
Python String Services
[ 64 ]
What just happened?
Let us go through this example. There are quite a few things to point out.
The very rst thing we do, other than import our required modules, is dene a custom
excepon class named BadEmployeeFormat. We simply have a subclass Exception
and dene a new docstring. Note that no pass keyword is required; the docstring is
essenally the body of our class. We do this because later on in this example, we'll raise this
error if an employee name doesn't match our simple validaon.
Now, note that our docstring is enclosed by triple quotes. As you've probably guessed,
that holds a special meaning. Python strings enclosed in triple quotes preserve line endings
so that mulline strings are represented correctly. Consider the following example.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> s = """This is a multiline string.
...
... There are many like it, but this one
... is mine.
... """
>>>
>>> print s
This is a multiline string.
There are many like it, but this one
is mine.
>>>
As you can see, the new line values are included. Note that all other values sll require
addional escaping. For example, including a \t will sll translate to a tab character.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> t = """This still creates a \tab"""
>>> print t
This still creates a ab
>>>
Aer our excepon class, we create a module-level funcon named get_employee that is
responsible for collecng, tesng, and returning employee data. The rst thing you should
noce is another triple quoted docstring. You should note that docstrings do not have
to be triple-quoted, but they do need to be string literals.
Chapter 3
[ 65 ]
The very rst line of code within get_employee calls raw_input, which simply receives a
single line of text via standard input, trimming the trailing newline. The single-quoted string
passed to it serves as the text prompt that the caller will see on the command line.
The very next line includes another call to raw_input, asking for the employee's role.
Noce that this invocaon includes the prompt text in double quotes. Why is that? The
answer is simple. We used an apostrophe in the word "employee's" in order to indicate
ownership. Both double and single quotes serve the same funconal purpose. There is
nothing dierent about them, as in other languages. They're both allowed in order to let you
include one set of quotes within the other without resorng to long sequences of escapes.
As you can see, the following string variables are all the same.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> single = '"Yes, I\'m a programmer", she said.'
>>> double = "\"Yes, I'm a programmer\", she said."
>>> triple = """Yes, I'm a programmer", she said."""
>>> print single
"Yes, I'm a programmer", she said.
>>> print double
"Yes, I'm a programmer", she said.
>>> print triple
Yes, I'm a programmer", she said.
>>>
The Python convenon is to use single quotes for strings unless there is an override needed
to use a dierent format, so you should also adhere to this whenever possible.
On the next line, we call re.match. This is a very simple regular expression that is used to
validate the employee's name. We're checking to ensure that the input value contained a space
because we want the end-user to supply both the rst and last name. We'd do a much beer
job in a real applicaon (where we would probably ask for both values independently).
The call to re.match includes a single-quoted string, but it's prexed with a single r. That
leading r indicates that we're dening a raw string. A raw string is interpreted as-is, and
escape sequences hold no special meaning. The most common use of raw strings is probably
within regular expressions like this. The following brief example details the dierence
between manual escapes and raw strings.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> standard = '\n\nOur Data\n\n'
>>> raw = r'\n\nOur Data\n\n'
Python String Services
[ 66 ]
>>> print standard
Our Data
>>> print raw
\n\nOur Data\n\n
>>>
Using the standard string syntax, we would have had to include backslashes if we wished
to mute the escape interpretaon, and our string value would have been '\\n\\nOur
Data\\n\\n'. Of course, this is a much more dicult string to read.
Users of the popular Django framework may recognize this syntax. Django uses
regular expressions to express HTTP request-roung rules. By default, these
regular expressions are all contained within raw string denions.
If the regular expression test fails, we'll raise our BadEmployeeFormat excepon that we
dened at the top of this example. Look carefully at the raise statement. Noce that the
string passed into BadEmployeeFormat's __init__ method is actually composed of two
strings. When the Python interpreter encounters string literals separated by white space,
it automacally concatenates them together. This provides a nice way for the developer
to wrap his or her strings neatly without creang long and hard to manage lines. As these
strings were dened within the parenthesis following BadEmployeeFormat, we were able
to include a newline.
Now, within our main secon, we create an innite loop and begin calling get_employee.
We append the result of each successful call onto our employees list. If an excepon is raised
from within get_employee, we might have to take some addional acon.
If EOFError bubbles up then a user has clicked Ctrl + d (Ctrl + z on Windows), indicang that
they have no more data to supply. The raw_input funcon actually raises the excepon;
we just let it percolate up the call stack. The rst thing we do within this handler is print out
some status text we nofy the user that we're dumping our employee list.
Next, we have a for loop that iterates on the results of enumerate(employee).
Enumerate is a convenient funcon that, when given a sequence as an argument, returns
the zero-based loop number as well as the actual value in a tuple, like in this example
snippet:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> for c,i in enumerate(xrange(2)):
... print "Loop %d, xrange value %d" % (c,i)
...
Chapter 3
[ 67 ]
Loop 0, xrange value 0
Loop 1, xrange value 1
>>>
Each employee's name and role is printed out this way. This connues unl we reach the end
of the list, at which point we're going to print a simple copyright statement.
When our employee applicaon becomes wildly popular, we want to be certain that we're
protected aer all! The copyright line introduces yet another string variant – a Unicode
literal. Unicode strings contain all of the funconality of standard string objects, plus some
encoding specics.
A Unicode literal can be created by prepending any standard string with a single u, much
like we did with the r for raw strings. Addionally, Unicode strings introduce the \N escape
sequence, which allows us to insert a Unicode character by standardized name rather than
literally or by character code.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> russian_pm = u'\N{CYRILLIC CAPITAL LETTER PE}\N{CYRILLIC SMALL
LETTER U}\N{CYRILLIC SMALL LETTER TE}\N{CYRILLIC SMALL LETTER SHORT
I}\N{CYRILLIC SMALL LETTER EN}'
>>> print russian_pm
Путйн
>>> russian_pm = u'\u041f\u0443\u0442\u0439\u043d'
>>> print russian_pm
Путйн
>>>
As of now, you should understand that Unicode allows us to represent characters outside of
the ASCII range. This includes symbols such as the one we added above as well as alphabets
such as Cyrillic, which at one point would have required its own encoding standard (in fact,
KOI8 is just that). We'll cover Unicode and addional text encodings in much more detail
when we get to Chapter 8, Understanding Encodings and i18n
Finally, we'll catch our BadEmployeeFormat excepon. This indicates that our test regular
expression didn't match. Here, you'll see that we're concatenang a string literal with a
calculated value, so we can't simply place them adjacent within our source lisng. We use
plus-syntax to create a new string, which is a concatenaon of the two.
One important thing to remember is that, although there are three dierent variants of
quotes and raw string modiers, there are only two string types: unicode and str.
Python String Services
[ 68 ]
Building non-literal strings
The majority of the strings you'll create in a manual fashion will be done using literals. In
most other scenarios, text data is generated as the result of a funcon or a method call.
Consider the value returned by sys.stdin.readline. We'll cover some of the common
methods for building strings programmacally as we progress through this chapter.
Python 3.0 eliminates the concept of a separate byte string and Unicode string.
All strings in Python 3.0 are Unicode. Dening a string using the u'content'
approach while running under Py3k will simply result in a SyntaxError
excepon. As there is only one string type, the previously menoned
basestring is no longer valid within Python 3.0, either. A bytes type
replaces the standard string object and is used to represent raw byte data, such
as binary informaon.
Pop Quiz – string literals
1. We've seen where we would use raw strings and we've seen where we would use
Unicode strings. Where might you wish to combine the two? Is it even possible?
2. What do you suppose would happen if you tried to concatenate a Unicode object
and a standard Python string? Here's a hint: what happens when you divide a whole
integer by a oat?
3. Suppose a ZeroDivisionError or an AttributeError is triggered from within
get_employee. What do you suppose would happen?
String formatting
In addion to simply creang plain old strings as we've just covered, Python also lets you
format them using a C sprintf style syntax. Strings in later versions of Python also support
a more advanced format method.
Time for action – customizing log processor output
Let's revisit and extend our web server log processor now. Our rst versions simply printed
text to sys.stdout when informaon was encountered. Let's expand upon that a bit. Using
Python's built-in string formaers, we'll do a beer job at reporng what we nd. In fact, we'll
delegate that responsibility to the classes responsible for evaluang the parsed log data.
We'll also add some addional processing meta-output as well, such as how many
lines we've processed and how long it takes to execute the enre report. This is helpful
informaon as we further extend our log processor.
Chapter 3
[ 69 ]
1. We're going to use logscan-c.py from Chapter 2, Working with the IO System as
our base here, so copy it over and rename it as logscan-e.py.
2. Update the code in logscan-e.py to resemble the following.
import time
import sys
from optparse import OptionParser
class LogProcessor(object):
"""
Process a combined log format.
This processor handles logfiles in a combined format;
objects that act on the results are passed in to
the init method as a series of methods.
"""
def __init__(self, call_chain=None):
"""
Setup parser.
Save the call chain. Each time we process a log,
we'll run the list of callbacks with the processed
log results.
"""
if call_chain is None:
call_chain = []
self._call_chain = call_chain
def split(self, line):
"""
Split a logfile.
Initially we just want size and requested filename, so
we'll split on spaces and pull the data out.
"""
parts = line.split()
return {
'size': 0 if parts[9] == '-' else int(parts[9]),
'file_requested': parts[6]
}
def report(self):
"""
Run report chain.
"""
for c in self._call_chain:
print c.title
print '=' * len(c.title)
Python String Services
[ 70 ]
c.report()
print
def parse(self, handle):
"""
Parses the logfile.
Returns a dictionary composed of log entry values,
for easy data summation.
"""
line_count = 0
for line in handle:
line_count += 1
fields = self.split(line)
for handler in self._call_chain:
getattr(handler, 'process')(fields)
return line_count
class MaxSizeHandler(object):
"""
Check a file's size.
"""
def __init__(self, size):
self.size = size
self.name_size = 0
self.warning_files = set()
@property
def title(self):
return 'Files over %d bytes' % self.size
def process(self, fields):
"""
Looks at each line individually.
Looks at each parsed log line individually and
performs a size calculation. If it's bigger than
our self.size, we just print a warning.
"""
if fields['size'] > self.size:
self.warning_files.add(
(fields['file_requested'], fields['size']))
# We want to keep track of the longest file
# name, for formatting later.
fs = len(fields['file_requested'])
if fs > self.name_size:
self.name_size = fs
Chapter 3
[ 71 ]
def report(self):
"""
Format the Max Size Report.
This method formats the report and prints
it to the console.
"""
for f,s in self.warning_files:
print '%-*s :%d' % (self.name_size, f, s)
if __name__ == '__main__':
parser = OptionParser()
parser.add_option('-s', '--size', dest="size",
help="Maximum File Size Allowed",
default=0, type="int")
parser.add_option('-f', '--file', dest="file",
help="Path to Web Log File", default="-")
opts,args = parser.parse_args()
call_chain = []
if opts.file == '-':
file_stream = sys.stdin
else:
try:
file_stream = open(opts.file)
except IOError, e:
print >>sys.stderr, str(e)
sys.exit(-1)
size_check = MaxSizeHandler(opts.size)
call_chain.append(size_check)
processor = LogProcessor(call_chain)
initial = time.time()
line_count = processor.parse(file_stream)
duration = time.time() - initial
# Ask the processor to display the
# individual reports.
processor.report()
# Print our internal statistics
print "Report Complete!"
print "Elapsed Time: %#.8f seconds" % duration
print "Lines Processed: %d" % line_count
print "Avg. Duration per line: %#.16f seconds" % \
(duration / line_count) if line_count else 0
Python String Services
[ 72 ]
3. Now, in order to illustrate what's going on here, create a new le named
example2.log, and enter the following data. Note that each line begins with
127.0.0.1.
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /a HTTP/1.1" 200
65383 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /short HTTP/1.1"
200 22912 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
SV1; .NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /bit_long
HTTP/1.1" 200 1818212 "-" "Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /extra_long
HTTP/1.1" 200 873923465 "-" "Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /e HTTP/1.1" 200
8221 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /e HTTP/1.1" 200 4
"-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET
CLR 1.1.4322)"
127.0.0.1 - - [29/Mar/2010:00:48:05 +0000] "GET /d HTTP/1.1" 200
22 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 1.1.4322)"
4. Now, from within our virtual environment, run this code on the command line. Your
output should be similar as follows:
(text_processing)$ cat example2.log | python logscan-e.py -s 30
Chapter 3
[ 73 ]
What just happened?
We introduced some extended string formang mechanisms and extended our code to be a
lile bit more extensible, which is generally a good pracce.
First of all, we're imporng the time module. We use this to calculate runme and other
things as we move forward. As we introduce new methods of extracng and parsing these
les, it's nice to have a means to measure the performance hit or gain associated with the
change.
We updated the LogProcessor class in a few places. First, we've added a report method.
This method will pull the title o of each log handler dened and display it, followed by
a separator bar. Next, the report method will call each handler class directly and ask it to
print its own report segment.
The parse funcon has been updated to return the number of lines processed for stascs
purposes. We've also replaced our direct call to handle with a dynamic lookup of a
process funcon. This is a great example of Python's dynamic nature and duck-typing at
work. We did this so that we can get at more of the class elds directly in other areas. Simply
passing the parsing funcon around limits what we have access to.
The MaxSizeHandler got an even bigger faceli this me through. We've added instance
level variables name_size and warning_files. The name_size variable keeps track of
the longest lename we've found while warning_files is a set object.
The following three lines dene a Python property:
@property
def title(self):
return 'Files over %d bytes' % self.size
A property is a special object that appears to be an aribute when accessed directly, but
is actually handled by a method under the scenes. When we access c.title from within
LogProcessor, we're actually triggering an instance of MaxSizeHandler's title method.
We've made changes to our process method, too. It now appends a tuple for each le
name/size pair that exceeds our maximum allowed size. Why did we use a set? Simple. If the
same le is accessed mulple mes, we only want to display it once for each size. Python lets
us use tuples as unique values within a set object as they're immutable. As is the nature
of sets, adding the same value mulple mes is a null operaon. A value only exists once
within a set.
Note that sets were available only as an external module up unl 2.6. Prior
to that, it was necessary to 'from sets import set' at the top of your
module. If you're running an earlier version, you'll have to take this precauon.
Python String Services
[ 74 ]
We nish up this revision of the MaxSizeHandler class by updang the longest lename, if
applicable, and dening our report funcon.
If you take a closer look at report, you'll see a line containing a string format that reads '%-
*s :%d' % (self.name_size, f, s). There is a bit of formaer magic included here.
We'll take a closer look at this syntax below, but understand that this line prints a le's name
and corresponding size. It also ensures that each size value lines up in a columnar format, to
the right of the longest lename we've found. We're allong for variaons in lename size
and spacing our sizes accordingly to void a jagged –edge look.
Finally, we hit our main secon. Not a whole lot has changed here. We've added code to
track how long we run and how many lines we've processed as returned by processor.
parse. We've also switched to passing instances of our handler classes to LogProcessor's
__init__ method rather than specic funcons.
At the boom of the main secon, we've introduced another variaon of the formang
expression. Here, we're shoring up some of our decimal formang and using some alternate
formang methods available to us. The '#' in this line alters the way the string is rendered.
Percent (modulo) formatting
This is the oldest method of string format available within Python, and as such, it's the
most popular one. We've been using it throughout the book so far, though this example
introduced some of the more esoteric features.
A percent formaer expression consists of two main parts a format string and a tuple or a
diconary of formang values. Format strings consist of plain text with format specicaons
mixed in it. Format specicaons begin with a percent sign and instruct Python on how to
translate a data value into printed text.
These two main components are then separated via a percent sign, or modulus operator. If
you're formang a string with a single % specier then the use of a tuple is not necessary.
For example, simple string formang expressions usually look like the following:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> "%d + %d = %d" % (1,2,3)
'1 + 2 = 3'
>>> '%d %% %d = %d' % (5,2,1)
'5 % 2 = 1'
>>> 'I am a %s programmer' % 'python'
'I am a python programmer'
>>>
Chapter 3
[ 75 ]
It is also possible to use a diconary instead of a tuple, if the corresponding key is specied
in parenthesis aer the % operator, like in this example.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> user = {'uid': 0, 'gid': 0, 'login': 'root'}
>>> 'Logged in as %(login)s with uid %(uid)d and gid %(gid)d' % user
'Logged in as root with uid 0 and gid 0'
>>>
Each formang specicaon consists of a variety of dierent elements, most of which are
usually le out. Here is a diagram detailing all of the available modiers.
This example uses a diconary to provide the mapped values. Let's review each possible
component. Remember that some of the possible values change depending on whether
we've used a diconary or a tuple.
Mapping key
If the mapping key is present then the format conversion expects a diconary aer the dividing
percent sign. The mapping key is quite simply a key into the diconary you'll provide.
Python String Services
[ 76 ]
Conversion ags
These are oponal values that change the way the provided value is displayed. There are a
series of dierent ags available.
Flag Usage
#Dictates that an alternate format should be used. Alternate formats vary by formang
me. For example, using this ag with a oang point ensures that the decimal point is
present, even if not required.
0 If the minimal display width is greater than the value, pad with zero for numeric values.
-The printed value is le-jused in relaon to the padding. The default is to right-jusfy.
<space> Signies that a space should be le aer a posive number.
+Add a sign character. Has a higher precedence than <space>.
In the above example, we specied an alternate format in order to ensure that the decimal is
always present.
Minimum width
If the value to be translated does not meet this minimum length, it will be padded
accordingly. If a * (asterisk) is passed in as opposed to a number, the value will be taken from
the tuple of values.
This is the approach taken in our last example. We programmacally determined the padding
we wanted to use and inserted it into our values tuple while forcing le-juscaon.
Precision
This is valid for oang-point numbers. The precision indicates how many places aer the
decimal to display. In the preceding diagram, we specied four places in the value, but
only requested three in the formang. The following small example details the use of the
precision opon. Note that the value printed versions the value provided.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> '%.3f' % 3.1415
'3.142'
>>>
As you can see, the value we've supplied is rounded up correctly and printed.
Chapter 3
[ 77 ]
Width
These have no use in Python and do not change the formang at all. They are largely carried
over from C's sprintf funconality. Accepted values are l, L, or h. If they are supplied, they
are simply ignored.
Conversion type
The data type we're converng from. These are generally the same as found in C. However,
the r and the s types are slightly special and we'll cover them below. Here is a list of the
valid conversion formats.
Conversion Descripon
d, i Signed Decimal
o Signed Octal
x Signed hexadecimal in lowercase
X Signed hexadecimal in uppercase
uObsolete – idencal to d
eFloang point exponenal in lowercase
EFloang point exponenal in uppercase
F,f Floang point decimal
gLowercase exponenal if exponent is less than -4, otherwise use decimal format.
GUppercase exponenal if exponent is less than -4, otherwise use decimal format.
c Single character. Can be an integer value or a string of one.
r Object repr value, see below.
s Object str value, see below.
%Literal percent sign.
Using string special methods
If an object has a __str__ method then it is implicitly called whenever an instance of that
object is passed to the str built-in funcon. Accepted pracce is to return human-friendly
string representaon of that object.
Likewise, if an object contains a __repr__ method, passing that object to the repr built-
in should return a Python-friendly representaon of that object. Historically, that means
enough text to recreate the object via eval, but that's not a strict requirement.
Using %s or %r results in the values of __str__ or __repr__ replacing the formang
specicaon. For example, consider the following code.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Python String Services
[ 78 ]
Type "help", "copyright", "credits", or "license" for more information.
>>> class MicroController(object):
... def __init__(self, brand, bits):
... self.brand = brand
... self.bits = bits
... def __str__(self):
... return '%s %s-bit CPU' % (self.brand, self.bits)
...
>>> m = MicroController('WhizBang', 8)
>>> 'my box runs a %s' % m
'my box runs a WhizBang 8-bit CPU'
>>>
This is very convenient while formang strings containing representaons of objects.
Though, in some cases, it can be somewhat misleading.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> 'I have %s bits' % 8
'I have 8 bits'
>>>
In many languages, an approach like that would simply result in either a syntax error or a
memory-related crash. Python treats it dierently, however, as the result of str(8) is the
string representaon of the number eight.
Have a go hero – make log processing more readable
So, now you should have a prey good grasp of percent string formang. All of the le sizes
outpued in our example above are in pure bytes. That's great for accuracy's sake, but it can
be quite dicult on the eyes.
Update all of the preceding output to display as kilobytes in a decimal form. We don't want
to display decimals beyond two places as that could get just as dicult to read.
Using the format method approach
As of Python 2.6 (and all values of 3.0), the format method has been available to all string
and Unicode objects. This method was introduced to combat exibility restricons in
the percent approach. While this is a much more powerful and exible method of string
formang, it's only available in newer versions of Python. If your code must run on older
distribuons, you're stuck with the classic percent-formang approach.
Chapter 3
[ 79 ]
Instead of marking our format specicaons with percent signs, the format method expects
formang informaon to be enclosed in curly braces.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> platforms = {'linux': 121, 'windows': 120, 'solaris': 12}
>>> 'We have {0} platforms, Linux: {linux}, Windows: {windows}, and
Solaris: {solaris}'.format(
… 3, **platforms)
'We have 3 platforms, Linux: 121, Windows: 120, and Solaris: 12'
>>>
In the simplest cases, numeric values in curly braces represent posional arguments while
text names represent keyword arguments.
In addion to the new format method found on string objects, Python 2.6 and
above also have a new built-in method – format. This essenally provides a
means to access the features of the string object's format, without requiring
a temporary string. Under the hood, it triggers a call to a formaed object's __
format__ method. For more informaon on the __format__ method, see
http://python.org/dev/peps/pep-3101/.
Time for action – adding status code data
First of all, note that this example won't work if you're using a version of Python less than
2.6. If you fall into that category, you'll have to either upgrade your version, or simply pass
over this secon.
We're going to update our LogProcessor script to report on the collecon of HTTP
response codes found within the logle. We'll simply add an addional handler to process
the parsed data.
1. Using logscan-e.py as a base, create logscan-f.py and add the following
addional import statement:
from collections import defaultdict
2. Now, we're going to change the split method of LogProcessor to also include
HTTP status code informaon.
def split(self, line):
"""
Split a logfile.
Initially, we just want size and requested filename, so
Python String Services
[ 80 ]
we'll split on spaces and pull the data out.
"""
parts = line.split()
return {
'size': 0 if parts[9] == '-' else int(parts[9]),
'file_requested': parts[6],
'status': parts[8]
}
3. Now, directly below the LogProcessor class, add the following new handler class.
class ErrorCodeHandler(object):
"""
Collect Error Code Information.
"""
title = 'Error Code Breakdown'
def __init__(self):
self.error_codes = defaultdict(int)
self.errors = 0
self.lines = 0
def process(self, fields):
"""
Scan each line's data.
Reading each line in, we'll save out the
number of response codes we run into so we
can get a picture of our success rate.
"""
code = fields['status']
self.error_codes[code] += 1
# Assume anything > 400 is
# an HTTP error
self.lines += 1
if int(code) >= 400:
self.errors += 1
def report(self):
"""
Print out Status Summary.
Create the status segment of the
report.
"""
longest_num = sorted(self.error_codes.values())[-1]
longest = len(str(longest_num))
for k,v in self.error_codes.items():
Chapter 3
[ 81 ]
print '{0}: {1:>{2}}'.format(k, v, longest)
# Print summary information
print
'Errors: {0}; Failure Rate: {1:%}; Codes: {2}'.format(
self.errors, float(self.errors)/self.lines,
len(self.error_codes.keys()))
4. Finally, add the following line to the main secon, right below:
call_chain.append(size_check).
call_chain.append(ErrorCodeHandler())
5. Now, run the updated applicaon. Your output should resemble the following:
(text_processing)$ cat example2.log | python logscan-f.py -s 30
What just happened?
Let's take a quick survey of the changes we made to this applicaon. First of all, we imported
defaultdict. This is a rather useful object. It also acts as a diconary. However, if a
referenced key doesn't exist, it calls the funcon supplied and uses its value to seed the
diconary before returning. A standard diconary would simply raise a KeyError, as in the
following example:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Python String Services
[ 82 ]
Type "help", "copyright", "credits", or "license" for more information.
>>> d = {}
>>> d['200'] += 1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: '200'
>>> from collections import defaultdict
>>> d_dict = defaultdict(int)
>>> d_dict['200'] += 1
>>> d_dict
defaultdict(<type 'int'>, {'200': 1})
>>>
Next, we're just updang the parse method to return the eighth eld in each line, which
happens to be the HTTP status code as returned to the client.
In the new handler class, ErrorCodeHandler, we set up three instance-level variables. The
defaultdict object detailed previously, and two counters that represent the number of
errors we've run into as well as the number of lines we've processed.
The process method adds to the defaultdict each me an error is encountered. If a
specic value hasn't been added yet, the diconary defaults (hence its name) to the value of
int(), which will be zero.
The defaultdict type is a useful helper when tallying or extracng informaon from
logles or other unknown sources of data when you're not certain whether a specic
key will exist and want to add it dynamically.
Next, we increase our line number counter. If the error number is greater than 400 then we
also increment our error counter. You should note that we're actually passing the value of
code to the int funcon before doing the comparison. Why is this?
Python is a dynamically-typed language; however, it is sll strictly-typed. For example, a
HTTP code value of '200' is a textual representaon of a number; it is sll a string type.
The value was assigned its type when we extracted it as a substring from a line in a logle,
which itself was read in as a collecon of strings. So, without the explicit conversion, we're
comparing an integer (400) against a string representaon of a number. The result probably
isn't what you would expect.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> snum = '1000'
>>> snum == 1000
False
>>> int(snum) == 1000
True
Chapter 3
[ 83 ]
This is a common gotcha and has actually been reced in Python 3.0. Aempng to
perform the preceding comparison will result in a TypeError when using Python 3.
>>> '1000' > 1000
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unorderable types: str() > int()
>>>
Within the report method, we next sort the self.error_code diconary values via
the built-in sorted funcon. We take the highest number in that list, via a subscript of -1,
and convert it into a string. We then take the length of that string. We'll use this value for a
formang modier later in this method.
The next secon loops through all of the response codes we've run into thus far and prints
them out to the screen, though it does that via the string format method.
The last thing we do within the report method is display a summary of the error code data
we've collected while processing a logle. Here, we're also using the format method rather
than tradional percent-sign formang.
Finally, within our main secon, we added an instance of ErrorCodeHandler to the call_
chain list that is passed into LogProcessor's __init__ method. This ensures that it will
be included during logle processing.
Making use of conversion speciers
As we menoned earlier, conversion markup is enclosed in curly braces as opposed to the
percent prex as used in standard string formang. In addion to the replacement value,
though, the curly braces also contain all of the same formang informaon (with some new
opons) that the standard methods support.
Let's take another look at that graphical breakdown of a format string, but this me we'll use
the newer format syntax.
Python String Services
[ 84 ]
Noce how the replacement value name or posion is separated from the formang
arguments by a colon. The colon itself holds no other special meaning. This example does
not include all possible combinaons. When using the format method, the # opon is only
valid for integers. Likewise, the precision argument is only valid for oang point values.
Fill
The fill argument allows us to specify which character we should use to pad our string if
the minimum width is less than the actual width of the replacement value. Any character can
be used other than a closing brace, which would signify the end of the format specicaon.
Align
This signies how text should be aligned in relaon to the ll characters if actual width is less
than minimum width.
Flag Usage
<The eld is le-aligned, this is the default alignment.
>The eld is right-aligned.
= This forces the padding to be placed between a sign character and the
value. This is only valid for numeric types.
^ Forces the value to be centered within the available spacing.
Sign
This eld is valid only for numeric types and is used to determine how the sign informaon is
displayed, if at all.
Flag Usage
+Sign data is always displayed.
-Python should only display the sign for negave numbers. This is the
default behavior.
<space> Leading space should be used on posive, while a sign should be used when
the value is negave.
Width
This species the minimum width of the eld. If the actual value is shorter, the result will be
padded according to the alignment rules using the ll character.
Precision
This species the oang-point precision. As menoned previously, this is only valid for
oang-point values. Floang-point numbers are rounded and not simply truncated.
Chapter 3
[ 85 ]
Type
The type eld is the last argument in the format specicaon and details how the value
should be displayed. Unlike standard percent-formang, this is no longer a required eld. If
not specied, a default is used based on the value's type.
There are quite a few new type ags introduced with the format method and some of
the implementaon details are rather complex. For a complete introducon to type elds
for use with the format funcon, see http://docs.python.org/library/string.
html#format-string-syntax.
The following table contains a survey of the available values.
Flag Usage
s String output. This is the default for strings and class instances
b Binary output
d Decimal output
o Octal format
xHexadecimal format using lowercase leers
XHexadecimal format using uppercase leers
nSame as the d ag, though it uses local informaon to display correctly based on
your preferences. This is the default for integers
eExponent (Scienc) Notaon using lowercase leers
EExponent (Scienc) Notaon using uppercase leers
f Fixed point
F Same as the 'f' type
gGeneral format. There is a collecon of rules regarding display for this type. See
the Python documentaon for details. This is the default for oang-point values
G Uppercase version of 'g'
% Percentage. Mulplies a number by 100 and displays in 'f' format, followed by a
percent sign
Have a go hero – updating the le size check to use the format method
Now that you've got a crash course in Python string-formang methods, you should
be able to work with both approaches. Take a few minutes and back up to update the
MaxSizeHandler class to use format methods rather than percent syntax. However, you'll
probably want to create a temporary copy.
You may nd the Python documentaon helpful in addion to the tables included in this
chapter. Formang markup seems to be one area that many developers never really seem to
fully grasp. Take a moment and stand out from the crowd!
Python String Services
[ 86 ]
Creating templates
It's oen said within the Python community that every programmer, at some point,
implements his or her own Python-based template language. The good news, then, is that
we don't have to as so many of them already exist!
There's a large collecon of very powerful third-party templang libraries available for Python.
We'll cover them in more detail (and even write our own) in Chapter 7, Creang Templates.
Python includes an elementary templang class within the string module. The Template
class doesn't provide any advanced features such as code execuon or inherited blocks. In
general, it's a simple way to replace tokens within a text le with Python values.
Time for action – displaying warnings on malformed lines
Up unl now, we've assumed that all of our lines processed are very well-formed and will
never generate excepons. In order to illustrate the use of the Template class, we'll x that
here. Under normal circumstances, it would probably be preferred to simply print an error
just quietly pass by incorrectly formaed lines.
1. Using logscan-f.py as a starng place, create logscan-g.py. We'll use this as
our starng point.
2. At the top of the le, add import string to the list of modules imported.
3. Immediately aer the docstring for LogProcessor, add the following code:
tmpl = string.Template(
'line $line is malformed, raised $exc error: $error')
4. Replace the parse method in LogProcessor with the following new method:
def parse(self, handle):
"""
Parses the logfile.
Returns a dictionary composed of log entry values,
for easy data summation.
"""
line_count = 0
for line in handle:
line_count += 1
try:
fields = self.split(line)
except Exception, e:
print >>sys.stderr, self.tmpl.substitute(
line=line_count,
Chapter 3
[ 87 ]
exc=e.__class__.__name__,
error=e)
continue
for handler in self._call_chain:
getattr(handler, 'process')(fields)
return line_count
5. Finally, copy example2.log over and create example3.log. Insert a :q! on line
eight, followed by a newline. This should be the only text on that line.
6. Running the example should produce the following output:
(text_processing)$ cat example3.log | python logscan-g.py -s 30
What just happened?
Aer imporng the required string module, we created a Template object within the
LogProcessor class denion. By adding it where we did, we ensured that it's only created
once. If we had placed it within a method, it would be created each me that specic
method was called.
Next, we updated our parse method to catch any excepons that rise up from within
split. If we happen to catch an error, we populate our template with values describing the
excepon and print the rendered result to the screen via standard error.
Python String Services
[ 88 ]
Template syntax
When we create an instance of Template, we pass in the template string we'll use. The
syntax is fairly straighorward. If we want a value to be replaced, we simply precede it with a
dollar sign. Two $ characters adjacent to each other act as an escape; they are replaced with
a single character in the rendered text.
If the idener we intend to replace is embedded in a longer string, we can surround it with
braces. A small example may clarify this concept.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> from string import Template
>>> template = Template('${name} has $$${amount} in his ${loc}et')
>>>
Rendering a template
Once we've created a template object, we use it to render a new string by calling either its
substitute or safe_substitute methods.
>>> template.substitute(name='Bill Gates', amount=35000000000,
loc='wall')
'Bill Gates has $35000000000 in his wallet'
>>> template.substitute(name='Joe', amount=10, loc='blank')
'Joe has $10 in his blanket'
>>>
If a template variable is le o, or if a standalone dollar sign is encountered, the
substitute method raises an error. If the safe_substitute alternave is used, errors
are simply ignored and the conversion will not take place. Noce the dierence in both
approaches below:
>>> template.substitute(name='Joe', amount=10)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/string.py", line 170, in substitute
return self.pattern.sub(convert, self.template)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/string.py", line 160, in convert
val = mapping[named]
KeyError: 'loc'
>>> template.safe_substitute(name='Joe', amount=10)
'Joe has $10 in his ${loc}et'
>>>
Chapter 3
[ 89 ]
Pop Quiz – string formatting
1. In what situaon might you elect to use the string.Template class versus
tradional string formang?
2. What method might you use to pass a diconary of values into the format method?
3. We know that expressions such as "1" + 2 are invalid. What do you think would be
the result of "1" + "2"?
Calling string object methods
In addion to providing powerful creaon and formang mechanisms, Python string objects
also provide a collecon of useful methods. We've already seen a few of them in our earlier
examples. For example, we called line.split() within our LogProcessor class in order
to separate a text line into pieces, delimited by space characters.
All of these methods are present on both standard byte strings and Unicode
objects. As a general rule, Unicode objects return Unicode while byte string
methods return byte strings.
Time for action – simple manipulation with string methods
In this example, we'll extend our lile employee data-gathering script present earlier in the
chapter. The goal is to illustrate the use of some of the string object methods.
1. Create a new le and name it string_definitions-b.py.
2. Enter the following code:
import sys
class BadEmployeeFormat(Exception):
"""Badly formatted employee name"""
def __init__(self, reason, name):
Exception.__init__(self, reason)
self.name = name
def get_employee():
"""
Retrieve user information.
This method simply prompts the user for
an employee's name and his current job
title.
Python String Services
[ 90 ]
"""
employee = raw_input('Employee Name: ')
role = raw_input("Employee's Role: ")
employee, role = employee.strip(), role.strip()
# Make sure we have a full name
if not employee.count(' '):
raise BadEmployeeFormat('Full Name Required '
'for records database.', employee )
return {'name': employee, 'role': role }
if __name__ == '__main__':
employees = []
failed_entries = []
print 'Enter your employees, EOF to Exit...'
while True:
try:
employees.append(get_employee())
except EOFError:
print
print "Employee Dump"
for number, employee in enumerate(employees):
print 'Emp #%d: %s, %s' % (number+1,
employee['name'], employee['role'].title())
print 'The following entries failed: ' +
', '.join(failed_entries)
print u'\N{Copyright Sign}2010, SuperCompany, Inc.'
sys.exit(0)
except BadEmployeeFormat, e:
failed_entries.append(e.name)
err_msg = 'Error: ' + str(e)
print >>sys.stderr, err_msg.center(len(err_msg)+20,
'*')
3. Run this example from the command line. If you entered it correctly then you should
see output similar to the following:
(text_processing)$ python string_definitions-b.py
Chapter 3
[ 91 ]
What just happened?
There's not a whole lot extra going on in this new example. We've simply cleaned up our
data a lile bit more and took the liberty of nofying the user which employees were not
successfully entered.
The rst thing you'll noce is that we updated our BadEmployeeFormat excepon to
take an addional argument, the employee name. We do this so we can append the failed
employee's informaon to a list within our main secon.
The next update you'll run into is the employee, role = employee.strip(), role.
strip() line. Each string (employee, role) might have white space on either end. Calling
the strip method trims the string down and removes that spacing. If we wanted to, we could
have passed addional characters into the strip and it would have removed those as well:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> 'ABC123DEF'.strip('ABCDEF')
'123'
>>>
The strip method removes any of the characters that appear in the argument string if they
appear in the source string.
We've updated our check for a space to simply scan for a single space character rather than
using our regular expression. The downside here, though, is that this check will pass even if
data was entered incorrectly. Consider 'AlexanderPushkin', for example.
Python String Services
[ 92 ]
In the main secon, we've added a failed_entries list. Whenever we catch a
BadEmployeeFormat excepon, we append the name of the employee to this list. When
we receive our EOFError, we join this list via ', '.join(failed_entries). Note that
in Python, join is a method of a string object and not a method of a list or an array data
structure.
Now that we've seen some of them put to use, let's take a closer look at some of the
methods available on string and Unicode objects. However, this isn't a complete survey.
For a detailed descripon of all methods available on Python string objects, see the Python
documentaon.
Aligning text
There are four methods available on string objects that allow you to manage alignment
and juscaon. Those methods are center, ljust, rjust, and zfill. We've seen
the center method used previously. The ljust and rjust methods simply change the
orientaon of a supplied padding character.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> 'abc'.rjust(5, '*')
'**abc'
>>> 'abc'.ljust(5, '*')
'abc**'
>>> 'abc'.center(5, '*')
'*abc*'
>>>
The zfill method adds zeros to the le of the string object, up to the passed-in minimum
width argument.
Detecting character classes
These methods correspond to a set of standard C character idencaon methods. However,
unlike their C equivalents, it is possible to test all values of a specic string and not just a
single character.
These methods include isalnum, isalpha, isdigit, isspace, istitle, isupper, and
islower. These methods all test the enre string value; if any one character doesn't t the
bill, these methods simply return False.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
Chapter 3
[ 93 ]
>>> '1'.isdigit()
True
>>> '1f'.isdigit()
False
>>> 'Back to the Future'.istitle()
False
>>> 'Back To The Future'.istitle()
True
>>> 'abc123'.isalnum()
True
>>>
The one method here that might not be clear up front is the istitle method. This returns
True if all words within a string have their rst leer capitalized.
Casing
Strings objects contain four methods for updang capitalizaon: title, capitalize,
upper, and lower. Both the upper and lower methods change casing for an enre string.
The capitalize and title methods are slightly dierent. Have a look at them in acon:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> position = 'VP of marketing mumbo jumbo'
>>> position.title()
'Vp Of Marketing Mumbo Jumbo'
>>> city = 'buffalo'
>>> city.capitalize()
'Buffalo'
>>>
Noce how the title method returns the string in tle case while the capitalize
method simply capitalizes the rst character of the string.
Searching strings
There are a number of methods associated with string objects that help with searching
and comparison. To check for general quality, simply use the double equal sign comparison
operator.
Python String Services
[ 94 ]
The count, find, index, replace, rfind, rindex, startswith, and endswith
methods all scan a string for the occurrence of a substring. Addionally, it's possible to use
the in keyword to test for a substring's occurrence within a larger string.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> 'one' in 'Bone Dry'
True
>>> 'one' == 'one'
True
>>>
We've already introduced you to the count method, so we'll skip over that here. find
and index are both similar. When called, both return the oset into a string in which the
substring is found. The dierence, however, is how they'll respond in the event that the test
string isn't present. The find method will simply return a -1. The index method will raise a
ValueError.
Both startswith and endswith test to see whether their respecve end is made up of the
test string passed in.
The replace method allows you to replace a given substring within a larger string with an
oponal upper bound on the number of mes the operaon takes place. In the following
example, noce how only one of the 'string-a' values is replaced:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> 'trout salmon turkey perch flounder'.replace('turkey', 'shark')
'trout salmon shark perch flounder'
>>> 'string-a string-b string-a'.replace('string-a', 'string', 1)
'string string-b string-a'
>>>
Finally, rfind and rindex are idencal to find and index, except that they'll work from
the end of the string rather than the beginning.
Dealing with lists of strings
There are four methods for dealing with string parts – join, split, partition, and
rpartition. We've already seen them to some extent, but let's take a closer look as they're
commonly-used string methods.
Chapter 3
[ 95 ]
The split method takes a delimiter and an oponal number of max splits. It will return a list
of strings as broken up by the delimiter. If the separator is not found then a single element
list is returned that contains the original string text. The oponal maximum separator limits
on how many mes the split takes place. An example might help solidify its usage:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> string = 'cheese,mouse,cat,dog'
>>> string.split(',')
['cheese', 'mouse', 'cat', 'dog']
>>> string.split('banana')
['cheese,mouse,cat,dog']
>>> string.split(',', 2)
['cheese', 'mouse', 'cat,dog']
>>>
We've already covered the join method; it places a string together given a list of elements. It
is common to join around an empty string in order to simply concatenate a larger list of values.
Finally, we have partition and rpartition. These methods act much like the split
method, except that they'll return three values - the part before a separator, the separator
itself, and nally the part aer the separator.
Treating strings as sequences
Remember that Python strings can be interpreted as sequences of characters as well. This
means that all common sequence operaons will also work on a string. It's possible to iterate
through a string or break it into pieces using standard slicing syntax.
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> 'abcdefg'[2-5]
'e'
>>> 'abcdefg'[2:5]
'cde'
>>> 'abcdefg'[2:5]
'cde'
>>> for i in 'abcdefg'[2:5]:
... print 'Letter %c' % i
...
Letter c
Letter d
Letter e
>>>
Python String Services
[ 96 ]
This works for both byte strings as well as Unicode strings as Python deals with the
underlying method calls at a character-level, and not a byte-level.
Have a go hero – dive into the string object
We've covered the majority of the string methods here as well as the most common usage
scenarios, but we've not touched on all of them. Addionally, there are opons we've not
touched on.
Open a Python prompt and have a look at all of the methods and aributes available on a
standard string object.
>>> dir('')
['__add__', '__class__', '__contains__', '__delattr__', '__doc__',
'__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__',
'__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__',
'__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__
new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__
rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'_formatter_field_name_split', '_formatter_parser', 'capitalize',
'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs',
'find', 'format', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower',
'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip',
'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition',
'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip',
'swapcase', 'title', 'translate', 'upper', 'zfill']
Using the output of dir, as well as the Python documentaon (either online or via pydoc),
spend some me and familiarize yourself with the available funcons. You'll be glad you did!
Summary
We covered a lot of detail in this chapter. Python's string services provide a clean mechanism
for dealing with text data at the character-level. You should now be familiar with built-in
templang, formang, and core string manipulaon. These techniques are valid and should
be considered before many more advanced approaches are evaluated.
Next, we'll leave the string basics behind and step into the standard library for a look at how
to handle some of the more commonly encountered text formats. Python makes processing
standard formats easy!
Text Processing Using the Standard
Library
In addion to its powerful built-in string manipulaon abilies, Python
also ships with an array of standard library modules designed to parse and
manipulate common standardized text formats.
Using the standard library, it's possible to parse INI les, read CSV and related
les, and access common data formats used on the web, such as JSON. In this
chapter, we'll take a look at some of these modules and look at how they can
help us process text data a layer above the string management foundaon.
We'll take a closer look at the following:
CSV, or Comma Separated Values. Python provides a rich mechanism for accessing
and extracng data from this common format commonly used as a spreadsheet
stand-in.
Parse and rely on INI les. We'll look at the standard Conguraon File parsing
classes for our own purposes and as a means to read Microso Windows
conguraons.
We'll parse JSON data as it's oen used as a data delivery mechanism on the
Internet.
Learn how to beer organize our log processing applicaon via modules and
packages in order to make it more extensible going forward.
4
Text Processing Using the Standard Library
[ 98 ]
Reading CSV data
Comma separated values, or CSV, is a generic term that refers to columnar data, which is
simply separated by commas. In fact, in spite of its name, the delimiter may actually be a
dierent character. Other common delimiters include a tab, a space, or a semi-colon.
The major drawback to CSV data is that there is no standardizaon. In some circumstances,
data elements will be quoted. In other circumstances, the wring applicaon may include
column or row headers along with the CSV data. Furthermore, consider the eects of the
various line-endings used by dierent operang systems.
Clearly, it's not just a maer of spling a comma-delimited line. Python's CSV support aims
to work around the formang variaons and provide a standardized interface.
Time for action – processing Excel formats
The csv module provides support for formang dierences by allowing the use of dierent
dialects. Dialects provide details such as which delimiter to use and how to address data
element quong.
In this example, we'll create an Excel spreadsheet and save it as a CSV document. We can
then open that via Python and access all of the elds directly.
1. First, we'll need to create an Excel spreadsheet and build an inial dataset. We'll use
some mock nancial data. Build up a spreadsheet that includes the following data:
2. Now, from the File menu, select Save As. The Save As dialog contains a Format
drop-down. From this dropdown, select CSV (Comma Delimited). Name the
le Workbook1.csv. Note that if you do not have Excel, these sample les are
downloadable from the Packt Publishing FTP Site.
Chapter 4
[ 99 ]
3. Create a new Python le and name it csv_reader.py. Enter the following code:
import csv
import sys
from optparse import OptionParser
def calculate_profit(day):
return float(day['Revenue']) - float(day['Cost'])
if __name__ == '__main__':
parser = OptionParser()
parser.add_option('-f', '--file', help="CSV Data File")
opts, args = parser.parse_args()
if not opts.file:
parser.error('File name is required')
# Create a dict reader from an open file
# handle and iterate through rows.
reader = csv.DictReader(open(opts.file, 'rU'))
for day in reader:
print '%10s: %10.2f' % \
(day['Date'], calculate_profit(day))
4. Running the preceding code should produce the following output, if you've copied it
correctly.
(text_processing)$ python csv_reader.py --file=./Workbook1.csv
What just happened?
Let's walk through the code here. By now, you should be familiar with both the __name__
== '__main__' secon as well as the opon parser. We won't cover that boilerplate stu
any longer.
The rst interesng line is redirect = csv.DictReader(open(opts.file, 'rU')).
There are two things worth poinng out on this line alone. First, we're opening the le
using Universal Newline support. This is because Excel will save the CSV le according to our
plaorm's convenon. We want Python just to hide all of that for us here.
Text Processing Using the Standard Library
[ 100 ]
Secondly, we're creang an instance of csv.DictReader. The basic approach to accessing
CSV data is via the csv.reader method. However, this requires us to access each row via an
array index. The csv.DictReader class uses the rst row in the CSV le (by default) as the
diconary keys. This makes it much easier to access data by name.
If we had used the standard reader, we would have had to parse our data as in the following
small example snippet:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> import csv
>>> reader = csv.reader(open('Workbook1.csv', 'rU'))
>>> for row in reader:
... print 'Revenue on ' + row[0] + ': ' + row[1]
...
Revenue on Date: Revenue
Revenue on 3-May-10: 1289.41
Revenue on 4-May-10: 951.89
Revenue on 6-May-10: 2812.23
Revenue on 7-May-10: 554.34l
Revenue on 8-May-10: 2419.62
Revenue on 9-May-10: 999.44
Revenue on 10-May-10: 514.78
>>>
As you can see, the diconary approach makes it much easier to handle the processed data.
Next, we iterate through each row in the dataset and print out a prot summary. If you
take a look at the calculate_profit funcon, you'll see how we do this. As menoned
before, Python is not only dynamically-typed, but also strongly-typed once a value has been
created. We have to explicitly create new oang-point types based on the text value in
order to perform our subtracon operaon.
Finally, our print statement uses classic percent-formang and adds a lile bit of padding
in order to keep everything easy to read.
If you were paying aenon, you'll remember we menoned that we need a dialect in order
to process a CSV le. What gives? We didn't specify one, did we? Well, no. Python defaults
to the Excel dialect, which is exactly what we're using in our example.
If you're familiar with Excel, you're probably wondering why we used Python to calculate our
prot rather than leng Excel do it for us. Aer all, that's what a spreadsheet applicaon is for!
Chapter 4
[ 101 ]
Time for action – CSV and formulas
Let's run though an example and illustrate why we chose to calculate the values ourselves
rather than leng Excel do it.
1. First, open Excel again and add a new column. We're going to name it Prot. The
value of this column should be a simple formula, =(BX-CX), where 'X' is the row
number you're at. Repeat unl your spreadsheet looks like this:
2. Now, like we did with our rst example, save this as Workbook2.csv. You'll need
to accept any warnings that Excel gives you. This document is also available on the
Packt Publishing FTP site.
3. Using csv_reader.py as a starng point, create csv_reader-b.py and modify
the calculate_profit funcon to read as follows.
def calculate_profit(day):
return float(day['Profit'])
4. Running the example using the new CSV input should produce the following results,
if you've entered the code correctly.
(text_processing)$ python csv_reader-b.py --file=Workbook2.csv
Text Processing Using the Standard Library
[ 102 ]
5. Now, open the Workbook2.csv le in a text editor and add a 1 to every revenue
column to increase net revenue by a visible amount. Save it as Workbook2a.csv.
The updated text le should look like this:
Date,Revenue,Cost,Profit,,
3-May-10,11289.41,899.54,389.87,,
4-May-10,1951.89,772.12,179.77,,
6-May-10,12812.23,749.9,2062.33,,
7-May-10,1554.34,442.91,111.43,,
8-May-10,12419.62,1754.23,665.39,,
9-May-10,1999.44,801.12,198.32,,
10-May-10,1514.78,332.21,182.57,,
6. Finally, let's run the applicaon again, using this new source of input.
(text_processing)$ python csv_reader-b.py --file=Workbook2a.csv
What just happened?
There's not much new code here. We simply updated our calculate_profit funcon to
return the Prot diconary key rather than perform the calculaon. Prey simple.
But, what happened? Why was the output the same for both runs? CSV data generated with
Excel (and probably all spreadsheet tools) does not contain formula informaon. Formula
results are calculated before the data is saved and the target cells receive that value.
The important thing to remember here is that if you're dealing with spreadsheet data,
you cannot rely on formula contents. If an input value to a formula changes outside of the
applicaon, you'll need to perform that calculaon yourself, within Python.
If you have a desire to read and manipulate nave Excel les, the xlrd module
provides that funconality. It is available on the Python Package Index at
http://pypi.python.org/pypi/xlrd/0.7.1.
Chapter 4
[ 103 ]
Reading non-Excel data
Not all CSV data is generated and wrien by Microso Excel. In fact, it's a fairly open and
exible format and is used in a lot of other arenas as well. For example, many shopping-cart
applicaons and online-banking ulies allow end users to export data using this format as
most all spreadsheet applicaons can read it.
In order to read a non-Excel format, we'll need to dene our own CSV dialect, which tells
the parser what to expect as a delimiter, whether values are quoted, and a few other details
as well.
Time for action – processing custom CSV formats
In this example, we'll build a Dialect class that is responsible for interpreng our own
format. We'll use some alternate delimiters and some dierent processing sengs. This is
the general approach you'll use when parsing your own format les.
We're going to process a UNIX style /etc/passwd le in this example. If you're not familiar
with the format, here's a small sample:
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
sys:x:3:3:sys:/dev:/bin/sh
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/bin/sh
man:x:6:12:man:/var/cache/man:/bin/sh
lp:x:7:7:lp:/var/spool/lpd:/bin/sh
mail:x:8:8:mail:/var/mail:/bin/sh
news:x:9:9:news:/var/spool/news:/bin/sh
uucp:x:10:10:uucp:/var/spool/uucp:/bin/sh
proxy:x:13:13:proxy:/bin:/bin/sh
www-data:x:33:33:www-data:/var/www:/bin/sh
backup:x:34:34:backup:/var/backups:/bin/sh
Each line is a colon-separated list of values. We're only going to concern ourselves with the
rst and the last values - the user's login name and the shell applicaon that is executed
when a login occurs.
If you're following along using a Windows machine, you obviously do not have
an /etc/passwd le. An example le is available on the Packt Publishing FTP
site. These examples will use that le so they match up for all users.
Text Processing Using the Standard Library
[ 104 ]
1. Create a new le named csv_reader-c.py and enter the following code. Note
that this le is based on the csv_reader.py source we created earlier in the
chapter.
import csv
import sys
from optparse import OptionParser
if __name__ == '__main__':
parser = OptionParser()
parser.add_option('-f', '--file', help="CSV Data File")
opts, args = parser.parse_args()
if not opts.file:
parser.error('File name is required')
csv.register_dialect('passwd', delimiter=':',
quoting=csv.QUOTE_NONE)
dict_keys = ('login', 'pwd', 'uid', 'gid',
'comment', 'home', 'shell')
# Create a dict reader from an open file
# handle and iterate through rows.
reader = csv.DictReader(
open(opts.file, 'rU'), fieldnames=dict_keys,
dialect='passwd')
for user in reader:
print '%s logs in with %s' % \
(user['login'], user['shell'])
2. Run the preceding example using an /etc/passwd le as input. We'll use the
example provided, but feel free to use your own if you wish.
(text_processing)$ python csv_translate.py --file=passwd > pwd.csv
Chapter 4
[ 105 ]
What just happened?
We made a few changes to our csv_reader.py code in order to manage UNIX /etc/
passwd les to illustrate how you would go about processing non-Excel compable formats.
The rst line we'll look at is the call to csv.register_dialect. In this call, we're adding
an enrely new CSV dialect, named passwd. We're seng the delimiter to a single colon and
conguring the system not to expect quotes. This is a convenient way to introduce a new
dialect, but it's not the only way.
If we had a reason to, we could have extended the Dialect class and passed that in instead
of a series of keyword arguments to csv.register_dialect. In most cases, though, you
will do it this way as a Dialect is simply a collecon of processing opons.
Next, we create a tuple of diconary keys. The DictReader uses the rst line of a CSV le
as it's a set of diconary keys by default. As a password le does not contain a header as our
Excel sheets did, we need to explicitly pass in the list of diconary keys to use. They should
be in the order in which they'll be split.
Text Processing Using the Standard Library
[ 106 ]
Finally, we call csv.DictReader again, but this me, we specify the dialect name to use as
well as the diconary keys in the tuple we just created. The remainder of this example simply
prints out a user and her corresponding login shell.
Writing CSV data
We've looked at methods for parsing two dierent dialects of CSV: Excel formats and our
own custom format. Let's wrap up our discussion on CSV by looking at how we would write
out a new le.
Time for action – creating a spreadsheet of UNIX users
We're going to read our UNIX password database using the code we've already developed,
and transform it into an Excel-friendly CSV dialect. We should then be able to open our list of
users in spreadsheet format if we choose.
1. Create a new le and name it csv_translate.py.
2. Enter the following code:
import csv
import sys
from optparse import OptionParser
parser = OptionParser()
parser.add_option('-f', '--file', help="CSV Data File")
opts, args = parser.parse_args()
if not opts.file:
parser.error('File name is required')
csv.register_dialect('passwd', delimiter=':',
quoting=csv.QUOTE_NONE)
dict_keys = ('login', 'pwd', 'uid', 'gid',
'comment', 'home', 'shell')
print ','.join([i.title() for i in dict_keys])
writer = csv.DictWriter(sys.stdout, dict_keys)
# Create a dict reader from an open file
# handle and iterate through rows.
reader = csv.DictReader(
open(opts.file, 'rU'), fieldnames=dict_keys, dialect='passwd')
writer.writerows(reader)
Chapter 4
[ 107 ]
3. Now, run the example using the supplied passwd le as your input. Redirect the
output to a le named passwd.csv.
(text_processing)$ python csv_translate.py --file=passwd > passwd.
csv
4. The contents of the newly created CSV le should be exactly as follows.
Text Processing Using the Standard Library
[ 108 ]
5. Finally, open the new CSV le in Microso Excel or OpenOce. The rendered
spreadsheet should resemble the following screenshot:
What just happened?
Using two dierent dialects, we read from our password le and wrote Excel-friendly CSV to
our standard output channel.
Lets skip over the boilerplate code again and look at what makes this example actually
work. First, the two lines that appear directly under the dict_key assignment line. We're
doing two important things here. First, we translate the keys we've been using into tle
case via a list comprehension and join them with a comma. Both of these steps use string
object methods covered in the previous chapter. In the same line, we then print this newly
generated value. This serves as the top line of the new CSV.
Chapter 4
[ 109 ]
The next line creates a writer object, which simply takes a le-like object and a list of
diconary keys. Note that the list of keys is required here as Python's diconaries are
unordered. This tells the writer in which order to print the diconary values. The actual write
logic executes much like the following small example:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits", or "license" for more information.
>>> dicts = [{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value1',
'key2': 'value2'}]
>>> key_order = ('key2', 'key1')
>>> for d in dicts:
... print ','.join([d[key] for key in key_order])
...
value2,value1
value2,value1
>>>
Finally, we call writer.writerows(reader) to read all of the data from the source
CSV and print it to the new desnaon. The writerows method of a DictWriter object
expects a sequence of diconaries with the appropriate keys.
Pop Quiz – CSV handling
1. We've described two methods of creang new CSV dialects. What are they? In what
situaons might you choose one over the other?
2. What's the drawback to simply using the split method of the string object for
parsing CSV data? Why isn't this approach reliable?
3. How are formulas executed once a spreadsheet document has been saved in a
text-only CSV format?
Have a go hero – detecting CSV dialects
One aspect of the CSV module we didn't cover here is the csv.Sniffer class. This class
aempts to build a new dialect based on a sample segment of a CSV le. You can read more
about the Sniffer class at http://docs.python.org/library/csv.html.
Given your knowledge of CSV les and how to process them, update the previous code to
automacally detect the CSV dialect in use given our example passwd le. If you're using a
UNIX system, try it on your own passwd le. Does it work? In which situaons do you run
into issues?
Text Processing Using the Standard Library
[ 110 ]
Modifying application conguration les
As you develop applicaons, you're going to want to allow your end users to make
runme changes without updang and eding source code. This is where the need for a
conguraon le comes in.
You've surely dealt with them before as you've set up and managed dierent computer
systems and applicaons. Perhaps you've had to edit one while dening a web server virtual
host, or while conguring drivers or boot preferences.
For the most part, applicaons choose their own conguraon formats and implement their
own parsers, to some degree. Some les contain simple name-value pairs while others build
programming-language-like structures. Sll others implement secons and segment values
even further.
Luckily, Python provides a full-featured conguraon le management module for us, so we
don't have to worry about wring our own error-prone processing logic. As an added benet,
Python's ConfigParser module also supports the generaon of new conguraon les
using Python data structures. This means we can easily write new les as well.
Time for action – adding basic conguration read support
In this example, we'll add some basic conguraon le support into our ever-growing
log-processing applicaon. There are a few values that we've been passing on the
command line that have become somewhat repeve. Let's x that.
1. First, create logscan-h.py, using logscan-g.py as your starng place.
2. Update the import statements at the top of the le to look like this:
import time
import string
import sys
from optparse import OptionParser
from collections import defaultdict
from ConfigParser import SafeConfigParser
from ConfigParser import ParsingError
3. Now, directly below the MaxSizeHandler class, add the following
configuration parser funcon. Note that this is not a part of the
MaxSizeHandler class and should not have a base indent.
def load_config():
"""
Load configuration.
Chapter 4
[ 111 ]
Reads the name of the configuration
of sys.argv and loads our config.
from disk.
"""
parser = OptionParser()
parser.add_option('-c', '--config', dest='config',
help="Configuration File Path")
opts, args = parser.parse_args()
if not opts.config:
parser.error('Configuration File Required')
config_parser = SafeConfigParser()
if not config_parser.read(opts.config):
parser.error('Could not parse configuration')
return config_parser
4. We need to update our __main__ secon to take advantage of our new
conguraon le support. Update your main secon to read as follows:
if __name__ == '__main__':
config = load_config()
input_source = config.get('main', 'input_source')
if input_source == '-':
file_stream = sys.stdin
else:
try:
file_stream = open(input_source)
except IOError, e:
print >>sys.stderr, str(e)
sys.exit(-1)
size_check = MaxSizeHandler(
int(config.get(
'maxsize', 'threshold')
)
)
call_chain = []
call_chain.append(size_check)
call_chain.append(ErrorCodeHandler())
processor = LogProcessor(call_chain)
initial = time.time()
line_count = processor.parse(file_stream)
duration = time.time() - initial
# Ask the processor to display the
Text Processing Using the Standard Library
[ 112 ]
# individual reports.
processor.report()
if config.getboolean('display', 'show_footer'):
# Print our internal statistics
print "Report Complete!"
print "Elapsed Time: %#.8f seconds" % duration
print "Lines Processed: %d" % line_count
print "Avg. Duration per line: %#.16f seconds" % \
(duration / line_count) if line_count else 0
5. The next thing to do is create a basic conguraon le. Enter the following text into
a le named logscan.cfg:
[main]
# Input filename. This must be either a pathname or a simple
# dash (-), which signifies we'll use standard in.
input_source = example3.log
[maxsize]
# When we hit this threshold, we'll alert for maximum
# file size.
threshold = 100
[display]
# Whether we want to see the final footer calculations or
# not. Sometimes things like this just get in the way.
show_footer = no
6. Now, let's run the example using this conguraon. If you entered everything
correctly, then your output should resemble the following:
(text_processing)$ python logscan-h.py --config=logscan.cfg
Chapter 4
[ 113 ]
7. Finally, open up the conguraon le and comment out the very last line. It should
begin with show_footer. Run the applicaon again. You should see the following
output:
(text_processing)$ python logscan-h.py --config=logscan.cfg
What just happened?
We opened, scanned, processed, converted, and used elements of an ini-style conguraon
le without having to deal with a single split or white space trim! Let's have a closer look at
how we set everything up.
First o, we updated our import statements to include the needed classes within the
ConfigParser module. In many cases, it's simpler to just import the ConfigParser
module itself rather than individual classes. We did it this way in order to save a bit of space
in the example text.
Next, we added a load_config funcon that is responsible for handling most of the actual
work. The rst thing we do here is parse our command line for a single –c (or –config)
opon, which is the locaon of our le. This opon is required and we'll exit if it's not found
(more on that later).
Next, we instanate a SafeConfigParser class and aempt to make it read the name of the
le we pass in via the command-line opon. If the read doesn't succeed then we exit with a
rather generic error. We return the config_parser object aer we have read our le.
Text Processing Using the Standard Library
[ 114 ]
Skip now to our __main__ secon. The very rst thing we do here is process our
conguraon le via the new funcon. The very next line shows the canonical way for
accessing data, via the get method. The get method takes a conguraon le secon as
well as a value name. This rst access retrieves the input_source value, which is the name
of our logle.
Next, we access the conguraon object again when we create our MaxSizeHandler class.
We pull the threshold size out and pass it to the constructor.
Noce that we have to explicitly convert our data to an integer type. Values read
via conguraon les are typed as strings.
The nal me we access our conguraon object is near the boom when we check the
display secon for the show_footer value. If it's not True, we won't print our familiar footer
text. Here, we use a convenience method available to us, called getboolean. There are a
series of these methods available that automacally handle the data transformaon for us.
The last thing we did was to comment out a conguraon line and run our applicaon. In
doing so, you'll noce that it results in a fatal error! This probably isn't what we want most
of the me. It's possible to avoid this situaon and set reasonable default values.
One nice thing about the SafeConfigParser classes is that they're also able
to read Microso Windows conguraon les directly. However, none of the
ConfigParser classes support value-type prexes found in extended version
INI syntax.
Using value interpolation
One really interesng feature of the ConfigParser module is that it supports conguraon
value interpolaon, or substuon, directly within the conguraon le itself. This is a very
useful feature.
Time for action – relying on conguration value interpolation
For this example, we'll simply update our conguraon le to take advantage of this feature.
There are no Python code changes necessary.
1. First, add a new conguraon value to the [main] secon of logscan.cfg.
The name of the value should be dir and the value should be the full path to the
directory that you're execung examples from.
[main]
# The main directory Where we're running from (or, rather, where
Chapter 4
[ 115 ]
# we store logfiles and write output to)
dir = /Users/jeff/Desktop/ptpbg/Chapters/Ch4
2. Next, you're going to update the input_source conguraon opon to reference
this full path.
# Input filename. This must be either a pathname or a simple
# dash (-), which signifies we'll use standard in.
input_source = %(dir)s/www.log
3. Finally, running this updated example should produce the same output as the
previous execuon did.
(text_processing)$ python logscan-h.py --config=logscan.cfg
What just happened?
We included the value of a configuration opon within a second one by using the
familiar percent syntax. This allows us to build complex congur