Joel Lawhead Learning Geospatial Analysis With Python, 2nd Edition An Effective Guide To Geographic%2

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 394 [warning: Documents this large are best viewed by clicking the View PDF Link!]

[ 1 ]
Learning Geospatial Analysis
with Python
Second Edition
An effective guide to geographic information system
and remote sensing analysis using Python 3
Joel Lawhead
Learning Geospatial Analysis with Python
Second Edition
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the author, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2013
Second edition: December 2015
Production reference: 1211215
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78355-242-9
Joel Lawhead
Mark Cederholm
Truc Viet Le
John Maurer
Julia Wood
Commissioning Editor
Kartikey Pandey
Acquisition Editors
Kevin Colaco
Usha Iyer
Kartikey Pandey
Content Development Editor
Anish Sukumaran
Technical Editor
Manthan Raja
Copy Editor
Tasneem Fatehi
Project Coordinator
Izzat Contractor
Sas Editing
Mariammal Chettiyar
Jason Monteiro
Production Coordinator
Arvindkumar Gupta
Cover Work
Arvindkumar Gupta
About the Author
Joel Lawhead is a project management institute-certied Project Management
Professional (PMP), certied GIS Professional (GISP), and the Chief Information
Ofcer (CIO) of NVision Solutions Inc., an award-winning rm that specializes in
geospatial technology integration and sensor engineering.
Joel began using Python in 1997 and started combining it with geospatial software
development in 2000. He is the author of the rst edition of Learning Geospatial
Analysis with Python and QGIS Python Programming Cookbook, both by Packt
Publishing. His Python cookbook recipes were featured in two editions of Python
Cookbook, O'Reilly Media. He is also the developer of the widely-used, open source
Python Shapele Library (PyShp). He maintains the geospatial technical blog and the Twitter feed, @SpatialPython, which
discusses the use of the Python programming language in the geospatial industry.
In 2011, Joel reverse-engineered and published the undocumented shapele spatial
indexing format and assisted fellow geospatial Python developer, Marc Pster,
in reversing the algorithm used, allowing developers around the world to create
better-integrated and more robust geospatial applications.
Joel serves as the lead architect, project manager, and co-developer for geospatial
applications used by U.S. government agencies, including NASA, FEMA, NOAA,
the U.S. Navy, and many other commercial and non-prot organizations. In 2002,
he received the international Esri Special Achievement in GIS award for his work
on the Real-Time Emergency Action Coordination Tool (REACT), for emergency
management using geospatial analysis.
About the Reviewers
Mark Cederholm, GISP, has over 20 years of experience in developing GIS
applications using various Esri technologies, from ARC/INFO AML to ArcObjects
to ArcGIS Runtime and Web SDKs. He lives in Flagstaff, Arizona.
He has been a technical reviewer for the book, Developing Mobile Web ArcGIS
Applications, Packt Publishing.
Truc Viet Le is currently a PhD candidate in information systems at the
Singapore Management University. His research interests primarily involve
novel methods for the modeling and predicting of human mobility patterns and
trajectories, learning smart strategies for urban transportation, and trafc ow
prediction from ne-grained GPS and sensor network data. He uses R and Python
every day for his work, where he nds R superb for data manipulation/visualization
and Python an ideal environment for machine learning tasks. He is also interested
in applying data science for the social work and international development work.
When not behind the computer screen, he is an avid traveler, adventurer, and an
aspiring travel writer and photographer. His work portfolio and some of his writings
can be found on his personal website at
He spent a wonderful year at Carnegie Mellon University in Pittsburgh,
Pennsylvania, while pursuing his PhD. Previously, he obtained his bachelor's and
master's degrees from Nanyang Technological University in computer engineering
and mathematical sciences.
John Maurer is a programmer and data manager at the Pacic Islands Ocean
Observing System (PacIOOS) in Honolulu, Hawaii. He creates and congures
web interfaces and data services to provide access, visualization, and mapping of
oceanographic data from a variety of sources, including satellite remote sensing,
forecast models, GIS layers, and in situ observations (buoys, sensors, shark tracking,
and so on) throughout the insular Pacic. He obtained a graduate certicate in
remote sensing as well as a master's degree in geography from the University of
Colorado, Boulder, where he developed software to analyze ground-penetrating
radar (GPR) for snow accumulation measurements on the Greenland ice sheet. While
in Boulder, he worked with the National Snow and Ice Data Center (NSIDC) for
eight years, sparking his initial interest in Earth science and all things geospatial: an
unexpected but comfortable detour from his undergraduate degree in music, science,
and technology at Stanford University.
Julia Wood is currently a Geospatial Information Sciences (GIS) analyst who
spends her professional time completing projects as a contractor in the Washington
D.C. area. She graduated magna cum laude from the University of Mary Washington
in Fredericksburg, Virginia, in the spring of 2014 with a bachelor's degree in both
history and geography as well as a minor in GIS. Though her career is still in its early
stages, Julia has aspirations to keep growing her skill set, and working on this review
has certainly helped expand her professional experience; she hopes to continue
learning and eventually work toward a master's degree while still working full time.
In her non-work life, she enjoys reading, crafting, cooking, and exploring the big city,
one local restaurant at a time.
Reviewing this book for Packt Publishing was Julia's rst professional reviewing
experience and she hopes that she can pursue similar endeavors in the future.
I'd like to thank my parents, John and Diana, for always encouraging
me to do well in my educational and professional endeavors; my
sister, Sarrina, and my brother, Jonathan, for offering support
and advice when I needed it; and my boyfriend, Max, and my cat,
Coco, for keeping me company while I conducted the reviews
for this book. A thank you to Packt for letting me be a part of this
Support les, eBooks, discount offers, and more
For support les and downloads related to your book, please visit
Did you know that Packt offers eBook versions of every book published, with PDF
and ePub les available? You can upgrade to the eBook version at
and as a print book customer, you are entitled to a discount on the eBook copy. Get in
touch with us at for more details.
At, you can also read a collection of free technical articles, sign
up for a range of free newsletters and receive exclusive discounts and offers on Packt
books and eBooks.
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital
book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Free access for Packt account holders
If you have an account with Packt at, you can use this to access
PacktLib today and view 9 entirely free books. Simply use your login credentials for
immediate access.
[ i ]
Table of Contents
Preface ix
Chapter 1: Learning Geospatial Analysis with Python 1
Geospatial analysis and our world 1
Beyond disasters 4
History of geospatial analysis 4
Geographic information systems 9
Remote sensing 11
Elevation data 17
Computer-aided drafting 18
Geospatial analysis and computer programming 18
Object-oriented programming for geospatial analysis 19
Importance of geospatial analysis 21
Geographic information system concepts 21
Thematic maps 22
Spatial databases 23
Spatial indexing 24
Metadata 24
Map projections 24
Rendering 26
Remote sensing concepts 27
Images as data 27
Remote sensing and color 28
Common vector GIS concepts 28
Data structures 28
Buffer 30
Dissolve 30
Generalize 31
Intersection 32
Table of Contents
[ ii ]
Merge 32
Point in polygon 33
Union 34
Join 34
Geospatial rules about polygons 35
Common raster data concepts 35
Band math 35
Change detection 36
Histogram 37
Feature extraction 37
Supervised classication 38
Unsupervised classication 38
Creating the simplest possible Python GIS 38
Getting started with Python 38
Building SimpleGIS 39
Step by step 40
Summary 47
Chapter 2: Geospatial Data 49
An overview of common data formats 49
Data structures 53
Common traits 53
Geolocation 54
Subject information 54
Spatial indexing 54
Indexing algorithms 54
Quadtree index 55
R-tree index 56
Grids 57
Overviews 57
Metadata 58
File structure 58
Vector data 60
Shapeles 61
CAD les 64
Tag-based and markup-based formats 65
GeoJSON 67
Raster data 68
TIFF les 69
JPEG, GIF, BMP, and PNG 70
Table of Contents
[ iii ]
Compressed formats 70
ASCII Grids 70
World les 71
Point cloud data 74
Web services 75
Summary 76
Chapter 3: The Geospatial Technology Landscape 77
Data access 80
OGR 81
Computational geometry 83
The PROJ.4 projection library 84
JTS 86
PostGIS 89
Other spatially-enabled databases 92
Oracle spatial and graph 93
ArcSDE 95
Microsoft SQL Server 97
MySQL 97
SpatiaLite 97
Routing 98
Esri Network Analyst and Spatial Analyst 98
pgRouting 98
Desktop tools (including visualization) 99
Quantum GIS 100
OpenEV 102
uDig 104
gvSIG 106
OpenJUMP 106
Google Earth 106
NASA World Wind 109
ArcGIS 110
Metadata management 112
GeoNetwork 112
CatMDEdit 113
Summary 114
Table of Contents
[ iv ]
Chapter 4: Geospatial Python Toolbox 115
Installing third-party Python modules 116
Installing GDAL 118
Windows 119
Linux 120
Mac OS X 120
Python networking libraries for acquiring data 121
The Python urllib module 121
FTP 123
ZIP and TAR les 125
Python markup and tag-based parsers 127
The minidom module 128
ElementTree 130
Building XML 131
Well-known text (WKT) 135
Python JSON libraries 137
The json module 138
The geojson module 139
OGR 139
PyShp 140
dbfpy 141
Shapely 142
Fiona 143
GDAL 145
NumPy 146
PIL 148
PNGCanvas 150
GeoPandas 152
PyMySQL 153
PyFPDF 154
Spectral Python 155
Summary 155
Chapter 5: Python and Geographic Information Systems 157
Measuring distance 158
Pythagorean theorem 161
Haversine formula 163
Vincenty's formula 165
Calculating line direction 167
Coordinate conversion 168
Reprojection 170
Table of Contents
[ v ]
Editing shapeles 173
Accessing the shapele 175
Reading shapele attributes 176
Reading shapele geometry 179
Changing a shapele 180
Adding elds 182
Merging shapeles 182
Merging shapeles with dbfpy 184
Splitting shapeles 186
Subsetting spatially 186
Performing selections 187
Point in polygon formula 187
Bounding Box Selections 188
Attribute selections 189
Creating images for visualization 191
Dot density calculations 191
Choropleth maps 195
Using spreadsheets 197
Using GPS data 199
Geocoding 200
Summary 201
Chapter 6: Python and Remote Sensing 203
Swapping image bands 204
Creating histograms 207
Performing a histogram stretch 211
Clipping images 214
Classifying images 218
Extracting features from images 222
Change detection 228
Summary 232
Chapter 7: Python and Elevation Data 233
ASCII Grid les 233
Reading grids 234
Writing grids 235
Creating a shaded relief 237
Creating elevation contours 242
Working with LIDAR 247
Creating a grid from LIDAR 247
Using PIL to visualize LIDAR 254
Creating a triangulated irregular network 259
Summary 263
Table of Contents
[ vi ]
Chapter 8: Advanced Geospatial Python Modeling 265
Creating a Normalized Difference Vegetative Index 266
Setting up the framework 268
Loading the data 269
Rasterizing the shapele 270
Clipping the bands 272
Using the NDVI formula 272
Classifying the NDVI 273
Additional functions 274
Loading the NDVI 275
Preparing the NDVI 275
Creating classes 275
Creating a ood inundation model 278
The ood ll function 280
Making a ood 282
Creating a color hillshade 286
Least cost path analysis 288
Setting up the test grid 289
The simple A* algorithm 290
Generating the test path 291
Viewing the test output 291
The real-world example 292
Loading the grid 294
Dening the helper functions 295
The real-world A* algorithm 296
Generating a real-world path 298
Routing along streets 301
Geolocating photos 304
Summary 307
Chapter 9: Real-Time Data 309
Tracking vehicles 310
The NextBus agency list 312
The NextBus route list 313
NextBus vehicle locations 313
Mapping NextBus locations 316
Storm chasing 320
Reports from the eld 328
Summary 331
Table of Contents
[ vii ]
Chapter 10: Putting It All Together 333
A typical GPS report 334
Working with 334
Stepping through the program 335
The initial setup 336
Working with utility functions 338
Parsing the GPX 342
Getting the bounding box 343
Downloading map and elevation images 344
Creating the hillshade 346
Creating maps 347
Measuring the elevation 351
Measuring the distance 352
Retrieving weather data 353
Summary 358
Index 359
[ ix ]
The book starts with a background on geospatial analysis, offers a ow of techniques
and technology used, and splits the eld into its component specialty areas, such as
Geographic Information Systems (GIS), remote sensing, elevation data, advanced
modeling, and real-time data. The focus of the book is to lay a strong foundation in
using the powerful Python language and framework in order to approach geospatial
analysis effectively. In doing so, we'll focus on using pure Python as well as certain
Python tools and APIs, and using generic algorithms. The readers will be able to
analyze various forms of geospatial data that comes in and learn real-time data
tracking and how to apply this to interesting scenarios.
While many third-party geospatial libraries are used throughout the examples, a
special effort will made by us to use pure Python, with no dependencies, whenever
possible. This focus on pure Python 3 examples is what will set this book apart from
nearly all other information in this eld. This book may be the only geospatial book
using Python 3 on the market currently. We will also go through some popular
libraries that weren't in the previous version of the book.
What this book covers
Chapter 1, Learning Geospatial Analysis with Python, introduces geospatial analysis
as a way of answering questions about our world. The differences between GIS
and remote sensing are explained. Common geospatial analysis processes are
demonstrated using illustrations, basic formulas, pseudo code, and Python.
Chapter 2, Geospatial Data, explains the major categories of data as well as several
newer formats that are becoming more and more common. Geospatial data comes
in many forms. The most challenging part of geospatial analysis is acquiring the
data that you need and preparing it for analysis. Familiarity with these data types is
essential to understand geospatial analysis.
[ x ]
Chapter 3, The Geospatial Technology Landscape, tells you about the geospatial
technology ecosystem that consists of thousands of software libraries and packages.
This vast array of choices is overwhelming for newcomers to geospatial analysis.
The secret to learning geospatial analysis quickly is understanding the handful of
libraries and packages that really matter. Most other software is derived from these
critical packages. Understanding the hierarchy of geospatial software and how it's
used allows you to quickly comprehend and evaluate any geospatial tool.
Chapter 4, Geospatial Python Toolbox, introduces software and libraries that form the
basis of the book and are used throughout. Python's role in the geospatial industry
is elaborated: the GIS scripting language, mash-up glue language, and full-blown
programming language. Code examples are used to teach data editing concepts, and
many of the basic geospatial concepts in Chapter 1, Learning Geospatial Analysis Using
Python, are also demonstrated in this chapter.
Chapter 5, Python and Geographic Information Systems, teaches you about simple yet
practical Python GIS geospatial products using processes, which can be applied to a
variety of problems.
Chapter 6, Python and Remote Sensing, shows you how to work with remote sensing
geospatial data. Remote sensing includes some of the most complex and least-
documented geospatial operations. This chapter will build a solid core for you and
demystify remote sensing using Python.
Chapter 7, Python and Elevation Data, demonstrates the most common uses of
elevation data and how to work with its unique properties. Elevation data deserves
a chapter all on its own. Elevation data can be contained in almost any geospatial
format but is used quite differently from other types of geospatial data.
Chapter 8, Advanced Geospatial Python Modeling, uses Python to teach you the true
power of geospatial technology. Geospatial data editing and processing help us
understand the world as it is. The true power of geospatial analysis is modeling.
Geospatial models help us predict the future, narrow a vast eld of choices down
to the best options, and visualize concepts that cannot be directly observed in the
natural world.
Chapter 9, Real-Time Data, examines the modern phenomena of geospatial analysis.
A wise geospatial analyst once said, "As soon as a map is created it is obsolete." Until
recently by the time you collected data about the earth, processed it, and created
a geospatial product, the world it represented had already changed. But modern
geospatial data shatter this notion. Data sets are available over the Internet which are
up to the minute or even the second. This data fundamentally changes the way we
perform geospatial analysis.
[ xi ]
Chapter 10, Putting It All Together, combines the skills from the previous chapters step
by step to build a generic corporate system to manage customer support requests
and eld support personnel that could be applied to virtually any organization.
What you need for this book
You will require Python (3.4 or higher), a minimum hardware requirement of a
300-MHz processor, 128 MB of RAM, 1.5 GB of available hard disk, and Windows,
Linux, or OS X operating systems.
Who this book is for
This book is for anyone who wants to understand digital mapping and analysis and
who uses Python or any other scripting language for the automation or crunching
of data manually. This book primarily targets Python developers, researchers, and
analysts who want to perform geospatial, modeling, and GIS analysis with Python.
In this book, you will nd a number of text styles that distinguish between different
kinds of information. Here are some examples of these styles and an explanation of
their meaning.
Code words in text, database table names, folder names, lenames, le extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows:
"The pycsw Python library implements the CSW standard."
A block of code is set as follows:
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns="">
<name>Mockingbird Cafe</name>
<description>Coffee Shop</description>
[ xii ]
Any command-line input or output is written as follows:
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about
this book—what you liked or disliked. Reader feedback is important for us as it helps
us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail, and mention
the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing
or contributing to a book, see our author guide at
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to
help you to get the most from your purchase.
Downloading the example code
You can download the example code les from your account at http://www. for all the Packt Publishing books you have purchased. If you
purchased this book elsewhere, you can visit
and register to have the les e-mailed directly to you.
[ xiii ]
Downloading the color images of this book
We also provide you with a PDF le that has color images of the screenshots/diagrams
used in this book. The color images will help you better understand the changes in
the output. You can download this le from:
Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you nd a mistake in one of our books—maybe a mistake in the text or
the code—we would be grateful if you could report this to us. By doing so, you can
save other readers from frustration and help us improve subsequent versions of this
book. If you nd any errata, please report them by visiting http://www.packtpub.
com/submit-errata, selecting your book, clicking on the Errata Submission Form
link, and entering the details of your errata. Once your errata are veried, your
submission will be accepted and the errata will be uploaded to our website or added
to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to
content/support and enter the name of the book in the search eld. The required
information will appear under the Errata section.
Piracy of copyrighted material on the Internet is an ongoing problem across all
media. At Packt, we take the protection of our copyright and licenses very seriously.
If you come across any illegal copies of our works in any form on the Internet, please
provide us with the location address or website name immediately so that we can
pursue a remedy.
Please contact us at with a link to the suspected
pirated material.
We appreciate your help in protecting our authors and our ability to bring you
valuable content.
If you have a problem with any aspect of this book, you can contact us at, and we will do our best to address the problem.
[ 1 ]
Learning Geospatial Analysis
with Python
This chapter is an overview of geospatial analysis. We will see how geospatial
technology is currently impacting our world with a case study of one of the worst
disease epidemics that the world has ever seen and how geospatial analysis helped
stop the deadly virus in its tracks. Next, we'll step through the history of geospatial
analysis, which predates computers and even paper maps! Then, we'll examine why
you might want to learn a programming language as a geospatial analyst as opposed
to just using geographic information system (GIS) applications. We'll realize the
importance of making geospatial analysis as accessible as possible to the broadest
number of people. Then, we'll step through basic GIS and remote sensing concepts
and terminology that will stay with you through the rest of the book. Finally, we'll
use Python for geospatial analysis right in the rst chapter by building the simplest
possible GIS from scratch!
This book assumes some basic knowledge of Python, IT literacy, and at least an
awareness of geospatial analysis. This chapter provides a foundation in geospatial
analysis, which is needed to attack any subject in the areas of remote sensing and
GIS, including the material in all the other chapters of the book.
Geospatial analysis and our world
On March 25, 2014, the world awoke to news from the United Nations World Health
Organization (WHO) announcing the early stages of a deadly virus outbreak in West
Africa. The fast-moving Ebola virus would spread rapidly over the coming summer
months resulting in cases in six countries on three continents, including the United
States and Europe.
Learning Geospatial Analysis with Python
[ 2 ]
Government and humanitarian agencies faced a vicious race against time to contain
the outbreak. Patients without treatment died in as little as six days after symptoms
appeared. The most critical piece of information was the location of new cases relative
to the existing cases. The challenge they faced required reporting these new cases in
mostly rural areas with limited infrastructure. Knowing where the virus existed in
humans provided the foundation for all of the decisions that response agencies needed
for containment. The location of cases dened the extent of the outbreak. It allowed
governments to prioritize the distribution of containment resources and medical
supplies. It allowed them to trace the disease to the rst victim. It ultimately allowed
them to see if they were making progress in slowing the disease.
This map is a relative heat map of the affected countries based on the number of
cases documented and their location:
Unfortunately, the rural conditions and limited number of response personnel at
the beginning of the outbreak resulted in a ve-day reporting cycle to the Liberian
Ministry of Health who initially tracked the virus. Authorities needed better
information to bring the outbreak under control as new cases grew exponentially.
Chapter 1
[ 3 ]
The solution came from a Liberian student using open source software from a
non-prot Kenyan technology start-up called Ushahidi. Ushahidi is the Swahili
word for testimony or witness. A team of developers in Kenya originally developed
the system in 2008 to track reports of violence after the disputed presidential election
there. Kpetermeni Siakor set up the system in Liberia in 2011 following a similarly
controversial election. When the epidemic hit Liberia, Siakor turned Ushahidi into a
disease-monitoring platform.
Siakor created a dispatch team of volunteers who received phone calls from the
public reporting possible Ebola cases. The details were entered into the Ushahidi
database, which was available on a web-based map almost instantly. The Liberian
Ministry of Health and other humanitarian organizations could access the website,
track the spread of the disease, and properly distribute supplies at health centers.
This effort, amplied by the international response, would ultimately contain the
epidemic globally. In 2015, cases are receding as the world monitors West African
cases in anticipation of the last patient recovering. The following screenshot shows
the latest Liberian public Ushahidi map as of April, 2015:
Learning Geospatial Analysis with Python
[ 4 ]
Relief workers also used the Ushahidi disaster mapping system to respond to the
2010 Haiti earthquake. Maps have always been used in disaster relief; however, the
modern evolution of GPS-enabled mobile phones, web technology, and open source
geospatial software have created a revolution in humanitarian efforts globally.
The Ushahidi API has a Python library that you can nd at
Beyond disasters
The application of geospatial modeling to disaster relief is one of the most recent
and visible case studies. However, the use of geospatial analysis has been increasing
steadily over the last 15 years. In 2004, the U.S. Department of Labor declared the
geospatial industry as one of 13 high-growth industries in the United States expected
to create millions of jobs in the coming decades.
Geospatial analysis can be found in almost every industry, including real estate, oil
and gas, agriculture, defense, politics, health, transportation, and oceanography, to
name a few. For a good overview of how geospatial analysis is used in dozens of
different industries, visit
History of geospatial analysis
Geospatial analysis can be traced as far back as 15,000 years ago to the Lascaux
cave in southwestern France. In this cave, Paleolithic artists painted commonly
hunted animals and what many experts believe are astronomical star maps for
either religious ceremonies or potentially even migration patterns of prey. Though
crude, these paintings demonstrate an ancient example of humans creating abstract
models of the world around them and correlating spatial-temporal features to nd
relationships. The following image shows one of the paintings with an overlay
illustrating the star maps:
Chapter 1
[ 5 ]
Over the centuries, the art of cartography and science of land surveying has developed,
but it wasn't until the 1800s that signicant advances in geographic analysis emerged.
Deadly cholera outbreaks in Europe between 1830 and 1860 led geographers in Paris
and London to use geographic analysis for epidemiological studies.
Learning Geospatial Analysis with Python
[ 6 ]
In 1832, Charles Picquet used different half-toned shades of gray to represent deaths
per thousand citizens in the 48 districts of Paris as part of a report on the cholera
outbreak. In 1854, Dr. John Snow expanded on this method by tracking a cholera
outbreak in London as it occurred. By placing a point on a map of the city each time
a fatality was diagnosed, he was able to analyze the clustering of cholera cases. Snow
traced the disease to a single water pump and prevented further cases. The map has
three layers with streets, an X for each pump, and dots for each cholera death:
Chapter 1
[ 7 ]
Geospatial analysis wasn't just used for the war on diseases. For centuries, generals
and historians have used maps to understand human warfare. A retired French
engineer named Charles Minard produced some of the most sophisticated infographics
ever drawn between 1850 and 1870. The term infographics is too generic to describe
these drawings because they have strong geographic components. The quality and
detail of these maps make them fantastic examples of geographic information analysis
even by today's standards. Minard released his masterpiece, Carte gurative des pertes
successives en hommes de l'Armée Française dans la campagne de Russie 1812-1813, in 1869,
which depicted the decimation of Napoleon's army in the Russian campaign of 1812.
The map shows the size and location of the army over time along with prevailing
weather conditions. The following graphic contains four different series of information
on a single theme. It is a fantastic example of geographic analysis using pen and paper.
The size of the army is represented by the widths of the brown and black swaths at a
ratio of one millimeter for every 10,000 men. The numbers are also written along the
swaths. The brown-colored path shows soldiers who entered Russia, while the black
represents the ones who made it out. The map scale is shown to the right in the center
as one French league (2.75 miles or 4.4 kilometers). The chart at the bottom runs from
right to left and depicts the brutal freezing temperatures experienced by the soldiers on
the return march home from Russia:
Learning Geospatial Analysis with Python
[ 8 ]
While far more mundane than a war campaign, Minard released another compelling
map cataloguing the number of cattle sent to Paris from around France. Minard used
pie charts of varying sizes in the regions of France to show each area's variety and
volume of cattle that was shipped:
In the early 1900s, mass printing drove the development of the concept of map
layers—a key feature of geospatial analysis. Cartographers drew different map
elements (vegetation, roads, and elevation contours) on plates of glass that
could then be stacked and photographed to be printed as a single image. If the
cartographer made a mistake, only one plate of glass had to be changed instead of
the entire map. Later, the development of plastic sheets made it even easier to create,
edit, and store maps in this manner. However, the layering concept for maps as a
benet to analysis would not come into play until the modern computer age.
Chapter 1
[ 9 ]
Geographic information systems
Computer mapping evolved with the computer itself in the 1960s. However, the
origin of the term GIS began with the Canadian Department of Forestry and Rural
Development. Dr. Roger Tomlinson headed a team of 40 developers in an agreement
with IBM to build the Canada Geographic Information System (CGIS). The CGIS
tracked the natural resources of Canada and allowed the proling of these features
for further analysis. The CGIS stored each type of land cover as a different layer.
It also stored data in a Canadian-specic coordinate system suitable for the entire
country that was devised for optimal area calculations. While the technology used
was primitive by today's standards, the system had phenomenal capability at that
time. The CGIS included software features that seem quite modern: map projection
switching, rubber sheeting of scanned images, map scale change, line smoothing and
generalization to reduce the number of points in a feature, automatic gap closing
for polygons, area measurement, dissolving and merging of polygons, geometric
buffering, creation of new polygons, scanning, and digitizing of new features from
the reference data.
The National Film Board of Canada produced a documentary in 1967
on the CGIS, which can be seen at the following URL:
Tomlinson is often called the father of GIS. After launching the CGIS, he earned his
doctorate from the University of London with his 1974 dissertation, entitled The
application of electronic computing methods and techniques to the storage, compilation, and
assessment of mapped data, which describes GIS and geospatial analysis. Tomlinson
now runs his own global consulting rm, Tomlinson Associates Ltd., and remains an
active participant in the industry. He is often found delivering the keynote addresses
at geospatial conferences.
Learning Geospatial Analysis with Python
[ 10 ]
CGIS is the starting point of geospatial analysis as dened by this book. However,
this book would not have been written if not for the work of Howard Fisher and the
Harvard Laboratory for Computer Graphics and Spatial Analysis at the Harvard
Graduate School of Design. His work on the SYMAP GIS software that outputs maps
to a line printer, started an era of development at the laboratory, which produced
two other important packages and, as a whole, permanently dened the geospatial
industry. SYMAP led to other packages including GRID and the Odyssey project
from the same lab. GRID was a raster-based GIS system that used cells to represent
geographic features instead of geometry. GRID was written by Carl Steinitz and
David Sinton. The system later became IMGRID. Next came Odyssey. Odyssey was a
team effort led by Nick Chrisman and Denis White. It was a system of programs that
included many advanced geospatial data management features that are typical of
modern geodatabase systems. Harvard attempted to commercialize these packages
with limited success. However, their impact is still seen today. Virtually, every
existing commercial and open source package owes something to these code bases.
Howard Fisher produced a 1967 lm using the output from
SYMAP to show the urban expansion of Lansing, Michigan, from
1850 to 1965 by hand coding decades of property information into
the system. The analysis took months but would take only a few
minutes to create now using modern tools and data. You can see the
lm at
There are now dozens of graphical user interface (GUI) geospatial desktop
applications available today from companies including Esri, ERDAS, Intergraph,
and ENVI, to name a few. Esri is the oldest, continuously operating GIS software
company, which started in the late 1960s. In the open source realm, packages
including Quantum GIS (QGIS) and Geographic Resources Analysis Support
System (GRASS) are widely used. Beyond comprehensive desktop software
packages, software libraries for the building of new software exist in the thousands.
Chapter 1
[ 11 ]
Remote sensing
Remote sensing is the collection of information about an object without making
physical contact with that object. In the context of geospatial analysis, the object
is usually the Earth. Remote sensing also includes the processing of the collected
information. The potential of geographic information systems is limited only by the
available geographic data. The cost of land surveying, even using a modern GPS, to
populate a GIS has always been resource-intensive. The advent of remote sensing
not only dramatically reduced this cost of geospatial analysis, but it took the eld
in entirely new directions. In addition to powerful reference data for GIS systems,
remote sensing has made possible the automated and semi-automated generation
of GIS data by extracting features from images and geographic data. The eccentric
French photographer, Gaspard-Félix Tournachon, also known as Nadar, took the rst
aerial photograph in 1858 from a hot-air balloon over Paris:
The value of a true bird's-eye view of the world was immediately apparent. As early
as 1920, books on aerial photo interpretation began to appear.
When the United States entered the Cold War with the Soviet Union after World
War II, aerial photography to monitor military capability became prolic with the
invention of the American U-2 spy plane. The U-2 spy plane could y at 70,000 feet,
putting it out of the range of existing anti-aircraft weapons designed to reach only
50,000 feet. The American U-2 ights over Russia ended when the Soviets nally shot
down a U-2 and captured the pilot.
Learning Geospatial Analysis with Python
[ 12 ]
However, aerial photography had little impact on modern geospatial analysis. Planes
could only capture small footprints of an area. Photographs were tacked to walls
or examined on light tables but not in the context of other information. Though
extremely useful, aerial photo interpretation was simply another visual perspective.
The game changer came on October 4, 1957, when the Soviet Union launched the
Sputnik 1 satellite. The Soviets had scrapped a much more complex and sophisticated
satellite prototype because of manufacturing difculties. Once corrected, this prototype
would later become Sputnik 3. Instead, they opted for a simple metal sphere with four
antennae and a simple radio transmitter. Other countries including the United States
were also working on satellites. The satellite initiatives were not entirely a secret.
They were driven by scientic motives as part of the International Geophysical Year
(IGY). Advancement in rocket technology made articial satellites a natural evolution
for Earth science. However, in nearly every case, each country's defense agency was
also heavily involved. Similar to the Soviets, other countries were struggling with
complex satellite designs packed with scientic instruments. The Soviets' decision to
switch to the simplest possible device for the sole reason of launching a satellite before
the Americans was effective. Sputnik was visible in the sky as it passed over and its
radio pulse could be heard by amateur radio operators. Despite Sputnik's simplicity,
it provided valuable scientic information that could be derived from its orbital
mechanics and radio frequency physics.
The Sputnik program's biggest impact was on the American space program.
America's chief adversary had gained a tremendous advantage in the race to space.
The United States ultimately responded with the Apollo moon landings. However,
before this, the U.S. launched a program that would remain a national secret until
1995. The classied CORONA program resulted in the rst pictures from space. The
U.S. and Soviet Union had signed an agreement to end spy plane ights but satellites
were conspicuously absent from the negotiations. The following map shows the
CORONA process. Dashed lines are the satellite ight paths, the longer white tubes
are the satellites, the smaller white cones are the lm canisters, and the black blobs
are the control stations that triggered the ejection of the lm so that a plane could
catch it in the sky:
Chapter 1
[ 13 ]
The rst CORONA satellite was a four-year effort with many setbacks. However,
the program ultimately succeeded. The difculty of satellite imaging, even today,
is retrieving the images from space. The CORONA satellites used canisters of black
and white lm that were ejected from the vehicle once exposed. As the lm canister
parachuted to Earth, a U.S. military plane would catch the package in midair. If
the plane missed the canister, it would oat for a brief duration in the water before
sinking into the ocean to protect the sensitive information. The U.S. continued to
develop the CORONA satellites until they matched the resolution and photographic
quality of the U-2 spy plane photos. The primary disadvantages of the CORONA
instruments were reusability and timeliness. Once out of lm, a satellite could no
longer be of service. Additionally, the lm recovery was on a set schedule making
the system unsuitable to monitor real-time situations. The overall success of the
CORONA program, however, paved the way for the next wave of satellites, which
ushered in the modern era of remote sensing.
Learning Geospatial Analysis with Python
[ 14 ]
Due to the CORONA program's secret status, its impact on remote sensing was
indirect. Photographs of the Earth taken on manned U.S. space missions inspired the
idea of a civilian-operated remote sensing satellite. The benets of such a satellite were
clear but the idea was still controversial. Government ofcials questioned whether
a satellite was as cost-efcient as aerial photography. The military were worried
that the public satellite could endanger the secrecy of the CORONA program. Other
ofcials worried about the political consequences of imaging other countries without
permission. However, the Department of the Interior (DOI) nally won permission
for NASA to create a satellite to monitor Earth's surface resources.
On July 23, 1972, NASA launched the Earth Resources Technology Satellite (ERTS).
The ERTS was quickly renamed to Landsat-1. The platform contained two sensors.
The rst was the Return Beam Vidicon (RBV) sensor, which was essentially a
video camera. It was built by the radio and television giant, Radio Corporation of
America (RCA). The RBV immediately had problems, which included disabling the
satellite's altitude guidance system. The second attempt at a satellite was the highly
experimental Multispectral Scanner (MSS). The MSS performed awlessly and
produced superior results than the RBV. The MSS captured four separate images at
four different wavelengths of the light reected from the Earth's surface.
This sensor had several revolutionary capabilities. The rst and most important
capability was the rst global imaging of the planet scanning every spot on the Earth
every 16 days. The following image from the U.S. National Aeronautics and Space
Administration (NASA) illustrates this ight and collection pattern that is a series
of overlapping swaths as the sensor orbits the Earth capturing tiles of data each time
the sensor images a location on the Earth:
Chapter 1
[ 15 ]
It also recorded light beyond the visible spectrum. While it did capture green and
red light visible to the human eye, it also scanned near-infrared light at two different
wavelengths not visible to the human eye. The images were stored and transmitted
digitally to three different ground stations in Maryland, California, and Alaska. The
multispectral capability and digital format meant that the aerial view provided by
Landsat wasn't just another photograph from the sky. It was beaming down data.
This data could be processed by computers to output derivative information about
the Earth in the same way a GIS provided derivative information about the Earth
by analyzing one geographic feature in the context of another. NASA promoted the
use of Landsat worldwide and made the data available at very affordable prices to
anyone who asked.
Learning Geospatial Analysis with Python
[ 16 ]
This global imaging capability led to many scientic breakthroughs including the
discovery of previously unknown geography as late as 1976. For example, using
Landsat imagery, the Government of Canada located a tiny uncharted island
inhabited by polar bears. They named the new landmass Landsat Island.
Landsat 1 was followed by six other missions and turned over to the National
Oceanic and Atmospheric Administration (NOAA) as the responsible agency.
Landsat 6 failed to achieve orbit due to a ruptured manifold, which disabled its
maneuvering engines. During some of these missions, the satellites were managed
by the Earth Observation Satellite (EOSAT) company, now called Space Imaging,
but returned to government management by the Landsat 7 mission. The following
image from NASA is a sample of a Landsat 7 product:
The Landsat Data Continuity Mission (LDCM) was launched on February 13,
2013, and began collecting images on April 27, 2013, as part of its calibration cycle
to become Landsat 8. The LDCM is a joint mission between NASA and the U. S.
Geological Survey (USGS).
Chapter 1
[ 17 ]
Elevation data
A Digital Elevation Model (DEM) is a three-dimensional representation of a planet's
terrain. In the context of this book, this planet is the Earth. The history of digital elevation
models is far less complicated than remotely-sensed imagery but no less signicant.
Before computers, representations of elevation data were limited to topographic maps
created through traditional land surveys. Technology existed to create three-dimensional
models from stereoscopic images or physical models from materials such as clay or
wood, but these approaches were not widely used for geography.
The concept of digital elevation models began in 1986 when the French space agency,
Centre national d'études spatiales (CNES), launched its SPOT-1 satellite, which
included a stereoscopic radar. This system created the rst usable DEM. Several
other U.S. and European satellites followed this model with similar missions.
In February, 2000, the Space Shuttle Endeavour conducted the Shuttle Radar
Topography Mission (SRTM), which collected elevation data over 80% of the
Earth's surface using a special radar antenna conguration that allowed a single
pass. This model was surpassed in 2009 by the joint U.S. and Japanese mission using
the Advanced Spaceborne Thermal Emission and Reection Radiometer (ASTER)
sensor aboard NASA's Terra satellite. This system captured 99% of the Earth's
surface but has proven to have minor data issues. As the Space Shuttle's orbit did not
cross the Earth's poles, it did not capture the entire surface. SRTM remains the gold
standard. The following image from the USGS shows a colorized DEM known as a
hillshade. Greener areas are lower elevations while yellow and brown areas are
mid-range to high elevations:
Learning Geospatial Analysis with Python
[ 18 ]
Recently, more ambitious attempts at a worldwide elevation dataset are underway in
the form of TerraSAR-X and TanDEM-X satellites launched by Germany in 2007 and
2010, respectively. These two radar elevation satellites worked together to produce a
global DEM called WorldDEM that was released on April 15, 2014. This dataset has
a relative accuracy of two meters and an absolute accuracy of four meters.
Computer-aided drafting
Computer-aided drafting (CAD) is worth mentioning, though it does not directly
relate to geospatial analysis. The history of CAD system development parallels
and intertwines with the history of geospatial analysis. CAD is an engineering
tool used to model two- and three-dimensional objects usually for engineering
and manufacturing. The primary difference between a geospatial model and CAD
model is that a geospatial model is referenced to the Earth, whereas a CAD model
can possibly exist in abstract space. For example, a three-dimensional blueprint of a
building in a CAD system would not have a latitude or longitude, but in a GIS, the
same building model would have a location on the Earth. However, over the years,
CAD systems have taken on many features of GIS systems and are commonly used
for smaller GIS projects. Likewise, many GIS programs can import CAD data that
has been georeferenced. Traditionally, CAD tools were designed primarily for the
engineering of data that was not geospatial.
However, engineers who became involved with geospatial engineering projects,
such as designing a city utility electric system, would use the CAD tools that they
were familiar with in order to create maps. Over time, both the GIS software evolved
to import the geospatial-oriented CAD data produced by engineers, and CAD
tools evolved to support geospatial data creation and better compatibility with GIS
software. AutoCAD by Autodesk and ArcGIS by Esri were the leading commercial
packages to develop this capability and the Geospatial Data Abstraction Library
(GDAL) OGR library developers added CAD support as well.
Geospatial analysis and computer
Modern geospatial analysis can be conducted with the click of a button in any of the
easy-to-use commercial or open source geospatial packages. So then, why would you
want to use a programming language to learn this eld? The most important reasons
are as follows:
You want complete control of the underlying algorithms, data, and execution
Chapter 1
[ 19 ]
You want to automate specific, repetitive analysis tasks with minimal
overhead from a large, multipurpose geospatial framework
You want to create a program that's easy to share
You want to learn geospatial analysis beyond pushing buttons in software
The geospatial industry is gradually moving away from the traditional workow
in which teams of analysts use expensive desktop software to produce geospatial
products. Geospatial analysis is being pushed towards automated processes that
reside in the cloud. End user software is moving towards task-specic tools, many of
which are accessed from mobile devices. Knowledge of geospatial concepts and data
as well as the ability to build custom geospatial processes is where the geospatial
work in the near future lies.
Object-oriented programming for geospatial
Object-oriented programming is a software development paradigm in which
concepts are modeled as objects that have properties and behaviors represented as
attributes and methods, respectively. The goal of this paradigm is more modular
software in which one object can inherit from one or more other objects to encourage
software reuse.
The Python programming language is known for its ability to serve multiple
roles as a well-designed, object-oriented language, a procedural scripting language,
or even a functional programming language. However, you never completely
abandon object-oriented programming in Python because even its native data types
are objects and all Python libraries, known as modules, adhere to a basic object
structure and behavior.
Geospatial analysis is the perfect activity for object-oriented programming. In
most object-oriented programming projects, the objects are abstract concepts such
as database connections that have no real-world analogy. However, in geospatial
analysis, the concepts modeled are, well, real-world objects! The domain of
geospatial analysis is the Earth and everything on it. Trees, buildings, rivers, and
people are all examples of objects within a geospatial system.
A common example in literature for newcomers to object-oriented programming is
the concrete analogy of a cat. Books on object-oriented programming frequently use
some form of the following example.
Learning Geospatial Analysis with Python
[ 20 ]
Imagine that you are looking at a cat. We know some information about the cat,
such as its name, age, color, and size. These features are the properties of the cat.
The cat also exhibits behaviors such as eating, sleeping, jumping, and purring. In
object-oriented programming, objects have properties and behaviors too. You can
model a real-world object such as the cat in our example or something more abstract
such as a bank account.
Most concepts in object-oriented programming are far more abstract than the simple
cat paradigm or even the bank account. However, in geospatial analysis, the objects
that are modeled remain concrete, such as the simple cat analogy, and in many
cases are living, breathing cats. Geospatial analysis allows you to continue with the
simple cat analogy and even visualize it. The following map represents the feral cat
population of Australia using data provided by the Atlas of Living Australia (ALA):
Chapter 1
[ 21 ]
Importance of geospatial analysis
Geospatial analysis helps people make better decisions. It doesn't make the decision
for you, but it can answer critical questions that are at the heart of the choice to
be made and often cannot be answered any other way. Until recently, geospatial
technology and data were tools available only to governments and well-funded
researchers. However, in the last decade, data has become much more widely
available and software much more accessible to anyone.
In addition to freely available government satellite imagery, many local governments
now conduct aerial photo surveys and make the data available online. The
ubiquitous Google Earth provides a cross-platform spinning globe view of the Earth
with satellite and aerial data, streets, points of interest, photographs, and much
more. Google Earth users can create custom Keyhole Markup Language (KML) les,
which are XML les to load and style data to the globe. This program, and similar
tools, are often called geographic exploration tools because they are excellent data
viewers but provide very limited data analysis capability.
The ambitious OpenStreetMap project ( is a
crowd-sourced, worldwide, geographic basemap containing most layers commonly
found in a GIS. Nearly every mobile phone now contains a GPS along with mobile
apps to collect GPS tracks as points, lines, or polygons. Most phones will also tag
photos taken with the phone's camera with GPS coordinates. In short, anyone can be
a geospatial analyst.
The global population has reached seven billion people. The world is changing
faster than ever before. The planet is undergoing environmental changes never seen
before in recorded history. Faster communication and transportation increase the
interaction between us and the environment in which we live. Managing people and
resources safely and responsibly is more challenging than ever. Geospatial analysis
is the best approach to understanding our world more efciently and deeply. The
more politicians, activists, relief workers, parents, teachers, rst responders, medical
professionals, and small businesses harness the power of geospatial analysis, the
more our potential for a better, healthier, safer, and fairer world will be realized.
Geographic information system concepts
In order to begin geospatial analysis, it is important to understand some key
underlying concepts unique to the eld. The list isn't long, but nearly every
aspect of analysis traces back to one of these ideas.
Learning Geospatial Analysis with Python
[ 22 ]
Thematic maps
As its name suggests, a thematic map portrays a specic theme. A general reference
map visually represents features as they relate geographically for navigation or
planning. A thematic map goes beyond location to provide the geographic context
for information around a central idea. Usually, a thematic map is designed for a
targeted audience to answer specic questions. The value of thematic maps lies in
what they do not show. A thematic map will use minimal geographic features to
avoid distracting the reader from the theme. Most thematic maps include political
boundaries such as country or state borders but omit navigational features, such as
street names or points of interest beyond major landmarks that orient the reader.
The cholera map by Dr. John Snow earlier in this chapter is a perfect example of a
thematic map. Common uses for thematic maps are visualizing health issues, such as
disease, election results, and environmental phenomena such as rainfall. These maps
are also the most common output of geospatial analysis. The following map from the
United States Census Bureau shows cancer mortality rates by state:
Chapter 1
[ 23 ]
Thematic maps tell a story and are very useful. However, it is important to
remember that while thematic maps are models of reality as any other map, they are
also generalizations of information. Two different analysts using the same source
of information will often come up with very different thematic maps depending
on how they analyze and summarize the data. They may also choose to focus on
different aspects of the dataset. The technical nature of thematic maps often leads
people to treat them as if they are scientic evidence. However, geospatial analysis
is often inconclusive. While the analysis may be based on scientic data, the analyst
does not always follow the rigor of the scientic method. In his classic book, How
to Lie with Maps, Mark Monmonier, University of Chicago Press, demonstrates in great
detail how maps are easily manipulated models of reality, which are commonly
abused. This fact doesn't degrade the value of these tools. The legendary statistician,
George Box, wrote in his 1987 book, Empirical Model-Building and Response Surfaces
that, "Essentially, all models are wrong, but some are useful." Thematic maps have been
used as guides to start (and end) wars, stop deadly diseases in their tracks, win
elections, feed nations, ght poverty, protect endangered species, and rescue those
impacted by disaster. Thematic maps may be the most useful models ever created.
Spatial databases
In its purest form, a database is simply an organized collection of information.
A database management system (DBMS) is an interactive suite of software that
can interact with a database. People often use the word database as a catch-all term
referring to both the DBMS and underlying data structure. Databases typically
contain alphanumeric data and, in some cases, binary large objects or blobs, which
can store binary data such as images. Most databases also allow a relational database
structure in which entries in normalized tables can be referenced to each other in
order to create many-to-one and one-to-many relationships among data.
Spatial databases, also known as geodatabases, use specialized software to extend
a traditional relational database management system (RDBMS) to store and
query data dened in two-dimensional or three-dimensional space. Some systems
also account for a series of data over time. In a spatial database, attributes about
geographic features are stored and queried as traditional relational database
structures. The spatial extensions allow you to query geometries using Structured
Query Language (SQL) in a similar way as traditional database queries. Spatial
queries and attribute queries can also be combined to select results based on both
location and attributes.
Learning Geospatial Analysis with Python
[ 24 ]
Spatial indexing
Spatial indexing is a process that organizes the geospatial vector data for faster
retrieval. It is a way of preltering the data for common queries or rendering.
Indexing is commonly used in large databases to speed up returns to queries. Spatial
data is no different. Even a moderately-sized geodatabase can contain millions of
points or objects. If you perform a spatial query, every point in the database must
be considered by the system in order to include it or eliminate it in the results.
Spatial indexing groups data in ways that allow large portions of the dataset to be
eliminated from consideration by doing computationally simpler checks before
going into a detailed and slower analysis of the remaining items.
Metadata is dened as data about data. Accordingly, geospatial metadata is data
about geospatial datasets that provide traceability for the source and history of
a dataset as well as summary of the technical details. Metadata also provides
long-term preservation of information holdings. Geospatial metadata can be
represented by several possible standards. One of the most prominent standards
is the international standard, ISO 19115-1, which includes hundreds of potential
elds to describe a single geospatial dataset. Additionally, the ISO 19115-2 includes
extensions for geospatial imagery and gridded data. Some example elds include
spatial representation, temporal extent, and lineage. The primary use of metadata is
cataloging datasets. Modern metadata can be ingested by geographic search engines
making it potentially discoverable by other systems automatically. It also lists points
of contact for a dataset if you have questions. Metadata is an important support tool
for geospatial analysts and adds credibility and accessibility to your work. The Open
Geospatial Consortium (OGC) created the Catalog Service for the Web (CSW) to
manage metadata. The pycsw Python library implements the CSW standard. You can
learn more about it at
Map projections
Map projections have entire books devoted to them and can be a challenge for new
analysts. If you take any three-dimensional object and atten it on a plane, such as
your screen or a sheet of paper, the object is distorted. Many grade school geography
classes demonstrate this concept by having students peel an orange and then attempt
to lay the peel at on their desk in order to understand the resulting distortion. The
same effect occurs when you take the round shape of the Earth and project it on a
computer screen.
Chapter 1
[ 25 ]
In geospatial analysis, you can manipulate this distortion to preserve common
properties, such as area, scale, bearing, distance, or shape. There is no one-size-ts-all
solution to map projections. The choice of projection is always a compromise of gaining
accuracy in one dimension in exchange for error in another. Projections are typically
represented as a set of over 40 parameters as either XML or a text format called
Well-Known Text (WKT), which is used to dene the transformation algorithm.
The International Association of Oil & Gas Producers (IOGP) maintains a registry
of the most known projections. The organization was formerly known as the
European Petroleum Survey Group (EPSG). The entries in the registry are still
known as EPSG codes. The EPSG maintained the registry as a common benet for
the oil and gas industry, which is a prolic user of geospatial analysis for energy
exploration. At the last count, this registry contained over 5,000 entries.
As recently as 10 years ago, map projections were a primary concern for a geospatial
analyst. Data storage was expensive, high-speed Internet was rare, and cloud
computing didn't really exist. Geospatial data was typically exchanged among small
groups working in separate areas of interest. The technology constraints at the time
meant that geospatial analysis was highly localized. Analysts would use the best
projection for their area of interest. Data in different projections cannot be displayed
on the same map because they represent two different models of the Earth. Any
time an analyst received data from a third party, it had to be reprojected before
using it with the existing data. This process was tedious and time-consuming. Most
geospatial data formats do not provide a way to store the projection information.
This information is stored in an ancillary le as text or XML usually. As analysts
didn't exchange data often, many people wouldn't bother dening projection
information. Every analyst's nightmare was to come across an extremely valuable
dataset missing the projection information. It rendered the dataset useless. The
coordinates in the le are just numbers and offer no clue to the projection. With over
5,000 choices, it was nearly impossible to guess.
Now, thanks to modern software and the Internet making data exchange easier and
more common, nearly every data format has added a metadata format that denes
the projection or places it in the le header if supported. Advances in technology
have also allowed for global basemaps, which allow for more common uses of
projections such as the common Google Mercator projection used for Google Maps.
This projection is also known as Web Mercator and uses code EPSG:3857 (or the
deprecated EPSG:900913). Geospatial portal projects such as OpenStreetMap.
org and have consolidated datasets for much of the world
in common projections. Modern geospatial software can also reproject data on the
y saving the analyst the trouble of preprocessing the data before using it. Closely
related to map projections are geodetic datums. A datum is a model of the Earth's
surface used to match the location of features on the Earth to a coordinate system.
One common datum is called WGS 84 that is used by GPS devices.
Learning Geospatial Analysis with Python
[ 26 ]
The exciting part of geospatial analysis is visualization. As geospatial analysis is a
computer-based process, it is good to be aware of how geographic data appears on a
computer screen.
Geographic data including points, lines, and polygons are stored numerically as
one or more points, which come in (x,y) pairs or (x,y,z) tuples. The x represents the
horizontal axis on a graph. The y represents the vertical axis. The z represents terrain
elevation. In computer graphics, a computer screen is represented by an x and y axis.
A z axis is not used because the computer screen is treated as a two-dimensional
plane by most graphics software APIs. However, as desktop computing power
continues to improve, three-dimensional maps are starting to become more common.
Another important factor is screen coordinates versus world coordinates. Geographic
data is stored in a coordinate system representing a grid overlaid on the Earth, which
is three-dimensional and round. Screen coordinates, also known as pixel coordinates,
represent a grid of pixels on a at, two-dimensional computer screen. Mapping x
and y world coordinates to pixel coordinates is fairly straightforward and involves a
simple scaling algorithm. However, if a z coordinate exists, then a more complicated
transform must be performed to map coordinates from three-dimensional space to
a two-dimensional plane. These transformations can be computationally costly and
therefore slow if not handled correctly.
In the case of remote sensing data, the challenge is typically the le size. Even a
moderately-sized satellite image that is compressed can be tens, if not hundreds,
of megabytes. Images can be compressed using lossless or lossy methods. Lossless
methods use tricks to reduce the le size without discarding any data. Lossy
compression algorithms reduce the le size by reducing the amount of data in the
image while avoiding a signicant change in the appearance of the image. Rendering
an image on the screen can be computationally-intensive. Most remote sensing le
formats allow for the storing of multiple lower-resolution versions of the image—
called overviews or pyramids—for the sole purpose of faster rendering at different
scales. When zoomed out from the image to a scale where you couldn't see the detail
of the full resolution image, a preprocessed, lower-resolution version of the image is
displayed quickly and seamlessly.
Chapter 1
[ 27 ]
Remote sensing concepts
Most of the GIS concepts described also apply to raster data. However, raster data
has some unique properties as well. Earlier in this chapter, in the history of remote
sensing, the focus was on Earth imaging from aerial platforms. It is important to note
that raster data can come in many forms including ground-based radar, laser range
nders, and other specialized devices to detect gases, radiation, and other forms of
energy in a geographic context. For the purpose of this book, we will focus on remote
sensing platforms that capture large amounts of Earth data. These sources included
Earth imaging systems, certain types of elevation data, and some weather systems
where applicable.
Images as data
Raster data is captured digitally as square tiles. This means that the data is stored on
a computer as a numerical array of rows and columns. If the data is multispectral, the
dataset will usually contain multiple arrays of the same size, which are geospatially
referenced together to represent a single area on the Earth. These different arrays are
called bands. Any numerical array can be represented on a computer as an image. In
fact, all computer data is ultimately numbers. It is important in geospatial analysis
to think of images as a numeric array because mathematical formulas are used to
process them.
In remotely sensed images, each pixel represents both space (location on the Earth
of a certain size) and the reectance captured as light reected from the Earth at
this location into space. So, each pixel has a ground size and contains a number
representing the intensity. As each pixel is a number, we can perform mathematical
equations on this data to combine data from different bands and highlight specic
classes of objects in the image. If the wavelength value is beyond the visible
spectrum, we can highlight features not visible to the human eye. Substances such
as chlorophyll in plants can be greatly contrasted using a specic formula called
Normalized Difference Vegetation Index (NDVI).
By processing remotely sensed images, we can turn this data into visual information.
Using the NDVI formula, we can answer the question, what is the relative health of
the plants in this image? You can also create new types of digital information, which
can be used as input for computer programs to output other types of information.
Learning Geospatial Analysis with Python
[ 28 ]
Remote sensing and color
Computer screens display images as combinations of Red, Green, and Blue (RGB) to
match the capability of the human eye. Satellites and other remote sensing imaging
devices can capture light beyond this visible spectrum. On a computer, wavelengths
beyond the visible spectrum are represented in the visible spectrum so that we
can see them. These images are known as false color images. In remote sensing,
for instance, infrared light makes moisture highly visible. This phenomenon has
a variety of uses such as monitoring ground saturation during a ood or nding
hidden leaks in a roof or levee.
Common vector GIS concepts
This section will discuss the different types of GIS processes commonly used in
geospatial analysis. This list is not exhaustive; however, it provides you with the
essential operations that all other operations are based on. If you understand these
operations, you can quickly understand much more complex processes as they are
either derivatives or combinations of these processes.
Data structures
GIS vector data uses coordinates consisting of, at a minimum, an x horizontal value
and a y vertical value to represent a location on the Earth. In many cases, a point may
also contain a z value. Other ancillary values are possible including measurements
or timestamps.
These coordinates are used to form points, lines, and polygons to model real-world
objects. Points can be a geometric feature in and of themselves or they can connect
line segments. Closed areas created by line segments are considered polygons.
Polygons model objects such as buildings, terrain, or political boundaries.
A GIS feature can consist of a single point, line, or polygon or it can consist of more
than one shape. For example, in a GIS polygon dataset containing world country
boundaries, the Philippines, which is made up of 7,107 islands, would be represented
as a single country made up of thousands of polygons.
Vector data typically represents topographic features better than raster data. Vector
data has better accuracy potential and is more precise. However, to collect vector
data on a large scale is also traditionally more costly than raster data.
Chapter 1
[ 29 ]
Two other important terms related to vector data structures are bounding box and
convex hull. The bounding box or minimum bounding box is the smallest possible
square that contains all of the points in a dataset. The following image demonstrates
a bounding box for a collection of points:
The convex hull of a dataset is similar to the bounding box, but instead of a square,
it is the smallest possible polygon that can contain a dataset. The bounding box of a
dataset always contains its convex hull. The following image shows the same point
data as the previous example with the convex hull polygon shown in red:
Learning Geospatial Analysis with Python
[ 30 ]
A buffer operation can be applied to spatial objects including points, lines, or
polygons. This operation creates a polygon around the object at a specied distance.
Buffer operations are used for proximity analysis, for example, establishing a safety
zone around a dangerous area. In the following image, the black shapes represent
the original geometry while the red outlines represent the larger buffer polygons
generated from the original shape:
A dissolve operation creates a single polygon out of adjacent polygons. A common
use for a dissolve operation is to merge two adjacent properties in a tax database
that have been purchased by a single owner. Dissolves are also used to simplify data
extracted from remote sensing:
Chapter 1
[ 31 ]
Objects that have more points than necessary for the geospatial model can be
generalized to reduce the number of points used to represent the shape. This
operation usually requires a few attempts to get the optimal number of points
without compromising the overall shape. It is a data optimization technique to
simplify data for the efciency of computing or better visualization. This technique
is useful in web-mapping applications. Computer screens have a resolution of 72
dots per inch (dpi). Highly-detailed point data, which would not be visible, can be
reduced so that less bandwidth is used to send a visually equivalent map to the user:
Learning Geospatial Analysis with Python
[ 32 ]
An intersection operation is used to see if one part of a feature intersects with one or
more features. This operation is for spatial queries in proximity analysis and is often
a follow-on operation to a buffer analysis:
A merge operation combines two or more non-overlapping shapes in a single
multishape object. Multishape objects mean that the shapes maintain separate
geometries but are treated as a single feature with a single set of attributes by the GIS:
Chapter 1
[ 33 ]
Point in polygon
A fundamental geospatial operation is checking to see whether a point is inside a
polygon. This one operation is the atomic building block of many different types
of spatial queries. If the point is on the boundary of the polygon, it is considered
inside. Very few spatial queries exist that do not rely on this calculation in some way.
However, it can be very slow on a large number of points.
The most common and efcient algorithm to detect if a point is inside a polygon is
called the ray casting algorithm. First, a test is performed to see if the point is on
the polygon boundary. Next, the algorithm draws a line from the point in question
in a single direction. The program counts the number of times the line crosses the
polygon boundary until it reaches the bounding box of the polygon. The bounding
box is the smallest box that can be drawn around the entire polygon. If the number is
odd, the point is inside. If the number of boundary intersections is even, the point
is outside:
Learning Geospatial Analysis with Python
[ 34 ]
The union operation is less common but very useful to combine two or more
overlapping polygons in a single shape. It is similar to dissolve, but in this case, the
polygons are overlapping as opposed to being adjacent. Usually, this operation is used
to clean up automatically-generated feature datasets from remote sensing operations:
A join or SQL join is a database operation used to combine two or more tables
of information. Relational databases are designed to avoid storing redundant
information for one-to-many relationships. For example, a U.S. state may contain
many cities. Rather than creating a table for each state containing all of its cities, a
table of states with numeric IDs is created, while a table for all the cities in every
state is created with a state numeric ID. In a GIS, you can also have spatial joins that
are part of the spatial extension software for a database. In spatial joins, combine the
attributes to two features in the same way that you do in a SQL join, but the relation
is based on the spatial proximity of the two features. To follow the previous cities
example, we could add the county name that each city resides in using a spatial
join. The cities layer could be loaded over a county polygon layer whose attributes
contain the county name. The spatial join would determine which city is in which
county and perform a SQL join to add the county name to each city's attribute row.
Chapter 1
[ 35 ]
Geospatial rules about polygons
In geospatial analysis, there are several general rules of thumb regarding polygons
that are different from mathematical descriptions of polygons:
Polygons must have at least four points—the first and last points must be
the same
A polygon boundary should not overlap itself
Polygons in a layer shouldn't overlap
A polygon in a layer inside another polygon is considered as a hole in the
underlying polygon
Different geospatial software packages and libraries handle exceptions to these rules
differently and can lead to confusing errors or software behaviors. The safest route is
to make sure that your polygons obey these rules. There is one more important piece
of information about polygons. A polygon is by denition a closed shape, which
means that the rst and last vertices of the polygon are identical. Some geospatial
software will throw an error if you haven't explicitly duplicated the rst point as the
last point in the polygon dataset. Other software will automatically close the polygon
without complaining. The data format that you use to store your geospatial data may
also dictate how polygons are dened. This issue is a gray area and so it didn't make
the polygon rules, but knowing this quirk will come in handy someday when you
run into an error that you can't explain easily.
Common raster data concepts
As mentioned earlier, remotely-sensed raster data is a matrix of numbers. Remote
sensing contains thousands of operations that can be performed on data. This eld
changes on almost a daily basis as new satellites are put into space and computer
power increases. Despite its decades-long history, we haven't even scratched the
surface of the knowledge that this eld can provide to the human race. Once again,
similar to the common GIS processes, this minimal list of operations gives you the
basis to evaluate any technique used in remote sensing.
Band math
Band math is multidimensional array mathematics. In array math, arrays are treated
as single units, which are added, subtracted, multiplied, and divided. However, in
an array, the corresponding numbers in each row and column across multiple arrays
are computed simultaneously. These arrays are termed matrices and computations
involving matrices are the focus of linear algebra.
Learning Geospatial Analysis with Python
[ 36 ]
Change detection
Change detection is the process of taking two images of the same location at
different times and highlighting the changes. A change can be due to the addition
of something on the ground, such as a new building or the loss of a feature such as
coastal erosion. There are many algorithms to detect changes among images and
also determine qualitative factors such as how long ago the change took place. The
following image from a research project by the U.S. Oak Ridge National Laboratory
(ORNL) shows rainforest deforestation between 1984 and 2000 in the state of
Rondonia, Brazil. Colors are used to show how recently the forest was cut. Green
represents virgin rainforests, white is a forest cut within two years of the end of the
date range, red is within 22 years, and the other colors fall in between as described in
the legend:
Chapter 1
[ 37 ]
A histogram is the statistical distribution of values in a dataset. The horizontal
axis represents a unique value in a dataset while the vertical axis represents the
frequency of this unique value in the raster. A histogram is a key operation in most
raster processing. It can be used for everything from enhancing contrast in an image
to serving as a basis for object classication and image comparison. The following
example from NASA shows a histogram for a satellite image that has been classied
into different categories representing the underlying surface features:
Feature extraction
Feature extraction is the manual or automatic digitization of features in an image
to points, lines, or polygons. This process serves as the basis for the vectorization
of images in which a raster is converted to a vector dataset. An example of feature
extraction is extracting a coastline from a satellite image and saving it as a vector
dataset. If this extraction is performed over several years, you could monitor the
erosion or other changes along this coastline.
Learning Geospatial Analysis with Python
[ 38 ]
Supervised classication
Objects on the Earth reect different wavelengths of light depending on the materials
that they are made of. In remote sensing, analysts collect wavelength signatures for
specic types of land cover (for example, concrete) and build a library for a specic
area. A computer can then use this library to automatically locate classes in the
library in a new image of the same area.
Unsupervised classication
In an unsupervised classication, a computer groups pixels with similar reectance
values in an image without any other reference information other than the histogram
of the image.
Creating the simplest possible Python
Now that we have a better understanding of geospatial analysis, the next step is to
build a simple GIS using Python called SimpleGIS. This small program will be a
technically complete GIS with a geographic data model and the ability to render the
data as a visual thematic map showing the population of different cities.
The data model will also be structured so that you can perform basic queries. Our
SimpleGIS will contain the state of Colorado, three cities, and population counts for
each city.
Most importantly, we will demonstrate the power and simplicity of Python
programming by building this tiny system in pure Python. We will only use
modules available in the standard Python distribution without downloading
any third-party libraries.
Getting started with Python
As stated earlier, this book assumes that you have some basic knowledge of Python.
The examples in this book are based on Python 3.4.3, which you can download here:
Chapter 1
[ 39 ]
The only module used in the following example is the turtle module that provides
a very simple graphics engine based on the Tkinter library included with Python.
If you used the installers for Windows or Mac OS X, the Tkinter library should be
included already. If you compiled Python yourself or are using a distribution from
somewhere besides, then make sure that you can import the turtle
module by typing the following at a command prompt to run the turtle demo script:
python –m turtle
If your Python distribution does not have Tkinter, you can nd information on
installing it from the following page. The information is for Python 2.3 but the
process is still the same:
The ofcial Python wiki page for Tkinter can be found here:
The documentation for Tkinter is in the Python Standard Library documentation that
can be found at
If you are new to Python, Dive into Python is a free online book, which covers all
the basics of Python and will bring you up to speed. For more information, refer
Building SimpleGIS
The code is divided into two different sections. The rst is the data model section
and the second is the map renderer that draws this data. For the data model, we
will use simple Python lists. A Python list is a native data type, which serves as a
container for other Python objects in a specied order. Python lists can contain other
lists and are great for simple data structures. They also map well to more complex
structures or even databases if you decide you want to develop your script further.
Learning Geospatial Analysis with Python
[ 40 ]
The second portion of the code will render the map using the Python turtle graphics
engine. We will have only one function in the GIS that converts the world coordinates,
in this case, longitude and latitude, into pixel coordinates. All graphics engines have
an origin point of (0,0) that is usually in the top-left or lower-left corner of the canvas.
Turtle graphics are designed to teach programming visually. The turtle graphics
canvas uses an origin of (0,0) in the center, similar to a graphing calculator. The
following image illustrates the type of Cartesian graph that the turtle module uses. In
the following graph, some points are plotted in both positive and negative space:
This also means that the turtle graphics engine can have negative pixel coordinates,
which is uncommon for graphics canvases. However, for this example, the turtle
module is the quickest and simplest way to render our map.
Step by step
You can run this program interactively in the Python interpreter or you can save
the complete program as a script and run it. The Python interpreter is an incredibly
powerful way to learn new concepts because it gives you real-time feedback on
errors or unexpected program behavior. You can easily recover from these issues
and try something else until you get the results that you want:
1. In Python, you usually import modules at the beginning of the script so we'll
import the turtle module rst. We'll use Python's import as feature to assign
the module the name t to save space and time when typing turtle commands:
import turtle as t
Chapter 1
[ 41 ]
2. Next, we'll set up the data model starting with some simple variables that
allow us to access list indexes by name instead of numbers to make the code
easier to follow. Python lists index the contained objects starting with the
number 0. So, if we want to access the rst item in a list called myList, we
would reference it as follows:
3. To make our code easier to read, we can also use a variable name assigned to
commonly used indexes:
firstItem = 0
Downloading the example code
You can download the example code files for all the Packt books
that you have purchased from your account at http://www. If you purchased this book elsewhere, you can visit and register in order to
have the files e-mailed to you directly.
In computer science, assigning commonly used numbers to an easy-to-remember
variable is a common practice. These variables are called constants.
So, for our example, we'll assign constants for some common elements
used for all the cities. All cities will have a name, one or more points,
and a population count:
NAME = 0
POP = 2
4. Now, we'll set up the data for Colorado as a list with name, polygon points,
and population. Note that the coordinates are a list within a list:
state = ["COLORADO", [[-109, 37],[-109, 41],[-102, 41],[-102,
37]], 5187582]
5. The cities will be stored as nested lists. Each city's location consists of a single
point as a longitude and latitude pair. These entries will complete our GIS
data model. We'll start with an empty list called cities and then append the
data to this list for each city:
cities = []
cities.append(["DENVER",[-104.98, 39.74], 634265])
cities.append(["BOULDER",[-105.27, 40.02], 98889])
cities.append(["DURANGO",[-107.88,37.28], 17069])
Learning Geospatial Analysis with Python
[ 42 ]
6. We will now render our GIS data as a map by rst dening a map size.
The width and height can be anything that you want depending on your
screen resolution:
map_width = 400
map_height = 300
7. In order to scale the map to the graphics canvas, we must rst determine
the bounding box of the largest layer, which is the state. We'll set the map's
bounding box to a global scale and reduce it to the size of the state. To do so,
we'll loop through the longitude and latitude of each point and compare it
with the current minimum and maximum x and y values. If it is larger than
the current maximum or smaller than the current minimum, we'll make this
value the new maximum or minimum, respectively:
minx = 180
maxx = -180
miny = 90
maxy = -90
forx,y in state[POINTS]:
if x < minx: minx = x
elif x > maxx: maxx = x
if y < miny: miny = y
elif y > maxy: maxy = y
8. The second step to scaling is to calculate a ratio between the actual state and
the tiny canvas that we will render it on. This ratio is used for coordinate to
pixel conversion. We get the size along the x and y axes of the state and then
we divide the map width and height by these numbers to get our scaling ratio:
dist_x = maxx - minx
dist_y = maxy - miny
x_ratio = map_width / dist_x
y_ratio = map_height / dist_y
9. The following function, called convert(), is our only function in SimpleGIS.
It transforms a point in the map coordinates from one of our data layers to
pixel coordinates using the previous calculations. You'll notice that, at the
end, we divide the map width and height in half and subtract it from the
nal conversion to account for the unusual center origin of the turtle graphics
canvas. Every geospatial program has some form of this function:
def convert(point):
lon = point[0]
Chapter 1
[ 43 ]
lat = point[1]
x = map_width - ((maxx - lon) * x_ratio)
y = map_height - ((maxy - lat) * y_ratio)
# Python turtle graphics start in the
# middle of the screen
# so we must offset the points so they are centered
x = x - (map_width/2)
y = y - (map_height/2)
return [x,y]
10. Now for the exciting part! We're ready to render our GIS as a thematic map.
The turtle module uses the concept of a cursor called a pen. Moving the
cursor around the canvas is exactly the same as moving a pen around a piece
of paper. The cursor will draw a line when you move it. So, you'll notice that
throughout the code, we use the t.up() and t.down() commands to pick
the pen up when we want to move to a new location and put it down when
we're ready to draw. We have some important steps in this section. As the
border of Colorado is a polygon, we must draw a line between the last point
and rst point to close the polygon. We can also leave out the closing step
and just add a duplicate point to the Colorado dataset. Once we draw the
state, we'll use the write() method to label the polygon:
first_pixel = None
for point in state[POINTS]:
pixel = convert(point)
if not first_pixel:
first_pixel = pixel
t.write(state[NAME], align="center", font=("Arial",16,"bold"))
Learning Geospatial Analysis with Python
[ 44 ]
11. If we were to run the code at this point, we would see a simplied map of the
state of Colorado, as shown in the following screenshot:
If you do try to run the code, you'll need to temporarily add
the following line at the end, or the Tkinter window will
close as soon as it finishes drawing:
Chapter 1
[ 45 ]
12. Now, we'll render the cities as point locations and label them with their
names and population. As the cities are a group of features in a list, we'll
loop through them to render them. Instead of drawing lines by moving the
pen around, we'll use the turtle dot() method to plot a small circle at the
pixel coordinate returned by our SimpleGISconvert() function. We'll then
label the dot with the city's name and add the population. You'll notice that
we must convert the population number to a string in order to use it in the
turtle write() method. To do so, we use Python's built-in function, str():
#for city in cities:
pixel = convert(city[POINTS])
# Place a point for the city
# Label the city
t.write(city[NAME] + ", Pop.: " + str(city[POP]), align="left")
13. Now we will perform one last operation to prove that we have created a real
GIS. We will perform an attribute query on our data to determine which city
has the largest population. Then, we'll perform a spatial query to see which
city lies the furthest west. Finally, we'll print the answers to our questions on
our thematic map page safely out of the range of the map.
14. As our query engine, we'll use Python's built-in min() and max() functions.
These functions take a list as an argument and return the minimum and
maximum values of this list. These functions have a special feature called
a key argument that allows you to sort complex objects. As we are dealing
with nested lists in our data model, we'll take advantage of the key argument
in these functions. The key argument accepts a function that temporarily
alters the list for evaluation before a nal value is returned. In this case, we
want to isolate the population values for comparison and then the points.
We could write a whole new function to return the specied value, but we
can use Python's lambda keyword instead. The lambda keyword denes an
anonymous function that is used inline. Other Python functions can be used
inline, for example, the string function, str(), but they are not anonymous.
This temporary function will isolate our value of interest.
15. So our rst question is, which city has the largest population?
biggest_city = max(cities, key=lambda city:city[POP])
t.write("The biggest city is: " + biggest_city[NAME])
Learning Geospatial Analysis with Python
[ 46 ]
16. The next question is, which city lies the furthest west?
western_city = min(cities, key=lambda city:city[POINTS])
t.write("The western-most city is: " + western_city[NAME])
17. In the preceding query, we use Python's built in min() function to select
the smallest longitude value and this works because we represented our
city locations as longitude and latitude pairs. It is possible to use different
representations for points including possible representations where this code
would need modication to work correctly. However, for our SimpleGIS, we
are using a common point representation to make it as intuitive as possible.
These last two commands are just for clean up purposes. First, we hide the cursor.
Then, we call the turtle done() method, which will keep the turtle graphics window
with our map open until we choose to close it using the close handle at the top of
the window.
Whether you followed along using the Python interpreter or you ran the complete
program as a script, you should see the following map rendered in real time:
Chapter 1
[ 47 ]
Congratulations! You have followed in the footsteps of Paleolithic hunters, the father
of GIS Dr. Roger Tomlinson, geospatial pioneer Howard Fisher, and game-changing
humanitarian programmers to create a functional, extensible, and technically
complete geographic information system. It took less than 60 lines of pure Python
code! You will be hard-pressed to nd a programming language that can create a
complete GIS using only its core libraries in such a nite amount of readable code
as Python. Even if you did, it is highly unlikely that the language would survive the
geospatial Python journey that you'll take through the rest of this book.
As you can see, there is lots of room for expansion of SimpleGIS. Here are some
other ways that you might expand this simple tool using the reference material for
Tkinter and Python linked at the beginning of this section:
Create an overview map in the top-right corner with a U.S. border outline
and Colorado's location in the U.S.
Add color for visual appeal and further clarity
Create a map key for different features
Make a list of states and cities and add more states and cities
Add a title to the map
Create a bar chart to compare population numbers visually
The possibilities are endless. SimpleGIS can also be used as a way to quickly test and
visualize geospatial algorithms that you come across. If you want to add more data
layers, you can create more lists, but these lists would become difcult to manage. In
this case, you can use another Python module included in the standard distribution.
The SQLite module provides a SQL-like database in Python that can be saved to disk
or run in memory.
Well done! You are now a geospatial analyst. In this chapter, you learned the state
of the art of geospatial analysis through the story of Ushahidi and the Ebola virus.
You learned the history of geospatial analysis and the technologies that support it.
You became familiar with foundational GIS and remote sensing concepts that will
serve you through the rest of the book. You actually built a working GIS that can be
expanded to do whatever you can imagine! In the next chapter, we'll tackle the data
formats that you'll encounter as geospatial analysts. Geospatial analysts spend far
more time dealing with data than actually performing analysis. Understanding the
data that you're working with is essential to working efciently and having fun.
[ 49 ]
Geospatial Data
One of the most challenging aspects of geospatial analysis is the data. Geospatial
data already includes dozens of le formats and database structures and continues
to evolve and grow to include new types of data and standards. Additionally, almost
any le format can technically contain geospatial information simply by adding a
location. In this chapter, we'll examine some common traits of geospatial data. Then
we'll look at some of the most widely used vector data types followed by raster data
types. We'll gain some insight into newer, more complex types including point cloud
data and web services.
An overview of common data formats
As a geospatial analyst, you may frequently encounter the following general
data types:
Spreadsheets and comma-separated files (CSV files) or tab-separated files
(TSV files)
Geotagged photos
Lightweight binary points, lines, and polygons
Multi-gigabyte satellite or aerial images
Elevation data such as grids, point clouds, or integer-based images
XML files
JSON files
Databases (both servers and file databases)
Web services
Geospatial Data
[ 50 ]
Each format contains its own challenges for access and processing. When you
perform analysis on data, usually you have to do some form of preprocessing rst.
You might clip or subset a satellite image of a large area down to just your area
of interest, or you might reduce the number of points in a collection to just the
ones meeting certain criteria in your data model. A good example of this type of
preprocessing is the SimpleGIS example at the end of Chapter 1, Learning Geospatial
Analysis with Python. The state dataset included just the state of Colorado rather than
all the 50 states. The city dataset included only three sample cities demonstrating
three levels of population along with different relative locations.
The common geospatial operations in Chapter 1 are the building blocks for this type
of preprocessing. However, it is important to note that there has been a gradual
shift in the eld of geospatial analysis towards readily available base maps. Until
around 2004, geospatial data was difcult to acquire and desktop computing power
was much less than it is today. Preprocessing data was an absolute rst step for any
geospatial project. However, in 2004, Google released Google Maps not long after
Google Earth. Microsoft had also been developing a technology acquisition called
TerraServer, which they relaunched around this time. In 2004, the Open Geospatial
Consortium (OGC) updated the version of its Web Map Service (WMS) to 1.3.0.
This same year, Esri also released version 9 of their ArcGIS server system. These
innovations were driven by Google's web map tiling model that allowed for smooth,
global, scrolling maps at many different resolutions often called slippy maps.
People used map servers on the Internet before Google Maps, most famously
with the MapQuest driving directions website. However, these map servers
offered only small amounts of data at a time and usually over limited areas. The
Google tiling system converted global maps to tiered image tiles for both images
and mapping data. These were served dynamically using JavaScript and the
browser-based XMLHttpRequest API, more commonly known as asynchronous
JavaScript and XML (AJAX). Google's system scaled to millions of users using
ordinary web browsers. More importantly, it allowed programmers to leverage
JavaScript programming to create mash-ups to use the Google Maps JavaScript API
to add additional data to the maps. The mash-up concept is actually a distributed
geospatial layers system. Users can combine and recombine data from different
locations into a single map as long as the data is web accessible. Other commercial
and open source systems quickly mimicked the idea of distributed layers.
Chapter 2
[ 51 ]
Notable examples of distributed geospatial layers are OpenLayers, which provide
an open source Google-like API that has now gone beyond Google's API offering
additional features. Complimentary to OpenLayers is OpenStreetMap, which
is the open source answer to the tiled map services consumed by systems like
OpenLayers. OpenStreetMap has global, street-level vector data and other spatial
features collected from available government data sources and the contributions
of thousands of editors worldwide. OpenStreetMap's data maintenance model is
similar to the way the Wikipedia online encyclopedia crowd-sources information
creation and update for articles. Recently, even more mapping APIs have appeared,
including Leaet and Mapbox, which continue to increase in exibility, simplicity,
and capability.
The mash-up revolution had interesting and benecial side effects on data.
Geospatial data is traditionally difcult to obtain. The cost of collecting, processing,
and distributing data kept geospatial analysis constrained to those who could
afford this steep overhead cost by producing data or purchasing it. For decades,
geospatial analysis was the tool of governments, very large organizations, and
universities. Once the web mapping trend shifted to large-scale, globally-tiled maps,
organizations began essentially providing base map layers for free in order to draw
developers to their platform. The massively scalable global map system required
massively scalable, high-resolution data to be useful. Geospatial software producers
and data providers wanted to maintain their market share and kept up with the
technology trend.
Geospatial analysts beneted greatly from this market shift in several ways. First of
all, data providers began distributing data in a common projection called Mercator.
The Mercator projection is a nautical navigation projection introduced over 400
years ago. As mentioned in Chapter 1, Learning Geospatial Analysis with Python,
all projections have practical benets as well as distortions. The distortion in the
Mercator projection is size. In a global view, Greenland appears bigger than the
continent of South America. However, like every projection, it also has a benet.
Mercator preserves angles. Predictable angles allowed medieval navigators to draw
straight bearing lines when plotting a course across oceans. Google Maps didn't
launch with Mercator. However, it quickly became clear that roads in high and low
latitudes met at odd angles on the map instead of the 90 degrees in reality.
As the primary purpose of Google Maps was street-level driving directions, Google
sacriced the global view accuracy for far better relative accuracy among streets
when viewing a single city. Competing mapping systems followed suit. Google also
standardized on the WGS 84 datum. This datum denes a specic spherical model of
the Earth called a geoid. This model denes the normal sea level. What is signicant
about this choice by Google is that the Global Positioning System (GPS) also uses
this datum. Therefore, most GPS units default to this datum as well, making Google
Maps easily compatible with raw GIS data.
Geospatial Data
[ 52 ]
The Google variation of the Mercator projection is often called Google Mercator.
The European Petroleum Survey Group (EPSG) assigns short numeric codes to
projections as an easy way to reference them. Rather than waiting for the EPSG
to approve or assign a code that was rst only relevant to Google, they began
calling the projection EPSG:900913, which is Google spelled with numbers. Later,
EPSG assigned the code EPSG:3857, deprecating the older code. Most GIS systems
recognize the two codes as synonymous. It should be noted that Google tweaked the
standard Mercator projection slightly for its use; however, this variation is almost
imperceptible. Google uses spherical formulas at all map scales while the standard
Mercator assumes an ellipsoidal form at large scales.
The following URL provides an image taken from Wikipedia:
It shows the distortion caused by the Mercator projection using Tissot's indicatrix,
which projects small ellipses of equal size into a map. The distortion of the ellipse
clearly shows how the projection affects the size and distance:
Chapter 2
[ 53 ]
Web mapping services have reduced the chore of hunting for data and much of the
preprocessing for analysts to create base maps. However, to create anything of value,
you must understand geospatial data and how to work with it. This chapter provides
an overview of the common data types and issues that you will encounter in geospatial
analysis. Throughout this chapter, two terms will be commonly used: vector data and
raster data. These are the two primary categories under which most geospatial datasets
can be grouped. Vector data includes any format that minimally represents geolocation
data using points, lines, or polygons. Raster data includes any format that stores data
in a grid of rows and columns. Raster data includes all image formats.
If you want to see a projection that shows the relative size of continents
more accurately, refer to the Goode homolosine projection:
Data structures
Despite dozens of formats, geospatial data have common traits. Understanding these
traits can help you approach and understand unfamiliar data formats by identifying
the ingredients common to nearly all spatial data. The structure of a given data
format is usually driven by its intended use. Some data is optimized for efcient
storage or compression, some is optimized for efcient access, some is designed to be
lightweight and readable (web formats), while other data formats seek to contain as
many different data types as possible.
Interestingly, some of the most popular formats today are also some of the simplest
and even lack features found in more capable and sophisticated formats. Ease of
use is extremely important to geospatial analysts because so much time is spent
integrating data into geographic information systems as well as exchanging data
among analysts. Simple data formats facilitate these activities the best.
Common traits
Geospatial analysis is an approach applying information processing techniques to
data with geographic context. This denition contains the most important elements
of geospatial data: geolocation data and subject information. These two factors are
present in every format that can be considered geospatial data. Another common
feature of geospatial data is spatial indexing. Overview datasets are also related
to indexing.
Geospatial Data
[ 54 ]
Geolocation information can be as simple as a single point on the Earth referencing
where a photo was taken. It can also be as complex as a satellite camera engineering
model and orbital mechanics information to reconstruct the exact conditions and
location under which the satellite captured the image.
Subject information
Subject information can also cover a wide range of possibilities. Sometimes, the
pixels in an image are the data in terms of a visual representation of the ground.
At other times, an image may be processed using multispectral bands, such as
infrared light, to provide information not visible in the image. Processed images are
often classied using a structured color palette that is linked to a key, describing
the information each color represents. Other possibilities include some form of
database with rows and columns of information for each geolocated feature, such as
the population associated with each city in our SimpleGIS from Chapter 1, Learning
Geospatial Analysis with Python.
Spatial indexing
Geospatial datasets are often very large les easily reaching hundreds of megabytes
or even several gigabytes in size. Geospatial software can be quite slow in trying
to repeatedly access large les when performing analysis. As discussed briey in
Chapter 1, Learning Geospatial Analysis with Python, spatial indexing creates a guide,
which allows software to quickly locate query results without examining every
single feature in the dataset. Spatial indexes allow software to eliminate possibilities
and perform more detailed searches or comparisons on a much smaller subset of
the data.
Indexing algorithms
Many spatial indexing algorithms are derivatives of well-established algorithms
used for decades on nonspatial information. The two most common spatial indexing
algorithms are Quadtree index and R-tree index.
Chapter 2
[ 55 ]
Quadtree index
The Quadtree algorithm actually represents a series of different algorithms based on
a common theme. Each node in a Quadtree index contains four children. These child
nodes are typically square or rectangular in shape. When a node contains a specied
number of features, the node splits if additional features are added. The concept of
dividing a space into nested squares speeds up spatial searches. Software must only
handle ve points at a time and use simple greater-than/less-than comparisons to
check whether a point is inside a node. Quadtree indexes are most commonly found
as le-based index formats.
The following image shows a point dataset sorted by a Quadtree algorithm. The
black points are the actual dataset, while the boxes are the bounding boxes of
the index. Note that none of the bounding boxes overlap. The left image shows
the spatial representation of the index. The right image shows the hierarchical
relationship of a typical index which is how spatial software sees the index and data.
This structure allows a spatial search algorithm to quickly eliminate possibilities
when trying to locate one or more points in relation to some other set of features.
Geospatial Data
[ 56 ]
R-tree index
R-tree indexes are more sophisticated than Quadtrees. R-trees are designed to
handle three-dimensional data and are optimized to store the index in a way that
is compatible with the way databases use disk space and memory. Nearby objects
are grouped together using any algorithm from a variety of spatial algorithms.
All objects in a group are bounded by a minimum rectangle. These rectangles are
aggregated into hierarchical nodes that are balanced at each level. Unlike a Quadtree,
the bounding boxes of an R-tree may overlap across nodes. Due to the relative
complexity and database-oriented structure, R-trees are most commonly found in
spatial databases as opposed to le-based formats.
The following diagram from
shows a balanced R-tree for a two-dimensional point dataset:
Chapter 2
[ 57 ]
Spatial indexes also often employ the concept of an integer grid. Geospatial
coordinates are usually oating point decimal numbers with anywhere from 2 to
16 decimal places. Performing comparisons on oating point numbers is far more
computationally expensive than working with integers. Indexed searching is about
rst eliminating possibilities that don't require precision. Most spatial indexing
algorithms, therefore, map oating point coordinates to a xed-sized integer grid.
On searching for a particular feature, the software can use more efcient integer
comparisons rather than working with oating point numbers. Once the results are
narrowed down, the software can access the full resolution data.
Grid sizes can be as small as 256 by 256 for simple le formats or can be as large as
three million by three million in large geospatial databases designed to incorporate
every known coordinate system and possible resolution. The integer mapping
technique is very similar to the rendering technique that is used to plot data on a
graphics canvas in mapping programs. The SimpleGIS script in Chapter 1, Learning
Geospatial Analysis with Python, also uses this technique to render points and
polygons using the built-in Python turtle graphics engine.
Overview data is most commonly found in raster formats. Overviews are resampled,
lower resolution versions of raster datasets that provide thumbnail views or simply
faster loading image views at different map scales. They are also known as pyramids
and the process of creating them is known as pyramiding an image. These overviews
are usually preprocessed and stored with the full resolution data either embedded with
the le or in a separate le. The compromise of this convenience is that the additional
images add to the overall le size of the dataset; however, they speed up image viewers.
Vector data also has a concept of overviews, usually to give a dataset geographic
context in an overview map. However, because vector data is scalable, reduced size
overviews are usually created on the y by software using a generalization operation as
mentioned in Chapter 1, Learning Geospatial Analysis with Python.
Geospatial Data
[ 58 ]
Occasionally, vector data is rasterized by converting it into a thumbnail image,
which is stored with or embedded in the image header. The following image
demonstrates the concept of image overviews that also shows visually why
they are often called pyramids:
As discussed in Chapter 1, Learning Geospatial Analysis with Python, metadata is any
data that describes the associated dataset. Common examples of metadata include
basic elements such as the footprint of the dataset on the Earth as well as more
detailed information such as spatial projection and information describing how the
dataset was created. Most data formats contain the footprint or bounding box of
the data on the Earth. Detailed metadata is typically stored in a separate location in
a standard format such as the U.S. Federal Geographic Data Committee (FGDC)
Content Standard for Digital Geospatial Metadata (CSDGM), ISO, or the newer
European Union initiative, which includes metadata requirements, called the
Infrastructure for Spatial Information in the European Community (INSPIRE).
File structure
The preceding elements can be stored in a variety of ways in a single le, multiple
les, or database depending on the format. Additionally, this geospatial information
can be stored in a variety of formats, including embedded binary headers, XML,
database tables, spreadsheets/CSV, separate text, or binary les.
Chapter 2
[ 59 ]
Human readable formats such as XML les, spreadsheets, and structured text
les require only a text editor to investigate. These les are also easily parsed and
processed using Python's built-in modules, data types, and string manipulation
functions. Binary-based formats are more complicated. It is thus typically easier to
use a third-party library to deal with binary formats.
However, you don't have to use a third-party library, especially if you just want to
investigate the data at a high level. Python's built-in struct module has everything
that you need. The struct module lets you read and write binary data as strings.
When using the struct module, you need to be aware of the concept of byte order.
Byte order refers to how the bytes of information that make up a le are stored in
memory. This order is usually platform-specic, but in some rare cases, including
shapeles, the byte order is mixed in the le. The Python struct module uses the
greater than (>) and less than (<) symbols to specify byte order (big endian and little
endian, respectively).
The following brief example demonstrates the usage of the Python struct module to
parse the bounding box coordinates from an Esri shapele vector dataset. You can
download this shapele as a zipped le at the following URL:
When you unzip this, you will see three les. For this example, we'll be using
hancock.shp. The Esri shapele format has a xed location and data type in the le
header from byte 36 to byte 37 for the minimum x, minimum y, maximum x, and
maximum y bounding box values. In this example, we will execute the following steps:
1. Import the struct module.
2. Open the shapele in the binary read mode.
3. Navigate to byte 36.
4. Read each 8-byte double specied as d, and unpack it using the struct
module in little-endian order as designated by the < sign.
The best way to execute this script is in the interactive Python interpreter. We
will read the minimum longitude, minimum latitude, maximum longitude, and
maximum latitude:
>>> import struct
>>> f = open("hancock.shp","rb")
>>> struct.unpack("<d",
>>> struct.unpack("<d",
Geospatial Data
[ 60 ]
>>> struct.unpack("<d",
>>> struct.unpack("<d",
You'll notice that when the struct module unpacks a value, it returns a Python
tuple with one value. You can shorten the preceding unpacking code to one line
by specifying all four doubles at once and increasing the byte length to 32 bytes as
shown in the following code:
>>> struct.unpack("<dddd",
(-89.6904544701547, 30.173943486533133, -89.32227546981174,
Vector data
Vector data is, by far, the most common geospatial format because it is the most
efcient way to store spatial information, and in general, requires less computer
resources to store and process than raster data. The OGC has over 16 formats directly
related to vector data. Vector data stores only geometric primitives including points,
lines, and polygons. However, only the points are stored for each type of shape. For
example, in the case of a simple straight vector line shape, only the end points would
be necessarily stored and dened as a line. Software displaying this data would read
the shape type and then connect the end points with a line dynamically.
Geospatial vector data is similar to the concept of vector computer graphics with
some notable exceptions. Geospatial vector data contains positive and negative
Earth-based coordinates, while vector graphics typically store computer screen
coordinates. Geospatial vector data is also usually linked to other information about
the object represented by the geometry. This information may be as simple as a
timestamp in the case of GPS data or an entire database table for larger geographic
information systems. Vector graphics often store styling information describing colors,
shadows, and other display-related instructions, while geospatial vector data typically
does not. Another important difference is shapes. Geospatial vectors typically only
include very primitive geometries based on points, straight lines, and straight-line
polygons, while many computer graphic vector formats have concepts of curves and
circles. However, geospatial vectors can model these shapes using more points.
Chapter 2
[ 61 ]
Other human readable formats such as Comma-Separated Values (CSV), simple
text strings, GeoJSON, and XML-based formats are technically vector data because
they store geometry as opposed to rasters, which represent all the data within the
bounding box of the dataset. Until the explosion of XML in the late 1990s, vector
data formats were nearly all binary. XML provided a hybrid approach that was
both computer and human readable. The compromise is that text formats such as
GeoJSON and XML data greatly increase the le size over binary formats. These
formats are discussed later in this section.
The number of vector formats to choose from is staggering. The open source vector
library, OGR (, lists over 86 supported
vector formats. Its commercial counterpart, Safe Software's Feature Manipulation
Engine (FME), lists over 188 supported vector formats (
format-search/#filters%5B%5D=VECTOR). These lists include a few vector graphics
formats as well as human readable geospatial formats. There are still dozens of
formats out there to at least be aware of, in case you come across them.
The most ubiquitous geospatial format is the Esri shapele. Geospatial software
company Esri released the shapele format specication as an open format in 1998
( Esri
developed it as a format for their ArcView software, designed as a lower-end GIS
option to complement their high-end professional package, ArcInfo, formerly called
ARC/INFO. However, the open specication, efciency, and simplicity of the format
turned it into an unofcial GIS standard, still extremely popular over 15 years later.
Virtually, every piece of software labeled as geospatial software supports shapeles
because the shapele format is so common. For this reason, you can almost get by
as an analyst by being intimately familiar with shapeles and mostly ignoring other
formats. You can convert almost any other format to shapeles through the source
format's native software or a third-party converter, such as the OGR library, for
which there is a Python module. Other Python modules to handle shapeles are
Shapely and Fiona, which are based on OGR.
Geospatial Data
[ 62 ]
One of the most striking features of a shapele is that the format consists of multiple
les. At a minimum, there are three, and there can even be as many as 15 different
les! The following table describes the le formats. The .shp, .shx, and .dbf les
are required for a valid shapele.
Shapefile supporting file
Supporting file purpose Notes
.shp It is the shapefile. It contains
the geometry.
Required file. Some
software needing only
geometry will accept the
.shp files without the
.shx or .dbf file.
.shx It is the shape index file. It
is fixed-sized record index
referencing geometry for faster
Required file. This file is
meaningless without the
.shp file.
.dbf It is the database file. It
contains the geometry
Required file. Some
software will access
this format without the
.shp file present as the
specification predates
shapefiles. Based on the
very old FoxPro and
Dbase formats. An open
specification exists called
Xbase. The .dbf files can
be opened by most types of
spreadsheet software.
.sbn It is the spatial bin file, the
shapefile spatial index.
Contains bounding boxes
of features mapped to a
256-by-256 integer grid.
Frequently seen.
.sbx A fixed-sized record index for
the .sbn file.
A traditional ordered
record index of a spatial
index. Frequently seen.
Chapter 2
[ 63 ]
Shapefile supporting file
Supporting file purpose Notes
.prj Map projection information
stored in well-known text
Very common file and
required for on-the-fly
projection by the GIS
software. This same format
can also accompany raster
.fbn A spatial index of read only
Very rarely seen.
.fbx A fixed-sized record index of
the .fbn spatial index.
Very rarely seen.
.ixs A geocoding index. Common in geocoding
applications including
driving-direction type
.mxs Another type of geocoding
Less common than the
.ixs format.
.ain Attribute index. Mostly legacy format and
rarely used in modern
.aih Attribute index. Accompanies the .ain
.qix Quadtree index. A spatial index format
created by the open source
community because the
Esri .sbn and .sbx files
were undocumented until
.atx Attribute index. A more recent Esri
software-specific attribute
index to speed up attribute
Geospatial Data
[ 64 ]
Shapefile supporting file
Supporting file purpose Notes
.shp.xml Metadata. Geospatial metadata .xml
container. Can be any of the
multiple XML standards
including FGDC and ISO.
.cpg Code page file for .dbf. Used for the
internationalization of the
.dbf files.
You will probably never encounter all of these formats at once. However, any
shapele that you use will have multiple les. You will commonly see .shp, .shx,
.dbf, .prj, .sbn, .sbx, and, occasionally, .shp.xml les. If you want to rename a
shapele, you must rename all of the associated les with the same name; however,
in Esri software and other GIS packages, these datasets will appear as a single le.
Another important feature of shapeles is that the records are not numbered.
Records include the geometry, the .shx index record, and the .dbf record. These
records are stored in a xed order. When you examine shapele records using
software, they appear to be numbered. People are often confused when they delete
a shapele record, save the le, and reopen it; the number of the record deleted
still appears. The reason is that the shapele records are numbered dynamically on
loading but not saved. So, if you delete record number 23 and save the shapele,
record number 24 will become 23 the next time you read the shapele. Many people
expect to open the shapele and see the records jump from 22 to 24. The only way to
track shapele records in this way is to create a new attribute called ID or similar in
the .dbf le and assign each record a permanent, unique identier.
Just like renaming shapeles, care must be taken when editing shapeles. It's best to
use software that treats the shapeles as a single dataset. If you edit any of the les
individually and add or delete a record without editing the accompanying les, the
shapele will be seen as corrupt by most geospatial software.
CAD les
CAD stands for computer-aided design. The primary formats for CAD data
were created by Autodesk for their leading AutoCAD package. The two formats
commonly seen are Drawing Exchange Format (DXF) and AutoCAD native
Drawing (DWG) format. DWG was traditionally a closed format but it has
become more open.
Chapter 2
[ 65 ]
CAD software is used for everything that is engineering-related, from designing
bicycles to cars, parks, and city sewer systems. As a geospatial analyst, you don't
have to worry about mechanical engineering designs; however, civil engineering
designs become quite an issue. Most engineering rms use geospatial analysis
to a very limited degree but store nearly all of their data in the CAD format. The
DWG and DXF formats can represent objects using features not found in geospatial
software or weakly supported by geospatial systems. Some examples of these
features include the following:
Surfaces (for objects that are different from geospatial elevation surfaces)
3D solids
Text (rendered as an object)
Text styling
Viewport configuration
These CAD and engineering-specic features make it difcult to cleanly convert
CAD data into geospatial formats. If you encounter CAD data, the easiest option is to
ask the data provider if they have shapeles or some other geospatial-centric format.
Tag-based and markup-based formats
Tag-based markup formats are typically XML formats. They also include other
structured text formats such as the well-known text (WKT) format used for projection
information les as well as different types of data exchange. XML formats include
Keyhole Markup Language (KML), the Open Street Map (OSM) format, and the
Garmin GPX format for GPS data, which has become a popular exchange format. The
Open Geospatial Consortium's Geographic Markup Language (GML) standard is one
of the oldest and most widely used XML-based geographic formats. It is also the basis
for the OGC Web Feature Service (WFS) standard for web applications. However,
GML has been largely superseded by KML and the GeoJSON format.
XML formats often contain more than just geometry. They also contain attributes and
rendering instructions such as color, styling, and symbology. Google's KML format
has become a fully supported OGC standard. The following is a sample of KML
showing a simple place mark:
<?xml version="1.0" encoding="utf-8"?>
<kml xmlns="">
<name>Mockingbird Cafe</name>
<description>Coffee Shop</description>
Geospatial Data
[ 66 ]
The XML format is attractive to geospatial analysts for the following several reasons:
It is a human readable format
It can be edited in a text editor
It is well-supported by programming languages (especially Python!)
It is, by definition, easily extensible
XML is not perfect though. It is an inefcient storage mechanism for very large data
formats and can quickly become cumbersome to edit. Errors in datasets are common
and most parsers do not handle errors robustly. Despite the downsides, XML is
widely used in geospatial analysis.
Scalable Vector Graphics (SVG) is a widely supported XML format for computer
graphics. It is supported well by browsers and is often used for geospatial rendering.
However, SVG was not designed as a geographic format.
The WKT format is also an older OGC standard. The most common use for it is to
dene projection information usually stored in the .prj projection les along with a
shapele or raster. The WKT string for the WGS84 coordinate system is as follows:
SPHEROID["WGS 84",6378137,298.257223563,
The parameters dening a projection can be quite long. A standards committee
created by the EPSG created a numerical coding system to reference projections.
These codes, such as EPSG:4326, are used as shorthand for strings like the preceding
code. There are also short names for commonly used projections such as Mercator,
which can be used in different software packages to reference a projection. More
information on these reference systems can be found at the spatial reference website
Chapter 2
[ 67 ]
GeoJSON is a relatively new and brilliant text format based on the JavaScript
Object Notation (JSON) format, which has been a commonly used data exchange
format for years. Despite its short history, GeoJSON can be found embedded in all
major geospatial software systems and most websites that distribute data because
JavaScript is the language of the dynamic web and GeoJSON can be directly fed
into JavaScript.
GeoJSON is a completely backwards-compatible extension for the popular JSON
format. The structure of JSON is very similar and in some cases identical to existing
data structures of common programming languages. JSON is almost identical to
Python's dictionary and list data types. Due to this similarity, parsing JSON in a
script is simple to do from scratch but there are also many libraries to make it even
easier. Python contains a built-in library aptly named json.
GeoJSON provides you with a standard way to dene geometry, attributes,
bounding boxes, and projection information. GeoJSON has all of the advantages of
XML including human readable syntax, excellent software support, and wide use in
the industry. It also surpasses XML. GeoJSON is far more compact than XML largely
because it uses simple symbols to dene objects rather than opening and closing
text-laden tags. The compactness also helps with the readability and manageability
of larger datasets. However, it is still inferior to binary formats from a data volume
standpoint. The following is a sample of the GeoJSON syntax, dening a geometry
collection with both a point and line:
{ "type": "GeometryCollection",
"geometries": [
{ "type": "Point",
"coordinates": [-89.33, 30.0]
{ "type": "LineString",
"coordinates": [ [-89.33, 30.30], [-89.36, 30.28] ]
{"type": "Polygon",
"coordinates": [[
[-104.05, 48.99],
[-97.22, 48.98]
Geospatial Data
[ 68 ]
The preceding code is a valid GeoJSON, but it is also a valid Python data structure.
You can copy the preceding code sample directly to the Python interpreter as a
variable denition and it will evaluate without error as follows:
gc = { "type": "GeometryCollection",
"geometries": [
{ "type": "Point",
"coordinates": [-89.33, 30.0]
{ "type": "LineString",
"coordinates": [ [-89.33, 30.30], [-89.36, 30.28] ]
{'type': 'GeometryCollection', 'geometries': [{'type': 'Point',
'coordinates': [
-89.33, 30.0]}, {'type': 'LineString', 'coordinates': [[-89.33,
30.3], [-89.36,30.28]]}]}
Due to its compact size, Internet-friendly syntax by virtue of being written in
JavaScript, and support from major programming languages, GeoJSON is a key
component of leading REST geospatial web APIs, which will be covered later
in this chapter. It currently offers the best compromise among the computer
resource efciency of binary formats, the human-readability of text formats, and
programmatic utility.
Raster data
Raster data consists of rows and columns of cells or pixels, with each cell
representing a single value. The easiest way to think of raster data is as images,
which is how they are typically represented by software. However, raster datasets
are not necessarily stored as images. They can also be ASCII text les or Binary
Large Objects (BLOBs) in databases.
Another difference between geospatial raster data and regular digital images is
resolution. Digital images express resolution as dots-per-inch if printed in full size.
Resolution can also be expressed as the total number of pixels in the image dened
as megapixels. However, geospatial raster data uses the ground distance that each
cell represents. For example, a raster dataset with a two-foot resolution means that
a single cell represents two feet on the ground, which also means that only objects
larger than two feet can be identied visually in the dataset.
Chapter 2
[ 69 ]
Raster datasets may contain multiple bands, meaning that different wavelengths of
light can be collected at the same time over the same area. Often, this range is from
3-7 bands but can be several hundred in hyperspectral systems. These bands are
viewed individually or swapped in and out as the RGB bands of an image. They can
also be recombined into a derivative single-band image using mathematics and then
recolored using a set number of classes representing like values within the dataset.
Another common application of raster data is in the eld of scientic computing
which shares many elements of geospatial remote sensing but adds some interesting
twists. Scientic computing often uses complex raster formats, including Network
Common Data Form (NetCDF), GRIB, and HDF5, which store entire data models.
These formats are more like directories in a lesystem and can contain multiple
datasets or multiple versions of the same dataset. Oceanography and meteorology
are the most common applications of this kind of analysis. An example of a scientic
computing dataset is the output of a weather model, where the cells of the raster
dataset in different bands may represent different variables' output from the model
in a time series.
Like vector data, raster data can come in a variety of formats. The open source raster
library called Geospatial Data Abstraction Library (GDAL), which actually includes
the vector OGR library mentioned earlier, lists over 130 supported raster formats
( The FME software package supports
this many as well. However, just like shapeles and CAD data, there are a few
standout raster formats.
TIFF les
The Tagged Image File Format (TIFF), is the most common geospatial raster
format. The TIFF format's exible tagging system allows it to store any type of data,
whatsoever, in a single le. TIFFs can contain overview images, multiple bands,
integer elevation data, basic metadata, internal compression, and a variety of other
data typically stored in additional supporting les by other formats. Anyone can
extend the TIFF format unofcially by adding tagged data to the le structure. This
extensibility has benets and drawbacks, however. A TIFF le may work ne in
one piece of software but fails when accessed in another because the two software
packages implement the massive TIFF specication to different degrees. An old
joke about TIFFs has a frustrating amount of truth to it: TIFF stands for Thousands
of Incompatible File Formats. The GeoTIFF extension denes how geospatial data is
stored. Geospatial rasters stored as TIFF les may have any of the following le
extensions: .tiff, .tif, or .gtif.
Geospatial Data
[ 70 ]
JPEG, GIF, BMP, and PNG formats are common image formats, in general, but can
be used for basic geospatial data storage as well. Typically, these formats rely on
accompanying the supporting text les for the georeferencing of the information in
order to make them compatible with the GIS software such as WKT, .prj, or world
les described here.
The JPEG format is also fairly common for geospatial data. JPEGs have a built-in
metadata tagging system similar to TIFFs called EXIF. JPEGs are commonly used
for geotagged photographs in addition to raster GIS layers. Bitmap (BMP) images
are used for desktop applications and document graphics. However, JPEG, GIF,
and PNG are the formats used in web mapping applications, especially for server
pregenerated map tiles for quick access via slippy maps.
Compressed formats
As geospatial rasters tend to be very large, they are often stored using advanced
compression techniques. The latest open standard is the JPEG 2000 format, which
is an update of the JPEG format and includes wavelet compression and a few other
features such as georeferencing data. Multi-resolution Seamless Image Database
(MrSID) (.sid) and Enhanced Compression Wavelet (ECW) (.ecw) are two
proprietary wavelet compression formats often seen in geospatial contexts. The TIFF
format supports compression including the Lempel-Ziv-Welch (LZW) algorithm.
It must be noted that compressed data is suitable as part of a base map but should
not be used for remote sensing processing. Compressed images are designed to
look visually correct but often alter the original cell value. Lossless compression
algorithms try to avoid degrading the source data but it's generally considered a bad
idea to attempt spectral analysis on data that has been through compression. The
JPEG format is designed to be a lossy format that sacrices data for a smaller le size.
It is also commonly encountered so it is important to remember this fact to avoid
invalid results.
Another means of storing raster data, often elevation data, is in ASCII Grid les. This
le format was created by Esri but has become an unofcial standard supported by
most software packages. An ASCII Grid is a simple text le containing (x, y) values
as rows and columns. The spatial information for the raster is contained in a simple
header. The format of the le is as follows:
<NCOLS xxx>
<NROWS xxx>
Chapter 2
[ 71 ]
row 1
row 2
row n
While not the most efcient way to store data, ASCII Grid les are very popular
because they don't require any special data libraries to create or access geospatial
raster data. These les are often distributed as zip les. The header values in the
preceding format contain the following information:
The number of columns
The number of rows
The x-axis cell center coordinate | x-axis lower-left corner coordinate
The y-axis cell center coordinate | y-axis lower-left corner coordinate
The cell size in mapping units
The no-data value (typically, 9999)
World les
World les are simple text les, which can provide geospatial referencing
information to any image externally for le formats that typically have no native
support for spatial information, including JPEG, GIF, PNG, and BMP. The world
le is recognized by geospatial software due to its naming convention. The most
common way to name a world le is to use the raster le name and then alter the
extension to remove the middle letter and add w at the end. The following table
shows some examples of raster images in different formats and the associated world
le name based on the convention:
Raster file name World file name
World.jpg World.jpw
World.tif World.tfw
World.bmp World.bpw
World.png World.pgw
World.gif World.gfw
Geospatial Data
[ 72 ]
The structure of a world le is very simple. It is a six-line text le as follows:
Line 1: The cell size along the x axis in ground units
Line 2: The rotation on the y axis
Line 3: The rotation on the x axis
Line 4: The cell size along the y axis in ground units
Line 5: The center x-coordinate of the upper left cell
Line 6: The center y-coordinate of the upper left cell
The following is an example of world le values:
The (x, y) coordinates and the (x, y) cell size contained in lines 1, 4, 5, and 6 allow you
to calculate the coordinate of any cell or the distance across a set of cells. The rotation
values are important for geospatial software because remotely sensed images are
often rotated due to the data collection platform. Rotating the images runs the risk
of resampling the data and, therefore, data loss so the rotation values allow the
software to account for the distortion. The surrounding pixels outside the image are
typically assigned a no data value and represented as the color black. The following
image, courtesy of the U.S. Geological Survey (USGS), demonstrates image rotation,
where the satellite collection path is oriented from southeast to northeast but the
underlying base map is north:
Chapter 2
[ 73 ]
World les are a great tool when working with raster data in Python. Most geospatial
software and data libraries support world les so they are usually a good choice for
the georeferencing.
You'll nd that world les are very useful, but as you use
them infrequently, you forget what the unlabeled contents
represent. A quick reference for world les is available at
Geospatial Data
[ 74 ]
Point cloud data
Point cloud data is any data collected as the (x, y, z) location of a surface point based
on some sort of focused energy return. Point cloud data can be created using lasers,
radar waves, acoustic soundings, or other waveform generation devices. The spacing
between points is arbitrary and dependent on the type and position of the sensor
collecting the data. In this book, we will primarily be concerned with LIDAR data
and radar data. Radar point cloud data is typically collected on space missions while
LIDAR is typically collected by terrestrial or airborne vehicles. Conceptually, both
the types of data are similar.
LIDAR uses powerful laser range-nding systems to model the world with very high
precision. The term LIDAR or LiDAR is a combination of the words light and radar.
Some people claim it also stands for Light Detection and Ranging. LIDAR sensors
can be mounted on aerial platforms including satellites, airplanes, or helicopters.
They can also be mounted on vehicles for ground-based collection.
Due to the high-speed, continuous data collection provided by LIDAR, and a wide
eld of view—often 360 degrees of the sensor—LIDAR data doesn't typically have
a rectangular footprint the way other forms of raster data do. LIDAR datasets are
usually called point clouds because the data is a stream of (x,y,z) locations, where z is
the distance from the laser to a detected object and the (x,y) values are the projected
location of the object calculated from the location of the sensor. The following image,
courtesy of USGS, shows a point cloud LIDAR dataset in an urban area using a
terrestrial sensor as opposed to an aerial one. The colors are based on the strength of
the laser's energy return with red areas being closer to the LIDAR sensor and green
areas farther away, which can give a precise height to within a few centimeters:
Chapter 2
[ 75 ]
The most common data format for LIDAR data is LIDAR Exchange Format (LAS),
which is a community standard. LIDAR data can be represented in many ways
including a simple text le with one (x, y, z) tuple per line. Sometimes, LIDAR data
can be colorized using image pixel colors collected at the same time. LIDAR data can
also be used to create 2D elevation rasters. This technique is the most common use
for LIDAR in geospatial analysis. Any other use requires specialized software that
allows the user to work in 3D. In that case, other geospatial data cannot be combined
with the point cloud.
Web services
Geospatial web services allow users to perform data discovery, data visualization,
and data access across the web. Web services are usually accessed by applications
based on user input such as zooming an online map or searching a data catalogue.
The most common protocols are the Web Map Service (WMS) that returns a
rendered map image and Web Feature Service (WFS) that typically returns GML,
which was mentioned in this chapter's introduction. Many WFS services can also
return KML, JSON, zipped shapeles, and other formats. These services are called
through HTTP GET requests. The following URL is an example of a WMS GET
request, which returns a map image of the world that is 640 pixels wide and 400
pixels tall and an EPSG code of 900913:
Web services are rapidly evolving. The Open GIS Consortium is adding new
standards for sensor networks and other geospatial contexts. Representational State
Transfer (REST) services are also commonly used. REST services use simple URLs
to make the requesting of data very easy to implement in nearly any programming
language by tailoring URL parameters and their values accordingly. Nearly every
programming language has robust HTTP client libraries capable of using the REST
services. The REST services can return many types of data including images, XML,
or JSON. There is no overarching geospatial REST standard yet, but the OGC has
been working on one for quite some time. Esri has created a working implementation
that is currently widely used. The following URL is an example of an Esri geospatial
REST service that would return KML based on a weather radar image layer. You
can add this URL to Google Earth as a network link or you can download it as
compressed KML (KMZ) in a browser to import to another program:
Geospatial Data
[ 76 ]
You can nd tutorials on the myriad of OGC services here:
You now have the background needed to work with common types of geospatial
data. You also know the common traits of geospatial datasets that will allow you
to evaluate unfamiliar types of data and identify key elements that will drive
you towards which tools to use when interacting with this data. In Chapter 3, The
Geospatial Technology Landscape, we'll examine the modules and libraries available to
work with these datasets. As with all the code in this book, whenever possible, we'll
use only pure Python and Python standard libraries.
[ 77 ]
The Geospatial Technology
The geospatial technology ecosystem consists of hundreds of software libraries and
packages. This vast array of choices is overwhelming for newcomers to geospatial
analysis. The secret to learning geospatial analysis quickly is to understand
the handful of libraries and packages that really matter. Most software, both
commercial and open source, is derived from these critical packages. Understanding
the ecosystem of geospatial software and how it's used allows you to quickly
comprehend and evaluate any geospatial tool.
Geospatial libraries can be assigned to one or more of the following high-level core
capabilities, which they implement to some degree:
Data access
Computational geometry (including data reprojection)
Metadata tools
Another important category is image processing for remote sensing; however, this
category is very fragmented, containing dozens of software packages which are
rarely integrated into derivative software. Most image processing software for remote
sensing is based on the same data access libraries with custom image processing
algorithms implemented on top of them. Take a look at the following examples of this
type of software, which include both open source and commercial packages:
Open Source Software Image Map (OSSIM)
Geographic Resources Analysis Support System (GRASS)
The Geospatial Technology Landscape
[ 78 ]
Orfeo ToolBox (OTB)
Data access libraries such as GDAL and OGR are mostly written in either C or C++ for
speed and cross-platform compatibility. Speed is important due to the commonly large
sizes of geospatial datasets. However, you will also see many packages written in Java.
When it's well-written, pure Java can approach speeds acceptable for processing large
vector or raster datasets and are usually acceptable for most applications.
The following concept map shows the major geospatial software libraries and packages
and how they are related. The libraries in bold represent root libraries that are actively
maintained and not signicantly derived from any other libraries. These root libraries
represent geospatial operations, which are sufciently difcult to implement, and
the vast majority of people choose to use one of these libraries rather than create a
competing one. As you can see, a handful of libraries make up a disproportionate
amount of geospatial analysis software. And the following diagram is by no means
exhaustive. In this book, we'll discuss only the most commonly used packages:
Chapter 3
[ 79 ]
The libraries GDAL, OGR, GEOS, and PROJ.4 are the heart and soul of the geospatial
analysis community on both the commercial and open source side. It is important
to note that these libraries are all written in C or C++. There is also signicant work
done in Java in the form of the GeoTools and JTS core libraries, which are used
across a range of desktops, servers, and mobile software. Given there are hundreds
of geospatial packages available and nearly all relying on these libraries to do
anything meaningful, you begin to get an idea of the complexity of geospatial data
access and computational geometry. Compare this software domain to that of text
editors, which return over 5,000 options when searched on the open source project
Geospatial analysis is a truly worldwide community with signicant contributions
to the eld coming from every corner of the globe. But as you learn more about
the heavy-hitting packages at the center of the software landscape, you'll see that
these programs tend to come from Canada or are contributed heavily by Canadian
developers. Credited as the birthplace of modern GIS, geospatial analysis is a
matter of national pride. Also, the Canadian government and the public-private
GeoConnections program have invested heavily in research and companies both
to fuel the industry for economic reasons and out of necessity, to better manage the
country's vast natural resources and the needs of its population.
In this chapter, we examine the packages which have had the largest impact on
geospatial analysis and also those that you are likely to frequently encounter.
However, as with any ltering of information, you are encouraged to do your
own research and draw your own conclusions. The following websites offer more
information on software not included in this chapter:
Wikipedia list of GIS software:
OSGeo project list and incubator projects: software database:
The Geospatial Technology Landscape
[ 80 ]
Data access
As described in Chapter 2, Geospatial Data, geospatial datasets are typically large,
complex, and varied. This challenge makes libraries, which efciently read and
in some cases, write this data essential to geospatial analysis. These libraries are
also the most important. Without access to data, geospatial analysis doesn't begin.
Furthermore, accuracy and precision are key factors in geospatial analysis. An image
library that resamples data without permission, or a computational geometry library
that rounds a coordinate even a couple of decimal places, can adversely affect the
quality of analysis. Also, these libraries must manage memory efciently. A complex
geospatial process can last for hours or even days. If a data access library has a
memory fault, it can delay an entire project or even an entire workow involving
dozens of people dependent on the output of that analysis.
The Geospatial Data Abstraction Library (GDAL) does the most heavy lifting
task in the geospatial industry. The GDAL website lists over 80 pieces of software
using the library, and this list is by no means complete. Many of these packages
are industry leading, open source, and commercial tools. This list doesn't include
hundreds of smaller projects and individual analysts using the library for geospatial
analysis. GDAL also includes a set of command-line tools that can do a variety of
operations without any programming.
A list of projects using GDAL can be found at the following URL:
GDAL provides a single, abstract data model for the vast array of raster data types
found in the geospatial industry. It consolidates unique data access libraries for
different formats and provides a common API for reading and writing data. Before
developer Frank Warmerdam created GDAL in the late 1990s, each data format
required a separate data access library with a different API to read data or the worse
situation was that developers often wrote custom data access routines.
Chapter 3
[ 81 ]
The following diagram provides a visual description of how GDAL abstracts
raster data:
In the software concept map earlier in this chapter, you can see that GDAL has had
the greatest impact of any single piece of geospatial software. Combine GDAL with
its sister library OGR for vector data and the impact almost doubles. The PROJ.4
library has also had tremendous impact, but it is usually accessed via OGR or GDAL.
The GDAL homepage can be found at
The OGR Simple Features Library is the vector data companion of GDAL. The OGR
lists at least partial support for over 70 vector data formats. OGR originally stood for
OpenGIS Simple Features Reference Implementation; however, it did not evolve into a
reference implementation for the Simple Features standard even though the name stuck.
OGR serves the same purpose for vector data as GDAL does for raster data. It is
also almost equally prolic in the geospatial industry. A part of the success of the
GDAL/OGR package is the X11/MIT open source license. This license is both
commercial and open source friendly. The GDAL/OGR library can be included in
the proprietary software without revealing proprietary source code to users.
The Geospatial Technology Landscape
[ 82 ]
OGR has the following capabilities:
Uniform vector data and modeling abstraction
Vector data re-projection
Vector data format conversion
Attribute data filtering
Basic geometry filtering including clipping and point-in-polygon testing
Like GDAL, OGR has several command-line utility programs, which demonstrate its
capability. This capability can also be accessed through its programming API. The
following diagram outlines the OGR architecture:
The OGR architecture is fairly concise, considering this model is able to
represent over 70 different data formats. The Geometry object represents the
OGC Simple Features Specication data model for points, linestrings, polygons,
geometrycollections, multipolygons, multipoints, and multilinestrings. The Feature
Denition object contains the attribute denitions of a group of related features.
The Feature object ties the Geometry and Feature Denition information together.
The Spatial Reference object contains an OGC Spatial Reference denition. The
Layer object represents features grouped as layers within a data source. The Data
Source is the le or database object accessed by OGR. The Driver object contains the
translators for the 70 plus data formats available to OGR.
Chapter 3
[ 83 ]
This architecture works smoothly with one minor quirk. The Layer concept is used
even for data formats that only contain a single layer. For example, shapeles can
only represent a single layer. But, when you access a shapele using OGR, on
opening the data source, you must still invoke a new Layer object using the base
name of the shapele without a le extension. The design feature is only a minor
inconvenience heavily outweighed by the power that OGR provides.
Computational geometry
Computational geometry encompasses the algorithms needed to perform operations
on vector data. The eld is very old in computer science; however, most of the
libraries used for geospatial operations are separate from computer graphics libraries
because of geospatial coordinate systems. As described near the end of Chapter 1,
Learning Geospatial Analysis with Python, computer screen coordinates are almost
always expressed in positive numbers, while geospatial coordinate systems often use
negative numbers when they're moving west and south.
Several different geospatial libraries t into the category but serve a wide range of
uses from spatial selection to rendering. It should be noted that some features of the
OGR Simple Features Library described previously move it beyond the category of
data access and into the realm of computational geometry. But, it was included in the
prior category because that is its primary purpose.
Computational geometry is a fascinating subject. While writing a simple script
to automate a geospatial operation, you inevitably need some spatial algorithm.
The question then arises that "Do you try to implement this algorithm yourself
or go through the overhead of using a third-party library?" The choice is always
deceptive because some tasks are visually easy to understand and implement; some
look complex but turn out to be easy, and some are trivial to comprehend but are
extraordinarily difcult. One such example is a geospatial buffer operation. The
concept is easy enough, but the algorithm turns out to be quite difcult. The following
libraries in this section are the major packages for computational geometry algorithms.
The Geospatial Technology Landscape
[ 84 ]
The PROJ.4 projection library
U.S. Geological Survey (USGS) analyst Jerry Evenden created what is now the
PROJ.4 projection library in the mid 1990s while working at the USGS. Since
then, it has become a project of the Open Source Geospatial Foundation with
contributions from many other developers. PROJ.4 accomplishes the herculean
task of transforming data among thousands of coordinate systems. The math to
convert points among so many coordinate systems is extremely complex. No other
library comes close to the capability PROJ.4. That fact and the routine needed by
applications to convert data sets from different sources to a common projection make
PROJ.4 the undisputed leader in this area.
The following plot is an example of how specic the projections supported by PROJ.4
can be. This image from represents the Line/Station coordinate system
of the California Cooperative Oceanic Fisheries Investigations (CalCOFI) program
pseudo-projection used only by NOAA, the University of California Scripps Institution
of Oceanography and the California Department of Fish and Wildlife to collect
oceanographic and sheries data over the last 60 years along the California coastline:
Chapter 3
[ 85 ]
PROJ.4 uses a simple syntax capable of describing any projection including custom,
localized ones as shown in the previous diagram. PROJ.4 can be found in virtually
every major GIS package, which provides reprojection support and has its own
command-line tools as well. It is available through both GDAL and OGR for vector
and raster data. However, it is often useful to access the library directly because
it gives you the ability to reproject individual points. Most of the libraries that
incorporate PROJ.4 only let you reproject entire data sets.
For more information on PROJ.4 visit
The Computational Geometry Algorithms Library (CGAL), originally released in
the late 1990s, is a robust and well-established open source computational geometry
library. It is not specically designed for geospatial analysis but is commonly used in
the eld.
CGAL is often referenced as a source for reliable geometry processing algorithms.
The following image from the CGAL User and Reference Manual provides a
visualization of one of the often referenced algorithms from CGAL called a polygon
straight skeleton needed to accurately grow or shrink a polygon:
The Geospatial Technology Landscape
[ 86 ]
The straight skeleton algorithm is complex and important because shrinking or
growing a polygon isn't just a matter of making it bigger or smaller. The polygon
actually changes shape. As a polygon shrinks, non-adjacent edges collide and
eliminate connecting edges. As a polygon grows, adjacent edges separate and
new edges are formed to connect them. This process is key to buffering geospatial
polygons. The following image, also from the CGAL User and Reference Manual,
shows this effect using insets on the preceding polygon:
CGAL can be found online at
The Java Topology Suite (JTS) is a geospatial computational geometry library
written in 100 percent pure Java. JTS separates itself from other computational
geometry libraries by implementing the Open Geospatial Consortium (OGC)
Simple Features Specication for SQL. Interestingly, other developers have ported
JTS to other languages including C++, Microsoft .NET, and even JavaScript.
JTS includes a fantastic test program called the JTS Test Builder, which provides
a GUI to test functions without setting up an entire program. One of the most
frustrating aspects of geospatial analysis concerns bizarre geometry shapes
that break algorithms, which work most of the time. Another common issue
is unexpected results due to tiny errors in data such as polygons that intersect
themselves in very small areas that are not easily visible. The JTS Test Builder lets
you interactively test JTS algorithms to verify data or just visually understand a
process, as shown here:
Chapter 3
[ 87 ]
This tool is handy even if you aren't using JTS but one of the several ports to another
language. It should be noted that Vivid Solutions, the maintainer of JTS, hasn't
released a new version since JTS Version 1.8 in December 2006. The package is
quite stable and still in active use. The JTS homepage is available at http://www.
The Geospatial Technology Landscape
[ 88 ]
GEOS, which stands for Geometry Engine – Open Source, is the C++ port of the
JTS library explained previously. It is mentioned here because this port has had
a much larger impact on the geospatial analysis than the original JTS. The C++
version can be compiled on many platforms as it avoids any platform-specic
dependencies. Another factor responsible for the popularity of GEOS is that a fair
amount of infrastructure exists to create automated or semi-automated bindings to
various scripting languages including Python. One more factor is that the majority of
geospatial analysis software is written in C or C++. The most common use of GEOS
is through other APIs, which include this.
GEOS provides the following capabilities:
OGC Simple Features
Geospatial predicate functions
Geospatial operations
Symmetric difference
Convex hull
Chapter 3
[ 89 ]
Polygon assembly
Polygon validation
Spatial indexing
OGC well-known text (WKT) and well-known binary (WKB) input/output
C and C++ API
Thread safety
GEOS can be compiled with GDAL to give OGR all of its capability. GEOS can be
found online at
As far as open source geospatial databases go, PostGIS is the most commonly
used spatial database. PostGIS is essentially a module on top of the well-known
PostgreSQL relational database. Much of the power of PostGIS comes from the GEOS
library mentioned earlier. Like JTS, it also implements the OGC Simple Features
Specication for SQL. The combination of computational geometry ability in a
geospatial context sets PostGIS in a category on its own.
PostGIS allows you to execute both attribute and spatial queries against a dataset.
Recall the point from Chapter 2, Geospatial Data, that a typical spatial data set is
comprised of multiple data types including geometry, attributes (one or more
columns of data in a row), and in most cases, indexing data. In PostGIS, you
can query attribute data as you would do to any database table using SQL. This
capability is not surprising as attribute data is stored in a traditional database
structure. However, you can also query geometry using SQL syntax. Spatial
operations are available through SQL functions, which you include as part of
queries. The following sample PostGIS SQL statement creates a 14.5 kilometer buffer
around the state of Florida:
SELECT ST_Buffer(the_geom, 14500)
FROM usa_states
WHERE state = 'Florida'
The Geospatial Technology Landscape
[ 90 ]
The FROM clause designates the usa_states layer as the location of the query. We
lter that layer by isolating Florida in the WHERE clause. Florida is a value in the
column state of the usa_states layer. The SELECT clause performs the actual spatial
selection on the geometry of Florida normally contained in the column the_geom
using the PostGIS ST_Buffer() function. The column the_geom is the geometry
column for the PostGIS layer in this instance. The ST in the function name stands
for Spatial Type. The ST_Buffer() function accepts a column containing spatial
geometries and a distance in the map units of the underlying layer. The map units in
usa_states layer are expressed in meters, so 14.5 km would be 14,500 meters in the
preceding example. Recall the point from Chapter 1, Learning Geospatial Analysis with
Python that buffers like this query are used for proximity analysis. It just so happens
that the State of Florida water boundary expands 9 nautical miles or approximately
14.5 km into the Gulf of Mexico from the state's western and northwestern coastlines.
The following image shows the ofcial Florida state water boundary as a dotted line,
which is labeled on the map:
Chapter 3
[ 91 ]
After applying the 9 nautical mile buffer, you can see in the following image that the
result, highlighted in orange, is quite close to the ofcial legal boundary based on
detailed ground surveys:
The website has an excellent interactive online
tutorial, which allows you to execute spatial queries against a
continental U.S. map and see the result immediately on a web
map. A solid written introductory PostGIS tutorial can be found at
the following URL:
The Geospatial Technology Landscape
[ 92 ]
Currently, PostGIS maintains the following feature set:
Geospatial geometry types including points, linestrings, polygons,
multipoints, multilinestrings, multipolygons, and geometry collections,
which can store different types of geometries including other
collectionsSpatial functions for testing geometric relationships (for example,
point-in-polygon or unions)
Spatial functions for deriving new geometries (for example, buffers
and intersects)
Spatial measurements including perimeter, length, and area
Spatial indexing using an R-Tree algorithm
A basic geospatial raster data type
Topology data types
US Geocoder based on TIGER census data
A new JSONB data type which allows indexing and querying of JSON
and GeoJSON
The PostGIS feature set is competitive among all geodatabases and the most
extensive among any open source or free geodatabases. The active momentum of the
PostGIS development community is another reason of this system being the best of
breed. PostGIS is maintained at
Other spatially-enabled databases
PostGIS is the gold standard among free and open source geospatial databases.
However, there are several other systems that you should be aware of as a geospatial
analyst. This list includes both commercial and open source systems with varying
degrees of geospatial support.
Geodatabases have evolved in parallel to geospatial software, standards, and
the Web. The Internet has driven the need for large, multiuser geospatial database
servers able to serve large amounts of data. The following image, courtesy of, shows how geospatial architectures have evolved with a
signicant portion of this evolution happening at the database level:
Chapter 3
[ 93 ]
Oracle spatial and graph
The Oracle relational database is a widely used database system typically used by
very large organizations because of its cost and large scalability. It is also extremely
stable and fast. It runs some of the largest and most complicated databases in the
world. It is often found in hospitals, banks, and government agencies managing
millions of critical records.
Geospatial data capability rst appeared at Oracle Version 4 as a modication by
the Canadian Hydrographic Service (CHS). CHS also implemented Oracle's rst
spatial index in the form of an unusual but efcient three-dimensional helical spiral.
Oracle subsequently incorporated the modication and released the Oracle Spatial
Database Option (SDO) at Version 7 of the main database. The SDO system became
Oracle Spatial at Oracle Version 8. The database schema of Oracle Spatial still has
the SDO prex on some column and table names similar to how PostGIS uses the
OGC convention ST (spatial type) to separate spatial information from traditional
relational database tables and functions at the schema level.
The Geospatial Technology Landscape
[ 94 ]
As of 2012, Oracle began calling the package Oracle Spatial and Graph to emphasize
the network data module. This module is used for analyzing networked datasets
such as transportation or utilities. However, the module can also be used against
abstract networks such as social networks. The analysis of social network data is a
common target for big data analysis, which is now a growing trend. Big data social
network analysis is likely the reason Oracle changed the name of the product.
As a spatial, Oracle Spatial has the following capabilities:
A geospatial data schema
A spatial indexing system which is now based on an R-tree index
An SQL API for performing geometric operations
A spatial data tuning API to optimize a particular dataset
A topology data model
A network data model
A GeoRaster data type to store, index, query, and retrieve raster data
Three-dimensional data types including Triangulated Irregular Networks
(TINs) and LIDAR point clouds
A geocoder to search location names and return coordinates
A routing engine for driving direction-type queries
Open Geospatial Consortium-compliance
Oracle Spatial and PostGIS are reasonably comparable and are both commonly used.
You will see these two systems sooner or later as data sources while performing
geospatial analysis.
Oracle Spatial and Graph is sold separately from Oracle itself.
A little-known fact is that the SDO data type is native to the
main Oracle database. If you have a simple application which
just inputs points and retrieves them, you can use the main
Oracle API to add, update, and retrieve SDOs without Oracle
Spatial and Graph.
Chapter 3
[ 95 ]
The U.S. Bureau of Ocean Energy, Management, Regulation, and Enforcement
(BOEMRE) uses Oracle to manage environmental, business, and geospatial data for
billions of dollars' worth of oil, gas, and mineral rights in one of the largest geospatial
systems in the world. The following map is courtesy of BOEMRE:
Oracle Spatial and Graph can be found online at the following URL: http://www.
ArcSDE is Esri's Spatial Data Engine (SDE). It is now rolled into Esri's ArcGIS
Server product after over a decade of being a standalone product. What makes
ArcSDE interesting is that the engine is mostly database independent supporting
multiple database backends. ArcSDE supports IBM DB2, Informix, Microsoft SQL
Server, Oracle and PostgreSQL as data storage systems. While ArcSDE has the ability
to create and manage a spatial schema from scratch on systems such as Microsoft
SQL Server and Oracle, it uses native spatial engines if available. This arrangement
is the case for IBM DB2, Oracle, and PostGreSQL. For Oracle, ArcSDE manages the
table structure but can rely on the Oracle SDO data type for feature storage.
The Geospatial Technology Landscape
[ 96 ]
Like the previously mentioned geodatabases, ArcSDE also has a rich spatial selection
API and can handle raster data. However, ArcSDE does not have as rich a SQL
spatial API as Oracle and PostGIS. Esri technically supports basic SQL functionality
related to ArcSDE, but it encourages users and developers to use Esri software or
programming APIs to manipulate data stored through ArcSDE as it is designed to
be a datasource for Esri software. Esri does provide software libraries for developers
to build applications outside of Esri software using ArcSDE or Esri's le-based
geodatabase called a personal geodatabase. But, these libraries are black boxes
and the communication protocol ArcSDE uses has never been reverse engineered.
Typically, interaction happens between ArcSDE and third-party applications at the
web services level using the ArcGIS Server API, which supports OGC services to
some degree and a fairly straightforward REST API service that returns GeoJSON.
The following screenshot is of the U.S. federal site
dataset, a very large geospatial data catalog based on ArcSDE, which in turn networks
U.S. federal data including other ArcSDE installations from other federal agencies:
ArcSDE is integrated into ArcGIS Server; however, information on it remains at
Chapter 3
[ 97 ]
Microsoft SQL Server
Microsoft added spatial data support to its agship database product in Microsoft SQL
Server 2008. It has gradually improved since that version, but still is nowhere near as
sophisticated as Oracle Spatial or PostGIS. Microsoft supports the same data types
as PostGIS with slightly different naming conventions with the exception of rasters,
which are not directly supported. It also supports output to WKT and WKB formats.
It offers some very basic support for spatial selection, but it is obviously not a
priority for Microsoft at the moment. This limited support is likely the case because it
is all that can be used for Microsoft software mapping components and several third
party engines can provide spatial support on top of SQL Server.
Microsoft's support for spatial data in SQL Server is documented at the following link:
MySQL, another highly popular free database, provides nearly the same support as
Microsoft SQL Server. The OGC Geometry types are supported with basic spatial
relationship functions. Through a series of buyouts, MySQL has become the property
of Oracle. While Oracle currently remains committed to MySQL as an open source
database, this purchase has brought the ultimate future of the world's most popular
open source database into question. But as far as geospatial analysis is concerned,
MySQL is barely a contender and unlikely to be the rst choice for any project.
For more information on MySQL spatial support visit the following link:
SpatiaLite is an extension for the open source SQLite database engine. SQLite uses a
le database and is designed to be integrated into applications instead of the typical
client server model used by most relational database servers. SQLite has spatial data
types and spatial indexing already, but SpatiaLite adds support for the OGC Simple
Features Specication as well as map projections.
SpatiaLite can be found at
The Geospatial Technology Landscape
[ 98 ]
Routing is a very niche area of computational geometry. It is also a very rich eld of
study that goes far beyond the familiar driving directions use case. The requirements
for a routing algorithm are simply a networked dataset and impedance values that
affect the speed of travel on that network. Typically, the dataset is vector-based but
raster data can also be used for certain applications. The two major contenders in this
area are Esri's Network Analyst and the open source pgRouting engine for PostGIS.
The most common routing problem is the most efcient way to visit a number of point
locations. This problem is called the travelling salesman problem (TSP). The TSP
is one of the most intensely studied problems in computational geometry. It is often
considered the benchmark for any routing algorithm. More information on the TSP can
be found at
Esri Network Analyst and Spatial Analyst
Esri's entry into the routing arena, Network Analyst, is a truly generic routing
engine, which can tackle most routing applications regardless of context. Spatial
Analyst is another Esri extension that is raster focused and can perform least cost
path analysis on raster terrain data.
The ArcGIS Network Analyst product page is located on Esri's website at
The pgRouting extension for PostGIS adds routing functionality to the geodatabase.
It is oriented towards road networks but can be adapted to work with other types of
networked data. The following image shows a driving distance radius calculation
output by pgRouting and displayed in QGIS. The points are color-coded from
green to red based on proximity to the starting location. As shown in the following
diagram, the points are nodes in the network dataset, courtesy of, which in
this case are roads:
Chapter 3
[ 99 ]
The pgRouting PostGIS extension is maintained at
Desktop tools (including visualization)
Geospatial analysis requires the ability to visualize output in order to be complete.
This fact makes tools, which can visualize data absolutely critical to the eld. There
are two categories of geospatial visualization tools. The rst is geospatial viewers
and the second is geospatial analysis software. The rst category—geospatial
viewers, allows you to access, query, and visualize data but not edit it in any way.
The second category allows you to perform those items and edit data too. The
main advantage of viewers is that they are typically lightweight pieces of software
that launch and load data quickly. Geospatial analysis software requires far more
resources to be able to edit complex geospatial data, so it loads slower and often
renders data more slowly to provide dynamic editing functionality.
The Geospatial Technology Landscape
[ 100 ]
Quantum GIS
Quantum GIS, more commonly known as QGIS, is a complete open source
geographic information system. QGIS falls well within the geospatial analysis
category in the two categories of visualization software. The development of the
system began in 2002 and version 1.0 was released in 2009.
It is the best showcase of most of the libraries mentioned earlier in this chapter. QGIS
is written in C++ using the Qt library for the GUI. The GUI is well designed and
easy to use. In fact, a geospatial analyst trained on a proprietary package like Esri's
ArcGIS or Manifold System will be right at home using QGIS. The tools and menu
system are logical and typical of a GIS system. The overall speed of QGIS is as good
as or better than any other system available.
A nice feature of QGIS is that the underlying libraries and utility programs are just
below the surface. Modules can be written by any third party in Python and added to
the system. QGIS also has a robust online package management system to search for,
install, and update these extensions. The Python integration includes a console that
allows you to issue commands at the console and see the results in the GUI. QGIS
isn't the only software to offer this capability.
Like most geospatial software packages, with Python integration it installs a
complete version of Python if you use the automated installer. There's no reason to
worry if you already have Python installed. Having multiple versions of Python on
a single machine is fairly common and well supported. Many people have multiple
versions of Python on their computers for testing software or because it is such a
common scripting environment for so many different software packages. When the
Python console is running in QGIS, the entire program API is available through an
automatically loaded object called qgis.utils.iface. The following screenshot
shows QGIS with the Python console running:
Chapter 3
[ 101 ]
Because QIS is based on GDAL/OGR and GEOS, and it can use PostGIS, it supports
all of the data sources offered by those packages. It also has nice raster processing
features too. QGIS works well for producing paper maps or entire map books using
available extensions.
QGIS is well documented through the QGIS website at the following link:
You can also nd numerous online and video tutorials by searching for QGIS and a
particular operation.
The Geospatial Technology Landscape
[ 102 ]
OpenEV is an open source geospatial viewer originally developed by Atlantis
Scientic around 2002, which became Vexcel before a buyout by Microsoft. Vexcel
developed OpenEV as a freely downloadable satellite image viewer for the Canadian
Geospatial Data Infrastructure. It is built using GDAL and Python and is partially
maintained by GDAL-creator Frank Warmerdam.
OpenEV is one of the fastest raster viewers available. Despite being originally
designed as a viewer, OpenEV offers all of the utility of GDAL/OGR and PROJ.4.
While created as a raster tool, it can overlay vector data such as shapeles and
even supports basic editing. Raster images can also be altered using the built-in
raster calculator, and data formats can be converted, reprojected, and clipped. The
following screenshot shows a 25 megabyte, 16-bit integer geotiff elevation le in an
OpenEV viewer window:
OpenEV is built largely in Python and offers a Python console with access to the full
capability of the program. The OpenEV GUI isn't as sophisticated as other tools like
QGIS. For example, you cannot drag and drop geospatial datasets into the viewer
like you can do in QGIS. But, the raw speed of OpenEV makes it very attractive for
simple raster viewing or basic processing and data conversion.
Chapter 3
[ 103 ]
The OpenEV homepage is available at
The GRASS is one of the oldest continuously developed geospatial systems in
existence. The U.S. Army Corps of Engineers began GRASS development in 1982. It
was originally designed to run on UNIX systems. In 1995, the Army released the last
patch, and the software was transferred to community development where it has
remained ever since.
Even though the user interface was redesigned, GRASS still feels somewhat
esoteric to modern GIS users. However, because of its decades old legacy and non-
existent price tag, many geospatial workows and highly specialized modules have
been implemented in GRASS over the years, making it highly relevant to many
organizations and individuals especially in research communities. For these reasons,
GRASS is still actively developed.
GRASS has also been integrated with QGIS, so the more modern and familiar QGIS
GUI can be used to run GRASS functions. GRASS is also deeply integrated with Python
and can be used as a library or command-line tool. The following screenshot shows
some landform analysis in the native GRASS GUI built using the wxPython library:
GRASS is housed online at
The Geospatial Technology Landscape
[ 104 ]
The program uDig is a Java-based GIS viewer. It is built on top of the Eclipse
platform originally created by IBM. Eclipse is designed as an integrated development
environment (IDE) for programmers. But, there are a lot of interface similarities
between IDE and a GIS, so the modication works quite well. The following
screenshot shows the core Eclipse IDE platform as typically used by programmers:
Chapter 3
[ 105 ]
The following screenshot demonstrates the uDig GIS built on top of Eclipse:
uDig is designed primarily as a thick-client viewer for web services (for example,
WMS and WFS) and common data types. Because of the Eclipse platform,
the developers encourage third-party plugins or even full-blown application
development on top of uDig. The program does support more advanced analysis by
using the GRASS GIS program and its Java bindings called JGRASS.
The uDig homepage is located at
The Geospatial Technology Landscape
[ 106 ]
Another Java-based desktop GIS is gvSIG. The gvSIG project began in 2004 as part of
a larger project to migrate the IT systems of the Regional Ministry of Infrastructure
and Transport of Valencia, Spain to free software. The result was gvSIG, which has
continued to mature. The feature set is mostly comparable to QGIS with some unique
capabilities as well. The ofcial gvSIG project has a very active fork called gvSIG
Community Edition (gvSIG CE). There is also a mobile version called gvSIG mobile.
The gvSIG code base is open source. The ofcial homepage for gvSIG is available at
OpenJUMP is another open source Java-based Desktop GIS. JUMP stands for Java
Unied Mapping Platform and was originally created by Vivid Solutions for the
Government of British Columbia. After Vivid Solutions delivered JUMP, development
stopped. Vivid Solutions eventually released JUMP to the open source community
where it was renamed OpenJUMP. OpenJUMP has the ability to read and write
shapeles, OGC GML, and supports PostGIS databases. It can also display some image
formats and data from OGC WMS and WFS services. It has a plugin architecture and
can also serve as a development platform for custom applications. You can nd out
more about OpenJUMP on the ofcial web page at
Google Earth
Google Earth is so ubiquitous that it hardly seems worth mentioning. But as
you learn more about geospatial analysis, you'll discover that there is a lot of
misinformation surrounding Google Earth. The rst release of EarthViewer 3D came
in 2001 and was created by a company called Keyhole Inc. and the EarthViewer 3D
project were funded by the non-prot venture capital rm In-Q-Tel, which in turn
is funded by the U.S. Central Intelligence agency. This cloak-and-dagger spy agency
lineage and the subsequent purchase of Keyhole by Google to create and distribute
Google Earth, brought global attention to the geospatial analysis eld.
Since the rst release of the software as Google Earth in 2005, Google has continually
rened the software. Some of the notable additions are creating Google Moon,
Google Mars, Google Sky, and Google Oceans. These are virtual globe applications,
which feature data from the Moon and Mars with the exception of Google Oceans,
which adds sea-oor elevation mapping known as bathymetry to Google Earth.
Google also released a Google Earth browser plugin allowing a simplied version
of the globe in a browser. The plugin has a JavaScript API, which allows reasonably
sophisticated control over the data and position of the view for the plugin. However
this API is ending in December, 2015.
Chapter 3
[ 107 ]
Google Earth introduced the idea of the spinning virtual globe concept for
exploration of geographic data. After centuries of looking at 2D maps or low-
resolution physical globes, ying around the Earth virtually and dropping into a
street corner anywhere in the world was mind-blowing—especially for geospatial
analysts and other geography enthusiasts, as depicted in the following screenshot of
Google Earth overlooking the New Orleans, Louisiana Central Business District:
Just as Google had revolutionized web mapping with its tile-based slippy mapping
approach, the virtual globe concept was a major boost to geospatial visualization.
The Geospatial Technology Landscape
[ 108 ]
After the initial excitement wore off, many geospatial analysts realized that Google
Earth was a very animated and fun geographic exploration tool, but it really had
very limited utility for any kind of meaningful geospatial analysis. Google Earth
falls squarely into the realm of geospatial viewer software. The only data format it
consumes is its native Keyhole Markup Language (KML), which is an all-in-one
data and styling format discussed in Chapter 2, Geospatial Data. As this format is now
an OGC standard, consuming only one data format immediately limits the utility
of any tool. Any project involving Google Earth must rst begin with complete data
conversion and styling in KML, reminiscent of geospatial analysis from around 10-20
years ago. The tools which do support KML, including Google Maps, support a limited
subset of KML.
Google Earth's native dataset has global coverage, but it is a mixture of datasets
spanning several years and sources. Google has greatly improved the inline
metadata in the tool, which identies the source and approximate date of the current
view. But this method creates confusion among lay people. Many people believe that
the data in Google Earth is updated far more frequently than it really is. The Google
Street View system, showing street-level, 360-degree views of much of the world,
has helped correct this misperception somewhat. People are able to easily identify
images of familiar locations as several years old. Another common misperception
created by Google Earth is that the entire world has been mapped in detail and
therefore creating a base map for geospatial analysis should be trivial. As discussed
in Chapter 2, Geospatial Data, mapping an area of interest is far easier than it used to
be a few years ago by using modern data and software, but it is still a complex and
labor intensive endeavor. This misperception is one of the rst customer expectations
a geospatial analyst must manage while starting a project.
Despite these misperceptions, the impact Google has had on geospatial analysis is
almost entirely positive. For decades, one of the most difcult challenges to growing
the geospatial industry was to convince potential stakeholders that geospatial
analysis is almost always the best approach while making decisions about people,
resources, and the environment. This hurdle stands in sharp contrast to a car dealer.
When a potential customer comes to a car lot, the salesman doesn't have to convince
the buyer that they need a car but just about the type of car. Geospatial analysts have
to rst educate project sponsors on the technology and then convince them that the
geospatial approach was the best way to address a challenge. Google has largely
eliminated those steps for the analysts.
Google Earth can be found online at
Chapter 3
[ 109 ]
NASA World Wind
NASA World Wind is an open source, virtual globe, and geospatial viewer,
originally released by the U.S. National Aeronautics and Space Administration
(NASA) in 2004. It was originally based on Microsoft's .NET framework making
it a Windows-centric application. The following screenshot of NASA World Wind
looks similar to Google Earth:
The Geospatial Technology Landscape
[ 110 ]
In 2007, a Java-based software development kit (SDK) was released called World
Wind Java, which made World Wind more cross-platform. The transition to Java also
led to the creation of a browser plugin for World Wind.
The World Wind Java SDK is considered an SDK and not a desktop application like
the .NET version. However, the demos included with the SDK provide a viewer
without any additional development. While NASA World Wind was originally
inspired by Google Earth, its status as an open source project takes it in an entirely
different direction. Google Earth is a generalist tool bounded by the limits of the
KML specication. NASA World Wind is now a platform upon which anyone can
develop without limits. As new types of data become available and computing
resources grow, the potential of the virtual globe paradigm certainly holds more
potential for geospatial visualization, which has not been explored yet.
NASA World Wind is available online at
Esri walks the line of one of the greatest promoters of the geospatial analytical
approach to understanding our world and a privately held, prot-making business,
which must look out for its own interests to a certain degree. The ArcGIS software
suite represents every type of geospatial visualization known including vector,
raster, globes, and 3D. It is also a market leader in many countries. As described in
the geospatial software map earlier in this chapter, Esri has increasingly incorporated
open source software into its suite of tools including GDAL for raster display and
Python as the scripting language for ArcGIS.
Chapter 3
[ 111 ]
The following screenshot shows the core ArcGIS application ArcMap with marine
tracking density data analysis. The interface shares a lot in common with QGIS as
shown in this image courtesy of
The ArcGIS product page is available online at
The Geospatial Technology Landscape
[ 112 ]
Metadata management
Internet distribution of data has increased the importance of metadata. Data
custodians are able to release a dataset to the entire world for download without any
personal interaction. The metadata record of a geospatial dataset can follow that to
help ensure that the integrity and accountability for that data is maintained. Properly
formatted metadata also allows for automated cataloguing, search indexing, and
integration of datasets. Metadata has become so important that a common mantra
within the geospatial community is Data without metadata isn't data, meaning that a
geospatial dataset cannot be fully utilized and understood without metadata. The
following section will list some of the common metadata tools that are available. The
OGC standard for metadata management is the Catalog Service for the Web (CSW),
which creates a metadata-based catalog system and an API for distributing and
discovering datasets.
For an excellent example of a CSW and client built using Python's
pycsw library, see the Pacic Islands Ocean Observing System
(PacIOOS) catalog available at the following link:
GeoNetwork is an open source, Java-based catalog server to manage geospatial
data. It includes a metadata editor and search engine as well as an interactive web
map viewer. The system is designed to connect spatial data infrastructures globally.
It can publish metadata through the web using the metadata editing tools. It can
publish the spatial data as well through the embedded GeoServer map server. It has
user and group security permissions and web and desktop conguration utilities.
GeoNetwork can also be congured to harvest metadata from other catalogs at
scheduled intervals. The following screenshot is of the United Nations Food and
Agriculture Organization's implementation of the GeoNetwork:
Chapter 3
[ 113 ]
You can nd out more about GeoNetwork at
CatMDEdit is another Java-based metadata editor focused on the geospatial data
from the National Geographic Institute of Spain and several collaborators. CatMDEdit
can exchange metadata records using XML and Resource Description Framework
(RDF) standards and style, and then transform the metadata for viewing in different
formats; it also enables visualization of geospatial data, integration with gvSIG and
has many other features. The CatMDEdit website is located at http://catmdedit.
The Geospatial Technology Landscape
[ 114 ]
In this chapter, you learned the hierarchy of geospatial analysis software. You
learned a framework for approaching the hundreds of existing geospatial software
packages and libraries by categorizing them into one or more major functions
including data access, computational geometry, raster processing, visualization, and
metadata management.
We also examined the commonly used foundation libraries including GDAL, OGR,
PROJ.4, and GEOS found again and again in geospatial software. You can approach
any new piece of geospatial software, trace it back to these core libraries and then
ask, What is the value added? to better understand the package. If the software isn't
using one of these libraries, you need to ask, Why are these developers going against the
grain? to understand what that system brings to the table.
Python was only mentioned a few times in this chapter to avoid any distraction
in understanding the geospatial software landscape. But, as we will see, Python is
interwoven into every single piece of software in this chapter and is a fully capable
geospatial tool in its own right. It is no coincidence that Python is the ofcial
scripting language of ArcGIS, QGIS, GRASS, and many other packages. It is also not
by chance that GDAL, OGR, PROJ.4, CGAL, JTS, GEOS, and PostGIS all have Python
bindings. And as for the packages not mentioned here, they are all within Python's
grasp as well through the Jython Java distribution, the IronPython .NET distribution,
Python's various database and Web APIs, and the built-in ctypes module. As a
geospatial analyst, if there's one technology you can't afford to pass up, it's Python.
[ 115 ]
Geospatial Python Toolbox
The rst three chapters of this book covered the history of geospatial analysis, the
types of geospatial data used by analysts, and the major software and libraries found
within the geospatial industry. We used some simple Python examples here and
there to illustrate certain points, but we focused mainly on the eld of geospatial
analysis independent of any specic technology.
Starting here, we will be using Python to conquer geospatial analysis and we will
continue with that approach for the rest of the book. In this chapter, we'll discover
the Python libraries used to access the different types of data found in the Vector
data and Raster data sections of Chapter 2, Geospatial Data. Some of these libraries
are pure Python and some are bindings to the different software packages found in
Chapter 3, The Geospatial Technology Landscape.
We will examine pure Python solutions whenever possible. Python is a very capable
programming language, but some operations, particularly in remote sensing, are
too computationally intensive and therefore impractical using pure Python or other
interpreted languages. Fortunately, nearly every aspect of geospatial analysis can be
addressed in some way through Python even if it is binding to a highly efcient C,
C++, or other compiled-language library.
We will avoid using broad scientic libraries which cover other domains beyond
geospatial analysis to keep solutions as simple as possible. There are many reasons
to use Python for geospatial analysis, but one of the strongest arguments is its
portability. Python is a ubiquitous programming language ofcially available as a
compiled installation on over 20 platforms according to the website.
It comes as standard with most Linux distributions and is available on most major
smart phone operating systems as well. The Python source distribution usually
compiles on any platform supporting C.
Geospatial Python Toolbox
[ 116 ]
Furthermore, Python has been ported to Java as the Jython distribution and to the
.NET Common Language Runtime (CLR) as IronPython. Python also has versions
such as Stackless Python for massively concurrent programs. There are versions of
Python designed too to run on cluster computers for distributed processing. Python
is also available on many hosted application servers that do not allow you to install
custom executables such as the Google App Engine platform, which has a Python
API. Modules written in pure Python using the standard library will almost always
run on any of the platforms that we just mentioned.
Each time you add a third-party module which relies on bindings to external libraries
in other languages, you reduce Python's inherent portability. You also add a layer
of complexity to fundamentally change the code by adding another language to the
mix. Pure Python keeps things simple. Also Python bindings to external libraries tend
to be automatically or semi-automatically generated. These automatically generated
bindings are very generic and esoteric, and they simply connect Python to a C or C++
API using the method names from that API instead of following the best practices for
Python. There are, of course, notable exceptions to this approach driven by project
requirements which may include speed, unique library features, or frequently updated
libraries where an automatically generated interface is preferable.
Installing third-party Python modules
We'll make a distinction between modules which are included as part of Python's
standard library and modules which must be installed. In Python, the words module
and library are used interchangeably. To install libraries, you either get them from
the Python Package Index (PyPI) or in the case of a lot of geospatial modules, you
download a specialized installer. PyPI acts as the ofcial software repository for
libraries and offers some easy-to-use setup programs, which simplify installing
packages. You can use the easy_install program, which is especially good on
Windows, or the pip program more commonly found on Linux and Unix systems.
Once it's installed, you can then install third-party packages simply by running the
following code:
easy_install <package name>
For installing pip, you run the following code:
pip install <package name>
Chapter 4
[ 117 ]
Links will be provided to installers and instructions for packages not available on PyPI.
You can manually install third-party Python modules by downloading the Python
source code and putting it in your current working directory, or you can put it in your
Python site-packages directory. These two directories are available in Python's
search path when you try to import a module. If you put a module in your current
working directory, it'll only be available when you start Python from that directory.
If you put it in your site-packages directory, it'll be available every time you start
Python. The site-packages directory is specically meant for third-party modules.
To locate the site-packages directory for your installation, you ask Python's
sys module. The sys module has a path attribute that is a list of all directories in
Python's search path. The site-packages directory should be the last one which
you can locate by specifying an index of -1, as shown in the following code:
>>> import sys
>>> sys.path[-1]
If that call doesn't return the site-packages path, just look at the entire list to locate
it, as shown in the following code:
>>> sys.path
['', 'C:\\WINDOWS\\system32\\', 'C:\\Python34\\DLLs',
'C:\\Python34\\lib', 'C:\\Python34\\lib\\plat-win
', 'C:\\Python34\\lib\\lib-tk', 'C:\\Python34',
These installation methods will be used for the rest of the book. You can nd the
latest Python version, source code for your platform installation, and compilation
instructions at
The Python virtualenv module allows you to easily create an
isolated copy of Python for a specic project without affecting your
main Python installation or other projects. Using this module, you
can have different projects with different versions of the same library.
Once you have a working code base, you can then keep it isolated
from changes to the modules you used or even Python itself. The
virtualenv module is simple to use and can be used for any example
in this book; however, explicit instructions on its use are not included.
To get started with virtualenv, follow this simple guide:
Geospatial Python Toolbox
[ 118 ]
Installing GDAL
The Geospatial Data Abstraction Library (GDAL), which includes OGR, is critical
to many of the examples in this book and is also one of the more complicated Python
setups as well. For these reasons, we'll discuss it separately here. The latest GDAL
bindings are available on PyPI, however the installation requires a few more steps
because of additional resources needed by the GDAL library.
There are three ways to install GDAL for use with Python:
Compile it from source code
Install it as part of a larger software package
Install a binary distribution and then Python bindings
If you have experience with compiling C libraries as well as the required compiler
software, then the rst option gives you the most control. However, it is not
recommended if you just want to get going as quickly as possible because even
experienced software developers can nd compiling GDAL and the associated
Python bindings challenging. Instructions for compiling GDAL on leading platforms
can be found at There are also
basic build instructions on the PyPI GDAL page. Have a look at https://pypi.
The second option is by far the quickest and the easiest. The Open Source
Geospatial Foundation (OSGeo) distributes an installer called OSGeo4W which
installs all of the top open source geospatial packages on Windows at the click of
a button. If you are on Linux, there is another package with distributions for both
Linux and Windows called FWTools. OSGeo4W can be found at http://trac.
FWTools is available online at
While these packages are the easiest to work with, they come with their own version
of Python. If you already have Python installed, then having another Python
distribution just to use certain libraries may be problematic. In that case, the third
option may be for you.
The third option installs a pre-compiled binary specic to your Python version. This
method is the best compromise between ease of installation and customization. The
catch is you must make sure the binary distributions and the corresponding Python
bindings are compatible with each other, your Python version, and in many cases
your operating system conguration.
Chapter 4
[ 119 ]
The installation of GDAL for Python on Windows becomes easier and easier each
year. To install GDAL on Windows, you must rst check whether you are running
the 32-bit or 64-bit version of Python. To do so, just start your Python interpreter at a
command prompt, as shown in the following code:
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct 6 2014, 22:15:05) [MSC v.1600
32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more
So, based on this instance, we see Python is version 3.4.2 for win32, which means it is
the 32-bit version. Once you have this information, go to the following URL:
This web page contains Python Windows binaries and bindings for nearly every
open source scientic library. On that web page, in the GDAL section, nd the
release that matches your version of Python. The release names use the abbreviation
cp for C Python followed by the major Python version number and either win32
for 32-bit Windows or win_amd64 for 64-bit Windows. In the previous example, we
would download the le named GDAL-1.11.3-cp34-none-win32.whl.
This download package is the newer Python pip wheel format. To install it, simply
open a command prompt and type the following code:
pip install GDAL-1.11.3-cp34-none-win32.whl
Once the package is installed, open your Python interpreter and run the following
commands to verify that GDAL is installed by checking its version:
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct 6 2014, 22:15:05) [MSC v.1600 32
bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from osgeo import gdal
>>> gdal.__version__
GDAL should return its version as 1.11.3.
If you have trouble installing modules using easy_install or pip
and PyPI, try to download and install the wheel package from the
same site as the GDAL example.
Geospatial Python Toolbox
[ 120 ]
GDAL installation on Linux varies widely by distribution. The following
binaries web page lists installation instructions for several distributions:
Typically, your package manager will install both GDAL and Python bindings. For
example, on Ubuntu, to install GDAL you run the following code:
sudo apt-get install gdal-bin
Then to install the Python bindings, you run the following command:
sudo apt-get install python-gdal
Also, most Linux distributions are set up to compile software already and
their instructions are much simpler than those on Windows. Depending on the
installation, you may have to import gdal and ogr as part of the osgeo package as
shown in the following command:
>>> from osgeo import gdal
>>> from osgeo import ogr
Mac OS X
To install GDAL on OS X, you can also use the Homebrew package management
system available at
Alternatively, you can use the MacPorts package management system available at
Both of these systems are well documented and contain GDAL packages for Python
3. You really only need them for libraries which require a properly compiled binary
written in C which has a lot of dependencies and includes many of the scientic and
geospatial libraries.
Chapter 4
[ 121 ]
Python networking libraries for acquiring
The vast majority of geospatial data sharing is accomplished via the Internet. And
Python is well equipped when it comes to networking libraries for almost any
protocol. Automated data downloads are often an important step in automating a
geospatial process. Data is typically retrieved from a website Uniform Resource
Locator (URL) or a File Transfer Protocol (FTP) server. And because geospatial
datasets often contain multiple les, data is often distributed as ZIP les.
A nice feature of Python is its concept of a le-like object. Most Python libraries
which read and write data use a standard set of methods which allow you to access
data from all different types of resources as if you were writing a simple le on disk.
The networking modules in the Python standard library use this convention as well.
The benet of this approach is that it allows you to pass le-like objects to other
libraries and methods, which recognize the convention without a lot of setup for
different types of data distributed in different ways.
The Python urllib module
The Python urllib package is designed for simple access to any le with a URL
address. The urllib package in Python 3 consists of several modules which handle
different parts of managing web requests and responses. These modules implement
some of Python's le-like object conventions starting with its open() method.
When you call open(), it prepares a connection to the resource but does not access
any data. Sometimes, you just want to grab a le and save it to the disk instead of
accessing it in memory. This function is available through the urllib.request.
retrieve() method.
The following example uses the urllib.request.retrieve() method to download
the zipped shapele named, which is used in other examples. We
dene the URL and the local le name as variables. The URL is passed as an
argument as well as the lename we want to use to save it to our local machine
which in this case is just
>>> import urllib.request
>>> import urllib.parse
>>> import urllib.error
>>> url = "
>>> fileName = ""
>>> urllib.request.urlretrieve(url, fileName)
('', <httplib.HTTPMessage instance at 0x00CAD378>)
Geospatial Python Toolbox
[ 122 ]
The message from the underlying httplib module conrms that the le was
downloaded to the current directory. The URL and lename could have been passed
to the retrieve() method directly as strings as well. If you specify just the lename,
the download saves to the current working directory. You can also specify a fully
qualied pathname to save it somewhere else. You can also specify a callback function
as a third argument which will receive download status information for the le, so
you can create a simple download status indicator or perform some other action.
The urllib.request.urlopen() method allows you to access an online resource
with more precision and control. As mentioned previously, it implements most of
the Python le-like object methods with the exception of the seek() method, which
allows you to jump to arbitrary locations within a le. You can read a le online
one line at a time, read all lines as a list, read a specied number of bytes, or iterate
through each line of the le. All of these functions are performed in memory, so you
don't have to store the data on disk. This ability is useful for accessing frequently
updated data online, which you may want to process without saving to disk.
In the following example, we demonstrate this concept by accessing the United
States Geological Survey (USGS) earthquake feed to view all of the earthquakes
in the world, which have occurred within the last hour. This data is distributed
as a Comma-Separated Values (CSV) le, which we can read line by line like a
text le. CSV les are similar to spreadsheets and can be opened in a text editor
or spreadsheet program. First, you'll open the URL and read the header with the
column names in the le, and then you'll read the rst line containing a record of a
recent earthquake, as shown in the following lines of code:
>>> url = "
>>> earthquakes = urllib.request.urlopen(url)
>>> earthquakes.readline()
>>> earthquakes.readline()
ak10739050,2013-06-14T14:39:09.442Z,"3km E of Fairbanks,
Chapter 4
[ 123 ]
We can also iterate through this le which is a memory efcient way to read through
large les. If you are running this example in the Python interpreter as shown next,
you will need to press the Enter or Return key twice to execute the loop. This action is
necessary because it signals to the interpreter that you are done building the loop. In
the following example, we abbreviate the output:
>>> for record in earthquakes: print(record)
ak10739046,2013-06-14T14:37:02.318Z,"13km ESE of Glennallen,
72008115,2013-06-14T13:53:11.592Z,"6km NW of The Geysers,
FTP allows you to browse an online directory and download data using FTP client
software. Until around 2004, when geospatial web services became very common,
FTP was one of the most common ways to distribute geospatial data. FTP is less
common now, but you occasionally encounter it when you're searching for data.
Once again Python's batteries included standard library has a reasonable FTP
module called ftplib with a main class called FTP().
In the following example, we will access an FTP server hosted by the U.S. National
Oceanic and Atmospheric Administration (NOAA) to access a text le containing
data from the Deep-ocean Assessment and Reporting of Tsunamis (DART) buoy
network used to watch for tsunamis around the world. This particular buoy is off the
coast of Peru. We'll dene the server and the directory path. Then we will access the
server. All FTP servers require a username and password. Most public servers have a
user called anonymous with the password anonymous as this one does. Using Python's
ftplib, you can just call the login() method without any arguments to login as the
default anonymous user. Otherwise, you can add the username and password as
string arguments. Once we're logged in, we'll change to the directory containing the
DART datale. To download the le, we open up a local le called out and pass its
write() method as a callback function to the ftplib.ftp.retrbinary() method,
which simultaneously downloads the le and writes it to our local le.
Geospatial Python Toolbox
[ 124 ]
Once the le is downloaded, we can close it to save it. Then we'll read the le and
look for the line containing the latitude and longitude of the buoy to make sure that
the data was downloaded successfully, as shown in the following lines of code:
>>> import ftplib
>>> server = ""
>>> dir = "hazards/DART/20070815_peru"
>>> fileName = "21415_from_20070727_08_55_15_tides.txt"
>>> ftp = ftplib.FTP(server)
>>> ftp.login()
'230 Login successful.'
>>> ftp.cwd(dir)
'250 Directory successfully changed.'
>>> out = open(fileName, "wb")
>>> ftp.retrbinary("RETR " + fileName, out.write)
'226 Transfer complete.'
>>> out.close()
>>> dart = open(fileName)
>>> for line in dart:
... if "LAT," in line:
... print(line)
... break
LAT, LON 50.1663 171.8360
In this example, we opened the local le in binary write mode, and we used the
retrbinary() ftplib method as opposed to retrlines(), which uses ASCII mode.
Binary mode works for both ASCII and binary les, so it's always a safer bet. In fact,
in Python, the binary read and write modes for a le are only required on Windows.
If you are just downloading a simple le from an FTP server, many FTP servers have
a web interface as well. In that case, you could use urllib to read the le. FTP URLs
use the following format to access data:
This format is insecure for password-protected directories because you are
transmitting your login information over the Internet. But for anonymous FTP servers,
there is no additional security risk. To demonstrate this, the following example
accesses the same le that we just saw but by using urllib instead of ftplib:
>>> dart = urllib.request.urlopen("ftp://" + server + "/" + dir +
"/" + fileName)
>>> for line in dart:
... line = str(line, encoding="utf8")
... if "LAT," in line:
... print(line)
Chapter 4
[ 125 ]
... break
LAT, LON 50.1663 171.8360
ZIP and TAR les
Geospatial datasets often consist of multiple les. For this reason, they are often
distributed as ZIP or TAR le archives. These formats can also compress data,
but their ability to bundle multiple les is the primary reason they are used for
geospatial data. While the TAR format doesn't contain a compression algorithm,
it incorporates the gzip compression and offers it as a program option. Python
has standard modules for reading and writing both ZIP and TAR archives. These
modules are called zipfile and tarfile respectively.
The following example extracts the hancock.shp, hancock.shx, and hancock.dbf
les contained in the le we downloaded using urllib for use in the
previous examples. This example assumes that the ZIP le is in the current directory:
>>> import zipfile
>>> zip = open("", "rb")
>>> zipShape = zipfile.ZipFile(zip)
>>> shpName, shxName, dbfName = zipShape.namelist()
>>> shpFile = open(shpName, "wb")
>>> shxFile = open(shxName, "wb")
>>> dbfFile = open(dbfName, "wb")
>>> shpFile.write(
>>> shxFile.write(
>>> dbfFile.write(
>>> shpFile.close()
>>> shxFile.close()
>>> dbfFile.close()
This example is more verbose than necessary for clarity. We can shorten this example
and make it more robust by using a for loop around the zipfile.namelist()
method without explicitly dening the different les as variables. This method is
a more exible and Pythonic approach, which could be used on ZIP archives with
unknown contents, as shown in the following lines of code:
>>> import zipfile
>>> zip = open("", "rb")
>>> zipShape = zipfile.ZipFile(zip)
>>> for fileName in zipShape.namelist():
... out = open(fileName, "wb")
... out.write(
... out.close()
Geospatial Python Toolbox
[ 126 ]
Now that you understand the basics of the zipfile module, let's take the les
we just unzipped and create a TAR archive with them. In this example, when we
open the TAR archive for writing, we specify the write mode as w:gz for gzipped
compression. We also specify the le extension as tar.gz to reect this mode, as
shown in the following lines of code:
>>> import tarfile
>>> tar ="hancock.tar.gz", "w:gz")
>>> tar.add("hancock.shp")
>>> tar.add("hancock.shx")
>>> tar.add("hancock.dbf")
>>> tar.close()
We can extract the les using the simple tarfile.extractall() method. First, we
open the le using the method, and then extract it, as shown in the
following lines of code:
>>> tar ="hancock.tar.gz", "r:gz")
>>> tar.extractall()
>>> tar.close()
We'll work on one more example by combining elements we've learned in this
chapter as well as the elements in the Vector data section of Chapter 2, Geospatial Data.
We'll read the bounding box coordinates from the le without ever
saving it to disk. We'll use the power of Python's le-like object convention to pass
around the data. Then, we'll use Python's struct module to read the bounding box
as we did in Chapter 2. In this case, we read the unzipped .shp le into a variable
and access the data using Python array slicing by specifying the starting and ending
indexes of the data separated by a colon (:). We are able to use list slicing because
Python allows you to treat strings as lists. In this example, we also use Python's
StringIO module to temporarily store data in memory in a le-like object that
implements all methods including the seek() method, which is absent from most
Python networking modules, as shown in the following lines of code:
>>> import urllib.request
>>> import urllib.parse
>>> import urllib.error
>>> import zipfile
>>> import io
>>> import struct
>>> url =
>>> cloudshape = urllib.request.urlopen(url)
>>> memoryshape = io.BytesIO(
>>> zipshape = zipfile.ZipFile(memoryshape)
Chapter 4
[ 127 ]
>>> cloudshp ="hancock.shp")
# Access Python string as an array
>>> struct.unpack("<dddd", cloudshp[36:68])
(-89.6904544701547, 30.173943486533133, -89.32227546981174,
As you can see from the examples so far, Python's standard library packs a lot of
punch. Most of the time, you don't have to download a third-party library just to
access a le online.
Python markup and tag-based parsers
Tag-based data, particularly different XML dialects, have become a very popular
way to distribute geospatial data. Formats that are both machine and human
readable are generally easy to work with, though they sacrice storage efciency for
usability. These formats can become unmanageable for very large datasets but work
very well in most cases.
While most formats are some form of XML (such as KML or GML), there is a notable
exception. The well-known text (WKT) format is fairly common but uses external
markers and square brackets ([]) to surround data instead of tags in angled brackets
around data like XML does.
Python has standard library support for XML as well as some excellent third-party
libraries available. Proper XML formats all follow the same structure, so you can
use a generic XML library to read it. Because XML is text-based, it is often easy to
write it as a string instead of using an XML library. The vast majority of applications
which output XML do so in this way. The primary advantage of using XML libraries
for writing XML is that your output is usually validated. It is very easy to create an
error while creating your own XML format. A single missing quotation mark can
derail an XML parser and throw an error for somebody trying to read your data.
When these errors happen, they virtually render your dataset useless. You will nd
that this problem is very common among XML-based geospatial data. What you'll
discover is that some parsers are more forgiving with incorrect XML than others.
Often, reliability is more important than speed or memory efciency. The analysis
available at provides benchmarks for memory
and speed among the different Python XML parsers.
Geospatial Python Toolbox
[ 128 ]
The minidom module
The Python minidom module is a very old and simple to use XML parser. It is a part
of Python's built-in set of XML tools in the XML package. It can parse XML les or
XML fed in as a string. The minidom module is best for small to medium-sized XML
documents of less than about 20 megabytes before speed begins to decrease.
To demonstrate the minidom module, we'll use a sample KML le which is a part
of Google's KML documentation that you can download. The data available at the
following link represents time-stamped point locations transferred from a GPS device:
First, we'll parse this data by reading it in from the le and creating a minidom
parser object. The le contains a series of <Placemark> tags, which contain a point
and a timestamp at which that point was collected. So, we'll get a list of all of the
Placemarks in the le, and we can count them by checking the length of that list, as
shown in the following lines of code:
>>> from xml.dom import minidom
>>> kml = minidom.parse("time-stamp-point.kml")
>>> Placemarks = kml.getElementsByTagName("Placemark")
>>> len(Placemarks)
As you can see, we retrieved all Placemarks which totaled 361. Now, let's take a
look at the rst Placemark element in the list:
>>> Placemarks[0]
<DOM Element: Placemark at 0x2045a30>
Each <Placemark> tag is now a DOM Element data type. To really see what that
element is, we call the toxml() method as you can see in the following lines of code:
>>> Placemarks[0].toxml()
u'<Placemark>\n <TimeStamp>\n \<when>2007-01-14T21:05:02Z</when>\n
</TimeStamp>\n <styleUrl>#paddle-a</styleUrl>\n <Point>\n
</Point>\n </Placemark>'
The toxml() function outputs everything contained in the Placemark tag as a
string object. If we want to print this information to a text le, we can call the
toprettyxml() method, which would add additional indentation to make the
XML more readable.
Chapter 4
[ 129 ]
Now what if we want to grab just the coordinates from this Placemark? The
coordinates are buried inside the coordinates tag, which is contained in the Point
tag and nested inside the Placemark tag. Each element of a minidom object is called a
node. Nested nodes are called children or child nodes. The child nodes include more
than just tags. They can also include whitespace separating tags as well as the data
inside tags. So, we can drill down to the coordinates tag using the tag name, but
then we'll need to access the data node. All the minidom elements have a childNodes
list as well as a firstChild() method to access the rst node. We'll combine these
methods to get to the data attribute of the rst coordinate's data node, which we
reference using index 0 in the list of coordinates tags:
>>> coordinates =
>>> point = coordinates[0]
>>> point
If you're new to Python, you'll notice that the text output in these examples is tagged
with the letter u. This markup is how Python denotes Unicode strings which support
internationalization to multiple languages with different character sets. Python
3.4.3 changes this convention slightly and leaves Unicode strings unmarked while
marking utf-8 strings with a b.
We can go a little further and convert this point string into usable data by splitting
the string and converting the resulting strings as Python oat types, as shown here:
>>> x,y,z = point.split(",")
>>> x
>>> y
>>> z
>>> x = float(x)
>>> y = float(y)
>>> z = float(z)
>>> x,y,z
(-122.536226, 37.86047, 0.0)
Using a Python list comprehension, we can perform this operation in a single step, as
you can see in the following lines of code:
>>> x,y,z = [float(c) for c in point.split(",")]
>>> x,y,z
(-122.536226, 37.86047, 0.0)
Geospatial Python Toolbox
[ 130 ]
This example scratches the surface of what the minidom library can do. For a great
tutorial on this library, have a look at the 9.3. Parsing XML section of the excellent
book Dive Into Python, by Mark Pilgrim, which is available in print or online at
The minidom module is pure Python, easy to work with, and has been around since
Python 2.0. However, Python 2.5 added a more efcient yet high-level XML parser to
the standard library called ElementTree. ElementTree is interesting because it has
been implemented in multiple versions. There is a pure Python version and a faster
version written in C called cElementTree. You should use cElementTree wherever
possible, but it's possible that you may be on a platform that doesn't include the
C-based version. When you import cElementTree, you can test to see if it's available
and fall back to the pure Python version if necessary:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
One of the great features of ElementTree is its implementation of a subset of the
XPath query language. XPath is short for XML Path and allows you to search an
XML document using a path-style syntax. If you work with XML frequently,
learning XPath is essential. You can learn more about XPath at the following link:
One catch with this feature is if the document species a namespace, as most XML
documents do, you must insert that namespace into queries. ElementTree does not
automatically handle the namespace for you. Your options are to manually specify it
or try to extract it using string parsing from the root element's tag name.
We'll repeat the minidom XML parsing example using ElementTree. First, we'll parse
the document and then we'll manually dene the KML namespace; later, we'll use an
XPath expression and the find()method to nd the rst Placemark element. Finally,
we'll nd the coordinates and the child node and then grab the text containing
the latitude and longitude. In both cases, we could have searched directly for the
coordinates tag. But by grabbing the Placemark element, it gives us the option of
grabbing the corresponding timestamp child element later, if we choose, as shown in
the following lines of code:
>>> tree = ET.ElementTree(file="time-stamp-point.kml")
Chapter 4
[ 131 ]
>>> ns = "{}"
>>> placemark = tree.find(".//%sPlacemark" % ns)
>>> coordinates =
placemark.find("./{}Point/{}coordinates".format(ns, ns))
>>> coordinates.text
In this example, notice that we used the Python string formatting syntax, which
is based on the string formatting concept found in C. When we dened the XPath
expression for the placemark variable, we used the %s placeholder to specify the
insertion of a string. Then after the string, we use the % operator followed by a
variable name to insert the ns namespace variable where the placeholder is. In the
coordinates variable, we use the ns variable twice, so we specify a tuple containing
ns twice after the string.
String formatting is a simple yet extremely powerful and useful tool
in Python, which is worth learning. You can nd more information in
Python's documentation online at the following link:
Building XML
Most of the time, XML can be built by concatenating strings, as you can see in the
following command:
xml = "<?xml version="1.0" encoding="utf-8"?>"
xml += "<kml xmlns="">"
xml += " <Placemark>"
xml += " <name>Office</name>"
xml += " <description>Office Building</description>"
xml += " <Point>"
xml += " <coordinates>"
xml += " -122.087461,37.422069"
xml += " </coordinates>"
xml += " </Point>"
xml += " </Placemark>"
xml += "</kml>"
Geospatial Python Toolbox
[ 132 ]
But, this method can be quite prone to typos, which creates invalid XML documents.
A safer way is to use an XML library. Let's build this simple KML document using
ElementTree. We'll dene the root KML element and assign it a namespace. Then,
we'll systematically append subelements to the root, and nally wrap the elements
as an ElementTree object, declare the XML encoding, and write it out to a le called
placemark.xml, as shown in the following lines of code:
>>> root = ET.Element("kml")
>>> root.attrib["xmlns"] = ""
>>> placemark = ET.SubElement(root, "Placemark")
>>> office = ET.SubElement(placemark, "name")
>>> office.text = "Office"
>>> point = ET.SubElement(placemark, "Point")
>>> coordinates = ET.SubElement(point, "coordinates")
>>> coordinates.text = "-122.087461,37.422069, 37.422069"
>>> tree = ET.ElementTree(root)
>>> tree.write("placemark.kml",
The output is identical to the previous string building example, except that
ElementTree does not indent the tags but rather writes it as one long string. The
minidom module has a similar interface, which is documented in the book Dive Into
Python, by Mark Pilgrim, referenced in the minidom example that we just saw.
XML parsers such as minidom and ElementTree work very well on perfectly
formatted XML documents. Unfortunately, the vast majority of XML documents out
there don't follow the rules and contain formatting errors or invalid characters. You'll
nd that you are often forced to work with this data and must resort to extraordinary
string parsing techniques to get the small subset of data you actually need. But
thanks to Python and BeautifulSoup, you can elegantly work with bad and even
terrible, tag-based data.
BeautifulSoup is a module specically designed to robustly handle broken XML.
It is oriented towards HTML, which is notorious for incorrect formatting but works
with other XML dialects too. BeautifulSoup is available on PyPI, so use either
easy_install or pip to install it, as you can see in the following command:
easy_install beautifulsoup4
Or you can execute the following command:
pip install beautifulsoup4
Then, to use it, you simply import it:
>>> from bs4 import BeautifulSoup
Chapter 4
[ 133 ]
To try it out, we'll use a GPS Exchange Format (GPX) tracking le from a
smartphone application, which has a glitch and exports slightly broken data.
You can download this sample le which is available at the following link:
This 2,347 line data le is in pristine condition except that it is missing a closing
</trkseg> tag, which should be located at the very end of the le just before the
closing </trk> tag. This error was caused by a data export function in the source
program. This defect is most likely a result of the original developer manually
generating the GPX XML on export and forgetting the line of code that adds this
closing tag. Watch what happens if we try to parse this le with minidom:
>>> gpx = minidom.parse("broken_data.gpx")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python34\lib\xml\dom\", line 1914, in parse
return expatbuilder.parse(file)
File "C:\Python34\lib\xml\dom\", line 924, in
result = builder.parseFile(fp)
File "C:\Python34\lib\xml\dom\", line 207, in
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: mismatched tag: line 2346, column 2
As you can see from the last line in the error message, the underlying XML parser
in minidom knows exactly what the problem is—a mismatched tag right at the end
of the le. But, it refused to do anything more than report the error. You must have
perfectly formed XML or none at all to avoid this.
Now, let's try the more sophisticated and efcient ElementTree module with the
same data:
>>> ET.ElementTree(file="broken_data.gpx")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python34\lib\xml\etree\", line 611, in
File "C:\Python34\lib\xml\etree\", line 653, in
File "C:\Python34\lib\xml\etree\", line 1624, in
Geospatial Python Toolbox
[ 134 ]
File "C:\Python34\lib\xml\etree\", line 1488, in
raise err
xml.etree.ElementTree.ParseError: mismatched tag: line 2346,
column 2
As you can see, different parsers face the same problem. Poorly formed XML is an
all too common reality in geospatial analysis, and every XML parser assumes that all
the XML in the world is perfect, except for one. Enter BeautifulSoup. This library
shreds bad XML into usable data without a second thought. And it can handle far
worse defects than missing tags. It will work despite missing punctuation or other
syntax and will give you the best data it can. It was originally developed for parsing
HTML, which is notoriously horrible for being poorly formed, but it works fairly
well with XML as well, as shown here:
>>> from bs4 import BeautifulSoup
>>> gpx = open("broken_data.gpx")
>>> soup = BeautifulSoup(, features="xml")
No complaints from BeautifulSoup! Just to make sure the data is actually usable,
let's try and access some of the data. One of the fantastic features of BeautifulSoup
is that it turns tags into attributes of the parse tree. If there are multiple tags with the
same name, it grabs the rst one. Our sample data le has hundreds of <trkpt> tags.
Let's access the rst one:
>>> soup.trkpt
<trkpt lat="30.307267000" lon="-89.332444000">
We're now certain that the data has been parsed correctly and we can access it. If we
want to access all of the <trkpt> tags, we can use the findAll() method to grab
them and then use the built-in Python len() function to count them, as shown here:
>>> tracks = soup.findAll("trkpt")
>>> len(tracks)
If we write the parsed data back out to a le, BeautifulSoup outputs the corrected
version. We'll save the xed data as a new GPX le using BeautifulSoup module's
prettify() method to format the XML with nice indentation, as you can see in the
following lines of code:
>>> fixed = open("fixed_data.gpx", "w")
>>> fixed.write(soup.prettify())
>>> fixed.close()
Chapter 4
[ 135 ]
BeautifulSoup is a very rich library with many more features. To explore it
further, visit the BeautifulSoup documentation online at http://
While minidom, ElementTree, and cElementTree come with the
Python standard library, there is an even more powerful and popular
XML library for Python called lxml. The lxml module provides a
Pythonic interface to the libxml2 and libxslt C libraries using the
ElementTree API. An even better fact is that lxml also works with
BeautifulSoup to parse bad tag-based data. On some installations,
BeautifulSoup4 may require lxml. The lxml module is available
via PyPI but requires some additional steps for the C libraries. More
information is available on the lxml homepage at the following link:
Well-known text (WKT)
The WKT format has been around for years and is a simple text-based format for
representing geometries and spatial reference systems. It is primarily used as a data
exchange format by systems, which implement the OGC Simple Features for SQL
specication. Take a look at the following sample WKT representation of a polygon:
POLYGON((0 0,4 0,4 4,0 4,0 0),(1 1, 2 1, 2 2, 1 2,1 1))
Currently, the best way to read and write WKT is the Shapely library. Shapely
provides a very Python-oriented or Pythonic interface to the Geometry Engine – Open
Source (GEOS) library described in Chapter 3, The Geospatial Technology Landscape.
You can install Shapely using either easy_install or pip. You can also use the wheel
from the site mentioned in the previous section. Shapely has a WKT module which can
load and export this data. Let's use Shapely to load the previous polygon sample and
then verify that it has been loaded as a polygon object by calculating its area:
>>> import shapely.wkt
>>> wktPoly = "POLYGON((0 0,4 0,4 4,0 4,0 0),(1 1, 2 1, 2 2, 1 2,
1 1))"
>>> poly = shapely.wkt.loads(wktPoly)
>>> poly.area
Geospatial Python Toolbox
[ 136 ]
We can convert any Shapely geometry back to a WKT by simply calling its wkt
attribute, as shown here:
>>> poly.wkt
'POLYGON ((0.0 0.0, 4.0 0.0, 4.0 4.0, 0.0 4.0, 0.0 0.0), (1.0 1.0,
2.0 1.0, 2.0 2.0, 1.0 2.0, 1.0 1.0))'
Shapely can also handle the WKT binary counterpart called Well-known binary
(WKB) used to store WKT strings as binary objects in databases. Shapely loads
WKB using its wkb module in the same way as the wkt module, and it can convert
geometries by calling that object's wkb attribute.
Shapely is the most Pythonic way to work with WKT data, but you can also use the
Python bindings to the OGR library, which we installed earlier in this chapter.
For this example, we'll use a shapele with one simple polygon, which can be
downloaded as a ZIP le available at the following link:
In the following example, we'll open the polygon.shp le from the shapefile
dataset, call the required GetLayer() method, get the rst (and only) feature, and
then export it to WKT:
>>> from osgeo import ogr
>>> shape = ogr.Open("polygon.shp")
>>> layer = shape.GetLayer()
>>> feature = layer.GetNextFeature()
>>> geom = feature.GetGeometryRef()
>>> wkt = geom.ExportToWkt()
>>> wkt
' POLYGON ((-99.904679362176353 51.698147686745074,
-75.010398603076666 46.56036851832075,-75.010398603076666
46.56036851832075,-75.010398603076666 46.56036851832075,
-76.975736557742451 23.246272688996914,-76.975736557742451
23.246272688996914,-76.975736557742451 23.246272688996914,
-114.31715769639194 26.220870210283724,-114.31715769639194
26.220870210283724,-99.904679362176353 51.698147686745074))'
Note that with OGR, you would have to read access each feature and export it
individually as the ExporttoWkt() method is at the feature level. We can now turn
around and read a WKT string using the wkt variable containing the export. We'll
import it back into ogr and get the bounding box, also known as an envelope, of the
polygon, as you can see here:
>>> poly = ogr.CreateGeometryFromWkt(wkt)
>>> poly.GetEnvelope()
(-114.31715769639194, -75.01039860307667, 23.246272688996914,
Chapter 4
[ 137 ]
Shapely and OGR are used for reading and writing valid WKT strings. Of course,
just like XML, which is also text, you could manipulate small amounts of WKT as
strings in a pinch.
Python JSON libraries
JavaScript Object Notation (JSON) is rapidly becoming the number one data
exchange format across a lot of elds. The lightweight syntax and the similarity to
existing data structures in both the JavaScript from which Python borrows some data
structures makes it a perfect match for Python.
The following GeoJSON sample document contains a single point:
"type": "Feature",
"id": "OpenLayers.Feature.Vector_314",
"properties": {},
"geometry": {
"type": "Point",
"coordinates": [
"crs": {
"type": "name",
"properties": {
"name": "urn:ogc:def:crs:OGC:1.3:CRS84"
This sample is just a simple point with new attributes, which would be stored in
the properties data structure of the geometry. First, we'll compact the sample
document into a single string to make it easier to handle:
>>> jsdata = """{ "type": "Feature", "id":
"OpenLayers.Feature.Vector_314", "pro
perties": {}, "geometry": { "type": "Point", "coordinates": [
97.03125, 39.72656
25 ] }, "crs": { "type": "name", "properties": { "name":
3:CRS84" } } }"""
Geospatial Python Toolbox
[ 138 ]
The json module
GeoJSON looks very similar to a nested set of Python's dictionaries and lists. Just for
fun, let's just try and use Python's eval() function to parse it as Python code:
>>> point = eval(jsdata)
>>> point["geometry"]
{'type': 'Point', 'coordinates': [97.03125, 39.7265625]}
Wow! That just worked! We turned that random GeoJSON string into native
Python data in one easy step. Keep in mind that the JSON data format is based on
JavaScript syntax which happens to be similar to Python. Also, as you get deeper into
GeoJSON data and work with larger data, you'll nd that JSON allows characters
that Python does not. Using Python's eval() function is considered very insecure
as well. But as far as keeping things simple is concerned, note that it doesn't get any
simpler than that!
Thanks to Python's drive towards simplicity, the more advanced method doesn't
get much more complicated. Let's use Python's json module, which is part of the
standard library, to turn the same string into Python the right way:
>>> import json
>>> json.loads(jsdata)
{u'geometry': {u'type': u'Point', u'coordinates': [97.03125,
39.7265625]}, u'crs
': {u'type': u'name', u'properties': {u'name':
}, u'type': u'Feature', u'id': u'OpenLayers.Feature.Vector_314',
As a side note, in the previous example the CRS84 property is a synonym for the
common WGS84 coordinate system. The json module adds some nice features such
as safer parsing and conversion of strings to Unicode. We can export Python data
structures to JSON in almost the same way:
>>> pydata = json.loads(jsdata)
>>> json.dumps(pydata)
'{"geometry": {"type": "Point", "coordinates": [97.03125,
39.7265625]}, "crs": {
"type": "name", "properties": {"name":
"urn:ogc:def:crs:OGC:1.3:CRS84"}}, "type"
: "Feature", "id": "OpenLayers.Feature.Vector_314", "properties":
Chapter 4
[ 139 ]
The geojson module
We could happily go on reading and writing GeoJSON data using the json module
forever, but there's an even better way. The geojson module available on PyPI offers
some distinct advantages. For starters, it knows the requirements of the GeoJSON
specication, which can save a lot of typing. Let's create a simple point using this
module and export it to GeoJSON:
>>> import geojson
>>> p = geojson.Point([-92, 37])
>>> geojs = geojson.dumps(p)
>>> geojs
'{"type": "Point", "coordinates": [-92, 37]}'
Notice that the geojson module has an interface for different data types and saves
us from setting the type and coordinates attributes manually. Now imagine if you
had a geographic object with hundreds of features. You could programmatically
build this data structure instead of building a very large string. The geojson module
is also the reference implementation for the Python __geo_interface__ convention.
This interface allows cooperating programs to exchange data seamlessly and in a
Pythonic way without the programmer explicitly exporting and importing GeoJSON
strings. So, if we wanted to feed the point we created with the geojson module to the
Shapely module, we could perform the following command, which reads the geojson
module's point object straight into Shapely after which we'll export it as WKT:
>>> from shapely.geometry import asShape
>>> point = asShape(p)
>>> point.wkt
'POINT (-92.0000000000000000 37.0000000000000000)'
More and more geospatial Python libraries are implementing both the geojson
and __geo_interface__ functionality including PyShp, Fiona, Karta, and ArcGIS.
Third-party implementations exist for QGIS.
We touched on OGR as a way to handle WKT strings, but its real power is as a
universal vector library. This book strives for pure Python solutions, but no single
library even comes close to the variety of formats that OGR can process.
Let's read a sample point shapele using the OGR Python API. The sample shapele
can be downloaded as a ZIP le here:
Geospatial Python Toolbox
[ 140 ]
This point shapele has ve points with single digit, positive coordinates. The
attributes list the order in which the points were created, making it useful for testing.
This simple example will read in the point shapele and loop through each feature;
then it'll print the x and y value of each point plus the value of the rst attribute eld:
>>> from osgeo import ogr
>>> shp = ogr.Open("point.shp")
>>> layer = shp.GetLayer()
>>> for feature in layer:
... geometry = feature.GetGeometryRef()
... print(geometry.GetX(), geometry.GetY(),
1.0 1.0 First
3.0 1.0 Second
4.0 3.0 Third
2.0 2.0 Fourth
0.0 0.0 Appended
This example is simple, but OGR can become quite verbose as your script becomes
more complex.
PyShp is a simple, pure Python library that reads and writes shapeles. It doesn't
perform any geometry operations and only uses Python's standard library. It's
contained in a single le that's easy to move around, squeeze onto small embedded
platforms, and modify. It is also compatible with Python 3. It also implements
__geo_interface__. The PyShp module is available on PyPI.
Let's repeat the previous OGR example with PyShp:
>>> import shapefile
>>> shp = shapefile.Reader("point.shp")
>>> for feature in shp.shapeRecords():
... point = feature.shape.points[0]
... rec = feature.record[0]
... print(point[0], point[1], rec)
1.0 1.0 First
3.0 1.0 Second
4.0 3.0 Third
2.0 2.0 Fourth
0.0 0.0 Appended
Chapter 4
[ 141 ]
Both OGR and PyShp read and write the dbf les because they are part of the
shapele specication. The dbf les contain the attributes and elds for the shapeles.
However, both libraries have very basic dbf support. Occasionally, you will need
to do some heavy duty dbf work. The dbfpy3 module is a pure Python module
dedicated to working with dbf les. It is currently hosted on only. You
can force easy_install to nd the download by specifying the download le:
easy_install -f
If you are using pip to install packages, use the following command:
pip install
The following shapele has over 600 dbf records representing U.S. Census Bureau
tracts which make it a good sample for trying out dbfpy:
Let's open up the dbf le of this shapele and look at the rst record:
>>> from dbfpy3 import dbf
>>> db = dbf.Dbf("GIS_CensusTract_poly.dbf")
>>> db[0]
GEODB_OID: 4029 (<type 'int'>)
OBJECTID: 4029 (<type 'int'>)
PERMANE0: 61be9239-8f3b-4876-8c4c-0908078bc597 (<type 'str'>)
SOURCE_1: NA (<type 'str'>)
SOURCE_2: 20006 (<type 'str'>)
SOURCE_3: Census Tracts (<type 'str'>)
SOURCE_4: Census Bureau (<type 'str'>)
DATA_SE5: 5 (<type 'str'>)
DISTRIB6: E4 (<type 'str'>)
LOADDATE: 2007-03-13 (<type ''>)
QUALITY: 2 (<type 'str'>)
SCALE: 1 (<type 'str'>)
FCODE: 1734 (<type 'str'>)
STCO_FI7: 22071 (<type 'str'>)
STATE_NAME: 22 (<type 'str'>)
COUNTY_8: 71 (<type 'str'>)
CENSUST9: 22071001734 (<type 'str'>)
POPULAT10: 1760 (<type 'int'>)
Geospatial Python Toolbox
[ 142 ]
AREASQKM: 264.52661934 (<type 'float'>)
GNIS_ID: NA (<type 'str'>)
POPULAT11: 1665 (<type 'int'>)
DB2GSE_12: 264526619.341 (<type 'float'>)
DB2GSE_13: 87406.406192 (<type 'float'>)
The module very quickly and easily gives us both the column names and data
values. Now let's increment the population eld contained in POPULAT10 by 1:
>>> rec = db[0]
>>> field = rec["POPULAT10"]
>>> rec["POPULAT10"] = field + 1
>>> del rec
>>> db[0]["POPULAT10"]
Keep in mind that both OGR and PyShp can do this same procedure, but dbfp3y
makes it a little easier if you are making a lot of changes to the dbf les only.
Shapely was mentioned in the Well-known text (WKT) section for import and export
ability. But, its true purpose is a generic geometry library. Shapely is a high-level,
Pythonic interface to the GEOS library for geometric operations. In fact, Shapely
intentionally avoids reading or writing les. It relies completely on data import and
export from other modules and maintains focus on geometry manipulation.
Let's do a quick Shapely demonstration in which we'll dene a single WKT polygon
and then import it into Shapely. Then we'll measure the area. Our computational
geometry will consist of buffering that polygon by a measure of ve arbitrary units,
which will return a new, bigger polygon for which we'll measure the area:
>>> from shapely import wkt, geometry
>>> wktPoly = "POLYGON((0 0,4 0,4 4,0 4,0 0))"
>>> poly = wkt.loads(wktPoly)
>>> poly.area
>>> buf = poly.buffer(5.0)
>>> buf.area
Chapter 4
[ 143 ]
We can then perform a difference on the area of the buffer and the original polygon
area, as shown here:
>>> buf.difference(poly).area
If you can't have pure Python, a Pythonic API as clean as Shapely that packs such a
punch is certainly the next best thing.
The Fiona library provides a simple Python API around the OGR library for data
access and nothing more. This approach makes it easy to use and is less verbose than
OGR while using Python. Fiona outputs GeoJSON by default. You can nd a wheel
le for Fiona at
As an example, we'll use the GIS_CensusTract_poly.shp le from the dbfpy
example seen earlier in this chapter.
First we'll import fiona and Python's pprint module to format the output. Then,
we'll open the shapele and check its driver type:
>>> import fiona
>>> import pprint
>>> f ="GIS_CensusTract_poly.shp")
>>> f.driver
ESRI Shapefile
Next, we'll check its coordinate reference system and get the data bounding box, as
shown here:
{'init': 'epsg:4269'}
>>> f.bounds
(-89.8744162216216, 30.161122135135138, -89.1383837783784,
Now, we'll view the data schema as geojson and format it using the pprint module,
as you can see in the following lines of code:
>>> pprint.pprint(f.schema)
{'geometry': 'Polygon',
'properties': {'GEODB_OID': 'float:11',
'OBJECTID': 'float:11',
'PERMANE0': 'str:40',
'SOURCE_1': 'str:40',
Geospatial Python Toolbox
[ 144 ]
'SOURCE_2': 'str:40',
'SOURCE_3': 'str:100',
'SOURCE_4': 'str:130',
'DATA_SE5': 'str:46',
'DISTRIB6': 'str:188',
'LOADDATE': 'date',
'QUALITY': 'str:35',
'SCALE': 'str:52',
'FCODE': 'str:38',
'STCO_FI7': 'str:5',
'STATE_NAME': 'str:140',
'COUNTY_8': 'str:60',
'CENSUST9': 'str:20',
'POPULAT10': 'float:11',
'AREASQKM': 'float:31.15',
'GNIS_ID': 'str:10',
'POPULAT11': 'float:11',
'DB2GSE_12': 'float:31.15',
'DB2GSE_13': 'float:31.15'}}
Next let's get a count of the number of features:
>>> len(list(f))
Finally, we'll print one of the records as formatted GeoJSON, as shown here:
{'geometry': {'coordinates': [[[(-89.86412366375093,
30.661213864864862), (-89.86418691770497, 30.660764012731285),
(-89.86443391770518, 30.659652012730202),
'type': 'MultiPolygon'},
'id': '1',
'properties': {'GEODB_OID': 4360.0,
'OBJECTID': 4360.0,
'PERMANE0': '9a914eef-9249-44cf-a05f-af4b48876c59',
'SOURCE_1': 'NA',
'SOURCE_2': '20006',
'DB2GSE_12': 351242560.967882,
'DB2GSE_13': 101775.283967268},
'type': 'Feature'}
Chapter 4
[ 145 ]
GDAL is the dominant geospatial library for raster data. Its raster capability is so
signicant that it is a part of virtually every geospatial toolkit in any language, and
Python is no exception to this. To see the basics of how GDAL works in Python,
download the following sample raster satellite image as a ZIP le and unzip it:
Let's open this image and see how many bands it has and how many pixels are
present along each axis:
>>> from osgeo import gdal
>>> raster = gdal.Open("SatImage.tif")
>>> raster.RasterCount
>>> raster.RasterXSize
>>> raster.RasterYSize
So, we see that the following image has three bands, 2,592 columns of pixels, and
2,693 rows of pixels by viewing it in OpenEV:
Geospatial Python Toolbox
[ 146 ]
GDAL is an extremely fast geospatial raster reader and writer within Python. It can
also reproject images quite well in addition to a few other tricks. However, the true
value of GDAL comes from its interaction with the next Python module, which we'll
examine now.
NumPy is an extremely fast, multidimensional Python array processor designed
specically for Python and scientic computing but is written in C. It is available
via PyPI or as a wheel le (available at
pythonlibs/#numpy) and installs easily. In addition to its amazing speed, the magic
of NumPy includes its interaction with other libraries. NumPy can exchange data
with GDAL, Shapely, the Python Imaging Library (PIL), and many other scientic
computing Python libraries in other elds.
As a quick example of NumPy's ability, we'll combine it with GDAL to read in
our sample satellite image and then create a histogram of it. The interface between
GDAL and NumPy is a GDAL module called gdal_array, which has NumPy as a
dependency. Numeric is the legacy name of the NumPy module. The gdal_array
module imports NumPy.
In the following example, we'll use gdal_array, which imports NumPy, to read the
image in as an array, grab the rst band, and save it back out as a JPEG image:
>>> from osgeo import gdal_array
>>> srcArray = gdal_array.LoadFile("SatImage.tif")
>>> band1 = srcArray[0]
>>> gdal_array.SaveArray(band1, "band1.jpg", format="JPEG")
Chapter 4
[ 147 ]
This operation gives us the following grayscale image in OpenEV:
Geospatial Python Toolbox
[ 148 ]
The PIL was originally developed for remote sensing, but has evolved as a general
image editing library for Python. Like NumPy, it is written in C for speed, but is
designed specically for Python. In addition to image creation and processing, it
also has a useful raster drawing module. PIL is also available via PyPI; however, in
Python 3, you may want to use the Pillow module which is an upgraded version of
PIL. As you'll see in the example below, we use a Python try statement to import
PIL using two possible variations depending on how you installed it.
In this example, we'll combine PyShp and PIL to rasterize the hancock shapele
from previous examples and save it as an image. We'll use a world to pixel coordinate
transformation similar to our SimpleGIS from Chapter 1, Learning Geospatial Analysis
with Python. We'll create an image to use as a canvas in PIL, and then we'll use the
PIL ImageDraw module to render the polygon. Finally, we'll save it as a PNG image,
as you can see in the following lines of code:
>>> try:
>>> import Image
>>> import ImageDraw
>>> except:
>>> from PIL import Image
>>> from PIL import ImageDraw
>>> import shapefile
>>> r = shapefile.Reader("hancock.shp")
>>> xdist = r.bbox[2] - r.bbox[0]
>>> ydist = r.bbox[3] - r.bbox[1]
>>> iwidth = 400
>>> iheight = 600
>>> xratio = iwidth/xdist
>>> yratio = iheight/ydist
>>> pixels = []
>>> for x,y in r.shapes()[0].points:
... px = int(iwidth - ((r.bbox[2] - x) * xratio))
... py = int((r.bbox[3] - y) * yratio)
... pixels.append((px,py))
>>> img ="RGB", (iwidth, iheight), "white")
>>> draw = ImageDraw.Draw(img)
>>> draw.polygon(pixels, outline="rgb(203, 196, 190)",
fill="rgb(198, 204, 189)")
Chapter 4
[ 149 ]
This example creates the following image:
Geospatial Python Toolbox
[ 150 ]
Sometimes, you may nd that PIL is overkill for your purposes, or you are not
allowed to install PIL because you do not have administrative rights to the machine
that you're using to install Python modules created and compiled in C. In those
cases, you can usually get away with the lightweight pure Python PNGCanvas
module. You can install it using easy_install or pip.
Using this module, we can repeat the raster shapefile example we performed using
PIL but in pure Python, as you can see here:
>>> import shapefile
>>> import pngcanvas
>>> r = shapefile.Reader("hancock.shp")
>>> xdist = r.bbox[2] - r.bbox[0]
>>> ydist = r.bbox[3] - r.bbox[1]
>>> iwidth = 400
>>> iheight = 600
>>> xratio = iwidth/xdist
>>> yratio = iheight/ydist
>>> pixels = []
>>> for x,y in r.shapes()[0].points:
... px = int(iwidth - ((r.bbox[2] - x) * xratio))
... py = int((r.bbox[3] - y) * yratio)
... pixels.append([px,py])
>>> c = pngcanvas.PNGCanvas(iwidth,iheight)
>>> c.polyline(pixels)
>>> f = open("hancock_pngcvs.png", "wb")
>>> f.write(c.dump())
>>> f.close()
Chapter 4
[ 151 ]
This example gives us a simple outline as PNGCanvas does not have a built-in
ll method:
Geospatial Python Toolbox
[ 152 ]
Pandas is a high-performance Python data analysis library, which can handle
large datasets that are tabular (similar to a database), ordered/unordered, labeled
matrices, or unlabeled statistical data. GeoPandas is simply a geospatial extension
to Pandas that builds upon Shapely, Fiona, PyProj, matplotlib, and Descartes, all
of which must be installed. It enables you to easily perform operations in Python,
which would otherwise require a spatial database such as PostGIS. You can
download a wheel le for GeoPandas from
The following script opens a shapele and dumps it to GeoJSON; it then creates a
map with matplotlib:
>>> import geopandas
>>> import matplotlib.pyplot as plt
>>> gdf = geopandas.GeoDataFrame
>>> census = gdf.from_file("GIS_CensusTract_poly.shp")
>>> census.plot()
The following image is the resulting map plot of the previous commands:
Chapter 4
[ 153 ]
The popular MySQL (available at database
is gradually evolving spatial functions. It has support for OGC geometries and
a few spatial functions. It also has a pure Python API available in the PyMySQL
library. The limited spatial functions use planar geometry and bounding rectangles
as opposed to spherical geometry and shapes. The latest development release of
MySQL contains some additional functions that improve this capability.
In the following example, we'll create a database in MySQL called spatial_db.
Then, we'll add a table called PLACES with a geometry column. Next, we'll add two
cities as point locations. And nally, we'll calculate the distance using MySQL's
ST_Distance function and then convert the result from degrees to miles:
>>> import pymysql
>>> conn = pymysql.connect(host='localhost', port=3306,
user='root', passwd='', db='mysql')
>>> cur = conn.cursor()
>>> cur.execute("DROP DATABASE IF EXISTS spatial_db")
>>> cur.execute("CREATE DATABASE spatial_db")
>>> cur.close()
>>> conn.close()
>>> conn = pymysql.connect(host='localhost', port=3306,
user='root', passwd='', db='spatial_db')
>>> cur = conn.cursor()
>>> cur.execute("CREATE TABLE PLACES (id int NOT NULL
AUTO_INCREMENT PRIMARY KEY, Name varchar(50) NOT NULL, location
Geometry NOT NULL)")
>>> cur.execute("INSERT INTO PLACES (name, location) VALUES ('NEW
ORLEANS', GeomFromText('POINT(30.03 90.03)'))")
>>> cur.execute("INSERT INTO PLACES (name, location) VALUES
('MEMPHIS', GeomFromText('POINT(35.05 90.00)'))")
>>> conn.commit()
>>> cur.execute("SELECT AsText(location) FROM PLACES")
>>> p1, p2 = [p[0] for p in cur.fetchall()]
>>> cur.execute("SET @p1 = ST_GeomFromText('{}')".format(p1))
>>> cur.execute("SET @p2 = ST_GeomFromText('{}')".format(p2))
>>> cur.execute("SELECT ST_Distance(@p1, @p2)")
>>> d = float(cur.fetchone()[0])
>>> print("{:.2f} miles from New Orleans to Memphis".format(d *
>>> cur.close()
>>> conn.close()
351.41 miles from New Orleans to Memphis
Geospatial Python Toolbox
[ 154 ]
There are other spatial database options including PostGIS and
SpatiaLite; however, Python 3 support for these spatial engines
is developmental at best. You can access PostGIS and MySQL
through the OGR library; however, MySQL support is limited.
The pure Python PyFPDF library is a lightweight way to create PDFs including maps.
Because the PDF format is a widely used standard, PDFs are commonly used to
distribute maps. You can install it via PyPI as fpdf. The ofcial name of the software
is PyFPDF because it is a part of the PHP language module called fpdf. This module
uses a concept called a cell to layout items at specic locations on a page. As a quick
example, we'll import our hancock.png image created from the PIL example into a
PDF called map.pdf to create a simple PDF map. The map will have the header text at
the top that says Hancock County Boundary, followed by the map image:
>>> import fpdf
>>> # PDF constructor:
>>> # Portrait, millimeter units, A4 page size
>>> pdf=fpdf.FPDF("P", "mm", "A4")
>>> # create a new page
>>> pdf.add_page()
>>> # Set font: arial, bold, size 20
>>> pdf.set_font('Arial','B',20)
>>> # Layout cell: 160 x 25mm, title, no border, centered
>>> pdf.cell(160,25,'Hancock County Boundary', \
>>> border=0, align="C")
>>> # Write the image specifying the size
>>> pdf.image("hancock.png",25,50,110,160)
>>> # Save the file: filename, F = to file System
>>> pdf.output('map.pdf','F')
If you open the PDF le named map.pdf in Adobe Acrobat Reader or another PDF
reader such as SumatraPDF, you'll see that the image is now centered on an A4 page.
Geospatial products are often included as part of larger reports, and the PyFPDF
module simplies automatically generating reports as PDFs.
Chapter 4
[ 155 ]
Spectral Python
Spectral Python (SPy) is a very advanced Python package for remote sensing. It goes
far beyond what you would typically do with GDAL and NumPy and focuses on
hyperspectral processing for images, which may have hundreds of bands. The basic
package installs easily via PyPI, but SPy can provide a fully windowed processing
environment if you install some additional dependencies. The remote sensing we'll
perform in the rest of this book won't require SPy, but it is worth mentioning here
because it is a well-maintained, powerful package that is competitive with many
commercial software products in this domain. You can nd out more about SPy at the
homepage available at
In this chapter, we surveyed the Python-specic tools for geospatial analysis. Many
of these tools included bindings to the libraries discussed in Chapter 3, The Geospatial
Technology Landscape for best-of-breed solutions for specic operations like GDAL's
raster access functions. We also included pure Python libraries as much as possible and
will continue to include pure Python algorithms as we work through the upcoming
chapters. In the next chapter, we'll begin applying these tools for GIS analysis.
[ 157 ]
Python and Geographic
Information Systems
This chapter will focus on applying Python to functions typically performed by a
geographic information system (GIS) such as QGIS or ArcGIS. We will continue to
use as few external dependencies as possible outside Python itself, so you have tools
which are as reusable as possible in different environments. In this book, we separate
GIS analysis and remote sensing from programming perspective, which means that
in this chapter, we'll focus on mostly vector data.
As with other chapters in this book, the items presented here are core functions
that serve as building blocks which can be recombined to solve challenges you will
encounter beyond this book. This chapter includes the following topics:
Measuring distance
Converting coordinates
Reprojecting vector data
Editing shapefiles
Selecting data from within larger datasets
Creating thematic maps
Conversion of non-GIS data types
This chapter contains many code samples. In addition to the text, code comments are
included as guides within the samples.
Python and Geographic Information Systems
[ 158 ]
Measuring distance
The essence of geospatial analysis is discovering the relationships of objects on the
Earth. Items which are closer together tend to have a stronger relationship than those
which are farther apart. This concept is known as Tobler's First Law of Geography.
Therefore, measuring distance is a critical function of geospatial analysis.
As you have learned, every map is a model of the Earth, and they are all wrong to
some degree. For this reason, measuring accurate distance between two points on
the Earth while sitting in front of a computer is impossible. Even professional land
surveyors who go out in the eld with both traditional sighting equipment and
very precise GPS equipment fail to account for every anomaly on the Earth's surface
between point A and point B. So, in order to measure distance, we must look at what
we are measuring, how much we are measuring, and how much accuracy we need.
There are three models of the Earth we can use to calculate distance:
Flat plane
In the at plane model, standard Euclidean geometry is used. The Earth is
considered a at plane with no curvature as shown in the following gure:
Chapter 5
[ 159 ]
This model makes math quite simple because you work with straight lines. The most
common format for geospatial coordinates is decimal degrees. However, decimal
degree coordinates are reference measurements on a sphere taken as angles between
the longitude and the prime meridian, and the latitude and equator. Furthermore,
the lines of longitude converge toward zero at the poles. The circumference of each
line of latitude becomes smaller toward the poles as well. These facts mean decimal
degrees are not a valid coordinate system for Euclidean geometry, which uses
innite planes.
Map projections attempt to simplify the issues of dealing with a three-dimensional
ellipsoid in a two-dimensional plane, which could be either a paper or a computer
screen. As discussed in Chapter 1, Learning Geospatial Analysis with Python, map
projections atten a round model of the Earth to a plane and introduce distortion in
exchange for the convenience of a map. Once this projection is in place and decimal
degrees are traded for a Cartesian coordinate system with x and y coordinates, we
can use the simplest forms of Euclidean geometry—the Pythagorean theorem.
At a large enough scale, a sphere or ellipsoid like the Earth, appears more like a
plane than a sphere. In fact, for centuries, everyone thought that the Earth was at!
If the difference in degrees of longitude is small enough, you can often get away
with using Euclidean geometry and then converting the measurements to meters,
kilometers, or miles. This method is generally not recommended, but the decision is
ultimately up to you and your requirements for accuracy as an analyst.
The spherical model approach tries to better approximate reality by avoiding
the problems resulting from smashing the Earth onto a at surface. As the name
suggests, this model uses a perfect sphere for representing the Earth (similar to a
physical globe), which allows us to work with degrees directly. This model ignores
the fact that the Earth is really more of an egg-shaped ellipsoid with varying degrees
of thickness in its crust. But by working with distance on the surface of a sphere,
we can begin to measure longer distances with more accuracy. The following gure
illustrates this concept:
Python and Geographic Information Systems
[ 160 ]
Using the ellipsoid model of the Earth, analysts strive for the best model of the
Earth's surface. There are several ellipsoid models which are called datums. A datum
is a set of values which dene an estimated shape for the Earth, also known as a
geodetic system. Like any other georeferencing system, a datum can be optimized
for a localized area. The most commonly used datum is called WGS84, which
is designed for global use. You should be aware that the WGS84 is occasionally
updated as assessment techniques and technology improves. The most recent
revision occurred in 2004. In North America, the NAD83 datum is used to optimize
referencing over the continent. In the Eastern Hemisphere, the European Terrestrial
Reference System 1989 (ETRS89) is used more frequently. ETRS89 is xed to the
stable part of the Eurasian Plate. Maps of Europe based on ETRS89 are immune to
continental drift which changes up to 2.5 cm per year as the Earth's crust shifts.
An ellipsoid does not have a constant radius from the center. This fact means
the formulas used in the spherical model of the Earth begin to have issues in the
ellipsoid model. Though not a perfect approximation, it is much closer to reality than
the spherical model. The following gure shows a generic ellipsoid model denoted
by a black line contrasted against a representation of the Earth's uneven crust using
the red line to represent the geoid. Although we will not use it for these examples,
another model is the geoid model. The geoid is the most precise and accurate model
of the Earth which is based on the Earth's surface with no inuences except gravity
and rotation. The following graphic is a representation of a geoid, ellipsoid, and
spherical model to demonstrate the differences:
Chapter 5
[ 161 ]
Pythagorean theorem
Now that we've discussed these different models of the Earth and the issues in
measuring them, let's look at some solutions using Python. We'll start measuring
with the simplest method using the Pythagorean Theorem, also known as Euclidean
distance. If you remember your geometry lessons from school, the Pythagorean
theorem asserts the following equation:
In this assertion, the variables a, b, and c are all sides of a triangle. You can solve
for any one side if you know the other two. In this example, we'll start with two
projected points in the Mississippi Transverse Mercator (MSTM) projection.
The units of this projection are in meters. The x axis locations are measured from
the central meridian dened by the westernmost location in the state. The y axis
is dened from the NAD83 horizontal datum. The rst point, dened as (x1,y1),
represents Jackson—the state capital of Mississippi. The second point, dened
as (x2,y2), represents the city of Biloxi, which is a coastal town, as shown in the
following gure:
Python and Geographic Information Systems
[ 162 ]
In the following example, the double asterisk (**) in Python is the
syntax for exponents, which we'll use to square the distances.
We'll import the Python math module for its square root function called sqrt().
Then, we'll calculate the x axis and y axis distances. Finally, we'll use these variables
to execute the Euclidean distance formula to get the distance across the bounding
box in meters from an (x, y) origin used in the MSTM projection:
>>> import math
>>> x1 = 456456.23
>>> y1 = 1279721.064
>>> x2 = 576628.34
>>> y2 = 1071740.33
>>> x_dist = x1 - x2
>>> y_dist = y1 - y2
>>> dist_sq = x_dist**2 + y_dist**2
>>> distance = math.sqrt(dist_sq)
>>> distance
So, the distance is approximately 240,202 meters, which is around 240.2 kilometers or
150 miles. This calculation is reasonably accurate because this projection is optimized
for measuring distance and area in Mississippi using Cartesian coordinates.
We can also measure distance using decimal degrees, but we must perform a few
additional steps. In order to measure using degrees, we must rst convert the angles to
radians, which accounts for the curved surface distance between the coordinates. We'll
also multiply our output in radians with the radius of the Earth in meters to convert
back from radians. You can read more about radians at
We'll perform this conversion using the Python math.radians() method when we
calculate the x and y distances, as shown here:
>>> import math
>>> x1 = -90.21
>>> y1 = 32.31
>>> x2 = -88.95
>>> y2 = 30.43
>>> x_dist = math.radians(x1 - x2)
>>> y_dist = math.radians(y1 - y2)
>>> dist_sq = x_dist**2 + y_dist**2
>>> dist_rad = math.sqrt(dist_sq)
>>> dist_rad * 6371251.46
Chapter 5
[ 163 ]
OK, this time we came up with around 251 kilometers which is 11 kilometers
more than our rst measurement. So, as you can see, your choice of measurement
algorithm and Earth model can have signicant consequences. Using the same
equation, we come up with radically different answers, depending on our choice of
coordinate system and Earth model.
You can read more about Euclidean distance at http://mathworld.
Haversine formula
A part of the problem with just plugging in unprojected decimal degrees into the
Pythagorean theorem is the concept of Great Circle distance. A Great Circle is the
shortest distance between two points on a sphere. Another important feature which
denes a Great Circle is the circle, which if followed all the way around the sphere,
will bisect the sphere into two equal halves, as shown in the following Wikipedia
gure (Jhbdel, Wikipedia):
Python and Geographic Information Systems
[ 164 ]
So what is the right way to measure in decimal degrees? The most popular method
is the Haversine formula which uses trigonometry to calculate the Great Circle
distance using coordinates dened in decimal degrees as input. Once again, we'll
convert the axis distances from degrees to radians before we apply the formula,
just like the previous example. But this time, we'll also convert the latitude (y axis)
coordinates to radians separately, as shown here:
>>> import math
>>> x1 = -90.212452861859035
>>> y1 = 32.316272202663704
>>> x2 = -88.952170968942525
>>> y2 = 30.438559624660321
>>> x_dist = math.radians(x1 - x2)
>>> y_dist = math.radians(y1 - y2)
>>> y1_rad = math.radians(y1)
>>> y2_rad = math.radians(y2)
>>> a = math.sin(y_dist/2)**2 + math.sin(x_dist/2)**2 \
>>> * math.cos(y1_rad) * math.cos(y2_rad)
>>> c = 2 * math.asin(math.sqrt(a))
>>> distance = c * 6371 # kilometers
>>> print(distance)
Wow! 240.6 kilometers using the Haversine formula compared to 240.2 kilometers
using the optimized and more accurate projection. This difference is less than half a
kilometer, which is not bad for a distance calculation of two cities that are 150 miles
apart. The Haversine formula is the most commonly used distance measuring formula
because it is relatively lightweight from a coding perspective and reasonably accurate
in most cases. It is considered to be accurate within about a meter.
To summarize what you've learned so far, most of the point coordinates you
encounter as an analyst are in unprojected decimal degrees. So, these are some of the
options for measurement:
Reproject to a distance-accurate Cartesian projection and measure
Just use the Haversine formula and see how far it takes you for your analysis
Use the even more precise Vincenty's formula
That's right! There's another formula which seeks to provide an even better
measurement than Haversine.
Chapter 5
[ 165 ]
Vincenty's formula
So we've examined distance measurement using the Pythagorean theorem (at
Earth model) and the Haversine formula (spherical Earth model). Vincenty's
formula accounts for the ellipsoid model of the Earth. And if you are using a
localized ellipsoid, it can be accurate within far less than a meter. In the following
implementation of this formula, you can change the semi-major axis value and
attening ratio to t the denition of any ellipsoid. Let's see what the distance is
when we measure using the Vincenty's formula on the NAD83 ellipsoid:
import math
distance = None
x1 = -90.212452861859035
y1 = 32.316272202663704
x2 = -88.952170968942525
y2 = 30.438559624660321
# Ellipsoid Parameters
# Example is NAD83
a = 6378137 # semi-major axis
f = 1/298.257222101 # inverse flattening
b = abs((f*a)-a) # semi-minor axis
L = math.radians(x2-x1)
U1 = math.atan((1-f) * math.tan(math.radians(y1)))
U2 = math.atan((1-f) * math.tan(math.radians(y2)))
sinU1 = math.sin(U1)
cosU1 = math.cos(U1)
sinU2 = math.sin(U2)
cosU2 = math.cos(U2)
lam = L
for i in range(100):
sinLam = math.sin(lam)
cosLam = math.cos(lam)
sinSigma = math.sqrt((cosU2*sinLam)**2 +
if (sinSigma == 0):
distance = 0 # coincident points
cosSigma = sinU1*sinU2 + cosU1*cosU2*cosLam
sigma = math.atan2(sinSigma, cosSigma)
sinAlpha = cosU1 * cosU2 * sinLam / sinSigma
cosSqAlpha = 1 - sinAlpha**2
cos2SigmaM = cosSigma - 2*sinU1*sinU2/cosSqAlpha
Python and Geographic Information Systems
[ 166 ]
if math.isnan(cos2SigmaM):
cos2SigmaM = 0 # equatorial line
C = f/16*cosSqAlpha*(4+f*(4-3*cosSqAlpha))
LP = lam
lam = L + (1-C) * f * sinAlpha * \
(sigma + C*sinSigma*(cos2SigmaM+C*cosSigma *
if not abs(lam-LP) > 1e-12:
uSq = cosSqAlpha * (a**2 - b**2) / b**2
A = 1 + uSq/16384*(4096+uSq*(-768+uSq*(320-175*uSq)))
B = uSq/1024 * (256+uSq*(-128+uSq*(74-47*uSq)))
deltaSigma = B*sinSigma*(cos2SigmaM+B/4 *
(cosSigma*(-1+2*cos2SigmaM*cos2SigmaM) -
B/6*cos2SigmaM*(-3+4*sinSigma*sinSigma) *
s = b*A*(sigma-deltaSigma)
distance = s
Using the Vincenty's formula, our measurement came to 240.1 kilometers, which was
only 100 meters off from our projected measurement using Euclidean distance. That's
impressive! While it's many times more mathematically complex than the Haversine
formula, you can see that it is also much more accurate.
The pure Python geopy module includes an implementation of
the Vincenty's formula and has the ability to geocode locations
as well, by turning place names into latitude and longitude
coordinates, as shown here:
Chapter 5
[ 167 ]
The points used in these examples are reasonably close to the equator. As you move
towards the poles or work with larger or extremely small distances, the choices you
make become increasingly more important. If you're just trying to make a radius
around a city to select locations for a marketing campaign promoting a concert, then
an error of a few kilometers is probably acceptable. However, if you're trying to
estimate the volume of fuel required for an airplane to make a ight between two
airports, then you want to be spot on!
If you'd like to learn more about issues with measuring distance and direction, and
how to work around them with programming, visit the following site:
On this site, Chris Veness goes into great detail on this topic and provides online
calculators as well as examples written in JavaScript, which are easily ported to
Python. The Vincenty's formula implementation that we just saw is ported from
the JavaScript on this site.
Calculating line direction
In addition to distance, you'll often want to know the bearing of a line between its
end points. We can calculate this line direction from one of the points using only the
Python math module, as shown in the following calculation:
>>> from math import atan2, cos, sin, degrees
>>> lon1 = -90.21
>>> lat1 = 32.31
>>> lon2 = -88.95
>>> lat2 = 30.43
>>> angle = atan2(cos(lat1)*sin(lat2)-sin(lat1) *
>>> cos(lat2)*cos(lon2-lon1), sin(lon2-lon1)*cos(lat2))
>>> bearing = (degrees(angle) + 360) % 360
>>> print(bearing)
Python and Geographic Information Systems
[ 168 ]
Sometimes, you end up with a negative bearing value. To avoid this issue, we add
360 to the result to avoid a negative number and use the Python module operator to
keep the value from climbing over 360.
The math in the angle calculation is reverse engineering a right triangle and
then guring out the acute angle of the triangle. The following URL provides an
explanation of the elements of this formula with an interactive example at the end:
Coordinate conversion
When you start working with multiple datasets you'll inevitably end up with data
in different coordinate systems and projections. You can convert back and forth
between Universal Transverse Mercator (UTM) and latitude/longitude using a pure
Python module called utm. You can install it using easy_install or pip from PyPI
available at the following link:
The utm module is straightforward to use. To convert from UTM to latitude and
longitude, use the following commands:
>>> import utm
>>> y = 479747.0453210057
>>> x = 5377685.825323031
>>> zone = 32
>>> band = 'U'
>>> print(utm.to_latlon(y, x, zone, band))
>>> (48.55199390882121, 8.725555729071763)
Chapter 5
[ 169 ]
The UTM zones are numbered horizontally. However, vertically, the bands of
latitude are ordered by English alphabets with a few exceptions. The letters A, B, Y,
and Z cover the poles. The letters I and O are omitted because they look too much
like 1 and 0. Letters N through X are in the northern hemisphere, while C through M
are in the southern hemisphere. The following gure from the website, Atlas Florae
Europaeae, illustrates the UTM zones over Europe:
Python and Geographic Information Systems
[ 170 ]
Converting from latitude and longitude is even easier. We just pass the latitude
and longitude to the from_latlon() method, which returns a tuple with the same
parameters accepted by the to_latlon() method:
>>> import utm
>>> utm.from_latlon(48.55199390882121, 8.725555729071763)
(479747.04524576373, 5377691.373080335, 32, 'U')
The algorithms used in this Python implementation are described in detail and are
available at the following link:
While reprojection is less common these days because of more advanced methods of
data distribution, sometimes you do need to reproject a shapele. The pure Python
utm module works for reference system conversion, but for a full reprojection we
need some help from the OGR Python API. The OGR API contained in the osgeo
module has the Open Spatial Reference module, also known as osr, which we'll use
for reprojection.
As an example, we'll use a point shapele containing New York City museum and
gallery locations in the Lambert conformal projection. We'll reproject it to WGS84
geographic (or unproject it rather). You can download this zipped shapele at
The following minimalist script reprojects the shapele. The geometry is transformed
and written to the new le, but the dbf le is simply copied to the new name as we
aren't changing it. The standard Python shutil module, which is the shortened
form for shell utilities, is used to copy dbf. The source and target shapele names
are variables at the beginning of the script. The target projection is also near the top,
which is set using an European Petroleum Survey Group (EPSG) code. The script
assumes there is a .prj projection le, which denes the source projection. If it's not,
you could manually dene it using the same syntax as the target projection. Each
section is marked with comments, as you can see here:
from osgeo import ogr
from osgeo import osr
import os
import shutil
tgtName = "NYC_MUSEUMS_GEO.shp"
Chapter 5
[ 171 ]
tgt_spatRef = osr.SpatialReference()
driver = ogr.GetDriverByName("ESRI Shapefile")
src = driver.Open(srcName, 0)
srcLyr = src.GetLayer()
src_spatRef = srcLyr.GetSpatialRef()
if os.path.exists(tgtName):
tgt = driver.CreateDataSource(tgtName)
lyrName = os.path.splitext(tgtName)[0]
# Use well-known binary format (WKB) to specify geometry
tgtLyr = tgt.CreateLayer(lyrName, geom_type=ogr.wkbPoint)
featDef = srcLyr.GetLayerDefn()
trans = osr.CoordinateTransformation(src_spatRef, tgt_spatRef)
srcFeat = srcLyr.GetNextFeature()
while srcFeat:
geom = srcFeat.GetGeometryRef()
feature = ogr.Feature(featDef)
srcFeat = srcLyr.GetNextFeature()
# Convert geometry to Esri flavor of Well-Known Text (WKT) format
# for export to the projection (prj) file.
prj = open(lyrName + ".prj", "w")
srcDbf = os.path.splitext(srcName)[0] + ".dbf"
tgtDbf = lyrName + ".dbf"
shutil.copyfile(srcDbf, tgtDbf)
Python and Geographic Information Systems
[ 172 ]
The following gure shows the reprojected points in QGIS with satellite imagery in
the background:
If you are working with a set of points, you can reproject them
programmatically instead of reprojecting a shapele using PyProj
as given in the following link:
Chapter 5
[ 173 ]
Editing shapeles
Shapeles are a fundamental data format in GIS used both for exchanging data as
well as performing GIS analysis. In this section, we'll learn how to work with these
les extensively. In Chapter 2, Geospatial Data, we discussed shapeles as a format
which can have many different le types associated with it. For editing shapeles
and most other operations, we are only concerned with two types: the .shp le and
the .dbf le.
The .shp le contains the geometry while the .dbf le contains the attributes of the
corresponding geometry. For each geometry record in a shapele, there is one dbf
record. The records aren't numbered or identied in any way. This means while
adding and deleting information from a shapele, you must be careful to remove or
add a record to each le type to match.
As discussed in Chapter 4, Geospatial Python Toolbox, there are two libraries to edit
shapeles in Python. One is the Python bindings to the OGR library. The other is the
PyShp library, which is written in pure Python. We'll use PyShp in sticking with the
pure Python when possible theme of this book. To install PyShp, use easy_install
or pip.
To begin editing shapeles, we'll start with a point shapele containing cities for the
state of Mississippi, which you can download as a ZIP le. Download the following
le to your working directory and unzip it:
Python and Geographic Information Systems
[ 174 ]
The points we are working with can be seen in the following gure:
Chapter 5
[ 175 ]
Accessing the shapele
Let's use PyShp to open this shapele:
>>> import shapefile
>>> r = shapefile.Reader("MSCities_Geo_Pts")
>>> r
<shapefile.Reader instance at 0x00BCB760>
We created a shapele Reader object instance and set it to the variable r. Notice that
when we passed the lename to the Reader class, we didn't use any le extensions.
Remember that we are dealing with at least two different les ending in .shp and
.dbf. So, the base lename without the extension that is common to these two les is
all we really need.
You can, however, use a le extension. PyShp will just ignore it and use the base
lename. So why would you add an extension? Most operating systems allow an
arbitrary number of periods in a lename. For example, you might have a shapele
with the following base name as myShapefile.version.1.2.
In this case, PyShp will try to interpret the characters after the last period as a
le extension which would be .2. This issue will prevent you from opening the
shapele. So, if your shapele has periods in the base name, you would need to
add a le extension such as .shp or .dbf to the lename.
Once you have opened a shapele and created a Reader object, you can get some
information about the geographic data. In the following sample, we'll get the
bounding box, shape type, and the number of records in the shapele from our
Reader object:
>>> r.bbox
[-91.38804855553174, 30.29314882296931, -88.18631833931401,
>>> r.shapeType
>>> r.numRecords
The bounding box, stored in the bbox property, is returned as a list containing the
minimum x value, minimum y value, maximum x value, and maximum y value. The
shape type, available as the shapeType property, is a numeric code dened by the
ofcial shapele specication. In this case, 1 represents a point shapele, 3 represents
lines, and 5 represents polygons. And nally, the property numRecords tells us that
there are 298 records in this shapele. Because it is a simple point shapele, we then
know that there are 298 points, each with its own dbf record.
Python and Geographic Information Systems
[ 176 ]
Reading shapele attributes
The dbf le is a simple database format which is structured in a way similar
to a spreadsheet with rows and columns, each column as a label dening what
information it contains. We can view that information by checking the fields
property of the Reader object:
>>> r.fields
[('DeletionFlag', 'C', 1, 0), ['STATEFP10', 'C', 2, 0],
['PLACEFP10', 'C', 5, 0], ['PLACENS10', 'C', 8, 0], ['GEOID10',
'C', 7, 0], ['NAME10', 'C', 100, 0], ['NAMELSAD10', 'C', 100,
0], ['LSAD10', 'C', 2, 0], ['CLASSFP10', 'C', 2, 0],
['PCICBSA10', 'C', 1, 0], ['PCINECTA10', 'C', 1, 0], ['MTFCC10',
'C', 5, 0], ['FUNCSTAT10', 'C', 1, 0], ['ALAND10', 'N', 14, 0],
['AWATER10', 'N', 14,0], ['INTPTLAT10', 'C', 11, 0],
['INTPTLON10', 'C', 12, 0]]
The fields property returns quite a bit of information. The elds are a list with
information about each eld called eld descriptors. For each eld, the following
information is presented:
Field name: This is the name of the field as text which can be no longer than
10 characters for shapefiles.
Field type: This is the type of the field which can be text, number,
date, floating point number, or Boolean represented as C, N, D, F, and L
respectively. The shapefile specification says it uses the dbf format specified
as dBASE III, but most GIS software seems to support dBASE IV. In version
IV (4), the number and floating point types are equivalent.
Field length: This is the length of the data in characters or digits.
Decimal length: This is the number of decimal places in a number or floating
point field.
The rst eld descriptor outlines a hidden eld, which is a part of the dbf le
format specication. DeletionFlag allows software to mark records for deletion
without actually deleting them. That way the information is still in the le but can be
removed from the displayed record list or search queries.
Chapter 5
[ 177 ]
If we just want the eld name and not the other metadata, we can use Python list
comprehensions to return just the rst item in the descriptor and also ignore the
DeletionFlag eld. This example creates a list comprehension that returns the rst
item in each descriptor (eld name) starting with the second descriptor to ignore the
deletion ag:
>>> [item[0] for item in r.fields[1:]]
['STATEFP10', 'PLACEFP10', 'PLACENS10', 'GEOID10', 'NAME10',
Now, we have just the eld names which are much easier to read. For clarity,
the eld names all contain the number 10 because this is the version 2010 of this
shapele, which is created as a part of each census. These kinds of abbreviations are
common in shapele dbf les due to the 10 character limit on the eld names.
Next, let's examine some of the records, which these elds describe. We can view an
individual record using the r.record() method. We know from the rst example
that there are 298 records. So let's examine the third record as an example. The
records are accessed using list indexes. In Python, indexes start at 0, so we have to
subtract one from the desired record number to get the index. For record 3, the index
would be 2. You just pass the index to the record() method:
>>> r.record(2)
['28', '16620', '02406337', '2816620', 'Crosby', 'Crosby town',
'43', 'C1', 'N','N', 'G4110', 'A', 5489412, 21336,
'+31.2742552', '-091.0614840']
As you can see, the eld names are stored separately from the actual records. If you
want to select a record value, you need its index. The index of the city name in each
record is 4:
>>> r.record(2)[4]
But counting indexes is tedious. It's much easier to reference a value by the
eld name. There are several ways we can associate a eld name with the value
of a particular record. The rst is to use the index() method in Python lists to
programmatically get the index using the eld name:
>>> fieldNames = [item[0] for item in r.fields[1:]]
>>> name10 = fieldNames.index("NAME10")
>>> name10