Visualize This Yau Nathan. The Flowing Data Guide To Design, Visualization, And Statistics

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 456

DownloadVisualize This Yau Nathan.-Visualize The Flowing Data Guide To Design, Visualization, And Statistics
Open PDF In BrowserView PDF
Table of Contents
Cover
Chapter 1: Telling Stories with Data
More Than Numbers
What to Look For
Design
Wrapping Up
Chapter 2: Handling Data
Gather Data
Formatting Data
Wrapping Up
Chapter 3: Choosing Tools to Visualize
Data
Out-of-the-Box Visualization
Programming

Illustration
Mapping
Survey Your Options
Wrapping Up
Chapter 4: Visualizing Patterns over
Time
What to Look for over Time
Discrete Points in Time
Continuous Data
Wrapping Up
Chapter 5: Visualizing Proportions
What to Look for in Proportions
Parts of a Whole
Proportions over Time
Wrapping Up
Chapter 6: Visualizing Relationships
What Relationships to Look For

Correlation
Distribution
Comparison
Wrapping Up
Chapter 7: Spotting Differences
What to Look For
Comparing across Multiple
Variables
Reducing Dimensions
Searching for Outliers
Wrapping Up
Chapter 8: Visualizing Spatial
Relationships
What to Look For
Specific Locations
Regions
Over Space and Time
Wrapping Up

Chapter 9: Designing with a Purpose
Prepare Yourself
Prepare Your Readers
Visual Cues
Good Visualization
Wrapping Up
Introduction
Learning Data

Chapter 1
Telling Stories with Data
Think of all the popular data visualization works out
there—the ones that you always hear in lectures or read
about in blogs, and the ones that popped into your head
as you were reading this sentence. What do they all
have in common? They all tell an interesting story.
Maybe the story was to convince you of something.
Maybe it was to compel you to action, enlighten you with
new information, or force you to question your own
preconceived notions of reality. Whatever it is, the best
data visualization, big or small, for art or a slide
presentation, helps you see what the data have to say.

More Than Numbers
Face it. Data can be boring if you don’t know what
you’re looking for or don’t know that there’s something
to look for in the first place. It’s just a mix of numbers
and words that mean nothing other than their raw
values. The great thing about statistics and visualization

is that they help you look beyond that. Remember, data
is a representation of real life. It’s not just a bucket of
numbers. There are stories in that bucket. There’s
meaning, truth, and beauty. And just like real life,
sometimes the stories are simple and straightforward;
and other times they’re complex and roundabout. Some
stories belong in a textbook. Others come in novel form.
It’s up to you, the statistician, programmer, designer, or
data scientist to decide how to tell the story.
This was one of the first things I learned as a
statistics graduate student. I have to admit that before
entering the program, I thought of statistics as pure
analysis, and I thought of data as the output of a
mechanical process. This is actually the case a lot of
the time. I mean, I did major in electrical engineering, so
it’s not all that surprising I saw data in that light.
Don’t get me wrong. That’s not necessarily a bad
thing, but what I’ve learned over the years is that data,
while objective, often has a human dimension to it.
For example, look at unemployment again. It’s easy
to spout state averages, but as you’ve seen, it can vary
a lot within the state. It can vary a lot by neighborhood.
Probably someone you know lost a job over the past
few years, and as the saying goes, they’re not just
another statistic, right? The numbers represent
individuals, so you should approach the data in that
way. You don’t have to tell every individual’s story.
However, there’s a subtle yet important difference

between the unemployment rate increasing by
5 percentage points and several hundred thousand
people left jobless. The former reads as a number
without much context, whereas the latter is more
relatable.

Journalism
A graphics internship at The New York Times drove the
point home for me. It was only for 3 months during the
summer after my second year of graduate school, but
it’s had a lasting impact on how I approach data. I didn’t
just learn how to create graphics for the news. I learned
how to report data as the news, and with that came a lot
of design, organization, fact checking, sleuthing, and
research.
There was one day when my only goal was to verify
three numbers in a dataset, because when The New
York Times graphics desk creates a graphic, it makes
sure what it reports is accurate. Only after we knew the
data was reliable did we move on to the presentation.
It’s this attention to detail that makes its graphics so
good.
Take a look at any New York Times graphic. It
presents the data clearly, concisely, and ever so nicely.
What does that mean though? When you look at a
graphic, you get the chance to understand the data.
Important points or areas are annotated; symbols and
colors are carefully explained in a legend or with points;

and the Times makes it easy for readers to see the
story in the data. It’s not just a graph. It’s a graphic.
The graphic in Figure 1-1 is similar to what you will
find in The New York Times . It shows the increasing
probability that you will die within one year given your
age.

Figure 1-1: Probability of death given your age

Check out some of the best New York Times
graphics at http://datafl.ws/nytimes.

The base of the graphic is simply a line chart.
However, design elements help tell the story better.
Labeling and pointers provide context and help you see
why the data is interesting; and line width and color
direct your eyes to what’s important.
Chart and graph design isn’t just about making
statistical visualization but also explaining what the
visualization shows.

Note
See Geoff McGhee’s video documentary
“Journalism in the Age of Data” for more on how
journalists use data to report current events.
This includes great interviews with some of the
best in the business.

Art
The New York Times is objective. It presents the data
and gives you the facts. It does a great job at that. On
the opposite side of the spectrum, visualization is less
about analytics and more about tapping into your
emotions. Jonathan Harris and Sep Kamvar did this
quite literally in We Feel Fine (Figure 1-2).

Figure 1-2: We Feel Fine by Jonathan Harris and Sep
Kamvar

The interactive piece scrapes sentences and
phrases from personal public blogs and then visualizes
them as a box of floating bubbles. Each bubble
represents an emotion and is color-coded accordingly.
As a whole, it is like individuals floating through space,
but watch a little longer and you see bubbles start to
cluster. Apply sorts and categorization through the
interface to see how these seemingly random vignettes
connect. Click an individual bubble to see a single story.
It’s poetic and revealing at the same time.
Interact and explore people’s emotions in
Jonathan Harris and Sep Kamvar’s live and

online piece at http://wefeelfine.org.

There are lots of other examples such as Golan
Levin’s The Dumpster, which explores blog entries that
mention breaking up with a significant other; Kim
Asendorf’s Sumedicina, which tells a fictional story of a
man running from a corrupt organization, with not words,
but graphs and charts; or Andreas Nicolas Fischer’s
physical sculptures that show economic downturn in the
United States.
See FlowingData for many more examples of
art and data at http://datafl.ws/art.

The main point is that data and visualization don’t
always have to be just about the cold, hard facts.
Sometimes you’re not looking for analytical insight.
Rather, sometimes you can tell the story from an
emotional point of view that encourages viewers to
reflect on the data. Think of it like this. Not all movies
have to be documentaries, and not all visualization has
to be traditional charts and graphs.

Entertainment
Somewhere in between journalism and art, visualization
has also found its way into entertainment. If you think of
data in the more abstract sense, outside of

spreadsheets and comma-delimited text files, where
photos and status updates also qualify, this is easy to
see.
Facebook used status updates to gauge the happiest
day of the year, and online dating site OkCupid used
online information to estimate the lies people tell to
make their digital selves look better, as shown in Figure
1-3. These analyses had little to do with improving a
business, increasing revenues, or finding glitches in a
system. They circulated the web like wildfire because of
their entertainment value. The data revealed a little bit
about ourselves and society.
Facebook found the happiest day to be
Thanksgiving, and OkCupid found that people tend to
exaggerate their height by about 2 inches.

Figure 1-3: Male Height Distribution on OkCupid

Check out the OkTrends blog for more
revelations from online dating such as what
white people really like and how not to be ugly
by accident: http://blog.okcupid.com.

Compelling
Of course, stories aren’t always to keep people
informed or entertained. Sometimes they’re meant to
provide urgency or compel people to action. Who can

forget that point in An Inconvenient Truth when Al Gore
stands on that scissor lift to show rising levels of carbon
dioxide?
For my money though, no one has done this better
than Hans Rosling, professor of International Health and
director of the Gapminder Foundation. Using a tool
called Trendalyzer, as shown in Figure 1-4, Rosling runs
an animation that shows changes in poverty by country.
He does this during a talk that first draws you in deep to
the data and by the end, everyone is on their feet
applauding. It’s an amazing talk, so if you haven’t seen it
yet, I highly recommend it.
The visualization itself is fairly basic. It’s a motion
chart. Bubbles represent countries and move based on
the corresponding country’s poverty during a given year.
Why is the talk so popular then? Because Rosling
speaks with conviction and excitement. He tells a story.
How often have you seen a presentation with charts and
graphs that put everyone to sleep? Instead Rosling gets
the meaning of the data and uses that to his advantage.
Plus, the sword-swallowing at the end of his talk drives
the point home. After I saw Rosling’s talk, I wanted to
get my hands on that data and take a look myself. It was
a story I wanted to explore, too.

Figure 1-4: Trendalyzer by the Gapminder Foundation

Watch Hans Rosling wow the audience with
data and an amazing demonstration at
http://datafl.ws/hans.

I later saw a Gapminder talk on the same topic with
the same visualizations but with a different speaker. It
wasn’t nearly as exciting. To be honest, it was kind of a
snoozer. There wasn’t any emotion. I didn’t feel any
conviction or excitement about the data. So it’s not just
about the data that makes for interesting chatter. It’s
how you present it and design it that can help people
remember.

When it’s all said and done, here’s what you need to
know. Approach visualization as if you were telling a
story. What kind of story are you trying to tell? Is it a
report, or is it a novel? Do you want to convince people
that action is necessary?
Think character development. Every data point has a
story behind it in the same way that every character in a
book has a past, present, and future. There are
interactions and relationships between those data
points. It’s up to you to find them. Of course, before
expert storytellers write novels, they must first learn to
construct sentences.

What to Look For
Okay, stories. Check. Now what kind of stories do you
tell with data? Well, the specifics vary by dataset, but
generally speaking, you should always be on the lookout
for these two things whatever your graphic is for:
patterns and relationships.

Patterns
Stuff changes as time goes by. You get older, your hair
grays, and your sight starts to get kind of fuzzy (Figure
1-5). Prices change. Logos change. Businesses are
born. Businesses die. Sometimes these changes are
sudden and without warning. Other times the change

happens so slowly you don’t even notice.

Figure 1-5: A comic look at aging

Whatever it is you’re looking at, the change itself can
be interesting as can the changing process. It is here
you can explore patterns over time. For example, say
you looked at stock prices over time. They of course
increase and decrease, but by how much do they
change per day? Per week? Per month? Are there
periods when the stock went up more than usual? If so,
why did it go up? Were there any specific events that
triggered the change?
As you can see, when you start with a single question
as a starting point, it can lead you to additional
questions. This isn’t just for time series data, but with all
types of data. Try to approach your data in a more
exploratory fashion, and you’ll most likely end up with

more interesting answers.
You can split your time series data in different ways.
In some cases it makes sense to show hourly or daily
values. Other times, it could be better to see that data
on a monthly or annual basis. When you go with the
former, your time series plot could show more noise,
whereas the latter is more of an aggregate view.
Those with websites and some analytics software in
place can identify with this quickly. When you look at
traffic to your site on a daily basis, as shown in Figure
1-6, the graph is bumpier. There are a lot more
fluctuations.

Figure 1-6: Daily unique visitors to FlowingData

When you look at it on a monthly basis, as shown in
Figure 1-7, fewer data points are on the same graph,
covering the same time span, so it looks much
smoother.
I’m not saying one graph is better than the other. In
fact, they can complement each other. How you split
your data depends on how much detail you need (or
don’t need).
Of course, patterns over time are not the only ones to
look for. You can also find patterns in aggregates that
can help you compare groups, people, and things. What
do you tend to eat or drink each week? What does the
President usually talk about during the State of the
Union address? What states usually vote Republican?
Looking at patterns over geographic regions would be
useful in this case. While the questions and data types
are different, your approach is similar, as you’ll see in
the following chapters.

Figure 1-7: Monthly unique visitors to FlowingData

Relationships
Have you ever seen a graphic with a whole bunch of
charts on it that seemed like they’ve been randomly
placed? I’m talking about the graphics that seem to be
missing that special something, as if the designer gave
only a little bit of thought to the data itself and then
belted out a graphic to meet a deadline. Often, that
special something is relationships.
In statistics, this usually means correlation and
causation. Multiple variables might be related in some
way. Chapter 6, “Visualizing Relationships,” covers
these concepts and how to visualize them.

At a more abstract level though, where you’re not
thinking about equations and hypothesis tests, you can
design your graphics to compare and contrast values
and distributions visually. For a simple example, look at
this excerpt on technology from the World Progress
Report in Figure 1-8.
The World Progress Report was a graphical
report that compared progress around the
world using data from UNdata. See the full
version at http://datafl.ws/12i.

These are histograms that show the number of users
of the Internet, Internet subscriptions, and broadband
per 100 inhabitants. Notice that the range for Internet
users (0 to 95 per 100 inhabitants) is much wider than
that of the other two datasets.

Figure 1-8: Technology adoption worldwide

The quick-and-easy thing to do would have been to
let your software decide what range to use for each
histogram. However, each histogram was made on the
same range even though there were no countries who
had 95 Internet subscribers or broadband users per
100 inhabitants. This enables you to easily compare the
distributions between the groups.
So when you end up with a lot of different datasets, try
to think of them as several groups instead of separate
compartments that do not interact with each other. It can
make for more interesting results.

Questionable Data
While you’re looking for the stories in your data, you

should always question what you see. Remember, just
because it’s numbers doesn’t mean it’s true.
I have to admit. Data checking is definitely my least
favorite part of graph-making. I mean, when someone, a
group, or a service provides you with a bunch of data, it
should be up to them to make sure all their data is legit.
But this is what good graph designers do. After all,
reliable builders don’t use shoddy cement for a house’s
foundation, so don’t use shoddy data to build your data
graphic.
Data-checking and verification is one of the most
important—if not the most important—part of graph
design.
Basically, what you’re looking for is stuff that makes
no sense. Maybe there was an error at data entry and
someone added an extra zero or missed one. Maybe
there were connectivity issues during a data scrape,
and some bits got mucked up in random spots.
Whatever it is, you need to verify with the source if
anything looks funky.
The person who supplied the data usually has a
sense of what to expect. If you were the one who
collected the data, just ask yourself if it makes sense:
That state is 90 percent of whatever and all other states
are only in the 10 to 20 percent range. What’s going on
there?
Often, an anomaly is simply a typo, and other times
it’s actually an interesting point in your dataset that

could form the whole drive for your story. Just make sure
you know which one it is.

Design
When you have all your data in order, you’re ready to
visualize. Whatever you’re making, whether it is for a
report, an infographic online, or a piece of data art, you
should follow a few basic rules. There’s wiggle room
with all of them, and you should think of what follows as
more of a framework than a hard set of rules, but this is
a good place to start if you are just getting into data
graphics.

Explain Encodings
The design of every graph follows a familiar flow. You
get the data; you encode the data with circles, bars, and
colors; and then you let others read it. The readers have
to decode your encodings at this point. What do these
circles, bars, and colors represent?
William Cleveland and Robert McGill have written
about encodings in detail. Some encodings work better
than others. But it won’t matter what you choose if
readers don’t know what the encodings represent in the
first place. If they can’t decode, the time you spend
designing your graphic is a waste.

Note
See Cleveland and McGill’s paper on Graphical
Perception and Graphical Methods for
Analyzing Data for more on how people encode
shapes and colors.

You sometimes see this lack of context with graphics
that are somewhere in between data art and
infographic. You definitely see it a lot with data art. A
label or legend can completely mess up the vibe of a
piece of work, but at the least, you can include some
information in a short description paragraph. It helps
others appreciate your efforts.
Other times you see this in actual data graphics,
which can be frustrating for readers, which is the last
thing you want. Sometimes you might forget because
you’re actually working with the data, so you know what
everything means. Readers come to a graphic blind
though without the context that you gain from analyses.
So how can you make sure readers can decode your
encodings? Explain what they mean with labels,
legends, and keys. Which one you choose can vary
depending on the situation. For example, take a look at
the world map in Figure 1-9 that shows usage of Firefox
by country.

Figure 1-9: Firefox usage worldwide by country

You can see different shades of blue for different
countries, but what do they mean? Does dark blue
mean more or less usage? If dark blue means high
usage, what qualifies as high usage? As-is, this map is
pretty useless to us. But if you provide the legend in
Figure 1-10, it clears things up. The color legend also
serves double time as a histogram showing the
distribution of usage by number of users.

Figure 1-10: Legend for Firefox usage map

You can also directly label shapes and objects in your
graphic if you have enough space and not too many
categories, as shown in Figure 1-11. This is a graph
that shows the number of nominations an actor had
before winning an Oscar for best actor.

Figure 1-11: Directly labeled objects

A theory floated around the web that actors who had
the most nominations among their cohorts in a given

year generally won the statue. As labeled, dark orange
shows actors who did have the most nominations,
whereas light orange shows actors who did not.
As you can see, plenty of options are available to you.
They’re easy to use, but these small details can make a
huge difference on how your graphic reads.

Label Axes
Along the same lines as explaining your encodings, you
should always label your axes. Without labels or an
explanation, your axes are just there for decoration.
Label your axes so that readers know what scale points
are plotted on. Is it logarithmic, incremental,
exponential, or per 100 flushing toilets? Personally, I
always assume it’s that last one when I don’t see labels.
To demonstrate my point, rewind to a contest I held
on FlowingData a couple of years ago. I posted the
image in Figure 1-12 and asked readers to label the
axes for maximum amusement.

Figure 1-12: Add your caption here.

There were about 60 different captions for the same
graph; Figure 1-13 shows a few.
As you can see, even though everyone looked at the
same graph, a simple change in axis labels told a
completely different story. Of course, this was just for
play. Now just imagine if your graph were meant to be
taken seriously. Without labels, your graph is
meaningless.

Figure 1-13: Some of the results from a caption
contest on FlowingData

Keep Your Geometry in Check
When you design a graph, you use geometric shapes.
A bar graph uses rectangles, and you use the length of

the rectangles to represent values. In a dot plot, the
position indicates value—same thing with a standard
time series chart. Pie charts use angles to indicate
value, and the sum of the values always equal 100
percent (see Figure 1-14). This is easy stuff, so be
careful because it’s also easy to mess up. You’re going
to make a mistake if you don’t pay attention, and when
you do mess up, people, especially on the web, won’t
be afraid to call you out on it.

Figure 1-14: The right and wrong way to make a pie
chart

Another common mistake is when designers start to
use two-dimensional shapes to represent values, but
size them as if they were using only a single dimension.
The rectangles in a bar chart are two-dimensional, but
you only use one length as an indicator. The width
doesn’t mean anything. However, when you create a

bubble chart, you use an area to represent values.
Beginners often use radius or diameter instead, and the
scale is totally off.
Figure 1-15 shows a pair of circles that have been
sized by area. This is the right way to do it.

Figure 1-15: The right way to size bubbles

Figure 1-16 shows a pair of circles sized by
diameter. The first circle has twice the diameter as that
of the second but is four times the area.
It’s the same deal with rectangles, like in a treemap.
You use the area of the rectangles to indicate values
instead of the length or width.

Figure 1-16: The wrong way to size bubbles

Include Your Sources
This should go without saying, but so many people miss
this one. Where did the data come from? If you look at
the graphics printed in the newspaper, you always see
the source somewhere, usually in small print along the
bottom. You should do the same. Otherwise readers
have no idea how accurate your graphic is.
There’s no way for them to know that the data wasn’t
just made up. Of course, you would never do that, but
not everyone will know that. Other than making your
graphics more reputable, including your source also lets
others fact check or analyze the data.
Inclusion of your data source also provides more
context to the numbers. Obviously a poll taken at a state
fair is going to have a different interpretation than one
conducted door-to-door by the U.S. Census.

Consider Your Audience
Finally, always consider your audience and the purpose
of your graphics. For example, a chart designed for a
slide presentation should be simple. You can include a
bunch of details, but only the people sitting up front will
see them. On the other hand, if you design a poster
that’s meant to be studied and examined, you can
include a lot more details.
Are you working on a business report? Then don’t try
to create the most beautiful piece of data art the world
has ever seen. Instead, create a clear and straight-tothe-point graphic. Are you using graphics in analyses?
Then the graphic is just for you, and you probably don’t
need to spend a lot of time on aesthetics and
annotation. Is your graphic meant for publication to a
mass audience? Don’t get too complicated, and explain
any challenging concepts.

Wrapping Up
In short, start with a question, investigate your data with
a critical eye, and figure out the purpose of your
graphics and who they’re for. This will help you design a
clear graphic that’s worth people’s time—no matter
what kind of graphic it is.
You learn how to do this in the following chapters. You
learn how to handle and visualize data. You learn how to

design graphics from start to finish. You then apply what
you learn to your own data. Figure out what story you
want to tell and design accordingly.

Chapter 2
Handling Data
Before you start working on the visual part of any
visualization, you actually need data. The data is what
makes a visualization interesting. If you don’t have
interesting data, you just end up with a forgettable graph
or a pretty but useless picture. Where can you find good
data? How can you access it?
When you have your data, it needs to be formatted so
that you can load it into your software. Maybe you got
the data as a comma-delimited text file or an Excel
spreadsheet, and you need to convert it to something
such as XML, or vice versa. Maybe the data you want is
accessible point-by-point from a web application, but
you want an entire spreadsheet.
Learn to access and process data, and your
visualization skills will follow.

Gather Data
Data is the core of any visualization. Fortunately, there
are a lot of places to find it. You can get it from experts
in the area you’re interested in, a variety of online
applications, or you can gather it yourself.

Provided by Others
This route is common, especially if you’re a freelance
designer or work in a graphics department of a larger
organization. This is a good thing a lot of the time
because someone else did all the data gathering work
for you, but you still need to be careful. A lot of mistakes
can happen along the way before that nicely formatted
spreadsheet gets into your hands.
When you share data with spreadsheets, the most
common mistake to look for is typos. Are there any
missing zeros? Did your client or data supplier mean
six instead of five? At some point, data was read from
one source and then input into Excel or a different
spreadsheet program (unless a delimited text file was
imported), so it’s easy for an innocent typo to make its

way through the vetting stage and into your hands.
You also need to check for context. You don’t need to
become an expert in the data’s subject matter, but you
should know where the original data came from, how it
was collected, and what it’s about. This can help you
build a better graphic and tell a more complete story
when you design your graphic. For example, say you’re
looking at poll results. When did the poll take place?
Who conducted the poll? Who answered? Obviously,
poll results from 1970 are going to take on a different
meaning from poll results from the present day.

Finding Sources
If the data isn’t directly sent to you, it’s your job to go out
and find it. The bad news is that, well, that’s more work
on your shoulders, but the good news is that’s it’s
getting easier and easier to find data that’s relevant and
machine-readable (as in, you can easily load it into
software). Here’s where you can start your search.

Search Engines
How do you find anything online nowadays? You Google
it. This is a no-brainer, but you’d be surprised how many
times people email me asking if I know where to find a
particular dataset and a quick search provided relevant
results. Personally, I turn to Google and occasionally
look to Wolfram|Alpha, the computational search
engine.
See

Wolfram|Alpha

at

http://wolframalpha.com. The search
engine can be especially useful if you’re looking
for some basic statistics on a topic.

Direct from the Source
If a direct query for “data” doesn’t provide anything of
use, try searching for academics who specialize in the
area you’re interested in finding data for. Sometimes
they post data on their personal sites. If not, scan their
papers and studies for possible leads. You can also try
emailing them, but make sure they’ve actually done
related studies. Otherwise, you’ll just be wasting
everyone’s time.
You can also spot sources in graphics published by
news outlets such as The New York Times . Usually
data sources are included in small print somewhere on

the graphic. If it’s not in the graphic, it should be
mentioned in the related article. This is particularly
useful when you see a graphic in the paper or online
that uses data you’re interested in exploring. Search for
a site for the source, and the data might be available.
This won’t always work because finding contacts
seems to be a little easier when you email saying that
you’re a reporter for the so-and-so paper, but it’s worth
a shot.

Universities
As a graduate student, I frequently make use of the
academic resources available to me, namely the library.
Many libraries have amped up their technology
resources and actually have some expansive data
archives. A number of statistics departments also keep
a list of data files, many of which are publicly
accessible. Albeit, many of the datasets made available
by these departments are intended for use with course
labs and homework. I suggest visiting the following
resources:
Data
and
Story
Library
(DASL)
(http://lib.stat.cmu.edu/DASL/)—An online library
of data files and stories that illustrate the use of
basic statistics methods, from Carnegie Mellon
Berkeley
Data
Lab
(http://sunsite3.berkeley.edu/wikis/datalab/)—Part
of the University of California, Berkeley library
system
UCLA
Statistics
Data
Sets
(www.stat.ucla.edu/data/)—Some of the data that
the UCLA Department of Statistics uses in their
labs and assignments

General Data Applications
A growing number of general data-supplying
applications are available. Some applications provide
large data files that you can download for free or for a
fee. Others are built with developers in mind with data
accessible via Application Programming Interface
(API). This lets you use data from a service, such as
Twitter, and integrate the data with your own
application. Following are a few suggested resources:
Freebase (www.freebase.com)—A community
effort that mostly provides data on people, places,
and things. It’s like Wikipedia for data but more
structured. Download data dumps or use it as a

backend for your application.
Infochimps
(http://infochimps.org)—A
data
marketplace with free and for-sale datasets. You
can also access some datasets via their API.
Numbrary (http://numbrary.com)—Serves as a
catalog for (mostly government) data on the web.
AggData
(http://aggdata.com)—Another
repository of for-sale datasets, mostly focused on
comprehensive lists of retail locations.
Amazon
Public
Data
Sets
(http://aws.amazon.com/publicdatasets)—There’s
not a lot of growth here, but it does host some
large scientific datasets.
Wikipedia (http://wikipedia.org)—A lot of smaller
datasets in the form of HTML tables on this
community-run encyclopedia.

Topical Data
Outside more general data suppliers, there’s no
shortage of subject-specific sites offering loads of free
data.
Following is a small taste of what’s available for the
topic of your choice.

Geography
Do you have mapping software, but no geographic
data? You’re in luck. Plenty of shapefiles and other
geographic file types are at your disposal.
TIGER (www.census.gov/geo/www/tiger/)—From
the Census Bureau, probably the most extensive
detailed data about roads, railroads, rivers, and
ZIP codes you can find
OpenStreetMap (www.openstreetmap.org/)—One
of the best examples of data and community effort
Geocommons (www.geocommons.com/)—Both
data and a mapmaker
Flickr Shapefiles (www.flickr.com/services/api/)—
Geographic boundaries as defined by Flickr
users

Sports
People love sports statistics, and you can find decades’
worth of sports data. You can find it on Sports Illustrated
or team organizations’ sites, but you can also find more
on sites dedicated to the data specifically.
Basketball
Reference
(www.basketball-

reference.com/)—Provides data as specific as
play-by-play for NBA games.
Baseball
DataBank
(http://baseballdatabank.org/)—Super basic site where you can
download full datasets.
databaseFootball (www.databasefootball.com/)—
Browse data for NFL games by team, player, and
season.

World
Several noteworthy international organizations keep
data about the world, mainly health and development
indicators. It does take some sifting though, because a
lot of the datasets are quite sparse. It’s not easy to get
standardized data across countries with varied
methods.
Global Health Facts (www.globalhealthfacts.org/)
—Health-related data about countries in the world.
UNdata (http://data.un.org/)—Aggregator of world
data from a variety of sources
World
Health
Organization
(www.who.int/research/en/)—Again, a variety of
health-related datasets such as mortality and life
expectancy
OECD Statistics (http://stats.oecd.org/)—Major
source for economic indicators
World Bank (http://data.worldbank.org/)—Data for
hundreds of indicators and developer-friendly

Government and Politics
There has been a fresh emphasis on data and
transparency in recent years, so many government
organizations supply data, and groups such as the
Sunlight Foundation encourage developers and
designers to make use of it. Government organizations
have been doing this for awhile, but with the launch of
data.gov, much of the data is available in one place.
You can also find plenty of nongovernmental sites that
aim to make politicians more accountable.
Census
Bureau
(www.census.gov/)—Find
extensive demographics here.
Data.gov (http://data.gov/)—Catalog for data
supplied by government organizations. Still
relatively new, but has a lot of sources.
Data.gov.uk (http://data.gov.uk/)—The Data.gov
equivalent for the United Kingdom.

DataSF (http://datasf.org/)—Data specific to San
Francisco.
NYC DataMine (http://nyc.gov/data/)—Just like the
above, but for New York.
Follow the Money (www.followthemoney.org/)—
Big set of tools and datasets to investigate money
in state politics.
OpenSecrets
(www.opensecrets.org/)—Also
provides details on government spending and
lobbying.

Data Scraping
Often you can find the exact data that you need, except
there’s one problem. It’s not all in one place or in one
file. Instead it’s in a bunch of HTML pages or on multiple
websites. What should you do?
The straightforward, but most time-consuming
method would be to visit every page and manually enter
your data point of interest in a spreadsheet. If you have
only a few pages, sure, no problem.
What if you have a thousand pages? That would take
too long—even a hundred pages would be tedious. It
would be much easier if you could automate the
process, which is what data scraping is for. You write
some code to visit a bunch of pages automatically, grab
some content from that page, and store it in a database
or a text file.

Note
Although coding is the most flexible way to
scrape the data you need, you can also try tools
such as Needlebase and Able2Extract PDF
converter. Use is straightforward, and they can
save you time.

Example: Scrape a Website
The best way to learn how to scrape data is to jump
right into an example. Say you wanted to download
temperature data for the past year, but you can’t find a
source that provides all the numbers for the right time
frame or the correct city. Go to almost any weather
website, and at the most, you’ll usually see only
temperatures for an extended 10-day forecast. That’s
not even close to what you want. You want actual
temperatures from the past, not predictions about future
weather.

Fortunately, the Weather Underground site does
provide historic temperatures; however, you can see
only one day at a time.
Visit

Weather

Underground

at

http://wunderground.com.
To make things more concrete, look up temperature
in Buffalo. Go to the Weather Underground site and
search for BUF in the search box. This should take you
to the weather page for Buffalo Niagara International,
which is the airport in Buffalo (see Figure 2-1).

Figure 2-1: Temperature in Buffalo, New York,
according to Weather Underground

Figure 2-2: Drop-down menu to see historical data for
a selected date

The top of the page provides the current temperature,
a 5-day forecast, and other details about the current

day. Scroll down toward the middle of the page to the
History & Almanac panel, as shown in Figure 2-2.
Notice the drop-down menu where you can select a
specific date.
Adjust the menu to show October 1, 2010, and click
the View button. This takes you to a different view that
shows you details for your selected date (see Figure 23).

Figure 2-3: Temperature data for a single day

There’s temperature, degree days, moisture,
precipitation, and plenty of other data points, but for
now, all you’re interested in is maximum temperature
per day, which you can find in the second column,
second row down. On October 1, 2010, the maximum
temperature in Buffalo was 62 degrees Fahrenheit.
Getting that single value was easy enough. Now how
can you get that maximum temperature value every day,
during the year 2009? The easy-and-straightforward
way would be to keep changing the date in the dropdown. Do that 365 times and you’re done.
Wouldn’t that be fun? No. You can speed up the
process with a little bit of code and some know-how,
and for that, turn to the Python programming language
and Leonard Richardson’s Python library called
Beautiful Soup.
You’re about to get your first taste of code in the next
few paragraphs. If you have programming experience,
you can go through the following relatively quickly. Don’t
worry if you don’t have any programming experience
though—I’ll take you through it step-by-step. A lot of
people like to keep everything within a safe click
interface, but trust me. Pick up just a little bit of

programming skills, and you can open up a whole bag
of possibilities for what you can do with data. Ready?
Here you go.
First, you need to make sure your computer has all
the right software installed. If you work on Mac OS X,
you should have Python installed already. Open the
Terminal application and type python to start (see
Figure 2-4).

Figure 2-4: Starting Python in OS X

If you’re on a Windows machine, you can visit the
Python site and follow the directions on how to
download and install.
Vi s i t http://python.org to download and
install Python. Don’t worry; it’s not too hard.

Next, you need to download Beautiful Soup, which
can help you read web pages quickly and easily. Save
the Beautiful Soup Python (.py) file in the directory that
you plan to save your code in. If you know your way
around Python, you can also put Beautiful Soup in your
library path, but it’ll work the same either way.
Visit

www.crummy.com/software/BeautifulSoup/
to download Beautiful Soup. Download the
version that matches the version of Python that
you use.

After you install Python and download Beautiful Soup,
start a file in your favorite text or code editor, and save it
as get-weather-data.py. Now you can code.
The first thing you need to do is load the page that

shows historical weather information. The URL for
historical weather in Buffalo on October 1, 2010,
follows:
www.wunderground.com/history/airport/KBUF/2010/10/1/DailyHistory.html?
req_city=NA&req_state=NA&req_statename=NA
If you remove everything after .html in the preceding
URL, the same page still loads, so get rid of those. You
don’t care about those right now.
www.wunderground.com/history/airport/KBUF/2010/10/1/DailyHistory.html
The date is indicated in the URL with /2010/10/1. Using
the drop-down menu, change the date to January 1,
2009, because you’re going to scrape temperature for
all of 2009. The URL is now this:
www.wunderground.com/history/airport/KBUF/2009/1/1/DailyHistory.html
Everything is the same as the URL for October 1,
except the portion that indicates the date. It’s /2009/1/1
now. Interesting. Without using the drop-down menu,
how can you load the page for January 2, 2009? Simply
change the date parameter so that the URL looks like
this:
www.wunderground.com/history/airport/KBUF/2009/1/2/DailyHistory.html
Load the preceding URL in your browser and you get
the historical summary for January 2, 2009. So all you
have to do to get the weather for a specific date is to
modify the Weather Underground URL. Keep this in
mind for later.
Now load a single page with Python, using the urllib2
library by importing it with the following line of code:
import urllib2

To load the January 1 page with Python, use the
function.

urlopen

page = urllib2.urlopen("www.wunderground.com/history/airport/KBUF/2009/1/1/DailyHistory.html")

This loads all the HTML that the URL points to in the
page variable. The next step is to extract the maximum
temperature value you’re interested in from that HTML,
and for that, Beautiful Soup makes your task much
easier. After urllib2, import Beautiful Soup like so:
from BeautifulSoup import BeautifulSoup

At the end of your file, use Beautiful Soup to read
(that is, parse) the page.
soup = BeautifulSoup(page)

Without getting into nitty-gritty details, this line of code

reads the HTML, which is essentially one long string,
and then stores elements of the page, such as the
header or images, in a way that is easier to work with.

Note
Beautiful Soup provides good documentation
and straightforward examples, so if any of this
is confusing, I strongly encourage you to check
those out on the same Beautiful Soup site you
used to download the library.

For example, if you want to find all the images in the
page, you can use this:
images = soup.findAll(‘img’)

This gives you a list of all the images on the Weather
Underground page displayed with the  HTML tag.
Want the first image on the page? Do this:
first_image = images[0]

Want the second image? Change the zero to a one. If
you want the src value in the first  tag, you would
use this:
src = first_image[‘src’]

Okay, you don’t want images. You just want that one
value: maximum temperature on January 1, 2009, in
Buffalo, New York. It was 26 degrees Fahrenheit. It’s a
little trickier finding that value in your soup than it was
finding images, but you still use the same method. You
just need to figure out what to put in findAll(), so look at
the HTML source.
You can easily do this in all the major browsers. In
Firefox, go to the View menu, and select Page Source.
A window with the HTML for your current page appears,
as shown in Figure 2-5.
Scroll down to where it shows Mean Temperature, or
just search for it, which is faster. Spot the 26. That’s what
you want to extract.
The row is enclosed by a  tag with a nobr class.
That’s your key. You can find all the elements in the
page with the nobr class.
nobrs = soup.findAll(attrs={"class":"nobr"})

Figure 2-5: HTML source for a page on Weather
Underground

As before, this gives you a list of all the occurrences
o f nobr. The one that you’re interested in is the sixth
occurrence, which you can find with the following:
print nobrs[5]

This gives you the whole element, but you just want
the 26. Inside the  tag with the nobr class is another
 tag and then the 26. So here’s what you need to
use:
dayTemp = nobrs[5].span.string
print dayTemp

Ta Da! You scraped your first value from an HTML
web page. Next step: scrape all the pages for 2009. For
that, return to the original URL.
www.wunderground.com/history/airport/KBUF/2009/1/1/DailyHistory.html
Remember that you changed the URL manually to get
the weather data for the date you want. The preceding
code is for January 1, 2009. If you want the page for
January 2, 2009, simply change the date portion of the
URL to match that. To get the data for every day of
2009, load every month (1 through 12) and then load
every day of each month. Here’s the script in full with
comments. Save it to your get-weather-data.py file.
import urllib2
from BeautifulSoup import BeautifulSoup
# Create/open a file called wunder.txt (which will be
a comma-delimited file)
f = open(‘wunder-data.txt’, ‘w’)
# Iterate through months and day
for m in range(1, 13):
for d in range(1, 32):

# Check if already gone through month
if (m == 2 and d > 28):
break
elif (m in [4, 6, 9, 11] and d > 30):
break
# Open wunderground.com url
timestamp = ‘2009’ + str(m) + str(d)
print "Getting data for " + timestamp
url = "http://www.wunderground.com/history/airport/KBUF/2009/" +
str(m) + "/" + str(d) + "/DailyHistory.html"
page = urllib2.urlopen(url)
# Get temperature from page
soup = BeautifulSoup(page)
# dayTemp = soup.body.nobr.b.string
dayTemp = soup.findAll(attrs={"class":"nobr"})[5].span.string
# Format month for timestamp
if len(str(m)) < 2:
mStamp = ‘0’ + str(m)
else:
mStamp = str(m)
# Format day for timestamp
if len(str(d)) < 2:
dStamp = ‘0’ + str(d)
else:
dStamp = str(d)
# Build timestamp
timestamp = ‘2009’ + mStamp + dStamp
# Write timestamp and temperature to file
f.write(timestamp + ‘,’ + dayTemp + ‘\n’)
# Done getting data! Close file.
f.close()

You should recognize the first two lines of code to
import the necessary libraries, urllib2 and
BeautifulSoup.
import urllib2
from BeautifulSoup import BeautifulSoup

Next, start a text file called wunder-data-txt with write
permissions, using the open() method. All the data that
you scrape will be stored in this text file, in the same
directory that you saved this script in.
# Create/open a file called wunder.txt (which will be
a comma-delimited file)
f = open(‘wunder-data.txt’, ‘w’)

With the next line of code, use a for loop, which tells
the computer to visit each month. The month number is
stored in the m variable. The loop that follows then tells
the computer to visit each day of each month. The day
number is stored in the d variable.
# Iterate through months and day
for m in range(1, 13):
for d in range(1, 32):

See Python documentation for more on how

loops

and

iteration

work:

http://docs.python.org/reference/compound_stmts.html
Notice that you used range (1, 32) to iterate through the
days. This means you can iterate through the numbers 1
to 31. However, not every month of the year has 31
days. February has 28 days; April, June, September,
and November have 30 days. There’s no temperature
value for April 31 because it doesn’t exist. So check
what month it is and act accordingly. If the current month
is February and the day is greater than 28, break and
move on to the next month. If you want to scrape multiple
years, you need to use an additional if statement to
handle leap years.
Similarly, if it’s not February, but instead April, June,
September, or November, move on to the next month if
the current day is greater than 30.
# Check if already gone through month
if (m == 2 and d > 28):
break
elif (m in [4, 6, 9, 11] and d > 30):
break

Again, the next few lines of code should look familiar.
You used them to scrape a single page from Weather
Underground. The difference is in the month and day
variable in the URL. Change that for each day instead of
leaving it static; the rest is the same. Load the page
with the urllib2 library, parse the contents with Beautiful
Soup, and then extract the maximum temperature, but
look for the sixth appearance of the nobr class.

# Open wunderground.com url
url = "http://www.wunderground.com/history/airport/KBUF/2009/" + str(m) + "/" + str(d) + "/DailyHistory.html"
page = urllib2.urlopen(url)
# Get temperature from page
soup = BeautifulSoup(page)
# dayTemp = soup.body.nobr.b.string
dayTemp = soup.findAll(attrs={"class":"nobr"})[5].span.string

The next to last chunk of code puts together a
timestamp based on the year, month, and day.
Timestamps are put into this format: yyyymmdd. You
can construct any format here, but keep it simple for
now.
# Format day for timestamp
if len(str(d)) < 2:
dStamp = ‘0’ + str(d)
else:
dStamp = str(d)
# Build timestamp
timestamp = ‘2009’ + mStamp + dStamp

Finally, the temperature and timestamp are written to
using the write() method.

‘wunder-data.txt’

# Write timestamp and temperature to file
f.write(timestamp + ‘,’ + dayTemp + ‘\n’)

Then use close()when you finish with all the months
and days.
# Done getting data! Close file.
f.close()

The only thing left to do is run the code, which you do
in your terminal with the following:
$ python get-weather-data.py

It takes a little while to run, so be patient. In the
process of running, your computer is essentially loading
365 pages, one for each day of 2009. You should have
a file named wunder-data.txt in your working directory
when the script is done running. Open it up, and there’s
your data, as a comma-separated file. The first column
is for the timestamps, and the second column is
temperatures. It should look similar to Figure 2-6.

Figure 2-6: One year’s worth of scraped temperature
data

Generalizing the Example
Although you just scraped weather data from Weather
Underground, you can generalize the process for use
with other data sources. Data scraping typically involves
three steps:
1. Identify the patterns.
2. Iterate.
3. Store the data.

In this example, you had to find two patterns. The first
was in the URL, and the second was in the loaded web
page to get the actual temperature value. To load the
page for a different day in 2009, you changed the month
and day portions of the URL. The temperature value
was enclosed in the sixth occurrence of the nobr class in
the HTML page. If there is no obvious pattern to the
URL, try to figure out how you can get the URLs of all the
pages you want to scrape. Maybe the site has a site
map, or maybe you can go through the index via a
search engine. In the end, you need to know all the
URLs of the pages of data.
After you find the patterns, you iterate. That is, you
visit all the pages programmatically, load them, and
parse them. Here you did it with Beautiful Soup, which
makes parsing XML and HTML easy in Python. There’s
probably a similar library if you choose a different
programming language.
Lastly, you need to store it somewhere. The easiest
solution is to store the data as a plain text file with
comma-delimited values, but if you have a database set
up, you can also store the values in there.
Things can get trickier as you run into web pages that
use JavaScript to load all their data into view, but the
process is still the same.

Formatting Data
Different visualization tools use different data formats,
and the structure you use varies by the story you want to
tell. So the more flexible you are with the structure of
your data, the more possibilities you can gain. Make
use of data formatting applications, and couple that with
a little bit of programming know-how, and you can get
your data in any format you want to fit your specific
needs.
The easy way of course is to find a programmer who
can format and parse all of your data, but you’ll always
be waiting on someone. This is especially evident
during the early stages of any project where iteration
and data exploration are key in designing a useful
visualization. Honestly, if I were in a hiring position, I’d
likely just get the person who knows how to work with
data, over the one who needs help at the beginning of
every project.

What I Learned about
Formatting
When I first learned statistics in high school, the data
was always provided in a nice, rectangular format. All I
had to do was plug some numbers into an Excel
spreadsheet or my awesome graphing calculator (which
was the best way to look like you were working in class,
but actually playing Tetris). That’s how it was all the way
through my undergraduate education. Because I was
learning about techniques and theorems for analyses,
my teachers didn’t spend any time on working with raw,
preprocessed data. The data always seemed to be in
just the right format.
This is perfectly understandable, given time constraints
and such, but in graduate school, I realized that data in
the real world never seems to be in the format that you
need. There are missing values, inconsistent labels,
typos, and values without any context. Often the data is
spread across several tables, but you need everything in
one, joined across a value, like a name or a unique id
number.
This was also true when I started to work with
visualization. It became increasingly important because I
wanted to do more with the data I had. Nowadays, it’s not
out of the ordinary that I spend just as much time getting
data in the format that I need as I do putting the visual
part of a data graphic together. Sometimes I spend more
time getting all my data in place. This might seem
strange at first, but you’ll find that the design of your data
graphics comes much easier when you have your data
neatly organized, just like it was back in that introductory
statistics course in high school.

Various data formats, the tools available to deal with
these formats, and finally, some programming, using the
same logic you used to scrape data in the previous
example are described next.

Data Formats
Most people are used to working with data in Excel.
This is fine if you’re going to do everything from
analyses to visualization in the program, but if you want
to step beyond that, you need to familiarize yourself with
other data formats. The point of these formats is to
make your data machine-readable, or in other words, to
structure your data in a way that a computer can
understand. Which data format you use can change by
visualization tool and purpose, but the three following
formats can cover most of your bases: delimited text,
JavaScript Object Notation, and Extensible Markup
Language.

Delimited Text

Most people are familiar with delimited text. You did
after all just make a comma-delimited text file in your
data scraping example. If you think of a dataset in the
context of rows and columns, a delimited text file splits
columns by a delimiter. The delimiter is a comma in a
comma-delimited file. The delimiter might also be a tab.
It can be spaces, semicolons, colons, slashes, or
whatever you want; although a comma and tab are the
most common.
Delimited text is widely used and can be read into
most spreadsheet programs such as Excel or Google
Documents. You can also export spreadsheets as
delimited text. If multiple sheets are in your workbook,
you usually have multiple delimited files, unless you
specify otherwise.
This format is also good for sharing data with others
because it doesn’t depend on any particular program.

JavaScript Object Notation (JSON)
This is a common format offered by web APIs. It’s
designed to be both machine- and human-readable;
although, if you have a lot of it in front of you, it’ll
probably make you cross-eyed if you stare at it too long.
It’s based on JavaScript notation, but it’s not dependent
on the language. There are a lot of specifications for
JSON, but you can get by for the most part with just the
basics.
JSON works with keywords and values, and treats
items like objects. If you were to convert JSON data to
comma-separated values (CSV), each object might be
a row.
As you can see later in this book, a number of
applications, languages, and libraries accept JSON as
input. If you plan to design data graphics for the web,
you’re likely to run into this format.
Visit http://json.org for the full specification of
JSON. You don’t need to know every detail of
the format, but it can be handy at times when
you don’t understand a JSON data source.

Extensible Markup Language (XML)
XML is another popular format on the web, often used
to transfer data via APIs. There are lots of different
types and specifications for XML, but at the most basic
level, it is a text document with values enclosed by tags.
For example, the Really Simple Syndication (RSS) feed

that people use to subscribe to blogs, such as
FlowingData, is actually an XML file, as shown in Figure
2-7.
The RSS lists recently published items enclosed in
t h e  tag, and each item has a title,
description, author, and publish date, along with some
other attributes.

Figure 2-7: Snippet of FlowingData’s RSS feed

XML is relatively easy to parse with libraries such as
Beautiful Soup in Python. You can get a better feel for
XML, along with CSV and JSON, in the sections that
follow.

Formatting Tools
Just a couple of years ago, quick scripts were always
written to handle and format data. After you’ve written a
few scripts, you start to notice patterns in the logic, so
it’s not super hard to write new scripts for specific
datasets, but it does take time. Luckily, with growing
volumes of data, some tools have been developed to
handle the boiler plate routines.

Google Refine
Google Refine is the evolution of Freebase Gridworks.
Gridworks was first developed as an in-house tool for
an open data platform, Freebase; however, Freebase
was acquired by Google, therefore the new name.
Google Refine is essentially Gridworks 2.0 with an
easier-to-use interface (Figure 2-8) with more features.

It runs on your desktop (but still through your browser),
which is great, because you don’t need to worry about
uploading private data to Google’s servers. All the
processing happens on your computer. Refine is also
open source, so if you feel ambitious, you can cater the
tool to your own needs with extensions.
When you open Refine, you see a familiar
spreadsheet interface with your rows and columns. You
can easily sort by field and search for values. You can
also find inconsistencies in your data and consolidate in
a relatively easy way.
For example, say for some reason you have an
inventory list for your kitchen. You can load the data in
Refine and quickly find inconsistencies such as typos or
differing classifications. Maybe a fork was misspelled
as “frk,” or you want to reclassify all the forks, spoons,
and knives as utensils. You can easily find these things
with Refine and make changes. If you don’t like the
changes you made or make a mistake, you can revert
to the old dataset with a simple undo.

Figure 2-8: Google Refine user interface

Getting into the more advanced stuff, you can also
incorporate data sources like your own with a dataset
from Freebase to create a richer dataset.
If anything, Google Refine is a good tool to keep in
your back pocket. It’s powerful, and it’s a free download,
so I highly recommend you at least fiddle around with
the tool.

Download the open-source Google Refine and
view tutorials on how to make the most out of
the
tool
at

http://code.google.com/p/googlerefine/.

Mr. Data Converter
Often, you might get all your data in Excel but then need
to convert it to another format to fit your needs. This is
almost always the case when you create graphics for
the web. You can already export Excel spreadsheets as
CSV, but what if you need something other than that?
Mr. Data Converter can help you.
Mr. Data Converter is a simple and free tool created
by Shan Carter, who is a graphics editor for The New
York Times. Carter spends most of his work time
creating interactive graphics for the online version of the
paper. He has to convert data often to fit the software
that he uses, so it’s not surprising he made a tool that
streamlines the process.
It’s easy to use, and Figure 2-9 shows that the
interface is equally as simple. All you need to do is copy
and paste data from Excel in the input section on the
top and then select what output format you want in the
bottom half of the screen. Choose from variants of XML,
JSON, and a number of others.

Figure 2-9: Mr. Data Converter makes switching
between data formats easy.

The source code to Mr. Data Converter is also
available if you want to make your own or extend.
Try

out

Mr.

Data

Converter

at

http://www.shancarter.com/data_converter/
or download the source on github at

https://github.com/shancarter/MrData-Converter to convert your Excel
spreadsheets to a web-friendly format.

Mr. People
Inspired by Carter’s Mr. Data Converter, The New York
Times graphics deputy director Matthew Ericson
created Mr. People. Like Mr. Data Converter, Mr.
People enables you to copy and paste data into a text
field, and the tool parses and extracts for you. Mr.
People, however, as you might guess, is specifically for
parsing names.
Maybe you have a long list of names without a
specific format, and you want to identify the first and last
names, along with middle initial, prefix, and suffix.
Maybe multiple people are listed on a single row. That’s
where Mr. People comes in. Copy and paste names, as
shown in Figure 2-10, and you get a nice clean table
that you can copy into your favorite spreadsheet
software, as shown in Figure 2-11.
Like Mr. Data Converter, Mr. People is also available
as open-source software on github.
Use

Mr.

People

at

http://people.ericson.net/ or download the
Ruby source on github to use the name parser
in
your
own
scripts:
http://github.com/mericson/people.

Spreadsheet Software
Of course, if all you need is simple sorting, or you just
need to make some small changes to individual data
points, your favorite spreadsheet software is always
available. Take this route if you’re okay with manually
editing data. Otherwise, try the preceding first
(especially if you have a giganto dataset), or go with a
custom coding solution.

Figure 2-10: Input page for names on Mr. People

Figure 2-11: Parsed names in table format with Mr.
People

Formatting with Code
Although point-and-click software can be useful,
sometimes the applications don’t quite do what you
want if you work with data long enough. Some software
doesn’t handle large data files well; they get slow or they
crash.
What do you do at this point? You can throw your
hands in the air and give up; although, that wouldn’t be
productive. Instead, you can write some code to get the
job done. With code you become much more flexible,
and you can tailor your scripts specifically for your data.
Now jump right into an example on how to easily
switch between data formats with just a few lines of
code.

Example: Switch Between Data
Formats
This example uses Python, but you can of course use
any language you want. The logic is the same, but the
syntax will be different. (I like to develop applications in
Python, so managing raw data with Python fits into my
workflow.)
Going back to the previous example on scraping
data, use the resulting wunder-data.txt file, which has
dates and temperatures in Buffalo, New York, for 2009.
The first rows look like this:
20090101,26
20090102,34
20090103,27
20090104,34
20090105,34
20090106,31
20090107,35
20090108,30
20090109,25
...

This is a CSV file, but say you want the data as XML
in the following format:


20090101
26


20090102
34


20090103
27


20090104
34

...


Each day’s temperature is enclosed in 
tags with a  and the .
To convert the CSV into the preceding XML format,
you can use the following code snippet:
import csv
reader = csv.reader(open(‘wunderdata.txt’, ‘r’), delimiter=",")
print ‘’
for row in reader:
print ‘’
print ‘’ + row[0] + ‘’
print ‘’ + row[1] + ‘’
print ‘’
print ‘’

As before, you import the necessary modules. You
need only the csv module in this case to read in wunderdata.txt.
import csv

The second line of code opens wunder-data.txt to read
usi ng open() and then reads it with the csv.reader()
method.
reader = csv.reader(open(‘wunderdata.txt’, ‘r’), delimiter=",")

Notice the delimiter is specified as a comma. If the
file were a tab-delimited file, you could specify the
delimiter as ‘\t’.
Then you can print the opening line of the XML file in
line 3.
print ‘’

In the main chunk of the code, you can loop through
each row of data and print in the format that you need
the XML to be in. In this example, each row in the CSV
header is equivalent to each observation in the XML.
for row in reader:
print ‘’
print ‘’ + row[0] + ‘’
print ‘’ + row[1] + ‘’
print ‘’

Each row has two values: the date and the maximum
temperature.
End the XML conversion with its closing tag.
print ‘’

Two main things are at play here. First, you read the
data in, and then you iterate over the data, changing
each row in some way. It’s the same logic if you were to
convert the resulting XML back to CSV. As shown in the
following snippet, the difference is that you use a
different module to parse the XML file.
from BeautifulSoup import BeautifulStoneSoup
f = open(‘wunder-data.xml’, ‘r’)
xml = f.read()
soup = BeautifulStoneSoup(xml)
observations = soup.findAll(‘observation’)
for o in observations:
print o.date.string + "," + o.max_temperature.string

The code looks different, but you’re basically doing
the same thing. Instead of importing the csv module, you
import BeautifulStoneSoup from BeautifulSoup.

Remember you used BeautifulSoup to parse the HTML
from Weather Underground. BeautifulStoneSoup
parses the more general XML.
You can open the XML file for reading with open() and
then load the contents in the xml variable. At this point,
the contents are stored as a string. To parse, pass the
xml string to BeautifulStoneSoup to iterate through each
 in the XML file. Use findAll() to fetch all the
observations, and finally, like you did with the CSV to
XML conversion, loop through each observation,
printing the values in your desired format.
This takes you back to where you began:
20090101,26
20090102,34
20090103,27
20090104,34
...

To drive the point home, here’s the code to convert
your CSV to JSON format.
import csv
reader = csv.reader(open(‘wunderdata.txt’, ‘r’), delimiter=",")
print "{ observations: ["
rows_so_far = 0
for row in reader:
rows_so_far += 1
print ‘{‘
print ‘"date": ‘ + ‘"‘ + row[0] + ‘", ‘
print ‘"temperature": ‘ + row[1]
if rows_so_far < 365:
print " },"
else:
print " }"
print "] }"

Go through the lines to figure out what’s going on, but
again, it’s the same logic with different output. Here’s
what the JSON looks like if you run the preceding code.
{
"observations": [
{
"date": "20090101",
"temperature": 26
},
{
"date": "20090102",
"temperature": 34
},
...
]
}

This is still the same data, with date and temperature
but in a different format. Computers just love variety.

Put Logic in the Loop
If you look at the code to convert your CSV file to JSON,
you should notice the if-else statement in the for loop,
after the three print lines. This checks if the current
iteration is the last row of data. If it isn’t, don’t put a
comma at the end of the observation. Otherwise, you
do. This is part of the JSON specification. You can do
more here.
You can check if the max temperature is more than a
certain amount and create a new field that is 1 if a day
is more than the threshold, or 0 if it is not. You can
create categories or flag days with missing values.
Actually, it doesn’t have to be just a check for a
threshold. You can calculate a moving average or the
difference between the current day and the previous.
There are lots of things you can do within the loop to
augment the raw data. Everything isn’t covered here
because you can do anything from trivial changes to
advanced analyses, but now look at a simple example.
Going back to your original CSV file, wunder-data.txt,
create a third column that indicates whether a day’s
maximum temperature was at or below freezing. A 0
indicates above freezing, and 1 indicates at or below
freezing.
import csv
reader = csv.reader(open(‘wunderdata.txt’, ‘r’), delimiter=",")
for row in reader:
if int(row[1]) <= 32:
is_freezing = ‘1’
else:
is_freezing = ‘0’
print row[0] + "," + row[1] + "," + is_freezing

Like before, read the data from the CSV file into
Python, and then iterate over each row. Check each day
and flag accordingly.
This is of course a simple example, but it should be
easy to see how you can expand on this logic to format
or augment your data to your liking. Remember the
three steps of load, loop, and process, and expand from
there.

Wrapping Up
This chapter covered where you can find the data you
need and how to manage it after you have it. This is an
important step, if not the most important, in the
visualization process. A data graphic is only as

interesting as its underlying data. You can dress up a
graphic all you want, but the data (or the results from
your analysis of the data) is still the substance; and now
that you know where and how to get your data, you’re
already a step ahead of the pack.
You also got your first taste of programming. You
scraped data from a website and then formatted and
rearranged that data, which will be a useful trick in later
chapters. The main takeaway, however, is the logic in
the code. You used Python, but you easily could have
used Ruby, Perl, or PHP. The logic is the same across
languages. When you learn one programming language
(and if you’re a programmer already, you can attest to
this), it’s much easier to learn other languages later.
You don’t always have to turn to code. Sometimes
there are click-and-drag applications that make your job
a lot easier, and you should take advantage of that
when you can. In the end, the more tools you have in
your toolbox, the less likely you’re going to get stuck
somewhere in the process.
Okay, you have your data. Now it’s time to get visual.

Chapter 3
Choosing Tools to Visualize
Data
In the last chapter, you learned where to find your data
and how to get it in the format you need, so you’re ready
to start visualizing. One of the most common questions
people ask me at this point is “What software should I
use to visualize my data?”
Luckily, you have a lot of options. Some are out-of-thebox and click-and-drag. Others require a little bit of
programming, whereas some tools weren’t designed
specifically for data graphics but are useful
nevertheless. This chapter covers these options.
The more visualization tools you know how to use and
take advantage of, the less likely you’ll get stuck not
knowing what to do with a dataset and the more likely
you can make a graphic that matches your vision.

Out-of-the-Box Visualization
The out-of-the-box solutions are by far the easiest for
beginners to pick up. Copy and paste some data or
load a CSV file and you’re set. Just click the graph type
you want—maybe change some options here and there.

Options
The out-of-the-box tools available vary quite a bit,
depending on the application they’ve been designed
for. Some, such as Microsoft Excel or Google
Documents, are meant for basic data management and
graphs, whereas others were built for more thorough
analyses and visual exploration.

Microsoft Excel
You know this one. You have the all-familiar
spreadsheet where you put your data, such as in Figure
3-1.

Figure 3-1: Microsoft Excel spreadsheet

Then you can click the button with the little bar graph

on it to make the chart you want. You get all your
standard chart types (Figure 3-2) such as the bar chart,
line, pie, and scatterplot.
Some people scoff at Excel, but it’s not all that bad
for the right tasks. For example, I don’t use Excel for any
sort of deep analyses or graphics for a publication, but
if I get a small dataset in an Excel file, as is often the
case, and I want a quick feel for what is in front of me,
then sure, I’ll whip up a graph with a few clicks in
everyone’s favorite spreadsheet program.

Graphs Really Can Be Fun
The first graph I made on a computer was in Microsoft
Excel for my fifth grade science fair project. My project
partner and I tried to find out which surface snails moved
on the fastest. It was ground-breaking research, I assure
you.
Even back then I remember enjoying the graph-making. It
took me forever to learn (the computer was still new to
me), but when I finally did, it was a nice treat. I entered
numbers in a spreadsheet and then got a graph instantly
that I could change to any color I wanted—blinding, bright
yellow it is.

Figure 3-2: Microsoft Excel chart options

This ease of use is what makes Excel so appealing
to the masses, and that’s fine. If you want higher quality
data graphics, don’t stop here. Other tools are a better
fit for that.

Google Spreadsheets
Google Spreadsheets is essentially the cloud version of
Microsoft Excel with the familiar spreadsheet interface,
obviously (Figure 3-3).

Figure 3-3: Google Spreadsheets

It also offers your standard chart types, as shown in
Figure 3-4.

Figure 3-4: Google Spreadsheets charting options

Google Spreadsheets offers some advantages over
Excel, however. First, because your data is stored on
the Google servers, you can see your data on any
computer as long as it has a web browser installed. Log
in to your Google account and go. You can also easily
share your spreadsheet with others and collaborate in
real-time. Google Spreadsheets also offers some
additional charting options via the Gadget option, as
shown in Figure 3-5.
A lot of the gadgets are useless, but a few good ones
are available. You can, for example, easily make a
motion chart with your time series data (just like Hans
Rosling). There’s also an interactive time series chart
that you might be familiar with if you’ve visited Google
Finance, as shown in Figure 3-6.
Visit

Google

http://docs.google.com

Docs
to

at
try

spreadsheets.

Figure 3-5: Google gadgets

Figure 3-6: Google Finance

Many Eyes
Many Eyes is an ongoing research project by the IBM
Visual Communication Lab. It’s an online application
that enables you to upload your data as a text-delimited
file and explore through a set of interactive visualization
tools. The original premise of Many Eyes was to see if
people could explore large datasets as groups—
therefore the name. If you have a lot of eyes on a large
dataset, can a group find interesting points in the data
quicker or more efficiently or find things in the data that
you would not have found on your own?
Although social data analyses never caught on with
Many Eyes, the tools can still be useful to the individual.
Most traditional visualization types are available, such
as the line graph (Figure 3-7) and the scatterplot
(Figure 3-8).
One of the great things about all the visualizations on
Many Eyes is that they are interactive and provide a
number of customization options. The scatterplot, for
example, enables you to scale dots by a third metric,

and you can view individual values by rolling over a
point of interest.

Figure 3-7: Line graph on Many Eyes

Figure 3-8: Scatterplot on Many Eyes

Many Eyes also provides a variety of more advanced
and experimental visualizations, along with some basic
mapping tools. A word tree helps you explore a full body
of text, such as in a book or news article. You choose a

word or a phrase, and you can see how your selection
is used throughout the text by looking at what follows.
Figure 3-9, for example, shows the results of a search
for right in the United States Constitution.

Figure 3-9: Word tree on Many Eyes showing parts of
the United States Constitution

Alternatively, you can easily switch between tools,
using the same data. Figure 3-10 shows the
Constitution visualized with a stylized word cloud, known
as a Wordle. Words used more often are sized larger.

Figure 3-10: Wordle of the United States Constitution

As you can see, Many Eyes has a lot of options to

help you play with your data and is by far the mostextensive (and in my eyes, the best) free tool for data
exploration; however, a couple of caveats exist. The first
is that most of the tools are Java applets, so you can’t
do much if you don’t have Java installed. (This isn’t a
big deal for most, but I know some people, for whatever
reason, who are particular about what they put on their
computer.)
The other caveat, which can be a deal breaker, is that
all the data you upload to the site is in the public
domain. So you can’t use Many Eyes, for example, to
dig into customer information or sales made by your
business.
Try uploading and visualizing your own data
http://many-eyes.com.

Tableau Software
Tableau Software, which is Windows-only software, is
relatively new but has been growing in popularity for the
past couple of years. It’s designed mainly to explore
and analyze data visually. It’s clear that careful thought
has been given to aesthetics and design, which is why
so many people like it.
Tableau Software offers lots of interactive
visualization tools and does a good job with data
management, too. You can import data from Excel, text
files, and database servers. Standard time series
charts, bar graphs, pie charts, basic mapping, and so
on are available. You can mix and match these displays,
hook in a dynamic data source for a custom view, or a

dashboard, for a snapshot of what’s going on in your
data.
Most recently, Tableau released Tableau Public,
which is free and offers a subset of the functionality in
the desktop editions. You can upload your data to
Tableau’s servers, build an interactive display, and
easily publish it to your website or blog. Any data you
upload to the servers though, like with Many Eyes, does
become publicly available, so keep that in mind.
If you want to use Tableau and keep your data
private, you need to go with the desktop editions. At the
time of this writing, the desktop software is on the
pricier side at $999 and $1,999 for the Personal and
Professional editions, respectively.
Visit

Tableau

Software

at

http://tableausoftware.com. It has a fullfunctioning free trial.

your.flowingdata
My interest in personal data collection inspired my own
application, your.flowingdata (YFD). It’s an online
application that enables you to collect data via Twitter
and then explore patterns and relationships with a set of
interactive visualization tools. Some people track their
eating habits or when they go to sleep and wake up.
Others have logged the habits of their newborn as sort
of a baby scrapbook, with a data twist.
YFD was originally designed with personal data in
mind, but many have found the application useful for
more general types of data collection, such as web

activity or train arrivals and departures.
Try personal data collection via Twitter at
http://your.flowingdata.com.

Trade-Offs
Although these tools are easy to use, there are some
drawbacks. In exchange for click-and-drag, you give up
some flexibility in what you can do. You can usually
change colors, fonts, and titles, but you’re restricted to
what the software offers. If there is no button for the
chart you want, you’re out of luck.
On the flip side, some software might have a lot of
functions, but in turn have a ton of buttons that you need
to learn. For example, there was one program (not
listed here) that I took a weekend crash course for, and
it was obvious that it could do a lot if I put in the time.
The processes to get things done though were so
counterintuitive that it made me not want to learn
anymore. It was also hard to repeat my work for different
datasets, because I had to remember everything I
clicked. In contrast, when you write code to handle your
data, it’s often easy to reuse code and plug in a
different dataset.
Don’t get me wrong. I’m not saying to avoid out-ofthe-box software completely. They can help you explore
your data quickly and easily. But as you work with more
datasets, there will be times when the software doesn’t
fit, and when that time comes you can turn to
programming.

Programming
This can’t be stressed enough: Gain just a little bit of
programming skills, and you can do so much more with
data than if you were to stick only with out-of-the-box
software. Programming skills give you the ability to be
more flexible and more able to adapt to different types
of data.
If you’ve ever been impressed by a data graphic that
looked custom-made, most likely it was coded or
designed in illustrative software. A lot of the time it’s
both. The latter is covered a little later.
Code can look cryptic to beginners—I’ve been there.
But think of it as a new language because that’s what it
is. Each line of code tells the computer to do something.
Your computer doesn’t understand the way you talk to
your friends, so you have to talk to the computer in its
own language or syntax.
Like any language, you can’t immediately start a
conversation. Start with the basics first and then work
your way up. Before you know it, you’ll be coding. The
cool thing about programming is that after you learn one
language, it’s much easier to learn others because the
logic is similar.

Options
So you decide to get your hands dirty with code—good
for you. A lot of options are freely available. Some
languages are better at performing certain tasks better
than others. Some solutions can handle large amounts

of data, whereas others are not as robust in that
department but can produce much better visuals or
provide interaction. Which language you use largely
depends on what your goals are for a specific data
graphic and what you’re most comfortable with.
Some people stick with one language and get to
know it well. This is fine, and if you’re new to
programming, I highly recommend this strategy.
Familiarize yourself with the basics and important
concepts of code.
Use the language that best suits your needs.
However, it’s fun to learn new languages and new ways
to play with data; so you should develop a good bit of
programming experience before you decide on your
favorite solution.

Python
The previous chapter discussed how Python can handle
data. Python is good at that and can handle large
amounts of data without crashing. This makes the
language especially useful for analyses and heavy
computation.
Python also has a clean and easy-to-read syntax that
programmers like, and you can work off of a lot of
modules to create data graphics, such as the graph in
Figure 3-11.
From an aesthetic point of view, it’s not great. You
probably don’t want to take a graphic from Python direct
to publication. The output usually looks kind of rough
around the edges. Nevertheless, it can be a good
starting point in the data exploration stages. You might

also export images and then touch them up or add
information using graphic editing software.

Figure 3-11: Graph produced in Python

Useful Python Resources
Official
Python
website
(http://python.org)
NumPy
and
SciPy
(http://numpy.scipy.org/)—Scientific
computing

PHP
PHP was the first language I learned when I started
programming for the web. Some people say it’s messy,

which it can be, but you can just as easily keep it
organized. It’s usually an easy setup because most web
servers already have it installed, so it’s easy to jump
right in.

Figure 3-12: Sparklines using a PHP graphing library

There’s a flexible PHP graphics library called GD
that’s also usually included in standard installs. The
library enables you to create images from scratch or
manipulate existing ones. Also a number of PHP
graphing libraries exist that enable you to create basic
charts and graphs. The most popular is the Sparklines
Graphing Library, which enables you to embed small
word-size graphs in text or add a visual component to a
numeric table, as shown in Figure 3-12.
Most of the time PHP is coupled with a database
such as MySQL, instead of working with a lot of CSV
files, to maximize usage and to work with hefty
datasets.

Useful PHP Resources
Official PHP website (http://php.net)
Sparkline PHP Graphing Library

(http://sparkline.org)

Processing
Processing is an open-source programming language
geared toward designers and data artists. It started as
a coding sketchbook in which you could produce
graphics quickly; however, it developed a lot since its
early days, and many high-quality projects have been
created in Processing. For example, We Feel Fine,
mentioned in Chapter 1, “Telling Stories with Data,” was
created in Processing.
The great thing about Processing is that you can
quickly get up and running. The programming
environment is lightweight, and with just a few lines of
code, you can create an animated and interactive
graphic. It would of course be basic, but because it was
designed with the creation of visuals in mind, you can
easily learn how to create more advanced pieces.
Although the audience was originally for designers
and artists, the community around Processing has
grown to be a diverse group. Many libraries can help
you do more with the language.
One of the drawbacks is that you do end up with a
Java applet, which can be slow to load on some
people’s computers, and not everyone has Java
installed. (Although most people do.) There’s a solution
for that, though. There’s a JavaScript version of
Processing recently out of development and ready to
use.
Nevertheless, this is a great place to start for

beginners. Even those who don’t have any
programming experience can make something useful.

Useful Processing Resource
Processing (http://processing.org)
—Official site for Processing

Flash and ActionScript
Most interactive and animated data graphics on the
web, especially on major news sites such as The New
York Times, are built in Flash and ActionScript. You
can design graphics in just Flash, which is a click-anddrag interface, but with ActionScript you have more
control over interactions. Many applications are written
completely in ActionScript, without the use of the Flash
environment. However, the code compiles as a Flash
application.

Note
Although there are many free and open-source
ActionScript libraries, Flash and Flash builders
can be pricey, which you should consider in
your choice of software.

For example, an interactive map that animates the
growth of Walmart, as shown in Figure 3-13, was written
in ActionScript. The Modest Maps library was used,
which is a display and interaction library for tile-based
maps. It’s BSD-licensed, meaning it’s free, and you can
use it for whatever you want.

Figure 3-13: Map animating the growth of Walmart,
written in ActionScript

The interactive stacked area chart in Figure 3-14 was
also written in ActionScript. It enables you to search for
spending categories over the years. The Flare
ActionScript library by the UC Berkeley Visualization
Lab was used to do most of the heavy lifting.

Figure 3-14: Interactive stacked area chart showing
consumer spending breakdowns, written in ActionScript

If you want to get into interactive graphics for the web,
Flash and ActionScript is an excellent option. Flash
applications are relatively quick to load, and most
people already have Flash installed on their computers.
It’s not the easiest language to pick up; the syntax
isn’t that complicated, but the setup and code
organization can overwhelm beginners. You’re not
going to have an application running with just a few lines

of code like you would with Processing. Later chapters
take you through the basic steps, and you can find a
number of useful tutorials online because Flash is so
widely used.
Also, as web browsers improve in speed and
efficiency, you have a growing number of alternatives.

Useful Flash and ActionScript
Resources
Adobe

Support

www.adobe.com/products/flash/whatisflash/)
—Official documentation for Flash and
ActionScript (and other Adobe products)
Flare
Visualization
Toolkit
(http://flare.prefuse.org)
Modest
Maps
(http://modestmaps.com)

HTML, JavaScript, and CSS
Web browsers continue to get faster and improve in
functionality. A lot of people spend more time using their
browsers than any other application on their computers.
More recently, there has been a shift toward
visualization that runs native in your browser via HTML,
JavaScript, and CSS. Data graphics used to be
primarily built in Flash and ActionScript if there were an
interactive component or saved as a static image. This
is still often the case, but it used to be that these were
the only options.
Now there are several robust packages and libraries

that can help you quickly build interactive and static
visualizations. They also provide a lot of options so that
you can customize the tools for your data needs.
For example, Protovis, maintained by the Stanford
Visualization Group, is a free and open-source
visualization library that enables you to create webnative visualizations. Protovis provides a number of outof-the-box visualizations, but you’re not at all limited by
what you can make, geometrically speaking. Figure 315 shows a stacked area chart, which can be
interactive.
This chart type is built into the Protovis, but you can
also go with a less traditional streamgraph, as shown in
Figure 3-16.

Figure 3-15: Stacked area chart with Protovis

Figure 3-16: Custom-made streamgraph with Protovis

You can also easily use multiple libraries for
increased functionality. This is possible in Flash, but
JavaScript can be a lot less heavy code-wise.
JavaScript is also a lot easier to read and use with
libraries such as jQuery and MooTools. These are not
visualization-specific but are useful. They provide a lot
of basic functionality with only a few lines of code.
Without the libraries, you’d have to write a lot more, and
your code can get messy in a hurry.
Plugins for the libraries can also help you with some
of your basic graphics. For example, you can use a
Sparkline plugin for jQuery to make small charts (see
Figure 3-17).

Figure 3-17: Sparklines with jQuery Sparklines plugin

You can also do this with PHP, but this method has a
couple of advantages. First, the graphic is generated in
a user’s browser instead of the server. This relieves
stress off your own machines, which can be an issue if
you have a website with a lot of traffic.

The other advantage is that you don’t need to set up
your server with the PHP graphics library. A lot of
servers are set up with graphics installed, but
sometimes they are not. Installation can be tedious if
you’re unfamiliar with the system.
You might not want to use a plugin at all. You can also
design a custom visualization with standard web
programming. Figure 3-18, for example, is an
interactive calendar that doubles as a heatmap in
your.flowingdata.
There are, however, a couple of caveats. Because
the software and technology are relatively new, your
designs might look different in different browsers. Some
of the previously mentioned tools won’t work correctly in
an old browser such as Internet Explorer 6. This is
becoming less of a problem though, because most
people use modern browsers such as Firefox or Google
Chrome. In the end it depends on your audience. Less
than 5 percent of visitors to FlowingData use old
versions of Internet Explorer, so compatibility isn’t much
of an issue.

Figure 3-18: Interactive calendar that also serves as a
heatmap in your.flowingdata

Also related to the age of the technology, there aren’t
as many libraries available for visualization in
JavaScript as there are in Flash and ActionScript. This

is why many major news organizations still use a lot of
Flash, but this will change as development continues.

Useful HTML, JavaScript, and
CSS Resources
jQuery
(http://jquery.com/)—A
JavaScript library that makes coding in
the language much more efficient and
makes your finished product easier to
read.
jQuery
Sparklines
(http://omnipotent.net/jquery.sparkline/)
—Make static and animated sparklines
in JavaScript.
Protovis
(http://vis.stanford.edu/protovis/)
—A visualization-specific JavaScript
library designed to learn by example.
JavaScript
InfoVis
Toolkit
(http://datafl.ws/15f)—Another
visualization library, although not quite as
developed as Protovis.
Google
Charts
API
(http://code.google.com/apis/chart/)
—Build traditional charts on-the-fly,
simply by modifying a URL.

R
If you read FlowingData, you probably know that my
favorite software for data graphics is R. It’s free and
open-source statistical computing software, which also
has good statistical graphics functionality. It is also most
statisticians’ analysis software of choice. There are

paid alternatives such as S-plus and SAS, but it’s hard
to beat the price of free and an active development
community.
One of the advantages that R has over the previously
mentioned software is that it was specifically designed
to analyze data. HTML was designed to make web
pages, and Flash is used for tons of other things, such
as video and animated advertisements. R, on the other
hand, was built and is maintained by statisticians for
statisticians, which can be good and bad depending on
what angle you’re looking from.
There are lots of R packages that enable you to make
data graphics with just a few lines of code. Load your
data into R, and you can have a graphic with even just
one line of code. For example, you can quickly make a
treemap using the Portfolio package, as shown in
Figure 3-19.
Just as easily, you can build a heatmap, as shown in
Figure 3-20.
And of course, you can also make more traditional
statistical graphics, such as scatterplots and time
series charts, which are discussed in Chapter 4,
“Visualizing Patterns over Time.”

Figure 3-19: Treemap generated in R with the Portfolio
package

Figure 3-20: Heatmap generated in R

To be completely honest though, the R site looks
horribly out-dated (Figure 3-21), and the software itself
isn’t very helpful in guiding new users. You need to
remember though that R is a programming language,
and you’re going to get that with any language you use.
The few bad things that I’ve read about R are usually
written by people who are used to buttons and clicking
and dragging. So when you come to R don’t expect a
clicky interface, or you will of course find the interface
unfriendly.

Figure 3-21: R homepage, www.r-project.org

But get past that, and there’s a lot you can do. You
can make publication-quality graphics (or at least the
beginnings of them), and you can learn to embrace R’s
flexibility. If you like, you can write your own functions
and packages to make graphics the way you want, or
you can use the ones that others have made available in
the R library.
R provides base drawing functions that basically
enable you to draw what you want. You can draw lines,
shapes, and axes within a plotting framework, so again,
like the other programming solutions, you’re limited only
by your imagination. Then again, practically every chart
type is available via some R package.

Tip
When you search for something about R on the
web via search engines, the basic name can
sometimes throw off your results. Instead, try
searching for r-project instead of just R, along
with what you’re looking for. You’ll usually find
more relevant search results.

Why would you use anything besides R? Why not just
do everything in R? Following are a few reasons. R
works on your desktop, so it’s not a good fit for the
dynamic web. Saving graphics and images and putting
them on a web page isn’t a problem, but it’s not going
to happen automatically. You can generate graphics onthe-fly via the web, but so far, the solutions aren’t
particularly robust when you compare them to the webnative stuff such as JavaScript.
R is also not good with interactive graphics and
animation. Again, you can do this in R, but there are
better, more elegant ways to accomplish this using, for
example, Flash or Processing.
Finally, you might have noticed that the graphics in
Figures 3-19 and 3-20 lack a certain amount of polish.
You probably won’t see graphics like that in a
newspaper any time soon. You can tighten up the
design in R by messing with different options or writing
additional code, but my strategy is usually to make the
base graphic in R and then edit and refine in design
software such as Adobe Illustrator, which is discussed
soon. For analyses, the raw output from R does just fine,
but for presentation and storytelling, it’s best to adjust
aesthetics.

Useful R Resource
R Project for Statistical Computing
(www.r-project.org)

Trade-Offs
Learning programming is learning a new language. It’s
your computer’s language of bits and logic. When you
work with Excel or Tableau for example, you essentially
work with a translator. The buttons and menus are in
your language, and when you click items, the software
translates your interaction and then sends the
translation to your computer. The computer then does
something for you, such as makes a graph or
processes some data.
So time is definitely a major hurdle. It takes time for
you to learn a new language. For a lot of people, this
hurdle is too high which I can relate to. You need to get
work done now because you have a load of data sitting
in front of you, and people waiting on results. If that’s the
case, in which you have only this single data-related
task with nothing in the future, it might be better to go
with the out-of-the-box visualization tools.
However, if you want to tackle your data and will most
likely have (or want) lots of data-related projects in the
future, the time spent learning how to program now
could end up as saved time on other projects, with more
impressive results. You’ll get better at programming on
each project you go through, and it’ll start to come much
easier. Just like any foreign language, you don’t start

writing books in that language; you start with the
essentials and then branch out.
Here’s another way to look at it. Hypothetically
speaking, say you’re tossed into a foreign country, and
you don’t speak the language. Instead, you have a
translator. (Stay with me on this one. I have a point.) To
talk to a local, you speak, and then your translator
forwards the message. What if the translator doesn’t
know the meaning or the right word for something you
just said? He could leave the word out, or if he’s
resourceful, he can look it up in a translation dictionary.
For out-of-the-box visualization tools, the software is
the translator. If it doesn’t know how to do something,
you’re stuck or have to try an alternative method. Unlike
the speaking translator, software usually doesn’t
instantly learn new words, or in this case, graph types or
data handling features. New functions come in the form
of software updates, which you have to wait for. So what
if you learn the language yourself?
Again, I’m not saying to avoid out-of-the-box tools. I
use them all the time. They make a lot of tedious tasks
quick and easy, which is great. Just don’t let the
software restrict you.
As you see in later chapters, programming can help
you get a lot done with much less effort than if you were
to do it all by hand. That said, there are also things
better done by hand, especially when you’re telling
stories with data. That brings you to the next section on
illustration: the opposite end of the visualization
spectrum.

Illustration
Now you’re in graphic designers’ comfort zone. If you’re
an analyst or in a more technical field, this is probably
unfamiliar territory. You can do a lot with a combination
of code and out-of-the-box visualization tools, but the
resulting data graphics almost always have that look of
something that was automatically generated. Maybe
labels are out of place or a legend feels cluttered. For
analyses, this is usually fine—you know what you’re
looking at.
However, when you make graphics for a
presentation, a report, or a publication, more polished
data graphics are usually appropriate so that people
can clearly see the story you’re telling.
For example, Figure 3-19 is the raw output from R. It
shows views and comments on FlowingData for 100
popular posts. Posts are separated by category such
as Mapping. The brighter the green, the more
comments on that post, and the larger the rectangle, the
more views. You wouldn’t know that from the original,
but when I was looking at the numbers, I knew what I
was looking at, because I’m the one who wrote the code
in R.
Figure 3-22 is a revised version. The labels have
been adjusted so that they’re all readable; lead-in copy
has been added on the top so that readers know what
they’re looking at; and the red portion of the color
legend was removed because there is no such thing as
a post having a negative number of comments. I also
changed the background to white from gray just
because I think it looks better.

I could have edited the code to fit my specific needs,
but it was a lot easier to click-and-drag in Adobe
Illustrator. You can either make graphics completely with
illustration software, or you can import graphics that
you’ve made in, for example, R, and edit it to your liking.
For the former, your visualization choices are limited
because visualization is not the primary purpose of the
software. For anything more complex than a bar chart,
your best bet is to go with the latter. Otherwise, you will
have to do a lot of things by hand, which is prone to
mistakes.
The great thing about using illustration software is that
you have more control over individual elements, and you
can do everything by clicking and dragging. Change the
color of bars or a single bar, modify axes width, or
annotate important features with a few mouse clicks.

Figure 3-22: Treemap created in R, and edited in
Adobe Illustrator

Options
A lot of illustration programs are available but only a few
that most people use—and one that almost everyone
uses. Cost will most likely be your deciding factor.
Prices range from free (and open-source) to several
hundred dollars.

Adobe Illustrator
Any static data graphic that looks custom-made or is in
a major news publication most likely passed through
Adobe Illustrator at some point. Adobe Illustrator is the
industry standard. Every graphic that goes to print at
The New York Times either was created or edited in

Illustrator.
Illustrator is so popular for print because you work
with vectors instead of pixels. This means you can
make the graphics big without decreasing the quality of
your image. In contrast, if you were to blow up a lowresolution photograph, which is a set number of pixels,
you would end up with a pixelated image.
The software was originally designed for font
development and later became popular among
designers for illustrations such as logos and more artfocused graphics. And that’s still what Illustrator is
primarily used for.
However, Illustrator does offer some basic graphing
functionality via its Graph tool. You can make the more
basic graph types such as bar graphs, pie charts, and
time series plots. You can paste your data into a small
spreadsheet, but that’s about the extent of the data
management capabilities.
The best part about using Illustrator, in terms of data
graphics, is the flexibility that it provides and its ease of
use, with a lot of buttons and functions. It can be kind of
confusing at first because there are so many, but it’s
easy to pick up, as you’ll see in Chapter 4, “Visualizing
Patterns over Time.” It’s this flexibility that enables the
best data designers to create the most clear and
concise graphics.
Illustrator is available for Windows and Mac. The
downside though is that it’s expensive when you
compare it to doing everything with code, which is free,
assuming you already have the machine to install things
on. However, compared to some of the out-of-the-box
solutions, Illustrator might not seem so pricey.

As of this writing, the most recent version of Illustrator
is priced at $599 on the Adobe site, but you should find
substantial discounts elsewhere (or go for an older
version). Adobe also provides large discounts to
students and those in academia, so be sure to check
those out. (It’s the most expensive software I’ve ever
purchased, but I use it almost every day.)

Useful Adobe Illustrator
Resources
Adobe Illustrator Product Page
(www.adobe.com/products/illustrator/)
VectorTuts ( http://vectortuts.com)—
Thorough and straightforward tutorials
on how to use Illustrator

Inkscape
Inkscape is the free and open-source alternative to
Adobe Illustrator. So if you want to avoid the hefty price
tag, Inkscape is your best bet. I always use Illustrator
because when I started to learn the finer points of data
graphics on the job, Illustrator was what everyone used,
so it just made sense. I have heard good things about
Inkscape though, and because it’s free, there’s no harm
in trying it. Just don’t expect as many resources on how
to use the software.

Useful Inkscape Resources
Inkscape (http://inkscape.org)

Inkscape
Tutorials
(http://inkscapetutorials.wordpress.com/)

Tip
Parts of this book use Adobe Illustrator to refine
your data graphics; however, it shouldn’t be too
hard to figure out how to do the same thing in
Inkscape. Many of the tools and functions are
similarly named.

Others
Illustrator and Inkscape are certainly not your only
options to create and polish your data graphics. They
just happen to be the programs that most people use.
You might be comfortable with something else. Some
people are fond of Corel Draw, which is Windows-only
software and approximately the same price as
Illustrator. It might be slightly cheaper, depending on
where you look.
There are also programs such as Raven by Aviary
and Lineform, that offer a smaller toolset. Remember
that Illustrator and Inkscape are general tools for graphic
designers, so they provide a lot of functionality. But if
you just want to make a few edits to existing graphics,
you might opt for the simpler (lower-priced) software.

Trade-Offs
Illustration software is for just that—illustration. It’s not
made specifically for data graphics. It’s meant for
graphic design, so many people do not use a lot of

functions offered by Illustrator or Inkscape. The software
is also not good for handling a lot of data, compared to
when you program or use visualization-specific tools.
Because of that, you can’t explore your data in these
programs.
That said, these programs are a must if you want to
make publication-level data graphics. They don’t just
help with aesthetics, but also readability and clarity
that’s often hard to achieve with automatically
generated output.

Mapping
Some overlap exists between the covered visualization
tools and the ones that you use to map geographic
data. However, the amount of geographic data has
increased significantly in the past years as has the
number of ways you can map. With mobile location
services on the rise, there will be more data with latitude
and longitude coordinates attached to it. Maps are also
an incredibly intuitive way to visualize data, and this
deserves a closer look.
Mapping in the early days of the web wasn’t easy; it
wasn’t elegant either. Remember the days you would go
to MapQuest, look up directions, and get this small
static map? Yahoo had the same thing for a while.
It wasn’t until a couple of years later until Google
provided a slippy map implementation (Figure 3-23).
The technology was around for a while, but it wasn’t
useful until most people’s Internet speed was fast
enough to handle the continuous updating. Slippy maps

are what we’re used to nowadays. We can pan and
zoom maps with ease, and in some cases, maps aren’t
just for directions; they’re the main interface to browse a
dataset.

Note
Slippy maps are the map implementation that
is now practically universal. Large maps, that
would normally not fit on your screen, are split
into smaller images, or tiles. Only the tiles that
fit in your window display, and the rest are
hidden from view. As you drag the map, other
tiles display, making it seem as if you’re moving
around a single large map. You might have
also seen this done with high-resolution
photographs.

Figure 3-23: Google Maps to look up directions

Options
Along with all the geographic data making its way into
the public domain, a variety of tools to map that data
have also sprung up. Some require only a tiny bit of
programming to get something up and running whereas
others need a little more work. There are also a few
other solutions that don’t require programming.

Google, Yahoo, and Microsoft Maps
This is your easiest online solution; although, it does
require a little bit of programming. The better you can
code, the more you can do with the mapping APIs

offered by Google, Yahoo, and Microsoft.
The base functionality of the three is fairly similar, but
if you’re just starting out, I recommend you go with
Google. It seems to be the most reliable. They have a
Maps API in both JavaScript and Flash, along with other
geo-related services such as geocoding and directions.
Go through the Getting Started tutorial and then branch
out to other items such as placing markers (Figure 324), drawing paths, and adding overlays. The
comprehensive set of code snippets and tutorials
should quickly get you up and running .

Figure 3-24: Marker placement on Google Maps

Yahoo also has JavaScript and Flash APIs for
mapping, plus some geoservices, but I’m not sure how
long it’ll be around given the current state of the
company. As of this writing, Yahoo has shifted focus

from applications and development to content provider.
Microsoft also provides a JavaScript API (under the
Bing name) and one in Silverlight, which was its answer
to Flash.

Useful Mapping API Resources

Google
Maps
API
Family
(http://code.google.com/apis/maps/)
Yahoo!
Maps
Web
Services
(http://code.google.com/apis/maps/index.html)
Bing
Maps
API
(http://www.microsoft.com/maps/developers/web.aspx)

ArcGIS
The previously mentioned online mapping services are
fairly basic in what they can do at the core. If you want
more advanced mapping, you’ll most likely need to
implement the functionality yourself. ArcGIS, built for
desktop mapping, is the opposite. It’s a massive
program that enables you to map lots of data and do
lots of stuff with it, such as smoothing and processing.
You can do all this through a user interface, so there’s
no code required.
Any graphics department with mapping specialists
most likely uses ArcGIS. Professional cartographers
use ArcGIS. Some people love it. So if you’re interested
in producing detailed maps, it’s worth checking out
ArcGIS.
I have used ArcGIS only for a few projects because I
tend to take the programming route when I can, and I

just didn’t need all that functionality. The downside of
such a rich feature set is that there are so many buttons
and menus to go through. Online and server solutions
are also available, but they feel kind of clunky compared
to other implementations.

Useful ArcGIS Resource
ArcGIS
Product
Page
(www.esri.com/software/arcgis/)

Modest Maps
I mentioned Modest Maps earlier, with an example in
Figure 3-13. It shows the growth of Walmart. Modest
Maps is a Flash and ActionScript library for tile-based
maps, and there is support for Python. It’s maintained
by a group of people who know their online mapping
and do great work for both clients and for fun, which
should tell you a little something about the quality of the
library.
The fun thing about Modest Maps is that it’s more of a
framework than a mapping API like the one offered by
Google. It provides the bare minimum of what it takes to
create an online map and then gets out of the way to let
you implement what you want. You can use tiles from
different providers, and you can customize the maps to
fit with your application. For example, Figure 3-13 has a
black-and-blue theme, but you can just as easily change
that to white and red, as shown in Figure 3-25.

Figure 3-25: White-and-red themed map using Modest

Maps

It’s BSD-licensed, so you can do just about anything
you want with it at no cost. You do have to know the
ropes around Flash and ActionScript, but the basics are
covered in Chapter 8, “Visualizing Spatial
Relationships.”

Polymaps
Polymaps is kind of like the JavaScript version of
Modest Maps. It was developed and is maintained by
some of the same people and provides the same
functionality—and then some. Modest Maps provides
only the basics of mapping, but Polymaps has some
built-in features such as choropleths (Figure 3-26) and
bubbles.

Figure 3-26: Choropleth map showing unemployment,
implemented in Polymaps

Because it’s JavaScript, it does feel more lightweight
(because it requires less code), and it works in modern
browsers. Polymaps uses Scalable Vector Graphics
(SVG) to display data, so it doesn’t work in the old
versions of Internet Explorer, but most people are up-todate. As a reference, only about 5 percent of
FlowingData visitors use a browser that’s too old, and I
suspect that percentage will approach zero soon.
My favorite plus of a mapping library in JavaScript is
that all the code runs native in the browser. You don’t
have to do any compiling or Flash exports, which makes
it easier to get things running and to make updates
later.

Useful Polymaps Resource
Polymaps (http://polymaps.org/)

R

R doesn’t provide mapping functionality in the base
distribution, but there are a few packages that let you do
s o . Figure 3-27 is a map that I made in R. The
annotation was added after the fact in Adobe Illustrator.
Maps in R are limited in what they can do, and the
documentation isn’t great. So I use R for mapping if I
have something simple and I happen to be using R.
Otherwise, I tend to use the tools already mentioned.

Figure 3-27: United States map created in R

Useful R Mapping Resources
Analysis of Spatial Data (http://cran.r-

project.org/web/views/Spatial.html)
—Comprehensive list of packages in R

for spatial analysis
A Practical Guide to Geostatistical
Mapping
(http://spatialanalyst.net/book/download)—Free
book download on how to use R and
other tools for spatial data

Online-Based Solutions
Figure 3-28: Choropleth map created in Indiemapper

A few online mapping solutions make it easy to
visualize your geographic data. For the most part,
they’ve taken the map types that people use the most
and then stripped away the other stuff—kind of like a
simplified ArcGIS. Many Eyes and GeoCommons are
two free ones. The former, discussed previously, has
only basic functionality for data by country or by state in
the United States. GeoCommons, however, has more

features and richer interaction. It also handles common
geospatial file formats such as shapefiles and KML.
A number of paid solutions exist, but Indiemapper
and SpatialKey are the most helpful. SpatialKey is
geared more toward business and decision making
whereas Indiemapper is geared toward cartographers
and designers. Figure 3-28 shows an example I
whipped up in just a few minutes in Indiemapper.

Trade-Offs
Mapping software comes in all shapes and sizes suited
to fit lots of different needs. It’d be great if you could
learn one program and be able design every kind of
map imaginable. Unfortunately, it doesn’t work that way.
For example, ArcGIS has a lot of functions, but it
might not be worth the time to learn or the money to
purchase if you only want to create simple maps. On the
other hand, R, which has basic mapping functionality
and is free, could be too simple for what you want. If
online and interactive maps are your goal, you can go
open-source with Modest Maps or Polymaps, but that
requires more programming skills. You’ll learn more
about how to use what’s available in Chapter 8.

Survey Your Options
This isn’t a comprehensive list of what you can use to
visualize data, but it should be enough to get you
started. There’s a lot to consider and play with here. The
tools you end up using largely depend on what you want

to accomplish, and there are always multiple ways to
accomplish a single task, even within the same
software. Want to design static data graphics? Maybe
try R or Illustrator. Do you want to build an interactive
tool for a web application? Try JavaScript or Flash.
On FlowingData, I ran a poll that asked people what
they mainly used to analyze and visualize data. A little
more than 1,000 people responded. The results are
shown in Figure 3-29.

Figure 3-29: What FlowingData readers use to analyze
and visualize data

There are some obvious leaders, given the topic of
FlowingData. Excel was first, and R followed in second.

But after that, there was a variety of software picks.
More than 200 people chose the Other category. In the
comments, many people stated that they use a
combination of tools to fill different needs, which is
usually the best route for the long term.

Combining Them
A lot of people like to stick to one program—it’s
comfortable and easy. They don’t have to learn anything
new. If that works, then by all means they should keep at
it. But there comes a point after you’ve worked with data
long enough when you hit the software’s threshold. You
know what you want to do with your data or how to
visualize it, but the software doesn’t let you do it or
makes the process harder than it has to be.
You can either accept that, or you can use different
software, which could take time to learn but helps you
design what you envision—I say go with the latter.
Learning a variety of tools ensures that you won’t get
stuck on a dataset, and you can be versatile enough to
accomplish a variety of visualization tasks to get actual
results.

Wrapping Up
Remember that none of these tools are a cure-all. In the
end, the analyses and data design is still up to you. The
tools are just that—they’re tools. Just because you have
a hammer doesn’t mean you can build a house.
Likewise, you can have great software and a super

computer, but if you don’t know how to use your tools,
they might as well not exist. You decide what questions
to ask, what data to use, and what facets to highlight,
and this all becomes easier with practice.
But hey, you’re in luck. That’s what the rest of this
book is for. The following chapters cover important data
design concepts and teach you how to put the abstract
into practice, using a combination of the tools that were
just covered. You can learn what to look for in your data
and how to visualize it.

Chapter 4
Visualizing Patterns over
Time
Time series data is just about everywhere. Public
opinion changes, populations shift, and businesses
grow. You look to time series data to see how much
these things have changed. This chapter looks at
discrete and continuous data because the type of data
graphics you use depends on the type of data you have.
You also get your hands dirty with R and Adobe
Illustrator—the two programs go great together.

What to Look for over Time
You look at time every day. It’s on your computer, your
watch, your phone, and just about anywhere else you
look. Even without a clock, you feel time as you wake up
and go to sleep and the sun rises and sets. So it’s only
natural to have data over time. It lets you see how things
change.
The most common thing you look for in time series, or
temporal, data is trends. Is something increasing or
decreasing? Are there seasonal cycles? To find these
patterns, you have to look beyond individual data points

to get the whole picture. It’s easy to pick out a single
value from a point in time and call it a day, but when you
look at what came before and after, you gain a better
understanding of what that single value means, and the
more you know about your data, the better the story that
you can tell.
For example, there was a chart the Obama
administration released a year into the new presidency,
reproduced in Figure 4-1. It showed job loss during the
tail end of the Bush administration through the first part
of Obama’s.

Figure 4-1: Change in job loss since Barack Obama
took office

It looks like the new administration had a significant
positive effect on job loss, but what if you zoom out and
look at a larger time frame, as shown in Figure 4-2?
Does it make a difference?

Figure 4-2: Change in job loss from 2001 through 2010

Although you always want to get the big picture, it’s
also useful to look at your data in more detail. Are there
outliers? Are there any periods of time that look out of
place? Are there spikes or dips? If so, what happened
during that time? Often, these irregularities are where
you want to focus. Other times the outliers can end up
being a mistake in data entry. Looking at the big picture
—the context—can help you determine what is what.

Discrete Points in Time
Temporal data can be categorized as discrete or
continuous. Knowing which category your data belongs
to can help you decide how to visualize it. In the discrete
case, values are from specific points or blocks of time,
and there is a finite number of possible values. For
example, the percentage of people who pass a test
each year is discrete. People take the test, and that’s it.

Their scores don’t change afterward, and the test is
taken on a specific date. Something like temperature,
however, is continuous. It can be measured at any time
of day during any interval, and it is constantly changing.
In this section you look at chart types that help you
visualize discrete temporal data, and you see concrete
examples on how to create these charts in R and
Illustrator. The beginning will be the main introduction,
and then you can apply the same design patterns
throughout the chapter. This part is important. Although
the examples are for specific charts, you can apply the
same principles to all sorts of visualization. Remember
it’s all about the big picture.

Bars
The bar graph is one of the most common chart types.
Most likely you’ve seen lots of them. You’ve probably
made some. The bar graph can be used for various
data types, but now take a look at how it can be used
for temporal data.
Figure 4-3 shows a basic framework. The time axis
(the horizontal one, that is, x-axis) provides a place for
points in time that are ordered chronologically. In this
case the points in time are months, from January to
June 2011, but it could just as easily be by year, by day,
or by some other time unit. Bar width and bar spacing
typically do not represent values.

Figure 4-3: Framework of bar graphs

The value axis (the vertical one, that is, y-axis)
indicates the scale of the graph. Figure 4-3 shows a
linear scale where units are evenly spaced across the
full axis. Bar height matches up with the value axis. The
first bar, for example, goes up to one unit, whereas the
highest bar goes up to four units.
This is important. The visual cue for value is bar
height. The lower the value is, the shorter the bar will be.
The greater a value is, the taller a bar will be. So you
can see that the height of the four-unit bar in April is
twice as tall as the two-unit bar in February.

Figure 4-4: Bar graph with non-zero axis

Many programs, by default, set the lowest value of the
value axis to the minimum of the dataset, as shown in
Figure 4-4. In this case, the minimum is 1. However, if
you were to start the value axis at 1, the height of the
February bar wouldn’t be half the height of the April bar
anymore. It would look like February was one-third that
of April. The bar for January would also be nonexistent.
The point: Always start the value axis at zero.
Otherwise, your bar graph could display incorrect
relationships.

Tip
Always start the value axis of your bar graph at
zero when you’re dealing with all positive
values. Anything else makes it harder to visually
compare the height of the bars.

Create a Bar Graph
It’s time to make your first graph, using real data, and
it’s an important part of history that is an absolute must
for all people who call themselves a human. It’s the
results from the past three decades of Nathan’s Hot
Dog Eating Contest. Oh, yes.
Figure 4-5 is the final graph you’re after. Do this in
two steps. First, create a basic bar graph in R, and then
you can refine that graph in Illustrator.
In case you’re not in the know of the competitive
eating circuit, Nathan’s Hot Dog Eating Contest is an
annual event that happens every July 4. That’s
Independence Day in the United States. The event has
become so popular that it’s even televised on ESPN.
Throughout the late 1990s, the winners ate 10 to 20
hot dogs and buns (HDBs) in about 15 minutes.
However, in 2001, Takeru Kobayashi, a professional
eater from Japan, obliterated the competition by eating
50 HDBs. That was more than twice the amount anyone
in the world had eaten before him. And this is where the
story begins.

Figure 4-5: Bar graph showing results from Nathan’s
Hot Dog Eating Contest

Wikipedia has results from the contest dating back to
1916, but the hot dog eating didn’t become a regular
event until 1980, so we start here. The data is in an
HTML table and includes the year, name, number of
HDBs eaten, and country where the winner is from. I’ve
compiled the data in a CSV file that you can download
a
t http://datasets.flowingdata.com/hot-dog-contestwinners.csv. Here’s what the first five rows of data look
like:
"Year","Winner","Dogs eaten","Country","New record"
1980,"Paul Siederman & Joe Baldini",9.1,"United States",0
1981,"Thomas DeBerry ",11,"United States",0
1982,"Steven Abrams ",11,"United States",0
1983,"Luis Llamas ",19.5,"Mexico",1
1984,"Birgit Felden ",9.5,"Germany",0

Download the data in CSV format from

http://datasets.flowingdata.com/hotdog-contest-winners.csv. See the page
for “Nathan’s Hot Dog Eating Contest” on
Wikipedia for precompiled data and history of

the contest.

To load the data in R, use the read.csv() command.
You can either load the file locally from you own
computer, or you can use a URL. Enter this line of code
in R to do the latter:
hotdogs 

Donut Chart




If you’ve ever created a web page, this should be straightforward, but in case you haven’t, the preceding is basic HTML that you’ll find almost everywhere online. Every page starts with an tag and is followed by a that contains information about the page but doesn’t show in your browser window. Everything enclosed by the tag is visible. Title the page Donut Chart and load the Protovis library, a JavaScript file, with the Okay, first things first: the data. You’re still looking at the results from the FlowingData poll, which you store in arrays. The vote counts are stored in one array, and the corresponding category names are stored in another. var data = [172,136,135,101,80,68,50,29,19,41]; var cats = ["Statistics", "Design", "Business", "Cartography", "Information Science", "Web Analytics", "Programming", "Engineering", "Mathematics", "Other"]; Then specify the width and height of the donut chart and the radius length and scale for arc length. var w h r a = = = = 350, 350, w / 2, pv.Scale.linear(0, pv.sum(data)).range(0, 2 * Math.PI); The width and height of the donut chart are both 350 pixels, and the radius (that is, the center of the chart to the outer edge) is half the width, or 175 pixels. The fourth line specifies the arc scale. Here’s how to read it. The actual data is on a linear scale from 0 to the sum of all votes, or total votes. This scale is then translated to the scale to that of the donut, which is from 0 to 2π radians, or 0 to 360 degrees if you want to think of it in that way. Next create a color scale. The more votes a category receives, the darker the red it should be. In Illustrator, you did this by hand, but Protovis can pick the colors for you. You just pick the range of colors you want. var depthColors = pv.Scale.linear(0, 172).range("white", "#821122"); Now you have a color scale from white to a dark red (that is #821122) on a linear range from 0 to 172, the highest vote count. In other words, a category with 0 votes will be white, and one with 172 votes will be dark red. Categories with vote counts in between will be somewhere in between white and red. So far all you have are variables. You specified size and scale. To create the actual chart, first make a blank panel 350 (w) by 350 (h) pixels. var vis = new pv.Panel() .width(w) .height(h); Then add stuff to the panel, in this case wedges. It might be a little confusing, but now look over it line by line. vis.add(pv.Wedge) .data(data) .bottom(w / 2) .left(w / 2) .innerRadius(r - 120) .outerRadius(r) .fillStyle(function(d) depthColors(d)) .strokeStyle("#fff") .angle(a) .title(function(d) String(d) + " votes") .anchor("center").add(pv.Label) .text(function(d) cats[this.index]); The first line says that you’re adding wedges to the panel, one for each point in the data array. The bottom() a nd left() properties orient the wedges so that the points are situated in the center of the circle. The innerRadius() specifies the radius of the hole in the middle whereas the outerRadius is the radius of the full circle. That covers the structure of the donut chart. Rather than setting the fill style to a static shade, fill colors are determined by the value of the data point and the color scale stored as depthColors, or in other words, color is determined by a function of each point. A white (#fff) border is used, which is specified by strokeStyle(). The circular scale you made can determine the angle of each wedge. To get a tooltip that says how many votes there were when you mouse over a section, title() is used. Another option would be to create a mouseover event where you specify what happens when a user places a pointer over an object, but because browsers automatically show the value of the title attribute, it’s easier to use title(). Make the title the value of each data point followed by “votes.” Finally, add labels for each section. The only thing left to do is add May 2009 in the hole of the chart. vis.anchor("center").add(pv.Label) .font("bold 14px Georgia") .text("May 2009"); This reads as, “Put a label in the center of the chart in bold 14-pixel Georgia font that says May 2009.” The full chart is now built, so now you can render it. vis.render(); When you open donut.html in your browser, you should see Figure 5-10. Visit http://book.flowingdata.com/ch05/donut.html to see the live chart and view the source for the code in its entirety. If you’re new to programming, this section might have felt kind of daunting, but the good news is that Protovis was designed to be learned by example. The library’s site has many working examples to learn from and that you can use with your own data. It has traditional statistical graphics to the more advanced interactive and animated graphics. So don’t get discouraged if you were a little confused. The effort you put in now will pay off after you get the hang of things. Now have another look at Protovis in the next section. Stack Them Up In the previous chapter you used the stacked bar chart to show data over time, but it’s not just temporal data. As shown in Figure 5-11, you can also use the stacked bar chart for categorical data. Figure 5-11: Stacked bar chart with categories For example, look at approval ratings for Barack Obama as estimated from a Gallup and CBS poll taken in July and August 2010. Participants were asked whether they approved or disapproved of how Obama has dealt with 13 issues. Here are the numbers in table form. Issue Race relations Education Terrorism Energy policy Foreign affairs Environment Situation in Iraq Taxes Healthcare policy Economy Situation in Afghanistan Federal budget deficit Immigration Approve 52 49 48 47 44 43 41 41 40 38 36 31 29 Disapprove 38 40 45 42 48 51 53 54 57 59 57 64 62 No Opinion 10 11 7 11 8 6 6 5 3 3 7 5 9 One option would be to make a pie chart for every issue, as shown in Figure 5-12. To do this in Illustrator, all you have to do is enter multiple rows of data instead of just a single one. One pie chart is generated for each row. However, a stacked bar chart enables you to compare approval ratings for the issues more easily because it’s easier to judge bar length than wedge angles, so try that. In the previous chapter, you made a stacked bar chart in Illustrator using the Stacked Graph tool. This time you add some simple interactions. Create an Interactive Stacked Bar Chart Like in the donut chart example, use Protovis to create an interactive stacked bar chart. Figure 5-13 shows the final graphic. There are two basic interactions to implement. The first shows the percentage value of any given stack when you place the mouse pointer over it. The second highlights bars in the approve, disapprove, and no opinion categories based on where you put your mouse. Figure 5-12: Series of pie charts Figure 5-13: Interactive stacked bar chart in Protovis To start, set up the HTML page and load the necessary Protovis JavaScript file. Stacked Bar Chart
This should look familiar. You did the same thing to make a donut chart with Protovis. The only difference is that the title of the page is “Stacked Bar Chart” and there’s an additional
with a “figure-wrapper” id. We also haven’t added any CSS yet to style the page, because we’re saving that for later. Now on to JavaScript. Within the figure
, load and prepare the data (Obama ratings, in this case) in arrays. You can read this as 52 percent and 38 percent approval and disapproval ratings, respectively, for race relations. Similarly, there were 49 percent and 40 percent approval and disapproval ratings for education. To make it easier to code the actual graph, you can split the data and store it in two variables. var cat = data.Issue; var data = [data.Approve, data.Disapprove, data.None]; The issues array is stored in cat and the data is now an array of arrays. Set up the necessary variables for width, height, scale, and colors with the following: var w = 400, h = 250, x = pv.Scale.ordinal(cat).splitBanded(0, w, 4/5), y = pv.Scale.linear(0, 100).range(0, h), fill = ["#809EAD", "#B1C0C9", "#D7D6CB"]; The graph will be 400 pixels wide and 250 pixels tall. The horizontal scale is ordinal, meaning you have set categories, as opposed to a continuous scale. The categories are the issues that the polls covered. Fourfifths of the graph width will be used for the bars, whereas the rest is for padding in between the bars. The vertical axis, which represents percentages, is a linear scale from 0 to 100 percent. The height of the bars can be anywhere in between 0 pixels to the height of the graph, or 250 pixels. Finally, fill is specified in an array with hexadecimal numbers. That’s dark blue for approval, light blue for disapproval, and light gray for no opinion. You can change the colors to whatever you like. If you’re not sure what colors to use, ColorBrewer at http://colorbrewer2.org is a good place to start. The tool enables you to specify the number of colors you want to use and the type of colors, and it provides a color scale that you can copy in various formats. 0to255 at http://0to255.com is a more general color tool, but I use it often. Next step: Initialize the visualization with specified width and height. The rest provides padding around the actual graph, so you can fit axis labels. For example, bottom(90) moves the zero-axis up 90 pixels. Think of this part as setting up a blank canvas. var vis = new pv.Panel() .width(w) .height(h) .bottom(90) .left(32) .right(10) .top(15); To add stacked bars to your canvas, Protovis provides a special layout for stacked charts appropriately named Stack. Although you use this for a stacked bar chart in this example, the layout can also be used with stacked area charts and streamgraphs. Store the new layout in the “bar” variable. var bar = vis.add(pv.Layout.Stack) .layers(data) .x(function() x(this.index)) .y(function(d) y(d)) .layer.add(pv.Bar) .fillStyle(function() fill[this.parent.index]) .width(x.range().band) .title(function(d) d + "%") .event("mouseover", function() this.fillStyle("#555")) .event("mouseout", function() this.fillStyle(fill[this.parent.index])); Another way to think about this chart is as a set of three layers, one each for approval, disapproval, and no opinion. Remember how you structured those three as an array of three arrays? That goes in layers(), where x and y follow the scales that you already made. For each layer, add bars using pv.Bar. Specify the fill style with fillStyle(). Notice that we used a function that goes by this.parent.index. This is so that the bar is colored by what layer it belongs to, of which there are three. If you were to use this.index, you would need color specifications for every bar, of which there are 39 (3 times 13). The width of each bar is the same across, and you can get that from the ordinal scale you already specified. The final three lines of the preceding code are what make the graph interactive. Using title() in Protovis is the equivalent of setting the title attribute of an HTML element such as an image. When you roll over an image on a web page, a tooltip shows up if you set the title. Similarly, a tooltip appears as you place the mouse pointer over a bar for a second. Here simply make the tooltip show the percentage value that the bar represents followed with a percent sign (%). To make the layers highlight whenever you mouse over a bar, use event(). On “mouseover” the fill color is set to a dark gray (#555), and when the mouse pointer is moved off, the bar is set to its original color using the “mouseout” event. Tip Interaction in Protovis isn’t just limited to mouse over and out. You can also set events for things such as click and double-click. See Protovis documentation for more details. To make the graph appear, you need to render it. Enter this at the end of our JavaScript. vis.render(); This basically says, “Okay, we’ve put together all the pieces. Now draw the visualization.” Open the page in your web browser (a modern one, such as Firefox or Safari), and you should see something like Figure 5-14. Mouse over a bar, and the layer appears highlighted. A tooltip shows up, too. A few things are still missing, namely the axes and labels. Add those now. In Figure 5-13, a number of labels are on the bars. It’s only on the larger bars though, that is, not the gray ones. Here’s how to do that. Keep in mind that this goes before vis.render(). Always save rendering for last. bar.anchor("center").add(pv.Label) .visible(function(d) d > 11) .textStyle("white") .text(function(d) d.toFixed(0)); Figure 5-14: Stacked bar graph without any labels For each bar, look to see if it is greater than 11 percent. If it is, a white label that reads the percentage rounded to the nearest integer is drawn in the middle of the bar. Now add the labels for each issue on the x-axis. Ideally, you want to make all labels read horizontally, but there is obviously not enough space to do that. If the graph were a horizontal bar chart, you could fit horizontal labels, but for this you want to see them at 45degree angles. You can make the labels completely vertical, but that’d make them harder to read. bar.anchor("bottom").add(pv.Label) .visible(function() !this.parent.index) .textAlign("right") .top(260) .left(function() x(this.index)+20) .textAngle(-Math.PI / 4) .text(function() cat[this.index]); This works in the same way you added number labels to the middle of each bar. However, this time around add labels only to the bars at the bottom, that is, the ones for approval. Then right-align the text and set their absolute vertical position with textAlign() and top(). Their x-position is based on what bar they label, each is rotated 45 degrees, and the text is the category. That gives the categorical labels. The labels for values on the vertical axis are added in the same way, but you also need to add tick marks. vis.add(pv.Rule) .data(y.ticks()) .bottom(y) .left(-15) .width(15) .strokeStyle(function(d) d > 0 ? "rgba(0,0,0,0.3)" : "#000") .anchor("top").add(pv.Label) .bottom(function(d) y(d)+2) .text(function(d) d == 100 ? "100%" : d.toFixed(0)); This adds a Rule, or lines, according to y.ticks(). If the tick mark is for anything other than the zero line, its color is gray. Otherwise, the tick is black. The second section then adds labels on top of the tick marks. Figure 5-15: Adding the horizontal axis You’re still missing the horizontal axis, so add another Rule, separately to get what you see in Figure 5-15. vis.add(pv.Rule) .bottom(y) .left(-15) .right(0) .strokeStyle("#000") Lead-in copy and remaining labels are added with HTML and CSS. There are entire books for web design though, so I’ll leave it at that. The cool thing here is that you can easily combine the HTML and CSS with Protovis, which is just JavaScript and still make it look seamless. To see and interact with the stacked bar graph, visit http://book.flowingdata.com/ch05/stackedbar.html. Check out the source code to see how HTML, CSS, and JavaScript fit together. Hierarchy and Rectangles In 1990, Ben Shneiderman, of the University of Maryland, wanted to visualize what was going on in his always-full hard drive. He wanted to know what was taking up so much space. Given the hierarchical structure of directories and files, he first tried a tree diagram. It got too big too fast to be useful though. Too many nodes. Too many branches. See http://datafl.ws/11m for a full history of treemaps and additional examples described by the creator, Ben Shneiderman. The treemap was his solution. As shown in Figure 516, it’s an area-based visualization where the size of each rectangle represents a metric. Outer rectangles represent parent categories, and rectangles within the parent are like subcategories. You can use a treemap to visualize straight-up proportions, but to fully put the technique to use, it’s best served with hierarchical, or rather, tree-structured data. Figure 5-16: Treemap generalized Create a Treemap Illustrator doesn’t have a Treemap tool, but there is an R package by Jeff Enos and David Kane called Portfolio. It was originally intended to visualize stock market portfolios (hence the name), but you can easily apply it to your own data. Look at page views and comments of 100 popular posts on FlowingData and separate them by their post categories, such as visualization or data design tips. Tip R is an open-source software environment for statistical computing. You can download it for free from www.r-project.org/. The great thing about R is that there is an active community around the software that is always developing packages to add functionality. If you’re looking to make a static chart, and don’t know where to start, the R archives are a great place to look. As always, the first step is to load the data into R. You can load data directly from your computer or point to a URL. Do the latter in this example because the data is already available online. If, however, you want to do the former when you apply the following steps to your own data, just make sure you put your data file in your working directory in R. You can change your working directory through the Miscellaneous menu. Loading a CSV file from a URL is easy. It’s only one line of code with the read.csv()function in R (Figure 5-17). posts <- read.csv("http://datasets.flowingdata.com/postdata.txt") Figure 5-17: Loading CSV in R Easy, right? We’ve loaded a text file (in CSV format) using read.csv() and stored the values for page views and comments in a variable called posts. As mentioned in the previous chapter, the read.csv() function assumes that your data file is comma-delimited. If your data were say, tab-delimited, you would use the sep argument and set the value to \t. If you want to load the data from a local directory, the preceding line might look something like this. posts <- read.csv("post-data.txt") This is assuming you’ve changed your working directory accordingly. For more options and instructions on how to load data using the read.csv() function, type the following in the R console: ?read.csv Moving on, now that the data is stored in the enter the following line to see the first five rows of the data. posts variable, posts[1:5,] You should see four columns that correspond to the original CSV file, with id, views, comments, and category. Now that the data is loaded in R, make use of the Portfolio package. Try loading it with the following: library(portfolio) Get an error? You probably need to install the package before you begin: install.packages("portfolio") You should load the package now. Go ahead and do that. Loaded with no errors? Okay, good, now go to the next step. Tip You can also install packages in R through the user interface. Go to Packages & Data ⇒Package Installer. Click Get List, and then find the package of interest. Double-click to install. The Portfolio package does the hard work with a function called map.market(). The function takes several arguments, but you use only five of them. map.market(id=data$id, area=posts$views, group=posts$category, color=posts$comments, main="FlowingData Map") The id is the column that indicates a unique point, and you tell R to use views to decide the areas of the rectangles in the treemap, the categories to form groups, and the number of comments in a post to decide color. Finally, enter FlowingData Map as the main title. Press Enter on your keyboard to get a treemap, as shown in Figure 5-18. It’s still kind of rough around the edges, but the base and hierarchy is set up, which is the hard part. Just like you specified, rectangles, each of which represent a post, are sized by the number of page views and sorted by category. Brighter shades of green indicate posts that received more comments; posts with a lot of views don’t necessarily get the most comments. You can save the image as a PDF in R and then open the file in Illustrator. All regular edit options apply. You can change stroke and fill colors, fonts, remove anything extraneous, and add comments if you like. Figure 5-18: Default treemap in R For this particular graphic you need to change the scale of the legend that goes from –90 to 90. It doesn’t make sense to have a negative scale because there’s no such thing as a negative number of comments. You can also fix the labels. Some of them are obscured in the small rectangles. Size the labels by popularity instead of the uniform scale it now has using the Selection tool. Also thicken the category borders so that they’re more prominent. That should give you something like Figure 5-19. There you go. The graphic is much more readable now with unobscured labeling and a color scale that makes more sense. You also got rid of the dark gray background, which makes it cleaner. Oh, and of course, you included a title and lead-in to briefly explain what the graphic shows. The New York Times used an animated treemap to show changes in the stock market during the financial crisis in its piece titled “How the Giants of Finance Shrank, Then Grew, Under the Financial Crisis.” See it in action at http://nyti.ms/9JUkWL. Because the Portfolio package does most of the heavy lifting, the only tough part in applying this to your own data is getting it into the right format. Remember, you need three things. You need a unique id for each row, a metric to size rectangles, and parent categories. Optionally, you can use a fourth metric to color your rectangles. Check out Chapter 2, “Handling Data,” for instructions on how to get your data into the format you need. Figure 5-19: Revised treemap from R to Illustrator Proportions over Time Often you’ll have a set of proportions over time. Instead of results for a series of questions from a single polling session, you might have results from the same poll run every month for a year. You’re not just interested in individual poll results; you also want to see how views have changed over time. How has opinion changed from one year ago until now? This doesn’t just apply to polls, of course. There are plenty of distributions that change over time. In the following examples, you take a look at the distribution of age groups in the United States from 1860 to 2005. With improving healthcare and average family size shrinking, the population as a whole is living longer than the generation before. Stacked Continuous Imagine you have several time series charts. Now stack each line on top of the other. Fill the empty space. What you have is a stacked area chart, where the horizontal axis is time, and the vertical axis is a range from 0 to 100 percent, as shown in Figure 5-20. Figure 5-20: Stacked area chart generalized So if you were to take a vertical slice of the area chart, you would get the distribution of that time slice. Another way to look at it is as a series of stacked bar charts connected by time. Create a Stacked Area Chart In this example, look at the aging population. Download the data at http://book.flowingdata.com/ch05/data/uspopulation-by-age.xls. Medicine and healthcare have improved over the decades, and the average lifespan continues to rise. As a result, the percentage of the population in older age brackets has increased. By how much has this age distribution changed over the years? Data from the U.S. Census Bureau can help you see via a stacked area chart. You want to see how the proportion of older age groups has increased and how the proportion of the younger age groups has decreased. You can do this in a variety of ways, but first use Illustrator. For the stacked area graph, it comes in the form of the Area Graph tool (Figure 5-21). Figure 5-21: Area Graph Tool Click and drag somewhere on a new document, and enter the data in the spreadsheet that pops up. You’re familiar with the load data, generate graphic, and refine process now, right? You can see a stacked area chart, as shown in Figure 5-22, after you enter the data. Figure 5-22: Default stacked area chart in Illustrator The top area goes above the 100 percent line. This happened because the stacked area graph is not just for normalized proportions or a set of values that add up to 100 percent. It can also be used for raw values, so if you want each time slice to add up to 100 percent, you need to normalize the data. The above image was actually from a mistake on my part; I entered the data incorrectly. Oops. A quick fix, and you can see the graph in Figure 5-23. Although, you probably entered the data correctly the first time, so you’re already here. Figure 5-23: Fixed area chart Keep an eye out for stuff like this in your graph design though. It’s better to spot typos and small data entry errors in the beginning than it is to finish a design and have to backtrack to figure out where things went wrong. Tip Be careful when you enter data manually. A lot of silly mistakes come from transferring data from one source to another. Now that you have a proper base, clean up the axis and lines. Make use of the Direct Selection tool to select specific elements. I like to remove the vertical axis line and leave thinner tick marks for a cleaner, less clunky look, and add the percentage sign to the numbers because that’s what we’re dealing with. I also typically change the stroke color of the actual graph fills from the default black to a simpler white. Also bring in some shades of blue. That takes you to Figure 5-24. Figure 5-24: Modified colors from default Again, this is just my design taste, and you can do what you want. Color selection can also vary by case. The more graphs that you design, the better feel you’ll develop for what you like and what works best. Tip Use colors that fit your theme and guide your readers’ eyes with varying shades. Are you missing anything else? Well, there are no labels for the horizontal axis. Now put them in. And while you’re at it, label the areas to indicate the age groups (Figure 5-25). Figure 5-25: Labeled stacked area chart I also added annotation on the right of the graph. What we’re most interested in here is the change in age distribution. We can see that from the graph, but the actual numbers can help drive the point home. Lastly, put in the title and lead-in copy, along with the data source on the bottom. Tweak the colors of the right annotations a little bit to add some more meaning to the display, and you have the final graphic, as shown in Figure 5-26. Figure 5-26: Final stacked area chart Create an Interactive Stacked Area Chart One of the drawbacks to using stacked area charts is that they become hard to read and practically useless when you have a lot of categories and data points. The chart type worked for age breakdowns because there were only five categories. Start adding more, and the layers start to look like thin strips. Likewise, if you have one category that has relatively small counts, it can easily get dwarfed by the more prominent categories. Making the stacked area graph interactive, however, can help solve that problem. You can provide a way for readers to search for categories and then adjust the axis to zoom in on points of interest. Tooltips can help readers see values in places that are too small to place labels. Basically, you can take data that wouldn’t work as a static stacked area chart, but use it with an interactive chart, and make it easy to browse and explore. You could do this in JavaScript with Protovis, but for the sake of learning more tools (because it’s super fun), use Flash and ActionScript. The NameVoyager by Martin Wattenberg made the interactive stacked area chart popular. It is used to show baby names over time, and the graph automatically updates as you type names in the search box. Try it out at www.babynamewizard.com/voyager. Note Online visualization has slowly been shifting away from Flash toward JavaScript and HTML5, but not all browsers support the latter, namely Internet Explorer. Also, because Flash has been around for years, there are libraries and packages that make certain tasks easier than if you were to try to do it with native browser functionality. Luckily you don’t have to start from scratch. Most of the work has already been done for you via the Flare visualization toolkit, designed and maintained by the UC Berkeley Visualization Lab. It’s an ActionScript library, which was actually a port of a Java visualization toolkit called Prefuse. We’ll work off one of the sample applications on the Flare site, JobVoyager, which is like NameVoyager, but an explorer for jobs. After you get your development environment set up, it’s just a matter of switching in your data and then customizing the look and feel. Note Download Flare for free at http://flare.prefuse.org/. You can write the code completely in ActionScript and then compile it into a Flash file. Basically this means you write the code, which is a language that you understand, and then use a compiler to translate the code into bits so that your computer, or the Flash player, can understand what you told it to do. So you need two things: a place to write and a way to compile. The hard way to do this is to write code in a standard text editor and then use one of Adobe’s free compilers. I say hard because the steps are definitely more roundabout, and you have to install separate things on your computer. The easy way to do this, and the way I highly recommend if you’re planning on doing a lot of work in Flash and ActionScript, is to use Adobe Flex Builder. It makes the tedious part of programming with ActionScript quicker, because you code, compile, and debug all in the same place. The downside is that it does cost money, although it’s free for students. If you’re not sure if it’s worth the money, you can always download a free trial and make your decision later. For the stacked area chart example, I’ll explain the steps you have to take in Flex Builder. Note At the time of this writing, Adobe changed the name of Flex Builder to Flash Builder. They are similar but there are some variations between the two. While the following steps use the former, you can still do the same in the latter. Download Flash Builder at www.adobe.com/products/flashbuilder/. Be sure to take advantage of the student discount. Simply provide a copy of your student ID, and you get a free license. Alternatively, find an old, lower-priced copy of Flex Builder. When you’ve downloaded and installed Flex Builder, go ahead and open it; you should see a window, as shown in Figure 5-27. Figure 5-27: Initial window on opening Flex Builder Right-click the Flex Navigator (left sidebar) and click Import. You’ll see a pop-up that looks like Figure 5-28. Select Existing Projects into Workspace and click Next. Browse to where you put the Flare files. Select the flare directory, and then make sure Flare is checked in the project window, as shown in Figure 5-29. Figure 5-28: Import window in Flex Builder Figure 5-29: Existing projects window Do the same thing with the flare.apps folder. Your Flex Builder window should look like Figure 5-30 after you expand the flare.apps/flare/apps/ folder and click JobVoyager.as. Figure 5-30: JobVoyager code opened If you click the run button right now (the green button with the white play triangle at the top left), you should see the working JobVoyager, as shown in Figure 5-31. Get that working, and you’re done with the hardest part: the setup. Now you just need to plug in your own data and customize it to your liking. Sound familiar? Figure 5-32 shows what you’re after. It’s a voyager for consumer spending from 1984 to 2008, as reported by the U.S. Census Bureau. The horizontal axis is still years, but instead of jobs, there are spending categories such as housing and food. Vi s i t http://datafl.ws/16r to try the final visualization and to see how the explorer works with consumer spending. Now you need to change the data source, which is specified on line 57 of JobVoyager.as. private var _url:String = "http://flare.prefuse.org/data/jobs.txt"; Figure 5-31: JobVoyager application Change the _url to point at the spending data available at http://datasets.flowingdata.com/expenditures.txt. Like jobs.txt, the data is also a tab-delimited file. The first column is year, the second category, and the last column is expenditure. private var _url:String = "http://datasets.flowingdata.com/expenditures.txt"; Now the file will read in your spending data instead of the data for jobs. Easy stuff so far. The next two lines, line 58 and 59, are the column names, or in this case, the distinct years that job data was available. It’s by decade from 1850 to 2000. You could make things more robust by finding the years in the loaded data, but because the data isn’t changing, you can save some time and explicitly specify the years. Figure 5-32: Interactive voyager for consumer spending The expenditures data is annual from 1984 to 2008, so Change lines 58–59 accordingly. private var _cols:Array = [1984,1985,1986,1987,1988,1989,1990,1991,1992, 1993,1994,1995,1996,1997,1998,1999,2000,2001,2002, 2003,2004,2005,2006,2007,2008]; Next change references to the data headers. The original data file (jobs.txt) has four columns: year, occupation, people, and sex. The spending data has only three columns: year, category, and expenditure. You need to adapt the code to this new data structure. Luckily, it’s easy. The year column is the same, so you just need to change any people references to expenditure (vertical axis) and any occupation references to category (the layers). Finally, remove all uses of gender. At line 74 the data is reshaped and prepared for the stacked area chart. It specifies by occupation and sex as the categories (that is, layers) and uses year on the x-axis and people on the y-axis. var dr:Array = reshape(ds.nodes.data, ["occupation","sex"], "year", "people", _cols); Change it to this: var dr:Array = reshape(ds.nodes.data, ["category"], "year", "expenditure", _cols); You only have one category (sans sex), and that’s uh, category. The x-axis is still year, and the y-axis is expenditure. Line 84 sorts the data by occupation (alphabetically) and then sex (numerically). Now just sort by category: data.nodes.sortBy("data.category"); Are you starting to get the idea here? Mostly everything is laid out for you. You just need to adjust the variables to accommodate the data. Tip There’s some great open-source work going on in visualization, and although coding can seem daunting in the beginning, many times you can use existing code with your own data just by changing variables. The challenge is reading the code and figuring out how everything works. Line 92 colors layers by sex, but you don’t have that split in the data, so you don’t need to do that. Remove the entire row: data.nodes.setProperty("fillHue", iff(eq("data.sex",1), 0.7, 0)); We’ll come back to customizing the colors of the stacks a little later. Line 103 adds labels based occupation: _vis.operators.add(new StackedAreaLabeler("data.occupation")); You want to label based on spending category, so change the line accordingly: _vis.operators.add(new StackedAreaLabeler("data.category")); Lines 213–231 handle filtering in JobVoyager. First, there’s the male/female filter; then there’s the filter by occupation. You don’t need the former, so you can get rid of lines 215–218 and then make line 219 a plain if statement. Similarly, lines 264–293 create buttons to trigger the male/female filter. We can get rid of that, too. You’re close to fully customizing the voyager to the spending data. Go back to the filter() function at line 213. Again, update the function so that you can filter by the spending category instead of occupation. Here’s line 222 as-is: var s:String = String(d.data["occupation"]).toLowerCase(); Change occupation to category: var s:String = String(d.data["category"]).toLowerCase(); Next up on the customization checklist is color. If you compiled the code now and ran it, you would get a reddish stacked area graph, as shown in Figure 5-33. You want more contrast though. Color is specified in two places. First lines 86–89 specify stroke color and color everything red: shape: Shapes.POLYGON, lineColor: 0, fillValue: 1, fillSaturation: 0.5 Then line 105 updates saturation (the level of red), by count. The code for the SaturationEncoder() is in lines 360– 383. We’re not going to use saturation; instead, explicitly specify the color scheme. First, update lines 86–89 to this: shape: Shapes.POLYGON, lineColor: 0xFFFFFFFF Now make stroke color white with lineColor. If there were more spending categories, you probably wouldn’t do this because it’d be cluttered. You don’t have that many though, so it’ll make reading a little easier. Next, make an array of the colors you want to use ordered by levels. Put it toward the top around line 50: private var _reds:Array = [0xFFFEF0D9, 0xFFFDD49E, 0xFFFDBB84, 0xFFFC8D59, 0xFFE34A33, 0xFFB30000]; Figure 5-33: Stacked area graph with basic coloring I used the ColorBrewer (referenced earlier) for these colors, which suggests color schemes based on criteria that you set. It’s intended to choose colors for maps but works great for general visualization, too. Now add a new ColorEncoder around line 110: var colorPalette:ColorPalette = new ColorPalette(_reds); vis.operators.add(new ColorEncoder("data.max", "nodes", "fillColor", null, colorPalette)); note If you get an error when you try to compile your code, check the top of JobVoyager.as to see if the following two lines to import the ColorPallete and Encoder objects are specified. Add them if they are not there already. import flare.util.palette.*; import flare.vis.operator.encoder.*; Ta Da! You now have something that looks like what we’re after (Figure 5-32). Of course, you don’t have to stop here. You can do a lot of things with this. You can apply this to your own data, use a different color scheme, and further customize to fit your needs. Maybe change the font or the tooltip format. Then you can get fancier and integrate it with other tools or add more ActionScript, and so on. Point-by-Point One disadvantage of the stacked area graph is that it can be hard to see trends for each group because the placement of each point is affected by the points below it. So sometimes a better way is to plot proportions as a straight up time series like the previous chapter covered. Luckily, it’s easy to switch between the two in Illustrator. The data entry is the same, so you just need to change the graph type. Select the line plot instead of the stacked area in the beginning, and you get this, the default graph in Figure 5-34. Clean up and format to your liking in the same way you did with the time series examples, and you have the same data from a different point of view (Figure 5-35). It’s easier to see the individual trends in each age group with this time series plot. On the other hand, you do lose the sense of a whole and distributions. The graph you choose should reflect the point you’ want to get across or what you want to find in your data. You can even show both views if you have the space. Figure 5-34: Default line plot Figure 5-35: Labeled line plot cleaned up Wrapping Up The main thing that sets proportions apart from other data types is that they represent parts of a whole. Each individual value means something, but so do the sum of all the parts or just a subset of the parts. The visualization you design should represent these ideas. Only have a few values? The pie chart might be your best bet. Use donut charts with care. If you have several values and several categories, consider the stacked bar chart instead of multiple pie charts. If you’re looking for patterns over time, look to your friend the stacked area chart or go for the classic time series. With these steady foundations, your proportions will be good to go. When it comes time to design and implement, ask yourself what you want to know about your data, and then go from there. Does a static graphic tell your story completely? A lot of the time the answer will be yes, and that’s fine. If, however, you decide you need to go with an interactive graphic, map out on paper what should happen when you click objects and what shouldn’t. It gets complicated quickly if you add too much functionality, so do your best to keep it simple. Have other people try interacting with your designs to see if they understand what’s going on. Finally, while you’re programming—especially if you’re new to code—you’re undoubtedly going to reach a point where you’re not sure what to do next. This happens to me all the time. When you get stuck, there’s no better place than the web to find your solution. Look at documentation if it’s available or study examples that are similar to what you’re trying to do. Don’t just look at the syntax. Learn the logic because that’s what’s going to help you the most. Luckily there are libraries such as Protovis and Flare that have many examples and great documentation. In the next chapter, we move towards deeper analysis and data interpretation and come back to your good statistical friend. You put R to good use as you study relationships between data sets and variables. Ready? Let’s go. Chapter 6 Visualizing Relationships Statistics is about finding relationships in data. What are the similarities between groups? Within groups? Within subgroups? The relationship that most people are familiar with for statistics is correlation. For example, as average height goes up in a population, most likely average weight will go up, too. This is a simple positive correlation. The relationships in your data, just like in real life, can get more complicated though as you consider more factors or find patterns that aren’t so linear. This chapter discusses how to use visualization to find such relationships and highlight them for storytelling. As you get into more complex statistical graphics, you can make heavy use of R in this chapter and the next. This is where the open-source software shines. Like in previous chapters, R does the grunt work, and then you can use Illustrator to make the graphic more readable for an audience. What Relationships to Look For So far you looked at basic relationships with patterns in time and proportions. You learned about temporal trends, and compared proportions and percentages to see what’s the least and greatest and everything in between. The next step is to look for relationships between different variables. As something goes up, does another thing go down, and is it a causal or correlative relationship? The former is usually quite hard to prove quantitatively, which makes it even less likely you can prove it with a graphic. You can, however, easily show correlation, which can lead to a deeper more exploratory analysis. You can also take a step back to look at the big picture, or the distribution of your data. Is it actually spaced out or is it clustered in between? Such comparisons can lead to stories about citizens of a country or how you compare to those around you. You can see how different countries compare to one another or general developmental can progress around the world, which can aid in decisions about where to provide aid. You can also compare multiple distributions for an even wider view of your data. How has the makeup of a population changed over time? How has it stayed the same? Most important, in the end, when you have all your graphics in front of you, ask what the results mean. Are they what you expected? Does anything surprise you? This might seem abstract and hand-wavy, so now jump right into some concrete examples on how to look at relationships in your data. Correlation Correlation is probably the first thing you think of when you hear about relationships in data. The second thing is probably causation. Now maybe you’re thinking about the mantra that correlation doesn’t equal causation. The first, correlation, means one thing tends to change a certain way as another thing changes. For example, the price of milk per gallon and the price of gasoline per gallon are positively correlated. Both have been increasing over the years. Now here’s the difference between correlation and causation. If you increase the price of gas, will the price of milk go up by default? More important, if the price of milk did go up, was it because of the increase in the gas price or was it an outside factor, such as a dairy strike? It’s difficult to account for every outside, or confounding factor, which makes it difficult to prove causation. Researchers spend years figuring stuff like that out. You can, however, easily find and see correlation, which can still be useful, as you see in the following sections. Correlation can help you predict one metric by knowing another. To see this relationship, return to scatterplot and multiple scatterplots. More with Points In Chapter 4, “Visualizing Patterns over Time,” you used a scatterplot to graph measurements over time, where time was on the horizontal axis and a metric of interest was on the vertical axis. This helped spot temporal changes (or nonchanges). The relationship was between time and another factor, or a variable. As shown in Figure 6-1, however, you can use the scatterplot for variables other than time; you can use a scatterplot to look for relationships between two variables. If two metrics are positively correlated (Figure 6-2, left), dots move higher up as you read the graph from left to right. Conversely, if a negative correlation exists, the dots appear lower, moving from left to right, as shown in the middle of Figure 6-2. Sometimes the relationship is straightforward, such as the correlation between peoples’ height and weight. Usually, as height increases, weight increases. Other times the correlation is not as obvious, such as that between health and body mass index (BMI). A high BMI typically indicates that someone is overweight; however, muscular people for example, who can be athletically fit, could have a high BMI. What if the sample population were body builders or football players? What would relationships between health and BMI look like? Figure 6-1: Scatterplot framework, comparing two variables Figure 6-2: Correlations shown in scatterplots Remember the graph is only part of the story. It’s still up to you to interpret the results. This is particularly important with relationships. You might be tempted to assume a cause-and-effect relationship, but most of the time that’s not the case at all. Just because the price of a gallon of gas and world population have both increased over the years doesn’t mean the price of gas should be decreased to slow population growth. Create a Scatterplot In this example, look at United States crime rates at the state level, in 2005, with rates per 100,000 population for crime types such as murder, robbery, and aggravated assault, as reported by the Census Bureau. There are seven crime types in total. Look at two of them to start: burglary and murder. How do these relate? Do states with relatively high murder rates also have high burglary rates? You can turn to R to investigate. As always, the first thing you do is load the data into R using read.csv(). You can download the CSV file at http://datasets.flowingdata.com/crimeRatesByState2005.csv, but now load it directly into R via the URL. # Load the data crime

Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : Yes
Author                          : Nathan Yau
Create Date                     : 2011:12:23 14:47:33+08:00
Modify Date                     : 2012:01:12 14:11:36+08:00
XMP Toolkit                     : Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00
Format                          : application/pdf
Creator                         : Nathan Yau
Title                           : Visualize This
Producer                        : Python PDF Library - http://pybrary.net/pyPdf/
Metadata Date                   : 2012:01:12 14:11:36+08:00
Document ID                     : uuid:9355b7d3-2c36-4454-9305-fbb85b3fab53
Instance ID                     : uuid:135a1724-ce70-4340-ab9c-22777e3b8966
Page Count                      : 456
EXIF Metadata provided by EXIF.tools

Navigation menu