Cassandra: The Definitive Guide Cassandra.The.Definitive.Guide

Cassandra%20-%20The%20Definitive%20Guide

Cassandra.The.Definitive.Guide-www.gocit.vn

Cassandra.The.Definitive.Guide

Cassandra.The.Definitive.Guide-www.gocit.vn

Cassandra.The.Definitive.Guide-www.gocit.vn

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 330 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Cassandra: The Definitive Guide
Cassandra: The Definitive Guide
Eben Hewitt
Beijing
Cambridge
Farnham
Köln
Sebastopol
Tokyo
Cassandra: The Definitive Guide
by Eben Hewitt
Copyright © 2011 Eben Hewitt. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions
are also available for most titles (http://my.safaribooksonline.com). For more information, contact our
corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com.
Editor: Mike Loukides
Production Editor: Holly Bauer
Copyeditor: Genevieve d’Entremont
Proofreader: Emily Quill
Indexer: Ellen Troutman Zaig
Cover Designer: Karen Montgomery
Interior Designer: David Futato
Illustrator: Robert Romano
Printing History:
November 2010: First Edition.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of
O’Reilly Media, Inc. Cassandra: The Definitive Guide, the image of a Paradise flycatcher, and related
trade dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a
trademark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and author assume
no responsibility for errors or omissions, or for damages resulting from the use of the information con-
tained herein.
TM
This book uses RepKover™, a durable and flexible lay-flat binding.
ISBN: 978-1-449-39041-9
[M]
1289577822
This book is dedicated to my sweetheart,
Alison Brown. I can hear the sound of violins,
long before it begins.
Table of Contents
Foreword . .................................................................. xv
Preface .................................................................... xvii
1. Introducing Cassandra ................................................... 1
What’s Wrong with Relational Databases? 1
A Quick Review of Relational Databases 6
RDBMS: The Awesome and the Not-So-Much 6
Web Scale 12
The Cassandra Elevator Pitch 14
Cassandra in 50 Words or Less 14
Distributed and Decentralized 14
Elastic Scalability 16
High Availability and Fault Tolerance 16
Tuneable Consistency 17
Brewer’s CAP Theorem 19
Row-Oriented 23
Schema-Free 24
High Performance 24
Where Did Cassandra Come From? 24
Use Cases for Cassandra 25
Large Deployments 25
Lots of Writes, Statistics, and Analysis 26
Geographical Distribution 26
Evolving Applications 26
Who Is Using Cassandra? 26
Summary 28
2. Installing Cassandra . ................................................... 29
Installing the Binary 29
Extracting the Download 29
vii
What’s In There? 29
Building from Source 30
Additional Build Targets 32
Building with Maven 32
Running Cassandra 33
On Windows 33
On Linux 33
Starting the Server 34
Running the Command-Line Client Interface 35
Basic CLI Commands 36
Help 36
Connecting to a Server 36
Describing the Environment 37
Creating a Keyspace and Column Family 38
Writing and Reading Data 39
Summary 40
3. The Cassandra Data Model . . ............................................. 41
The Relational Data Model 41
A Simple Introduction 42
Clusters 45
Keyspaces 46
Column Families 47
Column Family Options 49
Columns 49
Wide Rows, Skinny Rows 51
Column Sorting 52
Super Columns 53
Composite Keys 55
Design Differences Between RDBMS and Cassandra 56
No Query Language 56
No Referential Integrity 56
Secondary Indexes 56
Sorting Is a Design Decision 57
Denormalization 57
Design Patterns 58
Materialized View 59
Valueless Column 59
Aggregate Key 59
Some Things to Keep in Mind 60
Summary 60
viii | Table of Contents
4. Sample Application .................................................... 61
Data Design 61
Hotel App RDBMS Design 62
Hotel App Cassandra Design 63
Hotel Application Code 64
Creating the Database 65
Data Structures 66
Getting a Connection 67
Prepopulating the Database 68
The Search Application 80
Twissandra 85
Summary 85
5. The Cassandra Architecture .............................................. 87
System Keyspace 87
Peer-to-Peer 88
Gossip and Failure Detection 88
Anti-Entropy and Read Repair 90
Memtables, SSTables, and Commit Logs 91
Hinted Handoff 93
Compaction 94
Bloom Filters 95
Tombstones 95
Staged Event-Driven Architecture (SEDA) 96
Managers and Services 97
Cassandra Daemon 97
Storage Service 97
Messaging Service 97
Hinted Handoff Manager 98
Summary 98
6. Configuring Cassandra .................................................. 99
Keyspaces 99
Creating a Column Family 102
Transitioning from 0.6 to 0.7 103
Replicas 103
Replica Placement Strategies 104
Simple Strategy 105
Old Network Topology Strategy 106
Network Topology Strategy 107
Replication Factor 107
Increasing the Replication Factor 108
Partitioners 110
Table of Contents | ix
Random Partitioner 110
Order-Preserving Partitioner 110
Collating Order-Preserving Partitioner 111
Byte-Ordered Partitioner 111
Snitches 111
Simple Snitch 111
PropertyFileSnitch 112
Creating a Cluster 113
Changing the Cluster Name 113
Adding Nodes to a Cluster 114
Multiple Seed Nodes 116
Dynamic Ring Participation 117
Security 118
Using SimpleAuthenticator 118
Programmatic Authentication 121
Using MD5 Encryption 122
Providing Your Own Authentication 122
Miscellaneous Settings 123
Additional Tools 124
Viewing Keys 124
Importing Previous Configurations 125
Summary 127
7. Reading and Writing Data . ............................................. 129
Query Differences Between RDBMS and Cassandra 129
No Update Query 129
Record-Level Atomicity on Writes 129
No Server-Side Transaction Support 129
No Duplicate Keys 130
Basic Write Properties 130
Consistency Levels 130
Basic Read Properties 132
The API 133
Ranges and Slices 133
Setup and Inserting Data 134
Using a Simple Get 140
Seeding Some Values 142
Slice Predicate 142
Getting Particular Column Names with Get Slice 142
Getting a Set of Columns with Slice Range 144
Getting All Columns in a Row 145
Get Range Slices 145
Multiget Slice 147
x | Table of Contents
Deleting 149
Batch Mutates 150
Batch Deletes 151
Range Ghosts 152
Programmatically Defining Keyspaces and Column Families 152
Summary 153
8. Clients .............................................................. 155
Basic Client API 156
Thrift 156
Thrift Support for Java 159
Exceptions 159
Thrift Summary 160
Avro 160
Avro Ant Targets 162
Avro Specification 163
Avro Summary 164
A Bit of Git 164
Connecting Client Nodes 165
Client List 165
Round-Robin DNS 165
Load Balancer 165
Cassandra Web Console 165
Hector (Java) 168
Features 169
The Hector API 170
HectorSharp (C#) 170
Chirper 175
Chiton (Python) 175
Pelops (Java) 176
Kundera (Java ORM) 176
Fauna (Ruby) 177
Summary 177
9. Monitoring . ......................................................... 179
Logging 179
Tailing 181
General Tips 182
Overview of JMX and MBeans 183
MBeans 185
Integrating JMX 187
Interacting with Cassandra via JMX 188
Cassandra’s MBeans 190
Table of Contents | xi
org.apache.cassandra.concurrent 193
org.apache.cassandra.db 193
org.apache.cassandra.gms 194
org.apache.cassandra.service 194
Custom Cassandra MBeans 196
Runtime Analysis Tools 199
Heap Analysis with JMX and JHAT 199
Detecting Thread Problems 203
Health Check 204
Summary 204
10. Maintenance ......................................................... 207
Getting Ring Information 208
Info 208
Ring 208
Getting Statistics 209
Using cfstats 209
Using tpstats 210
Basic Maintenance 211
Repair 211
Flush 213
Cleanup 213
Snapshots 213
Taking a Snapshot 213
Clearing a Snapshot 214
Load-Balancing the Cluster 215
loadbalance and streams 215
Decommissioning a Node 218
Updating Nodes 220
Removing Tokens 220
Compaction Threshold 220
Changing Column Families in a Working Cluster 220
Summary 221
11. Performance Tuning . .................................................. 223
Data Storage 223
Reply Timeout 225
Commit Logs 225
Memtables 226
Concurrency 226
Caching 227
Buffer Sizes 228
Using the Python Stress Test 228
xii | Table of Contents
Generating the Python Thrift Interfaces 229
Running the Python Stress Test 230
Startup and JVM Settings 232
Tuning the JVM 232
Summary 234
12. Integrating Hadoop ................................................... 235
What Is Hadoop? 235
Working with MapReduce 236
Cassandra Hadoop Source Package 236
Running the Word Count Example 237
Outputting Data to Cassandra 239
Hadoop Streaming 239
Tools Above MapReduce 239
Pig 240
Hive 241
Cluster Configuration 241
Use Cases 242
Raptr.com: Keith Thornhill 243
Imagini: Dave Gardner 243
Summary 244
Appendix: The Nonrelational Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Glossary ................................................................... 271
Index ..................................................................... 285
Table of Contents | xiii
Foreword
Cassandra was open-sourced by Facebook in July 2008. This original version of
Cassandra was written primarily by an ex-employee from Amazon and one from Mi-
crosoft. It was strongly influenced by Dynamo, Amazon’s pioneering distributed key/
value database. Cassandra implements a Dynamo-style replication model with no sin-
gle point of failure, but adds a more powerful “column family” data model.
I became involved in December of that year, when Rackspace asked me to build them
a scalable database. This was good timing, because all of today’s important open source
scalable databases were available for evaluation. Despite initially having only a single
major use case, Cassandra’s underlying architecture was the strongest, and I directed
my efforts toward improving the code and building a community.
Cassandra was accepted into the Apache Incubator, and by the time it graduated in
March 2010, it had become a true open source success story, with committers from
Rackspace, Digg, Twitter, and other companies that wouldn’t have written their own
database from scratch, but together built something important.
Today’s Cassandra is much more than the early system that powered (and still powers)
Facebook’s inbox search; it has become “the hands down winner for transaction pro-
cessing performance,” to quote Tony Bain, with a deserved reputation for reliability
and performance at scale.
As Cassandra matured and began attracting more mainstream users, it became clear
that there was a need for commercial support; thus, Matt Pfeil and I cofounded Riptano
in April 2010. Helping drive Cassandra adoption has been very rewarding, especially
seeing the uses that don’t get discussed in public.
Another need has been a book like this one. Like many open source projects, Cassan-
dra’s documentation has historically been weak. And even when the documentation
ultimately improves, a book-length treatment like this will remain useful.
xv
Thanks to Eben for tackling the difficult task of distilling the art and science of devel-
oping against and deploying Cassandra. You, the reader, have the opportunity to learn
these new concepts in an organized fashion.
—Jonathan Ellis
Project Chair, Apache Cassandra, and Cofounder, Riptano
xvi | Foreword
Preface
Why Apache Cassandra?
Apache Cassandra is a free, open source, distributed data storage system that differs
sharply from relational database management systems.
Cassandra first started as an incubation project at Apache in January of 2009. Shortly
thereafter, the committers, led by Apache Cassandra Project Chair Jonathan Ellis, re-
leased version 0.3 of Cassandra, and have steadily made minor releases since that time.
Though as of this writing it has not yet reached a 1.0 release, Cassandra is being used
in production by some of the biggest properties on the Web, including Facebook,
Twitter, Cisco, Rackspace, Digg, Cloudkick, Reddit, and more.
Cassandra has become so popular because of its outstanding technical features. It is
durable, seamlessly scalable, and tuneably consistent. It performs blazingly fast writes,
can store hundreds of terabytes of data, and is decentralized and symmetrical so there’s
no single point of failure. It is highly available and offers a schema-free data model.
Is This Book for You?
This book is intended for a variety of audiences. It should be useful to you if you are:
A developer working with large-scale, high-volume websites, such as Web 2.0 so-
cial applications
An application architect or data architect who needs to understand the available
options for high-performance, decentralized, elastic data stores
A database administrator or database developer currently working with standard
relational database systems who needs to understand how to implement a fault-
tolerant, eventually consistent data store
xvii
A manager who wants to understand the advantages (and disadvantages) of Cas-
sandra and related columnar databases to help make decisions about technology
strategy
A student, analyst, or researcher who is designing a project related to Cassandra
or other non-relational data store options
This book is a technical guide. In many ways, Cassandra represents a new way of
thinking about data. Many developers who gained their professional chops in the last
15–20 years have become well-versed in thinking about data in purely relational or
object-oriented terms. Cassandra’s data model is very different and can be difficult to
wrap your mind around at first, especially for those of us with entrenched ideas about
what a database is (and should be).
Using Cassandra does not mean that you have to be a Java developer. However, Cas-
sandra is written in Java, so if you’re going to dive into the source code, a solid under-
standing of Java is crucial. Although it’s not strictly necessary to know Java, it can help
you to better understand exceptions, how to build the source code, and how to use
some of the popular clients. Many of the examples in this book are in Java. But because
of the interface used to access Cassandra, you can use Cassandra from a wide variety
of languages, including C#, Scala, Python, and Ruby.
Finally, it is assumed that you have a good understanding of how the Web works, can
use an integrated development environment (IDE), and are somewhat familiar with the
typical concerns of data-driven applications. You might be a well-seasoned developer
or administrator but still, on occasion, encounter tools used in the Cassandra world
that you’re not familiar with. For example, Apache Ivy is used to build Cassandra, and
a popular client (Hector) is available via Git. In cases where I speculate that you’ll need
to do a little setup of your own in order to work with the examples, I try to support that.
What’s in This Book?
This book is designed with the chapters acting, to a reasonable extent, as standalone
guides. This is important for a book on Cassandra, which has a variety of audiences
and is changing rapidly. To borrow from the software world, I wanted the book to be
“modular”—sort of. If you’re new to Cassandra, it makes sense to read the book in
order; if you’ve passed the introductory stages, you will still find value in later chapters,
which you can read as standalone guides.
Here is how the book is organized:
Chapter 1, Introducing Cassandra
This chapter introduces Cassandra and discusses what’s exciting and different
about it, who is using it, and what its advantages are.
Chapter 2, Installing Cassandra
This chapter walks you through installing Cassandra on a variety of platforms.
xviii | Preface
Chapter 3, The Cassandra Data Model
Here we look at Cassandra’s data model to understand what columns, super col-
umns, and rows are. Special care is taken to bridge the gap between the relational
database world and Cassandra’s world.
Chapter 4, Sample Application
This chapter presents a complete working application that translates from a rela-
tional model in a well-understood domain to Cassandra’s data model.
Chapter 5, The Cassandra Architecture
This chapter helps you understand what happens during read and write operations
and how the database accomplishes some of its notable aspects, such as durability
and high availability. We go under the hood to understand some of the more com-
plex inner workings, such as the gossip protocol, hinted handoffs, read repairs,
Merkle trees, and more.
Chapter 6, Configuring Cassandra
This chapter shows you how to specify partitioners, replica placement strategies,
and snitches. We set up a cluster and see the implications of different configuration
choices.
Chapter 7, Reading and Writing Data
This is the moment we’ve been waiting for. We present an overview of what’s
different about Cassandra’s model for querying and updating data, and then get
to work using the API.
Chapter 8, Clients
There are a variety of clients that third-party developers have created for many
different languages, including Java, C#, Ruby, and Python, in order to abstract
Cassandra’s lower-level API. We help you understand this landscape so you can
choose one that’s right for you.
Chapter 9, Monitoring
Once your cluster is up and running, you’ll want to monitor its usage, memory
patterns, and thread patterns, and understand its general activity. Cassandra has
a rich Java Management Extensions (JMX) interface baked in, which we put to use
to monitor all of these and more.
Chapter 10, Maintenance
The ongoing maintenance of a Cassandra cluster is made somewhat easier by some
tools that ship with the server. We see how to decommission a node, load-balance
the cluster, get statistics, and perform other routine operational tasks.
Chapter 11, Performance Tuning
One of Cassandra’s most notable features is its speed—it’s very fast. But there are
a number of things, including memory settings, data storage, hardware choices,
caching, and buffer sizes, that you can tune to squeeze out even more performance.
Preface | xix
Chapter 12, Integrating Hadoop
In this chapter, written by Jeremy Hanna, we put Cassandra in a larger context and
see how to integrate it with the popular implementation of Google’s Map/Reduce
algorithm, Hadoop.
Appendix
Many new databases have cropped up in response to the need to scale at Big Data
levels, or to take advantage of a “schema-free” model, or to support more recent
initiatives such as the Semantic Web. Here we contextualize Cassandra against a
variety of the more popular nonrelational databases, examining document-
oriented databases, distributed hashtables, and graph databases, to better
understand Cassandra’s offerings.
Glossary
It can be difficult to understand something that’s really new, and Cassandra has
many terms that might be unfamiliar to developers or DBAs coming from the re-
lational application development world, so I’ve included this glossary to make it
easier to read the rest of the book. If you’re stuck on a certain concept, you can flip
to the glossary to help clarify things such as Merkle trees, vector clocks, hinted
handoffs, read repairs, and other exotic terms.
This book is developed against Cassandra 0.6 and 0.7. The project team
is working hard on Cassandra, and new minor releases and bug fix re-
leases come out frequently. Where possible, I have tried to call out rel-
evant differences, but you might be using a different version by the time
you read this, and the implementation may have changed.
Finding Out More
If you’d like to find out more about Cassandra, and to get the latest updates, visit this
book’s companion website at http://www.cassandraguide.com.
It’s also an excellent idea to follow me on Twitter at @ebenhewitt.
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements
such as variable or function names, databases, data types, environment variables,
statements, and keywords.
xx | Preface
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter-
mined by context.
This icon signifies a tip, suggestion, or general note.
This icon indicates a warning or caution.
Using Code Examples
This book is here to help you get your job done. In general, you may use the code in
this book in your programs and documentation. You do not need to contact us for
permission unless you’re reproducing a significant portion of the code. For example,
writing a program that uses several chunks of code from this book does not require
permission. Selling or distributing a CD-ROM of examples from O’Reilly books does
require permission. Answering a question by citing this book and quoting example
code does not require permission. Incorporating a significant amount of example code
from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: “Cassandra: The Definitive Guide by Eben
Hewitt. Copyright 2011 Eben Hewitt, 978-1-449-39041-9.”
If you feel your use of code examples falls outside fair use or the permission given here,
feel free to contact us at permissions@oreilly.com.
Safari® Enabled
Safari Books Online is an on-demand digital library that lets you easily
search over 7,500 technology and creative reference books and videos to
find the answers you need quickly.
With a subscription, you can read any page and watch any video from our library online.
Read books on your cell phone and mobile devices. Access new titles before they are
available for print, and get exclusive access to manuscripts in development and post
feedback for the authors. Copy and paste code samples, organize your favorites,
Preface | xxi
download chapters, bookmark key sections, create notes, print out pages, and benefit
from tons of other time-saving features.
O’Reilly Media has uploaded this book to the Safari Books Online service. To have full
digital access to this book and others on similar topics from O’Reilly and other pub-
lishers, sign up for free at http://my.safaribooksonline.com
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707 829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at:
http://oreilly.com/catalog/0636920010852/
To comment or ask technical questions about this book, send email to:
bookquestions@oreilly.com
For more information about our books, conferences, Resource Centers, and the
O’Reilly Network, see our website at:
http://www.oreilly.com
Acknowledgments
There are many wonderful people to whom I am grateful for helping bring this book
to life.
Thanks to Jeremy Hanna, for writing the Hadoop chapter, and for being so easy to
work with.
Thank you to my technical reviewers. Stu Hood’s insightful comments in particular
really improved the book. Robert Schneider and Gary Dusbabek contributed thought-
ful reviews.
Thank you to Jonathan Ellis for writing the foreword.
Thanks to my editor, Mike Loukides, for being a charming conversationalist at dinner
in San Francisco.
Thank you to Rain Fletcher for supporting and encouraging this book.
xxii | Preface
I’m inspired by the many terrific developers who have contributed to Cassandra. Hats
off for making such a pretty and powerful database.
As always, thank you to Alison Brown, who read drafts, gave me notes, and made sure
that I had time to work; this book would not have happened without you.
Preface | xxiii
CHAPTER 1
Introducing Cassandra
If at first the idea is not absurd,
then there is no hope for it.
—Albert Einstein
Welcome to Cassandra: The Definitive Guide. The aim of this book is to help developers
and database administrators understand this important new database, explore how it
compares to the relational database management systems we’re used to, and help you
put it to work in your own environment.
What’s Wrong with Relational Databases?
If I had asked people what they wanted, they
would have said faster horses.
—Henry Ford
I ask you to consider a certain model for data, invented by a small team at a company
with thousands of employees. It is accessible over a TCP/IP interface and is available
from a variety of languages, including Java and web services. This model was difficult
at first for all but the most advanced computer scientists to understand, until broader
adoption helped make the concepts clearer. Using the database built around this model
required learning new terms and thinking about data storage in a different way. But as
products sprang up around it, more businesses and government agencies put it to use,
in no small part because it was fast—capable of processing thousands of operations a
second. The revenue it generated was tremendous.
And then a new model came along.
The new model was threatening, chiefly for two reasons. First, the new model was very
different from the old model, which it pointedly controverted. It was threatening be-
cause it can be hard to understand something different and new. Ensuing debates can
help entrench people stubbornly further in their views—views that might have been
1
largely inherited from the climate in which they learned their craft and the circumstan-
ces in which they work. Second, and perhaps more importantly, as a barrier, the new
model was threatening because businesses had made considerable investments in the
old model and were making lots of money with it. Changing course seemed ridiculous,
even impossible.
Of course I’m talking about the Information Management System (IMS) hierarchical
database, invented in 1966 at IBM.
IMS was built for use in the Saturn V moon rocket. Its architect was Vern Watts, who
dedicated his career to it. Many of us are familiar with IBM’s database DB2. IBM’s
wildly popular DB2 database gets its name as the successor to DB1—the product built
around the hierarchical data model IMS. IMS was released in 1968, and subsequently
enjoyed success in Customer Information Control System (CICS) and other applica-
tions. It is still used today.
But in the years following the invention of IMS, the new model, the disruptive model,
the threatening model, was the relational database.
In his 1970 paper “A Relational Model of Data for Large Shared Data Banks,” Dr.
Edgar F. Codd, also at IBM, advanced his theory of the relational model for data while
working at IBM’s San Jose research laboratory. This paper, still available at http://www
.seas.upenn.edu/~zives/03f/cis550/codd.pdf, became the foundational work for rela-
tional database management systems.
Codd’s work was antithetical to the hierarchical structure of IMS. Understanding and
working with a relational database required learning new terms that must have sounded
very strange indeed to users of IMS. It presented certain advantages over its predecessor,
in part because giants are almost always standing on the shoulders of other giants.
While these ideas and their application have evolved in four decades, the relational
database still is clearly one of the most successful software applications in history. It’s
used in the form of Microsoft Access in sole proprietorships, and in giant multinational
corporations with clusters of hundreds of finely tuned instances representing multi-
terabyte data warehouses. Relational databases store invoices, customer records, prod-
uct catalogues, accounting ledgers, user authentication schemes—the very world, it
might appear. There is no question that the relational database is a key facet of the
modern technology and business landscape, and one that will be with us in its various
forms for many years to come, as will IMS in its various forms. The relational model
presented an alternative to IMS, and each has its uses.
So the short answer to the question, “What’s wrong with relational databases?” is
“Nothing.”
There is, however, a rather longer answer that I gently encourage you to consider. This
answer takes the long view, which says that every once in a while an idea is born that
ostensibly changes things, and engenders a revolution of sorts. And yet, in another way,
such revolutions, viewed structurally, are simply history’s business as usual. IMS,
2 | Chapter 1:Introducing Cassandra
RDBMS, NoSQL. The horse, the car, the plane. They each build on prior art, they each
attempt to solve certain problems, and so they’re each good at certain things—and less
good at others. They each coexist, even now.
So let’s examine for a moment why, at this point, we might consider an alternative to
the relational database, just as Codd himself four decades ago looked at the Information
Management System and thought that maybe it wasn’t the only legitimate way of or-
ganizing information and solving data problems, and that maybe, for certain problems,
it might prove fruitful to consider an alternative.
We encounter scalability problems when our relational applications become successful
and usage goes up. Joins are inherent in any relatively normalized relational database
of even modest size, and joins can be slow. The way that databases gain consistency is
typically through the use of transactions, which require locking some portion of the
database so it’s not available to other clients. This can become untenable under very
heavy loads, as the locks mean that competing users start queuing up, waiting for their
turn to read or write the data.
We typically address these problems in one or more of the following ways, sometimes
in this order:
Throw hardware at the problem by adding more memory, adding faster processors,
and upgrading disks. This is known as vertical scaling. This can relieve you for a
time.
When the problems arise again, the answer appears to be similar: now that one
box is maxed out, you add hardware in the form of additional boxes in a database
cluster. Now you have the problem of data replication and consistency during
regular usage and in failover scenarios. You didn’t have that problem before.
Now we need to update the configuration of the database management system.
This might mean optimizing the channels the database uses to write to the under-
lying filesystem. We turn off logging or journaling, which frequently is not a
desirable (or, depending on your situation, legal) option.
Having put what attention we could into the database system, we turn to our ap-
plication. We try to improve our indexes. We optimize the queries. But presumably
at this scale we weren’t wholly ignorant of index and query optimization, and
already had them in pretty good shape. So this becomes a painful process of picking
through the data access code to find any opportunities for fine tuning. This might
include reducing or reorganizing joins, throwing out resource-intensive features
such as XML processing within a stored procedure, and so forth. Of course, pre-
sumably we were doing that XML processing for a reason, so if we have to do it
somewhere, we move that problem to the application layer, hoping to solve it there
and crossing our fingers that we don’t break something else in the meantime.
What’s Wrong with Relational Databases? | 3
We employ a caching layer. For larger systems, this might include distributed
caches such as memcached, EHCache, Oracle Coherence, or other related prod-
ucts. Now we have a consistency problem between updates in the cache and
updates in the database, which is exacerbated over a cluster.
We turn our attention to the database again and decide that, now that the appli-
cation is built and we understand the primary query paths, we can duplicate some
of the data to make it look more like the queries that access it. This process, called
denormalization, is antithetical to the five normal forms that characterize the re-
lational model, and violate Codd’s 12 Commandments for relational data. We
remind ourselves that we live in this world, and not in some theoretical cloud, and
then undertake to do what we must to make the application start responding at
acceptable levels again, even if it’s no longer “pure.”
I imagine that this sounds familiar to you. At web scale, engineers have started to won-
der whether this situation isn’t similar to Henry Ford’s assertion that at a certain point,
it’s not simply a faster horse that you want. And they’ve done some impressive, inter-
esting work.
We must therefore begin here in recognition that the relational model is simply a
model. That is, it’s intended to be a useful way of looking at the world, applicable to
certain problems. It does not purport to be exhaustive, closing the case on all other
ways of representing data, never again to be examined, leaving no room for alternatives.
If we take the long view of history, Dr. Codd’s model was a rather disruptive one in its
time. It was new, with strange new vocabulary and terms such as “tuples”—familiar
words used in a new and different manner. The relational model was held up to sus-
picion, and doubtless suffered its vehement detractors. It encountered opposition even
in the form of Dr. Codd’s own employer, IBM, which had a very lucrative product set
around IMS and didn’t need a young upstart cutting into its pie.
But the relational model now arguably enjoys the best seat in the house within the data
world. SQL is widely supported and well understood. It is taught in introductory uni-
versity courses. There are free databases that come installed and ready to use with a
$4.95 monthly web hosting plan. Often the database we end up using is dictated to us
by architectural standards within our organization. Even absent such standards, it’s
prudent to learn whatever your organization already has for a database platform. Our
colleagues in development and infrastructure have considerable hard-won knowledge.
If by nothing more than osmosis—or inertia—we have learned over the years that a
relational database is a one-size-fits-all solution.
So perhaps the real question is not, “What’s wrong with relational databases?” but
rather, “What problem do you have?”
That is, you want to ensure that your solution matches the problem that you have.
There are certain problems that relational databases solve very well.
4 | Chapter 1:Introducing Cassandra
If massive, elastic scalability is not an issue for you, the trade-offs in relative complexity
of a system such as Cassandra may simply not be worth it. No proponent of Cassandra
that I know of is asking anyone to throw out everything they’ve learned about relational
databases, surrender their years of hard-won knowledge around such systems, and
unnecessarily jeopardize their employer’s carefully constructed systems in favor of the
flavor of the month.
Relational data has served all of us developers and DBAs well. But the explosion of the
Web, and in particular social networks, means a corresponding explosion in the sheer
volume of data we must deal with. When Tim Berners-Lee first worked on the Web in
the early 1990s, it was for the purpose of exchanging scientific documents between
PhDs at a physics laboratory. Now, of course, the Web has become so ubiquitous that
it’s used by everyone, from those same scientists to legions of five-year-olds exchanging
emoticons about kittens. That means in part that it must support enormous volumes
of data; the fact that it does stands as a monument to the ingenious architecture of the
Web.
But some of this infrastructure is starting to bend under the weight.
In 1966, a company like IBM was in a position to really make people listen to their
innovations. They had the problems, and they had the brain power to solve them.
As we enter the second decade of the 21st century, we’re starting to see similar inno-
vations, even from young companies such as Facebook and Twitter.
So perhaps the real question, then, is not “What problem do I have?” but rather, “What
kinds of things would I do with data if it wasn’t a problem?” What if you could easily
achieve fault tolerance, availability across multiple data centers, consistency that you
tune, and massive scalability even to the hundreds of terabytes, all from a client lan-
guage of your choosing? Perhaps, you say, you don’t need that kind of availability or
that level of scalability. And you know best. You’re certainly right, in fact, because if
your current database didn’t suit your current database needs, you’d have a nonfunc-
tioning system.
It is not my intention to convince you by clever argument to adopt a non-relational
database such as Apache Cassandra. It is only my intention to present what Cassandra
can do and how it does it so that you can make an informed decision and get started
working with it in practical ways if you find it applies. Only you know what your data
needs are. I do not ask you to reconsider your database—unless you’re miserable with
your current database, or you can’t scale how you need to already, or your data model
isn’t mapping to your application in a way that’s flexible enough for you. I don’t ask
you to consider your database, but rather to consider your organization, its dreams for
the future, and its emerging problems. Would you collect more information about your
business objects if you could?
Don’t ask how to make Cassandra fit into your existing environment. Ask what kinds
of data problems you’d like to have instead of the ones you have today. Ask what new
What’s Wrong with Relational Databases? | 5
kinds of data you would like. What understanding of your organization would you like
to have, if only you could enable it?
A Quick Review of Relational Databases
Though you are likely familiar with them, let’s briefly turn our attention to some of the
foundational concepts in relational databases. This will give us a basis on which to
consider more recent advances in thought around the trade-offs inherent in distributed
data systems, especially very large distributed data systems, such as those that are
required at web scale.
RDBMS: The Awesome and the Not-So-Much
There are many reasons that the relational database has become so overwhelmingly
popular over the last four decades. An important one is the Structured Query Language
(SQL), which is feature-rich and uses a simple, declarative syntax. SQL was first offi-
cially adopted as an ANSI standard in 1986; since that time it’s gone through several
revisions and has also been extended with vendor proprietary syntax such as Micro-
soft’s T-SQL and Oracle’s PL/SQL to provide additional implementation-specific
features.
SQL is powerful for a variety of reasons. It allows the user to represent complex rela-
tionships with the data, using statements that form the Data Manipulation Language
(DML) to insert, select, update, delete, truncate, and merge data. You can perform a
rich variety of operations using functions based on relational algebra to find a maximum
or minimum value in a set, for example, or to filter and order results. SQL statements
support grouping aggregate values and executing summary functions. SQL provides a
means of directly creating, altering, and dropping schema structures at runtime using
Data Definition Language (DDL). SQL also allows you to grant and revoke rights for
users and groups of users using the same syntax.
SQL is easy to use. The basic syntax can be learned quickly, and conceptually SQL and
RDBMS offer a low barrier to entry. Junior developers can become proficient readily,
and as is often the case in an industry beset by rapid changes, tight deadlines, and
exploding budgets, ease of use can be very important. And it’s not just the syntax that’s
easy to use; there are many robust tools that include intuitive graphical interfaces for
viewing and working with your database.
In part because it’s a standard, SQL allows you to easily integrate your RDBMS with a
wide variety of systems. All you need is a driver for your application language, and
you’re off to the races in a very portable way. If you decide to change your application
implementation language (or your RDBMS vendor), you can often do that painlessly,
assuming you haven’t backed yourself into a corner using lots of proprietary extensions.
6 | Chapter 1:Introducing Cassandra
Transactions, ACID-ity, and two-phase commit
In addition to the features mentioned already, RDBMS and SQL also support transac-
tions. A database transaction is, as Jim Gray puts it, “a transformation of state” that
has the ACID properties (see http://research.microsoft.com/en-us/um/people/gray/pa
pers/theTransactionConcept.pdf). A key feature of transactions is that they execute vir-
tually at first, allowing the programmer to undo (using ROLLBACK) any changes that
may have gone awry during execution; if all has gone well, the transaction can be reli-
ably committed. The debate about support for transactions comes up very quickly as
a sore spot in conversations around non-relational data stores, so let’s take a moment
to revisit what this really means.
ACID is an acronym for Atomic, Consistent, Isolated, Durable, which are the gauges
we can use to assess that a transaction has executed properly and that it was successful:
Atomic
Atomic means “all or nothing”; that is, when a statement is executed, every update within the
transaction must succeed in order to be called successful. There is no partial failure where one
update was successful and another related update failed. The common example here is with
monetary transfers at an ATM: the transfer requires subtracting money from one account and
adding it to another account. This operation cannot be subdivided; they must both succeed.
Consistent
Consistent means that data moves from one correct state to another correct state, with no
possibility that readers could view different values that don’t make sense together. For example,
if a transaction attempts to delete a Customer and her Order history, it cannot leave Order rows
that reference the deleted customer’s primary key; this is an inconsistent state that would cause
errors if someone tried to read those Order records.
Isolated
Isolated means that transactions executing concurrently will not become entangled with each
other; they each execute in their own space. That is, if two different transactions attempt to
modify the same data at the same time, then one of them will have to wait for the other to
complete.
Durable
Once a transaction has succeeded, the changes will not be lost. This doesn’t imply another
transaction won’t later modify the same data; it just means that writers can be confident that
the changes are available for the next transaction to work with as necessary.
On the surface, these properties seem so obviously desirable as to not even merit con-
versation. Presumably no one who runs a database would suggest that data updates
don’t have to endure for some length of time; that’s the very point of making updates—
that they’re there for others to read. However, a more subtle examination might lead
us to want to find a way to tune these properties a bit and control them slightly. There
is, as they say, no free lunch on the Internet, and once we see how we’re paying for our
transactions, we may start to wonder whether there’s an alternative.
Transactions become difficult under heavy load. When you first attempt to horizontally
scale a relational database, making it distributed, you must now account for distributed
A Quick Review of Relational Databases | 7
transactions, where the transaction isn’t simply operating inside a single table or a single
database, but is spread across multiple systems. In order to continue to honor the ACID
properties of transactions, you now need a transaction manager to orchestrate across
the multiple nodes.
In order to account for successful completion across multiple hosts, the idea of a two-
phase commit (sometimes referred to as “2PC”) is introduced. But then, because
two-phase commit locks all associate resources, it is useful only for operations that can
complete very quickly. Although it may often be the case that your distributed opera-
tions can complete in sub-second time, it is certainly not always the case. Some use
cases require coordination between multiple hosts that you may not control yourself.
Operations coordinating several different but related activities can take hours to
update.
Two-phase commit blocks; that is, clients (“competing consumers”) must wait for a
prior transaction to finish before they can access the blocked resource. The protocol
will wait for a node to respond, even if it has died. It’s possible to avoid waiting forever
in this event, because a timeout can be set that allows the transaction coordinator node
to decide that the node isn’t going to respond and that it should abort the transaction.
However, an infinite loop is still possible with 2PC; that’s because a node can send a
message to the transaction coordinator node agreeing that it’s OK for the coordinator
to commit the entire transaction. The node will then wait for the coordinator to send
a commit response (or a rollback response if, say, a different node can’t commit); if the
coordinator is down in this scenario, that node conceivably will wait forever.
So in order to account for these shortcomings in two-phase commit of distributed
transactions, the database world turned to the idea of compensation. Compensation,
often used in web services, means in simple terms that the operation is immediately
committed, and then in the event that some error is reported, a new operation is invoked
to restore proper state.
There are a few basic, well-known patterns for compensatory action that architects
frequently have to consider as an alternative to two-phase commit. These include writ-
ing off the transaction if it fails, deciding to discard erroneous transactions and
reconciling later. Another alternative is to retry failed operations later on notification.
In a reservation system or a stock sales ticker, these are not likely to meet your require-
ments. For other kinds of applications, such as billing or ticketing applications, this
can be acceptable.
Gregor Hohpe, a Google architect, wrote a wonderful and often-cited
blog entry called “Starbucks Does Not Use Two-Phase Commit.” It
shows in real-world terms how difficult it is to scale two-phase commit
and highlights some of the alternatives that are mentioned here. Check
it out at http://www.eaipatterns.com/ramblings/18_starbucks.html. It’s
an easy, fun, and enlightening read.
8 | Chapter 1:Introducing Cassandra
The problems that 2PC introduces for application developers include loss of availability
and higher latency during partial failures. Neither of these is desirable. So once you’ve
had the good fortune of being successful enough to necessitate scaling your database
past a single machine, you now have to figure out how to handle transactions across
multiple machines and still make the ACID properties apply. Whether you have 10 or
100 or 1,000 database machines, atomicity is still required in transactions as if you were
working on a single node. But it’s now a much, much bigger pill to swallow.
Schema
One often-lauded feature of relational database systems is the rich schemas they afford.
You can represent your domain objects in a relational model. A whole industry has
sprung up around (expensive) tools such as the CA ERWin Data Modeler to support
this effort. In order to create a properly normalized schema, however, you are forced
to create tables that don’t exist as business objects in your domain. For example, a
schema for a university database might require a Student table and a Course table. But
because of the “many-to-many” relationship here (one student can take many courses
at the same time, and one course has many students at the same time), you have to
create a join table. This pollutes a pristine data model, where we’d prefer to just have
students and courses. It also forces us to create more complex SQL statements to join
these tables together. The join statements, in turn, can be slow.
Again, in a system of modest size, this isn’t much of a problem. But complex queries
and multiple joins can become burdensomely slow once you have a large number of
rows in many tables to handle.
Finally, not all schemas map well to the relational model. One type of system that has
risen in popularity in the last decade is the complex event processing system, which
represents state changes in a very fast stream. It’s often useful to contextualize events
at runtime against other events that might be related in order to infer some conclusion
to support business decision making. Although event streams could be represented in
terms of a relational database, it is an uncomfortable stretch.
And if you’re an application developer, you’ll no doubt be familiar with the many
object-relational mapping (ORM) frameworks that have sprung up in recent years to
help ease the difficulty in mapping application objects to a relational model. Again, for
small systems, ORM can be a relief. But it also introduces new problems of its own,
such as extended memory requirements, and it often pollutes the application code with
increasingly unwieldy mapping code. Here’s an example of a Java method using
Hibernate to “ease the burden” of having to write the SQL code:
A Quick Review of Relational Databases | 9
@CollectionOfElements
@JoinTable(name="store_description",
joinColumns = @JoinColumn(name="store_code"))
@MapKey(columns={@Column(name="for_store",length=3)})
@Column(name="description")
private Map<String, String> getMap() {
return this.map;
}
//... etc.
Is it certain that we’ve done anything but move the problem here? Of course, with some
systems, such as those that make extensive use of document exchange, as with services
or XML-based applications, there are not always clear mappings to a relational data-
base. This exacerbates the problem.
Sharding and shared-nothing architecture
If you can’t split it, you can’t scale it.
—Randy Shoup, Distinguished Architect, eBay
Another way to attempt to scale a relational database is to introduce sharding to your
architecture. This has been used to good effect at large websites such as eBay, which
supports billions of SQL queries a day, and in other Web 2.0 applications. The idea
here is that you split the data so that instead of hosting all of it on a single server or
replicating all of the data on all of the servers in a cluster, you divide up portions of the
data horizontally and host them each separately.
For example, consider a large customer table in a relational database. The least dis-
ruptive thing (for the programming staff, anyway) is to vertically scale by adding CPU,
adding memory, and getting faster hard drives, but if you continue to be successful and
add more customers, at some point (perhaps into the tens of millions of rows), you’ll
likely have to start thinking about how you can add more machines. When you do so,
do you just copy the data so that all of the machines have it? Or do you instead divide
up that single customer table so that each database has only some of the records, with
their order preserved? Then, when clients execute queries, they put load only on the
machine that has the record they’re looking for, with no load on the other machines.
It seems clear that in order to shard, you need to find a good key by which to order
your records. For example, you could divide your customer records across 26 machines,
one for each letter of the alphabet, with each hosting only the records for customers
whose last names start with that particular letter. It’s likely this is not a good strategy,
however—there probably aren’t many last names that begin with “Q” or “Z,” so those
machines will sit idle while the “J,” “M,” and “S” machines spike. You could shard
according to something numeric, like phone number, “member since” date, or the
name of the customer’s state. It all depends on how your specific data is likely to be
distributed.
10 | Chapter 1:Introducing Cassandra
There are three basic strategies for determining shard structure:
Feature-based shard or functional segmentation
This is the approach taken by Randy Shoup, Distinguished Architect at eBay, who
in 2006 helped bring their architecture into maturity to support many billions of
queries per day. Using this strategy, the data is split not by dividing records in a
single table (as in the customer example discussed earlier), but rather by splitting
into separate databases the features that don’t overlap with each other very much.
For example, at eBay, the users are in one shard, and the items for sale are in
another. At Flixster, movie ratings are in one shard and comments are in another.
This approach depends on understanding your domain so that you can segment
data cleanly.
Key-based sharding
In this approach, you find a key in your data that will evenly distribute it across
shards. So instead of simply storing one letter of the alphabet for each server as in
the (naive and improper) earlier example, you use a one-way hash on a key data
element and distribute data across machines according to the hash. It is common
in this strategy to find time-based or numeric keys to hash on.
Lookup table
In this approach, one of the nodes in the cluster acts as a “yellow pages” directory
and looks up which node has the data you’re trying to access. This has two obvious
disadvantages. The first is that you’ll take a performance hit every time you have
to go through the lookup table as an additional hop. The second is that the lookup
table not only becomes a bottleneck, but a single point of failure.
To read about how they used data sharding strategies to improve per-
formance at Flixster, see http://lsvp.wordpress.com/2008/06/20.
Sharding can minimize contention depending on your strategy and allows you not just
to scale horizontally, but then to scale more precisely, as you can add power to the
particular shards that need it.
Sharding could be termed a kind of “shared-nothing” architecture that’s specific to
databases. A shared-nothing architecture is one in which there is no centralized (shared)
state, but each node in a distributed system is independent, so there is no client con-
tention for shared resources. The term was first coined by Michael Stonebraker at
University of California at Berkeley in his 1986 paper “The Case for Shared Nothing.”
Shared Nothing was more recently popularized by Google, which has written systems
such as its Bigtable database and its MapReduce implementation that do not share
state, and are therefore capable of near-infinite scaling. The Cassandra database is a
shared-nothing architecture, as it has no central controller and no notion of master/
slave; all of its nodes are the same.
A Quick Review of Relational Databases | 11
You can read the 1986 paper “The Case for Shared Nothing” online at
http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf. It’s only a few pa-
ges. If you take a look, you’ll see that many of the features of shared-
nothing distributed data architecture, such as ease of high availability
and the ability to scale to a very large number of machines, are the very
things that Cassandra excels at.
MongoDB also provides auto-sharding capabilities to manage failover and node bal-
ancing. That many nonrelational databases offer this automatically and out of the box
is very handy; creating and maintaining custom data shards by hand is a wicked prop-
osition. It’s good to understand sharding in terms of data architecture in general, but
especially in terms of Cassandra more specifically, as it can take an approach similar
to key-based sharding to distribute data across nodes, but does so automatically.
Summary
In summary, relational databases are very good at solving certain data storage problems,
but because of their focus, they also can create problems of their own when it’s time
to scale. Then, you often need to find a way to get rid of your joins, which means
denormalizing the data, which means maintaining multiple copies of data and seriously
disrupting your design, both in the database and in your application. Further, you
almost certainly need to find a way around distributed transactions, which will quickly
become a bottleneck. These compensatory actions are not directly supported in any
but the most expensive RDBMS. And even if you can write such a huge check, you still
need to carefully choose partitioning keys to the point where you can never entirely
ignore the limitation.
Perhaps more importantly, as we see some of the limitations of RDBMS and conse-
quently some of the strategies that architects have used to mitigate their scaling issues,
a picture slowly starts to emerge. It’s a picture that makes some NoSQL solutions seem
perhaps less radical and less scary than we may have thought at first, and more like a
natural expression and encapsulation of some of the work that was already being done
to manage very large databases.
Web Scale
An invention has to make sense in the world in which it
is finished, not the world in which it is started.
—Ray Kurzweil
Because of some of the inherent design decisions in RDBMS, it is not always as easy to
scale as some other, more recent possibilities that take the structure of the Web into
consideration. But it’s not only the structure of the Web we need to consider, but also
its phenomenal growth, because as more and more data becomes available, we need
12 | Chapter 1:Introducing Cassandra
architectures that allow our organizations to take advantage of this data in near-time
to support decision making and to offer new and more powerful features and
capabilities to our customers.
It has been said, though it is hard to verify, that the 17th-century English
poet John Milton had actually read every published book on the face of
the earth. Milton knew many languages (he was even learning Navajo
at the time of his death), and given that the total number of published
books at that time was in the thousands, this would have been possible.
The size of the world’s data stores have grown somewhat since then.
We all know the Web is growing. But let’s take a moment to consider some numbers
from the IDC research paper “The Expanding Digital Universe.” (The complete
paper is available at http://www.emc.com/collateral/analyst-reports/expanding-digital
-idc-white-paper.pdf.)
YouTube serves 100 million videos every day.
Chevron accumulates 2TB of data every day.
In 2006, the amount of data on the Internet was approximately 166 exabytes
(166EB). In 2010, that number reached nearly 1,000 exabytes. An exabyte is one
quintillion bytes, or 1.1 million terabytes. To put this statistic in perspective, 1EB
is roughly the equivalent of 50,000 years of DVD-quality video. 166EB is approx-
imately three million times the amount of information contained in all the books
ever written.
Wal-Mart’s database of customer transactions is reputed to have stored 110 tera-
bytes in 2000, recording tens of millions of transactions per day. By 2004, it had
grown to half a petabyte.
The movie Avatar required 1PB storage space, or the equivalent of a single MP3
song—if that MP3 were 32 years long (source: http://bit.ly/736XCz).
As of May 2010, Google was provisioning 100,000 Android phones every day, all
of which have Internet access as a foundational service.
In 1998, the number of email accounts was approximately 253 million. By 2010,
that number is closer to 2 billion.
As you can see, there is great variety to the kinds of data that need to be stored, pro-
cessed, and queried, and some variety to the businesses that use such data. Consider
not only customer data at familiar retailers or suppliers, and not only digital video
content, but also the required move to digital television and the explosive growth of
email, messaging, mobile phones, RFID, Voice Over IP (VoIP) usage, and more. We
now have Blu-ray players that stream movies and music. As we begin departing from
physical consumer media storage, the companies that provide that content—and the
third-party value-add businesses built around them—will require very scalable data
solutions. Consider too that as a typical business application developer or database
A Quick Review of Relational Databases | 13
administrator, we may be used to thinking of relational databases as the center of our
universe. You might then be surprised to learn that within corporations, around 80%
of data is unstructured.
Or perhaps you think the kind of scale afforded by NoSQL solutions such as Cassandra
don’t apply to you. And maybe they don’t. It’s very possible that you simply don’t have
a problem that Cassandra can help you with. But I’m not asking you to envision your
database and its data as they exist today and figure out ways to migrate to Cassandra.
That would be a very difficult exercise, with a payoff that might be hard to see. It’s
almost analytic that the database you have today is exactly the right one for your ap-
plication of today. But if you could incorporate a wider array of rich data sets to help
improve your applications, what kinds of qualities would you then be looking for in a
database? The question becomes what kind of application would you want to have if
durability, elastic scalability, vast storage, and blazing-fast writes weren’t a problem?
In a world now working at web scale and looking to the future, Apache Cassandra
might be one part of the answer.
The Cassandra Elevator Pitch
Hollywood screenwriters and software startups are often advised to have their “elevator
pitch” ready. This is a summary of exactly what their product is all about—concise,
clear, and brief enough to deliver in just a minute or two, in the lucky event that they
find themselves sharing an elevator with an executive or agent or investor who might
consider funding their project. Cassandra has a compelling story, so let's boil it down
to an elevator pitch that you can present to your manager or colleagues should the
occasion arise.
Cassandra in 50 Words or Less
“Apache Cassandra is an open source, distributed, decentralized, elastically scalable,
highly available, fault-tolerant, tuneably consistent, column-oriented database that
bases its distribution design on Amazon’s Dynamo and its data model on Google’s
Bigtable. Created at Facebook, it is now used at some of the most popular sites on the
Web.” That’s exactly 50 words.
Of course, if you were to recite that to your boss in the elevator, you'd probably get a
blank look in return. So let's break down the key points in the following sections.
Distributed and Decentralized
Cassandra is distributed, which means that it is capable of running on multiple
machines while appearing to users as a unified whole. In fact, there is little point in
running a single Cassandra node. Although you can do it, and that’s acceptable for
getting up to speed on how it works, you quickly realize that you’ll need multiple
14 | Chapter 1:Introducing Cassandra
machines to really realize any benefit from running Cassandra. Much of its design and
code base is specifically engineered toward not only making it work across many dif-
ferent machines, but also for optimizing performance across multiple data center racks,
and even for a single Cassandra cluster running across geographically dispersed data
centers. You can confidently write data to anywhere in the cluster and Cassandra will
get it.
Once you start to scale many other data stores (MySQL, Bigtable), some nodes need
to be set up as masters in order to organize other nodes, which are set up as slaves.
Cassandra, however, is decentralized, meaning that every node is identical; no Cas-
sandra node performs certain organizing operations distinct from any other node.
Instead, Cassandra features a peer-to-peer protocol and uses gossip to maintain and
keep in sync a list of nodes that are alive or dead.
The fact that Cassandra is decentralized means that there is no single point of failure.
All of the nodes in a Cassandra cluster function exactly the same. This is sometimes
referred to as “server symmetry.” Because they are all doing the same thing, by defini-
tion there can’t be a special host that is coordinating activities, as with the master/slave
setup that you see in MySQL, Bigtable, and so many others.
In many distributed data solutions (such as RDBMS clusters), you set up multiple cop-
ies of data on different servers in a process called replication, which copies the data to
multiple machines so that they can all serve simultaneous requests and improve per-
formance. Typically this process is not decentralized, as in Cassandra, but is rather
performed by defining a master/slave relationship. That is, all of the servers in this kind
of cluster don’t function in the same way. You configure your cluster by designating
one server as the master and others as slaves. The master acts as the authoritative
source of the data, and operates in a unidirectional relationship with the slave nodes,
which must synchronize their copies. If the master node fails, the whole database is in
jeopardy. The decentralized design is therefore one of the keys to Cassandra’s high
availability. Note that while we frequently understand master/slave replication in the
RDBMS world, there are NoSQL databases such as MongoDB that follow the master/
slave scheme as well.
Decentralization, therefore, has two key advantages: it’s simpler to use than master/
slave, and it helps you avoid outages. It can be easier to operate and maintain a decen-
tralized store than a master/slave store because all nodes are the same. That means that
you don’t need any special knowledge to scale; setting up 50 nodes isn’t much different
from setting up one. There’s next to no configuration required to support it. Moreover,
in a master/slave setup, the master can become a single point of failure (SPOF). To
avoid this, you often need to add some complexity to the environment in the form of
multiple masters. Because all of the replicas in Cassandra are identical, failures of a
node won’t disrupt service.
In short, because Cassandra is distributed and decentralized, there is no single point
of failure, which supports high availability.
The Cassandra Elevator Pitch | 15
Elastic Scalability
Scalability is an architectural feature of a system that can continue serving a greater
number of requests with little degradation in performance. Vertical scaling—simply
adding more hardware capacity and memory to your existing machine—is the easiest
way to achieve this. Horizontal scaling means adding more machines that have all or
some of the data on them so that no one machine has to bear the entire burden of
serving requests. But then the software itself must have an internal mechanism for
keeping its data in sync with the other nodes in the cluster.
Elastic scalability refers to a special property of horizontal scalability. It means that
your cluster can seamlessly scale up and scale back down. To do this, the cluster must
be able to accept new nodes that can begin participating by getting a copy of some or
all of the data and start serving new user requests without major disruption or recon-
figuration of the entire cluster. You don’t have to restart your process. You don’t have
to change your application queries. You don’t have to manually rebalance the data
yourself. Just add another machine—Cassandra will find it and start sending it work.
Scaling down, of course, means removing some of the processing capacity from your
cluster. You might have to do this if you move parts of your application to another
platform, or if your application loses users and you need to start selling off hardware.
Let’s hope that doesn’t happen. But if it does, you won’t need to upset the entire apple
cart to scale back.
High Availability and Fault Tolerance
In general architecture terms, the availability of a system is measured according to its
ability to fulfill requests. But computers can experience all manner of failure, from
hardware component failure to network disruption to corruption. Any computer is
susceptible to these kinds of failure. There are of course very sophisticated (and often
prohibitively expensive) computers that can themselves mitigate many of these cir-
cumstances, as they include internal hardware redundancies and facilities to send
notification of failure events and hot swap components. But anyone can accidentally
break an Ethernet cable, and catastrophic events can beset a single data center. So for
a system to be highly available, it must typically include multiple networked computers,
and the software they’re running must then be capable of operating in a cluster and
have some facility for recognizing node failures and failing over requests to another part
of the system.
Cassandra is highly available. You can replace failed nodes in the cluster with no
downtime, and you can replicate data to multiple data centers to offer improved local
performance and prevent downtime if one data center experiences a catastrophe such
as fire or flood.
16 | Chapter 1:Introducing Cassandra
Tuneable Consistency
Consistency essentially means that a read always returns the most recently written
value. Consider two customers are attempting to put the same item into their shopping
carts on an ecommerce site. If I place the last item in stock into my cart an instant after
you do, you should get the item added to your cart, and I should be informed that the
item is no longer available for purchase. This is guaranteed to happen when the state
of a write is consistent among all nodes that have that data.
But there’s no free lunch, and as we’ll see later, scaling data stores means making certain
trade-offs between data consistency, node availability, and partition tolerance. Cas-
sandra is frequently called “eventually consistent,” which is a bit misleading. Out of
the box, Cassandra trades some consistency in order to achieve total availability. But
Cassandra is more accurately termed “tuneably consistent,” which means it allows
you to easily decide the level of consistency you require, in balance with the level of
availability.
Let’s take a moment to unpack this, as the term “eventual consistency” has caused
some uproar in the industry. Some practitioners hesitate to use a system that is descri-
bed as “eventually consistent.”
For detractors of eventual consistency, the broad argument goes something like this:
eventual consistency is maybe OK for social web applications where data doesn’t
really matter. After all, you’re just posting to mom what little Billy ate for breakfast,
and if it gets lost, it doesn’t really matter. But the data I have is actually really
important, and it’s ridiculous to think that I could allow eventual consistency in my
model.
Set aside the fact that all of the most popular web applications (Amazon, Facebook,
Google, Twitter) are using this model, and that perhaps there’s something to it. Pre-
sumably such data is very important indeed to the companies running these
applications, because that data is their primary product, and they are multibillion-
dollar companies with billions of users to satisfy in a sharply competitive world. It may
be possible to gain guaranteed, immediate, and perfect consistency throughout a highly
trafficked system running in parallel on a variety of networks, but if you want clients
to get their results sometime this year, it’s a very tricky proposition.
The detractors claim that some Big Data databases such as Cassandra have merely
eventual consistency, and that all other distributed systems have strict consistency. As
with so many things in the world, however, the reality is not so black and white, and
the binary opposition between consistent and not-consistent is not truly reflected in
practice. There are instead degrees of consistency, and in the real world they are very
susceptible to external circumstance.
Eventual consistency is one of several consistency models available to architects. Let’s
take a look at these models so we can understand the trade-offs:
The Cassandra Elevator Pitch | 17
Strict consistency
This is sometimes called sequential consistency, and is the most stringent level of
consistency. It requires that any read will always return the most recently written
value. That sounds perfect, and it’s exactly what I’m looking for. I’ll take it! How-
ever, upon closer examination, what do we find? What precisely is meant by “most
recently written”? Most recently to whom? In one single-processor machine, this
is no problem to observe, as the sequence of operations is known to the one clock.
But in a system executing across a variety of geographically dispersed data centers,
it becomes much more slippery. Achieving this implies some sort of global clock
that is capable of timestamping all operations, regardless of the location of the data
or the user requesting it or how many (possibly disparate) services are required to
determine the response.
Causal consistency
This is a slightly weaker form of strict consistency. It does away with the fantasy
of the single global clock that can magically synchronize all operations without
creating an unbearable bottleneck. Instead of relying on timestamps, causal con-
sistency instead takes a more semantic approach, attempting to determine the
cause of events to create some consistency in their order. It means that writes that
are potentially related must be read in sequence. If two different, unrelated oper-
ations suddenly write to the same field, then those writes are inferred not to be
causally related. But if one write occurs after another, we might infer that they
are causally related. Causal consistency dictates that causal writes must be read in
sequence.
Weak (eventual) consistency
Eventual consistency means on the surface that all updates will propagate through-
out all of the replicas in a distributed system, but that this may take some time.
Eventually, all replicas will be consistent.
Eventual consistency becomes suddenly very attractive when you consider what is re-
quired to achieve stronger forms of consistency.
When considering consistency, availability, and partition tolerance, we can achieve
only two of these goals in a given distributed system (we explore the CAP Theorem in
the section “Brewer’s CAP Theorem” on page 19). At the center of the problem is
data update replication. To achieve a strict consistency, all update operations will be
performed synchronously, meaning that they must block, locking all replicas until the
operation is complete, and forcing competing clients to wait. A side effect of such a
design is that during a failure, some of the data will be entirely unavailable. As
Amazon CTO Werner Vogels puts it, “rather than dealing with the uncertainty of the
correctness of an answer, the data is made unavailable until it is absolutely certain that
it is correct” ("Dynamo: Amazon’s Highly Distributed Key-Value Store”: [http://www
.allthingsdistributed.com/2007/10/amazons_dynamo.html], 207).
18 | Chapter 1:Introducing Cassandra
We could alternatively take an optimistic approach to replication, propagating updates
to all replicas in the background in order to avoid blowing up on the client. The diffi-
culty this approach presents is that now we are forced into the situation of detecting
and resolving conflicts. A design approach must decide whether to resolve these con-
flicts at one of two possible times: during reads or during writes. That is, a distributed
database designer must choose to make the system either always readable or always
writable.
Dynamo and Cassandra choose to be always writable, opting to defer the complexity
of reconciliation to read operations, and realize tremendous performance gains. The
alternative is to reject updates amidst network and server failures.
In Cassandra, consistency is not an all-or-nothing proposition, so we might more ac-
curately term it “tuneable consistency” because the client can control the number of
replicas to block on for all updates. This is done by setting the consistency level against
the replication factor.
The replication factor lets you decide how much you want to pay in performance to
gain more consistency. You set the replication factor to the number of nodes in the
cluster you want the updates to propagate to (remember that an update means any
add, update, or delete operation).
The consistency level is a setting that clients must specify on every operation and that
allows you to decide how many replicas in the cluster must acknowledge a write op-
eration or respond to a read operation in order to be considered successful. That’s the
part where Cassandra has pushed the decision for determining consistency out to the
client.
So if you like, you could set the consistency level to a number equal to the replication
factor, and gain stronger consistency at the cost of synchronous blocking operations
that wait for all nodes to be updated and declare success before returning. This is not
often done in practice with Cassandra, however, for reasons that should be clear (it
defeats the availability goal, would impact performance, and generally goes against the
grain of why you’d want to use Cassandra in the first place). So if the client sets
the consistency level to a value less than the replication factor, the update is considered
successful even if some nodes are down.
Brewer’s CAP Theorem
In order to understand Cassandra’s design and its label as an “eventually consistent”
database, we need to understand the CAP theorem. The CAP theorem is sometimes
called Brewer’s theorem after its author, Eric Brewer.
While working at University of California at Berkeley, Eric Brewer posited his CAP
theorem in 2000 at the ACM Symposium on the Principles of Distributed Computing.
The theorem states that within a large-scale distributed data system, there are three
The Cassandra Elevator Pitch | 19
requirements that have a relationship of sliding dependency: Consistency, Availability,
and Partition Tolerance.
Consistency
All database clients will read the same value for the same query, even given con-
current updates.
Availability
All database clients will always be able to read and write data.
Partition Tolerance
The database can be split into multiple machines; it can continue functioning in
the face of network segmentation breaks.
Brewer’s theorem is that in any given system, you can strongly support only two of the
three. This is analogous to the saying you may have heard in software development:
“You can have it good, you can have it fast, you can have it cheap: pick two.”
We have to choose between them because of this sliding mutual dependency. The more
consistency you demand from your system, for example, the less partition-tolerant
you’re likely to be able to make it, unless you make some concessions around
availability.
The CAP theorem was formally proved to be true by Seth Gilbert and Nancy Lynch of
MIT in 2002. In distributed systems, however, it is very likely that you will have network
partitioning, and that at some point, machines will fail and cause others to become
unreachable. Packet loss, too, is nearly inevitable. This leads us to the conclusion that
a distributed system must do its best to continue operating in the face of network
partitions (to be Partition-Tolerant), leaving us with only two real options to choose
from: Availability and Consistency.
Figure 1-1 illustrates visually that there is no overlapping segment where all three are
obtainable.
Figure 1-1. CAP Theorem indicates that you can realize only two of these properties at once
20 | Chapter 1:Introducing Cassandra
It might prove useful at this point to see a graphical depiction of where each of the
nonrelational data stores we’ll look at falls within the CAP spectrum. The graphic in
Figure 1-2 was inspired by a slide in a 2009 talk given by Dwight Merriman, CEO and
founder of MongoDB, to the MySQL User Group in New York City (you can watch it
online at http://bit.ly/7r6kRg). However, I have modified the placement of some systems
based on my research.
Figure 1-2 shows the general focus of some of the different databases we discuss in this
chapter. Note that placement of the databases in this chart could change based on
configuration. As Stu Hood points out, a distributed MySQL database can count as a
consistent system only if you’re using Google’s synchronous replication patches; oth-
erwise, it can only be Available and Partition-Tolerant (AP).
It’s interesting to note that the design of the system around CAP placement is inde-
pendent of the orientation of the data storage mechanism; for example, the CP edge is
populated by graph databases and document-oriented databases alike.
Figure 1-2. Where different databases appear on the CAP continuum
In this depiction, relational databases are on the line between Consistency and Avail-
ability, which means that they can fail in the event of a network failure (including a
cable breaking). This is typically achieved by defining a single master server, which
could itself go down, or an array of servers that simply don’t have sufficient mechanisms
built in to continue functioning in the case of network partitions.
Graph databases such as Neo4J and the set of databases derived at least in part from
the design of Google’s Bigtable database (such as MongoDB, HBase, Hypertable, and
Redis) all are focused slightly less on Availability and more on ensuring Consistency
and Partition Tolerance.
The Cassandra Elevator Pitch | 21
If you’re interested in the properties of other Big Data or NoSQL data-
bases, see this book’s Appendix.
Finally, the databases derived from Amazon’s Dynamo design include Cassandra,
Project Voldemort, CouchDB, and Riak. These are more focused on Availability and
Partition-Tolerance. However, this does not mean that they dismiss Consistency as
unimportant, any more than Bigtable dismisses Availability. According to the Bigtable
paper, the average percentage of server hours that “some data” was unavailable is
0.0047% (section 4), so this is relative, as we’re talking about very robust systems
already. If you think of each of these letters (C, A, P) as knobs you can tune to arrive
at the system you want, Dynamo derivatives are intended for employment in the many
use cases where “eventual consistency” is tolerable and where “eventual” is a matter
of milliseconds, read repairs mean that reads will return consistent values, and you can
achieve strong consistency if you want to.
So what does it mean in practical terms to support only two of the three facets of CAP?
CA
To primarily support Consistency and Availability means that you’re likely using
two-phase commit for distributed transactions. It means that the system will block
when a network partition occurs, so it may be that your system is limited to a single
data center cluster in an attempt to mitigate this. If your application needs only
this level of scale, this is easy to manage and allows you to rely on familiar, simple
structures.
CP
To primarily support Consistency and Partition Tolerance, you may try to
advance your architecture by setting up data shards in order to scale. Your data
will be consistent, but you still run the risk of some data becoming unavailable if
nodes fail.
AP
To primarily support Availability and Partition Tolerance, your system may return
inaccurate data, but the system will always be available, even in the face of network
partitioning. DNS is perhaps the most popular example of a system that is mas-
sively scalable, highly available, and partition-tolerant.
22 | Chapter 1:Introducing Cassandra
Note that this depiction is intended to offer an overview that helps draw
distinctions between the broader contours in these systems; it is not
strictly precise. For example, it’s not entirely clear where Google’s
Bigtable should be placed on such a continuum. The Google paper de-
scribes Bigtable as “highly available,” but later goes on to say that if
Chubby (the Bigtable persistent lock service) “becomes unavailable for
an extended period of time [caused by Chubby outages or network is-
sues], Bigtable becomes unavailable” (section 4). On the matter of data
reads, the paper says that “we do not consider the possibility of multiple
copies of the same data, possibly in alternate forms due to views or
indices.” Finally, the paper indicates that “centralized control and By-
zantine fault tolerance are not Bigtable goals” (section 10). Given such
variable information, you can see that determining where a database
falls on this sliding scale is not an exact science.
Row-Oriented
Cassandra is frequently referred to as a “column-oriented” database, which is not in-
correct. It’s not relational, and it does represent its data structures in sparse
multidimensional hashtables. “Sparse” means that for any given row you can have one
or more columns, but each row doesn’t need to have all the same columns as other
rows like it (as in a relational model). Each row has a unique key, which makes its data
accessible. So although it’s not wrong to say that Cassandra is columnar or column-
oriented, it might be more helpful to think of it as an indexed, row-oriented store, as
we examine more thoroughly in Chapter 3. I list the data orientation as a feature, be-
cause there are several data models that are easy to visualize and use in a nonrelational
model; it’s a weird mixture of laziness and possibly inviting far more work than nec-
essary to just assume that the relational model is always best, regardless of your
application.
Cassandra stores data in what can be thought of for now as a multidimensional hash
table. That means you don’t have to decide ahead of time precisely what your data
structure must look like, or what fields your records will need. This can be useful if
you’re in startup mode and are adding or changing features with some frequency. It is
also attractive if you need to support an Agile development methodology and aren’t
free to take months for up-front analysis. If your business changes and you later need
to add or remove new fields on the fly without disrupting service, go ahead; Cassandra
lets you.
That’s not to say that you don’t have to think about your data, though. On the contrary,
Cassandra requires a shift in how you think about it. Instead of designing a pristine
data model and then designing queries around the model as in RDBMS, you are free
to think of your queries first, and then provide the data that answers them.
The Cassandra Elevator Pitch | 23
Schema-Free
Cassandra requires you to define an outer container, called a keyspace, that contains
column families. The keyspace is essentially just a logical namespace to hold column
families and certain configuration properties. The column families are names for asso-
ciated data and a sort order. Beyond that, the data tables are sparse, so you can just
start adding data to it, using the columns that you want; there’s no need to define your
columns ahead of time. Instead of modeling data up front using expensive data mod-
eling tools and then writing queries with complex join statements, Cassandra asks you
to model the queries you want, and then provide the data around them.
High Performance
Cassandra was designed specifically from the ground up to take full advantage of
multiprocessor/multicore machines, and to run across many dozens of these machines
housed in multiple data centers. It scales consistently and seamlessly to hundreds of
terabytes. Cassandra has been shown to perform exceptionally well under heavy load.
It consistently can show very fast throughput for writes per second on a basic com-
modity workstation. As you add more servers, you can maintain all of Cassandra’s
desirable properties without sacrificing performance.
Where Did Cassandra Come From?
The Cassandra data store is an open source Apache project available at http://cassandra
.apache.org. Cassandra originated at Facebook in 2007 to solve that company’s inbox
search problem, in which they had to deal with large volumes of data in a way that was
difficult to scale with traditional methods. Specifically, the team had requirements to
handle huge volumes of data in the form of message copies, reverse indices of messages,
and many random reads and many simultaneous random writes.
The team was led by Jeff Hammerbacher, with Avinash Lakshman, Karthik Rangana-
than, and Facebook engineer on the Search Team Prashant Malik as key engineers.
The code was released as an open source Google Code project in July 2008. During its
tenure as a Google Code project in 2008, the code was updateable only by Facebook
engineers, and little community was built around it as a result. So in March 2009 it was
moved to an Apache Incubator project, and on February 17, 2010 it was voted into a
top-level project.
A central paper on Cassandra by Facebook’s Lakshman and Malik
called “A Decentralized Structured Storage System” is available at: http:
//www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009
.pdf.
24 | Chapter 1:Introducing Cassandra
Cassandra today presents a kind of paradox: it feels new and radical, and yet it’s solidly
rooted in many standard, traditional computer science concepts and maxims that suc-
cessful predecessors have already institutionalized. Cassandra is a realist’s kind of
database; it doesn’t depart from the relational model to be a fun art project or experi-
ment for smart developers. It was created specifically to solve a real-world problem that
existing tools weren’t able to solve. It acknowledges the limitations of prior methods
and faces our new world of big data head-on.
How Did Cassandra Get Its Name?
I’m a little surprised how often people ask me where the database got its name. It’s not
the first thing I think of when I hear about a project. But it is interesting, and in the case
of this database, it’s felicitously meaningful.
In Greek mythology, Cassandra was the daughter of King Priam and Queen Hecuba of
Troy. Cassandra was so beautiful that the god Apollo gave her the ability to see the
future. But when she refused his amorous advances, he cursed her such that she would
still be able to accurately predict everything that would happen—but no one would
believe her. Cassandra foresaw the destruction of her city of Troy, but was powerless
to stop it. The Cassandra distributed database is named for her. I speculate that it is
also named as kind of a joke on the Oracle at Delphi, another seer for whom a database
is named.
Use Cases for Cassandra
We have now unpacked the elevator pitch and have an understanding of Cassandra’s
advantages. Despite Cassandra’s sophisticated design and smart features, it is not the
right tool for every job. So in this section let’s take a quick look at what kind of projects
Cassandra is a good fit for.
Large Deployments
You probably don’t drive a semi truck to pick up your dry cleaning; semis aren’t well
suited for that sort of task. Lots of careful engineering has gone into Cassandra’s high
availability, tuneable consistency, peer-to-peer protocol, and seamless scaling, which
are its main selling points. None of these qualities is even meaningful in a single-node
deployment, let alone allowed to realize its full potential.
There are, however, a wide variety of situations where a single-node relational database
is all we may need. So do some measuring. Consider your expected traffic, throughput
needs, and SLAs. There are no hard and fast rules here, but if you expect that you can
reliably serve traffic with an acceptable level of performance with just a few relational
databases, it might be a better choice to do so, simply because RDBMS are easier to
run on a single machine and are more familiar.
Use Cases for Cassandra | 25
If you think you’ll need at least several nodes to support your efforts, however, Cas-
sandra might be a good fit. If your application is expected to require dozens of nodes,
Cassandra might be a great fit.
Lots of Writes, Statistics, and Analysis
Consider your application from the perspective of the ratio of reads to writes. Cassandra
is optimized for excellent throughput on writes.
Many of the early production deployments of Cassandra involve storing user activity
updates, social network usage, recommendations/reviews, and application statistics.
These are strong use cases for Cassandra because they involve lots of writing with less
predictable read operations, and because updates can occur unevenly with sudden
spikes. In fact, the ability to handle application workloads that require high perform-
ance at significant write volumes with many concurrent client threads is one of the
primary features of Cassandra.
According to the project wiki, Cassandra has been used to create a variety of applica-
tions, including a windowed time-series store, an inverted index for document
searching, and a distributed job priority queue.
Geographical Distribution
Cassandra has out-of-the-box support for geographical distribution of data. You can
easily configure Cassandra to replicate data across multiple data centers. If you have a
globally deployed application that could see a performance benefit from putting the
data near the user, Cassandra could be a great fit.
Evolving Applications
If your application is evolving rapidly and you’re in “startup mode,” Cassandra might
be a good fit given its schema-free data model. This makes it easy to keep your
database in step with application changes as you rapidly deploy.
Who Is Using Cassandra?
Cassandra is still in its early stages in many ways, not yet seeing its 1.0 release at the
time of this writing. There are few easy, graphical tools to help manage it, and the
community has not settled on certain key internal and external design questions that
have been revisited. But what does it say about the promise, usefulness, and stability
of a data store that even in its early stages is being used in production by many large,
well-known companies?
26 | Chapter 1:Introducing Cassandra
It is a logical fallacy, informally called the Bandwagon Fallacy, to argue
that just because something is growing in popularity means that it is
“true.” Cassandra is without a doubt enjoying skyrocketing growth in
popularity, especially over the past year or so. Still, my point here is
that the many successful production deployments at a variety of com-
panies for a variety of purposes is sufficient to suggest its usefulness and
readiness.
The list of companies using Cassandra is growing. These companies include:
Twitter is using Cassandra for analytics. In a much-publicized blog post (at http://
engineering.twitter.com/2010/07/cassandra-at-twitter-today.html), Twitter’s pri-
mary Cassandra engineer, Ryan King, explained that Twitter had decided against
using Cassandra as its primary store for tweets, as originally planned, but would
instead use it in production for several different things: for real-time analytics, for
geolocation and places of interest data, and for data mining over the entire user
store.
Mahalo uses it for its primary near-time data store.
Facebook still uses it for inbox search, though they are using a proprietary fork.
Digg uses it for its primary near-time data store.
Rackspace uses it for its cloud service, monitoring, and logging.
Reddit uses it as a persistent cache.
Cloudkick uses it for monitoring statistics and analytics.
Ooyala uses it to store and serve near real-time video analytics data.
SimpleGeo uses it as the main data store for its real-time location infrastructure.
Onespot uses it for a subset of its main data store.
Cassandra is also being used by Cisco and Platform64, and is starting to see use at
Comcast and bee.tv for personalized television streaming to the Web and to mobile
devices. There are others. The bottom line is that the uses are real. A wide variety of
companies are finding use cases for Cassandra and seeing success with it. As of this
writing, the largest known Cassandra installation is at Facebook, where they have more
than 150TB of data on more than 100 machines.
Many more companies are currently evaluating Cassandra for production use in dif-
ferent projects, and a services company called Riptano, cofounded by Jonathan Ellis,
the Apache Project Chair for Cassandra, was started in April of 2010. As more features
are added and better tooling and support options are rolled out, anticipate even broader
adoption.
Who Is Using Cassandra? | 27
Summary
In this chapter, we’ve taken an introductory look at Cassandra’s defining characteris-
tics, history, and major features. We have seen which major companies are using it
and what they’re using it for. We also examined a bit of history of the evolution of
important contributions to the database field in order to gain a historical view of Cas-
sandra’s value proposition.
28 | Chapter 1:Introducing Cassandra
CHAPTER 2
Installing Cassandra
For those among us who like instant gratification, we’ll start by installing Cassandra.
Because Cassandra introduces a lot of new vocabulary, there might be some unfamiliar
terms as we walk through this. That’s OK; the idea here is to get set up quickly in a
simple configuration to make sure everything is running properly. This will serve as an
orientation. Then, we’ll take a step back and understand Cassandra in its larger context.
Installing the Binary
Cassandra is available for download from the Web at http://cassandra.apache.org. Just
click the link on the home page to download the latest release version as a gzipped
tarball. The prebuilt binary is named apache-cassandra-x.x.x-bin.tar.gz, where x.x.x
represents the version number. The download is around 10MB.
Extracting the Download
The simplest way to get started is to download the prebuilt binary. You can unpack the
compressed file using any regular ZIP utility. On Linux, GZip extraction utilities should
be preinstalled; on Windows, you’ll need to get a program such as WinZip, which is
commercial, or something like 7-Zip, which is freeware. You can download the freeware
program 7-Zip from http://www.7-zip.org.
Open your extracting program. You might have to extract the ZIP file and the TAR file
in separate steps. Once you have a folder on your filesystem called apache-cassandra-
x.x.x, you’re ready to run Cassandra.
What’s In There?
Once you decompress the tarball, you’ll see that the Cassandra binary distribution
includes several directories. Let’s take a moment to look around and see what we have.
29
bin
This directory contains the executables to run Cassandra and the command-line
interface (CLI) client. It also has scripts to run the nodetool, which is a utility for
inspecting a cluster to determine whether it is properly configured, and to perform
a variety of maintenance operations. We look at nodetool in depth later. It also has
scripts for converting SSTables (the datafiles) to JSON and back.
conf
This directory, which is present in the source version at this location under the
package root, contains the files for configuring your Cassandra instance. There are
three basic functions: the storage-conf.xml file allows you to create your data store
by configuring your keyspace and column families; there are files related to setting
up authentication; and finally, the log4j properties let you change the logging levels
to suit your needs. We see how to use all of these when we discuss configuration
in Chapter 6.
interface
For versions 0.6 and earlier, this directory contains a single file, called
cassandra.thrift. This file represents the Remote Procedure Call (RPC) client API
that Cassandra makes available. The interface is defined using the Thrift syntax
and provides an easy means to generate clients. For a quick way to see all of the
operations that Cassandra supports, open this file in a regular text editor. You can
see that Cassandra supports clients for Java, C++, PHP, Ruby, Python, Perl, and
C# through this interface.
javadoc
This directory contains a documentation website generated using Java’s JavaDoc
tool. Note that JavaDoc reflects only the comments that are stored directly in the
Java code, and as such does not represent comprehensive documentation. It’s
helpful if you want to see how the code is laid out. Moreover, Cassandra is a
wonderful project, but the code contains precious few comments, so you might
find the JavaDoc’s usefulness limited. It may be more fruitful to simply read the
class files directly if you’re familiar with Java. Nonetheless, to read the JavaDoc,
open the javadoc/index.html file in a browser.
lib
This directory contains all of the external libraries that Cassandra needs to run.
For example, it uses two different JSON serialization libraries, the Google collec-
tions project, and several Apache Commons libraries. This directory includes the
Thrift and Avro RPC libraries for interacting with Cassandra.
Building from Source
Cassandra uses Apache Ant for its build scripting language and the Ivy plug-in for
dependency management.
30 | Chapter 2:Installing Cassandra
You can download Ant from http://ant.apache.org. You don’t need to
download Ivy separately just to build Cassandra.
Ivy requires Ant, and building from source requires the complete JDK, version 1.6.0_20
or better, not just the JRE. If you see a message about how Ant is missing tools.jar,
either you don’t have the full JDK or you’re pointing to the wrong path in your envi-
ronment variables.
If you want to download the most cutting-edge builds, you can get the
source from Hudson, which the Cassandra project uses as its Continu-
ous Integration tool. See http://hudson.zones.apache.org/hudson/job/Cas
sandra/ for the latest builds and test coverage information.
If you are a Git fan, you can get a read-only trunk version of the Cassandra source using
this command:
>git clone git://git.apache.org/cassandra.git
Git is a source code management system created by Linus Torvalds to
manage development of the Linux kernel. It’s increasingly popular and
is used by projects such as Android, Fedora, Ruby on Rails, Perl, and
many Cassandra clients (as we’ll see in Chapter 8). If you’re on a Linux
distribution such as Ubuntu, it couldn’t be easier to get Git. At a
console, just type >apt-get install git and it will be installed and ready
for commands. For more information, visit http://git-scm.com/.
Because Ivy takes care of all the dependencies, it’s easy to build Cassandra once you
have the source. Just make sure you’re in the root directory of your source download
and execute the ant program, which will look for a file called build.xml in the current
directory and execute the default build target. Ant and Ivy take care of the rest. To
execute the Ant program and start compiling the source, just type:
>ant
That’s it. Ivy will retrieve all of the necessary dependencies, and Ant will build the nearly
350 source files and execute the tests. If all went well, you should see a BUILD SUCCESS
FUL message. If all did not go well, make sure that your path settings are all correct,
that you have the most recent versions of the required programs, and that you down-
loaded a stable Cassandra build. You can check the Hudson report to make sure that
the source you downloaded actually can compile.
Building from Source | 31
If you want to see detailed information on what is happening during the
build, you can pass Ant the -v option to cause it to output verbose details
regarding each operation it performs.
Additional Build Targets
To compile the server, you can simply execute ant as shown previously. But there are
a couple of other targets in the build file that you might be interested in:
test
Users will probably find this the most helpful, as it executes the battery of unit
tests. You can also check out the unit test sources themselves for some useful
examples of how to interact with Cassandra.
gen-thrift-java
This target generates the Apache Thrift client interface for interacting with the
database in Java.
gen-thrift-py
This target generates the Thrift client interface for Python users.
build-jar
To create a Java Archive (JAR) file for distribution, execute the command >ant
jar. This will perform a complete build and output a file into the build directory
called apache-cassandra-x.x.x.jar.
Building with Maven
The original authors of Cassandra apparently didn’t care much for Maven, so the early
releases did not include any Maven POM file. But because so many Java developers
have begun to favor Maven over Ant, and the tooling support in IDEs for Maven has
become so strong, there’s a pom.xml contribution to the project so you can build from
Maven if you prefer.
To build the source from Maven, navigate to <cassandra-home>/contrib/maven and
execute this command:
$ mvn clean install
If you have any difficulties building with Maven, you may have to get some of the
required JARs manually. As of version 0.6.3, the Maven POM doesn’t work out of
the box because some dependencies, such as the libthrift.jar file, are unavailable in a
repository.
Few developers are using Maven with Cassandra, so Maven lacks strong
support. Which is to say, use caution, because the Maven POM is often
broken.
32 | Chapter 2:Installing Cassandra
Running Cassandra
In earlier versions of Cassandra, before you could start the server there was a bit of
fiddling to be done with Ivy and setting environment variables. But the developers have
done a terrific job of making it very easy to start using Cassandra immediately.
Cassandra requires Java Standard Edition JDK 6. Preferably, use
1.6.0_20 or greater. It has been tested on both the Open JDK and Sun’s
JDK. You can check your installed Java version by opening a command
prompt and executing >java -version. If you need a JDK, you can get
one at http://java.sun.com/javase/downloads.
On Windows
Once you have the binary or the source downloaded and compiled, you’re ready to
start the database server.
You also might need to set your JAVA_HOME environment variable. To do this on
Windows 7, click the Start button and then right-click on Computer. Click Advanced
System Settings, and then click the Environment Variables... button. Click New... to
create a new system variable. In the Variable Name field, type JAVA_HOME. In the Variable
Value field, type the path to your JDK installation. This is probably something like C:
\Program Files\Java\jdk1.6.0_20. Remember that if you create a new environment var-
iable, you’ll need to reopen any currently open terminals in order for the system to
become aware of the new variable. To make sure your environment variable is set
correctly and that Cassandra can subsequently find Java on Windows, execute this
command in a new terminal: >echo %JAVA_HOME%. This prints the value of your
environment variable.
Once you’ve started the server for the first time, Cassandra will add two directories to
your system. The first is C:\var\lib\cassandra, which is where it will store its data in files
called commitlog. The other is C:\var\log\cassandra; logs will be written to a file called
system.log. If you encounter any difficulties, consult the files in these directories to see
what might have happened. If you’ve been trying different versions of the database and
aren’t worried about losing data, you can delete these directories and restart the server
as a last resort.
On Linux
The process on Linux is similar to that on Windows. Make sure that your JAVA_HOME
variable is properly set to version 1.6.0_20 or better. Then, you need to extract the
Cassandra gzipped tarball using gunzip. Finally, create a couple of directories for Cas-
sandra to store its data and logs, and give them the proper permissions, as shown here:
Running Cassandra | 33
ehewitt@morpheus$ cd /home/eben/books/cassandra/dist/apache-cassandra-0.7.0-beta1
ehewitt@morpheus$ sudo mkdir -p /var/log/cassandra
ehewitt@morpheus$ sudo chown -R ehewitt /var/log/cassandra
ehewitt@morpheus$ sudo mkdir -p /var/lib/cassandra
ehewitt@morpheus$ sudo chown -R ehewitt /var/lib/cassandra
Instead of ehewitt, of course, substitute your own username.
Starting the Server
To start the Cassandra server on any OS, open a command prompt or terminal window,
navigate to the <cassandra-directory>/bin where you unpacked Cassandra, and run the
following command to start your server. In a clean installation, you should see some
log statements like this:
eben@morpheus$ bin/cassandra -f
INFO 13:23:22,367 DiskAccessMode 'auto' determined to be standard, indexAccessMode
is standard
INFO 13:23:22,475 Couldn't detect any schema definitions in local storage.
INFO 13:23:22,476 Found table data in data directories.
Consider using JMX to call org.apache.cassandra.service.StorageService
.loadSchemaFromYaml().
INFO 13:23:22,497 Cassandra version: 0.7.0-beta1
INFO 13:23:22,497 Thrift API version: 10.0.0
INFO 13:23:22,498 Saved Token not found. Using qFABQw5XJMvs47lg
INFO 13:23:22,498 Saved ClusterName not found. Using Test Cluster
INFO 13:23:22,502 Creating new commitlog segment /var/lib/cassandra/commitlog/
CommitLog-1282508602502.log
INFO 13:23:22,507 switching in a fresh Memtable for LocationInfo at CommitLogContext(
file='/var/lib/cassandra/commitlog/CommitLog-1282508602502.log', position=276)
INFO 13:23:22,510 Enqueuing flush of Memtable-LocationInfo@29857804(178 bytes,
4 operations)
INFO 13:23:22,511 Writing Memtable-LocationInfo@29857804(178 bytes, 4 operations)
INFO 13:23:22,691 Completed flushing /var/lib/cassandra/data/system/
LocationInfo-e-1-Data.db
INFO 13:23:22,701 Starting up server gossip
INFO 13:23:22,750 Binding thrift service to localhost/127.0.0.1:9160
INFO 13:23:22,752 Using TFramedTransport with a max frame size of 15728640 bytes.
INFO 13:23:22,753 Listening for thrift clients...
INFO 13:23:22,792 mx4j successfuly loaded
HttpAdaptor version 3.0.2 started on port 8081
Using the -f switch tells Cassandra to stay in the foreground instead of
running as a background process, so that all of the server logs will print
to standard out and you can see them in your terminal window, which
is useful for testing.
Congratulations! Now your Cassandra server should be up and running with a new
single node cluster called Test Cluster listening on port 9160.
34 | Chapter 2:Installing Cassandra
The committers work hard to ensure that data is readable from one
minor dot release to the next and from one major version to the next.
The commit log, however, needs to be completely cleared out from ver-
sion to version (even minor versions).
If you have any previous versions of Cassandra installed, you may want
to clear out the data directories for now, just to get up and running. If
you’ve messed up your Cassandra installation and want to get started
cleanly again, you can delete the folders in /var/lib/cassandra and /var/
log/cassandra.
Running the Command-Line Client Interface
Now that you have a Cassandra installation up and running, let’s give it a quick try to
make sure everything is set up properly. On Linux, running the command-line interface
just works. On Windows, you might have to do a little additional work.
On Windows, navigate to the Cassandra home directory and open a new terminal in
which to run our client process:
>bin\cassandra-cli
It’s possible that on Windows you will see an error like this when starting the client:
Starting Cassandra Client
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/cassandra/cli/CliMain
This probably means that you started Cassandra directly from within the bin directory,
and it therefore sets up its Java classpath incorrectly and can’t find the CliMain file to
start the client. You can define an environment variable called CASSANDRA_HOME that
points to the top-level directory where you have placed or built Cassandra, so you don’t
have to pay as much attention to where you’re starting Cassandra from.
For a little reminder on setting environment variables on Windows, see
the section “On Windows” on page 33.
To run the command-line interface program on Linux, navigate to the Cassandra home
directory and run the cassandra-cli program in the bin directory:
>bin/cassandra-cli
The Cassandra client will start:
eben@morpheus$ bin/cassandra-cli
Welcome to cassandra CLI.
Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
[default@unknown]
Running the Command-Line Client Interface | 35
You now have an interactive shell at which you can issue commands.
Note, however, that if you’re used to Oracle’s SQL*Plus or similar command-line
database clients, you may become frustrated. The Cassandra CLI is not intended to be
used as a full-blown client, as it’s really for development. That makes it a good way to
get started using Cassandra, because you don’t have to write lots of code to test inter-
actions with your database and get used to the environment.
Basic CLI Commands
Before we get too deep into how Cassandra works, let’s get an overview of the client
API so that you can see what kinds of commands you can send to the server. We’ll see
how to use the basic environment commands and how to do a round trip of inserting
and retrieving some data.
Help
To get help for the command-line interface, type help or ? to see the list of available
commands. The following list shows only the commands related to metadata and con-
figuration; there are other commands for getting and setting values that we explore
later.
[default@Keyspace1] help
List of all CLI commands:
? Display this message.
help Display this help.
help <command> Display detailed, command-specific help.
connect <hostname>/<port> Connect to thrift service.
use <keyspace> [<username> 'password'] Switch to a keyspace.
describe keyspace <keyspacename> Describe keyspace.
exit Exit CLI.
quit Exit CLI.
show cluster name Display cluster name.
show keyspaces Show list of keyspaces.
show api version Show server API version.
create keyspace <keyspace> [with <att1>=<value1> [and <att2>=<value2> ...]]
Add a new keyspace with the specified attribute and value(s).
create column family <cf> [with <att1>=<value1> [and <att2>=<value2> ...]]
Create a new column family with the specified attribute and value(s).
drop keyspace <keyspace> Delete a keyspace.
drop column family <cf> Delete a column family.
rename keyspace <keyspace> <keyspace_new_name> Rename a keyspace.
rename column family <cf> <new_name> Rename a column family.
Connecting to a Server
Starting the client this way does not automatically connect to a Cassandra server in-
stance. So to connect to a particular server after you have started Cassandra this way,
use the connect command:
36 | Chapter 2:Installing Cassandra
eben@morpheus:~/books/cassandra/dist/apache-cassandra-0.7.0-beta1$ bin/cassandra-cli
Welcome to cassandra CLI.
Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
[default@unknown] connect localhost/9160
Connected to: "Test Cluster" on localhost/9160
[default@unknown]
As a shortcut, you can start the client and connect to a particular server instance by
passing the host and port parameters at startup, like this:
eben@morpheus:~/books/cassandra/dist/apache-cassandra-0.7.0-beta1$ bin/
cassandra-cli localhost/9160
Welcome to cassandra CLI.
Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
[default@unknown]
If you see this error while trying to connect to a server:
Exception connecting to localhost/9160 - java.net.ConnectException:
Connection refused: connect
make sure that a Cassandra instance is started at that host and port and
that you can ping the host you’re trying to reach. There may be firewall
rules preventing you from connecting. Also make sure that you’re using
the new 0.7 syntax as described earlier, as it has changed from previous
versions.
The CLI indicates that you’re connected to a Cassandra server cluster called “Test
Cluster”. That’s because this cluster of one node at localhost is set up for you by
default.
In a production environment, be sure to remove the Test Cluster from
the configuration.
Describing the Environment
After connecting to your Cassandra instance Test Cluster, if you’re using the binary
distribution, an empty keyspace, or Cassandra database, is set up for you to test with.
To see the name of the current cluster you’re working in, type:
[default@unknown] show cluster name
Test Cluster
To see which keyspaces are available in the cluster, issue this command:
[default@unknown] show keyspaces
system
Basic CLI Commands | 37
If you have created any of your own keyspaces, they will be shown as well. The
system keyspace is used internally by Cassandra, and isn’t for us to put data into. In
this way, it’s similar to the master and temp databases in Microsoft SQL Server. This
keyspace contains the schema definitions and is aware of any modifications to the
schema made at runtime. It can propagate any changes made in one node to the rest
of the cluster based on timestamps.
To see the version of the API you’re using, type:
[default@Keyspace1] show api version
10.0.0
There are a variety of other commands with which you can experiment. For now, let’s
add some data to the database and get it back out again.
Creating a Keyspace and Column Family
A Cassandra keyspace is sort of like a relational database. It defines one or more column
families, which are very roughly analogous to tables in the relational world. When you
start the CLI client without specifying a keyspace, the output will look like this:
>bin/cassandra-cli --host localhost --port 9160
Starting Cassandra Client
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.
Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
[default@unknown]
Your shell prompt is for default@unknown because you haven’t authenticated as a par-
ticular user (which we’ll see how to do in Chapter 6) and you didn’t specify a keyspace.
This authentication scheme is familiar if you’ve used MySQL before.
Authentication and authorization are very much works in progress at
the time of this writing. The recommended deployment is to put a fire-
wall around your cluster.
Let’s create our own keyspace so we have something to write data to:
[default@unknown] create keyspace MyKeyspace with replication_factor=1
ab67bad0-ae2c-11df-b642-e700f669bcfc
Don’t worry about the replication_factor for now. That’s a setting we’ll look at in
detail later. After you have created your own keyspace, you can switch to it in the shell
by typing:
[default@unknown] use MyKeyspace
Authenticated to keyspace: MyKeyspace
[default@MyKeyspace]
We’re “authorized” to the keyspace because MyKeyspace doesn’t require credentials.
38 | Chapter 2:Installing Cassandra
Now we can create a column family in our keyspace. To do this on the CLI, use the
following command:
[default@MyKeyspace] create column family User
991590d3-ae2e-11df-b642-e700f669bcfc
[default@MyKeyspace]
This creates a new column family called “User” in our current keyspace, and takes the
defaults for column family settings. We can use the CLI to get a description of a key-
space using the describe keyspace command, and make sure it has our column family
definition, as shown here:
[default@MyKeyspace] describe keyspace MyKeyspace
Keyspace: MyKeyspace
Column Family Name: User
Column Family Type: Standard
Column Sorted By: org.apache.cassandra.db.marshal.BytesType
flush period: null minutes
------
[default@MyKeyspace]
We’ll worry about the Type, Sorted By, and flush period settings later. For now, we
have enough to get started.
Writing and Reading Data
Now that we have a keyspace and a column family, we’ll write some data to the database
and read it back out again. It’s OK at this point not to know quite what’s going on.
We’ll come to understand Cassandra’s data model in depth later. For now, you have
a keyspace (database), which has a column family. For our purposes here, it’s enough
to think of a column family as a multidimensional ordered map that you don’t have to
define further ahead of time. Column families hold columns, and columns are the
atomic unit of data storage.
To write a value, use the set command:
[default@MyKeyspace] set User['ehewitt']['fname']='Eben'
Value inserted.
[default@MyKeyspace] set User['ehewitt']['email']='me@example.com'
Value inserted.
[default@MyKeyspace]
Here we have created two columns for the key ehewitt, to store a set of related values.
The column names are fname and email. We can use the count command to make sure
that we have written two columns for our single key:
[default@MyKeyspace] count User['ehewitt']
2 columns
Now that we know the data is there, let’s read it, using the get command:
Basic CLI Commands | 39
[default@MyKeyspace] get User['ehewitt']
=> (column=666e616d65, value=Eben, timestamp=1282510290343000)
=> (column=656d61696c, value=me@example.com, timestamp=1282510313429000)
Returned 2 results.
You can delete a column using the del command. Here we will delete the email column
for the ehewitt row key:
[default@MyKeyspace] del User['ehewitt']['email']
column removed.
Now we’ll clean up after ourselves by deleting the entire row. It’s the same command,
but we don’t specify a column name:
[default@MyKeyspace] del User['ehewitt']
row removed.
To make sure that it’s removed, we can query again:
[default@Keyspace1] get User['ehewitt']
Returned 0 results.
Summary
Now you should have a Cassandra installation up and running. You’ve worked with
the CLI client to insert and retrieve some data, and you’re ready to take a step back and
get the big picture on Cassandra before really diving into the details.
40 | Chapter 2:Installing Cassandra
CHAPTER 3
The Cassandra Data Model
In this chapter, we’ll gain an understanding of Cassandra’s design goals, data model,
and some general behavior characteristics.
For developers and administrators coming from the relational world, the Cassandra
data model can be very difficult to understand initially. Some terms, such as
“keyspace,” are completely new, and some, such as “column,” exist in both worlds but
have different meanings. It can also be confusing if you’re trying to sort through the
Dynamo or Bigtable source papers, because although Cassandra may be based on them,
it has its own model.
So in this chapter we start from common ground and then work through the unfamiliar
terms. Then, we do some actual modeling to help understand how to bridge the gap
between the relational world and the world of Cassandra.
The Relational Data Model
In a relational database, we have the database itself, which is the outermost container
that might correspond to a single application. The database contains tables. Tables
have names and contain one or more columns, which also have names. When we add
data to a table, we specify a value for every column defined; if we don’t have a value
for a particular column, we use null. This new entry adds a row to the table, which we
can later read if we know the row’s unique identifier (primary key), or by using a SQL
statement that expresses some criteria that row might meet. If we want to update values
in the table, we can update all of the rows or just some of them, depending on the filter
we use in a “where” clause of our SQL statement.
For the purposes of learning Cassandra, it may be useful to suspend for a moment what
you know from the relational world.
41
A Simple Introduction
In this section, we’ll take a bottom-up approach to understanding Cassandra’s data
model.
The simplest data store you would conceivably want to work with might be an array
or list. It would look like Figure 3-1.
Figure 3-1. A list of values
If you persisted this list, you could query it later, but you would have to either examine
each value in order to know what it represented, or always store each value in the same
place in the list and then externally maintain documentation about which cell in the
array holds which values. That would mean you might have to supply empty place-
holder values (nulls) in order to keep the uniform size in case you didn’t have a value
for an optional attribute (such as a fax number or apartment number). An array is a
clearly useful data structure, but not semantically rich.
So we’d like to add a second dimension to this list: names to match the values. We’ll
give names to each cell, and now we have a map structure, as shown in Figure 3-2.
Figure 3-2. A map of name/value pairs
This is an improvement because we can know the names of our values. So if we decided
that our map would hold User information, we could have column names like first
Name, lastName, phone, email, and so on. This is a somewhat richer structure to work
with.
42 | Chapter 3:The Cassandra Data Model
But the structure we’ve built so far works only if we have one instance of a given entity,
such as a single Person or User or Hotel or Tweet. It doesn’t give us much if we want
to store multiple entities with the same structure, which is certainly what we want to
do. There’s nothing to unify some collection of name/value pairs, and no way to repeat
the same column names. So we need something that will group some of the column
values together in a distinctly addressable group. We need a key to reference a group
of columns that should be treated together as a set. We need rows. Then, if we get a
single row, we can get all of the name/value pairs for a single entity at once, or just get
the values for the names we’re interested in. We could call these name/value pairs
columns. We could call each separate entity that holds some set of columns rows. And
the unique identifier for each row could be called a row key.
Cassandra defines a column family to be a logical division that associates similar data.
For example, we might have a User column family, a Hotel column family, an
AddressBook column family, and so on. In this way, a column family is somewhat
analogous to a table in the relational world.
Putting this all together, we have the basic Cassandra data structures: the column,
which is a name/value pair (and a client-supplied timestamp of when it was last upda-
ted), and a column family, which is a container for rows that have similar, but not
identical, column sets.
In relational databases, we’re used to storing column names as strings only—that’s all
we’re allowed. But in Cassandra, we don’t have that limitation. Both row keys and
column names can be strings, like relational column names, but they can also be long
integers, UUIDs, or any kind of byte array. So there’s some variety to how your key
names can be set.
This reveals another interesting quality to Cassandra’s columns: they don’t have to be
as simple as predefined name/value pairs; you can store useful data in the key itself,
not only in the value. This is somewhat common when creating indexes in Cassandra.
But let’s not get ahead of ourselves.
Now we don’t need to store a value for every column every time we store a new entity.
Maybe we don’t know the values for every column for a given entity. For example,
some people have a second phone number and some don’t, and in an online form
backed by Cassandra, there may be some fields that are optional and some that are
required. That’s OK. Instead of storing null for those values we don’t know, which
would waste space, we just won’t store that column at all for that row. So now we have
a sparse, multidimensional array structure that looks like Figure 3-3.
A Simple Introduction | 43
Figure 3-3. A column family
It may help to think of it in terms of JavaScript Object Notation (JSON) instead of a
picture:
Musician: ColumnFamily 1
bootsy: RowKey
email: bootsy@pfunk.com, ColumnName:Value
instrument: bass ColumnName:Value
george: RowKey
email: george@pfunk.com ColumnName:Value
Band: ColumnFamily 2
george: RowKey
pfunk: 1968-2010 ColumnName:Value
Here we have two column families, Musician and Band. The Musician column family
has two rows, “bootsy” and “george”. These two rows have a ragged set of columns
associated with them: the bootsy record has two columns (email and instrument), and
the george record has only one column. That’s fine in Cassandra. The second column
family is Band, and it also has a “george” row, with a column named “pfunk”.
Columns in Cassandra actually have a third aspect: the timestamp, which records the
last time the column was updated. This is not an automatic metadata property, how-
ever; clients have to provide the timestamp along with the value when they perform
writes. You cannot query by the timestamp; it is used purely for conflict resolution on
the server side.
Rows do not have timestamps. Only each individual column has a
timestamp.
44 | Chapter 3:The Cassandra Data Model
And what if we wanted to create a group of related columns, that is, add another di-
mension on top of this? Cassandra allows us to do this with something called a super
column family. A super column family can be thought of as a map of maps. The super
column family is shown in Figure 3-4.
Figure 3-4. A super column family
Where a row in a column family holds a collection of name/value pairs, the super
column family holds subcolumns, where subcolumns are named groups of columns.
So the address of a value in a regular column family is a row key pointing to a column
name pointing to a value, while the address of a value in a column family of type
“super” is a row key pointing to a column name pointing to a subcolumn name
pointing to a value. Put slightly differently, a row in a super column family still contains
columns, each of which then contains subcolumns.
So that’s the bottom-up approach to looking at Cassandra’s data model. Now that we
have this basic understanding, let’s switch gears and zoom out to a higher level, in order
to take a top-down approach. There is so much confusion on this topic that it’s worth
it to restate things in a different way in order to thoroughly understand the data model.
Clusters
Cassandra is probably not the best choice if you only need to run a single node. As
previously mentioned, the Cassandra database is specifically designed to be distributed
over several machines operating together that appear as a single instance to the end
user. So the outermost structure in Cassandra is the cluster, sometimes called the
ring, because Cassandra assigns data to nodes in the cluster by arranging them in
a ring.
A node holds a replica for different ranges of data. If the first node goes down, a replica
can respond to queries. The peer-to-peer protocol allows the data to replicate across
nodes in a manner transparent to the user, and the replication factor is the number of
Clusters | 45
machines in your cluster that will receive copies of the same data. We’ll examine this
in greater detail in Chapter 6.
Keyspaces
A cluster is a container for keyspaces—typically a single keyspace. A keyspace is the
outermost container for data in Cassandra, corresponding closely to a relational data-
base. Like a relational database, a keyspace has a name and a set of attributes that define
keyspace-wide behavior. Although people frequently advise that it’s a good idea to
create a single keyspace per application, this doesn’t appear to have much practical
basis. It’s certainly an acceptable practice, but it’s perfectly fine to create as many key-
spaces as your application needs. Note, however, that you will probably run into trou-
ble creating thousands of keyspaces per application.
Depending on your security constraints and partitioner, it’s fine to run multiple key-
spaces on the same cluster. For example, if your application is called Twitter, you
would probably have a cluster called Twitter-Cluster and a keyspace called Twitter.
To my knowledge, there are currently no naming conventions in Cassandra for such
items.
In Cassandra, the basic attributes that you can set per keyspace are:
Replication factor
In simplest terms, the replication factor refers to the number of nodes that will act
as copies (replicas) of each row of data. If your replication factor is 3, then three
nodes in the ring will have copies of each row, and this replication is transparent
to clients.
The replication factor essentially allows you to decide how much you want to pay
in performance to gain more consistency. That is, your consistency level for reading
and writing data is based on the replication factor.
Replica placement strategy
The replica placement refers to how the replicas will be placed in the ring. There
are different strategies that ship with Cassandra for determining which nodes
will get copies of which keys. These are SimpleStrategy (formerly known as
RackUnawareStrategy), OldNetworkTopologyStrategy (formerly known as Rack-
AwareStrategy), and NetworkTopologyStrategy (formerly known as Datacenter-
ShardStrategy).
Column families
In the same way that a database is a container for tables, a keyspace is a container
for a list of one or more column families. A column family is roughly analagous to
a table in the relational model, and is a container for a collection of rows. Each row
contains ordered columns. Column families represent the structure of your data.
Each keyspace has at least one and often many column families.
46 | Chapter 3:The Cassandra Data Model
I mention the replication factor and replica placement strategy here because they are
set per keyspace. However, they don’t have an immediate impact on your data model
per se.
It is possible, but generally not recommended, to create multiple keyspaces per appli-
cation. The only time you would want to split your application into multiple keyspaces
is if you wanted a different replication factor or replica placement strategy for some of
the column families. For example, if you have some data that is of lower priority,
you could put it in its own keyspace with a lower replication factor so that Cassandra
doesn’t have to work as hard to replicate it. But this may be more complicated than it’s
worth. It’s probably a better idea to start with one keyspace and see whether you really
need to tune at that level.
Column Families
A column family is a container for an ordered collection of rows, each of which is itself
an ordered collection of columns. In the relational world, when you are physically
creating your database from a model, you specify the name of the database (keyspace),
the names of the tables (remotely similar to column families, but don’t get stuck on the
idea that column families equal tables—they don’t), and then you define the names of
the columns that will be in each table.
There are a few good reasons not to go too far with the idea that a column family is like
a relational table. First, Cassandra is considered schema-free because although the
column families are defined, the columns are not. You can freely add any column to
any column family at any time, depending on your needs. Second, a column family has
two attributes: a name and a comparator. The comparator value indicates how columns
will be sorted when they are returned to you in a query—according to long, byte, UTF8,
or other ordering.
In a relational database, it is frequently transparent to the user how tables are stored
on disk, and it is rare to hear of recommendations about data modeling based on how
the RDBMS might store tables on disk. That’s another reason to keep in mind that a
column family is not a table. Because column families are each stored in separate files
on disk, it’s important to keep related columns defined together in the same column
family.
Another way that column families differ from relational tables is that relational tables
define only columns, and the user supplies the values, which are the rows. But in Cas-
sandra, a table can hold columns, or it can be defined as a super column family. The
benefit of using a super column family is to allow for nesting.
For standard column families, which is the default, you set the type to Standard; for a
super column family, you set the type to Super.
Column Families | 47
When you write data to a column family in Cassandra, you specify values for one or
more columns. That collection of values together with a unique identifier is called a
row. That row has a unique key, called the row key, which acts like the primary key
unique identifier for that row. So while it’s not incorrect to call it column-oriented, or
columnar, it might be easier to understand the model if you think of rows as containers
for columns. This is also why some people refer to Cassandra column families as similar
to a four-dimensional hash:
[Keyspace][ColumnFamily][Key][Column]
We can use a JSON-like notation to represent a Hotel column family, as shown here:
Hotel {
key: AZC_043 { name: Cambria Suites Hayden, phone: 480-444-4444,
address: 400 N. Hayden Rd., city: Scottsdale, state: AZ, zip: 85255}
key: AZS_011 { name: Clarion Scottsdale Peak, phone: 480-333-3333,
address: 3000 N. Scottsdale Rd, city: Scottsdale, state: AZ, zip: 85255}
key: CAS_021 { name: W Hotel, phone: 415-222-2222,
address: 181 3rd Street, city: San Francisco, state: CA, zip: 94103}
key: NYN_042 { name: Waldorf Hotel, phone: 212-555-5555,
address: 301 Park Ave, city: New York, state: NY, zip: 10019}
}
I’m leaving out the timestamp attribute of the columns here for sim-
plicity, but just remember that every column has a timestamp.
In this example, the row key is a unique primary key for the hotel, and the columns are
name, phone, address, city, state, and zip. Although these rows happen to define values
for all of the same columns, you could easily have one row with 4 columns and another
row in the same column family with 400 columns, and none of them would have to
overlap.
It’s an inherent part of Cassandra’s replica design that all data for a single
row must fit on a single machine in the cluster. The reason for this
limitation is that rows have an associated row key, which is used to
determine the nodes that will act as replicas for that row. Further, the
value of a single column cannot exceed 2GB. Keep these things in mind
as you design your data model.
We can query a column family such as this one using the CLI, like this:
cassandra> get Hotelier.Hotel['NYN_042']
=> (column=zip, value=10019, timestamp=3894166157031651)
=> (column=state, value=NY, timestamp=3894166157031651)
=> (column=phone, value=212-555-5555, timestamp=3894166157031651)
=> (column=name, value=The Waldorf=Astoria, timestamp=3894166157031651)
=> (column=city, value=New York, timestamp=3894166157031651)
48 | Chapter 3:The Cassandra Data Model
=> (column=address, value=301 Park Ave, timestamp=3894166157031651)
Returned 6 results.
This indicates that we have one hotel in New York, New York, but we see six results
because the results are column-oriented, and there are six columns for that row in the
column family. Note that while there are six columns for that row, other rows might
have more or fewer columns.
Column Family Options
There are a few additional parameters that you can define for each column family. These
are:
keys_cached
The number of locations to keep cached per SSTable. This doesn’t refer to
column name/values at all, but to the number of keys, as locations of rows per
column family, to keep in memory in least-recently-used order.
rows_cached
The number of rows whose entire contents (the complete list of name/value pairs
for that unique row key) will be cached in memory.
comment
This is just a standard comment that helps you remember important things about
your column family definitions.
read_repair_chance
This is a value between 0 and 1 that represents the probability that read repair
operations will be performed when a query is performed without a specified quo-
rum, and it returns the same row from two or more replicas and at least one of the
replicas appears to be out of date. You may want to lower this value if you are
performing a much larger number of reads than writes.
preload_row_cache
Specifies whether you want to prepopulate the row cache on server startup.
I have simplified these definitions somewhat, as they are really more about configura-
tion and server behavior than they are about the data model. They are covered in detail
in Chapter 6.
Columns
A column is the most basic unit of data structure in the Cassandra data model. A column
is a triplet of a name, a value, and a clock, which you can think of as a timestamp for
now. Again, although we’re familiar with the term “columns” from the relational
world, it’s confusing to think of them in the same way in Cassandra. First of all, when
designing a relational database, you specify the structure of the tables up front by
Columns | 49
assigning all of the columns in the table a name; later, when you write data, you’re
simply supplying values for the predefined structure.
But in Cassandra, you don’t define the columns up front; you just define the column
families you want in the keyspace, and then you can start writing data without defining
the columns anywhere. That’s because in Cassandra, all of a column’s names are
supplied by the client. This adds considerable flexibility to how your application works
with data, and can allow it to evolve organically over time.
Cassandra’s clock was introduced in version 0.7, but its fate is uncertain.
Prior to 0.7, it was called a timestamp, and was simply a Java long type.
It was changed to support Vector Clocks, which are a popular mecha-
nism for replica conflict resolution in distributed systems, and it’s how
Amazon Dynamo implements conflict resolution. That’s why you’ll
hear the third aspect of the column referred to both as a timestamp and
a clock. Vector Clocks may or may not ultimately become how time-
stamps are represented in Cassandra 0.7, which is in beta at the time of
this writing.
The data types for the name and value are Java byte arrays, frequently supplied as
strings. Because the name and value are binary types, they can be of any length. The
data type for the clock is an org.apache.cassandra.db.IClock, but for the 0.7 release, a
timestamp still works to keep backward compatibility. This column structure is illus-
trated in Figure 3-5.
Cassandra 0.7 introduced an optional time to live (TTL) value, which
allows columns to expire a certain amount of time after creation. This
can potentially prove very useful.
Figure 3-5. The structure of a column
Here’s an example of a column you might define, represented with JSON notation just
for clarity of structure:
{
"name": "email",
"value: "me@example.com",
"timestamp": 1274654183103300
}
50 | Chapter 3:The Cassandra Data Model
In this example, this column is named “email”; more precisely, the value of its name
attribute is “email”. But recall that a single column family will have multiple keys (or
row keys) representing different rows that might also contain this column. This is why
moving from the relational model is hard: we think of a relational table as holding the
same set of columns for every row. But in Cassandra, a column family holds many rows,
each of which may hold the same, or different, sets of columns.
On the server side, columns are immutable in order to prevent multithreading
issues. The column is defined in Cassandra by the org.apache.cassandra.db.IColumn
interface, which allows a variety of operations, including getting the value of the column
as a byte array or, in the case of a super column, getting its subcolumns as a
Collection<IColumn> and finding the time of the most recent change.
In a relational database, rows are stored together. This wasn’t the case for early versions
of Cassandra, but as of version 0.6, rows for the same column family are stored together
on disk.
You cannot perform joins in Cassandra. If you have designed a data
model and find that you need something like a join, you’ll have to either
do the work on the client side, or create a denormalized second column
family that represents the join results for you. This is common among
Cassandra users. Performing joins on the client should be a very rare
case; you really want to duplicate (denormalize) the data instead.
Wide Rows, Skinny Rows
When designing a table in a traditional relational database, you’re typically dealing
with “entities,” or the set of attributes that describe a particular noun (Hotel, User,
Product, etc.). Not much thought is given to the size of the rows themselves, because
row size isn’t negotiable once you’ve decided what noun your table represents. How-
ever, when you’re working with Cassandra, you actually have a decision to make about
the size of your rows: they can be wide or skinny, depending on the number of columns
the row contains.
A wide row means a row that has lots and lots (perhaps tens of thousands or even
millions) of columns. Typically there is a small number of rows that go along with so
many columns. Conversely, you could have something closer to a relational model,
where you define a smaller number of columns and use many different rows—that’s
the skinny model.
Wide rows typically contain automatically generated names (like UUIDs or time-
stamps) and are used to store lists of things. Consider a monitoring application as an
example: you might have a row that represents a time slice of an hour by using a modi-
fied timestamp as a row key, and then store columns representing IP addresses that
accessed your application within that interval. You can then create a new row key after
an hour elapses.
Columns | 51
Skinny rows are slightly more like traditional RDBMS rows, in that each row will con-
tain similar sets of column names. They differ from RDBMS rows, however, because
all columns are essentially optional.
Another difference between wide and skinny rows is that only wide rows will typically
be concerned about sorting order of column names. Which brings us to the next section.
Column Sorting
Columns have another aspect to their definition. In Cassandra, you specify how column
names will be compared for sort order when results are returned to the client. Columns
are sorted by the “Compare With” type defined on their enclosing column family, and
you can choose from the following: AsciiType, BytesType, LexicalUUIDType, Integer
Type, LongType, TimeUUIDType, or UTF8Type.
AsciiType
This sorts by directly comparing the bytes, validating that the input can be parsed
as US-ASCII. US-ASCII is a character encoding mechanism based on the lexical
order of the English alphabet. It defines 128 characters, 94 of which are printable.
BytesType
This is the default, and sorts by directly comparing the bytes, skipping the valida-
tion step. BytesType is the default for a reason: it provides the correct sorting for
most types of data (UTF-8 and ASCII included).
LexicalUUIDType
A 16-byte (128-bit) Universally Unique Identifier (UUID), compared lexically (by
byte value).
LongType
This sorts by an 8-byte (64-bit) long numeric type.
IntegerType
Introduced in 0.7, this is faster than LongType and allows integers of both fewer and
more bits than the 64 bits provided by LongType.
TimeUUIDType
This sorts by a 16-byte (128-bit) timestamp. There are five common versions of
generating timestamp UUIDs. The scheme Cassandra uses is a version one UUID,
which means that it is generated based on conflating the computer’s MAC address
and the number of 100-nanosecond intervals since the beginning of the Gregorian
calendar.
UTF8Type
A string using UTF-8 as the character encoder. Although this may seem like a
good default type, that’s probably because it’s comfortable to programmers who
are used to using XML or other data exchange mechanism that requires common
encoding. In Cassandra, however, you should use UTF8Type only if you want your
data validated.
52 | Chapter 3:The Cassandra Data Model
Custom
You can create your own column sorting mechanism if you like. This, like many
things in Cassandra, is pluggable. All you have to do is extend the org.apache
.cassandra.db.marshal.AbstractType and specify your class name.
Column names are stored in sorted order according to the value of compare_with. Rows,
on the other hand, are stored in an order defined by the partitioner (for example,
with RandomPartitioner, they are in random order, etc.). We examine partitioners in
Chapter 6.
It is not possible in Cassandra to sort by value, as we’re used to doing in relational
databases. This may seem like an odd limitation, but Cassandra has to sort by column
name in order to allow fetching individual columns from very large rows without
pulling the entire row into memory. Performance is an important selling point of Cas-
sandra, and sorting at read time would harm performance.
Column sorting is controllable, but key sorting isn’t; row keys always
sort in byte order.
Super Columns
A super column is a special kind of column. Both kinds of columns are name/value
pairs, but a regular column stores a byte array value, and the value of a super column
is a map of subcolumns (which store byte array values). Note that they store only a
map of columns; you cannot define a super column that stores a map of other super
columns. So the super column idea goes only one level deep, but it can have an
unbounded number of columns.
The basic structure of a super column is its name, which is a byte array (just as with a
regular column), and the columns it stores (see Figure 3-6). Its columns are held as a
map whose keys are the column names and whose values are the columns.
Figure 3-6. The basic structure of a super column
Each column family is stored on disk in its own separate file. So to optimize perform-
ance, it’s important to keep columns that you are likely to query together in the same
column family, and a super column can be helpful for this.
Super Columns | 53
The SuperColumn class implements both the IColumn and the IColumnContainer classes,
both from the org.apache.cassandra.db package. The Thrift API is the underlying RPC
serialization mechanism for performing remote operations on Cassandra. Because the
Thrift API has no notion of inheritance, you will sometimes see the API refer to a
ColumnOrSupercolumn type; when data structures use this type, you are expected to
know whether your underlying column family is of type Super or Standard.
Fun fact: super columns were one of the updates that Facebook added
to Google’s Bigtable data model.
Here we see some more of the richness of the data model. When using regular columns,
as we saw earlier, Cassandra looks like a four-dimensional hashtable. But for super
columns, it becomes more like a five-dimensional hash:
[Keyspace][ColumnFamily][Key][SuperColumn][SubColumn]
To use a super column, you define your column family as type Super. Then, you still
have row keys as you do in a regular column family, but you also reference the super
column, which is simply a name that points to a list or map of regular columns (some-
times called the subcolumns).
Here is an example of a super column family definition called PointOfInterest. In the
hotelier domain, a “point of interest” is a location near a hotel that travelers might like
to visit, such as a park, museum, zoo, or tourist attraction:
PointOfInterest (SCF)
SCkey: Cambria Suites Hayden
{
key: Phoenix Zoo
{
phone: 480-555-9999,
desc: They have animals here.
},
key: Spring Training
{
phone: 623-333-3333,
desc: Fun for baseball fans.
},
}, //end of Cambria row
SCkey: (UTF8) Waldorf=Astoria
{
key: Central Park
desc: Walk around. It's pretty.
},
key: Empire State Building
{
phone: 212-777-7777,
desc: Great view from the 102nd floor.
54 | Chapter 3:The Cassandra Data Model
}
}
}
The PointOfInterest super column family has two super columns, each named for a
different hotel (Cambria Suites Hayden and Waldorf=Astoria). The row keys are names
of different points of interest, such as “Phoenix Zoo” and “Central Park”. Each row
has columns for a description (the “desc” column); some of the rows have a phone
number, and some don’t. Unlike relational tables, which group rows of identical struc-
ture, column families and super column families group merely similar records.
Using the CLI, we could query a super column family like this:
cassandra> get PointOfInterest['Central Park']['The Waldorf=Astoria']['desc']
=> (column=desc, value=Walk around in the park. It's pretty., timestamp=1281301988847)
This query is asking: in the PointOfInterest column family (which happens to be de-
fined as type Super), use the row key “Central Park”; for the super column named
“Waldorf=Astoria”, get me the value of the “desc” column (which is the plain language
text describing the point of interest).
Composite Keys
There is an important consideration when modeling with super columns: Cassandra
does not index subcolumns, so when you load a super column into memory, all of its
columns are loaded as well.
This limitation was discovered by Ryan King, the Cassandra lead at
Twitter. It might be fixed in a future release, but the change is pending
an update to the underlying storage file (the SSTable).
You can use a composite key of your own design to help you with queries. A composite
key might be something like <userid:lastupdate>.
This could just be something that you consider when modeling, and then check back
on later when you come to a hardware sizing exercise. But if your data model anticipates
more than several thousand subcolumns, you might want to take a different approach
and not use super columns. The alternative involves creating a composite key. Instead
of representing columns within a super column, the composite key approach means
that you use a regular column family with regular columns, and then employ a custom
delimiter in your key name and parse it on client retrieval.
Here’s an example of a composite key pattern, used in combination with an example
of a Cassandra design pattern I call Materialized View, as well as a common Cassandra
design pattern I call Valueless Column:
Super Columns | 55
HotelByCity (CF) Key: city:state {
key: Phoenix:AZ {AZC_043: -, AZS_011: -}
key: San Francisco:CA {CAS_021: -}
key: New York:NY {NYN_042: -}
}
There are three things happening here. First, we already have defined hotel information
in another column family called Hotel. But we can create a second column family called
HotelByCity that denormalizes the hotel data. We repeat the same information we
already have, but store it in a way that acts similarly to a view in RDBMS, because it
allows us a quick and direct way to write queries. When we know that we’re going to
look up hotels by city (because that’s how people tend to search for them), we can
create a table that defines a row key for that search. However, there are many states
that have cities with the same name (Springfield comes to mind), so we can’t just name
the row key after the city; we need to combine it with the state.
We then use another pattern called Valueless Column. All we need to know is what
hotels are in the city, and we don’t need to denormalize further. So we use the column’s
name as the value, and the column has no corresponding value. That is, when the
column is inserted, we just store an empty byte array with it.
Design Differences Between RDBMS and Cassandra
There are several differences between Cassandra’s model and query methods compared
to what’s available in RDBMS, and these are important to keep in mind.
No Query Language
SQL is the standard query language used in relational databases. Cassandra has no
query language. It does have an API that you access through its RPC serialization
mechanism, Thrift.
No Referential Integrity
Cassandra has no concept of referential integrity, and therefore has no concept of joins.
In a relational database, you could specify foreign keys in a table to reference the pri-
mary key of a record in another table. But Cassandra does not enforce this. It is still a
common design requirement to store IDs related to other entities in your tables, but
operations such as cascading deletes are not available.
Secondary Indexes
Here’s why secondary indexes are a feature: say that you want to find the unique ID
for a hotel property. In a relational database, you might use a query like this:
SELECT hotelID FROM Hotel WHERE name = 'Clarion Midtown';
56 | Chapter 3:The Cassandra Data Model
This is the query you’d have to use if you knew the name of the hotel you were looking
for but not the unique ID. When handed a query like this, a relational database will
perform a full table scan, inspecting each row’s name column to find the value you’re
looking for. But this can become very slow once your table grows very large. So the
relational answer to this is to create an index on the name column, which acts as a copy
of the data that the relational database can look up very quickly. Because the hotelID
is already a unique primary key constraint, it is automatically indexed, and that is the
primary index; for us to create another index on the name column would constitute a
secondary index, and Cassandra does not currently support this.
To achieve the same thing in Cassandra, you create a second column family that holds
the lookup data. You create one column family to store the hotel names, and map them
to their IDs. The second column family acts as an explicit secondary index.
Support for secondary indexes is currently being added to Cassandra
0.7. This allows you to create indexes on column values. So, if you want
to see all the users who live in a given city, for example, secondary index
support will save you from doing it from scratch.
Sorting Is a Design Decision
In RDBMS, you can easily change the order in which records are returned to you by
using ORDER BY in your query. The default sort order is not configurable; by default,
records are returned in the order in which they are written. If you want to change the
order, you just modify your query, and you can sort by any list of columns. In Cassan-
dra, however, sorting is treated differently; it is a design decision. Column family
definitions include a CompareWith element, which dictates the order in which your rows
will be sorted on reads, but this is not configurable per query.
Where RDBMS constrains you to sorting based on the data type stored in the column,
Cassandra only stores byte arrays, so that approach doesn’t make sense. What you can
do, however, is sort as if the column were one of several different types (ASCII, Long
integer, TimestampUUID, lexicographically, etc.). You can also use your own pluggable
comparator for sorting if you wish.
Otherwise, there is no support for ORDER BY and GROUP BY statements in Cassandra
as there is in SQL. There is a query type called a SliceRange, which we examine in
Chapter 4; it is similar to ORDER BY in that it allows a reversal.
Denormalization
In relational database design, we are often taught the importance of normalization.
This is not an advantage when working with Cassandra because it performs best when
the data model is denormalized. It is often the case that companies end up denormal-
izing data in a relational database. There are two common reasons for this. One is
Design Differences Between RDBMS and Cassandra | 57
performance. Companies simply can’t get the performance they need when they have
to do so many joins on years’ worth of data, so they denormalize along the lines of
known queries. This ends up working, but goes against the grain of how relational
databases are intended to be designed, and ultimately makes one question whether
using a relational database is the best approach in these circumstances.
A second reason that relational databases get denormalized on purpose is a business
document structure that requires retention. That is, you have an enclosing table that
refers to a lot of external tables whose data could change over time, but you need to
preserve the enclosing document as a snapshot in history. The common example here
is with invoices. You already have Customer and Product tables, and you’d think that
you could just make an invoice that refers to those tables. But this should never be done
in practice. Customer or price information could change, and then you would lose the
integrity of the Invoice document as it was on the invoice date, which could violate
audits, reports, or laws, and cause other problems.
In the relational world, denormalization violates Codd's normal forms, and we try to
avoid it. But in Cassandra, denormalization is, well, perfectly normal. It's not required
if your data model is simple. But don't be afraid of it.
The important point is that instead of modeling the data first and then writing queries,
with Cassandra you model the queries and let the data be organized around them.
Think of the most common query paths your application will use, and then create the
column families that you need to support them.
Detractors have suggested that this is a problem. But it is perfectly reasonable to expect
that you should think hard about the queries in your application, just as you would,
presumably, think hard about your relational domain. You may get it wrong, and then
you’ll have problems in either world. Or your query needs might change over time, and
then you’ll have to work to update your data set. But this is no different from defining
the wrong tables, or needing additional tables, in RDBMS.
For an interesting article on how Cloudkick is using Cassandra to store
metrics and monitoring data, see https://www.cloudkick.com/blog/2010/
mar/02/4_months_with_cassandra.
Design Patterns
There are a few ways that people commonly use Cassandra that might be described
as design patterns. I’ve given names to these common patterns: Materialized View,
Valueless Column, and Aggregate Key.
58 | Chapter 3:The Cassandra Data Model
Materialized View
It is common to create a secondary index that represents additional queries. Because
you don’t have a SQL WHERE clause, you can recreate this effect by writing your data to
a second column family that is created specifically to represent that query.
For example, if you have a User column family and you want to find users in a
particular city, you might create a second column family called UserCity that stores
user data with the city as keys (instead of the username) and that has columns named
for the users who live in that city. This is a denormalization technique that will speed
queries and is an example of specifically designing your data around your queries (and
not the other way around). This usage is common in the Cassandra world. When you
want to query for users in a city, you just query the UserCity column family, instead of
querying the User column family and doing a bunch of pruning work on the client across
a potentially large data set.
Note that in this context, “materialized” means storing a full copy of the original data
so that everything you need to answer a query is right there, without forcing you to
look up the original data. If you are performing a second query because you’re only
storing column names that you use, like foreign keys in the second column family, that’s
a secondary index.
As of 0.7, Cassandra has native support for secondary indexes.
Valueless Column
Let’s build on our User/UserCity example. Because we’re storing the reference data in
the User column family, two things arise: one, you need to have unique and thoughtful
keys that can enforce referential integrity; and two, the columns in the UserCity
column family don’t necessarily need values. If you have a row key of Boise, then the
column names can be the names of the users in that city. Because your reference data
is in the User column family, the columns don’t really have any meaningful value; you’re
just using it as a prefabricated list, but you’ll likely want to use values in that list to get
additional data from the reference column family.
Aggregate Key
When you use the Valueless Column pattern, you may also need to employ the
Aggregate Key pattern. This pattern fuses together two scalar values with a separator
to create an aggregate. To extend our example further, city names typically aren’t
unique; many states in the US have a city called Springfield, and there’s a Paris, Texas,
and a Paris, Tennessee. So what will work better here is to fuse together the state name
Design Patterns | 59
and the city name to create an Aggregate Key to use in our Materialized View. This key
would look something like: TX:Paris or TN:Paris. By convention, many Cassandra users
employ the colon as the separator, but it could be a pipe character or any other character
that is not otherwise meaningful in your keys.
Some Things to Keep in Mind
Let’s look briefly at a few things to keep in mind when you’re trying to move from a
relational mindset to Cassandra’s data model. I’ll just say it: if you have been working
with relational databases for a long time, it’s not always easy. Here are a few pointers:
Start with your queries. Ask what queries your application will need, and model
the data around that instead of modeling the data first, as you would in the rela-
tional world. This can be shocking to some people. Some very smart people have
told me that this approach will cause trouble for the Cassandra practitioner down
the line when new queries come up, as they tend to in business. Fair enough. My
response is to ask why they assume their data types would be more static than their
queries.
You have to supply a timestamp (or clock) with each query, so you need a strategy
to synchronize those with multiple clients. This is crucial in order for Cassandra
to use the timestamps to determine the most recent write value. One good strategy
here is the use of a Network Time Protocol (NTP) server. Again, some smart people
have asked me, why not let the server take care of the clock? My response is that
in a symmetrical distributed database, the server side actually has the same
problem.
Summary
In this chapter we took a gentle approach to understanding Cassandra’s data model of
keyspaces, column families, columns, and super columns. We also explored a few of
the contrasts between RDBMS and Cassandra.
60 | Chapter 3:The Cassandra Data Model
CHAPTER 4
Sample Application
In this chapter, we create a complete sample application so we can see how all the parts
fit together. We will use various parts of the API to see how to insert data, perform
batch updates, and search column families and super column families.
To create the example, we want to use something that is complex enough to show the
various data structures and basic API operations, but not something that will bog you
down with details. In order to get enough data in the database to make our searches
work right (by finding stuff we’re looking for and leaving out stuff we’re not looking
for), there’s a little redundancy in the prepopulation of the database. Also, I wanted to
use a domain that’s familiar to everyone so we can concentrate on how to work with
Cassandra, not on what the application domain is all about.
The code in this chapter has been tested against the 0.7 beta 1 release,
and works as shown. It is possible that API changes may necessitate
minor tuning on your part, depending on your version.
Data Design
When you set out to build a new data-driven application that will use a relational
database, you might start by modeling the domain as a set of properly normalized tables
and use foreign keys to reference related data in other tables. Now that we have an
understanding of how Cassandra stores data, let’s create a little domain model that is
easy to understand in the relational world, and then see how we might map it from a
relational to a distributed hashtable model in Cassandra.
Relational modeling, in simple terms, means that you start from the conceptual domain
and then represent the nouns in the domain in tables. You then assign primary keys
and foreign keys to model relationships. When you have a many-to-many relationship,
you create the join tables that represent just those keys. The join tables don’t exist in
the real world, and are a necessary side effect of the way relational models work. After
you have all your tables laid out, you can start writing queries that pull together
61
disparate data using the relationships defined by the keys. The queries in the relational
world are very much secondary. It is assumed that you can always get the data you want
as long as you have your tables modeled properly. Even if you have to use several
complex subqueries or join statements, this is usually true.
By contrast, in Cassandra you don’t start with the data model; you start with the query
model.
For this example, let’s use a domain that is easily understood and that everyone can
relate to: a hotel that wants to allow guests to book a reservation.
Our conceptual domain includes hotels, guests that stay in the hotels, a collection of
rooms for each hotel, and a record of the reservation, which is a certain guest in a certain
room for a certain period of time (called the “stay”). Hotels typically also maintain a
collection of “points of interest,” which are parks, museums, shopping galleries,
monuments, or other places near the hotel that guests might want to visit during their
stay. Both hotels and points of interest need to maintain geolocation data so that they
can be found on maps for mashups, and to calculate distances.
Obviously, in the real world there would be many more considerations
and much more complexity. For example, hotel rates are notoriously
dynamic, and calculating them involves a wide array of factors. Here
we’re defining something complex enough to be interesting and touch
on the important points, but simple enough to maintain the focus on
learning Cassandra.
Here’s how we would start this application design with Cassandra. First, determine
your queries. We’ll likely have something like the following:
Find hotels in a given area.
Find information about a given hotel, such as its name and location.
Find points of interest near a given hotel.
Find an available room in a given date range.
Find the rate and amenities for a room.
Book the selected room by entering guest information.
Hotel App RDBMS Design
Figure 4-1 shows how we might represent this simple hotel reservation system using a
relational database model. The relational model includes a couple of “join” tables in
order to resolve the many-to-many relationships of hotels-to-points of interest, and for
rooms-to-amenities.
62 | Chapter 4:Sample Application
Figure 4-1. A simple hotel search system using RDBMS
Hotel App Cassandra Design
Although there are many possible ways to do it, we could represent the same logical
data model using a Cassandra physical model such as that shown in Figure 4-2.
In this design, we’re doing all the same things as in the relational design. We have
transferred some of the tables, such as Hotel and Guest, to column families. Other
tables, such as PointOfInterest, have been denormalized into a super column family.
In the relational model, you can look up hotels by the city they’re in using a SQL
statement. But because we don’t have SQL in Cassandra, we’ve created an index in the
form of the HotelByCity column family.
I’m using a stereotype notation here, so <<CF>> refers to a column family,
<<SCF>> refers to a super column family, and so on.
We have combined room and amenities into a single column family, Room. The columns
such as type and rate will have corresponding values; other columns, such as
hot tub, will just use the presence of the column name itself as the value, and be oth-
erwise empty.
Hotel App Cassandra Design | 63
Hotel Application Code
In this section we walk through the code and show how to implement the given design.
This is useful because it illustrates several different API functions in action.
The purpose of this sample application is to show how different ideas
in Cassandra can be combined. It is by no means the only way to
transfer the relational design into this model. There is a lot of low-level
plumbing here that uses the Thrift API. Thrift is (probably) changing to
Avro, so although the basic ideas here work, you don’t want to follow
this example in a real application. Instead, check out Chapter 8 and use
one of the many available third-party clients for Cassandra, depending
on the language that your application uses and other needs that you
have.
Figure 4-2. The hotel search represented with Cassandra’s model
64 | Chapter 4:Sample Application
The application we’re building will do the following things:
1. Create the database structure.
2. Prepopulate the database with hotel and point of interest data. The hotels are
stored in standard column families, and the points of interest are in super column
families.
3. Search for a list of hotels in a given city. This uses a secondary index.
4. Select one of the hotels returned in the search, and then search for a list of points
of interest near the chosen hotel.
5. Booking the hotel by doing an insert into the Reservation column family should
be straightforward at this point, and is left to the reader.
Space doesn’t permit implementing the entire application. But we’ll walk through the
major parts, and finishing the implementation is just a matter of creating a variation of
what is shown.
Creating the Database
The first step is creating the schema definition. For this example, we’ll define the
schema in YAML and then load it, although you could also use client code to define it.
The YAML file shown in Example 4-1 defines the necessary keyspace and column
families.
Example 4-1. Schema definition in cassandra.yaml
keyspaces:
- name: Hotelier
replica_placement_strategy: org.apache.cassandra.locator.RackUnawareStrategy
replication_factor: 1
column_families:
- name: Hotel
compare_with: UTF8Type
- name: HotelByCity
compare_with: UTF8Type
- name: Guest
compare_with: BytesType
- name: Reservation
compare_with: TimeUUIDType
- name: PointOfInterest
column_type: Super
compare_with: UTF8Type
compare_subcolumns_with: UTF8Type
- name: Room
Hotel Application Code | 65
column_type: Super
compare_with: BytesType
compare_subcolumns_with: BytesType
- name: RoomAvailability
column_type: Super
compare_with: BytesType
compare_subcolumns_with: BytesType
This definition provides all of the column families to run the example, and a couple
more that we don’t directly reference in the application code, because it rounds out the
design as transferred from RDBMS.
Loading the schema
Once you have the schema defined in YAML, you need to load it. To do this, open a
console, start the jconsole application, and connect to Cassandra via JMX. Then,
execute the operation loadSchemaFromYAML, which is part of the org.apache
.cassandra.service.StorageService MBean. Now Cassandra knows about your
schema and you can start using it. You can also use the API itself to create keyspaces
and column families.
Data Structures
The application requires some standard data structures that will just act as transfer
objects for us. These aren’t particularly interesting, but are required to keep things
organized. We’ll use a Hotel data structure to hold all of the information about a hotel,
shown in Example 4-2.
Example 4-2. Hotel.java
package com.cassandraguide.hotel;
//data transfer object
public class Hotel {
public String id;
public String name;
public String phone;
public String address;
public String city;
public String state;
public String zip;
}
This structure just holds the column information for convenience in the application.
We also have a POI data structure to hold information about points of interest. This is
shown in Example 4-3.
66 | Chapter 4:Sample Application
Example 4-3. POI.java
package com.cassandraguide.hotel;
//data transfer object for a Point of Interest
public class POI {
public String name;
public String desc;
public String phone;
}
We also have a Constants class, which keeps commonly used strings in one easy-to-
change place, shown in Example 4-4.
Example 4-4. Constants.java
package com.cassandraguide.hotel;
import org.apache.cassandra.thrift.ConsistencyLevel;
public class Constants {
public static final String CAMBRIA_NAME = "Cambria Suites Hayden";
public static final String CLARION_NAME= "Clarion Scottsdale Peak";
public static final String W_NAME = "The W SF";
public static final String WALDORF_NAME = "The Waldorf=Astoria";
public static final String UTF8 = "UTF8";
public static final String KEYSPACE = "Hotelier";
public static final ConsistencyLevel CL = ConsistencyLevel.ONE;
public static final String HOST = "localhost";
public static final int PORT = 9160;
}<