Cassandra: The Definitive Guide OReilly.Cassandra.The.Definitive.Guide.2nd.Edition.1491933666

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 369 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Je Carpenter & Eben Hewitt
Cassandra
The Denitive Guide
DISTRIBUTED DATA AT WEB SCALE
2nd Edition
Je Carpenter and Eben Hewitt
Cassandra: The Denitive Guide
SECOND EDITION
Boston Farnham Sebastopol Tokyo
Beijing Boston Farnham Sebastopol Tokyo
Beijing
978-1-491-93366-4
[LSI]
Cassandra: The Denitive Guide
by Jeff Carpenter and Eben Hewitt
Copyright © 2016 Jeff Carpenter, Eben Hewitt. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐
tutional sales department: 800-998-9938 or corporate@oreilly.com.
Editors: Mike Loukides and Marie Beaugureau
Production Editor: Colleen Cole
Copyeditor: Jasmine Kwityn
Proofreader: James Fraleigh
Indexer: Ellen Troutman-Zaig
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
June 2016: Second Edition
Revision History for the Second Edition
2010-11-12: First Release
2016-06-27: Second Release
2017-04-07: Third Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491933664 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Cassandra: e Denitive Guide, the
cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
is book is dedicated to my sweetheart, Alison Brown.
I can hear the sound of violins, long before it begins.
—E.H.
For Stephanie, my inspiration, unfailing support,
and the love of my life.
—J.C.
Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
1. Beyond Relational Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Whats Wrong with Relational Databases? 1
A Quick Review of Relational Databases 5
RDBMSs: The Awesome and the Not-So-Much 5
Web Scale 12
The Rise of NoSQL 13
Summary 15
2. Introducing Cassandra. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
The Cassandra Elevator Pitch 17
Cassandra in 50 Words or Less 17
Distributed and Decentralized 18
Elastic Scalability 19
High Availability and Fault Tolerance 19
Tuneable Consistency 20
Brewer’s CAP Theorem 23
Row-Oriented 26
High Performance 28
Where Did Cassandra Come From? 28
Release History 30
Is Cassandra a Good Fit for My Project? 35
Large Deployments 35
v
Lots of Writes, Statistics, and Analysis 36
Geographical Distribution 36
Evolving Applications 36
Getting Involved 36
Summary 38
3. Installing Cassandra. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Installing the Apache Distribution 39
Extracting the Download 39
Whats In There? 40
Building from Source 41
Additional Build Targets 43
Running Cassandra 43
On Windows 44
On Linux 45
Starting the Server 45
Stopping Cassandra 47
Other Cassandra Distributions 48
Running the CQL Shell 49
Basic cqlsh Commands 50
cqlsh Help 50
Describing the Environment in cqlsh 51
Creating a Keyspace and Table in cqlsh 52
Writing and Reading Data in cqlsh 55
Summary 56
4. The Cassandra Query Language. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
The Relational Data Model 57
Cassandras Data Model 58
Clusters 61
Keyspaces 61
Tables 61
Columns 63
CQL Types 65
Numeric Data Types 66
Textual Data Types 67
Time and Identity Data Types 67
Other Simple Data Types 69
Collections 70
User-Defined Types 73
Secondary Indexes 76
Summary 78
vi | Table of Contents
5. Data Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Conceptual Data Modeling 79
RDBMS Design 80
Design Differences Between RDBMS and Cassandra 81
Defining Application Queries 84
Logical Data Modeling 85
Hotel Logical Data Model 87
Reservation Logical Data Model 89
Physical Data Modeling 91
Hotel Physical Data Model 92
Reservation Physical Data Model 93
Materialized Views 94
Evaluating and Refining 96
Calculating Partition Size 96
Calculating Size on Disk 97
Breaking Up Large Partitions 99
Defining Database Schema 100
DataStax DevCenter 102
Summary 103
6. The Cassandra Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Data Centers and Racks 105
Gossip and Failure Detection 106
Snitches 108
Rings and Tokens 109
Virtual Nodes 110
Partitioners 111
Replication Strategies 112
Consistency Levels 113
Queries and Coordinator Nodes 114
Memtables, SSTables, and Commit Logs 115
Caching 117
Hinted Handoff 117
Lightweight Transactions and Paxos 118
Tombstones 120
Bloom Filters 120
Compaction 121
Anti-Entropy, Repair, and Merkle Trees 122
Staged Event-Driven Architecture (SEDA) 124
Managers and Services 125
Cassandra Daemon 125
Storage Engine 126
Table of Contents | vii
Storage Service 126
Storage Proxy 126
Messaging Service 127
Stream Manager 127
CQL Native Transport Server 127
System Keyspaces 128
Summary 130
7. Conguring Cassandra. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Cassandra Cluster Manager 131
Creating a Cluster 132
Seed Nodes 135
Partitioners 136
Murmur3 Partitioner 136
Random Partitioner 137
Order-Preserving Partitioner 137
ByteOrderedPartitioner 137
Snitches 138
Simple Snitch 138
Property File Snitch 138
Gossiping Property File Snitch 139
Rack Inferring Snitch 139
Cloud Snitches 140
Dynamic Snitch 140
Node Configuration 140
Tokens and Virtual Nodes 141
Network Interfaces 142
Data Storage 143
Startup and JVM Settings 144
Adding Nodes to a Cluster 144
Dynamic Ring Participation 146
Replication Strategies 147
SimpleStrategy 147
NetworkTopologyStrategy 148
Changing the Replication Factor 150
Summary 150
8. Clients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Hector, Astyanax, and Other Legacy Clients 151
DataStax Java Driver 152
Development Environment Configuration 152
Clusters and Contact Points 153
viii | Table of Contents
Sessions and Connection Pooling 155
Statements 156
Policies 164
Metadata 167
Debugging and Monitoring 171
DataStax Python Driver 172
DataStax Node.js Driver 173
DataStax Ruby Driver 174
DataStax C# Driver 175
DataStax C/C++ Driver 176
DataStax PHP Driver 177
Summary 177
9. Reading and Writing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Writing 179
Write Consistency Levels 180
The Cassandra Write Path 181
Writing Files to Disk 183
Lightweight Transactions 185
Batches 188
Reading 190
Read Consistency Levels 191
The Cassandra Read Path 192
Read Repair 195
Range Queries, Ordering and Filtering 195
Functions and Aggregates 198
Paging 202
Speculative Retry 205
Deleting 205
Summary 206
10. Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Logging 207
Tailing 209
Examining Log Files 210
Monitoring Cassandra with JMX 211
Connecting to Cassandra via JConsole 213
Overview of MBeans 215
Cassandras MBeans 219
Database MBeans 222
Networking MBeans 226
Metrics MBeans 227
Table of Contents | ix
Threading MBeans 228
Service MBeans 228
Security MBeans 228
Monitoring with nodetool 229
Getting Cluster Information 230
Getting Statistics 232
Summary 234
11. Maintenance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Health Check 235
Basic Maintenance 236
Flush 236
Cleanup 237
Repair 238
Rebuilding Indexes 242
Moving Tokens 243
Adding Nodes 243
Adding Nodes to an Existing Data Center 243
Adding a Data Center to a Cluster 244
Handling Node Failure 246
Repairing Nodes 246
Replacing Nodes 247
Removing Nodes 248
Upgrading Cassandra 251
Backup and Recovery 252
Taking a Snapshot 253
Clearing a Snapshot 255
Enabling Incremental Backup 255
Restoring from Snapshot 255
SSTable Utilities 256
Maintenance Tools 257
DataStax OpsCenter 257
Netflix Priam 260
Summary 260
12. Performance Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Managing Performance 261
Setting Performance Goals 261
Monitoring Performance 262
Analyzing Performance Issues 264
Tracing 265
Tuning Methodology 268
x | Table of Contents
Caching 268
Key Cache 269
Row Cache 269
Counter Cache 270
Saved Cache Settings 270
Memtables 271
Commit Logs 272
SSTables 273
Hinted Handoff 274
Compaction 275
Concurrency and Threading 278
Networking and Timeouts 279
JVM Settings 280
Memory 281
Garbage Collection 281
Using cassandra-stress 283
Summary 286
13. Security. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Authentication and Authorization 289
Password Authenticator 289
Using CassandraAuthorizer 292
Role-Based Access Control 293
Encryption 294
SSL, TLS, and Certificates 295
Node-to-Node Encryption 296
Client-to-Node Encryption 298
JMX Security 299
Securing JMX Access 299
Security MBeans 301
Summary 301
14. Deploying and Integrating. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Planning a Cluster Deployment 303
Sizing Your Cluster 303
Selecting Instances 305
Storage 306
Network 307
Cloud Deployment 308
Amazon Web Services 308
Microsoft Azure 310
Google Cloud Platform 311
Table of Contents | xi
Integrations 312
Apache Lucene, SOLR, and Elasticsearch 312
Apache Hadoop 312
Apache Spark 313
Summary 319
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
xii | Table of Contents
Foreword
Cassandra was open-sourced by Facebook in July 2008. This original version of
Cassandra was written primarily by an ex-employee from Amazon and one from
Microsoft. It was strongly influenced by Dynamo, Amazons pioneering distributed
key/value database. Cassandra implements a Dynamo-style replication model with no
single point of failure, but adds a more powerful “column family” data model.
I became involved in December of that year, when Rackspace asked me to build them
a scalable database. This was good timing, because all of today’s important open
source scalable databases were available for evaluation. Despite initially having only a
single major use case, Cassandras underlying architecture was the strongest, and I
directed my efforts toward improving the code and building a community.
Cassandra was accepted into the Apache Incubator, and by the time it graduated in
March 2010, it had become a true open source success story, with committers from
Rackspace, Digg, Twitter, and other companies that wouldn’t have written their own
database from scratch, but together built something important.
Today’s Cassandra is much more than the early system that powered (and still pow‐
ers) Facebooks inbox search; it has become “the hands-down winner for transaction
processing performance,” to quote Tony Bain, with a deserved reputation for reliabil‐
ity and performance at scale.
As Cassandra matured and began attracting more mainstream users, it became clear
that there was a need for commercial support; thus, Matt Pfeil and I cofounded Rip‐
tano in April 2010. Helping drive Cassandra adoption has been very rewarding, espe‐
cially seeing the uses that don’t get discussed in public.
Another need has been a book like this one. Like many open source projects, Cassan‐
dras documentation has historically been weak. And even when the documentation
ultimately improves, a book-length treatment like this will remain useful.
xiii
Thanks to Eben for tackling the difficult task of distilling the art and science of devel‐
oping against and deploying Cassandra. You, the reader, have the opportunity to
learn these new concepts in an organized fashion.
— Jonathan Ellis
Project Chair, Apache Cassandra, and
Cofounder and CTO, DataStax
xiv | Foreword
Foreword
I am so excited to be writing the foreword for the new edition of Cassandra: e
Denitive Guide. Why? Because there is a new edition! When the original version of
this book was written, Apache Cassandra was a brand new project. Over the years, so
much has changed that users from that time would barely recognize the database
today. It’s notoriously hard to keep track of fast moving projects like Apache Cassan‐
dra, and I’m very thankful to Jeff for taking on this task and communicating the latest
to the world.
One of the most important updates to the new edition is the content on modeling
your data. I have said this many times in public: a data model can be the difference
between a successful Apache Cassandra project and a failed one. A good portion of
this book is now devoted to understanding how to do it right. Operations folks, you
haven’t been left out either. Modern Apache Cassandra includes things such as virtual
nodes and many new options to maintain data consistency, which are all explained in
the second edition. Theres so much ground to cover—it’s a good thing you got the
definitive guide!
Whatever your focus, you have made a great choice in learning more about Apache
Cassandra. There is no better time to add this skill to your toolbox. Or, for experi‐
enced users, maintaining your knowledge by keeping current with changes will give
you an edge. As recent surveys have shown, Apache Cassandra skills are some of the
highest paying and most sought after in the world of application development and
infrastructure. This also shows a very clear trend in our industry. When organiza‐
tions need a highly scaling, always-on, multi-datacenter database, you can’t find a bet‐
ter choice than Apache Cassandra. A quick search will yield hundreds of companies
that have staked their success on our favorite database. This trust is well founded, as
you will see as you read on. As applications are moving to the cloud by default, Cas‐
sandra keeps up with dynamic and global data needs. This book will teach you why
and how to apply it in your application. Build something amazing and be yet another
success story.
xv
And finally, I invite you to join our thriving Apache Cassandra community. World‐
wide, the community has been one of the strongest non-technical assets for new
users. We are lucky to have a thriving Cassandra community, and collaboration
among our members has made Apache Cassandra a stronger database. There are
many ways you can participate. You can start with simple things like attending meet‐
ups or conferences, where you can network with your peers. Eventually you may
want to make more involved contributions like writing blog posts or giving presenta‐
tions, which can add to the group intelligence and help new users following behind
you. And, the most critical part of an open source project, make technical contribu‐
tions. Write some code to fix a bug or add a feature. Submit a bug report or feature
request in a JIRA. These contributions are a great measurement of the health and
vibrancy of a project. You dont need any special status, just create an account and go!
And when you need help, refer back to this book, or reach out to our community. We
are here to help you be successful.
Excited yet? Good!
Enough of me talking, its time for you to turn the page and start learning.
— Patrick McFadin
Chief Evangelist for
Apache Cassandra, DataStax
xvi | Foreword
Preface
Why Apache Cassandra?
Apache Cassandra is a free, open source, distributed data storage system that differs
sharply from relational database management systems (RDBMSs).
Cassandra first started as an Incubator project at Apache in January of 2009. Shortly
thereafter, the committers, led by Apache Cassandra Project Chair Jonathan Ellis,
released version 0.3 of Cassandra, and have steadily made releases ever since. Cassan‐
dra is being used in production by some of the biggest companies on the Web, includ‐
ing Facebook, Twitter, and Netflix.
Its popularity is due in large part to the outstanding technical features it provides. It is
durable, seamlessly scalable, and tuneably consistent. It performs blazingly fast writes,
can store hundreds of terabytes of data, and is decentralized and symmetrical so
theres no single point of failure. It is highly available and offers a data model based on
the Cassandra Query Language (CQL).
Is This Book for You?
This book is intended for a variety of audiences. It should be useful to you if you are:
A developer working with large-scale, high-volume applications, such as Web 2.0
social applications or ecommerce sites
An application architect or data architect who needs to understand the available
options for high-performance, decentralized, elastic data stores
A database administrator or database developer currently working with standard
relational database systems who needs to understand how to implement a fault-
tolerant, eventually consistent data store
xvii
A manager who wants to understand the advantages (and disadvantages) of Cas‐
sandra and related columnar databases to help make decisions about technology
strategy
A student, analyst, or researcher who is designing a project related to Cassandra
or other non-relational data store options
This book is a technical guide. In many ways, Cassandra represents a new way of
thinking about data. Many developers who gained their professional chops in the last
15–20 years have become well versed in thinking about data in purely relational or
object-oriented terms. Cassandras data model is very different and can be difficult to
wrap your mind around at first, especially for those of us with entrenched ideas about
what a database is (and should be).
Using Cassandra does not mean that you have to be a Java developer. However, Cas‐
sandra is written in Java, so if you’re going to dive into the source code, a solid under‐
standing of Java is crucial. Although its not strictly necessary to know Java, it can
help you to better understand exceptions, how to build the source code, and how to
use some of the popular clients. Many of the examples in this book are in Java. But
because of the interface used to access Cassandra, you can use Cassandra from a wide
variety of languages, including C#, Python, node.js, PHP, and Ruby.
Finally, it is assumed that you have a good understanding of how the Web works, can
use an integrated development environment (IDE), and are somewhat familiar with
the typical concerns of data-driven applications. You might be a well-seasoned devel‐
oper or administrator but still, on occasion, encounter tools used in the Cassandra
world that youre not familiar with. For example, Apache Ant is used to build Cassan‐
dra, and the Cassandra source code is available via Git. In cases where we speculate
that you’ll need to do a little setup of your own in order to work with the examples,
we try to support that.
What’s in This Book?
This book is designed with the chapters acting, to a reasonable extent, as standalone
guides. This is important for a book on Cassandra, which has a variety of audiences
and is changing rapidly. To borrow from the software world, the book is designed to
be “modular.” If you’re new to Cassandra, it makes sense to read the book in order; if
you’ve passed the introductory stages, you will still find value in later chapters, which
you can read as standalone guides.
Here is how the book is organized:
Chapter 1, Beyond Relational Databases
This chapter reviews the history of the enormously successful relational database
and the recent rise of non-relational database technologies like Cassandra.
xviii | Preface
Chapter 2, Introducing Cassandra
This chapter introduces Cassandra and discusses whats exciting and different
about it, where it came from, and what its advantages are.
Chapter 3, Installing Cassandra
This chapter walks you through installing Cassandra, getting it running, and try‐
ing out some of its basic features.
Chapter 4, e Cassandra Query Language
Here we look at Cassandras data model, highlighting how it differs from the tra‐
ditional relational model. We also explore how this data model is expressed in the
Cassandra Query Language (CQL).
Chapter 5, Data Modeling
This chapter introduces principles and processes for data modeling in Cassandra.
We analyze a well-understood domain to produce a working schema.
Chapter 6, e Cassandra Architecture
This chapter helps you understand what happens during read and write opera‐
tions and how the database accomplishes some of its notable aspects, such as
durability and high availability. We go under the hood to understand some of the
more complex inner workings, such as the gossip protocol, hinted handoffs, read
repairs, Merkle trees, and more.
Chapter 7, Conguring Cassandra
This chapter shows you how to specify partitioners, replica placement strategies,
and snitches. We set up a cluster and see the implications of different configura‐
tion choices.
Chapter 8, Clients
There are a variety of clients available for different languages, including Java,
Python, node.js, Ruby, C#, and PHP, in order to abstract Cassandras lower-level
API. We help you understand common driver features.
Chapter 9, Reading and Writing Data
We build on the previous chapters to learn how Cassandra works “under the cov‐
ers” to read and write data. We’ll also discuss concepts such as batches, light‐
weight transactions, and paging.
Chapter 10, Monitoring
Once your cluster is up and running, youll want to monitor its usage, memory
patterns, and thread patterns, and understand its general activity. Cassandra has
a rich Java Management Extensions (JMX) interface baked in, which we put to
use to monitor all of these and more.
Preface | xix
Chapter 11, Maintenance
The ongoing maintenance of a Cassandra cluster is made somewhat easier by
some tools that ship with the server. We see how to decommission a node, load
balance the cluster, get statistics, and perform other routine operational tasks.
Chapter 12, Performance Tuning
One of Cassandras most notable features is its speed—its very fast. But there are
a number of things, including memory settings, data storage, hardware choices,
caching, and buffer sizes, that you can tune to squeeze out even more perfor‐
mance.
Chapter 13, Security
NoSQL technologies are often slighted as being weak on security. Thankfully,
Cassandra provides authentication, authorization, and encryption features,
which we’ll learn how to configure in this chapter.
Chapter 14, Deploying and Integrating
We close the book with a discussion of considerations for planning cluster
deployments, including cloud deployments using providers such as Amazon,
Microsoft, and Google. We also introduce several technologies that are frequently
paired with Cassandra to extend its capabilities.
Cassandra Versions Used in This Book
This book was developed using Apache Cassandra 3.0 and the
DataStax Java Driver version 3.0. The formatting and content of
tool output, log files, configuration files, and error messages are as
they appear in the 3.0 release, and may change in future releases.
When discussing features added in releases 2.0 and later, we cite
the release in which the feature was added for readers who may be
using earlier versions and are considering whether to upgrade.
New for the Second Edition
The first edition of Cassandra: e Denitive Guide was the first book published on
Cassandra, and has remained highly regarded over the years. However, the Cassandra
landscape has changed significantly since 2010, both in terms of the technology itself
and the community that develops and supports that technology. Heres a summary of
the key updates we’ve made to bring the book up to date:
A sense of history
The first edition was written against the 0.7 release in 2010. As of 2016, were up
to the 3.X series. The most significant change has been the introduction of CQL
and deprecation of the old Thrift API. Other new architectural features include
xx | Preface
secondary indexes, materialized views, and lightweight transactions. We provide
a summary release history in Chapter 2 to help guide you through the changes.
As we introduce new features throughout the text, we frequently cite the releases
in which these features were added.
Giving developers a leg up
Development and testing with Cassandra has changed a lot over the years, with
the introduction of the CQL shell (cqlsh) and the gradual replacement of
community-developed clients with the drivers provided by DataStax. We give in-
depth treatment to cqlsh in Chapters 3 and 4, and the drivers in Chapters 8 and
9. We also provide an expanded description of Cassandras read path and write
path in Chapter 9 to enhance your understanding of the internals and help you
understand the impact of decisions.
Maturing Cassandra operations
As more and more individuals and organizations have deployed Cassandra in
production environments, the knowledge base of production challenges and best
practices to meet those challenges has increased. Weve added entirely new chap‐
ters on security (Chapter 13) and deployment and integration (Chapter 14), and
greatly expanded the monitoring, maintenance, and performance tuning chap‐
ters (Chapters 10 through 12) in order to relate this collected wisdom.
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program ele‐
ments such as variable or function names, databases, data types, environment
variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter‐
mined by context.
Preface | xxi
This element signifies a tip or suggestion.
This element signifies a general note.
This element indicates a warning or caution.
Using Code Examples
The code examples found in this book are available for download at https://
github.com/jereyscarpenter/cassandra-guide.
This book is here to help you get your job done. In general, you may use the code in
this book in your programs and documentation. You do not need to contact us for
permission unless youre reproducing a significant portion of the code. For example,
writing a program that uses several chunks of code from this book does not require
permission. Selling or distributing a CD-ROM of examples from O’Reilly books does
require permission. Answering a question by citing this book and quoting example
code does not require permission. Incorporating a significant amount of example
code from this book into your products documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the
title, author, publisher, and ISBN. For example: “Cassandra: e Denitive Guide, Sec‐
ond Edition, by Jeff Carpenter. Copyright 2016 Jeff Carpenter, 978-1-491-93366-4.
If you feel your use of code examples falls outside fair use or the permission given
here, feel free to contact us at permissions@oreilly.com.
O’Reilly Safari
Safari (formerly Safari Books Online) is a membership-based
training and reference platform for enterprise, government,
educators, and individuals.
xxii | Preface
Members have access to thousands of books, training videos, Learning Paths, interac‐
tive tutorials, and curated playlists from over 250 publishers, including O’Reilly
Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Profes‐
sional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press,
John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe
Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and
Course Technology, among others.
For more information, please visit http://oreilly.com/safari.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://bit.ly/cassandra2e.
To comment or ask technical questions about this book, send email to bookques‐
tions@oreilly.com.
For more information about our books, courses, conferences, and news, see our web‐
site at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Acknowledgments
There are many wonderful people to whom we are grateful for helping bring this
book to life.
Thank you to our technical reviewers: Stu Hood, Robert Schneider, and Gary Dusba‐
bek contributed thoughtful reviews to the first edition, while Andrew Baker, Ewan
Elliot, Kirk Damron, Corey Cole, Jeff Jirsa, and Patrick McFadin reviewed the second
edition. Chris Judsons feedback was key to the maturation of Chapter 14.
Preface | xxiii
Thank you to Jonathan Ellis and Patrick McFadin for writing forewords for the first
and second editions, respectively. Thanks also to Patrick for his contributions to the
Spark integration section in Chapter 14.
Thanks to our editors, Mike Loukides and Marie Beaugureau, for their constant sup‐
port and making this a better book.
Jeff would like to thank Eben for entrusting him with the opportunity to update such
a well-regarded, foundational text, and for Ebens encouragement from start to finish.
Finally, weve been inspired by the many terrific developers who have contributed to
Cassandra. Hats off for making such an elegant and powerful database.
xxiv | Preface
CHAPTER 1
Beyond Relational Databases
If at rst the idea is not absurd, then there is no hope for it.
—Albert Einstein
Welcome to Cassandra: e Denitive Guide. The aim of this book is to help develop‐
ers and database administrators understand this important database technology.
During the course of this book, we will explore how Cassandra compares to tradi‐
tional relational database management systems, and help you put it to work in your
own environment.
What’s Wrong with Relational Databases?
If I had asked people what they wanted, they would have said faster horses.
—Henry Ford
We ask you to consider a certain model for data, invented by a small team at a com‐
pany with thousands of employees. It was accessible over a TCP/IP interface and was
available from a variety of languages, including Java and web services. This model
was difficult at first for all but the most advanced computer scientists to understand,
until broader adoption helped make the concepts clearer. Using the database built
around this model required learning new terms and thinking about data storage in a
different way. But as products sprang up around it, more businesses and government
agencies put it to use, in no small part because it was fast—capable of processing
thousands of operations a second. The revenue it generated was tremendous.
And then a new model came along.
The new model was threatening, chiefly for two reasons. First, the new model was
very different from the old model, which it pointedly controverted. It was threatening
because it can be hard to understand something different and new. Ensuing debates
1
can help entrench people stubbornly further in their views—views that might have
been largely inherited from the climate in which they learned their craft and the cir‐
cumstances in which they work. Second, and perhaps more importantly, as a barrier,
the new model was threatening because businesses had made considerable invest‐
ments in the old model and were making lots of money with it. Changing course
seemed ridiculous, even impossible.
Of course, we are talking about the Information Management System (IMS) hierarch‐
ical database, invented in 1966 at IBM.
IMS was built for use in the Saturn V moon rocket. Its architect was Vern Watts, who
dedicated his career to it. Many of us are familiar with IBMs database DB2. IBMs
wildly popular DB2 database gets its name as the successor to DB1—the product built
around the hierarchical data model IMS. IMS was released in 1968, and subsequently
enjoyed success in Customer Information Control System (CICS) and other applica‐
tions. It is still used today.
But in the years following the invention of IMS, the new model, the disruptive model,
the threatening model, was the relational database.
In his 1970 paper “A Relational Model of Data for Large Shared Data Banks,” Dr.
Edgar F. Codd, also at advanced his theory of the relational model for data while
working at IBMs San Jose research laboratory. This paper, still available at http://
www.seas.upenn.edu/~zives/03f/cis550/codd.pdf, became the foundational work for
relational database management systems.
Codds work was antithetical to the hierarchical structure of IMS. Understanding and
working with a relational database required learning new terms, including “relations,
tuples,” and “normal form,” all of which must have sounded very strange indeed to
users of IMS. It presented certain key advantages over its predecessor, such as the
ability to express complex relationships between multiple entities, well beyond what
could be represented by hierarchical databases.
While these ideas and their application have evolved in four decades, the relational
database still is clearly one of the most successful software applications in history. It’s
used in the form of Microsoft Access in sole proprietorships, and in giant multina‐
tional corporations with clusters of hundreds of finely tuned instances representing
multi-terabyte data warehouses. Relational databases store invoices, customer
records, product catalogues, accounting ledgers, user authentication schemes—the
very world, it might appear. There is no question that the relational database is a key
facet of the modern technology and business landscape, and one that will be with us
in its various forms for many years to come, as will IMS in its various forms. The
relational model presented an alternative to IMS, and each has its uses.
So the short answer to the question, “Whats wrong with relational databases?” is
“Nothing.
2 | Chapter 1: Beyond Relational Databases
There is, however, a rather longer answer, which says that every once in a while an
idea is born that ostensibly changes things, and engenders a revolution of sorts. And
yet, in another way, such revolutions, viewed structurally, are simply history’s busi‐
ness as usual. IMS, RDBMSs, NoSQL. The horse, the car, the plane. They each build
on prior art, they each attempt to solve certain problems, and so they’re each good at
certain things—and less good at others. They each coexist, even now.
So lets examine for a moment why, at this point, we might consider an alternative to
the relational database, just as Codd himself four decades ago looked at the Informa‐
tion Management System and thought that maybe it wasn’t the only legitimate way of
organizing information and solving data problems, and that maybe, for certain prob‐
lems, it might prove fruitful to consider an alternative.
We encounter scalability problems when our relational applications become success‐
ful and usage goes up. Joins are inherent in any relatively normalized relational data‐
base of even modest size, and joins can be slow. The way that databases gain
consistency is typically through the use of transactions, which require locking some
portion of the database so its not available to other clients. This can become untena‐
ble under very heavy loads, as the locks mean that competing users start queuing up,
waiting for their turn to read or write the data.
We typically address these problems in one or more of the following ways, sometimes
in this order:
Throw hardware at the problem by adding more memory, adding faster process‐
ors, and upgrading disks. This is known as vertical scaling. This can relieve you
for a time.
When the problems arise again, the answer appears to be similar: now that one
box is maxed out, you add hardware in the form of additional boxes in a database
cluster. Now you have the problem of data replication and consistency during
regular usage and in failover scenarios. You didnt have that problem before.
Now we need to update the configuration of the database management system.
This might mean optimizing the channels the database uses to write to the
underlying filesystem. We turn off logging or journaling, which frequently is not
a desirable (or, depending on your situation, legal) option.
Having put what attention we could into the database system, we turn to our
application. We try to improve our indexes. We optimize the queries. But pre‐
sumably at this scale we weren’t wholly ignorant of index and query optimiza‐
tion, and already had them in pretty good shape. So this becomes a painful
process of picking through the data access code to find any opportunities for
fine-tuning. This might include reducing or reorganizing joins, throwing out
resource-intensive features such as XML processing within a stored procedure,
and so forth. Of course, presumably we were doing that XML processing for a
What’s Wrong with Relational Databases? | 3
reason, so if we have to do it somewhere, we move that problem to the applica‐
tion layer, hoping to solve it there and crossing our fingers that we dont break
something else in the meantime.
We employ a caching layer. For larger systems, this might include distributed
caches such as memcached, Redis, Riak, EHCache, or other related products.
Now we have a consistency problem between updates in the cache and updates in
the database, which is exacerbated over a cluster.
We turn our attention to the database again and decide that, now that the appli‐
cation is built and we understand the primary query paths, we can duplicate
some of the data to make it look more like the queries that access it. This process,
called denormalization, is antithetical to the five normal forms that characterize
the relational model, and violates Codd’s 12 Rules for relational data. We remind
ourselves that we live in this world, and not in some theoretical cloud, and then
undertake to do what we must to make the application start responding at
acceptable levels again, even if its no longer “pure.
Codd’s Twelve Rules
Codd provided a list of 12 rules (there are actually 13, numbered 0
to 12) formalizing his definition of the relational model as a
response to the divergence of commercial databases from his origi‐
nal concepts. Codd introduced his rules in a pair of articles in
CompuWorld magazine in October 1985, and formalized them in
the second edition of his book e Relational Model for Database
Management, which is now out of print.
This likely sounds familiar to you. At web scale, engineers may legitimately ponder
whether this situation isn’t similar to Henry Fords assertion that at a certain point, its
not simply a faster horse that you want. And they’ve done some impressive, interest‐
ing work.
We must therefore begin here in recognition that the relational model is simply a
model. That is, its intended to be a useful way of looking at the world, applicable to
certain problems. It does not purport to be exhaustive, closing the case on all other
ways of representing data, never again to be examined, leaving no room for alterna‐
tives. If we take the long view of history, Dr. Codds model was a rather disruptive one
in its time. It was new, with strange new vocabulary and terms such as “tuples”—
familiar words used in a new and different manner. The relational model was held up
to suspicion, and doubtless suffered its vehement detractors. It encountered opposi‐
tion even in the form of Dr. Codds own employer, IBM, which had a very lucrative
product set around IMS and didnt need a young upstart cutting into its pie.
4 | Chapter 1: Beyond Relational Databases
But the relational model now arguably enjoys the best seat in the house within the
data world. SQL is widely supported and well understood. It is taught in introductory
university courses. There are open source databases that come installed and ready to
use with a $4.95 monthly web hosting plan. Cloud-based Platform-as-a-Service
(PaaS) providers such as Amazon Web Services, Google Cloud Platform, Rackspace,
and Microsoft Azure provide relational database access as a service, including auto‐
mated monitoring and maintenance features. Often the database we end up using is
dictated to us by architectural standards within our organization. Even absent such
standards, its prudent to learn whatever your organization already has for a database
platform. Our colleagues in development and infrastructure have considerable hard-
won knowledge.
If by nothing more than osmosis (or inertia), we have learned over the years that a
relational database is a one-size-fits-all solution.
So perhaps a better question is not, “What’s wrong with relational databases?” but
rather, “What problem do you have?
That is, you want to ensure that your solution matches the problem that you have.
There are certain problems that relational databases solve very well. But the explosion
of the Web, and in particular social networks, means a corresponding explosion in
the sheer volume of data we must deal with. When Tim Berners-Lee first worked on
the Web in the early 1990s, it was for the purpose of exchanging scientific documents
between PhDs at a physics laboratory. Now, of course, the Web has become so ubiqui‐
tous that its used by everyone, from those same scientists to legions of five-year-olds
exchanging emoticons about kittens. That means in part that it must support enor‐
mous volumes of data; the fact that it does stands as a monument to the ingenious
architecture of the Web.
But some of this infrastructure is starting to bend under the weight.
A Quick Review of Relational Databases
Though you are likely familiar with them, lets briefly turn our attention to some of
the foundational concepts in relational databases. This will give us a basis on which to
consider more recent advances in thought around the trade-offs inherent in dis‐
tributed data systems, especially very large distributed data systems, such as those
that are required at web scale.
RDBMSs: The Awesome and the Not-So-Much
There are many reasons that the relational database has become so overwhelmingly
popular over the last four decades. An important one is the Structured Query Lan‐
guage (SQL), which is feature-rich and uses a simple, declarative syntax. SQL was first
officially adopted as an ANSI standard in 1986; since that time, its gone through sev‐
A Quick Review of Relational Databases | 5
eral revisions and has also been extended with vendor proprietary syntax such as
Microsofts T-SQL and Oracles PL/SQL to provide additional implementation-
specific features.
SQL is powerful for a variety of reasons. It allows the user to represent complex rela‐
tionships with the data, using statements that form the Data Manipulation Language
(DML) to insert, select, update, delete, truncate, and merge data. You can perform a
rich variety of operations using functions based on relational algebra to find a maxi‐
mum or minimum value in a set, for example, or to filter and order results. SQL state‐
ments support grouping aggregate values and executing summary functions. SQL
provides a means of directly creating, altering, and dropping schema structures at
runtime using Data Definition Language (DDL). SQL also allows you to grant and
revoke rights for users and groups of users using the same syntax.
SQL is easy to use. The basic syntax can be learned quickly, and conceptually SQL
and RDBMSs offer a low barrier to entry. Junior developers can become proficient
readily, and as is often the case in an industry beset by rapid changes, tight deadlines,
and exploding budgets, ease of use can be very important. And its not just the syntax
thats easy to use; there are many robust tools that include intuitive graphical inter‐
faces for viewing and working with your database.
In part because its a standard, SQL allows you to easily integrate your RDBMS with a
wide variety of systems. All you need is a driver for your application language, and
youre off to the races in a very portable way. If you decide to change your application
implementation language (or your RDBMS vendor), you can often do that painlessly,
assuming you haven’t backed yourself into a corner using lots of proprietary exten‐
sions.
Transactions, ACID-ity, and two-phase commit
In addition to the features mentioned already, RDBMSs and SQL also support trans‐
actions. A key feature of transactions is that they execute virtually at first, allowing
the programmer to undo (using rollback) any changes that may have gone awry dur‐
ing execution; if all has gone well, the transaction can be reliably committed. As Jim
Gray puts it, a transaction is “a transformation of state” that has the ACID properties
(see The Transaction Concept: Virtues and Limitations).
ACID is an acronym for Atomic, Consistent, Isolated, Durable, which are the gauges
we can use to assess that a transaction has executed properly and that it was success‐
ful:
Atomic
Atomic means “all or nothing”; that is, when a statement is executed, every
update within the transaction must succeed in order to be called successful.
There is no partial failure where one update was successful and another related
6 | Chapter 1: Beyond Relational Databases
update failed. The common example here is with monetary transfers at an ATM:
the transfer requires subtracting money from one account and adding it to
another account. This operation cannot be subdivided; they must both succeed.
Consistent
Consistent means that data moves from one correct state to another correct state,
with no possibility that readers could view different values that dont make sense
together. For example, if a transaction attempts to delete a customer and her
order history, it cannot leave order rows that reference the deleted customer’s
primary key; this is an inconsistent state that would cause errors if someone tried
to read those order records.
Isolated
Isolated means that transactions executing concurrently will not become entan‐
gled with each other; they each execute in their own space. That is, if two differ‐
ent transactions attempt to modify the same data at the same time, then one of
them will have to wait for the other to complete.
Durable
Once a transaction has succeeded, the changes will not be lost. This doesn’t imply
another transaction wont later modify the same data; it just means that writers
can be confident that the changes are available for the next transaction to work
with as necessary.
The debate about support for transactions comes up very quickly as a sore spot in
conversations around non-relational data stores, so let’s take a moment to revisit what
this really means. On the surface, ACID properties seem so obviously desirable as to
not even merit conversation. Presumably no one who runs a database would suggest
that data updates dont have to endure for some length of time; thats the very point of
making updates—that they’re there for others to read. However, a more subtle exami‐
nation might lead us to want to find a way to tune these properties a bit and control
them slightly. There is, as they say, no free lunch on the Internet, and once we see how
were paying for our transactions, we may start to wonder whether theres an alterna‐
tive.
Transactions become difficult under heavy load. When you first attempt to horizon‐
tally scale a relational database, making it distributed, you must now account for dis‐
tributed transactions, where the transaction isnt simply operating inside a single table
or a single database, but is spread across multiple systems. In order to continue to
honor the ACID properties of transactions, you now need a transaction manager to
orchestrate across the multiple nodes.
In order to account for successful completion across multiple hosts, the idea of a two-
phase commit (sometimes referred to as “2PC”) is introduced. But then, because
two-phase commit locks all associated resources, it is useful only for operations that
A Quick Review of Relational Databases | 7
can complete very quickly. Although it may often be the case that your distributed
operations can complete in sub-second time, it is certainly not always the case. Some
use cases require coordination between multiple hosts that you may not control your‐
self. Operations coordinating several different but related activities can take hours to
update.
Two-phase commit blocks; that is, clients (“competing consumers”) must wait for a
prior transaction to finish before they can access the blocked resource. The protocol
will wait for a node to respond, even if it has died. It’s possible to avoid waiting for‐
ever in this event, because a timeout can be set that allows the transaction coordinator
node to decide that the node isn’t going to respond and that it should abort the trans‐
action. However, an infinite loop is still possible with 2PC; thats because a node can
send a message to the transaction coordinator node agreeing that it’s OK for the coor‐
dinator to commit the entire transaction. The node will then wait for the coordinator
to send a commit response (or a rollback response if, say, a different node can’t com‐
mit); if the coordinator is down in this scenario, that node conceivably will wait for‐
ever.
So in order to account for these shortcomings in two-phase commit of distributed
transactions, the database world turned to the idea of compensation. Compensation,
often used in web services, means in simple terms that the operation is immediately
committed, and then in the event that some error is reported, a new operation is
invoked to restore proper state.
There are a few basic, well-known patterns for compensatory action that architects
frequently have to consider as an alternative to two-phase commit. These include
writing off the transaction if it fails, deciding to discard erroneous transactions and
reconciling later. Another alternative is to retry failed operations later on notification.
In a reservation system or a stock sales ticker, these are not likely to meet your
requirements. For other kinds of applications, such as billing or ticketing applica‐
tions, this can be acceptable.
The Problem with Two-Phase Commit
Gregor Hohpe, a Google architect, wrote a wonderful and often-
cited blog entry called “Starbucks Does Not Use Two-Phase Com‐
mit. It shows in real-world terms how difficult it is to scale two-
phase commit and highlights some of the alternatives that are
mentioned here. Its an easy, fun, and enlightening read.
The problems that 2PC introduces for application developers include loss of availabil‐
ity and higher latency during partial failures. Neither of these is desirable. So once
you’ve had the good fortune of being successful enough to necessitate scaling your
database past a single machine, you now have to figure out how to handle transac‐
8 | Chapter 1: Beyond Relational Databases
tions across multiple machines and still make the ACID properties apply. Whether
you have 10 or 100 or 1,000 database machines, atomicity is still required in transac‐
tions as if you were working on a single node. But its now a much, much bigger pill
to swallow.
Schema
One often-lauded feature of relational database systems is the rich schemas they
afford. You can represent your domain objects in a relational model. A whole indus‐
try has sprung up around (expensive) tools such as the CA ERWin Data Modeler to
support this effort. In order to create a properly normalized schema, however, you are
forced to create tables that don’t exist as business objects in your domain. For exam‐
ple, a schema for a university database might require a “student” table and a “course
table. But because of the “many-to-many” relationship here (one student can take
many courses at the same time, and one course has many students at the same time),
you have to create a join table. This pollutes a pristine data model, where wed prefer
to just have students and courses. It also forces us to create more complex SQL state‐
ments to join these tables together. The join statements, in turn, can be slow.
Again, in a system of modest size, this isnt much of a problem. But complex queries
and multiple joins can become burdensomely slow once you have a large number of
rows in many tables to handle.
Finally, not all schemas map well to the relational model. One type of system that has
risen in popularity in the last decade is the complex event processing system, which
represents state changes in a very fast stream. It’s often useful to contextualize events
at runtime against other events that might be related in order to infer some conclu‐
sion to support business decision making. Although event streams could be repre‐
sented in terms of a relational database, it is an uncomfortable stretch.
And if youre an application developer, you’ll no doubt be familiar with the many
object-relational mapping (ORM) frameworks that have sprung up in recent years to
help ease the difficulty in mapping application objects to a relational model. Again,
for small systems, ORM can be a relief. But it also introduces new problems of its
own, such as extended memory requirements, and it often pollutes the application
code with increasingly unwieldy mapping code. Heres an example of a Java method
using Hibernate to “ease the burden” of having to write the SQL code:
@CollectionOfElements
@JoinTable(name="store_description",
joinColumns = @JoinColumn(name="store_code"))
@MapKey(columns={@Column(name="for_store",length=3)})
@Column(name="description")
private Map<String, String> getMap() {
return this.map;
}
//... etc.
A Quick Review of Relational Databases | 9
Is it certain that weve done anything but move the problem here? Of course, with
some systems, such as those that make extensive use of document exchange, as with
services or XML-based applications, there are not always clear mappings to a rela‐
tional database. This exacerbates the problem.
Sharding and shared-nothing architecture
If you can’t split it, you cant scale it.
—Randy Shoup, Distinguished Architect, eBay
Another way to attempt to scale a relational database is to introduce sharding to your
architecture. This has been used to good effect at large websites such as eBay, which
supports billions of SQL queries a day, and in other modern web applications. The
idea here is that you split the data so that instead of hosting all of it on a single server
or replicating all of the data on all of the servers in a cluster, you divide up portions of
the data horizontally and host them each separately.
For example, consider a large customer table in a relational database. The least dis‐
ruptive thing (for the programming staff, anyway) is to vertically scale by adding
CPU, adding memory, and getting faster hard drives, but if you continue to be suc‐
cessful and add more customers, at some point (perhaps into the tens of millions of
rows), youll likely have to start thinking about how you can add more machines.
When you do so, do you just copy the data so that all of the machines have it? Or do
you instead divide up that single customer table so that each database has only some
of the records, with their order preserved? Then, when clients execute queries, they
put load only on the machine that has the record they’re looking for, with no load on
the other machines.
It seems clear that in order to shard, you need to find a good key by which to order
your records. For example, you could divide your customer records across 26
machines, one for each letter of the alphabet, with each hosting only the records for
customers whose last names start with that particular letter. It’s likely this is not a
good strategy, however—there probably arent many last names that begin with “Q
or “Z,” so those machines will sit idle while the “J,” “M,” and “S” machines spike. You
could shard according to something numeric, like phone number, “member since
date, or the name of the customer’s state. It all depends on how your specific data is
likely to be distributed.
10 | Chapter 1: Beyond Relational Databases
There are three basic strategies for determining shard structure:
Feature-based shard or functional segmentation
This is the approach taken by Randy Shoup, Distinguished Architect at eBay,
who in 2006 helped bring the sites architecture into maturity to support many
billions of queries per day. Using this strategy, the data is split not by dividing
records in a single table (as in the customer example discussed earlier), but rather
by splitting into separate databases the features that dont overlap with each other
very much. For example, at eBay, the users are in one shard, and the items for
sale are in another. At Flixster, movie ratings are in one shard and comments are
in another. This approach depends on understanding your domain so that you
can segment data cleanly.
Key-based sharding
In this approach, you find a key in your data that will evenly distribute it across
shards. So instead of simply storing one letter of the alphabet for each server as in
the (naive and improper) earlier example, you use a one-way hash on a key data
element and distribute data across machines according to the hash. It is common
in this strategy to find time-based or numeric keys to hash on.
Lookup table
In this approach, one of the nodes in the cluster acts as a “yellow pages” directory
and looks up which node has the data youre trying to access. This has two obvi‐
ous disadvantages. The first is that you’ll take a performance hit every time you
have to go through the lookup table as an additional hop. The second is that the
lookup table not only becomes a bottleneck, but a single point of failure.
Sharding can minimize contention depending on your strategy and allows you not
just to scale horizontally, but then to scale more precisely, as you can add power to
the particular shards that need it.
Sharding could be termed a kind of “shared-nothing” architecture thats specific to
databases. A shared-nothing architecture is one in which there is no centralized
(shared) state, but each node in a distributed system is independent, so there is no
client contention for shared resources. The term was first coined by Michael Stone‐
braker at the University of California at Berkeley in his 1986 paper “The Case for
Shared Nothing.
Shared-nothing architecture was more recently popularized by Google, which has
written systems such as its Bigtable database and its MapReduce implementation that
do not share state, and are therefore capable of near-infinite scaling. The Cassandra
database is a shared-nothing architecture, as it has no central controller and no
notion of master/slave; all of its nodes are the same.
A Quick Review of Relational Databases | 11
More on Shared-Nothing Architecture
You can read the 1986 paper “The Case for Shared Nothing” online
at http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf. Its only a few
pages. If you take a look, you’ll see that many of the features of
shared-nothing distributed data architecture, such as ease of high
availability and the ability to scale to a very large number of
machines, are the very things that Cassandra excels at.
MongoDB also provides auto-sharding capabilities to manage failover and node bal‐
ancing. That many non-relational databases offer this automatically and out of the
box is very handy; creating and maintaining custom data shards by hand is a wicked
proposition. It’s good to understand sharding in terms of data architecture in general,
but especially in terms of Cassandra more specifically, as it can take an approach sim‐
ilar to key-based sharding to distribute data across nodes, but does so automatically.
Web Scale
In summary, relational databases are very good at solving certain data storage prob‐
lems, but because of their focus, they also can create problems of their own when it’s
time to scale. Then, you often need to find a way to get rid of your joins, which means
denormalizing the data, which means maintaining multiple copies of data and seri‐
ously disrupting your design, both in the database and in your application. Further,
you almost certainly need to find a way around distributed transactions, which will
quickly become a bottleneck. These compensatory actions are not directly supported
in any but the most expensive RDBMSs. And even if you can write such a huge check,
you still need to carefully choose partitioning keys to the point where you can never
entirely ignore the limitation.
Perhaps more importantly, as we see some of the limitations of RDBMSs and conse‐
quently some of the strategies that architects have used to mitigate their scaling
issues, a picture slowly starts to emerge. It’s a picture that makes some NoSQL solu‐
tions seem perhaps less radical and less scary than we may have thought at first, and
more like a natural expression and encapsulation of some of the work that was
already being done to manage very large databases.
Because of some of the inherent design decisions in RDBMSs, it is not always as easy
to scale as some other, more recent possibilities that take the structure of the Web into
consideration. However, its not only the structure of the Web we need to consider,
but also its phenomenal growth, because as more and more data becomes available,
we need architectures that allow our organizations to take advantage of this data in
near real time to support decision making and to offer new and more powerful fea‐
tures and capabilities to our customers.
12 | Chapter 1: Beyond Relational Databases
Data Scale, Then and Now
It has been said, though it is hard to verify, that the 17th-century
English poet John Milton had actually read every published book
on the face of the earth. Milton knew many languages (he was even
learning Navajo at the time of his death), and given that the total
number of published books at that time was in the thousands, this
would have been possible. The size of the worlds data stores have
grown somewhat since then.
With the rapid growth in the Web, there is great variety to the kinds of data that need
to be stored, processed, and queried, and some variety to the businesses that use such
data. Consider not only customer data at familiar retailers or suppliers, and not only
digital video content, but also the required move to digital television and the explo‐
sive growth of email, messaging, mobile phones, RFID, Voice Over IP (VoIP) usage,
and the Internet of Things (IoT). As we have departed from physical consumer media
storage, companies that provide content—and the third-party value-add businesses
built around them—require very scalable data solutions. Consider too that as a typi‐
cal business application developer or database administrator, we may be used to
thinking of relational databases as the center of our universe. You might then be sur‐
prised to learn that within corporations, around 80% of data is unstructured.
The Rise of NoSQL
The recent interest in non-relational databases reflects the growing sense of need in
the software development community for web scale data solutions. The term
“NoSQL” began gaining popularity around 2009 as a shorthand way of describing
these databases. The term has historically been the subject of much debate, but a con‐
sensus has emerged that the term refers to non-relational databases that support “not
only SQL” semantics.
Various experts have attempted to organize these databases in a few broad categories;
we’ll examine a few of the most common:
Key-value stores
In a key-value store, the data items are keys that have a set of attributes. All data
relevant to a key is stored with the key; data is frequently duplicated. Popular
key-value stores include Amazons Dynamo DB, Riak, and Voldemort. Addition‐
ally, many popular caching technologies act as key-value stores, including Oracle
Coherence, Redis, and MemcacheD.
Column stores
Column stores are also frequently known as wide-column stores. Googles
Bigtable served as the inspiration for implementations including Cassandra,
Hypertable, and Apache Hadoops HBase.
The Rise of NoSQL | 13
Document stores
The basic unit of storage in a document database is the complete document,
often stored in a format such as JSON, XML, or YAML. Popular document stores
include MongoDB and CouchDB.
Graph databases
Graph databases represent data as a graph—a network of nodes and edges that
connect the nodes. Both nodes and edges can have properties. Because they give
heightened importance to relationships, graph databases such as FlockDB, Neo4J,
and Polyglot have proven popular for building social networking and semantic
web applications.
Object databases
Object databases store data not in terms of relations and columns and rows, but
in terms of the objects themselves, making it straightforward to use the database
from an object-oriented application. Object databases such as db4o and InterSys‐
tems Caché allow you to avoid techniques like stored procedures and object-
relational mapping (ORM) tools.
XML databases
XML databases are a special form of document databases, optimized specifically
for working with XML. So-called “XML native” databases include Tamino from
Software AG and eXist.
For a comprehensive list of NoSQL databases, see the site http://nosql-database.org.
There is wide variety in the goals and features of these databases, but they tend to
share a set of common characteristics. The most obvious of these is implied by the
name NoSQL—these databases support data models, data definition languages
(DDLs), and interfaces beyond the standard SQL available in popular relational data‐
bases. In addition, these databases are typically distributed systems without central‐
ized control. They emphasize horizontal scalability and high availability, in some
cases at the cost of strong consistency and ACID semantics. They tend to support
rapid development and deployment. They take flexible approaches to schema defini‐
tion, in some cases not requiring any schema to be defined up front. They provide
support for Big Data and analytics use cases.
Over the past several years, there have been a large number of open source and com‐
mercial offerings in the NoSQL space. The adoption and quality of these have varied
widely, but leaders have emerged in the categories just discussed, and many have
become mature technologies with large installation bases and commercial support.
Were happy to report that Cassandra is one of those technologies, as well dig into
more in the next chapter.
14 | Chapter 1: Beyond Relational Databases
Summary
The relational model has served the software industry well over the past four decades,
but the level of availability and scalability required for modern applications has
stretched relational database technology to the breaking point.
The intention of this book is not to convince you by clever argument to adopt a non-
relational database such as Apache Cassandra. It is only our intention to present what
Cassandra can do and how it does it so that you can make an informed decision and
get started working with it in practical ways if you find it applies.
Perhaps the ultimate question, then, is not “What’s wrong with relational databases?
but rather, “What kinds of things would I do with data if it wasnt a problem?” In a
world now working at web scale and looking to the future, Apache Cassandra might
be one part of the answer.
Summary | 15
CHAPTER 2
Introducing Cassandra
An invention has to make sense in the world in which it is nished,
not the world in which it is started.
—Ray Kurzweil
In the previous chapter, we discussed the emergence of non-relational database tech‐
nologies in order to meet the increasing demands of modern web scale applications.
In this chapter, we’ll focus on Cassandras value proposition and key tenets to show
how it rises to the challenge. Youll also learn about Cassandras history and how you
can get involved in the open source community that maintains Cassandra.
The Cassandra Elevator Pitch
Hollywood screenwriters and software startups are often advised to have their “eleva‐
tor pitch” ready. This is a summary of exactly what their product is all about—con‐
cise, clear, and brief enough to deliver in just a minute or two, in the lucky event that
they find themselves sharing an elevator with an executive, agent, or investor who
might consider funding their project. Cassandra has a compelling story, so let’s boil it
down to an elevator pitch that you can present to your manager or colleagues should
the occasion arise.
Cassandra in 50 Words or Less
Apache Cassandra is an open source, distributed, decentralized, elastically scalable,
highly available, fault-tolerant, tuneably consistent, row-oriented database that bases
its distribution design on Amazons Dynamo and its data model on Googles Bigtable.
Created at Facebook, it is now used at some of the most popular sites on the Web.
Thats exactly 50 words.
17
Of course, if you were to recite that to your boss in the elevator, youd probably get a
blank look in return. So lets break down the key points in the following sections.
Distributed and Decentralized
Cassandra is distributed, which means that it is capable of running on multiple
machines while appearing to users as a unified whole. In fact, there is little point in
running a single Cassandra node. Although you can do it, and thats acceptable for
getting up to speed on how it works, you quickly realize that you’ll need multiple
machines to really realize any benefit from running Cassandra. Much of its design
and code base is specifically engineered toward not only making it work across many
different machines, but also for optimizing performance across multiple data center
racks, and even for a single Cassandra cluster running across geographically dis‐
persed data centers. You can confidently write data to anywhere in the cluster and
Cassandra will get it.
Once you start to scale many other data stores (MySQL, Bigtable), some nodes need
to be set up as masters in order to organize other nodes, which are set up as slaves.
Cassandra, however, is decentralized, meaning that every node is identical; no Cas‐
sandra node performs certain organizing operations distinct from any other node.
Instead, Cassandra features a peer-to-peer protocol and uses gossip to maintain and
keep in sync a list of nodes that are alive or dead.
The fact that Cassandra is decentralized means that there is no single point of failure.
All of the nodes in a Cassandra cluster function exactly the same. This is sometimes
referred to as “server symmetry.” Because they are all doing the same thing, by defini‐
tion there cant be a special host that is coordinating activities, as with the master/
slave setup that you see in MySQL, Bigtable, and so many others.
In many distributed data solutions (such as RDBMS clusters), you set up multiple
copies of data on different servers in a process called replication, which copies the
data to multiple machines so that they can all serve simultaneous requests and
improve performance. Typically this process is not decentralized, as in Cassandra, but
is rather performed by defining a master/slave relationship. That is, all of the servers
in this kind of cluster dont function in the same way. You configure your cluster by
designating one server as the master and others as slaves. The master acts as the
authoritative source of the data, and operates in a unidirectional relationship with the
slave nodes, which must synchronize their copies. If the master node fails, the whole
database is in jeopardy. The decentralized design is therefore one of the keys to Cas‐
sandras high availability. Note that while we frequently understand master/slave rep‐
lication in the RDBMS world, there are NoSQL databases such as MongoDB that
follow the master/slave scheme as well.
Decentralization, therefore, has two key advantages: its simpler to use than master/
slave, and it helps you avoid outages. It can be easier to operate and maintain a decen‐
18 | Chapter 2: Introducing Cassandra
tralized store than a master/slave store because all nodes are the same. That means
that you don’t need any special knowledge to scale; setting up 50 nodes isn’t much dif‐
ferent from setting up one. Theres next to no configuration required to support it.
Moreover, in a master/slave setup, the master can become a single point of failure
(SPOF). To avoid this, you often need to add some complexity to the environment in
the form of multiple masters. Because all of the replicas in Cassandra are identical,
failures of a node wont disrupt service.
In short, because Cassandra is distributed and decentralized, there is no single point
of failure, which supports high availability.
Elastic Scalability
Scalability is an architectural feature of a system that can continue serving a greater
number of requests with little degradation in performance. Vertical scaling—simply
adding more hardware capacity and memory to your existing machine—is the easiest
way to achieve this. Horizontal scaling means adding more machines that have all or
some of the data on them so that no one machine has to bear the entire burden of
serving requests. But then the software itself must have an internal mechanism for
keeping its data in sync with the other nodes in the cluster.
Elastic scalability refers to a special property of horizontal scalability. It means that
your cluster can seamlessly scale up and scale back down. To do this, the cluster must
be able to accept new nodes that can begin participating by getting a copy of some or
all of the data and start serving new user requests without major disruption or recon‐
figuration of the entire cluster. You dont have to restart your process. You don’t have
to change your application queries. You don’t have to manually rebalance the data
yourself. Just add another machine—Cassandra will find it and start sending it work.
Scaling down, of course, means removing some of the processing capacity from your
cluster. You might do this for business reasons, such as adjusting to seasonal work‐
loads in retail or travel applications. Or perhaps there will be technical reasons such
as moving parts of your application to another platform. As much as we try to mini‐
mize these situations, they still happen. But when they do, you won’t need to upset the
entire apple cart to scale back.
High Availability and Fault Tolerance
In general architecture terms, the availability of a system is measured according to its
ability to fulfill requests. But computers can experience all manner of failure, from
hardware component failure to network disruption to corruption. Any computer is
susceptible to these kinds of failure. There are of course very sophisticated (and often
prohibitively expensive) computers that can themselves mitigate many of these cir‐
cumstances, as they include internal hardware redundancies and facilities to send
notification of failure events and hot swap components. But anyone can accidentally
The Cassandra Elevator Pitch | 19
break an Ethernet cable, and catastrophic events can beset a single data center. So for
a system to be highly available, it must typically include multiple networked comput‐
ers, and the software they’re running must then be capable of operating in a cluster
and have some facility for recognizing node failures and failing over requests to
another part of the system.
Cassandra is highly available. You can replace failed nodes in the cluster with no
downtime, and you can replicate data to multiple data centers to offer improved local
performance and prevent downtime if one data center experiences a catastrophe such
as fire or flood.
Tuneable Consistency
Consistency essentially means that a read always returns the most recently written
value. Consider two customers are attempting to put the same item into their shop‐
ping carts on an ecommerce site. If I place the last item in stock into my cart an
instant after you do, you should get the item added to your cart, and I should be
informed that the item is no longer available for purchase. This is guaranteed to hap‐
pen when the state of a write is consistent among all nodes that have that data.
But as we’ll see later, scaling data stores means making certain trade-offs between data
consistency, node availability, and partition tolerance. Cassandra is frequently called
eventually consistent,” which is a bit misleading. Out of the box, Cassandra trades
some consistency in order to achieve total availability. But Cassandra is more accu‐
rately termed “tuneably consistent,” which means it allows you to easily decide the
level of consistency you require, in balance with the level of availability.
Lets take a moment to unpack this, as the term “eventual consistency” has caused
some uproar in the industry. Some practitioners hesitate to use a system that is
described as “eventually consistent.
For detractors of eventual consistency, the broad argument goes something like this:
eventual consistency is maybe OK for social web applications where data doesn’t
really matter. After all, you’re just posting to Mom what little Billy ate for breakfast,
and if it gets lost, it doesnt really matter. But the data I have is actually really
important, and it’s ridiculous to think that I could allow eventual consistency in my
model.
Set aside the fact that all of the most popular web applications (Amazon, Facebook,
Google, Twitter) are using this model, and that perhaps theres something to it. Pre‐
sumably such data is very important indeed to the companies running these
applications, because that data is their primary product, and they are multibillion-
dollar companies with billions of users to satisfy in a sharply competitive world. It
may be possible to gain guaranteed, immediate, and perfect consistency throughout a
20 | Chapter 2: Introducing Cassandra
highly trafficked system running in parallel on a variety of networks, but if you want
clients to get their results sometime this year, it’s a very tricky proposition.
The detractors claim that some Big Data databases such as Cassandra have merely
eventual consistency, and that all other distributed systems have strict consistency. As
with so many things in the world, however, the reality is not so black and white, and
the binary opposition between consistent and not-consistent is not truly reflected in
practice. There are instead degrees of consistency, and in the real world they are very
susceptible to external circumstance.
Eventual consistency is one of several consistency models available to architects. Lets
take a look at these models so we can understand the trade-offs:
Strict consistency
This is sometimes called sequential consistency, and is the most stringent level of
consistency. It requires that any read will always return the most recently written
value. That sounds perfect, and its exactly what I’m looking for. I’ll take it! How‐
ever, upon closer examination, what do we find? What precisely is meant by
most recently written”? Most recently to whom? In one single-processor
machine, this is no problem to observe, as the sequence of operations is known
to the one clock. But in a system executing across a variety of geographically dis‐
persed data centers, it becomes much more slippery. Achieving this implies some
sort of global clock that is capable of timestamping all operations, regardless of
the location of the data or the user requesting it or how many (possibly disparate)
services are required to determine the response.
Causal consistency
This is a slightly weaker form of strict consistency. It does away with the fantasy
of the single global clock that can magically synchronize all operations without
creating an unbearable bottleneck. Instead of relying on timestamps, causal con‐
sistency instead takes a more semantic approach, attempting to determine the
cause of events to create some consistency in their order. It means that writes that
are potentially related must be read in sequence. If two different, unrelated oper‐
ations suddenly write to the same field, then those writes are inferred not to be
causally related. But if one write occurs after another, we might infer that they
are causally related. Causal consistency dictates that causal writes must be read in
sequence.
Weak (eventual) consistency
Eventual consistency means on the surface that all updates will propagate
throughout all of the replicas in a distributed system, but that this may take some
time. Eventually, all replicas will be consistent.
Eventual consistency becomes suddenly very attractive when you consider what is
required to achieve stronger forms of consistency.
The Cassandra Elevator Pitch | 21
1“Dynamo: Amazons Highly Distributed Key-Value Store, 207.
When considering consistency, availability, and partition tolerance, we can achieve
only two of these goals in a given distributed system, a trade-off known as the CAP
theorem (we explore this theorem in more depth in “Brewer’s CAP Theorem” on
page 23). At the center of the problem is data update replication. To achieve a strict
consistency, all update operations will be performed synchronously, meaning that
they must block, locking all replicas until the operation is complete, and forcing com‐
peting clients to wait. A side effect of such a design is that during a failure, some of
the data will be entirely unavailable. As Amazon CTO Werner Vogels puts it, “rather
than dealing with the uncertainty of the correctness of an answer, the data is made
unavailable until it is absolutely certain that it is correct.1
We could alternatively take an optimistic approach to replication, propagating
updates to all replicas in the background in order to avoid blowing up on the client.
The difficulty this approach presents is that now we are forced into the situation of
detecting and resolving conflicts. A design approach must decide whether to resolve
these conflicts at one of two possible times: during reads or during writes. That is, a
distributed database designer must choose to make the system either always readable
or always writable.
Dynamo and Cassandra choose to be always writable, opting to defer the complexity
of reconciliation to read operations, and realize tremendous performance gains. The
alternative is to reject updates amidst network and server failures.
In Cassandra, consistency is not an all-or-nothing proposition. We might more accu‐
rately term it “tuneable consistency” because the client can control the number of
replicas to block on for all updates. This is done by setting the consistency level
against the replication factor.
The replication factor lets you decide how much you want to pay in performance to
gain more consistency. You set the replication factor to the number of nodes in the
cluster you want the updates to propagate to (remember that an update means any
add, update, or delete operation).
The consistency level is a setting that clients must specify on every operation and that
allows you to decide how many replicas in the cluster must acknowledge a write oper‐
ation or respond to a read operation in order to be considered successful. Thats the
part where Cassandra has pushed the decision for determining consistency out to the
client.
So if you like, you could set the consistency level to a number equal to the replication
factor, and gain stronger consistency at the cost of synchronous blocking operations
that wait for all nodes to be updated and declare success before returning. This is not
22 | Chapter 2: Introducing Cassandra
often done in practice with Cassandra, however, for reasons that should be clear (it
defeats the availability goal, would impact performance, and generally goes against
the grain of why youd want to use Cassandra in the first place). So if the client sets
the consistency level to a value less than the replication factor, the update is consid‐
ered successful even if some nodes are down.
Brewer’s CAP Theorem
In order to understand Cassandras design and its label as an “eventually consistent”
database, we need to understand the CAP theorem. The CAP theorem is sometimes
called Brewer’s theorem after its author, Eric Brewer.
While working at the University of California at Berkeley, Eric Brewer posited his
CAP theorem in 2000 at the ACM Symposium on the Principles of Distributed Com‐
puting. The theorem states that within a large-scale distributed data system, there are
three requirements that have a relationship of sliding dependency:
Consistency
All database clients will read the same value for the same query, even given con‐
current updates.
Availability
All database clients will always be able to read and write data.
Partition tolerance
The database can be split into multiple machines; it can continue functioning in
the face of network segmentation breaks.
Brewer’s theorem is that in any given system, you can strongly support only two of
the three. This is analogous to the saying you may have heard in software develop‐
ment: “You can have it good, you can have it fast, you can have it cheap: pick two.
We have to choose between them because of this sliding mutual dependency. The
more consistency you demand from your system, for example, the less partition-
tolerant youre likely to be able to make it, unless you make some concessions around
availability.
The CAP theorem was formally proved to be true by Seth Gilbert and Nancy Lynch
of MIT in 2002. In distributed systems, however, it is very likely that you will have
network partitioning, and that at some point, machines will fail and cause others to
become unreachable. Networking issues such as packet loss or high latency are nearly
inevitable and have the potential to cause temporary partitions. This leads us to the
conclusion that a distributed system must do its best to continue operating in the face
of network partitions (to be partition tolerant), leaving us with only two real options
to compromise on: availability and consistency.
The Cassandra Elevator Pitch | 23
Figure 2-1 illustrates visually that there is no overlapping segment where all three are
obtainable.
Figure 2-1. CAP theorem indicates that you can realize only two of these properties at
once
It might prove useful at this point to see a graphical depiction of where each of the
non-relational data stores well look at falls within the CAP spectrum. The graphic in
Figure 2-2 was inspired by a slide in a 2009 talk given by Dwight Merriman, CEO and
founder of MongoDB, to the MySQL User Group in New York City. However, we
have modified the placement of some systems based on research.
Figure 2-2. Where dierent databases appear on the CAP continuum
Figure 2-2 shows the general focus of some of the different databases we discuss in
this chapter. Note that placement of the databases in this chart could change based on
24 | Chapter 2: Introducing Cassandra
configuration. As Stu Hood points out, a distributed MySQL database can count as a
consistent system only if youre using Googles synchronous replication patches;
otherwise, it can only be available and partition tolerant (AP).
It’s interesting to note that the design of the system around CAP placement is inde‐
pendent of the orientation of the data storage mechanism; for example, the CP edge is
populated by graph databases and document-oriented databases alike.
In this depiction, relational databases are on the line between consistency and availa‐
bility, which means that they can fail in the event of a network failure (including a
cable breaking). This is typically achieved by defining a single master server, which
could itself go down, or an array of servers that simply dont have sufficient mecha‐
nisms built in to continue functioning in the case of network partitions.
Graph databases such as Neo4J and the set of databases derived at least in part from
the design of Googles Bigtable database (such as MongoDB, HBase, Hypertable, and
Redis) all are focused slightly less on availability and more on ensuring consistency
and partition tolerance.
Finally, the databases derived from Amazons Dynamo design include Cassandra,
Project Voldemort, CouchDB, and Riak. These are more focused on availability and
partition tolerance. However, this does not mean that they dismiss consistency as
unimportant, any more than Bigtable dismisses availability. According to the Bigtable
paper, the average percentage of server hours that “some data” was unavailable is
0.0047% (section 4), so this is relative, as were talking about very robust systems
already. If you think of each of these letters (C, A, P) as knobs you can tune to arrive
at the system you want, Dynamo derivatives are intended for employment in the
many use cases where “eventual consistency” is tolerable and where “eventual” is a
matter of milliseconds, read repairs mean that reads will return consistent values, and
you can achieve strong consistency if you want to.
So what does it mean in practical terms to support only two of the three facets of
CAP?
CA To primarily support consistency and availability means that you’re likely using
two-phase commit for distributed transactions. It means that the system will
block when a network partition occurs, so it may be that your system is limited to
a single data center cluster in an attempt to mitigate this. If your application
needs only this level of scale, this is easy to manage and allows you to rely on
familiar, simple structures.
CP To primarily support consistency and partition tolerance, you may try to
advance your architecture by setting up data shards in order to scale. Your data
The Cassandra Elevator Pitch | 25
will be consistent, but you still run the risk of some data becoming unavailable if
nodes fail.
AP To primarily support availability and partition tolerance, your system may return
inaccurate data, but the system will always be available, even in the face of net‐
work partitioning. DNS is perhaps the most popular example of a system that is
massively scalable, highly available, and partition tolerant.
Note that this depiction is intended to offer an overview that helps draw distinctions
between the broader contours in these systems; it is not strictly precise. For example,
its not entirely clear where Googles Bigtable should be placed on such a continuum.
The Google paper describes Bigtable as “highly available,” but later goes on to say that
if Chubby (the Bigtable persistent lock service) “becomes unavailable for an extended
period of time [caused by Chubby outages or network issues], Bigtable becomes
unavailable” (section 4). On the matter of data reads, the paper says that “we do not
consider the possibility of multiple copies of the same data, possibly in alternate
forms due to views or indices.” Finally, the paper indicates that “centralized control
and Byzantine fault tolerance are not Bigtable goals” (section 10). Given such variable
information, you can see that determining where a database falls on this sliding scale
is not an exact science.
An Updated Perspective on CAP
In February 2012, Eric Brewer provided an updated perspective on his CAP theorem
in the article CAP Twelve Years Later: How the ‘Rules’ Have Changed in IEEE’s
Computer. Brewer now describes the “2 out of 3” axiom as somewhat misleading. He
notes that designers only need sacrifice consistency or availability in the presence of
partitions, and that advances in partition recovery techniques have made it possible
for designers to achieve high levels of both consistency and availability.
These advances in partition recovery certainly would include Cassandras usage of
mechanisms such as hinted handoff and read repair. We’ll explore these in Chapter 6.
However, it is important to recognize that these partition recovery mechanisms are
not infallible. There is still immense value in Cassandras tuneable consistency, allow‐
ing Cassandra to function effectively in a diverse set of deployments in which it is not
possible to completely prevent partitions.
Row-Oriented
Cassandras data model can be described as a partitioned row store, in which data is
stored in sparse multidimensional hashtables. “Sparse” means that for any given row
you can have one or more columns, but each row doesn’t need to have all the same
columns as other rows like it (as in a relational model). “Partitioned” means that each
26 | Chapter 2: Introducing Cassandra
row has a unique key which makes its data accessible, and the keys are used to dis‐
tribute the rows across multiple data stores.
Row-Oriented Versus Column-Oriented
Cassandra has frequently been referred to as a “column-oriented
database, which has proved to be the source of some confusion. A
column-oriented database is one in which the data is actually
stored by columns, as opposed to relational databases, which store
data in rows. Part of the confusion that occurs in classifying data‐
bases is that there can be a difference between the API exposed by
the database and the underlying storage on disk. So Cassandra is
not really column-oriented, in that its data store is not organized
primarily around columns.
In the relational storage model, all of the columns for a table are defined beforehand
and space is allocated for each column whether it is populated or not. In contrast,
Cassandra stores data in a multidimensional, sorted hash table. As data is stored in
each column, it is stored as a separate entry in the hash table. Column values are
stored according to a consistent sort order, omitting columns that are not populated,
which enables more efficient storage and query processing. We’ll examine Cassandras
data model in more detail in Chapter 4.
Is Cassandra “Schema-Free”?
In its early versions. Cassandra was faithful to the original Bigtable whitepaper in
supporting a “schema-free” data model in which new columns can be defined dynam‐
ically. Schema-free databases such as Bigtable and MongoDB have the advantage of
being very extensible and highly performant in accessing large amounts of data. The
major drawback of schema-free databases is the difficulty in determining the meaning
and format of data, which limits the ability to perform complex queries. These disad‐
vantages proved a barrier to adoption for many, especially as startup projects which
benefitted from the initial flexibility matured into more complex enterprises involv‐
ing multiple developers and administrators.
The solution for those users was the introduction of the Cassandra Query Language
(CQL), which provides a way to define schema via a syntax similar to the Structured
Query Language (SQL) familiar to those coming from a relational background. Ini‐
tially, CQL was provided as another interface to Cassandra alongside the schema-free
interface based on the Apache Thrift project. During this transitional phase, the term
Schema-optional” was used to describe that data models could be defined by schema
using CQL, but could also be dynamically extended to add new columns via the
Thrift API. During this period, the underlying data storage continued to be based on
the Bigtable model.
The Cassandra Elevator Pitch | 27
Starting with the 3.0 release, the Thrift-based API that supported dynamic column
creation has been deprecated, and Cassandras underlying storage has been re-
implemented to more closely align with CQL. Cassandra does not entirely limit the
ability to dynamically extend the schema on the fly, but the way it works is signifi‐
cantly different. CQL collections such as lists, sets, and especially maps provide the
ability to add content in a less structured form that can be leveraged to extend an
existing schema. CQL also provides the ability to change the type of columns in cer‐
tain instances, and facilities to support the storage of JSON-formatted text.
So perhaps the best way to describe Cassandras current posture is that it supports
flexible schema.
High Performance
Cassandra was designed specifically from the ground up to take full advantage of
multiprocessor/multi-core machines, and to run across many dozens of these
machines housed in multiple data centers. It scales consistently and seamlessly to
hundreds of terabytes. Cassandra has been shown to perform exceptionally well
under heavy load. It consistently can show very fast throughput for writes per second
on basic commodity computers, whether physical hardware or virtual machines. As
you add more servers, you can maintain all of Cassandras desirable properties
without sacrificing performance.
Where Did Cassandra Come From?
The Cassandra data store is an open source Apache project. Cassandra originated at
Facebook in 2007 to solve its inbox search problem—the company had to deal with
large volumes of data in a way that was difficult to scale with traditional methods.
Specifically, the team had requirements to handle huge volumes of data in the form of
message copies, reverse indices of messages, and many random reads and many
simultaneous random writes.
The team was led by Jeff Hammerbacher, with Avinash Lakshman, Karthik Rangana‐
than, and Facebook engineer on the Search Team Prashant Malik as key engineers.
The code was released as an open source Google Code project in July 2008. During its
tenure as a Google Code project in 2008, the code was updatable only by Facebook
engineers, and little community was built around it as a result. So in March 2009, it
was moved to an Apache Incubator project, and on February 17, 2010, it was voted
into a top-level project. On the Apache Cassandra Wiki, you can find a list of the
committers, many of whom have been with the project since 2010/2011. The commit‐
ters represent companies including Twitter, LinkedIn, Apple, as well as independent
developers.
28 | Chapter 2: Introducing Cassandra
The Paper that Introduced Cassandra to the World
A Decentralized Structured Storage System by Facebooks Laksh‐
man and Malik was a central paper on Cassandra. An updated
commentary on this paper was provided by Jonathan Ellis corre‐
sponding to the 2.0 release, noting changes to the technology since
the transition to Apache. We’ll unpack some of these changes in
more detail in “Release History” on page 30.
How Did Cassandra Get Its Name?
In Greek mythology, Cassandra was the daughter of King Priam and Queen Hecuba
of Troy. Cassandra was so beautiful that the god Apollo gave her the ability to see the
future. But when she refused his amorous advances, he cursed her such that she
would still be able to accurately predict everything that would happen—but no one
would believe her. Cassandra foresaw the destruction of her city of Troy, but was
powerless to stop it. The Cassandra distributed database is named for her. We specu‐
late that it is also named as kind of a joke on the Oracle at Delphi, another seer for
whom a database is named.
As commercial interest in Cassandra grew, the need for production support became
apparent. Jonathan Ellis, the Apache Project Chair for Cassandra, and his colleague
Matt Pfeil formed a services company called DataStax (originally known as Riptano)
in April of 2010. DataStax has provided leadership and support for the Cassandra
project, employing several Cassandra committers.
DataStax provides free products including Cassandra drivers for various languages
and tools for development and administration of Cassandra. Paid product offerings
include enterprise versions of the Cassandra server and tools, integrations with other
data technologies, and product support. Unlike some other open source projects that
have commercial backing, changes are added first to the Apache open source project,
and then rolled into the commercial offering shortly after each Apache release.
DataStax also provides the Planet Cassandra website as a resource to the Cassandra
community. This site is a great location to learn about the ever-growing list of compa‐
nies and organizations that are using Cassandra in industry and academia. Industries
represented run the gamut: financial services, telecommunications, education, social
media, entertainment, marketing, retail, hospitality, transportation, healthcare,
energy, philanthropy, aerospace, defense, and technology. Chances are that you will
find a number of case studies here that are relevant to your needs.
Where Did Cassandra Come From? | 29
Release History
Now that weve learned about the people and organizations that have shaped Cassan‐
dra, lets take a look at how Cassandra has matured through its various releases since
becoming an official Apache project. If you’re new to Cassandra, dont worry if some
of these concepts and terms are new to you—we’ll dive into them in more depth in
due time. You can return to this list later to get a sense of the trajectory of how Cas‐
sandra has matured over time and its future directions. If youve used Cassandra in
the past, this summary will give you a quick primer on what’s changed.
Performance and Reliability Improvements
This list focuses primarily on features that have been added over
the course of Cassandras lifespan. This is not to discount the steady
and substantial improvements in reliability and read/write perfor‐
mance.
Release 0.6
This was the first release after Cassandra graduated from the Apache Incubator
to a top-level project. Releases in this series ran from 0.6.0 in April 2010 through
0.6.13 in April 2011. Features in this series included:
Integration with Apache Hadoop, allowing easy data retrieval from Cassan‐
dra via MapReduce
Integrated row caching, which helped eliminate the need for applications to
deploy other caching technologies alongside Cassandra
Release 0.7
Releases in this series ran from 0.7.0 in January 2011 through 0.7.10 in October
2011. Key features and improvements included:
Secondary indexes—that is, indexes on non-primary columns
Support for large rows, containing up to two billion columns
Online schema changes, including adding, renaming, and removing keyspa‐
ces and column families in live clusters without a restart, via the Thrift API
Expiring columns, via specification of a time-to-live (TTL) per column
The NetworkTopologyStrategy was introduced to support multi-data center
deployments, allowing a separate replication factor per data center, per key‐
space
Configuration files were converted from XML to the more readable YAML
format
30 | Chapter 2: Introducing Cassandra
Release 0.8
This release began a major shift in Cassandra APIs with the introduction of CQL.
Releases in this series ran from 0.8.0 in June 2011 through 0.8.10 in February
2012. Key features and improvements included:
Distributed counters were added as a new data type that incrementally
counts up or down
The sstableloader tool was introduced to support bulk loading of data into
Cassandra clusters
An off-heap row cache was provided to allow usage of native memory
instead of the JVM heap
Concurrent compaction allowed for multi-threaded execution and throttling
control of SSTable compaction
Improved memory configuration parameters allowed more flexible control
over the size of memtables
Release 1.0
In keeping with common version numbering practice, this is officially the first
production release of Cassandra, although many companies were using Cassan‐
dra in production well before this point. Releases in this series ran from 1.0.0 in
October 2011 through 1.0.12 in October 2012. In keeping with the focus on pro‐
duction readiness, improvements focused on performance and enhancements to
existing features:
CQL 2 added several improvements, including the ability to alter tables and
columns, support for counters and TTL, and the ability to retrieve the count
of items matching a query
The leveled compaction strategy was introduced as an alternative to the orig‐
inal size-tiered compaction strategy, allowing for faster reads at the expense
of more I/O on writes
Compression of SSTable files, configurable on a per-table level
Release 1.1
Releases in this series ran from 1.1.0 in April 2011 through 1.1.12 in May 2013.
Key features and improvements included:
CQL 3 added the timeuuid type, and the ability to create tables with com‐
pound primary keys including clustering keys. Clustering keys support
order by” semantics to allow sorting. This was a much anticipated feature
that allowed the creation of “wide rows” via CQL.
Support for importing and exporting comma-separated values (CSV) files
via cqlsh
Flexible data storage settings allow the storage of data in SSDs or magnetic
storage, selectable by table
Where Did Cassandra Come From? | 31
The schema update mechanism was reimplemented to allow concurrent
changes and improve reliability. Schema are now stored in tables in the
system keyspace.
Caching was updated to provide more straightforward configuration of
cache sizes
A utility to leverage the bulk loader from Hadoop, allowing efficient export
of data from Hadoop to Cassandra
Row-level isolation was added to assure that when multiple columns are
updated on a write, it is not possible for a read to get a mix of new and old
column values
Release 1.2
Releases in this series ran from 1.2.0 in January 2013 through 1.2.19 in Septem‐
ber 2014. Notable features and improvements included:
CQL 3 added collection types (sets, lists, and maps), prepared statements,
and a binary protocol as a replacement for Thrift
Virtual nodes spread data more evenly across the nodes in a cluster, improv‐
ing performance, especially when adding or replacing nodes
Atomic batches ensure that all writes in a batch succeed or fail as a unit
The system keyspace contains the local table containing information about
the local node and the peers table describing other nodes in the cluster
Request tracing can be enabled to allow clients to see the interactions
between nodes for reads and writes. Tracing provides valuable insight into
what is going on behind the scenes and can help developers understand the
implications of various table design options.
Most data structures were moved off of the JVM heap to native memory
Disk failure policies allow flexible configuration of behaviors, including
removing a node from the cluster on disk failure or making a best effort to
access data from memory, even if stale
Release 2.0
The 2.0 release was an especially significant milestone in the history of Cassan‐
dra, as it marked the culmination of the CQL capability, as well as a new level of
production maturity. This included significant performance improvements and
cleanup of the codebase to pay down 5 years of accumulated technical debt.
Releases in this series ran from 2.0.0 in September 2013 through 2.0.16 in June
2015. Highlights included:
Lightweight transactions were added using the Paxos consensus protocol
CQL3 improvements included the addition of DROP semantics on the
ALTER command, conditional schema modifications (IF EXISTS, IF NOT
32 | Chapter 2: Introducing Cassandra
EXISTS), and the ability to create secondary indexes on primary key col‐
umns
Native CQL protocol improvements began to make CQL demonstrably more
performant than Thrift
A prototype implementation of triggers was added, providing an extensible
way to react to write operations. Triggers can be implemented in any JVM
language.
Java 7 was required for the first time
Static columns were added in the 2.0.6 release
Release 2.1
Releases in this series ran from 2.1.0 in September 2014 through 2.1.8 in June
2015. Key features and improvements included:
CQL3 added user-defined types (UDT), and the ability to create secondary
indexes on collections
Configuration options were added to move memtable data off heap to native
memory
Row caching was made more configurable to allow setting the number of
cached rows per partition
Counters were re-implemented to improve performance and reliability
Release 2.2
The original release plan outlined by the Cassandra developers did not contain a
2.2 release. The intent was to do some major “under the covers” rework for a 3.0
release to follow the 2.1 series. However, due to the amount and complexity of
the changes involved, it was decided to release some of completed features sepa‐
rately in order to make them available while allowing some of the more complex
changes time to mature. Release 2.2.0 became available in July 2015, and support
releases are scheduled through fall 2016. Notable features and improvements in
this series included:
CQL3 improvements, including support for JSON-formatted input/output
and user-defined functions
With this release, Windows became a fully supported operating system.
Although Cassandra still performs best on Linux systems, improvements in
file I/O and scripting have made it much easier to run Cassandra on Win‐
dows.
The Date Tiered Compaction Strategy (DTCS) was introduced to improve
performance of time series data
Where Did Cassandra Come From? | 33
Role-based access control (RBAC) was introduced to allow more flexible
management of authorization
Tick-Tock Releases
In June 2015, the Cassandra team announced plans to adopt a tick-tock release model
as part of increased emphasis on improving agility and the quality of releases.
The tick-tock release model popularized by Intel was originally intended for chip
design, and referred to changing chip architecture and production processes in alter‐
nate builds. You can read more about this approach at http://www.intel.com/
content/www/us/en/silicon-innovations/intel-tick-tock-model-general.html.
The tick-tock approach has proven to be useful in software development as well.
Starting with the Cassandra 3.0 release, even-numbered releases are feature releases
with some bug fixes, while odd-numbered releases are focused on bug fixes, with the
goal of releasing each month.
Release 3.0 (Feature release - November 2015)
The underlying storage engine was rewritten to more closely match CQL con‐
structs
Support for materialized views (sometimes also called global indexes) was added
Java 8 is now the supported version
The Thrift-based command-line interface (CLI) was removed
Release 3.1 (Bug x release - December 2015)
Release 3.2 (Feature release - January 2016)
The way in which Cassandra allocates SSTable file storage across multiple disk in
just a bunch of disks” or JBOD configurations was reworked to improve reliabil‐
ity and performance and to enable backup and restore of individual disks
The ability to compress and encrypt hints was added
Release 3.3 (Bug x release - February 2016)
Release 3.4 (Feature release - March 2016)
SSTableAttachedSecondaryIndex, or “SASI” for short, is an implementation of
Cassandras SecondaryIndex interface that can be used as an alternative to the
existing implementations.
Release 3.5 (Bug x release - April 2016)
The 4.0 release series is scheduled to begin in Fall 2016.
34 | Chapter 2: Introducing Cassandra
As you will have noticed, the trends in these releases include:
Continuous improvement in the capabilities of CQL
A growing list of clients for popular languages built on a common set of
metaphors
Exposure of configuration options to tune performance and optimize resource
usage
Performance and reliability improvements, and reduction of technical debt
Supported Releases
There are two officially supported releases of Cassandra at any one
time: the latest stable release, which is considered appropriate for
production, and the latest development release. You can see the
officially supported versions on the projects download page.
Users of Cassandra are strongly recommended to track the latest
stable release in production. Anecdotally, a substantial majority of
issues and questions posted to the Cassandra-users email list per‐
tain to releases that are no longer supported. Cassandra experts are
very gracious in answering questions and diagnosing issues with
these unsupported releases, but more often than not the recom‐
mendation is to upgrade as soon as possible to a release that
addresses the issue.
Is Cassandra a Good Fit for My Project?
We have now unpacked the elevator pitch and have an understanding of Cassandras
advantages. Despite Cassandras sophisticated design and smart features, it is not the
right tool for every job. So in this section, lets take a quick look at what kind of
projects Cassandra is a good fit for.
Large Deployments
You probably dont drive a semitruck to pick up your dry cleaning; semis aren’t well
suited for that sort of task. Lots of careful engineering has gone into Cassandras high
availability, tuneable consistency, peer-to-peer protocol, and seamless scaling, which
are its main selling points. None of these qualities is even meaningful in a single-node
deployment, let alone allowed to realize its full potential.
There are, however, a wide variety of situations where a single-node relational data‐
base is all we may need. So do some measuring. Consider your expected traffic,
throughput needs, and SLAs. There are no hard-and-fast rules here, but if you expect
that you can reliably serve traffic with an acceptable level of performance with just a
Is Cassandra a Good Fit for My Project? | 35
few relational databases, it might be a better choice to do so, simply because RDBMSs
are easier to run on a single machine and are more familiar.
If you think youll need at least several nodes to support your efforts, however, Cas‐
sandra might be a good fit. If your application is expected to require dozens of nodes,
Cassandra might be a great fit.
Lots of Writes, Statistics, and Analysis
Consider your application from the perspective of the ratio of reads to writes. Cassan‐
dra is optimized for excellent throughput on writes.
Many of the early production deployments of Cassandra involve storing user activity
updates, social network usage, recommendations/reviews, and application statistics.
These are strong use cases for Cassandra because they involve lots of writing with less
predictable read operations, and because updates can occur unevenly with sudden
spikes. In fact, the ability to handle application workloads that require high perfor‐
mance at significant write volumes with many concurrent client threads is one of the
primary features of Cassandra.
According to the project wiki, Cassandra has been used to create a variety of applica‐
tions, including a windowed time-series store, an inverted index for document
searching, and a distributed job priority queue.
Geographical Distribution
Cassandra has out-of-the-box support for geographical distribution of data. You can
easily configure Cassandra to replicate data across multiple data centers. If you have a
globally deployed application that could see a performance benefit from putting the
data near the user, Cassandra could be a great fit.
Evolving Applications
If your application is evolving rapidly and youre in “startup mode,” Cassandra might
be a good fit given its support for flexible schemas. This makes it easy to keep your
database in step with application changes as you rapidly deploy.
Getting Involved
The strength and relevance of any technology depend on the investment of individu‐
als in a vibrant community environment. Thankfully, the Cassandra community is
active and healthy, offering a number of ways for you to participate. We’ll start with a
few steps in Chapter 3 such as downloading Cassandra and building from the source.
Here are a few other ways to get involved:
36 | Chapter 2: Introducing Cassandra
Chat
Many of the Cassandra developers and community members hang out in the
#cassandra channel on webchat.freenode.net. This informal environment is a
great place to get your questions answered or offer up some answers of your own.
Mailing lists
The Apache project hosts several mailing lists to which you can subscribe to learn
about various topics of interest:
user@cassandra.apache.org provides a general discussion list for users and is
frequently used by new users or those needing assistance.
dev@cassandra.apache.org is used by developers to discuss changes, prioritize
work, and approve releases.
client-dev@cassandra.apache.org is used for discussion specific to develop‐
ment of Cassandra clients for various programming languages.
commits@cassandra.apache.org tracks Cassandra code commits. This is a
fairly high volume list and is primarily of interest to committers.
Releases are typically announced to both the developer and user mailing lists.
Issues
If you encounter issues using Cassandra and feel you have discovered a defect,
you should feel free to submit an issue to the Cassandra JIRA. In fact, users who
identify defects on the user@cassandra.apache.org list are frequently encouraged
to create JIRA issues.
Blogs
The DataStax developer blog features posts on using Cassandra, announcements
of Apache Cassandra and DataStax product releases, as well as occasional deep-
dive technical articles on Cassandra implementation details and features under
development. The Planet Cassandra blog provides similar technical content, but
has a greater focus on profiling companies using Cassandra.
The Apache Cassandra Wiki provides helpful articles on getting started and con‐
figuration, but note that some content may not be fully up to date with current
releases.
Meetups
A meetup group is a local community of people who meet face to face to discuss
topics of common interest. These groups provide an excellent opportunity to
network, learn, or share your knowledge by offering a presentation of your own.
There are Cassandra meetups on every continent, so you stand a good chance of
being able to find one in your area.
Getting Involved | 37
Training and conferences
DataStax offers online training, and in June 2015 announced a partnership with
O’Reilly Media to produce Cassandra certifications. DataStax also hosts annual
Cassandra Summits in locations around the world.
A Marketable Skill
There continues to be increased demand for Cassandra developers
and administrators. A 2015 Dice.com salary survey placed Cassan‐
dra as the second most highly compensated skill set.
Summary
In this chapter, weve taken an introductory look at Cassandras defining characteris‐
tics, history, and major features. We have learned about the Cassandra user commu‐
nity and how companies are using Cassandra. Now were ready to start getting some
hands-on experience.
38 | Chapter 2: Introducing Cassandra
CHAPTER 3
Installing Cassandra
For those among us who like instant gratification, well start by installing Cassandra.
Because Cassandra introduces a lot of new vocabulary, there might be some unfami‐
liar terms as we walk through this. Thats OK; the idea here is to get set up quickly in
a simple configuration to make sure everything is running properly. This will serve as
an orientation. Then, we’ll take a step back and understand Cassandra in its larger
context.
Installing the Apache Distribution
Cassandra is available for download from the Web at http://cassandra.apache.org. Just
click the link on the home page to download a version as a gzipped tarball. Typically
two versions of Cassandra are provided. The latest release is recommended for those
starting new projects not yet in production. The most stable release is the one recom‐
mended for production usage. For all releases, the prebuilt binary is named apache-
cassandra-x.x.x-bin.tar.gz, where x.x.x represents the version number. The download
is around 23MB.
Extracting the Download
The simplest way to get started is to download the prebuilt binary. You can unpack
the compressed file using any regular ZIP utility. On Unix-based systems such as
Linux or MacOS, GZip extraction utilities should be preinstalled; on Windows, you’ll
need to get a program such as WinZip, which is commercial, or something like 7-Zip,
which is freeware.
Open your extracting program. You might have to extract the ZIP file and the TAR
file in separate steps. Once you have a folder on your filesystem called apache-
cassandra-x.x.x, youre ready to run Cassandra.
39
What’s In There?
Once you decompress the tarball, you’ll see that the Cassandra binary distribution
includes several files and directories.
The files include the NEWS.txt file, which includes the release notes describing fea‐
tures included in the current and prior releases, and the CHANGES.txt, which is simi‐
lar but focuses on bug fixes. Youll want to make sure to review these files whenever
you are upgrading to a new version so you know what changes to expect.
Lets take a moment to look around in the directories and see what we have.
bin This directory contains the executables to run Cassandra as well as clients,
including the query language shell (cqlsh) and the command-line interface
(CLI) client. It also has scripts to run the nodetool, which is a utility for inspect‐
ing a cluster to determine whether it is properly configured, and to perform a
variety of maintenance operations. We look at nodetool in depth later. The direc‐
tory also contains several utilities for performing operations on SSTables, includ‐
ing listing the keys of an SSTable (sstablekeys), bulk extraction and restoration
of SSTable contents (sstableloader), and upgrading SSTables to a new version
of Cassandra (sstableupgrade).
confThis directory contains the files for configuring your Cassandra instance. The
required configuration files include: the cassandra.yaml file, which is the primary
configuration for running Cassandra; and the logback.xml file, which lets you
change the logging settings to suit your needs. Additional files can optionally be
used to configure the network topology, archival and restore commands, and
triggers. We see how to use these configuration files when we discuss configura‐
tion in Chapter 7.
interface
This directory contains a single file, called cassandra.thri. This file defines a leg‐
acy Remote Procedure Call (RPC) API based on the Thrift syntax. The Thrift
interface was used to create clients in Java, C++, PHP, Ruby, Python, Perl, and C#
prior to the creation of CQL. The Thrift API has been officially marked as depre‐
cated in the 3.2 release and will be deleted in the 4.0 release.
javadoc
This directory contains a documentation website generated using Javas JavaDoc
tool. Note that JavaDoc reflects only the comments that are stored directly in the
Java code, and as such does not represent comprehensive documentation. Its
helpful if you want to see how the code is laid out. Moreover, Cassandra is a
wonderful project, but the code contains relatively few comments, so you might
40 | Chapter 3: Installing Cassandra
find the JavaDoc’s usefulness limited. It may be more fruitful to simply read the
class files directly if youre familiar with Java. Nonetheless, to read the JavaDoc,
open the javadoc/index.html file in a browser.
lib This directory contains all of the external libraries that Cassandra needs to run.
For example, it uses two different JSON serialization libraries, the Google collec‐
tions project, and several Apache Commons libraries.
pylib
This directory contains Python libraries that are used by cqlsh.
toolsThis directory contains tools that are used to maintain your Cassandra nodes.
We’ll look at these tools in Chapter 11.
Additional Directories
If you’ve already run Cassandra using the default configuration,
you will notice two additional directories under the main Cassan‐
dra directory: data and log. We’ll discuss the contents of these
directories momentarily.
Building from Source
Cassandra uses Apache Ant for its build scripting language and Maven for depend‐
ency management.
Downloading Ant
You can download Ant from http://ant.apache.org. You don’t need
to download Maven separately just to build Cassandra.
Building from source requires a complete Java 7 or 8 JDK, not just the JRE. If you see
a message about how Ant is missing tools.jar, either you dont have the full JDK or
youre pointing to the wrong path in your environment variables. Maven downloads
files from the Internet so if your connection is invalid or Maven cannot determine the
proxy, the build will fail.
Building from Source | 41
Downloading Development Builds
If you want to download the most cutting-edge builds, you can get
the source from Jenkins, which the Cassandra project uses as its
Continuous Integration tool. See http://cassci.datastax.com for the
latest builds and test coverage information.
If you are a Git fan, you can get a read-only trunk version of the Cassandra source
using this command:
$ git clone git://git.apache.org/cassandra.git
What Is Git?
Git is a source code management system created by Linus Torvalds
to manage development of the Linux kernel. Its increasingly popu‐
lar and is used by projects such as Android, Fedora, Ruby on Rails,
Perl, and many Cassandra clients (as well see in Chapter 8). If
you’re on a Linux distribution such as Ubuntu, it couldnt be easier
to get Git. At a console, just type >apt-get install git and it will be
installed and ready for commands. For more information, visit
http://git-scm.com.
Because Maven takes care of all the dependencies, it’s easy to build Cassandra once
you have the source. Just make sure youre in the root directory of your source down‐
load and execute the ant program, which will look for a file called build.xml in the
current directory and execute the default build target. Ant and Maven take care of the
rest. To execute the Ant program and start compiling the source, just type:
$ ant
Thats it. Maven will retrieve all of the necessary dependencies, and Ant will build the
hundreds of source files and execute the tests. If all went well, you should see a BUILD
SUCCESSFUL message. If all did not go well, make sure that your path settings are all
correct, that you have the most recent versions of the required programs, and that
you downloaded a stable Cassandra build. You can check the Jenkins report to make
sure that the source you downloaded actually can compile.
More Build Output
If you want to see detailed information on what is happening dur‐
ing the build, you can pass Ant the -v option to cause it to output
verbose details regarding each operation it performs.
42 | Chapter 3: Installing Cassandra
Additional Build Targets
To compile the server, you can simply execute ant as shown previously. This com‐
mand executes the default target, jar. This target will perform a complete build
including unit tests and output a file into the build directory called apache-cassandra-
x.x.x.jar.
If you want to see a list of all of the targets supported by the build file, simply pass
Ant the -p option to get a description of each target. Here are a few others you might
be interested in:
test Users will probably find this the most helpful, as it executes the battery of unit
tests. You can also check out the unit test sources themselves for some useful
examples of how to interact with Cassandra.
stress-build
This target builds the Cassandra stress tool, which we will try out in Chapter 12.
clean
This target removes locally created artifacts such as generated source files and
classes and unit test results. The related target realclean performs a clean and
additionally removes the Cassandra distribution JAR files and JAR files downloa‐
ded by Maven.
Running Cassandra
In earlier versions of Cassandra, before you could start the server there were some
required steps to edit configuration files and set environment variables. But the devel‐
opers have done a terrific job of making it very easy to start using Cassandra immedi‐
ately. Well note some of the available configuration options as we go.
Required Java Version
Cassandra requires a Java 7 or 8 JVM, preferably the latest stable
version. It has been tested on both the Open JDK and Oracles JDK.
You can check your installed Java version by opening a command
prompt and executing java -version. If you need a JDK, you can
get one at http://www.oracle.com/technetwork/java/javase/down‐
loads/index.html.
Running Cassandra | 43
On Windows
Once you have the binary or the source downloaded and compiled, youre ready to
start the database server.
Setting the JAVA_HOME environment variable is recommended. To do this on
Windows 7, click the Start button and then right-click on Computer. Click Advanced
System Settings, and then click the Environment Variables... button. Click New... to
create a new system variable. In the Variable Name field, type JAVA_HOME. In the Vari‐
able Value field, type the path to your Java installation. This is probably something
like C:\Program Files\Java\jre7 if running Java 7 or C:\Program Files\Java\jre1.8.0_25
if running Java 8.
Remember that if you create a new environment variable, you’ll need to reopen any
currently open terminals in order for the system to become aware of the new variable.
To make sure your environment variable is set correctly and that Cassandra can sub‐
sequently find Java on Windows, execute this command in a new terminal: echo
%JAVA_HOME%. This prints the value of your environment variable.
You can also define an environment variable called CASSANDRA_HOME that points to the
top-level directory where you have placed or built Cassandra, so you dont have to
pay as much attention to where you’re starting Cassandra from. This is useful for
other tools besides the database server, such as nodetool and cqlsh.
Once you’ve started the server for the first time, Cassandra will add directories to
your system to store its data files. The default configuration creates these directories
under the CASSANDRA_HOME directory.
dataThis directory is where Cassandra stores its data. By default, there are three sub-
directories under the data directory, corresponding to the various data files Cas‐
sandra uses: commitlog, data, and saved_caches. We’ll explore the significance of
each of these data files in Chapter 6. If youve been trying different versions of the
database and aren’t worried about losing data, you can delete these directories
and restart the server as a last resort.
logs This directory is where Cassandra stores its logs in a file called system.log. If you
encounter any difficulties, consult the log to see what might have happened.
44 | Chapter 3: Installing Cassandra
Data File Locations
The data file locations are configurable in the cassandra.yaml file,
located in the conf directory. The properties are called
data_file_directories, commit_log_directory, and saved_
caches_directory. Well discuss the recommended configuration
of these directories in Chapter 7.
On Linux
The process on Linux and other *nix operating systems (including Mac OS) is similar
to that on Windows. Make sure that your JAVA_HOME variable is properly set, accord‐
ing to the earlier description. Then, you need to extract the Cassandra gzipped tarball
using gunzip. Many users prefer to use the /var/lib directory for data storage. If you
are changing this configuration, you will need to edit the conf/cassandra.yaml file and
create the referenced directories for Cassandra to store its data and logs, making sure
to configure write permissions for the user that will be running Cassandra:
$ sudo mkdir -p /var/lib/cassandra
$ sudo chown -R username /var/lib/cassandra
Instead of username, substitute your own username, of course.
Starting the Server
To start the Cassandra server on any OS, open a command prompt or terminal win‐
dow, navigate to the <cassandra-directory>/bin where you unpacked Cassandra, and
run the command cassandra -f to start your server.
Starting Cassandra in the Foreground
Using the -f switch tells Cassandra to stay in the foreground
instead of running as a background process, so that all of the server
logs will print to standard out and you can see them in your termi‐
nal window, which is useful for testing. In either case, the logs will
append to the system.log file, described earlier.
In a clean installation, you should see quite a few log statements as the server gets
running. The exact syntax of logging statements will vary depending on the release
youre using, but there are a few highlights we can look for. If you search for “cassan‐
dra.yaml, you’ll quickly run into the following:
DEBUG [main] 2015-12-08 06:02:38,677 YamlConfigurationLoader.java:104 -
Loading settings from file:/.../conf/cassandra.yaml
INFO [main] 2015-12-08 06:02:38,781 YamlConfigurationLoader.java:179 -
Node configuration:[authenticator=AllowAllAuthenticator;
Running Cassandra | 45
authorizer=AllowAllAuthorizer; auto_bootstrap=false; auto_snapshot=true;
batch_size_fail_threshold_in_kb=50; ...
These log statements indicate the location of the cassandra.yaml file containing the
configured settings. The Node configuration statement lists out the settings from
the config file.
Now search for “JVM” and youll find something like this:
INFO [main] 2015-12-08 06:02:39,239 CassandraDaemon.java:436 -
JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.8.0_60
INFO [main] 2015-12-08 06:02:39,239 CassandraDaemon.java:437 -
Heap size: 519045120/519045120
These log statements provide information describing the JVM being used, including
memory settings.
Next, search for versions in use—“Cassandra version, “Thrift API Version, “CQL
supported versions”:
INFO [main] 2015-12-08 06:02:43,931 StorageService.java:586 -
Cassandra version: 3.0.0
INFO [main] 2015-12-08 06:02:43,932 StorageService.java:587 -
Thrift API version: 20.1.0
INFO [main] 2015-12-08 06:02:43,932 StorageService.java:588 -
CQL supported versions: 3.3.1 (default: 3.3.1)
We can also find statements where Cassandra is initializing internal data structures
such as caches:
INFO [main] 2015-12-08 06:02:43,633 CacheService.java:115 -
Initializing key cache with capacity of 24 MBs.
INFO [main] 2015-12-08 06:02:43,679 CacheService.java:137 -
Initializing row cache with capacity of 0 MBs
INFO [main] 2015-12-08 06:02:43,686 CacheService.java:166 -
Initializing counter cache with capacity of 12 MBs
If we search for terms like “JMX, “gossip, and “clients, we can find statements like
the following:
WARN [main] 2015-12-08 06:08:06,078 StartupChecks.java:147 -
JMX is not enabled to receive remote connections.
Please see cassandra-env.sh for more info.
INFO [main] 2015-12-08 06:08:18,463 StorageService.java:790 -
Starting up server gossip
INFO [main] 2015-12-08 06:02:48,171 Server.java:162 -
Starting listening for CQL clients on /127.0.0.1:9042 (unencrypted)
These log statements indicate the server is beginning to initiate communications with
other servers in the cluster and expose publicly available interfaces. By default, the
management interface via the Java Management Extensions (JMX) is disabled for
remote access. We’ll explore the management interface in Chapter 10.
46 | Chapter 3: Installing Cassandra
Finally, search for “state jump” and youll see the following:
INFO [main] 2015-12-08 06:02:47,351 StorageService.java:1936 -
Node /127.0.0.1 state jump to normal
Congratulations! Now your Cassandra server should be up and running with a new
single node cluster called Test Cluster listening on port 9160. If you continue to mon‐
itor the output, youll begin to see periodic output such as memtable flushing and
compaction, which well learn about soon.
Starting Over
The committers work hard to ensure that data is readable from one
minor dot release to the next and from one major version to the
next. The commit log, however, needs to be completely cleared out
from version to version (even minor versions).
If you have any previous versions of Cassandra installed, you may
want to clear out the data directories for now, just to get up and
running. If you’ve messed up your Cassandra installation and want
to get started cleanly again, you can delete the data folders.
Stopping Cassandra
Now that weve successfully started a Cassandra server, you may be wondering how to
stop it. You may have noticed the stop-server command in the bin directory. Lets
try running that command. Heres what youll see on Unix systems:
$ ./stop-server
please read the stop-server script before use
So you see that our server has not been stopped, but instead we are directed to read
the script. Taking a look inside with our favorite code editor, you’ll learn that the way
to stop Cassandra is to kill the JVM process that is running Cassandra. The file sug‐
gests a couple of different techniques by which you can identify the JVM process and
kill it.
The first technique is to start Cassandra using the -p option, which provides Cassan‐
dra with the name of a file to which it should write the process identifier (PID) upon
starting up. This is arguably the most straightforward approach to making sure we
kill the right process.
However, because we did not start Cassandra with the -p option, well need to find
the process ourselves and kill it. The script suggests using pgrep to locate processes
for the current user containing the term “cassandra”:
user=`whoami`
pgrep -u $user -f cassandra | xargs kill -9
Running Cassandra | 47
Stopping Cassandra on Windows
On Windows installations, you can find the JVM process and kill it
using the Task Manager.
Other Cassandra Distributions
The instructions we just reviewed showed us how to install the Apache distribution of
Cassandra. In addition to the Apache distribution, there are a couple of other ways to
get Cassandra:
DataStax Community Edition
This free distribution is provided by DataStax via the Planet Cassandra website.
Installation options for various platforms include RPM and Debian (Linux), MSI
(Windows), and a MacOS library. The community edition provides additional
tools, including an integrated development environment (IDE) known as Dev‐
Center, and the OpsCenter monitoring tool. Another useful feature is the ability
to configure Cassandra as an OS-managed service on Windows. Releases of the
community edition generally track the Apache releases, with availability soon
after each Apache release.
DataStax Enterprise Edition
DataStax also provides a fully supported version certified for production use. The
product line provides an integrated database platform with support for comple‐
mentary data technologies such as Hadoop and Apache Spark. We’ll explore
some of these integrations in Chapter 14.
Virtual machine images
A frequent model for deployment of Cassandra is to package one of the preced‐
ing distributions in a virtual machine image. For example, multiple such images
are available in the Amazon Web Services (AWS) Marketplace.
We’ll take a deeper look at several options for deploying Cassandra in production
environments, including cloud computing environments, in Chapter 14.
Selecting the right distribution will depend on your deployment environment; your
needs for scale, stability, and support; and your development and maintenance budg‐
ets. Having both open source and commercial deployment options provides the flexi‐
bility to make the right choice for your organization.
48 | Chapter 3: Installing Cassandra
Running the CQL Shell
Now that you have a Cassandra installation up and running, let’s give it a quick try to
make sure everything is set up properly. We’ll use the CQL shell (cqlsh) to connect to
our server and have a look around.
Deprecation of the CLI
If you’ve used Cassandra in releases prior to 3.0, you may also be
familiar with the command-line client interface known as
cassandra-cli. The CLI was removed in the 3.0 release because it
depends on the legacy Thrift API.
To run the shell, create a new terminal window, change to the Cassandra home direc‐
tory, and type the following command (you should see output similar to that shown
here):
$ bin/cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.0.0 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh>
Because we did not specify a node to which we wanted to connect, the shell helpfully
checks for a node running on the local host, and finds the node we started earlier. The
shell also indicates that you’re connected to a Cassandra server cluster called “Test
Cluster”. That’s because this cluster of one node at localhost is set up for you by
default.
Renaming the Default Cluster
In a production environment, be sure to change the cluster name
to something more suitable to your application.
To connect to a specific node, specify the hostname and port on the command line.
For example, the following will connect to our local node:
$ bin/cqlsh localhost 9042
Another alternative for configuring the cqlsh connection is to set the environment
variables $CQLSH_HOST and $CQLSH_PORT. This approach is useful if you will be fre‐
quently connecting to a specific node on another host. The environment variables
will be overriden if you specify the host and port on the command line.
Running the CQL Shell | 49
Connection Errors
Have you run into an error like this while trying to connect to a
server?
Exception connecting to localhost/9160. Reason:
Connection refused.
If so, make sure that a Cassandra instance is started at that host and
port, and that you can ping the host youre trying to reach. There
may be firewall rules preventing you from connecting.
To see a complete list of the command-line options supported by cqlsh, type the
command cqlsh -help.
Basic cqlsh Commands
Lets take a quick tour of cqlsh to learn what kinds of commands you can send to the
server. We’ll see how to use the basic environment commands and how to do a
round-trip of inserting and retrieving some data.
Case in cqlsh
The cqlsh commands are all case insensitive. For our examples,
we’ll adopt the convention of uppercase to be consistent with the
way the shell describes its own commands in help topics and out‐
put.
cqlsh Help
To get help for cqlsh, type HELP or ? to see the list of available commands:
cqlsh> HELP
Documented shell commands:
===========================
CAPTURE COPY DESCRIBE EXPAND PAGING SOURCE
CONSISTENCY DESC EXIT HELP SHOW TRACING
CQL help topics:
================
ALTER CREATE_TABLE_TYPES PERMISSIONS
ALTER_ADD CREATE_USER REVOKE
ALTER_ALTER DATE_INPUT REVOKE_ROLE
ALTER_DROP DELETE SELECT
ALTER_RENAME DELETE_COLUMNS SELECT_COLUMNFAMILY
ALTER_USER DELETE_USING SELECT_EXPR
ALTER_WITH DELETE_WHERE SELECT_LIMIT
APPLY DROP SELECT_TABLE
ASCII_OUTPUT DROP_AGGREGATE SELECT_WHERE
50 | Chapter 3: Installing Cassandra
BEGIN DROP_COLUMNFAMILY TEXT_OUTPUT
BLOB_INPUT DROP_FUNCTION TIMESTAMP_INPUT
BOOLEAN_INPUT DROP_INDEX TIMESTAMP_OUTPUT
COMPOUND_PRIMARY_KEYS DROP_KEYSPACE TIME_INPUT
CREATE DROP_ROLE TRUNCATE
CREATE_AGGREGATE DROP_TABLE TYPES
CREATE_COLUMNFAMILY DROP_USER UPDATE
CREATE_COLUMNFAMILY_OPTIONS GRANT UPDATE_COUNTERS
CREATE_COLUMNFAMILY_TYPES GRANT_ROLE UPDATE_SET
CREATE_FUNCTION INSERT UPDATE_USING
CREATE_INDEX INT_INPUT UPDATE_WHERE
CREATE_KEYSPACE LIST USE
CREATE_ROLE LIST_PERMISSIONS UUID_INPUT
CREATE_TABLE LIST_ROLES
CREATE_TABLE_OPTIONS LIST_USERS
cqlsh Help Topics
You’ll notice that the help topics listed differ slightly from the
actual command syntax. The CREATE_TABLE help topic describes
how to use the syntax > CREATE TABLE ..., for example.
To get additional documentation about a particular command, type HELP <command>.
Many cqlsh commands may be used with no parameters, in which case they print
out the current setting. Examples include CONSISTENCY, EXPAND, and PAGING.
Describing the Environment in cqlsh
After connecting to your Cassandra instance Test Cluster, if you’re using the binary
distribution, an empty keyspace, or Cassandra database, is set up for you to test with.
To learn about the current cluster you’re working in, type:
cqlsh> DESCRIBE CLUSTER;
Cluster: Test Cluster
Partitioner: Murmur3Partitioner
...
For releases 3.0 and later, this command also prints out a list of token ranges owned
by each node in the cluster, which have been omitted here for brevity.
To see which keyspaces are available in the cluster, issue this command:
cqlsh> DESCRIBE KEYSPACES;
system_auth system_distributed system_schema
system system_traces
Initially this list will consist of several system keyspaces. Once you have created your
own keyspaces, they will be shown as well. The system keyspaces are managed inter‐
nally by Cassandra, and arent for us to put data into. In this way, these keyspaces are
Basic cqlsh Commands | 51
similar to the master and temp databases in Microsoft SQL Server. Cassandra uses
these keyspaces to store the schema, tracing, and security information. We’ll learn
more about these keyspaces in Chapter 6.
You can use the following command to learn the client, server, and protocol versions
in use:
cqlsh> SHOW VERSION;
[cqlsh 5.0.1 | Cassandra 3.0.0 | CQL spec 3.3.1 | Native protocol v4]
You may have noticed that this version info is printed out when cqlsh starts. There
are a variety of other commands with which you can experiment. For now, lets add
some data to the database and get it back out again.
Creating a Keyspace and Table in cqlsh
A Cassandra keyspace is sort of like a relational database. It defines one or more
tables or “column families.” When you start cqlsh without specifying a keyspace, the
prompt will look like this: cqlsh>, with no keyspace specified.
Lets create our own keyspace so we have something to write data to. In creating our
keyspace, there are some required options. To walk through these options, we could
use the command HELP CREATE_KEYSPACE, but instead we’ll use the helpful
command-completion features of cqlsh. Type the following and then hit the Tab key:
cqlsh> CREATE KEYSPACE my_keyspace WITH
When you hit the Tab key, cqlsh begins completing the syntax of our command:
cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class': '
This is informing us that in order to specify a keyspace, we also need to specify a rep‐
lication strategy. Let’s Tab again to see what options we have:
cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class': '
NetworkTopologyStrategy SimpleStrategy
OldNetworkTopologyStrategy
Now cqlsh is giving us three strategies to choose from. We’ll learn more about these
strategies in Chapter 6. For now, we will choose the SimpleStrategy by typing the
name. We’ll indicate were done with a closing quote and Tab again:
cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class':
'SimpleStrategy', 'replication_factor':
The next option were presented with is a replication factor. For the simple strategy,
this indicates how many nodes the data in this keyspace will be written to. For a pro‐
duction deployment, wed want copies of our data stored on multiple nodes, but
because were just running a single node at the moment, well ask for a single copy.
Lets specify a value of “1” and Tab again:
52 | Chapter 3: Installing Cassandra
cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class':
'SimpleStrategy', 'replication_factor': 1};
We see that cqlsh has now added a closing bracket, indicating weve completed all of
the required options. Lets complete our command with a semicolon and return, and
our keyspace will be created.
Keyspace Creation Options
For a production keyspace, we would probably never want to use a
value of 1 for the replication factor. There are additional options on
creating a keyspace depending on the replication strategy that is
chosen. The command completion feature will walk through the
different options.
Lets have a look at our keyspace using theDESCRIBE KEYSPACE command:
cqlsh> DESCRIBE KEYSPACE my_keyspace
CREATE KEYSPACE my_keyspace WITH replication = {'class':
'SimpleStrategy', 'replication_factor': '1'} AND
durable_writes = true;
We see that the table has been created with the SimpleStrategy, a replication_fac
tor of one, and durable writes. Notice that our keyspace is described in much the
same syntax that we used to create it, with one additional option that we did not spec‐
ify: durable_writes = true. Don’t worry about these settings now; we’ll look at
them in detail later.
After you have created your own keyspace, you can switch to it in the shell by typing:
cqlsh> USE my_keyspace;
cqlsh:my_keyspace>
Notice that the prompt has changed to indicate that were using the keyspace.
Using Snake Case
You may have wondered why we chose to name our keyspace in “snake case
(my_keyspace) as opposed to “camel case” (MyKeyspace), which is familiar to devel‐
opers using Java and other languages.
As it turns out, Cassandra naturally handles keyspace, table, and column names as
lowercase. When you enter names in mixed case, Cassandra stores them as all lower‐
case.
This behavior can be overridden by enclosing your names in double quotes (e.g.,
CREATE KEYSPACE "MyKeyspace"...). However, it tends to be a lot simpler to use
snake case than to go against the grain.
Basic cqlsh Commands | 53
Now that we have a keyspace, we can create a table in our keyspace. To do this in
cqlsh, use the following command:
cqlsh:my_keyspace> CREATE TABLE user ( first_name text ,
last_name text, PRIMARY KEY (first_name)) ;
This creates a new table called “user” in our current keyspace with two columns to
store first and last names, both of type text. The text and varchar types are synony‐
mous and are used to store strings. Weve specified the first_name column as our
primary key and taken the defaults for other table options.
Using Keyspace Names in cqlsh
We could have also created this table without switching to our key‐
space by using the syntax CREATE TABLE my_keyspace.user (... .
We can use cqlsh to get a description of a the table we just created using the
DESCRIBE TABLE command:
cqlsh:my_keyspace> DESCRIBE TABLE user;
CREATE TABLE my_keyspace.user (
first_name text PRIMARY KEY,
last_name text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.
SizeTieredCompactionStrategy', 'max_threshold': '32',
'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Youll notice that cqlsh prints a nicely formatted version of the CREATE TABLE com‐
mand that we just typed in but also includes values for all of the available table
options that we did not specify. These values are the defaults, as we did not specify
them. We’ll worry about these settings later. For now, we have enough to get started.
54 | Chapter 3: Installing Cassandra
Writing and Reading Data in cqlsh
Now that we have a keyspace and a table, well write some data to the database and
read it back out again. It’s OK at this point not to know quite whats going on. We’ll
come to understand Cassandras data model in depth later. For now, you have a key‐
space (database), which has a table, which holds columns, the atomic unit of data
storage.
To write a value, use the INSERT command:
cqlsh:my_keyspace> INSERT INTO user (first_name, last_name )
VALUES ('Bill', 'Nguyen');
Here we have created a new row with two columns for the key Bill, to store a set of
related values. The column names are first_name and last_name. We can use the
SELECT COUNT command to make sure that the row was written:
cqlsh:my_keyspace> SELECT COUNT (*) FROM user;
count
-------
1
(1 rows)
Now that we know the data is there, lets read it, using the SELECT command:
cqlsh:my_keyspace> SELECT * FROM user WHERE first_name='Bill';
first_name | last_name
------------+-----------
Bill | Nguyen
(1 rows)
In this command, we requested to return rows matching the primary key Bill
including all columns. You can delete a column using the DELETE command. Here we
will delete the last_name column for the Bill row key:
cqlsh:my_keyspace> DELETE last_name FROM USER WHERE
first_name='Bill';
To make sure that it’s removed, we can query again:
cqlsh:my_keyspace> SELECT * FROM user WHERE first_name='Bill';
first_name | last_name
------------+-----------
Bill | null
(1 rows)
Basic cqlsh Commands | 55
Now well clean up after ourselves by deleting the entire row. Its the same command,
but we dont specify a column name:
cqlsh:my_keyspace> DELETE FROM USER WHERE first_name='Bill';
To make sure that it’s removed, we can query again:
cqlsh:my_keyspace> SELECT * FROM user WHERE first_name='Bill';
first_name | last_name
------------+-----------
(0 rows)
If we really want to clean up after ourselves, we can remove all data from the table
using the TRUNCATE command, or even delete the table schema using the DROP TABLE
command.
cqlsh:my_keyspace> TRUNCATE user;
cqlsh:my_keyspace> DROP TABLE user;
cqlsh Command History
Now that you’ve been using cqlsh for a while, you may have
noticed that you can navigate through commands you’ve executed
previously with the up and down arrow key. This history is stored
in a file called cqlsh_history, which is located in a hidden directory
called .cassandra within your home directory. This acts like your
bash shell history, listing the commands in a plain-text file in the
order Cassandra executed them. Nice!
Summary
Now you should have a Cassandra installation up and running. You’ve worked with
the cqlsh client to insert and retrieve some data, and youre ready to take a step back
and get the big picture on Cassandra before really diving into the details.
56 | Chapter 3: Installing Cassandra
CHAPTER 4
The Cassandra Query Language
In this chapter, you’ll gain an understanding of Cassandras data model and how that
data model is implemented by the Cassandra Query Language (CQL). We’ll show
how CQL supports Cassandras design goals and look at some general behavior char‐
acteristics.
For developers and administrators coming from the relational world, the Cassandra
data model can be difficult to understand initially. Some terms, such as “keyspace,
are completely new, and some, such as “column,” exist in both worlds but have
slightly different meanings. The syntax of CQL is similar in many ways to SQL, but
with some important differences. For those familiar with NoSQL technologies such as
Dynamo or Bigtable, it can also be confusing, because although Cassandra may be
based on those technologies, its own data model is significantly different.
So in this chapter, we start from relational database terminology and introduce
Cassandras view of the world. Along the way well get more familiar with CQL and
learn how it implements this data model.
The Relational Data Model
In a relational database, we have the database itself, which is the outermost container
that might correspond to a single application. The database contains tables. Tables
have names and contain one or more columns, which also have names. When we add
data to a table, we specify a value for every column defined; if we dont have a value
for a particular column, we use null. This new entry adds a row to the table, which
we can later read if we know the row’s unique identifier (primary key), or by using a
SQL statement that expresses some criteria that row might meet. If we want to update
values in the table, we can update all of the rows or just some of them, depending on
the filter we use in a “where” clause of our SQL statement.
57
Now that weve had this review, were in good shape to look at Cassandras data model
in terms of its similarities and differences.
Cassandra’s Data Model
In this section, we’ll take a bottom-up approach to understanding Cassandras data
model.
The simplest data store you would conceivably want to work with might be an array
or list. It would look like Figure 4-1.
Figure 4-1. A list of values
If you persisted this list, you could query it later, but you would have to either exam‐
ine each value in order to know what it represented, or always store each value in the
same place in the list and then externally maintain documentation about which cell in
the array holds which values. That would mean you might have to supply empty
placeholder values (nulls) in order to keep the predetermined size of the array in case
you didnt have a value for an optional attribute (such as a fax number or apartment
number). An array is a clearly useful data structure, but not semantically rich.
So wed like to add a second dimension to this list: names to match the values. We’ll
give names to each cell, and now we have a map structure, as shown in Figure 4-2.
Figure 4-2. A map of name/value pairs
This is an improvement because we can know the names of our values. So if we deci‐
ded that our map would hold User information, we could have column names like
first_name, last_name, phone, email, and so on. This is a somewhat richer structure
to work with.
But the structure weve built so far works only if we have one instance of a given
entity, such as a single person, user, hotel, or tweet. It doesnt give us much if we want
to store multiple entities with the same structure, which is certainly what we want to
do. Theres nothing to unify some collection of name/value pairs, and no way to
repeat the same column names. So we need something that will group some of the
column values together in a distinctly addressable group. We need a key to reference
58 | Chapter 4: The Cassandra Query Language
a group of columns that should be treated together as a set. We need rows. Then, if we
get a single row, we can get all of the name/value pairs for a single entity at once, or
just get the values for the names were interested in. We could call these name/value
pairs columns. We could call each separate entity that holds some set of columns rows.
And the unique identifier for each row could be called a row key or primary key.
Figure 4-3 shows the contents of a simple row: a primary key, which is itself one or
more columns, and additional columns.
Figure 4-3. A Cassandra row
Cassandra defines a table to be a logical division that associates similar data. For
example, we might have a user table, a hotel table, an address book table, and so
on. In this way, a Cassandra table is analogous to a table in the relational world.
Now we dont need to store a value for every column every time we store a new entity.
Maybe we don’t know the values for every column for a given entity. For example,
some people have a second phone number and some don’t, and in an online form
backed by Cassandra, there may be some fields that are optional and some that are
required. Thats OK. Instead of storing null for those values we dont know, which
would waste space, we just wont store that column at all for that row. So now we have
a sparse, multidimensional array structure that looks like Figure 4-4.
When designing a table in a traditional relational database, you’re typically dealing
with “entities,” or the set of attributes that describe a particular noun (hotel, user,
product, etc.). Not much thought is given to the size of the rows themselves, because
row size isnt negotiable once you’ve decided what noun your table represents. How‐
ever, when youre working with Cassandra, you actually have a decision to make
about the size of your rows: they can be wide or skinny, depending on the number of
columns the row contains.
A wide row means a row that has lots and lots (perhaps tens of thousands or even
millions) of columns. Typically there is a smaller number of rows that go along with
so many columns. Conversely, you could have something closer to a relational model,
where you define a smaller number of columns and use many different rows—that’s
the skinny model. Weve already seen a skinny model in Figure 4-4.
Cassandra’s Data Model | 59
Figure 4-4. A Cassandra table
Cassandra uses a special primary key called a composite key (or compound key) to
represent wide rows, also called partitions. The composite key consists of a partition
key, plus an optional set of clustering columns. The partition key is used to determine
the nodes on which rows are stored and can itself consist of multiple columns. The
clustering columns are used to control how data is sorted for storage within a parti‐
tion. Cassandra also supports an additional construct called a static column, which is
for storing data that is not part of the primary key but is shared by every row in a
partition.
Figure 4-5 shows how each partition is uniquely identified by a partition key, and
how the clustering keys are used to uniquely identify the rows within a partition.
Figure 4-5. A Cassandra wide row
60 | Chapter 4: The Cassandra Query Language
For this chapter, we will concern ourselves with simple primary keys consisting of a
single column. In these cases, the primary key and the partition key are the same,
because we have no clustering columns. We’ll examine more complex primary keys in
Chapter 5.
Putting this all together, we have the basic Cassandra data structures:
The column, which is a name/value pair
The row, which is a container for columns referenced by a primary key
The table, which is a container for rows
The keyspace, which is a container for tables
The cluster, which is a container for keyspaces that spans one or more nodes
So thats the bottom-up approach to looking at Cassandras data model. Now that we
know the basic terminology, lets examine each structure in more detail.
Clusters
As previously mentioned, the Cassandra database is specifically designed to be dis‐
tributed over several machines operating together that appear as a single instance to
the end user. So the outermost structure in Cassandra is the cluster, sometimes called
the ring, because Cassandra assigns data to nodes in the cluster by arranging them in
a ring.
Keyspaces
A cluster is a container for keyspaces. A keyspace is the outermost container for data
in Cassandra, corresponding closely to a relational database. In the same way that a
database is a container for tables in the relational model, a keyspace is a container for
tables in the Cassandra data model. Like a relational database, a keyspace has a name
and a set of attributes that define keyspace-wide behavior.
Because were currently focusing on the data model, we’ll leave questions about set‐
ting up and configuring clusters and keyspaces until later. Well examine these topics
in Chapter 7.
Tables
A table is a container for an ordered collection of rows, each of which is itself an
ordered collection of columns. The ordering is determined by the columns, which are
identified as keys. We’ll soon see how Cassandra uses additional keys beyond the pri‐
mary key.
When you write data to a table in Cassandra, you specify values for one or more col‐
umns. That collection of values is called a row. At least one of the values you specify
must be a primary key that serves as the unique identifier for that row.
Cassandra’s Data Model | 61
Lets go back to the user table we created in the previous chapter. Remember how we
wrote a row of data and then read it using the SELECT command in cqlsh:
cqlsh:my_keyspace> SELECT * FROM user WHERE first_name='Bill';
first_name | last_name
------------+-----------
Bill | Nguyen
(1 rows)
Youll notice in the last row that the shell tells us that one row was returned. It turns
out to be the row identified by the first_name “Bill. This is the primary key that
identifies this row.
Data Access Requires a Primary Key
This is an important detail—the SELECT, INSERT, UPDATE, and
DELETE commands in CQL all operate in terms of rows.
As we stated earlier, we don’t need to include a value for every column when we add a
new row to the table. Lets test this out with our user table using the ALTER TABLE
command and then view the results using the DESCRIBE TABLE command:
cqlsh:my_keyspace> ALTER TABLE user ADD title text;
cqlsh:my_keyspace> DESCRIBE TABLE user;
CREATE TABLE my_keyspace.user (
first_name text PRIMARY KEY,
last_name text,
title text
) ...
We see that the title column has been added. Note that we’ve shortened the output
to omit the various table settings. Youll learn more about these settings and how to
configure them in Chapter 7.
Now, lets write a couple of rows, populate different columns for each, and view the
results:
cqlsh:my_keyspace> INSERT INTO user (first_name, last_name, title)
VALUES ('Bill', 'Nguyen', 'Mr.');
cqlsh:my_keyspace> INSERT INTO user (first_name, last_name) VALUES
('Mary', 'Rodriguez');
cqlsh:my_keyspace> SELECT * FROM user;
first_name | last_name | title
------------+-----------+-------
Mary | Rodriguez | null
Bill | Nguyen | Mr.
62 | Chapter 4: The Cassandra Query Language
(2 rows)
Now that weve learned more about the structure of a table and done some data mod‐
eling, lets dive deeper into columns.
Columns
A column is the most basic unit of data structure in the Cassandra data model. So far
weve seen that a column contains a name and a value. We constrain each of the val‐
ues to be of a particular type when we define the column. We’ll want to dig into the
various types that are available for each column, but first let’s take a look into some
other attributes of a column that we havent discussed yet: timestamps and time to
live. These attributes are key to understanding how Cassandra uses time to keep data
current.
Timestamps
Each time you write data into Cassandra, a timestamp is generated for each column
value that is updated. Internally, Cassandra uses these timestamps for resolving any
conflicting changes that are made to the same value. Generally, the last timestamp
wins.
Lets view the timestamps that were generated for our previous writes by adding the
writetime() function to our SELECT command. We’ll do this on the lastname col‐
umn and include a couple of other values for context:
cqlsh:my_keyspace> SELECT first_name, last_name,
writetime(last_name) FROM user;
first_name | last_name | writetime(last_name)
------------+-----------+----------------------
Mary | Rodriguez | 1434591198790252
Bill | Nguyen | 1434591198798235
(2 rows)
We might expect that if we ask for the timestamp on first_name wed get a similar
result. However, it turns out were not allowed to ask for the timestamp on primary
key columns:
cqlsh:my_keyspace> SELECT WRITETIME(first_name) FROM user;
InvalidRequest: code=2200 [Invalid query] message="Cannot use
selection function writeTime on PRIMARY KEY part first_name"
Cassandra’s Data Model | 63
Cassandra also allows us to specify a timestamp we want to use when performing
writes. To do this, we’ll use the CQL UPDATE command for the first time. We’ll use the
optional USING TIMESTAMP option to manually set a timestamp (note that the time‐
stamp must be later than the one from our SELECT command, or the UPDATE will be
ignored):
cqlsh:my_keyspace> UPDATE user USING TIMESTAMP 1434373756626000
SET last_name = 'Boateng' WHERE first_name = 'Mary' ;
cqlsh:my_keyspace> SELECT first_name, last_name,
WRITETIME(last_name) FROM user WHERE first_name = 'Mary';
first_name | last_name | writetime(last_name)
------------+-------------+---------------------
Mary | Boateng | 1434373756626000
(1 rows)
This statement has the effect of adding the last name column to the row identified by
the primary key “Mary”, and setting the timestamp to the value we provided.
Working with Timestamps
Setting the timestamp is not required for writes. This functionality
is typically used for writes in which there is a concern that some of
the writes may cause fresh data to be overwritten with stale data.
This is advanced behavior and should be used with caution.
There is currently not a way to convert timestamps produced by
writetime() into a more friendly format in cqlsh.
Time to live (TTL)
One very powerful feature that Cassandra provides is the ability to expire data that is
no longer needed. This expiration is very flexible and works at the level of individual
column values. The time to live (or TTL) is a value that Cassandra stores for each
column value to indicate how long to keep the value.
The TTL value defaults to null, meaning that data that is written will not expire. Let’s
show this by adding the TTL() function to a SELECT command in cqlsh to see the
TTL value for Mary’s last name:
cqlsh:my_keyspace> SELECT first_name, last_name, TTL(last_name)
FROM user WHERE first_name = 'Mary';
first_name | last_name | ttl(last_name)
------------+-----------+----------------
Mary | Boateng | null
(1 rows)
64 | Chapter 4: The Cassandra Query Language
Now let’s set the TTL on the last name column to an hour (3,600 seconds) by adding
the USING TTL option to our UPDATE command:
cqlsh:my_keyspace> UPDATE user USING TTL 3600 SET last_name =
'McDonald' WHERE first_name = 'Mary' ;
cqlsh:my_keyspace> SELECT first_name, last_name, TTL(last_name)
FROM user WHERE first_name = 'Mary';
first_name | last_name | ttl(last_name)
------------+-------------+---------------
Mary | McDonald | 3588
(1 rows)
As you can see, the clock is already counting down our TTL, reflecting the several
seconds it took to type the second command. If we run this command again in an
hour, Mary’s last_name will be set to null. We can also set TTL on INSERTS using the
same USING TTL option.
Using TTL
Remember that TTL is stored on a per-column level. There is cur‐
rently no mechanism for setting TTL at a row level directly. As
with the timestamp, there is no way to obtain or set the TTL value
of a primary key column, and the TTL can only be set for a column
when we provide a value for the column.
If we want to set TTL across an entire row, we must provide a value for every non-
primary key column in our INSERT or UPDATE command.
CQL Types
Now that weve taken a deeper dive into how Cassandra represents columns including
time-based metadata, lets look at the various types that are available to us for our val‐
ues.
As weve seen in our exploration so far, each column in our table is of a specified type.
Up until this point, we’ve only used the varchar type, but there are plenty of other
options available to us in CQL, so lets explore them.
CQL supports a flexible set of data types, including simple character and numeric
types, collections, and user-defined types. We’ll describe these data types and provide
some examples of how they might be used to help you learn to make the right choice
for your data model.
CQL Types | 65
Numeric Data Types
CQL supports the numeric types youd expect, including integer and floating-point
numbers. These types are similar to standard types in Java and other languages:
int
A 32-bit signed integer (as in Java)
bigint
A 64-bit signed long integer (equivalent to a Java long)
smallint
A 16-bit signed integer (equivalent to a Java short)
tinyint
An 8-bit signed integer (as in Java)
varint
A variable precision signed integer (equivalent to java.math.BigInteger)
float
A 32-bit IEEE-754 floating point (as in Java)
double
A 64-bit IEEE-754 floating point (as in Java)
decimal
A variable precision decimal (equivalent to java.math.BigDecimal)
Additional Integer Types
The smallint and tinyint types were added in the Cassandra 2.2
release.
While enumerated types are common in many languages, there is no direct equiva‐
lent in CQL. A common practice is to store enumerated values as strings. For exam‐
ple, using the Enum.name() method to convert an enumerated value to a String for
writing to Cassandra as text, and the Enum.valueOf() method to convert from text
back to the enumerated value.
66 | Chapter 4: The Cassandra Query Language
Textual Data Types
CQL provides two data types for representing text, one of which weve made quite a
bit of use of already (text):
text, varchar
Synonyms for a UTF-8 character string
ascii
An ASCII character string
UTF-8 is the more recent and widely used text standard and supports internationali‐
zation, so we recommend using text over ascii when building tables for new data.
The ascii type is most useful if you are dealing with legacy data that is in ASCII for‐
mat.
Setting the Locale in cqlsh
By default, cqlsh prints out control and other unprintable charac‐
ters using a backslash escape. You can control how cqlsh displays
non-ASCII characters by setting the locale via the $LANG environ‐
ment variable before running the tool. See the cqlsh command
HELP TEXT_OUTPUT for more information.
Time and Identity Data Types
The identity of data elements such as rows and partitions is important in any data
model in order to be able to access the data. Cassandra provides several types which
prove quite useful in defining unique partition keys. Lets take some time (pun
intended) to dig into these:
timestamp
While we noted earlier that each column has a timestamp indicating when it was
last modified, you can also use a timestamp as the value of a column itself. The
time can be encoded as a 64-bit signed integer, but it is typically much more use‐
ful to input a timestamp using one of several supported ISO 8601 date formats.
For example:
2015-06-15 20:05-0700
2015-06-15 20:05:07-0700
2015-06-15 20:05:07.013-0700
2015-06-15T20:05-0700
2015-06-15T20:05:07-0700
2015-06-15T20:05:07.013+-0700
CQL Types | 67
The best practice is to always provide time zones rather than relying on the oper‐
ating system time zone configuration.
date, time
Releases through Cassandra 2.1 only had the timestamp type to represent times,
which included both a date and a time of day. The 2.2 release introduced date
and time types that allowed these to be represented independently; that is, a date
without a time, and a time of day without reference to a specific date. As with
timestamp, these types support ISO 8601 formats.
Although there are new java.time types available in Java 8, the date type maps
to a custom type in Cassandra in order to preserve compatibility with older
JDKs. The time type maps to a Java long representing the number of nanosec‐
onds since midnight.
uuid
A universally unique identier (UUID) is a 128-bit value in which the bits con‐
form to one of several types, of which the most commonly used are known as
Type 1 and Type 4. The CQL uuid type is a Type 4 UUID, which is based entirely
on random numbers. UUIDs are typically represented as dash-separated sequen‐
ces of hex digits. For example:
1a6300ca-0572-4736-a393-c0b7229e193e
The uuid type is often used as a surrogate key, either by itself or in combination
with other values.
Because UUIDs are of a finite length, they are not absolutely guaranteed to be
unique. However, most operating systems and programming languages provide
utilities to generate IDs that provide adequate uniqueness, and cqlsh does as
well. You can obtain a Type 4 UUID value via the uuid() function and use this
value in an INSERT or UPDATE.
timeuuid
This is a Type 1 UUID, which is based on the MAC address of the computer, the
system time, and a sequence number used to prevent duplicates. This type is fre‐
quently used as a conflict-free timestamp. cqlsh provides several convenience
functions for interacting with the timeuuid type: now(), dateOf() and
unixTimestampOf().
The availability of these convenience functions is one reason why timeuuid tends
to be used more frequently than uuid.
68 | Chapter 4: The Cassandra Query Language
Building on our previous examples, we might determine that wed like to assign a
unique ID to each user, as first_name is perhaps not a sufficiently unique key for our
user table. After all, its very likely that well run into users with the same first name at
some point. If we were starting from scratch, we might have chosen to make this
identifier our primary key, but for now well add it as another column.
Primary Keys Are Forever
After you create a table, there is no way to modify the primary key,
because this controls how data is distributed within the cluster, and
even more importantly, how it is stored on disk.
Lets add the identifier using a uuid :
cqlsh:my_keyspace> ALTER TABLE user ADD id uuid;
Next, well insert an ID for Mary using the uuid() function and then view the results:
cqlsh:my_keyspace> UPDATE user SET id = uuid() WHERE first_name =
'Mary';
cqlsh:my_keyspace> SELECT first_name, id FROM user WHERE
first_name = 'Mary';
first_name | id
------------+--------------------------------------
Mary | e43abc5d-6650-4d13-867a-70cbad7feda9
(1 rows)
Notice that the id is in UUID format.
Now we have a more robust table design, which we can extend with even more col‐
umns as we learn about more types.
Other Simple Data Types
CQL provides several other simple data types that dont fall nicely into one of the cat‐
egories weve looked at already:
boolean
This is a simple true/false value. The cqlsh is case insensitive in accepting these
values but outputs True or False.
blob
A binary large object (blob) is a colloquial computing term for an arbitrary array
of bytes. The CQL blob type is useful for storing media or other binary file types.
Cassandra does not validate or examine the bytes in a blob. CQL represents the
data as hexadecimal digits—for example, 0x00000ab83cf0. If you want to encode
CQL Types | 69
arbitrary textual data into the blob you can use the textAsBlob() function in
order to specify values for entry. See the cqlsh help function HELP BLOB_INPUT
for more information.
inet
This type represents IPv4 or IPv6 Internet addresses. cqlsh accepts any legal for‐
mat for defining IPv4 addresses, including dotted or non-dotted representations
containing decimal, octal, or hexadecimal values. However, the values are repre‐
sented using the dotted decimal format in cqlsh output—for example,
192.0.2.235.
IPv6 addresses are represented as eight groups of four hexadecimal digits, separa‐
ted by colons—for example, 2001:0db8:85a3:0000:0000:8a2e:0370:7334. The
IPv6 specification allows the collapsing of consecutive zero hex values, so the
preceding value is rendered as follows when read using SELECT: 2001:
db8:85a3:a::8a2e:370:7334.
counter
The counter data type provides 64-bit signed integer, whose value cannot be set
directly, but only incremented or decremented. Cassandra is one of the few data‐
bases that provides race-free increments across data centers. Counters are fre‐
quently used for tracking statistics such as numbers of page views, tweets, log
messages, and so on. The counter type has some special restrictions. It cannot be
used as part of a primary key. If a counter is used, all of the columns other than
primary key columns must be counters.
A Warning About Counters
Remember: the increment and decrement operators are not
idempotent. There is no operation to reset a counter directly,
but you can approximate a reset by reading the counter value
and decrementing by that value. Unfortunately, this is not
guaranteed to work perfectly, as the counter may have been
changed elsewhere in between reading and writing.
Collections
Lets say we wanted to extend our user table to support multiple email addresses.
One way to do this would be to create additional columns such as email2, email3, and
so on. While this is an approach that will work, it does not scale very well and might
cause a lot of rework. It is much simpler to deal with the email addresses as a group or
collection.” CQL provides three collection types to help us out with these situations:
sets, lists, and maps. Lets now take a look at each of them:
70 | Chapter 4: The Cassandra Query Language
set
The set data type stores a collection of elements. The elements are unordered,
but cqlsh returns the elements in sorted order. For example, text values are
returned in alphabetical order. Sets can contain the simple types we reviewed ear‐
lier as well as user-defined types (which we’ll discuss momentarily) and even
other collections. One advantage of using set is the ability to insert additional
items without having to read the contents first.
Lets modify our user table to add a set of email addresses:
cqlsh:my_keyspace> ALTER TABLE user ADD emails set<text>;
Then we’ll add an email address for Mary and check that it was added success‐
fully:
cqlsh:my_keyspace> UPDATE user SET emails = {
'mary@example.com' } WHERE first_name = 'Mary';
cqlsh:my_keyspace> SELECT emails FROM user WHERE first_name =
'Mary';
emails
----------------------
{'mary@example.com'}
(1 rows)
Note that in adding that first email address, we replaced the previous contents of
the set, which in this case was null. We can add another email address later
without replacing the whole set by using concatenation:
cqlsh:my_keyspace> UPDATE user SET emails = emails + {
'mary.mcdonald.AZ@gmail.com' } WHERE first_name = 'Mary';
cqlsh:my_keyspace> SELECT emails FROM user WHERE first_name =
'Mary';
emails
---------------------------------------------------
{'mary.mcdonald.AZ@gmail.com', 'mary@example.com'}
(1 rows)
Other Set Operations
We can also clear items from the set by using the subtraction
operator: SET emails = emails - {'mary@example.com'}.
Alternatively, we could clear out the entire set by using the
empty set notation: SET emails = {}.
CQL Types | 71
list
The list data type contains an ordered list of elements. By default, the values are
stored in order of insertion. Lets modify our user table to add a list of phone
numbers:
cqlsh:my_keyspace> ALTER TABLE user ADD
phone_numbers list<text>;
Then we’ll add a phone number for Mary and check that it was added success‐
fully:
cqlsh:my_keyspace> UPDATE user SET phone_numbers = [
'1-800-999-9999' ] WHERE first_name = 'Mary';
cqlsh:my_keyspace> SELECT phone_numbers FROM user WHERE
first_name = 'Mary';
phone_numbers
--------------------
['1-800-999-9999']
(1 rows)
Lets add a second number by appending it:
cqlsh:my_keyspace> UPDATE user SET phone_numbers =
phone_numbers + [ '480-111-1111' ] WHERE first_name = 'Mary';
cqlsh:my_keyspace> SELECT phone_numbers FROM user WHERE
first_name = 'Mary';
phone_numbers
------------------------------------
['1-800-999-9999', '480-111-1111']
(1 rows)
The second number we added now appears at the end of the list.
We could also have prepended the number to the front of the
list by reversing the order of our values: SET phone_numbers =
[‘4801234567’] + phone_numbers.
We can replace an individual item in the list when we reference it by its index:
cqlsh:my_keyspace> UPDATE user SET phone_numbers[1] =
'480-111-1111' WHERE first_name = 'Mary';
72 | Chapter 4: The Cassandra Query Language
As with sets, we can also use the subtraction operator to remove items that match
a specified value:
cqlsh:my_keyspace> UPDATE user SET phone_numbers =
phone_numbers - [ '480-111-1111' ] WHERE first_name = 'Mary';
Finally, we can delete a specific item directly using its index:
cqlsh:my_keyspace> DELETE phone_numbers[0] from user WHERE
first_name = 'Mary';
map
The map data type contains a collection of key/value pairs. The keys and the val‐
ues can be of any type except counter. Lets try this out by using a map to store
information about user logins. We’ll create a column to track login session time
in seconds, with a timeuuid as the key:
cqlsh:my_keyspace> ALTER TABLE user ADD
login_sessions map<timeuuid, int>;
Then we’ll add a couple of login sessions for Mary and see the results:
cqlsh:my_keyspace> UPDATE user SET login_sessions =
{ now(): 13, now(): 18} WHERE first_name = 'Mary';
cqlsh:my_keyspace> SELECT login_sessions FROM user WHERE
first_name = 'Mary';
login_sessions
-----------------------------------------------
{6061b850-14f8-11e5-899a-a9fac1d00bce: 13,
6061b851-14f8-11e5-899a-a9fac1d00bce: 18}
(1 rows)
We can also reference an individual item in the map by using its key.
Collection types are very useful in cases where we need to store a variable number of
elements within a single column.
User-Dened Types
Now we might decide that we need to keep track of physical addresses for our users.
We could just use a single text column to store these values, but that would put the
burden of parsing the various components of the address on the application. It would
be better if we could define a structure in which to store the addresses to maintain the
integrity of the different components.
CQL Types | 73
Fortunately, Cassandra gives us a way to define our own types. We can then create
columns of these user-defined types (UDTs). Lets create our own address type,
inserting some line breaks in our command for readability:
cqlsh:my_keyspace> CREATE TYPE address (
... street text,
... city text,
... state text,
... zip_code int);
A UDT is scoped by the keyspace in which it is defined. We could have written
CREATE TYPE my_keyspace.address. If you run the command DESCRIBE KEYSPACE
my_keyspace, youll see that the address type is part of the keyspace definition.
Now that we have defined our address type, well try to use it in our user table, but we
immediately run into a problem:
cqlsh:my_keyspace> ALTER TABLE user ADD
addresses map<text, address>;
InvalidRequest: code=2200 [Invalid query] message="Non-frozen
collections are not allowed inside collections: map<text,
address>"
What is going on here? It turns out that a user-defined data type is considered a col‐
lection, as its implementation is similar to a set, list, or map.
Freezing Collections
Cassandra releases prior to 2.2 do not fully support the nesting of
collections. Specifically, the ability to access individual attributes of
a nested collection is not yet supported, because the nested collec‐
tion is serialized as a single object by the implementation.
Freezing is a concept that the Cassandra community has intro‐
duced as a forward compatibility mechanism. For now, you can
nest a collection within another collection by marking it as frozen.
In the future, when nested collections are fully supported, there
will be a mechanism to “unfreeze” the nested collections, allowing
the individual attributes to be accessed.
You can also use a collection as a primary key if it is frozen.
Now that weve taken a short detour to discuss freezing and nested tables, let’s get
back to modifying our table, this time marking the address as frozen:
cqlsh:my_keyspace> ALTER TABLE user ADD addresses map<text,
frozen<address>>;
74 | Chapter 4: The Cassandra Query Language
Now let’s add a home address for Mary:
cqlsh:my_keyspace> UPDATE user SET addresses = addresses +
{'home': { street: '7712 E. Broadway', city: 'Tucson',
state: 'AZ', zip_code: 85715} } WHERE first_name = 'Mary';
Now that weve finished learning about the various types, let’s take a step back and
look at the tables we’ve created so far by describing my_keyspace:
cqlsh:my_keyspace> DESCRIBE KEYSPACE my_keyspace ;
CREATE KEYSPACE my_keyspace WITH replication = {'class':
'SimpleStrategy', 'replication_factor': '1'} AND
durable_writes = true;
CREATE TYPE my_keyspace.address (
street text,
city text,
state text,
zip_code int
);
CREATE TABLE my_keyspace.user (
first_name text PRIMARY KEY,
addresses map<text, frozen<address>>,
emails set<text>,
id uuid,
last_name text,
login_sessions map<timeuuid, int>,
phone_numbers list<text>,
title text
) WITH bloom_filter_fp_chance = 0.01
AND caching = '{'keys':'ALL', 'rows_per_partition':'NONE'}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class':
'org.apache.cassandra.db.compaction.
SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CQL Types | 75
Secondary Indexes
If you try to query on column in a Cassandra table that is not part of the primary key,
youll soon realize that this is not allowed. For example, consider our user table from
the previous chapter, which uses first_name as the primary key. Attempting to query
by last_name results in the following output:
cqlsh:my_keyspace> SELECT * FROM user WHERE last_name = 'Nguyen';
InvalidRequest: code=2200 [Invalid query] message="No supported
secondary index found for the non primary key columns restrictions"
As the error message instructs us, we need to create a secondary index for the
last_name column. A secondary index is an index on a column that is not part of the
primary key:
cqlsh:my_keyspace> CREATE INDEX ON user ( last_name );
We can also give an optional name to the index with the syntax CREATE INDEX
<name> ON.... If you dont specify a name, cqlsh creates a name automatically
according to the form <table name>_<column name>_idx. For example, we can learn
the name of the index we just created using DESCRIBE KEYSPACE:
cqlsh:my_keyspace> DESCRIBE KEYSPACE;
...
CREATE INDEX user_last_name_idx ON my_keyspace.user (last_name);
Now that weve created the index, our query will work as expected:
cqlsh:my_keyspace> SELECT * FROM user WHERE last_name = 'Nguyen';
first_name | last_name
------------+-----------
Bill | Nguyen
(1 rows)
Were not limited just to indexes based only on simple type columns. It’s also possible
to create indexes that are based on values in collections. For example, we might wish
to be able to search based on user addresses, emails, or phone numbers, which we
have implemented using map, set, and list, respectively:
cqlsh:my_keyspace> CREATE INDEX ON user ( addresses );
cqlsh:my_keyspace> CREATE INDEX ON user ( emails );
cqlsh:my_keyspace> CREATE INDEX ON user ( phone_numbers );
Note that for maps in particular, we have the option of indexing either the keys (via
the syntax KEYS(addresses)) or the values (which is the default), or both (in Cassan‐
dra 2.2 or later).
76 | Chapter 4: The Cassandra Query Language
Finally, we can use the DROP INDEX command to remove an index:
cqlsh:my_keyspace> DROP INDEX user_last_name_idx;
Secondary Index Pitfalls
Because Cassandra partitions data across multiple nodes, each
node must maintain its own copy of a secondary index based on
the data stored in partitions it owns. For this reason, queries
involving a secondary index typically involve more nodes, making
them significantly more expensive.
Secondary indexes are not recommended for several specific cases:
Columns with high cardinality. For example, indexing on the
user.addresses column could be very expensive, as the vast
majority of addresses are unique.
Columns with very low data cardinality. For example, it would
make little sense to index on the user.title column in order
to support a query for every “Mrs.” in the user table, as this
would result in a massive row in the index.
Columns that are frequently updated or deleted. Indexes built
on these columns can generate errors if the amount of deleted
data (tombstones) builds up more quickly than the compac‐
tion process can handle.
For optimal read performance, denormalized table designs or materialized views are
generally preferred to using secondary indexes. We’ll learn more about these in Chap‐
ter 5. However, secondary indexes can be a useful way of supporting queries that were
not considered in the initial data model design.
Secondary Indexes | 77
SASI: A New Secondary Index Implementation
The Cassandra 3.4 release included an alternative implementation of secondary
indexes known as the SSTable Attached Secondary Index (SASI). SASI was developed
by Apple and released as an open source implementation of Cassandras secondary
index API. As the name implies, SASI indexes are calculated and stored as part of
each SSTable file, differing from the original Cassandra implementation, which stores
indexes in separate, “hidden” tables.
The SASI implementation exists alongside traditional secondary indexes, and you can
create a SASI index with the CQL CREATE CUSTOM INDEX command:
CREATE CUSTOM INDEX user_last_name_sasi_idx ON user (last_name)
USING 'org.apache.cassandra.index.sasi.SASIIndex';
SASI indexes do offer functionality beyond the traditional secondary index imple‐
mentation, such as the ability to do inequality (greater than or less than) searches on
indexed columns. You can also use the new CQL LIKE keyword to do text searches
against indexed columns. For example, you could use the following query to find
users whose last name begins with “N”:
SELECT * FROM user WHERE last_name LIKE 'N%';
While SASI indexes do perform better than traditional indexes by eliminating the
need to read from additional tables, they still require reads from a greater number of
nodes than a denormalized design.
Summary
In this chapter, we took a quick tour of Cassandras data model of clusters, keyspaces,
tables, keys, rows, and columns. In the process, we learned a lot of CQL syntax and
gained more experience working with tables and columns in cqlsh. If youre interes‐
ted in diving deeper on CQL, you can read the full language specification.
78 | Chapter 4: The Cassandra Query Language
CHAPTER 5
Data Modeling
In this chapter, you’ll learn how to design data models for Cassandra, including a data
modeling process and notation. To apply this knowledge, well design the data model
for a sample application, which well build over the next several chapters. This will
help show how all the parts fit together. Along the way, well use a tool to help us
manage our CQL scripts.
Conceptual Data Modeling
First, lets create a simple domain model that is easy to understand in the relational
world, and then see how we might map it from a relational to a distributed hashtable
model in Cassandra.
To create the example, we want to use something that is complex enough to show the
various data structures and design patterns, but not something that will bog you
down with details. Also, a domain thats familiar to everyone will allow you to con‐
centrate on how to work with Cassandra, not on what the application domain is all
about.
For our example, we’ll use a domain that is easily understood and that everyone can
relate to: making hotel reservations.
Our conceptual domain includes hotels, guests that stay in the hotels, a collection of
rooms for each hotel, the rates and availability of those rooms, and a record of reser‐
vations booked for guests. Hotels typically also maintain a collection of “points of
interest,” which are parks, museums, shopping galleries, monuments, or other places
near the hotel that guests might want to visit during their stay. Both hotels and points
of interest need to maintain geolocation data so that they can be found on maps for
mashups, and to calculate distances.
79
We depict our conceptual domain in Figure 5-1 using the entity–relationship model
popularized by Peter Chen. This simple diagram represents the entities in our
domain with rectangles, and attributes of those entities with ovals. Attributes that
represent unique identifiers for items are underlined. Relationships between entities
are represented as diamonds, and the connectors between the relationship and each
entity show the multiplicity of the connection.
Figure 5-1. Hotel domain entity–relationship diagram
Obviously, in the real world, there would be many more considerations and much
more complexity. For example, hotel rates are notoriously dynamic, and calculating
them involves a wide array of factors. Here were defining something complex enough
to be interesting and touch on the important points, but simple enough to maintain
the focus on learning Cassandra.
RDBMS Design
When you set out to build a new data-driven application that will use a relational
database, you might start by modeling the domain as a set of properly normalized
tables and use foreign keys to reference related data in other tables.
Figure 5-2 shows how we might represent the data storage for our application using a
relational database model. The relational model includes a couple of “join” tables in
order to realize the many-to-many relationships from our conceptual model of
hotels-to-points of interest, rooms-to-amenities, rooms-to-availability, and guests-to-
rooms (via a reservation).
80 | Chapter 5: Data Modeling
Figure 5-2. A simple hotel search system using RDBMS
Design Dierences Between RDBMS and Cassandra
Of course, because this is a Cassandra book, what we really want is to model our data
so we can store it in Cassandra. Before we start creating our Cassandra data model,
lets take a minute to highlight some of the key differences in doing data modeling for
Cassandra versus a relational database.
No joins
You cannot perform joins in Cassandra. If you have designed a data model and find
that you need something like a join, you’ll have to either do the work on the client
side, or create a denormalized second table that represents the join results for you.
This latter option is preferred in Cassandra data modeling. Performing joins on the
client should be a very rare case; you really want to duplicate (denormalize) the data
instead.
RDBMS Design | 81
No referential integrity
Although Cassandra supports features such as lightweight transactions and batches,
Cassandra itself has no concept of referential integrity across tables. In a relational
database, you could specify foreign keys in a table to reference the primary key of a
record in another table. But Cassandra does not enforce this. It is still a common
design requirement to store IDs related to other entities in your tables, but operations
such as cascading deletes are not available.
Denormalization
In relational database design, we are often taught the importance of normalization.
This is not an advantage when working with Cassandra because it performs best
when the data model is denormalized. It is often the case that companies end up
denormalizing data in relational databases as well. There are two common reasons for
this. One is performance. Companies simply can’t get the performance they need
when they have to do so many joins on years’ worth of data, so they denormalize
along the lines of known queries. This ends up working, but goes against the grain of
how relational databases are intended to be designed, and ultimately makes one ques‐
tion whether using a relational database is the best approach in these circumstances.
A second reason that relational databases get denormalized on purpose is a business
document structure that requires retention. That is, you have an enclosing table that
refers to a lot of external tables whose data could change over time, but you need to
preserve the enclosing document as a snapshot in history. The common example here
is with invoices. You already have customer and product tables, and youd think that
you could just make an invoice that refers to those tables. But this should never be
done in practice. Customer or price information could change, and then you would
lose the integrity of the invoice document as it was on the invoice date, which could
violate audits, reports, or laws, and cause other problems.
In the relational world, denormalization violates Codds normal forms, and we try to
avoid it. But in Cassandra, denormalization is, well, perfectly normal. It’s not required
if your data model is simple. But don’t be afraid of it.
Server-Side Denormalization with Materialized Views
Historically, denormalization in Cassandra has required designing
and managing multiple tables using techniques we will introduce
momentarily. Beginning with the 3.0 release, Cassandra provides a
feature known as materialized views which allows us to create mul‐
tiple denormalized views of data based on a base table design. Cas‐
sandra manages materialized views on the server, including the
work of keeping the views in sync with the table. In this chapter,
we’ll see examples of both classic denormalization and materialized
views.
82 | Chapter 5: Data Modeling
Query-rst design
Relational modeling, in simple terms, means that you start from the conceptual
domain and then represent the nouns in the domain in tables. You then assign pri‐
mary keys and foreign keys to model relationships. When you have a many-to-many
relationship, you create the join tables that represent just those keys. The join tables
dont exist in the real world, and are a necessary side effect of the way relational mod‐
els work. After you have all your tables laid out, you can start writing queries that pull
together disparate data using the relationships defined by the keys. The queries in the
relational world are very much secondary. It is assumed that you can always get the
data you want as long as you have your tables modeled properly. Even if you have to
use several complex subqueries or join statements, this is usually true.
By contrast, in Cassandra you don’t start with the data model; you start with the
query model. Instead of modeling the data first and then writing queries, with Cas‐
sandra you model the queries and let the data be organized around them. Think of
the most common query paths your application will use, and then create the tables
that you need to support them.
Detractors have suggested that designing the queries first is overly constraining on
application design, not to mention database modeling. But it is perfectly reasonable to
expect that you should think hard about the queries in your application, just as you
would, presumably, think hard about your relational domain. You may get it wrong,
and then youll have problems in either world. Or your query needs might change
over time, and then youll have to work to update your data set. But this is no differ‐
ent from defining the wrong tables, or needing additional tables, in an RDBMS.
Designing for optimal storage
In a relational database, it is frequently transparent to the user how tables are stored
on disk, and it is rare to hear of recommendations about data modeling based on how
the RDBMS might store tables on disk. However, that is an important consideration
in Cassandra. Because Cassandra tables are each stored in separate files on disk, its
important to keep related columns defined together in the same table.
A key goal that we will see as we begin creating data models in Cassandra is to mini‐
mize the number of partitions that must be searched in order to satisfy a given query.
Because the partition is a unit of storage that does not get divided across nodes, a
query that searches a single partition will typically yield the best performance.
Sorting is a design decision
In an RDBMS, you can easily change the order in which records are returned to you
by using ORDER BY in your query. The default sort order is not configurable; by
default, records are returned in the order in which they are written. If you want to
RDBMS Design | 83
change the order, you just modify your query, and you can sort by any list of col‐
umns.
In Cassandra, however, sorting is treated differently; it is a design decision. The sort
order available on queries is fixed, and is determined entirely by the selection of clus‐
tering columns you supply in the CREATE TABLE command. The CQL SELECT state‐
ment does support ORDER BY semantics, but only in the order specified by the
clustering columns.
Dening Application Queries
Lets try the query-first approach to start designing the data model for our hotel
application. The user interface design for the application is often a great artifact to use
to begin identifying queries. Lets assume that weve talked with the project stakehold‐
ers and our UX designers have produced user interface designs or wireframes for the
key use cases. We’ll likely have a list of shopping queries like the following:
Q1. Find hotels near a given point of interest.
Q2. Find information about a given hotel, such as its name and location.
Q3. Find points of interest near a given hotel.
Q4. Find an available room in a given date range.
Q5. Find the rate and amenities for a room.
Number Your Queries
It is often helpful to be able to refer to queries by a shorthand num‐
ber rather that explaining them in full. The queries listed here are
numbered Q1, Q2, and so on, which is how we will reference them
in diagrams as we move throughout our example.
Now if our application is to be a success, well certainly want our customers to be able
to book reservations at our hotels. This includes steps such as selecting an available
room and entering their guest information. So clearly we will also need some queries
that address the reservation and guest entities from our conceptual data model. Even
here, however, well want to think not only from the customer perspective in terms of
how the data is written, but also in terms of how the data will be queried by down‐
stream use cases.
Our natural tendency as data modelers would be to focus first on designing the tables
to store reservation and guest records, and only then start thinking about the queries
that would access them. You may have felt a similar tension already when we began
discussing the shopping queries before, thinking “but where did the hotel and point
of interest data come from?” Dont worry, we will get to this soon enough. Here are
some queries that describe how our users will access reservations:
84 | Chapter 5: Data Modeling
Q6. Lookup a reservation by confirmation number.
Q7. Lookup a reservation by hotel, date, and guest name.
Q8. Lookup all reservations by guest name.
Q9. View guest details.
We show all of our queries in the context of the workflow of our application in
Figure 5-3. Each box on the diagram represents a step in the application workflow,
with arrows indicating the flows between steps and the associated query. If weve
modeled our application well, each step of the workflow accomplishes a task that
unlocks” subsequent steps. For example, the “View hotels near POI” task helps the
application learn about several hotels, including their unique keys. The key for a
selected hotel may be used as part of Q2, in order to obtain detailed description of the
hotel. The act of booking a room creates a reservation record that may be accessed by
the guest and hotel staff at a later time through various additional queries.
Figure 5-3. Hotel application queries
Logical Data Modeling
Now that we have defined our queries, were ready to begin designing our Cassandra
tables. First, we’ll create a logical model containing a table for each query, capturing
entities and relationships from the conceptual model.
To name each table, we’ll identify the primary entity type for which we are querying
and use that to start the entity name. If we are querying by attributes of other related
entities, we append those to the table name, separated with _by_. For example,
hotels_by_poi.
Next, we identify the primary key for the table, adding partition key columns based
on the required query attributes, and clustering columns in order to guarantee
uniqueness and support desired sort ordering.
Logical Data Modeling | 85
We complete each table by adding any additional attributes identified by the query. If
any of these additional attributes are the same for every instance of the partition key,
we mark the column as static.
Now that was a pretty quick description of a fairly involved process, so it will be
worth our time to work through a detailed example. First, lets introduce a notation
that we can use to represent our logical models.
Introducing Chebotko Diagrams
Several individuals within the Cassandra community have proposed notations for
capturing data models in diagrammatic form. Weve elected to use a notation popu‐
larized by Artem Chebotko which provides a simple, informative way to visualize the
relationships between queries and tables in our designs. Figure 5-4 shows the Che‐
botko notation for a logical data model.
Figure 5-4. A Chebotko logical diagram
Each table is shown with its title and a list of columns. Primary key columns are iden‐
tified via symbols such as K for partition key columns and C or C to represent clus‐
tering columns. Lines are shown entering tables or between tables to indicate the
queries that each table is designed to support.
86 | Chapter 5: Data Modeling
Hotel Logical Data Model
Figure 5-5 shows a Chebotko logical data model for the queries involving hotels,
points of interest, rooms, and amenities. One thing we notice immediately is that our
Cassandra design doesnt include dedicated tables for rooms or amenities, as we had
in the relational design. This is because our workflow didn’t identify any queries
requiring this direct access.
Figure 5-5. Hotel domain logical model
Lets explore the details of each of these tables.
Our first query Q1 is to find hotels near a point of interest, so we’ll call our table
hotels_by_poi. Were searching by a named point of interest, so that is a clue that the
point of interest should be a part of our primary key. Let’s reference the point of inter‐
est by name, because according to our workflow that is how our users will start their
search.
Youll note that we certainly could have more than one hotel near a given point of
interest, so well need another component in our primary key in order to make sure
we have a unique partition for each hotel. So we add the hotel key as a clustering col‐
umn.
Logical Data Modeling | 87
Make Your Primary Keys Unique
An important consideration in designing your tables primary key
is making sure that it defines a unique data element. Otherwise you
run the risk of accidentally overwriting data.
Now for our second query (Q2), we’ll need a table to get information about a specific
hotel. One approach would have been to put all of the attributes of a hotel in the
hotels_by_poi table, but we chose to add only those attributes that were required by
our application workflow.
From our workflow diagram, we know that the hotels_by_poi table is used to dis‐
play a list of hotels with basic information on each hotel, and the application knows
the unique identifiers of the hotels returned. When the user selects a hotel to view
details, we can then use Q2, which is used to obtain details about the hotel. Because
we already have the hotel_id from Q1, we use that as our reference to the hotel were
looking for. Therefore our second table is just called hotels.
Another option would have been to store a set of poi_names in the hotels table. This
is an equally valid approach. You’ll learn through experience which approach is best
for your application.
Using Unique Identiers as References
You’ll find that its often helpful to use unique IDs to uniquely ref‐
erence elements, and to use these uuids as references in tables rep‐
resenting other entities. This helps to minimize coupling between
different entity types. This may prove especially helpful if you are
using a microservice architectural style for your application, in
which there are separate services responsible for each entity type.
For the purposes of this book, however, well use mostly text
attributes as identifiers, to keep our samples simple and readable.
For example, a common convention in the hospitality industry is to
reference properties by short codes like “AZ123” or “NY229”. Well
use these values for our hotel_ids, while acknowledging they are
not necessarily globally unique.
Q3 is just a reverse of Q1—looking for points of interest near a hotel, rather than
hotels near a point of interest. This time, however, we need to access the details of
each point of interest, as represented by the pois_by_hotel table. As we have done
previously, we add the point of interest name as a clustering key to guarantee unique‐
ness.
88 | Chapter 5: Data Modeling
At this point, lets now consider how to support query Q4 to help our user find avail‐
able rooms at a selected hotel for the nights they are interested in staying. Note that
this query involves both a start date and an end date. Because were querying over a
range instead of a single date, we know that we’ll need to use the date as a clustering
key. We use the hotel_id as a primary key to group room data for each hotel on a
single partition, which should help our search be super fast. Lets call this the
available_rooms_by_hotel_date table.
Searching Over a Range
Use clustering columns to store attributes that you need to access
in a range query. Remember that the order of the clustering col‐
umns is important. Well learn more about range queries in Chap‐
ter 9.
In order to round out the shopping portion of our data model, we add the
amenities_by_room table to support Q5. This will allow our user to view the ameni‐
ties of one of the rooms that is available for the desired stay dates.
Reservation Logical Data Model
Now we switch gears to look at the reservation queries. Figure 5-6 shows a logical
data model for reservations. You’ll notice that these tables represent a denormalized
design; the same data appears in multiple tables, with differing keys.
Figure 5-6. A denormalized logical model for reservations
Logical Data Modeling | 89
In order to satisfy Q6, the reservations_by_confirmation table supports the look
up of reservations by a unique confirmation number provided to the customer at the
time of booking.
If the guest doesnt have the confirmation number, the reservations_by_guest table
can be used to look up the reservation by guest name. We could envision query Q7
being used on behalf of a guest on a self-serve website or a call center agent trying to
assist the guest. Because the guest name might not be unique, we include the guest ID
here as a clustering column as well.
The hotel staff might wish to see a record of upcoming reservations by date in order
to get insight into how the hotel is performing, such as what dates the hotel is sold out
or undersold. Q8 supports the retrieval of reservations for a given hotel by date.
Finally, we create a guests table. You’ll notice that it has similar attributes to our user
table from Chapter 4. This provides a single location that we can use to store our
guests. In this case, we specify a separate unique identifier for our guest records, as it
is not uncommon for guests to have the same name. In many organizations, a cus‐
tomer database such as our guests table would be part of a separate customer man‐
agement application, which is why weve omitted other guest access patterns from our
example.
Design Queries for All Stakeholders
Q8 and Q9 in particular help to remind us that we need to create
queries that support various stakeholders of our application, not
just customers but staff as well, and perhaps even the analytics
team, suppliers, and so on.
Patterns and Anti-Patterns
As with other types of software design, there are some well-known patterns and anti-
patterns for data modeling in Cassandra. Weve already used one of the most com‐
mon patterns in our hotel model—the wide row.
The time series pattern is an extension of the wide row pattern. In this pattern, a ser‐
ies of measurements at specific time intervals are stored in a wide row, where the
measurement time is used as part of the partition key. This pattern is frequently used
in domains including business analysis, sensor data management, and scientific
experiments.
The time series pattern is also useful for data other than measurements. Consider the
example of a banking application. We could store each customer’s balance in a row,
but that might lead to a lot of read and write contention as various customers check
their balance or make transactions. Wed probably be tempted to wrap a transaction
around our writes just to protect the balance from being updated in error. In contrast,
90 | Chapter 5: Data Modeling
a time series–style design would store each transaction as a timestamped row and
leave the work of calculating the current balance to the application.
One design trap that many new users fall into is attempting to use Cassandra as a
queue. Each item in the queue is stored with a timestamp in a wide row. Items are
appended to the end of the queue and read from the front, being deleted after they are
read. This is a design that seems attractive, especially given its apparent similarity to
the time series pattern. The problem with this approach is that the deleted items are
now tombstones that Cassandra must scan past in order to read from the front of the
queue. Over time, a growing number of tombstones begins to degrade read perfor‐
mance.
The queue anti-pattern serves as a reminder that any design that relies on the deletion
of data is potentially a poorly performing design.
Physical Data Modeling
Once we have a logical data model defined, creating the physical model is a relatively
simple process.
We walk through each of our logical model tables, assigning types to each item. We
can use any of the types we covered in Chapter 4, including the basic types, collec‐
tions, and user-defined types. We may identify additional user-defined types that can
be created to simplify our design.
After weve assigned our data types, we analyze our model by performing size calcula‐
tions and testing out how the model works. We may make some adjustments based
on our findings. Once again we’ll cover the data modeling process in more detail by
working through our example.
Before we get started, let’s look at a few additions to the Chebotko notation for physi‐
cal data models.
Chebotko Physical Diagrams
To draw physical models, we need to be able to add the typing information for each
column. Figure 5-7 shows the addition of a type for each column in a sample table.
The figure includes a designation of the keyspace containing each table and visual
cues for columns represented using collections and user-defined types. We also note
the designation of static columns and secondary index columns. There is no restric‐
tion on assigning these as part of a logical model, but they are typically more of a
physical data modeling concern.
Physical Data Modeling | 91
Figure 5-7. Extending the Chebotko notation for physical data models
Hotel Physical Data Model
Now let’s get to work on our physical model. First, we need keyspaces for our tables.
To keep the design relatively simple, we’ll create a hotel keyspace to contain our
tables for hotel and availability data, and a reservation keyspace to contain tables for
reservation and guest data. In a real system, we might divide the tables across even
more keyspaces in order to separate concerns.
For our hotels table, well use Cassandras text type to represent the hotels id. For
the address, we’ll use the address type that we created in Chapter 4. We use the text
type to represent the phone number, as there is considerable variance in the format‐
ting of numbers between countries.
As we work to create physical representations of various tables in our logical hotel
data model, we use the same approach. The resulting design is shown in Figure 5-8.
92 | Chapter 5: Data Modeling
Figure 5-8. Hotel physical model
Note that we have also included the address type in our design. It is designated with
an asterisk to denote that it is a user-defined type, and has no primary key columns
identified. We make use of this type in the hotels and hotels_by_poi tables.
Taking Advantage of User-Dened Types
It is often helpful to make use of user-defined types to help reduce
duplication of non-primary key columns, as we have done with the
address user-defined type. This can reduce complexity in the
design.
Remember that the scope of a UDT is the keyspace in which it is
defined. To use address in the reservation keyspace were about
to design, we’ll have to declare it again. This is just one of the many
trade-offs we have to make in data model design.
Reservation Physical Data Model
Now, lets turn our attention to the reservation tables in our design. Remember that
our logical model contained three denormalized tables to support queries for reserva‐
tions by confirmation number, guest, and hotel and date. As we work to implement
these different designs, well want to consider whether to manage the denormaliza‐
tion manually or use Cassandras materialized view capability.
The design shown for the reservation keyspace in Figure 5-9 uses both approaches.
We chose to implement reservations_by_hotel_date and reservations_by_guest
as regular tables, and reservations_by_confirmation as a materialized view on the
Physical Data Modeling | 93
reservations_by_hotel_date table. We’ll discuss the reasoning behind this design
choice momentarily.
Figure 5-9. Reservation physical model
Note that we have reproduced the address type in this keyspace and modeled the
guest_id as a uuid type in all of our tables.
Materialized Views
Materialized views were introduced to help address some of the shortcomings of sec‐
ondary indexes, which we discussed in Chapter 4. Creating indexes on columns with
high cardinality tends to result in poor performance, because most or all of the nodes
in the ring need are queried.
Materialized views address this problem by storing preconfigured views that support
queries on additional columns which are not part of the original clustering key. Mate‐
rialized views simplify application development: instead of the application having to
keep multiple denormalized tables in sync, Cassandra takes on the responsibility of
updating views in order to keep them consistent with the base table.
Materialized views incur a small performance impact on writes in order to maintain
this consistency. However, materialized views demonstrate more efficient perfor‐
mance compared to managing denormalized tables in application clients. Internally,
materialized view updates are implemented using batching, which we will discuss in
Chapter 9.
94 | Chapter 5: Data Modeling
Similar to secondary indexes, materialized views can be created on existing tables.
To understand the syntax and constraints associated with materialized views, we’ll
take a look at the CQL command that creates the reservations_by_confirmation
table from the reservation physical model:
cqlsh> CREATE MATERIALIZED VIEW reservation.reservations_by_confirmation
AS SELECT *
FROM reservation.reservations_by_hotel_date
WHERE confirm_number IS NOT NULL and hotel_id IS NOT NULL and
start_date IS NOT NULL and room_number IS NOT NULL
PRIMARY KEY (confirm_number, hotel_id, start_date, room_number);
The order of the clauses in the CREATE MATERIALIZED VIEW command can appear
somewhat inverted, so we’ll walk through these clauses in an order that is a bit easier
to process.
The first parameter after the command is the name of the materialized view—in this
case, reservations_by_confirmation. The FROM clause identifies the base table for
the materialized view, reservations_by_hotel_date.
The PRIMARY KEY clause identifies the primary key for the materialized view, which
must include all of the columns in the primary key of the base table. This restriction
keeps Cassandra from collapsing multiple rows in the base table into a single row in
the materialized view, which would greatly increase the complexity of managing
updates.
The grouping of the primary key columns uses the same syntax as an ordinary table.
The most common usage is to place the additional column first as the partition key,
followed by the base table primary key columns, used as clustering columns for pur‐
poses of the materialized view.
The WHERE clause provides support for filtering.Note that a filter must be specified for
every primary key column of the materialized view, even if it is as simple as designat‐
ing that the value IS NOT NULL.
The AS SELECT clause identifies the columns from the base table that we want our
materialized view to contain. We can reference individual columns, but in this case
have chosen for all columns to be part of the view by using the wildcard *.
Physical Data Modeling | 95
Enhanced Materialized View Capabilities
The initial implementation of materialized views in the 3.0 release
has some limitations on the selection of primary key columns and
filters. There are several JIRA issues in progress to add capabilities
such as multiple non-primary key columns in materialized view
primary keys CASSANDRA-9928 or using aggregates in material‐
ized views CASSANDRA-9778. If you’re interested in these fea‐
tures, track the JIRA issues to see when they will be included in a
release.
Now that we have a better understanding of the design and use of materialized views,
we can revisit the prior decision made for the reservation physical design. Specifically,
reservations_by_confirmation is a good candidate for implementation as a mate‐
rialized view due to the high cardinality of the confirmation numbers—after all, you
cant get any higher cardinality than a unique value per reservation.
An alternate design would have been to use reservations_by_confirmation as the
base table and reservations_by_hotel_date as a materialized view. However,
because we cannot (at least in early 3.X releases) create a materialized view with mul‐
tiple non-primary key column from the base table, this would have required us to
designate either hotel_id or date as a clustering column in reservations_by_con
firmation. Both designs are acceptable, but this should give some insight into the
trade-offs youll want to consider in selecting which of several denormalized table
designs to use as the base table.
Evaluating and Rening
Once weve created our physical model, there are some steps we’ll want to take to
evaluate and refine our table designs to help ensure optimal performance.
Calculating Partition Size
The first thing that we want to look for is whether our tables will have partitions that
will be overly large, or to put it another way, partitions that are too wide. Partition
size is measured by the number of cells (values) that are stored in the partition. Cas‐
sandras hard limit is 2 billion cells per partition, but we’ll likely run into performance
issues before reaching that limit.
In order to calculate the size of our partitions, we use the following formula:
Nv=NrNcNpk Ns+Ns
The number of values (or cells) in the partition (Nv) is equal to the number of static
columns (Ns) plus the product of the number of rows (Nr) and the number of of val‐
96 | Chapter 5: Data Modeling
ues per row. The number of values per row is defined as the number of columns (Nc)
minus the number of primary key columns (Npk) and static columns (Ns).
The number of columns tends to be relatively static, although as we have seen it is
quite possible to alter tables at runtime. For this reason, a primary driver of partition
size is the number of rows in the partition. This is a key factor that you must consider
in determining whether a partition has the potential to get too large. Two billion val‐
ues sounds like a lot, but in a sensor system where tens or hundreds of values are
measured every millisecond, the number of values starts to add up pretty fast.
Lets take a look at one of our tables to analyze the partition size. Because it has a wide
row design with a partition per hotel, we’ll choose the available_rooms_
by_hotel_date table. The table has four columns total (Nc = 4), including three pri‐
mary key columns (Npk = 3) and no static columns (Ns = 0). Plugging these values
into our formula, we get:
Nv=Nr4 − 3 − 0 + 0 = 1Nr
So the number of values for this table is equal to the number of rows. We still need to
determine a number of rows. To do this, we make some estimates based on the appli‐
cation were designing. Our table is storing a record for each room, in each of our
hotels, for every night. Let’s assume that our system will be used to store two years of
inventory at a time, and there are 5,000 hotels in our system, with an average of 100
rooms in each hotel.
Since there is a partition for each hotel, our estimated number of rows per partition is
as follows:
Nr= 100 rooms/hotel × 730 days = 73, 000 rows
This relatively small number of rows per partition is not going to get us in too much
trouble, but if we start storing more dates of inventory, or dont manage the size of
our inventory well using TTL, we could start having issues. We still might want to
look at breaking up this large partition, which well do shortly.
Estimate for the Worst Case
When performing sizing calculations, it is tempting to assume the
nominal or average case for variables such as the number of rows.
Consider calculating the worst case as well, as these sorts of predic‐
tions have a way of coming true in successful systems.
Calculating Size on Disk
In addition to calculating the size of our partition, it is also an excellent idea for us to
estimate the amount of disk space that will be required for each table we plan to store
Evaluating and Rening | 97
in the cluster. In order to determine the size, we use the following formula to deter‐
mine the size St of a partition:
St=
isizeOf cki+
jsizeOf csj+Nr×
ksizeOf crk+
lsizeOf ccl+
Nv×sizeOf tavg
This is a bit more complex than our previous formula, but we’ll break it down a bit at
a time. Lets take a look at the notation first:
In this formula, ck refers to partition key columns, cs to static columns, cr to regu‐
lar columns, and cc to clustering columns.
The term tavg refers to the average number of bytes of metadata stored per cell,
such as timestamps. It is typical to use an estimate of 8 bytes for this value.
We recognize the number of rows Nr and number of values Nv from our previous
calculations.
The sizeOf() function refers to the size in bytes of the CQL data type of each ref‐
erenced column.
The first term asks us to sum the size of the partition key columns. For our example,
the available_rooms_by_hotel_date table has a single partition key column, the
hotel_id, which we chose to make of type text. Assuming our hotel identifiers are
simple 5-character codes, we have a 5-byte value, so the sum of our partition key col‐
umn sizes is 5 bytes.
The second term asks us to sum the size of our static columns. Our table has no static
columns, so in our case this is 0 bytes.
The third term is the most involved, and for good reason—it is calculating the size of
the cells in the partition. We sum the size of the clustering columns and regular col‐
umns. Our two clustering columns are the date, which we assume is 4 bytes, and the
room_number, which is a 2-byte short integer, giving us a sum of 6 bytes. There is only
a single regular column, the boolean is_available, which is 1 byte in size. Summing
the regular column size (1 byte) plus the clustering column size (6 bytes) gives us a
total of 7 bytes. To finish up the term, we multiply this value by the number of rows
(73,000), giving us 511,000 bytes (0.51 MB).
The fourth term is simply counting the metadata that that Cassandra stores for each
cell. In the storage format used by Cassandra 3.0 and later, the amount of metadata
for a given cell varies based on the type of data being stored, and whether or not cus‐
tom timestamp or TTL values are specified for individual cells. For our table, we
reuse the number of values from our previous calculation (73,000) and multiply by 8,
which gives us 0.58 MB.
Adding these terms together, we get our final estimate:
98 | Chapter 5: Data Modeling
Partition size = 16 bytes + 0 bytes + 0.51 MB + 0.58 MB = 1.1 MB
This formula is an approximation of the actual size of a partition on disk, but is accu‐
rate enough to be quite useful. Remembering that the partition must be able to fit on
a single node, it looks like our table design will not put a lot of strain on our disk
storage.
A More Compact Storage Format
As mentioned in Chapter 2, Cassandras storage engine was re-
implemented for the 3.0 release, including a new format for
SSTable files. The previous format stored a separate copy of the
clustering columns as part of the record for each cell. The newer
format eliminates this duplication, which reduces the size of stored
data and simplifies the formula for computing that size.
Keep in mind also that this estimate only counts a single replica of our data. We will
need to multiply the value obtained here by the number of partitions and the number
of replicas specified by the keyspaces replication strategy in order to determine the
total required total capacity for each table. This will come in handy when we discuss
how to plan our clusters in Chapter 14.
Breaking Up Large Partitions
As discussed previously, our goal is to design tables that can provide the data we need
with queries that touch a single partition, or failing that, the minimum possible num‐
ber of partitions. However, as we have seen in our examples, it is quite possible to
design wide row-style tables that approach Cassandras built-in limits. Performing siz‐
ing analysis on tables may reveal partitions that are potentially too large, either in
number of values, size on disk, or both.
The technique for splitting a large partition is straightforward: add an additional col‐
umn to the partition key. In most cases, moving one of the existing columns into the
partition key will be sufficient. Another option is to introduce an additional column
to the table to act as a sharding key, but this requires additional application logic.
Continuing to examine our available rooms example, if we add the date column to
the partition key for the available_rooms_by_hotel_date table, each partition
would then represent the availability of rooms at a specific hotel on a specific date.
This will certainly yield partitions that are significantly smaller, perhaps too small, as
the data for consecutive days will likely be on separate nodes.
Another technique known as bucketing is often used to break the data into moderate-
size partitions. For example, we could bucketize our available_rooms_
by_hotel_date table by adding a month column to the partition key. While this col‐
Evaluating and Rening | 99
umn is partially duplicative of the date, it provides a nice way of grouping related
data in a partition that will not get too large.
If we really felt strongly about preserving a wide row design, we could instead add the
room_id to the partition key, so that each partition would represent the availability of
the room across all dates. Because we haven’t identified a query that involves search‐
ing availability of a specific room, the first or second design approach is most suitable
to our application needs.
Dening Database Schema
Once we have finished evaluating and refining our physical model, were ready to
implement the schema in CQL. Here is the schema for the hotel keyspace, using
CQLs comment feature to document the query pattern supported by each table:
CREATE KEYSPACE hotel
WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
CREATE TYPE hotel.address (
street text,
city text,
state_or_province text,
postal_code text,
country text
);
CREATE TABLE hotel.hotels_by_poi (
poi_name text,
hotel_id text,
name text,
phone text,
address frozen<address>,
PRIMARY KEY ((poi_name), hotel_id)
) WITH comment = 'Q1. Find hotels near given poi'
AND CLUSTERING ORDER BY (hotel_id ASC) ;
CREATE TABLE hotel.hotels (
id text PRIMARY KEY,
name text,
phone text,
address frozen<address>,
pois set<text>
) WITH comment = 'Q2. Find information about a hotel';
CREATE TABLE hotel.pois_by_hotel (
poi_name text,
hotel_id text,
description text,
PRIMARY KEY ((hotel_id), poi_name)
) WITH comment = 'Q3. Find pois near a hotel';
100 | Chapter 5: Data Modeling
CREATE TABLE hotel.available_rooms_by_hotel_date (
hotel_id text,
date date,
room_number smallint,
is_available boolean,
PRIMARY KEY ((hotel_id), date, room_number)
) WITH comment = 'Q4. Find available rooms by hotel / date';
CREATE TABLE hotel.amenities_by_room (
hotel_id text,
room_number smallint,
amenity_name text,
description text,
PRIMARY KEY ((hotel_id, room_number), amenity_name)
) WITH comment = 'Q5. Find amenities for a room';
Identify Partition Keys Explicitly
We chose to represent our tables by surrounding the elements of
our partition key with parentheses, even though the partition key
consists of the single column poi_name. This is a best practice that
makes our selection of partition key more explicit to others reading
our CQL.
Similarly, here is the schema for the reservation keyspace:
CREATE KEYSPACE reservation
WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
CREATE TYPE reservation.address (
street text,
city text,
state_or_province text,
postal_code text,
country text
);
CREATE TABLE reservation.reservations_by_hotel_date (
hotel_id text,
start_date date,
end_date date,
room_number smallint,
confirm_number text,
guest_id uuid,
PRIMARY KEY ((hotel_id, start_date), room_number)
) WITH comment = 'Q7. Find reservations by hotel and date';
CREATE MATERIALIZED VIEW reservation.reservations_by_confirmation AS
SELECT * FROM reservation.reservations_by_hotel_date
WHERE confirm_number IS NOT NULL and hotel_id IS NOT NULL and
Dening Database Schema | 101
start_date IS NOT NULL and room_number IS NOT NULL
PRIMARY KEY (confirm_number, hotel_id, start_date, room_number);
CREATE TABLE reservation.reservations_by_guest (
guest_last_name text,
hotel_id text,
start_date date,
end_date date,
room_number smallint,
confirm_number text,
guest_id uuid,
PRIMARY KEY ((guest_last_name), hotel_id)
) WITH comment = 'Q8. Find reservations by guest name';
CREATE TABLE reservation.guests (
guest_id uuid PRIMARY KEY,
first_name text,
last_name text,
title text,
emails set<text>,
phone_numbers list<text>,
addresses map<text, frozen<address>>,
confirm_number text
) WITH comment = 'Q9. Find guest by ID';
DataStax DevCenter
Weve already had quite a bit of practice creating schema using cqlsh, but now that
were starting to create an application data model with more tables, it starts to be
more of a challenge to keep track of all of that CQL.
Thankfully, there is a great development tool provided by DataStax called DevCenter.
This tool is available as a free download from the DataStax Academy. Figure 5-10
shows the hotel schema being edited in DevCenter.
The middle pane shows the currently selected CQL file, featuring syntax highlighting
for CQL commands, CQL types, and name literals. DevCenter provides command
completion as you type out CQL commands and interprets the commands you type,
highlighting any errors you make. The tool provides panes for managing multiple
CQL scripts and connections to multiple clusters. The connections are used to run
CQL commands against live clusters and view the results.
102 | Chapter 5: Data Modeling
Figure 5-10. Editing the Hotel schema in DataStax DevCenter
Summary
In this chapter, we saw how to create a complete, working Cassandra data model and
compared it with an equivalent relational model. We represented our data model in
both logical and physical forms and learned a new tool for realizing our data models
in CQL. Now that we have a working data model, we’ll continue building our hotel
application in the coming chapters.
Summary | 103
CHAPTER 6
The Cassandra Architecture
3.2 Architecture - fundamental concepts or properties of a system in its environment embod‐
ied in its elements, relationships, and in the principles of its design and evolution.
—ISO/IEC/IEEE 42010
In this chapter, we examine several aspects of Cassandras architecture in order to
understand how it does its job. We’ll explain the topology of a cluster, and how nodes
interact in a peer-to-peer design to maintain the health of the cluster and exchange
data, using techniques like gossip, anti-entropy, and hinted handoff. Looking inside
the design of a node, we examine architecture techniques Cassandra uses to support
reading, writing, and deleting data, and examine how these choices affect architec‐
tural considerations such as scalability, durability, availability, manageability, and
more. We also discuss Cassandras adoption of a Staged Event-Driven Architecture,
which acts as the platform for request delegation.
As we introduce these topics, we also provide references to where you can find their
implementations in the Cassandra source code.
Data Centers and Racks
Cassandra is frequently used in systems spanning physically separate locations. Cas‐
sandra provides two levels of grouping that are used to describe the topology of a
cluster: data center and rack. A rack is a logical set of nodes in close proximity to each
other, perhaps on physical machines in a single rack of equipment. A data center is a
logical set of racks, perhaps located in the same building and connected by reliable
network. A sample topology with multiple data centers and racks is shown in
Figure 6-1.
105
Figure 6-1. Topology of a sample cluster with data centers, racks, and nodes
Out of the box, Cassandra comes with a default configuration of a single data center
("DC1") containing a single rack ("RAC1"). We’ll learn in Chapter 7 how to build a
larger cluster and define its topology.
Cassandra leverages the information you provide about your cluster’s topology to
determine where to store data, and how to route queries efficiently. Cassandra tries to
store copies of your data in multiple data centers to maximize availability and parti‐
tion tolerance, while preferring to route queries to nodes in the local data center to
maximize performance.
Gossip and Failure Detection
To support decentralization and partition tolerance, Cassandra uses a gossip protocol
that allows each node to keep track of state information about the other nodes in the
cluster. The gossiper runs every second on a timer.
Gossip protocols (sometimes called “epidemic protocols”) generally assume a faulty
network, are commonly employed in very large, decentralized network systems, and
are often used as an automatic mechanism for replication in distributed databases.
They take their name from the concept of human gossip, a form of communication in
which peers can choose with whom they want to exchange information.
106 | Chapter 6: The Cassandra Architecture
The Origin of “Gossip Protocol”
The term “gossip protocol” was originally coined in 1987 by Alan
Demers, a researcher at Xerox’s Palo Alto Research Center, who
was studying ways to route information through unreliable net‐
works.
The gossip protocol in Cassandra is primarily implemented by the org.apache.cas
sandra.gms.Gossiper class, which is responsible for managing gossip for the local
node. When a server node is started, it registers itself with the gossiper to receive
endpoint state information.
Because Cassandra gossip is used for failure detection, the Gossiper class maintains a
list of nodes that are alive and dead.
Here is how the gossiper works:
1. Once per second, the gossiper will choose a random node in the cluster and initi‐
alize a gossip session with it. Each round of gossip requires three messages.
2. The gossip initiator sends its chosen friend a GossipDigestSynMessage.
3. When the friend receives this message, it returns a GossipDigestAckMessage.
4. When the initiator receives the ack message from the friend, it sends the friend a
GossipDigestAck2Message to complete the round of gossip.
When the gossiper determines that another endpoint is dead, it “convicts” that end‐
point by marking it as dead in its local list and logging that fact.
Cassandra has robust support for failure detection, as specified by a popular algo‐
rithm for distributed computing called Phi Accrual Failure Detection. This manner of
failure detection originated at the Advanced Institute of Science and Technology in
Japan in 2004.
Accrual failure detection is based on two primary ideas. The first general idea is that
failure detection should be flexible, which is achieved by decoupling it from the appli‐
cation being monitored. The second and more novel idea challenges the notion of
traditional failure detectors, which are implemented by simple “heartbeats” and
decide whether a node is dead or not dead based on whether a heartbeat is received
or not. But accrual failure detection decides that this approach is naive, and finds a
place in between the extremes of dead and alive—a suspicion level.
Therefore, the failure monitoring system outputs a continuous level of “suspicion
regarding how confident it is that a node has failed. This is desirable because it can
take into account fluctuations in the network environment. For example, just because
one connection gets caught up doesn’t necessarily mean that the whole node is dead.
So suspicion offers a more fluid and proactive indication of the weaker or stronger
Gossip and Failure Detection | 107
possibility of failure based on interpretation (the sampling of heartbeats), as opposed
to a simple binary assessment.
Phi Threshold and Accrual Failure Detectors
Accrual Failure Detectors output a value associated with each process (or node). This
value is called Phi. The value is output in a manner that is designed from the ground
up to be adaptive in the face of volatile network conditions, so it’s not a binary condi‐
tion that simply checks whether a server is up or down.
The Phi convict threshold in the configuration adjusts the sensitivity of the failure
detector. Lower values increase the sensitivity and higher values decrease it, but not in
a linear fashion.
The Phi value refers to a level of suspicion that a server might be down. Applications
such as Cassandra that employ an AFD can specify variable conditions for the Phi
value they emit. Cassandra can generally detect a failed node in about 10 seconds
using this mechanism.
You can read the original Phi Accrual Failure Detection paper by Naohiro Hayashi‐
bara et al. at http://www.jaist.ac.jp/~defago/les/pdf/IS_RR_2004_010.pdf.
Failure detection is implemented in Cassandra by the org.apache.cassandra.gms.
FailureDetector class, which implements the org.apache.cassandra.gms.IFailur
eDetector interface. Together, they allow operations including:
isAlive(InetAddress)
What the detector will report about a given nodes alive-ness.
interpret(InetAddress)
Used by the gossiper to help it decide whether a node is alive or not based on
suspicion level reached by calculating Phi (as described in the Hayashibara
paper).
report(InetAddress)
When a node receives a heartbeat, it invokes this method.
Snitches
The job of a snitch is to determine relative host proximity for each node in a cluster,
which is used to determine which nodes to read and write from. Snitches gather
information about your network topology so that Cassandra can efficiently route
requests. The snitch will figure out where nodes are in relation to other nodes.
108 | Chapter 6: The Cassandra Architecture
As an example, lets examine how the snitch participates in a read operation. When
Cassandra performs a read, it must contact a number of replicas determined by the
consistency level. In order to support the maximum speed for reads, Cassandra
selects a single replica to query for the full object, and asks additional replicas for
hash values in order to ensure the latest version of the requested data is returned. The
role of the snitch is to help identify the replica that will return the fastest, and this is
the replica which is queried for the full data.
The default snitch (the SimpleSnitch) is topology unaware; that is, it does not know
about the racks and data centers in a cluster, which makes it unsuitable for multi-data
center deployments. For this reason, Cassandra comes with several snitches for dif‐
ferent cloud environments including Amazon EC2, Google Cloud, and Apache
Cloudstack.
The snitches can be found in the package org.apache.cassandra.locator. Each
snitch implements the IEndpointSnitch interface. Well learn how to select and con‐
figure an appropriate snitch for your environment in Chapter 7.
While Cassandra provides a pluggable way to statically describe your cluster’s topol‐
ogy, it also provides a feature called dynamic snitching that helps optimize the routing
of reads and writes over time. Heres how it works. Your selected snitch is wrapped
with another snitch called the DynamicEndpointSnitch. The dynamic snitch gets its
basic understanding of the topology from the selected snitch. It then monitors the
performance of requests to the other nodes, even keeping track of things like which
nodes are performing compaction. The performance data is used to select the best
replica for each query. This enables Cassandra to avoid routing requests to replicas
that are performing poorly.
The dynamic snitching implementation uses a modified version of the Phi failure
detection mechanism used by gossip. The “badness threshold” is a configurable
parameter that determines how much worse a preferred node must perform than the
best-performing node in order to lose its preferential status. The scores of each node
are reset periodically in order to allow a poorly performing node to demonstrate that
it has recovered and reclaim its preferred status.
Rings and Tokens
So far weve been focusing on how Cassandra keeps track of the physical layout of
nodes in a cluster. Lets shift gears and look at how Cassandra distributes data across
these nodes.
Cassandra represents the data managed by a cluster as a ring. Each node in the ring is
assigned one or more ranges of data described by a token, which determines its posi‐
tion in the ring. A token is a 64-bit integer ID used to identify each partition. This
gives a possible range for tokens from –263 to 263–1.
Rings and Tokens | 109
A node claims ownership of the range of values less than or equal to each token and
greater than the token of the previous node. The node with lowest token owns the
range less than or equal to its token and the range greater than the highest token,
which is also known as the “wrapping range.” In this way, the tokens specify a com‐
plete ring. Figure 6-2 shows a notional ring layout including the nodes in a single
data center. This particular arrangement is structured such that consecutive token
ranges are spread across nodes in different racks.
Figure 6-2. Example ring arrangement of nodes in a data center
Data is assigned to nodes by using a hash function to calculate a token for the parti‐
tion key. This partition key token is compared to the token values for the various
nodes to identify the range, and therefore the node, that owns the data.
Token ranges are represented by the org.apache.cassandra.dht.Range class.
Virtual Nodes
Early versions of Cassandra assigned a single token to each node, in a fairly static
manner, requiring you to calculate tokens for each node. Although there are tools
available to calculate tokens based on a given number of nodes, it was still a manual
process to configure the initial_token property for each node in the cassandra.yaml
file. This also made adding or replacing a node an expensive operation, as rebalanc‐
ing the cluster required moving a lot of data.
110 | Chapter 6: The Cassandra Architecture
Cassandras 1.2 release introduced the concept of virtual nodes, also called vnodes for
short. Instead of assigning a single token to a node, the token range is broken up into
multiple smaller ranges. Each physical node is then assigned multiple tokens. By
default, each node will be assigned 256 of these tokens, meaning that it contains 256
virtual nodes. Virtual nodes have been enabled by default since 2.0.
Vnodes make it easier to maintain a cluster containing heterogeneous machines. For
nodes in your cluster that have more computing resources available to them, you can
increase the number of vnodes by setting the num_tokens property in the cassan‐
dra.yaml file. Conversely, you might set num_tokens lower to decrease the number of
vnodes for less capable machines.
Cassandra automatically handles the calculation of token ranges for each node in the
cluster in proportion to their num_tokens value. Token assignments for vnodes are
calculated by the org.apache.cassandra.dht.tokenallocator.ReplicationAware
TokenAllocator class.
A further advantage of virtual nodes is that they speed up some of the more heavy‐
weight Cassandra operations such as bootstrapping a new node, decommissioning a
node, and repairing a node. This is because the load associated with operations on
multiple smaller ranges is spread more evenly across the nodes in the cluster.
Partitioners
A partitioner determines how data is distributed across the nodes in the cluster. As we
learned in Chapter 5, Cassandra stores data in wide rows, or “partitions.” Each row
has a partition key that is used to identify the partition. A partitioner, then, is a hash
function for computing the token of a partition key. Each row of data is distributed
within the ring according to the value of the partition key token.
Cassandra provides several different partitioners in the org.apache.cassandra.dht
package (DHT stands for “distributed hash table”). The Murmur3Partitioner was
added in 1.2 and has been the default partitioner since then; it is an efficient Java
implementation on the murmur algorithm developed by Austin Appleby. It generates
64-bit hashes. The previous default was the RandomPartitioner.
Because of Cassandras generally pluggable design, you can also create your own par‐
titioner by implementing the org.apache.cassandra.dht.IPartitioner class and
placing it on Cassandras classpath.
Partitioners | 111
Replication Strategies
A node serves as a replica for different ranges of data. If one node goes down, other
replicas can respond to queries for that range of data. Cassandra replicates data across
nodes in a manner transparent to the user, and the replication factor is the number of
nodes in your cluster that will receive copies (replicas) of the same data. If your repli‐
cation factor is 3, then three nodes in the ring will have copies of each row.
The first replica will always be the node that claims the range in which the token falls,
but the remainder of the replicas are placed according to the replication strategy
(sometimes also referred to as the replica placement strategy).
For determining replica placement, Cassandra implements the Gang of Four Strategy
pattern, which is outlined in the common abstract class org.apache.cassandra.loca
tor.AbstractReplicationStrategy, allowing different implementations of an algo‐
rithm (different strategies for accomplishing the same work). Each algorithm
implementation is encapsulated inside a single class that extends the AbstractRepli
cationStrategy.
Out of the box, Cassandra provides two primary implementations of this interface
(extensions of the abstract class): SimpleStrategy and NetworkTopologyStrategy.
The SimpleStrategy places replicas at consecutive nodes around the ring, starting
with the node indicated by the partitioner. The NetworkTopologyStrategy allows you
to specify a different replication factor for each data center. Within a data center, it
allocates replicas to different racks in order to maximize availability.
Legacy Replication Strategies
A third strategy, OldNetworkTopologyStrategy, is provided for
backward compatibility. It was previously known as the RackAware
Strategy, while the SimpleStrategy was previously known as the
RackUnawareStrategy. NetworkTopologyStrategy was previously
known as DataCenterShardStrategy. These changes were effective
in the 0.7 release.
The strategy is set independently for each keyspace and is a required option to create
a keyspace, as we saw in Chapter 5.
112 | Chapter 6: The Cassandra Architecture
Consistency Levels
In Chapter 2, we discussed Brewer’s CAP theorem, in which consistency, availability,
and partition tolerance are traded off against one another. Cassandra provides tunea‐
ble consistency levels that allow you to make these trade-offs at a fine-grained level.
You specify a consistency level on each read or write query that indicates how much
consistency you require. A higher consistency level means that more nodes need to
respond to a read or write query, giving you more assurance that the values present
on each replica are the same.
For read queries, the consistency level specifies how many replica nodes must
respond to a read request before returning the data. For write operations, the consis‐
tency level specifies how many replica nodes must respond for the write to be
reported as successful to the client. Because Cassandra is eventually consistent,
updates to other replica nodes may continue in the background.
The available consistency levels include ONE, TWO, and THREE, each of which specify an
absolute number of replica nodes that must respond to a request. The QUORUM consis‐
tency level requires a response from a majority of the replica nodes (sometimes
expressed as “replication factor / 2 + 1”). The ALL consistency level requires a
response from all of the replicas. We’ll examine these consistency levels and others in
more detail in Chapter 9.
For both reads and writes, the consistency levels of ANY, ONE, TWO, and THREE are con‐
sidered weak, whereas QUORUM and ALL are considered strong. Consistency is tuneable
in Cassandra because clients can specify the desired consistency level on both reads
and writes. There is an equation that is popularly used to represent the way to achieve
strong consistency in Cassandra: R + W > N = strong consistency. In this equation, R,
W, and N are the read replica count, the write replica count, and the replication fac‐
tor, respectively; all client reads will see the most recent write in this scenario, and
you will have strong consistency.
Distinguishing Consistency Levels and Replication Factors
If you’re new to Cassandra, the replication factor can sometimes be
confused with the consistency level. The replication factor is set per
keyspace. The consistency level is specified per query, by the client.
The replication factor indicates how many nodes you want to use
to store a value during each write operation. The consistency level
specifies how many nodes the client has decided must respond in
order to feel confident of a successful read or write operation. The
confusion arises because the consistency level is based on the repli‐
cation factor, not on the number of nodes in the system.
Consistency Levels | 113
Queries and Coordinator Nodes
Lets bring these concepts together to discuss how Cassandra nodes interact to sup‐
port reads and writes from client applications. Figure 6-3 shows the typical path of
interactions with Cassandra.
Figure 6-3. Clients, coordinator nodes, and replicas
A client may connect to any node in the cluster to initiate a read or write query. This
node is known as the coordinator node. The coordinator identifies which nodes are
replicas for the data that is being written or read and forwards the queries to them.
For a write, the coordinator node contacts all replicas, as determined by the consis‐
tency level and replication factor, and considers the write successful when a number
of replicas commensurate with the consistency level acknowledge the write.
For a read, the coordinator contacts enough replicas to ensure the required consis‐
tency level is met, and returns the data to the client.
These, of course, are the “happy path” descriptions of how Cassandra works. We’ll
soon discuss some of Cassandras high availability mechanisms, including hinted
handoff.
114 | Chapter 6: The Cassandra Architecture
Memtables, SSTables, and Commit Logs
Now let’s take a look at some of Cassandras internal data structures and files, sum‐
marized in Figure 6-4. Cassandra stores data both in memory and on disk to provide
both high performance and durability. In this section, well focus on Cassandras use
of constructs called memtables, SSTables, and commit logs to support the writing and
reading of data from tables.
Figure 6-4. Internal data structures and les of a Cassandra node
When you perform a write operation, its immediately written to a commit log. The
commit log is a crash-recovery mechanism that supports Cassandras durability goals.
A write will not count as successful until it’s written to the commit log, to ensure that
if a write operation does not make it to the in-memory store (the memtable, dis‐
cussed in a moment), it will still be possible to recover the data. If you shut down the
database or it crashes unexpectedly, the commit log can ensure that data is not lost.
Thats because the next time you start the node, the commit log gets replayed. In fact,
thats the only time the commit log is read; clients never read from it.
After its written to the commit log, the value is written to a memory-resident data
structure called the memtable. Each memtable contains data for a specific table. In
early implementations of Cassandra, memtables were stored on the JVM heap, but
improvements starting with the 2.1 release have moved the majority of memtable data
to native memory. This makes Cassandra less susceptible to fluctuations in perfor‐
mance due to Java garbage collection.
When the number of objects stored in the memtable reaches a threshold, the contents
of the memtable are flushed to disk in a file called an SSTable. A new memtable is
then created. This flushing is a non-blocking operation; multiple memtables may
Memtables, SSTables, and Commit Logs | 115
exist for a single table, one current and the rest waiting to be flushed. They typically
should not have to wait very long, as the node should flush them very quickly unless
it is overloaded.
Each commit log maintains an internal bit flag to indicate whether it needs flushing.
When a write operation is first received, it is written to the commit log and its bit flag
is set to 1. There is only one bit flag per table, because only one commit log is ever
being written to across the entire server. All writes to all tables will go into the same
commit log, so the bit flag indicates whether a particular commit log contains any‐
thing that hasn’t been flushed for a particular table. Once the memtable has been
properly flushed to disk, the corresponding commit log’s bit flag is set to 0, indicating
that the commit log no longer has to maintain that data for durability purposes. Like
regular logfiles, commit logs have a configurable rollover threshold, and once this file
size threshold is reached, the log will roll over, carrying with it any extant dirty bit
flags.
The SSTable is a concept borrowed from Googles Bigtable. Once a memtable is
flushed to disk as an SSTable, it is immutable and cannot be changed by the applica‐
tion. Despite the fact that SSTables are compacted, this compaction changes only
their on-disk representation; it essentially performs the “merge” step of a mergesort
into new files and removes the old files on success.
Why Are They Called “SSTables”?
The idea that “SSTable” is a compaction of “Sorted String Table” is
somewhat inaccurate for Cassandra, because the data is not stored
as strings on disk.
Since the 1.0 release, Cassandra has supported the compression of SSTables in order
to maximize use of the available storage. This compression is configurable per table.
Each SSTable also has an associated Bloom filter, which is used as an additional per‐
formance enhancer (see “Bloom Filters” on page 120).
All writes are sequential, which is the primary reason that writes perform so well in
Cassandra. No reads or seeks of any kind are required for writing a value to Cassan‐
dra because all writes are append operations. This makes one key limitation on per‐
formance the speed of your disk. Compaction is intended to amortize the
reorganization of data, but it uses sequential I/O to do so. So the performance benefit
is gained by splitting; the write operation is just an immediate append, and then com‐
paction helps to organize for better future read performance. If Cassandra naively
inserted values where they ultimately belonged, writing clients would pay for seeks up
front.
116 | Chapter 6: The Cassandra Architecture
On reads, Cassandra will read both SSTables and memtables to find data values, as
the memtable may contain values that have not yet been flushed to disk. Memtables
are implemented by the org.apache.cassandra.db.Memtable class.
Caching
As we saw in Figure 6-4, Cassandra provides three forms of caching:
The key cache stores a map of partition keys to row index entries, facilitating
faster read access into SSTables stored on disk. The key cache is stored on the
JVM heap.
The row cache caches entire rows and can greatly speed up read access for fre‐
quently accessed rows, at the cost of more memory usage. The row cache is
stored in off-heap memory.
The counter cache was added in the 2.1 release to improve counter performance
by reducing lock contention for the most frequently accessed counters.
By default, key and counter caching are enabled, while row caching is disabled, as it
requires more memory. Cassandra saves its caches to disk periodically in order to
warm them up more quickly on a node restart. We’ll investigate how to tune these
caches in Chapter 12.
Hinted Hando
Consider the following scenario: a write request is sent to Cassandra, but a replica
node where the write properly belongs is not available due to network partition,
hardware failure, or some other reason. In order to ensure general availability of the
ring in such a situation, Cassandra implements a feature called hinted hando. You
might think of a hint as a little Post-it note that contains the information from the
write request. If the replica node where the write belongs has failed, the coordinator
will create a hint, which is a small reminder that says, “I have the write information
that is intended for node B. I’m going to hang onto this write, and I’ll notice when
node B comes back online; when it does, I’ll send it the write request.” That is, once it
detects via gossip that node B is back online, node A will “hand off” to node B the
“hint” regarding the write. Cassandra holds a separate hint for each partition that is to
be written.
This allows Cassandra to be always available for writes, and generally enables a clus‐
ter to sustain the same write load even when some of the nodes are down. It also
reduces the time that a failed node will be inconsistent after it does come back online.
In general, hints do not count as writes for the purposes of consistency level. The
exception is the consistency level ANY, which was added in 0.6. This consistency level
means that a hinted handoff alone will count as sufficient toward the success of a
Caching | 117
write operation. That is, even if only a hint was able to be recorded, the write still
counts as successful. Note that the write is considered durable, but the data may not
be readable until the hint is delivered to the target replica.
Hinted Hando and Guaranteed Delivery
Hinted handoff is used in Amazons Dynamo and is familiar to
those who are aware of the concept of guaranteed delivery in mes‐
saging systems such as the Java Message Service (JMS). In a durable
guaranteed-delivery JMS queue, if a message cannot be delivered to
a receiver, JMS will wait for a given interval and then resend the
request until the message is received.
There is a practical problem with hinted handoffs (and guaranteed delivery
approaches, for that matter): if a node is offline for some time, the hints can build up
considerably on other nodes. Then, when the other nodes notice that the failed node
has come back online, they tend to flood that node with requests, just at the moment
it is most vulnerable (when it is struggling to come back into play after a failure). To
address this problem, Cassandra limits the storage of hints to a configurable time
window. It is also possible to disable hinted handoff entirely.
As its name suggests, org.apache.cassandra.db.HintedHandOffManager is the class
that manages hinted handoffs internally.
Although hinted handoff helps increase Cassandras availability, it does not fully
replace the need for manual repair to ensure consistency.
Lightweight Transactions and Paxos
As we discussed in Chapter 2, Cassandra provides tuneable consistency, including the
ability to achieve strong consistency by specifying sufficiently high consistency levels.
However, strong consistency is not enough to prevent race conditions in cases where
clients need to read, then write data.
To help explain this with an example, let’s revisit our my_keyspace.user table from
Chapter 5. Imagine we are building a client that wants to manage user records as part
of an account management application. In creating a new user account, wed like to
make sure that the user record doesn’t already exist, lest we unintentionally overwrite
existing user data. So we do a read to see if the record exists first, and then only per‐
form the create if the record doesn’t exist.
The behavior were looking for is called linearizable consistency, meaning that wed
like to guarantee that no other client can come in between our read and write queries
with their own modification. Since the 2.0 release, Cassandra supports a lightweight
transaction (or “LWT”) mechanism that provides linearizable consistency.
118 | Chapter 6: The Cassandra Architecture
Cassandras LWT implementation is based on Paxos. Paxos is a consensus algorithm
that allows distributed peer nodes to agree on a proposal, without requiring a master
to coordinate a transaction. Paxos and other consensus algorithms emerged as alter‐
natives to traditional two-phase commit based approaches to distributed transactions
(reference the note on Two-Phase Commit in The Problem with Two-Phase Com‐
mit).
The basic Paxos algorithm consists of two stages: prepare/promise, and propose/
accept. To modify data, a coordinator node can propose a new value to the replica
nodes, taking on the role of leader. Other nodes may act as leaders simultaneously for
other modifications. Each replica node checks the proposal, and if the proposal is the
latest it has seen, it promises to not accept proposals associated with any prior pro‐
posals. Each replica node also returns the last proposal it received that is still in pro‐
gress. If the proposal is approved by a majority of replicas, the leader commits the
proposal, but with the caveat that it must first commit any in-progress proposals that
preceded its own proposal.
The Cassandra implementation extends the basic Paxos algorithm in order to support
the desired read-before-write semantics (also known as “check-and-set”), and to
allow the state to be reset between transactions. It does this by inserting two addi‐
tional phases into the algorithm, so that it works as follows:
1. Prepare/Promise
2. Read/Results
3. Propose/Accept
4. Commit/Ack
Thus, a successful transaction requires four round-trips between the coordinator
node and replicas. This is more expensive than a regular write, which is why you
should think carefully about your use case before using LWTs.
More on Paxos
Several papers have been written about the Paxos protocol. One of
the best explanations available is Leslie Lamports “Paxos Made
Simple.
Cassandras lightweight transactions are limited to a single partition. Internally, Cas‐
sandra stores a Paxos state for each partition. This ensures that transactions on differ‐
ent partitions cannot interfere with each other.
You can find Cassandras implementation of the Paxos algorithm in the package
org.apache.cassandra.service.paxos. These classes are leveraged by the Storage
Service, which we will learn about soon.
Lightweight Transactions and Paxos | 119
Tombstones
In the relational world, you might be accustomed to the idea of a “soft delete.” Instead
of actually executing a delete SQL statement, the application will issue an update
statement that changes a value in a column called something like “deleted.” Program‐
mers sometimes do this to support audit trails, for example.
Theres a similar concept in Cassandra called a tombstone. This is how all deletes work
and is therefore automatically handled for you. When you execute a delete operation,
the data is not immediately deleted. Instead, it’s treated as an update operation that
places a tombstone on the value. A tombstone is a deletion marker that is required to
suppress older data in SSTables until compaction can run.
Theres a related setting called Garbage Collection Grace Seconds. This is the amount
of time that the server will wait to garbage-collect a tombstone. By default, its set to
864,000 seconds, the equivalent of 10 days. Cassandra keeps track of tombstone age,
and once a tombstone is older than GCGraceSeconds, it will be garbage-collected. The
purpose of this delay is to give a node that is unavailable time to recover; if a node is
down longer than this value, then it is treated as failed and replaced.
Bloom Filters
Bloom filters are used to boost the performance of reads. They are named for their
inventor, Burton Bloom. Bloom filters are very fast, non-deterministic algorithms for
testing whether an element is a member of a set. They are non-deterministic because
it is possible to get a false-positive read from a Bloom filter, but not a false-negative.
Bloom filters work by mapping the values in a data set into a bit array and condens‐
ing a larger data set into a digest string using a hash function. The digest, by defini‐
tion, uses a much smaller amount of memory than the original data would. The filters
are stored in memory and are used to im