The Definitive Guide To MongoDB Mongo DB

www.apress.com

Hows · Membrey · Plugge · Hawkins The Definitive Guide to MongoDB

The Definitive

Guide to

MongoDB

A complete guide to dealing with Big Data

using MongoDB

—

Third Edition

—

David Hows

Peter Membrey

Eelco Plugge

Tim Hawkins

The Definitive Guide to MongoDB

BOOKS FOR PROFESSIONALS BY PROFESSIONALS®THE EXPERT’S VOICE® IN OPEN SOURCE

The De nitive Guide to MongoDB, Third Edition, is updated for MongoDB 3 and includes all of

the latest MongoDB features, including the aggregation framework introduced in version 2.2,

the hashed indexes introduced in version 2.4, and WiredTiger from 3.2. The Third Edition also

now includes Node.js along with Python.

MongoDB is the most popular of the “Big Data” NoSQL database technologies, and it’s still

growing. David Hows from 10gen, along with experienced MongoDB authors David Hows,

Peter Membrey and Eelco Plugge, provide their expertise and experience in teaching you

everything you need to know to become a MongoDB pro.

• Set up MongoDB on all major server platforms, including Windows, Linux,

OS X, and cloud platforms like Rackspace, Azure, and Amazon EC2

• Work with GridFS and the new aggregation framework

• Work with your data using non-SQL commands

• Write applications using either Node.js or Python

• Optimize MongoDB

• Master MongoDB administration, including replication, replication tagging,

and tag-aware sharding

9781484 211830

54999

ISBN 978-1-4842-1183-0

Shelve in:

Databases/General

User level:

Beginning–Advanced

Related Titles

The Definitive Guide

to MongoDB

A complete guide to dealing with

Big Data using MongoDB

Third Edition

David Hows

Peter Membrey

Eelco Plugge

Tim Hawkins

The Definitive Guide to MongoDB: A complete guide to dealing with Big Data using MongoDB

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the

material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,

broadcasting, reproduction on microfilms or in any other physical way, and transmission or information

storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now

known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with

reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed

on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or

parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its

current version, and permission for use must always be obtained from Springer. Permissions for use may be

obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under

the respective Copyright Law.

ISBN-13 (pbk): 978-1-4842-1183-0

ISBN-13 (electronic): 978-1-4842-1182-3

Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with

every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an

editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.

The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are

not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to

proprietary rights.

While the advice and information in this book are believed to be true and accurate at the date of publication,

neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or

omissions that may be made. The publisher makes no warranty, express or implied, with respect to the

material contained herein.

Managing Director: Welmoed Spahr

Lead Editor: Michelle Lowman

Technical Reviewer: Stephen Steneker

Editorial Board: Steve Anglin, Louise Corrigan, Jonathan Gennick, Robert Hutchinson,

Michelle Lowman, James Markham, Susan McDermott, Matthew Moodie, Jeffrey Pepper,

Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing

Coordinating Editor: Mark Powers

Copy Editor: Mary Bearden

Compositor: SPi Global

Indexer: SPi Global

Artist: SPi Global

Distributed to the book trade worldwide by Springer Science+Business Media New York,

233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail

orders-ny@springer-sbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC

and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM

Finance Inc is a Delaware corporation.

For information on translations, please e-mail rights@apress.com, or visit www.apress.com.

Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use.

eBook versions and licenses are also available for most titles. For more information, reference our Special

Bulk Sales–eBook Licensing web page at www.apress.com/bulk-sales.

Any source code or other supplementary material referenced by the author in this text is available to readers

at www.apress.com/9781484211830. For detailed information about how to locate your book’s source

code, go to www.apress.com/source-code/. Readers can also access source code at SpringerLink in the

Supplementary Material section for each chapter.

For Dr. Rocky Chan, for going the extra mile and always being there when I need him.

I hope one day I can properly thank him for his support.

—Peter Membrey

To my uncle, Luut, who introduced me to the vast and

ever-challenging world of IT. ank you.

—Eelco Plugge

v

Contents at a Glance

About the Authors ��xix

About the Technical Reviewer ��xxi

About the Contributor ��xxiii

Acknowledgments ��xxv

Introduction ��xxvii

■Chapter 1: Introduction to MongoDB �� 1

■Chapter 2: Installing MongoDB �� 17

■Chapter 3: The Data Model �� 33

■Chapter 4: Working with Data �� 49

■Chapter 5: GridFS�� 91

■Chapter 6: PHP and MongoDB �� 103

■Chapter 7: Python and MongoDB �� 147

■Chapter 8: Advanced Queries �� 181

■Chapter 9: Database Administration �� 209

■Chapter 10: Optimization �� 249

■Chapter 11: Replication �� 285

■Chapter 12: Sharding �� 315

Index �� 337

vii

Contents

About the Authors ��xix

About the Technical Reviewer ��xxi

About the Contributor ��xxiii

Acknowledgments ��xxv

Introduction ��xxvii

■Chapter 1: Introduction to MongoDB �� 1

Reviewing the MongoDB Philosophy �� 1

Using the Right Tool for the Right Job �� 1

Lacking Innate Support for Transactions �� 3

JSON and MongoDB ��3

Adopting a Nonrelational Approach ��6

Opting for Performance vs� Features �� 6

Running the Database Anywhere��7

Fitting Everything Together �� 7

Generating or Creating a Key ��8

Using Keys and Values ��8

Implementing Collections �� 9

Understanding Databases��9

Reviewing the Feature List �� 9

WiredTiger ��10

Using Document-Oriented Storage (BSON) ��10

Supporting Dynamic Queries �� 11

Indexing Your Documents �� 11

Leveraging Geospatial Indexes ��12

viii

■ Contents

Proﬁling Queries �� 12

Updating Information In Place (Memory Mapped Database Only) �� 12

Storing Binary Data �� 13

Replicating Data ��13

Implementing Sharding ��14

Using Map and Reduce Functions ��14

The Aggregation Framework ��14

Getting Help �� 15

Visiting the Website ��15

Cutting and Pasting MongoDB Code ��15

Finding Solutions on Google Groups ��15

Finding Solutions on Stack Overﬂow ��15

Leveraging the JIRA Tracking System ��15

Chatting with the MongoDB Developers ��16

Summary �� 16

■Chapter 2: Installing MongoDB �� 17

Choosing Your Version �� 17

Understanding the Version Numbers �� 18

Installing MongoDB on Your System �� 18

Installing MongoDB under Linux ��18

Installing MongoDB under Windows �� 20

Running MongoDB �� 20

Prerequisites��21

Surveying the Installation Layout �� 21

Using the MongoDB Shell �� 22

Installing Additional Drivers�� 23

Installing the PHP Driver ��24

Conﬁrming That Your PHP Installation Works ��27

Installing the Python Driver ��29

Conﬁrming That Your PyMongo Installation Works �� 31

Summary �� 32

ix

■ Contents

■Chapter 3: The Data Model �� 33

Designing the Database �� 33

Drilling Down on Collections ��34

Using Documents ��36

Creating the _id Field ��38

Building Indexes �� 39

Impacting Performance with Indexes �� 39

Implementing Geospatial Indexing �� 40

Querying Geospatial Information �� 41

Pluggable Storage Engines �� 46

Using MongoDB in the Real World �� 46

Summary �� 47

■Chapter 4: Working with Data �� 49

Navigating Your Databases �� 49

Viewing Available Databases and Collections ��49

Inserting Data into Collections �� 50

Querying for Data �� 52

Using the Dot Notation ��53

Using the Sort, Limit, and Skip Functions �� 54

Working with Capped Collections, Natural Order, and $natural �� 55

Retrieving a Single Document ��57

Using the Aggregation Commands �� 57

Working with Conditional Operators �� 60

Leveraging Regular Expressions ��68

Updating Data �� 68

Updating with update() �� 69

Implementing an Upsert with the save() Command ��69

Updating Information Automatically �� 69

Removing Elements from an Array ��73

x

■ Contents

Specifying the Position of a Matched Array ��74

Atomic Operations ��75

Modifying and Returning a Document Atomically�� 77

Processing Data in Bulk �� 77

Executing Bulk Operations��78

Evaluating the Output ��79

Renaming a Collection �� 80

Deleting Data �� 81

Referencing a Database �� 82

Referencing Data Manually ��82

Referencing Data with DBRef ��83

Implementing Index-Related Functions �� 85

Surveying Index-Related Commands �� 87

Summary �� 89

■Chapter 5: GridFS�� 91

Filling in Some Background �� 91

Working with GridFS �� 92

Getting Started with the Command-Line Tools �� 92

Using the _id Key ��93

Working with Filenames �� 93

The File’s Length �� 94

Working with Chunk Sizes �� 94

Tracking the Upload Date ��95

Hashing Your Files ��95

Looking Under MongoDB’s Hood �� 95

Using the search Command ��96

Deleting ��97

Retrieving Files from MongoDB �� 97

Summing Up mongoﬁles ��98

xi

■ Contents

Exploiting the Power of Python �� 98

Connecting to the Database ��99

Accessing the Words ��99

Putting Files into MongoDB �� 99

Retrieving Files from GridFS �� 100

Deleting Files �� 100

Summary �� 101

■Chapter 6: PHP and MongoDB �� 103

Comparing Documents in MongoDB and PHP �� 103

MongoDB Classes �� 105

Connecting and Disconnecting �� 105

Inserting Data ��107

Listing Your Data �� 109

Returning a Single Document �� 109

Listing All Documents �� 110

Using Query Operators �� 111

Querying for Speciﬁc Information ��111

Sorting, Limiting, and Skipping Items �� 112

Counting the Number of Matching Results ��114

Grouping Data with the Aggregation Framework ��114

Specifying the Index with Hint ��115

Reﬁning Queries with Conditional Operators ��116

Determining Whether a Field Has a Value �� 122

Regular Expressions �� 123

Modifying Data with PHP �� 124

Updating via update() ��124

Saving Time with Update Operators ��126

Upserting Data with save() ��133

Modifying a Document Atomically �� 134

xii

■ Contents

Processing Data in Bulk �� 136

Executing Bulk Operations��137

Evaluating the Output ��138

Deleting Data �� 139

DBRef �� 141

Retrieving the Information �� 142

GridFS and the PHP Driver �� 143

Storing Files ��143

Adding More Metadata to Stored Files ��144

Retrieving Files ��144

Deleting Data �� 145

Summary �� 146

■Chapter 7: Python and MongoDB �� 147

Working with Documents in Python �� 147

Using PyMongo Modules �� 148

Connecting and Disconnecting �� 148

Inserting Data �� 149

Finding Your Data �� 150

Finding a Single Document ��151

Finding Multiple Documents ��152

Using Dot Notation ��153

Returning Fields �� 153

Simplifying Queries with sort(), limit(), and skip() �� 154

Aggregating Queries ��155

Specifying an Index with hint() �� 158

Reﬁning Queries with Conditional Operators ��159

Conducting Searches with Regular Expressions ��165

Modifying the Data �� 166

Updating Your Data �� 166

Modiﬁer Operators ��167

xiii

■ Contents

Replacing Documents with replace_one() ��172

Modifying a Document Atomically �� 172

Putting the Parameters to Work ��173

Processing Data in Bulk �� 174

Executing Bulk Operations��174

Deleting Data �� 175

Creating a Link Between Two Documents �� 176

Retrieving the Information �� 178

Summary �� 179

■Chapter 8: Advanced Queries �� 181

Text Search �� 181

Text Search Costs and Limitations ��182

Using Text Search �� 182

Text Indexes in Other Languages ��187

Compound Indexing with Text Indexes ��187

The Aggregation Framework �� 189

Using the $group Command �� 190

Using the $limit Operator ��192

Using the $match Operator ��193

Using the $sort Operator ��194

Using the $unwind Operator ��196

Using the $skip Operator �� 198

Using the $out Operator ��199

Using the $lookup Operator �� 200

MapReduce �� 202

How MapReduce Works ��202

Setting Up Testing Documents �� 202

Working with Map Functions �� 203

Advanced MapReduce �� 205

Debugging MapReduce ��207

Summary �� 208

xiv

■ Contents

■Chapter 9: Database Administration �� 209

Using Administrative Tools �� 209

mongo, the MongoDB Console ��210

Using Third-Party Administration Tools ��210

Backing Up the MongoDB Server �� 210

Creating a Backup 101 ��210

Backing Up a Single Database ��213

Backing Up a Single Collection ��213

Digging Deeper into Backups �� 213

Restoring Individual Databases or Collections �� 214

Restoring a Single Database ��215

Restoring a Single Collection ��215

Automating Backups �� 216

Using a Local Datastore ��216

Using a Remote (Cloud-Based) Datastore ��218

Backing Up Large Databases �� 219

Using a Hidden Secondary Server for Backups �� 219

Creating Snapshots with a Journaling Filesystem �� 220

Disk Layout to Use with Volume Managers ��223

Importing Data into MongoDB �� 223

Exporting Data from MongoDB �� 225

Securing Your Data by Restricting Access to a MongoDB Server �� 226

Protecting Your Server with Authentication �� 226

Adding an Admin User �� 227

Enabling Authentication ��227

Authenticating in the mongo Console ��228

MongoDB User Roles �� 230

Changing a User’s Credentials �� 231

xv

■ Contents

Adding a Read-Only User ��232

Deleting a User ��233

Using Authenticated Connections in a PHP Application �� 234

Managing Servers �� 234

Starting a Server �� 234

Getting the Server’s Version ��237

Getting the Server’s Status �� 237

Shutting Down a Server �� 240

Using MongoDB Log Files �� 241

Validating and Repairing Your Data �� 241

Repairing a Server �� 241

Validating a Single Collection ��242

Repairing Collection Validation Faults ��243

Repairing a Collection’s Data Files �� 244

Compacting a Collection’s Data Files �� 244

Upgrading MongoDB �� 245

Rolling Upgrade of MongoDB ��246

Monitoring MongoDB �� 246

Using MongoDB Cloud Manager �� 247

Summary �� 248

■Chapter 10: Optimization �� 249

Optimizing Your Server Hardware for Performance �� 249

Understanding MongoDB’s Storage Engines �� 249

Understanding MongoDB Memory Use Under MMAPv1 �� 250

Understanding Working Set Size in MMAPv1 ��250

Understanding MongoDB Memory Use Under WiredTiger �� 251

Compression in WiredTiger ��251

Choosing the Right Database Server Hardware ��252

xvi

■ Contents

Evaluating Query Performance �� 252

The MongoDB Proﬁler ��253

Analyzing a Speciﬁc Query with explain() �� 257

Using the Proﬁler and explain() to Optimize a Query �� 258

Managing Indexes �� 264

Listing Indexes ��265

Creating a Simple Index ��265

Creating a Compound Index ��266

Three-Step Compound Indexes By A� Jesse Jiryu Davis �� 267

The Setup ��267

Range Query �� 267

Equality Plus Range Query�� 269

Digression: How MongoDB Chooses an Index ��271

Equality, Range Query, and Sort ��272

Final Method ��275

Specifying Index Options �� 275

Creating an Index in the Background with {background:true} ��275

Creating an Index with a Unique Key {unique:true} ��276

Creating Sparse Indexes with {sparse:true} ��276

Creating Partial Indexes ��277

TTL Indexes��277

Text Search Indexes ��278

Dropping an Index ��278

Reindexing a Collection ��279

Using hint( ) to Force Using a Speciﬁc Index �� 279

Using Index Filters �� 280

Optimizing the Storage of Small Objects �� 283

Summary �� 284

xvii

■ Contents

■Chapter 11: Replication �� 285

Spelling Out MongoDB’s Replication Goals �� 286

Improving Scalability �� 286

Improving Durability/Reliability ��286

Providing Isolation �� 287

Replication Fundamentals �� 287

What Is a Primary? �� 288

What Is a Secondary? �� 288

What Is an Arbiter? �� 288

Drilling Down on the Oplog �� 289

Implementing a Replica Set �� 290

Creating a Replica Set ��291

Getting a Replica Set Member Up and Running��292

Adding a Server to a Replica Set �� 293

Adding an Arbiter �� 299

Replica Set Chaining��300

Managing Replica Sets ��300

Conﬁguring the Options for Replica Set Members ��306

Connecting to a Replica Set from Your Application �� 308

Read Concern �� 313

Summary �� 313

■Chapter 12: Sharding �� 315

Exploring the Need for Sharding �� 315

Partitioning Horizontal and Vertical Data �� 316

Partitioning Data Vertically ��316

Partitioning Data Horizontally �� 317

Analyzing a Simple Sharding Scenario �� 317

xviii

■ Contents

Implementing Sharding with MongoDB �� 318

Setting Up a Sharding Conﬁguration ��321

Determining How You’re Connected �� 328

Listing the Status of a Sharded Cluster �� 328

Using Replica Sets to Implement Shards ��329

The Balancer �� 330

Hashed Shard Keys �� 332

Tag Sharding �� 332

Adding More Conﬁg Servers�� 335

Summary �� 336

Index �� 337

xix

About the Authors

David Hows is an Honors graduate from the University of Woolongong

in NSW, Australia. He got his start in computing trying to drive more

performance out of his family PC without spending a fortune. This led

to a career in IT, where David has worked as a Systems Administrator,

Performance Engineer, Software Developer, Solutions Architect, and

Database Engineer. David has tried in vain for many years to play soccer well,

and his coffee mug reads “Grumble Bum.”

Peter Membrey is a Chartered IT Fellow with over 15 years of experience

using Linux and Open Source solutions to solve problems in the real

world. An RHCE since the age of 17, he has also had the honor of working

for Red Hat and writing several books covering Open Source solutions.

He holds a master's degree in IT (Information Security) from the

University of Liverpool and is currently an EngD candidate at the Hong

Kong Polytechnic University, where his research interests include time

synchronization, cloud computing, big data, and security. He lives in

Hong Kong with his wonderful wife Sarah and son Kaydyn.

xx

■ About the Authors

Eelco Plugge is a techie who works and lives in the Netherlands. Currently

working as an engineer in the mobile device management-industry

where he spends most of his time analyzing logs, configs and errors, he

previously worked as a data encryption specialist at McAfee and held

a handful of IT/system engineering jobs. Eelco is the author of various

books on MongoDB and Load Balancing, a skilled troubleshooter and

holds a casual interest in IT security-related subjects complementing his

MSc in IT Security.

Eelco is a father of two, and any leisure time left is spent behind the

screen or sporadically reading a book. Interested in science and nature’s

oddities, currency trading (FX), programming, security and sushi.

Tim Hawkins produced one of the world’s first online classifieds portals in 1993, loot.com, before moving on

to run engineering for many of Yahoo EU’s non-media-based properties, such as search, local search, mail,

messenger, and its social networking products. He is currently managing a large offshore team for a major

US eTailer, developing and deploying next-gen eCommerce applications. Loves hats, hates complexity.

xxi

About the Technical Reviewer

Stephen Steneker (aka Stennie) is an experienced full stack software

developer, consultant, and instructor. Stephen has a long history working

for Australian technology startups including founding technical roles at

Yahoo! Australia & NZ, HomeScreen Entertainment, and Grox. He holds a

BSc (Computer Science) from the University of British Columbia.

In his current role as a Technical Services Engineer for MongoDB,

Inc., Stephen provides support, consulting, and training for MongoDB. He

frequently speaks at user groups and conferences, and is the founder and

wrangler for the Sydney MongoDB User Group (http://www.meetup.com/

SydneyMUG/).

You can find him on Twitter, StackOverflow, or Github as @stennie.

xxiii

About the Contributor

A. Jesse Jiryu Davis is a Staff Engineer at MongoDB in New York City,

specializing in C, Python, and asynchronous I/O. He is the lead developer

of the MongoDB C Driver, author of Motor, and a contributor to Python,

PyMongo, and Tornado. He is the co-author with Guido van Rossum of the

chapter “A Web Crawler With asyncio Coroutines” in 500 Lines or Less, the

fourth book in the Architecture of Open Source Applications series.

xxv

Acknowledgments

My thanks to all members of the MongoDB team, past and present. Without them we would not be here, and

the way people think about the storage of data would be radically different. I would like to pay extra special

thanks to my colleagues at the MongoDB team in Sydney, as without them I would not be here today.

—David Hows

Writing a book is always a team effort. Even when there is just a single author, there are many people

working behind the scenes to pull everything together. With that in mind I want to thank everyone in the

MongoDB community and everyone at Apress for all their hard work, patience, and support. Thanks go to

Dave and Eelco for really driving the Third Edition home.

I’d also like to thank Dou Yi, a PhD student also at the Hong Kong Polytechnic University (who is

focusing on security and cryptographic based research), for helping to keep me sane and (patiently)

explaining mathematical concepts that I really should have grasped a long time ago. She has saved me hours

of banging my head against a very robust brick wall.

Special thanks go to Dr. Rocky Chang for agreeing to supervise my EngD studies and for introducing

me to the world of Internet Measurement (which includes time synchronization). His continued support,

patience and understanding are greatly appreciated.

—Peter Membrey

To the 9gag community, without whom this book would have been finished months ago.

—Eelco Plugge

I would like to acknowledge the members of the mongodb-user and mongodb-dev mail lists for putting up

with my endless questions.

—Tim Hawkins

xxvii

Introduction

I am a relative latecomer to the world of databases, starting with MySQL in 2006. This followed the logical

course for any computer science undergraduate, leading me to develop on a full LAMP stack backed

by rudimentary tables. At the time I thought little about the complexities of what went into SQL table

management. However, as time has gone on, I have seen the need to store more and more heterogeneous

data and how a simple schema can grow and morph over time as life takes its toll on systems.

My first introduction to MongoDB was in 2011, when Peter Membrey suggested that instead of a 0 context

table of 30 key and 30 value rows, I simply use a MongoDB instance to store data. And like all developers faced

with a new technology I scoffed and did what I had originally planned. It wasn’t until I was halfway through

writing the code to use my horrible monstrosity that Peter insisted I try MongoDB, and I haven’t looked back

since. Like all newcomers from SQL-land, I was awed by the ability of this system to simply accept whatever

data I threw at it and then return it based on whatever criteria I asked. I am still hooked.

Our Approach

And now, in this book, Peter, Eelco Plugge, Tim Hawkins, and I have the goal of presenting you with the same

experiences we had in learning the product: teaching you how you can put MongoDB to use for yourself,

while keeping things simple and clear. Each chapter presents an individual sample database, so you can read

the book in a modular or linear fashion; it’s entirely your choice. This means you can skip a certain chapter if

you like, without breaking your example databases.

Throughout the book, you will find example commands followed by their output. Both appear in a

fixed-width “code” font, with the commands also in boldface to distinguish them from the resulting output.

In most chapters, you will also come across tips, warnings, and notes that contain useful, and sometimes

vital, information.

—David Hows

1

Chapter 1

Introduction to MongoDB

Imagine a world where using a database is so simple that you soon forget you’re even using it. Imagine a

world where speed and scalability just work, and there’s no need for complicated configuration or set up.

Imagine being able to focus only on the task at hand, get things done, and then—just for a change—leave

work on time. That might sound a bit fanciful, but MongoDB promises to help you accomplish all these

things (and more).

MongoDB (derived from the word humongous) is a relatively new breed of database that has no concept

of tables, schemas, SQL, or rows. It doesn’t have transactions, ACID compliance, joins, foreign keys, or many

of the other features that tend to cause headaches in the early hours of the morning. In short, MongoDB

is a very different database than you’re probably used to, especially if you’ve used a relational database

management system (RDBMS) in the past. In fact, you might even be shaking your head in wonder at the

lack of so-called “standard” features.

Fear not! In the following pages, you will learn about MongoDB’s background and guiding principles

and why the MongoDB team made the design decisions it did. We’ll also take a whistle-stop tour of

MongoDB’s feature list, providing just enough detail to ensure that you’ll be completely hooked on this topic

for the rest of the book.

We’ll start by looking at the philosophy and ideas behind the creation of MongoDB, as well as some

of the interesting and somewhat controversial design decisions. We’ll explore the concept of document-

oriented databases, how they fit together, and what their strengths and weaknesses are. We’ll also explore

JavaScript Object Notation and examine how it applies to MongoDB. To wrap things up, we’ll step through

some of the notable features of MongoDB.

Reviewing the MongoDB Philosophy

Like all projects, MongoDB has a set of design philosophies that help guide its development. In this section,

we’ll review some of the database’s founding principles.

Using the Right Tool for the Right Job

The most important of the philosophies that underpin MongoDB is the notion that one size does not fit all.

For many years, traditional relational (SQL) databases (MongoDB is a document-oriented database) have

been used for storing content of all types. It didn’t matter whether the data were a good fit for the relational

model (which is used in all RDBMS databases, such as MySQL, PostgresSQL, SQLite, Oracle, MS SQL Server,

and so on); the data were stuffed in there anyway. Part of the reason for this is that, generally speaking,

it’s much easier (and more secure) to read and write to a database than it is to write to a file system. If you

pick up any book that teaches PHP, such as PHP for Absolute Beginners 2nd edition, by Jason Lengstorf and

Thomas Blom Hansen (Apress, 2014), you’ll probably discover almost right away that the database is used

CHAPTER 1 ■ INTRODUCTION TO MONGODB

2

to store information, not the file system. It’s just so much easier to do things that way. And while using a

database as a storage bin works, developers always have to work against the flow. It’s usually obvious when

we’re not using the database the way it was intended; anyone who has ever tried to store information with

even slightly complex data and had to set up several tables and then try to pull them all together knows what

we’re talking about!

The MongoDB team decided that it wasn’t going to create another database that tries to do everything

for everyone. Instead, the team wanted to create a database that worked with documents rather than rows

and that was blindingly fast, massively scalable, and easy to use. To do this, the team had to leave some

features behind, which means that MongoDB is not an ideal candidate for certain situations. For example,

its lack of transaction support means that you wouldn’t want to use MongoDB to write an accounting

application. That said, MongoDB might be perfect for part of the aforementioned application (such as

storing complex data). That’s not a problem, though, because there is no reason why you can’t use a

traditional RDBMS for the accounting components and MongoDB for the document storage. Such hybrid

solutions are quite common, and you can see them in production apps such as the one used for the New

York Times website

Once you’re comfortable with the idea that MongoDB may not solve all your problems, you will

discover that there are certain problems that MongoDB is a perfect fit for resolving, such as analytics (think

a real-time Google Analytics for your website) and complex data structures (for example, blog posts and

comments). If you’re still not convinced that MongoDB is a serious database tool, feel free to skip ahead to

the “Reviewing the Feature List” section, where you will find an impressive list of features for MongoDB.

■Note The lack of transactions and other traditional database features doesn’t mean that MongoDB is

unstable or that it cannot be used for managing important data.

Another key concept behind MongoDB’s design is that there should always be more than one copy of

the database. If a single database should fail, then it can simply be restored from the other servers. Because

MongoDB aims to be as fast as possible, it takes some shortcuts that make it more difficult to recover from

a crash. The developers believe that most serious crashes are likely to remove an entire computer from

service anyway; this means that even if the database were perfectly restored, it would still not be usable.

Remember: MongoDB does not try to be everything to everyone. But for many purposes (such as building a

web application), MongoDB can be an awesome tool for implementing your solution.

So now you know where MongoDB is coming from. It’s not trying to be the best at everything, and

it readily acknowledges that it’s not for everyone. However, for those who choose to use it, MongoDB

provides a rich document-oriented database that’s optimized for speed and scalability. It can also run nearly

anywhere you might want to run it. MongoDB’s website includes downloads for Linux, Mac OS, Windows,

and Solaris.

MongoDB succeeds at all these goals, and this is why using MongoDB (at least for us) is somewhat

dream-like. You don’t have to worry about squeezing your data into a table—just put the data together, and

then pass them to MongoDB for handling.

Consider this real-world example. A recent application that co-author Peter Membrey worked on

needed to store a set of eBay search results. There could be any number of results (up to 100 of them), and

he needed an easy way to associate the results with the users in his database. Had Peter been using MySQL,

he would have had to design a table to store the data, write the code to store his results, and then write more

code to piece it all back together again. This is a fairly common scenario and one most developers face on

a regular basis. Normally, we just get on with it; however, for this project, he was using MongoDB, so things

went a bit differently.

CHAPTER 1 ■ INTRODUCTION TO MONGODB

3

Specifically, he added this line of code:

request['ebay_results'] = ebay_results_array

collection.save(request)

In this example, request is Peter’s document, ebay_results is the key, and ebay_result_array contains

the results from eBay. The second line saves the changes. When he accesses this document in the future, he

will have the eBay results in exactly the same format as before. He doesn’t need any SQL; he doesn’t need to

perform any conversions; nor does he need to create any new tables or write any special code—MongoDB

just worked. It got out of the way, he finished his work early, and he got to go home on time.

Lacking Innate Support for Transactions

Here’s another important design decision by MongoDB developers: The database does not include

transactional semantics (the element that offers guarantees about data consistency and storage). This

is a solid tradeoff based on MongoDB’s goal of being simple, fast, and scalable. Once you leave those

heavyweight features at the door, it becomes much easier to scale horizontally.

Normally with a traditional RDBMS, you improve performance by buying a bigger, more powerful

machine. This is scaling vertically, but you can only take it so far. With horizontal scaling, rather than having

one big machine, you have lots of less powerful small machines. Historically, clusters of servers like this were

excellent for load-balancing websites, but databases had always been a problem because of internal design

limitations.

You might think this missing support constitutes a deal-breaker; however, many people forget that one

of the most popular table types in MySQL (MYISAM—which also happens to be the default) doesn’t support

transactions either. This fact hasn’t stopped MySQL from becoming and remaining the dominant open

source database for well over a decade. As with most choices when developing solutions, using MongoDB is

going to be a matter of personal preference and whether the tradeoffs fit your project.

■Note MongoDB offers durability when used in tandem with at least two data-bearing servers as part of a

three-node cluster. This is the recommended minimum for production deployments. MongoDB also supports

the concept of “write concerns.” This is where a given number of nodes can be made to confirm the write was

successful, giving a stronger guarantee that the data are safely stored.

Single server durability is ensured since version 1.8 of MongoDB with a transaction log. This log is

append only and is flushed to disk every 100 milliseconds.

JSON and MongoDB

JSON (JavaScript Object Notation) is more than a great way to exchange data; it’s also a nice way to store

data. An RDBMS is highly structured, with multiple files (tables) that store the individual pieces. MongoDB,

on the other hand, stores everything together in a single document. MongoDB is like JSON in this way,

and this model provides a rich and expressive way of storing data. Moreover, JSON effectively describes all

the content in a given document, so there is no need to specify the structure of the document in advance.

JSON is effectively schemaless (that is, it doesn’t require a schema), because documents can be updated

individually or changed independently of any other documents. As an added bonus, JSON also provides

excellent performance by keeping all of the related data in one place.

CHAPTER 1 ■ INTRODUCTION TO MONGODB

4

MongoDB doesn’t actually use JSON to store the data; rather, it uses an open data format developed

by the MongoDB team called BSON (pronounced Bee-Son), which is short for binary JSON. For the most

part, using BSON instead of JSON won’t change how you work with your data. BSON makes MongoDB even

faster by making it much easier for a computer to process and search documents. BSON also adds a couple

of features that aren’t available in standard JSON, including a number of extended types for numeric data

(such as int32 and int64) and support for handling binary data. We’ll look at BSON in more depth in “Using

Document-Oriented Storage (BSON),” later in this chapter.

The original specification for JSON can be found in RFC 7159, and it was written by Douglas Crockford.

JSON allows complex data structures to be represented in a simple, human-readable text format that is

generally considered to be much easier to read and understand than XML. Like XML, JSON was envisaged

as a way to exchange data between a web client (such as a browser) and web applications. When combined

with the rich way that it can describe objects, its simplicity has made it the exchange format of choice for the

majority of developers.

You might wonder what is meant here by complex data structures. Historically, data were exchanged

using the comma-separated values x(CSV) format (indeed, this approach remains very common today). CSV

is a simple text format that separates rows with a new line and fields with a comma. For example, a CSV file

might look like this:

Membrey, Peter, +852 1234 5678

Thielen, Wouter, +81 1234 5678

Someone can look at this information and see quite quickly what information is being communicated.

Or maybe not—is that number in the third column a phone number or a fax number? It might even be the

number for a pager. To avoid this ambiguity, CSV files often have a header field, in which the first row defines

what comes in the file. The following snippet takes the previous example one step further:

Lastname, Firstname, Phone Number

Membrey, Peter, +852 1234 5678

Thielen, Wouter, +81 1234 5678

Okay, that’s a bit better. But now assume some people in the CSV file have more than one phone

number. You could add another field for an office phone number, but you face a new set of issues if you want

several office phone numbers. And you face yet another set of issues if you also want to incorporate multiple

e-mail addresses. Most people have more than one, and these addresses can’t usually be neatly defined

as either home or work. Suddenly, CSV starts to show its limitations. CSV files are only good for storing

data that are flat and don’t have repeating values. Similarly, it’s not uncommon for several CSV files to be

provided, each with the separate bits of information. These files are then combined (usually in an RDBMS)

to create the whole picture. As an example, a large retail company may receive sales data in the form of CSV

files from each of its stores at the end of each day. These files must be combined before the company can see

how it performed on a given day. This process is not exactly straightforward, and it certainly increases the

chances of a mistake as the number of required files grows.

XML largely solves this problem, but using XML for most things is a bit like using a sledgehammer

to crack a nut: it works, but it feels like overkill. The reason for this is that XML is not only designed for

machines to read (whereas JSON is designed for humans), but it is also highly extensible. Rather than define

a particular data format, XML defines how you define a data format. This can be useful when you need to

exchange complex and highly structured data; however, for simple data exchange, it often results in too

much work. Indeed, this scenario is the source of the phrase “XML hell.”

CHAPTER 1 ■ INTRODUCTION TO MONGODB

5

JSON provides a happy medium. Unlike CSV, it can store structured content; but unlike XML, JSON

makes the content easy to understand and simple to use. Let’s revisit the previous example; however, this

time we used JSON rather than CSV:

{

"firstname": "Peter",

"lastname": "Membrey",

"phone_numbers": [

"+852 1234 5678",

"+44 1234 565 555"

]

}

In this version of the example, each JSON object (or document) contains all the information needed to

understand it. If you look at phone_numbers, you can see that it contains a list of different numbers. This list

can be as large as you want. You could also be more specific about the type of number being recorded, as in

this example:

{

"firstname": "Peter",

"lastname": "Membrey",

"numbers": [

{

"phone": "+852 1234 5678"

},

{

"fax": "+44 1234 565 555"

}

]

}

This version of the example improves on things a bit more. Now you can clearly see what each number

is for. JSON is extremely expressive, and, although it’s quite easy to write JSON from scratch, it is usually

generated automatically in software. For example, Python includes a module called (somewhat predictably)

json that takes existing Python objects and automatically converts them to JSON. Because JSON is

supported and used on so many platforms, it is an ideal choice for exchanging data.

When you add items such as the list of phone numbers, you are actually creating what is known as

an embedded document. This happens whenever you add complex content such as a list (or array, to use

the term favored in JSON). Generally speaking, there is also a logical distinction. For example, a Person

document might have several Address documents embedded inside it. Similarly, an Invoice document

might have numerous LineItem documents embedded inside it. Of course, the embedded Address

document could also have its own embedded document that contains phone numbers, for example.

Whether you choose to embed a particular document is determined when you decide how to store your

information. This is usually referred to as schema design. It might seem odd to refer to schema design when

MongoDB is considered a schemaless database. However, while MongoDB doesn’t force you to create a

schema or enforce one that you create, you do still need to think about how your data fit together. We’ll look

at this in more depth in Chapter 3.

CHAPTER 1 ■ INTRODUCTION TO MONGODB

6

Adopting a Nonrelational Approach

Improving performance with a relational database is usually straightforward: you buy a bigger, faster server.

And this works great until you reach the point where there isn’t a bigger server available to buy. At that point,

the only option is to spread out to two servers. This might sound easy, but it is a stumbling block for most

databases. For example, PostgreSQL can’t run a single database on two servers, where both servers can both

read and write data (often referred to as an active/active cluster), and MySQL can only do it with a special

add-on package. And although Oracle can do this with its impressive Real Application Clusters (RAC)

architecture, you can expect to take out a mortgage if you want to use that solution—implementing a

RAC-based solution requires multiple servers, shared storage, and several software licenses.

You might wonder why having an active/active cluster on two databases is so difficult. When you query

your database, the database has to find all the relevant data and link them all together. RDBMS solutions

feature many ingenious ways to improve performance, but they all rely on having a complete picture of the

data available. And this is where you hit a wall: this approach simply doesn’t work when half the data are on

another server.

Of course you might have a small database that simply gets lots of requests, so you just need to share

the workload. Unfortunately, here you hit another wall. You need to ensure that data written to the first

server are available to the second server. And you face additional issues if updates are made on two separate

masters simultaneously. For example, you need to determine which update is the correct one. Another

problem you can encounter is if someone queries the second server for information that has just been

written to the first server, but that information hasn’t been updated yet on the second server. When you

consider all these issues, it becomes easy to see why the Oracle solution is so expensive—these problems are

extremely hard to address.

MongoDB solves the active/active cluster problems in a very clever way—it avoids them completely.

Recall that MongoDB stores data in BSON documents, so the data are self-contained. That is, although

similar documents are stored together, individual documents aren’t made up of relationships. This means

that everything you need is all in one place. Because queries in MongoDB look for specific keys and values

in a document, this information can be easily spread across as many servers as you have available. Each

server checks the content it has and returns the result. This effectively allows almost linear scalability and

performance.

Admittedly, MongoDB does not offer master/master replication, in which two separate servers can

both accept write requests. However, it does have sharding, which allows data to be partitioned across

multiple machines, with each machine responsible for updating different parts of the dataset. The benefit of

a sharded cluster is that additional shards can be added to increase resource capacity in your deployment

without any changes to your application code. Nonsharded database deployments are limited to vertical

scaling: you can add more RAM/CPU/disk, but this can quickly get expensive. Sharded deployments

can also be scaled vertically, but more importantly, they can be scaled horizontally based on capacity

requirements: a sharded cluster can be comprised of many more affordable commodity servers rather than a

few very expensive ones. Horizontal scaling is a great fit for elastic provisioning with cloud-hosted instances

and containers.

Opting for Performance vs. Features

Performance is important, but MongoDB also provides a large feature set. We’ve already discussed some

of the features MongoDB doesn’t implement, and you might be somewhat skeptical of the claim that

MongoDB achieves its impressive performance partly by judiciously excising certain features common to

other databases. However, there are analogous database systems available that are extremely fast, but also

extremely limited, such as those that implement a key/value store.

A perfect example is memcached. This application was written to provide high-speed data caching, and

it is mind-numbingly fast. When used to cache website content, it can speed up an application many times

over. This application is used by extremely large websites, such as Facebook and LiveJournal. The catch is

CHAPTER 1 ■ INTRODUCTION TO MONGODB

7

that this application has two significant shortcomings. First, it is a memory-only database. If the power goes

out, then all the data are lost. Second, you can’t actually search for data using memcached; you can only

request specific keys.

These might sound like serious limitations; however, you must remember the problems that

memcached is designed to solve. First and foremost, memcached is a data cache. That is, it’s not supposed

to be a permanent data store, but only a means to provide a caching layer for your existing database. When

you build a dynamic web page, you generally request very specific data (such as the current top ten articles).

This means you can specifically ask memcached for that data—there is no need to perform a search. If the

cache is outdated or empty, you would query your database as normal, build up the data, and then store it in

memcached for future use.

Once you accept these limitations, you can see how memcached offers superb performance by

implementing a very limited feature set. This performance, by the way, is unmatched by that of a traditional

database. That said, memcached certainly can’t replace an RDBMS. The important thing to keep in mind is

that it’s not supposed to.

Compared to memcached, MongoDB is itself feature-rich. To be useful, MongoDB must offer a strong

set of features, such as the ability to search for specific documents. It must also be able to store those

documents on disk, so they can survive a reboot. Fortunately, MongoDB provides enough features to be a

strong contender for most web applications and many other types of applications as well.

Like memcached, MongoDB is not a one-size-fits-all database. As is usually the case in computing,

tradeoffs must be made to achieve the intended goals of the application.

Running the Database Anywhere

MongoDB is written in C++, which makes it relatively easy to port or run the application practically

anywhere. Currently, binaries can be downloaded from the MongoDB website for Linux, Mac OS, Windows,

and Solaris. Officially supported Linux packages include Amazon Linux, RHEL, Ubuntu Server LTS, and

SUSE. You can even download the source code and build your own MongoDB, although it is recommended

that you use the provided binaries wherever possible.

■Caution The 32-bit version of MongoDB is limited to databases of 2GB or less. This is because MongoDB

uses memory-mapped files internally to achieve high performance. Anything larger than 2GB on a 32-bit system

would require some fancy footwork that wouldn’t be fast and would also complicate the application’s code.

The official stance on this limitation is that 64-bit environments are easily available; therefore, increasing code

complexity is not a good tradeoff. The 64-bit version for all intents and purposes has no such restriction.

MongoDB’s modest requirements allow it to run on high-powered servers or virtual machines, and

even to power cloud-based applications. By keeping things simple and focusing on speed and efficiency,

MongoDB provides solid performance wherever you choose to deploy it.

Fitting Everything Together

Before we look at MongoDB’s feature list, we need to review a few basic terms. MongoDB doesn’t require

much in the way of specialized knowledge to get started, and many of the terms specific to MongoDB can be

loosely translated to RDBMS equivalents that you are probably already familiar with. Don’t worry, though;

we’ll explain each term fully. Even if you’re not familiar with standard database terminology, you will still be

able to follow along easily.

CHAPTER 1 ■ INTRODUCTION TO MONGODB

8

Generating or Creating a Key

A document represents the unit of storage in MongoDB. In an RDBMS, this would be called a row. However,

documents are much more than rows because they can store complex information such as lists, dictionaries,

and even lists of dictionaries. In contrast to a traditional database, where a row is fixed, a document in

MongoDB can be made up of any number of keys and values (you’ll learn more about this in the next

section). Ultimately, a key is nothing more than a label; it is roughly equivalent to the name you might give to

a column in an RDBMS. You use a key to reference pieces of data inside your document.

In a relational database, there should always be some way to uniquely identify a given record; otherwise

it becomes impossible to refer to a specific row. To that end, you are supposed to include a field that holds a

unique value (called a primary key) or a collection of fields that can uniquely identify the given row (called a

compound primary key).

MongoDB requires that each document have a unique identifier for much the same reason; in

MongoDB, this identifier is called _id. Unless you specify a value for this field, MongoDB will generate

a unique value for you. Even in the well-established world of RDBMS databases, opinion is divided as to

whether you should use a unique key provided by the database or generate a unique key yourself. Recently,

it has become more popular to allow the database to create the key for you. MongoDB is a distributed

database, so one of the main goals is to remove dependencies on shared resources (for example, checking

if a primary key is actually unique). Nondistributed databases often use a simple primary key such an auto-

incrementing sequence number. MongoDB’s default _id format is an ObjectId, which is a 12-byte unique

identifier that can be generated independently in a distributed environment.

The reason for this is that human-created unique numbers such as car registration numbers have

a nasty habit of changing. For example, in 2001, the United Kingdom implemented a new number plate

scheme that was completely different from the previous system. It happens that MongoDB can cope with

this type of change perfectly well; however, chances are that you would need to do some careful thinking if

you used the registration plate as your primary key. A similar scenario may have occurred when the ISBN

(International Standard Book Number) scheme was upgraded from 10 digits to 13.

Previously, most developers who used MongoDB seemed to prefer creating their own unique keys,

taking it upon themselves to ensure that the number would remain unique. Today, though, general

consensus seems to point at using the default ID value that MongoDB creates for you. However, as is the

case when working with RDBMS databases, the approach you choose mostly comes down to personal

preference. We prefer to use a database-provided value because it means we can be sure the key is unique

and independent of anything else.

Ultimately, you must decide what works best for you. If you are confident that your key is unique (and

likely to remain unchanged), then feel free to use it. If you’re unsure about your key’s uniqueness or you

don’t want to worry about it, then you can simply use the default key provided by MongoDB.

Using Keys and Values

Documents are made up of keys and values. Let’s take another look at the example discussed previously in

this chapter:

{

"firstname": "Peter",

"lastname": "Membrey",

"phone_numbers": [

"+852 1234 5678",

"+44 1234 565 555"

]

}

CHAPTER 1 ■ INTRODUCTION TO MONGODB

9

Keys and values always come in pairs. Unlike an RDBMS, where every field must have a value, even

if it’s NULL (somewhat paradoxically, this means unknown), MongoDB does not require every document

to have the same fields, or that every field with the same name has the same type of value. For example,

"phone_numbers" could be a single value in some documents and a list in others. If you don’t know the

phone number for a particular person on your list, you simply leave it out. A popular analogy for this sort of

thing is a business card. If you have a fax number, you usually put it on your business card; however, if you

don’t have one, you don’t write: “Fax number: none.” Instead, you simply leave the information out. If the

key/value pair isn’t included in a MongoDB document, it is assumed not to exist.

Implementing Collections

Collections are somewhat analogous to tables, but they are far less rigid. A collection is a lot like a box with

a label on it. You might have a box at home labeled “DVDs” into which you put, well, your DVDs. This

makes sense, but there is nothing stopping you from putting CDs or even cassette tapes into this box if you

wanted to. In an RDBMS, tables are strictly defined, and you can only put designated items into the table.

In MongoDB, a collection is simply that: a collection of similar items. The items don’t have to be similar

(MongoDB is inherently flexible); however, once we start looking at indexing and more advanced queries,

you’ll soon see the benefits of placing similar items in a collection.

While you could mix various items together in a collection, there’s little need to do so. Had the

collection been called media, then all of the DVDs, CDs, and cassette tapes would be at home there. After all,

these items all have things in common, such as an artist name, a release date, and content. In other words, it

really does depend on your application whether certain documents should be stored in the same collection.

Performance-wise, having multiple collections is no slower than having only one collection. Remember:

MongoDB is about making your life easier, so you should do whatever feels right to you.

Last but not least, collections are usually created on demand. Specifically, a collection is created when

you first attempt to save a document that references it. This means that you could create collections on

demand (not that you necessarily should). Because MongoDB also lets you create indexes and perform

other database-level commands dynamically, you can leverage this behavior to build some very dynamic

applications.

Understanding Databases

Perhaps the easiest way to think of a database in MongoDB is as a group of collections. Like collections,

databases can be created on demand. This means that it’s easy to create a database for each

customer—your application code can even do it for you. You can do this with databases other than

MongoDB, as well; however, creating databases in this manner with MongoDB is a very natural process.

Reviewing the Feature List

Now that you understand what MongoDB is and what it offers, it’s time to run through its feature list. You

can find a complete list of MongoDB’s features on the database’s website at www.mongodb.org/; be sure to

visit this site for an up-to-date list of them. The feature list in this chapter covers a fair bit of material that

goes on behind the scenes, but you don’t need to be familiar with every feature listed to use MongoDB itself.

In other words, if you feel your eyes beginning to close as you review this list, feel free to jump to the end of

the section!

CHAPTER 1 ■ INTRODUCTION TO MONGODB

10

WiredTiger

This is the third release of this book on MongoDB, and there have been some significant changes along the

way. At the forefront of these is the introduction of MongoDB’s pluggable storage API and WiredTiger, a very

high-performance database engine. WiredTiger was an optional storage engine introduced in MongoDB 3.0

and is now the default storage engine as of MongoDB 3.2. The classic MMAP (memory-mapped) storage

engine is still available, but WiredTiger is more efficient and performant for the majority of use cases.

WiredTiger itself can be said to have taken MongoDB to a whole new level, replacing the older MMAP

model of internal data storage and management. WiredTiger allows MongoDB to (among other things)

far better optimize what data reside in memory and what data reside on disk, without some of the messy

overflows that were present before. The upshot of this is that more often than not, WiredTiger represents

a real performance gain for all users. WiredTiger also better optimizes how data are stored on disk and

provides an in-built compression API that makes for massive savings on disk space. It’s safe to say that with

WiredTiger onboard, MongoDB looks to be making another huge move in the database landscape, one of

similar size to that made when MongoDB was first released.

Using Document-Oriented Storage (BSON)

We’ve already discussed MongoDB’s document-oriented design. We’ve also briefly touched on BSON.

As you learned, JSON makes it much easier to store and retrieve documents in their real form, effectively

removing the need for any sort of mapper or special conversion code. The fact that this feature also makes it

much easier for MongoDB to scale up is icing on the cake.

BSON is an open standard; you can find its specification at http://bsonspec.org/. When people

hear that BSON is a binary form of JSON, they expect it to take up much less room than text-based JSON.

However, that isn’t necessarily the case; indeed, there are many cases where the BSON version takes up

more space than its JSON equivalent.

You might wonder why you should use BSON at all. After all, CouchDB (another powerful document-

oriented database) uses pure JSON, and it’s reasonable to wonder whether it’s worth the trouble of

converting documents back and forth between BSON and JSON.

First, you must remember that MongoDB is designed to be fast, rather than space-efficient. This doesn’t

mean that MongoDB wastes space (it doesn’t); however, a small bit of overhead in storing a document is

perfectly acceptable if that makes it faster to process the data (which it does). In short, BSON is much easier

to traverse (that is, to look through) and index very quickly. Although BSON requires slightly more disk space

than JSON, this extra space is unlikely to be a problem, because disks are inexpensive, and MongoDB can

scale across machines. The tradeoff in this case is quite reasonable: you exchange a bit of extra disk space

for better query and indexing performance. The WiredTiger storage engine supports multiple compression

libraries and has index and data compression enabled by default. Compression level can be set at a per-

server default as well as per-collection (on creation). Higher levels of compression will use more CPU when

data are stored but can result in a significant disk space savings.

The second key benefit to using BSON is that it is easy and quick to convert BSON to a programming

language’s native data format. If the data were stored in pure JSON, a relatively high-level conversion would

need to take place. There are MongoDB drivers for a large number of programming languages (such as

Python, Ruby, PHP, C, C++, and C#), and each works slightly differently. Using a simple binary format, native

data structures can be quickly built for each language, without requiring that you first process JSON. This

makes the code simpler and faster, both of which are in keeping with MongoDB’s stated goals.

BSON also provides some extensions to JSON. For example, it enables you to store binary data and to

incorporate a specific data type. Thus, while BSON can store any JSON document, a valid BSON document

may not be valid in JSON. This doesn’t matter, because each language has its own driver that converts data

to and from BSON without needing to use JSON as an intermediary language.

CHAPTER 1 ■ INTRODUCTION TO MONGODB

11

At the end of the day, BSON is not likely to be a big factor in how you use MongoDB. Like all great

tools, MongoDB will quietly sit in the background and do what it needs to do. Apart from possibly using a

graphical tool to look at your data, you will generally work in your native language and let the driver worry

about persisting to MongoDB.

Supporting Dynamic Queries

MongoDB’s support for dynamic queries means that you can run a query without planning for it in advance.

This is similar to being able to run SQL queries against an RDBMS. You might wonder why this is listed as a

feature; surely it is something that every database supports—right?

Actually, no. For example, CouchDB (which is generally considered MongoDB’s biggest “competitor”)

doesn’t support dynamic queries. This is because CouchDB has come up with a completely new (and

admittedly exciting) way of thinking about data. A traditional RDBMS has static data and dynamic queries.

This means that the structure of the data is fixed in advance—tables must be defined, and each row has to fit

into that structure. Because the database knows in advance how the data are structured, it can make certain

assumptions and optimizations that enable fast dynamic queries.

CouchDB has turned this on its head. As a document-oriented database, CouchDB is schemaless, so the

data are dynamic. However, the new idea here is that queries are static. That is, you define them in advance,

before you can use them.

This isn’t as bad as it might sound, because many queries can be easily defined in advance. For

example, a system that lets you search for a book will probably let you search by ISBN. In CouchDB, you

would create an index that builds a list of all the ISBNs for all the documents. When you punch in an ISBN,

the query is very fast because it doesn’t actually need to search for any data. Whenever a new piece of data is

added to the system, CouchDB will automatically update its index.

Technically, you can run a query against CouchDB without generating an index; in that case, however,

CouchDB will have to create the index itself before it can process your query. This won’t be a problem if you

only have a hundred books; however, it will result in poor performance if you’re filing hundreds of thousands

of books, because each query will generate the index again (and again). For this reason, the CouchDB team

does not recommend dynamic queries—that is, queries that haven’t been predefined—in production.

CouchDB also lets you write your queries as map and reduce functions. If that sounds like a lot of effort,

then you’re in good company; CouchDB has a somewhat severe learning curve. In fairness to CouchDB, an

experienced programmer can probably pick it up quite quickly; for most people, however, the learning curve

is probably steep enough that they won’t bother with the tool.

Fortunately for us mere mortals, MongoDB is much easier to use. We’ll cover how to use MongoDB in

more detail throughout the book, but here’s the short version: in MongoDB, you simply provide the parts of the

document you want to match against, and MongoDB does the rest. MongoDB can do much more, however. For

example, you won’t find MongoDB lacking if you want to use map or reduce functions. At the same time, you

can ease into using MongoDB; you don’t have to know all of the tool’s advanced features up front.

Indexing Your Documents

MongoDB includes extensive support for indexing your documents, a feature that really comes in handy

when you’re dealing with tens of thousands of documents. Without an index, MongoDB will have to look at

each individual document in turn to see whether it is something that you want to see. This is like asking a

librarian for a particular book and watching as he works his way around the library looking at each and every

book. With an indexing system (libraries tend to use the Dewey Decimal system), he can find the area where

the book you are looking for lives and very quickly determine if it is there.

Unlike a library book, all documents in MongoDB are automatically indexed on the _id key. This key is

considered a special case because you cannot delete it; the index is what ensures that each value is unique.

One of the benefits of this key is that you can be assured that each document is uniquely identifiable,

something that isn’t guaranteed by an RDBMS.

CHAPTER 1 ■ INTRODUCTION TO MONGODB

12

When you create your own indexes, you can decide whether you want them to enforce uniqueness. By

default, an error will be returned if you try to create a unique index on a key that has duplicate values.

There are many occasions where you will want to create an index that allows duplicates. For example, if

your application searches by last name, it makes sense to build an index on the lastname key. Of course, you

cannot guarantee that each last name will be unique; and in any database of a reasonable size, duplicates are

practically guaranteed.

MongoDB’s indexing abilities don’t end there, however. MongoDB can also create indexes on

embedded documents. For example, if you store numerous addresses in the address key, you can create an

index on the ZIP or postal code. This means that you can easily pull back a document based on any postal

code—and do so very quickly.

MongoDB takes this a step further by allowing composite indexes. In a composite index, two or more

keys are used to build a given index. For example, you might build an index that combines both the

lastname and firstname tags. A search for a full name would be very quick because MongoDB can quickly

isolate the last name and then, just as quickly, isolate the first name.

We will look at indexing in more depth in Chapter 10, but suffice it to say that MongoDB has you

covered as far as indexing is concerned.

Leveraging Geospatial Indexes

One form of indexing worthy of special mention is geospatial indexing. This new, specialized indexing

technique was introduced in MongoDB 1.4. You use this feature to index location-based data, enabling you

to answer queries such as how many items are within a certain distance from a given set of coordinates.

As an increasing number of web applications start making use of location-based data, this feature will

play an increasingly prominent role in everyday development.

Profiling Queries

A built-in profiling tool lets you see how MongoDB works out which documents to return. This is useful

because, in many cases, a query can be easily improved simply by adding an index, the number one cause of

painfully slow queries. If you have a complicated query, and you’re not really sure why it’s running so slowly,

then the query profiler (MongoDB’s query planner explain()) can provide you with extremely valuable

information. Again, you’ll learn more about the MongoDB profiler in Chapter 10.

Updating Information In Place (Memory Mapped Database Only)

When a database updates a row (or in the case of MongoDB, a document), it has a couple of choices about

how to do it. Many databases choose the multiversion concurrency control (MVCC) approach, which allows

multiple users to see different versions of the data. This approach is useful because it ensures that the data

won’t be changed partway through by another program during a given transaction.

The downside to this approach is that the database needs to track multiple copies of the data. For

example, CouchDB provides very strong versioning, but this comes at the cost of writing the data out in its

entirety. While this ensures that the data are stored in a robust fashion, it also increases complexity and

reduces performance.

MongoDB, on the other hand, updates information in place. This means that (in contrast to CouchDB)

MongoDB can update the data wherever it happens to be. This typically means that no extra space needs to

be allocated, and the indexes can be left untouched.

Another benefit of this method is that MongoDB performs lazy writes. Writing to and from memory

is very fast, but writing to disk is thousands of times slower. This means that you want to limit reading and

writing from the disk as much as possible. This isn’t possible in CouchDB, because that program ensures

that each document is quickly written to disk. While this approach guarantees that the data are written safely

to disk, it also impacts performance significantly.

CHAPTER 1 ■ INTRODUCTION TO MONGODB

13

MongoDB only writes to disk when it has to, which is usually once every 100 milliseconds or so. This

means that if a value is being updated many times a second—a not uncommon scenario if you’re using

a value as a page counter or for live statistics—then the value will only be written once, rather than the

thousands of times that CouchDB would require.

This approach makes MongoDB much faster, but, again, it comes with a tradeoff. CouchDB may be

slower, but it does guarantee that data are stored safely on the disk. MongoDB makes no such guarantee,

and this is why a traditional RDBMS is probably a better solution for managing critical data such as billing or

accounts receivable.

Storing Binary Data

GridFS is MongoDB’s solution to storing binary data in the database. BSON supports saving up to 16MB of

binary data in a document, and this may well be enough for your needs. For example, if you want to store

a profile picture or a sound clip, then 16MB might be more space than you need. On the other hand, if you

want to store movie clips, high-quality audio clips, or even files that are several hundred megabytes in size,

then MongoDB has you covered here, too.

GridFS works by storing the information about the file (called metadata) in the files collection The

data themselves are broken down into pieces called chunks that are stored in the chunks collection. This

approach makes storing data both easy and scalable; it also makes range operations (such as retrieving

specific parts of a file) much easier to use.

Generally speaking, you would use GridFS through your programming language’s MongoDB driver, so

it’s unlikely you’d ever have to get your hands dirty at such a low level. As with everything else in MongoDB,

GridFS is designed for both speed and scalability. This means you can be confident that MongoDB will be up

to the task if you want to work with large data files.

Replicating Data

When we talked about the guiding principles behind MongoDB, we mentioned that RDBMS databases

offer certain guarantees for data storage that are not available in MongoDB. These guarantees weren’t

implemented for a handful of reasons. First, these features would slow the database down. Second, they

would greatly increase the complexity of the program. Third, it was felt that the most common failure on

a server would be hardware, which would render the data unusable anyway, even if the data were safely

saved to disk.

Of course, none of this means that data safety isn’t important. MongoDB wouldn’t be of much use if you

couldn’t count on being able to access the data when you need them. Initially, MongoDB provided a safety

net with a feature called master-slave replication, in which only one database is active for writing at any given

time, an approach that is also fairly common in the RDBMS world. This feature has since been replaced with

replica sets, and basic master-slave replication has been deprecated and should no longer be used.

Replica sets have one primary server (similar to a master), which handles all the write requests from

clients. Because there is only one primary server in a given set, it can guarantee that all writes are handled

properly. When a write occurs, it is logged in the primary’s oplog.

The oplog is replicated by the secondary servers (of which there can be many) and used to bring them

up to date with the current primary. Should the primary fail at any given time, the surviving members of

the replica set will hold an election and one of the secondaries will become the primary and take over

responsibility for handling client write requests. Application drivers will automatically detect any changes to

the replica set configuration or replica set status and reestablish connectivity based on the updated replica

set state. In order for a replica set to maintain a primary, a strict majority of the healthy replica set nodes

must be able to connect with one another. For example, a three-node replica set requires two healthy nodes

to maintain a primary.

CHAPTER 1 ■ INTRODUCTION TO MONGODB

14

Implementing Sharding

For those involved with large-scale deployments, autosharding will probably prove to be one of MongoDB’s

most significant and oft-used features.

In an autosharding scenario, MongoDB takes care of all the data splitting and recombination for you.

It makes sure the data go to the right server and that queries are run and combined in the most efficient

manner possible. In fact, from a developer’s point of view, there is no difference between talking to a

MongoDB database with a hundred shards and talking to a single MongoDB server.

In the meantime, if you’re just starting out or you’re building your first MongoDB-based website, then

you’ll probably find that a single instance of MongoDB is sufficient for your needs (although for a production

environment, we still recommend using a replica set). If you end up building the next Facebook or Amazon,

however, you will be glad that you built your site on a technology that can scale so limitlessly. Sharding is the

topic of Chapter 12 of this book.

Using Map and Reduce Functions

For many people, hearing the term MapReduce sends shivers down their spines. At the other extreme,

many RDBMS advocates scoff at the complexity of map and reduce functions. It’s scary for some because

these functions require a completely different way of thinking about finding and sorting your data, and

many professional programmers have trouble getting their heads around the concepts that underpin map

and reduce functions. That said, these functions provide an extremely powerful way to query data. In fact,

CouchDB supports only this approach, which is one reason it has such a high learning curve.

MongoDB doesn’t require that you use map and reduce functions. In fact, MongoDB relies on a simple

querying syntax that is more akin to what you see in MySQL. However, MongoDB does make these functions

available for those who want them. The map and reduce functions are written in JavaScript and run on the

server. The job of the map function is to find all the documents that meet a certain criteria. These results are

then passed to the reduce function, which processes the data. The reduce function doesn’t usually return

a collection of documents; rather, it returns a new document that contains the information derived. As a

general rule, if you would normally use GROUP BY in SQL, then the map and reduce functions are probably

the right tools for the job in MongoDB.

The Aggregation Framework

MapReduce is a very powerful tool, but it has one major drawback; it’s not exactly high performance. This is

because of how MapReduce is implemented behind the scenes. In short, a lot of work has to be done moving

the data about and converting between the native storage format (BSON) and JSON, applying filters, and

so forth. With the aggregation framework, a large number of operators are provided that are written in C++

and are highly performant. The operators available are growing all the time, with each release bringing new

features.

The aggregation framework is pipeline based, and it allows you to take individual pieces of a query and

string them together in order to get the result you’re looking for. This maintains the benefits of MongoDB’s

document-oriented design while still providing high performance.

So if you need all the power of MapReduce, you still have it at your beck and call. If you just want to do

some basic statistics and number crunching, you’re going to love the aggregation framework. You’ll learn

more about the aggregation framework and its commands in Chapters 4 and 6.

CHAPTER 1 ■ INTRODUCTION TO MONGODB

15

Getting Help

MongoDB has a great support community, and the core developers are very active and easily approachable

and typically go to great lengths to help other members of the community. MongoDB is easy to use and

comes with great documentation; however, it’s still nice to know that you’re not alone, and help is available,

should you need it.

Visiting the Website

The first place to look for updated information or help is on the MongoDB website (www.mongodb.org). This

site is updated regularly and contains all the latest MongoDB goodness. On this site, you can find drivers,

tutorials, examples, frequently asked questions, and much more.

Cutting and Pasting MongoDB Code

Pastie (http://pastie.org) is not strictly a MongoDB site; however, it is something you will come across

if you float about in #MongoDB for any length of time. The Pastie site basically lets you cut and paste (hence

the name) some output or program code, and then put it online for others to view. In IRC, pasting multiple

lines of text can be messy or hard to read. If you need to post a fair bit of text (such as three lines or more),

then you should visit http://pastie.org, paste in your content, and then paste the link to your new page

into the channel.

Finding Solutions on Google Groups

MongoDB also has a discussion group called mongodb-user (http://groups.google.com/group/mongodb-user).

This group is a great place to ask questions or search for answers. You can also interact with the group via

e-mail. Unlike IRC, which is very transient, the Google group is a great long-term resource. If you really want

to get involved with the MongoDB community, joining the group is a great way to start.

Finding Solutions on Stack Overflow

Stack Overflow (www.stackoverflow.com) is one of the most popular programming Q&A sites on the

Internet and has a repository of tens of thousands of questions and answers available for anyone to view.

Stack Overflow is best suited for when you have a specific question and are looking for a specific answer.

Answers are rated by the community, so there is a very high chance you’ll find something useful here and

quite often the exact answer you’re looking for. MongoDB, Inc., the company behind the product, maintains

an active support presence on Stack Overflow, making it a great place to start hunting for your answers.

Stack Overflow specifically targets programming questions, but there are also “Stack Exchanges,” such

as DBA Stack Exchange and Server Fault, which cover database and sysadmin questions, respectively.

Leveraging the JIRA Tracking System

MongoDB uses the JIRA issue-tracking system You can view the tracking site at http://jira.mongodb.org/,

and you are actively encouraged to report any bugs or problems that you come across to this site. Reporting

such issues is viewed by the community as a genuinely good thing to do. Of course, you can also search

through previous issues, and you can even view the roadmap and planned updates for the next release.

CHAPTER 1 ■ INTRODUCTION TO MONGODB

16

If you haven’t posted to JIRA before, you might want to try the mongodb-users list first. You will quickly

find out whether you’ve found something new, and if so, you will be shown how to go about reporting it.

Chatting with the MongoDB Developers

Some MongoDB developers often hang out on Internet Relay Chat (IRC) at #MongoDB on the Freenode

network (www.freenode.net). Of course, the developers do need to sleep at some point (coffee only works

for so long!); fortunately, there are also many knowledgeable MongoDB users from around the world who

are ready to help out. Many people who visit the #MongoDB channel aren’t experts; however, the general

atmosphere is so friendly that they stick around anyway. Please feel free to join #MongoDB channel and chat

with people there—you may find some great hints and tips. If you’re really stuck, you’ll probably be able to

quickly get back on track.

Summary

This chapter has provided a whistle-stop tour of the benefits MongoDB brings to the table. We’ve looked

at the philosophies and guiding principles behind MongoDB’s creation and development, as well as the

tradeoffs MongoDB’s developers made when implementing these ideals. We’ve also looked at some of the

key terms used in conjunction with MongoDB, how they fit together, and their rough SQL equivalents.

Next, we looked at some of the features MongoDB offers, including how and where you might want to

use them. Finally, we wrapped up the chapter with a quick overview of the community and where you can go

to get help, should you need it.

Now that we've given you a taste of what MongoDB can do for you, let's move on to Chapter 2 where we

will show you how to get MongDB installed and ready to go.

17

Chapter 2

Installing MongoDB

In Chapter 1, you got a taste of what MongoDB can do for you. In this chapter, you will learn how to

install and expand MongoDB to do even more, enabling you to use it in combination with your favorite

programming language.

MongoDB is a cross-platform database, and you can find a significant list of available packages to

download from the MongoDB website (www.mongodb.org). The wealth of available versions might make it

difficult to decide which version is the right one for you. The right choice for you probably depends on the

operating system your server uses, the kind of processor in your server, and whether you prefer a stable

release or would like to take a dive into a version that is still in development but offers exciting new features.

Perhaps you’d like to install both a stable and a forward-looking version of the database. It’s also possible

you’re not entirely sure which version you should choose yet. In any case, read on!

Choosing Your Version

When you look at the Download section on the MongoDB website, you will see a rather straightforward

overview of the packages available for download. The first thing you need to pay attention to is the operating

system you are going to run the MongoDB software on. Currently, there are precompiled packages available

for Windows, various flavors of the Linux operating system, Mac OS, and Solaris.

■Note An important thing to remember here is the difference between the 32-bit release and the 64-bit

release of the product. The 32-bit release is only supported as legacy and may lack performance optimizations

present in the 64-bit version. The 32-bit release also does not support the WiredTiger storage engine. It is

strongly recommended to use the 64-bit release for production environments.

You will also need to pay attention to the version of the MongoDB software itself: there are production

releases, previous releases, and development releases. The production release indicates that it’s the most

recent stable version available. When a newer and generally improved or enhanced version is released, the

prior most recent stable version will be made available as a previous release. This designation means the

release is stable and reliable, but it usually has fewer features available in it. Finally, there’s the development

release. This release is generally referred to as the unstable version. This version is still in development, and

it will include many changes, including significant new features. Although it has not been fully developed

and tested yet, the developers of MongoDB have made it available to the public to test or otherwise try out.

CHAPTER 2 ■ INSTALLING MONGODB

18

Understanding the Version Numbers

MongoDB uses the “odd-numbered versions for development releases” approach. In other words, you can

tell by looking at the second part of the version number (also called the release number) whether a version

is a development version or a stable version. If the second number is even, then it’s a stable release. If the

second number is odd, then it’s an unstable, or development, release.

Let’s take a closer look at the three digits included in a version number’s three parts, A, B, and C:

• A, the first (or leftmost) number: Represents the major version and only changes

when there is a full version upgrade.

• B, the second (or middle) number: Represents the release number and indicates

whether a version is a development version or a stable version. If the number is even,

the version is stable; if the number is odd, the version is unstable and considered a

development release.

• C, the third (or rightmost) number: Represents the revision number; this is used for

bugs and security issues.

For example, at the time of writing, the following versions were available from the MongoDB website:

• 3.0.6 (Production release)

• 2.6.11 (Previous release)

• 3.1.8 (Development release)

Installing MongoDB on Your System

So far, you’ve learned which versions of MongoDB are available and—hopefully—were able to select one.

Now you’re ready to take a closer look at how to install MongoDB on your particular system. The two main

operating systems for servers at the moment are based on Linux and Microsoft Windows, so this chapter will

walk you through how to install MongoDB on both of these operating systems, beginning with Linux.

Installing MongoDB under Linux

The Unix-based operating systems are extremely popular choices at the moment for hosting services,

including web services, mail services, and, of course, database services. In this chapter, we’ll walk you

through how to get MongoDB running on a popular Linux distribution: Ubuntu.

Depending on your needs, you have two ways of installing MongoDB under Ubuntu: you can install the

packages automatically through so-called repositories, or you can install it manually. The next two sections

will walk you through both options.

Installing MongoDB through the Repositories

Repositories are basically online directories filled with software. Every package contains information about

the version number, prerequisites, and possible incompatibilities. This information is useful when you

need to install a software package that requires another piece of software to be installed first because the

prerequisites can be installed at the same time.

CHAPTER 2 ■ INSTALLING MONGODB

19

The default repositories available in Ubuntu’s LTS (long-term support) editions contain MongoDB, but

they may be out-of-date versions of the software. Therefore, let’s tell apt-get (the software you use to install

software from repositories) to look at a custom repository. To do this, you need to create a custom MongoDB

list file and specify the repository URL using the following command:

$ echo "deb http://repo.mongodb.org/apt/ubuntu "$(lsb_release -sc)"/mongodb-org/3.0

multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.0.list

Next, you need to import MongoDB’s public GPG key, used to sign the packages, to ensure their

consistency; you can do so by using the apt-key command:

$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10

When that is done, you need to tell apt-get that it contains new repositories; you can do so using

apt-get’s update command:

$ sudo apt-get update

This line made aptitude aware of your manually added repository. This means you can now tell apt-get

to install the software itself. You do this by typing the following command in the shell:

$ sudo apt-get install -y mongodb-org

This line installs the current stable (production) version from MongoDB community edition. If you wish

to install any other version from MongoDB instead, you need to specify the version number. For example, to

install the previous production (stable) version from MongoDB, type in the following command instead:

$ sudo apt-get install -y mongodb-org=3.0.6 mongodb-org-server=3.0.6 mongodb-org-shell=3.0.6

mongodb-org-mongos=3.0.6 mongodb-org-tools=3.0.6

That’s all there is to it. At this point, MongoDB has been installed and is (almost) ready to use!

■Note Running apt-get update on a system running an older version of MongoDB will upgrade the

software to the latest stable version available. You can prevent this from happening by running this command:

$ echo "mongodb-org hold" | sudo dpkg --set-selections

$ echo "mongodb-org-server hold" | sudo dpkg --set-selections

$ echo "mongodb-org-shell hold" | sudo dpkg --set-selections

$ echo "mongodb-org-mongos hold" | sudo dpkg --set-selections

$ echo "mongodb-org-tools hold" | sudo dpkg --set-selections

Installing MongoDB Manually

Next, we’ll cover how to install MongoDB manually. Given how easy it is to install MongoDB with aptitude

on Ubuntu LTS editions automatically, you might wonder why you would want to install the software

manually. For starters, the packaging remains a work in progress, so it might be the case that there are

versions not yet available through the repositories. It’s also possible that the version of MongoDB you want

CHAPTER 2 ■ INSTALLING MONGODB

20

to use isn’t included in the repository or that you simply don’t run Ubuntu or an LTS version of it. Installing

the software manually also gives you the ability to run multiple versions of MongoDB at the same time.

You’ve decided which version of MongoDB you would like to use, and you’ve downloaded it from their

website, http://mongodb.org/downloads, to your Home directory. Next, you need to extract the package

with the following command:

$ tar xzvf mongodb-linux-x86_64-<distribution version>-<mongodb version>.tgz

This command extracts the entire contents of the package to a new directory called mongodb-linux-

x86_64-<distribution version>-<mongodb version>; this directory is located under your current

directory. This directory will contain a number of subdirectories and files. The directory that contains the

executable files is called the bin directory. We will cover which applications perform which tasks shortly.

However, you don’t need to do anything further to install the application. Indeed, it doesn’t take much

more time to install MongoDB manually—depending on what else you need to install, it might even be

faster. Manually installing MongoDB does have some downsides, however. For example, the executables that

you just extracted and found in the bin directory can’t be executed from anywhere except the bin directory

by default unless you add them to your $PATH environment variable. Thus, if you want to run the mongod

service, you will need to do so directly from the aforementioned bin directory if this directory isn’t part of

your $PATH environment variable. Another critical downside here is that the mongod service won’t start

automatically as a server after a restart and does not include Secure Socket Layer –or SSL- support. These

downsides highlight some of the benefits of installing MongoDB through repositories.

Installing MongoDB under Windows

Microsoft’s Windows is also a popular choice for server software, including Internet-based services.

MongoDB comes with an installer for Windows-based operating systems. All you need to do is select

the MongoDB version of your choice, download the installer, and run it to get it set up. The installer comes

with two options—Complete and Custom—allowing you to choose the features to be installed and where

they will be installed. In most cases, the Complete setup type would be recommended.

Alternatively, the legacy build from MongoDB can be downloaded in ZIP format. With this, you do

not need to walk through any setup process; installing the software is a simple matter of downloading the

package, extracting it, and running the application itself. Similar to the Linux legacy builds, this version will

not include SSL support, however.

For example, assume you’ve decided to download the latest legacy version of MongoDB for your

64-bit Windows 2008 R2+ server. You begin by extracting the package (mongodb-win32-x86_64-2008plus-

x.y.z.zip) to the root of your C:\ drive. At this point, all you need to do is open a command prompt

(Start ➤ Run ➤ cmd ➤ OK) and browse to the directory you extracted the contents to:

> cd C:\mongodb-win32–x86_64-2008plus-x.y.z\

> cd bin\

Doing this brings you to the directory that contains the MongoDB executables. That’s all there is to it: as

I noted previously, with this approach, there’s no installation necessary.

Running MongoDB

At long last you’re ready to get your hands dirty. You’ve learned where to get the MongoDB version that best

suits your needs and hardware, and you’ve also seen how to install the software. Now it’s finally time to look

at running and using MongoDB.

CHAPTER 2 ■ INSTALLING MONGODB

21

Prerequisites

Before you can start the MongoDB service, you need to create a data directory for MongoDB to store its files

in. By default, MongoDB stores the data in the /data/db directory on Unix-based systems (such as Linux and

OS X) and in the C:\data\db directory on Windows.

■Note MongoDB does not create these data directories for you, so you need to create them manually;

otherwise, MongoDB will fail to run and throw an error message. Also, be sure that you set the permissions

correctly: MongoDB must have read, write, and directory creation permissions to function properly.

If you wish to use a directory other than /data/db or C:\data\db, then you can tell MongoDB to look at

the desired directory by using the --dbpath flag when executing the service.

Once you create the required directory and assign the appropriate permissions, you can start the

MongoDB core database service by executing the mongod application. You can do this from the command

prompt or the shell in Windows and Linux, respectively.

Surveying the Installation Layout

After you install or extract MongoDB successfully, you will have the applications shown in Table2-1

available in the bin directory (in both Linux and Windows).

Table 2-1. The Included MongoDB Applications

Application Function

--bsondump Reads contents of BSON-formatted rollback files.

--mongo The database shell.

--mongod The core database server.

--mongodump Database backup utility.

--mongoexport Export utility (JSON, CSV, TSV), not reliable for backup.

--mongofiles Manipulates files in GridFS objects.

--mongoimport Import utility (JSON, CSV, TSV), not reliable for recoveries.

--mongooplog Pulls oplog entries from another mongod instance.

--mongoperf Check disk I/O performance.

--mongorestore Database backup restore utility.

--mongos MongoDB shard process.

--mongostat Returns counters of database operation.

--mongotop Tracks/reports MongoDB read/write activities.

--mongorestore Restore/import utility.

Note: All applications are within the --bin directory.

CHAPTER 2 ■ INSTALLING MONGODB

22

The installed software includes 14 applications (or 13, under Microsoft Windows) that you will be using

in conjunction with your MongoDB databases. The two “most important” applications are the mongo and

mongod applications. The mongo application allows you to use the database shell; this shell enables you to

accomplish practically anything you’d want to do with MongoDB.

The mongod application starts the service or daemon, as it’s also called. There are also many flags you

can set when launching the MongoDB applications. For example, the service lets you specify the path where

the database is located (--dbpath), show version information (--version), and even print some diagnostic

system information (with the --sysinfo flag)! You can view the entire list of options by including the --help

flag when you launch the service. For now, you can just use the defaults and start the service by typing

mongod as any user in your shell or command prompt.

Using the MongoDB Shell

Once you create the database directory and start the mongod database application successfully, you’re ready

to fire up the shell and take a sneak peak at the powers of MongoDB.

Fire up your shell (Unix) or your command prompt (Windows); when you do so, make sure you are in

the correct location, so that the mongo executable can be found. You can start the shell by typing mongo at the

command prompt and hitting the Return key. You will be immediately presented with a blank window and a

blinking cursor (see Figure2-1). Ladies and gentlemen, welcome to MongoDB!

If you start the MongoDB service with the default parameters, and start the shell with the default

settings, you will be connected to the default test database running on your local host. This database is

created automatically the moment you connect to it. This is one of MongoDB’s most powerful features: if you

attempt to connect to a database that does not exist, MongoDB will automatically create it for you once you

insert data into it. This can be either good or bad, depending on how well you handle your keyboard.

Before taking any further steps, such as implementing any additional drivers that will enable you to

work with your favorite programming language, you might find it helpful to take a quick peek at some of the

more useful commands available in the MongoDB shell (see Table2-2).

Figure 2-1. The MongoDB shell

CHAPTER 2 ■ INSTALLING MONGODB

23

■Tip You can get a full list of commands by typing the help command in the MongoDB shell.

Installing Additional Drivers

You might think that you are ready to take on the world now that you have set up MongoDB and know

how to use its shell. That’s partially true; however, you probably want to use your preferred programming

language rather than the shell when querying or otherwise manipulating the MongoDB database. MongoDB

offers multiple official drivers, and many more are offered in the community that let you do precisely that.

For example, drivers for the following programming languages can be found on the MongoDB website:

• C

• C++

• C#

• Java

• Node.js

• Perl

• PHP

• Python

• Motor

• Ruby

• Scala

In this section, you will learn how to implement MongoDB support for two of the more popular

programming languages in use today: PHP and Python.

■Tip There are many community-driven MongoDB drivers available. A long list can be found on the

MongoDB website docs.mongodb.org/ecosystem.

Table 2-2. Basic Commands within the MongoDB Shell

Command Function

show dbs Shows the names of the available databases.

show collections Shows the collections in the current database.

show users Shows the users in the current database.

use <db name> Sets the current database to <db name>.

CHAPTER 2 ■ INSTALLING MONGODB

24

Installing the PHP Driver

PHP is one of the most popular programming languages in existence today. This language is specifically

aimed at web development, and it can be incorporated into HTML easily. This fact makes the language

the perfect candidate for designing a web application, such as a blog, a guestbook, or even a business-card

database. The next few sections cover your options for installing and using the MongoDB PHP driver.

Getting MongoDB for PHP

Like MongoDB, PHP is a cross-platform development tool, and the steps required to set up MongoDB in

PHP vary depending on the intended platform. Previously, this chapter showed you how to install MongoDB

on both Ubuntu and Windows; we’ll adopt the same approach here, demonstrating how to install the driver

for PHP on both Ubuntu and Windows.

Begin by downloading the PHP driver for your operating system. Do this by firing up your browser and

navigating to docs.mongodb.org. At the time of writing, the website includes a separate menu option called

Drivers. Click this option to bring up a list of currently available language drivers (see Figure2-2).

Next, select PHP from the list of languages and follow the links to download the latest (stable) version of

the driver. Different operating systems will require different approaches for installing the MongoDB extension

for PHP automatically. That’s right; just as you were able to install MongoDB on Ubuntu automatically, you

can do the same for the PHP driver. And just as when installing MongoDB under Ubuntu, you can also choose

to install the PHP language driver manually. Let’s look at the two options available to you.

Figure 2-2. A short list of currently available language drivers for MongoDB

CHAPTER 2 ■ INSTALLING MONGODB

25

Installing the PHP Driver on Unix-Based Platforms Automatically

The developers of PHP came up with a great solution that allows you to expand your PHP installation with

other popular extensions: PECL. PECL is a repository solely designed for PHP; it provides a directory of all

known extensions that you can use to download, install, and even develop PHP extensions. If you are already

acquainted with the package-management system called aptitude (which you used previously to install

MongoDB), then you will be pleased by how similar PECL’s interface is to the one in aptitude.

Assuming that you have PECL installed on your system, open up a console and type the following

command to install the MongoDB extension:

$ sudo pecl install mongo

Entering this command causes PECL to download and install the MongoDB extension for PHP

automatically. In other words, PECL will download the extension for your PHP version and place it in the

PHP extensions directory. There’s just one catch: PECL does not automatically add the extension to the

list of loaded extensions; you will need to do this step manually. To do so, open a text editor (vim, nano, or

whichever text editor you prefer) and alter the file called php.ini, which is the main configuration file PHP

uses to control its behavior, including the extensions it should load.

Next, open the php.ini file, scroll down to the extensions section, and add the following line to tell PHP

to load the MongoDB driver:

extension=mongo.so

■Note The preceding step is mandatory; if you don’t do this, then the MongoDB commands in PHP will not

function. To find the php.ini file on your system, you can use the grep command in your shell: php –i | grep

Configuration.

The “Confirming That Your PHP Installation Works” section later in this chapter will cover how to

confirm that an extension has been loaded successfully.

That’s all, folks! You’ve just installed the MongoDB extension for your PHP installation, and you are now

ready to use it. Next, you will learn how to install the driver manually.

Installing the PHP Driver on Unix-Based Platforms Manually

If you would prefer to compile the driver yourself or for some reason are unable to use the PECL application

as described previously (your hosting provider might not support this option, for instance), then you can

also choose to download the source driver and compile it manually.

To download the driver, go to the GitHub website (http://github.com). This site offers the latest source

package for the PHP driver. Once you download it, you will need to extract the package and make the driver

by running the following set of commands:

$ unzip mongo-php-driver-master.zip

$ cd mongo-php-driver-master

$ phpize

$ ./configure

$ sudo make install

CHAPTER 2 ■ INSTALLING MONGODB

26

This process can take a while, depending on the speed of your system. Once the process completes,

your MongoDB PHP driver is installed and ready to use! After you execute the commands, you will be shown

where the driver has been placed; typically, the output looks something like this:

Installing '/usr/lib/php5/20121212/mongo.so'

You do need to confirm that this directory is the same directory where PHP stores its extensions by

default. You can use the following command to confirm where PHP stores its extensions:

$ php -i | grep extension_dir

This line outputs the directory where all PHP extensions should be placed. If this directory doesn’t

match the one where the mongo.so driver was placed, then you must move the mongo.so driver to the proper

directory, so PHP knows where to find it.

As before, you will need to tell PHP that the newly created extension has been placed in its extension

directory and that it should load this extension. You can specify this by modifying the php.ini file’s

extensions section; add the following line to that section:

extension=mongo.so

Finally, a restart of your web service is required. When using the Apache HTTPd service, you can

accomplish this using the following service command:

sudo /etc/init.d/apache2 restart

That’s it! This process is a little lengthier than using PECL’s automated method; however, if you are

unable to use PECL, or if you are a driver developer and interested in bug fixes, then you would want to use

the manual method instead.

Installing the PHP Driver on Windows

You have seen previously how to install MongoDB on your Windows operating system. Now let’s look at how

to implement the MongoDB driver for PHP on Windows.

For Windows, there are precompiled DLLs available for each release of the PHP driver for MongoDB. You

can get these binaries from the PECL website (http://pecl.php.net/package/mongo). The biggest challenge

in this case is choosing the correct package to install for your version of PHP (a wide variety of packages

are available). If you aren’t certain which package version you need, you can use the <? phpinfo(); ?>

command in a PHP page to learn exactly which one suits your specific environment. We’ll take a closer look at

the phpinfo() command in the next section.

After downloading the correct package and extracting its contents, all you need to do is copy the driver

file (called php_mongo.dll) to your PHP’s extension directory; this enables PHP to pick it up.

Depending on your version of PHP, the extension directory may be called either Ext or Extensions.

If you aren’t certain which directory it should be, you can review the PHP documentation that came with the

version of PHP installed on your system.

Once you place the driver DLL into the PHP extensions directory, you still need to tell PHP to load the

driver. Do this by altering the php.ini file and adding the following line in the extensions section:

extension=php_mongo.dll

CHAPTER 2 ■ INSTALLING MONGODB

27

When this is done, restart the HTTP service on your system, and you are now ready to use the MongoDB

driver in PHP. Before you start leveraging the magic of MongoDB with PHP, however, you need to confirm

that the extension is loaded correctly.

Confirming That Your PHP Installation Works

So far you’ve successfully installed both MongoDB and the MongoDB driver in PHP. Now it’s time to do

a quick check to confirm whether the driver is being loaded correctly by PHP. PHP gives you a simple

and straightforward method to accomplish this: the phpinfo() command. This command shows you an

extended overview of all the modules loaded, including version numbers, compilation options, server

information, operating system information, and so on.

To use the phpinfo() command, open a text or HTML editor, and type the following:

<? phpinfo(); ?>

Next, save the document in your webserver’s www directory and call it whatever you like. For example,

you might call it test.php or phpinfo.php. Now open your browser and go to your localhost or external

server (that is, go to whatever server you are working on) and look at the page you just created. You will see

a good overview of all the PHP components and all sorts of other relevant information. The thing you need

to focus on here is the section that displays your MongoDB information. This section will list the version

number, port numbers, hostname, and so on (see Figure2-3).

Once you confirm that the installation was successful and that the driver loaded successfully, you’re

ready to write some PHP code and walk through a MongoDB example that leverages PHP.

Connecting to and Disconnecting from the PHP Driver

You’ve confirmed that the MongoDB PHP driver has been loaded correctly, so it’s time to start writing some

PHP code! Let’s take a look at two simple yet fundamental options for working with MongoDB: initiating a

connection between MongoDB and PHP, and then severing that connection.

Figure 2-3. Displaying your MongoDB information in PHP

CHAPTER 2 ■ INSTALLING MONGODB

28

You use the MongoClient class to initiate a connection between MongoDB and PHP; this same class

also lets you use the database server commands. A simple yet typical connection command looks like this:

$connection = new MongoClient();

If you use this command without providing any parameters, it will connect to the MongoDB service on

the default MongoDB port (27017) on your localhost. If your MongoDB service is running somewhere else,

then you simply specify the hostname of the remote host you want to connect to:

$connection = new MongoClient("example.com");

This line instantiates a fresh connection for your MongoDB service running on the server and listening

to the example.com domain name (note that it will still connect to the default port: 27017). If you want to

connect to a different port number, however (for example, if you don’t want to use the default port, or you’re

already running another session of the MongoDB service on that port), you can do so by specifying the port

number and hostname:

$connection = new MongoClient("example.com:12345");

This example creates a connection to the database service. Next, you will learn how to disconnect

from the service. Assuming you used the method just described to connect to your database, you can call

$connection again to pass the close() command to terminate the connection, as in this example:

$connection->close();

The close doesn’t need to be called, except in unusual circumstances. The reason for this is that the PHP

driver closes the connection to the database once the MongoClient object goes out of scope. Nevertheless,

it is recommended that you call close() at the end of your PHP code; this helps you avoid keeping old

connections from hanging around until they eventually time out. It also helps you ensure that any existing

connection is closed, thereby enabling a new connection to happen, as in the following example:

$connection = new MongoClient();

$connection->close();

$connection->connect();

The following snippet shows how this would look in PHP:

<?php

// Establish the database connection

$connection = new MongoClient()

// Close the database connection

$connection->close();

?>

CHAPTER 2 ■ INSTALLING MONGODB

29

Installing the Python Driver

Python is a general-purpose and easy-to-read programming language. These qualities make Python a

good language to start with when you are new to programming and scripting. It’s also a great language

to look into if you are familiar with programming and you’re looking for a multiparadigm programming

language that permits several styles of programming (object-oriented programming, structured

programming, and so on). In the upcoming sections, you’ll learn how to install Python and enable

MongoDB support for the language.

Installing PyMongo under Linux

Python offers a specific package for MongoDB support called PyMongo. This package allows you to interact

with the MongoDB database, but you will need to get this driver up and running before you can use this

powerful combination. As when installing the PHP driver, there are two methods you can use to install

PyMongo: an automated approach that relies on setuptools or a manual approach where you download

the source code for the project. The following sections show you how to install PyMongo using both

approaches.

Installing PyMongo Automatically

The pip application that comes bundled with the python-pip package lets you automatically download,

build, install, and manage Python packages. This is incredibly convenient, enabling you to extend your

Python modules’ installation even as it does all the work for you.

■Note You must have setuptools installed before you can use the pip application. This will be done

automatically when installing the python-pip package.

To install pip, all you need to do is tell apt-get to download and install it, like so:

$ sudo apt-get install python-pip

When this line executes, pip will detect the currently running version of Python and installs itself on the

system. That’s all there is to it. Now you are ready to use the pip command to download, make, and install

the MongoDB module, as in this example:

$ sudo pip install pymongo

Again, that’s all there is to it! PyMongo is now installed and ready to use.

■Tip You can also install previous versions of the PyMongo module with pip using the pip install

pymongo=x.y.z command. Here, x.y.z denotes the version of the module.

CHAPTER 2 ■ INSTALLING MONGODB

30

Installing PyMongo Manually

You can also choose to install PyMongo manually. Begin by going to the download section of the site that

hosts the PyMongo plug-in (http://pypi.python.org/pypi/pymongo). Next, download the tarball and

extract it. A typical download and extract procedure might look like this in your console:

$ wget http://pypi.python.org/packages/source/p/pymongo/pymongo-3.0.3.tar.gz

$ tar xzf pymongo-3.0.3.tar.gz

Once you successfully download and extract this file, make your way to the extracted contents directory

and invoke the installation of PyMongo by running the install.py command with Python:

$ cd pymongo-3.0.3

$ sudo python setup.py install

The preceding snippet outputs the entire creation and installation process of the PyMongo module.

Eventually, this process brings you back to your prompt, at which time you’re ready to start using PyMongo.

Installing PyMongo under Windows

Installing PyMongo under Windows is a straightforward process. As when installing PyMongo under Linux,

Easy Install can simplify installing PyMongo under Windows as well. If you don’t have setuptools installed

yet (this package includes the easy_install command), then go to the Python Package Index website

(http://pypi.python.org) to locate the setuptools installer.

For example, assume you have Python version 3.4.3 installed on your system. Next, you will need

to download the setuptools bootstrapper, ez_setup.py, from the Python Package Index website. Simply

double-click the ez_setup.py Python file to install setuptools on your system! It is that simple.

■Caution If you have previously installed an older version of setuptools, then you will need to uninstall that

version using your system’s Add/Remove Programs feature before installing the newer version.

Once the installation is complete, you will find the easy_install.exe file in Python’s Scripts

subdirectory. At this point, you’re ready to install PyMongo on Windows.

Once you’ve successfully installed setuptools, you can open a command prompt and cd your way to

Python’s Scripts directory. By default, this is set to C:\Pythonxy\Scripts\, where xy represents your version

number. Once you navigate to this location, you can use the same syntax shown previously for installing the

Unix variant:

C:\Python27\Scripts> easy_install PyMongo

Unlike the output you get when installing this program on a Linux machine, the output here is rather

brief, indicating only that the extension has been downloaded and installed (see Figure2-4). That said, this

information is sufficient for your purposes in this case.

CHAPTER 2 ■ INSTALLING MONGODB

31

Confirming That Your PyMongo Installation Works

To confirm whether the PyMongo installation has completed successfully, you can open your Python shell.

In Linux, you do this by opening a console and typing python. In Windows, you do this by clicking Start

➤ Programs ➤ Python xy ➤ Python (command line). At this point, you will be welcomed to the world of

Python (see Figure2-5).

Figure 2-5. The Python shell

Figure 2-4. Installing PyMongo under Windows

You can use the import command to tell Python to start using the freshly installed extension:

>>> import pymongo

>>>

■Note You must use the import pymongo command each time you want to use PyMongo.

If all went well, you will not see a thing, and you can start firing off some fancy MongoDB commands.

If you received an error message, however, something went wrong, and you might need to review the steps

just taken to discover where the error occurred.

CHAPTER 2 ■ INSTALLING MONGODB

32

Summary

In this chapter, we examined how to obtain the MongoDB software, including how to select the correct

version you need for your environment. We also discussed the version numbers, how to install and run

MongoDB, and how to install and run its prerequisites. Next, we covered how to establish a connection to a

database through a combination of the shell, PHP, and Python.

We also explored how to expand MongoDB so it will work with your favorite programming languages, as

well as how to confirm whether the language-specific drivers have installed correctly.

In the next chapter, we will explore how to design and structure MongoDB databases and data properly.

Along the way, you’ll learn how to index information to speed up queries, how to reference data, and how to

leverage a fancy new feature called geospatial indexing.

33

Chapter 3

The Data Model

In Chapter 2, you learned how to install MongoDB on two commonly used platforms (Windows and Linux), as

well as how to extend the database with some additional drivers. In this chapter, you will shift your attention

from the operating system and instead examine the general design of a MongoDB database. Specifically,

you’ll learn what collections are, what documents look like, how indexes work and what they do, and finally,

when and where to reference data instead of embedding it. We touched on some of these concepts briefly

in Chapter 1, but in this chapter, we’ll explore them in more detail. Throughout this chapter, you will see

code examples designed to give you a good feeling for the concepts being discussed. Do not worry too much

about the commands you’ll be looking at, however, because they will be discussed extensively in Chapter 4.

Designing the Database

As you learned in Chapters 1 and 2, a MongoDB database is nonrelational and schemaless. This means

that a MongoDB database isn’t bound to any predefined columns or data types as relational databases are

(such as MySQL). The biggest benefit of this implementation is that working with data is extremely flexible

because there is no predefined structure required in your documents.

To put it more simply, you are perfectly capable of having one collection that contains hundreds or

even thousands of documents that all carry a different structure—without breaking any of the MongoDB

database’s rules.

One of the benefits of this flexible schemaless design is that you won’t be restricted when programming

in a dynamically typed language such as Python or PHP. Indeed, it would be a severe limitation if your

extremely flexible and dynamically capable programming language couldn’t be used to its full potential

because of the innate limitations of your database.

Let’s take another glance at what the data design of a document in MongoDB looks like, paying

particular attention to how flexible data in MongoDB are compared to data in a relational database. In

MongoDB, a document is an item that contains the actual data, comparable to a row in SQL. In the following

example, you will see how two completely different types of documents can coexist in a single collection

named Media (note that a collection is roughly equivalent to a table in the world of SQL):

{

"Type": "CD",

"Artist": "Nirvana",

"Title": "Nevermind",

"Genre": "Grunge",

"Releasedate": "1991.09.24",

CHAPTER 3 ■ THE DATA MODEL

34

"Tracklist": [

{

"Track": "1",

"Title": "Smells Like Teen Spirit",

"Length": "5:02"

},

{

"Track": "2",

"Title": "In Bloom",

"Length": "4:15"

}

]

}

{

"type": "Book",

"Title": "Definitive Guide to MongoDB: A complete guide to dealing with Big Data using

MongoDB 3rd ed., The",

"ISBN": "978-1-4842-1183-0",

"Publisher": "Apress",

"Author": [

"Hows, David"

"Plugge, Eelco",

"Membrey, Peter",

"Hawkins, Tim ]

}

As you might have noticed when looking at this pair of documents, most of the fields aren’t closely

related to one another. Yes, they both have fields called Title and Type; but apart from that similarity, the

documents are completely different. Nevertheless, these two documents are contained in a single collection

called Media.

MongoDB is called a schemaless database, but that doesn’t mean MongoDB’s data structure is

completely devoid of schema. For example, you do define collections and indexes in MongoDB (you will

learn more about this later in the chapter). Nevertheless, you do not need to predefine a structure for any of

the documents you will be adding, as is the case when working with MySQL, for example.

Simply stated, MongoDB is an extraordinarily dynamic database; the preceding example would never

work in a relational database unless you also added each possible field to your table. Doing so would be a

waste of both space and performance, not to mention highly disorganized.

Drilling Down on Collections

As mentioned previously, collection is a commonly used term in MongoDB. You can think of a collection as a

container that stores your documents (that is, your data), as shown in Figure3-1.

CHAPTER 3 ■ THE DATA MODEL

35

Now compare the MongoDB database model to a typical model for a relational database (see Figure3-2).

As you can see, the general structure is the same between the two types of databases; nevertheless,

you do not use them in even remotely similar manners. There are several types of collections in MongoDB.

The default collection type is expandable in size: the more data you add to it, the larger it becomes. It’s also

possible to define collections that are capped. These capped collections can only contain a certain amount

of data before the oldest document is replaced by a newer document (you will learn more about these

collections in Chapter 4).

Figure 3-1. The MongoDB database model

Figure 3-2. A typical relational database model

CHAPTER 3 ■ THE DATA MODEL

36

Every collection in MongoDB has a unique name. This name should, for the sake of best practice, begin

with a letter, or optionally, an underscore (_) when created using the createCollection function. The

name can contain numbers and letters; however, the $ symbol is reserved by MongoDB. Similarly, using an

empty string (" ") is not allowed; the null character cannot be used in the name and it cannot start with the

system. string. Generally, it’s recommended that you keep the collection’s name simple and short (to around

nine characters or so); however, the maximum number of allowed characters in a collection name is 128.

Obviously, there isn’t much practical reason to create such a long name.

A single database running the default MMAPv1 storage engine has a default limit of approximately

24,000 namespaces, whereas the WiredTiger storage engine is not subject to this limitation. Each collection

accounts for at least two namespaces: one for the collection itself and one more for the first index created

in the collection. If you were to add more indexes per collection, however, another namespace would be

used. In theory, this means that each database can have up to 12,000 collections by default, assuming each

collection only carries one index. However, this limit on the number of namespaces can be increased up to

2047MB by providing the nsSize parameter when executing the MongoDB service application (mongod).

Using Documents

Recall that a document consists of key-value pairs. For example, the pair "type" : "Book" consists of a key

named type, and its value, Book. Keys are written as strings, but the values in them can vary tremendously.

Values can be any of a rich set of datatypes, such as arrays or even binary data. Remember: MongoDB stores

its data in BSON format (see Chapter 1 for more information on this topic).

Next, let’s look at all of the possible types of data you can add to a document, and what you use them for:

• String: This commonly used datatype contains a string of text (or any other kind

of characters). This datatype is used mostly for storing text values (for example,

{"Country" : "Japan"}).

• Integer (32-bit and 64-bit): This type is used to store a numerical value (for example,

{ "Rank" : 1 }). Note that there are no quotes placed before or after the integer.

• Boolean: This datatype can be set to either TRUE or FALSE.

• Double: This datatype is used to store floating-point values.

• Min / Max keys: This datatype is used to compare a value against the lowest and

highest BSON elements, respectively.

• Arrays: This datatype is used to store arrays (for example, ["Membrey,

Peter","Plugge, Eelco","Hows, David"]).

• Timestamp: This datatype is used to store a timestamp. This can be handy for

recording when a document has been modified or added.

• Object: This datatype is used for embedded documents.

• Null: This datatype is used for a Null value.

• Symbol: This datatype is used identically to a string; however, it’s generally reserved

for languages that use a specific symbol type.

• Date: This datatype is used to store the current date or time in Unix time format

(POSIX time).

• Object ID: This datatype is used to store the document’s ID.

• Binary data: This datatype is used to store binary data.

CHAPTER 3 ■ THE DATA MODEL

37

• Regular expression: This datatype is used for regular expressions. All options are

represented by specific characters provided in alphabetical order. You will learn

more about regular expressions in Chapter 4.

• JavaScript code: This datatype is used for JavaScript code.

In Chapter 4, you will learn how to identify your datatypes by using the $type operator.

In theory, this all probably sounds straightforward. However, you might wonder how you go about

actually designing the document, including what information to put in it. Because a document can contain

any type of data, you might think there is no need to reference information from inside another document.

In the next section, we’ll look at the pros and cons of embedding information in a document compared to

referencing that information from another document.

Embedding vs. Referencing Information in Documents

You can choose either to embed information into a document or reference that information from another

document. Embedding information simply means that you place a certain type of data (for example, an array

containing more data) into the document itself. Referencing information means that you create a reference

to another document that contains that specific data. Typically, you reference information when you use a

relational database. For example, assume you wanted to use a relational database to keep track of your CDs,

DVDs, and books. In this database, you might have one table for your CD collection and another table that

stores the track lists of your CDs. Thus, you would probably need to query multiple tables to acquire a list of

tracks from a specific CD.

With MongoDB (and other nonrelational databases), however, it would be much easier to embed such

information instead. After all, the documents are natively capable of doing so. Adopting this approach keeps

your database nice and tidy, ensures that all related information is kept in one single document, and even

works much faster because the data are then co-located on the disk.

Now let’s look at the differences between embedding and referencing information by looking at a

real-world scenario: storing CD data in a database.

In the relational approach, your data structure might look something like this:

|_media

|_cds

|_id, artist, title, genre, releasedate

|_ cd_tracklists

|_cd_id, songtitle, length

In the nonrelational approach, your data structure might look something like this:

|_media

|_items

|_<document>

In the nonrelational approach, the document might look something like the following:

{

"Type": "CD",

"Artist": "Nirvana",

"Title": "Nevermind",

"Genre": "Grunge",

"Releasedate": "1991.09.24",

CHAPTER 3 ■ THE DATA MODEL

38

"Tracklist": [

{

"Track" : "1",

"Title" : "Smells Like Teen Spirit",

"Length" : "5:02"

},

{

"Track" : "2",

"Title" : "In Bloom",

"Length" : "4:15"

}

]

}

In this example, the track list information is embedded in the document itself. This approach is both

incredibly efficient and well organized. All the information that you wish to store regarding this CD is added

to a single document. In the relational version of the CD database, this requires at least two tables; in the

nonrelational database, it requires only one collection and one document.

When information is retrieved for a given CD, that information only needs to be loaded from one

document into RAM, not from multiple documents. Remember that every reference requires another query

in the database.

■Tip The rule of thumb when using MongoDB is to embed data whenever you can. This approach is far more

efficient and almost always viable.

At this point, you might be wondering about the use case in which an application has multiple users.

Generally speaking, a relational database version of the aforementioned CD app would require that you

have one table that contains all your users and two tables for the items added. For a nonrelational database,

it would be good practice to have separate collections for the users and the items added. For these kinds of

problems, MongoDB allows you to create references in two ways: manually or automatically. In the latter

case, you use the DBRef specification, which provides more flexibility in case a collection changes from one

document to the next. You will learn more about these two approaches in Chapter 4.

Creating the _id Field

Every object within the MongoDB database contains a unique identifier to distinguish that object from every

other object. This identifier is called the _id key, and it is added automatically to every document you create

in a collection.

The _id key is the first attribute added in each new document you create. This remains true even if

you do not tell MongoDB to create the key. For example, none of the code in the preceding examples used

the _id key. Nevertheless, MongoDB created an _id key for you automatically in each document. It did so

because _id key is a mandatory element for each document in the collection.

If you do not specify the _id value manually, the type will be set to a special ObjectId BSON datatype

that consists of a 12-byte binary value. Thanks to its design, this value has a reasonably high probability of

being unique. The 12-byte value consists of a 4-byte timestamp (seconds since epoch, or January 1, 1970),

a 3-byte machine ID, a 2-byte process ID, and a 3-byte counter. It’s good to know that the counter and

timestamp fields are stored in Big Endian format. This is because MongoDB wants to ensure that there is an

increasing order to these values, and a Big Endian approach suits this requirement best.

CHAPTER 3 ■ THE DATA MODEL

39

■Note The terms Big Endian and Little Endian refer to how individual bytes/bits are stored in a longer data

word in the memory. Big Endian simply means that the most significant value is saved first. Similarly, Little

Endian means that the least significant value is saved first.

Figure3-3 shows how the value of the _id key is built up and where the values come from.

Every additional supported driver that you load when working with MongoDB (such as the PHP driver

or the Python driver) supports this special BSON datatype and uses it whenever new data are created. You

can also invoke ObjectId() from the MongoDB shell to create a value for an _id key. Optionally, you can

specify your own value by using ObjectId(string), where string represents the specified hex string.

Building Indexes

As mentioned in Chapter 1, an index is nothing more than a data structure that collects information about

the values of specified fields in the documents of a collection. This data structure is used by MongoDB’s

query optimizer to quickly sort through and order the documents in a collection.

Remember that indexing ensures a quick lookup from data in your documents. Basically, you should

view an index as a predefined query that was executed and had its results stored. As you can imagine, this

enhances query-performance dramatically. The general rule of thumb in MongoDB is that you should create

an index for the same sort of scenarios where you would want to have an index in relational databases.

The biggest benefit of creating your own indexes is that querying for often-used information will be

incredibly fast because your query won’t need to go through your entire database to collect this information.

Creating (or deleting) an index is relatively easy—once you get the hang of it, anyway. You will learn

how to do so in Chapter 4, which covers working with data. You will also learn some more advanced

techniques for taking advantage of indexing in Chapter 10, which covers how to maximize performance.

Impacting Performance with Indexes

You might wonder why you would ever need to delete an index, rebuild your indexes, or even delete all

indexes within a collection. The simple answer is that doing so lets you clean up some irregularities. For

instance, sometimes the size of a database can increase dramatically for no apparent reason. At other times,

the space used by the indexes might strike you as excessive.

Another good thing to keep in mind: you can have a maximum of 64 indexes per collection. Generally

speaking, this is far more than you should need, but you could potentially hit this limit someday.

Figure 3-3. Creating the _id key in MongoDB

CHAPTER 3 ■ THE DATA MODEL

40

■Note Adding an index potentially increases query speed, but it reduces insertion or deletion speed. It’s best

to consider only adding indexes for collections where the number of reads is higher than the number of writes.

When more writes occur than reads, indexes may even prove to be counterproductive.

Finally, you can run the listIndexes() command to take a quick peek at the indexes that have been

stored so far. To see the indexes created for a specific collection, you can use the getIndexes command:

db.collection.getIndexes()

Indexing, and how indexing can affect MongoDB’s performance, will be covered in more detail in the

Optimization chapter.

Implementing Geospatial Indexing

Ever since version 1.4, MongoDB has implemented geospatial indexing. This means that, in addition to the

various other index types, MongoDB also supports geospatial indexes that are designed to work in an optimal

way with location-based queries. For example, you can use this feature to find a number of closest known

items to the user’s current location. Or you might further refine your search to query for a specified number

of restaurants near the current location. This type of query can be particularly helpful if you are designing an

application where you want to find the closest available branch office to a given customer’s ZIP code.

A document for which you want to add geospatial information must contain either a subobject or an

array whose first element specifies the object type, followed by the item’s longitude and latitude, as in the

following example:

> db.restaurants.insert({name: "Kimono", loc: { type: "Point",

coordinates: [ 52.370451, 5.217497]}})

Note that the type parameter can be used to specify the document’s GeoJSON object type, which

can be a Point, a MultiPoint, a LineString, a MultiLineString, a Polygon, a MultiPolygon, or a

GeometryCollection. As can be expected, the Point type is used to specify that the item (in this case, a

restaurant) is located at exactly the spot given, thus requiring exactly two values, the longitude and latitude.

The LineString type can be used to specify that the item extends along a specific line (say, a street), and

thus requires a beginning and end point, as in the following example:

> db.streets.insert( {name: "Westblaak", loc: { type: "LineString",

coordinates: [ [52.36881,4.890286],[52.368762,4.890021] ] } } )

The Polygon type can be used to specify a (nondefault) shape (say, a shopping area). When using

this type, you need to ensure that the first and last points are identical, to close the loop. Also, the point

coordinates are to be provided as an array within an array, as in the following example:

> db.stores.insert( {name: "SuperMall", loc: { type: "Polygon",

coordinates: [ [ [52.146917,5.374337], [52.146966,5.375471], [52.146722,5.375085],

[52.146744,5.37437], [52.146917,5.374337] ] ] } } )

CHAPTER 3 ■ THE DATA MODEL

41

For all of these, the Multi- version (MultiPoint, MultiLineString, etc.) is an array of the datatype selected,

as in the following MultiPoint example:

> db.restaurants.insert({name: "Shabu Shabu", loc: { type: "MultiPoint",

coordinates: [52.1487441, 5.3873406], [52.3569665,4.890517] }})

In most cases, the Point type will be appropriate.

Once this geospatial information is added to a document, you can create the index (or even create the

index beforehand, of course) and give the ensureIndex() function the 2dsphere parameter:

> db.restaurants.ensureIndex( { loc: "2dsphere" } )

■Note The ensureIndex() function is used to add a custom index. Don’t worry about the syntax of this

function yet—you will learn how to use ensureIndex() in depth in Chapter 4.

The 2dsphere parameter tells ensureIndex() that it’s indexing a coordinate or some other form of

two-dimensional information on an Earth-like sphere. By default, ensureIndex() assumes that a

latitude/longitude key is given, and it uses a range of -180 to 180. However, you can overwrite these values

using the min and max parameters:

> db.restaurants.ensureIndex( { loc: "2dsphere" }, { min : -500 , max : 500 } )

You can also expand your geospatial indexes by using secondary key values (also known as compound keys).

This structure can be useful when you intend to query on multiple values, such as a location (geospatial

information) and a category (sort ascending):

> db.restaurants.ensureIndex( { loc: "2dsphere", category: 1 } )

Querying Geospatial Information

In this chapter, we are concerned primarily with two things: how to model the data and how a database

works in the background of an application. That said, manipulating geospatial information is increasingly

important in a wide variety of applications, so we’ll take a few moments to explain how to leverage

geospatial information in a MongoDB database.

Before getting started, a mild word of caution. If you are completely new to MongoDB and haven’t

had the opportunity to work with (geospatial) indexed data in the past, this section may seem a little

overwhelming at first. Not to worry, however; you can safely skip it for now and come back to it later if you

wish to. The examples given serve to show you a practical example of how (and why) to use geospatial

indexing, making it easier to comprehend. With that out of the way, and if you are feeling brave, read on.

Once you’ve added data to your collection, and once the index has been created, you can do a

geospatial query. For example, let’s look at a few lines of simple yet powerful code that demonstrate how to

use geospatial indexing.

Begin by starting up your MongoDB shell and selecting a database with the use function. In this case,

the database is named restaurants:

> use restaurants

CHAPTER 3 ■ THE DATA MODEL

42

Once you’ve selected the database, you can define a few documents that contain geospatial

information, and then insert them into the places collection (remember: you do not need to create the

collection beforehand):

> db.restaurants.insert( { name: "Kimono", loc: { type: "Point",

coordinates: [ 52.370451, 5.217497] } } )

> db.restaurants.insert( {name: "Shabu Shabu", loc: { type: "Point",

coordinates: [51.915288,4.472786] } } )

> db.restaurants.insert( {name: "Tokyo Cafe", loc: { type: "Point",

coordinates: [52.368736, 4.890530] } } )

After you add the data, you need to tell the MongoDB shell to create an index based on the location

information that was specified in the loc key, as in this example:

> db.restaurants.ensureIndex ( { loc: "2dsphere" } )

Once the index has been created, you can start searching for your documents. Begin by searching on an

exact value (so far this is a “normal” query; it has nothing to do with the geospatial information at this point):

> db.restaurants.find( { loc : [52,5] } )

>

The preceding search returns no results. This is because the query is too specific. A better approach in

this case would be to search for documents that contain information near a given value. You can accomplish

this using the $near operator. Note that this requires the type operator to be specified, as in the following

example:

> db.restaurants.find( { loc : { $near : { $geometry : { type : "Point",

coordinates: [52.338433,5.513629] } } } } )

This produces the following output:

{

"_id" : ObjectId("51ace0f380523d89efd199ac"),

"name" : "Kimono",

"loc" : {

"type" : "Point",

"coordinates" : [ 52.370451, 5.217497 ]

}

{

"_id" : ObjectId("51ace13380523d89efd199ae"),

"name" : "Tokyo Cafe",

"loc" : {

"type" : "Point",

"coordinates" : [ 52.368736, 4.89053 ]

}

CHAPTER 3 ■ THE DATA MODEL

43

{

"_id" : ObjectId("51ace11b80523d89efd199ad"),

"name" : "Shabu Shabu",

"loc" : {

"type" : "Point",

"coordinates" : [ 51.915288, 4.472786 ]

}

Although this set of results certainly looks better, there’s still one problem: all of the documents are

returned! When used without any additional operators, $near returns the first 100 entries and sorts them

based on their distance from the given coordinates. Now, while you can choose to limit your results to say,

the first two items (or 200, if you want) using the limit function, even better would be to limit the results to

those within a given range.

This can be achieved by appending the $maxDistance or $minDistance operators. Using one of these

operators you can tell MongoDB to return only those results falling within a maximum or minimum distance

(measured in meters) from the given point, as in the following example and its output:

> db.retaurants.find( { loc : { $near : { $geometry : { type : "Point",

coordinates: [52.338433,5.513629] }, $maxDistance : 40000 } } } )

{

"_id" : ObjectId("51ace0f380523d89efd199ac"),

"name" : "Kimono",

"loc" : {

"type" : "Point",

"coordinates" : [ 52.370451, 5.217497 ]

}

As you can see, this returns only a single result: a restaurant located within 40 kilometers (or, roughly

25 miles) from the starting point.

■Note There is a direct correlation between the number of results returned and the time a given query takes

to execute.

In addition to the $near operator, MongoDB also includes a $geoWithin operator. You use this operator

to find items in a particular shape. At this time, you can find items located in a $box, $polygon, $center,

and $centerSphere shape, where $box represents a rectangle, $polygon represents a specific shape of your

choosing, $center represents a circle, and $centerSphere defines a circle on a sphere. Let’s look at a couple

of additional examples that illustrate how to use these shapes.

■Note With version 2.4 of MongoDB the $within operator was deprecated and replaced by $geoWithin.

This operator does not strictly require a geospatial indexing. Also, unlike the $near operator, $geoWithin does

not sort the returned results, improving their performance.

CHAPTER 3 ■ THE DATA MODEL

44

To use the $box shape, you first need to specify the lower-left, followed by the upper-right, coordinates

of the box, as in the following example:

> db.restaurants.find( { loc: { $geoWithin : { $box : [ [52.368549,4.890238],

[52.368849,4.89094] ] } } } )

Similarly, to find items within a specific polygon form, you need to specify the coordinates of your

points as a set of nested arrays. Again note that the first and last coordinates must be identical to close the

shape properly, as shown in the following example:

> db.restaurants.find( { loc :

{ $geoWithin :

{ $geometry :

{ type : "Polygon" ,

coordinates : [ [

[52.368739,4.890203], [52.368872,4.890477], [52.368726,4.890793],

[52.368608,4.89049], [52.368739,4.890203]

] ]

}

} )

The code to find items in a basic $circle shape is quite simple. In this case, you need to specify the

center of the circle and its radius, measured in the units used by the coordinate system, before executing the

find() function:

> db.restaurants.find( { loc: { $geoWithin : { $center : [ [52.370524, 5.217682], 10] } } } )

Note that ever since MongoDB version 2.2.3, the $center operator can be used without having a

geospatial index in place. However, it is recommended to create one to improve performance.

Finally, to find items located within a circular shape on a sphere (say, our planet) you can use the

$centerSphere operator. This operator is similar to $center, like so:

> db.restaurants.find( { loc: { $geoWithin : { $centerSphere : [ [52.370524, 5.217682], 10]

} } } )

By default, the find() function is ideal for running queries. However, MongoDB also provides the

geoNear() function, which works like the find() function, but also displays the distance from the specified

point for each item in the results. The geoNear() function also includes some additional diagnostics. The

following example uses the geoNear() function to find the two closest results to the specified position:

> db.runCommand( { geoNear : "restaurants", near : { type : "Point", coordinates:

[52.338433,5.513629] }, spherical : true})

It returns the following results:

{

"ns" : "stores.restaurants",

"results" : [

{

"dis" : 33155.517810497055,

CHAPTER 3 ■ THE DATA MODEL

45

"obj" : {

"_id" : ObjectId("51ace0f380523d89efd199ac"),

"name" : "Kimono",

"loc" : {

"type" : "Point",

"coordinates" : [

52.370451,

5.217497

]

}

},

{

"dis" : 69443.96264213261,

"obj" : {

"_id" : ObjectId("51ace13380523d89efd199ae"),

"name" : "Tokyo Cafe",

"loc" : {

"type" : "Point",

"coordinates" : [

52.368736,

4.89053

]

}

},

{

"dis" : 125006.87383713324,

"obj" : {

"_id" : ObjectId("51ace11b80523d89efd199ad"),

"name" : "Shabu Shabu",

"loc" : {

"type" : "Point",

"coordinates" : [

51.915288,

4.472786

]

}

],

"stats" : {

"time" : 6,

"nscanned" : 3,

"avgDistance" : 75868.7847632543,

"maxDistance" : 125006.87383713324

},

"ok" : 1

}

CHAPTER 3 ■ THE DATA MODEL

46

That completes our introduction to geospatial information for now; however, you’ll see a few more

examples that show you how to leverage geospatial functions in this book’s upcoming chapters.

Pluggable Storage Engines

Now that we’ve briefly touched upon MongoDB’s performance features, it’s time to look at the storage

engines available since version 3.0 and what these can mean for you. MongoDB’s storage engine is that

part of the database in charge of storing your data on the disk. Prior to version 3.0 you were limited to using

MongoDB’s native MMAPv1 storage engine. While this is still the default storage engine used in any version

prior to 3.2, you can choose to use the added alternative, the WiredTiger storage engine, or even develop

your own using the storage engine API.

■Note Each storage engine comes with its own pros and cons; where one might be best suited for

read-heavy tasks, another might perform better for write-heavy tasks. You can decide which storage engine is

a best fit for your use case. It is worth noting at this stage that multiple storage engines may coexist within a

single replica set.

By default, MongoDB v3.0 and later come with two supported storage engines: the legacy MMAPv1,

and the new WiredTiger storage engine. Compared to MMAPv1, the WiredTiger storage engine offers more

granular concurrency control as well as native compression capabilities. This allows for better utilization of

the hardware, reduced storage costs, as well as more predictable performance. MongoDB’s storage engines

and its capabilities will be discussed in full detail in Chapter 10 later on in this book.

Using MongoDB in the Real World

Now that you have MongoDB and its associated plug-ins installed and you have gained an understanding

of the data model, it’s time to get to work. In the next five chapters of the book, you will learn how to build,

query, and otherwise manipulate a variety of sample MongoDB databases (see Table3-1 for a quick view

of the topics to come). Each chapter will stick primarily to using a single database that is unique to that

chapter; we took this approach to make it easier to read this book in a modular fashion.

Table 3-1. MongoDB Sample Databases Covered in This Book

Chapter Database Name Topic

4Library Working with data and indexes

5Test GridFS

6Contacts PHP and MongoDB

7Inventory Python and MongoDB

8Test Advanced queries

CHAPTER 3 ■ THE DATA MODEL

47

Summary

In this chapter, we looked at what’s happening in the background of your database. We also explored the

primary concepts of collections and documents in more depth; and we covered the datatypes supported in

MongoDB, as well as how to embed and reference data.

Next, we examined what indexes do, including when and why they should be used (or not).

We also touched on the concepts of geospatial indexing. For example, we covered how geospatial data

can be stored; we also explained how you can search for such data using either the regular find() function

or the more geospatially based geoNear database command.

In the next chapter, we’ll take a closer look at how the MongoDB shell works, including which functions

can be used to insert, find, update, or delete your data. We will also explore how conditional operators can

help you with all of these functions.

49

Chapter 4

Working with Data

In Chapter 3, you learned how the database works on the backend, what indexes are, how to use a database

to quickly find the data you are looking for, and what the structure of a document looks like. You also saw a

brief example that illustrated how to add data and find it again using the MongoDB shell. In this chapter, we

will focus more on working with data from your shell.

We will use one database (named library) throughout this chapter, and we will perform actions

such as adding data, searching data, modifying data, deleting data, and creating indexes. We’ll also look

at how to navigate the database using various commands, as well as what DBRef is and what it does. If

you have followed the instructions in the previous chapters to set up the MongoDB software, you can

follow the examples in this chapter to get used to the interface. Along the way, you will also attain a solid

understanding of which commands can be used for what kind of operations.

Navigating Your Databases

The first thing you need to know is how to navigate your databases and collections. With traditional SQL

databases, the first thing you would need to do is create an actual database; however, as you probably

remember from previous chapters, this is not required with MongoDB because the program creates the

database and underlying collection for you automatically the moment you store data in it.

To switch to an existing database or create a new one, you can use the use function in the shell, followed

by the name of the database you would like to use, whether or not it exists. This snippet shows how to use

the library database:

> use library

Switched to db library

The mere act of invoking the use function, followed by the database’s name, sets your db (database) global

variable to library. Doing this means that all the commands you pass down into the shell will automatically

assume they need to be executed on the library database until you reset this variable to another database.

Viewing Available Databases and Collections

MongoDB automatically assumes a database needs to be created the moment you save data to it. It is also

case sensitive. For these reasons, it can be quite tricky to ensure that you’re working in the correct database.

Therefore, it’s best to view a list of all current databases available to MongoDB prior to switching to one, in

case you forgot the database’s name or its exact spelling. You can do this using the show dbs function:

> show dbs

local 0.000GB

CHAPTER 4 ■ WORKING WITH DATA

50

Note that this function will only show a database that already exists. At this stage, the database does

not contain any data yet, so nothing else will be listed. If you want to view all available collections for your

current database, you can use the show collections function:

> show collections

>

■Tip To view the database you are currently working in, simply type db into the MongoDB shell.

Inserting Data into Collections

One of the most frequently used pieces of functionality you will want to learn about is how to insert data into

your collection. All data are stored in BSON format (which is both compact and reasonably fast to scan), so

you will need to insert the data in BSON format as well. You can do this in several ways. For example, you can

define it first and then save it in the collection using the insertOne function, or you can type the document

while using the insert function on the fly:

> document = ({"Type": "Book", "Title" : "Definitive Guide to MongoDB 3rd ed., The",

"ISBN" : "978-1-4842-1183-0", "Publisher" : "Apress", "Author" : ["Hows, David", "Plugge,

Eelco", "Membrey, Peter", "Hawkins, Tim"] } )

■Note When you define a variable in the shell (for example, document = ( { ... } ) ), the contents of the

variable will be printed out immediately.

> db.media.insertOne(document)

WriteResult({ "nInserted" : 1 })

Notice the WriteResult() output returned after inserting a document into the collection.

WriteResult() will carry the status of the operation, as well as the action performed. When inserting a

document, the nInserted property is returned, together with the number of documents inserted.

Line breaks can also be used while typing in the shell. This can be convenient if you are writing a rather

lengthy document, as in this example:

> document = ( { "Type" : "Book",

..."Title" : "Definitive Guide to MongoDB 3rd ed., The",

..."ISBN" : " 978-1-4842-1183-0",

..."Publisher" : "Apress",

..."Author" : ["Hows, David", Plugge, Eelco", "Membrey, Peter"," "Hawkins, Tim"]

...} )

> db.media.insertOne(document)

WriteResult({ "nInserted" : 1 })

CHAPTER 4 ■ WORKING WITH DATA

51

As mentioned previously, the other option is to insert your data directly through the shell, without

defining the document first. You can do this by invoking the insert function immediately, followed by the

document’s contents:

> db.media.insertOne( { "Type" : "CD", "Artist" : "Nirvana", "Title" : "Nevermind" })

WriteResult({ "nInserted" : 1 })

Or you can insert the data while using line breaks, as before. For example, you can expand the

preceding example by adding an array of tracks to it. Pay close attention to how the commas and brackets

are used in the following example:

> db.media.insertOne( { "Type" : "CD",

..."Artist" : "Nirvana",

..."Title" : "Nevermind",

... "Tracklist" : [

... {

... "Track" : "1",

... "Title" : "Smells Like Teen Spirit",

... "Length" : "5:02"

... },

... {

... "Track" : "2",

... "Title" : "In Bloom",

... "Length" : "4:15"

... }

... ]

...}

... )

WriteResult({ "nInserted" : 1 })

As you can see, inserting data through the Mongo shell is straightforward.

The process of inserting data is extremely flexible, but you must adhere to some rules when doing so.

For example, the names of the keys while inserting documents have the following limitations:

• The $ character must not be the first character in the key name. Example: $tags

• The period [.] character must not appear anywhere in the key name. Example: ta.gs

• The name _id is reserved for use as a primary key ID; although it is not

recommended, it can store anything unique as a value, such as a string or an integer.

Similarly, some restrictions apply when creating a collection. For example, the name of a collection

must adhere to the following rules:

• The collection’s namespace (including the database name and a “.” separator) cannot

exceed 120 characters.

• An empty string (“ ”) cannot be used as a collection name.

• The collection’s name must start with either a letter or an underscore.

• The collection name system is reserved for MongoDB and cannot be used.

• The collection’s name cannot contain the “\0” null character.

CHAPTER 4 ■ WORKING WITH DATA

52

Querying for Data

You’ve seen how to switch to your database and how to insert data; next, you will learn how to query for data

in your collection. Let’s build on the preceding example and look at all the possible ways to get a good clear

view of your data in a given collection.

■Note When querying your data, you have an extraordinary range of options, operators, expressions, filters,

and so on available to you. We will spend the next few sections reviewing these options.

The find() function provides the easiest way to retrieve data from multiple documents within one of

your collections. This function is one that you will be using often.

Let’s assume that you have inserted the preceding two examples into a collection called media in the

library database. If you were to use a simple find() function on this collection, you would getall of the

documents you’ve added so far printed out for you:

> db.media.find()

{ "_id" : "ObjectId("4c1a8a56c603000000007ecb"), "Type" : "Book", "Title" : "Definitive

Guide to MongoDB 3rd ed., The", "ISBN" : "978-1-4842-1183-0", "Publisher" : "Apress",

"Author" : ["Hows, David ", "Plugge, Eelco", "Membrey, Peter", "Hawkins, Tim"]}

{ "_id" : "ObjectId("4c1a86bb2955000000004076"), "Type" : "CD", "Artist" :

"Nirvana", "Title" : "Nevermind", "Tracklist" : [

{

"Track" : "1",

"Title" : "Smells Like Teen Spirit",

"Length" : "5:02"

},

{

"Track" : "2",

"Title" : "In Bloom",

"Length" : "4:15"

}

] }

This is simple stuff, but typically you would not want to retrieve all the information from all the

documents in your collection. Instead, you probably want to retrieve a certain type of document. For

example, you might want to return all the CDs from Nirvana. If so, you can specify that only the desired

information is requested and returned:

> db.media.find ( { Artist : "Nirvana" } )

{ "_id" : "ObjectId("4c1a86bb2955000000004076"), "Type" : "CD", "Artist" : "Nirvana",

"Title" : "Nevermind", "Tracklist" : [

{

"Track" : "1",

"Title" : "Smells Like Teen Spirit",

"Length" : "5:02"

},

CHAPTER 4 ■ WORKING WITH DATA

53

{

"Track" : "2",

"Title" : "In Bloom",

"Length" : "4:15"

}

] }

Okay, so this looks much better! You don’t have to see all the information from all the other items you’ve

added to your collection, only the information that interests you. However, what if you’re still not satisfied

with the results returned? For example, assume you want to get a list back that shows only the titles of the

CDs you have by Nirvana, ignoring any other information, such as track lists. You can do this by inserting an

additional parameter into your query that specifies the name of the keys you want to return, followed by a 1:

> db.media.find ( {Artist : "Nirvana"}, {Title: 1} )

{ "_id" : ObjectId("4c1a86bb2955000000004076"), "Title" : "Nevermind" }

Inserting the { Title : 1 } information specifies that only the information from the title field should

be returned. The _id field is always returned, unless you specifically exclude it using { _id: 0 }.

■Note If you do not specify a sort order, the order of results is undefined. Sorting is covered later in

this chapter.

You can also accomplish the opposite: inserting { Type : 0 } retrieves a list of all items you have

stored from Nirvana, showing all information except for the Type field.

■Note The _id field will by default remain visible unless you explicitly ask it not to show itself.

Take a moment to run the revised query with the { Title : 1 } insertion; no unnecessary information

is returned at all. This saves you time because you see only the information you want. It also spares your

database the time required to return unnecessary information.

Using the Dot Notation

When you start working with more complex document structures such as documents containing arrays

or embedded objects, you can begin using other methods for querying information from those objects as

well. For example, assume you want to find all CDs that contain a specific song you like. The following code

executes a more detailed query:

> db.media.find( { "Tracklist.Title" : "In Bloom" } )

{ "_id" : "ObjectId("4c1a86bb2955000000004076"), "Type" : "CD", "Artist" : "Nirvana",

"Title" : "Nevermind", "Tracklist" : [

{

"Track" : "1",

"Title" : "Smells Like Teen Spirit",

"Length" : "5:02"

},

CHAPTER 4 ■ WORKING WITH DATA

54

{

"Track" : "2",

"Title" : "In Bloom",

"Length" : "4:15"

}

] }

Using a period [.] after the key’s name tells your find function to look for information embedded in

your documents. Things are a little simpler when working with arrays. For example, you can execute the

following query if you want to find a list of books written by Peter Membrey:

> db.media.find( { "Author" : "Membrey, Peter" } )

{ "_id" : "ObjectId("4c1a8a56c603000000007ecb"), "Type" : "Book", "Title" : "Definitive

Guide to MongoDB 3rd ed., The", "ISBN" : "978-1-4842-1183-0", "Publisher" : "Apress",

"Author" : ["Hows, David ", "Plugge, Eelco", "Membrey, Peter", "Hawkins, Tim"] }

However, the following command will not match any documents, even though it might appear identical

to the earlier track list query:

> db.media.find ( { "Tracklist" : {"Track" : "1" }} )

Subobjects must match exactly; therefore, the preceding query would only match a document that

contains no other information, such as Track.Title:

{"Type" : "CD",

"Artist" : "Nirvana"

"Title" : "Nevermind",

"Tracklist" : [

{

"Track" : "1",

},

{

"Track" : "2",

"Title" : "In Bloom",

"Length" : "4:15"

}

]

}

Using the Sort, Limit, and Skip Functions

MongoDB includes several functions that you can use for more precise control over your queries. We’ll cover

how to use the sort, limit, and skip functions in this section.

You can use the sort function to sort the results returned from a query. You can sort the results in

ascending or descending order using 1 or -1, respectively. The function itself is analogous to the ORDER BY

statement in SQL, and it uses the key’s name and sorting method as criteria, as in this example:

> db.media.find().sort( { Title: 1 })

This example sorts the results based on the Title key’s value in ascending order. This is the default

sorting order when no parameters are specified. You would add the -1 flag to sort in descending order.

CHAPTER 4 ■ WORKING WITH DATA

55

■Note If you specify a key for sorting that does not exist, the order of results will be undefined.

You can use the limit() function to specify the maximum number of results returned. This function

requires only one parameter: the number of the desired results returned. When you specify 0, all results will

be returned. The following example returns only ten items in your media collection:

> db.media.find().limit( 10 )

Another thing you might want to do is skip the first n documents in a collection. The following example

skips 20 documents in your media collection:

> db.media.find().skip( 20 )

As you probably surmised, this command returns all documents within your collection, except for the

first 20 it finds.

MongoDB wouldn’t be particularly powerful if it weren’t able to combine these commands. However,

practically any function can be combined and used in conjunction with any other function. The following

example limits the results by skipping a few and then sorts the results in descending order:

> db.media.find().sort ( { Title : -1 } ).limit ( 10 ).skip ( 20 )

You might use this example if you want to implement paging in your application. As you might have

guessed, this command wouldn’t return any results in the media collection created so far, because the

collection contains fewer documents than were skipped in this example.

■Note You can use the following shortcut in the find() function to skip and limit your results:

find ( {}, {}, 10, 20 ). Here, you limit the results to ten and skip the first 20 documents found.

Working with Capped Collections, Natural Order, and $natural

There are some additional concepts and features you should be aware of when sorting queries with

MongoDB, including capped collections, natural order, and $natural. We’ll explain in this section what all

of these terms mean and how you can leverage them in your sorts.

The natural order is the database’s native ordering method for objects within a (normal) collection.

When you query for items in a collection without specifying an explicit sort order, the items are returned

by default in forward natural order. This may initially appear identical to the order in which items were

inserted; however, the natural order for a normal collection is not defined and may vary depending on

document growth patterns, indexes used for a query, and the storage engine used.

A capped collection is a collection in your database where the natural order is guaranteed to be the order

in which the documents were inserted. Guaranteeing that the natural order will always match the insertion

order can be particularly useful when you’re querying data and need to be absolutely certain that the results

returned are already sorted based on their order of insertion.

Capped collections have another great benefit: they are a fixed size. Once a capped collection is full,

the oldest data will be purged and newer data will be added at the end, ensuring that the natural order

follows the order in which the records were inserted. This type of collection can be used for logging and

autoarchiving data.

CHAPTER 4 ■ WORKING WITH DATA

56

Unlike a standard collection, a capped collection must be created explicitly, using the

createCollection function. You must also supply parameters that specify the size (in bytes) of the

collection you want to add. For example, imagine you want to create a capped collection named audit with

a maximum size of 20480 bytes:

> db.createCollection("audit", {capped:true, size:20480})

{ "ok" : 1 }

Given that a capped collection guarantees that the natural order matches the insertion order, you don’t

need to include any special parameters or any other special commands or functions when querying the data

either, except of course when you want to reverse the default results. This is where the $natural parameter

comes in. For example, assume you want to find the ten most recent entries from your capped collection that

lists failed login attempts. You could use the $natural parameter to find this information:

> db.audit.find().sort( { $natural: -1 } ).limit ( 10 )

■Note Documents already added to a capped collection can be updated, but they must not grow in size.

The update will fail if they do. Deleting documents from a capped collection is also not possible; instead, the

entire collection must be dropped and re-created if you want to do this. You will learn more about dropping a

collection later in this chapter.

You can also limit the number of items added into a capped collection using the max: parameter

when you create the collection. However, you must ensure that there is enough space in the collection for

the number of items you want to add. If the collection becomes full before the number of items has been

reached, the oldest item in the collection will be removed. The MongoDB shell includes a utility that lets

you see the amount of space used by an existing collection, whether it’s capped or uncapped. You invoke

this utility using the validate() function. This can be particularly useful if you want to estimate how large a

collection might become.

As stated previously, you can use the max: parameter to cap the number of items that can be inserted

into a collection, as in this example:

> db.createCollection("audit100", { capped:true, size:20480, max: 100})

{ "ok" : 1 }

Next, use the stats() function to check the size of the collection:

> db.audit100.stats()

{

"ns" : "library.audit100",

"count" : 0,

"size" : 0,

"storageSize" : 4096,

"capped" : true,

"max" : 100,

"maxSize" : 20480,

"sleepCount" : 0,

"sleepMS" : 0,

CHAPTER 4 ■ WORKING WITH DATA

57

"wiredTiger" : {

[..]

},

"nindexes" : 1,

"totalIndexSize" : 4096,

"indexSizes" : {

"_id_" : 4096

},

"ok" : 1

}

The resulting output shows that the table (named audit100) is a capped collection with a maximum of

100 items to be added, and it currently contains zero items.

Retrieving a Single Document

So far we’ve only looked at examples that show how to retrieve multiple documents. If you want to receive

only one result, however, querying for all documents—which is what you generally do when executing a

find() function—would be a waste of CPU time and memory. For this case, you can use the findOne()

function to retrieve a single item from your collection. Overall, the result and execution methods are

identical to what occurs when you append the limit(1) function, but why make it harder on yourself than

you should?

The syntax of the findOne() function is identical to the syntax of the find() function:

> db.media.findOne()

It’s generally advised to use the findOne() function if you expect only one result.

Using the Aggregation Commands

MongoDB comes with a nice set of aggregation commands. You might not see their significance at first,

but once you get the hang of using them, you will see that the aggregation commands form an extremely

powerful set of tools. For instance, you might use them to get an overview of some basic statistics about your

database. In this section, we will take a closer look at how to use three of the functions from the available

aggregate commands: count, distinct, and group.

In addition to these three basic aggregation commands, MongoDB also includes an aggregation

framework. This powerful feature will allow you to calculate aggregated values without needing to use the

map/reduce framework. The aggregation framework will be discussed in Chapter 5.

Returning the Number of Documents with count( )

The count() function returns the number of documents in the specified collection. So far you’ve added a

number of documents in the media collection. The count() function can tell you exactly how many:

> db.media.count()

2

CHAPTER 4 ■ WORKING WITH DATA

58

You can also perform additional filtering by combining count() with conditional operators,

as shown here:

> db.media.find( { Publisher : "Apress", Type: "Book" } ).count()

1

This example returns only the number of documents added in the collection that are published by

Apress and of the type Book. Note that the count() function ignores a skip() or limit() parameter by

default. To ensure that your query doesn’t skip these parameters and that your count results will match the

limit and/or skip parameters, use count(true):

> db.media.find( { Publisher: "Apress", Type: "Book" }).skip ( 2 ) .count (true)

0

Retrieving Unique Values with distinct( )

The preceding example shows a great way to retrieve the total number of documents from a specific

publisher. However, this approach is definitely not precise. After all, if you own more than one book with the

same title (for instance, the hardcopy and the e-book), then you would technically have just one book. This

is where distinct() can help you: it will only return unique values.

For the sake of completeness, you can add an additional item to the collection. This item carries the

same title, but has a different ISBN number:

> document = ( { "Type" : "Book", "Title" : "Definitive Guide to MongoDB 3rd ed., The", ISBN:

" 978-1-4842-1183-1", "Publisher" : "Apress", "Author" : ["Hows, David", "Membrey, Peter",

"Plugge, Eelco", "Hawkins, Tim"] } )

> db.media.insert (document)

WriteResult({ "nInserted" : 1 })

At this point, you should have two books in the database with identical titles. When using the

distinct() function on the titles in this collection, you will get a total of two unique items. However, the

titles of the two books are unique, so they will be grouped into one item. The other result will be the title of

the album “Nevermind”:

> db.media.distinct( "Title")

[ "Definitive Guide to MongoDB 3rd ed., The", "Nevermind" ]

Similarly, you will get two results if you query for a list of unique ISBN numbers:

> db.media.distinct ("ISBN")

[ "978-1-4842-1183-0", " 978-1-4842-1183-1" ]

The distinct() function also takes nested keys when querying; for instance, this command will give

you a list of unique titles of your CDs:

> db.media.distinct ("Tracklist.Title")

[ "In Bloom", "Smells Like Teen Spirit" ]

CHAPTER 4 ■ WORKING WITH DATA

59

Grouping Your Results

Last but not least, you can group your results. MongoDB’s group() function is similar to SQL’s GROUP BY

function, although the syntax is a little different. The purpose of the command is to return an array of

grouped items. The group() function takes three parameters: key, initial, and reduce.

The key parameter specifies which results you want to group. For example, assume you want to group

results by Title. The initial parameter lets you provide a base for each grouped result (that is, the base

number of items to start off with). By default, you want to leave this parameter at zero if you want an exact

number returned. The reduce parameter groups all similar items together. Reduce takes two arguments: the

current document being iterated over and the aggregation counter object. These arguments are called items

and prev in the example that follows. Essentially, the reduce parameter adds a 1 to the sum of every item it

encounters that matches a title it has already found.

The group() function is ideal when you’re looking for a tagcloud kind of function. For example, assume

you want to obtain a list of all unique titles of any type of item in your collection. Additionally, assume you

want to group them together if any doubles are found, based on the title:

> db.media.group (

{

key: {Title : true},

initial: {Total : 0},

reduce : function (items,prev)

{

prev.Total += 1

}

)

[

{

"Title" : "Nevermind",

"Total" : 1

},

{

"Title" : "Definitive Guide to MongoDB 3rd ed., The",

"Total" : 2

}

]

In addition to the key, initial, and reduce parameters, you can specify three more optional parameters:

• keyf: You can use this parameter to replace the key parameter if you do not wish to

group the results on an existing key in your documents. Instead, you would group

them using another function you design that specifies how to do grouping.

• cond: You can use this parameter to specify an additional statement that must be true

before a document will be grouped. You can use this much as you use the find()

query to search for documents in your collection. If this parameter isn’t set (the

default), then all documents in the collection will be checked.

• finalize: You can use this parameter to specify a function you want to execute

before the final results are returned. For instance, you might calculate an average or

perform a count and include this information in the results.

CHAPTER 4 ■ WORKING WITH DATA

60

■Note The group() function does not currently work in sharded environments. For these, you should use the

mapreduce() function instead. Also, the resulting output cannot contain more than 20,000 keys in all with the

group() function or an exception will be raised. This, too, can be bypassed by using mapreduce().

Working with Conditional Operators

MongoDB supports a large set of conditional operators to better filter your results. The following sections

provide an overview of these operators, including some basic examples that show you how to use them.

Before walking through these examples, however, you should add a few more items to the database; doing so

will let you see the effects of these operators more plainly:

> dvd = ( { "Type" : "DVD", "Title" : "Matrix, The", "Released" : 1999,

"Cast" : ["Keanu Reeves","Carrie-Anne Moss","Laurence Fishburne","Hugo

Weaving","Gloria Foster","Joe Pantoliano"] } )

{

"Type" : "DVD",

"Title" : "Matrix, The",

"Released" : 1999,

"Cast" : [

"Keanu Reeves",

"Carrie-Anne Moss",

"Laurence Fishburne",

"Hugo Weaving",

"Gloria Foster",

"Joe Pantoliano"

]

}

> db.media.insertOne(dvd)

> dvd = ( { "Type" : "DVD", Title : "Blade Runner", Released : 1982 } )

{ "Type" : "DVD", "Title" : "Blade Runner", "Released" : 1982 }

> db.media.insertOne(dvd)

> dvd = ( { "Type" : "DVD", Title : "Toy Story 3", Released : 2010 } )

{ "Type" : "DVD", "Title" : "Toy Story 3", "Released" : 2010 }

> db.media.insertOne(dvd)

Performing Greater-Than and Less-Than Comparisons

You can use the following special parameters to perform greater-than and less-than comparisons in queries:

$gt, $lt, $gte, and $lte. In this section, we’ll look at how to use each of these parameters.

The first one we’ll cover is the $gt (greater-than) parameter. You can use this to specify that a certain

integer should be greater than a specified value in order to be returned:

> db.media.find ( { Released : {$gt : 2000} }, { "Cast" : 0 } )

{ "_id" : ObjectId("4c4369a3c603000000007ed3"), "Type" : "DVD", "Title" : "Toy Story 3",

"Released" : 2010 }

CHAPTER 4 ■ WORKING WITH DATA

61

Note that the year 2000 itself will not be included in the preceding query. For that, you use the $gte

(greater-than or equal-to) parameter:

> db.media.find ( { Released : {$gte : 1999 } }, { "Cast" : 0 } )

{ "_id" : ObjectId("4c43694bc603000000007ed1"), "Type" : "DVD", "Title" :

"Matrix, The", "Released" : 1999 }

{ "_id" : ObjectId("4c4369a3c603000000007ed3"), "Type" : "DVD", "Title" :

"Toy Story 3", "Released" : 2010 }

Likewise, you can use the $lt (less-than) parameter to find items in your collection that predate the

year 1999:

> db.media.find ( { Released : {$lt : 1999 } }, { "Cast" : 0 } )

{ "_id" : ObjectId("4c436969c603000000007ed2"), "Type" : "DVD", "Title" : "Blade Runner",

"Released" : 1982 }

You can also get a list of items older than or equal to the year 1999 by using the $lte (less-than or

equal-to) parameter:

> db.media.find( {Released : {$lte: 1999}}, { "Cast" : 0 })

{ "_id" : ObjectId("4c43694bc603000000007ed1"), "Type" : "DVD", "Title" :

"Matrix, The", "Released" : 1999 }

{ "_id" : ObjectId("4c436969c603000000007ed2"), "Type" : "DVD", "Title" :

"Blade Runner", "Released" : 1982 }

You can also combine these parameters to specify a range:

> db.media.find( {Released : {$gte: 1990, $lt : 2010}}, { "Cast" : 0 })

{ "_id" : ObjectId("4c43694bc603000000007ed1"), "Type" : "DVD", "Title" : "Matrix, The",

"Released" : 1999 }

These parameters might strike you as relatively simple to use; however, you will be using them a lot

when querying for a specific range of data.

Retrieving All Documents but Those Specified

You can use the $ne (not-equals) parameter to retrieve every document in your collection, except for the

ones that match certain criteria. It should be noted that $ne may be performance heavy when the field of

choice has many potential values. For example, you can use this snippet to obtain a list of all books where

the author is not Eelco Plugge:

> db.media.find( { Type : "Book", Author: {$ne : "Plugge, Eelco"}})

Specifying an Array of Matches

You can use the $in operator to specify an array of possible matches. The SQL equivalent is the IN operator.

You can use the following snippet to retrieve data from the media collection using the $in operator:

> db.media.find( {Released : {$in : [1999,2008,2009] } }, { "Cast" : 0 } )

{ "_id" : ObjectId("4c43694bc603000000007ed1"), "Type" : "DVD", "Title" : "Matrix, The",

"Released" : 1999 }

CHAPTER 4 ■ WORKING WITH DATA

62

This example returns only one item, because only one item matches the release year of 1999, and there

are no matches for the years 2008 and 2009.

Finding a Value Not in an Array

The $nin operator functions similarly to the $in operator, except that it searches for the objects where the

specified field does not have a value in the specified array:

> db.media.find( {Released : {$nin : [1999,2008,2009] },Type : "DVD" },

{ "Cast" : 0 } )

{ "_id" : ObjectId("4c436969c603000000007ed2"), "Type" : "DVD", "Title" :

"Blade Runner", "Released" : 1982 }

{ "_id" : ObjectId("4c4369a3c603000000007ed3"), "Type" : "DVD", "Title" :

"Toy Story 3", "Released" : 2010 }

Matching All Attributes in a Document

The $all operator also works similarly to the $in operator. However, $all requires that all attributes match

in the documents, whereas only one attribute must match for the $in operator. Let’s look at an example that

illustrates these differences. First, here’s an example that uses $in:

> db.media.find ( { Released : {$in : ["2010","2009"] } }, { "Cast" : 0 } )

{ "_id" : ObjectId("4c4369a3c603000000007ed3"), "Type" : "DVD", "Title" : "Toy Story 3",

"Released" : 2010 }

One document is returned for the $in operator because there’s a match for 2010, but not for 2009.

However, the $all parameter doesn’t return any results, because there are no matching documents with

2009 in the value:

> db.media.find ( { Released : {$all : ["2010","2009"] } }, { "Cast" : 0 } )

Searching for Multiple Expressions in a Document

You can use the $or operator to search for multiple expressions in a single query, where only one criterion

needs to match to return a given document. Unlike the $in operator, $or allows you to specify both the key

and the value, rather than only the value:

> db.media.find({ $or : [ { "Title" : "Toy Story 3" }, { "ISBN" : "978-1-4842-1183-0" } ] } )

{ "_id" : ObjectId("4c5fc7d8db290000000067c5"), "Type" : "Book", "Title" : "Definitive Guide

to MongoDB 3rd ed., The", "ISBN" : "978-1-4842-1183-0", "Publisher" : "Apress", "Author" :

["Hows, David", "Membrey, Peter", "Plugge, Eelco", "Hawkins, Tim" ] }

{ "_id" : ObjectId("4c5fc943db290000000067ca"), "Type" : "DVD", "Title" : "Toy Story 3",

"Released" : 2010 }

CHAPTER 4 ■ WORKING WITH DATA

63

It’s also possible to combine the $or operator with another query parameter. This will restrict the

returned documents to only those that match the first query (mandatory), and then either of the two

key/value pairs specified at the $or operator, as in this example:

> db.media.find({ "Type" : "DVD", $or : [ { "Title" : "Toy Story 3" },

{ "ISBN" : "978-1-4842-1183-0" } ] })

{ "_id" : ObjectId("4c5fc943db290000000067ca"), "Type" : "DVD", "Title" : "Toy Story 3",

"Released" : 2010 }

You could say that the $or operator allows you to perform two queries at the same time, combining

the results of two otherwise unrelated queries on the same collection. It is worth noting here that, if all the

queries in an $or clause can be supported by indexes, MongoDB will perform index scans. If not, a collection

scan will be used instead. Lastly, each clause of the $or can use its own index.

Retrieving a Document with $slice

You can use the $slice projection to limit an array field to a subset of the array for each matching result.

This can be particularly useful if you want to limit a certain set of items added to save bandwidth. The

operator also lets you retrieve the results of n items per page, a feature generally known as paging.

The operator takes two parameters; the first indicates the total number of items to be returned. The

second parameter is optional; if used, it ensures that the first parameter defines the offset, while the second

defines the limit. The $slice limit parameter also accepts a negative value to return items starting from the

end of an array instead of the beginning.

The following example limits the items from the Cast list to the first three items:

> db.media.find({"Title" : "Matrix, The"}, {"Cast" : {$slice: 3}})

{ "_id" : ObjectId("4c5fcd3edb290000000067cb"), "Type" : "DVD", "Title" : "Matrix, The",

"Released" : 1999, "Cast" : [ "Keanu Reeves", "Carrie-Anne Moss", "Laurence Fishburne" ] }

You can also get only the last three items by making the integer negative:

> db.media.find({"Title" : "Matrix, The"}, {"Cast" : {$slice: -3}})

{ "_id" : ObjectId("4c5fcd3edb290000000067cb"), "Type" : "DVD", "Title" : "Matrix, The",

"Released" : 1999, "Cast" : [ "Hugo Weaving", "Gloria Foster", "Joe Pantoliano" ] }

Or you can skip the first two items and limit the results to three from that particular point (pay careful

attention to the brackets):

> db.media.find({"Title" : "Matrix, The"}, {"Cast" : {$slice: [2,3] }})

{ "_id" : ObjectId("4c5fcd3edb290000000067cb"), "Type" : "DVD", "Title" : "Matrix, The",

"Released" : 1999, "Cast" : [ "Laurence Fishburne", "Hugo Weaving", "Gloria Foster" ] }

Finally, when specifying a negative integer, you can skip to the last five items and limit the results to

four, as in this example:

> db.media.find({"Title" : "Matrix, The"}, {"Cast" : {$slice: [-5,4] }})

{ "_id" : ObjectId("4c5fcd3edb290000000067cb"), "Type" : "DVD", "Title" : "Matrix, The",

"Released" : 1999, "Cast" : [ "Carrie-Anne Moss","LaurenceFishburne","Hugo Weaving",

"Gloria Foster"] }

CHAPTER 4 ■ WORKING WITH DATA

64

■Note With version 2.4, MongoDB also introduced the $slice operator for $push operations, allowing you to

limit the number of array elements when appending values to an array. This operator is discussed later in this

chapter. Do not confuse the two, however.

Searching for Odd/Even Integers

The $mod operator lets you search for specific data that consists of an even or uneven number. This works

because the operator takes the modulus of 2 and checks for a remainder of 0, thereby providing even-

numbered results only.

For example, the following code returns any item in the collection that has an even-numbered integer

set to its Released field:

> db.media.find ( { Released : { $mod: [2,0] } }, {"Cast" : 0 } )

{ "_id" : ObjectId("4c45b5c18e0f0000000062aa"), "Type" : "DVD", "Title" : "Blade Runner",

"Released" : 1982 }

{ "_id" : ObjectId("4c45b5df8e0f0000000062ab"), "Type" : "DVD", "Title" : "Toy Story 3",

"Released" : 2010 }

Likewise, you can find any documents containing an uneven value in the Released field by changing

the parameters in $mod, as follows:

> db.media.find ( { Released : { $mod: [2,1] } }, { "Cast" : 0 } )

{ "_id" : ObjectId("4c45b5b38e0f0000000062a9"), "Type" : "DVD", "Title" : "Matrix, The",

"Released" : 1999 }

■Note The $mod operator only works on integer values, not on strings that contain a numbered value. For

example, you can’t use the operator on { Released : "2010" } because it’s in quotes and therefore a string.

Filtering Results with $size

The $size operator lets you filter your results to match an array with the specified number of elements in it.

For example, you might use this operator to do a search for those CDs that have exactly two songs on them:

> db.media.find ( { Tracklist : {$size : 2} } )

{ "_id" : ObjectId("4c1a86bb2955000000004076"), "Type" : "CD", "Artist" : "Nirvana",

"Title" : "Nevermind", "Tracklist" : [

{

"Track" : "1",

"Title" : "Smells Like Teen Spirit",

"Length" : "5:02"

},

CHAPTER 4 ■ WORKING WITH DATA

65

{

"Track" : "2",

"Title" : "In Bloom",

"Length" : "4:15"

}

] }

■Note You cannot use the $size operator to find a range of sizes. For example, you cannot use it to find

arrays with more than one element in them.

Returning a Specific Field Object

The $exists operator allows you to return a specific object if a specified field is either missing or found.

The following example returns all items in the collection with a key named Author:

> db.media.find ( { Author : {$exists : true } } )

Similarly, if you invoke this operator with a value of false, then all documents that don’t have a key

named Author will be returned:

> db.media.find ( { Author : {$exists : false } } )

■Warning Currently, the $exists operator is unable to use an index; therefore, using it requires a full table scan.

Matching Results Based on the BSON Type

The $type operator lets you match results based on their BSON type. For instance, the following snippet lets

you find all items that have a track list of the type Embedded Object (that is, it contains a list of information):

> db.media.find ( { Tracklist: { $type : 3 } } )

{ "_id" : ObjectId("4c1a86bb2955000000004076"), "Type" : "CD", "Artist" : "Nirvana",

"Title" : "Nevermind", "Tracklist" : [

{

"Track" : "1",

"Title" : "Smells Like Teen Spirit",

"Length" : "5:02"

},

{

"Track" : "2",

"Title" : "In Bloom",

"Length" : "4:15"

}

] }

The known data types are defined in Table4-1.

CHAPTER 4 ■ WORKING WITH DATA

66

Table 4-1. Known BSON Types and Codes

Code Data Type

–1 MinKey

1 Double

2 Character string (UTF8)

3 Embedded object

4 Embedded array

5 Binary data

7 Object ID

8 Boolean type

9 Date type

10 Null type

11 Regular expression

13 JavaScript code

14 Symbol

15 JavaScript code with scope

16 32-bit integer

17 Timestamp

18 64-bit integer

127 MaxKey

255 MinKey

Matching an Entire Array

If you want to match an entire array within a document, you can use the $elemMatch operator. This is

particularly useful if you have multiple documents within your collection, some of which have some of the

same information. This can make a default query incapable of finding the exact document you are looking

for. This is because the standard query syntax doesn’t restrict itself to a single document within an array.

Let’s look at an example that illustrates this principle. For this to work, you need to add another

document to the collection, one that has an identical item in it but is otherwise different. Specifically, let’s

add another CD from Nirvana that happens to have the same track on it as the aforementioned CD

(“Smells Like Teen Spirit”). However, on this version of the CD, the song is track 5, not track 1:

{

"Type" : "CD",

"Artist" : "Nirvana",

"Title" : "Nirvana",

"Tracklist" : [

{

"Track" : "1",

"Title" : "You Know You're Right",

"Length" : "3:38"

},

CHAPTER 4 ■ WORKING WITH DATA

67

{

"Track" : "5",

"Title" : "Smells Like Teen Spirit",

"Length" : "5:02"

}

]

}

> nirvana = ( { "Type" : "CD", "Artist" : "Nirvana", "Title" : "Nirvana", "Tracklist" :

[ { "Track" : "1", "Title" : "You Know You're Right", "Length" : "3:38"}, {"Track" : "5",

"Title" : "Smells Like Teen Spirit", "Length" : "5:02" } ] } )

> db.media.insertOne(nirvana)

If you want to search for an album from Nirvana that has the song “Smells Like Teen Spirit” as Track 1

on the CD, you might think that the following query would do the job:

> db.media.find ( { "Tracklist.Title" : "Smells Like Teen Spirit", "Tracklist.Track" : "1" } )

Unfortunately, the preceding query will return both documents. The reason for this is that both

documents have a track with the title called “Smells Like Teen Spirit” and both have a track number 1. If you

want to match an entire document within the array, you can use $elemMatch, as in this example:

> db.media.find ( { Tracklist: { "$elemMatch" : { Title: "Smells Like Teen Spirit",

Track : "1" } } } )

{ "_id" : ObjectId("4c1a86bb2955000000004076"), "Type" : "CD", "Artist" : "Nirvana",

"Title" : "Nevermind", "Tracklist" : [

{

"Track" : "1",

"Title" : "Smells Like Teen Spirit",

"Length" : "5:02"

},

{

"Track" : "2",

"Title" : "In Bloom",

"Length" : "4:15"

}

] }

This query gave the desired result and only returned the first document.

Using the $not Metaoperator

You can use the $not metaoperator to negate any check performed by a standard operator. It should

be noted that $not may be performance heavy when the field of choice has many potential values. The

following example returns all documents in your collection, except for the one seen in the $elemMatch

example:

> db.media.find ( { Tracklist : { $not : { "$elemMatch" : { Title: "Smells Like Teen

Spirit", "Track" : "1" } } } } )

CHAPTER 4 ■ WORKING WITH DATA

68

Specifying Additional Query Expressions

Apart from the structured query syntax you’ve seen so far, you can also specify additional query expressions

in JavaScript. The big advantage of this is that JavaScript is extremely flexible and allows you to do tons of

additional things. The downside of using JavaScript is that it’s a tad slower than the native operators baked

into MongoDB, as it cannot take advantage of indexes.

For example, assume you want to search for a DVD within your collection that is older than 1995. All of

the following code examples would return this information:

db.media.find ( { "Type" : "DVD", "Released" : { $lt : 1995 } } )

db.media.find ( { "Type" : "DVD", $where: "this.Released < 1995" } )

db.media.find ("this.Released < 1995")

f = function() { return this.Released < 1995 }

db.media.find(f)

And that’s how flexible MongoDB is! Using these operators should enable you to find just about

anything throughout your collections.

Leveraging Regular Expressions

Regular expressions are another powerful tool you can use to query information. Regular expressions—regex,

for short—are special text strings that you can use to describe your search pattern. These work much like

wildcards, but they are far more powerful and flexible.

MongoDB allows you to use these regular expressions when searching for data in your collections;

however, to improve performance it will attempt to use an index whenever possible for simple prefix

expressions. Prefix expressions are those regular expressions that start with either a left anchor (“\A”) or a

caret (“^”) followed by a few characters (example: “^Matrix”). Querying with regular expressions that are not

prefix expressions cannot efficiently make use of an index.

■Note Please bear in mind that case insensitive (“i”) regular-expression queries can cause poor

performance due to the number of searches it needs to perform when using these.

The following example uses regex in a query to find all items in the media collection that start with the

word “Matrix” (case insensitive):

> db.media.find ( { Title : /^Matrix/i } )

Using regular expressions from MongoDB can make your life much simpler, so we recommend

exploring this feature in greater detail as time permits or your circumstances can benefit from it.

Updating Data

So far you’ve learned how to insert and query for data in your database. Next, you’ll learn how to

update those data. MongoDB supports quite a few update operators that you’ll learn how to use in the

following sections.

CHAPTER 4 ■ WORKING WITH DATA

69

Updating with update()

MongoDB comes with the update() function for performing updates to your data. The update() function

takes three primary arguments: criteria, objNew, and options.

The criteria argument lets you specify the query that selects the record you want to update. You use

the objNew argument to specify the updated information; or you can use an operator to do this for you.

The options argument lets you specify your options when updating the document, and it has two possible

values: upsert and multi. The upsert option lets you specify whether the update should be an upsert—that

is, it tells MongoDB to update the record if it exists and create it if it doesn’t. Finally, the multi option lets you

specify whether all matching documents should be updated or just the first one (the default action).

The following simple example uses the update() function without any fancy operators:

> db.media.updateOne( { "Title" : "Matrix, The"}, {"Type" : "DVD", "Title" : "Matrix, The",

"Released" : 1999, "Genre" : "Action"}, { upsert: true} )

This example updates a matching document in the collection if one exists or saves a new document

with the new values specified. Note that any fields you leave out are removed (the document is basically

being rewritten).

In case there happens to be multiple documents matching the criteria and you wish to upsert them all, the

updateMany function can be used instead of updateOne() while using the $set modifier operator, as shown here:

> db.media.updateMany( { "Title" : "Matrix, The"}, {$set: {"Type" : "DVD", "Title" :

"Matrix, The", "Released" : 1999, "Genre" : "Action"} }, {upsert: true} )

■Note An upsert tells the database to “update a record if a document is present or to insert the record

if it isn’t.”

Implementing an Upsert with the save() Command

You can also perform an upsert with the save() command. To do this, you need to specify the _id value; you

can have this value added automatically or specify it manually yourself. If you do not specify the _id value,

the save() command will assume it’s an insert and simply add the document into your collection.

The main benefit of using the save() command is that you do not need to specify that the upsert

method should be used in conjunction with the update() command. Thus, the save() command gives you a

quicker way to upsert data. In practice, the save() and update() commands look similar:

> db.media.updateOne( { "Title" : "Matrix, The"}, {"Type" : "DVD", "Title" : "Matrix, The",

"Released" : "1999", "Genre" : "Action"}, { upsert: true} )

> db.media.save( { "Title" : "Matrix, The"}, {"Type" : "DVD", "Title" : "Matrix, The",

"Released" : "1999", "Genre" : "Action"})

Obviously, this example assumes that the Title value acts as the id field.

Updating Information Automatically

You can use the modifier operations to update information quickly and simply in your documents, without

needing to type everything in manually. For example, you might use these operations to increase a number

or to remove an element from an array.

We’ll be exploring these operators next, providing practical examples that show you how to use them.

CHAPTER 4 ■ WORKING WITH DATA

70

Incrementing a Value with $inc

The $inc operator enables you to perform an (atomic) update on a key to increase the value by the given

increment, assuming that the field exists. If the field doesn’t exist, it will be created. To see this in action,

begin by adding another document to the collection:

> manga = ( { "Type" : "Manga", "Title" : "One Piece", "Volumes" : 612, "Read" : 520 } )

{

"Type" : "Manga",

"Title" : "One Piece",

"Volumes" : "612",

"Read" : "520"

}

> db.media.insertOne(manga)

Now you’re ready to update the document. For example, assume you’ve read another four volumes

of the One Piece manga, and you want to increment the number of Read volumes in the document. The

following example shows you how to do this:

> db.media.updateOne ( { "Title" : "One Piece"}, {$inc: {"Read" : 4} } )

> db.media.find ( { "Title" : "One Piece" } )

{

"Type" : "Manga",

"Title" : "One Piece ",

"Volumes" : "612",

"Read" : "524"

}

Setting a Field’s Value

You can use the $set operator to set a field’s value to one you specify. This works for any datatype, as in the

following example:

> db.media.update ( { "Title" : "Matrix, The" }, {$set : { Genre : "Sci-Fi" } } )

This snippet would update the genre in the document created earlier, setting it to Sci-Fi instead.

Deleting a Specified Field

The $unset operator lets you delete a given field, as in this example:

> db.media.updateOne ( {"Title": "Matrix, The"}, {$unset : { "Genre" : 1 } } )

This snippet would delete the Genre key and its value from the document.

Appending a Value to a Specified Field

The $push operator allows you to append a value to a specified field. If the field is an existing array, then the

value will be added. If the field doesn’t exist yet, then the field will be set to the array value. If the field exists

but it isn’t an array, then an error condition will be raised.

CHAPTER 4 ■ WORKING WITH DATA

71

Begin by adding another author to your entry in the collection:

> db.media.updateOne ( {"ISBN" : "978-1-4842-1183-0"}, {$push: { Author : "Griffin,

Stewie"} } )

The next snippet raises an error message because the Title field is not an array:

> db.media.updateOne ( {"ISBN" : "978-1-4842-1183-0"}, {$push: { Title :

"This isn't an array"} } )

Cannot apply $push/$pushAll modifier to non-array

The following example shows how the document looks in the meantime:

> db.media.find ( { "ISBN" : "978-1-4842-1183-0" } )

{

"Author" :

[

"Hows, David",

"Membrey, Peter",

"Plugge, Eelco",

"Griffin, Stewie",

],

"ISBN" : "978-1-4302-5821-6",

"Publisher" : "Apress",

"Title" : "Definitive Guide to MongoDB 3rd ed., The",

"Type" : "Book",

"_id" : ObjectId("4c436231c603000000007ed0")

}

Specifying Multiple Values in an Array

When working with arrays, the $push operator will append the value specified to the given array, expanding

the data stored within the given element. If you wish to add several separate values to the given array, you

can use the optional $each modifier, as in this example:

> db.media.updateOne( { "ISBN" : "978-1-4842-1183-0" }, { $push: { Author : { $each:

["Griffin, Peter", "Griffin, Brian"] } } } )

{

"Author" :

[

"Hows, David",

"Membrey, Peter",

"Plugge, Eelco",

"Hawkins, Tim",

"Griffin, Stewie",

"Griffin, Peter",

"Griffin, Brian"

],

CHAPTER 4 ■ WORKING WITH DATA

72

"ISBN" : "978-1-4842-1183-0",

"Publisher" : "Apress",

"Title" : "Definitive Guide to MongoDB 3rd ed., The",

"Type" : "Book",

"_id" : ObjectId("4c436231c603000000007ed0")

}

Optionally, you can use the $slice operator when using $each. This allows you to limit the number of

elements within an array during a $push operation. The $slice operator takes either a negative number or

zero. Using a negative number ensures that only the last n elements will be kept within the array, whereas

using zero would empty the array. Note that the $slice operator has to be the first modifier to the $push

operator in order to function as such:

> db.media.updateOne( { "ISBN" : "978-1-4842-1183-0" }, { $push: { Author : { $each:

["Griffin, Meg", "Griffin, Louis"], $slice: -2 } } } )

{

"Author" :

[

"Griffin, Meg",

"Griffin, Louis"

],

"ISBN" : "978-1-4842-1183-0",

"Publisher" : "Apress",

"Title" : "Definitive Guide to MongoDB 3rd ed., The",

"Type" : "Book",

"_id" : ObjectId("4c436231c603000000007ed0")

}

As you can see, the $slice operator ensured that not only were the two new values pushed, but that the

data kept within the array was also limited to the value specified (2). The $slice operator can be a valuable

tool when working with fixed-sized arrays.

Adding Data to an Array with $addToSet

The $addToSet operator is another command that lets you add data to an array. However, this operator only

adds the data to the array if the data are not already there. In this way, $addToSet is unlike $push. By default,

the $addToSet operator takes one argument. However, you can use the $each operator to specify additional

arguments when using t$addToSet. The following snippet adds the author Griffin, Brian into the authors

array because it isn’t there yet:

> db.media.updateOne( { "ISBN" : "978-1-4842-1183-0" }, {$addToSet : { Author : "Griffin,

Brian" } } )

Executing the snippet again won’t change anything because the author is already in the array.

To add more than one value, however, you should take a different approach and use the $each operator

as well:

> db.media.updateOne( { "ISBN" : "978-1-4842-1183-0" }, {$addToSet : { Author : { $each :

["Griffin, Brian","Griffin, Meg"] } } } )

CHAPTER 4 ■ WORKING WITH DATA

73

At this point, our document, which once looked tidy and trustworthy, has been transformed into

something like this:

{

"Author" :

[

"Hows, David",

"Membrey, Peter",

"Plugge, Eelco",

"Hawkins, Tim",

"Griffin, Stewie",

"Griffin, Peter",

"Griffin, Brian",

"Griffin, Louis",

"Griffin, Meg"

],

"ISBN" : "978-1-4842-1183-0",

"Publisher" : "Apress",

"Title" : "Definitive Guide to MongoDB 3rd ed., The",

"Type" : "Book",

"_id" : ObjectId("4c436231c603000000007ed0")

}

Removing Elements from an Array

MongoDB also includes several methods that let you remove elements from an array, including $pop,

$pull, and $pullAll. In the sections that follow, you’ll learn how to use each of these methods for removing

elements from an array.

The $pop operator lets you remove a single element from an array. This operator lets you remove the

first or last value in the array, depending on the parameter you pass down with it. For example, the following

snippet removes the last element from the array:

> db.media.updateOne( { "ISBN" : "978-1-4842-1183-0" }, {$pop : {Author : 1 } } )

In this case, the $pop operator will pop Meg’s name off the list of authors. Passing down a negative

number would remove the first element from the array. The following example removes Peter Membrey’s

name from the list of authors:

> db.media.updateOne( { "ISBN" : "978-1-4842-1183-0" }, {$pop : {Author : -1 } } )

■Note Specifying a value of -2 or 1000 wouldn’t change which element gets removed. Any negative

number would remove the first element, while any positive number would remove the last element. Using the

number 0 removes the last element from the array.

CHAPTER 4 ■ WORKING WITH DATA

74

Removing Each Occurrence of a Specified Value

The $pull operator lets you remove each occurrence of a specified value from an array. This can be

particularly useful if you have multiple elements with the same value in your array. Let’s begin this example

by using the $push parameter to add Stewie back to the list of authors:

> db.media.updateOne ( {"ISBN" : "978-1-4842-1183-0"}, {$push: { Author : "Griffin,

Stewie"} } )

Stewie will be in and out of the database a couple more times as we walk through this book’s examples.

You can remove all occurrences of this author in the document with the following code:

> db.media.updateOne ( {"ISBN" : "978-1-4842-1183-0"}, {$pull : { Author : "Griffin,

Stewie" } } )

Removing Multiple Elements from an Array

You can also remove multiple elements with different values from an array. The $pullAll operator enables

you to accomplish this. The $pullAll operator takes an array with all the elements you want to remove, as in

the following example:

> db.media.updateOne( { "ISBN" : "978-1-4842-1183-0"}, {$pullAll : { Author : ["Griffin,

Louis","Griffin, Peter","Griffin, Brian"] } } )

The field from which you remove the elements (Author in the preceding example) needs to be an array.

If it isn’t, you’ll receive an error message.

Specifying the Position of a Matched Array

You can use the $ operator in your queries to specify the position of the matched array item in your query.

You can use this operator for data manipulation after finding an array member. For instance, assume you’ve

added another track to your track list, but you accidently made a typo when entering the track number:

> db.media.updateOne( { "Artist" : "Nirvana" }, {$addToSet : { Tracklist : {"Track" :

2,"Title": "Been a Son", "Length":"2:23"} } } )

{

"Artist" : "Nirvana",

"Title" : "Nevermind",

"Tracklist" : [

{

"Track" : "1",

"Title" : "You Know You're Right",

"Length" : "3:38"

},

{

"Track" : "5",

"Title" : "Smells Like Teen Spirit",

"Length" : "5:02"

},

CHAPTER 4 ■ WORKING WITH DATA

75

{

"Track" : 2,

"Title" : "Been a Son",

"Length" : "2:23"

}

],

"Type" : "CD",

"_id" : ObjectId("4c443ad6c603000000007ed5")

}

It so happens you know that the track number of the most recent item should be 3 rather than 2.

You can use the $inc method in conjunction with the $ operator to increase the value from 2 to 3, as in

this example:

> db.media.updateOne( { "Tracklist.Title" : "Been a Son"}, {$inc:{"Tracklist.$.Track" : 1} } )

Note that only the first item it matches will be updated. Thus, if there are two identical elements in the

comments array, only the first element will be increased.

Atomic Operations

MongoDB supports atomic operations executed against single documents. An atomic operation is a set of

operations that can be combined in such a way that the set of operations appears to be merely one single

operation to the rest of the system. This set of operations will have either a positive or a negative outcome as

the final result.

You can call a set of operations an atomic operation if it meets the following pair of conditions:

1. No other process knows about the changes being made until the entire set of

operations has completed.

2. If one of the operations fails, the entire set of operations (the entire atomic

operation) will fail, resulting in a full rollback, where the data are restored to their

state prior to running the atomic operation.

A standard behavior when executing atomic operations is that the data will be locked and therefore

unable to be reached by other queries. However, MongoDB does not support locking or complex

transactions for a number of reasons:

• In sharded environments (see Chapter 12 for more information on such

environments), distributed locks can be expensive and slow. MongoDB’s goal is to be

lightweight and fast, so expensive and slow go against this principle.

• MongoDB developers don’t like the idea of deadlocks. In their view, it’s preferable for

a system to be simple and predictable instead.

• MongoDB is designed to work well for real-time problems. When an operation is

executed that locks large amounts of data, it would also stop some smaller light

queries for an extended period of time. Again, this goes against the MongoDB goal

of speed.

CHAPTER 4 ■ WORKING WITH DATA

76

MongoDB includes several update operators (as noted previously), all of which can atomically update

an element:

• $set: Sets a particular value.

• $unset: Removes a particular value.

• $inc: Increments a particular value by a certain amount.

• $push: Appends a value to an array.

• $pull: Removes one or more values from an existing array.

• $pullAll: Removes several values from an existing array.

Using the Update-If-Current Method

Another strategy that atomic update uses is the update-if-current method. This method takes the following

three steps:

1. It fetches the object from the document.

2. It modifies the object locally (with any of the previously mentioned operations,

or a combination of them).

3. It sends an update request to update the object to the new value, in case the

current value still matches the old value fetched.

You can check the WriteResult output to see whether all went well. Note that all of this happens

automatically. Let’s take a new look at an example shown previously:

> db.media.updateOne( { "Tracklist.Title" : "Been a Son"}, {$inc:{"Tracklist.$.Track" : 1} } )

Here, you can use the WriteResult output to check whether the update went smoothly:

WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })

In this example, you incremented Tracklist.Track using the track list title as an identifier. But now

consider what happens if the track list data are changed by another user using the same method while

MongoDB was modifying your data. Because Tracklist.Title remains the same, you might assume

(incorrectly) that you are updating the original data, when in fact you are overwriting the changes.

This is known as the ABA problem. This scenario might seem unlikely, but in a multiuser environment,

where many applications are working on data at the same time, this can be a significant problem.

To avoid this problem, you can do one of the following:

• Use the entire object in the update’s query expression, instead of just the _id and

comments.by fields.

• Use $set to set the field you care about. If other fields have changed, they won’t be

affected by this.

• Put a version variable in the object and increment it on each update.

• When possible, use a $ operator instead of an update-if-current sequence of

operations.

CHAPTER 4 ■ WORKING WITH DATA

77

■Note MongoDB does not support updating multiple documents atomically in a single operation. Instead,

you can use nested objects, which effectively make them one document for atomic purposes.

Modifying and Returning a Document Atomically

The findAndModify command also allows you to perform an atomic update on a document. This command

modifies the document and returns it. The command takes three main operators: <query>, which is used

to specify the document you’re executing it against; <sort>, which is used to sort the matching documents

when multiple documents match, and <operations>, which is used to specify what needs to be done.

Now let’s look at a handful of examples that illustrate how to use this command. The first example finds

the document you’re searching for and removes it once it is found:

> db.media.findAndModify( { "Title" : "One Piece",sort:{"Title": -1}, remove: true} )

{

"_id" : ObjectId("4c445218c603000000007ede"),

"Type" : "Manga",

"Title" : "One Piece",

"Volumes" : 612,

"Read" : 524

}

This code returned the document it found matching the criteria. In this case, it found and removed

the first item it found with the title “One Piece.” If you execute a find() function now, you will see that the

document is no longer within the collection.

The next example modifies the document rather than removing it:

> db.media.findAndModify( { query: { "ISBN" : "978-1-4842-1183-0" }, sort: {"Title":-1},

update: {$set: {"Title" : " Different Title"} } } )

The preceding example updates the title from “Definitive Guide to MongoDB, The” to “Different

Title”—and returns the old document (as it was before the update) to your shell. If you would rather see the

results of the update on the document, you can add the new operator after your query:

> db.media.findAndModify( { query: { "ISBN" : "978-1-4842-1183-0" }, sort: {"Title":-1},

update: {$set: {"Title" : " Different Title"} }, new:true } )

Note that you can use any modifier operation with this command, not just $set.

Processing Data in Bulk

MongoDB also allows you to perform write operations in bulk. This way, you can first define the dataset

prior to writing it all in a single go. Bulk write operations are limited to a single collection only and can be

used to insert, update, or remove data.

Before you can write your data in bulk, you will first need to tell MongoDB how those data are to be

written: ordered or unordered. When executing the operation in an ordered fashion, MongoDB will go over

the list of operations serially. That is, were an error to occur while processing one of the write operations,

the remaining operations will not be processed. In contrast, using an unordered write operation, MongoDB

CHAPTER 4 ■ WORKING WITH DATA

78

will execute the operations in a parallel manner. Were an error to occur during one of the writing operations

here, MongoDB will continue to process the remaining write operations.

For example, let’s assume you want to insert data in bulk to your media collection in an ordered fashion,

so that if an error were to occur the operation would halt. You first will need to initialize your ordered list

using the initializeOrderedBulkOp() functionx, as follows:

> var bulk = db.media.initializeOrderedBulkOp();

Now you can continue to insert the data into your ordered list, named bulk, before finally executing the

operations using the execute() command, like so:

> bulk.insertOne({ "Type" : "Movie", "Title" : "Deadpool", "Released" : 2016});

> bulk.insertOne({ "Type" : "CD", "Artist" : "Iron Maiden", "Title" : "Book of Souls, The" });

> bulk.insertOne({ "Type" : "Book", "Title" : "Paper Towns", "Author" : "Green, John" });

■Note Your list can contain a maximum of 1000 operations. MongoDB will automatically split and process

your list into separate groups of 1000 operations or less when your list exceeds this limit.

Executing Bulk Operations

Now that the list has been filled, you will notice that the data themselves have not been written into the

collection yet. You can verify this by doing a simple find() on the media collection, which will only show the

previously added content:

> db.media.find()

{ "_id" : ObjectId("55e6d1d8b54fe7a2c96567d4"),

"Type" : "Book",

"Title" : "Definitive Guide to MongoDB 3rd ed., The",

"ISBN" : "978-1-4842-1183-0",

"Publisher" : "Apress",

"Author" : [

"Hows, David",

"Plugge, Eelco",

"Membrey, Peter",

"Hawkins, Tim"

] }

{ "_id" : "ObjectId("4c1a86bb2955000000004076"),

"Type" : "CD",

"Artist" : "Nirvana",

"Title" : "Nevermind",

"Tracklist" : [

{

"Track" : "1",

"Title" : "Smells Like Teen Spirit",

"Length" : "5:02"

},

CHAPTER 4 ■ WORKING WITH DATA

79

{

"Track" : "2",

"Title" : "In Bloom",

"Length" : "4:15"

}

] }

To process the list of operations, the execute() command can be used like so:

> bulk.execute();

BulkWriteResult({

"writeErrors" : [ ],

"writeConcernErrors" : [ ],

"nInserted" : 3,

"nUpserted" : 0,

"nMatched" : 0,

"nModified" : 0,

"nRemoved" : 0,

"upserted" : [ ]

})

As you can tell from the output, nInserted reports 3, meaning three items were inserted into your

collection. If your list were to include other operations such as upserts or removals, those would have been

listed here as well.

Evaluating the Output

Once the bulk operations have been executed using the execute() command, you are also able to review

the write operations performed. This can be used to evaluate whether all the data were written successfully

and in what order this was done. Moreover, when something does go wrong during the write operation, the

output will help you understand what has been executed. To review the write operations executed through

execute(), you can use the getOperations() command, like so:

> bulk.getOperations();

[

{

"originalZeroIndex" : 0,

"batchType" : 1,

"operations" : [

{

"_id" : ObjectId("55e7fa1db54fe7a2c96567d6"),

"Type" : "Movie",

"Title" : "Deadpool",

"Released" : 2016

},

{

"_id" : ObjectId("55e7fa1db54fe7a2c96567d7"),

"Type" : "CD",

"Artist" : "Iron Maiden",

"Title" : "Book of Souls, The"

},

CHAPTER 4 ■ WORKING WITH DATA

80

{

"_id" : ObjectId("55e7fa1db54fe7a2c96567d8"),

"Type" : "Book",

"Title" : "Paper Towns",

"ISBN" : "978-0142414934",

"Author" : "Green, John"

}

]

}

]

Notice how the array returned includes all the data processed under the operations key, as well as the

batchType key indicating the type of operation performed. Here, its value is 1, indicating the items were

inserted into the collection. Table4-2 describes the types of operations performed and their subsequent

batchType values.

Table 4-2. BatchType Values and Their Meaning

BatchType Operation

1 Insert

2Update

3 Remove

■Note When processing various types of operations in unordered lists, MongoDB will group these together

by type (inserts, update, removals) to increase performance. As such, be sure your applications do not depend

on the order of operations performed. Ordered lists’ operations will only group contiguous operations of the

same type so that these are still processed in order.

Bulk operations can be extremely useful for processing a large set of data in a single go without

influencing the available dataset beforehand.

Renaming a Collection

It might happen that you discover you have named a collection incorrectly, but you’ve already inserted some

data into it. This might make it troublesome to remove and read the data again from scratch.

Instead, you can use the renameCollection() function to rename your existing collection. The following

example shows you how to use this simple and straightforward command:

> db.media.renameCollection("newname")

{ "ok" : 1 }

If the command executes successfully, an OK will be returned. If it fails, however (if the collection

doesn’t exist, for example), then the following message is returned:

{ "errmsg" : "assertion: source namespace does not exist", "ok" : 0 }

CHAPTER 4 ■ WORKING WITH DATA

81

The renameCollection command doesn’t take many parameters (unlike some commands you’ve seen

so far); however, it can be quite useful in the right circumstances.

Deleting Data

So far we’ve explored how to add, search for, and modify data. Next, we’ll examine how to delete documents,

entire collections, and the databases themselves.

Previously, you learned how to delete data from a specific document (using the $pop command,

for instance). In this section, you will learn how to delete full documents and collections. Just as the

insertOne() function is used for inserting and updateOne() is used for modifying a document, deleteOne()

is used to delete a document.

To delete a single document from your collection, you need to specify the criteria you’ll use to find

the document. A good approach is to perform a find() first; this ensures that the criteria used are specific

to your document. Once you are sure of the criterion, you can invoke the deleteOne() function using that

criterion as a parameter:

> db.newname.deleteOne( { "Title" : "Different Title" } )

This statement removes a single matching document. Any other item in your collection that matches

the criteria will not be removed when using the deleteOne() function. To delete multiple documents

matching your criteria, you can use the deleteMany() function instead.

Or you can use the following snippet to delete all documents from the newname library (remember, we

renamed the media collection this previously):

> db.newname.deleteMany({})

■Warning When deleting a document, you need to remember that any reference to that document will

remain within the database. For this reason, be sure you manually delete or update those references as well;

otherwise, these references will return null when evaluated. Referencing will be discussed in the next section.

If you want to delete an entire collection, you can use either the drop() or remove() function. Using

remove() will be a lot slower than drop() as all indexes will be kept this way. A drop() will be faster if you

need to remove all data as well as indexes from a collection. The following snippet removes the entire

newname collection, including all of its documents:

> db.newname.drop()

true

The drop() function returns either true or false, depending on whether the operation has completed

successfully. Likewise, if you want to remove an entire database from MongoDB, you can use the

dropDatabase() function, as in this example:

> db.dropDatabase()

{ "dropped" : "library", "ok" : 1 }

Note that this snippet will remove the database you are currently working in (again, be sure to check db

to see which database is your current database).

CHAPTER 4 ■ WORKING WITH DATA

82

Referencing a Database

At this point, you have an empty database again. You’re also familiar with inserting various kinds of data into

a collection. Now you’re ready to take things a step further and learn about database referencing (DBRef).

As you’ve already seen, there are plenty of scenarios where embedding data into your document will suffice for

your application (such as the track list or the list of authors in the book entry). However, sometimes you do need

to reference information in another document. The following sections will explain how to go about doing so.

Just as with SQL, references between documents in MongoDB are resolved by performing additional

queries on the server. MongoDB gives you two ways to accomplish this: referencing them manually or using

the DBRef standard, which many drivers also support.

Referencing Data Manually

The simplest and most straightforward way to reference data is to do so manually. When referencing data

manually, you store the value from the _id of the other document in your document, either through the full

ID or through a simpler common term. Before proceeding with an example, let’s add a new document and

specify the publisher’s information in it (pay close attention to the _id field):

> apress = ( { "_id" : "Apress", "Type" : "Technical Publisher", "Category" : ["IT",

"Software","Programming"] } )

{

"_id" : "Apress",

"Type" : "Technical Publisher",

"Category" : [

"IT",

"Software",

"Programming"

]

}

> db.publisherscollection.insertOne(apress)

Once you add the publisher’s information, you’re ready to add an actual document (for example, a

book’s information) into the media collection. The following example adds a document, specifying Apress as

the name of the publisher:

> book = ( { "Type" : "Book", "Title" : "Definitive Guide to MongoDB 3rd ed., The",

"ISBN" : "978-1-4842-1183-0", "Publisher" : "Apress","Author" : ["Hows, David","Plugge,

Eelco","Membrey,Peter","Hawkins, Tim"] } )

{

"Type" : "Book",

"Title" : "Definitive Guide to MongoDB 3rd ed., The",

"ISBN" : "978-1-4842-1183-0",

"Publisher": "Apress",

"Author" : [

"Hows, David"

"Membrey, Peter",

"Plugge, Eelco",

"Hawkins, Tim"

]

}

> db.media.insertOne(book)

CHAPTER 4 ■ WORKING WITH DATA

83

All the information you need has been inserted into the publisherscollection and media collections,

respectively. You can now start using the database reference. First, specify the document that contains the

publisher’s information to a variable:

> book = db.media.findOne()

{

"_id" : ObjectId("4c458e848e0f00000000628e"),

"Type" : "Book",

"Title" : "Definitive Guide to MongoDB 3rd ed., The",

"ISBN" : "978-1-4842-1183-0",

"Publisher" : "Apress",

"Author" : [

"Hows, David"

"Membrey, Peter",

"Plugge, Eelco",

"Hawkins, Tim"

]

}

To obtain the information itself, you combine the findOne function with some dot notation:

> db.publisherscollection.findOne( { _id : book.Publisher } )

{

"_id" : "Apress",

"Type" : "Technical Publisher",

"Category" : [

"IT",

"Software",

"Programming"

]

}

As this example illustrates, referencing data manually is straightforward and doesn’t require much

brainwork. Here, the _id in the documents placed in the users collection has been manually set and has not

been generated by MongoDB (otherwise, the _id would be an object ID).

Referencing Data with DBRef

The DBRef standard provides a more formal specification for referencing data between documents. The

main reason for using DBRef over a manual reference is that the collection can change from one document

to the next. So, if your referenced collection will always be the same, referencing data manually (as just

described) is fine.

With DBRef, the database reference is stored as a standard embedded (JSON/BSON) object. Having a

standard way to represent references means that drivers and data frameworks can add helper methods that

manipulate the references in standard ways.

The syntax for adding a DBRef reference value looks like this:

{ $ref : <collectionname>, $id : <id value>[, $db : <database name>] }

CHAPTER 4 ■ WORKING WITH DATA

84

Here, <collectionname> represents the name of the collection referenced (for example,

publisherscollection); <id value> represents the value of the _id field for the object you are referencing;

and the optional $db allows you to reference documents that are placed in other databases.

Let’s look at another example using DBRef from scratch. Begin by emptying your two collections and

adding a new document:

> db.publisherscollection.drop()

true

> db.media.drop()

true

> apress = ( { "Type" : "Technical Publisher", "Category" :

["IT","Software","Programming"] } )

{

"Type" : "Technical Publisher",

"Category" : [

"IT",

"Software",

"Programming"

]

}

> db.publisherscollection.save(apress)

So far you’ve defined the variable apress and saved it using the save() function. Next, display the

updated contents of the variable by typing in its name:

> apress

{

"Type" : "Technical Publisher",

"Category" : [

"IT",

"Software",

"Programming"

],

"_id" : ObjectId("4c4597e98e0f000000006290")

}

So far you’ve defined the publisher and saved it to the publisherscollection collection. Now you’re

ready to add an item to the media collection that references the data:

> book = { "Type" : "Book", "Title" : "Definitive Guide to MongoDB 3rd ed., The", "ISBN"

: "978-1-4842-1183-0", "Author": ["Hows, David","Membrey, Peter","Plugge,Eelco","Hawkins,

Tim"], Publisher : [ new DBRef ('publisherscollection',apress._id) ] }

{

"Type" : "Book",

"Title" : "Definitive Guide to MongoDB 3rd ed., The",

"ISBN" : "978-1-4842-1183-0",

"Author" : [

"Hows, David”

"Membrey, Peter",

CHAPTER 4 ■ WORKING WITH DATA

85

"Plugge, Eelco",

"Hawkins, Tim"

],

"Publisher" : [

DBRef("publishercollection", "Apress")

]

}

> db.media.save(book)

And that’s it! Granted, the example looks a little less simple than the manual method of referencing

data; however, it’s a good alternative for cases where collections can change from one document to the next.

Implementing Index-Related Functions

In Chapter 3, you looked at what indexes can do for your database. Now it’s time to briefly learn how to

create and use indexes. Indexing will be discussed in greater detail in Chapter 10, but for now let’s look at the

basics. MongoDB includes a fair number of functions available for maintaining your indexes; we’ll begin by

creating an index with the createIndex() function.

The createIndex() function takes at least one parameter, which is the name of a key in one of your

documents that you will use to build the index. In the previous example, you added a document to the media

collection that used the Title key. This collection would be well served by an index on this key.

■Tip The rule of thumb in MongoDB is to create an index for the same sort of scenarios where you’d want to

create one in relational databases and to support your more common queries.

You can create an index for this collection by invoking the following command:

> db.media.createIndex( { Title : 1 } )

This command ensures that an index will be created for all the Title values from all documents in the

media collection. The :1 at the end of the line specifies the direction of the index: 1 would order the index

entries in ascending order, whereas -1 would order the index entries in descending order:

// Ensure ascending index

db.media.createIndex( { Title :1 } )

// Ensure descending index

db.media.createIndex( { Title :-1 } )

■Tip Searching through indexed information is fast. Searching for nonindexed information is slow, as each

document needs to be checked to see if it’s a match.

CHAPTER 4 ■ WORKING WITH DATA

86

BSON allows you to store full arrays in a document; however, it would also be beneficial to be able to

create an index on an embedded key. Luckily, the developers of MongoDB thought of this, too, and added

support for this feature. Let’s build on one of the earlier examples in this chapter, adding another document

into the database that has embedded information:

> db.media.insertOne( { "Type" : "CD", "Artist" : "Nirvana","Title" : "Nevermind",

"Tracklist" : [ { "Track" : "1", "Title" : "Smells Like Teen Spirit", "Length" : "5:02" },

{"Track" : "2","Title" : "In Bloom", "Length" : "4:15" } ] } )

{ "_id" : ObjectId("4c45aa2f8e0f000000006293"), "Type" : "CD", "Artist" : "Nirvana",

"Title" : "Nevermind", "Tracklist" : [

{

"Track" : "1",

"Title" : "Smells Like Teen Spirit",

"Length" : "5:02"

},

{

"Track" : "2",

"Title" : "In Bloom",

"Length" : "4:15"

}

] }

Next, you can create an index on the Title key for all entries in the track list:

> db.media.createIndex( { "Tracklist.Title" : 1 } )

The next time you perform a search for any of the titles in the collection—assuming they are nested

under Tracklist—the titles will show up instantly. Next, you can take this concept one step further and use

an entire (sub)document as a key, as in this example:

> db.media.createIndex( { "Tracklist" : 1 } )

This statement indexes each element of the array, which means you can now search for any object in

the array. These types of keys are also known as multikeys. You can also create an index based on multiple

keys in a set of documents. This process is known as compound indexing. The method you use to create a

compound index is mostly the same; the difference is that you specify several keys instead of one, as in

this example:

> db.media.createIndex({"Tracklist.Title": 1, "Tracklist.Length": -1})

The benefit of this approach is that you can make an index on multiple keys (as in the previous example,

where you indexed an entire subdocument). Unlike the subdocument method, however, compound

indexing lets you specify whether you want one of the two fields to be indexed in descending order. If you

perform your index with the subdocument method, you are limited to ascending or descending order only.

There is more on compound indexes in Chapter 10.

CHAPTER 4 ■ WORKING WITH DATA

87

Surveying Index-Related Commands

So far you’ve taken a quick glance at one of the index-related commands, createIndex(). Without a doubt,

this is the command you will primarily use to create your indexes. However, you might also find a pair of

additional functions useful: hint() and min()/max(). You use these functions to query for data. We haven’t

covered them to this point because they won’t function without a custom index. But now let’s take a look at

what they can do for you.

Forcing a Specified Index to Query Data

You can use the hint() function to force the use of a specified index when querying for data. The intended

benefit of using this command is to improve the query performance where the query planner does not

consistently use a good index for a given query. This option should be used with caution, as you can also

force an index to be used, which will result in poor performance.

To see this principle in action, try performing a find with the hint() function without defining an index:

> db.media.find( { ISBN: " 978-1-4842-1183-0"} ) . hint ( { ISBN: -1 } )

error: { "$err" : "bad hint", "code" : 10113 }

If you create an index on ISBN numbers, this technique will be more successful. Note that the first

command’s background parameter ensures that the indexing is done on the background. This is useful as by

default initial index builds are done on the foreground, which is a blocking operation for other writes. The

background option allows the initial index build to happen without blocking other writes:

> db.media.ensureIndex({ISBN: 1}, {background: true});

> db.media.find( { ISBN: "978-1-4842-1183-0"} ) . hint ( { ISBN: 1 } )

{ "_id" : ObjectId("4c45a5418e0f000000006291"), "Type" : "Book", "Title" : "Definitive Guide

to MongoDB 3rd ed., The", "ISBN" : "978-1-4842-1183-0", "Author" : ["Hows, David","Membrey,

Peter", "Plugge, Eelco","Hawkins,Tim"], "Publisher" : [

{

"$ref" : "publisherscollection",

"$id" : ObjectId("4c4597e98e0f000000006290")

}

] }

To confirm that the given index is being used, you can optionally add the explain() function, returning

information about the query plan chosen. Here, the indexBounds value tells you about the index used:

> db.media.find( { ISBN: "978-1-4842-1183-0"} ) . hint ( { ISBN: 1 } ).explain()

{

"waitedMS" : NumberLong(0),

"queryPlanner" : {

"plannerVersion" : 1,

"namespace" : "library.media",

"indexFilterSet" : false,

"parsedQuery" : {

"$and" : [ ]

},

CHAPTER 4 ■ WORKING WITH DATA

88

"winningPlan" : {

"stage" : "COLLSCAN",

"filter" : {

"$and" : [ ]

},

"direction" : "forward"

},

"rejectedPlans" : [ ]

},

"serverInfo" : {

"host" : "localhost",

"port" : 27017,

"version" : "3.1.7",

"gitVersion" : "7d7f4fb3b6f6a171eacf53384053df0fe728db42"

},

"ok" : 1

}

Constraining Query Matches

The min() and max() functions enable you to constrain query matches to only those that have index keys

between the min and max keys specified. Therefore, you will need to have an index for the keys you are

specifying. Also, you can either combine the two functions or use them separately. Let’s begin by adding a

few documents that enable you to take advantage of these functions. First, create an index on the Released

field:

> db.media.insertOne( { "Type" : "DVD", "Title" : "Matrix, The", "Released" : 1999} )

> db.media.insertOne( { "Type" : "DVD", "Title" : "Blade Runner", "Released" : 1982 } )

> db.media.insertOne( { "Type" : "DVD", "Title" : "Toy Story 3", "Released" : 2010} )

> db.media.ensureIndex( { "Released": 1 } )

You can now use the max() and min() commands, as in this example:

> db.media.find() . min ( { Released: 1995 } ) . max ( { Released : 2005 } )

{ "_id" : ObjectId("4c45b5b38e0f0000000062a9"), "Type" : "DVD", "Title" : "Matrix, The",

"Released" : 1999 }

If no index is created, then an error message will be returned, saying that no index has been found

for the specified key pattern. Obviously, you will need to define which index must be used with the hint()

function:

> db.media.find() . min ( { Released: 1995 } ) .

max ( { Released : 2005 } ). hint ( { Released : 1 } )

{ "_id" : ObjectId("4c45b5b38e0f0000000062a9"), "Type" : "DVD", "Title" : "Matrix, The",

"Released" : 1999 }

■Note The min() value will be included in the results, whereas the max() value will be excluded from the

results.

CHAPTER 4 ■ WORKING WITH DATA

89

Generally speaking, it is recommended that you use $gt and $lt (greater than and less than,

respectively) rather than min() and max() because $gt and $lt don’t require an index. The min() and max()

functions are used primarily for compound keys.

Summary

In this chapter, we’ve taken a look at the most commonly used commands and options that can be

performed with the MongoDB shell to manipulate data. We also examined how to search for, add, modify,

and delete data, and how to modify your collections and databases. Next, we took a quick look at atomic

operations, how to use aggregation, and when to use operators such as $elemMatch. Finally, we explored

how to create indexes and when to use them. We examined what indexes are used for, how you can drop

them, how to search for your data using the indexes created, and how to check for running indexing

operations.

In the next chapter, we’ll look into the fundamentals of GridFS, including what it is, what it does, and

how it can be used to your benefit.

91

Chapter 5

GridFS

We live in a world of high-definition video, 12MP cameras, and storage media that can hold 50GB of data on

a disc the size of a CD-ROM. In that context, the 16MB limit for the maximum size of a MongoDB document

might seem laughably inadequate. Indeed, you might wonder why MongoDB, which has been designed as a

database for today’s high-tech age, has such a seemingly strange limitation. The short answer is performance.

If data were stored in the document itself, it would obviously get very large, which in turn would make the

data harder to work with. For example, pulling back the whole document would require loading the files in the

document as well. You could work around this issue by requesting only a small subset of the fields projected

in your result, but the MongoDB server still needs to load the entire document into memory. Fortunately,

MongoDB features a unique and somewhat elegant solution to this problem. MongoDB enables you to store

large files quite easily, yet it also allows you to access parts of the file without retrieving the entire thing—all

while maintaining high performance. It achieves this by leveraging a specification known as GridFS.

■Note One interesting thing about GridFS is that it isn’t actually a software feature. For example, there isn’t

any special server-side code in MongoDB that manages GridFS (although there are some helper functions to

make it easier to write GridFS drivers). Instead, GridFS is a simple specification used by all of the supported

drivers on MongoDB. The key benefit of such a specification is that files stored by one driver can be accessed

by any other driver that follows the same convention.

This approach adheres closely to the MongoDB principle of keeping things simple. Because GridFS uses

standard MongoDB features, it’s easy to implement and work with the specification from the driver’s point

of view. It also means you can poke around by hand if you really want to, because to MongoDB files in the

GridFS, specification are just normal collections containing documents.

Filling in Some Background

Chapter 1 touched on the fact that we have been taught to use databases for even simple storage for many

years. For example, the book one of us bought to help improve his PHP more than 15 years ago introduced

MySQL in Chapter 3. Considering the complexity of SQL and databases in the real world (not to mention in

theory), you might wonder why a book intended for beginners would practically start off with SQL. After all,

it was a PHP book and not a MySQL book.

One thing most people don’t appreciate until they try it is that reading and writing data directly to disk

is hard. Some people don’t agree with us on this point—after all, opening and reading files in Python might

seem trivial. And it is: in simpler scenarios, working with files is rather painless when using PHP. If all you

want to do is read in lines and process them, you’re unlikely to have any trouble.

CHAPTER 5 ■ GRIDFS

92

On the other hand, things become a lot harder if you want to search a file or store complicated or

structured data. Even if you can work out how to do this and create a solution, your solution is unlikely to

be faster or more efficient than relying on a database instead. Today’s applications depend on finding and

storing data quickly—and databases make this possible for those of us who can’t or don’t want to write such

a system ourselves.

One area that is glossed over by many books is the storing of files. Most books that teach you to use a

database to store your data also teach you to read and write to the filesystem instead when you need to

store files. In some ways, this isn’t usually a problem, because it’s much easier to read and write simple files

than to process what’s in them. There are some issues, however. First, the developer must have permission

to write those files in the first place, and that requires giving the web server permission to write to the

local filesystem. This might not seem likely to pose a problem, but it gives system administrators

nightmares—getting files onto a server is the first stage in being able to compromise it.

Databases can store binary files; typically, it’s just not elegant for them to do so. MySQL has a special

column type called BLOB. PostgreSQL requires special procedures to be followed to store such files—and the

data aren’t stored in the table itself. In other words, it’s messy. These solutions are obviously bolt-ons. Thus,

it’s not surprising that people choose to write data to the disk instead. But that approach also has issues.

Apart from the problems with security, it adds another directory that needs to be backed up, and you must

also ensure that this information is replicated to all the appropriate servers. There are filesystems that provide

the ability to write to disk and have that content fully replicated (including GridFS); but these solutions are

complex and add overhead; moreover, these features typically make your solution harder to maintain.

MongoDB, on the other hand, enforces a maximum document size of 16MB. This is more than enough

for storing rich documents, and it might have sufficed a few years ago for storing many other types of files as

well. However, this limit is wholly inadequate for today’s environment.

Working with GridFS

Next, we’ll take a brief look at how GridFS is implemented. As the MongoDB website points out, you do not

need to understand or be aware of the underlying implementation of GridFS to use it. In fact, you can simply

let the driver handle the heavy lifting for you. For the most part, the drivers that support GridFS implement

file handling in a language-specific way. For example, the MongoDB driver for Python works in a manner

that is wholly consistent with Python, as you’ll see shortly. If the ins-and-outs of GridFS don’t interest you,

then just skip ahead to the section on Getting started with the command line tools. We promise you won’t

miss anything that enables you to use MongoDB effectively!

GridFS consists of two parts. More specifically, it consists of two collections. One collection holds the

filename and related information such as size (called metadata), while the other collection holds the file data

itself, usually in 255K chunks. The specification calls for these to be named files and chunks, respectively.

By default, the files and chunks collections are created in the fs namespace, but this can be changed. The

ability to change the default namespace is useful if you want to store different types of files. For example, you

might want to keep image and movie files separate.

Getting Started with the Command-Line Tools

Now that we have some of the background out of the way, let’s look at how to get started with GridFS by

exploring the command-line tools available to leverage it. First, we will need a file to play with. To keep

things simple, let’s use the dictionary file. On Ubuntu, you can find this at /usr/share/dict/words.

However, there are various levels of symbolic links, so you might want to run this command first:

root@core2:/usr/share/dict# cat words > /tmp/dictionary

CHAPTER 5 ■ GRIDFS

93

■Note In Ubuntu, you might need to use apt-get install wbritish to get the dictionary file installed.

This command copies all the contents of the file to a nice and simple path that you can use easily. Of

course, you can use any file that you wish for this example; it doesn’t need to be any particular size or type.

Rather than describe all the options you can use with mongofiles, let’s jump right in and start playing

with some of the tool’s features. This book assumes that you’re running mongofiles on the same machine as

MongoDB. If you’re not, then you’ll need to use the –h option to specify the host that MongoDB is running on.

You’ll learn about the other options available in the mongofiles command after putting it through its paces.

First, let’s list all the files in the database. We’re not expecting any files to be in there yet, but let’s make

sure. The list command lists the files in the database so far:

$ mongofiles list

2015-10-01T08:54:51.901+0000 connected to: localhost

$

Okay, so that probably wasn’t very exciting. Keep in mind that mongofiles is a proof-of-concept tool; it’s

probably not a tool you will use much with your own applications. However, mongofiles is great for learning

and testing. Once you create a file, you can use the tool to explore the files and chunks that are created.

Let’s kick things up a notch with the put command to add the dictionary file created previously

(remember: you can use any file that you like for this example):

$ mongofiles put /tmp/dictionary

2015-10-01T08:56:14.605+0000 connected to: localhost

added file: /tmp/dictionary

$

This doesn’t give us much in the way of information, so let’s see if we can get some confirmation that it

actually did what we thought it did. Do so by rerunning the list command:

$ mongofiles list

2015-10-01T08:57:37.290+0000 connected to: localhost

/tmp/dictionary 938969

$

This example shows the dictionary file, along with its size. The information clearly comes from the

files collection, but we’re getting ahead of ourselves.

Using the _id Key

As you know, each document in MongoDB includes a unique identifier stored in the _id key. Like MySQL’s

auto_increment field, the _id key is not of much direct interest, apart from the fact that it allows you to pick

out a specific file.

Working with Filenames

Inserted files have a filename key, which itself needs a little explanation. Generally, you will want to keep this

field unique to help prevent major confusion; however, that’s not entirely necessary. In fact, if you run the put

command again, you’ll end up with two documents that look identical. In this case, the files and metadata

are identical, apart from the _id key. You might be surprised by this and wonder why MongoDB doesn’t

CHAPTER 5 ■ GRIDFS

94

update the file that exists rather than create a new one. The reason is that there could be many cases where

you would have filenames that are identical. For example, if you built a system to store student assignments,

then chances are pretty good that at least some of the filenames would be the same. MongoDB cannot

assume that identical filenames (even those with identical sizes) are in fact the same file. Thus, there are

many cases where it would be a mistake for MongoDB to update the file. Of course, you can use the _id key

to update a specific file; and you’ll learn more about this topic in the upcoming Python-based experiments.

The File’s Length

The file’s length is both useful information and critical to how GridFS works. While it is nice to know how

big a file is for reference, the file’s size also plays a big part when you write your own applications. For

example, when sending a file over the Web (through HTTP, for example), you need to specify how big the

file is. Not all servers do this; for example, when downloading files from certain sites, you may have noticed

that your browser can tell you the speed you’re downloading the file at, but not how long it will take to finish

downloading the file. This is because the server did not provide size information.

Knowing the size of your file is important in one other respect. Earlier, we mentioned that a file is

broken up into chunks—that is, the file is split into smaller pieces. By default, the chunk size is 255K, but that

can be changed to another value if you wish. To work out how many chunks a file takes up, you need to know

two things. First you must know how big each chunk is; and second, you must know the file size, so that you

can tell how many chunks there are.

You might think that this shouldn’t be important. After all, if you have a 1MB file and the chunk size is

255K, then you know that you must start with chunk number four if you want to access data starting at the

800K mark. Yet you still need to know how big the overall file is for the following reason: if you don’t know

the size, you cannot work out how many valid chunks there are. In the previous example, there’s nothing to

stop you asking for data that starts at 1.26MB (that is, the sixth chunk). In this case, that chunk doesn’t exist,

but there is no way to know that without a reference to the file size. Of course, the driver handles all of this

for you, so there’s no need to worry too much about it; however, knowing how GridFS works “behind the

scenes” will certainly help when it comes to debugging your applications.

Working with Chunk Sizes

Although there is a default chunk size, this default can be changed on a file-by-file basis. This allows flexible

sizing. If your website streams video, you might want to have many chunks so that you can easily skip to

any part of a given video with ease. If you had one big file, you would have to return the whole file, and then

find the starting point for the specified section in it. With GridFS, you can pull back data at the chunk level.

If you’re using the default size, then you can start retrieving data from any 255K chunk. Of course, you can

also specify the bit of data you actually want (for example, you might want only five minutes in the middle of

a 60-minute movie). This is a very efficient system, and 255K is a pretty good chunk size for most purposes.

If you decide to change it, you should have a good reason for doing so. As always, don’t forget to benchmark

and test the performance of your custom chunk size; it’s not uncommon for theoretically better systems to

fail to live up to expectations.

■Note MongoDB has a 16MB restriction on document size. Because GridFS is simply a different way of

storing files in the standard MongoDB framework, this restriction also exists in GridFS. That is, you can’t create

chunks larger than 16MB. This shouldn’t pose a problem, because the whole point of GridFS is to alleviate the

need for huge document sizes. If you’re worried that you’re storing huge files, and this will give you too many

chunk documents, you needn’t worry—there are MongoDB systems in production with significantly more than a

billion documents!

CHAPTER 5 ■ GRIDFS

95

Tracking the Upload Date

The uploadDate key does exactly what its name suggests: it stores the date the file was created in MongoDB.

This is a good time to mention that the files collection is just a normal MongoDB collection, containing

normal documents. This means that you can add any additional key and value pairs that you need, in the

same way you would for any other collection.

For example, consider the case of a real-world application that needs to store text content that you

extract from various files. You might need to do this so you could perform some additional indexing and

searching. To accomplish this, you might add a file_text key and store the text in there. The elegance of

the GridFS system means that you can do anything with this system you can do with any other MongoDB

documents. Elegance and power are two of the defining characteristics of working in MongoDB.

Hashing Your Files

MongoDB ships with the MD5 hashing algorithm. You may have come across the algorithm previously

when downloading software over the Internet. The theory behind MD5 is that each file has a unique

signature. Changing a single bit anywhere in that file will drastically (and noticeably) change the signature.

This signature is used for two reasons: security and integrity. For security, if you know what the MD5 hash

is supposed to be and you trust the source (perhaps a friend gave it to you), then you can be assured that

the file has not been altered if the hash (often called the checksum) is correct. This also ensures that the file

integrity has been maintained and that no data have been lost or damaged. The MD5 hash of a particular file

acts like a fingerprint for a file. The hash can also be used to identify files that have different filenames but

have the same contents.

■Warning The MD5 algorithm is no longer considered secure, and it has been demonstrated that it is

possible to create two different files that have the same MD5 checksum, even though their contents are

different. In cryptographic terms, this is called a collision. Such collisions are bad because this means it is

possible for an attacker to alter a file in such a way that it cannot be detected. This caveat remains somewhat

theoretical because a great deal of time and effort would be required to create such collisions intentionally; and

even then, the files could be so different as to be obviously not the same file. For this reason, MD5 is still the

preferred method of determining file integrity because it is so widely supported.

Looking Under MongoDB’s Hood

At this point, you have some data in a MongoDB database. Now let’s take a closer look at that data under the

covers. To do this, you’ll again use some command-line tools to connect to the database and query it. For

example, try running the find() command against the file created earlier:

$ mongo test

MongoDB shell version: 3.1.7

connecting to: test

Welcome to the MongoDB shell.

> db.fs.files.find()

{ "_id" : ObjectId("560cf6ab73f0fc3ab9000001"), "chunkSize" : 261120,

"uploadDate" : ISODate("2015-10-01T09:02:35.397Z"), "length" : 938969, "md5" :

"7e2877e5dad6e8e97b0fa43d28f2feca", "filename" : "/tmp/dictionary" }

>

CHAPTER 5 ■ GRIDFS

96

This output shows you how the different keys discussed earlier fit together.

Next, let’s take a look at the chunks collection (you have to add a projection to exclude the binary data;

otherwise, it will show you all of the raw binary data as well):

$ mongo test

MongoDB shell version: 3.1.7

connecting to: test

> db.fs.chunks.find({},{"data":0});

{ "_id" : ObjectId("560cf6ab73f0fc3ab9000002"), "files_id" :

ObjectId("560cf6ab73f0fc3ab9000001"), "n" : 0 }

{ "_id" : ObjectId("560cf6ab73f0fc3ab9000005"), "files_id" :

ObjectId("560cf6ab73f0fc3ab9000001"), "n" : 3 }

{ "_id" : ObjectId("560cf6ab73f0fc3ab9000004"), "files_id" :

ObjectId("560cf6ab73f0fc3ab9000001"), "n" : 2 }

{ "_id" : ObjectId("560cf6ab73f0fc3ab9000003"), "files_id" :

ObjectId("560cf6ab73f0fc3ab9000001"), "n" : 1 }

>

■Warning Accessing documents and collections directly is a powerful feature, but you need to be careful.

This feature also makes it much easier to shoot yourself in both feet at the same time. Make sure you know

what you’re doing and that you perform a great deal of testing if you decide to edit these documents and

collections manually. Also, keep in mind that the GridFS support in MongoDB’s drivers won’t know anything

about any customizations you’ve made.

Using the search Command

Next, let’s take a closer look at mongofiles command search. Thus far, there is only a single file in the

database, which greatly limits the types of searches you might conduct! So let’s add something else. The

following snippet copies the dictionary to another file, and then imports that file:

$ cp /tmp/dictionary /tmp/hello_world

$ mongofiles put /tmp/hello_world

2015-10-01T09:30:00.183+0000 connected to: localhost

added file: /tmp/hello_world

$ mongofiles list

2015-10-01T09:30:41.894+0000 connected to: localhost

/tmp/dictionary 938969

/tmp/hello_world 938969

$

The first line copies the file, and the second line imports it into MongoDB. Next, you might run the

mongofiles command list to check that the files were correctly stored. If you do so, you can see that there

are now two files in the collection; unsurprisingly, both files have the same size.

CHAPTER 5 ■ GRIDFS

97

The search command works exactly as you would expect. All you need to do is tell mongofiles what you

are looking for, and it will try to find it for you, as in this example:

$ mongofiles search hello

2015-10-01T09:31:31.471+0000 connected to: localhost

/tmp/hello_world 938969

$ mongofiles search dict

2015-10-01T09:31:37.514+0000 connected to: localhost

/tmp/dictionary 938969

$

Again, nothing too exciting happens here. However, there is an important takeaway that’s worth noting.

MongoDB can be as simple or as complex as you need it to be. The mongofiles tool is only for reference

use, and it includes very basic debugging. The good news is that MongoDB makes it easy to perform simple

searches against your files. The even better news is that MongoDB also has your back if you want to write

some insanely complicated searches.

Deleting

The mongofiles command delete doesn’t require much explanation, but it does deserve a big warning. This

command deletes files based on the filename. Thus, if you have more than one file with the same name, this

command will delete all of them. The following snippet shows how to use the delete command:

$ mongofiles delete /tmp/hello_world

2015-10-01T09:32:34.131+0000 connected to: localhost

successfully deleted all instances of '/tmp/hello_world' from GridFS

$ mongofiles list

2015-10-01T09:32:54.103+0000 connected to: localhost

/tmp/dictionary 938969

$

■Note Many people have commented in connection with this issue that deleting multiple files with the same

name is not a problem because no application would have duplicate names. This is simply not true; and in

many cases, it doesn’t even make sense to enforce unique names. For example, if your app lets users upload

photos to their profiles, there’s a good chance that half the files you receive will be called photo.jpg or me.png.

Of course, if you are unlikely to use mongofiles to manage your live data—and in truth no one ever expected it

to be used that way—then you just need to be careful when deleting data in general.

Retrieving Files from MongoDB

So far, you haven’t actually pulled any files out from MongoDB. The most important feature of any database

is that it lets you find and retrieve data once they have been input. The following snippet retrieves a file from

MongoDB using the mongofiles command get:

$ mongofiles get /tmp/dictionary

2015-10-01T09:33:54.820+0000 connected to: localhost

finished writing to /tmp/dictionary

$

CHAPTER 5 ■ GRIDFS

98

This example includes an intentional mistake. Because it specifies the full name and path of the file you

want to retrieve (as required), mongofiles writes the data to a file with the same name and path. Effectively,

this overwrites the original dictionary file! This isn’t exactly a great loss, because it is being overwritten by the

same file—and the dictionary file was only a temporary copy in the first place. Nevertheless, this behavior

could give you a rather nasty shock if you accidentally erase two weeks of work. Trust us, you won’t figure out

where all your work went until sometime after the event! As when using the delete command, you need to

be careful when using the get command.

Summing Up mongofiles

The mongofiles utility is a useful tool for quickly looking at what’s in your database. If you’ve written some

software, and you suspect something might be amiss with it, then you can use mongofiles to double check

what’s going on.

It’s an extremely simple implementation, so it doesn’t require any fancy logic that could complicate

accomplishing the task at hand. Whether you would use mongofiles in a production environment is a matter

of personal taste. It’s not exactly a Swiss army knife; however, it does provide a useful set of commands that

you’ll be grateful to have if your application begins misbehaving. In short, you should be familiar with this

tool because someday it might be exactly the tool you require to solve an otherwise nettlesome problem.

Exploiting the Power of Python

At this point, you have a solid idea of how GridFS works. Next, you will learn how to access GridFS from

Python. Chapter 2 covered how to install PyMongo; if you have any trouble with the examples, please refer

back to Chapter 2 and make sure everything is installed correctly.

If you’ve been following along with the previous examples in this chapter, you should now have one file

in GridFS. You’ll also recall that the file is a dictionary file, so it contains a list of words. In this section, you

will learn how to write a simple Python script that prints out all the words in the dictionary file. Sure, it would

be simpler and more efficient to simply cat the original file—but where would the fun be in that?

Begin by firing up Python:

Python 2.7.9 (default, Apr 2 2015, 15:33:21)

[GCC 4.9.2] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>>

The standard driver for Python is called PyMongo. Because the PyMongo driver is supported directly by

MongoDB, Inc., the company that publishes MongoDB, you can rest assured that it will be regularly updated

and maintained. So, let’s go ahead and import the library. You should see something like the following:

>>> from pymongo import MongoClient

>>> import gridfs

>>>

If PyMongo isn’t installed correctly, you will get an error similar to this:

>>> import gridfs

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

ImportError: No module named gridfs

>>>

CHAPTER 5 ■ GRIDFS

99

If you see the latter message, chances are something was missed during installation. In that case, pop

back to Chapter 2 and follow the instructions to install PyMongo again.

Connecting to the Database

Before you can retrieve information from a database, you must first establish a connection to it. When you

were using the mongofiles utility earlier in this chapter, you probably noticed the reference to 127.0.0.1.

This value is also known as the localhost, and it represents your computer’s loopback address. This value is

simply a shortcut for telling a computer to talk to itself. The reason mongofiles mentioned this IP address

is that it was actually connecting to MongoDB through the network. The default is to connect to the local

machine on the default MongoDB port. Because you haven’t changed the default settings, mongofiles can

find and connect to your database without any trouble.

When using MongoDB with Python, however, you need to connect to the database and then set up

GridFS. Fortunately, this is easy to do:

>>> db = MongoClient().test

>>> fs = gridfs.GridFS(db)

>>>

The first line opens the connection and selects the database. By default, mongofiles uses the test

database; hence, you’ll find your dictionary file in test. The second line sets up GridFS and prepares it for use.

Accessing the Words

In its original implementation, the PyMongo driver used a file-like interface to leverage GridFS. This is

somewhat different from what you saw in this chapter’s earlier examples with mongofiles, which were more

FTP-like in nature. In the original implementation of PyMongo, you could read and write data just as you do

for a normal file.

This made PyMongo very much like Python to use, and it allowed for easy integration with existing

scripts. However, this behavior was changed in version 1.6 of the driver, and this functionality is no longer

supported. While very Python-like, the behavior had some problems that made the tool less effective overall.

Generally speaking, the PyMongo driver attempts to make GridFS files look and feel like ordinary files

on the filesystem. On the one hand, this is nice because it means there’s no learning curve, and the driver

is usable with any method that requires a file. On the other hand, this approach is somewhat limiting and

doesn’t give a good feel for how powerful GridFS is.

Putting Files into MongoDB

Getting files into GridFS through PyMongo is straightforward and intentionally similar to the way you do

so using command-line tools. MongoDB is all about throughput, and the changes to the API in the revised

version of PyMongo reflect this. Not only do you get better performance, but the changes also bring the

Python driver in line with the other GridFS implementations.

Let’s put the dictionary into GridFS (again):

>>> with open("/tmp/dictionary") as dictionary:

... uid = fs.put(dictionary)

...

>>> uid

ObjectId('560d00b273f0fc5d7178f4a7')

>>>

CHAPTER 5 ■ GRIDFS

100

In this example, you use the put method to insert the file. It’s important that you capture the result

from this method because it contains the document _id for your file. PyMongo takes a different approach

than mongofiles, which assumes the filename is effectively the key (even though you can have duplicates).

Instead, PyMongo references files based on their _id. If you don’t capture this information, then you won’t

be able to reliably find the file again. Actually, that’s not strictly true—you could search for a file quite

easily—but if you want to link this file to a particular user account, then you need this _id.

Two useful arguments that can be used in conjunction with the put command are filename and

content_type. As you might expect, these arguments let you set the filename and the content type of the

file, respectively. This is useful for loading files directly from disk. However, it is even handier when you’re

handling files that have been received over the Internet or generated in memory because, in those cases, you

can use file-like semantics, but without actually having to create a real file on the disk.

Retrieving Files from GridFS

At long last, you’re now ready to return your data! At this point, you have your unique _id, so finding the file

is easy. The get method retrieves a file from GridFS:

>>> new_dictionary = fs.get(uid)

That’s it! The preceding snippet returns a file-like object; thus, you can print all the words in the

dictionary using the following snippet:

>>> for word in new_dictionary:

... print word

Now watch in awe as a list of words quickly scrolls up the screen! Okay, so this isn’t exactly rocket

science. However, the fact that it isn’t rocket science or in any way difficult is part of the beauty of GridFS—it

does work as advertised, and it does so in an intuitive and easily understood way!

Deleting Files

Deleting a file is also easy. All you have to do is call fs.delete() and pass the _id of the file, as in the

following example:

>>> fs.delete(uid)

>>> new_dictionary = fs.get(uid)

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "/usr/local/lib/python2.7/dist-packages/gridfs/__init__.py", line 149, in get

gout._ensure_file()

File "/usr/local/lib/python2.7/dist-packages/gridfs/grid_file.py", line 410,

in _ensure_file

(self.__files, self.__file_id))

gridfs.errors.NoFile: no file in gridfs collection Collection(Database(MongoClient

('localhost', 27017), u'test'), u'fs.files') with _id ObjectId('560d00b273f0fc5d7178f4a7')

>>>

These results could look a bit scary, but they are just PyMongo’s way of saying that it couldn’t find the

file. This isn’t surprising, because you just deleted it!

CHAPTER 5 ■ GRIDFS

101

Summary

In this chapter, we took you on a fast-paced tour of GridFS. You learned what GridFS is, how it fits together

with MongoDB, and how to use its basic syntax. This chapter didn’t explore GridFS in great depth, but in the

next chapter, you’ll learn how to integrate GridFS with a real application using PHP. For now, it’s enough to

understand how GridFS can save you time and hassle when storing files and other large pieces of data.

In the next chapter, you’ll start putting what you’ve learned to real use—specifically, you’ll learn how to

build a fully functional address book!

103

Chapter 6

PHP and MongoDB

Through Chapters 1 to 5, you’ve learned how to perform all sorts of actions in the MongoDB shell. For example,

you’ve learned how to add, modify, and delete a document. You’ve also learned about the workings of

DBRef and GridFS, including how to use them.

So far, however, most of the things you’ve learned about have taken place in the MongoDB shell. It is a

very capable application, but the MongoDB software also comes with a vast number of additional drivers

(see Chapter 2 for more information on these) that let you step outside the shell to accomplish many other

sorts of tasks programmatically.

One such tool is the PHP driver, which allows you to extend your PHP installation to connect, modify,

and manage your MongoDB databases when you want to use PHP rather than the shell. This can be helpful

when you need to design a web application or don’t have access to the MongoDB shell. As this chapter will

demonstrate, most of the actions you can perform with the PHP driver closely resemble functions you can

execute in the MongoDB shell; however, the PHP driver requires that the options be specified in an array,

rather than between two curly brackets. Similarities notwithstanding, you will need to be aware of quite a

few howevers when working with the PHP driver. This chapter will walk you through the benefits of using

PHP with MongoDB, as well as how to overcome the aforementioned howevers.

This chapter brings you back to the beginning in many ways. You will start by learning to navigate the

database and use collections in PHP. Next you will learn how to insert, modify, and delete posts in PHP. You

will also learn how to use GridFS and DBRef again; this time, however, the focus will be on how to use them

in PHP, rather than the theory behind these technologies.

Comparing Documents in MongoDB and PHP

As you’ve learned previously, a document in a MongoDB collection is stored using a JSON-like format that

consists of keys and values. This is similar to the way PHP defines an associative array, so it shouldn’t be too

difficult to get used to this format.

For example, assume a document looks like the following in the MongoDB shell:

contact = ( {

"First Name" : "Philip",

"Last Name" : "Moran",

"Address" : [

{

"Street" : "681 Hinkle Lake Road",

"Place" : "Newton",

"Postal Code" : "MA 02160",

"Country" : "USA"

}

],

CHAPTER 6 ■ PHP AND MONGODB

104

"E-Mail" : [

"pm@example.com",

"pm@office.com",

"philip@example.com",

"philip@example.net",

"moran@example.com",

"moran@example.net",

"pmoran@example.com",

"pmoran@example.net"

],

"Phone" : "617-546-8428",

"Age" : 60

})

The same document would look like this when contained in an array in PHP:

$contact = array(

"First Name" => "Philip",

"Last Name" => "Moran",

"Address" => array(

"Street" => "681 Hinkle Lake Road",

"Place" => "Newton",

"Postal Code" => "MA 02160",

"Country" => "USA"

)

,

"E-Mail" => array(

"pm@example.com",

"pm@example.net",

"philip@example.com",

"philip@example.net",

"moran@example.com",

"moran@example.net",

"pmoran@example.com",

"pmoran@example.net"

),

"Phone" => "617-546-8428",

"Age" => 60

);

The two versions of the document look a lot alike. The obvious difference is that the colon (:) is replaced

as the key/value separator by an arrow-like symbol (=>) in PHP. You will get used to these syntactical

differences relatively quickly.

CHAPTER 6 ■ PHP AND MONGODB

105

MongoDB Classes

The PHP driver version 1.6 for MongoDB contains four core classes, a few others for dealing with GridFS,

and several more to represent MongoDB datatypes. The core classes make up the most important part of the

driver. Together, these classes allow you to execute a rich set of commands. The four core classes available

are as follows:

• MongoClient: Initiates a connection to the database and provides database

server commands such as connect(), close(), listDBs(), selectDBs(), and

selectCollection().

• MongoDB: Interacts with the database and provides commands such as

createCollection(), selectCollection(), createDBRef(), getDBRef(), drop(),

and getGridFS().

• MongoCollection: Interacts with the collection. It includes commands such as

count(), find(), findOne(), insert(), remove(), save(), and update().

• MongoCursor: Interacts with the results returned by a find() command and includes

commands such as getNext(), count(), hint(), limit(), skip(), and sort().

In this chapter, we’ll look at all of the preceding commands; without a doubt, you’ll use these

commands the most.

■Note This chapter will not discuss the preceding commands grouped by class; instead, the commands will

be sorted in as logical an order as possible.

Connecting and Disconnecting

Let’s begin by examining how to use the MongoDB driver to connect to and select a database and a

collection. You establish connections using the Mongo class, which is also used for database server

commands. The following example shows how to quickly connect to your database in PHP:

// Connect to the database

$c = new MongoClient();

// Select the database you want to connect to, e.g. contacts

$c->contacts;

The Mongo class also includes the selectDB() function, which you can use to select a database:

// Connect to the database

$c = new MongoClient();

// Select the database you want to connect to, e.g. contacts

$c->selectDB("contacts");

CHAPTER 6 ■ PHP AND MONGODB

106

The next example shows how to select the collection you want to work with. The same rules apply as

when working in the shell: if you select a collection that does not yet exist, it will be created when you save

data to it. The process for selecting the collection you want to connect to is similar to that for connecting to

the database; in other words, you use the (->) syntax to literally point to the collection in question, as in the

following example:

// Connect to the database

$c = new MongoClient();

// Selecting the database ('contacts') and collection ('people') you want

// to connect to

$c->contacts->people;

The selectCollection() function also lets you select—or switch—collections, as in the following

example:

// Connect to the database

$c = new MongoClient();

// Selecting the database ('contacts') and collection ('people') you want

// to connect to

$c-> selectDB("contacts")->selectCollection("people");

Before you can select a database or a collection, you sometimes need to find the desired database or

collection. The Mongo class includes two additional commands for listing the available databases, as well as

the available collections. You can acquire a list of available databases by invoking the listDBs() function

and printing the output (which will be placed in an array):

// Connecting to the database

$c = new MongoClient();

// Listing the available databases

print_r($c->listDBs());

Likewise, you can use listCollections() to get a list of available collections in a database:

// Connecting to the database

$c = new MongoClient();

// Listing the available collections within the 'contacts' database

print_r($c->contacts->listCollections());

■Note The print_r command used in this example is a PHP command that prints the contents of an array.

The listDBs() function returns an array directly, so the command can be used as a parameter of the print_r

function.

The MongoClient class also contains a close() function that you can use to disconnect the PHP session

from the database server. However, using it is generally not required, except in unusual circumstances,

because the driver will automatically close the connection to the database cleanly whenever the Mongo

object goes out of scope.

CHAPTER 6 ■ PHP AND MONGODB

107

Sometimes you may not want to forcibly close a connection. For example, you may not be sure of the

actual state of the connection, or you may wish to ensure that a new connection can be established. In this

case, you can use the close() function, as shown in the following example:

// Connecting to the database

$c = new MongoClient();

// Closing the connection

$c->close();

Inserting Data

So far you’ve seen how to establish a connection to the database. Now it’s time to learn how to insert data

into your collection. The process for doing this is no different in PHP than when using the MongoDB shell.

The process has two steps. First, you define the document in a variable. Second, you insert it using the

insert() function.

Defining a document is not specifically related to MongoDB—instead, you create an array with keys and

values stored in it, as in the following example:

$contact = array(

"First Name" => "Philip",

"Last Name" => "Moran",

"Address" => array(

"Street" => "681 Hinkle Lake Road",

"Place" => "Newton",

"Postal Code" => "MA 02160",

"Country" => "USA"

)

,

"E-Mail" => array(

"pm@example.com",

"pm@office.com",

"philip@example.com",

"philip@office.com",

"moran@example.com",

"moran@office.com",

"pmoran@example.com",

"pmoran@office.com"

),

"Phone" => "617-546-8428",

"Age" => 60

);

■Warning Strings sent to the database need to be UTF-8 formatted to prevent an exception from occurring.

CHAPTER 6 ■ PHP AND MONGODB

108

Once you’ve assigned your data properly to a variable—called $contact in this case—you can use the

insert() function to insert it into the MongoCollection class:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people'

$collection = $c->contacts->people;

// Insert the document '$contact' into the people collection '$collection'

$collection->insert($contact);

The insert() function takes five options, specified in an array: fsync, j, w, wTimeoutMS, and

socketTimeoutMS. The fsync option can be set to TRUE or FALSE; FALSE is the default value for this option. If

set to TRUE, fsync forces the data to be written to the hard disk before it indicates the insertion was a success.

This option will override any setting for the option w, setting it to 0. Generally, you will want to avoid using

this option. The j option can be set to TRUE or FALSE, where FALSE is the default. If set, the j option will force

the data to be written to the journal before indicating the insertion was a success. If you are unfamiliar with

journaling, think of it as a log file that keeps track of the changes made to your data, before it is finally written

to disk. This ensures that, were mongod to stop unexpectedly, it would be able to recover the changes written

to the journal, thereby preventing your data from entering an inconsistent state.

The w option can be used to acknowledge or unacknowledge a write operation (making this option

also applicable for remove() and update() operations). If w is set to 0, the write operation will not be

acknowledged; set it to 1 and the write will be acknowledged by the (primary) server. When working with

replica sets, w can also be set to n, ensuring that the primary server acknowledges the write operation when

successfully replicated to n nodes. The w option can also be set to 'majority'—a reserved string—ensuring

that the majority of the replica set will acknowledge the write, or to a specific tag, ensuring that those tagged

nodes will acknowledge the write. For this option, the default setting is also 1. The wTimeoutMS option can be

used to specify how long the server is to wait for receiving acknowledgment (in milliseconds). By default, this

option is set to 10000. Lastly, the socketTimeoutMS option allows you to specify how long (in milliseconds)

the client needs to wait for a response from the database. By default, this option is set to 30000.

■Warning The wTimeoutMS and socketTimeoutMs options determine how long the client will wait for

a response, but do not interrupt any operations executed server-side when the timeout expires. As such,

operations may complete after the timeout, but the application will not know about this, having given up waiting

for a response.

The following example illustrates how to use the w and wTimeoutMS options to insert data:

// Define another contact

$contact = array(

"First Name" => "Victoria",

"Last Name" => "Wood",

"Address" => array(

"Street" => "50 Ash lane",

"Place" => "Ystradgynlais",

"Postal Code" => "SA9 6XS",

"Country" => "UK"

)

,

CHAPTER 6 ■ PHP AND MONGODB

109

"E-Mail" => array(

"vw@example.com",

"vw@example.net"

),

"Phone" => "078-8727-8049",

"Age" => 28

);

// Connect to the database

$c = new MongoClient();

// Select the collection 'people'

$collection = $c->contacts->people;

// Specify the w and wTimeoutMS options

$options = array("w" => 1, "wTimeoutMS" => 5000);

// Insert the document '$contact' into the people collection '$collection'

$collection->insert($contact,$options);

And that’s all there is to inserting data into your database with the PHP driver. For the most part, you will

probably be working on defining the array that contains the data, rather than injecting the data into the array.

Listing Your Data

Typically, you will use the find() function to query for data. It takes a parameter that you use to specify your

search criteria; once you specify your criteria, you execute find() to get the results. By default, the find()

function simply returns all documents in the collection. This is similar to the shell examples discussed in

Chapter 4. Most of the time, however, you will not want to do this. Instead, you will want to define specific

information for which to return results. The next sections will cover commonly used options and parameters

that you can use with the find() function to filter your results.

Returning a Single Document

Listing a single document is easy: simply executing the findOne() function without any parameters specified

will grab the first document it finds in the collection. The findOne function stores the returned information

in an array and leaves it up to you to print out again, as in this example:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Find the very first document within the collection, and print it out

// using print_r

print_r($collection->findOne());

As noted previously, it’s easy to list a single document in a collection: all you will need to do is define the

findOne() function itself. Naturally, you can use the findOne() function with additional filters. For instance, if

you know the last name of a person you’re looking for, you can specify it as an option in the findOne() function:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

CHAPTER 6 ■ PHP AND MONGODB

110

// Define the last name of the person in the $lastname variable

$lastname = array("Last Name" => "Moran");

// Find the very first person in the collection with the last name "Moran"

print_r($collection->findOne($lastname));

Of course, many more options exist for filtering the data; you’ll learn more about these additional

options later in this chapter. Let’s begin by looking at some sample output returned by using the print_r()

command (the example adds a few line breaks for the sake of making the code easier to read):

Array (

[_id] => MongoId Object ( )

[First Name] => Philip

[Last Name] => Moran

[Address] => Array (

[Street] => 681 Hinkle Lake Road

[Place] => Newton

[Postal Code] => MA 02160

[Country] => USA

)

[E-Mail] => Array (

[0] => pm@example.com

[1] => pm@office.com

[2] => philip@example.com

[3] => philip@office.com

[4] => moran@example.com

[5] => moran@office.com

[6] => pmoran@example.com

[7] => pmoran@office.com

)

[Phone] => 617-546-8428

[Age] => 60

)

Listing All Documents

While you can use the findOne() function to list a single document, you will use the find() function for

pretty much everything else. Don’t misunderstand, please: you can find a single document with the find()

function by limiting your results; but if you are unsure about the number of documents to be returned, or if

you are expecting more than one document to be returned, then the find() function is your friend.

As detailed in the previous chapters, the find() function has many, many options that you can use

to filter your results to suit just about any circumstance you can imagine. We’ll start off with a few simple

examples and build from there.

First, let’s see how you can display all the documents in a certain collection using PHP and the find()

function. The only thing that you should be wary of when printing out multiple documents is that each

document is returned in an array, and that each array needs to be printed individually. You can do this using

PHP’s while() function. As just indicated, you will need to instruct the function to print each document

before proceeding with the next one. The getNext() command gets the next document in the cursor from

CHAPTER 6 ■ PHP AND MONGODB

111

MongoDB; this command effectively returns the next object in the cursor and advances the cursor. The

following snippet lists all the documents found in a collection:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Execute the query and store it under the $cursor variable

$cursor = $collection->find();

// For each document it finds within the collection, print the contents

while ($document = $cursor->getNext())

{

print_r($document);

}

■Note You can implement the syntax for the preceding example several different ways. For example, a faster

way to execute the preceding command would look like this: $cursor = $c->contacts->people->find().

For the sake of clarity, however, code examples like this one will be split up into two lines in this chapter,

leaving more room for comments.

At this stage, the resulting output would still show only two arrays, assuming you have added the

documents described previously in this chapter (and nothing else). If you were to add more documents,

then each document would be printed in its own array. Granted, this doesn’t look pretty; however, that’s

nothing you can’t fix with a little additional code.

Using Query Operators

Whatever you can do in the MongoDB shell, you can also accomplish using the PHP driver. As you’ve seen

in the previous chapters, the shell includes dozens of options for filtering your results. For example, you

can use dot notation; sort or limit the results; skip, count, or group a number of items; or even use regular

expressions, among many other things. The following sections will walk you through how to use most of

these options with the PHP driver.

Querying for Specific Information

As you might remember from Chapter 4, you can use dot notation to query for specific information in an

embedded object in a document. For instance, if you want to find one of your contacts for which you know a

portion of the address details, you can use dot notation to find this, as in the following example:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Use dot notation to search for a document in which the place

// is set to "Newton"

$address = array("Address.Place" => "Newton");

CHAPTER 6 ■ PHP AND MONGODB

112

// Execute the query and store it under the $cursor variable

$cursor = $collection->find($address);

// For each document it finds within the collection, print the ID

// and its contents

while ($document = $cursor->getNext())

{

print_r($document);

}

In a similar fashion, you can search for information in a document’s array by specifying one of the

items in that array, such as an e-mail address. Because an e-mail address is (usually) unique, the findOne()

function will suffice in this example:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Define the e-mail address you want to search for under $email

$email = array("E-Mail" => "vw@example.com");

// Find the very first person in the collection matching the e-mail address

print_r($collection->findOne($email));

As expected, this example returns the first document that matches the e-mail address vw@example.com—the

address of Victoria Wood in this case. The document is returned in the form of an array:

Array (

[_id] => MongoId Object ( )

[First Name] => Victoria

[Last Name] => Wood

[Address] => Array (

[Street] => 50 Ash lane

[Place] => Ystradgynlais

[Postal Code] => SA9 6XS

[Country] => UK

)

[E-Mail] => Array (

[0] => vw@example.com

[1] => vw@example.net

)

[Phone] => 078-8727-8049

[Age] => 28

)

Sorting, Limiting, and Skipping Items

The MongoCursor class provides sort(), limit(), and skip() functions, which allow you to sort your results,

limit the total number of returned results, and skip a specific number of results, respectively. Let’s use the

PHP driver to examine each function and how it is used.

CHAPTER 6 ■ PHP AND MONGODB

113

PHP’s sort() function takes one array as a parameter. In that array, you can specify the field by which it

should sort the documents. As when using the shell, you use the value 1 to sort the results in ascending order

and -1 to sort the results in descending order. Note that you execute these functions on an existing cursor—

that is, against the results of a previously executed find() command.

The following example sorts your contacts based on their age in ascending order:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Execute the query and store it under the $cursor variable

$cursor = $collection->find();

// Use the sort command to sort all results in $cursor, based on their age

$cursor->sort(array('Age' => 1));

// Print the results

while($document = $cursor->getNext())

{

print_r($document);

}

You execute the limit() function on the actual cursor; this takes a stunning total of one parameter,

which specifies the number of results you would like to have returned. The limit() command returns the

first number of n items it finds in the collection that match your search criteria. The following example

returns only one document (granted, you could use the findOne() function for this instead, but limit()

does the job):

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Execute the query and store it under the $cursor variable

$cursor = $collection->find();

// Use the limit function to limit the number of results to 1

$cursor->limit(1);

//Print the result

while($document = $cursor->getNext())

{

print_r($document);

}

Finally, you can use the skip() function to skip the first n results that match your criteria. This function

also works on a cursor:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Execute the query and store it under the $cursor variable

$cursor = $collection->find();

// Use the skip function to skip the first result found

$cursor->skip(1);

CHAPTER 6 ■ PHP AND MONGODB

114

// Print the result

while($document = $cursor->getNext())

{

print_r($document);

}

Counting the Number of Matching Results

You can use PHP’s count() function to count the number of documents matching your criteria and return

the number of items in an array. This function is part of the MongoCursor class and thus operates on the

cursor. The following example shows how to get a count of contacts in the collection for people who live in

the United States:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'$collection =

$c->contacts->people;

// Specify the search parameters

$country = array("Address.Country" => "USA");

// Execute the query and store under the $cursor variable for further processing

$cursor = $collection->find($country);

// Count the results and return the value

print_r($cursor->count());

This query returns one result. Such counts can be useful for all sorts of operations, whether it’s counting

comments, the total number of registered users, or anything else.

Grouping Data with the Aggregation Framework

The aggregation framework is easily one of the more powerful features built into MongoDB, as it allows you to

calculate aggregated values without needing to use the Map/Reduce functionality. One of the most useful pipeline

operators the framework includes is the $group operator, which can loosely be compared to SQL’s GROUP

BY functionality. This operator allows you to calculate aggregate values based on a collection of documents.

For example, the aggregation function $max can be used to find and return a group’s highest value; the $min

function to find and return the lowest value, and $sum to calculate the total number of occurrences of a given value.

Let’s say that you want to get a list of all contacts in your collection, grouped by the country where they live.

The aggregation framework lets you do this easily. Let’s take a look at an example:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Execute the query and store it under the $result variable

$result = $collection->aggregate(array(

'$group' => array(

'_id' => '$Address.Country',

'total' => array('$sum' => 1)

)

));

// Count the results and return the value

print_r($result);

CHAPTER 6 ■ PHP AND MONGODB

115

As you can see, the aggregate function accepts one (or more) array with pipeline operators (in this case,

the $group operator). Here, you can specify how the resulting output is returned, and any optional aggregation

functions to execute: in this case the $sum function. In this example a unique document is returned for

every unique country found, represented by the document’s _id field. Next, the total count of each country

is summarized using the $sum function and returned using the total field. Note that the $sum function is

represented by an array and given the value of 1, as you want every match to increase the total by 1.

You might wonder what the resulting output will look like. Here’s an example of the output, given that

there are two contacts living in the United Kingdom and one in the United States:

Array (

[result] => Array (

[0] => Array (

[_id] => UK [total] => 2

)

[1] => Array (

[_id] => USA [total] => 1

)

[ok] => 1

)

This example is but a simple one, but the aggregation framework is quite powerful indeed, as you will

see when we look into it more closely in Chapter 8.

Specifying the Index with Hint

You use PHP’s hint() function to specify which index should be used when querying for data; doing so can

help you increase query performance in case the query planner isn’t able to consistently choose a good

index. Bear in mind, however, that using hint() can harm performance if the use of a poor index is forced.

For instance, assume you have thousands of contacts in your collection, and you generally search for a

person based on last name. In this case, it’s recommended that you create an index on the Last Name key in

the collection.

■Note The hint() example shown next will not return anything if an index is not created first.

To use the hint() function, you must apply it to the cursor, as in the following example:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Execute the query and store it under the $cursor variable

$cursor = $collection->find(array("Last Name" => "Moran"));

// Use the hint function to specify which index to use

$cursor->hint(array("Last Name" => -1));

CHAPTER 6 ■ PHP AND MONGODB

116

//Print the result

while($document = $cursor->getNext())

{

print_r($document);

}

■Note See Chapter 4 for more details on how to create an index. It is also possible to use the PHP driver’s

createIndex() function to create an index, as discussed there.

Refining Queries with Conditional Operators

You can use conditional operators to refine your queries. PHP comes with a nice set of default conditional

operators, such as < (less than), > (greater than), <= (less than or equal to), and >= (greater than or equal to).

Now for the bad news: you cannot use these operators with the PHP driver. Instead, you will need to use

MongoDB’s version of these operators. Fortunately, MongoDB itself comes with a vast set of conditional

operators (you can find more information about these operators in Chapter 4). You can use all of these

operators when querying for data through PHP, passing them on through the find() function.

While you can use all of these operators with the PHP driver, you must use specific syntax to do so; that

is, you must place them in an array and pass this array to the find() function. The following sections will

walk you through how to use several commonly used operators.

Using the $lt, $gt, $lte, and $gte Operators

MongoDB’s $lt, $gt, $lte, and $gte operators allow you to perform the same actions as the <, >, <=, and >=

operators, respectively. These operators are useful in situations where you want to search for documents that

store integer values.

You can use the $lt (less than) operator to find any kind of data for which the integer value is less than n,

as shown in the following example:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the conditional operator

$cond = array('Age' => array('$lt' => 30));

// Execute the query and store it under the $cursor variable

$cursor = $collection->find($cond);

//Print the results

while($document = $cursor->getNext())

{

print_r($document);

}

CHAPTER 6 ■ PHP AND MONGODB

117

The resulting output shows only one result in the current documents: the contact information for

Victoria Wood, who happens to be younger than 30:

Array (

[_id] => MongoId Object ( )

[First Name] => Victoria

[Last Name] => Wood

Address] => Array (

[Street] => 50 Ash lane

[Place] => Ystradgynlais

[Postal Code] => SA9 6XS

[Country] => UK

)

[E-Mail] => Array (

[0] => vw@example.com

[1] => vw@office.com

)

[Phone] => 078-8727-8049

[Age] => 28

)

Similarly, you can use the $gt operator to find any contacts who are older than 30. This following

example does that by changing the $lt variable to $gt (greater than), instead:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the conditional operator

$cond = array('Age' => array('$gt' => 30));

// Execute the query and store it under the $cursor variable

$cursor = $collection->find($cond);

//Print the results

while($document = $cursor->getNext())

{

print_r($document);

}

This will return the document for Philip Moran because he’s older than 30:

Array (

[_id] => MongoId Object ( )

[First Name] => Philip

[Last Name] => Moran

[Address] => Array (

[Street] => 681 Hinkle Lake Road

[Place] => Newton

[Postal Code] => MA 02160

[Country] => USA

)

CHAPTER 6 ■ PHP AND MONGODB

118

[E-Mail] => Array (

[0] => pm@example.com

[1] => pm@office.com

[2] => philip@example.com

[3] => philip@office.com

[4] => moran@example.com

[5] => moran@office.com

[6] => pmoran@example.com

[7] => pmoran@office.com

)

[Phone] => 617-546-8428

[Age] => 60

)

You can use the $lte operator to specify that the value must either match exactly or be lower than the

value specified. Remember: $lt will find anyone who is younger than 30, but not anyone who is exactly 30.

The same goes for the $gte operator, which finds any value that is greater than or equal to the integer

specified. Now let’s look at a pair of examples.

The first example will return both items from your collection to your screen:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the conditional operator

$cond = array('Age' => array('$lte' => 60));

// Execute the query and store it under the $cursor variable

$cursor = $collection->find($cond);

//Print the results

while($document = $cursor->getNext())

{

print_r($document);

}

The second example will display only one document because the collection only holds one contact who

is either 60 or older:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the conditional operator

$cond = array('Age' => array('$gte' => 60));

// Execute the query and store it under the $cursor variable

$cursor = $collection->find($cond);

//Print the results

while($document = $cursor->getNext())

{

print_r($document);

}

CHAPTER 6 ■ PHP AND MONGODB

119

Finding Documents That Don’t Match a Value

You can use the $ne (not equals) operator to find any documents that don’t match the value specified in

the $ne operator. The syntax for this operator is straightforward. The next example will display any contact

whose age is not equal to 28:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the conditional operator

$cond = array('Age' => array('$ne' => 28));

// Execute the query and store it under the $cursor variable

$cursor = $collection->find($cond);

//Print the results

while($document = $cursor->getNext())

{

print_r($document);

}

Matching Any of Multiple Values with $in

The $in operator lets you search for documents that match any of several possible values added to an array,

as in the following example:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the conditional operator

$cond = array('Address.Country' => array('$in' => array("USA","UK")));

// Execute the query and store it under the $cursor variable

$cursor = $collection->find($cond);

//Print the results

while($document = $cursor->getNext())

{

print_r($document);

}

The resulting output would show any contact information from any person you add, whether that

person lives in the United States or the United Kingdom. Note that the list of possibilities is actually added in

an array; it cannot be typed in “just like that.”

CHAPTER 6 ■ PHP AND MONGODB

120

Matching All Criteria in a Query with $all

Like the $in operator, the $all operator lets you compare against multiple values in an additional array.

The difference is that the $all operator requires that all items in the array match a document before it

returns any results. The following example shows how to conduct such a query:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the conditional operator

$cond = array('E-Mail' => array('$all' => array("vw@example.com","vw@office.com")));

// Execute the query and store it under the $cursor variable

$cursor = $collection->find($cond);

//Print the results

while($document = $cursor->getNext())

{

print_r($document);

}

Searching for Multiple Expressions with $or

You can use the $or operator to specify multiple expressions a document can contain to return a match.

The difference between the two operators is that the $in operator doesn’t allow you to specify both a key

and value, whereas the $or operator does. You can combine the $or operator with any other key/value

combination. Let’s look at two examples.

The first example searches for and returns any document that contains either an Age key with the

integer value of 28 or an Address.Country key with the value of USA:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the conditional operator

$cond = array('$or' => array(

array("Age" => 28),

array("Address.Country" => "USA")

) );

// Execute the query and store it under the $cursor variable

$cursor = $collection->find($cond);

//Print the results

while($document = $cursor->getNext())

{

print_r($document);

}

CHAPTER 6 ■ PHP AND MONGODB

121

The second example searches for and returns any document that has the Address.Country key set

to USA (mandatory), as well as a key/value set either to "Last Name" : "Moran" or to "E-Mail" :

"vw@example.com":

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the conditional operator

$cond = array(

"Address.Country" => "USA",

'$or' => array(

array("Last Name" => "Moran"),

array("E-Mail" => "vw@example.com")

)

);

// Execute the query and store it under the $cursor variable

$cursor = $collection->find($cond);

//Print the results

while($document = $cursor->getNext())

{

print_r($document);

}

The $or operator allows you to conduct two searches at once and then combine the resulting output,

even if the searches have nothing in common.

Retrieving a Specified Number of Items with $slice

You can use the $slice projection operator to retrieve a specified number of items from an array in your

document. This function is similar to the skip() and limit() functions detailed previously in this chapter.

The difference is that the skip() and limit() functions work on full documents, whereas the $slice

operator allows you to work on an array rather than a single document.

The $slice projection operator is a great method for limiting the number of items per page (this is

generally known as paging). The next example shows how to limit the number of e-mail addresses returned

from one of the contacts specified earlier (Philip Moran); in this case, you only return the first three e-mail

addresses:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify our search operator

$query = array("Last Name" => "Moran");

// Create a new object from an array using the $slice operator

$cond = (object)array('E-Mail' => array('$slice' => 3));

// Execute the query and store it under the $cursor variable

$cursor = $collection->find($query, $cond);

CHAPTER 6 ■ PHP AND MONGODB

122

// For each document it finds within the collection, print the contents

while ($document = $cursor->getNext())

{

print_r($document);

}

Similarly, you can get only the last three e-mail addresses in the list by making the integer negative, as

shown in the following example:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify our search operator

$query = array("Last Name" => "Moran");

// Specify the conditional operator

$cond = (object)array('E-Mail' => array('$slice' => -3));

// Execute the query and store it under the $cursor variable

$cursor = $collection->find($query, $cond);

// For each document it finds within the collection, print the contents

while ($document = $cursor->getNext())

{

print_r($document);

}

Or, you can skip the first two entries and limit the results to three:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify our search operator

$query = array("Last Name" => "Moran");

// Specify the conditional operator

$cond = (object)array('E-Mail' => array('$slice' => array(2, 3)));

// Execute the query and store it under the $cursor variable

$cursor = $collection->find($query, $cond);

// For each document it finds within the collection, print the contents

while ($document = $cursor->getNext())

{

print_r($document);

}

The $slice operator is a great method for limiting the number of items in an array; you’ll definitely

want to keep this operator in mind when programming with the MongoDB driver and PHP.

Determining Whether a Field Has a Value

You can use the $exists operator to return a result based on whether a field holds a value (regardless of the value

of this field). As illogical as this may sound, it’s actually very handy. For example, you can search for contacts

where the Age field has not been set yet; or you can search for contacts for whom you have a street name.

CHAPTER 6 ■ PHP AND MONGODB

123

The following example returns any contacts that do not have an Age field set:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the conditional operator

$cond = array('Age' => array('$exists' => false));

// Execute the query and store it under the $cursor variable

$cursor = $collection->find($cond);

//Print the results

while($document = $cursor->getNext())

{

print_r($document);

}

Similarly, the next example returns any contacts that have the Street field set:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the conditional operator

$cond = array("Address.Street" => array('$exists' => true));

// Execute the query and store it under the $cursor variable

$cursor = $collection->find($cond);

//Print the results

while($document = $cursor->getNext())

{

print_r($document);

}

Regular Expressions

Regular expressions are neat. You can use them for just about everything (except for making coffee,

perhaps); and they can greatly simplify your life when searching for data. The PHP driver comes with its own

class for regular expressions: the MongoRegex class. You can use this class to create regular expressions, and

then use them to find data.

The MongoRegex class knows six regular expression flags that you can use to query your data. You may

already be familiar with some of them:

• i: Triggers case insensitivity.

• m: Searches for content that is spread over multiple lines (line breaks).

• x: Allows your search to contain #comments.

• l: Specifies a locale.

• s: Also known as dotall, "." can be specified to match everything, including new lines.

• u: Matches Unicode.

CHAPTER 6 ■ PHP AND MONGODB

124

Now let’s take a closer look at how to use regular expressions in PHP to search for data in your

collection. Obviously, this is best demonstrated with a simple example.

Let’s assume you want to search for a contact about whom you have very little information. For

example, you may vaguely recall the place where the person lives and that it contains something like

stradgynl in the middle somewhere. Regular expressions give you a simple yet elegant way to search for such

a person:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the regular expression

$regex = new MongoRegex("/stradgynl/i");

// Execute the query and store it under the $cursor variable

$cursor = $collection->find(array("Address.Place" => $regex));

//Print the results

while($document = $cursor->getNext())

{

print_r($document);

}

When creating a PHP application, you’ll typically want to search for specific data. In the preceding

example, you would probably replace the text ("stradgynl", in this case) with a $_POST variable.

Modifying Data with PHP

If we lived in a world where all data remained static and humans never made any typos, we would never

need to update our documents. But the world is a little more flexible than that, and there are times when we

make mistakes that we’d like to correct.

For such situations, you can use a set of modifier functions in MongoDB to update (and therefore change)

your existing data. You can do this in several ways. For example, you might use the update() function to update

existing information, and then use the save() function to save your changes. The following sections look at a

handful of these and other modifier operators, illustrating how to use them effectively.

Updating via update()

As detailed in Chapter 4, you use the update() function to perform most document updates. Like the version

of update() in the MongoDB shell, the update() function that comes with the PHP driver allows you to

use an assortment of modifier operators to update your documents quickly and easily. PHP’s version of the

update() function operates almost identically; nevertheless, using the PHP version successfully requires

a significantly different approach. The upcoming section will walk you through how to use the function

successfully with PHP.

PHP’s update() function takes a minimum of two parameters: the first describes the object(s) to

update, and the second describes the object you want to update the matching record(s) with. Additionally,

you can specify a third parameter for an expanded set of options.

CHAPTER 6 ■ PHP AND MONGODB

125

The options parameter provides seven additional flags you can use with the update() function; this list

explains what they are and how to use them:

• upsert: If set to true, this Boolean option causes a new document to be created if the

search criteria are not matched.

• multiple: If set to true, this Boolean option causes all documents matching the

search criteria to be updated.

• fsync: If set to true, this Boolean option causes the data to be synced to disk before

returning a success. If this option is set to true, then it’s implied that w is set to 0,

even if it’s set otherwise. It defaults to false.

• w: If set to 0, the update operation will not be acknowledged. When working with

replica sets, w can also be set to n, ensuring that the primary server acknowledges

the update operation when successfully replicated to n nodes. It can also be set

to 'majority'—a reserved string—to ensure that the majority of replica nodes

will acknowledge the update or to a specific tag, ensuring that those nodes tagged

will acknowledge the update. This option defaults to 1, acknowledging the update

operation.

• j: If set to true, this Boolean option will force the data to be written to the journal

before indicating the update was a success. It defaults to false.

• wTimeoutMS: Used to specify how long the server is to wait for receiving

acknowledgment (in milliseconds). It defaults to 10000.

• socketTimeoutMS: Used to specific how long the server is to wait for socket

communication (in milliseconds). It defaults to 30000.

Now let’s look at a common example that changes Victoria Wood’s first name to “Vicky” without using

any of the modifier operators (these will be discussed momentarily):

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the search criteria

$criteria = array("Last Name" => "Wood");

// Specify the information to be changed

$update = array(

"First Name" => "Vicky",

"Last Name" => "Wood",

"Address" => array(

"Street" => "50 Ash lane",

"Place" => "Ystradgynlais",

"Postal Code" => "SA9 6XS",

"Country" => "UK"

)

,

"E-Mail" => array(

"vw@example.com",

"vw@office.com"

),

CHAPTER 6 ■ PHP AND MONGODB

126

"Phone" => "078-8727-8049",

"Age" => 28

);

// Options

$options = array("upsert" => true);

// Perform the update

$collection->update($criteria,$update,$options);

// Show the result

print_r($collection->findOne($criteria));

The resulting output would look like this:

Array (

[_id] => MongoId Object ()

[First Name] => Vicky

[Last Name] => Wood

[Address] => Array (

[Street] => 50 Ash lane

[Place] => Ystradgynlais

[Postal Code] => SA9 6XS

[Country] => UK

)

[E-Mail] => Array (

[0] => vw@example.com

[1] => vw@office.com

)

[Phone] => 078-8727-8049

[Age] => 28

)

This is a lot of work just to change one value—not exactly what you’d want to be doing to make a living.

However, this is precisely what you would have to do if you didn’t use PHP’s modifier operators. Now let’s

look at how you can use these operators in PHP to make life easier and consume less time.

■Warning If you don’t specify any of the conditional operators when applying the change, the data in the

matching document(s) will be replaced by the information in the array. Generally, it’s best to use $set if you

want to change only one field.

Saving Time with Update Operators

The update operations are going to save you loads of typing. As you’ll probably agree, the preceding example

is just not feasible to work with. Fortunately, the PHP driver includes about half a dozen update operators

for quickly modifying your data, without going through the trouble of writing it out fully. The purpose of

each operator will be briefly summarized again, although you are probably familiar with most of them at this

point (you can find more information about all the update operators discussed in this section in Chapter 4).

However, the way you use them in PHP differs significantly, as do the options associated with them. We’ll

look at examples for each of these operators, so you can familiarize yourself with their syntax in PHP.

CHAPTER 6 ■ PHP AND MONGODB

127

■Note None of the update operators that follow will include PHP code to review the changes made; rather,

the examples that follow only apply the changes. It’s suggested that you fire up the MongoDB shell alongside

of the PHP code, so you can perform searches and confirm that the desired changes have been applied.

Alternatively, you can write additional PHP code to perform these checks.

Increasing the Value of a Specific Key with $inc

The $inc operator allows you to increase the value of a specific key by n, assuming that the key exists. If the

key does not exist, it will be created instead. The following example increases the age of each person younger

than 40 by three years:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Search for anyone that's younger than 40

$criteria = array("Age" => array('$lt' => 40));

// Use $inc to increase their age by 3 years

$update = array('$inc' => array('Age' => 3));

// Options

$options = array("upsert" => true);

// Perform the update

$collection->update($criteria,$update,$options);

Changing the Value of a Key with $set

The $set operator lets you change the value of a key while ignoring any other fields. As noted previously, this

would have been a much better choice for updating Victoria’s first name to "Vicky" in the earlier example.

The following example shows how to use the $set operator to change the contact’s name to "Vicky":

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the search criteria

$criteria = array("Last Name" => "Wood");

// Specify the information to be changed

$update = array('$set' => array("First Name" => "Vicky"));

// Options

$options = array("upsert" => true);

// Perform the update

$collection->update($criteria,$update,$options);

CHAPTER 6 ■ PHP AND MONGODB

128

You can also use $set to add a field for every occurrence found matching your query:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the search criteria using regular expressions

$criteria = array("E-Mail" => new MongoRegex("/@office.com/i"));

// Add “Category => Work” into every occurrence found

$update = array('$set' => array('Category' => 'Work'));

// Options

$options = array('upsert' => true, 'multi' => true);

// Perform the upsert via save()

$collection->update($criteria,$update,$options);

Deleting a Field with $unset

The $unset operator works similarly to the $set operator. The difference is that $unset lets you delete a

given field from a document. For instance, the following example removes the Phone field and its associated

data from the contact information for Victoria Wood:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the search criteria

$criteria = array("Last Name" => "Wood");

// Specify the information to be removed

$update = array('$unset' => array("Phone" => 1));

// Perform the update

$collection->update($criteria,$update);

Renaming a Field with $rename

The $rename operator can be used to rename a field. This can be helpful when you’ve accidently made a typo

or simply wish to change its name to a more accurate one. The operator will search for the given field name

within each document and its underlying arrays and subdocuments.

■Warning Be careful when using this operator. If the document already contains a field that has the new

name, that field will be deleted, after which the old field name will be renamed to the new one as specified.

Let’s look at an example where the First Name and Last Name fields will be renamed to Given Name

and Family Name, respectively, for Vicky Wood:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

CHAPTER 6 ■ PHP AND MONGODB

129

// Specify the search criteria

$criteria = array("Last Name" => "Wood");

// Specify the information to be changed

$update = array('$rename' => array("First Name" => "Given Name", "Last Name" => "Family Name"));

// Perform the update

$collection->update($criteria,$update);

Changing the Value of a Key During Upsert with $setOnInsert

MongoDB’s $setOnInsert operator can be used to assign a specific value only in case the update function

performs an insert when using the upsert operator. This might sound a bit confusing at first, but you

can think of this operator as a conditional statement that only sets the given value when upsert inserts a

document, rather than updates one. Let’s look at an example to clarify how this works. First, we’ll perform an

upsert that matches an existing document, thus ignoring the $setOnInsert criteria specified:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the search criteria

$criteria = array("Family Name" => "Wood");

// Specify the information to be set on upsert-inserts only

$update = array('$setOnInsert' => array("Country" => "Unknown"));

// Specify the upsert options

$options = array("upsert" => true);

// Perform the update

$collection->update($criteria,$update,$options);

Next, let’s look at an example where an upsert performs an insert as the document does not yet exist.

Here you’ll find that the $setOnInsert criteria given will be successfully applied:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the search criteria

$criteria = array("Family Name" => "Wallace");

// Specify the information to be set on upsert-inserts only

$update = array('$setOnInsert' => array("Country" => "Unknown"));

// Specify the upsert options

$options = array("upsert" => true);

// Perform the update

$collection->update($criteria,$update,$options);

This piece of code will search for any document where the Family Name field (remember we renamed it

previously) is set to "Wallace". If it’s not found, an upsert will be done, as a result of which the Country field

will be set to "Unknown", creating the following empty-looking document:

{

"_id" : ObjectId("1"),

"Country" : "Unknown",

"Last Name" : "Wallace"

}

CHAPTER 6 ■ PHP AND MONGODB

130

Appending a Value to a Specified Field with $push

MongoDB’s $push operator lets you append a value to a specified field. If the field is an existing array, the

data will be added; if the field does not exist, it will be created. If the field exists, but it is not an array, then

an error condition will be raised. The following example shows how to use $push to add some data into an

existing array:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the search criteria

$criteria = array("Family Name" => "Wood");

// Specify the information to be added

$update = array('$push' => array("E-Mail" => "vw@mongo.db"));

// Perform the update

$collection->update($criteria,$update);

Adding Multiple Values to a Key with $push and $each

The $push operator also lets you append multiple values to a key. For this, the $each modifier needs to be

added. The values, presented in an array, will be added in case they do not exist yet within the given field.

As the $push operator is being used, the same general rules apply: if the field exists, and it is an array, then

the data will be added; if it does not exist, then it will be created; if it exists, but it isn’t an array, then an error

condition will be raised. The following example illustrates how to use the $each modifier:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the search criteria

$criteria = array("Family Name" => "Wood");

// Specify the information to be added

$update = array(

'$push' => array(

"E-Mail" => array(

'$each' => array(

"vicwo@mongo.db",

"vicwo@example.com"

)

);

// Perform the update

$collection->update($criteria,$update);

CHAPTER 6 ■ PHP AND MONGODB

131

Adding Data to an Array with $addToSet

The $addToSet operator is similar to the $push operator, with one important difference: $addToSet ensures

that data are added to an array only if the data are not in there. The $addToSet operator takes one array as a

parameter:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the search criteria

$criteria = array("Family Name" => "Wood");

// Specify the information to be added (successful because it doesn't exist yet)

$update = array('$addToSet' => array("E-Mail" => "vic@example.com"));

// Perform the update

$collection->update($criteria,$update);

Similarly, you can add a number of items that don’t exist by combining the $addToSet operator with the

$each operator:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the search criteria

$criteria = array("Family Name" => "Wood");

// Specify the information to be added (partially successful; some

// examples were already there)

$update = array(

'$addToSet' => array

(

"E-Mail" => array

(

'$each' => array

(

"vw@mongo.db",

"vicky@mongo.db",

"vicky@example.com"

)

);

// Perform the update

$collection->update($criteria,$update);

CHAPTER 6 ■ PHP AND MONGODB

132

Removing an Element from an Array with $pop

MongoDB’s $pop operator lets you remove an element from an array. Keep in mind that you can remove only

the first or last element in the array—and nothing in between. You remove the first element by specifying a

value of -1; similarly, you remove the last element by specifying a value of 1:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the search criteria

$criteria = array("Family Name" => "Wood");

// Pop out the first e-mail address found in the list

$update = array('$pop' => array("E-Mail" => -1));

// Perform the update

$collection->update($criteria,$update);

■Note Specifying a value of -2 or 1000 wouldn’t change which element is removed. Any negative number

will remove the first element, whereas any positive number removes the last element. Using a value of 0

removes the last element from the array.

Removing Each Occurrence of a Value with $pull

You can use MongoDB’s $pull operator to remove each occurrence of a given value from an array. For

example, this is handy if you’ve accidentally added duplicates to an array when using $push or $pushAll.

The following example removes any duplicate occurrence of an e-mail address:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Specify the search criteria

$criteria = array("Family Name" => "Wood");

// Pull out each occurrence of the e-mail address "vicky@example.com"

$update = array('$pull' => array("E-Mail" => "vicky@example.com"));

// Perform the update

$collection->update($criteria,$update);

Removing Each Occurrence of Multiple Elements with $pullAll

Similarly, you can use the $pullAll operator to remove each occurrence of multiple elements from your

documents, as shown in the following example:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

CHAPTER 6 ■ PHP AND MONGODB

133

// Specify the search criteria

$criteria = array("Family Name" => "Wood");

// Pull out each occurrence of the e-mail addresses below

$update = array(

'$pullAll' => array(

"E-Mail" => array("vw@mongo.db","vw@office.com")

)

);

// Perform the update

$collection->update($criteria,$update);

Upserting Data with save()

Like the insert() function, the save() function allows you to insert data into your collection. The only

difference is that you can also use save() to update a field that already holds data. As you might recall, this

is called an upsert. The way you execute the save() function shouldn’t come as a surprise at this point. Like

the save() function in the MongoDB shell, PHP’s save() takes two parameters: an array that contains the

information you wish to save and any options for the save. The following options can be used:

• fsync: If set to true, this Boolean option causes the data to be synced to disk before

returning a success. If this option is set to true, then it’s implied that w is set to 0,

even if it’s set otherwise.

• w: If set to 0, the save operation will not be acknowledged. When working with

replica sets, w can also be set to n, ensuring that the primary server acknowledges

the save operation when successfully replicated to n nodes. It can also be set to

'majority'—a reserved string—to ensure that the majority of replica nodes will

acknowledge the save, or to a specific tag, ensuring that those nodes tagged will

acknowledge the save. This option defaults to 1, acknowledging the save operation.

• j: If set to true, this Boolean option will force the data to be written to the journal

before indicating the save was a success. It defaults to false.

• wTimeoutMS: Used to specify how long the server is to wait for receiving

acknowledgment (in milliseconds). It defaults to 10000.

• socketTimeoutMS: Used to specify how long to wait for socket communication to the

server. It defaults to 30000.

The syntax for PHP’s save() version is similar to that in the MongoDB shell, as the following example

illustrates:

// Specify the document to be saved

$contact = array(

"Given Name" => "Kenji",

"Family Name" => "Kitahara",

"Address" => array(

"Street" => "149 Bartlett Avenue",

"Place" => "Southfield",

"Postal Code" => "MI 48075",

"Country" => "USA"

)

,

CHAPTER 6 ■ PHP AND MONGODB

134

"E-Mail" => array(

"kk@example.com",

"kk@office.com"

),

"Phone" => "248-510-1562",

"Age" => 34

);

// Connect to the database

$c = new MongoClient();

// Select the collection 'people'

$collection = $c->contacts->people;

// Save via the save() function

$options = array("fsync" => true);

// Specify the save() options

$collection->save($contact,$options);

// Realizing you forgot something, let's upsert this contact:

$contact['Category'] = 'Work';

// Perform the upsert

$collection->save($contact);

Modifying a Document Atomically

Like the save() and update() functions, the findAndModify() function can be invoked from the PHP driver.

Remember that you can use the findAndModify() function to modify a document atomically and return

the results after the update executes successfully. You use the findAndModify() function to update a single

document—and nothing more. You may recall that, by default, the document returned will not show the

modifications made—returning the document with the modifications made would require specifying an

additional argument: the new parameter.

The findAndModify function takes four parameters: query, update, fields, and options. Some of these

are optional, depending on your actions. For example, when specifying the update criteria, the fields and

options are optional. However, when you wish to use the remove option, the update and fields parameters

need to be specified (using null, for example). The following list details the available parameters:

• query: Specifies a filter for the query. If this parameter isn’t specified, then all

documents in the collection will be seen as possible candidates, and the first

document encountered will be updated or removed.

• update: Specifies the information to update the document. Note that any of the

modifier operators specified previously can be used to accomplish this.

• fields: Specifies the fields you would like to see returned, rather than the entire

document. This parameter behaves identically to the fields parameter in the

find() function. Note that the _id field will always be returned, even if that field isn’t

included in your list of fields to return.

CHAPTER 6 ■ PHP AND MONGODB

135

• options: Specifies the options to apply. The following options can be used:

• sort: Sorts the matching documents in a specified order.

• remove: If set to true, the first matching document will be removed.

• update: If set to true, an update will be performed on the selected document.

• new: If set to true, returns the updated document, rather than the selected

document. Note that this parameter is not set by default, which might be a bit

confusing in some circumstances.

• upsert: If set to true, performs an upsert.

Now let’s look at a set of examples that illustrate how to use these parameters. The first example

searches for a contact with the last name "Kitahara" and adds an e-mail address to his contact card by

combining an update() with the $push operator. The new parameter is not set in the following example, so

the resulting output still displays the old information:

// Connect to the database

$c = new MongoClient();

// Specify the database and collection in which to work

$collection = $c->contacts->people;

// Specify the search criteria

$criteria = array("Family Name" => "Kitahara");

// Specify the update criteria

$update = array('$push' => array("E-Mail" => "kitahara@mongo.db"));

// Perform a findAndModify()

$collection->findAndModify($criteria,$update);

The result returned looks like this:

Array (

[value] => Array (

[Given Name] => Kenji

[Family Name] => Kitahara

[Address] => Array (

[Street] => 149 Bartlett Avenue

[Place] => Southfield

[Postal Code] => MI 48075

[Country] => USA

)

[E-Mail] => Array (

[0] => kk@example.com

[1] => kk@office.com

)

[Phone] => 248-510-1562

[Age] => 34

[_id] => MongoId Object ( )

[Category] => Work

)

[ok] => 1

)

CHAPTER 6 ■ PHP AND MONGODB

136

The following example shows how to use the remove and sort parameters:

// Connect to the database

$c = new MongoClient();

// Specify the database and collection in which to work

$collection = $c->contacts->people;

// Specify the search criteria

$criteria = array("Category" => "Work");

// Specify the options

$options = array("sort" => array("Age" => -1), "remove" => true);

// Perform a findAndModify()

$collection->findAndModify($criteria,null,null,$options);

Processing Data in Bulk

The MongoDB PHP driver also allows you to perform multiple write operations in bulk. Similar to how this is

done on the MongoDB shell, you will first need to define your dataset as well as write options before writing

it all in a single go using the execute() command. Bulk write operations are limited to a single collection

only, and can be used to insert, update, or remove data using the MongoInsertBatch, MongoUpdateBatch, or

MongoDeleteBatch class, respectively.

Before you can write your data in bulk, you will first need to define your connectivity details, the type of

batch operation to be executed, and the dataset—or array—that will hold your data. For example, if you wish

to batch insert a set of documents, you will use the MongoInsertBatch class to create a new instance of the

class as follows:

// Connect to the database

$c = new MongoClient();

// Select the collection 'people' from the database 'contacts'

$collection = $c->contacts->people;

// Create the array containing bulk insert operations

$bulk = new MongoInsertBatch($collection);

Next, you will need to define your dataset to be inserted. Here, you can create a single array to store

all your to-be-inserted documents in, prior to inserting each document into the previously created

MongoInsertBatch instance, like so:

// Initialize the data set to be inserted

$data = array();

// Add your data

$data[] = array(

"First Name" => "Nick",

"Last Name" => "Scheffer",

"E-Mail" => array(

"nick@example.com",

"nick@domain.com"

),

);

CHAPTER 6 ■ PHP AND MONGODB

137

$data[] = array(

"First Name" => "Max",

"Last Name" => "Scheffer",

"E-Mail" => array(

"max@example.com",

"max@domain.com"

),

);

When your dataset has been defined you will need to iterate over it using the foreach() command to

add each of the documents to the previously created MongoInsertBatch instance—called $bulk—using the

add() function. Let’s look at an example:

// Insert each document defined in the dataset '$data'

foreach($data as $document) {

$bulk->add($document);

}

Now that your dataset has been filled and your bulk operation has been defined, you are but one step

away from executing it all in a single go.

■Note Your bulk instance can contain a maximum of 1000 documents, or up to 16777216 bytes of data, by

default. MongoDB will automatically split and process your list into separate groups of 1000 operations or less

when your list exceeds this.

Executing Bulk Operations

Before executing your previously defined bulk operation, you may first want to specify the operation’s

write options. These write options are similar to the ones previously discussed, with the exception of the

ordered write option, used to tell MongoDB how the data are to be written: ordered or unordered. When

executing the operation in an ordered fashion, MongoDB will go over the list of operations serially. That

is, were an error to occur while processing one of the write operations, the remaining operations would

not be processed. In contrast, using an unordered write operation, MongoDB will execute the operations

in a parallel manner. Were an error to occur during one of the writing operations here, MongoDB would

continue to process the remaining write operations. A complete list of the write options for bulk operations

are listed below:

• continueOnError: If set to true, bulk inserts would continue to be processed even if

one fails. It defaults to false.

• w: If set to 0, the save operation will not be acknowledged. When working with

replica sets, w can also be set to n, ensuring the primary server acknowledges

the save operation when successfully replicated to n nodes. It can also be set

to 'majority'—a reserved string—to ensure that the majority of replica nodes

will acknowledge the save, or to a specific tag, ensuring those nodes tagged will

acknowledge the save. This option defaults to 1, acknowledging the save operation.

CHAPTER 6 ■ PHP AND MONGODB

138

• wTimeoutMS: Used to specify how long the server is to wait for receiving

acknowledgment (in milliseconds). It defaults to 10000.

• socketTimeoutMS: Used to specify how long to wait for socket communication to the

server. It defaults to 30000.

• ordered: Used to determine if MongoDB should process this batch sequentially—

one item at a time—or if it can rearrange the operations. It defaults to true.

• fsync: If set to true, this Boolean option causes the data to be synced to disk before

returning a success. If this option is set to true, then it’s implied that w is set to 0,

even if it’s set otherwise.

• j: If set to true, this Boolean option will force the data to be written to the journal

before indicating the save was a success. It defaults to false.

Having determined your write options, you can finally execute the bulk operation using the execute()

command on the previously created $bulk instance, providing the write operations as an option:

// Specify the write options

$options = array("w" => 1);

// Execute the batch operation

$result = $bulk->execute($options);

Evaluating the Output

If you wish to review the output of the bulk operations executed to ensure all write operations went well, you

may print the output generated by the execute() function and stored in the $result variable using PHP’s

var_dump() function in your PHP document:

// Return the results

var_dump($retval);

If both documents were inserted properly, your output will look as follows:

array(2) {

["nInserted"]=>

int(2)

["ok"]=>

bool(true)

}

Here, the nInserted key will report the number of documents inserted (two). Similarly, nModified and

nRemoved will report the number of documents changed or removed when using the MongoUpdateBatch

and MongoDeleteBatch classes, respectively. Finally, the "ok" key will tell you if the operations executed

successfully.

Bulk operations can be extremely useful for processing a large set of data in a single go without

influencing the available dataset beforehand.

CHAPTER 6 ■ PHP AND MONGODB

139

Deleting Data

You can use the remove() function to remove a document like the one in the preceding example from the

MongoDB shell. The PHP driver also includes a remove() function you can use to remove data. The PHP

version of this function takes two parameters: one contains the description of the record or records to

remove, while the other specifies the additional write options governing the removal process.

There are five options available:

• justOne: If set to true, at most only one record matching the criteria must be

removed.