The Definitive Guide To MongoDB Mongo DB
The%20Definitive%20Guide%20to%20MongoDB
The%20Definitive%20Guide%20to%20MongoDB
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 361 [warning: Documents this large are best viewed by clicking the View PDF Link!]
- Contents at a Glance
- Contents
- About the Authors
- About the Technical Reviewer
- About the Contributor
- Acknowledgments
- Introduction
- Chapter 1: Introduction to MongoDB
- Reviewing the MongoDB Philosophy
- Fitting Everything Together
- Reviewing the Feature List
- WiredTiger
- Using Document-Oriented Storage (BSON)
- Supporting Dynamic Queries
- Indexing Your Documents
- Leveraging Geospatial Indexes
- Profiling Queries
- Updating Information In Place (Memory Mapped Database Only)
- Storing Binary Data
- Replicating Data
- Implementing Sharding
- Using Map and Reduce Functions
- The Aggregation Framework
- Getting Help
- Summary
- Chapter 2: Installing MongoDB
- Chapter 3: The Data Model
- Chapter 4: Working with Data
- Navigating Your Databases
- Inserting Data into Collections
- Querying for Data
- Using the Dot Notation
- Using the Sort, Limit, and Skip Functions
- Working with Capped Collections, Natural Order, and $natural
- Retrieving a Single Document
- Using the Aggregation Commands
- Working with Conditional Operators
- Performing Greater-Than and Less-Than Comparisons
- Retrieving All Documents but Those Specified
- Specifying an Array of Matches
- Finding a Value Not in an Array
- Matching All Attributes in a Document
- Searching for Multiple Expressions in a Document
- Retrieving a Document with $slice
- Searching for Odd/Even Integers
- Filtering Results with $size
- Returning a Specific Field Object
- Matching Results Based on the BSON Type
- Matching an Entire Array
- Using the $not Metaoperator
- Specifying Additional Query Expressions
- Leveraging Regular Expressions
- Updating Data
- Processing Data in Bulk
- Renaming a Collection
- Deleting Data
- Referencing a Database
- Implementing Index-Related Functions
- Summary
- Chapter 5: GridFS
- Chapter 6: PHP and MongoDB
- Comparing Documents in MongoDB and PHP
- MongoDB Classes
- Listing Your Data
- Using Query Operators
- Modifying Data with PHP
- Updating via update()
- Saving Time with Update Operators
- Increasing the Value of a Specific Key with $inc
- Changing the Value of a Key with $set
- Deleting a Field with $unset
- Renaming a Field with $rename
- Changing the Value of a Key During Upsert with $setOnInsert
- Appending a Value to a Specified Field with $push
- Adding Multiple Values to a Key with $push and $each
- Adding Data to an Array with $addToSet
- Removing an Element from an Array with $pop
- Removing Each Occurrence of a Value with $pull
- Removing Each Occurrence of Multiple Elements with $pullAll
- Upserting Data with save()
- Modifying a Document Atomically
- Processing Data in Bulk
- Deleting Data
- DBRef
- GridFS and the PHP Driver
- Summary
- Chapter 7: Python and MongoDB
- Working with Documents in Python
- Using PyMongo Modules
- Connecting and Disconnecting
- Inserting Data
- Finding Your Data
- Finding a Single Document
- Finding Multiple Documents
- Using Dot Notation
- Returning Fields
- Simplifying Queries with sort(), limit(), and skip()
- Aggregating Queries
- Specifying an Index with hint()
- Refining Queries with Conditional Operators
- Using the $lt, $gt, $lte, and $gte Operators
- Searching for Nonmatching Values with $ne
- Specifying an Array of Matches with $in
- Specifying Against an Array of Matches with $nin
- Finding Documents That Match an Array’s Values
- Specifying Multiple Expressions to Match with $or
- Retrieving Items from an Array with $slice
- Conducting Searches with Regular Expressions
- Modifying the Data
- Updating Your Data
- Modifier Operators
- Increasing an Integer Value with $inc
- Changing an Existing Value with $set
- Removing a Key/Value Field with $unset
- Adding a Value to an Array with $push
- Adding Multiple Values to an Array with $push and $each
- Adding a Value to an Existing Array with $addToSet
- Removing an Element from an Array with $pop
- Removing a Specific Value with $pull
- Replacing Documents with replace_one()
- Modifying a Document Atomically
- Putting the Parameters to Work
- Processing Data in Bulk
- Deleting Data
- Creating a Link Between Two Documents
- Summary
- Chapter 8: Advanced Queries
- Chapter 9: Database Administration
- Using Administrative Tools
- Backing Up the MongoDB Server
- Digging Deeper into Backups
- Restoring Individual Databases or Collections
- Automating Backups
- Backing Up Large Databases
- Importing Data into MongoDB
- Exporting Data from MongoDB
- Securing Your Data by Restricting Access to a MongoDB Server
- Protecting Your Server with Authentication
- Managing Servers
- Using MongoDB Log Files
- Validating and Repairing Your Data
- Upgrading MongoDB
- Monitoring MongoDB
- Using MongoDB Cloud Manager
- Summary
- Chapter 10: Optimization
- Optimizing Your Server Hardware for Performance
- Understanding MongoDB’s Storage Engines
- Understanding MongoDB Memory Use Under MMAPv1
- Understanding MongoDB Memory Use Under WiredTiger
- Evaluating Query Performance
- Managing Indexes
- Three-Step Compound Indexes By A. Jesse Jiryu Davis
- Specifying Index Options
- Using hint( ) to Force Using a Specific Index
- Using Index Filters
- Optimizing the Storage of Small Objects
- Summary
- Chapter 11: Replication
- Spelling Out MongoDB’s Replication Goals
- Replication Fundamentals
- Drilling Down on the Oplog
- Implementing a Replica Set
- Read Concern
- Summary
- Chapter 12: Sharding
- Index
www.apress.com
Hows · Membrey · Plugge · Hawkins The Definitive Guide to MongoDB
The Definitive
Guide to
MongoDB
A complete guide to dealing with Big Data
using MongoDB
—
Third Edition
—
David Hows
Peter Membrey
Eelco Plugge
Tim Hawkins
The Definitive Guide to MongoDB
BOOKS FOR PROFESSIONALS BY PROFESSIONALS®THE EXPERT’S VOICE® IN OPEN SOURCE
The De nitive Guide to MongoDB, Third Edition, is updated for MongoDB 3 and includes all of
the latest MongoDB features, including the aggregation framework introduced in version 2.2,
the hashed indexes introduced in version 2.4, and WiredTiger from 3.2. The Third Edition also
now includes Node.js along with Python.
MongoDB is the most popular of the “Big Data” NoSQL database technologies, and it’s still
growing. David Hows from 10gen, along with experienced MongoDB authors David Hows,
Peter Membrey and Eelco Plugge, provide their expertise and experience in teaching you
everything you need to know to become a MongoDB pro.
• Set up MongoDB on all major server platforms, including Windows, Linux,
OS X, and cloud platforms like Rackspace, Azure, and Amazon EC2
• Work with GridFS and the new aggregation framework
• Work with your data using non-SQL commands
• Write applications using either Node.js or Python
• Optimize MongoDB
• Master MongoDB administration, including replication, replication tagging,
and tag-aware sharding
9781484 211830
54999
ISBN 978-1-4842-1183-0
Shelve in:
Databases/General
User level:
Beginning–Advanced
Related Titles
The Definitive Guide
to MongoDB
A complete guide to dealing with
Big Data using MongoDB
Third Edition
David Hows
Peter Membrey
Eelco Plugge
Tim Hawkins
The Definitive Guide to MongoDB: A complete guide to dealing with Big Data using MongoDB
Copyright © 2015 by David Hows, Peter Membrey, Eelco Plugge, Tim Hawkins
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with
reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed
on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or
parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its
current version, and permission for use must always be obtained from Springer. Permissions for use may be
obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under
the respective Copyright Law.
ISBN-13 (pbk): 978-1-4842-1183-0
ISBN-13 (electronic): 978-1-4842-1182-3
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with
every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an
editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Managing Director: Welmoed Spahr
Lead Editor: Michelle Lowman
Technical Reviewer: Stephen Steneker
Editorial Board: Steve Anglin, Louise Corrigan, Jonathan Gennick, Robert Hutchinson,
Michelle Lowman, James Markham, Susan McDermott, Matthew Moodie, Jeffrey Pepper,
Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing
Coordinating Editor: Mark Powers
Copy Editor: Mary Bearden
Compositor: SPi Global
Indexer: SPi Global
Artist: SPi Global
Distributed to the book trade worldwide by Springer Science+Business Media New York,
233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail
orders-ny@springer-sbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC
and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM
Finance Inc is a Delaware corporation.
For information on translations, please e-mail rights@apress.com, or visit www.apress.com.
Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use.
eBook versions and licenses are also available for most titles. For more information, reference our Special
Bulk Sales–eBook Licensing web page at www.apress.com/bulk-sales.
Any source code or other supplementary material referenced by the author in this text is available to readers
at www.apress.com/9781484211830. For detailed information about how to locate your book’s source
code, go to www.apress.com/source-code/. Readers can also access source code at SpringerLink in the
Supplementary Material section for each chapter.
For Dr. Rocky Chan, for going the extra mile and always being there when I need him.
I hope one day I can properly thank him for his support.
—Peter Membrey
To my uncle, Luut, who introduced me to the vast and
ever-challenging world of IT. ank you.
—Eelco Plugge
v
Contents at a Glance
About the Authors ���������������������������������������������������������������������������������������������������xix
About the Technical Reviewer ��������������������������������������������������������������������������������xxi
About the Contributor �������������������������������������������������������������������������������������������xxiii
Acknowledgments �������������������������������������������������������������������������������������������������xxv
Introduction ���������������������������������������������������������������������������������������������������������xxvii
■Chapter 1: Introduction to MongoDB ��������������������������������������������������������������������� 1
■Chapter 2: Installing MongoDB ���������������������������������������������������������������������������� 17
■Chapter 3: The Data Model ���������������������������������������������������������������������������������� 33
■Chapter 4: Working with Data ����������������������������������������������������������������������������� 49
■Chapter 5: GridFS������������������������������������������������������������������������������������������������� 91
■Chapter 6: PHP and MongoDB ��������������������������������������������������������������������������� 103
■Chapter 7: Python and MongoDB ����������������������������������������������������������������������� 147
■Chapter 8: Advanced Queries ���������������������������������������������������������������������������� 181
■Chapter 9: Database Administration ����������������������������������������������������������������� 209
■Chapter 10: Optimization ����������������������������������������������������������������������������������� 249
■Chapter 11: Replication ������������������������������������������������������������������������������������� 285
■Chapter 12: Sharding ����������������������������������������������������������������������������������������� 315
Index ��������������������������������������������������������������������������������������������������������������������� 337
vii
Contents
About the Authors ���������������������������������������������������������������������������������������������������xix
About the Technical Reviewer ��������������������������������������������������������������������������������xxi
About the Contributor �������������������������������������������������������������������������������������������xxiii
Acknowledgments �������������������������������������������������������������������������������������������������xxv
Introduction ���������������������������������������������������������������������������������������������������������xxvii
■Chapter 1: Introduction to MongoDB ��������������������������������������������������������������������� 1
Reviewing the MongoDB Philosophy �������������������������������������������������������������������������������� 1
Using the Right Tool for the Right Job ���������������������������������������������������������������������������������������������������� 1
Lacking Innate Support for Transactions ������������������������������������������������������������������������������������������������ 3
JSON and MongoDB �������������������������������������������������������������������������������������������������������������������������������3
Adopting a Nonrelational Approach ��������������������������������������������������������������������������������������������������������6
Opting for Performance vs� Features ������������������������������������������������������������������������������������������������������ 6
Running the Database Anywhere������������������������������������������������������������������������������������������������������������7
Fitting Everything Together ���������������������������������������������������������������������������������������������� 7
Generating or Creating a Key ������������������������������������������������������������������������������������������������������������������8
Using Keys and Values ����������������������������������������������������������������������������������������������������������������������������8
Implementing Collections ����������������������������������������������������������������������������������������������������������������������� 9
Understanding Databases�����������������������������������������������������������������������������������������������������������������������9
Reviewing the Feature List ����������������������������������������������������������������������������������������������� 9
WiredTiger ��������������������������������������������������������������������������������������������������������������������������������������������10
Using Document-Oriented Storage (BSON) �������������������������������������������������������������������������������������������10
Supporting Dynamic Queries ���������������������������������������������������������������������������������������������������������������� 11
Indexing Your Documents ��������������������������������������������������������������������������������������������������������������������� 11
Leveraging Geospatial Indexes �������������������������������������������������������������������������������������������������������������12
viii
■ Contents
Profiling Queries ����������������������������������������������������������������������������������������������������������������������������������� 12
Updating Information In Place (Memory Mapped Database Only) �������������������������������������������������������� 12
Storing Binary Data ������������������������������������������������������������������������������������������������������������������������������ 13
Replicating Data �����������������������������������������������������������������������������������������������������������������������������������13
Implementing Sharding ������������������������������������������������������������������������������������������������������������������������14
Using Map and Reduce Functions ��������������������������������������������������������������������������������������������������������14
The Aggregation Framework ����������������������������������������������������������������������������������������������������������������14
Getting Help �������������������������������������������������������������������������������������������������������������������� 15
Visiting the Website ������������������������������������������������������������������������������������������������������������������������������15
Cutting and Pasting MongoDB Code �����������������������������������������������������������������������������������������������������15
Finding Solutions on Google Groups �����������������������������������������������������������������������������������������������������15
Finding Solutions on Stack Overflow ����������������������������������������������������������������������������������������������������15
Leveraging the JIRA Tracking System ��������������������������������������������������������������������������������������������������15
Chatting with the MongoDB Developers �����������������������������������������������������������������������������������������������16
Summary ������������������������������������������������������������������������������������������������������������������������ 16
■Chapter 2: Installing MongoDB ���������������������������������������������������������������������������� 17
Choosing Your Version ���������������������������������������������������������������������������������������������������� 17
Understanding the Version Numbers ���������������������������������������������������������������������������������������������������� 18
Installing MongoDB on Your System ������������������������������������������������������������������������������� 18
Installing MongoDB under Linux �����������������������������������������������������������������������������������������������������������18
Installing MongoDB under Windows ����������������������������������������������������������������������������������������������������� 20
Running MongoDB ���������������������������������������������������������������������������������������������������������� 20
Prerequisites�����������������������������������������������������������������������������������������������������������������������������������������21
Surveying the Installation Layout ��������������������������������������������������������������������������������������������������������� 21
Using the MongoDB Shell ��������������������������������������������������������������������������������������������������������������������� 22
Installing Additional Drivers�������������������������������������������������������������������������������������������� 23
Installing the PHP Driver �����������������������������������������������������������������������������������������������������������������������24
Confirming That Your PHP Installation Works ���������������������������������������������������������������������������������������27
Installing the Python Driver ������������������������������������������������������������������������������������������������������������������29
Confirming That Your PyMongo Installation Works ������������������������������������������������������������������������������� 31
Summary ������������������������������������������������������������������������������������������������������������������������ 32
ix
■ Contents
■Chapter 3: The Data Model ���������������������������������������������������������������������������������� 33
Designing the Database ������������������������������������������������������������������������������������������������� 33
Drilling Down on Collections �����������������������������������������������������������������������������������������������������������������34
Using Documents ����������������������������������������������������������������������������������������������������������������������������������36
Creating the _id Field ���������������������������������������������������������������������������������������������������������������������������38
Building Indexes ������������������������������������������������������������������������������������������������������������� 39
Impacting Performance with Indexes ��������������������������������������������������������������������������������������������������� 39
Implementing Geospatial Indexing ��������������������������������������������������������������������������������� 40
Querying Geospatial Information ���������������������������������������������������������������������������������������������������������� 41
Pluggable Storage Engines �������������������������������������������������������������������������������������������� 46
Using MongoDB in the Real World ���������������������������������������������������������������������������������� 46
Summary ������������������������������������������������������������������������������������������������������������������������ 47
■Chapter 4: Working with Data ����������������������������������������������������������������������������� 49
Navigating Your Databases ��������������������������������������������������������������������������������������������� 49
Viewing Available Databases and Collections ��������������������������������������������������������������������������������������49
Inserting Data into Collections ��������������������������������������������������������������������������������������� 50
Querying for Data ����������������������������������������������������������������������������������������������������������� 52
Using the Dot Notation ��������������������������������������������������������������������������������������������������������������������������53
Using the Sort, Limit, and Skip Functions ��������������������������������������������������������������������������������������������� 54
Working with Capped Collections, Natural Order, and $natural ������������������������������������������������������������ 55
Retrieving a Single Document ��������������������������������������������������������������������������������������������������������������57
Using the Aggregation Commands ������������������������������������������������������������������������������������������������������� 57
Working with Conditional Operators ����������������������������������������������������������������������������������������������������� 60
Leveraging Regular Expressions ����������������������������������������������������������������������������������������������������������68
Updating Data ����������������������������������������������������������������������������������������������������������������� 68
Updating with update() ������������������������������������������������������������������������������������������������������������������������� 69
Implementing an Upsert with the save() Command �����������������������������������������������������������������������������69
Updating Information Automatically ����������������������������������������������������������������������������������������������������� 69
Removing Elements from an Array �������������������������������������������������������������������������������������������������������73
x
■ Contents
Specifying the Position of a Matched Array ������������������������������������������������������������������������������������������74
Atomic Operations ��������������������������������������������������������������������������������������������������������������������������������75
Modifying and Returning a Document Atomically��������������������������������������������������������������������������������� 77
Processing Data in Bulk ������������������������������������������������������������������������������������������������� 77
Executing Bulk Operations��������������������������������������������������������������������������������������������������������������������78
Evaluating the Output ���������������������������������������������������������������������������������������������������������������������������79
Renaming a Collection ��������������������������������������������������������������������������������������������������� 80
Deleting Data ������������������������������������������������������������������������������������������������������������������ 81
Referencing a Database ������������������������������������������������������������������������������������������������� 82
Referencing Data Manually ������������������������������������������������������������������������������������������������������������������82
Referencing Data with DBRef ���������������������������������������������������������������������������������������������������������������83
Implementing Index-Related Functions ������������������������������������������������������������������������� 85
Surveying Index-Related Commands ��������������������������������������������������������������������������������������������������� 87
Summary ������������������������������������������������������������������������������������������������������������������������ 89
■Chapter 5: GridFS������������������������������������������������������������������������������������������������� 91
Filling in Some Background ������������������������������������������������������������������������������������������� 91
Working with GridFS ������������������������������������������������������������������������������������������������������� 92
Getting Started with the Command-Line Tools ��������������������������������������������������������������� 92
Using the _id Key ����������������������������������������������������������������������������������������������������������������������������������93
Working with Filenames ����������������������������������������������������������������������������������������������������������������������� 93
The File’s Length ���������������������������������������������������������������������������������������������������������������������������������� 94
Working with Chunk Sizes �������������������������������������������������������������������������������������������������������������������� 94
Tracking the Upload Date ����������������������������������������������������������������������������������������������������������������������95
Hashing Your Files ��������������������������������������������������������������������������������������������������������������������������������95
Looking Under MongoDB’s Hood ������������������������������������������������������������������������������������ 95
Using the search Command ������������������������������������������������������������������������������������������������������������������96
Deleting ������������������������������������������������������������������������������������������������������������������������������������������������97
Retrieving Files from MongoDB ������������������������������������������������������������������������������������������������������������ 97
Summing Up mongofiles ����������������������������������������������������������������������������������������������������������������������98
xi
■ Contents
Exploiting the Power of Python �������������������������������������������������������������������������������������� 98
Connecting to the Database �����������������������������������������������������������������������������������������������������������������99
Accessing the Words ����������������������������������������������������������������������������������������������������������������������������99
Putting Files into MongoDB �������������������������������������������������������������������������������������������� 99
Retrieving Files from GridFS ���������������������������������������������������������������������������������������� 100
Deleting Files ���������������������������������������������������������������������������������������������������������������� 100
Summary ���������������������������������������������������������������������������������������������������������������������� 101
■Chapter 6: PHP and MongoDB ��������������������������������������������������������������������������� 103
Comparing Documents in MongoDB and PHP �������������������������������������������������������������� 103
MongoDB Classes ��������������������������������������������������������������������������������������������������������� 105
Connecting and Disconnecting ����������������������������������������������������������������������������������������������������������� 105
Inserting Data �������������������������������������������������������������������������������������������������������������������������������������107
Listing Your Data ���������������������������������������������������������������������������������������������������������� 109
Returning a Single Document ������������������������������������������������������������������������������������������������������������� 109
Listing All Documents ������������������������������������������������������������������������������������������������������������������������� 110
Using Query Operators ������������������������������������������������������������������������������������������������� 111
Querying for Specific Information �������������������������������������������������������������������������������������������������������111
Sorting, Limiting, and Skipping Items ������������������������������������������������������������������������������������������������� 112
Counting the Number of Matching Results �����������������������������������������������������������������������������������������114
Grouping Data with the Aggregation Framework �������������������������������������������������������������������������������114
Specifying the Index with Hint ������������������������������������������������������������������������������������������������������������115
Refining Queries with Conditional Operators ��������������������������������������������������������������������������������������116
Determining Whether a Field Has a Value ������������������������������������������������������������������������������������������ 122
Regular Expressions ��������������������������������������������������������������������������������������������������������������������������� 123
Modifying Data with PHP ���������������������������������������������������������������������������������������������� 124
Updating via update() �������������������������������������������������������������������������������������������������������������������������124
Saving Time with Update Operators ���������������������������������������������������������������������������������������������������126
Upserting Data with save() �����������������������������������������������������������������������������������������������������������������133
Modifying a Document Atomically ������������������������������������������������������������������������������������������������������ 134
xii
■ Contents
Processing Data in Bulk ����������������������������������������������������������������������������������������������� 136
Executing Bulk Operations������������������������������������������������������������������������������������������������������������������137
Evaluating the Output �������������������������������������������������������������������������������������������������������������������������138
Deleting Data ���������������������������������������������������������������������������������������������������������������� 139
DBRef ��������������������������������������������������������������������������������������������������������������������������� 141
Retrieving the Information ������������������������������������������������������������������������������������������������������������������ 142
GridFS and the PHP Driver �������������������������������������������������������������������������������������������� 143
Storing Files ����������������������������������������������������������������������������������������������������������������������������������������143
Adding More Metadata to Stored Files �����������������������������������������������������������������������������������������������144
Retrieving Files �����������������������������������������������������������������������������������������������������������������������������������144
Deleting Data �������������������������������������������������������������������������������������������������������������������������������������� 145
Summary ���������������������������������������������������������������������������������������������������������������������� 146
■Chapter 7: Python and MongoDB ����������������������������������������������������������������������� 147
Working with Documents in Python ����������������������������������������������������������������������������� 147
Using PyMongo Modules ���������������������������������������������������������������������������������������������� 148
Connecting and Disconnecting ������������������������������������������������������������������������������������� 148
Inserting Data ��������������������������������������������������������������������������������������������������������������� 149
Finding Your Data ��������������������������������������������������������������������������������������������������������� 150
Finding a Single Document �����������������������������������������������������������������������������������������������������������������151
Finding Multiple Documents ���������������������������������������������������������������������������������������������������������������152
Using Dot Notation ������������������������������������������������������������������������������������������������������������������������������153
Returning Fields ��������������������������������������������������������������������������������������������������������������������������������� 153
Simplifying Queries with sort(), limit(), and skip() ������������������������������������������������������������������������������� 154
Aggregating Queries ���������������������������������������������������������������������������������������������������������������������������155
Specifying an Index with hint() ����������������������������������������������������������������������������������������������������������� 158
Refining Queries with Conditional Operators ��������������������������������������������������������������������������������������159
Conducting Searches with Regular Expressions ��������������������������������������������������������������������������������165
Modifying the Data ������������������������������������������������������������������������������������������������������� 166
Updating Your Data ����������������������������������������������������������������������������������������������������������������������������� 166
Modifier Operators ������������������������������������������������������������������������������������������������������������������������������167
xiii
■ Contents
Replacing Documents with replace_one() ������������������������������������������������������������������������������������������172
Modifying a Document Atomically ������������������������������������������������������������������������������������������������������ 172
Putting the Parameters to Work ����������������������������������������������������������������������������������������������������������173
Processing Data in Bulk ����������������������������������������������������������������������������������������������� 174
Executing Bulk Operations������������������������������������������������������������������������������������������������������������������174
Deleting Data ���������������������������������������������������������������������������������������������������������������� 175
Creating a Link Between Two Documents �������������������������������������������������������������������� 176
Retrieving the Information ������������������������������������������������������������������������������������������������������������������ 178
Summary ���������������������������������������������������������������������������������������������������������������������� 179
■Chapter 8: Advanced Queries ���������������������������������������������������������������������������� 181
Text Search ������������������������������������������������������������������������������������������������������������������� 181
Text Search Costs and Limitations ������������������������������������������������������������������������������������������������������182
Using Text Search ������������������������������������������������������������������������������������������������������������������������������� 182
Text Indexes in Other Languages ��������������������������������������������������������������������������������������������������������187
Compound Indexing with Text Indexes �����������������������������������������������������������������������������������������������187
The Aggregation Framework ���������������������������������������������������������������������������������������� 189
Using the $group Command ��������������������������������������������������������������������������������������������������������������� 190
Using the $limit Operator ��������������������������������������������������������������������������������������������������������������������192
Using the $match Operator �����������������������������������������������������������������������������������������������������������������193
Using the $sort Operator ��������������������������������������������������������������������������������������������������������������������194
Using the $unwind Operator ���������������������������������������������������������������������������������������������������������������196
Using the $skip Operator �������������������������������������������������������������������������������������������������������������������� 198
Using the $out Operator ����������������������������������������������������������������������������������������������������������������������199
Using the $lookup Operator ���������������������������������������������������������������������������������������������������������������� 200
MapReduce ������������������������������������������������������������������������������������������������������������������ 202
How MapReduce Works ����������������������������������������������������������������������������������������������������������������������202
Setting Up Testing Documents ������������������������������������������������������������������������������������������������������������ 202
Working with Map Functions �������������������������������������������������������������������������������������������������������������� 203
Advanced MapReduce ������������������������������������������������������������������������������������������������������������������������ 205
Debugging MapReduce ����������������������������������������������������������������������������������������������������������������������207
Summary ���������������������������������������������������������������������������������������������������������������������� 208
xiv
■ Contents
■Chapter 9: Database Administration ����������������������������������������������������������������� 209
Using Administrative Tools ������������������������������������������������������������������������������������������� 209
mongo, the MongoDB Console ������������������������������������������������������������������������������������������������������������210
Using Third-Party Administration Tools �����������������������������������������������������������������������������������������������210
Backing Up the MongoDB Server ��������������������������������������������������������������������������������� 210
Creating a Backup 101 �����������������������������������������������������������������������������������������������������������������������210
Backing Up a Single Database ������������������������������������������������������������������������������������������������������������213
Backing Up a Single Collection �����������������������������������������������������������������������������������������������������������213
Digging Deeper into Backups ��������������������������������������������������������������������������������������� 213
Restoring Individual Databases or Collections ������������������������������������������������������������� 214
Restoring a Single Database ��������������������������������������������������������������������������������������������������������������215
Restoring a Single Collection ��������������������������������������������������������������������������������������������������������������215
Automating Backups ���������������������������������������������������������������������������������������������������� 216
Using a Local Datastore ����������������������������������������������������������������������������������������������������������������������216
Using a Remote (Cloud-Based) Datastore ������������������������������������������������������������������������������������������218
Backing Up Large Databases ��������������������������������������������������������������������������������������� 219
Using a Hidden Secondary Server for Backups ���������������������������������������������������������������������������������� 219
Creating Snapshots with a Journaling Filesystem ����������������������������������������������������������������������������� 220
Disk Layout to Use with Volume Managers ����������������������������������������������������������������������������������������223
Importing Data into MongoDB �������������������������������������������������������������������������������������� 223
Exporting Data from MongoDB ������������������������������������������������������������������������������������� 225
Securing Your Data by Restricting Access to a MongoDB Server ��������������������������������� 226
Protecting Your Server with Authentication ������������������������������������������������������������������ 226
Adding an Admin User ������������������������������������������������������������������������������������������������������������������������ 227
Enabling Authentication ����������������������������������������������������������������������������������������������������������������������227
Authenticating in the mongo Console �������������������������������������������������������������������������������������������������228
MongoDB User Roles �������������������������������������������������������������������������������������������������������������������������� 230
Changing a User’s Credentials ������������������������������������������������������������������������������������������������������������ 231
xv
■ Contents
Adding a Read-Only User ��������������������������������������������������������������������������������������������������������������������232
Deleting a User �����������������������������������������������������������������������������������������������������������������������������������233
Using Authenticated Connections in a PHP Application ���������������������������������������������������������������������� 234
Managing Servers �������������������������������������������������������������������������������������������������������� 234
Starting a Server �������������������������������������������������������������������������������������������������������������������������������� 234
Getting the Server’s Version ���������������������������������������������������������������������������������������������������������������237
Getting the Server’s Status ����������������������������������������������������������������������������������������������������������������� 237
Shutting Down a Server ���������������������������������������������������������������������������������������������������������������������� 240
Using MongoDB Log Files ��������������������������������������������������������������������������������������������� 241
Validating and Repairing Your Data ������������������������������������������������������������������������������ 241
Repairing a Server ������������������������������������������������������������������������������������������������������������������������������ 241
Validating a Single Collection �������������������������������������������������������������������������������������������������������������242
Repairing Collection Validation Faults ������������������������������������������������������������������������������������������������243
Repairing a Collection’s Data Files ����������������������������������������������������������������������������������������������������� 244
Compacting a Collection’s Data Files ������������������������������������������������������������������������������������������������� 244
Upgrading MongoDB ���������������������������������������������������������������������������������������������������� 245
Rolling Upgrade of MongoDB ��������������������������������������������������������������������������������������������������������������246
Monitoring MongoDB ���������������������������������������������������������������������������������������������������� 246
Using MongoDB Cloud Manager ����������������������������������������������������������������������������������� 247
Summary ���������������������������������������������������������������������������������������������������������������������� 248
■Chapter 10: Optimization ����������������������������������������������������������������������������������� 249
Optimizing Your Server Hardware for Performance ����������������������������������������������������� 249
Understanding MongoDB’s Storage Engines ���������������������������������������������������������������� 249
Understanding MongoDB Memory Use Under MMAPv1 ����������������������������������������������� 250
Understanding Working Set Size in MMAPv1 �������������������������������������������������������������������������������������250
Understanding MongoDB Memory Use Under WiredTiger �������������������������������������������� 251
Compression in WiredTiger �����������������������������������������������������������������������������������������������������������������251
Choosing the Right Database Server Hardware ���������������������������������������������������������������������������������252
xvi
■ Contents
Evaluating Query Performance ������������������������������������������������������������������������������������� 252
The MongoDB Profiler �������������������������������������������������������������������������������������������������������������������������253
Analyzing a Specific Query with explain() ������������������������������������������������������������������������������������������ 257
Using the Profiler and explain() to Optimize a Query �������������������������������������������������������������������������� 258
Managing Indexes �������������������������������������������������������������������������������������������������������� 264
Listing Indexes ������������������������������������������������������������������������������������������������������������������������������������265
Creating a Simple Index ����������������������������������������������������������������������������������������������������������������������265
Creating a Compound Index ���������������������������������������������������������������������������������������������������������������266
Three-Step Compound Indexes By A� Jesse Jiryu Davis ���������������������������������������������� 267
The Setup �������������������������������������������������������������������������������������������������������������������������������������������267
Range Query ��������������������������������������������������������������������������������������������������������������������������������������� 267
Equality Plus Range Query������������������������������������������������������������������������������������������������������������������ 269
Digression: How MongoDB Chooses an Index ������������������������������������������������������������������������������������271
Equality, Range Query, and Sort ����������������������������������������������������������������������������������������������������������272
Final Method ���������������������������������������������������������������������������������������������������������������������������������������275
Specifying Index Options ���������������������������������������������������������������������������������������������� 275
Creating an Index in the Background with {background:true} ������������������������������������������������������������275
Creating an Index with a Unique Key {unique:true} ����������������������������������������������������������������������������276
Creating Sparse Indexes with {sparse:true} ���������������������������������������������������������������������������������������276
Creating Partial Indexes ����������������������������������������������������������������������������������������������������������������������277
TTL Indexes�����������������������������������������������������������������������������������������������������������������������������������������277
Text Search Indexes ����������������������������������������������������������������������������������������������������������������������������278
Dropping an Index ������������������������������������������������������������������������������������������������������������������������������278
Reindexing a Collection ����������������������������������������������������������������������������������������������������������������������279
Using hint( ) to Force Using a Specific Index ���������������������������������������������������������������� 279
Using Index Filters �������������������������������������������������������������������������������������������������������� 280
Optimizing the Storage of Small Objects ���������������������������������������������������������������������� 283
Summary ���������������������������������������������������������������������������������������������������������������������� 284
xvii
■ Contents
■Chapter 11: Replication ������������������������������������������������������������������������������������� 285
Spelling Out MongoDB’s Replication Goals ������������������������������������������������������������������ 286
Improving Scalability �������������������������������������������������������������������������������������������������������������������������� 286
Improving Durability/Reliability ����������������������������������������������������������������������������������������������������������286
Providing Isolation ������������������������������������������������������������������������������������������������������������������������������ 287
Replication Fundamentals �������������������������������������������������������������������������������������������� 287
What Is a Primary? ����������������������������������������������������������������������������������������������������������������������������� 288
What Is a Secondary? ������������������������������������������������������������������������������������������������������������������������� 288
What Is an Arbiter? ����������������������������������������������������������������������������������������������������������������������������� 288
Drilling Down on the Oplog ������������������������������������������������������������������������������������������� 289
Implementing a Replica Set ����������������������������������������������������������������������������������������� 290
Creating a Replica Set ������������������������������������������������������������������������������������������������������������������������291
Getting a Replica Set Member Up and Running����������������������������������������������������������������������������������292
Adding a Server to a Replica Set �������������������������������������������������������������������������������������������������������� 293
Adding an Arbiter �������������������������������������������������������������������������������������������������������������������������������� 299
Replica Set Chaining���������������������������������������������������������������������������������������������������������������������������300
Managing Replica Sets �����������������������������������������������������������������������������������������������������������������������300
Configuring the Options for Replica Set Members �����������������������������������������������������������������������������306
Connecting to a Replica Set from Your Application ���������������������������������������������������������������������������� 308
Read Concern ��������������������������������������������������������������������������������������������������������������� 313
Summary ���������������������������������������������������������������������������������������������������������������������� 313
■Chapter 12: Sharding ����������������������������������������������������������������������������������������� 315
Exploring the Need for Sharding ���������������������������������������������������������������������������������� 315
Partitioning Horizontal and Vertical Data ���������������������������������������������������������������������� 316
Partitioning Data Vertically �����������������������������������������������������������������������������������������������������������������316
Partitioning Data Horizontally ������������������������������������������������������������������������������������������������������������� 317
Analyzing a Simple Sharding Scenario ������������������������������������������������������������������������ 317
xviii
■ Contents
Implementing Sharding with MongoDB ������������������������������������������������������������������������ 318
Setting Up a Sharding Configuration ��������������������������������������������������������������������������������������������������321
Determining How You’re Connected ��������������������������������������������������������������������������������������������������� 328
Listing the Status of a Sharded Cluster ���������������������������������������������������������������������������������������������� 328
Using Replica Sets to Implement Shards �������������������������������������������������������������������������������������������329
The Balancer ���������������������������������������������������������������������������������������������������������������� 330
Hashed Shard Keys ������������������������������������������������������������������������������������������������������ 332
Tag Sharding ���������������������������������������������������������������������������������������������������������������� 332
Adding More Config Servers����������������������������������������������������������������������������������������� 335
Summary ���������������������������������������������������������������������������������������������������������������������� 336
Index ��������������������������������������������������������������������������������������������������������������������� 337
xix
About the Authors
David Hows is an Honors graduate from the University of Woolongong
in NSW, Australia. He got his start in computing trying to drive more
performance out of his family PC without spending a fortune. This led
to a career in IT, where David has worked as a Systems Administrator,
Performance Engineer, Software Developer, Solutions Architect, and
Database Engineer. David has tried in vain for many years to play soccer well,
and his coffee mug reads “Grumble Bum.”
Peter Membrey is a Chartered IT Fellow with over 15 years of experience
using Linux and Open Source solutions to solve problems in the real
world. An RHCE since the age of 17, he has also had the honor of working
for Red Hat and writing several books covering Open Source solutions.
He holds a master's degree in IT (Information Security) from the
University of Liverpool and is currently an EngD candidate at the Hong
Kong Polytechnic University, where his research interests include time
synchronization, cloud computing, big data, and security. He lives in
Hong Kong with his wonderful wife Sarah and son Kaydyn.
xx
■ About the Authors
Eelco Plugge is a techie who works and lives in the Netherlands. Currently
working as an engineer in the mobile device management-industry
where he spends most of his time analyzing logs, configs and errors, he
previously worked as a data encryption specialist at McAfee and held
a handful of IT/system engineering jobs. Eelco is the author of various
books on MongoDB and Load Balancing, a skilled troubleshooter and
holds a casual interest in IT security-related subjects complementing his
MSc in IT Security.
Eelco is a father of two, and any leisure time left is spent behind the
screen or sporadically reading a book. Interested in science and nature’s
oddities, currency trading (FX), programming, security and sushi.
Tim Hawkins produced one of the world’s first online classifieds portals in 1993, loot.com, before moving on
to run engineering for many of Yahoo EU’s non-media-based properties, such as search, local search, mail,
messenger, and its social networking products. He is currently managing a large offshore team for a major
US eTailer, developing and deploying next-gen eCommerce applications. Loves hats, hates complexity.
xxi
About the Technical Reviewer
Stephen Steneker (aka Stennie) is an experienced full stack software
developer, consultant, and instructor. Stephen has a long history working
for Australian technology startups including founding technical roles at
Yahoo! Australia & NZ, HomeScreen Entertainment, and Grox. He holds a
BSc (Computer Science) from the University of British Columbia.
In his current role as a Technical Services Engineer for MongoDB,
Inc., Stephen provides support, consulting, and training for MongoDB. He
frequently speaks at user groups and conferences, and is the founder and
wrangler for the Sydney MongoDB User Group (http://www.meetup.com/
SydneyMUG/).
You can find him on Twitter, StackOverflow, or Github as @stennie.
xxiii
About the Contributor
A. Jesse Jiryu Davis is a Staff Engineer at MongoDB in New York City,
specializing in C, Python, and asynchronous I/O. He is the lead developer
of the MongoDB C Driver, author of Motor, and a contributor to Python,
PyMongo, and Tornado. He is the co-author with Guido van Rossum of the
chapter “A Web Crawler With asyncio Coroutines” in 500 Lines or Less, the
fourth book in the Architecture of Open Source Applications series.
xxv
Acknowledgments
My thanks to all members of the MongoDB team, past and present. Without them we would not be here, and
the way people think about the storage of data would be radically different. I would like to pay extra special
thanks to my colleagues at the MongoDB team in Sydney, as without them I would not be here today.
—David Hows
Writing a book is always a team effort. Even when there is just a single author, there are many people
working behind the scenes to pull everything together. With that in mind I want to thank everyone in the
MongoDB community and everyone at Apress for all their hard work, patience, and support. Thanks go to
Dave and Eelco for really driving the Third Edition home.
I’d also like to thank Dou Yi, a PhD student also at the Hong Kong Polytechnic University (who is
focusing on security and cryptographic based research), for helping to keep me sane and (patiently)
explaining mathematical concepts that I really should have grasped a long time ago. She has saved me hours
of banging my head against a very robust brick wall.
Special thanks go to Dr. Rocky Chang for agreeing to supervise my EngD studies and for introducing
me to the world of Internet Measurement (which includes time synchronization). His continued support,
patience and understanding are greatly appreciated.
—Peter Membrey
To the 9gag community, without whom this book would have been finished months ago.
—Eelco Plugge
I would like to acknowledge the members of the mongodb-user and mongodb-dev mail lists for putting up
with my endless questions.
—Tim Hawkins
xxvii
Introduction
I am a relative latecomer to the world of databases, starting with MySQL in 2006. This followed the logical
course for any computer science undergraduate, leading me to develop on a full LAMP stack backed
by rudimentary tables. At the time I thought little about the complexities of what went into SQL table
management. However, as time has gone on, I have seen the need to store more and more heterogeneous
data and how a simple schema can grow and morph over time as life takes its toll on systems.
My first introduction to MongoDB was in 2011, when Peter Membrey suggested that instead of a 0 context
table of 30 key and 30 value rows, I simply use a MongoDB instance to store data. And like all developers faced
with a new technology I scoffed and did what I had originally planned. It wasn’t until I was halfway through
writing the code to use my horrible monstrosity that Peter insisted I try MongoDB, and I haven’t looked back
since. Like all newcomers from SQL-land, I was awed by the ability of this system to simply accept whatever
data I threw at it and then return it based on whatever criteria I asked. I am still hooked.
Our Approach
And now, in this book, Peter, Eelco Plugge, Tim Hawkins, and I have the goal of presenting you with the same
experiences we had in learning the product: teaching you how you can put MongoDB to use for yourself,
while keeping things simple and clear. Each chapter presents an individual sample database, so you can read
the book in a modular or linear fashion; it’s entirely your choice. This means you can skip a certain chapter if
you like, without breaking your example databases.
Throughout the book, you will find example commands followed by their output. Both appear in a
fixed-width “code” font, with the commands also in boldface to distinguish them from the resulting output.
In most chapters, you will also come across tips, warnings, and notes that contain useful, and sometimes
vital, information.
—David Hows
1
Chapter 1
Introduction to MongoDB
Imagine a world where using a database is so simple that you soon forget you’re even using it. Imagine a
world where speed and scalability just work, and there’s no need for complicated configuration or set up.
Imagine being able to focus only on the task at hand, get things done, and then—just for a change—leave
work on time. That might sound a bit fanciful, but MongoDB promises to help you accomplish all these
things (and more).
MongoDB (derived from the word humongous) is a relatively new breed of database that has no concept
of tables, schemas, SQL, or rows. It doesn’t have transactions, ACID compliance, joins, foreign keys, or many
of the other features that tend to cause headaches in the early hours of the morning. In short, MongoDB
is a very different database than you’re probably used to, especially if you’ve used a relational database
management system (RDBMS) in the past. In fact, you might even be shaking your head in wonder at the
lack of so-called “standard” features.
Fear not! In the following pages, you will learn about MongoDB’s background and guiding principles
and why the MongoDB team made the design decisions it did. We’ll also take a whistle-stop tour of
MongoDB’s feature list, providing just enough detail to ensure that you’ll be completely hooked on this topic
for the rest of the book.
We’ll start by looking at the philosophy and ideas behind the creation of MongoDB, as well as some
of the interesting and somewhat controversial design decisions. We’ll explore the concept of document-
oriented databases, how they fit together, and what their strengths and weaknesses are. We’ll also explore
JavaScript Object Notation and examine how it applies to MongoDB. To wrap things up, we’ll step through
some of the notable features of MongoDB.
Reviewing the MongoDB Philosophy
Like all projects, MongoDB has a set of design philosophies that help guide its development. In this section,
we’ll review some of the database’s founding principles.
Using the Right Tool for the Right Job
The most important of the philosophies that underpin MongoDB is the notion that one size does not fit all.
For many years, traditional relational (SQL) databases (MongoDB is a document-oriented database) have
been used for storing content of all types. It didn’t matter whether the data were a good fit for the relational
model (which is used in all RDBMS databases, such as MySQL, PostgresSQL, SQLite, Oracle, MS SQL Server,
and so on); the data were stuffed in there anyway. Part of the reason for this is that, generally speaking,
it’s much easier (and more secure) to read and write to a database than it is to write to a file system. If you
pick up any book that teaches PHP, such as PHP for Absolute Beginners 2nd edition, by Jason Lengstorf and
Thomas Blom Hansen (Apress, 2014), you’ll probably discover almost right away that the database is used
CHAPTER 1 ■ INTRODUCTION TO MONGODB
2
to store information, not the file system. It’s just so much easier to do things that way. And while using a
database as a storage bin works, developers always have to work against the flow. It’s usually obvious when
we’re not using the database the way it was intended; anyone who has ever tried to store information with
even slightly complex data and had to set up several tables and then try to pull them all together knows what
we’re talking about!
The MongoDB team decided that it wasn’t going to create another database that tries to do everything
for everyone. Instead, the team wanted to create a database that worked with documents rather than rows
and that was blindingly fast, massively scalable, and easy to use. To do this, the team had to leave some
features behind, which means that MongoDB is not an ideal candidate for certain situations. For example,
its lack of transaction support means that you wouldn’t want to use MongoDB to write an accounting
application. That said, MongoDB might be perfect for part of the aforementioned application (such as
storing complex data). That’s not a problem, though, because there is no reason why you can’t use a
traditional RDBMS for the accounting components and MongoDB for the document storage. Such hybrid
solutions are quite common, and you can see them in production apps such as the one used for the New
York Times website
Once you’re comfortable with the idea that MongoDB may not solve all your problems, you will
discover that there are certain problems that MongoDB is a perfect fit for resolving, such as analytics (think
a real-time Google Analytics for your website) and complex data structures (for example, blog posts and
comments). If you’re still not convinced that MongoDB is a serious database tool, feel free to skip ahead to
the “Reviewing the Feature List” section, where you will find an impressive list of features for MongoDB.
■Note The lack of transactions and other traditional database features doesn’t mean that MongoDB is
unstable or that it cannot be used for managing important data.
Another key concept behind MongoDB’s design is that there should always be more than one copy of
the database. If a single database should fail, then it can simply be restored from the other servers. Because
MongoDB aims to be as fast as possible, it takes some shortcuts that make it more difficult to recover from
a crash. The developers believe that most serious crashes are likely to remove an entire computer from
service anyway; this means that even if the database were perfectly restored, it would still not be usable.
Remember: MongoDB does not try to be everything to everyone. But for many purposes (such as building a
web application), MongoDB can be an awesome tool for implementing your solution.
So now you know where MongoDB is coming from. It’s not trying to be the best at everything, and
it readily acknowledges that it’s not for everyone. However, for those who choose to use it, MongoDB
provides a rich document-oriented database that’s optimized for speed and scalability. It can also run nearly
anywhere you might want to run it. MongoDB’s website includes downloads for Linux, Mac OS, Windows,
and Solaris.
MongoDB succeeds at all these goals, and this is why using MongoDB (at least for us) is somewhat
dream-like. You don’t have to worry about squeezing your data into a table—just put the data together, and
then pass them to MongoDB for handling.
Consider this real-world example. A recent application that co-author Peter Membrey worked on
needed to store a set of eBay search results. There could be any number of results (up to 100 of them), and
he needed an easy way to associate the results with the users in his database. Had Peter been using MySQL,
he would have had to design a table to store the data, write the code to store his results, and then write more
code to piece it all back together again. This is a fairly common scenario and one most developers face on
a regular basis. Normally, we just get on with it; however, for this project, he was using MongoDB, so things
went a bit differently.
CHAPTER 1 ■ INTRODUCTION TO MONGODB
3
Specifically, he added this line of code:
request['ebay_results'] = ebay_results_array
collection.save(request)
In this example, request is Peter’s document, ebay_results is the key, and ebay_result_array contains
the results from eBay. The second line saves the changes. When he accesses this document in the future, he
will have the eBay results in exactly the same format as before. He doesn’t need any SQL; he doesn’t need to
perform any conversions; nor does he need to create any new tables or write any special code—MongoDB
just worked. It got out of the way, he finished his work early, and he got to go home on time.
Lacking Innate Support for Transactions
Here’s another important design decision by MongoDB developers: The database does not include
transactional semantics (the element that offers guarantees about data consistency and storage). This
is a solid tradeoff based on MongoDB’s goal of being simple, fast, and scalable. Once you leave those
heavyweight features at the door, it becomes much easier to scale horizontally.
Normally with a traditional RDBMS, you improve performance by buying a bigger, more powerful
machine. This is scaling vertically, but you can only take it so far. With horizontal scaling, rather than having
one big machine, you have lots of less powerful small machines. Historically, clusters of servers like this were
excellent for load-balancing websites, but databases had always been a problem because of internal design
limitations.
You might think this missing support constitutes a deal-breaker; however, many people forget that one
of the most popular table types in MySQL (MYISAM—which also happens to be the default) doesn’t support
transactions either. This fact hasn’t stopped MySQL from becoming and remaining the dominant open
source database for well over a decade. As with most choices when developing solutions, using MongoDB is
going to be a matter of personal preference and whether the tradeoffs fit your project.
■Note MongoDB offers durability when used in tandem with at least two data-bearing servers as part of a
three-node cluster. This is the recommended minimum for production deployments. MongoDB also supports
the concept of “write concerns.” This is where a given number of nodes can be made to confirm the write was
successful, giving a stronger guarantee that the data are safely stored.
Single server durability is ensured since version 1.8 of MongoDB with a transaction log. This log is
append only and is flushed to disk every 100 milliseconds.
JSON and MongoDB
JSON (JavaScript Object Notation) is more than a great way to exchange data; it’s also a nice way to store
data. An RDBMS is highly structured, with multiple files (tables) that store the individual pieces. MongoDB,
on the other hand, stores everything together in a single document. MongoDB is like JSON in this way,
and this model provides a rich and expressive way of storing data. Moreover, JSON effectively describes all
the content in a given document, so there is no need to specify the structure of the document in advance.
JSON is effectively schemaless (that is, it doesn’t require a schema), because documents can be updated
individually or changed independently of any other documents. As an added bonus, JSON also provides
excellent performance by keeping all of the related data in one place.
CHAPTER 1 ■ INTRODUCTION TO MONGODB
4
MongoDB doesn’t actually use JSON to store the data; rather, it uses an open data format developed
by the MongoDB team called BSON (pronounced Bee-Son), which is short for binary JSON. For the most
part, using BSON instead of JSON won’t change how you work with your data. BSON makes MongoDB even
faster by making it much easier for a computer to process and search documents. BSON also adds a couple
of features that aren’t available in standard JSON, including a number of extended types for numeric data
(such as int32 and int64) and support for handling binary data. We’ll look at BSON in more depth in “Using
Document-Oriented Storage (BSON),” later in this chapter.
The original specification for JSON can be found in RFC 7159, and it was written by Douglas Crockford.
JSON allows complex data structures to be represented in a simple, human-readable text format that is
generally considered to be much easier to read and understand than XML. Like XML, JSON was envisaged
as a way to exchange data between a web client (such as a browser) and web applications. When combined
with the rich way that it can describe objects, its simplicity has made it the exchange format of choice for the
majority of developers.
You might wonder what is meant here by complex data structures. Historically, data were exchanged
using the comma-separated values x(CSV) format (indeed, this approach remains very common today). CSV
is a simple text format that separates rows with a new line and fields with a comma. For example, a CSV file
might look like this:
Membrey, Peter, +852 1234 5678
Thielen, Wouter, +81 1234 5678
Someone can look at this information and see quite quickly what information is being communicated.
Or maybe not—is that number in the third column a phone number or a fax number? It might even be the
number for a pager. To avoid this ambiguity, CSV files often have a header field, in which the first row defines
what comes in the file. The following snippet takes the previous example one step further:
Lastname, Firstname, Phone Number
Membrey, Peter, +852 1234 5678
Thielen, Wouter, +81 1234 5678
Okay, that’s a bit better. But now assume some people in the CSV file have more than one phone
number. You could add another field for an office phone number, but you face a new set of issues if you want
several office phone numbers. And you face yet another set of issues if you also want to incorporate multiple
e-mail addresses. Most people have more than one, and these addresses can’t usually be neatly defined
as either home or work. Suddenly, CSV starts to show its limitations. CSV files are only good for storing
data that are flat and don’t have repeating values. Similarly, it’s not uncommon for several CSV files to be
provided, each with the separate bits of information. These files are then combined (usually in an RDBMS)
to create the whole picture. As an example, a large retail company may receive sales data in the form of CSV
files from each of its stores at the end of each day. These files must be combined before the company can see
how it performed on a given day. This process is not exactly straightforward, and it certainly increases the
chances of a mistake as the number of required files grows.
XML largely solves this problem, but using XML for most things is a bit like using a sledgehammer
to crack a nut: it works, but it feels like overkill. The reason for this is that XML is not only designed for
machines to read (whereas JSON is designed for humans), but it is also highly extensible. Rather than define
a particular data format, XML defines how you define a data format. This can be useful when you need to
exchange complex and highly structured data; however, for simple data exchange, it often results in too
much work. Indeed, this scenario is the source of the phrase “XML hell.”
CHAPTER 1 ■ INTRODUCTION TO MONGODB
5
JSON provides a happy medium. Unlike CSV, it can store structured content; but unlike XML, JSON
makes the content easy to understand and simple to use. Let’s revisit the previous example; however, this
time we used JSON rather than CSV:
{
"firstname": "Peter",
"lastname": "Membrey",
"phone_numbers": [
"+852 1234 5678",
"+44 1234 565 555"
]
}
In this version of the example, each JSON object (or document) contains all the information needed to
understand it. If you look at phone_numbers, you can see that it contains a list of different numbers. This list
can be as large as you want. You could also be more specific about the type of number being recorded, as in
this example:
{
"firstname": "Peter",
"lastname": "Membrey",
"numbers": [
{
"phone": "+852 1234 5678"
},
{
"fax": "+44 1234 565 555"
}
]
}
This version of the example improves on things a bit more. Now you can clearly see what each number
is for. JSON is extremely expressive, and, although it’s quite easy to write JSON from scratch, it is usually
generated automatically in software. For example, Python includes a module called (somewhat predictably)
json that takes existing Python objects and automatically converts them to JSON. Because JSON is
supported and used on so many platforms, it is an ideal choice for exchanging data.
When you add items such as the list of phone numbers, you are actually creating what is known as
an embedded document. This happens whenever you add complex content such as a list (or array, to use
the term favored in JSON). Generally speaking, there is also a logical distinction. For example, a Person
document might have several Address documents embedded inside it. Similarly, an Invoice document
might have numerous LineItem documents embedded inside it. Of course, the embedded Address
document could also have its own embedded document that contains phone numbers, for example.
Whether you choose to embed a particular document is determined when you decide how to store your
information. This is usually referred to as schema design. It might seem odd to refer to schema design when
MongoDB is considered a schemaless database. However, while MongoDB doesn’t force you to create a
schema or enforce one that you create, you do still need to think about how your data fit together. We’ll look
at this in more depth in Chapter 3.
CHAPTER 1 ■ INTRODUCTION TO MONGODB
6
Adopting a Nonrelational Approach
Improving performance with a relational database is usually straightforward: you buy a bigger, faster server.
And this works great until you reach the point where there isn’t a bigger server available to buy. At that point,
the only option is to spread out to two servers. This might sound easy, but it is a stumbling block for most
databases. For example, PostgreSQL can’t run a single database on two servers, where both servers can both
read and write data (often referred to as an active/active cluster), and MySQL can only do it with a special
add-on package. And although Oracle can do this with its impressive Real Application Clusters (RAC)
architecture, you can expect to take out a mortgage if you want to use that solution—implementing a
RAC-based solution requires multiple servers, shared storage, and several software licenses.
You might wonder why having an active/active cluster on two databases is so difficult. When you query
your database, the database has to find all the relevant data and link them all together. RDBMS solutions
feature many ingenious ways to improve performance, but they all rely on having a complete picture of the
data available. And this is where you hit a wall: this approach simply doesn’t work when half the data are on
another server.
Of course you might have a small database that simply gets lots of requests, so you just need to share
the workload. Unfortunately, here you hit another wall. You need to ensure that data written to the first
server are available to the second server. And you face additional issues if updates are made on two separate
masters simultaneously. For example, you need to determine which update is the correct one. Another
problem you can encounter is if someone queries the second server for information that has just been
written to the first server, but that information hasn’t been updated yet on the second server. When you
consider all these issues, it becomes easy to see why the Oracle solution is so expensive—these problems are
extremely hard to address.
MongoDB solves the active/active cluster problems in a very clever way—it avoids them completely.
Recall that MongoDB stores data in BSON documents, so the data are self-contained. That is, although
similar documents are stored together, individual documents aren’t made up of relationships. This means
that everything you need is all in one place. Because queries in MongoDB look for specific keys and values
in a document, this information can be easily spread across as many servers as you have available. Each
server checks the content it has and returns the result. This effectively allows almost linear scalability and
performance.
Admittedly, MongoDB does not offer master/master replication, in which two separate servers can
both accept write requests. However, it does have sharding, which allows data to be partitioned across
multiple machines, with each machine responsible for updating different parts of the dataset. The benefit of
a sharded cluster is that additional shards can be added to increase resource capacity in your deployment
without any changes to your application code. Nonsharded database deployments are limited to vertical
scaling: you can add more RAM/CPU/disk, but this can quickly get expensive. Sharded deployments
can also be scaled vertically, but more importantly, they can be scaled horizontally based on capacity
requirements: a sharded cluster can be comprised of many more affordable commodity servers rather than a
few very expensive ones. Horizontal scaling is a great fit for elastic provisioning with cloud-hosted instances
and containers.
Opting for Performance vs. Features
Performance is important, but MongoDB also provides a large feature set. We’ve already discussed some
of the features MongoDB doesn’t implement, and you might be somewhat skeptical of the claim that
MongoDB achieves its impressive performance partly by judiciously excising certain features common to
other databases. However, there are analogous database systems available that are extremely fast, but also
extremely limited, such as those that implement a key/value store.
A perfect example is memcached. This application was written to provide high-speed data caching, and
it is mind-numbingly fast. When used to cache website content, it can speed up an application many times
over. This application is used by extremely large websites, such as Facebook and LiveJournal. The catch is
CHAPTER 1 ■ INTRODUCTION TO MONGODB
7
that this application has two significant shortcomings. First, it is a memory-only database. If the power goes
out, then all the data are lost. Second, you can’t actually search for data using memcached; you can only
request specific keys.
These might sound like serious limitations; however, you must remember the problems that
memcached is designed to solve. First and foremost, memcached is a data cache. That is, it’s not supposed
to be a permanent data store, but only a means to provide a caching layer for your existing database. When
you build a dynamic web page, you generally request very specific data (such as the current top ten articles).
This means you can specifically ask memcached for that data—there is no need to perform a search. If the
cache is outdated or empty, you would query your database as normal, build up the data, and then store it in
memcached for future use.
Once you accept these limitations, you can see how memcached offers superb performance by
implementing a very limited feature set. This performance, by the way, is unmatched by that of a traditional
database. That said, memcached certainly can’t replace an RDBMS. The important thing to keep in mind is
that it’s not supposed to.
Compared to memcached, MongoDB is itself feature-rich. To be useful, MongoDB must offer a strong
set of features, such as the ability to search for specific documents. It must also be able to store those
documents on disk, so they can survive a reboot. Fortunately, MongoDB provides enough features to be a
strong contender for most web applications and many other types of applications as well.
Like memcached, MongoDB is not a one-size-fits-all database. As is usually the case in computing,
tradeoffs must be made to achieve the intended goals of the application.
Running the Database Anywhere
MongoDB is written in C++, which makes it relatively easy to port or run the application practically
anywhere. Currently, binaries can be downloaded from the MongoDB website for Linux, Mac OS, Windows,
and Solaris. Officially supported Linux packages include Amazon Linux, RHEL, Ubuntu Server LTS, and
SUSE. You can even download the source code and build your own MongoDB, although it is recommended
that you use the provided binaries wherever possible.
■Caution The 32-bit version of MongoDB is limited to databases of 2GB or less. This is because MongoDB
uses memory-mapped files internally to achieve high performance. Anything larger than 2GB on a 32-bit system
would require some fancy footwork that wouldn’t be fast and would also complicate the application’s code.
The official stance on this limitation is that 64-bit environments are easily available; therefore, increasing code
complexity is not a good tradeoff. The 64-bit version for all intents and purposes has no such restriction.
MongoDB’s modest requirements allow it to run on high-powered servers or virtual machines, and
even to power cloud-based applications. By keeping things simple and focusing on speed and efficiency,
MongoDB provides solid performance wherever you choose to deploy it.
Fitting Everything Together
Before we look at MongoDB’s feature list, we need to review a few basic terms. MongoDB doesn’t require
much in the way of specialized knowledge to get started, and many of the terms specific to MongoDB can be
loosely translated to RDBMS equivalents that you are probably already familiar with. Don’t worry, though;
we’ll explain each term fully. Even if you’re not familiar with standard database terminology, you will still be
able to follow along easily.
CHAPTER 1 ■ INTRODUCTION TO MONGODB
8
Generating or Creating a Key
A document represents the unit of storage in MongoDB. In an RDBMS, this would be called a row. However,
documents are much more than rows because they can store complex information such as lists, dictionaries,
and even lists of dictionaries. In contrast to a traditional database, where a row is fixed, a document in
MongoDB can be made up of any number of keys and values (you’ll learn more about this in the next
section). Ultimately, a key is nothing more than a label; it is roughly equivalent to the name you might give to
a column in an RDBMS. You use a key to reference pieces of data inside your document.
In a relational database, there should always be some way to uniquely identify a given record; otherwise
it becomes impossible to refer to a specific row. To that end, you are supposed to include a field that holds a
unique value (called a primary key) or a collection of fields that can uniquely identify the given row (called a
compound primary key).
MongoDB requires that each document have a unique identifier for much the same reason; in
MongoDB, this identifier is called _id. Unless you specify a value for this field, MongoDB will generate
a unique value for you. Even in the well-established world of RDBMS databases, opinion is divided as to
whether you should use a unique key provided by the database or generate a unique key yourself. Recently,
it has become more popular to allow the database to create the key for you. MongoDB is a distributed
database, so one of the main goals is to remove dependencies on shared resources (for example, checking
if a primary key is actually unique). Nondistributed databases often use a simple primary key such an auto-
incrementing sequence number. MongoDB’s default _id format is an ObjectId, which is a 12-byte unique
identifier that can be generated independently in a distributed environment.
The reason for this is that human-created unique numbers such as car registration numbers have
a nasty habit of changing. For example, in 2001, the United Kingdom implemented a new number plate
scheme that was completely different from the previous system. It happens that MongoDB can cope with
this type of change perfectly well; however, chances are that you would need to do some careful thinking if
you used the registration plate as your primary key. A similar scenario may have occurred when the ISBN
(International Standard Book Number) scheme was upgraded from 10 digits to 13.
Previously, most developers who used MongoDB seemed to prefer creating their own unique keys,
taking it upon themselves to ensure that the number would remain unique. Today, though, general
consensus seems to point at using the default ID value that MongoDB creates for you. However, as is the
case when working with RDBMS databases, the approach you choose mostly comes down to personal
preference. We prefer to use a database-provided value because it means we can be sure the key is unique
and independent of anything else.
Ultimately, you must decide what works best for you. If you are confident that your key is unique (and
likely to remain unchanged), then feel free to use it. If you’re unsure about your key’s uniqueness or you
don’t want to worry about it, then you can simply use the default key provided by MongoDB.
Using Keys and Values
Documents are made up of keys and values. Let’s take another look at the example discussed previously in
this chapter:
{
"firstname": "Peter",
"lastname": "Membrey",
"phone_numbers": [
"+852 1234 5678",
"+44 1234 565 555"
]
}
CHAPTER 1 ■ INTRODUCTION TO MONGODB
9
Keys and values always come in pairs. Unlike an RDBMS, where every field must have a value, even
if it’s NULL (somewhat paradoxically, this means unknown), MongoDB does not require every document
to have the same fields, or that every field with the same name has the same type of value. For example,
"phone_numbers" could be a single value in some documents and a list in others. If you don’t know the
phone number for a particular person on your list, you simply leave it out. A popular analogy for this sort of
thing is a business card. If you have a fax number, you usually put it on your business card; however, if you
don’t have one, you don’t write: “Fax number: none.” Instead, you simply leave the information out. If the
key/value pair isn’t included in a MongoDB document, it is assumed not to exist.
Implementing Collections
Collections are somewhat analogous to tables, but they are far less rigid. A collection is a lot like a box with
a label on it. You might have a box at home labeled “DVDs” into which you put, well, your DVDs. This
makes sense, but there is nothing stopping you from putting CDs or even cassette tapes into this box if you
wanted to. In an RDBMS, tables are strictly defined, and you can only put designated items into the table.
In MongoDB, a collection is simply that: a collection of similar items. The items don’t have to be similar
(MongoDB is inherently flexible); however, once we start looking at indexing and more advanced queries,
you’ll soon see the benefits of placing similar items in a collection.
While you could mix various items together in a collection, there’s little need to do so. Had the
collection been called media, then all of the DVDs, CDs, and cassette tapes would be at home there. After all,
these items all have things in common, such as an artist name, a release date, and content. In other words, it
really does depend on your application whether certain documents should be stored in the same collection.
Performance-wise, having multiple collections is no slower than having only one collection. Remember:
MongoDB is about making your life easier, so you should do whatever feels right to you.
Last but not least, collections are usually created on demand. Specifically, a collection is created when
you first attempt to save a document that references it. This means that you could create collections on
demand (not that you necessarily should). Because MongoDB also lets you create indexes and perform
other database-level commands dynamically, you can leverage this behavior to build some very dynamic
applications.
Understanding Databases
Perhaps the easiest way to think of a database in MongoDB is as a group of collections. Like collections,
databases can be created on demand. This means that it’s easy to create a database for each
customer—your application code can even do it for you. You can do this with databases other than
MongoDB, as well; however, creating databases in this manner with MongoDB is a very natural process.
Reviewing the Feature List
Now that you understand what MongoDB is and what it offers, it’s time to run through its feature list. You
can find a complete list of MongoDB’s features on the database’s website at www.mongodb.org/; be sure to
visit this site for an up-to-date list of them. The feature list in this chapter covers a fair bit of material that
goes on behind the scenes, but you don’t need to be familiar with every feature listed to use MongoDB itself.
In other words, if you feel your eyes beginning to close as you review this list, feel free to jump to the end of
the section!
CHAPTER 1 ■ INTRODUCTION TO MONGODB
10
WiredTiger
This is the third release of this book on MongoDB, and there have been some significant changes along the
way. At the forefront of these is the introduction of MongoDB’s pluggable storage API and WiredTiger, a very
high-performance database engine. WiredTiger was an optional storage engine introduced in MongoDB 3.0
and is now the default storage engine as of MongoDB 3.2. The classic MMAP (memory-mapped) storage
engine is still available, but WiredTiger is more efficient and performant for the majority of use cases.
WiredTiger itself can be said to have taken MongoDB to a whole new level, replacing the older MMAP
model of internal data storage and management. WiredTiger allows MongoDB to (among other things)
far better optimize what data reside in memory and what data reside on disk, without some of the messy
overflows that were present before. The upshot of this is that more often than not, WiredTiger represents
a real performance gain for all users. WiredTiger also better optimizes how data are stored on disk and
provides an in-built compression API that makes for massive savings on disk space. It’s safe to say that with
WiredTiger onboard, MongoDB looks to be making another huge move in the database landscape, one of
similar size to that made when MongoDB was first released.
Using Document-Oriented Storage (BSON)
We’ve already discussed MongoDB’s document-oriented design. We’ve also briefly touched on BSON.
As you learned, JSON makes it much easier to store and retrieve documents in their real form, effectively
removing the need for any sort of mapper or special conversion code. The fact that this feature also makes it
much easier for MongoDB to scale up is icing on the cake.
BSON is an open standard; you can find its specification at http://bsonspec.org/. When people
hear that BSON is a binary form of JSON, they expect it to take up much less room than text-based JSON.
However, that isn’t necessarily the case; indeed, there are many cases where the BSON version takes up
more space than its JSON equivalent.
You might wonder why you should use BSON at all. After all, CouchDB (another powerful document-
oriented database) uses pure JSON, and it’s reasonable to wonder whether it’s worth the trouble of
converting documents back and forth between BSON and JSON.
First, you must remember that MongoDB is designed to be fast, rather than space-efficient. This doesn’t
mean that MongoDB wastes space (it doesn’t); however, a small bit of overhead in storing a document is
perfectly acceptable if that makes it faster to process the data (which it does). In short, BSON is much easier
to traverse (that is, to look through) and index very quickly. Although BSON requires slightly more disk space
than JSON, this extra space is unlikely to be a problem, because disks are inexpensive, and MongoDB can
scale across machines. The tradeoff in this case is quite reasonable: you exchange a bit of extra disk space
for better query and indexing performance. The WiredTiger storage engine supports multiple compression
libraries and has index and data compression enabled by default. Compression level can be set at a per-
server default as well as per-collection (on creation). Higher levels of compression will use more CPU when
data are stored but can result in a significant disk space savings.
The second key benefit to using BSON is that it is easy and quick to convert BSON to a programming
language’s native data format. If the data were stored in pure JSON, a relatively high-level conversion would
need to take place. There are MongoDB drivers for a large number of programming languages (such as
Python, Ruby, PHP, C, C++, and C#), and each works slightly differently. Using a simple binary format, native
data structures can be quickly built for each language, without requiring that you first process JSON. This
makes the code simpler and faster, both of which are in keeping with MongoDB’s stated goals.
BSON also provides some extensions to JSON. For example, it enables you to store binary data and to
incorporate a specific data type. Thus, while BSON can store any JSON document, a valid BSON document
may not be valid in JSON. This doesn’t matter, because each language has its own driver that converts data
to and from BSON without needing to use JSON as an intermediary language.
CHAPTER 1 ■ INTRODUCTION TO MONGODB
11
At the end of the day, BSON is not likely to be a big factor in how you use MongoDB. Like all great
tools, MongoDB will quietly sit in the background and do what it needs to do. Apart from possibly using a
graphical tool to look at your data, you will generally work in your native language and let the driver worry
about persisting to MongoDB.
Supporting Dynamic Queries
MongoDB’s support for dynamic queries means that you can run a query without planning for it in advance.
This is similar to being able to run SQL queries against an RDBMS. You might wonder why this is listed as a
feature; surely it is something that every database supports—right?
Actually, no. For example, CouchDB (which is generally considered MongoDB’s biggest “competitor”)
doesn’t support dynamic queries. This is because CouchDB has come up with a completely new (and
admittedly exciting) way of thinking about data. A traditional RDBMS has static data and dynamic queries.
This means that the structure of the data is fixed in advance—tables must be defined, and each row has to fit
into that structure. Because the database knows in advance how the data are structured, it can make certain
assumptions and optimizations that enable fast dynamic queries.
CouchDB has turned this on its head. As a document-oriented database, CouchDB is schemaless, so the
data are dynamic. However, the new idea here is that queries are static. That is, you define them in advance,
before you can use them.
This isn’t as bad as it might sound, because many queries can be easily defined in advance. For
example, a system that lets you search for a book will probably let you search by ISBN. In CouchDB, you
would create an index that builds a list of all the ISBNs for all the documents. When you punch in an ISBN,
the query is very fast because it doesn’t actually need to search for any data. Whenever a new piece of data is
added to the system, CouchDB will automatically update its index.
Technically, you can run a query against CouchDB without generating an index; in that case, however,
CouchDB will have to create the index itself before it can process your query. This won’t be a problem if you
only have a hundred books; however, it will result in poor performance if you’re filing hundreds of thousands
of books, because each query will generate the index again (and again). For this reason, the CouchDB team
does not recommend dynamic queries—that is, queries that haven’t been predefined—in production.
CouchDB also lets you write your queries as map and reduce functions. If that sounds like a lot of effort,
then you’re in good company; CouchDB has a somewhat severe learning curve. In fairness to CouchDB, an
experienced programmer can probably pick it up quite quickly; for most people, however, the learning curve
is probably steep enough that they won’t bother with the tool.
Fortunately for us mere mortals, MongoDB is much easier to use. We’ll cover how to use MongoDB in
more detail throughout the book, but here’s the short version: in MongoDB, you simply provide the parts of the
document you want to match against, and MongoDB does the rest. MongoDB can do much more, however. For
example, you won’t find MongoDB lacking if you want to use map or reduce functions. At the same time, you
can ease into using MongoDB; you don’t have to know all of the tool’s advanced features up front.
Indexing Your Documents
MongoDB includes extensive support for indexing your documents, a feature that really comes in handy
when you’re dealing with tens of thousands of documents. Without an index, MongoDB will have to look at
each individual document in turn to see whether it is something that you want to see. This is like asking a
librarian for a particular book and watching as he works his way around the library looking at each and every
book. With an indexing system (libraries tend to use the Dewey Decimal system), he can find the area where
the book you are looking for lives and very quickly determine if it is there.
Unlike a library book, all documents in MongoDB are automatically indexed on the _id key. This key is
considered a special case because you cannot delete it; the index is what ensures that each value is unique.
One of the benefits of this key is that you can be assured that each document is uniquely identifiable,
something that isn’t guaranteed by an RDBMS.
CHAPTER 1 ■ INTRODUCTION TO MONGODB
12
When you create your own indexes, you can decide whether you want them to enforce uniqueness. By
default, an error will be returned if you try to create a unique index on a key that has duplicate values.
There are many occasions where you will want to create an index that allows duplicates. For example, if
your application searches by last name, it makes sense to build an index on the lastname key. Of course, you
cannot guarantee that each last name will be unique; and in any database of a reasonable size, duplicates are
practically guaranteed.
MongoDB’s indexing abilities don’t end there, however. MongoDB can also create indexes on
embedded documents. For example, if you store numerous addresses in the address key, you can create an
index on the ZIP or postal code. This means that you can easily pull back a document based on any postal
code—and do so very quickly.
MongoDB takes this a step further by allowing composite indexes. In a composite index, two or more
keys are used to build a given index. For example, you might build an index that combines both the
lastname and firstname tags. A search for a full name would be very quick because MongoDB can quickly
isolate the last name and then, just as quickly, isolate the first name.
We will look at indexing in more depth in Chapter 10, but suffice it to say that MongoDB has you
covered as far as indexing is concerned.
Leveraging Geospatial Indexes
One form of indexing worthy of special mention is geospatial indexing. This new, specialized indexing
technique was introduced in MongoDB 1.4. You use this feature to index location-based data, enabling you
to answer queries such as how many items are within a certain distance from a given set of coordinates.
As an increasing number of web applications start making use of location-based data, this feature will
play an increasingly prominent role in everyday development.
Profiling Queries
A built-in profiling tool lets you see how MongoDB works out which documents to return. This is useful
because, in many cases, a query can be easily improved simply by adding an index, the number one cause of
painfully slow queries. If you have a complicated query, and you’re not really sure why it’s running so slowly,
then the query profiler (MongoDB’s query planner explain()) can provide you with extremely valuable
information. Again, you’ll learn more about the MongoDB profiler in Chapter 10.
Updating Information In Place (Memory Mapped Database Only)
When a database updates a row (or in the case of MongoDB, a document), it has a couple of choices about
how to do it. Many databases choose the multiversion concurrency control (MVCC) approach, which allows
multiple users to see different versions of the data. This approach is useful because it ensures that the data
won’t be changed partway through by another program during a given transaction.
The downside to this approach is that the database needs to track multiple copies of the data. For
example, CouchDB provides very strong versioning, but this comes at the cost of writing the data out in its
entirety. While this ensures that the data are stored in a robust fashion, it also increases complexity and
reduces performance.
MongoDB, on the other hand, updates information in place. This means that (in contrast to CouchDB)
MongoDB can update the data wherever it happens to be. This typically means that no extra space needs to
be allocated, and the indexes can be left untouched.
Another benefit of this method is that MongoDB performs lazy writes. Writing to and from memory
is very fast, but writing to disk is thousands of times slower. This means that you want to limit reading and
writing from the disk as much as possible. This isn’t possible in CouchDB, because that program ensures
that each document is quickly written to disk. While this approach guarantees that the data are written safely
to disk, it also impacts performance significantly.
CHAPTER 1 ■ INTRODUCTION TO MONGODB
13
MongoDB only writes to disk when it has to, which is usually once every 100 milliseconds or so. This
means that if a value is being updated many times a second—a not uncommon scenario if you’re using
a value as a page counter or for live statistics—then the value will only be written once, rather than the
thousands of times that CouchDB would require.
This approach makes MongoDB much faster, but, again, it comes with a tradeoff. CouchDB may be
slower, but it does guarantee that data are stored safely on the disk. MongoDB makes no such guarantee,
and this is why a traditional RDBMS is probably a better solution for managing critical data such as billing or
accounts receivable.
Storing Binary Data
GridFS is MongoDB’s solution to storing binary data in the database. BSON supports saving up to 16MB of
binary data in a document, and this may well be enough for your needs. For example, if you want to store
a profile picture or a sound clip, then 16MB might be more space than you need. On the other hand, if you
want to store movie clips, high-quality audio clips, or even files that are several hundred megabytes in size,
then MongoDB has you covered here, too.
GridFS works by storing the information about the file (called metadata) in the files collection The
data themselves are broken down into pieces called chunks that are stored in the chunks collection. This
approach makes storing data both easy and scalable; it also makes range operations (such as retrieving
specific parts of a file) much easier to use.
Generally speaking, you would use GridFS through your programming language’s MongoDB driver, so
it’s unlikely you’d ever have to get your hands dirty at such a low level. As with everything else in MongoDB,
GridFS is designed for both speed and scalability. This means you can be confident that MongoDB will be up
to the task if you want to work with large data files.
Replicating Data
When we talked about the guiding principles behind MongoDB, we mentioned that RDBMS databases
offer certain guarantees for data storage that are not available in MongoDB. These guarantees weren’t
implemented for a handful of reasons. First, these features would slow the database down. Second, they
would greatly increase the complexity of the program. Third, it was felt that the most common failure on
a server would be hardware, which would render the data unusable anyway, even if the data were safely
saved to disk.
Of course, none of this means that data safety isn’t important. MongoDB wouldn’t be of much use if you
couldn’t count on being able to access the data when you need them. Initially, MongoDB provided a safety
net with a feature called master-slave replication, in which only one database is active for writing at any given
time, an approach that is also fairly common in the RDBMS world. This feature has since been replaced with
replica sets, and basic master-slave replication has been deprecated and should no longer be used.
Replica sets have one primary server (similar to a master), which handles all the write requests from
clients. Because there is only one primary server in a given set, it can guarantee that all writes are handled
properly. When a write occurs, it is logged in the primary’s oplog.
The oplog is replicated by the secondary servers (of which there can be many) and used to bring them
up to date with the current primary. Should the primary fail at any given time, the surviving members of
the replica set will hold an election and one of the secondaries will become the primary and take over
responsibility for handling client write requests. Application drivers will automatically detect any changes to
the replica set configuration or replica set status and reestablish connectivity based on the updated replica
set state. In order for a replica set to maintain a primary, a strict majority of the healthy replica set nodes
must be able to connect with one another. For example, a three-node replica set requires two healthy nodes
to maintain a primary.
CHAPTER 1 ■ INTRODUCTION TO MONGODB
14
Implementing Sharding
For those involved with large-scale deployments, autosharding will probably prove to be one of MongoDB’s
most significant and oft-used features.
In an autosharding scenario, MongoDB takes care of all the data splitting and recombination for you.
It makes sure the data go to the right server and that queries are run and combined in the most efficient
manner possible. In fact, from a developer’s point of view, there is no difference between talking to a
MongoDB database with a hundred shards and talking to a single MongoDB server.
In the meantime, if you’re just starting out or you’re building your first MongoDB-based website, then
you’ll probably find that a single instance of MongoDB is sufficient for your needs (although for a production
environment, we still recommend using a replica set). If you end up building the next Facebook or Amazon,
however, you will be glad that you built your site on a technology that can scale so limitlessly. Sharding is the
topic of Chapter 12 of this book.
Using Map and Reduce Functions
For many people, hearing the term MapReduce sends shivers down their spines. At the other extreme,
many RDBMS advocates scoff at the complexity of map and reduce functions. It’s scary for some because
these functions require a completely different way of thinking about finding and sorting your data, and
many professional programmers have trouble getting their heads around the concepts that underpin map
and reduce functions. That said, these functions provide an extremely powerful way to query data. In fact,
CouchDB supports only this approach, which is one reason it has such a high learning curve.
MongoDB doesn’t require that you use map and reduce functions. In fact, MongoDB relies on a simple
querying syntax that is more akin to what you see in MySQL. However, MongoDB does make these functions
available for those who want them. The map and reduce functions are written in JavaScript and run on the
server. The job of the map function is to find all the documents that meet a certain criteria. These results are
then passed to the reduce function, which processes the data. The reduce function doesn’t usually return
a collection of documents; rather, it returns a new document that contains the information derived. As a
general rule, if you would normally use GROUP BY in SQL, then the map and reduce functions are probably
the right tools for the job in MongoDB.
The Aggregation Framework
MapReduce is a very powerful tool, but it has one major drawback; it’s not exactly high performance. This is
because of how MapReduce is implemented behind the scenes. In short, a lot of work has to be done moving
the data about and converting between the native storage format (BSON) and JSON, applying filters, and
so forth. With the aggregation framework, a large number of operators are provided that are written in C++
and are highly performant. The operators available are growing all the time, with each release bringing new
features.
The aggregation framework is pipeline based, and it allows you to take individual pieces of a query and
string them together in order to get the result you’re looking for. This maintains the benefits of MongoDB’s
document-oriented design while still providing high performance.
So if you need all the power of MapReduce, you still have it at your beck and call. If you just want to do
some basic statistics and number crunching, you’re going to love the aggregation framework. You’ll learn
more about the aggregation framework and its commands in Chapters 4 and 6.
CHAPTER 1 ■ INTRODUCTION TO MONGODB
15
Getting Help
MongoDB has a great support community, and the core developers are very active and easily approachable
and typically go to great lengths to help other members of the community. MongoDB is easy to use and
comes with great documentation; however, it’s still nice to know that you’re not alone, and help is available,
should you need it.
Visiting the Website
The first place to look for updated information or help is on the MongoDB website (www.mongodb.org). This
site is updated regularly and contains all the latest MongoDB goodness. On this site, you can find drivers,
tutorials, examples, frequently asked questions, and much more.
Cutting and Pasting MongoDB Code
Pastie (http://pastie.org) is not strictly a MongoDB site; however, it is something you will come across
if you float about in #MongoDB for any length of time. The Pastie site basically lets you cut and paste (hence
the name) some output or program code, and then put it online for others to view. In IRC, pasting multiple
lines of text can be messy or hard to read. If you need to post a fair bit of text (such as three lines or more),
then you should visit http://pastie.org, paste in your content, and then paste the link to your new page
into the channel.
Finding Solutions on Google Groups
MongoDB also has a discussion group called mongodb-user (http://groups.google.com/group/mongodb-user).
This group is a great place to ask questions or search for answers. You can also interact with the group via
e-mail. Unlike IRC, which is very transient, the Google group is a great long-term resource. If you really want
to get involved with the MongoDB community, joining the group is a great way to start.
Finding Solutions on Stack Overflow
Stack Overflow (www.stackoverflow.com) is one of the most popular programming Q&A sites on the
Internet and has a repository of tens of thousands of questions and answers available for anyone to view.
Stack Overflow is best suited for when you have a specific question and are looking for a specific answer.
Answers are rated by the community, so there is a very high chance you’ll find something useful here and
quite often the exact answer you’re looking for. MongoDB, Inc., the company behind the product, maintains
an active support presence on Stack Overflow, making it a great place to start hunting for your answers.
Stack Overflow specifically targets programming questions, but there are also “Stack Exchanges,” such
as DBA Stack Exchange and Server Fault, which cover database and sysadmin questions, respectively.
Leveraging the JIRA Tracking System
MongoDB uses the JIRA issue-tracking system You can view the tracking site at http://jira.mongodb.org/,
and you are actively encouraged to report any bugs or problems that you come across to this site. Reporting
such issues is viewed by the community as a genuinely good thing to do. Of course, you can also search
through previous issues, and you can even view the roadmap and planned updates for the next release.
CHAPTER 1 ■ INTRODUCTION TO MONGODB
16
If you haven’t posted to JIRA before, you might want to try the mongodb-users list first. You will quickly
find out whether you’ve found something new, and if so, you will be shown how to go about reporting it.
Chatting with the MongoDB Developers
Some MongoDB developers often hang out on Internet Relay Chat (IRC) at #MongoDB on the Freenode
network (www.freenode.net). Of course, the developers do need to sleep at some point (coffee only works
for so long!); fortunately, there are also many knowledgeable MongoDB users from around the world who
are ready to help out. Many people who visit the #MongoDB channel aren’t experts; however, the general
atmosphere is so friendly that they stick around anyway. Please feel free to join #MongoDB channel and chat
with people there—you may find some great hints and tips. If you’re really stuck, you’ll probably be able to
quickly get back on track.
Summary
This chapter has provided a whistle-stop tour of the benefits MongoDB brings to the table. We’ve looked
at the philosophies and guiding principles behind MongoDB’s creation and development, as well as the
tradeoffs MongoDB’s developers made when implementing these ideals. We’ve also looked at some of the
key terms used in conjunction with MongoDB, how they fit together, and their rough SQL equivalents.
Next, we looked at some of the features MongoDB offers, including how and where you might want to
use them. Finally, we wrapped up the chapter with a quick overview of the community and where you can go
to get help, should you need it.
Now that we've given you a taste of what MongoDB can do for you, let's move on to Chapter 2 where we
will show you how to get MongDB installed and ready to go.
17
Chapter 2
Installing MongoDB
In Chapter 1, you got a taste of what MongoDB can do for you. In this chapter, you will learn how to
install and expand MongoDB to do even more, enabling you to use it in combination with your favorite
programming language.
MongoDB is a cross-platform database, and you can find a significant list of available packages to
download from the MongoDB website (www.mongodb.org). The wealth of available versions might make it
difficult to decide which version is the right one for you. The right choice for you probably depends on the
operating system your server uses, the kind of processor in your server, and whether you prefer a stable
release or would like to take a dive into a version that is still in development but offers exciting new features.
Perhaps you’d like to install both a stable and a forward-looking version of the database. It’s also possible
you’re not entirely sure which version you should choose yet. In any case, read on!
Choosing Your Version
When you look at the Download section on the MongoDB website, you will see a rather straightforward
overview of the packages available for download. The first thing you need to pay attention to is the operating
system you are going to run the MongoDB software on. Currently, there are precompiled packages available
for Windows, various flavors of the Linux operating system, Mac OS, and Solaris.
■Note An important thing to remember here is the difference between the 32-bit release and the 64-bit
release of the product. The 32-bit release is only supported as legacy and may lack performance optimizations
present in the 64-bit version. The 32-bit release also does not support the WiredTiger storage engine. It is
strongly recommended to use the 64-bit release for production environments.
You will also need to pay attention to the version of the MongoDB software itself: there are production
releases, previous releases, and development releases. The production release indicates that it’s the most
recent stable version available. When a newer and generally improved or enhanced version is released, the
prior most recent stable version will be made available as a previous release. This designation means the
release is stable and reliable, but it usually has fewer features available in it. Finally, there’s the development
release. This release is generally referred to as the unstable version. This version is still in development, and
it will include many changes, including significant new features. Although it has not been fully developed
and tested yet, the developers of MongoDB have made it available to the public to test or otherwise try out.
CHAPTER 2 ■ INSTALLING MONGODB
18
Understanding the Version Numbers
MongoDB uses the “odd-numbered versions for development releases” approach. In other words, you can
tell by looking at the second part of the version number (also called the release number) whether a version
is a development version or a stable version. If the second number is even, then it’s a stable release. If the
second number is odd, then it’s an unstable, or development, release.
Let’s take a closer look at the three digits included in a version number’s three parts, A, B, and C:
• A, the first (or leftmost) number: Represents the major version and only changes
when there is a full version upgrade.
• B, the second (or middle) number: Represents the release number and indicates
whether a version is a development version or a stable version. If the number is even,
the version is stable; if the number is odd, the version is unstable and considered a
development release.
• C, the third (or rightmost) number: Represents the revision number; this is used for
bugs and security issues.
For example, at the time of writing, the following versions were available from the MongoDB website:
• 3.0.6 (Production release)
• 2.6.11 (Previous release)
• 3.1.8 (Development release)
Installing MongoDB on Your System
So far, you’ve learned which versions of MongoDB are available and—hopefully—were able to select one.
Now you’re ready to take a closer look at how to install MongoDB on your particular system. The two main
operating systems for servers at the moment are based on Linux and Microsoft Windows, so this chapter will
walk you through how to install MongoDB on both of these operating systems, beginning with Linux.
Installing MongoDB under Linux
The Unix-based operating systems are extremely popular choices at the moment for hosting services,
including web services, mail services, and, of course, database services. In this chapter, we’ll walk you
through how to get MongoDB running on a popular Linux distribution: Ubuntu.
Depending on your needs, you have two ways of installing MongoDB under Ubuntu: you can install the
packages automatically through so-called repositories, or you can install it manually. The next two sections
will walk you through both options.
Installing MongoDB through the Repositories
Repositories are basically online directories filled with software. Every package contains information about
the version number, prerequisites, and possible incompatibilities. This information is useful when you
need to install a software package that requires another piece of software to be installed first because the
prerequisites can be installed at the same time.
CHAPTER 2 ■ INSTALLING MONGODB
19
The default repositories available in Ubuntu’s LTS (long-term support) editions contain MongoDB, but
they may be out-of-date versions of the software. Therefore, let’s tell apt-get (the software you use to install
software from repositories) to look at a custom repository. To do this, you need to create a custom MongoDB
list file and specify the repository URL using the following command:
$ echo "deb http://repo.mongodb.org/apt/ubuntu "$(lsb_release -sc)"/mongodb-org/3.0
multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.0.list
Next, you need to import MongoDB’s public GPG key, used to sign the packages, to ensure their
consistency; you can do so by using the apt-key command:
$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10
When that is done, you need to tell apt-get that it contains new repositories; you can do so using
apt-get’s update command:
$ sudo apt-get update
This line made aptitude aware of your manually added repository. This means you can now tell apt-get
to install the software itself. You do this by typing the following command in the shell:
$ sudo apt-get install -y mongodb-org
This line installs the current stable (production) version from MongoDB community edition. If you wish
to install any other version from MongoDB instead, you need to specify the version number. For example, to
install the previous production (stable) version from MongoDB, type in the following command instead:
$ sudo apt-get install -y mongodb-org=3.0.6 mongodb-org-server=3.0.6 mongodb-org-shell=3.0.6
mongodb-org-mongos=3.0.6 mongodb-org-tools=3.0.6
That’s all there is to it. At this point, MongoDB has been installed and is (almost) ready to use!
■Note Running apt-get update on a system running an older version of MongoDB will upgrade the
software to the latest stable version available. You can prevent this from happening by running this command:
$ echo "mongodb-org hold" | sudo dpkg --set-selections
$ echo "mongodb-org-server hold" | sudo dpkg --set-selections
$ echo "mongodb-org-shell hold" | sudo dpkg --set-selections
$ echo "mongodb-org-mongos hold" | sudo dpkg --set-selections
$ echo "mongodb-org-tools hold" | sudo dpkg --set-selections
Installing MongoDB Manually
Next, we’ll cover how to install MongoDB manually. Given how easy it is to install MongoDB with aptitude
on Ubuntu LTS editions automatically, you might wonder why you would want to install the software
manually. For starters, the packaging remains a work in progress, so it might be the case that there are
versions not yet available through the repositories. It’s also possible that the version of MongoDB you want
CHAPTER 2 ■ INSTALLING MONGODB
20
to use isn’t included in the repository or that you simply don’t run Ubuntu or an LTS version of it. Installing
the software manually also gives you the ability to run multiple versions of MongoDB at the same time.
You’ve decided which version of MongoDB you would like to use, and you’ve downloaded it from their
website, http://mongodb.org/downloads, to your Home directory. Next, you need to extract the package
with the following command:
$ tar xzvf mongodb-linux-x86_64-<distribution version>-<mongodb version>.tgz
This command extracts the entire contents of the package to a new directory called mongodb-linux-
x86_64-<distribution version>-<mongodb version>; this directory is located under your current
directory. This directory will contain a number of subdirectories and files. The directory that contains the
executable files is called the bin directory. We will cover which applications perform which tasks shortly.
However, you don’t need to do anything further to install the application. Indeed, it doesn’t take much
more time to install MongoDB manually—depending on what else you need to install, it might even be
faster. Manually installing MongoDB does have some downsides, however. For example, the executables that
you just extracted and found in the bin directory can’t be executed from anywhere except the bin directory
by default unless you add them to your $PATH environment variable. Thus, if you want to run the mongod
service, you will need to do so directly from the aforementioned bin directory if this directory isn’t part of
your $PATH environment variable. Another critical downside here is that the mongod service won’t start
automatically as a server after a restart and does not include Secure Socket Layer –or SSL- support. These
downsides highlight some of the benefits of installing MongoDB through repositories.
Installing MongoDB under Windows
Microsoft’s Windows is also a popular choice for server software, including Internet-based services.
MongoDB comes with an installer for Windows-based operating systems. All you need to do is select
the MongoDB version of your choice, download the installer, and run it to get it set up. The installer comes
with two options—Complete and Custom—allowing you to choose the features to be installed and where
they will be installed. In most cases, the Complete setup type would be recommended.
Alternatively, the legacy build from MongoDB can be downloaded in ZIP format. With this, you do
not need to walk through any setup process; installing the software is a simple matter of downloading the
package, extracting it, and running the application itself. Similar to the Linux legacy builds, this version will
not include SSL support, however.
For example, assume you’ve decided to download the latest legacy version of MongoDB for your
64-bit Windows 2008 R2+ server. You begin by extracting the package (mongodb-win32-x86_64-2008plus-
x.y.z.zip) to the root of your C:\ drive. At this point, all you need to do is open a command prompt
(Start ➤ Run ➤ cmd ➤ OK) and browse to the directory you extracted the contents to:
> cd C:\mongodb-win32–x86_64-2008plus-x.y.z\
> cd bin\
Doing this brings you to the directory that contains the MongoDB executables. That’s all there is to it: as
I noted previously, with this approach, there’s no installation necessary.
Running MongoDB
At long last you’re ready to get your hands dirty. You’ve learned where to get the MongoDB version that best
suits your needs and hardware, and you’ve also seen how to install the software. Now it’s finally time to look
at running and using MongoDB.
CHAPTER 2 ■ INSTALLING MONGODB
21
Prerequisites
Before you can start the MongoDB service, you need to create a data directory for MongoDB to store its files
in. By default, MongoDB stores the data in the /data/db directory on Unix-based systems (such as Linux and
OS X) and in the C:\data\db directory on Windows.
■Note MongoDB does not create these data directories for you, so you need to create them manually;
otherwise, MongoDB will fail to run and throw an error message. Also, be sure that you set the permissions
correctly: MongoDB must have read, write, and directory creation permissions to function properly.
If you wish to use a directory other than /data/db or C:\data\db, then you can tell MongoDB to look at
the desired directory by using the --dbpath flag when executing the service.
Once you create the required directory and assign the appropriate permissions, you can start the
MongoDB core database service by executing the mongod application. You can do this from the command
prompt or the shell in Windows and Linux, respectively.
Surveying the Installation Layout
After you install or extract MongoDB successfully, you will have the applications shown in Table2-1
available in the bin directory (in both Linux and Windows).
Table 2-1. The Included MongoDB Applications
Application Function
--bsondump Reads contents of BSON-formatted rollback files.
--mongo The database shell.
--mongod The core database server.
--mongodump Database backup utility.
--mongoexport Export utility (JSON, CSV, TSV), not reliable for backup.
--mongofiles Manipulates files in GridFS objects.
--mongoimport Import utility (JSON, CSV, TSV), not reliable for recoveries.
--mongooplog Pulls oplog entries from another mongod instance.
--mongoperf Check disk I/O performance.
--mongorestore Database backup restore utility.
--mongos MongoDB shard process.
--mongostat Returns counters of database operation.
--mongotop Tracks/reports MongoDB read/write activities.
--mongorestore Restore/import utility.
Note: All applications are within the --bin directory.
CHAPTER 2 ■ INSTALLING MONGODB
22
The installed software includes 14 applications (or 13, under Microsoft Windows) that you will be using
in conjunction with your MongoDB databases. The two “most important” applications are the mongo and
mongod applications. The mongo application allows you to use the database shell; this shell enables you to
accomplish practically anything you’d want to do with MongoDB.
The mongod application starts the service or daemon, as it’s also called. There are also many flags you
can set when launching the MongoDB applications. For example, the service lets you specify the path where
the database is located (--dbpath), show version information (--version), and even print some diagnostic
system information (with the --sysinfo flag)! You can view the entire list of options by including the --help
flag when you launch the service. For now, you can just use the defaults and start the service by typing
mongod as any user in your shell or command prompt.
Using the MongoDB Shell
Once you create the database directory and start the mongod database application successfully, you’re ready
to fire up the shell and take a sneak peak at the powers of MongoDB.
Fire up your shell (Unix) or your command prompt (Windows); when you do so, make sure you are in
the correct location, so that the mongo executable can be found. You can start the shell by typing mongo at the
command prompt and hitting the Return key. You will be immediately presented with a blank window and a
blinking cursor (see Figure2-1). Ladies and gentlemen, welcome to MongoDB!
If you start the MongoDB service with the default parameters, and start the shell with the default
settings, you will be connected to the default test database running on your local host. This database is
created automatically the moment you connect to it. This is one of MongoDB’s most powerful features: if you
attempt to connect to a database that does not exist, MongoDB will automatically create it for you once you
insert data into it. This can be either good or bad, depending on how well you handle your keyboard.
Before taking any further steps, such as implementing any additional drivers that will enable you to
work with your favorite programming language, you might find it helpful to take a quick peek at some of the
more useful commands available in the MongoDB shell (see Table2-2).
Figure 2-1. The MongoDB shell
CHAPTER 2 ■ INSTALLING MONGODB
23
■Tip You can get a full list of commands by typing the help command in the MongoDB shell.
Installing Additional Drivers
You might think that you are ready to take on the world now that you have set up MongoDB and know
how to use its shell. That’s partially true; however, you probably want to use your preferred programming
language rather than the shell when querying or otherwise manipulating the MongoDB database. MongoDB
offers multiple official drivers, and many more are offered in the community that let you do precisely that.
For example, drivers for the following programming languages can be found on the MongoDB website:
• C
• C++
• C#
• Java
• Node.js
• Perl
• PHP
• Python
• Motor
• Ruby
• Scala
In this section, you will learn how to implement MongoDB support for two of the more popular
programming languages in use today: PHP and Python.
■Tip There are many community-driven MongoDB drivers available. A long list can be found on the
MongoDB website docs.mongodb.org/ecosystem.
Table 2-2. Basic Commands within the MongoDB Shell
Command Function
show dbs Shows the names of the available databases.
show collections Shows the collections in the current database.
show users Shows the users in the current database.
use <db name> Sets the current database to <db name>.
CHAPTER 2 ■ INSTALLING MONGODB
24
Installing the PHP Driver
PHP is one of the most popular programming languages in existence today. This language is specifically
aimed at web development, and it can be incorporated into HTML easily. This fact makes the language
the perfect candidate for designing a web application, such as a blog, a guestbook, or even a business-card
database. The next few sections cover your options for installing and using the MongoDB PHP driver.
Getting MongoDB for PHP
Like MongoDB, PHP is a cross-platform development tool, and the steps required to set up MongoDB in
PHP vary depending on the intended platform. Previously, this chapter showed you how to install MongoDB
on both Ubuntu and Windows; we’ll adopt the same approach here, demonstrating how to install the driver
for PHP on both Ubuntu and Windows.
Begin by downloading the PHP driver for your operating system. Do this by firing up your browser and
navigating to docs.mongodb.org. At the time of writing, the website includes a separate menu option called
Drivers. Click this option to bring up a list of currently available language drivers (see Figure2-2).
Next, select PHP from the list of languages and follow the links to download the latest (stable) version of
the driver. Different operating systems will require different approaches for installing the MongoDB extension
for PHP automatically. That’s right; just as you were able to install MongoDB on Ubuntu automatically, you
can do the same for the PHP driver. And just as when installing MongoDB under Ubuntu, you can also choose
to install the PHP language driver manually. Let’s look at the two options available to you.
Figure 2-2. A short list of currently available language drivers for MongoDB
CHAPTER 2 ■ INSTALLING MONGODB
25
Installing the PHP Driver on Unix-Based Platforms Automatically
The developers of PHP came up with a great solution that allows you to expand your PHP installation with
other popular extensions: PECL. PECL is a repository solely designed for PHP; it provides a directory of all
known extensions that you can use to download, install, and even develop PHP extensions. If you are already
acquainted with the package-management system called aptitude (which you used previously to install
MongoDB), then you will be pleased by how similar PECL’s interface is to the one in aptitude.
Assuming that you have PECL installed on your system, open up a console and type the following
command to install the MongoDB extension:
$ sudo pecl install mongo
Entering this command causes PECL to download and install the MongoDB extension for PHP
automatically. In other words, PECL will download the extension for your PHP version and place it in the
PHP extensions directory. There’s just one catch: PECL does not automatically add the extension to the
list of loaded extensions; you will need to do this step manually. To do so, open a text editor (vim, nano, or
whichever text editor you prefer) and alter the file called php.ini, which is the main configuration file PHP
uses to control its behavior, including the extensions it should load.
Next, open the php.ini file, scroll down to the extensions section, and add the following line to tell PHP
to load the MongoDB driver:
extension=mongo.so
■Note The preceding step is mandatory; if you don’t do this, then the MongoDB commands in PHP will not
function. To find the php.ini file on your system, you can use the grep command in your shell: php –i | grep
Configuration.
The “Confirming That Your PHP Installation Works” section later in this chapter will cover how to
confirm that an extension has been loaded successfully.
That’s all, folks! You’ve just installed the MongoDB extension for your PHP installation, and you are now
ready to use it. Next, you will learn how to install the driver manually.
Installing the PHP Driver on Unix-Based Platforms Manually
If you would prefer to compile the driver yourself or for some reason are unable to use the PECL application
as described previously (your hosting provider might not support this option, for instance), then you can
also choose to download the source driver and compile it manually.
To download the driver, go to the GitHub website (http://github.com). This site offers the latest source
package for the PHP driver. Once you download it, you will need to extract the package and make the driver
by running the following set of commands:
$ unzip mongo-php-driver-master.zip
$ cd mongo-php-driver-master
$ phpize
$ ./configure
$ sudo make install
CHAPTER 2 ■ INSTALLING MONGODB
26
This process can take a while, depending on the speed of your system. Once the process completes,
your MongoDB PHP driver is installed and ready to use! After you execute the commands, you will be shown
where the driver has been placed; typically, the output looks something like this:
Installing '/usr/lib/php5/20121212/mongo.so'
You do need to confirm that this directory is the same directory where PHP stores its extensions by
default. You can use the following command to confirm where PHP stores its extensions:
$ php -i | grep extension_dir
This line outputs the directory where all PHP extensions should be placed. If this directory doesn’t
match the one where the mongo.so driver was placed, then you must move the mongo.so driver to the proper
directory, so PHP knows where to find it.
As before, you will need to tell PHP that the newly created extension has been placed in its extension
directory and that it should load this extension. You can specify this by modifying the php.ini file’s
extensions section; add the following line to that section:
extension=mongo.so
Finally, a restart of your web service is required. When using the Apache HTTPd service, you can
accomplish this using the following service command:
sudo /etc/init.d/apache2 restart
That’s it! This process is a little lengthier than using PECL’s automated method; however, if you are
unable to use PECL, or if you are a driver developer and interested in bug fixes, then you would want to use
the manual method instead.
Installing the PHP Driver on Windows
You have seen previously how to install MongoDB on your Windows operating system. Now let’s look at how
to implement the MongoDB driver for PHP on Windows.
For Windows, there are precompiled DLLs available for each release of the PHP driver for MongoDB. You
can get these binaries from the PECL website (http://pecl.php.net/package/mongo). The biggest challenge
in this case is choosing the correct package to install for your version of PHP (a wide variety of packages
are available). If you aren’t certain which package version you need, you can use the <? phpinfo(); ?>
command in a PHP page to learn exactly which one suits your specific environment. We’ll take a closer look at
the phpinfo() command in the next section.
After downloading the correct package and extracting its contents, all you need to do is copy the driver
file (called php_mongo.dll) to your PHP’s extension directory; this enables PHP to pick it up.
Depending on your version of PHP, the extension directory may be called either Ext or Extensions.
If you aren’t certain which directory it should be, you can review the PHP documentation that came with the
version of PHP installed on your system.
Once you place the driver DLL into the PHP extensions directory, you still need to tell PHP to load the
driver. Do this by altering the php.ini file and adding the following line in the extensions section:
extension=php_mongo.dll
CHAPTER 2 ■ INSTALLING MONGODB
27
When this is done, restart the HTTP service on your system, and you are now ready to use the MongoDB
driver in PHP. Before you start leveraging the magic of MongoDB with PHP, however, you need to confirm
that the extension is loaded correctly.
Confirming That Your PHP Installation Works
So far you’ve successfully installed both MongoDB and the MongoDB driver in PHP. Now it’s time to do
a quick check to confirm whether the driver is being loaded correctly by PHP. PHP gives you a simple
and straightforward method to accomplish this: the phpinfo() command. This command shows you an
extended overview of all the modules loaded, including version numbers, compilation options, server
information, operating system information, and so on.
To use the phpinfo() command, open a text or HTML editor, and type the following:
<? phpinfo(); ?>
Next, save the document in your webserver’s www directory and call it whatever you like. For example,
you might call it test.php or phpinfo.php. Now open your browser and go to your localhost or external
server (that is, go to whatever server you are working on) and look at the page you just created. You will see
a good overview of all the PHP components and all sorts of other relevant information. The thing you need
to focus on here is the section that displays your MongoDB information. This section will list the version
number, port numbers, hostname, and so on (see Figure2-3).
Once you confirm that the installation was successful and that the driver loaded successfully, you’re
ready to write some PHP code and walk through a MongoDB example that leverages PHP.
Connecting to and Disconnecting from the PHP Driver
You’ve confirmed that the MongoDB PHP driver has been loaded correctly, so it’s time to start writing some
PHP code! Let’s take a look at two simple yet fundamental options for working with MongoDB: initiating a
connection between MongoDB and PHP, and then severing that connection.
Figure 2-3. Displaying your MongoDB information in PHP
CHAPTER 2 ■ INSTALLING MONGODB
28
You use the MongoClient class to initiate a connection between MongoDB and PHP; this same class
also lets you use the database server commands. A simple yet typical connection command looks like this:
$connection = new MongoClient();
If you use this command without providing any parameters, it will connect to the MongoDB service on
the default MongoDB port (27017) on your localhost. If your MongoDB service is running somewhere else,
then you simply specify the hostname of the remote host you want to connect to:
$connection = new MongoClient("example.com");
This line instantiates a fresh connection for your MongoDB service running on the server and listening
to the example.com domain name (note that it will still connect to the default port: 27017). If you want to
connect to a different port number, however (for example, if you don’t want to use the default port, or you’re
already running another session of the MongoDB service on that port), you can do so by specifying the port
number and hostname:
$connection = new MongoClient("example.com:12345");
This example creates a connection to the database service. Next, you will learn how to disconnect
from the service. Assuming you used the method just described to connect to your database, you can call
$connection again to pass the close() command to terminate the connection, as in this example:
$connection->close();
The close doesn’t need to be called, except in unusual circumstances. The reason for this is that the PHP
driver closes the connection to the database once the MongoClient object goes out of scope. Nevertheless,
it is recommended that you call close() at the end of your PHP code; this helps you avoid keeping old
connections from hanging around until they eventually time out. It also helps you ensure that any existing
connection is closed, thereby enabling a new connection to happen, as in the following example:
$connection = new MongoClient();
$connection->close();
$connection->connect();
The following snippet shows how this would look in PHP:
<?php
// Establish the database connection
$connection = new MongoClient()
// Close the database connection
$connection->close();
?>
CHAPTER 2 ■ INSTALLING MONGODB
29
Installing the Python Driver
Python is a general-purpose and easy-to-read programming language. These qualities make Python a
good language to start with when you are new to programming and scripting. It’s also a great language
to look into if you are familiar with programming and you’re looking for a multiparadigm programming
language that permits several styles of programming (object-oriented programming, structured
programming, and so on). In the upcoming sections, you’ll learn how to install Python and enable
MongoDB support for the language.
Installing PyMongo under Linux
Python offers a specific package for MongoDB support called PyMongo. This package allows you to interact
with the MongoDB database, but you will need to get this driver up and running before you can use this
powerful combination. As when installing the PHP driver, there are two methods you can use to install
PyMongo: an automated approach that relies on setuptools or a manual approach where you download
the source code for the project. The following sections show you how to install PyMongo using both
approaches.
Installing PyMongo Automatically
The pip application that comes bundled with the python-pip package lets you automatically download,
build, install, and manage Python packages. This is incredibly convenient, enabling you to extend your
Python modules’ installation even as it does all the work for you.
■Note You must have setuptools installed before you can use the pip application. This will be done
automatically when installing the python-pip package.
To install pip, all you need to do is tell apt-get to download and install it, like so:
$ sudo apt-get install python-pip
When this line executes, pip will detect the currently running version of Python and installs itself on the
system. That’s all there is to it. Now you are ready to use the pip command to download, make, and install
the MongoDB module, as in this example:
$ sudo pip install pymongo
Again, that’s all there is to it! PyMongo is now installed and ready to use.
■Tip You can also install previous versions of the PyMongo module with pip using the pip install
pymongo=x.y.z command. Here, x.y.z denotes the version of the module.
CHAPTER 2 ■ INSTALLING MONGODB
30
Installing PyMongo Manually
You can also choose to install PyMongo manually. Begin by going to the download section of the site that
hosts the PyMongo plug-in (http://pypi.python.org/pypi/pymongo). Next, download the tarball and
extract it. A typical download and extract procedure might look like this in your console:
$ wget http://pypi.python.org/packages/source/p/pymongo/pymongo-3.0.3.tar.gz
$ tar xzf pymongo-3.0.3.tar.gz
Once you successfully download and extract this file, make your way to the extracted contents directory
and invoke the installation of PyMongo by running the install.py command with Python:
$ cd pymongo-3.0.3
$ sudo python setup.py install
The preceding snippet outputs the entire creation and installation process of the PyMongo module.
Eventually, this process brings you back to your prompt, at which time you’re ready to start using PyMongo.
Installing PyMongo under Windows
Installing PyMongo under Windows is a straightforward process. As when installing PyMongo under Linux,
Easy Install can simplify installing PyMongo under Windows as well. If you don’t have setuptools installed
yet (this package includes the easy_install command), then go to the Python Package Index website
(http://pypi.python.org) to locate the setuptools installer.
For example, assume you have Python version 3.4.3 installed on your system. Next, you will need
to download the setuptools bootstrapper, ez_setup.py, from the Python Package Index website. Simply
double-click the ez_setup.py Python file to install setuptools on your system! It is that simple.
■Caution If you have previously installed an older version of setuptools, then you will need to uninstall that
version using your system’s Add/Remove Programs feature before installing the newer version.
Once the installation is complete, you will find the easy_install.exe file in Python’s Scripts
subdirectory. At this point, you’re ready to install PyMongo on Windows.
Once you’ve successfully installed setuptools, you can open a command prompt and cd your way to
Python’s Scripts directory. By default, this is set to C:\Pythonxy\Scripts\, where xy represents your version
number. Once you navigate to this location, you can use the same syntax shown previously for installing the
Unix variant:
C:\Python27\Scripts> easy_install PyMongo
Unlike the output you get when installing this program on a Linux machine, the output here is rather
brief, indicating only that the extension has been downloaded and installed (see Figure2-4). That said, this
information is sufficient for your purposes in this case.
CHAPTER 2 ■ INSTALLING MONGODB
31
Confirming That Your PyMongo Installation Works
To confirm whether the PyMongo installation has completed successfully, you can open your Python shell.
In Linux, you do this by opening a console and typing python. In Windows, you do this by clicking Start
➤ Programs ➤ Python xy ➤ Python (command line). At this point, you will be welcomed to the world of
Python (see Figure2-5).
Figure 2-5. The Python shell
Figure 2-4. Installing PyMongo under Windows
You can use the import command to tell Python to start using the freshly installed extension:
>>> import pymongo
>>>
■Note You must use the import pymongo command each time you want to use PyMongo.
If all went well, you will not see a thing, and you can start firing off some fancy MongoDB commands.
If you received an error message, however, something went wrong, and you might need to review the steps
just taken to discover where the error occurred.
CHAPTER 2 ■ INSTALLING MONGODB
32
Summary
In this chapter, we examined how to obtain the MongoDB software, including how to select the correct
version you need for your environment. We also discussed the version numbers, how to install and run
MongoDB, and how to install and run its prerequisites. Next, we covered how to establish a connection to a
database through a combination of the shell, PHP, and Python.
We also explored how to expand MongoDB so it will work with your favorite programming languages, as
well as how to confirm whether the language-specific drivers have installed correctly.
In the next chapter, we will explore how to design and structure MongoDB databases and data properly.
Along the way, you’ll learn how to index information to speed up queries, how to reference data, and how to
leverage a fancy new feature called geospatial indexing.
33
Chapter 3
The Data Model
In Chapter 2, you learned how to install MongoDB on two commonly used platforms (Windows and Linux), as
well as how to extend the database with some additional drivers. In this chapter, you will shift your attention
from the operating system and instead examine the general design of a MongoDB database. Specifically,
you’ll learn what collections are, what documents look like, how indexes work and what they do, and finally,
when and where to reference data instead of embedding it. We touched on some of these concepts briefly
in Chapter 1, but in this chapter, we’ll explore them in more detail. Throughout this chapter, you will see
code examples designed to give you a good feeling for the concepts being discussed. Do not worry too much
about the commands you’ll be looking at, however, because they will be discussed extensively in Chapter 4.
Designing the Database
As you learned in Chapters 1 and 2, a MongoDB database is nonrelational and schemaless. This means
that a MongoDB database isn’t bound to any predefined columns or data types as relational databases are
(such as MySQL). The biggest benefit of this implementation is that working with data is extremely flexible
because there is no predefined structure required in your documents.
To put it more simply, you are perfectly capable of having one collection that contains hundreds or
even thousands of documents that all carry a different structure—without breaking any of the MongoDB
database’s rules.
One of the benefits of this flexible schemaless design is that you won’t be restricted when programming
in a dynamically typed language such as Python or PHP. Indeed, it would be a severe limitation if your
extremely flexible and dynamically capable programming language couldn’t be used to its full potential
because of the innate limitations of your database.
Let’s take another glance at what the data design of a document in MongoDB looks like, paying
particular attention to how flexible data in MongoDB are compared to data in a relational database. In
MongoDB, a document is an item that contains the actual data, comparable to a row in SQL. In the following
example, you will see how two completely different types of documents can coexist in a single collection
named Media (note that a collection is roughly equivalent to a table in the world of SQL):
{
"Type": "CD",
"Artist": "Nirvana",
"Title": "Nevermind",
"Genre": "Grunge",
"Releasedate": "1991.09.24",
CHAPTER 3 ■ THE DATA MODEL
34
"Tracklist": [
{
"Track": "1",
"Title": "Smells Like Teen Spirit",
"Length": "5:02"
},
{
"Track": "2",
"Title": "In Bloom",
"Length": "4:15"
}
]
}
{
"type": "Book",
"Title": "Definitive Guide to MongoDB: A complete guide to dealing with Big Data using
MongoDB 3rd ed., The",
"ISBN": "978-1-4842-1183-0",
"Publisher": "Apress",
"Author": [
"Hows, David"
"Plugge, Eelco",
"Membrey, Peter",
"Hawkins, Tim ]
}
As you might have noticed when looking at this pair of documents, most of the fields aren’t closely
related to one another. Yes, they both have fields called Title and Type; but apart from that similarity, the
documents are completely different. Nevertheless, these two documents are contained in a single collection
called Media.
MongoDB is called a schemaless database, but that doesn’t mean MongoDB’s data structure is
completely devoid of schema. For example, you do define collections and indexes in MongoDB (you will
learn more about this later in the chapter). Nevertheless, you do not need to predefine a structure for any of
the documents you will be adding, as is the case when working with MySQL, for example.
Simply stated, MongoDB is an extraordinarily dynamic database; the preceding example would never
work in a relational database unless you also added each possible field to your table. Doing so would be a
waste of both space and performance, not to mention highly disorganized.
Drilling Down on Collections
As mentioned previously, collection is a commonly used term in MongoDB. You can think of a collection as a
container that stores your documents (that is, your data), as shown in Figure3-1.
CHAPTER 3 ■ THE DATA MODEL
35
Now compare the MongoDB database model to a typical model for a relational database (see Figure3-2).
As you can see, the general structure is the same between the two types of databases; nevertheless,
you do not use them in even remotely similar manners. There are several types of collections in MongoDB.
The default collection type is expandable in size: the more data you add to it, the larger it becomes. It’s also
possible to define collections that are capped. These capped collections can only contain a certain amount
of data before the oldest document is replaced by a newer document (you will learn more about these
collections in Chapter 4).
Figure 3-1. The MongoDB database model
Figure 3-2. A typical relational database model
CHAPTER 3 ■ THE DATA MODEL
36
Every collection in MongoDB has a unique name. This name should, for the sake of best practice, begin
with a letter, or optionally, an underscore (_) when created using the createCollection function. The
name can contain numbers and letters; however, the $ symbol is reserved by MongoDB. Similarly, using an
empty string (" ") is not allowed; the null character cannot be used in the name and it cannot start with the
system. string. Generally, it’s recommended that you keep the collection’s name simple and short (to around
nine characters or so); however, the maximum number of allowed characters in a collection name is 128.
Obviously, there isn’t much practical reason to create such a long name.
A single database running the default MMAPv1 storage engine has a default limit of approximately
24,000 namespaces, whereas the WiredTiger storage engine is not subject to this limitation. Each collection
accounts for at least two namespaces: one for the collection itself and one more for the first index created
in the collection. If you were to add more indexes per collection, however, another namespace would be
used. In theory, this means that each database can have up to 12,000 collections by default, assuming each
collection only carries one index. However, this limit on the number of namespaces can be increased up to
2047MB by providing the nsSize parameter when executing the MongoDB service application (mongod).
Using Documents
Recall that a document consists of key-value pairs. For example, the pair "type" : "Book" consists of a key
named type, and its value, Book. Keys are written as strings, but the values in them can vary tremendously.
Values can be any of a rich set of datatypes, such as arrays or even binary data. Remember: MongoDB stores
its data in BSON format (see Chapter 1 for more information on this topic).
Next, let’s look at all of the possible types of data you can add to a document, and what you use them for:
• String: This commonly used datatype contains a string of text (or any other kind
of characters). This datatype is used mostly for storing text values (for example,
{"Country" : "Japan"}).
• Integer (32-bit and 64-bit): This type is used to store a numerical value (for example,
{ "Rank" : 1 }). Note that there are no quotes placed before or after the integer.
• Boolean: This datatype can be set to either TRUE or FALSE.
• Double: This datatype is used to store floating-point values.
• Min / Max keys: This datatype is used to compare a value against the lowest and
highest BSON elements, respectively.
• Arrays: This datatype is used to store arrays (for example, ["Membrey,
Peter","Plugge, Eelco","Hows, David"]).
• Timestamp: This datatype is used to store a timestamp. This can be handy for
recording when a document has been modified or added.
• Object: This datatype is used for embedded documents.
• Null: This datatype is used for a Null value.
• Symbol: This datatype is used identically to a string; however, it’s generally reserved
for languages that use a specific symbol type.
• Date: This datatype is used to store the current date or time in Unix time format
(POSIX time).
• Object ID: This datatype is used to store the document’s ID.
• Binary data: This datatype is used to store binary data.
CHAPTER 3 ■ THE DATA MODEL
37
• Regular expression: This datatype is used for regular expressions. All options are
represented by specific characters provided in alphabetical order. You will learn
more about regular expressions in Chapter 4.
• JavaScript code: This datatype is used for JavaScript code.
In Chapter 4, you will learn how to identify your datatypes by using the $type operator.
In theory, this all probably sounds straightforward. However, you might wonder how you go about
actually designing the document, including what information to put in it. Because a document can contain
any type of data, you might think there is no need to reference information from inside another document.
In the next section, we’ll look at the pros and cons of embedding information in a document compared to
referencing that information from another document.
Embedding vs. Referencing Information in Documents
You can choose either to embed information into a document or reference that information from another
document. Embedding information simply means that you place a certain type of data (for example, an array
containing more data) into the document itself. Referencing information means that you create a reference
to another document that contains that specific data. Typically, you reference information when you use a
relational database. For example, assume you wanted to use a relational database to keep track of your CDs,
DVDs, and books. In this database, you might have one table for your CD collection and another table that
stores the track lists of your CDs. Thus, you would probably need to query multiple tables to acquire a list of
tracks from a specific CD.
With MongoDB (and other nonrelational databases), however, it would be much easier to embed such
information instead. After all, the documents are natively capable of doing so. Adopting this approach keeps
your database nice and tidy, ensures that all related information is kept in one single document, and even
works much faster because the data are then co-located on the disk.
Now let’s look at the differences between embedding and referencing information by looking at a
real-world scenario: storing CD data in a database.
In the relational approach, your data structure might look something like this:
|_media
|_cds
|_id, artist, title, genre, releasedate
|_ cd_tracklists
|_cd_id, songtitle, length
In the nonrelational approach, your data structure might look something like this:
|_media
|_items
|_<document>
In the nonrelational approach, the document might look something like the following:
{
"Type": "CD",
"Artist": "Nirvana",
"Title": "Nevermind",
"Genre": "Grunge",
"Releasedate": "1991.09.24",
CHAPTER 3 ■ THE DATA MODEL
38
"Tracklist": [
{
"Track" : "1",
"Title" : "Smells Like Teen Spirit",
"Length" : "5:02"
},
{
"Track" : "2",
"Title" : "In Bloom",
"Length" : "4:15"
}
]
}
In this example, the track list information is embedded in the document itself. This approach is both
incredibly efficient and well organized. All the information that you wish to store regarding this CD is added
to a single document. In the relational version of the CD database, this requires at least two tables; in the
nonrelational database, it requires only one collection and one document.
When information is retrieved for a given CD, that information only needs to be loaded from one
document into RAM, not from multiple documents. Remember that every reference requires another query
in the database.
■Tip The rule of thumb when using MongoDB is to embed data whenever you can. This approach is far more
efficient and almost always viable.
At this point, you might be wondering about the use case in which an application has multiple users.
Generally speaking, a relational database version of the aforementioned CD app would require that you
have one table that contains all your users and two tables for the items added. For a nonrelational database,
it would be good practice to have separate collections for the users and the items added. For these kinds of
problems, MongoDB allows you to create references in two ways: manually or automatically. In the latter
case, you use the DBRef specification, which provides more flexibility in case a collection changes from one
document to the next. You will learn more about these two approaches in Chapter 4.
Creating the _id Field
Every object within the MongoDB database contains a unique identifier to distinguish that object from every
other object. This identifier is called the _id key, and it is added automatically to every document you create
in a collection.
The _id key is the first attribute added in each new document you create. This remains true even if
you do not tell MongoDB to create the key. For example, none of the code in the preceding examples used
the _id key. Nevertheless, MongoDB created an _id key for you automatically in each document. It did so
because _id key is a mandatory element for each document in the collection.
If you do not specify the _id value manually, the type will be set to a special ObjectId BSON datatype
that consists of a 12-byte binary value. Thanks to its design, this value has a reasonably high probability of
being unique. The 12-byte value consists of a 4-byte timestamp (seconds since epoch, or January 1, 1970),
a 3-byte machine ID, a 2-byte process ID, and a 3-byte counter. It’s good to know that the counter and
timestamp fields are stored in Big Endian format. This is because MongoDB wants to ensure that there is an
increasing order to these values, and a Big Endian approach suits this requirement best.
CHAPTER 3 ■ THE DATA MODEL
39
■Note The terms Big Endian and Little Endian refer to how individual bytes/bits are stored in a longer data
word in the memory. Big Endian simply means that the most significant value is saved first. Similarly, Little
Endian means that the least significant value is saved first.
Figure3-3 shows how the value of the _id key is built up and where the values come from.
Every additional supported driver that you load when working with MongoDB (such as the PHP driver
or the Python driver) supports this special BSON datatype and uses it whenever new data are created. You
can also invoke ObjectId() from the MongoDB shell to create a value for an _id key. Optionally, you can
specify your own value by using ObjectId(string), where string represents the specified hex string.
Building Indexes
As mentioned in Chapter 1, an index is nothing more than a data structure that collects information about
the values of specified fields in the documents of a collection. This data structure is used by MongoDB’s
query optimizer to quickly sort through and order the documents in a collection.
Remember that indexing ensures a quick lookup from data in your documents. Basically, you should
view an index as a predefined query that was executed and had its results stored. As you can imagine, this
enhances query-performance dramatically. The general rule of thumb in MongoDB is that you should create
an index for the same sort of scenarios where you would want to have an index in relational databases.
The biggest benefit of creating your own indexes is that querying for often-used information will be
incredibly fast because your query won’t need to go through your entire database to collect this information.
Creating (or deleting) an index is relatively easy—once you get the hang of it, anyway. You will learn
how to do so in Chapter 4, which covers working with data. You will also learn some more advanced
techniques for taking advantage of indexing in Chapter 10, which covers how to maximize performance.
Impacting Performance with Indexes
You might wonder why you would ever need to delete an index, rebuild your indexes, or even delete all
indexes within a collection. The simple answer is that doing so lets you clean up some irregularities. For
instance, sometimes the size of a database can increase dramatically for no apparent reason. At other times,
the space used by the indexes might strike you as excessive.
Another good thing to keep in mind: you can have a maximum of 64 indexes per collection. Generally
speaking, this is far more than you should need, but you could potentially hit this limit someday.
Figure 3-3. Creating the _id key in MongoDB
CHAPTER 3 ■ THE DATA MODEL
40
■Note Adding an index potentially increases query speed, but it reduces insertion or deletion speed. It’s best
to consider only adding indexes for collections where the number of reads is higher than the number of writes.
When more writes occur than reads, indexes may even prove to be counterproductive.
Finally, you can run the listIndexes() command to take a quick peek at the indexes that have been
stored so far. To see the indexes created for a specific collection, you can use the getIndexes command:
db.collection.getIndexes()
Indexing, and how indexing can affect MongoDB’s performance, will be covered in more detail in the
Optimization chapter.
Implementing Geospatial Indexing
Ever since version 1.4, MongoDB has implemented geospatial indexing. This means that, in addition to the
various other index types, MongoDB also supports geospatial indexes that are designed to work in an optimal
way with location-based queries. For example, you can use this feature to find a number of closest known
items to the user’s current location. Or you might further refine your search to query for a specified number
of restaurants near the current location. This type of query can be particularly helpful if you are designing an
application where you want to find the closest available branch office to a given customer’s ZIP code.
A document for which you want to add geospatial information must contain either a subobject or an
array whose first element specifies the object type, followed by the item’s longitude and latitude, as in the
following example:
> db.restaurants.insert({name: "Kimono", loc: { type: "Point",
coordinates: [ 52.370451, 5.217497]}})
Note that the type parameter can be used to specify the document’s GeoJSON object type, which
can be a Point, a MultiPoint, a LineString, a MultiLineString, a Polygon, a MultiPolygon, or a
GeometryCollection. As can be expected, the Point type is used to specify that the item (in this case, a
restaurant) is located at exactly the spot given, thus requiring exactly two values, the longitude and latitude.
The LineString type can be used to specify that the item extends along a specific line (say, a street), and
thus requires a beginning and end point, as in the following example:
> db.streets.insert( {name: "Westblaak", loc: { type: "LineString",
coordinates: [ [52.36881,4.890286],[52.368762,4.890021] ] } } )
The Polygon type can be used to specify a (nondefault) shape (say, a shopping area). When using
this type, you need to ensure that the first and last points are identical, to close the loop. Also, the point
coordinates are to be provided as an array within an array, as in the following example:
> db.stores.insert( {name: "SuperMall", loc: { type: "Polygon",
coordinates: [ [ [52.146917,5.374337], [52.146966,5.375471], [52.146722,5.375085],
[52.146744,5.37437], [52.146917,5.374337] ] ] } } )
CHAPTER 3 ■ THE DATA MODEL
41
For all of these, the Multi- version (MultiPoint, MultiLineString, etc.) is an array of the datatype selected,
as in the following MultiPoint example:
> db.restaurants.insert({name: "Shabu Shabu", loc: { type: "MultiPoint",
coordinates: [52.1487441, 5.3873406], [52.3569665,4.890517] }})
In most cases, the Point type will be appropriate.
Once this geospatial information is added to a document, you can create the index (or even create the
index beforehand, of course) and give the ensureIndex() function the 2dsphere parameter:
> db.restaurants.ensureIndex( { loc: "2dsphere" } )
■Note The ensureIndex() function is used to add a custom index. Don’t worry about the syntax of this
function yet—you will learn how to use ensureIndex() in depth in Chapter 4.
The 2dsphere parameter tells ensureIndex() that it’s indexing a coordinate or some other form of
two-dimensional information on an Earth-like sphere. By default, ensureIndex() assumes that a
latitude/longitude key is given, and it uses a range of -180 to 180. However, you can overwrite these values
using the min and max parameters:
> db.restaurants.ensureIndex( { loc: "2dsphere" }, { min : -500 , max : 500 } )
You can also expand your geospatial indexes by using secondary key values (also known as compound keys).
This structure can be useful when you intend to query on multiple values, such as a location (geospatial
information) and a category (sort ascending):
> db.restaurants.ensureIndex( { loc: "2dsphere", category: 1 } )
Querying Geospatial Information
In this chapter, we are concerned primarily with two things: how to model the data and how a database
works in the background of an application. That said, manipulating geospatial information is increasingly
important in a wide variety of applications, so we’ll take a few moments to explain how to leverage
geospatial information in a MongoDB database.
Before getting started, a mild word of caution. If you are completely new to MongoDB and haven’t
had the opportunity to work with (geospatial) indexed data in the past, this section may seem a little
overwhelming at first. Not to worry, however; you can safely skip it for now and come back to it later if you
wish to. The examples given serve to show you a practical example of how (and why) to use geospatial
indexing, making it easier to comprehend. With that out of the way, and if you are feeling brave, read on.
Once you’ve added data to your collection, and once the index has been created, you can do a
geospatial query. For example, let’s look at a few lines of simple yet powerful code that demonstrate how to
use geospatial indexing.
Begin by starting up your MongoDB shell and selecting a database with the use function. In this case,
the database is named restaurants:
> use restaurants
CHAPTER 3 ■ THE DATA MODEL
42
Once you’ve selected the database, you can define a few documents that contain geospatial
information, and then insert them into the places collection (remember: you do not need to create the
collection beforehand):
> db.restaurants.insert( { name: "Kimono", loc: { type: "Point",
coordinates: [ 52.370451, 5.217497] } } )
> db.restaurants.insert( {name: "Shabu Shabu", loc: { type: "Point",
coordinates: [51.915288,4.472786] } } )
> db.restaurants.insert( {name: "Tokyo Cafe", loc: { type: "Point",
coordinates: [52.368736, 4.890530] } } )
After you add the data, you need to tell the MongoDB shell to create an index based on the location
information that was specified in the loc key, as in this example:
> db.restaurants.ensureIndex ( { loc: "2dsphere" } )
Once the index has been created, you can start searching for your documents. Begin by searching on an
exact value (so far this is a “normal” query; it has nothing to do with the geospatial information at this point):
> db.restaurants.find( { loc : [52,5] } )
>
The preceding search returns no results. This is because the query is too specific. A better approach in
this case would be to search for documents that contain information near a given value. You can accomplish
this using the $near operator. Note that this requires the type operator to be specified, as in the following
example:
> db.restaurants.find( { loc : { $near : { $geometry : { type : "Point",
coordinates: [52.338433,5.513629] } } } } )
This produces the following output:
{
"_id" : ObjectId("51ace0f380523d89efd199ac"),
"name" : "Kimono",
"loc" : {
"type" : "Point",
"coordinates" : [ 52.370451, 5.217497 ]
}
}
{
"_id" : ObjectId("51ace13380523d89efd199ae"),
"name" : "Tokyo Cafe",
"loc" : {
"type" : "Point",
"coordinates" : [ 52.368736, 4.89053 ]
}
}
CHAPTER 3 ■ THE DATA MODEL
43
{
"_id" : ObjectId("51ace11b80523d89efd199ad"),
"name" : "Shabu Shabu",
"loc" : {
"type" : "Point",
"coordinates" : [ 51.915288, 4.472786 ]
}
}
Although this set of results certainly looks better, there’s still one problem: all of the documents are
returned! When used without any additional operators, $near returns the first 100 entries and sorts them
based on their distance from the given coordinates. Now, while you can choose to limit your results to say,
the first two items (or 200, if you want) using the limit function, even better would be to limit the results to
those within a given range.
This can be achieved by appending the $maxDistance or $minDistance operators. Using one of these
operators you can tell MongoDB to return only those results falling within a maximum or minimum distance
(measured in meters) from the given point, as in the following example and its output:
> db.retaurants.find( { loc : { $near : { $geometry : { type : "Point",
coordinates: [52.338433,5.513629] }, $maxDistance : 40000 } } } )
{
"_id" : ObjectId("51ace0f380523d89efd199ac"),
"name" : "Kimono",
"loc" : {
"type" : "Point",
"coordinates" : [ 52.370451, 5.217497 ]
}
}
As you can see, this returns only a single result: a restaurant located within 40 kilometers (or, roughly
25 miles) from the starting point.
■Note There is a direct correlation between the number of results returned and the time a given query takes
to execute.
In addition to the $near operator, MongoDB also includes a $geoWithin operator. You use this operator
to find items in a particular shape. At this time, you can find items located in a $box, $polygon, $center,
and $centerSphere shape, where $box represents a rectangle, $polygon represents a specific shape of your
choosing, $center represents a circle, and $centerSphere defines a circle on a sphere. Let’s look at a couple
of additional examples that illustrate how to use these shapes.
■Note With version 2.4 of MongoDB the $within operator was deprecated and replaced by $geoWithin.
This operator does not strictly require a geospatial indexing. Also, unlike the $near operator, $geoWithin does
not sort the returned results, improving their performance.
CHAPTER 3 ■ THE DATA MODEL
44
To use the $box shape, you first need to specify the lower-left, followed by the upper-right, coordinates
of the box, as in the following example:
> db.restaurants.find( { loc: { $geoWithin : { $box : [ [52.368549,4.890238],
[52.368849,4.89094] ] } } } )
Similarly, to find items within a specific polygon form, you need to specify the coordinates of your
points as a set of nested arrays. Again note that the first and last coordinates must be identical to close the
shape properly, as shown in the following example:
> db.restaurants.find( { loc :
{ $geoWithin :
{ $geometry :
{ type : "Polygon" ,
coordinates : [ [
[52.368739,4.890203], [52.368872,4.890477], [52.368726,4.890793],
[52.368608,4.89049], [52.368739,4.890203]
] ]
}
}
} )
The code to find items in a basic $circle shape is quite simple. In this case, you need to specify the
center of the circle and its radius, measured in the units used by the coordinate system, before executing the
find() function:
> db.restaurants.find( { loc: { $geoWithin : { $center : [ [52.370524, 5.217682], 10] } } } )
Note that ever since MongoDB version 2.2.3, the $center operator can be used without having a
geospatial index in place. However, it is recommended to create one to improve performance.
Finally, to find items located within a circular shape on a sphere (say, our planet) you can use the
$centerSphere operator. This operator is similar to $center, like so:
> db.restaurants.find( { loc: { $geoWithin : { $centerSphere : [ [52.370524, 5.217682], 10]
} } } )
By default, the find() function is ideal for running queries. However, MongoDB also provides the
geoNear() function, which works like the find() function, but also displays the distance from the specified
point for each item in the results. The geoNear() function also includes some additional diagnostics. The
following example uses the geoNear() function to find the two closest results to the specified position:
> db.runCommand( { geoNear : "restaurants", near : { type : "Point", coordinates:
[52.338433,5.513629] }, spherical : true})
It returns the following results:
{
"ns" : "stores.restaurants",
"results" : [
{
"dis" : 33155.517810497055,
CHAPTER 3 ■ THE DATA MODEL
45
"obj" : {
"_id" : ObjectId("51ace0f380523d89efd199ac"),
"name" : "Kimono",
"loc" : {
"type" : "Point",
"coordinates" : [
52.370451,
5.217497
]
}
}
},
{
"dis" : 69443.96264213261,
"obj" : {
"_id" : ObjectId("51ace13380523d89efd199ae"),
"name" : "Tokyo Cafe",
"loc" : {
"type" : "Point",
"coordinates" : [
52.368736,
4.89053
]
}
}
},
{
"dis" : 125006.87383713324,
"obj" : {
"_id" : ObjectId("51ace11b80523d89efd199ad"),
"name" : "Shabu Shabu",
"loc" : {
"type" : "Point",
"coordinates" : [
51.915288,
4.472786
]
}
}
}
],
"stats" : {
"time" : 6,
"nscanned" : 3,
"avgDistance" : 75868.7847632543,
"maxDistance" : 125006.87383713324
},
"ok" : 1
}
CHAPTER 3 ■ THE DATA MODEL
46
That completes our introduction to geospatial information for now; however, you’ll see a few more
examples that show you how to leverage geospatial functions in this book’s upcoming chapters.
Pluggable Storage Engines
Now that we’ve briefly touched upon MongoDB’s performance features, it’s time to look at the storage
engines available since version 3.0 and what these can mean for you. MongoDB’s storage engine is that
part of the database in charge of storing your data on the disk. Prior to version 3.0 you were limited to using
MongoDB’s native MMAPv1 storage engine. While this is still the default storage engine used in any version
prior to 3.2, you can choose to use the added alternative, the WiredTiger storage engine, or even develop
your own using the storage engine API.
■Note Each storage engine comes with its own pros and cons; where one might be best suited for
read-heavy tasks, another might perform better for write-heavy tasks. You can decide which storage engine is
a best fit for your use case. It is worth noting at this stage that multiple storage engines may coexist within a
single replica set.
By default, MongoDB v3.0 and later come with two supported storage engines: the legacy MMAPv1,
and the new WiredTiger storage engine. Compared to MMAPv1, the WiredTiger storage engine offers more
granular concurrency control as well as native compression capabilities. This allows for better utilization of
the hardware, reduced storage costs, as well as more predictable performance. MongoDB’s storage engines
and its capabilities will be discussed in full detail in Chapter 10 later on in this book.
Using MongoDB in the Real World
Now that you have MongoDB and its associated plug-ins installed and you have gained an understanding
of the data model, it’s time to get to work. In the next five chapters of the book, you will learn how to build,
query, and otherwise manipulate a variety of sample MongoDB databases (see Table3-1 for a quick view
of the topics to come). Each chapter will stick primarily to using a single database that is unique to that
chapter; we took this approach to make it easier to read this book in a modular fashion.
Table 3-1. MongoDB Sample Databases Covered in This Book
Chapter Database Name Topic
4Library Working with data and indexes
5Test GridFS
6Contacts PHP and MongoDB
7Inventory Python and MongoDB
8Test Advanced queries
CHAPTER 3 ■ THE DATA MODEL
47
Summary
In this chapter, we looked at what’s happening in the background of your database. We also explored the
primary concepts of collections and documents in more depth; and we covered the datatypes supported in
MongoDB, as well as how to embed and reference data.
Next, we examined what indexes do, including when and why they should be used (or not).
We also touched on the concepts of geospatial indexing. For example, we covered how geospatial data
can be stored; we also explained how you can search for such data using either the regular find() function
or the more geospatially based geoNear database command.
In the next chapter, we’ll take a closer look at how the MongoDB shell works, including which functions
can be used to insert, find, update, or delete your data. We will also explore how conditional operators can
help you with all of these functions.
49
Chapter 4
Working with Data
In Chapter 3, you learned how the database works on the backend, what indexes are, how to use a database
to quickly find the data you are looking for, and what the structure of a document looks like. You also saw a
brief example that illustrated how to add data and find it again using the MongoDB shell. In this chapter, we
will focus more on working with data from your shell.
We will use one database (named library) throughout this chapter, and we will perform actions
such as adding data, searching data, modifying data, deleting data, and creating indexes. We’ll also look
at how to navigate the database using various commands, as well as what DBRef is and what it does. If
you have followed the instructions in the previous chapters to set up the MongoDB software, you can
follow the examples in this chapter to get used to the interface. Along the way, you will also attain a solid
understanding of which commands can be used for what kind of operations.
Navigating Your Databases
The first thing you need to know is how to navigate your databases and collections. With traditional SQL
databases, the first thing you would need to do is create an actual database; however, as you probably
remember from previous chapters, this is not required with MongoDB because the program creates the
database and underlying collection for you automatically the moment you store data in it.
To switch to an existing database or create a new one, you can use the use function in the shell, followed
by the name of the database you would like to use, whether or not it exists. This snippet shows how to use
the library database:
> use library
Switched to db library
The mere act of invoking the use function, followed by the database’s name, sets your db (database) global
variable to library. Doing this means that all the commands you pass down into the shell will automatically
assume they need to be executed on the library database until you reset this variable to another database.
Viewing Available Databases and Collections
MongoDB automatically assumes a database needs to be created the moment you save data to it. It is also
case sensitive. For these reasons, it can be quite tricky to ensure that you’re working in the correct database.
Therefore, it’s best to view a list of all current databases available to MongoDB prior to switching to one, in
case you forgot the database’s name or its exact spelling. You can do this using the show dbs function:
> show dbs
local 0.000GB
CHAPTER 4 ■ WORKING WITH DATA
50
Note that this function will only show a database that already exists. At this stage, the database does
not contain any data yet, so nothing else will be listed. If you want to view all available collections for your
current database, you can use the show collections function:
> show collections
>
■Tip To view the database you are currently working in, simply type db into the MongoDB shell.
Inserting Data into Collections
One of the most frequently used pieces of functionality you will want to learn about is how to insert data into
your collection. All data are stored in BSON format (which is both compact and reasonably fast to scan), so
you will need to insert the data in BSON format as well. You can do this in several ways. For example, you can
define it first and then save it in the collection using the insertOne function, or you can type the document
while using the insert function on the fly:
> document = ({"Type": "Book", "Title" : "Definitive Guide to MongoDB 3rd ed., The",
"ISBN" : "978-1-4842-1183-0", "Publisher" : "Apress", "Author" : ["Hows, David", "Plugge,
Eelco", "Membrey, Peter", "Hawkins, Tim"] } )
■Note When you define a variable in the shell (for example, document = ( { ... } ) ), the contents of the
variable will be printed out immediately.
> db.media.insertOne(document)
WriteResult({ "nInserted" : 1 })
Notice the WriteResult() output returned after inserting a document into the collection.
WriteResult() will carry the status of the operation, as well as the action performed. When inserting a
document, the nInserted property is returned, together with the number of documents inserted.
Line breaks can also be used while typing in the shell. This can be convenient if you are writing a rather
lengthy document, as in this example:
> document = ( { "Type" : "Book",
..."Title" : "Definitive Guide to MongoDB 3rd ed., The",
..."ISBN" : " 978-1-4842-1183-0",
..."Publisher" : "Apress",
..."Author" : ["Hows, David", Plugge, Eelco", "Membrey, Peter"," "Hawkins, Tim"]
...} )
> db.media.insertOne(document)
WriteResult({ "nInserted" : 1 })
CHAPTER 4 ■ WORKING WITH DATA
51
As mentioned previously, the other option is to insert your data directly through the shell, without
defining the document first. You can do this by invoking the insert function immediately, followed by the
document’s contents:
> db.media.insertOne( { "Type" : "CD", "Artist" : "Nirvana", "Title" : "Nevermind" })
WriteResult({ "nInserted" : 1 })
Or you can insert the data while using line breaks, as before. For example, you can expand the
preceding example by adding an array of tracks to it. Pay close attention to how the commas and brackets
are used in the following example:
> db.media.insertOne( { "Type" : "CD",
..."Artist" : "Nirvana",
..."Title" : "Nevermind",
... "Tracklist" : [
... {
... "Track" : "1",
... "Title" : "Smells Like Teen Spirit",
... "Length" : "5:02"
... },
... {
... "Track" : "2",
... "Title" : "In Bloom",
... "Length" : "4:15"
... }
... ]
...}
... )
WriteResult({ "nInserted" : 1 })
As you can see, inserting data through the Mongo shell is straightforward.
The process of inserting data is extremely flexible, but you must adhere to some rules when doing so.
For example, the names of the keys while inserting documents have the following limitations:
• The $ character must not be the first character in the key name. Example: $tags
• The period [.] character must not appear anywhere in the key name. Example: ta.gs
• The name _id is reserved for use as a primary key ID; although it is not
recommended, it can store anything unique as a value, such as a string or an integer.
Similarly, some restrictions apply when creating a collection. For example, the name of a collection
must adhere to the following rules:
• The collection’s namespace (including the database name and a “.” separator) cannot
exceed 120 characters.
• An empty string (“ ”) cannot be used as a collection name.
• The collection’s name must start with either a letter or an underscore.
• The collection name system is reserved for MongoDB and cannot be used.
• The collection’s name cannot contain the “\0” null character.
CHAPTER 4 ■ WORKING WITH DATA
52
Querying for Data
You’ve seen how to switch to your database and how to insert data; next, you will learn how to query for data
in your collection. Let’s build on the preceding example and look at all the possible ways to get a good clear
view of your data in a given collection.
■Note When querying your data, you have an extraordinary range of options, operators, expressions, filters,
and so on available to you. We will spend the next few sections reviewing these options.
The find() function provides the easiest way to retrieve data from multiple documents within one of
your collections. This function is one that you will be using often.
Let’s assume that you have inserted the preceding two examples into a collection called media in the
library database. If you were to use a simple find() function on this collection, you would getall of the
documents you’ve added so far printed out for you:
> db.media.find()
{ "_id" : "ObjectId("4c1a8a56c603000000007ecb"), "Type" : "Book", "Title" : "Definitive
Guide to MongoDB 3rd ed., The", "ISBN" : "978-1-4842-1183-0", "Publisher" : "Apress",
"Author" : ["Hows, David ", "Plugge, Eelco", "Membrey, Peter", "Hawkins, Tim"]}
{ "_id" : "ObjectId("4c1a86bb2955000000004076"), "Type" : "CD", "Artist" :
"Nirvana", "Title" : "Nevermind", "Tracklist" : [
{
"Track" : "1",
"Title" : "Smells Like Teen Spirit",
"Length" : "5:02"
},
{
"Track" : "2",
"Title" : "In Bloom",
"Length" : "4:15"
}
] }
This is simple stuff, but typically you would not want to retrieve all the information from all the
documents in your collection. Instead, you probably want to retrieve a certain type of document. For
example, you might want to return all the CDs from Nirvana. If so, you can specify that only the desired
information is requested and returned:
> db.media.find ( { Artist : "Nirvana" } )
{ "_id" : "ObjectId("4c1a86bb2955000000004076"), "Type" : "CD", "Artist" : "Nirvana",
"Title" : "Nevermind", "Tracklist" : [
{
"Track" : "1",
"Title" : "Smells Like Teen Spirit",
"Length" : "5:02"
},
CHAPTER 4 ■ WORKING WITH DATA
53
{
"Track" : "2",
"Title" : "In Bloom",
"Length" : "4:15"
}
] }
Okay, so this looks much better! You don’t have to see all the information from all the other items you’ve
added to your collection, only the information that interests you. However, what if you’re still not satisfied
with the results returned? For example, assume you want to get a list back that shows only the titles of the
CDs you have by Nirvana, ignoring any other information, such as track lists. You can do this by inserting an
additional parameter into your query that specifies the name of the keys you want to return, followed by a 1:
> db.media.find ( {Artist : "Nirvana"}, {Title: 1} )
{ "_id" : ObjectId("4c1a86bb2955000000004076"), "Title" : "Nevermind" }
Inserting the { Title : 1 } information specifies that only the information from the title field should
be returned. The _id field is always returned, unless you specifically exclude it using { _id: 0 }.
■Note If you do not specify a sort order, the order of results is undefined. Sorting is covered later in
this chapter.
You can also accomplish the opposite: inserting { Type : 0 } retrieves a list of all items you have
stored from Nirvana, showing all information except for the Type field.
■Note The _id field will by default remain visible unless you explicitly ask it not to show itself.
Take a moment to run the revised query with the { Title : 1 } insertion; no unnecessary information
is returned at all. This saves you time because you see only the information you want. It also spares your
database the time required to return unnecessary information.
Using the Dot Notation
When you start working with more complex document structures such as documents containing arrays
or embedded objects, you can begin using other methods for querying information from those objects as
well. For example, assume you want to find all CDs that contain a specific song you like. The following code
executes a more detailed query:
> db.media.find( { "Tracklist.Title" : "In Bloom" } )
{ "_id" : "ObjectId("4c1a86bb2955000000004076"), "Type" : "CD", "Artist" : "Nirvana",
"Title" : "Nevermind", "Tracklist" : [
{
"Track" : "1",
"Title" : "Smells Like Teen Spirit",
"Length" : "5:02"
},
CHAPTER 4 ■ WORKING WITH DATA
54
{
"Track" : "2",
"Title" : "In Bloom",
"Length" : "4:15"
}
] }
Using a period [.] after the key’s name tells your find function to look for information embedded in
your documents. Things are a little simpler when working with arrays. For example, you can execute the
following query if you want to find a list of books written by Peter Membrey:
> db.media.find( { "Author" : "Membrey, Peter" } )
{ "_id" : "ObjectId("4c1a8a56c603000000007ecb"), "Type" : "Book", "Title" : "Definitive
Guide to MongoDB 3rd ed., The", "ISBN" : "978-1-4842-1183-0", "Publisher" : "Apress",
"Author" : ["Hows, David ", "Plugge, Eelco", "Membrey, Peter", "Hawkins, Tim"] }
However, the following command will not match any documents, even though it might appear identical
to the earlier track list query:
> db.media.find ( { "Tracklist" : {"Track" : "1" }} )
Subobjects must match exactly; therefore, the preceding query would only match a document that
contains no other information, such as Track.Title:
{"Type" : "CD",
"Artist" : "Nirvana"
"Title" : "Nevermind",
"Tracklist" : [
{
"Track" : "1",
},
{
"Track" : "2",
"Title" : "In Bloom",
"Length" : "4:15"
}
]
}
Using the Sort, Limit, and Skip Functions
MongoDB includes several functions that you can use for more precise control over your queries. We’ll cover
how to use the sort, limit, and skip functions in this section.
You can use the sort function to sort the results returned from a query. You can sort the results in
ascending or descending order using 1 or -1, respectively. The function itself is analogous to the ORDER BY
statement in SQL, and it uses the key’s name and sorting method as criteria, as in this example:
> db.media.find().sort( { Title: 1 })
This example sorts the results based on the Title key’s value in ascending order. This is the default
sorting order when no parameters are specified. You would add the -1 flag to sort in descending order.
CHAPTER 4 ■ WORKING WITH DATA
55
■Note If you specify a key for sorting that does not exist, the order of results will be undefined.
You can use the limit() function to specify the maximum number of results returned. This function
requires only one parameter: the number of the desired results returned. When you specify 0, all results will
be returned. The following example returns only ten items in your media collection:
> db.media.find().limit( 10 )
Another thing you might want to do is skip the first n documents in a collection. The following example
skips 20 documents in your media collection:
> db.media.find().skip( 20 )
As you probably surmised, this command returns all documents within your collection, except for the
first 20 it finds.
MongoDB wouldn’t be particularly powerful if it weren’t able to combine these commands. However,
practically any function can be combined and used in conjunction with any other function. The following
example limits the results by skipping a few and then sorts the results in descending order:
> db.media.find().sort ( { Title : -1 } ).limit ( 10 ).skip ( 20 )
You might use this example if you want to implement paging in your application. As you might have
guessed, this command wouldn’t return any results in the media collection created so far, because the
collection contains fewer documents than were skipped in this example.
■Note You can use the following shortcut in the find() function to skip and limit your results:
find ( {}, {}, 10, 20 ). Here, you limit the results to ten and skip the first 20 documents found.
Working with Capped Collections, Natural Order, and $natural
There are some additional concepts and features you should be aware of when sorting queries with
MongoDB, including capped collections, natural order, and $natural. We’ll explain in this section what all
of these terms mean and how you can leverage them in your sorts.
The natural order is the database’s native ordering method for objects within a (normal) collection.
When you query for items in a collection without specifying an explicit sort order, the items are returned
by default in forward natural order. This may initially appear identical to the order in which items were
inserted; however, the natural order for a normal collection is not defined and may vary depending on
document growth patterns, indexes used for a query, and the storage engine used.
A capped collection is a collection in your database where the natural order is guaranteed to be the order
in which the documents were inserted. Guaranteeing that the natural order will always match the insertion
order can be particularly useful when you’re querying data and need to be absolutely certain that the results
returned are already sorted based on their order of insertion.
Capped collections have another great benefit: they are a fixed size. Once a capped collection is full,
the oldest data will be purged and newer data will be added at the end, ensuring that the natural order
follows the order in which the records were inserted. This type of collection can be used for logging and
autoarchiving data.
CHAPTER 4 ■ WORKING WITH DATA
56
Unlike a standard collection, a capped collection must be created explicitly, using the
createCollection function. You must also supply parameters that specify the size (in bytes) of the
collection you want to add. For example, imagine you want to create a capped collection named audit with
a maximum size of 20480 bytes:
> db.createCollection("audit", {capped:true, size:20480})
{ "ok" : 1 }
Given that a capped collection guarantees that the natural order matches the insertion order, you don’t
need to include any special parameters or any other special commands or functions when querying the data
either, except of course when you want to reverse the default results. This is where the $natural parameter
comes in. For example, assume you want to find the ten most recent entries from your capped collection that
lists failed login attempts. You could use the $natural parameter to find this information:
> db.audit.find().sort( { $natural: -1 } ).limit ( 10 )
■Note Documents already added to a capped collection can be updated, but they must not grow in size.
The update will fail if they do. Deleting documents from a capped collection is also not possible; instead, the
entire collection must be dropped and re-created if you want to do this. You will learn more about dropping a
collection later in this chapter.
You can also limit the number of items added into a capped collection using the max: parameter
when you create the collection. However, you must ensure that there is enough space in the collection for
the number of items you want to add. If the collection becomes full before the number of items has been
reached, the oldest item in the collection will be removed. The MongoDB shell includes a utility that lets
you see the amount of space used by an existing collection, whether it’s capped or uncapped. You invoke
this utility using the validate() function. This can be particularly useful if you want to estimate how large a
collection might become.
As stated previously, you can use the max: parameter to cap the number of items that can be inserted
into a collection, as in this example:
> db.createCollection("audit100", { capped:true, size:20480, max: 100})
{ "ok" : 1 }
Next, use the stats() function to check the size of the collection:
> db.audit100.stats()
{
"ns" : "library.audit100",
"count" : 0,
"size" : 0,
"storageSize" : 4096,
"capped" : true,
"max" : 100,
"maxSize" : 20480,
"sleepCount" : 0,
"sleepMS" : 0,
CHAPTER 4 ■ WORKING WITH DATA
57
"wiredTiger" : {
[..]
},
"nindexes" : 1,
"totalIndexSize" : 4096,
"indexSizes" : {
"_id_" : 4096
},
"ok" : 1
}
The resulting output shows that the table (named audit100) is a capped collection with a maximum of
100 items to be added, and it currently contains zero items.
Retrieving a Single Document
So far we’ve only looked at examples that show how to retrieve multiple documents. If you want to receive
only one result, however, querying for all documents—which is what you generally do when executing a
find() function—would be a waste of CPU time and memory. For this case, you can use the findOne()
function to retrieve a single item from your collection. Overall, the result and execution methods are
identical to what occurs when you append the limit(1) function, but why make it harder on yourself than
you should?
The syntax of the findOne() function is identical to the syntax of the find() function:
> db.media.findOne()
It’s generally advised to use the findOne() function if you expect only one result.
Using the Aggregation Commands
MongoDB comes with a nice set of aggregation commands. You might not see their significance at first,
but once you get the hang of using them, you will see that the aggregation commands form an extremely
powerful set of tools. For instance, you might use them to get an overview of some basic statistics about your
database. In this section, we will take a closer look at how to use three of the functions from the available
aggregate commands: count, distinct, and group.
In addition to these three basic aggregation commands, MongoDB also includes an aggregation
framework. This powerful feature will allow you to calculate aggregated values without needing to use the
map/reduce framework. The aggregation framework will be discussed in Chapter 5.
Returning the Number of Documents with count( )
The count() function returns the number of documents in the specified collection. So far you’ve added a
number of documents in the media collection. The count() function can tell you exactly how many:
> db.media.count()
2
CHAPTER 4 ■ WORKING WITH DATA
58
You can also perform additional filtering by combining count() with conditional operators,
as shown here:
> db.media.find( { Publisher : "Apress", Type: "Book" } ).count()
1
This example returns only the number of documents added in the collection that are published by
Apress and of the type Book. Note that the count() function ignores a skip() or limit() parameter by
default. To ensure that your query doesn’t skip these parameters and that your count results will match the
limit and/or skip parameters, use count(true):
> db.media.find( { Publisher: "Apress", Type: "Book" }).skip ( 2 ) .count (true)
0
Retrieving Unique Values with distinct( )
The preceding example shows a great way to retrieve the total number of documents from a specific
publisher. However, this approach is definitely not precise. After all, if you own more than one book with the
same title (for instance, the hardcopy and the e-book), then you would technically have just one book. This
is where distinct() can help you: it will only return unique values.
For the sake of completeness, you can add an additional item to the collection. This item carries the
same title, but has a different ISBN number:
> document = ( { "Type" : "Book", "Title" : "Definitive Guide to MongoDB 3rd ed., The", ISBN:
" 978-1-4842-1183-1", "Publisher" : "Apress", "Author" : ["Hows, David", "Membrey, Peter",
"Plugge, Eelco", "Hawkins, Tim"] } )
> db.media.insert (document)
WriteResult({ "nInserted" : 1 })
At this point, you should have two books in the database with identical titles. When using the
distinct() function on the titles in this collection, you will get a total of two unique items. However, the
titles of the two books are unique, so they will be grouped into one item. The other result will be the title of
the album “Nevermind”:
> db.media.distinct( "Title")
[ "Definitive Guide to MongoDB 3rd ed., The", "Nevermind" ]
Similarly, you will get two results if you query for a list of unique ISBN numbers:
> db.media.distinct ("ISBN")
[ "978-1-4842-1183-0", " 978-1-4842-1183-1" ]
The distinct() function also takes nested keys when querying; for instance, this command will give
you a list of unique titles of your CDs:
> db.media.distinct ("Tracklist.Title")
[ "In Bloom", "Smells Like Teen Spirit" ]
CHAPTER 4 ■ WORKING WITH DATA
59
Grouping Your Results
Last but not least, you can group your results. MongoDB’s group() function is similar to SQL’s GROUP BY
function, although the syntax is a little different. The purpose of the command is to return an array of
grouped items. The group() function takes three parameters: key, initial, and reduce.
The key parameter specifies which results you want to group. For example, assume you want to group
results by Title. The initial parameter lets you provide a base for each grouped result (that is, the base
number of items to start off with). By default, you want to leave this parameter at zero if you want an exact
number returned. The reduce parameter groups all similar items together. Reduce takes two arguments: the
current document being iterated over and the aggregation counter object. These arguments are called items
and prev in the example that follows. Essentially, the reduce parameter adds a 1 to the sum of every item it
encounters that matches a title it has already found.
The group() function is ideal when you’re looking for a tagcloud kind of function. For example, assume
you want to obtain a list of all unique titles of any type of item in your collection. Additionally, assume you
want to group them together if any doubles are found, based on the title:
> db.media.group (
{
key: {Title : true},
initial: {Total : 0},
reduce : function (items,prev)
{
prev.Total += 1
}
}
)
[
{
"Title" : "Nevermind",
"Total" : 1
},
{
"Title" : "Definitive Guide to MongoDB 3rd ed., The",
"Total" : 2
}
]
In addition to the key, initial, and reduce parameters, you can specify three more optional parameters:
• keyf: You can use this parameter to replace the key parameter if you do not wish to
group the results on an existing key in your documents. Instead, you would group
them using another function you design that specifies how to do grouping.
• cond: You can use this parameter to specify an additional statement that must be true
before a document will be grouped. You can use this much as you use the find()
query to search for documents in your collection. If this parameter isn’t set (the
default), then all documents in the collection will be checked.
• finalize: You can use this parameter to specify a function you want to execute
before the final results are returned. For instance, you might calculate an average or
perform a count and include this information in the results.
CHAPTER 4 ■ WORKING WITH DATA
60
■Note The group() function does not currently work in sharded environments. For these, you should use the
mapreduce() function instead. Also, the resulting output cannot contain more than 20,000 keys in all with the
group() function or an exception will be raised. This, too, can be bypassed by using mapreduce().
Working with Conditional Operators
MongoDB supports a large set of conditional operators to better filter your results. The following sections
provide an overview of these operators, including some basic examples that show you how to use them.
Before walking through these examples, however, you should add a few more items to the database; doing so
will let you see the effects of these operators more plainly:
> dvd = ( { "Type" : "DVD", "Title" : "Matrix, The", "Released" : 1999,
"Cast" : ["Keanu Reeves","Carrie-Anne Moss","Laurence Fishburne","Hugo
Weaving","Gloria Foster","Joe Pantoliano"] } )
{
"Type" : "DVD",
"Title" : "Matrix, The",
"Released" : 1999,
"Cast" : [
"Keanu Reeves",
"Carrie-Anne Moss",
"Laurence Fishburne",
"Hugo Weaving",
"Gloria Foster",
"Joe Pantoliano"
]
}
> db.media.insertOne(dvd)
> dvd = ( { "Type" : "DVD", Title : "Blade Runner", Released : 1982 } )
{ "Type" : "DVD", "Title" : "Blade Runner", "Released" : 1982 }
> db.media.insertOne(dvd)
> dvd = ( { "Type" : "DVD", Title : "Toy Story 3", Released : 2010 } )
{ "Type" : "DVD", "Title" : "Toy Story 3", "Released" : 2010 }
> db.media.insertOne(dvd)
Performing Greater-Than and Less-Than Comparisons
You can use the following special parameters to perform greater-than and less-than comparisons in queries:
$gt, $lt, $gte, and $lte. In this section, we’ll look at how to use each of these parameters.
The first one we’ll cover is the $gt (greater-than) parameter. You can use this to specify that a certain
integer should be greater than a specified value in order to be returned:
> db.media.find ( { Released : {$gt : 2000} }, { "Cast" : 0 } )
{ "_id" : ObjectId("4c4369a3c603000000007ed3"), "Type" : "DVD", "Title" : "Toy Story 3",
"Released" : 2010 }
CHAPTER 4 ■ WORKING WITH DATA
61
Note that the year 2000 itself will not be included in the preceding query. For that, you use the $gte
(greater-than or equal-to) parameter:
> db.media.find ( { Released : {$gte : 1999 } }, { "Cast" : 0 } )
{ "_id" : ObjectId("4c43694bc603000000007ed1"), "Type" : "DVD", "Title" :
"Matrix, The", "Released" : 1999 }
{ "_id" : ObjectId("4c4369a3c603000000007ed3"), "Type" : "DVD", "Title" :
"Toy Story 3", "Released" : 2010 }
Likewise, you can use the $lt (less-than) parameter to find items in your collection that predate the
year 1999:
> db.media.find ( { Released : {$lt : 1999 } }, { "Cast" : 0 } )
{ "_id" : ObjectId("4c436969c603000000007ed2"), "Type" : "DVD", "Title" : "Blade Runner",
"Released" : 1982 }
You can also get a list of items older than or equal to the year 1999 by using the $lte (less-than or
equal-to) parameter:
> db.media.find( {Released : {$lte: 1999}}, { "Cast" : 0 })
{ "_id" : ObjectId("4c43694bc603000000007ed1"), "Type" : "DVD", "Title" :
"Matrix, The", "Released" : 1999 }
{ "_id" : ObjectId("4c436969c603000000007ed2"), "Type" : "DVD", "Title" :
"Blade Runner", "Released" : 1982 }
You can also combine these parameters to specify a range:
> db.media.find( {Released : {$gte: 1990, $lt : 2010}}, { "Cast" : 0 })
{ "_id" : ObjectId("4c43694bc603000000007ed1"), "Type" : "DVD", "Title" : "Matrix, The",
"Released" : 1999 }
These parameters might strike you as relatively simple to use; however, you will be using them a lot
when querying for a specific range of data.
Retrieving All Documents but Those Specified
You can use the $ne (not-equals) parameter to retrieve every document in your collection, except for the
ones that match certain criteria. It should be noted that $ne may be performance heavy when the field of
choice has many potential values. For example, you can use this snippet to obtain a list of all books where
the author is not Eelco Plugge:
> db.media.find( { Type : "Book", Author: {$ne : "Plugge, Eelco"}})
Specifying an Array of Matches
You can use the $in operator to specify an array of possible matches. The SQL equivalent is the IN operator.
You can use the following snippet to retrieve data from the media collection using the $in operator:
> db.media.find( {Released : {$in : [1999,2008,2009] } }, { "Cast" : 0 } )
{ "_id" : ObjectId("4c43694bc603000000007ed1"), "Type" : "DVD", "Title" : "Matrix, The",
"Released" : 1999 }
CHAPTER 4 ■ WORKING WITH DATA
62
This example returns only one item, because only one item matches the release year of 1999, and there
are no matches for the years 2008 and 2009.
Finding a Value Not in an Array
The $nin operator functions similarly to the $in operator, except that it searches for the objects where the
specified field does not have a value in the specified array:
> db.media.find( {Released : {$nin : [1999,2008,2009] },Type : "DVD" },
{ "Cast" : 0 } )
{ "_id" : ObjectId("4c436969c603000000007ed2"), "Type" : "DVD", "Title" :
"Blade Runner", "Released" : 1982 }
{ "_id" : ObjectId("4c4369a3c603000000007ed3"), "Type" : "DVD", "Title" :
"Toy Story 3", "Released" : 2010 }
Matching All Attributes in a Document
The $all operator also works similarly to the $in operator. However, $all requires that all attributes match
in the documents, whereas only one attribute must match for the $in operator. Let’s look at an example that
illustrates these differences. First, here’s an example that uses $in:
> db.media.find ( { Released : {$in : ["2010","2009"] } }, { "Cast" : 0 } )
{ "_id" : ObjectId("4c4369a3c603000000007ed3"), "Type" : "DVD", "Title" : "Toy Story 3",
"Released" : 2010 }
One document is returned for the $in operator because there’s a match for 2010, but not for 2009.
However, the $all parameter doesn’t return any results, because there are no matching documents with
2009 in the value:
> db.media.find ( { Released : {$all : ["2010","2009"] } }, { "Cast" : 0 } )
Searching for Multiple Expressions in a Document
You can use the $or operator to search for multiple expressions in a single query, where only one criterion
needs to match to return a given document. Unlike the $in operator, $or allows you to specify both the key
and the value, rather than only the value:
> db.media.find({ $or : [ { "Title" : "Toy Story 3" }, { "ISBN" : "978-1-4842-1183-0" } ] } )
{ "_id" : ObjectId("4c5fc7d8db290000000067c5"), "Type" : "Book", "Title" : "Definitive Guide
to MongoDB 3rd ed., The", "ISBN" : "978-1-4842-1183-0", "Publisher" : "Apress", "Author" :
["Hows, David", "Membrey, Peter", "Plugge, Eelco", "Hawkins, Tim" ] }
{ "_id" : ObjectId("4c5fc943db290000000067ca"), "Type" : "DVD", "Title" : "Toy Story 3",
"Released" : 2010 }
CHAPTER 4 ■ WORKING WITH DATA
63
It’s also possible to combine the $or operator with another query parameter. This will restrict the
returned documents to only those that match the first query (mandatory), and then either of the two
key/value pairs specified at the $or operator, as in this example:
> db.media.find({ "Type" : "DVD", $or : [ { "Title" : "Toy Story 3" },
{ "ISBN" : "978-1-4842-1183-0" } ] })
{ "_id" : ObjectId("4c5fc943db290000000067ca"), "Type" : "DVD", "Title" : "Toy Story 3",
"Released" : 2010 }
You could say that the $or operator allows you to perform two queries at the same time, combining
the results of two otherwise unrelated queries on the same collection. It is worth noting here that, if all the
queries in an $or clause can be supported by indexes, MongoDB will perform index scans. If not, a collection
scan will be used instead. Lastly, each clause of the $or can use its own index.
Retrieving a Document with $slice
You can use the $slice projection to limit an array field to a subset of the array for each matching result.
This can be particularly useful if you want to limit a certain set of items added to save bandwidth. The
operator also lets you retrieve the results of n items per page, a feature generally known as paging.
The operator takes two parameters; the first indicates the total number of items to be returned. The
second parameter is optional; if used, it ensures that the first parameter defines the offset, while the second
defines the limit. The $slice limit parameter also accepts a negative value to return items starting from the
end of an array instead of the beginning.
The following example limits the items from the Cast list to the first three items:
> db.media.find({"Title" : "Matrix, The"}, {"Cast" : {$slice: 3}})
{ "_id" : ObjectId("4c5fcd3edb290000000067cb"), "Type" : "DVD", "Title" : "Matrix, The",
"Released" : 1999, "Cast" : [ "Keanu Reeves", "Carrie-Anne Moss", "Laurence Fishburne" ] }
You can also get only the last three items by making the integer negative:
> db.media.find({"Title" : "Matrix, The"}, {"Cast" : {$slice: -3}})
{ "_id" : ObjectId("4c5fcd3edb290000000067cb"), "Type" : "DVD", "Title" : "Matrix, The",
"Released" : 1999, "Cast" : [ "Hugo Weaving", "Gloria Foster", "Joe Pantoliano" ] }
Or you can skip the first two items and limit the results to three from that particular point (pay careful
attention to the brackets):
> db.media.find({"Title" : "Matrix, The"}, {"Cast" : {$slice: [2,3] }})
{ "_id" : ObjectId("4c5fcd3edb290000000067cb"), "Type" : "DVD", "Title" : "Matrix, The",
"Released" : 1999, "Cast" : [ "Laurence Fishburne", "Hugo Weaving", "Gloria Foster" ] }
Finally, when specifying a negative integer, you can skip to the last five items and limit the results to
four, as in this example:
> db.media.find({"Title" : "Matrix, The"}, {"Cast" : {$slice: [-5,4] }})
{ "_id" : ObjectId("4c5fcd3edb290000000067cb"), "Type" : "DVD", "Title" : "Matrix, The",
"Released" : 1999, "Cast" : [ "Carrie-Anne Moss","LaurenceFishburne","Hugo Weaving",
"Gloria Foster"] }
CHAPTER 4 ■ WORKING WITH DATA
64
■Note With version 2.4, MongoDB also introduced the $slice operator for $push operations, allowing you to
limit the number of array elements when appending values to an array. This operator is discussed later in this
chapter. Do not confuse the two, however.
Searching for Odd/Even Integers
The $mod operator lets you search for specific data that consists of an even or uneven number. This works
because the operator takes the modulus of 2 and checks for a remainder of 0, thereby providing even-
numbered results only.
For example, the following code returns any item in the collection that has an even-numbered integer
set to its Released field:
> db.media.find ( { Released : { $mod: [2,0] } }, {"Cast" : 0 } )
{ "_id" : ObjectId("4c45b5c18e0f0000000062aa"), "Type" : "DVD", "Title" : "Blade Runner",
"Released" : 1982 }
{ "_id" : ObjectId("4c45b5df8e0f0000000062ab"), "Type" : "DVD", "Title" : "Toy Story 3",
"Released" : 2010 }
Likewise, you can find any documents containing an uneven value in the Released field by changing
the parameters in $mod, as follows:
> db.media.find ( { Released : { $mod: [2,1] } }, { "Cast" : 0 } )
{ "_id" : ObjectId("4c45b5b38e0f0000000062a9"), "Type" : "DVD", "Title" : "Matrix, The",
"Released" : 1999 }
■Note The $mod operator only works on integer values, not on strings that contain a numbered value. For
example, you can’t use the operator on { Released : "2010" } because it’s in quotes and therefore a string.
Filtering Results with $size
The $size operator lets you filter your results to match an array with the specified number of elements in it.
For example, you might use this operator to do a search for those CDs that have exactly two songs on them:
> db.media.find ( { Tracklist : {$size : 2} } )
{ "_id" : ObjectId("4c1a86bb2955000000004076"), "Type" : "CD", "Artist" : "Nirvana",
"Title" : "Nevermind", "Tracklist" : [
{
"Track" : "1",
"Title" : "Smells Like Teen Spirit",
"Length" : "5:02"
},
CHAPTER 4 ■ WORKING WITH DATA
65
{
"Track" : "2",
"Title" : "In Bloom",
"Length" : "4:15"
}
] }
■Note You cannot use the $size operator to find a range of sizes. For example, you cannot use it to find
arrays with more than one element in them.
Returning a Specific Field Object
The $exists operator allows you to return a specific object if a specified field is either missing or found.
The following example returns all items in the collection with a key named Author:
> db.media.find ( { Author : {$exists : true } } )
Similarly, if you invoke this operator with a value of false, then all documents that don’t have a key
named Author will be returned:
> db.media.find ( { Author : {$exists : false } } )
■Warning Currently, the $exists operator is unable to use an index; therefore, using it requires a full table scan.
Matching Results Based on the BSON Type
The $type operator lets you match results based on their BSON type. For instance, the following snippet lets
you find all items that have a track list of the type Embedded Object (that is, it contains a list of information):
> db.media.find ( { Tracklist: { $type : 3 } } )
{ "_id" : ObjectId("4c1a86bb2955000000004076"), "Type" : "CD", "Artist" : "Nirvana",
"Title" : "Nevermind", "Tracklist" : [
{
"Track" : "1",
"Title" : "Smells Like Teen Spirit",
"Length" : "5:02"
},
{
"Track" : "2",
"Title" : "In Bloom",
"Length" : "4:15"
}
] }
The known data types are defined in Table4-1.
CHAPTER 4 ■ WORKING WITH DATA
66
Table 4-1. Known BSON Types and Codes
Code Data Type
–1 MinKey
1 Double
2 Character string (UTF8)
3 Embedded object
4 Embedded array
5 Binary data
7 Object ID
8 Boolean type
9 Date type
10 Null type
11 Regular expression
13 JavaScript code
14 Symbol
15 JavaScript code with scope
16 32-bit integer
17 Timestamp
18 64-bit integer
127 MaxKey
255 MinKey
Matching an Entire Array
If you want to match an entire array within a document, you can use the $elemMatch operator. This is
particularly useful if you have multiple documents within your collection, some of which have some of the
same information. This can make a default query incapable of finding the exact document you are looking
for. This is because the standard query syntax doesn’t restrict itself to a single document within an array.
Let’s look at an example that illustrates this principle. For this to work, you need to add another
document to the collection, one that has an identical item in it but is otherwise different. Specifically, let’s
add another CD from Nirvana that happens to have the same track on it as the aforementioned CD
(“Smells Like Teen Spirit”). However, on this version of the CD, the song is track 5, not track 1:
{
"Type" : "CD",
"Artist" : "Nirvana",
"Title" : "Nirvana",
"Tracklist" : [
{
"Track" : "1",
"Title" : "You Know You're Right",
"Length" : "3:38"
},
CHAPTER 4 ■ WORKING WITH DATA
67
{
"Track" : "5",
"Title" : "Smells Like Teen Spirit",
"Length" : "5:02"
}
]
}
> nirvana = ( { "Type" : "CD", "Artist" : "Nirvana", "Title" : "Nirvana", "Tracklist" :
[ { "Track" : "1", "Title" : "You Know You're Right", "Length" : "3:38"}, {"Track" : "5",
"Title" : "Smells Like Teen Spirit", "Length" : "5:02" } ] } )
> db.media.insertOne(nirvana)
If you want to search for an album from Nirvana that has the song “Smells Like Teen Spirit” as Track 1
on the CD, you might think that the following query would do the job:
> db.media.find ( { "Tracklist.Title" : "Smells Like Teen Spirit", "Tracklist.Track" : "1" } )
Unfortunately, the preceding query will return both documents. The reason for this is that both
documents have a track with the title called “Smells Like Teen Spirit” and both have a track number 1. If you
want to match an entire document within the array, you can use $elemMatch, as in this example:
> db.media.find ( { Tracklist: { "$elemMatch" : { Title: "Smells Like Teen Spirit",
Track : "1" } } } )
{ "_id" : ObjectId("4c1a86bb2955000000004076"), "Type" : "CD", "Artist" : "Nirvana",
"Title" : "Nevermind", "Tracklist" : [
{
"Track" : "1",
"Title" : "Smells Like Teen Spirit",
"Length" : "5:02"
},
{
"Track" : "2",
"Title" : "In Bloom",
"Length" : "4:15"
}
] }
This query gave the desired result and only returned the first document.
Using the $not Metaoperator
You can use the $not metaoperator to negate any check performed by a standard operator. It should
be noted that $not may be performance heavy when the field of choice has many potential values. The
following example returns all documents in your collection, except for the one seen in the $elemMatch
example:
> db.media.find ( { Tracklist : { $not : { "$elemMatch" : { Title: "Smells Like Teen
Spirit", "Track" : "1" } } } } )
CHAPTER 4 ■ WORKING WITH DATA
68
Specifying Additional Query Expressions
Apart from the structured query syntax you’ve seen so far, you can also specify additional query expressions
in JavaScript. The big advantage of this is that JavaScript is extremely flexible and allows you to do tons of
additional things. The downside of using JavaScript is that it’s a tad slower than the native operators baked
into MongoDB, as it cannot take advantage of indexes.
For example, assume you want to search for a DVD within your collection that is older than 1995. All of
the following code examples would return this information:
db.media.find ( { "Type" : "DVD", "Released" : { $lt : 1995 } } )
db.media.find ( { "Type" : "DVD", $where: "this.Released < 1995" } )
db.media.find ("this.Released < 1995")
f = function() { return this.Released < 1995 }
db.media.find(f)
And that’s how flexible MongoDB is! Using these operators should enable you to find just about
anything throughout your collections.
Leveraging Regular Expressions
Regular expressions are another powerful tool you can use to query information. Regular expressions—regex,
for short—are special text strings that you can use to describe your search pattern. These work much like
wildcards, but they are far more powerful and flexible.
MongoDB allows you to use these regular expressions when searching for data in your collections;
however, to improve performance it will attempt to use an index whenever possible for simple prefix
expressions. Prefix expressions are those regular expressions that start with either a left anchor (“\A”) or a
caret (“^”) followed by a few characters (example: “^Matrix”). Querying with regular expressions that are not
prefix expressions cannot efficiently make use of an index.
■Note Please bear in mind that case insensitive (“i”) regular-expression queries can cause poor
performance due to the number of searches it needs to perform when using these.
The following example uses regex in a query to find all items in the media collection that start with the
word “Matrix” (case insensitive):
> db.media.find ( { Title : /^Matrix/i } )
Using regular expressions from MongoDB can make your life much simpler, so we recommend
exploring this feature in greater detail as time permits or your circumstances can benefit from it.
Updating Data
So far you’ve learned how to insert and query for data in your database. Next, you’ll learn how to
update those data. MongoDB supports quite a few update operators that you’ll learn how to use in the
following sections.
CHAPTER 4 ■ WORKING WITH DATA
69
Updating with update()
MongoDB comes with the update() function for performing updates to your data. The update() function
takes three primary arguments: criteria, objNew, and options.
The criteria argument lets you specify the query that selects the record you want to update. You use
the objNew argument to specify the updated information; or you can use an operator to do this for you.
The options argument lets you specify your options when updating the document, and it has two possible
values: upsert and multi. The upsert option lets you specify whether the update should be an upsert—that
is, it tells MongoDB to update the record if it exists and create it if it doesn’t. Finally, the multi option lets you
specify whether all matching documents should be updated or just the first one (the default action).
The following simple example uses the update() function without any fancy operators:
> db.media.updateOne( { "Title" : "Matrix, The"}, {"Type" : "DVD", "Title" : "Matrix, The",
"Released" : 1999, "Genre" : "Action"}, { upsert: true} )
This example updates a matching document in the collection if one exists or saves a new document
with the new values specified. Note that any fields you leave out are removed (the document is basically
being rewritten).
In case there happens to be multiple documents matching the criteria and you wish to upsert them all, the
updateMany function can be used instead of updateOne() while using the $set modifier operator, as shown here:
> db.media.updateMany( { "Title" : "Matrix, The"}, {$set: {"Type" : "DVD", "Title" :
"Matrix, The", "Released" : 1999, "Genre" : "Action"} }, {upsert: true} )
■Note An upsert tells the database to “update a record if a document is present or to insert the record
if it isn’t.”
Implementing an Upsert with the save() Command
You can also perform an upsert with the save() command. To do this, you need to specify the _id value; you
can have this value added automatically or specify it manually yourself. If you do not specify the _id value,
the save() command will assume it’s an insert and simply add the document into your collection.
The main benefit of using the save() command is that you do not need to specify that the upsert
method should be used in conjunction with the update() command. Thus, the save() command gives you a
quicker way to upsert data. In practice, the save() and update() commands look similar:
> db.media.updateOne( { "Title" : "Matrix, The"}, {"Type" : "DVD", "Title" : "Matrix, The",
"Released" : "1999", "Genre" : "Action"}, { upsert: true} )
> db.media.save( { "Title" : "Matrix, The"}, {"Type" : "DVD", "Title" : "Matrix, The",
"Released" : "1999", "Genre" : "Action"})
Obviously, this example assumes that the Title value acts as the id field.
Updating Information Automatically
You can use the modifier operations to update information quickly and simply in your documents, without
needing to type everything in manually. For example, you might use these operations to increase a number
or to remove an element from an array.
We’ll be exploring these operators next, providing practical examples that show you how to use them.
CHAPTER 4 ■ WORKING WITH DATA
70
Incrementing a Value with $inc
The $inc operator enables you to perform an (atomic) update on a key to increase the value by the given
increment, assuming that the field exists. If the field doesn’t exist, it will be created. To see this in action,
begin by adding another document to the collection:
> manga = ( { "Type" : "Manga", "Title" : "One Piece", "Volumes" : 612, "Read" : 520 } )
{
"Type" : "Manga",
"Title" : "One Piece",
"Volumes" : "612",
"Read" : "520"
}
> db.media.insertOne(manga)
Now you’re ready to update the document. For example, assume you’ve read another four volumes
of the One Piece manga, and you want to increment the number of Read volumes in the document. The
following example shows you how to do this:
> db.media.updateOne ( { "Title" : "One Piece"}, {$inc: {"Read" : 4} } )
> db.media.find ( { "Title" : "One Piece" } )
{
"Type" : "Manga",
"Title" : "One Piece ",
"Volumes" : "612",
"Read" : "524"
}
Setting a Field’s Value
You can use the $set operator to set a field’s value to one you specify. This works for any datatype, as in the
following example:
> db.media.update ( { "Title" : "Matrix, The" }, {$set : { Genre : "Sci-Fi" } } )
This snippet would update the genre in the document created earlier, setting it to Sci-Fi instead.
Deleting a Specified Field
The $unset operator lets you delete a given field, as in this example:
> db.media.updateOne ( {"Title": "Matrix, The"}, {$unset : { "Genre" : 1 } } )
This snippet would delete the Genre key and its value from the document.
Appending a Value to a Specified Field
The $push operator allows you to append a value to a specified field. If the field is an existing array, then the
value will be added. If the field doesn’t exist yet, then the field will be set to the array value. If the field exists
but it isn’t an array, then an error condition will be raised.
CHAPTER 4 ■ WORKING WITH DATA
71
Begin by adding another author to your entry in the collection:
> db.media.updateOne ( {"ISBN" : "978-1-4842-1183-0"}, {$push: { Author : "Griffin,
Stewie"} } )
The next snippet raises an error message because the Title field is not an array:
> db.media.updateOne ( {"ISBN" : "978-1-4842-1183-0"}, {$push: { Title :
"This isn't an array"} } )
Cannot apply $push/$pushAll modifier to non-array
The following example shows how the document looks in the meantime:
> db.media.find ( { "ISBN" : "978-1-4842-1183-0" } )
{
"Author" :
[
"Hows, David",
"Membrey, Peter",
"Plugge, Eelco",
"Griffin, Stewie",
],
"ISBN" : "978-1-4302-5821-6",
"Publisher" : "Apress",
"Title" : "Definitive Guide to MongoDB 3rd ed., The",
"Type" : "Book",
"_id" : ObjectId("4c436231c603000000007ed0")
}
Specifying Multiple Values in an Array
When working with arrays, the $push operator will append the value specified to the given array, expanding
the data stored within the given element. If you wish to add several separate values to the given array, you
can use the optional $each modifier, as in this example:
> db.media.updateOne( { "ISBN" : "978-1-4842-1183-0" }, { $push: { Author : { $each:
["Griffin, Peter", "Griffin, Brian"] } } } )
{
"Author" :
[
"Hows, David",
"Membrey, Peter",
"Plugge, Eelco",
"Hawkins, Tim",
"Griffin, Stewie",
"Griffin, Peter",
"Griffin, Brian"
],
CHAPTER 4 ■ WORKING WITH DATA
72
"ISBN" : "978-1-4842-1183-0",
"Publisher" : "Apress",
"Title" : "Definitive Guide to MongoDB 3rd ed., The",
"Type" : "Book",
"_id" : ObjectId("4c436231c603000000007ed0")
}
Optionally, you can use the $slice operator when using $each. This allows you to limit the number of
elements within an array during a $push operation. The $slice operator takes either a negative number or
zero. Using a negative number ensures that only the last n elements will be kept within the array, whereas
using zero would empty the array. Note that the $slice operator has to be the first modifier to the $push
operator in order to function as such:
> db.media.updateOne( { "ISBN" : "978-1-4842-1183-0" }, { $push: { Author : { $each:
["Griffin, Meg", "Griffin, Louis"], $slice: -2 } } } )
{
"Author" :
[
"Griffin, Meg",
"Griffin, Louis"
],
"ISBN" : "978-1-4842-1183-0",
"Publisher" : "Apress",
"Title" : "Definitive Guide to MongoDB 3rd ed., The",
"Type" : "Book",
"_id" : ObjectId("4c436231c603000000007ed0")
}
As you can see, the $slice operator ensured that not only were the two new values pushed, but that the
data kept within the array was also limited to the value specified (2). The $slice operator can be a valuable
tool when working with fixed-sized arrays.
Adding Data to an Array with $addToSet
The $addToSet operator is another command that lets you add data to an array. However, this operator only
adds the data to the array if the data are not already there. In this way, $addToSet is unlike $push. By default,
the $addToSet operator takes one argument. However, you can use the $each operator to specify additional
arguments when using t$addToSet. The following snippet adds the author Griffin, Brian into the authors
array because it isn’t there yet:
> db.media.updateOne( { "ISBN" : "978-1-4842-1183-0" }, {$addToSet : { Author : "Griffin,
Brian" } } )
Executing the snippet again won’t change anything because the author is already in the array.
To add more than one value, however, you should take a different approach and use the $each operator
as well:
> db.media.updateOne( { "ISBN" : "978-1-4842-1183-0" }, {$addToSet : { Author : { $each :
["Griffin, Brian","Griffin, Meg"] } } } )
CHAPTER 4 ■ WORKING WITH DATA
73
At this point, our document, which once looked tidy and trustworthy, has been transformed into
something like this:
{
"Author" :
[
"Hows, David",
"Membrey, Peter",
"Plugge, Eelco",
"Hawkins, Tim",
"Griffin, Stewie",
"Griffin, Peter",
"Griffin, Brian",
"Griffin, Louis",
"Griffin, Meg"
],
"ISBN" : "978-1-4842-1183-0",
"Publisher" : "Apress",
"Title" : "Definitive Guide to MongoDB 3rd ed., The",
"Type" : "Book",
"_id" : ObjectId("4c436231c603000000007ed0")
}
Removing Elements from an Array
MongoDB also includes several methods that let you remove elements from an array, including $pop,
$pull, and $pullAll. In the sections that follow, you’ll learn how to use each of these methods for removing
elements from an array.
The $pop operator lets you remove a single element from an array. This operator lets you remove the
first or last value in the array, depending on the parameter you pass down with it. For example, the following
snippet removes the last element from the array:
> db.media.updateOne( { "ISBN" : "978-1-4842-1183-0" }, {$pop : {Author : 1 } } )
In this case, the $pop operator will pop Meg’s name off the list of authors. Passing down a negative
number would remove the first element from the array. The following example removes Peter Membrey’s
name from the list of authors:
> db.media.updateOne( { "ISBN" : "978-1-4842-1183-0" }, {$pop : {Author : -1 } } )
■Note Specifying a value of -2 or 1000 wouldn’t change which element gets removed. Any negative
number would remove the first element, while any positive number would remove the last element. Using the
number 0 removes the last element from the array.
CHAPTER 4 ■ WORKING WITH DATA
74
Removing Each Occurrence of a Specified Value
The $pull operator lets you remove each occurrence of a specified value from an array. This can be
particularly useful if you have multiple elements with the same value in your array. Let’s begin this example
by using the $push parameter to add Stewie back to the list of authors:
> db.media.updateOne ( {"ISBN" : "978-1-4842-1183-0"}, {$push: { Author : "Griffin,
Stewie"} } )
Stewie will be in and out of the database a couple more times as we walk through this book’s examples.
You can remove all occurrences of this author in the document with the following code:
> db.media.updateOne ( {"ISBN" : "978-1-4842-1183-0"}, {$pull : { Author : "Griffin,
Stewie" } } )
Removing Multiple Elements from an Array
You can also remove multiple elements with different values from an array. The $pullAll operator enables
you to accomplish this. The $pullAll operator takes an array with all the elements you want to remove, as in
the following example:
> db.media.updateOne( { "ISBN" : "978-1-4842-1183-0"}, {$pullAll : { Author : ["Griffin,
Louis","Griffin, Peter","Griffin, Brian"] } } )
The field from which you remove the elements (Author in the preceding example) needs to be an array.
If it isn’t, you’ll receive an error message.
Specifying the Position of a Matched Array
You can use the $ operator in your queries to specify the position of the matched array item in your query.
You can use this operator for data manipulation after finding an array member. For instance, assume you’ve
added another track to your track list, but you accidently made a typo when entering the track number:
> db.media.updateOne( { "Artist" : "Nirvana" }, {$addToSet : { Tracklist : {"Track" :
2,"Title": "Been a Son", "Length":"2:23"} } } )
{
"Artist" : "Nirvana",
"Title" : "Nevermind",
"Tracklist" : [
{
"Track" : "1",
"Title" : "You Know You're Right",
"Length" : "3:38"
},
{
"Track" : "5",
"Title" : "Smells Like Teen Spirit",
"Length" : "5:02"
},
CHAPTER 4 ■ WORKING WITH DATA
75
{
"Track" : 2,
"Title" : "Been a Son",
"Length" : "2:23"
}
],
"Type" : "CD",
"_id" : ObjectId("4c443ad6c603000000007ed5")
}
It so happens you know that the track number of the most recent item should be 3 rather than 2.
You can use the $inc method in conjunction with the $ operator to increase the value from 2 to 3, as in
this example:
> db.media.updateOne( { "Tracklist.Title" : "Been a Son"}, {$inc:{"Tracklist.$.Track" : 1} } )
Note that only the first item it matches will be updated. Thus, if there are two identical elements in the
comments array, only the first element will be increased.
Atomic Operations
MongoDB supports atomic operations executed against single documents. An atomic operation is a set of
operations that can be combined in such a way that the set of operations appears to be merely one single
operation to the rest of the system. This set of operations will have either a positive or a negative outcome as
the final result.
You can call a set of operations an atomic operation if it meets the following pair of conditions:
1. No other process knows about the changes being made until the entire set of
operations has completed.
2. If one of the operations fails, the entire set of operations (the entire atomic
operation) will fail, resulting in a full rollback, where the data are restored to their
state prior to running the atomic operation.
A standard behavior when executing atomic operations is that the data will be locked and therefore
unable to be reached by other queries. However, MongoDB does not support locking or complex
transactions for a number of reasons:
• In sharded environments (see Chapter 12 for more information on such
environments), distributed locks can be expensive and slow. MongoDB’s goal is to be
lightweight and fast, so expensive and slow go against this principle.
• MongoDB developers don’t like the idea of deadlocks. In their view, it’s preferable for
a system to be simple and predictable instead.
• MongoDB is designed to work well for real-time problems. When an operation is
executed that locks large amounts of data, it would also stop some smaller light
queries for an extended period of time. Again, this goes against the MongoDB goal
of speed.
CHAPTER 4 ■ WORKING WITH DATA
76
MongoDB includes several update operators (as noted previously), all of which can atomically update
an element:
• $set: Sets a particular value.
• $unset: Removes a particular value.
• $inc: Increments a particular value by a certain amount.
• $push: Appends a value to an array.
• $pull: Removes one or more values from an existing array.
• $pullAll: Removes several values from an existing array.
Using the Update-If-Current Method
Another strategy that atomic update uses is the update-if-current method. This method takes the following
three steps:
1. It fetches the object from the document.
2. It modifies the object locally (with any of the previously mentioned operations,
or a combination of them).
3. It sends an update request to update the object to the new value, in case the
current value still matches the old value fetched.
You can check the WriteResult output to see whether all went well. Note that all of this happens
automatically. Let’s take a new look at an example shown previously:
> db.media.updateOne( { "Tracklist.Title" : "Been a Son"}, {$inc:{"Tracklist.$.Track" : 1} } )
Here, you can use the WriteResult output to check whether the update went smoothly:
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
In this example, you incremented Tracklist.Track using the track list title as an identifier. But now
consider what happens if the track list data are changed by another user using the same method while
MongoDB was modifying your data. Because Tracklist.Title remains the same, you might assume
(incorrectly) that you are updating the original data, when in fact you are overwriting the changes.
This is known as the ABA problem. This scenario might seem unlikely, but in a multiuser environment,
where many applications are working on data at the same time, this can be a significant problem.
To avoid this problem, you can do one of the following:
• Use the entire object in the update’s query expression, instead of just the _id and
comments.by fields.
• Use $set to set the field you care about. If other fields have changed, they won’t be
affected by this.
• Put a version variable in the object and increment it on each update.
• When possible, use a $ operator instead of an update-if-current sequence of
operations.
CHAPTER 4 ■ WORKING WITH DATA
77
■Note MongoDB does not support updating multiple documents atomically in a single operation. Instead,
you can use nested objects, which effectively make them one document for atomic purposes.
Modifying and Returning a Document Atomically
The findAndModify command also allows you to perform an atomic update on a document. This command
modifies the document and returns it. The command takes three main operators: <query>, which is used
to specify the document you’re executing it against; <sort>, which is used to sort the matching documents
when multiple documents match, and <operations>, which is used to specify what needs to be done.
Now let’s look at a handful of examples that illustrate how to use this command. The first example finds
the document you’re searching for and removes it once it is found:
> db.media.findAndModify( { "Title" : "One Piece",sort:{"Title": -1}, remove: true} )
{
"_id" : ObjectId("4c445218c603000000007ede"),
"Type" : "Manga",
"Title" : "One Piece",
"Volumes" : 612,
"Read" : 524
}
This code returned the document it found matching the criteria. In this case, it found and removed
the first item it found with the title “One Piece.” If you execute a find() function now, you will see that the
document is no longer within the collection.
The next example modifies the document rather than removing it:
> db.media.findAndModify( { query: { "ISBN" : "978-1-4842-1183-0" }, sort: {"Title":-1},
update: {$set: {"Title" : " Different Title"} } } )
The preceding example updates the title from “Definitive Guide to MongoDB, The” to “Different
Title”—and returns the old document (as it was before the update) to your shell. If you would rather see the
results of the update on the document, you can add the new operator after your query:
> db.media.findAndModify( { query: { "ISBN" : "978-1-4842-1183-0" }, sort: {"Title":-1},
update: {$set: {"Title" : " Different Title"} }, new:true } )
Note that you can use any modifier operation with this command, not just $set.
Processing Data in Bulk
MongoDB also allows you to perform write operations in bulk. This way, you can first define the dataset
prior to writing it all in a single go. Bulk write operations are limited to a single collection only and can be
used to insert, update, or remove data.
Before you can write your data in bulk, you will first need to tell MongoDB how those data are to be
written: ordered or unordered. When executing the operation in an ordered fashion, MongoDB will go over
the list of operations serially. That is, were an error to occur while processing one of the write operations,
the remaining operations will not be processed. In contrast, using an unordered write operation, MongoDB
CHAPTER 4 ■ WORKING WITH DATA
78
will execute the operations in a parallel manner. Were an error to occur during one of the writing operations
here, MongoDB will continue to process the remaining write operations.
For example, let’s assume you want to insert data in bulk to your media collection in an ordered fashion,
so that if an error were to occur the operation would halt. You first will need to initialize your ordered list
using the initializeOrderedBulkOp() functionx, as follows:
> var bulk = db.media.initializeOrderedBulkOp();
Now you can continue to insert the data into your ordered list, named bulk, before finally executing the
operations using the execute() command, like so:
> bulk.insertOne({ "Type" : "Movie", "Title" : "Deadpool", "Released" : 2016});
> bulk.insertOne({ "Type" : "CD", "Artist" : "Iron Maiden", "Title" : "Book of Souls, The" });
> bulk.insertOne({ "Type" : "Book", "Title" : "Paper Towns", "Author" : "Green, John" });
■Note Your list can contain a maximum of 1000 operations. MongoDB will automatically split and process
your list into separate groups of 1000 operations or less when your list exceeds this limit.
Executing Bulk Operations
Now that the list has been filled, you will notice that the data themselves have not been written into the
collection yet. You can verify this by doing a simple find() on the media collection, which will only show the
previously added content:
> db.media.find()
{ "_id" : ObjectId("55e6d1d8b54fe7a2c96567d4"),
"Type" : "Book",
"Title" : "Definitive Guide to MongoDB 3rd ed., The",
"ISBN" : "978-1-4842-1183-0",
"Publisher" : "Apress",
"Author" : [
"Hows, David",
"Plugge, Eelco",
"Membrey, Peter",
"Hawkins, Tim"
] }
{ "_id" : "ObjectId("4c1a86bb2955000000004076"),
"Type" : "CD",
"Artist" : "Nirvana",
"Title" : "Nevermind",
"Tracklist" : [
{
"Track" : "1",
"Title" : "Smells Like Teen Spirit",
"Length" : "5:02"
},
CHAPTER 4 ■ WORKING WITH DATA
79
{
"Track" : "2",
"Title" : "In Bloom",
"Length" : "4:15"
}
] }
To process the list of operations, the execute() command can be used like so:
> bulk.execute();
BulkWriteResult({
"writeErrors" : [ ],
"writeConcernErrors" : [ ],
"nInserted" : 3,
"nUpserted" : 0,
"nMatched" : 0,
"nModified" : 0,
"nRemoved" : 0,
"upserted" : [ ]
})
As you can tell from the output, nInserted reports 3, meaning three items were inserted into your
collection. If your list were to include other operations such as upserts or removals, those would have been
listed here as well.
Evaluating the Output
Once the bulk operations have been executed using the execute() command, you are also able to review
the write operations performed. This can be used to evaluate whether all the data were written successfully
and in what order this was done. Moreover, when something does go wrong during the write operation, the
output will help you understand what has been executed. To review the write operations executed through
execute(), you can use the getOperations() command, like so:
> bulk.getOperations();
[
{
"originalZeroIndex" : 0,
"batchType" : 1,
"operations" : [
{
"_id" : ObjectId("55e7fa1db54fe7a2c96567d6"),
"Type" : "Movie",
"Title" : "Deadpool",
"Released" : 2016
},
{
"_id" : ObjectId("55e7fa1db54fe7a2c96567d7"),
"Type" : "CD",
"Artist" : "Iron Maiden",
"Title" : "Book of Souls, The"
},
CHAPTER 4 ■ WORKING WITH DATA
80
{
"_id" : ObjectId("55e7fa1db54fe7a2c96567d8"),
"Type" : "Book",
"Title" : "Paper Towns",
"ISBN" : "978-0142414934",
"Author" : "Green, John"
}
]
}
]
Notice how the array returned includes all the data processed under the operations key, as well as the
batchType key indicating the type of operation performed. Here, its value is 1, indicating the items were
inserted into the collection. Table4-2 describes the types of operations performed and their subsequent
batchType values.
Table 4-2. BatchType Values and Their Meaning
BatchType Operation
1 Insert
2Update
3 Remove
■Note When processing various types of operations in unordered lists, MongoDB will group these together
by type (inserts, update, removals) to increase performance. As such, be sure your applications do not depend
on the order of operations performed. Ordered lists’ operations will only group contiguous operations of the
same type so that these are still processed in order.
Bulk operations can be extremely useful for processing a large set of data in a single go without
influencing the available dataset beforehand.
Renaming a Collection
It might happen that you discover you have named a collection incorrectly, but you’ve already inserted some
data into it. This might make it troublesome to remove and read the data again from scratch.
Instead, you can use the renameCollection() function to rename your existing collection. The following
example shows you how to use this simple and straightforward command:
> db.media.renameCollection("newname")
{ "ok" : 1 }
If the command executes successfully, an OK will be returned. If it fails, however (if the collection
doesn’t exist, for example), then the following message is returned:
{ "errmsg" : "assertion: source namespace does not exist", "ok" : 0 }
CHAPTER 4 ■ WORKING WITH DATA
81
The renameCollection command doesn’t take many parameters (unlike some commands you’ve seen
so far); however, it can be quite useful in the right circumstances.
Deleting Data
So far we’ve explored how to add, search for, and modify data. Next, we’ll examine how to delete documents,
entire collections, and the databases themselves.
Previously, you learned how to delete data from a specific document (using the $pop command,
for instance). In this section, you will learn how to delete full documents and collections. Just as the
insertOne() function is used for inserting and updateOne() is used for modifying a document, deleteOne()
is used to delete a document.
To delete a single document from your collection, you need to specify the criteria you’ll use to find
the document. A good approach is to perform a find() first; this ensures that the criteria used are specific
to your document. Once you are sure of the criterion, you can invoke the deleteOne() function using that
criterion as a parameter:
> db.newname.deleteOne( { "Title" : "Different Title" } )
This statement removes a single matching document. Any other item in your collection that matches
the criteria will not be removed when using the deleteOne() function. To delete multiple documents
matching your criteria, you can use the deleteMany() function instead.
Or you can use the following snippet to delete all documents from the newname library (remember, we
renamed the media collection this previously):
> db.newname.deleteMany({})
■Warning When deleting a document, you need to remember that any reference to that document will
remain within the database. For this reason, be sure you manually delete or update those references as well;
otherwise, these references will return null when evaluated. Referencing will be discussed in the next section.
If you want to delete an entire collection, you can use either the drop() or remove() function. Using
remove() will be a lot slower than drop() as all indexes will be kept this way. A drop() will be faster if you
need to remove all data as well as indexes from a collection. The following snippet removes the entire
newname collection, including all of its documents:
> db.newname.drop()
true
The drop() function returns either true or false, depending on whether the operation has completed
successfully. Likewise, if you want to remove an entire database from MongoDB, you can use the
dropDatabase() function, as in this example:
> db.dropDatabase()
{ "dropped" : "library", "ok" : 1 }
Note that this snippet will remove the database you are currently working in (again, be sure to check db
to see which database is your current database).
CHAPTER 4 ■ WORKING WITH DATA
82
Referencing a Database
At this point, you have an empty database again. You’re also familiar with inserting various kinds of data into
a collection. Now you’re ready to take things a step further and learn about database referencing (DBRef).
As you’ve already seen, there are plenty of scenarios where embedding data into your document will suffice for
your application (such as the track list or the list of authors in the book entry). However, sometimes you do need
to reference information in another document. The following sections will explain how to go about doing so.
Just as with SQL, references between documents in MongoDB are resolved by performing additional
queries on the server. MongoDB gives you two ways to accomplish this: referencing them manually or using
the DBRef standard, which many drivers also support.
Referencing Data Manually
The simplest and most straightforward way to reference data is to do so manually. When referencing data
manually, you store the value from the _id of the other document in your document, either through the full
ID or through a simpler common term. Before proceeding with an example, let’s add a new document and
specify the publisher’s information in it (pay close attention to the _id field):
> apress = ( { "_id" : "Apress", "Type" : "Technical Publisher", "Category" : ["IT",
"Software","Programming"] } )
{
"_id" : "Apress",
"Type" : "Technical Publisher",
"Category" : [
"IT",
"Software",
"Programming"
]
}
> db.publisherscollection.insertOne(apress)
Once you add the publisher’s information, you’re ready to add an actual document (for example, a
book’s information) into the media collection. The following example adds a document, specifying Apress as
the name of the publisher:
> book = ( { "Type" : "Book", "Title" : "Definitive Guide to MongoDB 3rd ed., The",
"ISBN" : "978-1-4842-1183-0", "Publisher" : "Apress","Author" : ["Hows, David","Plugge,
Eelco","Membrey,Peter","Hawkins, Tim"] } )
{
"Type" : "Book",
"Title" : "Definitive Guide to MongoDB 3rd ed., The",
"ISBN" : "978-1-4842-1183-0",
"Publisher": "Apress",
"Author" : [
"Hows, David"
"Membrey, Peter",
"Plugge, Eelco",
"Hawkins, Tim"
]
}
> db.media.insertOne(book)
CHAPTER 4 ■ WORKING WITH DATA
83
All the information you need has been inserted into the publisherscollection and media collections,
respectively. You can now start using the database reference. First, specify the document that contains the
publisher’s information to a variable:
> book = db.media.findOne()
{
"_id" : ObjectId("4c458e848e0f00000000628e"),
"Type" : "Book",
"Title" : "Definitive Guide to MongoDB 3rd ed., The",
"ISBN" : "978-1-4842-1183-0",
"Publisher" : "Apress",
"Author" : [
"Hows, David"
"Membrey, Peter",
"Plugge, Eelco",
"Hawkins, Tim"
]
}
To obtain the information itself, you combine the findOne function with some dot notation:
> db.publisherscollection.findOne( { _id : book.Publisher } )
{
"_id" : "Apress",
"Type" : "Technical Publisher",
"Category" : [
"IT",
"Software",
"Programming"
]
}
As this example illustrates, referencing data manually is straightforward and doesn’t require much
brainwork. Here, the _id in the documents placed in the users collection has been manually set and has not
been generated by MongoDB (otherwise, the _id would be an object ID).
Referencing Data with DBRef
The DBRef standard provides a more formal specification for referencing data between documents. The
main reason for using DBRef over a manual reference is that the collection can change from one document
to the next. So, if your referenced collection will always be the same, referencing data manually (as just
described) is fine.
With DBRef, the database reference is stored as a standard embedded (JSON/BSON) object. Having a
standard way to represent references means that drivers and data frameworks can add helper methods that
manipulate the references in standard ways.
The syntax for adding a DBRef reference value looks like this:
{ $ref : <collectionname>, $id : <id value>[, $db : <database name>] }
CHAPTER 4 ■ WORKING WITH DATA
84
Here, <collectionname> represents the name of the collection referenced (for example,
publisherscollection); <id value> represents the value of the _id field for the object you are referencing;
and the optional $db allows you to reference documents that are placed in other databases.
Let’s look at another example using DBRef from scratch. Begin by emptying your two collections and
adding a new document:
> db.publisherscollection.drop()
true
> db.media.drop()
true
> apress = ( { "Type" : "Technical Publisher", "Category" :
["IT","Software","Programming"] } )
{
"Type" : "Technical Publisher",
"Category" : [
"IT",
"Software",
"Programming"
]
}
> db.publisherscollection.save(apress)
So far you’ve defined the variable apress and saved it using the save() function. Next, display the
updated contents of the variable by typing in its name:
> apress
{
"Type" : "Technical Publisher",
"Category" : [
"IT",
"Software",
"Programming"
],
"_id" : ObjectId("4c4597e98e0f000000006290")
}
So far you’ve defined the publisher and saved it to the publisherscollection collection. Now you’re
ready to add an item to the media collection that references the data:
> book = { "Type" : "Book", "Title" : "Definitive Guide to MongoDB 3rd ed., The", "ISBN"
: "978-1-4842-1183-0", "Author": ["Hows, David","Membrey, Peter","Plugge,Eelco","Hawkins,
Tim"], Publisher : [ new DBRef ('publisherscollection',apress._id) ] }
{
"Type" : "Book",
"Title" : "Definitive Guide to MongoDB 3rd ed., The",
"ISBN" : "978-1-4842-1183-0",
"Author" : [
"Hows, David”
"Membrey, Peter",
CHAPTER 4 ■ WORKING WITH DATA
85
"Plugge, Eelco",
"Hawkins, Tim"
],
"Publisher" : [
DBRef("publishercollection", "Apress")
]
}
> db.media.save(book)
And that’s it! Granted, the example looks a little less simple than the manual method of referencing
data; however, it’s a good alternative for cases where collections can change from one document to the next.
Implementing Index-Related Functions
In Chapter 3, you looked at what indexes can do for your database. Now it’s time to briefly learn how to
create and use indexes. Indexing will be discussed in greater detail in Chapter 10, but for now let’s look at the
basics. MongoDB includes a fair number of functions available for maintaining your indexes; we’ll begin by
creating an index with the createIndex() function.
The createIndex() function takes at least one parameter, which is the name of a key in one of your
documents that you will use to build the index. In the previous example, you added a document to the media
collection that used the Title key. This collection would be well served by an index on this key.
■Tip The rule of thumb in MongoDB is to create an index for the same sort of scenarios where you’d want to
create one in relational databases and to support your more common queries.
You can create an index for this collection by invoking the following command:
> db.media.createIndex( { Title : 1 } )
This command ensures that an index will be created for all the Title values from all documents in the
media collection. The :1 at the end of the line specifies the direction of the index: 1 would order the index
entries in ascending order, whereas -1 would order the index entries in descending order:
// Ensure ascending index
db.media.createIndex( { Title :1 } )
// Ensure descending index
db.media.createIndex( { Title :-1 } )
■Tip Searching through indexed information is fast. Searching for nonindexed information is slow, as each
document needs to be checked to see if it’s a match.
CHAPTER 4 ■ WORKING WITH DATA
86
BSON allows you to store full arrays in a document; however, it would also be beneficial to be able to
create an index on an embedded key. Luckily, the developers of MongoDB thought of this, too, and added
support for this feature. Let’s build on one of the earlier examples in this chapter, adding another document
into the database that has embedded information:
> db.media.insertOne( { "Type" : "CD", "Artist" : "Nirvana","Title" : "Nevermind",
"Tracklist" : [ { "Track" : "1", "Title" : "Smells Like Teen Spirit", "Length" : "5:02" },
{"Track" : "2","Title" : "In Bloom", "Length" : "4:15" } ] } )
{ "_id" : ObjectId("4c45aa2f8e0f000000006293"), "Type" : "CD", "Artist" : "Nirvana",
"Title" : "Nevermind", "Tracklist" : [
{
"Track" : "1",
"Title" : "Smells Like Teen Spirit",
"Length" : "5:02"
},
{
"Track" : "2",
"Title" : "In Bloom",
"Length" : "4:15"
}
] }
Next, you can create an index on the Title key for all entries in the track list:
> db.media.createIndex( { "Tracklist.Title" : 1 } )
The next time you perform a search for any of the titles in the collection—assuming they are nested
under Tracklist—the titles will show up instantly. Next, you can take this concept one step further and use
an entire (sub)document as a key, as in this example:
> db.media.createIndex( { "Tracklist" : 1 } )
This statement indexes each element of the array, which means you can now search for any object in
the array. These types of keys are also known as multikeys. You can also create an index based on multiple
keys in a set of documents. This process is known as compound indexing. The method you use to create a
compound index is mostly the same; the difference is that you specify several keys instead of one, as in
this example:
> db.media.createIndex({"Tracklist.Title": 1, "Tracklist.Length": -1})
The benefit of this approach is that you can make an index on multiple keys (as in the previous example,
where you indexed an entire subdocument). Unlike the subdocument method, however, compound
indexing lets you specify whether you want one of the two fields to be indexed in descending order. If you
perform your index with the subdocument method, you are limited to ascending or descending order only.
There is more on compound indexes in Chapter 10.
CHAPTER 4 ■ WORKING WITH DATA
87
Surveying Index-Related Commands
So far you’ve taken a quick glance at one of the index-related commands, createIndex(). Without a doubt,
this is the command you will primarily use to create your indexes. However, you might also find a pair of
additional functions useful: hint() and min()/max(). You use these functions to query for data. We haven’t
covered them to this point because they won’t function without a custom index. But now let’s take a look at
what they can do for you.
Forcing a Specified Index to Query Data
You can use the hint() function to force the use of a specified index when querying for data. The intended
benefit of using this command is to improve the query performance where the query planner does not
consistently use a good index for a given query. This option should be used with caution, as you can also
force an index to be used, which will result in poor performance.
To see this principle in action, try performing a find with the hint() function without defining an index:
> db.media.find( { ISBN: " 978-1-4842-1183-0"} ) . hint ( { ISBN: -1 } )
error: { "$err" : "bad hint", "code" : 10113 }
If you create an index on ISBN numbers, this technique will be more successful. Note that the first
command’s background parameter ensures that the indexing is done on the background. This is useful as by
default initial index builds are done on the foreground, which is a blocking operation for other writes. The
background option allows the initial index build to happen without blocking other writes:
> db.media.ensureIndex({ISBN: 1}, {background: true});
> db.media.find( { ISBN: "978-1-4842-1183-0"} ) . hint ( { ISBN: 1 } )
{ "_id" : ObjectId("4c45a5418e0f000000006291"), "Type" : "Book", "Title" : "Definitive Guide
to MongoDB 3rd ed., The", "ISBN" : "978-1-4842-1183-0", "Author" : ["Hows, David","Membrey,
Peter", "Plugge, Eelco","Hawkins,Tim"], "Publisher" : [
{
"$ref" : "publisherscollection",
"$id" : ObjectId("4c4597e98e0f000000006290")
}
] }
To confirm that the given index is being used, you can optionally add the explain() function, returning
information about the query plan chosen. Here, the indexBounds value tells you about the index used:
> db.media.find( { ISBN: "978-1-4842-1183-0"} ) . hint ( { ISBN: 1 } ).explain()
{
"waitedMS" : NumberLong(0),
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "library.media",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
CHAPTER 4 ■ WORKING WITH DATA
88
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : [ ]
},
"direction" : "forward"
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "localhost",
"port" : 27017,
"version" : "3.1.7",
"gitVersion" : "7d7f4fb3b6f6a171eacf53384053df0fe728db42"
},
"ok" : 1
}
Constraining Query Matches
The min() and max() functions enable you to constrain query matches to only those that have index keys
between the min and max keys specified. Therefore, you will need to have an index for the keys you are
specifying. Also, you can either combine the two functions or use them separately. Let’s begin by adding a
few documents that enable you to take advantage of these functions. First, create an index on the Released
field:
> db.media.insertOne( { "Type" : "DVD", "Title" : "Matrix, The", "Released" : 1999} )
> db.media.insertOne( { "Type" : "DVD", "Title" : "Blade Runner", "Released" : 1982 } )
> db.media.insertOne( { "Type" : "DVD", "Title" : "Toy Story 3", "Released" : 2010} )
> db.media.ensureIndex( { "Released": 1 } )