Mongo Administration Guide
mongo-administration-guide
User Manual:
Open the PDF directly: View PDF .
Page Count: 223
Download | |
Open PDF In Browser | View PDF |
MongoDB Administrator's Guide Over 100 practical recipes to efficiently maintain and administer your MongoDB solution Cyrus Dasadia BIRMINGHAM - MUMBAI MongoDB Administrator's Guide Copyright © 2017 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: October 2017 Production reference: 1241017 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78712-648-0 www.packtpub.com Credits Author Cyrus Dasadia Copy Editor Safis Editing Reviewers Nilap Shah Ruben Oliva Ramos Project Coordinator Nidhi Joshi Commissioning Editor Amey Varangaonkar Proofreader Safis Editing Acquisition Editor Viraj Madhav Indexer Aishwarya Gangawane Content Development Editor Cheryl Dsa Graphics Tania Dutta Technical Editor Dinesh Pawar Production Coordinator Shantanu Zagade About the Author Cyrus Dasadia has enjoyed tinkering with open source projects since 1996. He has been working as a Linux system administrator and part-time programmer for over a decade. He works at InMobi, where he loves designing tools and platforms. His love for MongoDB blossomed in 2013, when he was amazed by its ease of use and stability. Since then, almost all of his projects have been written with MongoDB as the primary backend. Cyrus is also the creator of an open source alert management system called CitoEngine. His spare time is devoted to trying to reverse-engineer software, playing computer games, or increasing his silliness quotient by watching reruns of Monty Python. About the Reviewers Nilap Shah is a lead software consultant with experience across various fields and technologies. He is expert in .NET, Uipath (Robotics) and MongoDB. He is certified MongoDB developer and DBA. He is technical writer as well as technical speaker. He is also providing MongoDB corporate training. Currently, he is working as lead MongoDB consultant and providing solutions with MongoDB technology (DBA and developer projects). His LinkedIn profile can be found at https://www.linkedin.com/in/nilap-shah8b6780a/ and can be reachable +91-9537047334 on WhatsApp. Ruben Oliva Ramos is a computer systems engineer from Tecnologico de Leon Institute, with a master's degree in computer and electronic systems engineering, teleinformatics, and networking specialization from the University of Salle Bajio in Leon, Guanajuato, Mexico. He has more than 5 years of experience in developing web applications to control and monitor devices connected with Arduino and Raspberry Pi using web frameworks and cloud services to build the Internet of Things applications. He is a mechatronics teacher at the University of Salle Bajio and teaches students of the master's degree in design and engineering of mechatronics systems. Ruben also works at Centro de Bachillerato Tecnologico Industrial 225 in Leon, Guanajuato, Mexico, teaching subjects such as electronics, robotics and control, automation, and microcontrollers at Mechatronics Technician Career; he is a consultant and developer for projects in areas such as monitoring systems and datalogger data using technologies (such as Android, iOS, Windows Phone, HTML5, PHP, CSS, Ajax, JavaScript, Angular, and ASP.NET), databases (such as SQlite, MongoDB, and MySQL), web servers (such as Node.js and IIS), hardware programming (such as Arduino, Raspberry pi, Ethernet Shield, GPS, and GSM/GPRS, ESP8266), and control and monitor systems for data acquisition and programming. He has authored the book Internet of Things Programming with JavaScript and Advanced Analytics with R and Tableau by Packt Publishing. He is also involved in monitoring, controlling, and the acquisition of data with Arduino and Visual Basic .NET for Alfaomega. I would like to thank my savior and lord, Jesus Christ, for giving me the strength and courage to pursue this project; my dearest wife, Mayte; our two lovely sons, Ruben and Dario; my dear father, Ruben; my dearest mom, Rosalia; my brother, Juan Tomas; and my sister, Rosalia, whom I love, for all their support while reviewing this book, for allowing me to pursue my dream, and tolerating not being with them after my busy day job. I'm very grateful to Pack Publishing for giving the opportunity to collaborate as an author and reviewer, to belong to this honest and professional team. www.PacktPub.com For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career. Why subscribe? Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via a web browser Customer Feedback Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/178712648X. If you'd like to join our team of regular reviewers, you can email us at customerreviews@packtpub.com. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products! Table of Contents Preface Chapter 1: Installation and Configuration Introduction Installing and starting MongoDB on Linux Getting ready How to do it… How it works… There's more… Installing and starting MongoDB on macOS Getting ready How to do it... How it works… Binding MongoDB process to a specific network interface and port Getting ready How to do it... How it works... Enabling SSL for MongodDB Getting ready How to do it.. How it works... There's more… Choosing the right MongoDB storage engine WiredTiger MMAPv1 The verdict Changing storage engine Getting ready How to do it... How it works... Separating directories per database Getting ready How to do it... How it works... Customizing the MongoDB configuration file 1 7 7 8 8 8 9 10 10 10 10 12 12 12 13 14 14 14 14 15 16 16 16 17 17 18 18 18 20 20 20 21 22 23 Getting ready How to do it.. How it works... There's more... Running MongoDB as a Docker container Getting ready How to do it... How it works... There's more.. Chapter 2: Understanding and Managing Indexes Introduction Creating an index Getting ready How it works... There's more... Managing existing indexes Getting ready How to do it... How it works... How to use compound indexes Getting ready How to do it... How it works… There's more... Creating background indexes Getting ready How to do it... How it works... Creating TTL-based indexes Getting ready How to do it... How it works... There's more... Creating a sparse index Getting ready How to do it... How it works... Creating a partial index Getting ready 24 24 24 25 25 25 25 26 27 29 29 29 30 34 35 35 36 36 39 40 40 40 44 44 45 45 45 48 49 49 49 50 50 51 51 51 54 54 54 [ ii ] How to do it... How it works... Creating a unique index Getting ready How to do it... How it works... 54 58 59 59 59 61 Chapter 3: Performance Tuning 62 Introduction Configuring disks for better I/O Reading and writing from disks Few considerations while selecting storage devices Measuring disk I/O performance with mongoperf Getting ready How to do it... How it works... Finding slow running queries and operations Getting ready How to do it... How it works... There's more... Storage considerations when using Amazon EC2 Figuring out the size of a working set There's more... Chapter 4: High Availability with Replication Introduction Initializing a new replica set Getting ready How to do it... How it works... Adding a node to the replica set Getting ready How to do it... How it works... Removing a node from the replica set Getting ready How to do it... How it works... Working with an arbiter [ iii ] 62 62 63 65 66 66 66 69 71 71 71 74 75 76 79 81 82 82 83 83 84 87 89 89 89 91 92 92 92 95 96 Getting ready How to do it... How it works... Switching between primary and secondary nodes Getting ready How to do it... How it works... Changing replica set configuration Getting ready How to do it... How it works.. Changing priority to replica set nodes Getting ready How to do it... How it works... There's more... Chapter 5: High Scalability with Sharding Understanding sharding and its components Components of MongoDB sharding infrastructure Config server The mongos query router The shard server Choosing the shard key Setting up and configuring a sharded cluster Getting ready How to do it... How it works... Managing chunks Getting ready How to do it... How it works... Moving non-sharded collection data from one shard to another Getting ready How to do it... How it works... Removing a shard from the cluster Getting ready How to do it... How it works... [ iv ] 97 97 99 100 100 100 101 102 102 102 103 104 104 104 105 105 106 106 107 107 107 108 108 109 109 109 113 115 116 116 118 120 120 120 122 122 122 123 125 Understanding tag aware sharding – zones Getting ready How to do it... How it works... See also Chapter 6: Managing MongoDB Backups Introduction Taking backup using mongodump tool Getting ready How to do it... How it works... There's more... Taking backup of a specific mongodb database or collection Getting ready How to do it... How it works... Taking backup of a small subset of documents in a collection Getting ready How to do it... How it works... Using bsondump tool to view mongodump output in human readable form Getting ready How to do it... How it works... Creating a point in time backup of replica sets Getting ready How to do it... How it works... Using the mongoexport tool Getting ready How to do it... How it works... Creating a backup of a sharded cluster Getting ready How to do it... How it works... Chapter 7: Restoring MongoDB from Backups [v] 126 126 127 128 129 130 130 131 131 131 133 134 134 134 134 136 136 136 137 137 137 138 138 139 140 140 140 141 142 142 142 143 143 143 144 144 145 Introduction Restoring standalone MongoDB using the mongorestore tool Getting ready How to do it... How it works... Restoring specific database or specific collection Getting ready How to do it... How it works... Restoring data from one collection or database to another Getting ready How to do it... How it works... Creating a new MongoDB replica set node using backups Getting ready How to do it... How it works... Restoring a MongoDB sharded cluster from backup Getting ready How to do it... How it works... Chapter 8: Monitoring MongoDB 145 145 146 146 147 147 148 148 150 151 151 151 153 154 154 154 156 157 157 157 158 159 Introduction Monitoring MongoDB performance with mongostat Getting ready How to do it... How it works... See also Checking replication lag of nodes in a replica set Getting ready How to do it... How it works... Monitoring and killing long running operations on MongoDB Getting ready How to do it... How it works... See also Checking disk I/O usage Getting ready [ vi ] 159 160 160 160 162 163 163 163 164 165 166 166 167 169 169 169 169 How to do it... How it works... Collecting MongoDB metrics using Diamond and Graphite Getting ready How to do it... How it works... See also Chapter 9: Authentication and Security in MongoDB Introduction Setting up authentication in MongoDB and creating a superuser account Getting ready How to do it... How it works... Creating normal users and assigning built-in roles Getting ready How to do it... How it works... See also... Creating and assigning custom roles Getting ready How to do it... How it works... Restoring access if you are locked out Getting ready How to do it... How it works... Using key files to authenticate servers in a replica set Getting ready How to do it... How it works... There's more... Chapter 10: Deploying MongoDB in Production Introduction Configuring MongoDB for a production deployment Getting ready How to do it... Upgrading production MongoDB to a newer version [ vii ] 170 171 172 172 172 174 174 175 175 176 176 176 177 178 178 179 180 182 182 182 182 186 187 187 187 188 189 189 189 191 192 193 193 193 194 194 196 Getting ready How to do it... There's more... Setting up and configuring TLS (SSL) Getting ready How to do it... How it works... There's more... Restricting network access using firewalls Getting ready How to do it... How it works... See also Index 196 196 197 197 197 198 198 200 200 200 200 201 202 203 [ viii ] Preface MongoDB is an extremely versatile NoSQL database that offers performance, scalability, and reliability of data. It has slowly become one of the leading NoSQL database systems used for storing extremely large datasets. In addition to this, the fact that it is open source makes it the perfect candidate for any project. From prototyping a minimal viable product to storing millions of complex documents, MongoDB is clearly emerging as the go-to database system. This book aims to help the reader in operating and managing MongoDB systems. The contents of this book are divided into sections covering all the core aspects of administering MongoDB systems. The primary goal of this book is not to duplicate the MongoDB documentation, but to gently nudge the reader towards topics that are often overlooked when designing MongoDB systems. What this book covers Chapter 1, Installation and Configuration, covers the basic details of how to install MongoDB, either from the bundled binaries or through the operating system's package managers. It also covers configuration details, as well as how to install MongoDB in a Docker container. Chapter 2, Understanding and Managing Indexes, gives a quick overview of the benefits of indexes, their various types, and how to optimize database responses by choosing the correct indexes. Chapter 3, Performance Tuning, covers various topics that can help optimize the infrastructure to deliver optimal database performance. We discuss disk I/O optimization, measuring slow queries, storage considerations in AWS, and managing working sets. Chapter 4, High Availability with Replication, shows how to achieve high availability using MongoDB replica sets. Topics such as the configuration of replica sets, managing node subscriptions, arbiters, and so on are covered. Chapter 5, High Scalability with Sharding, covers MongoDB's high scalability aspects using shards. The topics covered in this section include setting up a sharded cluster, managing chunks, managing non-sharded data, adding and removing nodes from the cluster, and creating a geographically distributed sharded cluster. Preface Chapter 6, Managing MongoDB Backups, helps the reader understand how to select an optimum backup strategy for their MongoDB setup. It covers how to take backups of standalone systems, replica sets, analyzing backup files, and so on. Chapter 7, Restoring MongoDB from Backups, shows various techniques for restoring systems from previously generated backups. Topics covered include restoring standalone systems, specific databases, the backup of one database to another database, replica sets, and sharded clusters. Chapter 8, Monitoring MongoDB, illustrates various aspects of monitoring the health of a MongoDB setup. This chapter includes recipes for using mongostat, monitoring replica set nodes, monitoring long-running operations, checking disk I/O, fetching database metrics, and storing them in a time-series database such as Graphite. Chapter 9, Authentication and Security in MongoDB, looks into various aspects involved in securing a MongoDB infrastructure. Topics covered in this chapter include creating and managing users, implementing role-based access models, implementing SSL/TLS-based transport mechanisms, and so on. Chapter 10, Deploying MongoDB in Production, provides insights into deploying MongoDB in a production environment, upgrading servers to newer versions, using configuration management tools to deploy MongoDB, and using Docker Swarm to set up MongoDB in containers. What you need for this book For the most part, this book requires only MongoDB 3.4 or higher. Although most of the operating system commands used throughout the book are for Linux, the semantics is generic and can be replayed on any operating system. It may be useful to have some knowledge of how MongoDB works, but for the most part, all chapters are verbose enough for beginners as well. [2] Preface Who this book is for This book is for database administrators or site reliability engineers who are keen on ensuring the stability and scalability of their MongoDB systems. Database administrators who have a basic understanding of the features of MongoDB and want to professionally configure, deploy, and administer a MongoDB database will find this book essential. If you are a MongoDB developer and want to get into MongoDB administration, this book will also help you. Sections In this book, you will find several headings that appear frequently (Getting ready, How to do it…, How it works…, There's more…, and See also). To give clear instructions on how to complete a recipe, we use these sections as follows. Getting ready This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe. How to do it… This section contains the steps required to follow the recipe. How it works… This section usually consists of a detailed explanation of what happened in the previous section. There's more… This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe. See also This section provides helpful links to other useful information for the recipe. [3] Preface Conventions In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "You can view the available command line parameters by using --help or -h." Any command-line input or output is written as follows: ln -s data/mongodb-linux-x86_64-ubuntu1404-3.4.4/ data/mongodb New terms and important words are shown in bold. Warnings or important notes appear like this. Tips and tricks appear like this. Reader feedback Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail feedback@packtpub.com, and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors . Customer support Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase. [4] Preface Downloading the example code You can download the example code files for this book from your account at http://www. packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub. com/support and register to have the files emailed directly to you. You can download the code files by following these steps: 1. 2. 3. 4. 5. 6. 7. Log in or register to our website using your e-mail address and password. Hover the mouse pointer on the SUPPORT tab at the top. Click on Code Downloads & Errata. Enter the name of the book in the Search box. Select the book for which you're looking to download the code files. Choose from the drop-down menu where you purchased this book from. Click on Code Download. You can also download the code files by clicking on the Code Files button on the book's web page at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account. Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of: WinRAR / 7-Zip for Windows Zipeg / iZip / UnRarX for Mac 7-Zip / PeaZip for Linux The code bundle for the book is also hosted on GitHub at https://github.com/ PacktPublishing/MongoDB-Administrators-Guide. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out! [5] Preface Errata Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the codewe would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/ books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section. Piracy Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at copyright@packtpub.com with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content. Questions If you have a problem with any aspect of this book, you can contact us at questions@packtpub.com, and we will do our best to address the problem. [6] 1 Installation and Configuration In this chapter, we will cover the following recipes: Installing and starting MongoDB on Linux Installing and starting MongoDB on macOS Binding MongoDB process to a specific network interface and port Enabling SSL for MongoDB Choosing the right MongoDB storage engine Changing storage engine Separating directories per database Customizing the MongoDB configuration file Running MongoDB as a Docker container Introduction In this chapter, we will look at how to install a standalone MongoDB server. We will also look at how to perform some useful customization to the default configuration of a MongoDB server. Lastly, we will run a MongoDB server inside a Docker container. MongoDB 3.4 was the latest stable release available while writing this book. All recipes in this and the subsequent chapters assume you are using MongoDB 3.4 or higher. Installation and Configuration Installing and starting MongoDB on Linux Getting ready You will need a machine running Ubuntu 14.04 or higher, although in theory any Red Hat or Debian-based Linux distribution should be fine. You will also need to download the latest stable binary tarball from https://www.mongodb.com/download-center How to do it… 1. Create a directory /data and untar your downloaded file into this directory so that you now have a /data/mongodb-linux-x86_64-ubuntu1404-3.4.4 directory. All of MongoDB's core binaries are available in the /data/mongodblinux-x86_64-ubuntu1404-3.4.4/bin directory. 2. Create a symbolic link to the versioned file directory for a simpler naming convention and also allowing us to use a generic directory name (for example, in scripts): ln -s /data/mongodb-linux-x86_64-ubuntu1404-3.4.4/ /data/mongodb 3. Create a directory for the database: mkdir /data/db 4. Start the MongoDB server: /data/mongodb/bin/mongod --dbpath /data/db 5. You should see output like this: 2017-05-14T10:07:15.247+0000 I CONTROL [initandlisten] MongoDB starting : pid=3298 port=27017 dbpath=/data/db 64-bit host=vagrantubuntu-trusty-64 2017-05-14T10:07:15.247+0000 I CONTROL [initandlisten] db version v3.4.4 2017-05-14T10:07:15.248+0000 I CONTROL [initandlisten] git version: 888390515874a9debd1b6c5d36559ca86b44babd 2017-05-14T10:07:15.248+0000 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014 2017-05-14T10:07:15.248+0000 I CONTROL [initandlisten] allocator: tcmalloc [8] Installation and Configuration 2017-05-14T10:07:15.249+0000 I CONTROL none 2017-05-14T10:07:15.249+0000 I CONTROL environment: 2017-05-14T10:07:15.249+0000 I CONTROL distmod: ubuntu1404 2017-05-14T10:07:15.249+0000 I CONTROL distarch: x86_64 2017-05-14T10:07:15.250+0000 I CONTROL target_arch: x86_64 2017-05-14T10:07:15.250+0000 I CONTROL storage: { dbPath: "/data/db" } } < -- snip -- > 2017-05-14T10:07:15.313+0000 I COMMAND featureCompatibilityVersion to 3.4 2017-05-14T10:07:15.313+0000 I NETWORK connections on port 27017 [initandlisten] modules: [initandlisten] build [initandlisten] [initandlisten] [initandlisten] [initandlisten] options: { [initandlisten] setting [thread1] waiting for 6. You can stop the server by pressing Ctrl + C. 7. Additionally, for convenience, we can edit the system's PATH variable to include the mongodb binaries directory. This allows us to invoke the mongodb binaries without having to type the entire path. For example, to execute the mongo client, instead of having to type /data/mongodb/bin/mongo every time, we can simply type mongo. This can be done by appending your ~/.bashrc or ~/.zshrc files for bash and zsh respectively, with the following lines: PATH=/data/mongodb/bin:${PATH} export PATH How it works… We downloaded a precompiled binary package and started the mongod server using the most basic command line parameter --dbpath so that it uses a customized directory, /data/db for storing databases. As you might have noticed, the MongoDB server by default, starts listening on TCP port 27017 on all interfaces. [9] Installation and Configuration There's more… The mongod binary has a lot of interesting options. You can view the available command line parameters by using --help or -h. Alternatively, you can also find a detailed reference of available options, at https://docs.mongodb.com/master/reference/program/mongod/. Just like most mature community projects, MongoDB also provides packages for formats supported by Debian/Ubuntu and Red Hat/CentOS package managers. There is extensive documentation on how to configure your operating system's package manager to automatically download the MongoDB package and install it. For more information on how to do so, see: https://docs.mongodb.com/master/administration/install-on-linux/. Installing and starting MongoDB on macOS Similar to the previous recipe, Installing and starting MongoDB on Linux, we will see how to set up MongoDB on a macOS operating system. Getting ready MongoDB supports macOS 10.7 (Lion) or higher, so ensure that your operating system is upgraded. Download the binary files the latest stable binary tarball from https://www. mongodb.com/download-center. How to do it... 1. In this recipe, we will be installing MongoDB in the user's home directory. Create a directory ~/data/ and extract the TAR file in this directory: tar xvf mongodb-osx-x86_64-3.4.4.tgz All of MongoDB's core binaries are available in the ~/data/mongodb-osxx86_64-3.4.4/bin directory. [ 10 ] Installation and Configuration 2. Create a symbolic link to the versioned file directory for simpler naming conventions and also allowing us to use a generic directory name (for example, in scripts): cd ~/data/ ln -s mongodb-osx-x86_64-3.4.4 mongodb 3. Create a directory for the database: mkdir ~/data/db 4. Start the MongoDB server: ~/data/mongodb/bin/mongod --dbpath ~/data/db 5. You should see output like this: 2017-05-21T15:21:20.662+0530 I CONTROL [initandlisten] MongoDB starting : pid=960 port=27017 dbpath=/Users/cyrus.dasadia/data/db 64-bit host=foo 2017-05-21T15:21:20.662+0530 I CONTROL [initandlisten] db version v3.4.4 2017-05-21T15:21:20.662+0530 I CONTROL [initandlisten] git version: 888390515874a9debd1b6c5d36559ca86b44babd 2017-05-21T15:21:20.662+0530 I CONTROL [initandlisten] allocator: system 2017-05-21T15:21:20.662+0530 I CONTROL [initandlisten] modules: none 2017-05-21T15:21:20.662+0530 I CONTROL [initandlisten] build environment: 2017-05-21T15:21:20.662+0530 I CONTROL [initandlisten] distarch: x86_64 2017-05-21T15:21:20.662+0530 I CONTROL [initandlisten] target_arch: x86_64 2017-05-21T15:21:20.662+0530 I CONTROL [initandlisten] options: { storage: { dbPath: "/Users/cyrus.dasadia/data/db" } } <<--- snip -- >> 2017-05-21T15:21:21.492+0530 I NETWORK [thread1] waiting for connections on port 27017 6. You can press Ctrl + C to stop the server. [ 11 ] Installation and Configuration 7. Additionally, for convenience, we can edit the system's PATH variable to include the MongoDB binaries directory. This allows us invoke the MongoDB binaries without having to type the entire path. For example, to execute the mongo client, instead of having to type ~/mongodb/bin/mongo every time we can simply type mongo. This can be done by appending your ~/.bashrc or ~/.zshrc files for bash and zsh respectively, with the following lines: PATH=~/data/mongodb/bin:${PATH} export PATH How it works… Similar to our first recipe, we downloaded a precompiled binary package and started the MongoDB server using the most basic command line parameter --dbpath such that it uses a customized directory ~/data/db for storing databases. As you might have noticed, MongoDB server by default, starts listening on TCP 27017 on all interfaces. We also saw how to add the MongoDB binary directory's path to our system's PATH variable for a more convenient way to access the MongoDB binaries. Binding MongoDB process to a specific network interface and port As you might have observed, after starting the MongoDB server, the mongod process binds to all interfaces which may not be suitable for all use cases. For example, if you are using MongoDB for development or you are running a single node instance on the same server as your application, you probably do not wish to expose MongoDB to the entire network. You might also have a server with multiple network interfaces and may wish to have MongoDB server listen to a specific network interface. In this recipe, we will see how to start MongoDB on a specific interface and port. Getting ready Make sure you have MongoDB installed on your system as shown in the previous recipes. [ 12 ] Installation and Configuration How to do it... 1. Find your system's network interfaces and corresponding IP address(s) using the ifconfig command. For example, let's assume your system's IP address is 192.168.1.112. 2. Start the mongod daemon without any special flags: mongod --dbpath /data/db This starts the mongod daemon which binds to all network interfaces on port 27017. 3. In a separate Terminal, connect to your MongoDB server on this IP: mongo 192.168.1.112:27017 You should see a MongoDB shell. 4. Now stop the previously running mongod daemon (press Ctrl + C in the Terminal) and start the daemon to listen to your loopback interface: mongod --dbpath /data/db --bind_ip 127.0.0.1 5. In a separate Terminal, connect to your MongoDB server on this IP: mongo 192.168.1.112:27017 6. This time the mongo client will exit with a connect failed message. Let's connect to your loopback IP and it should work: mongo 127.0.0.1:27017 7. Stop the mongod daemon (press Ctrl + C in the Terminal) and let's start the daemon such that it binds to a different port as well: mongod --dbpath /data/db --bind_ip 127.0.0.1 --port 27000 8. In a separate Terminal, connect to your MongoDB server on this IP: mongo 127.0.0.1:27000 9. You should be connected to the server and see the mongo shell. [ 13 ] Installation and Configuration How it works... By default, the mongod daemon binds to all interfaces on TCP port 27017. By passing the IP address with the --bind_ip parameter, we instructed mongod daemon to listen only on this socket. Next we passed the --port parameter along with --bind_ip to instruct the mongod daemon to listen to a particular port and IP. Using a non-standard port is a common practice when one wishes to run multiple instances of mongod daemon (along with a different --dbpath) or wish to add a little touch security by obscurity. Either way, we will be using this practice in our later recipes to test shards and replica sets setups running on a single server. Enabling SSL for MongodDB By default, connections to MongoDB server are not encrypted. If one were to intercept this traffic, almost all the data transferred between the client and the server is visible as clear text. If you are curious, I would encourage you to use tcpdump or wireshark to capture packets between a mongod daemon and the client. As a result, it is highly advisable to make sure that you encrypt all connections to your mongod set by enabling Transport Layer Security (TLS) also commonly known as SSL. Getting ready Make sure you have MongoDB installed on your system as shown in the previous recipes. The default MongoDB binaries for OS X are not compiled with SSL, you may need to manually recompile the source code or use Homebrew: brew install mongodb --with-openssl. How to do it.. 1. First, let us generate a self-signed certificate using OpenSSL, in the /data directory: openssl req -x509 -newkey rsa:4096 -nodes -keyout mongo-secure.key -out mongo-secure.crt -days 365 [ 14 ] Installation and Configuration 2. Combine the key and certificate into a single .pem file: cat mongo-secure.key mongo-secure.crt > mongo-secure.pem 3. Start the mongod daemon, with SSL enabled and listening on the default socket that is, localhost 27017: mongod --dbpath /data/db /data/mongo-secure.pem --sslMode requireSSL --sslPEMKeyFile 4. In another window, connect to this server using a mongo client: mongo localhost:27017 5. You should see a connect failed error on the client Terminal. Switch to the server's console window and you should see a log message indicating that the connection was rejected, something like this: 2017-05-13T16:51:08.031+0000 I NETWORK [thread1] connection accepted from 192.168.200.200:43441 #4 (1 connection now open) 2017-05-13T16:51:08.032+0000 I [conn4] AssertionException handling request, closing client connection: 17189 The server is configured to only allow SSL connections 2017-05-13T16:51:08.032+0000 I [conn4] end connection 192.168.200.200:43441 (1 connection now open) 6. Now, switch back to the other console window and connect to the server again but this time using SSL: mongo --ssl --sslAllowInvalidCertificates 7. You should be connected to the server and see the mongo shell. How it works... In step 1, we created a self-signed certificate to get us started with SSL enabled connections. One could very well use a certificate signed by a valid Certificate Authority (CA), but for test purposes we are good with a self-signed certificate. In all honesty, if connection security is all you need, a self-signed certificate can also be used in a production environment as long as you keep the keys secure. You might as well take it a step forward by creating your own CA certificate and use it to sign your certificates. [ 15 ] Installation and Configuration In step 2, we concatenate the key and the certificate file. Next, in step 3, we start the mongod daemon with --sslMode requireSSL followed by providing the path to the concatenated .pem file. At this point, we have a standalone MongoDB server listening to the default port 27017, ready to accept only SSL based clients. Next, we attempt to connect to the mongod server using the default non-SSL mode, which is immediately rejected by the sever. Finally, in step 5 we explicitly make an SSL connection by providing the --ssl parameter followed by --sslAllowInvalidCertificates. The latter parameter is used because we are using a self-signed certificate on the server. If we were using an certificate signed by a authorized CA or even a self-signed CA, we could very well use the --sslCAFile to provide the CA certificate. There's more… MongoDB also supports X.509 certificate-based authentication as an option to username and passwords. We will cover this topic in Chapter 9, Authentication and Security in MongoDB. Choosing the right MongoDB storage engine Starting with MongoDB Version 3.0, a new storage engine named WiredTiger was available and very soon it became the default storage engine in version 3.2. Up until then, MMAPv1 was used as the default storage engine. I will give you a brief rundown on the main features of both storage engines and hopefully it should give you enough to decide which one suits your application's requirements. WiredTiger WiredTiger provides the ability, for multiple clients, to perform write operations on the same collection. This is achieved by providing document-level concurrency such that during a given write operation, the database only locks a given document in the collection as against its predecessors, which would lock the entire collection. This drastically improves performance for write heavy applications. Additionally, WiredTiger provides compression of data for indexes and collections. The current compression algorithms used by WiredTiger are Google's Snappy and zLib. Although disabling compression is possible, one should not immediately jump this gun unless it is truly load-tested while planning your storage strategy. [ 16 ] Installation and Configuration WiredTiger uses Multi-Version Concurrency Control (MVCC) that allows asserting pointin-time snapshots of transactions. These finalized snapshots are written to disk which helps create checkpoints in the database. These checkpoints eventually help determine the last good state of data files and helps in recovery of data during abnormal shutdowns. Additionally, journaling is also supported with WiredTiger where write-ahead transaction logs are maintained. The combination of journaling and checkpoints increases the chance of data recovery during failures. WiredTiger uses internal caching as well as filesystem cache to provide faster responses on queries. With high concurrency in mind, the architecture of WiredTiger is such that it better utilizes multi-core systems. MMAPv1 MMAPv1 is quite mature and has proven to be quite stable over the years. One of the storage allocation strategies used with this engine is the power of two allocation strategy. This primarily involves storing double the amount of document space (in power of twos) such that in-place updates of documents become highly likely without having to move the documents during updates. Another storage strategy used with this engine is fixed sizing. In this, the documents are padded (for example, with zeros) such that maximum data allocation for each document is attained. This strategy is usually followed by applications that have fewer updates. Consistency in MMAPv1 is achieved by journaling, where writes are written to a private view in memory which are written to the on-disk journal. Upon which the changes are then written to a shared view that is the data files. There is no support for data compression with MMAPv1. Lastly, MMAPv1 heavily relies on page caches and hence uses up available memory to retain the working dataset in cache thus providing good performance. Although, MongoDB does yield (free up) memory, used for cache, if another process demands it. Some production deployments avoid enabling swap space to ensure these caches are not written to disk which may deteriorate performance. The verdict So which storage engine should you choose? Well, with the above mentioned points, I personally feel that you should go with WiredTiger as the document level concurrency itself is a good marker for attaining better performance. However, as all engineering decisions go, one should definitely not shy away from performing appropriate load testing of the application across both storage engines. [ 17 ] Installation and Configuration The enterprise MongoDB version also provides in-memory storage engine and supports encryption at rest. These are good features to have depending on your application's requirements. Changing storage engine In this recipe, we will look at how to migrate existing data onto a new storage engine. MongoDB does not allow on the fly (live) migrations, so we will have to do it the hard way. Getting ready Ensure you have a MongoDB database installation ready. How to do it... 1. Start the mongod daemon to explicitly use MMAPv1 storage engine: /data/mongodb/bin/mongod --dbpath /data/db --storageEngine mmapv1 2. Start the mongo client and you should be presented with the MongoDB shell. Execute the following commands in the shell: > var status = db.serverStatus() > status['storageEngine'] { "name" : "mmapv1", "supportsCommittedReads" : false, "readOnly" : false, "persistent" : true } [ 18 ] Installation and Configuration 3. Now let's add some random data into it. Run the following JavaScript code to insert 100 documents with random data: > use mydb > for(var x=0; x<100; x++){ db.mycol.insert({ age:(Math.round(Math.random()*100)%20) }) } > db.mycol.count() 100 4. Exit the shell and perform a full backup using mongodump command: mkdir /data/backup mongodump -o /data/backup --host localhost:27017 5. Now shutdown the mongod process. 6. Create a new data directory for the migration and start the mongod daemon with a new storage engine: mkdir /data/newdb /data/mongodb/bin/mongod --dbpath /data/newdb --storageEngine wiredTiger 7. Let's restore the previous backup to this new instance: mongorestore /data/backup/ 8. Start the mongo shell and check your data: > var status = db.serverStatus() > status['storageEngine'] { "name" : "wiredTiger", "supportsCommittedReads" : true, "readOnly" : false, "persistent" : true } > use mydb switched to db mydb > db.mycol.count() 100 [ 19 ] Installation and Configuration How it works... As WiredTiger is the default storage engine for MongoDB 3.2, for this exercise, we explicitly started a MongoDB instance with MMAPv1 storage engine in step 1. In step 2, we stored the db.serverStatus() command's output in a temporary variable to inspect the output of the server's storageEngine key. This helps us see which storage engine our MongoDB instance is running on. In step 3, we switched to database mydb and ran a simple JavaScript function to add 100 documents to a collection called mycol. Next, in step 4, we created a backup directory /data/backup which is passed as a parameter to mongodump utility. We will discuss more about the mongodump utility in Chapter 6, Managing MongoDB Backups. Once we shutdown the mongod instance, in step 5, we are now ready to start a new instance of MongoDB but this time with WiredTiger storage engine. We follow the basic practice of covering for failure and instead of removing /data/db, we create a new path for this instance (#AlwaysHaveABackupPlan). Our new MongoDB instance is empty, so in step 7 we import the aforementioned backup into the database using the mongorestore utility. As the new MongoDB instance is running WiredTiger storage engine, our backup (which is essentially BSON data) is restored and saved on disk using this storage engine. Lastly, in step 8, we simply inspect the storageEngine key on the db.serverStatus() output and confirm that we are indeed using WiredTiger. As you can see, this is an overly simplistic example of how to convert MongoDB data from one storage engine format to another. One has to keep in mind that this operation will take a significant amount of time depending on the size of data. However, application downtime can be averted if we were to use a replica set. More on this later. Separating directories per database In this recipe we will be looking at how to optimize on disk I/O by separating databases in different directories. Getting ready Ensure you have a MongoDB database installation ready. [ 20 ] Installation and Configuration How to do it... 1. Start mongod daemon with no special parameters: /data/mongodb/bin/mongod --dbpath /data/db 2. Connect to mongo shell, create a test db and insert a sample document: mongo localhost:27017 > use mydb > db.mycol.insert({foo:1}) 3. Inspect the /data/db directory structure, it should look something like this: ls /data/db total 244 drwxr-xr-x 4 root root 4096 May 21 drwxr-xr-x 10 root root 4096 May 21 -rw-r--r-- 1 root root 16384 May 21 collection-0-626293768203557661.wt -rw-r--r-- 1 root root 16384 May 21 collection-2-626293768203557661.wt -rw-r--r-- 1 root root 16384 May 21 collection-5-626293768203557661.wt drwxr-xr-x 2 root root 4096 May 21 -rw-r--r-- 1 root root 16384 May 21 index-1-626293768203557661.wt -rw-r--r-- 1 root root 16384 May 21 index-3-626293768203557661.wt -rw-r--r-- 1 root root 16384 May 21 index-4-626293768203557661.wt -rw-r--r-- 1 root root 16384 May 21 index-6-626293768203557661.wt drwxr-xr-x 2 root root 4096 May 21 -rw-r--r-- 1 root root 16384 May 21 -rw-r--r-- 1 root root 6 May 21 -rw-r--r-- 1 root root 16384 May 21 -rw-r--r-- 1 root root 95 May 21 -rw-r--r-- 1 root root 49 May 21 -rw-r--r-- 1 root root 4096 May 21 -rw-r--r-- 1 root root 21 May 21 -rw-r--r-- 1 root root 994 May 21 -rw-r--r-- 1 root root 61440 May 21 4. Shutdown the previous mongod instance. [ 21 ] 08:45 . 08:42 .. 08:43 08:43 08:43 08:45 diagnostic.data 08:43 08:43 08:43 08:43 08:42 08:43 08:42 08:44 08:42 08:42 08:42 08:42 08:45 08:45 journal _mdb_catalog.wt mongod.lock sizeStorer.wt storage.bson WiredTiger WiredTigerLAS.wt WiredTiger.lock WiredTiger.turtle WiredTiger.wt Installation and Configuration 5. Create a new db path and start mongod with --directoryperdb option: mkdir /data/newdb /data/mongodb/bin/mongod --dbpath /data/newdb --directoryperdb 6. Connect to the mongo shell, create a test db, and insert a sample document: mongo localhost:27017 > use mydb > db.mycol.insert({bar:1}) 7. Inspect the /data/newdb directory structure, it should look something like this: ls /data/newdb total 108 drwxr-xr-x 7 root drwxr-xr-x 10 root drwxr-xr-x 2 root drwxr-xr-x 2 root drwxr-xr-x 2 root drwxr-xr-x 2 root -rw-r--r-- 1 root -rw-r--r-- 1 root drwxr-xr-x 2 root -rw-r--r-- 1 root -rw-r--r-- 1 root -rw-r--r-- 1 root -rw-r--r-- 1 root -rw-r--r-- 1 root -rw-r--r-- 1 root -rw-r--r-- 1 root root 4096 May 21 08:42 . root 4096 May 21 08:42 .. root 4096 May 21 08:41 admin root 4096 May 21 08:42 diagnostic.data root 4096 May 21 08:41 journal root 4096 May 21 08:41 local root 16384 May 21 08:42 _mdb_catalog.wt root 0 May 21 08:42 mongod.lock root 4096 May 21 08:41 mydb root 16384 May 21 08:42 sizeStorer.wt root 95 May 21 08:41 storage.bson root 49 May 21 08:41 WiredTiger root 4096 May 21 08:42 WiredTigerLAS.wt root 21 May 21 08:41 WiredTiger.lock root 986 May 21 08:42 WiredTiger.turtle root 28672 May 21 08:42 WiredTiger.wt How it works... We start by running a mongod instance with no special parameters except for --dbpath. In step 2, we create a new database mydb and insert a document in the collection mycol, using the mongo shell. By doing this, the data files for this new db are created and can be seen by inspecting the directory structure of our main database path /data/db. In that, among other files, you can see that database files begin with collection-and its relevant index file begins with index- . As we guessed, all databases and their relevant files are within the same directory as our db path. [ 22 ] Installation and Configuration If you are curious and wish to find the correlation between the files and the db, then run the following commands in mongo shell: > use mydb > var curiosity = db.mycol.stats() > curiosity['wiredTiger']['uri'] statistics:table:collection-5-626293768203557661 The last part of this string that is, collection-5-626293768203557661 corresponds to the file in our /data/db path. Moving on, in steps 4 and step 5, we stop the previous mongod instance, create a new path for our data files and start a new mongod instance but this time with the -directoryperdb parameter. As before, in step 6 we insert some random data in the mycol collection of a new database called mydb. In step 7, we look at the directory listing of our data path and we can see that there is a subdirectory in the data path which, as you guessed, matches our database name mydb. If you look inside this directory that is, /data/newdb/mydb, you should see a collection and an index file. So one might ask, why go through all this trouble for having separate directories for databases? Well, in certain application scenarios, if your database workloads are significantly high, you should consider storing the database on a separate disk/volume. Ideally, this should be a physically separate disk or a RAID volume created using separate physical disks. This ensures the separation of disk I/O from other operations including MongoDB journals. Additionally, this also helps you separate your fault domains. One thing you should keep in mind is that journals are stored separately, that is, outside the database's directory. So, using separate disks for databases allows the journals to not content for same disk I/O path. Customizing the MongoDB configuration file In all the previous recipes of this chapter, we have passed command line flags to the mongod daemon. In this recipe, we will look at how to use the config file as an alternative to passing command line flags. [ 23 ] Installation and Configuration Getting ready Nothing special, just make sure you have a MongoDB database installation ready. How to do it.. 1. Start your favorite text editor and add the following in a file called mongod.conf: storage: dbPath: /data/db engine: wiredTiger directoryPerDB: true net: port: 27000 bindIp: 127.0.0.1 ssl: mode: requireSSL PEMKeyFile: /data/mongo-secure.pem 2. Start your mongod instance: mongodb/bin/mongod --config /data/mongod.conf How it works... MongoDB allows passing command line parameters to mongod using a YAML file. In step 1, we are creating a config file called mongod.conf. We add all the previously used command line parameters from this chapter, into this config file in YAML format. A quick look at the file's content should make it clear that the parameters are divided into sections and relevant subsections. Next, in step 2, we start the mongod instance, but this time with just one parameter --config followed by the path of our config file. As we saw in earlier recipes, although passing configuration parameters seems normal, it is highly advisable that one should use configuration files instead. Having all parameters in a single configuration file not only makes it easier in terms of viewing the parameters but also helps us programmatically (YAML FTW!) inspect and manage the values of these variables. This simplifies operations and reduces the chance of errors. [ 24 ] Installation and Configuration There's more... Do have a look at other parameters available in the configuration file https://docs. mongodb.com/manual/reference/configuration-options/. Running MongoDB as a Docker container In this recipe, we will look at how to run MongoDB as a Docker container. I will assume that you are familiar with the bare minimum understanding of how Docker works. If you are not, have a look at https://www.docker.com/what-container. It should help you get acquainted with Docker's concepts. Getting ready Make sure you have Docker installed on your system. If you are using Linux, then it is highly advisable to use kernel version 3.16 or higher. How to do it... 1. Download the latest MongoDB Docker image: docker pull mongo:3.4.4 2. Check that the image exists: docker images 3. Start a container: docker run -d -v /data/db:/data/db --name mymongo mongo:3.4.4 4. Check if the container is running successfully: docker ps [ 25 ] Installation and Configuration 5. Let's connect to our mongo server using the mongo client from the container: docker exec -it mymongo mongo 6. Stop the mongo instance and with host mode networking: docker run -d -v /data/db:/data/db --name mymongo --net=host mongo:3.4.4 --bind_ip 127.0.0.1 --port 27000 7. Connect to the new instance using mongo shell: docker exec -it mymongo mongo localhost:27000 How it works... In step 1, we fetched the official MongoDB image, from Docker's public repository. You can view it at https://hub.docker.com/_/mongo/. While fetching the image we explicitly mentioned the version that is, mongo:3.4.4.. Although mentioning the version (also known as Docker image tag) is optional, it is highly advisable that when you download any application images via Docker, always fetch them with the relevant tag. Otherwise, you might end up fetching the latest tag and as they change often, you would end up running different versions of you applications. Next, in step 2, we run the docker images command which shows us a list of images available on the server, in our case it should show you the MongoDB image with the tag 3.4.4 available for use. In step 3, we start a container in detached (-d) mode. As all containers use ephemeral storage and as we wish to retain the data, we mount a volume (-v) by providing it a local path /data/db that can be mounted to the container's internal directory /data/db. This ensures that even if the container is stopped/removed, our data on the host machine is retained on the host's /data/db path. At this point, one could also use Docker's volumes, but in order to keep things simplified, I prefer using a regular directory. Next, in the command we provide a name (--name) for our container. This is followed by the Docker image and tag that should be used to run the container, in our case it would be mongo:3.4.4. When you enter the command, you should get a large string as a return value, this is your new container's ID. [ 26 ] Installation and Configuration In step 4, we run the docker ps command which shows us a list of running containers. If, in case your container is stopped or exited, use docker ps -a to show all containers. In the output you can see the container's details. By default, Docker starts a container in bridge mode that is, when Docker is installed, it creates a bridge interface on the host and the resulting containers are run using a virtual network device attached to the bridge. This results in complete network isolation of the container. Thus, in our case, if we wish to connect to the container's mongod instance on 27017, we would need to explicitly expose TCP port 27017 to the base host or bind the base host's port to that of the container thus allowing an external MongoDB client to connect to our instance. You can read more about Docker's networking architecture at https://docs.docker.com/engine/userguide/ networking/. In step 5, we execute the mongo shell command from the container to connect to the mongod instance. The official MongoDB container image also takes in command-line flags, by passing them in the docker run command. We do this in step 6 along with running the container in host mode networking. Host mode networking binds the server's network namespace onto the container thus bypassing the bridge interface. We pass the --bind_ip and --port flags to the docker run command which instructs mongod to bind to 127.0.0.1:27000. As we are using host mode networking, the mongod daemon would effectively bind to the base host's loopback interface. In step 7, we connect to our new MongoDB instance but this time we explicitly provide the connection address. There's more.. If you ever wish to debug the container, you can always run the container in the foreground by passing the -it parameters in place of -d. Additionally, try running the following command and check the output: docker logs mymongo Lastly, I would suggest you have a look at the start scripts used by this container's image to understand how configurations are templatized. It will definitely give you some pointers that will help when you are setting up your own MongoDB container. [ 27 ] Installation and Configuration With this recipe, we conclude this chapter. I hope these recipes have helped you gear up for getting started with MongoDB. As all things go, no amount of text can replace actual practice. So I sincerely request you to get your hands dirty and attempt these recipes yourself. In the next chapter, we will take a closer look at MongoDB's indexes and how they can be leveraged to gain a tremendous performance boost in data retrieval. [ 28 ] 2 Understanding and Managing Indexes In this chapter, we will be covering the following topics: Creating an index Managing existing indexes How to use compound indexes Creating background indexes Creating TTL-based indexes Creating a sparse index Creating a partial index Creating a unique index Introduction In this chapter, we are going to look at how to create and manage database indexes in MongoDB. We will also look at how to view index sizes, create background indexes and creating various forms of indexes. So let's get started! Creating an index In this recipe, we will be using a fairly large dataset and add it into MongoDB. Then we will examine how a query executes in this dataset with and without an index. Understanding and Managing Indexes Getting ready Assuming that you are already running a MongoDB server, we will be importing a dataset of around 100,000 records available in the form of a CSV file called chapter_2_mock_data.csv. You can download this file from the Packt website. 1. Import the sample data to the MongoDB server: $mongoimport --headerline --ignoreBlanks --type=csv -d mydb -c mockdata -h localhost chapter_2_mock_data.csv You should see output like this: 2017-06-18T08:25:08.444+0530 connected to: localhost 2017-06-18T08:25:09.498+0530 imported 100000 documents 2. Connect to the MongoDB instance and open a mongo shell: mongo localhost:27017 3. Check that the documents are in the right place: use mydb db.mockdata.count() You should see the following result: 105000 4. Let's fetch a document with the explain() method: > db.mockdata.find({city:'Singapore'}).explain("executionStats") You should see the following result: { "executionStats": { "executionStages": { "advanced": 1, "direction": "forward", "docsExamined": 100000, "executionTimeMillisEstimate": 44, "filter": { "city": { "$eq": "Singapore" } }, [ 30 ] Understanding and Managing Indexes "invalidates": 0, "isEOF": 1, "nReturned": 1, "needTime": 100000, "needYield": 0, "restoreState": 783, "saveState": 783, "stage": "COLLSCAN", "works": 100002 }, "executionSuccess": true, "executionTimeMillis": 41, "nReturned": 1, "totalDocsExamined": 100000, "totalKeysExamined": 0 }, "ok": 1, "queryPlanner": { "indexFilterSet": false, "namespace": "mydb.mockdata", "parsedQuery": { "city": { "$eq": "Singapore" } }, "plannerVersion": 1, "rejectedPlans": [], "winningPlan": { "direction": "forward", "filter": { "city": { "$eq": "Singapore" } }, "stage": "COLLSCAN" } }, "serverInfo": { "gitVersion": "888390515874a9debd1b6c5d36559ca86b44babd", "host": "vagrant-ubuntu-trusty-64", "port": 27017, "version": "3.4.4" } } [ 31 ] Understanding and Managing Indexes 5. Create an index on the city field: > db.mockdata.createIndex({'city': 1}) The following result is obtained: { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1 } 6. Execute the same fetch query: > db.mockdata.find({city:'Singapore'}).explain("executionStats") { "executionStats": { "executionStages": { "advanced": 1, "alreadyHasObj": 0, "docsExamined": 1, "executionTimeMillisEstimate": 0, "inputStage": { "advanced": 1, "direction": "forward", "dupsDropped": 0, "dupsTested": 0, "executionTimeMillisEstimate": 0, "indexBounds": { "city": [ "[\"Singapore\", \"Singapore\"]" ] }, "indexName": "city_1", "indexVersion": 2, "invalidates": 0, "isEOF": 1, "isMultiKey": false, "isPartial": false, "isSparse": false, "isUnique": false, "keyPattern": { "city": 1 }, "keysExamined": 1, "multiKeyPaths": { [ 32 ] Understanding and Managing Indexes "city": [] }, "nReturned": 1, "needTime": 0, "needYield": 0, "restoreState": 0, "saveState": 0, "seeks": 1, "seenInvalidated": 0, "stage": "IXSCAN", "works": 2 }, "invalidates": 0, "isEOF": 1, "nReturned": 1, "needTime": 0, "needYield": 0, "restoreState": 0, "saveState": 0, "stage": "FETCH", "works": 2 }, "executionSuccess": true, "executionTimeMillis": 0, "nReturned": 1, "totalDocsExamined": 1, "totalKeysExamined": 1 }, "ok": 1, "queryPlanner": { "indexFilterSet": false, "namespace": "mydb.mockdata", "parsedQuery": { "city": { "$eq": "Singapore" } }, "plannerVersion": 1, "rejectedPlans": [], "winningPlan": { "inputStage": { "direction": "forward", "indexBounds": { "city": [ "[\"Singapore\", \"Singapore\"]" ] }, "indexName": "city_1", [ 33 ] Understanding and Managing Indexes "indexVersion": 2, "isMultiKey": false, "isPartial": false, "isSparse": false, "isUnique": false, "keyPattern": { "city": 1 }, "multiKeyPaths": { "city": [] }, "stage": "IXSCAN" }, "stage": "FETCH" } }, "serverInfo": { "gitVersion": "888390515874a9debd1b6c5d36559ca86b44babd", "host": "vagrant-ubuntu-trusty-64", "port": 27017, "version": "3.4.4" } } How it works... In step 1, we used the mongoimport utility to import our sample dataset from chapter_2_mock_data.csv which is a comma separated file. We'll discuss more about mongoimport in later chapters, so don't worry about it for now. Once we import the data, we execute the mongo shell and confirm that we've indeed imported our sample dataset (100,000 documents). In step 4, we run a simple find() function chained with the explain() function. The explain() function shows us all the details about the execution of our query, especially the executionStats. In this, if you look at the value of key executionStages['stage'], you can see it says COLLSAN. This indicates that the entire collection was scanned, which can be confirmed by looking at the totalDocsExamined key's value, which should say 100,000. Clearly our collection needs an index! [ 34 ] Understanding and Managing Indexes In step 5, we create and index by calling db.mockdata.createIndex({'city': 1}). In createIndex() function, we mention the city field with value of 1 which tells MongoDB to create an ascending index on this key. You can use -1 to create a descending index, if need be. By executing this function, MongoDB immediately begins creating an index on the collection. Index creation is an intensive blocking call which means database operations will be blocked until the index is created. We will examine how to create background indexes in later recipes, in this chapter. In step 6, we execute the exact same find() query, as we did in step 4, and upon inspecting the executionStats, you can observe that the value of key executionStages now contains some more details. Especially, the value of stage key is FETCH and the inputStages['stage'] is IXSCAN. In short, this indicates that the query was fetched from by running an index scan. As this was a direct index hit, the value of totalDocsExamined is 1. There's more... Over time, you may come across scenarios that require redesigning your indexing strategy. This may be by adding a new feature in your application or simply by identifying a more appropriate key that can be indexed. In either case, it is highly advisable to remove older (unused) indexes to ensure you do not have any unnecessary overhead on the database. In order to remove an index, you can use the db. .dropIndex( ). If you are not sure about your index name, use the db. .getIndexes() function. Managing existing indexes In this recipe, we will be looking at some common operations we can perform on indexes like viewing, deleting, checking index sizes, and re-indexing. [ 35 ] Understanding and Managing Indexes Getting ready For this recipe, load the sample dataset and create an index on the city field, as described in the previous recipe. How to do it... 1. We begin by connecting to the mongo shell of the server and viewing all indexes on the system: > db.mockdata.getIndexes() The following result is obtained: [ { "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "mydb.mockdata" }, { "v" : 2, "key" : { "city" : 1, "first_name" : 1 }, "name" : "city_1_first_name_1", "ns" : "mydb.mockdata" } ] 2. Execute a dropIndex() command to delete a particular index: > db.mockdata.dropIndex('city_1_first_name_1') You should see the following result: { "nIndexesWas" : 2, "ok" : 1 } [ 36 ] Understanding and Managing Indexes 3. Let's recreate the index: > db.mockdata.createIndex({'city':1}, {name: 'city_index'}) { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1 } 4. Run getIndexes() to fetch all indexes of the collection: > db.mockdata.getIndexes() We should see the following result: [ { "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "mydb.mockdata" }, { "v" : 2, "key" : { "city" : 1 }, "name" : "city_index", "ns" : "mydb.mockdata" } ] 5. Try creating the index again on the city field: > db.mockdata.createIndex({'city':1}) [ 37 ] Understanding and Managing Indexes You should see the following message: { "createdCollectionAutomatically" : false, "numIndexesBefore" : 2, "numIndexesAfter" : 2, "note" : "all indexes already exist", "ok" : 1 } 6. Check the size of the index: stats = db.mockdata.stats() stats["totalIndexSize"] It should show the following result: 1818624 7. Let us view the size of each index: stats["indexSizes"] This should show the following result: { "_id_" : 905216, "city_index" : 913408 } 8. Re-index city_index: > db.mockdata.reIndex('city_index') The following result is obtained: { "nIndexesWas" : 2, "nIndexes" : 2, "indexes" : [ { "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "mydb.mockdata" }, { "v" : 2, [ 38 ] Understanding and Managing Indexes "key" : { "city" : 1 }, "name" : "city_index", "ns" : "mydb.mockdata" } ], "ok" : 1 } How it works... Most of the commands are pretty self-explanatory. In steps 1 and 2, we view and delete index respectively. You can also use db. .dropIndexes() to delete all indexes. In step 3, we recreate the index on the city field, but this time we provide an additional parameter to customize the name of the index. This can be confirmed by viewing the output of the getIndexes() command, as shown in step 3. Next, in step 4, we try to create another index on the city field (in ascending order). However, as we already have an index on this field, this would be redundant and hence MongoDB does not allow it. If you change the 1 to -1 that is, change the sort order to descending, then your operation would succeed and you'd end up with another index on the city field, but sorted in descending order. In step 5, we run the stats() function on the collection which can alternately be run as db.mockdata.runCommand('collstats') and save its output in a temporary variable called stats. If we inspect the totalIndexSizeand indexSizes keys, we can find the total as well as index specific sizes, respectively. At this point, I would strongly suggest you have a look at other keys in the output. It should give you a peek into the low-level internals of how MongoDB manages each collection. Lastly, in step 6, we re-index an existing index. In that, it drops the existing index and rebuilds it either in the foreground or background, depending on how it was set up initially. It is usually not necessary to rebuild the index, however, as per MongoDB's documentation you may choose to do so if you feel that the index size may be disproportionate or your collection has significantly grown in size. [ 39 ] Understanding and Managing Indexes How to use compound indexes The beauty of indexes is that they can be used with multiple keys. A single key index can be thought of as a table with one column. A multi-key index or compound index can be visualized as a multi column table where the first column is sorted first, and then the next, and so on. In this recipe, we will look at how to create a compound index and examine how it works. Getting ready Load the sample dataset and create an index on the city field, as described in the previous recipe. How to do it... 1. Assuming you have already created an index on the city field, create one by executing the command db.mockdata.createIndex({'city': 1}) again. 2. Run a find() query: > plan = db.mockdata.find({city:'Boston', first_name: 'Sara'}).explain("executionStats") 3. Examine the executionStats: > plan['executionStats'] You should see the following result: { "executionSuccess" : true, "nReturned" : 1, "executionTimeMillis" : 0, "totalKeysExamined" : 9, "totalDocsExamined" : 9, "executionStages" : { "stage" : "FETCH", "filter" : { "first_name" : { "$eq" : "Sara" } }, "nReturned" : 1, [ 40 ] Understanding and Managing Indexes "executionTimeMillisEstimate" : 0, "works" : 10, "advanced" : 1, "needTime" : 8, "needYield" : 0, "saveState" : 0, "restoreState" : 0, "isEOF" : 1, "invalidates" : 0, "docsExamined" : 9, "alreadyHasObj" : 0, "inputStage" : { "stage" : "IXSCAN", "nReturned" : 9, "executionTimeMillisEstimate" : 0, "works" : 10, "advanced" : 9, "needTime" : 0, "needYield" : 0, "saveState" : 0, "restoreState" : 0, "isEOF" : 1, "invalidates" : 0, "keyPattern" : { "city" : 1 }, "indexName" : "city_1", "isMultiKey" : false, "multiKeyPaths" : { "city" : [ ] }, "isUnique" : false, "isSparse" : false, "isPartial" : false, "indexVersion" : 2, "direction" : "forward", "indexBounds" : { "city" : [ "[\"Boston\", \"Boston\"]" ] }, "keysExamined" : 9, "seeks" : 1, "dupsTested" : 0, "dupsDropped" : 0, "seenInvalidated" : 0 } } [ 41 ] Understanding and Managing Indexes } 4. Now drop this index: > db.mockdata.dropIndex('city_1') You should see an output similar to this: { "nIndexesWas" : 2, "ok" : 1 } 5. Create a compound index on city and name: > db.mockdata.createIndex({'city': 1, 'first_name': 1}) You should see an output similar to this: { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1 } 6. Let's run the same fetch query again and examine the plan: > plan = db.mockdata.find({city:'Boston', first_name: 'Sara'}).explain("executionStats") > plan['executionStats'] You should see an output similar to this: { "executionSuccess": true, "nReturned": 1, "executionTimeMillis": 0, "totalKeysExamined": 1, "totalDocsExamined": 1, "executionStages": { "stage": "FETCH", "nReturned": 1, "executionTimeMillisEstimate": 0, "works": 2, "advanced": 1, "needTime": 0, "needYield": 0, "saveState": 0, "restoreState": 0, [ 42 ] Understanding and Managing Indexes "isEOF": 1, "invalidates": 0, "docsExamined": 1, "alreadyHasObj": 0, "inputStage": { "stage": "IXSCAN", "nReturned": 1, "executionTimeMillisEstimate": 0, "works": 2, "advanced": 1, "needTime": 0, "needYield": 0, "saveState": 0, "restoreState": 0, "isEOF": 1, "invalidates": 0, "keyPattern": { "city": 1, "first_name": 1 }, "indexName": "city_1_first_name_1", "isMultiKey": false, "multiKeyPaths": { "city": [], "first_name": [] }, "isUnique": false, "isSparse": false, "isPartial": false, "indexVersion": 2, "direction": "forward", "indexBounds": { "city": [ "[\"Boston\", \"Boston\"]" ], "first_name": [ "[\"Sara\", \"Sara\"]" ] }, "keysExamined": 1, "seeks": 1, "dupsTested": 0, "dupsDropped": 0, "seenInvalidated": 0 } } } [ 43 ] Understanding and Managing Indexes How it works… We start with loading the sample dataset with an index on the city field. Next, we execute a find() command on our collection chained with the explain('executionStats') function, in steps 2 and 3 respectively. This time, we capture the output of the data in a variable so it is easier to examine for later use. In step 4, we specifically examine the execution stats. We can observe that nine documents were fetched from the index in which we had one match. When we ran db.mockdata.find({city:'Boston', first_name: 'Sara'}), MongoDB first saw that the city field already has an index on it. So, for the remaining part of the query, MongoDB simply searched the documents which were returned from the index and searched on the field first_name in these documents until they matched the value Sara. In step 5, we remove the existing index on field city and in step 6, we create a compound index on two field names city and first_name. At this point, I would like to point, that the sequence of the field names is extremely important. As I explained in the introduction of this recipe, compound indexes in MongoDB are created in the order in which the field names are mentioned. Hence, when we create a compound index with, say, {city:1, first_name:1}, MongoDB first creates a B-tree index on the field city and an ascending order followed by first_name in an ascending order. In step 7, we run the same find() query and examine the executionStats. We can observe that this time, as both keys were indexed, totalDocumentsExamined was 1 that is, we got an exact match in our compound index. There's more... Compound indexes, if used smartly, can dramatically reduce your document seek times. For example, let's assume our application had a view that only required us to show a list of names in a city. A traditional approach would be to run a find query and get the list of documents and send them to the application's view. However, we know that other fields in the document are not needed for this view. Then, by having a compound index on city and first_name with the addition of field projection, we simply send the index values down to the application that is: db.mockdata.find({city:'Boston', first_name:'Sara'}, {city:1, first_name:1, _id:0}) [ 44 ] Understanding and Managing Indexes By doing this, not only do we leverage the speed of the index but we negate the need to fetch the non-indexed keys. Another term used for this is a covered query and it can improve our applications significantly! Also, compound indexes allow us to use the index for the leftmost key. In our example, if we were to run db.mockdata.find({city:'Boston'}), then the result would be fetched from the index. However, if we were to search on the first_name that is, db.mockdata.find({first_name:'Sara'}), the server would do a full collection scan and fetch the result. I would encourage you to run the preceding queries chained with the explain() function and see the details yourself. Creating background indexes In the previous recipes, whenever we've created indexes, it has always been in the foreground that is, the database server blocks all changes to the database until the index creation is completed. This is definitely not suitable for larger datasets where index creation time can take a few seconds which could be application errors. Getting ready Load the sample dataset, as shown in the Creating an index recipe. How to do it... 1. Remove all indexes: > db.mockdata.dropIndexes() { "nIndexesWas" : 2, "msg" : "non-_id indexes dropped for collection", "ok" : 1 } 2. Add some additional data to increase the size of our collection. Run the following command string in your Terminal window: for x in $(seq 20); do mongoimport --headerline --type=csv -d mydb -c mockdata -h localhost chapter_2_mock_data.csv;done [ 45 ] Understanding and Managing Indexes 3. Open two mongo shells, we will create an index in one while we do an insert query in another. Ensure you've selected mydb by executing the command use mydb in both windows. 4. In the first mongo shell, create an index and immediately shift to the second shell: > db.mockdata.createIndex({city:1, first_name:1, last_name:1}) 5. In the second shell window, perform a simple insert operation: > db.mockdata.insert({foo:'bar'}) 6. Check the mongod server logs: 2017-06-13T03:54:26.296+0000 I INDEX [conn1] build index on: mydb.mockdata properties: { v: 2, key: { city: 1.0, first_name: 1.0, last_name: 1.0 }, name: "city_1_first_name_1_last_name_1", ns: "mydb.mockdata" } 2017-06-13T03:54:26.297+0000 I INDEX [conn1] building index using bulk method; build may temporarily use up to 500 megabytes of RAM 2017-06-13T03:54:36.575+0000 I INDEX [conn1] build index done. scanned 2100001 total records. 10 secs 2017-06-13T03:54:36.576+0000 I COMMAND [conn2] command mydb.mockdata appName: "MongoDB Shell" command: insert { insert: "mockdata", documents: [ { _id: ObjectId('59474af356e41a7db57952b6'), foo: "bar" } ], ordered: true } ninserted:1 keysInserted:3 numYields:0 reslen:29 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 9307131 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_command 9307ms 2017-06-13T03:54:36.577+0000 I COMMAND [conn1] command mydb.$cmd appName: "MongoDB Shell" command: createIndexes { createIndexes: "mockdata", indexes: [ { key: { city: 1.0, first_name: 1.0, last_name: 1.0 }, name: "city_1_first_name_1_last_name_1" } ] } numYields:0 reslen:98 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_command 10284ms 7. Now drop the indexes and get ready to repeat steps 4 and step 5 again. 8. In the first mongo shell window, recreate the index. As this command will take some time, switch to the second shell window: > db.mockdata.createIndex({city:1, first_name:1, last_name:1}, {background:1}) [ 46 ] Understanding and Managing Indexes 9. In the second shell window, perform an insert operation, this time it should immediately yield: > db.mockdata.insert({foo:'bar'}) You should see the following output: WriteResult({ "nInserted" : 1 }) 10. Look at the mongod server logs: 2017-06-13T04:00:29.248+0000 I INDEX [conn1] build index on: mydb.mockdata properties: { v: 2, key: { city: 1.0, first_name: 1.0, last_name: 1.0 }, name: "city_1_first_name_1_last_name_1", ns: "mydb.mockdata", background: 1.0 } 2017-06-13T04:00:32.008+0000 I - [conn1] Index Build (background): 397400/2200004 18% 2017-06-13T04:00:35.002+0000 I - [conn1] Index Build (background): 673800/2200005 30% 2017-06-13T04:00:38.009+0000 I - [conn1] Index Build (background): 762300/2200005 34% 2017-06-13T04:00:41.006+0000 I - [conn1] Index Build (background): 903400/2200005 41% << --- output snipped --- >> 2123200/2200005 96% 2017-06-13T04:02:32.021+0000 I - [conn1] Index Build (background): 2148300/2200005 97% 2017-06-13T04:02:35.021+0000 I - [conn1] Index Build (background): 2172800/2200005 98% 2017-06-13T04:02:38.019+0000 I - [conn1] Index Build (background): 2195800/2200005 99% 2017-06-13T04:02:38.566+0000 I INDEX [conn1] build index done. scanned 2100006 total records. 129 secs 2017-06-13T04:02:38.572+0000 I COMMAND [conn1] command mydb.$cmd appName: "MongoDB Shell" command: createIndexes { createIndexes: "mockdata", indexes: [ { key: { city: 1.0, first_name: 1.0, last_name: 1.0 }, name: "city_1_first_name_1_last_name_1", background: 1.0 } ] } numYields:20353 reslen:98 locks:{ Global: { acquireCount: { r: 20354, w: 20354 } }, Database: { acquireCount: { w: 20354, W: 2 } }, Collection: { acquireCount: { w: 20354 } } } protocol:op_command 129326ms [ 47 ] Understanding and Managing Indexes How it works... In step 1 we remove any existing indexes. Next, in order to better simulate index creation delays, what we do is simply keep reimporting our sample dataset about 20 times. This should give us about 2 million records in our collection after the end of step 2. As I have the previous recipes' sample dataset, my document count may be slightly higher so don't worry about it. Now, in order to test how foreground index creation hinders database operations, we need to be able to perform two tasks simultaneously. For this, we set up two terminal windows, preferably side by side, with mongo shells connected and ensure mydb is selected. In step 4, we create a index on three fields city, first_name, and last_name. Again, this is intentional to add a bit of computational overhead for our test database setup. Note that, unlike previous runs, this command will not yield immediately. So, switch to the next terminal windows and try inserting a simple record, as shown in step 5. If you have both window stacked side by side, you will notice that both of them yield almost simultaneously. If you look at mongod server logs, you can see that both operations, in this case, took roughly 10 seconds to complete. Also, as expected, our insert query did not complete until the index creation had released the lock on the collection. In step 7, we delete the index again and in step 8 we recreate the index but this time with the option {background: 1}. This tells mongod to start the index creation process in the background. In step 9, we switch to the other terminal window and try inserting a random document to our collection. Lo and behold, our document gets inserted immediately. Now is a good time to switch to the mongod server logs. As shown in step 10, you can now see that the index creation is happening in small batches. When the index creation completes, you can see that mongod acquired about 20,354 locks for this process as opposed to 1, when creating index in foreground. This lock and release method allowed our insert query to go through. However, this does come with a slight trade-off. The index creation time in the background was about 130 seconds as compared to 10 seconds, when created in the foreground. There you have it, a simple test to show the effectiveness of creating background indexes. As real-world production scenarios go, it is always safe to create indexes in the background unless you have a very strong reason otherwise. [ 48 ] Understanding and Managing Indexes Creating TTL-based indexes In this recipe, we will explore the expireAfterSeconds property of MongoDB indexes to allow automatic deletion of documents from a collection. Getting ready For this recipe, all you need is a mongod instance running. We will be creating and working on a new collection called ttlcol in the database mydb. How to do it... 1. Ensure that our collection is empty: db.ttlcol.drop() 2. Add 200 random documents: for(var x=1; x<=100; x++){ var past = new Date() past.setSeconds(past.getSeconds() - (x * 60)) // Insert a document with timestamp in the past var doc = { foo: 'bar', timestamp: past } db.ttlcol.insert(doc) // Insert a document with timestamp in the future var future = new Date() future.setSeconds(future.getSeconds() + (x * 60)) var doc = { foo: 'bar', timestamp: future } db.ttlcol.insert(doc) } 3. Check that the documents were added: db.ttlcol.count() [ 49 ] Understanding and Managing Indexes 4. Create an index with TTL: db.ttlcol.createIndex({timestamp:1}, {expireAfterSeconds: 10}) You should see output similar to this: { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1 } 5. Wait for about a minute and check the document count: db.ttlcol.count() The number of documents returned should be lower than 200. How it works... In step 1, we emptied the ttlcol collection in mydb to ensure there is no old data. Next, in step 2, we ran a simple JavaScript code that adds 200 records, each having a BSON Date() field called timestamp. We added about 100 records in the past and 100 in the future each 1 minute in the past and future respectively. Then in step 3, we created a regular index but with an additional parameter {expireAfterSeconds: 10}. In this, we are telling the server to expire documents 10 seconds from the value of time mentioned in our timestamp field. Once this is added, you can check that the number of documents in the collection has reduced from 200 to, in this case, 113 and counting. What happens here is that there is a background thread in MongoDB server that wakes up every minute and removes any document that matches our index's condition. At this point, I would like to point out that if our field timestamp were not a valid Date() function or an array of Date() function, then no documents would be removed. There's more... If you wish to have explicit expiry times, then set the expireAfterSecond to 0. In that, the documents would be removed as soon as they match the desired field's timestamp. [ 50 ] Understanding and Managing Indexes So when would you need a TTL-based index? Well if you happen to store time sensitive documents like user session times or documents that can be removed after a certain period like events, logs, or transaction history, then TTL-based indexes are your best option. They offer you more control over document retention than traditional capped collections. Creating a sparse index MongoDB allows you to create an index on fields that may not exist in all documents, in a given collection. These are called sparse indexes and in this recipe, we will look at how to create them. Getting ready For this recipe, load the sample dataset and create an index on the city field, as described in the Creating an index recipe. How to do it... 1. Check the total number of documents in our collection and number of documents without the language field: db.mockdata.count() The preceding command should return 100000. db.mockdata.find({language: {$eq:null}}).count() The preceding command should return 12704. 2. Create a sparse index on the document: db.mockdata.createIndex({language:1}, {sparse: true}) [ 51 ] Understanding and Managing Indexes You should see output similar to this: { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1 } 3. Check our index got created with the sparse parameter: db.mockdata.getIndexes() The preceding command should give you output similar to this: [ { "key": { "_id": 1 }, "name": "_id_", "ns": "test.mockdata", "v": 2 }, { "key": { "language": 1 }, "name": "language_1", "ns": "test.mockdata", "sparse": true, "v": 2 } ] 4. Run a simple find query: db.mockdata.find({language: 'French'}).explain('executionStats')['executionStats'] [ 52 ] Understanding and Managing Indexes The preceding command should give you output similar to this: "executionStages": { "advanced": 893, "alreadyHasObj": 0, "docsExamined": 893, "executionTimeMillisEstimate": 0, "inputStage": { "advanced": 893, "direction": "forward", "dupsDropped": 0, "dupsTested": 0, "executionTimeMillisEstimate": 0, "indexBounds": { "language": [ "[\"French\", \"French\"]" ] }, "indexName": "language_1", "indexVersion": 2, "invalidates": 0, "isEOF": 1, "isMultiKey": false, "isPartial": false, "isSparse": true, "isUnique": false, "keyPattern": { "language": 1 }, "keysExamined": 893, "multiKeyPaths": { "language": [] }, "nReturned": 893, "needTime": 0, "needYield": 0, "restoreState": 6, "saveState": 6, "seeks": 1, "seenInvalidated": 0, "stage": "IXSCAN", "works": 894 }, "invalidates": 0, "isEOF": 1, "nReturned": 893, "needTime": 0, "needYield": 0, [ 53 ] Understanding and Managing Indexes "restoreState": 6, "saveState": 6, "stage": "FETCH", "works": 894 }, "executionSuccess": true, "executionTimeMillis": 1, "nReturned": 893, "totalDocsExamined": 893, "totalKeysExamined": 893 } How it works... For this example, we have picked a sparsely populated field, language, which does not exist in all documents of our sample dataset. In step 1, we can see that around 12,000 documents do not contain this field. Next, in step 2, we create an index with the optional parameter {sparse: true} which tells MongoDB server to create a sparse index on our field, language. The index gets created and works just like any other index as seen in steps 3 and step 4, respectively. Creating a partial index Partial indexes were introduced recently, in MongoDB Version 3.2. A partial index is slightly similar to sparse index but with the added advantage of being able to use expressions ($eq, $gt, and so on) and operators ($and). Getting ready For this recipe, load the sample dataset and create an index on the city field, as described in the Creating an index recipe. How to do it... 1. Check the total number of documents in our collection and number of documents without the language field: db.mockdata.count() [ 54 ] Understanding and Managing Indexes The preceding command should return 100000: db.mockdata.find({language: {$eq:null}}).count() The preceding command should return 12704. 2. Create a sparse index on the document: > db.mockdata.createIndex( {first_name:1}, {partialFilterExpression: { language: {$exists: true}}} ) This should give you output similar to this: { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1 } 3. Confirm that the index was created: db.mockdata.getIndexes() The preceding command should give you output similar to this: [ { "key": { "_id": 1 }, "name": "_id_", "ns": "mydb.mockdata", "v": 2 }, { "key": { "first_name": 1 }, "name": "first_name_1", "ns": "mydb.mockdata", "partialFilterExpression": { "language": { "$exists": true } }, [ 55 ] Understanding and Managing Indexes "v": 2 } ] 4. Find a record without language field: db.mockdata.find({first_name: 'Sara'}).explain('executionStats')['executionStats'] The preceding command should give you output similar to this: { "executionStages": { "advanced": 7, "direction": "forward", "docsExamined": 100000, "executionTimeMillisEstimate": 21, "filter": { "first_name": { "$eq": "Sara" } }, "invalidates": 0, "isEOF": 1, "nReturned": 7, "needTime": 99994, "needYield": 0, "restoreState": 782, "saveState": 782, "stage": "COLLSCAN", "works": 100002 }, "executionSuccess": true, "executionTimeMillis": 33, "nReturned": 7, "totalDocsExamined": 100000, "totalKeysExamined": 0 } 5. Find a record with language field: db.mockdata.find({first_name: 'Sara', language: 'Spanish'}).explain('executionStats')['executionStats'] [ 56 ] Understanding and Managing Indexes The preceding command should give you output similar to this: { "executionStages": { "advanced": 1, "alreadyHasObj": 0, "docsExamined": 7, "executionTimeMillisEstimate": 0, "filter": { "language": { "$eq": "Spanish" } }, "inputStage": { "advanced": 7, "direction": "forward", "dupsDropped": 0, "dupsTested": 0, "executionTimeMillisEstimate": 0, "indexBounds": { "first_name": [ "[\"Sara\", \"Sara\"]" ] }, "indexName": "first_name_1", "indexVersion": 2, "invalidates": 0, "isEOF": 1, "isMultiKey": false, "isPartial": true, "isSparse": false, "isUnique": false, "keyPattern": { "first_name": 1 }, "keysExamined": 7, "multiKeyPaths": { "first_name": [] }, "nReturned": 7, "needTime": 0, "needYield": 0, "restoreState": 0, "saveState": 0, "seeks": 1, "seenInvalidated": 0, "stage": "IXSCAN", "works": 8 [ 57 ] Understanding and Managing Indexes }, "invalidates": 0, "isEOF": 1, "nReturned": 1, "needTime": 6, "needYield": 0, "restoreState": 0, "saveState": 0, "stage": "FETCH", "works": 8 }, "executionSuccess": true, "executionTimeMillis": 0, "nReturned": 1, "totalDocsExamined": 7, "totalKeysExamined": 7 } How it works... As in the previous recipe, we have picked a sparsely populated field, language, which does not exist in all documents of our sample dataset. In step 1, we can see that around 12,000 documents do not contain this field. Next, in step 2, we create an index on the field first_name with the optional parameter partialFilterExpression. With this parameter, we have added a condition { language: {$exists: true}}. MongoDB is instructed to create an index on first_name only on documents which have the field language present. If we look at the executionStats in step 4, we can observe that the index is not used if we do a simple search on the field first_name. However, in step 5, we can see that our query is using the MongoDB index if we add an additional parameter of the field language. Apart from this simple example, there are tons of good variations possible if we use expressions like $lt, $gt, and so on. You can find some more examples at https://docs. mongodb.com/manual/core/index-partial/. So why would one use a partial index? Say, for example, you have a huge dataset and wish to have an index on a field which is sparsely spread across these documents. Traditional indexes would cause the entire collection to be indexed and may not be optimal if we are going to work on a subset of these documents. [ 58 ] Understanding and Managing Indexes Creating a unique index MongoDB allows you to create an index on a field with the option of ensuring that it is unique in the collection. In this recipe, we will explore how it can be done. Getting ready For this recipe, we only need a running mongod instance. How to do it... 1. Connect to the mongo shell and insert a random document: use mydb db.testuniq.insert({foo: 'zoidberg'}) 2. Create an index with the unique parameter: db.testuniq.createIndex({foo:1}, {unique:1}) The preceding command should give you an output similar to this: { "createdCollectionAutomatically": false, "numIndexesAfter": 2, "numIndexesBefore": 1, "ok": 1 } 3. Try to add another document with a duplicate value of the field: db.testuniq.insert({foo: 'zoidberg'}) [ 59 ] Understanding and Managing Indexes The preceding command should give you an error message similar to this: WriteResult({ "nInserted" : 0, "writeError" : { "code" : 11000, "errmsg" : "E11000 duplicate key error collection: mydb.testuniq index: foo_1 dup key: { : \"zoidberg\" }" } }) 4. Drop the index: db.testuniq.dropIndexes() 5. Add a duplicate record: db.testuniq.insert({foo: 'zoidberg'}) db.testuniq.find() The preceding command should give you an output similar to this: { "_id" : ObjectId("59490cabc14da1366d83254f"), "foo" : "zoidberg" } { "_id" : ObjectId("59490d20c14da1366d832551"), "foo" : "zoidberg" } 6. Try creating the index again: db.testuniq.createIndex({foo:1}, {unique:1}) The preceding command should give you an output similar to this: { "ok" : 0, "errmsg" : "E11000 duplicate key error collection: mydb.testuniq index: foo_1 dup key: { : \"zoidberg\" }", "code" : 11000, "codeName" : "DuplicateKey" } [ 60 ] Understanding and Managing Indexes How it works... In step 1 we inserted a document in a new collection testuniq. Next, in step 2, we created an index on the field foo with the parameter {unique: true}. In step 3, we try to add another record with the same value of field foo as we did earlier and we receive an error as expected. In step 4, we drop the indexes and add a duplicate record. Next we try to create a new unique index. This time we are not allowed because there are duplicates in our collection. This is a simple example of how to create an index with unique constraint. Additionally, we can also create unique indexes on fields that have an array, for example {foo: ['bar', 'baz']}. MongoDB would inspect each value of the array and against the index. Try adding a document with the above values and see what happens. If you insert a document where the indexed field is missing, then MongoDB will not allow you to add another one with the indexed field missing. The missing field is considered a null value and because of the unique constraint to the index, only one field can be null. [ 61 ] 3 Performance Tuning In this chapter we will be covering the following topics: Configuring disks for better I/O Measuring disk I/O performance with mongoperf Finding slow running queries and operations Figuring out the size of a working set Introduction This chapter is slightly different than the previous ones in that we will be looking at different technical aspects that should be considered to gain optimal performance from a MongoDB setup. As you are probably aware, application performance tuning is a highly nuanced art, hence not all aspects will be covered here. However, I will try and discuss the most important points, which will help pave the way for more critical thinking on the subject. Configuring disks for better I/O In this recipe, we will be looking at the importance of provisioning your servers for better disk I/O. Performance Tuning Reading and writing from disks Apart from CPU and memory (RAM), MongoDB, like most database applications, relies heavily on disk operations. To better understand this dependency, let's look at a very simple example of reading data from a file. Suppose you have a file that contains a few thousand lines, each containing a set of strings in no particular order. If one were to write a program that is used to search a particular string, it would need to open the file, iterate through each line, and search the string. Once the string is found, the program closes the file. As disks are usually much slower than RAM, this approach of opening a file, reading, and closing it on every query, is suboptimal. To circumvent this, Linux (and most modern operating systems) rely heavily on the cache buffer. The operating system kernel uses this cache to store chunks of data, in blocks, which are frequently read from the disk. So, when a process tries to read a particular file, the kernel first does a lookup in its cache. If the data is not cached, then the kernel reads it from the disks and loads it in the cache. Data is evicted from the cache based on its frequency of use, that is, less used data gets removed first to make room for more frequently accessed data. Additionally, the kernel tries to utilize all available free memory for the cache, but it automatically reduces the cache size if a process requires memory. [ 63 ] Performance Tuning The design of this cache was to circumvent the delays inherent in reading and writing on disks. Any application that relies on disk I/O would be greatly impacted by the speed of the disk. RAM, on the other hand, is extremely fast. How fast, you ask? To put it in perspective, most disk operations are in the range of milliseconds (thousands of a second), whereas for RAM, it is in nanoseconds (billionths of a second). MongoDB is designed quite similarly to this, in that the database server tries to keep the index and the working set in memory. At the same time, for actual disk reads, it heavily relies on the filesystem buffer cache. But even with everything optimized to be in memory, at some point, MongoDB would need to either write to the disk or read from it. [ 64 ] Performance Tuning Disk read/write operations are what are commonly referred to ask disk Input/Output Operations Per Second (IOPS). As disk I/O is a blocking operation, the amount of disk IOPS required by MongoDB would eventually determine how fast your database performs. Few considerations while selecting storage devices First things first, disks are slow. Neither magnetic nor solid state disks can perform anywhere near the speed of RAM. As MongoDB tries to store a database index in memory, try to have workloads that utilize the benefits of indexes. It goes without saying that your servers need to have sufficient RAM to store indexes and disk cache. While deciding the optimal RAM capacity for your server, consider aspects such as the percentage rate of growth of data (and indexes), sufficient size for disk buffer cache, and headroom for the underlying operating system. A very simple example for calculating IOPS for a disk would be 1/(average disk latency + average seek time). So, for a disk with 2 ms average latency and 3 ms average seek time, the total supported IOPS would be 1/(0.002 + 0.003) = 200 IOPS. Again, this does not take into account a lot of other factors, such as disk degradation, ECC, and sequential or random seeks. With a limited cap on disk IOPS, you can substantially increase the server's IOPS capacity by using RAID 0 (disk striping). For example, an array of four disks in RAID 0 would theoretically give you 4 x 200 = 800 IOPS. The trade-off with RAID 0 is that you do not get data redundancy, that is, if a disk fails, your data is lost. But this can be easily rectified by having a MongoDB replica set. However, on the off-chance that you do decide to use any other RAID setup, keep in mind that your write operations will be directly affected by the RAID setup. That is, for RAID 1 or RAID 10 you would be performing two write operations for every one actual disk write. At the same time, RAID 5 and RAID 6 would not be suitable as they increase the additional writes even more. Lastly, know your application requirements. I cannot stress how important it is to analyze and monitor your applications' read and write operations. It is ideal to have, at the least, a rough estimate on the ratio of reads to writes. Filesystems also play a crucial role. MongoDB highly recommends using the XFS filesystem. For more information, see https://docs.mongodb. com/manual/administration/production-notes/#kernel-and-filesystems. We will discuss this in the recipe 'Configuring for production deployment' in Chapter 10. [ 65 ] Performance Tuning Measuring disk I/O performance with mongoperf By now, you should have a fair idea of the importance of disk I/O and how it directly impacts your database performance. MongoDB provides a nifty little utility called mongoperf that allows us to quickly measure disk I/O performance. Getting ready For this recipe, we only need the mongoperf utility, which is available in the bin directory of your MongoDB installation. How to do it... 1. Measure the read throughput with mmf disabled: root@ubuntu:~# echo "{ recSizeKB: 8, nThreads: 12, fileSizeMB: 10000, r: true, mmf: false }" | mongoperf You will get the following result: mongoperf use -h for help parsed options: { recSizeKB: 8, nThreads: 12, fileSizeMB: 10000, r: true, mmf: false } creating test file size:10000MB ... 1GB... 2GB... 3GB... 4GB... 5GB... 6GB... 7GB... 8GB... 9GB... testing... options:{ recSizeKB: 8, nThreads: 12, fileSizeMB: 10000, r: true, mmf: false } wthr 12 new thread, total running : 1 read:1 write:0 19789 ops/sec 77 MB/sec [ 66 ] Performance Tuning 19602 ops/sec 76 MB/sec 19173 ops/sec 74 MB/sec 19300 ops/sec 75 MB/sec 18838 ops/sec 73 MB/sec 19494 ops/sec 76 MB/sec 19579 ops/sec 76 MB/sec 19002 ops/sec 74 MB/sec new thread, total running : 2 <---- output truncated ---> new thread, total running : 12 read:1 write:0 read:1 write:0 read:1 write:0 read:1 write:0 40544 ops/sec 158 MB/sec 40237 ops/sec 157 MB/sec 40463 ops/sec 158 MB/sec 40463 ops/sec 158 MB/sec 2. In another Terminal window, run iostat to confirm the disk utilization as follows: 3. Measure the read throughput with mmf enabled and a payload larger than the server's total memory shown as follows: root@ubuntu:~# echo "{ recSizeKB: 8, nThreads: 12, fileSizeMB: 10000, r: true, mmf: true }" | mongoperf The following result is obtained: mongoperf use -h for help parsed options: { recSizeKB: 8, nThreads: 12, fileSizeMB: 10000, r: true, mmf: true } creating test file size:10000MB ... 1GB... 2GB... [ 67 ] Performance Tuning 3GB... 4GB... 5GB... 6GB... 7GB... 8GB... 9GB... testing... options:{ recSizeKB: 8, nThreads: 12, fileSizeMB: 10000, r: true, mmf: true } wthr 12 new thread, total running : 1 read:1 write:0 8107 ops/sec 9253 ops/sec 9258 ops/sec 9290 ops/sec 9088 ops/sec <---- output truncated ---> new thread, total running : 12 read:1 write:0 read:1 write:0 read:1 write:0 read:1 write:0 9430 ops/sec 9668 ops/sec 9804 ops/sec 9619 ops/sec 9371 ops/sec 4. Measure the read throughput with mmf enabled and a payload slightly less than the systems total memory: root@ubuntu:~# echo "{ recSizeKB: 8, nThreads: 12, fileSizeMB: 400, r: true, mmf: true }" | mongoperf You will see the following: mongoperf use -h for help parsed options: { recSizeKB: 8, nThreads: 12, fileSizeMB: 400, r: true, mmf: true } creating test file size:400MB ... testing... options:{ recSizeKB: 8, nThreads: 12, fileSizeMB: 400, r: true, mmf: true } wthr 12 [ 68 ] Performance Tuning new thread, total running : 1 read:1 write:0 2605344 ops/sec 4918429 ops/sec 4720891 ops/sec 4766924 ops/sec 4693762 ops/sec 4810953 ops/sec 4785765 ops/sec 4839164 ops/sec <---- output truncated ---> new thread, total running : 12 read:1 write:0 read:1 write:0 read:1 write:0 read:1 write:0 4835022 ops/sec 4962848 ops/sec 4945852 ops/sec 4945882 ops/sec 4970441 ops/sec How it works... The mongoperf utility takes parameters in the form of a JSON file. We can either provide this configuration in the form of a file or simply pipe the configuration to mongoperf's stdin. To view the available options of mongoperf simply run mongoperf -h and obtain the following: usage: mongoperf < myjsonconfigfile { nThreads: , // number of threads (default 1) fileSizeMB: , // test file size (default 1MB) sleepMicros: , // pause for sleepMicros/nThreads between each operation (default 0) mmf: , // if true do i/o's via memory mapped files (default false) r: , // do reads (default false) w: , // do writes (default false) recSizeKB: , // size of each write (default 4KB) syncDelay: // secs between fsyncs, like --syncdelay in mongod. (default 0/never) } [ 69 ] Performance Tuning In step 1, we pass a handful of parameters to mongoperf. Let's take a look at them: recSizeKB: The size of each record that would be written or read from the sample dataset. In our example, we are using an 8 KB record size. nThreads: The number of concurrent threads performing the (read/write) operations. In our case, it is set to 12. fileSizeMB: The size of the file to be read or written to. We are setting this to roughly 10 GB r: By indicating r:true, we will only be performing read operations. You can use w:true to test write operations or both. mmf: It is memory mapped file format. Disabling mmf causes mongoperf to bypass the file buffer and perform the operation directly on the disk. In order to truly test the underlying physical I/O, we are disabling mmf by setting it to false. In the subsequent steps, we will set it to true. As we fire up the mongoperf utility, mongoperf first tries to create a roughly 10 GB file on the disk. Once created, it starts one thread and slowly ramps up to 12 (nThreads). You can clearly see the increase in read operations per second as the number of threads increases. Depending on your disk's capabilities, you should expect to reach the maximum IOPS limit pretty soon. This can be observed, in step 2, by running the iostat command and observing the %util column. Once it reaches 100%, you can assume that the disk is peaking at its maximum operating limit. In step 3, we run the same test but this time with mmf set to true. Here, we are attempting to test the benefits of memory mapping by not reading the data from memory and reading it from the physical disk instead. However, you can see that the performance is not as high as we would expect. In fact, it is drastically lower than the IOPS achieved when reading from disk. The primary reason is that our working file is 10 GB in size, whereas my VM's memory is only 1 GB. As the entire dataset cannot fit in memory, mongoperf has to routinely seek data from the disk. This is more suboptimal when the reads are random, and this can be observed in the output. In step 4, we confirm our theory by running the test again but this time, with a fileSize of 400 MB, which is smaller than the available memory. As you can see, the number of IOPS is drastically higher than the previous run, confirming that it is extremely important that your working dataset fits in your system's memory. [ 70 ] Performance Tuning So there you have it, a simple way to test your system's IOPS using the mongoperf utility. Although we only tested read operations, I would strongly urge you to test write as well as read/write operations when testing your systems. Additionally, you should also perform mmf enabled tests to give you an idea of what would be an adequate sized working set that you can hold on a given server. Finding slow running queries and operations In this recipe, we will be looking at how to capture queries that have longer execution times. By identifying slow running queries, you can work towards implementing appropriate database indexes or even consider optimizing the application code. Getting ready Assuming that you are already running a MongoDB server, we will be importing a dataset of around 100,000 records that are available in the form of a CSV file called chapter_2_mock_data.csv. You can download this file from the Packt website. How to do it... 1. Import the sample data into the MongoDB server: mongoimport --headerline --ignoreBlanks --type=csv -d mydb -c mockdata -h localhost chapter_2_mock_data.csv This will give us the following result: 2017-06-23T08:12:02.122+0530 connected to: localhost 2017-06-23T08:12:03.144+0530 imported 100000 documents 2. Connect to the MongoDB instance and open a MongoDB shell: mongo localhost [ 71 ] Performance Tuning 3. Check that the documents are in the right place: > use mydb switched to db mydb > db.mockdata.count() 100000 4. Enable profiling for slow queries: > db.setProfilingLevel(1, 20) { "was" : 0, "slowms" : 20, "ok" : 1 } 5. Run a simple find query as follows: > db.mockdata.find({first_name: "Pam"}).count() 10 6. Check the profiling collection: > db.system.profile.find().pretty() The following result is obtained: { "op" : "command", "ns" : "mydb.mockdata", "command" : { "count" : "mockdata", "query" : { "first_name" : "Pam" }, "fields" : { } }, "keysExamined" : 0, "docsExamined" : 100000, "numYield" : 781, "locks" : { "Global" : { "acquireCount" : { "r" : NumberLong(1564) } }, "Database" : { "acquireCount" : { "r" : NumberLong(782) } [ 72 ] Performance Tuning }, "Collection" : { "acquireCount" : { "r" : NumberLong(782) } } }, "responseLength" : 29, "protocol" : "op_command", "millis" : 37, "planSummary" : "COLLSCAN", "execStats" : { "stage" : "COUNT", "nReturned" : 0, "executionTimeMillisEstimate" : 26, "works" : 100002, "advanced" : 0, "needTime" : 100001, "needYield" : 0, "saveState" : 781, "restoreState" : 781, "isEOF" : 1, "invalidates" : 0, "nCounted" : 10, "nSkipped" : 0, "inputStage" : { "stage" : "COLLSCAN", "filter" : { "first_name" : { "$eq" : "Pam" } }, "nReturned" : 10, "executionTimeMillisEstimate" : 26, "works" : 100002, "advanced" : 10, "needTime" : 99991, "needYield" : 0, "saveState" : 781, "restoreState" : 781, "isEOF" : 1, "invalidates" : 0, "direction" : "forward", "docsExamined" : 100000 } }, "ts" : ISODate("2017-07-07T03:26:57.818Z"), "client" : "192.168.200.1", [ 73 ] Performance Tuning "appName" : "MongoDB Shell", "allUsers" : [ ], "user" : "" } How it works... We begin by importing a fairly large dataset using the mongoimport utility, as we did in the Working with indexes recipe in Chapter 2, Understanding and Managing Indexes. Next, in steps 2 and step 3, we start the MongoDB shell and check that our documents were inserted. In step 4, we enable database profiling by running the db.setProfilingLevel(1, 20) command. Database profiling is a feature available in MongoDB that allows you to log slow queries or operations and profiling information related to the operation. MongoDB allows three profiling levels: Level 0: Disable database profiling Level 1: Log slow queries Level 2: Log slow operations By default, profiling for all databases is set to level 0. This can be confirmed by running the following command: db.getProfilingStatus() { "was" : 0, "slowms" : 100 } The was field indicates the current profiling level, whereas the slowms field indicates the maximum allowed execution time (in milliseconds) for operations. All operations taking longer than the slowms threshold will be recorded by the database profiler. In our recipe, we set the profiling level to 1, indicating that we want the profiling level to record only slow queries, and the second parameter, 20, indicates that any query taking longer than 20 ms should be recorded. In step 5, we run a simple query to count the number of documents that have first_name = 'Pam'. As this is not an indexed collection, the server will have to scan through all documents, which hopefully takes more than 20 ms. Once the profiler's threshold is crossed (in our case, 20 ms), the data is stored in the system.profile collection. [ 74 ] Performance Tuning In step 6, we query the system.profile collection to find all operations captured by the profiling database. Each document in this collection captures a lot of information regarding the query. A few of them are as follows: client: The IP address of the connecting client. appName: This is a string passed by the MongoDB driver that can help identify the connecting app. It's extremely helpful if you have multiple applications talking to the same database. In our example, this string was "MongoDB Shell", which was set by mongo-shell. user: The authenticated user who ran the operation. This can be empty if no authentication was used. millis: The time taken, in milliseconds, for the entire operation to finish. command: The command for the given operation. ns: The namespace on which the command was run. Its format is . , so in our example it was run on the mydb database's mockdata collection. An exhaustive list can be found in MongoDB's official documentation, https://docs. mongodb.com/manual/reference/database-profiler/. Considering the wealth of information collected by the database profiler, it should be very easy not only to debug slow queries but even monitor the collection to alert on patterns (more on this in Chapter 8, Monitoring MongoDB). There's more... If, due to sheer boredom or just curiosity, you happened to inspect the system.profile collection, you will note that it is a capped collection with a size of 1 MB: db.system.profile.stats() The result is as follows: { "ns" : "mydb.system.profile", "size" : 0, "count" : 0, "numExtents" : 1, "storageSize" : 1048576, "lastExtentSize" : 1048576, "paddingFactor" : 1, "paddingFactorNote" : "paddingFactor is unused and unmaintained in [ 75 ] Performance Tuning 3.0. It remains hard coded to 1.0 for compatibility only.", "userFlags" : 1, "capped" : true, "max" : NumberLong("9223372036854775807"), "maxSize" : 1048576, "nindexes" : 0, "totalIndexSize" : 0, "indexSizes" : { }, "ok" : 1 } This size may be sufficient for most cases, but if you need to increase the size of this collection, here is how to do it. First, we disable profiling: > db.setProfilingLevel(0) { "was" : 1, "slowms" : 100, "ok" : 1 } Next, we drop the system.profile collection and create a new capped collection with a size of 10 MB: > db.createCollection('system.profile', {capped: true, size: 10485760}) { "ok" : 1 } Finally, enable profiling: > db.setProfilingLevel(1,20) { "was" : 0, "slowms" : 100, "ok" : 1 } That's it! Your system.profile collection's size is now 10 MB. Storage considerations when using Amazon EC2 Amazon Web Services (AWS) provides a variety of instances in their Elastic Compute Cloud (EC2) offerings. With each type of EC2 instance, there are two distinct ways to store data: instance store and Elastic Block Storage (EBS). [ 76 ] Performance Tuning Instance store refers to an ephemeral disk that is available as a block device to the instance and is physically present on the host of the instance. By being available on the same host, these disks provide extremely high throughput. However, instance stores are ephemeral and thus provide no guarantees of data retention if an instance is terminated, stopped, or the disk fails. This is clearly not suitable for a single node MongoDB instance, as you might lose your data any time the instance goes down. Not all hope is lost, though. We can use a three or more node replica set and ensure the redundancy of data. For a more robust deployment, we can consider having an extra node in the replica set cluster that uses EBS and has a priority set to zero. This ensures that the node is always in sync with the data and at the same time is not used for serving actual queries. EBS is network-attached storage that can be used as a block device and can be attached to any AWS instance. EBS volumes provide data persistence and can be reattached to any instance running in the same availability zone of the AWS region. There are various forms of EBS volume available, such as standard general purpose SSDs, Provisioned IOPS (PIOPS), and high throughput magnetic disks. As magnetic disks are more focused on high-throughput data streams mostly performing sequential reads on large files, they are not appropriate for MongoDB. General purpose SSDs provide submillisecond latencies with a minimum baseline of 100 IOPS. It also provides the ability to burst up to 10,000 IOPS depending on the volume type and has a rather unique 3 IOPS per GB burst bucket system, and I would rather not go into too much detail. PIOPS is another EBS offering, in that you can choose a minimum guaranteed IOPS and are billed accordingly. For most small to medium sized workloads, general purpose SSDs should do the trick. However, when provisioning EBS volumes, we need to keep in mind the network utilization. As EBS volumes are accessed over the network, they tend to share the same network link as that of the instance. This may not be ideal for a database, as the application traffic to the instance would then be contending with that of the EBS volumes. AWS does provide EBS optimized EC2 instances that use a different network path so that your instance traffic does not affect your disk throughput. [ 77 ] Performance Tuning Another significant optimization technique is to use multiple EBS volumes for different parts of your MongoDB data. For instance, we can have separate EBS volumes for the actual data, the database journal, the logs, and the backups. This separation of EBS volumes would ensure that journals, logs, and backup operations do not impinge on the throughput of the actual data. Lastly, striping volumes over EBS (RAID 0) may prove to increase your overall volume's IOPS capacity. Although the official MongoDB documentation does not recommend using RAID 0 over EBS, I suggest testing your workload against RAID 0 EBS volumes to determine if this suits your needs. [ 78 ] Performance Tuning More on EBS can be found here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ EBSVolumeTypes.html. Figuring out the size of a working set In this recipe, we will be looking at what a working set is, why is it important, and how to calculate it. As you probably know, MongoDB relies heavily on caching objects and indexes in RAM. The primary reason to do so is to leverage the speed at which data can be retrieved from RAM as compared to physical disks. Theoretically, a working set is the amount of data accessed by your clients. For performance reasons, it is highly recommended that the server should have sufficient RAM to fit the entire working set while keeping sufficient room for other operations and services running on the same server. At a high level, the working set comprises the most frequently accessed data and indexes. To get an idea of your database's size, you can run the db.stats() command on the MongoDB shell: [ 79 ] Performance Tuning db.stats() You will get the following result: { "db" : "mydb", "collections" : 5, "views" : 0, "objects" : 100009, "avgObjSize" : 239.83617474427302, "dataSize" : 23985776, "storageSize" : 48304128, "numExtents" : 12, "indexes" : 2, "indexSize" : 3270400, "fileSize" : 67108864, "nsSizeMB" : 16, "extentFreeList" : { "num" : 1, "totalSize" : 1048576 }, "dataFileVersion" : { "major" : 4, "minor" : 22 }, "ok" : 1 } In the output, dataSize represents the size of entire (uncompressed) data, of the given database and indexSize represents the total size of all indexes in the database. In theory, we want to have enough RAM to fit all the data and indexes. This would result in the fewest seeks from the physical storage and provide an optimal read performance. However, for all practical purposes, this scenario may not be true in all cases. Say, for example, you have 24 GB of data and about 2 GB of indexes; it is recommended that you go with a server that has 32 GB RAM. But what if your application usage is such that you barely access about 4 GB of data? In this case, having an over provisioned server may not be an ideal choice. Similarly, if you have a smaller working set, say 6 GB, and you host it on an server with 8 GB RAM, if the rate at which the working set increases is considerably fast, you may soon run out of memory to fit the working set. My point is, while understanding the size of a working set is an absolutely must, you should not underestimate the importance of monitoring the actual usage of the data. MongoDB maintains a thread per connection that consumes 1 MB of RAM. Make sure you factor this in when doing capacity planning for your database server. [ 80 ] Performance Tuning There's more... From version 3.0, MongoDB has provided detailed statistics of the WiredTiger storage engine, especially the cache. Here is the output of a production system that has 16 GB of memory. The approximate size of the working set is 600 MB and the index size is 3 MB: > db.serverStatus().wiredTiger.cache { "tracked dirty bytes in the cache" : 0, "tracked bytes belonging to internal pages in the cache" : 299485, "bytes currently in the cache" : 641133907, "tracked bytes belonging to leaf pages in the cache" : 7515893283, "maximum bytes configured" : 7516192768, "tracked bytes belonging to overflow pages in the cache" : 0, "bytes read into cache" : 583725713, "bytes written from cache" : 711362477, <-- output truncated --> "tracked dirty pages in the cache" : 0, "pages currently held in the cache" : 3674, "pages read into cache" : 3784, "pages written from cache" : 117710 } By default, WiredTiger uses 50% of RAM (minus 1 GB) or 256 MB for its internal cache. In the preceding output, this can be seen in the value of maximum bytes configured, which is roughly 7 GB on a 16 GB RAM server. This parameter can be changed by setting the wiredTiger.engineConfig.cacheSizeGB parameter in the MongoDB configuration file, or by setting wiredTigerEngineRuntimeConfig. You should keep an eye on tracked dirty bytes in the cache. If this increases consistently to a high number, you may need to look at changing the cache size. Here's a simple rule of thumb: tracked dirty bytes in the cache < bytes currently in the cache < maximum bytes configured [ 81 ] 4 High Availability with Replication In this chapter, we will cover the following topics: Initializing a new replica set Adding a node to the replica set Removing a node from the replica set Working with an arbiter Switching between primary and secondary nodes Changing replica set configuration Changing priority to replica set nodes Introduction This chapter aims to get you started with MongoDB replica sets. A replica set is essentially a group of MongoDB servers that form a quorum and replicate data across all nodes. Such a setup not only provides a high availability cluster but also allows the distribution of database reads across multiple nodes. A replica consists of a single primary node along with secondary nodes. High Availability with Replication The primary node accepts all writes to the database, and each write operation is replicated to the secondary nodes through replication of operation logs, which are also known as oplogs. A node is determined as primary by way of an election between the nodes in the replica set. Thus, any node within the cluster can become a primary node at any point. It is important to have an odd number of nodes in the replica set to ensure that the election process does not result in a tie. If you choose to have an even number of nodes in the replica set, MongoDB provides a non-resource intensive arbiter server that can perform heartbeats and take part in the election process. In this chapter, we will be looking at various aspects of setting up and managing replica sets. Initializing a new replica set In this recipe, we will be setting up the first node of a three node replica set on a single server. In a production setup, this should be on three physically separate servers. Getting ready By now, I am assuming you are familiar with installing MongoDB and have it ready. Additionally, we will create individual directories for each MongoDB instance: mkdir -p /data/server{1,2,3}/{conf,logs,db} [ 83 ] High Availability with Replication This should create three parent directories: /data/server1, /data/server2, and /data/server3, each containing subdirectories named conf, logs, and db. We will be using this directory format throughout the chapter. How to do it... 1. Start the first node in the replica set: mongod --dbpath /data/server1/db --replSet MyReplicaSet 2. Open a new Terminal window, connect to the replica set node using the MongoDB shell, and check the replica set's status: rs.status() { "info" : "run rs.initiate(...) if not yet done for the set", "ok" : 0, "errmsg" : "no replset config has been received", "code" : 94, "codeName" : "NotYetInitialized" } 3. Initialize the replica set: rs.initiate() { "info2" : "no configuration specified. Using a default configuration for the set", "me" : "vagrant-ubuntu-trusty-64:27017", "ok" : 1 } 4. Check the replica set's status again: rs.status() { "set" : "MyReplicaSet", "date" : ISODate("2017-08-20T05:28:26.827Z"), "myState" : 1, "term" : NumberLong(1), "heartbeatIntervalMillis" : NumberLong(2000), "optimes" : { "lastCommittedOpTime" : { "ts" : Timestamp(1503206903, 1), "t" : NumberLong(1) [ 84 ] High Availability with Replication }, "appliedOpTime" : { "ts" : Timestamp(1503206903, 1), "t" : NumberLong(1) }, "durableOpTime" : { "ts" : Timestamp(1503206903, 1), "t" : NumberLong(1) } }, "members" : [ { "_id" : 0, "name" : "vagrant-ubuntu-trusty-64:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 35, "optime" : { "ts" : Timestamp(1503206903, 1), "t" : NumberLong(1) }, "optimeDate" : ISODate("2017-08-20T05:28:23Z"), "infoMessage" : "could not find member to sync from", "electionTime" : Timestamp(1503206902, 2), "electionDate" : ISODate("2017-08-20T05:28:22Z"), "configVersion" : 1, "self" : true } ], "ok" : 1 } 5. Switch back to the mongod Terminal window and inspect the server logs: 2017-08-20T05:28:16.928+0000 I NETWORK [thread1] connection accepted from 192.168.200.1:55765 #1 (1 connection now open) 2017-08-20T05:28:16.929+0000 I NETWORK [conn1] received client metadata from 192.168.200.1:55765 conn1: { application: { name: "MongoDB Shell" }, driver: { name: "MongoDB Internal Client", version: "3.4.4" }, os: { type: "Darwin", name: "Mac OS X", architecture: "x86_64", version: "14.5.0" } } 2017-08-20T05:28:22.625+0000 I COMMAND [conn1] initiate : no configuration specified. Using a default configuration for the set 2017-08-20T05:28:22.625+0000 I COMMAND [conn1] created this configuration for initiation : { _id: "MyReplicaSet", version: 1, members: [ { _id: 0, host: "vagrant-ubuntu-trusty-64:27017" } ] } 2017-08-20T05:28:22.625+0000 I REPL [conn1] replSetInitiate admin [ 85 ] High Availability with Replication command received from client 2017-08-20T05:28:22.625+0000 I REPL [conn1] replSetInitiate config object with 1 members parses ok 2017-08-20T05:28:22.625+0000 I REPL [conn1] ****** 2017-08-20T05:28:22.625+0000 I REPL [conn1] creating replication oplog of size: 1628MB... 2017-08-20T05:28:22.628+0000 I STORAGE [conn1] Starting WiredTigerRecordStoreThread local.oplog.rs 2017-08-20T05:28:22.628+0000 I STORAGE [conn1] The size storer reports that the oplog contains 0 records totaling to 0 bytes 2017-08-20T05:28:22.628+0000 I STORAGE [conn1] Scanning the oplog to determine where to place markers for truncation 2017-08-20T05:28:22.634+0000 I REPL [conn1] ****** 2017-08-20T05:28:22.646+0000 I INDEX [conn1] build index on: admin.system.version properties: { v: 2, key: { version: 1 }, name: "incompatible_with_version_32", ns: "admin.system.version" } 2017-08-20T05:28:22.646+0000 I INDEX [conn1] building index using bulk method; build may temporarily use up to 500 megabytes of RAM 2017-08-20T05:28:22.646+0000 I INDEX [conn1] build index done. scanned 0 total records. 0 secs 2017-08-20T05:28:22.646+0000 I COMMAND [conn1] setting featureCompatibilityVersion to 3.4 2017-08-20T05:28:22.647+0000 I REPL [conn1] New replica set config in use: { _id: "MyReplicaSet", version: 1, protocolVersion: 1, members: [ { _id: 0, host: "vagrant-ubuntu-trusty-64:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: 60000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('59991df64db063a571ae8680') } } 2017-08-20T05:28:22.647+0000 I REPL [conn1] This node is vagrantubuntu-trusty-64:27017 in the config 2017-08-20T05:28:22.647+0000 I REPL [conn1] transition to STARTUP2 2017-08-20T05:28:22.647+0000 I REPL [conn1] Starting replication storage threads 2017-08-20T05:28:22.647+0000 I REPL [conn1] Starting replication fetcher thread 2017-08-20T05:28:22.647+0000 I REPL [conn1] Starting replication applier thread 2017-08-20T05:28:22.647+0000 I REPL [conn1] Starting replication reporter thread 2017-08-20T05:28:22.647+0000 I REPL [rsSync] transition to RECOVERING 2017-08-20T05:28:22.648+0000 I REPL [rsSync] transition to SECONDARY 2017-08-20T05:28:22.648+0000 I REPL [rsSync] conducting a dry run [ 86 ] High Availability with Replication election to see if we could be elected 2017-08-20T05:28:22.648+0000 I REPL [ReplicationExecutor] dry election run succeeded, running for election 2017-08-20T05:28:22.654+0000 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 1 2017-08-20T05:28:22.654+0000 I REPL [ReplicationExecutor] transition to PRIMARY 2017-08-20T05:28:22.654+0000 I REPL [ReplicationExecutor] Entering primary catch-up mode. 2017-08-20T05:28:22.654+0000 I REPL [ReplicationExecutor] Exited primary catch-up mode. 2017-08-20T05:28:23.649+0000 I REPL [rsSync] transition to primary complete; database writes are now permitted How it works... In step 1, we begin by starting the mongod process with the two parameters. First, we provide the database path with --dbpath, which is quite standard with all mongod processes. Next, we provide the --replSet parameter with the value MyReplicaSet. This parameter starts the mongod process with the explicit instruction that it will be running a replica set node and the unique name for this replica set is MyReplicaSet. MongoDB uses naming constructs to identify a replica set cluster. This can be changed in the future but would require you to shut down all the nodes within the cluster. In step 2, we open a different Terminal window and start the mongo shell that is connected to our aforementioned node. We check the replica set's status by running the rs.status() command. If you ever happen to work with replica sets, rs.status() will become the most frequent command you will use for eons to come. I would also like to point out that all major replica set operations are available in the rs. format. To view your options, type rs. (with the trailing dot) and press the Tab key twice. OK, coming back to the output of rs.status(), we can see that MongoDB is indicating that our replica set has not been initialized. We do so by running the rs.initiate() command in step 3. At this point, if you keep pressing the Enter key (without any parameters), you can see your mongo shell show the transition of starting the node as a SECONDARY and then PRIMARY: rs.initiate() { "info2" : "no configuration specified. Using a default configuration for the set", "me" : "vagrant-ubuntu-trusty-64:27017", "ok" : 1 [ 87 ] High Availability with Replication } MyReplicaSet:SECONDARY> MyReplicaSet:PRIMARY> MyReplicaSet:PRIMARY> From now on, every time you connect to this node, you will see the replica set name followed by the node's status. Next, we run the rs.status() command again and this time get the detailed status of the replica set's configuration. Let's go through some of the key values of the output: set: This indicates the name of the replica set. myState: This indicates the status of the current node in the replica set. The most common states you will encounter are as follows: State number State Decription 0 STARTUP The node is parsing configuration and is starting up 1 PRIMARY The node is the primary member of the cluster 2 SECONDARY The node is a secondary member of the cluster 3 RECOVERING The node is completing either rollback or resync after starting up 7 ARBITER The node is an arbiter, it does not store any data 8 DOWN The node is marked as DOWN usually when it is unreachable 10 REMOVED The node has been removed from the replica set configuration There are more MongoDB replica set states; they can be found at https:// docs.mongodb.com/manual/reference/replica-states/. heartbeatIntervalMillis: This indicates the frequency of health checks between nodes in milliseconds. members: An array containing a list of members currently in the replica set. Each member entry is accompanied by details about the member, such as its name, state, up time, and an information message showing its current state. We will be looking at them more closely in future recipes in this chapter. For now, I just want you to get acquainted with this format. [ 88 ] High Availability with Replication Once we execute the rs.initiate() command, MongoDB attempts to figure out any configuration parameters associated with this replica set (in the form of a config file or mongod command-line flags) and initialized the replica set. In our case, we only mentioned the name of the replica set MyReplicaSet as a mongod parameter. In step 5, by looking at the mongod process logs, we can observe the various stages the application goes through, while trying to bring up a node in a replica set. The information is pretty verbose, so I will not go into detail. Adding a node to the replica set In this recipe, we will be looking at how to add a node to an existing replica set. Getting ready Ensure that you have a single node replica set running as mentioned in the first recipe of this chapter. How to do it... 1. Assuming you have the node from the previous recipe already running, open a new Terminal and start a new replica set node: mongod --dbpath /data/server2/db --replSet MyReplicaSet --port 27018 2. In another Terminal window, connect to the primary server using mongo shell (replace the IP with that of your server's): mongo mongodb://192.168.200.200:27017 3. Check the number of members in the replica set: rs.status()['members'] [ { "_id" : 0, "name" : "vagrant-ubuntu-trusty-64:27017", "health" : 1, "state" : 1, [ 89 ] High Availability with Replication "stateStr" : "PRIMARY", "uptime" : 36, "optime" : { "ts" : Timestamp(1503664489, 1), "t" : NumberLong(3) }, "optimeDate" : ISODate("2017-08-25T12:34:49Z"), "infoMessage" : "could not find member to sync from", "electionTime" : Timestamp(1503664458, 1), "electionDate" : ISODate("2017-08-25T12:34:18Z"), "configVersion" : 1, "self" : true } ] 4. Add the new node to the replica set: rs.add('192.168.200.200:27018') 5. Once again, check the members in the replica set: { "_id" : 0, "name" : "vagrant-ubuntu-trusty-64:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 71, "optime" : { "ts" : Timestamp(1503664527, 1), "t" : NumberLong(3) }, "optimeDate" : ISODate("2017-08-25T12:35:27Z"), "infoMessage" : "could not find member to sync from", "electionTime" : Timestamp(1503664458, 1), "electionDate" : ISODate("2017-08-25T12:34:18Z"), "configVersion" : 2, "self" : true }, { "_id" : 1, "name" : "192.168.200.200:27018", "health" : 1, "state" : 0, "stateStr" : "STARTUP", "uptime" : 1, "optime" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) [ 90 ] High Availability with Replication }, "optimeDurable" : { "ts" : Timestamp(0, 0), "t" : NumberLong(-1) }, "optimeDate" : ISODate("1970-01-01T00:00:00Z"), "optimeDurableDate" : ISODate("1970-01-01T00:00:00Z"), "lastHeartbeat" : ISODate("2017-08- 5T12:35:27.327Z"), "lastHeartbeatRecv" : ISODate("2017-08-5T12:35:27.378Z"), "pingMs" : NumberLong(0), "configVersion" : -2 } How it works... As mentioned earlier, this recipe assumes that you are already running the first (primary) node in your replica set, as show in the previous recipe. In step 1, we start another instance of mongod listening on a different port (27018). I just want to reiterate that as this is a test setup we will be running all instances of mongod on the same server, but in a production setup all replica set members should be running on separate servers. In step 2, we look at the output of the rs.status() command, more importantly the members array. As of now, although we have started a new instance, the primary replica set node is not aware of its existence. Therefore, the list of members would only show one member. Let's fix this. In step 3, we run rs.add('192.168.200.200:27018') in the mongo shell, which is connected to the primary node. The rs.add() method is a wrapper around the actual replSetReconfig command in that it adds a node to the members array and reconfigures the replica set. We will look into replica set reconfiguration in future recipes. Next, we look again at the output of the rs.status() command. If you inspect the members array, you will find our second member. If you have run the command soon after rs.add(...), you may be able to see the following: "_id" : 1, "name" : "192.168.200.200:27018", "health" : 1, "state" : 0, "stateStr" : "STARTUP", The "state" : 0 string indicates that the member is parsing its configuration and starting up. If you run the rs.status() command again, this should change to "state" : 2, indicating that the node is a secondary node. [ 91 ] High Availability with Replication Keep an eye on the configVersion key of each member. Every change in the replica set's configuration increments the value of configVersion by one. This can be handy for a members's current configuration state. To finish off this recipe, I would like you to start another instance of mongod on port 27019 and add it to the cluster. Removing a node from the replica set In this recipe, we will be looking at how to remove a member from a replica set. If you have done the previous two recipes in this chapter, this should be a breeze. Getting ready For this recipe, we will need a three node replica set. If you don’t have one ready, I suggest referring to the first two recipes of this chapter. How to do it... 1. Open the mongo shell and log in to one of the nodes. Run rs.status() to find the primary node: rs.status()['members'] [ { "_id" : 0, "name" : "vagrant-ubuntu-trusty-64:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 57933, "optime" : { "ts" : Timestamp(1503722389, 1), "t" : NumberLong(5) }, "optimeDate" : ISODate("2017-08-26T04:39:49Z"), "electionTime" : Timestamp(1503721808, 1), "electionDate" : ISODate("2017-08-26T04:30:08Z"), "configVersion" : 3, [ 92 ] High Availability with Replication "self" : true }, { "_id" : 1, "name" : "192.168.200.200:27018", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 51609, "optime" : { "ts" : Timestamp(1503722389, 1), "t" : NumberLong(5) }, "optimeDurable" : { "ts" : Timestamp(1503722389, 1), "t" : NumberLong(5) }, "optimeDate" : ISODate("2017-08-26T04:39:49Z"), "optimeDurableDate" : ISODate("2017-08-26T04:39:49Z"), "lastHeartbeat" : ISODate("2017-08-26T04:39:51.239Z"), "lastHeartbeatRecv" : ISODate("2017-086T04:39:51.240Z"), "pingMs" : NumberLong(0), "syncingTo" : "vagrant-ubuntu-trusty-64:27017", "configVersion" : 3 }, { "_id" : 2, "name" : "192.168.200.200:27019", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 84, "optime" : { "ts" : Timestamp(1503722389, 1), "t" : NumberLong(5) }, "optimeDurable" : { "ts" : Timestamp(1503722389, 1), "t" : NumberLong(5) }, "optimeDate" : ISODate("2017-08-26T04:39:49Z"), "optimeDurableDate" : ISODate("2017-08-26T04:39:49Z"), "lastHeartbeat" : ISODate("2017-08-26T04:39:51.240Z"), "lastHeartbeatRecv" : ISODate("2017-086T04:39:51.307Z"), "pingMs" : NumberLong(0), "syncingTo" : "192.168.200.200:27018", [ 93 ] High Availability with Replication "configVersion" : 3 } ] 2. Run rs.remove() to remove the last node in the replica set: rs.remove('192.168.200.200:27019') 3. Check the status of the replica set: rs.status()['members'] [ { "_id" : 0, "name" : "vagrant-ubuntu-trusty-64:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 57998, "optime" : { "ts" : Timestamp(1503722449, 2), "t" : NumberLong(5) }, "optimeDate" : ISODate("2017-08-26T04:40:49Z"), "electionTime" : Timestamp(1503721808, 1), "electionDate" : ISODate("2017-08-26T04:30:08Z"), "configVersion" : 4, "self" : true }, { "_id" : 1, "name" : "192.168.200.200:27018", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 51673, "optime" : { "ts" : Timestamp(1503722449, 2), "t" : NumberLong(5) }, "optimeDurable" : { "ts" : Timestamp(1503722449, 2), "t" : NumberLong(5) }, "optimeDate" : ISODate("2017-08-26T04:40:49Z"), "optimeDurableDate" : ISODate("2017-08-26T04:40:49Z"), "lastHeartbeat" : ISODate("2017-08-26T04:40:55.956Z"), "lastHeartbeatRecv" : ISODate("2017-08- [ 94 ] High Availability with Replication 6T04:40:55.956Z"), "pingMs" : NumberLong(0), "syncingTo" : "vagrant-ubuntu-trusty-64:27017", "configVersion" : 4 } ] 4. Connect to the third replica set node, which we removed, and check rs.status(): rs.status() { "state" : 10, "stateStr" : "REMOVED", "uptime" : 338, > "optime" : { "ts" : Timestamp(1503722619, 1), "t" : NumberLong(5) }, "optimeDate" : ISODate("2017-08-26T04:43:39Z"), "ok" : 0, "errmsg" : "Our replica set config is invalid or we are not a member of it", "code" : 93, "codeName" : "InvalidReplicaSetConfig" } MyReplicaSet:OTHER> How it works... In step 1, we connect to one of the three replica set members and check the replica set status. We want to ensure two things: one, that the connected node is primary, and that the node that we want to remove is secondary. You cannot remove a primary node from the replica set. You need to force it into becoming secondary and then remove it. We will look more closely at how to do this in the Switching between primary and secondary nodes recipe in this chapter. Now that we've determined that we are connected to the primary node, in step 2, we remove one node from the replica set. By using rs.remove() with the IP and port of the node, we remove the node from the replica set. [ 95 ] High Availability with Replication In step 3, we confirm that the node is removed by running rs.status() to get the list of configured nodes in the cluster. Finally, in step 4, we connect to the mongo shell of the node that we just removed. As soon as you log in, you can observe that the console prompt shows OTHER instead of PRIMARY or SECONDARY. Also, the rs.status() command's output confirms that the node is in state 10 (REMOVED), indicating that this node is no longer in the replica set cluster. At this point, I would also like you to go through the mongod logs of this node and observe the sequence of events that occur when we run rs.remove(): 2017-08-26T04:40:51.338+0000 I REPL [ReplicationExecutor] Cannot find self in new replica set configuration; I must be removed; NodeNotFound: No host described in new configuration 4 for replica set MyReplicaSet maps to this node 2017-08-26T04:40:51.339+0000 I REPL [ReplicationExecutor] New replica set config in use: { _id: "MyReplicaSet", version: 4, protocolVersion: 1, members: [ { _id: 0, host: "vagrant-ubuntu-trusty-64:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "192.168.200.200:27018", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: 60000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('59991df64db063a571ae8680') } } 2017-08-26T04:40:51.339+0000 I REPL [ReplicationExecutor] This node is not a member of the config 2017-08-26T04:40:51.339+0000 I REPL [ReplicationExecutor] transition to REMOVED As we ran rs.remove('192.168.200.200:27019') on the primary node, a new configuration was generated. This configuration is sent to all new or existing nodes of the replica set and the relevant changes are implemented. In the log output shown previously, you can see that the replica set node got the new configuration and figured out that it had been removed from the replica set cluster. It then reconfigured itself and transitioned to the REMOVED state. Working with an arbiter In MongoDB, nodes within replica sets perform elections to select a primary node. To ensure there is always a majority in the number of nodes, you can add an arbiter to the replica set. An arbiter is a mongod instance that does not store data but is only involved in voting during an election process. This can prove very useful, especially during network partitions that result in conflicting votes. [ 96 ] High Availability with Replication Getting ready We can continue on from the previous recipe, in that all we need is a two node replica set. How to do it... 1. Create directories for the arbiter process: mkdir -p /data/arbiter/db 2. Start the arbiter process: mongod --dbpath /data/arbiter/db --replSet MyReplicaSet --port 30000 3. Open a new Terminal window and connect to the primary node: mongo mongodb://192.168.200.200:27017 4. Add the arbiter: rs.addArb('192.168.200.200:30000') 5. Check the members of the replica set: rs.status()['member'] [ { "_id" : 0, "name" : "vagrant-ubuntu-trusty-64:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 61635, "optime" : { "ts" : Timestamp(1503726090, 1), "t" : NumberLong(8) }, "optimeDate" : ISODate("2017-08-26T05:41:30Z"), "electionTime" : Timestamp(1503725438, 1), "electionDate" : ISODate("2017-08-26T05:30:38Z"), "configVersion" : 5, "self" : true }, { [ 97 ] High Availability with Replication "_id" : 1, "name" : "192.168.200.200:27018", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 1214, "optime" : { "ts" : Timestamp(1503726090, 1), "t" : NumberLong(8) }, "optimeDurable" : { "ts" : Timestamp(1503726090, 1), "t" : NumberLong(8) }, "optimeDate" : ISODate("2017-08-26T05:41:30Z"), "optimeDurableDate" : ISODate("2017-08-26T05:41:30Z"), "lastHeartbeat" : ISODate("2017-08-26T05:41:32.024Z"), "lastHeartbeatRecv" : ISODate("2017-08-26T05:41:30.034Z"), "pingMs" : NumberLong(0), "configVersion" : 5 }, { "_id" : 2, "name" : "192.168.200.200:30000", "health" : 1, "state" : 7, "stateStr" : "ARBITER", "uptime" : 3, "lastHeartbeat" : ISODate("2017-08-26T05:41:32.025Z"), "lastHeartbeatRecv" : ISODate("2017-08-26T05:41:30.034Z"), "pingMs" : NumberLong(0), "configVersion" : 5 } ] [ 98 ] High Availability with Replication How it works... We begin by creating the directories for the arbiter process. As mentioned in the beginning of this recipe, an arbiter is nothing but a mongod process that will not store any data. However, it does need to store some metadata about itself and hence a minimal amount of state has to be maintained. For this purpose, in step 2, we provide the --dbpath parameter with a location to store its data along with an arbitrary port 30000. In step 3, we connect to the primary node of our replica set, and in step 4, we use the rs.addArb() wrapper to add the new arbiter. [ 99 ] High Availability with Replication Next, in step 4, we check the status of the replica set; lo and behold, the mighty arbiter is added to the replica set. If you look at the state and stateStr keys, you will see that this member is set state to 7, which confirms it is an arbiter. Switching between primary and secondary nodes In this recipe, we will be looking at how to force a primary node to become secondary and vice versa. Let’s get to it then. Getting ready We need a three node replica set, preferably without an arbiter. If you have followed the previous recipes, we should have three mongod instances running on the same instance on three different ports, 27017, 27018, and 27019. In order to keep things simple, we will call them node 1, node 2, and node 3 respectively. Here, we assume that node 1 is primary, whereas node 2 and node 3 are secondary. In the first part of this recipe, we will force node 1 to become secondary. Assuming that node 3 gets elected as primary, in the second part of the recipe, we will try to make node 1 primary. How to do it... 1. Connect to the primary member (node 1) of the replica set: mongo mongodb://192.168.200.200:27017 2. Force it to become secondary: rs.stepDown() 3. Confirm the member is now secondary: rs.isMaster()['ismaster'] [ 100 ] High Availability with Replication 4. Log in to node 2, assuming it is secondary, and prevent it from getting elected: mongo mongodb://192.168.200.200:27018 rs.freeze(120) 5. Log in to the newly elected primary node (node 3) of the replica set: mongo mongodb://192.168.200.200:27019 rs.stepDown() 6. Force it to become secondary and prevent it from getting elected: rs.freeze(120) 7. Check that the desired node (node 1) is now primary: mongo mongodb://192.168.200.200:27017 rs.isMaster()['ismaster'] How it works... Forcing a primary node to step down is a fairly straightforward process. As shown in steps 1 and step 2, we just need to log in to the primary node and run the rs.stepDown() command. This forces the node to become secondary and initiates an election in the replica set. Within a few seconds (or less), one of the secondary nodes would be elected as the new primary node. In this recipe, we assume that node 3 got elected as the new primary node. In step 3, we run another neat little helper, rs.isMaster(), and look for the value of the ismaster key. If its value is set to true, then the current node is a primary. Otherwise, it is a secondary. For the next part, we work towards converting a particular secondary node to a primary node. This involves a new command called rs.freeze(). This wrapper executes the replSetFreeze command, which prevents the member from seeking election. So, our strategy is to prevent all nodes from seeking election, except for the one that we want to become the primary. We do exactly the same in step 4. Here, we log in to node 2 and run rs.freeze(120), which prevents it from seeking election for the next 120 seconds. [ 101 ] High Availability with Replication Next, in step 5, we log in to our newly elected primary, node 3, and make it step down as primary. Finally, in step 6, we run rs.freeze(120), which prevents it from seeking election for the next 120 seconds. Once done, we confirm that node 1 is now our primary, as expected. All hail Cthulhu! Changing replica set configuration Up until now, we were performing replica set modifications using helper functions like rs.add(), rs.remove(), and so on. As mentioned earlier, these functions are wrappers which modify the replica set configuration. In this recipe, we will be looking at how to fetch and change the replica set configuration. This can be helpful for various operations like setting priorities, delayed nodes, changing member hostnames, and so on. Getting ready For this recipe, you will need a three node replica set. How to do it... 1. Connect to the primary member of the replica set using the mongo shell: mongo mongodb://192.168.200.200:27017 2. Fetch the configuration: conf = rs.conf() 3. Remove the third member of the replica set: conf['members'].pop(2) 4. Reconfigure the replica set: rs.reconfig(conf) [ 102 ] High Availability with Replication 5. Confirm that the third node was removed by inspecting the output of rs.status(): rs.status()['members'] 6. Add the third node back to the replica set: member = {"_id": 2, "host": "192.168.200.200:27019"} conf['members'].push(member) 7. Reconfigure the replica set: rs.reconfig(conf) 8. Confirm that the addition was successful: rs.status()['members'] How it works.. Like our previous recipes, replica set configuration operations can only be performed on the primary node. Once we connect to the primary node, we fetch the running configuration of the replica set using rs.conf(). In step 2, we are storing the value of rs.conf() in a variable called conf. The replica set configuration is a JavaScript object, and therefore we can modify it within the mongo shell. The configuration contains an array of members. So, in order to remove a member, we simply have to remove its entry from the array and reload the configuration with the new values. In step 3, we use the JavaScript native pop() method to remove an entry from the members array. By running conf['members'].pop(2) we are removing the third entry from the array (note that array indexes start from zero). Next, in step 4, we simply run the rs.reconfig() function while providing it the modified configuration. This function reloads the configuration, and in step 5, we can confirm that the node was indeed removed. In step 6, we create an object that contains the _id and host entry for the node that we wish to add. Next, we append the configuration's members array and add this entry to it. Finally, in step 7, we reload the configuration again and confirm that the node was added back to the replica set. [ 103 ] High Availability with Replication Changing priority to replica set nodes By now, you would have noticed the priority keyword in the rs.status() output. Replica set members with higher priorities are more likely to be elected as primaries. The value of a priority can range from 0 to 1000, where 0 indicates a non-voting member. A non-voting member functions as a regular member of a replica set but cannot vote in elections nor get elected as a primary. Getting ready For this recipe, we need a three node replica set. How to do it... 1. Connect to the primary member of the replica set using the mongo shell: mongo mongodb://192.168.200.200:27017 2. Fetch the configuration: conf = rs.conf() 3. Change the priorities of all members: conf['members'][0].priority = 5 conf['members'][1].priority = 2 conf['members'][2].priority = 2 4. Reconfigure the replica set: rs.reconfig(conf) 5. Check the new configuration: rs.conf()['members'] [ 104 ] High Availability with Replication How it works... Like our previous recipe, we connect to the primary node and fetch the replica set configuration object. Next, in step 3, we modify the value of the priority key of each member in the members array. In step 4, we reconfigure the replica set configuration. Lastly, in step 5, we can confirm that the changes have taken effect by inspecting the output of the rs.conf() command. So, why would you need to set priorities in the first place? Well, there can be various circumstances when you would need to have control over which member gets elected as a leader. As a simple example, you need to perform sequential maintenance on your replica set members. You can control which node becomes primary if an election kicks in during the maintenance. There's more... Along with priority, we can also set delayed sync and hidden members in replica sets. We will be looking closely at how to set these up later in the book in Chapter 7, Restoring MongoDB from Backups. [ 105 ] 5 High Scalability with Sharding In this chapter, we will cover the following recipes: Setting up and configuring a shard cluster Managing chunks Moving non-sharded collection data from one shard to another Removing a shard from the cluster Understanding tag aware sharding – zones Understanding sharding and its components In the previous chapter, we saw how MongoDB provides high availability using replica sets. Replica sets also allow distributing read queries across slaves, thus providing a fair bit of load distribution across a cluster of nodes. We have also seen that MongoDB performs most optimally if its working datasets can fit in memory with minimal disk operations. However, as databases grow, it becomes harder to provision servers that can effectively fit the entire working set in memory. This is one of the most common scalability problems faced by most growing organizations. High Scalability with Sharding To address this, MongoDB provides sharding of collections. Sharding allows dividing the data into smaller chunks and distributing it across multiple machines. Components of MongoDB sharding infrastructure Unlike replica sets, a sharded MongoDB cluster consists of multiple components. Config server The config server is used to store metadata about the sharded cluster. It contains details about authorizations, as well as admin and config databases. The metadata stored in the config server is read by mongos and shards, making its role extremely important during the operation of the sharded cluster. Thus, it is highly recommended that the config server is setup as a replica set, with appropriate backup and monitoring configured. The mongos query router MongoDB's mongos server acts as an interface between the application and the sharded cluster. First, it gathers information (metadata) about the sharded cluster from the config server (described later). Once it has the relevant information about the sharded cluster, it acts as a proxy for all the read and write operations on the cluster. In that, applications only talk to the mongos server and never talks directly to a shard. [ 107 ] High Scalability with Sharding More information on how mongos routes queries can be seen at: https://docs.mongodb. com/manual/core/sharded-cluster-query-router/. The shard server The shard server is nothing but a mongod instance and is executed with the --shardsvr switch. The config server delegates chunks to each shard server based on the shard key used for the collection. All queries, executed on the shard, have to originate through the mongos query router. Applications should never directly communicate with a standalone shard. Choosing the shard key In order to partition data across multiple shards, MongoDB uses a shard key. This is an immutable key that can be used to identify a document within a sharded collection. Based on boundaries of the shard key, the data is then divided into chunks and spread across multiple shards within a cluster. It is important to note that MongoDB provides sharding at the collection level and a sharded collection can have only one shard key. As shard keys are immutable, we cannot change a key once it is set. It is extremely important to properly plan shard keys before setting up a sharded cluster. MongoDB provides two sharding strategies—a hashed shard key and ranged shard key. In hashed shard keys, MongoDB computes and indexes on the hash of the shard key. The data is then evenly distributed across the cluster. So at the expense of a broadcast query, we can achieve even distribution of data across all shards. A ranged shard key is the default shard key strategy used by MongoDB. In this strategy, MongoDB splits the ranges into chunks and distributes these chunks accordingly. This increases the chance of documents, which have a close proximity to the key value, to be stored on the same shard. In such cases, queries would not be broadcast to all the shards and DB operations would become faster. However, this can also lead to shards getting overloaded on a certain type of keys. For example, if we do a ranged key on language and keep adding a high number of documents for English speaking users, then the shard holding the key would get all the documents. So there is a good chance that document distribution would be uneven. So it is extremely important to plan out your sharding strategy far in advance. All aspects of your applications must be thoroughly understood before choosing a shard key strategy. [ 108 ] High Scalability with Sharding More information about shard key specifications can be found at: https://docs.mongodb. com/manual/core/sharding-shard-key. Setting up and configuring a sharded cluster In this recipe, we will look at how to set up a sharded cluster in MongoDB. The cluster includes config servers, shards, and mongos servers. As this is a test setup, we will be running all relevant binaries from a single virtual machine; however, in production, they should be located on separate nodes. Next, we will look at how to enable sharding on a database, followed by sharding an actual collection. Once the sharded cluster is ready, we will import some data to the cluster and execute queries that would give us a glimpse of how the data is partitioned across the shards. Much fun awaits, let's get started! Getting ready There are no additional components required besides standard MongoDB binaries. Create the following directories in advance for the config server as well as the shards: mkdir -p /data/{cfgserver1,shard1,shard2,shard3}/data How to do it... 1. Start the config server: mongod --configsvr --dbpath /data/cfgserver1/data --port 27019 -replSet MyConfigRepl 2. Initialize the config server replica set: mongo localhost:27019 rs.initiate() { "info2" : "no configuration specified. Using a default configuration for the set", "me" : "vagrant-ubuntu-trusty-64:27019", "ok" : 1 } [ 109 ] High Scalability with Sharding rs.status()['configsvr'] true 3. Start three shard servers: mongod --shardsvr --dbpath /data/shard1/data --port 27027 mongod --shardsvr --dbpath /data/shard2/data --port 27028 mongod --shardsvr --dbpath /data/shard3/data --port 27029 4. Start the mongos query router: mongos --configdb MyConfigRepl/192.168.200.200:27019 5. Connect to the mongos server and add the shard mongo mongodb://127.0.0.1:27017. Then add the shards to the cluster: sh.addShard('192.168.200.200:27027') { "shardAdded" : "shard0000", "ok" : 1 } sh.addShard('192.168.200.200:27028') { "shardAdded" : "shard0001", "ok" : 1 } sh.addShard('192.168.200.200:27029') { "shardAdded" : "shard0002", "ok" : 1 } sh.status() --- Sharding Status --sharding version: { "_id" : 1, "minCompatibleVersion" : 5, "currentVersion" : 6, "clusterId" : ObjectId("59c7950c9be3cff24816915a") } shards: { "_id" : "shard0000", "host" : "192.168.200.200:27027", "state" : 1 } { "_id" : "shard0001", "host" : "192.168.200.200:27028", "state" : 1 } { "_id" : "shard0002", "host" : "192.168.200.200:27029", "state" : 1 } <-- output truncated --> [ 110 ] High Scalability with Sharding 6. Enable sharding for a database: sh.enableSharding('myShardedDB') { "ok" : 1 } sh.status() --- Sharding Status --<--output truncated--> databases: { "_id" : "myShardedDB", "primary" : "shard0001", "partitioned" : true } 7. Shard a collection: sh.shardCollection('myShardedDB.people', {language: 1}) { "collectionsharded" : "myShardedDB.people", "ok" : 1 } sh.status() --- Sharding Status --sharding version: { "_id" : 1, "minCompatibleVersion" : 5, "currentVersion" : 6, "clusterId" : ObjectId("59c7950c9be3cff24816915a") } shards: { "_id" : "shard0000", "host" : "192.168.200.200:27027", "state" : 1 } { "_id" : "shard0001", "host" : "192.168.200.200:27028", "state" : 1 } { "_id" : "shard0002", "host" : "192.168.200.200:27029", "state" : 1 } <-- output truncated --> databases: { "_id" : "myShardedDB", "primary" : "shard0001", "partitioned" : true } myShardedDB.people shard key: { "language" : 1 } unique: false balancing: true chunks: shard0001 1 { "language" : { "$minKey" : 1 } } -->> { "language" : { "$maxKey" : 1 } } on : shard0001 Timestamp(1, 0) [ 111 ] High Scalability with Sharding 8. Add some data to our database: mongoimport -h 192.168.200.200 --type csv --headerline -d myShardedDB -c people chapter_2_mock_data.csv 9. Inspect the data distribution: sh.status() --- Sharding Status -----
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.6 Linearized : No Create Date : 2017:10:25 03:43:10Z Modify Date : 2017:10:25 09:25:10+05:30 XMP Toolkit : Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26 Metadata Date : 2017:10:25 09:25:10+05:30 Producer : mPDF 6.0 Format : application/pdf Document ID : uuid:4e824649-5c86-4251-80c2-3dcd0d07b8fa Instance ID : uuid:3d686d22-f70a-46e6-875f-0fd6b1839260 Page Layout : OneColumn Page Mode : UseOutlines Page Count : 223EXIF Metadata provided by EXIF.tools