Hadoop Beginner Guide
Hadoop%20%20Beginner%20Guide
Hadoop%20%20Beginner%20Guide
hadoop_-beginners-guide
Hadoop_%20Beginner's%20Guide
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 398
www.it-ebooks.info
Hadoop Beginner's Guide
Learn how to crunch big data to extract meaning from the
data avalanche
Garry Turkington
BIRMINGHAM - MUMBAI
www.it-ebooks.info
Hadoop Beginner's Guide
Copyright © 2013 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers
and distributors will be held liable for any damages caused or alleged to be caused directly or
indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: February 2013
Production Reference: 1150213
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-84951-7-300
www.packtpub.com
Cover Image by Asher Wishkerman (a.wishkerman@mpic.de)
www.it-ebooks.info
Credits
Author
Project Coordinator
Garry Turkington
Leena Purkait
Reviewers
Proofreader
David Gruzman
Maria Gould
Muthusamy Manigandan
Vidyasagar N V
Acquisition Editor
Robin de Jongh
Lead Technical Editor
Azharuddin Sheikh
Indexer
Hemangini Bari
Production Coordinator
Nitesh Thakur
Cover Work
Nitesh Thakur
Technical Editors
Ankita Meshram
Varun Pius Rodrigues
Copy Editors
Brandt D'Mello
Aditya Nair
Laxmi Subramanian
Ruta Waghmare
www.it-ebooks.info
About the Author
Garry Turkington has 14 years of industry experience, most of which has been focused
on the design and implementation of large-scale distributed systems. In his current roles as
VP Data Engineering and Lead Architect at Improve Digital, he is primarily responsible for
the realization of systems that store, process, and extract value from the company's large
data volumes. Before joining Improve Digital, he spent time at Amazon.co.uk, where he led
several software development teams building systems that process Amazon catalog data for
every item worldwide. Prior to this, he spent a decade in various government positions in
both the UK and USA.
He has BSc and PhD degrees in Computer Science from the Queens University of Belfast in
Northern Ireland and an MEng in Systems Engineering from Stevens Institute of Technology
in the USA.
I would like to thank my wife Lea for her support and encouragement—not
to mention her patience—throughout the writing of this book and my
daughter, Maya, whose spirit and curiosity is more of an inspiration than
she could ever imagine.
www.it-ebooks.info
About the Reviewers
David Gruzman is a Hadoop and big data architect with more than 18 years of hands-on
experience, specializing in the design and implementation of scalable high-performance
distributed systems. He has extensive expertise of OOA/OOD and (R)DBMS technology. He
is an Agile methodology adept and strongly believes that a daily coding routine makes good
software architects. He is interested in solving challenging problems related to real-time
analytics and the application of machine learning algorithms to the big data sets.
He founded—and is working with—BigDataCraft.com, a boutique consulting firm in the area
of big data. Visit their site at www.bigdatacraft.com. David can be contacted at david@
bigdatacraft.com. More detailed information about his skills and experience can be
found at http://www.linkedin.com/in/davidgruzman.
Muthusamy Manigandan is a systems architect for a startup. Prior to this, he was a Staff
Engineer at VMWare and Principal Engineer with Oracle. Mani has been programming for
the past 14 years on large-scale distributed-computing applications. His areas of interest are
machine learning and algorithms.
www.it-ebooks.info
Vidyasagar N V has been interested in computer science since an early age. Some of his
serious work in computers and computer networks began during his high school days. Later,
he went to the prestigious Institute Of Technology, Banaras Hindu University, for his B.Tech.
He has been working as a software developer and data expert, developing and building
scalable systems. He has worked with a variety of second, third, and fourth generation
languages. He has worked with flat files, indexed files, hierarchical databases, network
databases, relational databases, NoSQL databases, Hadoop, and related technologies.
Currently, he is working as Senior Developer at Collective Inc., developing big data-based
structured data extraction techniques from the Web and local information. He enjoys
producing high-quality software and web-based solutions and designing secure and
scalable data systems. He can be contacted at vidyasagar1729@gmail.com.
I would like to thank the Almighty, my parents, Mr. N Srinivasa Rao and
Mrs. Latha Rao, and my family who supported and backed me throughout
my life. I would also like to thank my friends for being good friends and
all those people willing to donate their time, effort, and expertise by
participating in open source software projects. Thank you, Packt Publishing
for selecting me as one of the technical reviewers for this wonderful book.
It is my honor to be a part of it.
www.it-ebooks.info
www.PacktPub.com
Support files, eBooks, discount offers and more
You might want to visit www.PacktPub.com for support files and downloads related
to your book.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub
files available? You can upgrade to the eBook version at www.PacktPub.com and as a
print book customer, you are entitled to a discount on the eBook copy. Get in touch with
us at service@packtpub.com for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters and receive exclusive discounts and offers on Packt
books and eBooks.
http://PacktLib.PacktPub.com
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book
library. Here, you can access, read and search across Packt's entire library of books.
Why Subscribe?
Fully searchable across every book published by Packt
Copy and paste, print and bookmark content
On demand and accessible via web browser
Free Access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access
PacktLib today and view nine entirely free books. Simply use your login credentials for
immediate access.
www.it-ebooks.info
www.it-ebooks.info
Table of Contents
Preface
Chapter 1: What It's All About
Big data processing
The value of data
Historically for the few and not the many
Classic data processing systems
Limiting factors
1
7
8
8
9
9
10
A different approach
11
Hadoop
15
All roads lead to scale-out
Share nothing
Expect failure
Smart software, dumb hardware
Move processing, not data
Build applications, not infrastructure
Thanks, Google
Thanks, Doug
Thanks, Yahoo
Parts of Hadoop
Common building blocks
HDFS
MapReduce
Better together
Common architecture
What it is and isn't good for
Cloud computing with Amazon Web Services
Too many clouds
A third way
Different types of costs
AWS – infrastructure on demand from Amazon
Elastic Compute Cloud (EC2)
Simple Storage Service (S3)
www.it-ebooks.info
11
11
12
13
13
14
15
15
15
15
16
16
17
18
19
19
20
20
20
21
22
22
22
Table of Contents
Elastic MapReduce (EMR)
22
What this book covers
23
A dual approach
23
Summary
24
Chapter 2: Getting Hadoop Up and Running
Hadoop on a local Ubuntu host
Other operating systems
Time for action – checking the prerequisites
Setting up Hadoop
A note on versions
25
25
26
26
27
27
Time for action – downloading Hadoop
Time for action – setting up SSH
Configuring and running Hadoop
Time for action – using Hadoop to calculate Pi
Three modes
Time for action – configuring the pseudo-distributed mode
Configuring the base directory and formatting the filesystem
Time for action – changing the base HDFS directory
Time for action – formatting the NameNode
Starting and using Hadoop
Time for action – starting Hadoop
Time for action – using HDFS
Time for action – WordCount, the Hello World of MapReduce
Monitoring Hadoop from the browser
28
29
30
30
32
32
34
34
35
36
36
38
39
42
Using Elastic MapReduce
Setting up an account on Amazon Web Services
45
45
Time for action – WordCount in EMR using the management console
Other ways of using EMR
46
54
The AWS ecosystem
Comparison of local versus EMR Hadoop
Summary
55
55
56
Chapter 3: Understanding MapReduce
57
The HDFS web UI
Creating an AWS account
Signing up for the necessary services
AWS credentials
The EMR command-line tools
Key/value pairs
What it mean
Why key/value data?
42
45
45
54
54
57
57
58
Some real-world examples
MapReduce as a series of key/value transformations
[ ii ]
www.it-ebooks.info
59
59
Table of Contents
The Hadoop Java API for MapReduce
The 0.20 MapReduce Java API
60
61
Writing MapReduce programs
Time for action – setting up the classpath
Time for action – implementing WordCount
Time for action – building a JAR file
Time for action – running WordCount on a local Hadoop cluster
Time for action – running WordCount on EMR
The pre-0.20 Java MapReduce API
Hadoop-provided mapper and reducer implementations
Time for action – WordCount the easy way
Walking through a run of WordCount
Startup
Splitting the input
Task assignment
Task startup
Ongoing JobTracker monitoring
Mapper input
Mapper execution
Mapper output and reduce input
Partitioning
The optional partition function
Reducer input
Reducer execution
Reducer output
Shutdown
That's all there is to it!
Apart from the combiner…maybe
64
65
65
68
68
69
72
73
73
75
75
75
75
76
76
76
77
77
77
78
78
79
79
79
80
80
Time for action – WordCount with a combiner
80
Time for action – fixing WordCount to work with a combiner
Reuse is your friend
Hadoop-specific data types
The Writable and WritableComparable interfaces
Introducing the wrapper classes
81
82
83
83
84
The Mapper class
The Reducer class
The Driver class
Why have a combiner?
When you can use the reducer as the combiner
Primitive wrapper classes
Array wrapper classes
Map wrapper classes
[ iii ]
www.it-ebooks.info
61
62
63
80
81
85
85
85
Table of Contents
Time for action – using the Writable wrapper classes
86
Input/output
Files, splits, and records
InputFormat and RecordReader
Hadoop-provided InputFormat
Hadoop-provided RecordReader
Output formats and RecordWriter
Hadoop-provided OutputFormat
Don't forget Sequence files
Summary
88
89
89
90
90
91
91
91
92
Other wrapper classes
Making your own
Chapter 4: Developing MapReduce Programs
Using languages other than Java with Hadoop
How Hadoop Streaming works
Why to use Hadoop Streaming
Time for action – WordCount using Streaming
Differences in jobs when using Streaming
Analyzing a large dataset
Getting the UFO sighting dataset
Getting a feel for the dataset
Time for action – summarizing the UFO data
Examining UFO shapes
88
88
93
94
94
94
95
97
98
98
99
99
101
Time for action – summarizing the shape data
Time for action – correlating sighting duration to UFO shape
102
103
Time for action – performing the shape/time analysis from the command line
Java shape and location analysis
Time for action – using ChainMapper for field validation/analysis
107
107
108
Time for action – using the Distributed Cache to improve location output
Counters, status, and other output
Time for action – creating counters, task states, and writing log output
Too much information!
Summary
114
117
118
125
126
Using Streaming scripts outside Hadoop
Too many abbreviations
Using the Distributed Cache
Chapter 5: Advanced MapReduce Techniques
Simple, advanced, and in-between
Joins
[ iv ]
www.it-ebooks.info
106
112
113
127
127
128
Table of Contents
When this is a bad idea
Map-side versus reduce-side joins
Matching account and sales information
Time for action – reduce-side joins using MultipleInputs
DataJoinMapper and TaggedMapperOutput
Implementing map-side joins
Using the Distributed Cache
Pruning data to fit in the cache
Using a data representation instead of raw data
Using multiple mappers
128
128
129
129
134
135
135
135
136
136
To join or not to join...
Graph algorithms
Graph 101
Graphs and MapReduce – a match made somewhere
Representing a graph
Time for action – representing the graph
Overview of the algorithm
137
137
138
138
139
140
140
Time for action – creating the source code
Time for action – the first run
Time for action – the second run
Time for action – the third run
Time for action – the fourth and last run
Running multiple jobs
Final thoughts on graphs
Using language-independent data structures
Candidate technologies
Introducing Avro
Time for action – getting and installing Avro
Avro and schemas
Time for action – defining the schema
Time for action – creating the source Avro data with Ruby
Time for action – consuming the Avro data with Java
Using Avro within MapReduce
Time for action – generating shape summaries in MapReduce
Time for action – examining the output data with Ruby
Time for action – examining the output data with Java
Going further with Avro
Summary
142
146
147
148
149
151
151
151
152
152
152
154
154
155
156
158
158
163
163
165
166
The mapper
The reducer
Iterative application
[v]
www.it-ebooks.info
141
141
141
Table of Contents
Chapter 6: When Things Break
167
Failure
Embrace failure
Or at least don't fear it
Don't try this at home
Types of failure
Hadoop node failure
167
168
168
168
168
168
The dfsadmin command
Cluster setup, test files, and block sizes
Fault tolerance and Elastic MapReduce
169
169
170
Time for action – killing a DataNode process
170
Time for action – the replication factor in action
Time for action – intentionally causing missing blocks
174
176
Time for action – killing a TaskTracker process
180
Killing the cluster masters
Time for action – killing the JobTracker
184
184
Time for action – killing the NameNode process
186
NameNode and DataNode communication
When data may be lost
Block corruption
Comparing the DataNode and TaskTracker failures
Permanent failure
Starting a replacement JobTracker
Starting a replacement NameNode
The role of the NameNode in more detail
File systems, files, blocks, and nodes
The single most important piece of data in the cluster – fsimage
DataNode startup
Safe mode
SecondaryNameNode
So what to do when the NameNode process has a critical failure?
BackupNode/CheckpointNode and NameNode HA
Hardware failure
Host failure
Host corruption
The risk of correlated failures
Task failure due to software
Failure of slow running tasks
173
178
179
183
184
185
188
188
188
189
189
190
190
190
191
191
191
192
192
192
192
Time for action – causing task failure
193
Hadoop's handling of slow-running tasks
Speculative execution
Hadoop's handling of failing tasks
195
195
195
Task failure due to data
196
Handling dirty data through code
Using Hadoop's skip mode
196
197
[ vi ]
www.it-ebooks.info
Table of Contents
Time for action – handling dirty data by using skip mode
197
Summary
202
To skip or not to skip...
202
Chapter 7: Keeping Things Running
205
Calculating usable space on a node
Location of the master nodes
Sizing hardware
Processor / memory / storage ratio
EMR as a prototyping platform
210
211
211
211
212
A note on EMR
Hadoop configuration properties
Default values
Time for action – browsing default properties
Additional property elements
Default storage location
Where to set properties
Setting up a cluster
How many hosts?
206
206
206
206
208
208
209
209
210
Special node requirements
Storage types
213
213
Hadoop networking configuration
215
Commodity versus enterprise class storage
Single disk versus RAID
Finding the balance
Network storage
214
214
214
214
How blocks are placed
Rack awareness
215
216
Time for action – examining the default rack configuration
Time for action – adding a rack awareness script
What is commodity hardware anyway?
Cluster access control
The Hadoop security model
Time for action – demonstrating the default security
216
217
219
220
220
220
Working around the security model via physical access control
Managing the NameNode
Configuring multiple locations for the fsimage class
Time for action – adding an additional fsimage location
224
224
225
225
User identity
More granular access control
Where to write the fsimage copies
Swapping to another NameNode host
223
224
226
227
Having things ready before disaster strikes
227
[ vii ]
www.it-ebooks.info
Table of Contents
Time for action – swapping to a new NameNode host
227
Managing HDFS
Where to write data
Using balancer
230
230
230
MapReduce management
Command line job management
Job priorities and scheduling
Time for action – changing job priorities and killing a job
Alternative schedulers
231
231
231
232
233
Scaling
Adding capacity to a local Hadoop cluster
Adding capacity to an EMR job flow
235
235
235
Summary
236
Don't celebrate quite yet!
What about MapReduce?
When to rebalance
Capacity Scheduler
Fair Scheduler
Enabling alternative schedulers
When to use alternative schedulers
Expanding a running job flow
Chapter 8: A Relational View on Data with Hive
Overview of Hive
Why use Hive?
Thanks, Facebook!
Setting up Hive
Prerequisites
Getting Hive
Time for action – installing Hive
Using Hive
Time for action – creating a table for the UFO data
Time for action – inserting the UFO data
Validating the data
Time for action – validating the table
Time for action – redefining the table with the correct column separator
Hive tables – real or not?
Time for action – creating a table from an existing file
Time for action – performing a join
Hive and SQL views
Time for action – using views
Handling dirty data in Hive
[ viii ]
www.it-ebooks.info
229
229
230
233
234
234
234
235
237
237
238
238
238
238
239
239
241
241
244
246
246
248
250
250
252
254
254
257
Table of Contents
Time for action – exporting query output
Partitioning the table
Time for action – making a partitioned UFO sighting table
Bucketing, clustering, and sorting... oh my!
User Defined Function
Time for action – adding a new User Defined Function (UDF)
To preprocess or not to preprocess...
Hive versus Pig
What we didn't cover
Hive on Amazon Web Services
Time for action – running UFO analysis on EMR
Using interactive job flows for development
Integration with other AWS products
Summary
Chapter 9: Working with Relational Databases
Common data paths
Hadoop as an archive store
Hadoop as a preprocessing step
Hadoop as a data input tool
The serpent eats its own tail
Setting up MySQL
Time for action – installing and setting up MySQL
Did it have to be so hard?
Time for action – configuring MySQL to allow remote connections
Don't do this in production!
Time for action – setting up the employee database
Be careful with data file access rights
Getting data into Hadoop
Using MySQL tools and manual import
Accessing the database from the mapper
A better way – introducing Sqoop
Time for action – downloading and configuring Sqoop
Sqoop and Hadoop versions
Sqoop and HDFS
258
260
260
264
264
265
268
269
269
270
270
277
278
278
279
279
280
280
281
281
281
281
284
285
286
286
287
287
288
288
289
289
290
291
Time for action – exporting data from MySQL to HDFS
291
Importing data into Hive using Sqoop
Time for action – exporting data from MySQL into Hive
Time for action – a more selective import
294
295
297
Sqoop's architecture
Datatype issues
[ ix ]
www.it-ebooks.info
294
298
Table of Contents
Time for action – using a type mapping
Time for action – importing data from a raw query
299
300
Getting data out of Hadoop
Writing data from within the reducer
Writing SQL import files from the reducer
A better way – Sqoop again
Time for action – importing data from Hadoop into MySQL
303
303
304
304
304
Time for action – importing Hive data into MySQL
Time for action – fixing the mapping and re-running the export
308
310
AWS considerations
Considering RDS
Summary
313
313
314
Sqoop and Hive partitions
Field and line terminators
Differences between Sqoop imports and exports
Inserts versus updates
Sqoop and Hive exports
Other Sqoop features
Chapter 10: Data Collection with Flume
A note about AWS
Data data everywhere
Types of data
Getting network traffic into Hadoop
Time for action – getting web server data into Hadoop
Getting files into Hadoop
Hidden issues
Keeping network data on the network
Hadoop dependencies
Reliability
Re-creating the wheel
A common framework approach
302
302
306
307
307
312
315
315
316
316
316
316
318
318
318
318
318
318
319
Introducing Apache Flume
A note on versioning
Time for action – installing and configuring Flume
Using Flume to capture network data
Time for action – capturing network traffic to a log file
Time for action – logging to the console
Writing network data to log files
Time for action – capturing the output of a command in a flat file
319
319
320
321
321
324
326
326
Time for action – capturing a remote file in a local flat file
Sources, sinks, and channels
328
330
Logs versus files
[x]
www.it-ebooks.info
327
Table of Contents
Sources
Sinks
Channels
Or roll your own
Understanding the Flume configuration files
It's all about events
Time for action – writing network traffic onto HDFS
Time for action – adding timestamps
To Sqoop or to Flume...
Time for action – multi level Flume networks
Time for action – writing to multiple sinks
Selectors replicating and multiplexing
Handling sink failure
Next, the world
The bigger picture
Data lifecycle
Staging data
Scheduling
Summary
Chapter 11: Where to Go Next
What we did and didn't cover in this book
Upcoming Hadoop changes
Alternative distributions
Why alternative distributions?
Bundling
Free and commercial extensions
Choosing a distribution
Other Apache projects
HBase
Oozie
Whir
Mahout
MRUnit
Other programming abstractions
Pig
Cascading
AWS resources
HBase on EMR
SimpleDB
DynamoDB
330
330
330
331
331
332
333
335
337
338
340
342
342
343
343
343
344
344
345
347
347
348
349
349
349
349
351
352
352
352
353
353
354
354
354
354
355
355
355
355
[ xi ]
www.it-ebooks.info
Table of Contents
Sources of information
Source code
Mailing lists and forums
LinkedIn groups
HUGs
Conferences
Summary
356
356
356
356
356
357
357
Appendix: Pop Quiz Answers
359
Index
361
Chapter 3, Understanding MapReduce
Chapter 7, Keeping Things Running
359
360
[ xii ]
www.it-ebooks.info
Preface
This book is here to help you make sense of Hadoop and use it to solve your big data
problems. It's a really exciting time to work with data processing technologies such as
Hadoop. The ability to apply complex analytics to large data sets—once the monopoly of
large corporations and government agencies—is now possible through free open source
software (OSS).
But because of the seeming complexity and pace of change in this area, getting a grip on
the basics can be somewhat intimidating. That's where this book comes in, giving you an
understanding of just what Hadoop is, how it works, and how you can use it to extract
value from your data now.
In addition to an explanation of core Hadoop, we also spend several chapters exploring
other technologies that either use Hadoop or integrate with it. Our goal is to give you an
understanding not just of what Hadoop is but also how to use it as a part of your broader
technical infrastructure.
A complementary technology is the use of cloud computing, and in particular, the offerings
from Amazon Web Services. Throughout the book, we will show you how to use these
services to host your Hadoop workloads, demonstrating that not only can you process
large data volumes, but also you don't actually need to buy any physical hardware to do so.
What this book covers
This book comprises of three main parts: chapters 1 through 5, which cover the core of
Hadoop and how it works, chapters 6 and 7, which cover the more operational aspects
of Hadoop, and chapters 8 through 11, which look at the use of Hadoop alongside other
products and technologies.
www.it-ebooks.info
Preface
Chapter 1, What It's All About, gives an overview of the trends that have made Hadoop and
cloud computing such important technologies today.
Chapter 2, Getting Hadoop Up and Running, walks you through the initial setup of a local
Hadoop cluster and the running of some demo jobs. For comparison, the same work is also
executed on the hosted Hadoop Amazon service.
Chapter 3, Understanding MapReduce, goes inside the workings of Hadoop to show how
MapReduce jobs are executed and shows how to write applications using the Java API.
Chapter 4, Developing MapReduce Programs, takes a case study of a moderately sized data
set to demonstrate techniques to help when deciding how to approach the processing and
analysis of a new data source.
Chapter 5, Advanced MapReduce Techniques, looks at a few more sophisticated ways of
applying MapReduce to problems that don't necessarily seem immediately applicable to the
Hadoop processing model.
Chapter 6, When Things Break, examines Hadoop's much-vaunted high availability and fault
tolerance in some detail and sees just how good it is by intentionally causing havoc through
killing processes and intentionally using corrupt data.
Chapter 7, Keeping Things Running, takes a more operational view of Hadoop and will be
of most use for those who need to administer a Hadoop cluster. Along with demonstrating
some best practice, it describes how to prepare for the worst operational disasters so you
can sleep at night.
Chapter 8, A Relational View On Data With Hive, introduces Apache Hive, which allows
Hadoop data to be queried with a SQL-like syntax.
Chapter 9, Working With Relational Databases, explores how Hadoop can be integrated with
existing databases, and in particular, how to move data from one to the other.
Chapter 10, Data Collection with Flume, shows how Apache Flume can be used to gather
data from multiple sources and deliver it to destinations such as Hadoop.
Chapter 11, Where To Go Next, wraps up the book with an overview of the broader Hadoop
ecosystem, highlighting other products and technologies of potential interest. In addition, it
gives some ideas on how to get involved with the Hadoop community and to get help.
What you need for this book
As we discuss the various Hadoop-related software packages used in this book, we will
describe the particular requirements for each chapter. However, you will generally need
somewhere to run your Hadoop cluster.
[2]
www.it-ebooks.info
Preface
In the simplest case, a single Linux-based machine will give you a platform to explore almost
all the exercises in this book. We assume you have a recent distribution of Ubuntu, but as
long as you have command-line Linux familiarity any modern distribution will suffice.
Some of the examples in later chapters really need multiple machines to see things working,
so you will require access to at least four such hosts. Virtual machines are completely
acceptable; they're not ideal for production but are fine for learning and exploration.
Since we also explore Amazon Web Services in this book, you can run all the examples on
EC2 instances, and we will look at some other more Hadoop-specific uses of AWS throughout
the book. AWS services are usable by anyone, but you will need a credit card to sign up!
Who this book is for
We assume you are reading this book because you want to know more about Hadoop at
a hands-on level; the key audience is those with software development experience but no
prior exposure to Hadoop or similar big data technologies.
For developers who want to know how to write MapReduce applications, we assume you are
comfortable writing Java programs and are familiar with the Unix command-line interface.
We will also show you a few programs in Ruby, but these are usually only to demonstrate
language independence, and you don't need to be a Ruby expert.
For architects and system administrators, the book also provides significant value in
explaining how Hadoop works, its place in the broader architecture, and how it can be
managed operationally. Some of the more involved techniques in Chapter 4, Developing
MapReduce Programs, and Chapter 5, Advanced MapReduce Techniques, are probably
of less direct interest to this audience.
Conventions
In this book, you will find several headings appearing frequently.
To give clear instructions of how to complete a procedure or task, we use:
Time for action – heading
1.
2.
3.
Action 1
Action 2
Action 3
Instructions often need some extra explanation so that they make sense, so they are
followed with:
[3]
www.it-ebooks.info
Preface
What just happened?
This heading explains the working of tasks or instructions that you have just completed.
You will also find some other learning aids in the book, including:
Pop quiz – heading
These are short multiple-choice questions intended to help you test your own
understanding.
Have a go hero – heading
These set practical challenges and give you ideas for experimenting with what you
have learned.
You will also find a number of styles of text that distinguish between different kinds of
information. Here are some examples of these styles, and an explanation of their meaning.
Code words in text are shown as follows: "You may notice that we used the Unix command
rm to remove the Drush directory rather than the DOS del command."
A block of code is set as follows:
# * Fine Tuning
#
key_buffer = 16M
key_buffer_size = 32M
max_allowed_packet = 16M
thread_stack = 512K
thread_cache_size = 8
max_connections = 300
When we wish to draw your attention to a particular part of a code block, the relevant lines
or items are set in bold:
# * Fine Tuning
#
key_buffer = 16M
key_buffer_size = 32M
max_allowed_packet = 16M
thread_stack = 512K
thread_cache_size = 8
max_connections = 300
[4]
www.it-ebooks.info
Preface
Any command-line input or output is written as follows:
cd /ProgramData/Propeople
rm -r Drush
git clone --branch master http://git.drupal.org/project/drush.git
Newterms and important words are shown in bold. Words that you see on the screen, in
menus or dialog boxes for example, appear in the text like this: "On the Select Destination
Location screen, click on Next to accept the default destination."
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book—what you liked or may have disliked. Reader feedback is important for us to
develop titles that you really get the most out of.
To send us general feedback, simply send an e-mail to feedback@packtpub.com,
and mention the book title through the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or
contributing to a book, see our author guide on www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.
Downloading the example code
You can download the example code files for all Packt books you have purchased from
your account at http://www.packtpub.com. If you purchased this book elsewhere,
you can visit http://www.packtpub.com/support and register to have the files
e-mailed directly to you.
[5]
www.it-ebooks.info
Preface
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you find a mistake in one of our books—maybe a mistake in the text or the
code—we would be grateful if you would report this to us. By doing so, you can save other
readers from frustration and help us improve subsequent versions of this book. If you find
any errata, please report them by visiting http://www.packtpub.com/submit-errata,
selecting your book, clicking on the errata submission form link, and entering the details of
your errata. Once your errata are verified, your submission will be accepted and the errata
will be uploaded to our website, or added to any list of existing errata, under the Errata
section of that title.
Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media.
At Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works, in any form, on the Internet, please provide us with
the location address or website name immediately so that we can pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected
pirated material.
We appreciate your help in protecting our authors, and our ability to bring you
valuable content.
Questions
You can contact us at questions@packtpub.com if you are having a problem with any
aspect of the book, and we will do our best to address it.
[6]
www.it-ebooks.info
1
What It's All About
This book is about Hadoop, an open source framework for large-scale data
processing. Before we get into the details of the technology and its use in later
chapters, it is important to spend a little time exploring the trends that led to
Hadoop's creation and its enormous success.
Hadoop was not created in a vacuum; instead, it exists due to the explosion
in the amount of data being created and consumed and a shift that sees this
data deluge arrive at small startups and not just huge multinationals. At the
same time, other trends have changed how software and systems are deployed,
using cloud resources alongside or even in preference to more traditional
infrastructures.
This chapter will explore some of these trends and explain in detail the specific
problems Hadoop seeks to solve and the drivers that shaped its design.
In the rest of this chapter we shall:
Learn about the big data revolution
Understand what Hadoop is and how it can extract value from data
Look into cloud computing and understand what Amazon Web Services provides
See how powerful the combination of big data processing and cloud computing
can be
Get an overview of the topics covered in the rest of this book
So let's get on with it!
www.it-ebooks.info
What It’s All About
Big data processing
Look around at the technology we have today, and it's easy to come to the conclusion that
it's all about data. As consumers, we have an increasing appetite for rich media, both in
terms of the movies we watch and the pictures and videos we create and upload. We also,
often without thinking, leave a trail of data across the Web as we perform the actions of
our daily lives.
Not only is the amount of data being generated increasing, but the rate of increase is also
accelerating. From emails to Facebook posts, from purchase histories to web links, there are
large data sets growing everywhere. The challenge is in extracting from this data the most
valuable aspects; sometimes this means particular data elements, and at other times, the
focus is instead on identifying trends and relationships between pieces of data.
There's a subtle change occurring behind the scenes that is all about using data in more
and more meaningful ways. Large companies have realized the value in data for some
time and have been using it to improve the services they provide to their customers, that
is, us. Consider how Google displays advertisements relevant to our web surfing, or how
Amazon or Netflix recommend new products or titles that often match well to our tastes
and interests.
The value of data
These corporations wouldn't invest in large-scale data processing if it didn't provide a
meaningful return on the investment or a competitive advantage. There are several main
aspects to big data that should be appreciated:
Some questions only give value when asked of sufficiently large data sets.
Recommending a movie based on the preferences of another person is, in the
absence of other factors, unlikely to be very accurate. Increase the number of
people to a hundred and the chances increase slightly. Use the viewing history of
ten million other people and the chances of detecting patterns that can be used to
give relevant recommendations improve dramatically.
Big data tools often enable the processing of data on a larger scale and at a lower
cost than previous solutions. As a consequence, it is often possible to perform data
processing tasks that were previously prohibitively expensive.
The cost of large-scale data processing isn't just about financial expense; latency is
also a critical factor. A system may be able to process as much data as is thrown at
it, but if the average processing time is measured in weeks, it is likely not useful. Big
data tools allow data volumes to be increased while keeping processing time under
control, usually by matching the increased data volume with additional hardware.
[8]
www.it-ebooks.info
Chapter 1
Previous assumptions of what a database should look like or how its data should be
structured may need to be revisited to meet the needs of the biggest data problems.
In combination with the preceding points, sufficiently large data sets and flexible
tools allow previously unimagined questions to be answered.
Historically for the few and not the many
The examples discussed in the previous section have generally been seen in the form of
innovations of large search engines and online companies. This is a continuation of a much
older trend wherein processing large data sets was an expensive and complex undertaking,
out of the reach of small- or medium-sized organizations.
Similarly, the broader approach of data mining has been around for a very long time but has
never really been a practical tool outside the largest corporations and government agencies.
This situation may have been regrettable but most smaller organizations were not at a
disadvantage as they rarely had access to the volume of data requiring such an investment.
The increase in data is not limited to the big players anymore, however; many small and
medium companies—not to mention some individuals—find themselves gathering larger
and larger amounts of data that they suspect may have some value they want to unlock.
Before understanding how this can be achieved, it is important to appreciate some of these
broader historical trends that have laid the foundations for systems such as Hadoop today.
Classic data processing systems
The fundamental reason that big data mining systems were rare and expensive is that scaling
a system to process large data sets is very difficult; as we will see, it has traditionally been
limited to the processing power that can be built into a single computer.
There are however two broad approaches to scaling a system as the size of the data
increases, generally referred to as scale-up and scale-out.
Scale-up
In most enterprises, data processing has typically been performed on impressively large
computers with impressively larger price tags. As the size of the data grows, the approach is
to move to a bigger server or storage array. Through an effective architecture—even today,
as we'll describe later in this chapter—the cost of such hardware could easily be measured in
hundreds of thousands or in millions of dollars.
[9]
www.it-ebooks.info
What It’s All About
The advantage of simple scale-up is that the architecture does not significantly change
through the growth. Though larger components are used, the basic relationship (for
example, database server and storage array) stays the same. For applications such as
commercial database engines, the software handles the complexities of utilizing the
available hardware, but in theory, increased scale is achieved by migrating the same
software onto larger and larger servers. Note though that the difficulty of moving software
onto more and more processors is never trivial; in addition, there are practical limits on just
how big a single host can be, so at some point, scale-up cannot be extended any further.
The promise of a single architecture at any scale is also unrealistic. Designing a scale-up system
to handle data sets of sizes such as 1 terabyte, 100 terabyte, and 1 petabyte may conceptually
apply larger versions of the same components, but the complexity of their connectivity may
vary from cheap commodity through custom hardware as the scale increases.
Early approaches to scale-out
Instead of growing a system onto larger and larger hardware, the scale-out approach
spreads the processing onto more and more machines. If the data set doubles, simply use
two servers instead of a single double-sized one. If it doubles again, move to four hosts.
The obvious benefit of this approach is that purchase costs remain much lower than for
scale-up. Server hardware costs tend to increase sharply when one seeks to purchase larger
machines, and though a single host may cost $5,000, one with ten times the processing
power may cost a hundred times as much. The downside is that we need to develop
strategies for splitting our data processing across a fleet of servers and the tools
historically used for this purpose have proven to be complex.
As a consequence, deploying a scale-out solution has required significant engineering effort;
the system developer often needs to handcraft the mechanisms for data partitioning and
reassembly, not to mention the logic to schedule the work across the cluster and handle
individual machine failures.
Limiting factors
These traditional approaches to scale-up and scale-out have not been widely adopted
outside large enterprises, government, and academia. The purchase costs are often high,
as is the effort to develop and manage the systems. These factors alone put them out of the
reach of many smaller businesses. In addition, the approaches themselves have had several
weaknesses that have become apparent over time:
As scale-out systems get large, or as scale-up systems deal with multiple CPUs, the
difficulties caused by the complexity of the concurrency in the systems have become
significant. Effectively utilizing multiple hosts or CPUs is a very difficult task, and
implementing the necessary strategy to maintain efficiency throughout execution
of the desired workloads can entail enormous effort.
[ 10 ]
www.it-ebooks.info
Chapter 1
Hardware advances—often couched in terms of Moore's law—have begun to
highlight discrepancies in system capability. CPU power has grown much faster than
network or disk speeds have; once CPU cycles were the most valuable resource in
the system, but today, that no longer holds. Whereas a modern CPU may be able to
execute millions of times as many operations as a CPU 20 years ago would, memory
and hard disk speeds have only increased by factors of thousands or even hundreds.
It is quite easy to build a modern system with so much CPU power that the storage
system simply cannot feed it data fast enough to keep the CPUs busy.
A different approach
From the preceding scenarios there are a number of techniques that have been used
successfully to ease the pain in scaling data processing systems to the large scales
required by big data.
All roads lead to scale-out
As just hinted, taking a scale-up approach to scaling is not an open-ended tactic. There is
a limit to the size of individual servers that can be purchased from mainstream hardware
suppliers, and even more niche players can't offer an arbitrarily large server. At some point,
the workload will increase beyond the capacity of the single, monolithic scale-up server, so
then what? The unfortunate answer is that the best approach is to have two large servers
instead of one. Then, later, three, four, and so on. Or, in other words, the natural tendency
of scale-up architecture is—in extreme cases—to add a scale-out strategy to the mix.
Though this gives some of the benefits of both approaches, it also compounds the costs
and weaknesses; instead of very expensive hardware or the need to manually develop
the cross-cluster logic, this hybrid architecture requires both.
As a consequence of this end-game tendency and the general cost profile of scale-up
architectures, they are rarely used in the big data processing field and scale-out
architectures are the de facto standard.
If your problem space involves data workloads with strong internal
cross-references and a need for transactional integrity, big iron
scale-up relational databases are still likely to be a great option.
Share nothing
Anyone with children will have spent considerable time teaching the little ones that it's good
to share. This principle does not extend into data processing systems, and this idea applies to
both data and hardware.
[ 11 ]
www.it-ebooks.info
What It’s All About
The conceptual view of a scale-out architecture in particular shows individual hosts, each
processing a subset of the overall data set to produce its portion of the final result. Reality
is rarely so straightforward. Instead, hosts may need to communicate between each other,
or some pieces of data may be required by multiple hosts. These additional dependencies
create opportunities for the system to be negatively affected in two ways: bottlenecks and
increased risk of failure.
If a piece of data or individual server is required by every calculation in the system, there is
a likelihood of contention and delays as the competing clients access the common data or
host. If, for example, in a system with 25 hosts there is a single host that must be accessed
by all the rest, the overall system performance will be bounded by the capabilities of this
key host.
Worse still, if this "hot" server or storage system holding the key data fails, the entire
workload will collapse in a heap. Earlier cluster solutions often demonstrated this risk;
even though the workload was processed across a farm of servers, they often used a
shared storage system to hold all the data.
Instead of sharing resources, the individual components of a system should be as
independent as possible, allowing each to proceed regardless of whether others
are tied up in complex work or are experiencing failures.
Expect failure
Implicit in the preceding tenets is that more hardware will be thrown at the problem
with as much independence as possible. This is only achievable if the system is built
with an expectation that individual components will fail, often regularly and with
inconvenient timing.
You'll often hear terms such as "five nines" (referring to 99.999 percent uptime
or availability). Though this is absolute best-in-class availability, it is important
to realize that the overall reliability of a system comprised of many such devices
can vary greatly depending on whether the system can tolerate individual
component failures.
Assume a server with 99 percent reliability and a system that requires five such
hosts to function. The system availability is 0.99*0.99*0.99*0.99*0.99 which
equates to 95 percent availability. But if the individual servers are only rated
at 95 percent, the system reliability drops to a mere 76 percent.
Instead, if you build a system that only needs one of the five hosts to be functional at any
given time, the system availability is well into five nines territory. Thinking about system
uptime in relation to the criticality of each component can help focus on just what the
system availability is likely to be.
[ 12 ]
www.it-ebooks.info
Chapter 1
If figures such as 99 percent availability seem a little abstract to you, consider
it in terms of how much downtime that would mean in a given time period.
For example, 99 percent availability equates to a downtime of just over 3.5
days a year or 7 hours a month. Still sound as good as 99 percent?
This approach of embracing failure is often one of the most difficult aspects of big data
systems for newcomers to fully appreciate. This is also where the approach diverges most
strongly from scale-up architectures. One of the main reasons for the high cost of large
scale-up servers is the amount of effort that goes into mitigating the impact of component
failures. Even low-end servers may have redundant power supplies, but in a big iron box,
you will see CPUs mounted on cards that connect across multiple backplanes to banks of
memory and storage systems. Big iron vendors have often gone to extremes to show how
resilient their systems are by doing everything from pulling out parts of the server while it's
running to actually shooting a gun at it. But if the system is built in such a way that instead of
treating every failure as a crisis to be mitigated it is reduced to irrelevance, a very different
architecture emerges.
Smart software, dumb hardware
If we wish to see a cluster of hardware used in as flexible a way as possible, providing hosting
to multiple parallel workflows, the answer is to push the smarts into the software and away
from the hardware.
In this model, the hardware is treated as a set of resources, and the responsibility for
allocating hardware to a particular workload is given to the software layer. This allows
hardware to be generic and hence both easier and less expensive to acquire, and the
functionality to efficiently use the hardware moves to the software, where the knowledge
about effectively performing this task resides.
Move processing, not data
Imagine you have a very large data set, say, 1000 terabytes (that is, 1 petabyte), and you
need to perform a set of four operations on every piece of data in the data set. Let's look
at different ways of implementing a system to solve this problem.
A traditional big iron scale-up solution would see a massive server attached to an equally
impressive storage system, almost certainly using technologies such as fibre channel to
maximize storage bandwidth. The system will perform the task but will become I/O-bound;
even high-end storage switches have a limit on how fast data can be delivered to the host.
[ 13 ]
www.it-ebooks.info
What It’s All About
Alternatively, the processing approach of previous cluster technologies would perhaps see
a cluster of 1,000 machines, each with 1 terabyte of data divided into four quadrants, with
each responsible for performing one of the operations. The cluster management software
would then coordinate the movement of the data around the cluster to ensure each piece
receives all four processing steps. As each piece of data can have one step performed on the
host on which it resides, it will need to stream the data to the other three quadrants, so we
are in effect consuming 3 petabytes of network bandwidth to perform the processing.
Remembering that processing power has increased faster than networking or disk
technologies, so are these really the best ways to address the problem? Recent experience
suggests the answer is no and that an alternative approach is to avoid moving the data and
instead move the processing. Use a cluster as just mentioned, but don't segment it into
quadrants; instead, have each of the thousand nodes perform all four processing stages on
the locally held data. If you're lucky, you'll only have to stream the data from the disk once
and the only things travelling across the network will be program binaries and status reports,
both of which are dwarfed by the actual data set in question.
If a 1,000-node cluster sounds ridiculously large, think of some modern server form factors
being utilized for big data solutions. These see single hosts with as many as twelve 1- or
2-terabyte disks in each. Because modern processors have multiple cores it is possible to
build a 50-node cluster with a petabyte of storage and still have a CPU core dedicated to
process the data stream coming off each individual disk.
Build applications, not infrastructure
When thinking of the scenario in the previous section, many people will focus on the
questions of data movement and processing. But, anyone who has ever built such a
system will know that less obvious elements such as job scheduling, error handling,
and coordination are where much of the magic truly lies.
If we had to implement the mechanisms for determining where to execute processing,
performing the processing, and combining all the subresults into the overall result, we
wouldn't have gained much from the older model. There, we needed to explicitly manage
data partitioning; we'd just be exchanging one difficult problem with another.
This touches on the most recent trend, which we'll highlight here: a system that handles
most of the cluster mechanics transparently and allows the developer to think in terms of
the business problem. Frameworks that provide well-defined interfaces that abstract all this
complexity—smart software—upon which business domain-specific applications can be built
give the best combination of developer and system efficiency.
[ 14 ]
www.it-ebooks.info
Chapter 1
Hadoop
The thoughtful (or perhaps suspicious) reader will not be surprised to learn that the
preceding approaches are all key aspects of Hadoop. But we still haven't actually
answered the question about exactly what Hadoop is.
Thanks, Google
It all started with Google, which in 2003 and 2004 released two academic papers describing
Google technology: the Google File System (GFS) (http://research.google.com/
archive/gfs.html) and MapReduce (http://research.google.com/archive/
mapreduce.html). The two together provided a platform for processing data on a very
large scale in a highly efficient manner.
Thanks, Doug
At the same time, Doug Cutting was working on the Nutch open source web search
engine. He had been working on elements within the system that resonated strongly
once the Google GFS and MapReduce papers were published. Doug started work on the
implementations of these Google systems, and Hadoop was soon born, firstly as a subproject
of Lucene and soon was its own top-level project within the Apache open source foundation.
At its core, therefore, Hadoop is an open source platform that provides implementations of
both the MapReduce and GFS technologies and allows the processing of very large data sets
across clusters of low-cost commodity hardware.
Thanks, Yahoo
Yahoo hired Doug Cutting in 2006 and quickly became one of the most prominent supporters
of the Hadoop project. In addition to often publicizing some of the largest Hadoop
deployments in the world, Yahoo has allowed Doug and other engineers to contribute to
Hadoop while still under its employ; it has contributed some of its own internally developed
Hadoop improvements and extensions. Though Doug has now moved on to Cloudera
(another prominent startup supporting the Hadoop community) and much of the Yahoo's
Hadoop team has been spun off into a startup called Hortonworks, Yahoo remains a major
Hadoop contributor.
Parts of Hadoop
The top-level Hadoop project has many component subprojects, several of which we'll
discuss in this book, but the two main ones are Hadoop Distributed File System (HDFS)
and MapReduce. These are direct implementations of Google's own GFS and MapReduce.
We'll discuss both in much greater detail, but for now, it's best to think of HDFS and
MapReduce as a pair of complementary yet distinct technologies.
[ 15 ]
www.it-ebooks.info
What It’s All About
HDFS is a filesystem that can store very large data sets by scaling out across a cluster of
hosts. It has specific design and performance characteristics; in particular, it is optimized
for throughput instead of latency, and it achieves high availability through replication
instead of redundancy.
MapReduce is a data processing paradigm that takes a specification of how the data will be
input and output from its two stages (called map and reduce) and then applies this across
arbitrarily large data sets. MapReduce integrates tightly with HDFS, ensuring that wherever
possible, MapReduce tasks run directly on the HDFS nodes that hold the required data.
Common building blocks
Both HDFS and MapReduce exhibit several of the architectural principles described in the
previous section. In particular:
Both are designed to run on clusters of commodity (that is, low-to-medium
specification) servers
Both scale their capacity by adding more servers (scale-out)
Both have mechanisms for identifying and working around failures
Both provide many of their services transparently, allowing the user to concentrate
on the problem at hand
Both have an architecture where a software cluster sits on the physical servers and
controls all aspects of system execution
HDFS
HDFS is a filesystem unlike most you may have encountered before. It is not a POSIXcompliant filesystem, which basically means it does not provide the same guarantees as a
regular filesystem. It is also a distributed filesystem, meaning that it spreads storage across
multiple nodes; lack of such an efficient distributed filesystem was a limiting factor in some
historical technologies. The key features are:
HDFS stores files in blocks typically at least 64 MB in size, much larger than the 4-32
KB seen in most filesystems.
HDFS is optimized for throughput over latency; it is very efficient at streaming
read requests for large files but poor at seek requests for many small ones.
HDFS is optimized for workloads that are generally of the write-once and
read-many type.
Each storage node runs a process called a DataNode that manages the blocks on
that host, and these are coordinated by a master NameNode process running on a
separate host.
[ 16 ]
www.it-ebooks.info
Chapter 1
Instead of handling disk failures by having physical redundancies in disk arrays or
similar strategies, HDFS uses replication. Each of the blocks comprising a file is
stored on multiple nodes within the cluster, and the HDFS NameNode constantly
monitors reports sent by each DataNode to ensure that failures have not dropped
any block below the desired replication factor. If this does happen, it schedules the
addition of another copy within the cluster.
MapReduce
Though MapReduce as a technology is relatively new, it builds upon much of the
fundamental work from both mathematics and computer science, particularly approaches
that look to express operations that would then be applied to each element in a set of data.
Indeed the individual concepts of functions called map and reduce come straight from
functional programming languages where they were applied to lists of input data.
Another key underlying concept is that of "divide and conquer", where a single problem is
broken into multiple individual subtasks. This approach becomes even more powerful when
the subtasks are executed in parallel; in a perfect case, a task that takes 1000 minutes could
be processed in 1 minute by 1,000 parallel subtasks.
MapReduce is a processing paradigm that builds upon these principles; it provides a series of
transformations from a source to a result data set. In the simplest case, the input data is fed
to the map function and the resultant temporary data to a reduce function. The developer
only defines the data transformations; Hadoop's MapReduce job manages the process of
how to apply these transformations to the data across the cluster in parallel. Though the
underlying ideas may not be novel, a major strength of Hadoop is in how it has brought
these principles together into an accessible and well-engineered platform.
Unlike traditional relational databases that require structured data with well-defined
schemas, MapReduce and Hadoop work best on semi-structured or unstructured data.
Instead of data conforming to rigid schemas, the requirement is instead that the data be
provided to the map function as a series of key value pairs. The output of the map function is
a set of other key value pairs, and the reduce function performs aggregation to collect the
final set of results.
Hadoop provides a standard specification (that is, interface) for the map and reduce
functions, and implementations of these are often referred to as mappers and reducers.
A typical MapReduce job will comprise of a number of mappers and reducers, and it is not
unusual for several of these to be extremely simple. The developer focuses on expressing the
transformation between source and result data sets, and the Hadoop framework manages all
aspects of job execution, parallelization, and coordination.
[ 17 ]
www.it-ebooks.info
What It’s All About
This last point is possibly the most important aspect of Hadoop. The platform takes
responsibility for every aspect of executing the processing across the data. After the user
defines the key criteria for the job, everything else becomes the responsibility of the system.
Critically, from the perspective of the size of data, the same MapReduce job can be applied
to data sets of any size hosted on clusters of any size. If the data is 1 gigabyte in size and on
a single host, Hadoop will schedule the processing accordingly. Even if the data is 1 petabyte
in size and hosted across one thousand machines, it still does likewise, determining how best
to utilize all the hosts to perform the work most efficiently. From the user's perspective, the
actual size of the data and cluster are transparent, and apart from affecting the time taken to
process the job, they do not change how the user interacts with Hadoop.
Better together
It is possible to appreciate the individual merits of HDFS and MapReduce, but they are even
more powerful when combined. HDFS can be used without MapReduce, as it is intrinsically a
large-scale data storage platform. Though MapReduce can read data from non-HDFS sources,
the nature of its processing aligns so well with HDFS that using the two together is by far the
most common use case.
When a MapReduce job is executed, Hadoop needs to decide where to execute the code
most efficiently to process the data set. If the MapReduce-cluster hosts all pull their data
from a single storage host or an array, it largely doesn't matter as the storage system is
a shared resource that will cause contention. But if the storage system is HDFS, it allows
MapReduce to execute data processing on the node holding the data of interest, building
on the principle of it being less expensive to move data processing than the data itself.
The most common deployment model for Hadoop sees the HDFS and MapReduce clusters
deployed on the same set of servers. Each host that contains data and the HDFS component
to manage it also hosts a MapReduce component that can schedule and execute data
processing. When a job is submitted to Hadoop, it can use an optimization process as much
as possible to schedule data on the hosts where the data resides, minimizing network traffic
and maximizing performance.
Think back to our earlier example of how to process a four-step task on 1 petabyte of
data spread across one thousand servers. The MapReduce model would (in a somewhat
simplified and idealized way) perform the processing in a map function on each piece
of data on a host where the data resides in HDFS and then reuse the cluster in the reduce
function to collect the individual results into the final result set.
A part of the challenge with Hadoop is in breaking down the overall problem into the best
combination of map and reduce functions. The preceding approach would only work if the
four-stage processing chain could be applied independently to each data element in turn. As
we'll see in later chapters, the answer is sometimes to use multiple MapReduce jobs where
the output of one is the input to the next.
[ 18 ]
www.it-ebooks.info
Chapter 1
Common architecture
Both HDFS and MapReduce are, as mentioned, software clusters that display common
characteristics:
Each follows an architecture where a cluster of worker nodes is managed by a
special master/coordinator node
The master in each case (NameNode for HDFS and JobTracker for MapReduce)
monitors the health of the cluster and handle failures, either by moving data
blocks around or by rescheduling failed work
Processes on each server (DataNode for HDFS and TaskTracker for MapReduce) are
responsible for performing work on the physical host, receiving instructions from
the NameNode or JobTracker, and reporting health/progress status back to it
As a minor terminology point, we will generally use the terms host or server to refer to the
physical hardware hosting Hadoop's various components. The term node will refer to the
software component comprising a part of the cluster.
What it is and isn't good for
As with any tool, it's important to understand when Hadoop is a good fit for the problem
in question. Much of this book will highlight its strengths, based on the previous broad
overview on processing large data volumes, but it's important to also start appreciating
at an early stage where it isn't the best choice.
The architecture choices made within Hadoop enable it to be the flexible and scalable data
processing platform it is today. But, as with most architecture or design choices, there are
consequences that must be understood. Primary amongst these is the fact that Hadoop is a
batch processing system. When you execute a job across a large data set, the framework will
churn away until the final results are ready. With a large cluster, answers across even huge
data sets can be generated relatively quickly, but the fact remains that the answers are not
generated fast enough to service impatient users. Consequently, Hadoop alone is not well
suited to low-latency queries such as those received on a website, a real-time system, or a
similar problem domain.
When Hadoop is running jobs on large data sets, the overhead of setting up the job,
determining which tasks are run on each node, and all the other housekeeping activities
that are required is a trivial part of the overall execution time. But, for jobs on small data
sets, there is an execution overhead that means even simple MapReduce jobs may take a
minimum of 10 seconds.
[ 19 ]
www.it-ebooks.info
What It’s All About
Another member of the broader Hadoop family is HBase, an
open-source implementation of another Google technology.
This provides a (non-relational) database atop Hadoop that
uses various means to allow it to serve low-latency queries.
But haven't Google and Yahoo both been among the strongest proponents of this method
of computation, and aren't they all about such websites where response time is critical?
The answer is yes, and it highlights an important aspect of how to incorporate Hadoop into
any organization or activity or use it in conjunction with other technologies in a way that
exploits the strengths of each. In a paper (http://research.google.com/archive/
googlecluster.html), Google sketches how they utilized MapReduce at the time; after a
web crawler retrieved updated webpage data, MapReduce processed the huge data set, and
from this, produced the web index that a fleet of MySQL servers used to service end-user
search requests.
Cloud computing with Amazon Web Services
The other technology area we'll explore in this book is cloud computing, in the form
of several offerings from Amazon Web Services. But first, we need to cut through some
hype and buzzwords that surround this thing called cloud computing.
Too many clouds
Cloud computing has become an overused term, arguably to the point that its overuse risks
it being rendered meaningless. In this book, therefore, let's be clear what we mean—and
care about—when using the term. There are two main aspects to this: a new architecture
option and a different approach to cost.
A third way
We've talked about scale-up and scale-out as the options for scaling data processing systems.
But our discussion thus far has taken for granted that the physical hardware that makes
either option a reality will be purchased, owned, hosted, and managed by the organization
doing the system development. The cloud computing we care about adds a third approach;
put your application into the cloud and let the provider deal with the scaling problem.
[ 20 ]
www.it-ebooks.info
Chapter 1
It's not always that simple, of course. But for many cloud services, the model truly is this
revolutionary. You develop the software according to some published guidelines or interface
and then deploy it onto the cloud platform and allow it to scale the service based on the
demand, for a cost of course. But given the costs usually involved in making scaling systems,
this is often a compelling proposition.
Different types of costs
This approach to cloud computing also changes how system hardware is paid for. By
offloading infrastructure costs, all users benefit from the economies of scale achieved by
the cloud provider by building their platforms up to a size capable of hosting thousands
or millions of clients. As a user, not only do you get someone else to worry about difficult
engineering problems, such as scaling, but you pay for capacity as it's needed and you
don't have to size the system based on the largest possible workloads. Instead, you gain the
benefit of elasticity and use more or fewer resources as your workload demands.
An example helps illustrate this. Many companies' financial groups run end-of-month
workloads to generate tax and payroll data, and often, much larger data crunching occurs at
year end. If you were tasked with designing such a system, how much hardware would you
buy? If you only buy enough to handle the day-to-day workload, the system may struggle at
month end and may likely be in real trouble when the end-of-year processing rolls around. If
you scale for the end-of-month workloads, the system will have idle capacity for most of the
year and possibly still be in trouble performing the end-of-year processing. If you size for the
end-of-year workload, the system will have significant capacity sitting idle for the rest of the
year. And considering the purchase cost of hardware in addition to the hosting and running
costs—a server's electricity usage may account for a large majority of its lifetime costs—you
are basically wasting huge amounts of money.
The service-on-demand aspects of cloud computing allow you to start your application
on a small hardware footprint and then scale it up and down as the year progresses.
With a pay-for-use model, your costs follow your utilization and you have the capacity
to process your workloads without having to buy enough hardware to handle the peaks.
A more subtle aspect of this model is that this greatly reduces the costs of entry for an
organization to launch an online service. We all know that a new hot service that fails to
meet demand and suffers performance problems will find it hard to recover momentum and
user interest. For example, say in the year 2000, an organization wanting to have a successful
launch needed to put in place, on launch day, enough capacity to meet the massive surge of
user traffic they hoped for but did n't know for sure to expect. When taking costs of physical
location into consideration, it would have been easy to spend millions on a product launch.
[ 21 ]
www.it-ebooks.info
What It’s All About
Today, with cloud computing, the initial infrastructure cost could literally be as low as a
few tens or hundreds of dollars a month and that would only increase when—and if—the
traffic demanded.
AWS – infrastructure on demand from Amazon
Amazon Web Services (AWS) is a set of such cloud computing services offered by Amazon.
We will be using several of these services in this book.
Elastic Compute Cloud (EC2)
Amazon's Elastic Compute Cloud (EC2), found at http://aws.amazon.com/ec2/, is
basically a server on demand. After registering with AWS and EC2, credit card details are
all that's required to gain access to a dedicated virtual machine, it's easy to run a variety
of operating systems including Windows and many variants of Linux on our server.
Need more servers? Start more. Need more powerful servers? Change to one of the higher
specification (and cost) types offered. Along with this, EC2 offers a suite of complimentary
services, including load balancers, static IP addresses, high-performance additional virtual
disk drives, and many more.
Simple Storage Service (S3)
Amazon's Simple Storage Service (S3), found at http://aws.amazon.com/s3/, is a
storage service that provides a simple key/value storage model. Using web, commandline, or programmatic interfaces to create objects, which can be everything from text files
to images to MP3s, you can store and retrieve your data based on a hierarchical model.
You create buckets in this model that contain objects. Each bucket has a unique identifier,
and within each bucket, every object is uniquely named. This simple strategy enables an
extremely powerful service for which Amazon takes complete responsibility (for service
scaling, in addition to reliability and availability of data).
Elastic MapReduce (EMR)
Amazon's Elastic MapReduce (EMR), found at http://aws.amazon.com/
elasticmapreduce/, is basically Hadoop in the cloud and builds atop both EC2 and
S3. Once again, using any of the multiple interfaces (web console, CLI, or API), a Hadoop
workflow is defined with attributes such as the number of Hadoop hosts required and the
location of the source data. The Hadoop code implementing the MapReduce jobs is provided
and the virtual go button is pressed.
[ 22 ]
www.it-ebooks.info
Chapter 1
In its most impressive mode, EMR can pull source data from S3, process it on a Hadoop
cluster it creates on EC2, push the results back into S3, and terminate the Hadoop cluster
and the EC2 virtual machines hosting it. Naturally, each of these services has a cost (usually
on per GB stored and server time usage basis), but the ability to access such powerful data
processing capabilities with no need for dedicated hardware is a powerful one.
What this book covers
In this book we will be learning how to write MapReduce programs to do some serious data
crunching and how to run them on both locally managed and AWS-hosted Hadoop clusters.
Not only will we be looking at Hadoop as an engine for performing MapReduce processing,
but we'll also explore how a Hadoop capability can fit into the rest of an organization's
infrastructure and systems. We'll look at some of the common points of integration, such as
getting data between Hadoop and a relational database and also how to make Hadoop look
more like such a relational database.
A dual approach
In this book we will not be limiting our discussion to EMR or Hadoop hosted on Amazon EC2;
we will be discussing both the building and the management of local Hadoop clusters (on
Ubuntu Linux) in addition to showing how to push the processing into the cloud via EMR.
The reason for this is twofold: firstly, though EMR makes Hadoop much more accessible,
there are aspects of the technology that only become apparent when manually
administering the cluster. Though it is also possible to use EMR in a more manual mode,
we'll generally use a local cluster for such explorations. Secondly, though it isn't necessarily
an either/or decision, many organizations use a mixture of in-house and cloud-hosted
capacities, sometimes due to a concern of over reliance on a single external provider, but
practically speaking, it's often convenient to do development and small-scale tests on local
capacity then deploy at production scale into the cloud.
In some of the latter chapters, where we discuss additional products that integrate with
Hadoop, we'll only give examples of local clusters as there is no difference in how the
products work regardless of where they are deployed.
[ 23 ]
www.it-ebooks.info
What It’s All About
Summary
We learned a lot in this chapter about big data, Hadoop, and cloud computing.
Specifically, we covered the emergence of big data and how changes in the approach to
data processing and system architecture bring within the reach of almost any organization
techniques that were previously prohibitively expensive.
We also looked at the history of Hadoop and how it builds upon many of these trends
to provide a flexible and powerful data processing platform that can scale to massive
volumes. We also looked at how cloud computing provides another system architecture
approach, one which exchanges large up-front costs and direct physical responsibility
for a pay-as-you-go model and a reliance on the cloud provider for hardware provision,
management and scaling. We also saw what Amazon Web Services is and how its Elastic
MapReduce service utilizes other AWS services to provide Hadoop in the cloud.
We also discussed the aim of this book and its approach to exploration on both
locally-managed and AWS-hosted Hadoop clusters.
Now that we've covered the basics and know where this technology is coming from
and what its benefits are, we need to get our hands dirty and get things running,
which is what we'll do in Chapter 2, Getting Hadoop Up and Running.
[ 24 ]
www.it-ebooks.info
2
Getting Hadoop Up and Running
Now that we have explored the opportunities and challenges presented
by large-scale data processing and why Hadoop is a compelling choice,
it's time to get things set up and running.
In this chapter, we will do the following:
Learn how to install and run Hadoop on a local Ubuntu host
Run some example Hadoop programs and get familiar with the system
Set up the accounts required to use Amazon Web Services products such as EMR
Create an on-demand Hadoop cluster on Elastic MapReduce
Explore the key differences between a local and hosted Hadoop cluster
Hadoop on a local Ubuntu host
For our exploration of Hadoop outside the cloud, we shall give examples using one or
more Ubuntu hosts. A single machine (be it a physical computer or a virtual machine)
will be sufficient to run all the parts of Hadoop and explore MapReduce. However,
production clusters will most likely involve many more machines, so having even a
development Hadoop cluster deployed on multiple hosts will be good experience.
However, for getting started, a single host will suffice.
Nothing we discuss will be unique to Ubuntu, and Hadoop should run on any Linux
distribution. Obviously, you may have to alter how the environment is configured if
you use a distribution other than Ubuntu, but the differences should be slight.
www.it-ebooks.info
Getting Hadoop Up and Running
Other operating systems
Hadoop does run well on other platforms. Windows and Mac OS X are popular choices
for developers. Windows is supported only as a development platform and Mac OS X is
not formally supported at all.
If you choose to use such a platform, the general situation will be similar to other Linux
distributions; all aspects of how to work with Hadoop will be the same on both platforms
but you will need use the operating system-specific mechanisms for setting up environment
variables and similar tasks. The Hadoop FAQs contain some information on alternative
platforms and should be your first port of call if you are considering such an approach.
The Hadoop FAQs can be found at http://wiki.apache.org/hadoop/FAQ.
Time for action – checking the prerequisites
Hadoop is written in Java, so you will need a recent Java Development Kit (JDK) installed
on the Ubuntu host. Perform the following steps to check the prerequisites:
1.
First, check what's already available by opening up a terminal and typing
the following:
$ javac
$ java -version
2.
If either of these commands gives a no such file or directory or similar
error, or if the latter mentions "Open JDK", it's likely you need to download the full
JDK. Grab this from the Oracle download page at http://www.oracle.com/
technetwork/java/javase/downloads/index.html; you should get the
latest release.
3.
Once Java is installed, add the JDK/bin directory to your path and set the
JAVA_HOME environment variable with commands such as the following,
modified for your specific Java version:
$ export JAVA_HOME=/opt/jdk1.6.0_24
$ export PATH=$JAVA_HOME/bin:${PATH}
What just happened?
These steps ensure the right version of Java is installed and available from the command line
without having to use lengthy pathnames to refer to the install location.
[ 26 ]
www.it-ebooks.info
Chapter 2
Remember that the preceding commands only affect the currently running shell and the
settings will be lost after you log out, close the shell, or reboot. To ensure the same setup
is always available, you can add these to the startup files for your shell of choice, within
the .bash_profile file for the BASH shell or the .cshrc file for TCSH, for example.
An alternative favored by me is to put all required configuration settings into a standalone
file and then explicitly call this from the command line; for example:
$ source Hadoop_config.sh
This technique allows you to keep multiple setup files in the same account without making
the shell startup overly complex; not to mention, the required configurations for several
applications may actually be incompatible. Just remember to begin by loading the file at the
start of each session!
Setting up Hadoop
One of the most confusing aspects of Hadoop to a newcomer is its various components,
projects, sub-projects, and their interrelationships. The fact that these have evolved over
time hasn't made the task of understanding it all any easier. For now, though, go to http://
hadoop.apache.org and you'll see that there are three prominent projects mentioned:
Common
HDFS
MapReduce
The last two of these should be familiar from the explanation in Chapter 1, What It's All
About, and common projects comprise a set of libraries and tools that help the Hadoop
product work in the real world. For now, the important thing is that the standard Hadoop
distribution bundles the latest versions all of three of these projects and the combination is
what you need to get going.
A note on versions
Hadoop underwent a major change in the transition from the 0.19 to the 0.20 versions, most
notably with a migration to a set of new APIs used to develop MapReduce applications. We
will be primarily using the new APIs in this book, though we do include a few examples of the
older API in later chapters as not of all the existing features have been ported to the new API.
Hadoop versioning also became complicated when the 0.20 branch was renamed to 1.0.
The 0.22 and 0.23 branches remained, and in fact included features not included in the 1.0
branch. At the time of this writing, things were becoming clearer with 1.1 and 2.0 branches
being used for future development releases. As most existing systems and third-party tools
are built against the 0.20 branch, we will use Hadoop 1.0 for the examples in this book.
[ 27 ]
www.it-ebooks.info
Getting Hadoop Up and Running
Time for action – downloading Hadoop
Carry out the following steps to download Hadoop:
1.
Go to the Hadoop download page at http://hadoop.apache.org/common/
releases.html and retrieve the latest stable version of the 1.0.x branch; at the
time of this writing, it was 1.0.4.
2.
You'll be asked to select a local mirror; after that you need to download
the file with a name such as hadoop-1.0.4-bin.tar.gz.
3.
Copy this file to the directory where you want Hadoop to be installed
(for example, /usr/local), using the following command:
$ cp Hadoop-1.0.4.bin.tar.gz /usr/local
4.
Decompress the file by using the following command:
$ tar –xf hadoop-1.0.4-bin.tar.gz
5.
Add a convenient symlink to the Hadoop installation directory.
$ ln -s /usr/local/hadoop-1.0.4 /opt/hadoop
6.
Now you need to add the Hadoop binary directory to your path and set
the HADOOP_HOME environment variable, just as we did earlier with Java.
$ export HADOOP_HOME=/usr/local/Hadoop
$ export PATH=$HADOOP_HOME/bin:$PATH
7.
Go into the conf directory within the Hadoop installation and edit the
Hadoop-env.sh file. Search for JAVA_HOME and uncomment the line,
modifying the location to point to your JDK installation, as mentioned earlier.
What just happened?
These steps ensure that Hadoop is installed and available from the command line.
By setting the path and configuration variables, we can use the Hadoop command-line
tool. The modification to the Hadoop configuration file is the only required change to
the setup needed to integrate with your host settings.
As mentioned earlier, you should put the export commands in your shell startup file
or a standalone-configuration script that you specify at the start of the session.
Don't worry about some of the details here; we'll cover Hadoop setup and use later.
[ 28 ]
www.it-ebooks.info
Chapter 2
Time for action – setting up SSH
Carry out the following steps to set up SSH:
1.
Create a new OpenSSL key pair with the following commands:
$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
…
2.
Copy the new public key to the list of authorized keys by using the following
command:
$ cp .ssh/id _rsa.pub
3.
.ssh/authorized_keys
Connect to the local host.
$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be
established.
RSA key fingerprint is b6:0c:bd:57:32:b6:66:7c:33:7b:62:92:61:fd:c
a:2a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known
hosts.
4.
Confirm that the password-less SSH is working.
$ ssh localhost
$ ssh localhost
What just happened?
Because Hadoop requires communication between multiple processes on one or more
machines, we need to ensure that the user we are using for Hadoop can connect to each
required host without needing a password. We do this by creating a Secure Shell (SSH) key
pair that has an empty passphrase. We use the ssh-keygen command to start this process
and accept the offered defaults.
[ 29 ]
www.it-ebooks.info
Getting Hadoop Up and Running
Once we create the key pair, we need to add the new public key to the stored list of trusted
keys; this means that when trying to connect to this machine, the public key will be trusted.
After doing so, we use the ssh command to connect to the local machine and should expect
to get a warning about trusting the host certificate as just shown. After confirming this, we
should then be able to connect without further passwords or prompts.
Note that when we move later to use a fully distributed cluster, we will
need to ensure that the Hadoop user account has the same key set up
on every host in the cluster.
Configuring and running Hadoop
So far this has all been pretty straightforward, just downloading and system administration.
Now we can deal with Hadoop directly. Finally! We'll run a quick example to show Hadoop in
action. There is additional configuration and set up to be performed, but this next step will
help give confidence that things are installed and configured correctly so far.
Time for action – using Hadoop to calculate Pi
We will now use a sample Hadoop program to calculate the value of Pi. Right now,
this is primarily to validate the installation and to show how quickly you can get a
MapReduce job to execute. Assuming the HADOOP_HOME/bin directory is in your path,
type the following commands:
$ Hadoop jar hadoop/hadoop-examples-1.0.4.jar
Number of Maps
pi 4 1000
= 4
Samples per Map = 1000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Starting Job
12/10/26 22:56:11 INFO jvm.JvmMetrics: Initializing JVM Metrics
with processName=JobTracker, sessionId=
12/10/26 22:56:11 INFO mapred.FileInputFormat: Total input paths
to process : 4
12/10/26 22:56:12 INFO mapred.JobClient: Running job: job_
local_0001
[ 30 ]
www.it-ebooks.info
Chapter 2
12/10/26 22:56:12 INFO mapred.FileInputFormat: Total input paths
to process : 4
12/10/26 22:56:12 INFO mapred.MapTask: numReduceTasks: 1
…
12/10/26 22:56:14 INFO mapred.JobClient:
map 100% reduce 100%
12/10/26 22:56:14 INFO mapred.JobClient: Job complete: job_
local_0001
12/10/26 22:56:14 INFO mapred.JobClient: Counters: 13
12/10/26 22:56:14 INFO mapred.JobClient:
FileSystemCounters
…
Job Finished in 2.904 seconds
Estimated value of Pi is 3.14000000000000000000
$
What just happened?
There's a lot of information here; even more so when you get the full output on your screen.
For now, let's unpack the fundamentals and not worry about much of Hadoop's status
output until later in the book. The first thing to clarify is some terminology; each Hadoop
program runs as a job that creates multiple tasks to do its work.
Looking at the output, we see it is broadly split into three sections:
The start up of the job
The status as the job executes
The output of the job
In our case, we can see the job creates four tasks to calculate Pi, and the overall job result
will be the combination of these subresults. This pattern should sound familiar to the one
we came across in Chapter 1, What It's All About; the model is used to split a larger job into
smaller pieces and then bring together the results.
The majority of the output will appear as the job is being executed and provide status
messages showing progress. On successful completion, the job will print out a number of
counters and other statistics. The preceding example is actually unusual in that it is rare to see
the result of a MapReduce job displayed on the console. This is not a limitation of Hadoop,
but rather a consequence of the fact that jobs that process large data sets usually produce a
significant amount of output data that isn't well suited to a simple echoing on the screen.
Congratulations on your first successful MapReduce job!
[ 31 ]
www.it-ebooks.info
Getting Hadoop Up and Running
Three modes
In our desire to get something running on Hadoop, we sidestepped an important issue: in
which mode should we run Hadoop? There are three possibilities that alter where the various
Hadoop components execute. Recall that HDFS comprises a single NameNode that acts as
the cluster coordinator and is the master for one or more DataNodes that store the data. For
MapReduce, the JobTracker is the cluster master and it coordinates the work executed by one
or more TaskTracker processes. The Hadoop modes deploy these components as follows:
Local standalone mode: This is the default mode if, as in the preceding Pi example,
you don't configure anything else. In this mode, all the components of Hadoop, such
as NameNode, DataNode, JobTracker, and TaskTracker, run in a single Java process.
Pseudo-distributed mode: In this mode, a separate JVM is spawned for each of the
Hadoop components and they communicate across network sockets, effectively
giving a fully functioning minicluster on a single host.
Fully distributed mode: In this mode, Hadoop is spread across multiple machines,
some of which will be general-purpose workers and others will be dedicated hosts
for components, such as NameNode and JobTracker.
Each mode has its benefits and drawbacks. Fully distributed mode is obviously the only one
that can scale Hadoop across a cluster of machines, but it requires more configuration work,
not to mention the cluster of machines. Local, or standalone, mode is the easiest to set
up, but you interact with it in a different manner than you would with the fully distributed
mode. In this book, we shall generally prefer the pseudo-distributed mode even when using
examples on a single host, as everything done in the pseudo-distributed mode is almost
identical to how it works on a much larger cluster.
Time for action – configuring the pseudo-distributed mode
Take a look in the conf directory within the Hadoop distribution. There are many
configuration files, but the ones we need to modify are core-site.xml, hdfs-site.xml
and mapred-site.xml.
1.
Modify core-site.xml to look like the following code:
fs.default.name
[ 32 ]
www.it-ebooks.info
Chapter 2
hdfs://localhost:9000
2.
Modify hdfs-site.xml to look like the following code:
dfs.replication
1
3.
Modify mapred-site.xml to look like the following code:
mapred.job.tracker
localhost:9001
What just happened?
The first thing to note is the general format of these configuration files. They are obviously
XML and contain multiple property specifications within a single configuration element.
The property specifications always contain name and value elements with the possibility for
optional comments not shown in the preceding code.
We set three configuration variables here:
The dfs.default.name variable holds the location of the NameNode and is
required by both HDFS and MapReduce components, which explains why it's in
core-site.xml and not hdfs-site.xml.
[ 33 ]
www.it-ebooks.info
Getting Hadoop Up and Running
The dfs.replication variable specifies how many times each HDFS block should
be replicated. Recall from Chapter 1, What It's All About, that HDFS handles failures
by ensuring each block of filesystem data is replicated to a number of different
hosts, usually 3. As we only have a single host and one DataNode in the pseudodistributed mode, we change this value to 1.
The mapred.job.tracker variable holds the location of the JobTracker just
like dfs.default.name holds the location of the NameNode. Because only
MapReduce components need know this location, it is in mapred-site.xml.
You are free, of course, to change the port numbers used, though 9000
and 9001 are common conventions in Hadoop.
The network addresses for the NameNode and the JobTracker specify the ports on which
the actual system requests should be directed. These are not user-facing locations, so don't
bother pointing your web browser at them. There are web interfaces that we will look at
shortly.
Configuring the base directory and formatting the filesystem
If the pseudo-distributed or fully distributed mode is chosen, there are two steps that need
to be performed before we start our first Hadoop cluster.
1. Set the base directory where Hadoop files will be stored.
2. Format the HDFS filesystem.
To be precise, we don't need to change the default directory; but, as
seen later, it's a good thing to think about it now.
Time for action – changing the base HDFS directory
Let's first set the base directory that specifies the location on the local filesystem under
which Hadoop will keep all its data. Carry out the following steps:
1.
Create a directory into which Hadoop will store its data:
$ mkdir /var/lib/hadoop
2.
Ensure the directory is writeable by any user:
$ chmod 777 /var/lib/hadoop
[ 34 ]
www.it-ebooks.info
Chapter 2
3.
Modify core-site.xml once again to add the following property:
hadoop.tmp.dir
/var/lib/hadoop
What just happened?
As we will be storing data in Hadoop and all the various components are running on our local
host, this data will need to be stored on our local filesystem somewhere. Regardless of the
mode, Hadoop by default uses the hadoop.tmp.dir property as the base directory under
which all files and data are written.
MapReduce, for example, uses a /mapred directory under this base directory; HDFS uses
/dfs. The danger is that the default value of hadoop.tmp.dir is /tmp and some Linux
distributions delete the contents of /tmp on each reboot. So it's safer to explicitly state
where the data is to be held.
Time for action – formatting the NameNode
Before starting Hadoop in either pseudo-distributed or fully distributed mode for the first
time, we need to format the HDFS filesystem that it will use. Type the following:
$
hadoop namenode -format
The output of this should look like the following:
$ hadoop namenode -format
12/10/26 22:45:25 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:
host = vm193/10.0.0.193
STARTUP_MSG:
args = [-format]
…
12/10/26 22:45:25 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop
12/10/26 22:45:25 INFO namenode.FSNamesystem: supergroup=supergroup
12/10/26 22:45:25 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/10/26 22:45:25 INFO common.Storage: Image file of size 96 saved in 0
seconds.
[ 35 ]
www.it-ebooks.info
Getting Hadoop Up and Running
12/10/26 22:45:25 INFO common.Storage: Storage directory /var/lib/hadoophadoop/dfs/name has been successfully formatted.
12/10/26 22:45:26 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at vm193/10.0.0.193
$
What just happened?
This is not a very exciting output because the step is only an enabler for our future use
of HDFS. However, it does help us think of HDFS as a filesystem; just like any new storage
device on any operating system, we need to format the device before we can use it. The
same is true for HDFS; initially there is a default location for the filesystem data but no
actual data for the equivalents of filesystem indexes.
Do this every time!
If your experience with Hadoop has been similar to the one I have had, there
will be a series of simple mistakes that are frequently made when setting
up new installations. It is very easy to forget about the formatting of the
NameNode and then get a cascade of failure messages when the first Hadoop
activity is tried.
But do it only once!
The command to format the NameNode can be executed multiple times, but in
doing so all existing filesystem data will be destroyed. It can only be executed
when the Hadoop cluster is shut down and sometimes you will want to do it
but in most other cases it is a quick way to irrevocably delete every piece of
data on HDFS; it does take much longer on large clusters. So be careful!
Starting and using Hadoop
After all that configuration and setup, let's now start our cluster and actually do something
with it.
Time for action – starting Hadoop
Unlike the local mode of Hadoop, where all the components run only for the lifetime of the
submitted job, with the pseudo-distributed or fully distributed mode of Hadoop, the cluster
components exist as long-running processes. Before we use HDFS or MapReduce, we need to
start up the needed components. Type the following commands; the output should look as
shown next, where the commands are included on the lines prefixed by $:
[ 36 ]
www.it-ebooks.info
Chapter 2
1.
Type in the first command:
$ start-dfs.sh
starting namenode, logging to /home/hadoop/hadoop/bin/../logs/
hadoop-hadoop-namenode-vm193.out
localhost: starting datanode, logging to /home/hadoop/hadoop/
bin/../logs/hadoop-hadoop-datanode-vm193.out
localhost: starting secondarynamenode, logging to /home/hadoop/
hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-vm193.out
2.
Type in the second command:
$ jps
9550 DataNode
9687 Jps
9638 SecondaryNameNode
9471 NameNode
3.
Type in the third command:
$ hadoop dfs -ls /
Found 2 items
4.
drwxr-xr-x
- hadoop supergroup
0 2012-10-26 23:03 /tmp
drwxr-xr-x
- hadoop supergroup
0 2012-10-26 23:06 /user
Type in the fourth command:
$ start-mapred.sh
starting jobtracker, logging to /home/hadoop/hadoop/bin/../logs/
hadoop-hadoop-jobtracker-vm193.out
localhost: starting tasktracker, logging to /home/hadoop/hadoop/
bin/../logs/hadoop-hadoop-tasktracker-vm193.out
5.
Type in the fifth command:
$ jps
9550 DataNode
9877 TaskTracker
9638 SecondaryNameNode
9471 NameNode
9798 JobTracker
9913 Jps
[ 37 ]
www.it-ebooks.info
Getting Hadoop Up and Running
What just happened?
The start-dfs.sh command, as the name suggests, starts the components necessary for
HDFS. This is the NameNode to manage the filesystem and a single DataNode to hold data.
The SecondaryNameNode is an availability aid that we'll discuss in a later chapter.
After starting these components, we use the JDK's jps utility to see which Java processes are
running, and, as the output looks good, we then use Hadoop's dfs utility to list the root of
the HDFS filesystem.
After this, we use start-mapred.sh to start the MapReduce components—this time the
JobTracker and a single TaskTracker—and then use jps again to verify the result.
There is also a combined start-all.sh file that we'll use at a later stage, but in the early
days it's useful to do a two-stage start up to more easily verify the cluster configuration.
Time for action – using HDFS
As the preceding example shows, there is a familiar-looking interface to HDFS that allows
us to use commands similar to those in Unix to manipulate files and directories on the
filesystem. Let's try it out by typing the following commands:
Type in the following commands:
$ hadoop -mkdir /user
$ hadoop -mkdir /user/hadoop
$ hadoop fs -ls /user
Found 1 items
drwxr-xr-x
- hadoop supergroup
0 2012-10-26 23:09 /user/Hadoop
$ echo "This is a test." >> test.txt
$ cat test.txt
This is a test.
$ hadoop dfs -copyFromLocal test.txt
.
$ hadoop dfs -ls
Found 1 items
-rw-r--r-test.txt
1 hadoop supergroup
16 2012-10-26 23:19/user/hadoop/
$ hadoop dfs -cat test.txt
This is a test.
$ rm test.txt
$ hadoop dfs -cat test.txt
[ 38 ]
www.it-ebooks.info
Chapter 2
This is a test.
$ hadoop fs -copyToLocal test.txt
$ cat test.txt
This is a test.
What just happened?
This example shows the use of the fs subcommand to the Hadoop utility. Note that both
dfs and fs commands are equivalent). Like most filesystems, Hadoop has the concept of a
home directory for each user. These home directories are stored under the /user directory
on HDFS and, before we go further, we create our home directory if it does not already exist.
We then create a simple text file on the local filesystem and copy it to HDFS by using the
copyFromLocal command and then check its existence and contents by using the -ls and
-cat utilities. As can be seen, the user home directory is aliased to . because, in Unix, -ls
commands with no path specified are assumed to refer to that location and relative paths
(not starting with /) will start there.
We then deleted the file from the local filesystem, copied it back from HDFS by using the
-copyToLocal command, and checked its contents using the local cat utility.
Mixing HDFS and local filesystem commands, as in the preceding example,
is a powerful combination, and it's very easy to execute on HDFS commands
that were intended for the local filesystem and vice versa. So be careful,
especially when deleting.
There are other HDFS manipulation commands; try Hadoop fs -help for a detailed list.
Time for action – WordCount, the Hello World of MapReduce
Many applications, over time, acquire a canonical example that no beginner's guide should
be without. For Hadoop, this is WordCount – an example bundled with Hadoop that counts
the frequency of words in an input text file.
1.
First execute the following commands:
$ hadoop dfs -mkdir data
$ hadoop dfs -cp test.txt data
$ hadoop dfs -ls data
Found 1 items
-rw-r--r-1 hadoop supergroup
user/hadoop/data/test.txt
[ 39 ]
www.it-ebooks.info
16 2012-10-26 23:20 /
Getting Hadoop Up and Running
2.
Now execute these commands:
$ Hadoop Hadoop/hadoop-examples-1.0.4.jar
wordcount data out
12/10/26 23:22:49 INFO input.FileInputFormat: Total input paths to
process : 1
12/10/26 23:22:50 INFO mapred.JobClient: Running job:
job_201210262315_0002
12/10/26 23:22:51 INFO mapred.JobClient:
map 0% reduce 0%
12/10/26 23:23:03 INFO mapred.JobClient:
map 100% reduce 0%
12/10/26 23:23:15 INFO mapred.JobClient:
map 100% reduce 100%
12/10/26 23:23:17 INFO mapred.JobClient: Job complete:
job_201210262315_0002
12/10/26 23:23:17 INFO mapred.JobClient: Counters: 17
12/10/26 23:23:17 INFO mapred.JobClient:
Job Counters
12/10/26 23:23:17 INFO mapred.JobClient:
tasks=1
Launched reduce
12/10/26 23:23:17 INFO mapred.JobClient:
Launched map tasks=1
12/10/26 23:23:17 INFO mapred.JobClient:
tasks=1
Data-local map
12/10/26 23:23:17 INFO mapred.JobClient:
FileSystemCounters
12/10/26 23:23:17 INFO mapred.JobClient:
FILE_BYTES_READ=46
12/10/26 23:23:17 INFO mapred.JobClient:
HDFS_BYTES_READ=16
12/10/26 23:23:17 INFO mapred.JobClient:
WRITTEN=124
FILE_BYTES_
12/10/26 23:23:17 INFO mapred.JobClient:
HDFS_BYTES_WRITTEN=24
12/10/26 23:23:17 INFO mapred.JobClient:
Map-Reduce Framework
12/10/26 23:23:17 INFO mapred.JobClient:
Reduce input groups=4
12/10/26 23:23:17 INFO mapred.JobClient:
records=4
Combine output
12/10/26 23:23:17 INFO mapred.JobClient:
Map input records=1
12/10/26 23:23:17 INFO mapred.JobClient:
bytes=46
Reduce shuffle
12/10/26 23:23:17 INFO mapred.JobClient:
records=4
Reduce output
12/10/26 23:23:17 INFO mapred.JobClient:
Spilled Records=8
12/10/26 23:23:17 INFO mapred.JobClient:
Map output bytes=32
12/10/26 23:23:17 INFO mapred.JobClient:
records=4
Combine input
12/10/26 23:23:17 INFO mapred.JobClient:
Map output records=4
12/10/26 23:23:17 INFO mapred.JobClient:
records=4
Reduce input
[ 40 ]
www.it-ebooks.info
Chapter 2
3.
Execute the following command:
$ hadoop fs -ls out
Found 2 items
4.
drwxr-xr-x
- hadoop supergroup
user/hadoop/out/_logs
0 2012-10-26 23:22 /
-rw-r--r-1 hadoop supergroup
user/hadoop/out/part-r-00000
24 2012-10-26 23:23 /
Now execute this command:
$ hadoop fs -cat out/part-0-00000
This
a
is
1
1
1
test.
1
What just happened?
We did three things here, as follows:
Moved the previously created text file into a new directory on HDFS
Ran the example WordCount job specifying this new directory and a non-existent
output directory as arguments
Used the fs utility to examine the output of the MapReduce job
As we said earlier, the pseudo-distributed mode has more Java processes, so it may seem
curious that the job output is significantly shorter than for the standalone Pi. The reason is
that the local standalone mode prints information about each individual task execution to
the screen, whereas in the other modes this information is written only to logfiles on the
running hosts.
The output directory is created by Hadoop itself and the actual result files follow the
part-nnnnn convention illustrated here; though given our setup, there is only one result
file. We use the fs -cat command to examine the file, and the results are as expected.
If you specify an existing directory as the output source for a Hadoop job, it
will fail to run and will throw an exception complaining of an already existing
directory. If you want Hadoop to store the output to a directory, it must not exist.
Treat this as a safety mechanism that stops Hadoop from writing over previous
valuable job runs and something you will forget to ascertain frequently. If you are
confident, you can override this behavior, as we will see later.
[ 41 ]
www.it-ebooks.info
Getting Hadoop Up and Running
The Pi and WordCount programs are only some of the examples that ship with Hadoop. Here
is how to get a list of them all. See if you can figure some of them out.
$ hadoop jar hadoop/hadoop-examples-1.0.4.jar
Have a go hero – WordCount on a larger body of text
Running a complex framework like Hadoop utilizing five discrete Java processes to count the
words in a single-line text file is not terribly impressive. The power comes from the fact that
we can use exactly the same program to run WordCount on a larger file, or even a massive
corpus of text spread across a multinode Hadoop cluster. If we had such a setup, we would
execute exactly the same commands as we just did by running the program and simply
specifying the location of the directories for the source and output data.
Find a large online text file—Project Gutenberg at http://www.gutenberg.org is a good
starting point—and run WordCount on it by copying it onto the HDFS and executing the
WordCount example. The output may not be as you expect because, in a large body of text,
issues of dirty data, punctuation, and formatting will need to be addressed. Think about how
WordCount could be improved; we'll study how to expand it into a more complex processing
chain in the next chapter.
Monitoring Hadoop from the browser
So far, we have been relying on command-line tools and direct command output to see what
our system is doing. Hadoop provides two web interfaces that you should become familiar
with, one for HDFS and the other for MapReduce. Both are useful in pseudo-distributed
mode and are critical tools when you have a fully distributed setup.
The HDFS web UI
Point your web browser to port 50030 on the host running Hadoop. By default, the web
interface should be available from both the local host and any other machine that has
network access. Here is an example screenshot:
[ 42 ]
www.it-ebooks.info
Chapter 2
There is a lot going on here, but the immediately critical data tells us the number of nodes
in the cluster, the filesystem size, used space, and links to drill down for more info and even
browse the filesystem.
Spend a little time playing with this interface; it needs to become familiar. With a multinode
cluster, the information about live and dead nodes plus the detailed information on their
status history will be critical to debugging cluster problems.
[ 43 ]
www.it-ebooks.info
Getting Hadoop Up and Running
The MapReduce web UI
The JobTracker UI is available on port 50070 by default, and the same access rules stated
earlier apply. Here is an example screenshot:
This is more complex than the HDFS interface! Along with a similar count of the number
of live/dead nodes, there is a history of the number of jobs executed since startup and a
breakdown of their individual task counts.
The list of executing and historical jobs is a doorway to much more information; for every
job, we can access the history of every task attempt on every node and access logs for
detailed information. We now expose one of the most painful parts of working with any
distributed system: debugging. It can be really hard.
Imagine you have a cluster of 100 machines trying to process a massive data set where the
full job requires each host to execute hundreds of map and reduce tasks. If the job starts
running very slowly or explicitly fails, it is not always obvious where the problem lies. Looking
at the MapReduce web UI will likely be the first port of call because it provides such a rich
starting point to investigate the health of running and historical jobs.
[ 44 ]
www.it-ebooks.info
Chapter 2
Using Elastic MapReduce
We will now turn to Hadoop in the cloud, the Elastic MapReduce service offered by Amazon
Web Services. There are multiple ways to access EMR, but for now we will focus on the
provided web console to contrast a full point-and-click approach to Hadoop with the
previous command-line-driven examples.
Setting up an account in Amazon Web Services
Before using Elastic MapReduce, we need to set up an Amazon Web Services account and
register it with the necessary services.
Creating an AWS account
Amazon has integrated their general accounts with AWS, meaning that if you already have an
account for any of the Amazon retail websites, this is the only account you will need to use
AWS services.
Note that AWS services have a cost; you will need an active credit card associated with the
account to which charges can be made.
If you require a new Amazon account, go to http://aws.amazon.com, select create a new
AWS account, and follow the prompts. Amazon has added a free tier for some services, so
you may find that in the early days of testing and exploration you are keeping many of your
activities within the non-charged tier. The scope of the free tier has been expanding, so make
sure you know for what you will and won't be charged.
Signing up for the necessary services
Once you have an Amazon account, you will need to register it for use with the required
AWS services, that is, Simple Storage Service (S3), Elastic Compute Cloud (EC2), and Elastic
MapReduce (EMR). There is no cost for simply signing up to any AWS service; the process
just makes the service available to your account.
Go to the S3, EC2, and EMR pages linked from http://aws.amazon.com and click on the
Sign up button on each page; then follow the prompts.
[ 45 ]
www.it-ebooks.info
Getting Hadoop Up and Running
Caution! This costs real money!
Before going any further, it is critical to understand that use of AWS services will
incur charges that will appear on the credit card associated with your Amazon
account. Most of the charges are quite small and increase with the amount of
infrastructure consumed; storing 10 GB of data in S3 costs 10 times more than
for 1 GB, and running 20 EC2 instances costs 20 times as much as a single one.
There are tiered cost models, so the actual costs tend to have smaller marginal
increases at higher levels. But you should read carefully through the pricing
sections for each service before using any of them. Note also that currently
data transfer out of AWS services, such as EC2 and S3, is chargeable but data
transfer between services is not. This means it is often most cost-effective to
carefully design your use of AWS to keep data within AWS through as much of
the data processing as possible.
Time for action – WordCount on EMR using the management
console
Let's jump straight into an example on EMR using some provided example code. Carry out
the following steps:
1.
Browse to http://aws.amazon.com, go to Developers | AWS Management
Console, and then click on the Sign in to the AWS Console button. The default
view should look like the following screenshot. If it does not, click on Amazon S3
from within the console.
[ 46 ]
www.it-ebooks.info
Chapter 2
2.
As shown in the preceding screenshot, click on the Create bucket button and enter
a name for the new bucket. Bucket names must be globally unique across all AWS
users, so do not expect obvious bucket names such as mybucket or s3test to
be available.
3.
Click on the Region drop-down menu and select the geographic area nearest to you.
4.
Click on the Elastic MapReduce link and click on the Create a new Job Flow button.
You should see a screen like the following screenshot:
[ 47 ]
www.it-ebooks.info
Getting Hadoop Up and Running
5.
You should now see a screen like the preceding screenshot. Select the Run a sample
application radio button and the Word Count (Streaming) menu item from the
sample application drop-down box and click on the Continue button.
6.
The next screen, shown in the preceding screenshot, allows us to specify the
location of the output produced by running the job. In the edit box for the output
location, enter the name of the bucket created in step 1 (garryt1use is the bucket
we are using here); then click on the Continue button.
[ 48 ]
www.it-ebooks.info
Chapter 2
7.
The next screenshot shows the page where we can modify the number and size of
the virtual hosts utilized by our job. Confirm that the instance type for each combo
box is Small (m1.small), and the number of nodes for the Core group is 2 and for the
Task group it is 0. Then click on the Continue button.
8.
This next screenshot involves options we will not be using in this example. For the
Amazon EC2 key pair field, select the Proceed without key pair menu item and click
on the No radio button for the Enable Debugging field. Ensure that the Keep Alive
radio button is set to No and click on the Continue button.
[ 49 ]
www.it-ebooks.info
Getting Hadoop Up and Running
9.
The next screen, shown in the preceding screenshot, is one we will not be doing
much with right now. Confirm that the Proceed with no Bootstrap Actions radio
button is selected and click on the Continue button.
10.
Confirm the job flow specifications are as expected and click on the Create Job Flow
button. Then click on the View my Job Flows and check status buttons. This will give
a list of your job flows; you can filter to show only running or completed jobs. The
default is to show all, as in the example shown in the following screenshot:
[ 50 ]
www.it-ebooks.info
Chapter 2
11.
Occasionally hit the Refresh button until the status of the listed job, Running or
Starting, changes to Complete; then click its checkbox to see details of the job flow,
as shown in the following screenshot:
12.
Click the S3 tab and select the bucket you created for the output location. You will
see it has a single entry called wordcount, which is a directory. Right-click on that
and select Open. Then do the same until you see a list of actual files following the
familiar Hadoop part-nnnnn naming scheme, as shown in the following screenshot:
[ 51 ]
www.it-ebooks.info
Getting Hadoop Up and Running
Right click on part-00000 and open it. It should look something like this:
a
aa
aakar
aargau
abad
abandoned
abandonment
abate
abauj
abbassid
abbes
abbl
…
14716
52
3
3
3
46
6
9
3
4
3
3
Does this type of output look familiar?
What just happened?
The first step deals with S3, and not EMR. S3 is a scalable storage service that allows you to
store files (called objects) within containers called buckets, and to access objects by their
bucket and object key (that is, name). The model is analogous to the usage of a filesystem, and
though there are underlying differences, they are unlikely to be important within this book.
S3 is where you will place the MapReduce programs and source data you want to process in
EMR, and where the output and logs of EMR Hadoop jobs will be stored. There is a plethora
of third-party tools to access S3, but here we are using the AWS management console, a
browser interface to most AWS services.
Though we suggested you choose the nearest geographic region for S3, this is not required;
non-US locations will typically give better latency for customers located nearer to them, but
they also tend to have a slightly higher cost. The decision of where to host your data and
applications is one you need to make after considering all these factors.
After creating the S3 bucket, we moved to the EMR console and created a new job flow.
This term is used within EMR to refer to a data processing task. As we will see, this can
be a one-time deal where the underlying Hadoop cluster is created and destroyed on
demand or it can be a long-running cluster on which multiple jobs are executed.
We left the default job flow name and then selected the use of an example application,
in this case, the Python implementation of WordCount. The term Hadoop Streaming refers
to a mechanism allowing scripting languages to be used to write map and reduce tasks, but
the functionality is the same as the Java WordCount we used earlier.
[ 52 ]
www.it-ebooks.info
Chapter 2
The form to specify the job flow requires a location for the source data, program, map and
reduce classes, and a desired location for the output data. For the example we just saw, most
of the fields were prepopulated; and, as can be seen, there are clear similarities to what was
required when running local Hadoop from the command line.
By not selecting the Keep Alive option, we chose a Hadoop cluster that would be created
specifically to execute this job, and destroyed afterwards. Such a cluster will have a longer
startup time but will minimize costs. If you choose to keep the job flow alive, you will see
additional jobs executed more quickly as you don't have to wait for the cluster to start up.
But you will be charged for the underlying EC2 resources until you explicitly terminate the
job flow.
After confirming, we do not need to add any additional bootstrap options; we selected the
number and types of hosts we wanted to deploy into our Hadoop cluster. EMR distinguishes
between three different groups of hosts:
Master group: This is a controlling node hosting the NameNode and the JobTracker.
There is only 1 of these.
Core group: These are nodes running both HDFS DataNodes and MapReduce
TaskTrackers. The number of hosts is configurable.
Task group: These hosts don't hold HDFS data but do run TaskTrackers and can
provide more processing horsepower. The number of hosts is configurable.
The type of host refers to different classes of hardware capability, the details of which can
be found on the EC2 page. Larger hosts are more powerful but have a higher cost. Currently,
by default, the total number of hosts in a job flow must be 20 or less, though Amazon has a
simple form to request higher limits.
After confirming, all is as expected—we launch the job flow and monitor it on the console
until the status changes to COMPLETED. At this point, we go back to S3, look inside the
bucket we specified as the output destination, and examine the output of our WordCount
job, which should look very similar to the output of a local Hadoop WordCount.
An obvious question is where did the source data come from? This was one of the
prepopulated fields in the job flow specification we saw during the creation process. For
nonpersistent job flows, the most common model is for the source data to be read from a
specified S3 source location and the resulting data written to the specified result S3 bucket.
That is it! The AWS management console allows fine-grained control of services such as S3
and EMR from the browser. Armed with nothing more than a browser and a credit card,
we can launch Hadoop jobs to crunch data without ever having to worry about any of the
mechanics around installing, running, or managing Hadoop.
[ 53 ]
www.it-ebooks.info
Getting Hadoop Up and Running
Have a go hero – other EMR sample applications
EMR provides several other sample applications. Why not try some of them as well?
Other ways of using EMR
Although a powerful and impressive tool, the AWS management console is not always
how we want to access S3 and run EMR jobs. As with all AWS services, there are both
programmatic and command-line tools to use the services.
AWS credentials
Before using either programmatic or command-line tools, however, we need to look at how
an account holder authenticates for AWS to make such requests. As these are chargeable
services, we really do not want anyone else to make requests on our behalf. Note that as
we logged directly into the AWS management console with our AWS account in the
preceding example, we did not have to worry about this.
Each AWS account has several identifiers that are used when accessing the various services:
Account ID: Each AWS account has a numeric ID.
Access key: Each account has an associated access key that is used to identify the
account making the request.
Secret access key: The partner to the access key is the secret access key. The access
key is not a secret and could be exposed in service requests, but the secret access
key is what you use to validate yourself as the account owner.
Key pairs: These are the key pairs used to log in to EC2 hosts. It is possible to either
generate public/private key pairs within EC2 or to import externally generated keys
into the system.
If this sounds confusing, it's because it is. At least at first. When using a tool to access an
AWS service, however, there's usually a single up-front step of adding the right credentials
to a configured file, and then everything just works. However, if you do decide to explore
programmatic or command-line tools, it will be worth a little time investment to read the
documentation for each service to understand how its security works.
The EMR command-line tools
In this book, we will not do anything with S3 and EMR that cannot be done from the AWS
management console. However, when working with operational workloads, looking to
integrate into other workflows, or automating service access, a browser-based tool is not
appropriate, regardless of how powerful it is. Using the direct programmatic interfaces to
a service provides the most granular control but requires the most effort.
[ 54 ]
www.it-ebooks.info
Chapter 2
Amazon provides for many services a group of command-line tools that provide a useful way
of automating access to AWS services that minimizes the amount of required development.
The Elastic MapReduce command-line tools, linked from the main EMR page, are worth a
look if you want a more CLI-based interface to EMR but don't want to write custom code
just yet.
The AWS ecosystem
Each AWS service also has a plethora of third-party tools, services, and libraries that can
provide different ways of accessing the service, provide additional functionality, or offer
new utility programs. Check out the developer tools hub at http://aws.amazon.com/
developertools, as a starting point.
Comparison of local versus EMR Hadoop
After our first experience of both a local Hadoop cluster and its equivalent in EMR, this is a
good point at which we can consider the differences of the two approaches.
As may be apparent, the key differences are not really about capability; if all we want is an
environment to run MapReduce jobs, either approach is completely suited. Instead, the
distinguishing characteristics revolve around a topic we touched on in Chapter 1, What It's
All About, that being whether you prefer a cost model that involves upfront infrastructure
costs and ongoing maintenance effort over one with a pay-as-you-go model with a lower
maintenance burden along with rapid and conceptually infinite scalability. Other than the
cost decisions, there are a few things to keep in mind:
EMR supports specific versions of Hadoop and has a policy of upgrading over time.
If you have a need for a specific version, in particular if you need the latest and
greatest versions immediately after release, then the lag before these are live on
EMR may be unacceptable.
You can start up a persistent EMR job flow and treat it much as you would a local
Hadoop cluster, logging into the hosting nodes and tweaking their configuration. If
you find yourself doing this, its worth asking if that level of control is really needed
and, if so, is it stopping you getting all the cost model benefits of a move to EMR?
If it does come down to a cost consideration, remember to factor in all the hidden
costs of a local cluster that are often forgotten. Think about the costs of power,
space, cooling, and facilities. Not to mention the administration overhead, which
can be nontrivial if things start breaking in the early hours of the morning.
[ 55 ]
www.it-ebooks.info
Getting Hadoop Up and Running
Summary
We covered a lot of ground in this chapter, in regards to getting a Hadoop cluster up and
running and executing MapReduce programs on it.
Specifically, we covered the prerequisites for running Hadoop on local Ubuntu hosts.
We also saw how to install and configure a local Hadoop cluster in either standalone or
pseudo-distributed modes. Then, we looked at how to access the HDFS filesystem and
submit MapReduce jobs. We then moved on and learned what accounts are needed to
access Elastic MapReduce and other AWS services.
We saw how to browse and create S3 buckets and objects using the AWS management
console, and also how to create a job flow and use it to execute a MapReduce job on an
EMR-hosted Hadoop cluster. We also discussed other ways of accessing AWS services and
studied the differences between local and EMR-hosted Hadoop.
Now that we have learned about running Hadoop locally or on EMR, we are ready to start
writing our own MapReduce programs, which is the topic of the next chapter.
[ 56 ]
www.it-ebooks.info
3
Understanding MapReduce
The previous two chapters have discussed the problems that Hadoop allows us
to solve, and gave some hands-on experience of running example MapReduce
jobs. With this foundation, we will now go a little deeper.
In this chapter we will be:
Understanding how key/value pairs are the basis of Hadoop tasks
Learning the various stages of a MapReduce job
Examining the workings of the map, reduce, and optional combined stages in detail
Looking at the Java API for Hadoop and use it to develop some simple
MapReduce jobs
Learning about Hadoop input and output
Key/value pairs
Since Chapter 1, What It's All About, we have been talking about operations that process
and provide the output in terms of key/value pairs without explaining why. It is time to
address that.
What it mean
Firstly, we will clarify just what we mean by key/value pairs by highlighting similar concepts
in the Java standard library. The java.util.Map interface is the parent of commonly used
classes such as HashMap and (through some library backward reengineering) even the
original Hashtable.
www.it-ebooks.info
Understanding MapReduce
For any Java Map object, its contents are a set of mappings from a given key of a specified
type to a related value of a potentially different type. A HashMap object could, for example,
contain mappings from a person's name (String) to his or her birthday (Date).
In the context of Hadoop, we are referring to data that also comprises keys that relate to
associated values. This data is stored in such a way that the various values in the data set
can be sorted and rearranged across a set of keys. If we are using key/value data, it will
make sense to ask questions such as the following:
Does a given key have a mapping in the data set?
What are the values associated with a given key?
What is the complete set of keys?
Think back to WordCount from the previous chapter. We will go into it in more detail shortly,
but the output of the program is clearly a set of key/value relationships; for each word
(the key), there is a count (the value) of its number of occurrences. Think about this simple
example and some important features of key/value data will become apparent, as follows:
Keys must be unique but values need not be
Each value must be associated with a key, but a key could have no values
(though not in this particular example)
Careful definition of the key is important; deciding on whether or not the
counts are applied with case sensitivity will give different results
Note that we need to define carefully what we mean by keys being unique
here. This does not mean the key occurs only once; in our data set we may see
a key occur numerous times and, as we shall see, the MapReduce model has
a stage where all values associated with each key are collected together. The
uniqueness of keys guarantees that if we collect together every value seen for
any given key, the result will be an association from a single instance of the key
to every value mapped in such a way, and none will be omitted.
Why key/value data?
Using key/value data as the foundation of MapReduce operations allows for a powerful
programming model that is surprisingly widely applicable, as can be seen by the adoption of
Hadoop and MapReduce across a wide variety of industries and problem scenarios. Much
data is either intrinsically key/value in nature or can be represented in such a way. It is a
simple model with broad applicability and semantics straightforward enough that programs
defined in terms of it can be applied by a framework like Hadoop.
[ 58 ]
www.it-ebooks.info
Chapter 3
Of course, the data model itself is not the only thing that makes Hadoop useful; its real
power lies in how it uses the techniques of parallel execution, and divide and conquer
discussed in Chapter 1, What It's All About. We can have a large number of hosts on which
we can store and execute data and even use a framework that manages the division of
the larger task into smaller chunks, and the combination of partial results into the overall
answer. But we need this framework to provide us with a way of expressing our problems
that doesn't require us to be an expert in the execution mechanics; we want to express the
transformations required on our data and then let the framework do the rest. MapReduce,
with its key/value interface, provides such a level of abstraction, whereby the programmer
only has to specify these transformations and Hadoop handles the complex process of
applying this to arbitrarily large data sets.
Some real-world examples
To become less abstract, let's think of some real-world data that is key/value pair:
An address book relates a name (key) to contact information (value)
A bank account uses an account number (key) to associate with the account
details (value)
The index of a book relates a word (key) to the pages on which it occurs (value)
On a computer filesystem, filenames (keys) allow access to any sort of data,
such as text, images, and sound (values)
These examples are intentionally broad in scope, to help and encourage you to think that
key/value data is not some very constrained model used only in high-end data mining but
a very common model that is all around us.
We would not be having this discussion if this was not important to Hadoop. The bottom line
is that if the data can be expressed as key/value pairs, it can be processed by MapReduce.
MapReduce as a series of key/value transformations
You may have come across MapReduce described in terms of key/value transformations, in
particular the intimidating one looking like this:
{K1,V1} -> {K2, List} -> {K3,V3}
We are now in a position to understand what this means:
The input to the map method of a MapReduce job is a series of key/value pairs that
we'll call K1 and V1.
[ 59 ]
www.it-ebooks.info
Understanding MapReduce
The output of the map method (and hence input to the reduce method) is a series
of keys and an associated list of values that are called K2 and V2. Note that each
mapper simply outputs a series of individual key/value outputs; these are combined
into a key and list of values in the shuffle method.
The final output of the MapReduce job is another series of key/value pairs, called K3
and V3.
These sets of key/value pairs don't have to be different; it would be quite possible to input,
say, names and contact details and output the same, with perhaps some intermediary format
used in collating the information. Keep this three-stage model in mind as we explore the Java
API for MapReduce next. We will first walk through the main parts of the API you will need
and then do a systematic examination of the execution of a MapReduce job.
Pop quiz – key/value pairs
Q1. The concept of key/value pairs is…
1. Something created by and specific to Hadoop.
2. A way of expressing relationships we often see but don't think of as such.
3. An academic concept from computer science.
Q2. Are username/password combinations an example of key/value data?
1. Yes, it's a clear case of one value being associated to the other.
2. No, the password is more of an attribute of the username, there's no index-type
relationship.
3. We'd not usually think of them as such, but Hadoop could still process a series
of username/password combinations as key/value pairs.
The Hadoop Java API for MapReduce
Hadoop underwent a major API change in its 0.20 release, which is the primary interface
in the 1.0 version we use in this book. Though the prior API was certainly functional, the
community felt it was unwieldy and unnecessarily complex in some regards.
The new API, sometimes generally referred to as context objects, for reasons we'll see later,
is the future of Java's MapReduce development; and as such we will use it wherever possible
in this book. Note that caveat: there are parts of the pre-0.20 MapReduce libraries that have
not been ported to the new API, so we will use the old interfaces when we need to examine
any of these.
[ 60 ]
www.it-ebooks.info
Chapter 3
The 0.20 MapReduce Java API
The 0.20 and above versions of MapReduce API have most of the key classes and interfaces
either in the org.apache.hadoop.mapreduce package or its subpackages.
In most cases, the implementation of a MapReduce job will provide job-specific subclasses
of the Mapper and Reducer base classes found in this package.
We'll stick to the commonly used K1 / K2 / K3 / and so on terminology,
though more recently the Hadoop API has, in places, used terms such as
KEYIN/VALUEIN and KEYOUT/VALUEOUT instead. For now, we will
stick with K1 / K2 / K3 as it helps understand the end-to-end data flow.
The Mapper class
This is a cut-down view of the base Mapper class provided by Hadoop. For our own
mapper implementations, we will subclass this base class and override the specified
method as follows:
class Mapper
{
void map(K1 key, V1 value Mapper.Context context)
throws IOException, InterruptedException
{..}
}
Although the use of Java generics can make this look a little opaque at first, there is
actually not that much going on. The class is defined in terms of the key/value input
and output types, and then the map method takes an input key/value pair in its parameters.
The other parameter is an instance of the Context class that provides various mechanisms
to communicate with the Hadoop framework, one of which is to output the results of a map
or reduce method.
Notice that the map method only refers to a single instance of K1 and V1 key/
value pairs. This is a critical aspect of the MapReduce paradigm in which you
write classes that process single records and the framework is responsible
for all the work required to turn an enormous data set into a stream of key/
value pairs. You will never have to write map or reduce classes that try to
deal with the full data set. Hadoop also provides mechanisms through its
InputFormat and OutputFormat classes that provide implementations
of common file formats and likewise remove the need of having to write file
parsers for any but custom file types.
[ 61 ]
www.it-ebooks.info
Understanding MapReduce
There are three additional methods that sometimes may be required to be overridden.
protected void setup( Mapper.Context context)
throws IOException, Interrupted Exception
This method is called once before any key/value pairs are presented to the map method.
The default implementation does nothing.
protected void cleanup( Mapper.Context context)
throws IOException, Interrupted Exception
This method is called once after all key/value pairs have been presented to the map method.
The default implementation does nothing.
protected void run( Mapper.Context context)
throws IOException, Interrupted Exception
This method controls the overall flow of task processing within a JVM. The default
implementation calls the setup method once before repeatedly calling the map
method for each key/value pair in the split, and then finally calls the cleanup method.
Downloading the example code
You can download the example code files for all Packt books you have purchased
from your account at http://www.packtpub.com. If you purchased this
book elsewhere, you can visit http://www.packtpub.com/support
and register to have the files e-mailed directly to you.
The Reducer class
The Reducer base class works very similarly to the Mapper class, and usually requires only
subclasses to override a single reduce method. Here is the cut-down class definition:
public class Reducer
{
void reduce(K1 key, Iterable values,
Reducer.Context context)
throws IOException, InterruptedException
{..}
}
Again, notice the class definition in terms of the broader data flow (the reduce method
accepts K2/V2 as input and provides K3/V3 as output) while the actual reduce method
takes only a single key and its associated list of values. The Context object is again the
mechanism to output the result of the method.
This class also has the setup, run, and cleanup methods with similar default
implementations as with the Mapper class that can optionally be overridden:
[ 62 ]
www.it-ebooks.info
Chapter 3
protected void setup( Reduce.Context context)
throws IOException, InterruptedException
This method is called once before any key/lists of values are presented to the reduce
method. The default implementation does nothing.
protected void cleanup( Reducer.Context context)
throws IOException, InterruptedException
This method is called once after all key/lists of values have been presented to the reduce
method. The default implementation does nothing.
protected void run( Reducer.Context context)
throws IOException, InterruptedException
This method controls the overall flow of processing the task within JVM. The default
implementation calls the setup method before repeatedly calling the reduce method for as
many key/values provided to the Reducer class, and then finally calls the cleanup method.
The Driver class
Although our mapper and reducer implementations are all we need to perform the
MapReduce job, there is one more piece of code required: the driver that communicates
with the Hadoop framework and specifies the configuration elements needed to run a
MapReduce job. This involves aspects such as telling Hadoop which Mapper and Reducer
classes to use, where to find the input data and in what format, and where to place the
output data and how to format it. There is an additional variety of other configuration
options that can be set and which we will see throughout this book.
There is no default parent Driver class as a subclass; the driver logic usually exists in the main
method of the class written to encapsulate a MapReduce job. Take a look at the following
code snippet as an example driver. Don't worry about how each line works, though you
should be able to work out generally what each is doing:
public class ExampleDriver
{
...
public static void main(String[] args) throws Exception
{
// Create a Configuration object that is used to set other options
Configuration conf = new Configuration() ;
// Create the object representing the job
Job job = new Job(conf, "ExampleJob") ;
// Set the name of the main class in the job jarfile
job.setJarByClass(ExampleDriver.class) ;
// Set the mapper class
job.setMapperClass(ExampleMapper.class) ;
[ 63 ]
www.it-ebooks.info
Understanding MapReduce
// Set the reducer class
job.setReducerClass(ExampleReducer.class) ;
// Set the types for the final output key and value
job.setOutputKeyClass(Text.class) ;
job.setOutputValueClass(IntWritable.class) ;
// Set input and output file paths
FileInputFormat.addInputPath(job, new Path(args[0])) ;
FileOutputFormat.setOutputPath(job, new Path(args[1]))
// Execute the job and wait for it to complete
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}}
Given our previous talk of jobs, it is not surprising that much of the setup involves operations
on a Job object. This includes setting the job name and specifying which classes are to be
used for the mapper and reducer implementations.
Certain input/output configurations are set and, finally, the arguments passed to the main
method are used to specify the input and output locations for the job. This is a very common
model that you will see often.
There are a number of default values for configuration options, and we are implicitly using
some of them in the preceding class. Most notably, we don't say anything about the file
format of the input files or how the output files are to be written. These are defined through
the InputFormat and OutputFormat classes mentioned earlier; we will explore them
in detail later. The default input and output formats are text files that suit our WordCount
example. There are multiple ways of expressing the format within text files in addition to
particularly optimized binary formats.
A common model for less complex MapReduce jobs is to have the Mapper and Reducer
classes as inner classes within the driver. This allows everything to be kept in a single file,
which simplifies the code distribution.
Writing MapReduce programs
We have been using and talking about WordCount for quite some time now; let's actually
write an implementation, compile, and run it, and then explore some modifications.
[ 64 ]
www.it-ebooks.info
Chapter 3
Time for action – setting up the classpath
To compile any Hadoop-related code, we will need to refer to the standard
Hadoop-bundled classes.
Add the Hadoop-1.0.4.core.jar file from the distribution to the Java classpath
as follows:
$ export CLASSPATH=.:${HADOOP_HOME}/Hadoop-1.0.4.core.jar:${CLASSPATH}
What just happened?
This adds the Hadoop-1.0.4.core.jar file explicitly to the classpath alongside the
current directory and the previous contents of the CLASSPATH environment variable.
Once again, it would be good to put this in your shell startup file or a standalone file
to be sourced.
We will later need to also have many of the supplied third-party libraries
that come with Hadoop on our classpath, and there is a shortcut to do this.
For now, the explicit addition of the core JAR file will suffice.
Time for action – implementing WordCount
We have seen the use of the WordCount example program in Chapter 2, Getting Hadoop
Up and Running. Now we will explore our own Java implementation by performing the
following steps:
1.
Enter the following code into the WordCount1.java file:
Import
import
import
import
import
import
import
import
import
import
java.io.* ;
org.apache.hadoop.conf.Configuration ;
org.apache.hadoop.fs.Path;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.Job;
org.apache.hadoop.mapreduce.Mapper;
org.apache.hadoop.mapreduce.Reducer;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
[ 65 ]
www.it-ebooks.info
Understanding MapReduce
public class WordCount1
{
public static class WordCountMapper
extends Mapper
Source Exif Data:
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
PDF Version : 1.6
Linearized : No
Create Date : 2013:02:18 10:16:33+05:30
Modify Date : 2013:08:27 11:05:10+04:00
Www It-ebooks Info : {6F114860-FA26-42C9-B26F-48EC50C4FBE9}
Keywords : www.it-ebooks.info
XMP Toolkit : Image::ExifTool 9.27
Creator : www.it-ebooks.info
Format : application/pdf
Subject : IT eBooks
Doc Change Count : 115
Key Stamp Mp : AAAAAA==
Producer : www.it-ebooks.info
Trapped : False
Creator Tool : Adobe InDesign CS5 (7.0)
Metadata Date : 2013:02:18 10:22:38+05:30
Page Image Format : JPEG, JPEG
Page Image Height : 256, 256
Page Image : (Binary data 5518 bytes, use -b option to extract), (Binary data 5512 bytes, use -b option to extract)
Page Image Width : 256, 256
Page Image Page Number : 1, 2
Derived From Document ID : adobe:docid:indd:7c2210fc-d03b-11df-b78d-ab1f91ee47f1
Derived From Instance ID : 7c2210fd-d03b-11df-b78d-ab1f91ee47f1
Document ID : xmp.did:04DBD64E2071E2118B31C8B22F78C4E9
History Action : saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved, saved
History Changed : /;/metadata, /metadata, /;/metadata, /metadata, /;/metadata, /metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /metadata, /, /metadata, /;/metadata, /metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata, /;/metadata
History Instance ID : xmp.iid:A50FE816AC5EE111A6DDC9561A956402, xmp.iid:A60FE816AC5EE111A6DDC9561A956402, xmp.iid:C4C9E8E86863E1119B0DD3538859331E, xmp.iid:5A4798A76963E1119B0DD3538859331E, xmp.iid:39195515ABA8E111AC02958474B072BE, xmp.iid:3A195515ABA8E111AC02958474B072BE, xmp.iid:00D0CA20ACA8E111AC02958474B072BE, xmp.iid:01D0CA20ACA8E111AC02958474B072BE, xmp.iid:07D0CA20ACA8E111AC02958474B072BE, xmp.iid:CE64CDA7B6AFE1118EF4B4604D8FED7A, xmp.iid:532C8FA2B7AFE1118EF4B4604D8FED7A, xmp.iid:6B6CC6406B13E211A17CFBAEB5168E96, xmp.iid:6C6CC6406B13E211A17CFBAEB5168E96, xmp.iid:8E27E8D72371E2118B31C8B22F78C4E9, xmp.iid:8F27E8D72371E2118B31C8B22F78C4E9, xmp.iid:9227E8D72371E2118B31C8B22F78C4E9, xmp.iid:9527E8D72371E2118B31C8B22F78C4E9, xmp.iid:9827E8D72371E2118B31C8B22F78C4E9, xmp.iid:94AFFBA73877E211BCCBD5009DB66147, xmp.iid:FA34B34A3E77E211BCCBD5009DB66147, xmp.iid:48182D736D77E211BCCBD5009DB66147, xmp.iid:B57FD3197277E211BCCBD5009DB66147
History Software Agent : Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.5, Adobe InDesign 7.5, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign CS6 (Windows), Adobe InDesign CS6 (Windows), Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0, Adobe InDesign 7.0
History When : 2012:02:24 11:25:55+05:30, 2012:02:24 11:25:55+05:30, 2012:03:01 12:12:03+05:30, 2012:03:01 12:12:03+05:30, 2012:05:28 15:24:33+05:30, 2012:05:28 15:24:33+05:30, 2012:05:28 15:31:43+05:30, 2012:05:28 15:33:42+05:30, 2012:05:28 15:44:08+05:30, 2012:06:06 14:41:43+05:30, 2012:06:06 14:41:43+05:30, 2012:10:11 11:49:05+05:30, 2012:10:11 11:49:05+05:30, 2013:02:07 18:12:33+05:30, 2013:02:07 18:12:33+05:30, 2013:02:07 18:12:50+05:30, 2013:02:07 18:13:12+05:30, 2013:02:07 18:14:48+05:30, 2013:02:15 11:56:39+05:30, 2013:02:15 12:38:36+05:30, 2013:02:15 18:16:34+05:30, 2013:02:15 18:47:53+05:30
Instance ID : uuid:3baa3528-c58e-4105-b86d-9e45455836f5
Original Document ID : adobe:docid:indd:7c2210fe-d03b-11df-b78d-ab1f91ee47f1
Rendition Class : proof:pdf
Page Count : 398
EXIF Metadata provided by EXIF.tools