Hadoop Beginner Guide

Hadoop%20%20Beginner%20Guide

hadoop_-beginners-guide

Hadoop_%20Beginner's%20Guide

User Manual: Pdf

Open the PDF directly: View PDF .
Page Count: 398 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Cover
Copyright
Credits
About the Author
About the Reviewers
www.PacktPub.com
Table of Contents
Preface
Chapter 1: What It's All About
- Big data processing
- Cloud computing with Amazon Web Services
- Summary
Chapter 2: Getting Hadoop Up and Running
- Hadoop on a local Ubuntu host
  - Other operating systems
- Time for action – checking the prerequisites
  - Setting up Hadoop
    - A note on versions
- Time for action – downloading Hadoop
- Time for action – setting up SSH
  - Configuring and running Hadoop
- Time for action – using Hadoop to calculate Pi
  - Three modes
- Time for action – configuring the pseudo-distributed mode
  - Configuring the base directory and formatting the filesystem
- Time for action – changing the base HDFS directory
- Time for action – formatting the NameNode
  - Starting and using Hadoop
- Time for action – starting Hadoop
- Time for action – using HDFS
- Time for action – WordCount, the Hello World of MapReduce
  - Monitoring Hadoop from the browser
    - The HDFS web UI
- Using Elastic MapReduce
  - Setting up an account on Amazon Web Services
    - Creating an AWS account
    - Signing up for the necessary services
- Time for action – WordCount on EMR using the management console
  - Other ways of using EMR
    - AWS credentials
    - The EMR command-line tools
  - The AWS ecosystem
- Comparison of local versus EMR Hadoop
- Summary
Chapter 3: Understanding MapReduce
- Key/value pairs
- The Hadoop Java API for MapReduce
  - The 0.20 MapReduce Java API
- Writing MapReduce programs
- Time for action – setting up the classpath
- Time for action – implementing WordCount
- Time for action – building a JAR file
- Time for action – running WordCount on a local Hadoop cluster
- Time for action – running WordCount on EMR
  - The pre-0.20 Java MapReduce API
  - Hadoop-provided mapper and reducer implementations
- Time for action – WordCount the easy way
- Walking through a run of WordCount
- Time for action – WordCount with a combiner
  - When you can use the reducer as the combiner
- Time for action – fixing WordCount to work with a combiner
  - Reuse is your friend
- Hadoop-specific data types
  - The Writable and WritableComparable interfaces
  - Introducing the wrapper classes
- Time for action – using the Writable wrapper classes
  - Other wrapper classes
    - Make your own
- Input/output
- Summary
Chapter 4: Developing MapReduce Programs
- Using languages other than Java with Hadoop
  - How Hadoop Streaming works
  - Why use Hadoop Streaming?
- Time for action – WordCount using Streaming
  - Differences in jobs when using Streaming
- Analyzing a large dataset
  - Getting the UFO sighting dataset
  - Getting a feel for the dataset
- Time for action – summarizing the UFO data
  - Examining UFO shapes
- Time for action – summarizing the shape data
- Time for action – correlation of sighting duration to UFO shape
  - Using Streaming scripts outside Hadoop
- Time for action – performing the shape/time analysis from the command line
  - Java shape and location analysis
- Time for action – using ChainMapper for field validation/analysis
  - Too many abbreviations
    - Using the Distributed Cache
- Time for action – using the Distributed Cache to improve location output
- Counters, status, and other output
- Time for action – creating counters, task states, and writing log output
  - Too much information!
- Summary
Chapter 5: Advanced MapReduce Techniques
- Simple, advanced, and in-between
- Joins
- Time for action – reduce-side join using MultipleInputs
- Graph algorithms
- Time for action – representing the graph
  - Overview of the algorithm
- Time for action – creating the source code
- Time for action – the first run
- Time for action – The second run
- Time for action – the third run
- Time for action – the fourth and last run
  - Running multiple jobs
  - Final thoughts on graphs
- Using language-independent data structures
  - Candidate technologies
  - Introducing Avro
- Time for action – getting and installing Avro
  - Avro and schemas
- Time for action – defining the schema
- Time for action – creating the source Avro data with Ruby
- Time for action – consuming the Avro data with Java
  - Using Avro within MapReduce
- Time for action – generating shape summaries in MapReduce
- Time for action – examining the output data with Ruby
- Time for action – examining the output data with Java
  - And beyond with Avro
- Summary
Chapter 6: When Things Break
- Failure
- Time for action – killing a DataNode process
  - NameNode and DataNode communication
- Time for action – the replication factor in action
- Time for action – intentionally causing missing blocks
  - When data may be lost
    - Block corruption
- Time for action – killing a TaskTracker process
  - Comparing the DataNode and TaskTracker failures
    - Permanent failure
  - Killing the cluster masters
- Time for action – killing the JobTracker
  - Starting a replacement JobTracker
- Time for action – killing the NameNode process
  - Starting a replacement NameNode
  - Task failures due to software
    - Failures of slow running tasks
- Time for action – causing task failures
  - Hadoop's handling of slow-running tasks
    - Speculative execution
    - Hadoop's handling of failing tasks
  - Task failures due to data
    - Handling dirty data through code
    - Using Hadoop's skip mode
- Time for action – handling dirty data by using skip mode
  - To skip or not to skip
- Summary
Chapter 7: Keeping Things Running
- A note on EMR
- Hadoop configuration properties
  - Default values
- Time for action – browsing default properties
- Setting up a cluster
- Time for action – examining the default rack configuration
- Time for action – adding a rack awareness script
  - What is commodity hardware anyway?
- Cluster access control
  - The Hadoop security model
- Time for action – demonstrating the default security
  - User identity
    - More granular access control
  - Working around the security model via physical access control
- Managing the NameNode
  - Configuring multiple locations for the fsimage class
- Time for action – adding an additional fsimage location
  - Where to write the fsimage copies
  - Swapping to another NameNode host
    - Have things ready before disaster strikes
- Time for action – swapping to a new NameNode host
  - Don't celebrate quite yet
    - What about MapReduce?
- Managing HDFS
  - Where to write data
  - Using balancer
    - When to rebalance
- MapReduce management
  - Command line job management
  - Job priorities and scheduling
- Time for action – changing job priorities and killing a job
  - Alternative schedulers
- Scaling
  - Adding capacity to a local Hadoop cluster
  - Adding capacity to an EMR job flow
    - Expanding a running job flow
- Summary
Chapter 8: A Relational View on Data with Hive
- Overview of Hive
  - Why use Hive?
  - Thanks Facebook!
- Setting up Hive
  - Prerequisites
  - Getting Hive
- Time for action – installing Hive
- Using Hive
- Time for action – creating a table for the UFO data
- Time for action – inserting the UFO data
  - Validating the data
- Time for action – validating the table
- Time for action – redefining the table with the correct column separator
  - Hive tables – real or not?
- Time for action – creating a table from an existing file
- Time for action - performing a join
  - Hive and SQL views
- Time for action - using views
  - Handling dirty data in Hive
- Time for action – exporting query output
  - Partitioning the table
- Time for action – making a partitioned UFO sighting table
  - Bucketing, clustering, and sorting – oh my!
  - User-defined functions
- Time for action – adding a new User Defined Function (UDF)
- Hive on Amazon Web Services
- Time for action – running UFO analysis on EMR
  - Using interactive job flows for development
  - Integration with other AWS products
- Summary
Chapter 9: Working with Relational Databases
- Common data paths
- Setting up MySQL
- Time for action – installing and setting up MySQL
  - Did it have to be so hard?
- Time for action – configuring MySQL to allow remote connections
  - Don't do this in production!
- Time for action – setting up the employee database
  - Be careful with data file access rights
- Getting data into Hadoop
- Time for action – downloading and configuring Sqoop
  - Sqoop and Hadoop versions
    - Sqoop and HDFS
- Time for action – exporting data from MySQL to HDFS
  - Sqoop's architecture
  - Importing data into Hive using Sqoop
- Time for action – exporting data from MySQL into Hive
- Time for action – a more selective import
  - Datatype issues
- Time for action – using a type mapping
- Time for action – importing data from a raw query
  - Sqoop and Hive partitions
    - Field and line terminators
- Getting data out of Hadoop
- Time for action – importing data from Hadoop into MySQL
  - Differences between Sqoop imports and exports
    - Inserts versus updates
    - Sqoop and Hive exports
- Time for action – importing Hive data into MySQL
- Time for action – fixing the mapping and re-running the export
  - Other Sqoop features
- AWS considerations
  - Considering RDS
- Summary
Chapter 10: Data Collection with Flume
- A note about AWS
- Data data everywhere
  - Types of data
  - Getting network traffic into Hadoop
- Time for action – getting web server data into Hadoop
  - Getting files into Hadoop
  - Hidden issues
- Introducing Apache Flume
  - A note on versioning
- Time for action – installing and configuring Flume
  - Using Flume to capture network data
- Time for action – capturing network traffic to a log file
- Time for action – logging to the console
  - Writing network data to log files
- Time for action – capturing the output of a command to a flat file
  - Logs versus files
- Time for action – capturing a remote file to a local flat file
- Time for action – writing network traffic onto HDFS
- Time for action – adding timestamps
  - To Sqoop or to Flume?
- Time for action – multi-level Flume networks
- Time for action – writing to multiple sinks
- The bigger picture
- Summary
Chapter 11: Where to Go Next
- What we did and didn't cover in this book
- Upcoming Hadoop changes
- Alternative distributions
  - Why alternative distributions?
- Other Apache projects
- Other programming abstractions
  - Pig
  - Cascading
- AWS resources
- Sources of information
- Summary
Appendix: Pop Quiz Answers
- Chapter 3, Understanding MapReduce
- Chapter 7, Keeping Things Running
Index

www.it-ebooks.info

Hadoop Beginner's Guide

Learn how to crunch big data to extract meaning from the

data avalanche

Garry Turkington

BIRMINGHAM - MUMBAI

www.it-ebooks.info

Hadoop Beginner's Guide

or transmied in any form or by any means, without the prior wrien permission of the

publisher, except in the case of brief quotaons embedded in crical arcles or reviews.

Every eort has been made in the preparaon of this book to ensure the accuracy of the

informaon presented. However, the informaon contained in this book is sold without

warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers

and distributors will be held liable for any damages caused or alleged to be caused directly or

indirectly by this book.

Packt Publishing has endeavored to provide trademark informaon about all of the

companies and products menoned in this book by the appropriate use of capitals.

However, Packt Publishing cannot guarantee the accuracy of this informaon.

First published: February 2013

Producon Reference: 1150213

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-84951-7-300

www.packtpub.com

Cover Image by Asher Wishkerman (a.wishkerman@mpic.de)

www.it-ebooks.info

Credits

Author

Garry Turkington

Reviewers

David Gruzman

Muthusamy Manigandan

Vidyasagar N V

Acquision Editor

Robin de Jongh

Lead Technical Editor

Azharuddin Sheikh

Technical Editors

Ankita Meshram

Varun Pius Rodrigues

Copy Editors

Brandt D'Mello

Aditya Nair

Laxmi Subramanian

Ruta Waghmare

Project Coordinator

Leena Purkait

Proofreader

Maria Gould

Indexer

Hemangini Bari

Producon Coordinator

Nitesh Thakur

Cover Work

Nitesh Thakur

www.it-ebooks.info

About the Author

Garry Turkington has 14 years of industry experience, most of which has been focused

on the design and implementaon of large-scale distributed systems. In his current roles as

VP Data Engineering and Lead Architect at Improve Digital, he is primarily responsible for

the realizaon of systems that store, process, and extract value from the company's large

data volumes. Before joining Improve Digital, he spent me at Amazon.co.uk, where he led

several soware development teams building systems that process Amazon catalog data for

every item worldwide. Prior to this, he spent a decade in various government posions in

both the UK and USA.

He has BSc and PhD degrees in Computer Science from the Queens University of Belfast in

Northern Ireland and an MEng in Systems Engineering from Stevens Instute of Technology

in the USA.

I would like to thank my wife Lea for her support and encouragement—not

to menon her paence—throughout the wring of this book and my

daughter, Maya, whose spirit and curiosity is more of an inspiraon than

she could ever imagine.

www.it-ebooks.info

About the Reviewers

David Gruzman is a Hadoop and big data architect with more than 18 years of hands-on

experience, specializing in the design and implementaon of scalable high-performance

distributed systems. He has extensive experse of OOA/OOD and (R)DBMS technology. He

is an Agile methodology adept and strongly believes that a daily coding roune makes good

soware architects. He is interested in solving challenging problems related to real-me

analycs and the applicaon of machine learning algorithms to the big data sets.

He founded—and is working with—BigDataCra.com, a bouque consulng rm in the area

of big data. Visit their site at www.bigdatacraft.com. David can be contacted at david@

bigdatacraft.com. More detailed informaon about his skills and experience can be

found at http://www.linkedin.com/in/davidgruzman.

Muthusamy Manigandan is a systems architect for a startup. Prior to this, he was a Sta

Engineer at VMWare and Principal Engineer with Oracle. Mani has been programming for

the past 14 years on large-scale distributed-compung applicaons. His areas of interest are

machine learning and algorithms.

www.it-ebooks.info

Vidyasagar N V has been interested in computer science since an early age. Some of his

serious work in computers and computer networks began during his high school days. Later,

he went to the presgious Instute Of Technology, Banaras Hindu University, for his B.Tech.

He has been working as a soware developer and data expert, developing and building

scalable systems. He has worked with a variety of second, third, and fourth generaon

languages. He has worked with at les, indexed les, hierarchical databases, network

databases, relaonal databases, NoSQL databases, Hadoop, and related technologies.

Currently, he is working as Senior Developer at Collecve Inc., developing big data-based

structured data extracon techniques from the Web and local informaon. He enjoys

producing high-quality soware and web-based soluons and designing secure and

scalable data systems. He can be contacted at vidyasagar1729@gmail.com.

I would like to thank the Almighty, my parents, Mr. N Srinivasa Rao and

Mrs. Latha Rao, and my family who supported and backed me throughout

my life. I would also like to thank my friends for being good friends and

all those people willing to donate their me, eort, and experse by

parcipang in open source soware projects. Thank you, Packt Publishing

for selecng me as one of the technical reviewers for this wonderful book.

It is my honor to be a part of it.

www.it-ebooks.info

www.PacktPub.com

Support les, eBooks, discount offers and more

You might want to visit www.PacktPub.com for support les and downloads related

to your book.

Did you know that Packt oers eBook versions of every book published, with PDF and ePub

les available? You can upgrade to the eBook version at www.PacktPub.com and as a

print book customer, you are entled to a discount on the eBook copy. Get in touch with

us at service@packtpub.com for more details.

At www.PacktPub.com, you can also read a collecon of free technical arcles, sign

up for a range of free newsleers and receive exclusive discounts and oers on Packt

books and eBooks.

http://PacktLib.PacktPub.com

Do you need instant soluons to your IT quesons? PacktLib is Packt's online digital book

library. Here, you can access, read and search across Packt's enre library of books.

Why Subscribe?

Fully searchable across every book published by Packt

Copy and paste, print and bookmark content

On demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access

PacktLib today and view nine enrely free books. Simply use your login credenals for

immediate access.

www.it-ebooks.info

Table of Contents

Preface 1

Chapter 1: What It's All About 7

Big data processing 8

The value of data 8

Historically for the few and not the many 9

Classic data processing systems 9

Liming factors 10

A dierent approach 11

All roads lead to scale-out 11

Share nothing 11

Expect failure 12

Smart soware, dumb hardware 13

Move processing, not data 13

Build applicaons, not infrastructure 14

Hadoop 15

Thanks, Google 15

Thanks, Doug 15

Thanks, Yahoo 15

Parts of Hadoop 15

Common building blocks 16

HDFS 16

MapReduce 17

Beer together 18

Common architecture 19

What it is and isn't good for 19

Cloud compung with Amazon Web Services 20

Too many clouds 20

A third way 20

Dierent types of costs 21

AWS – infrastructure on demand from Amazon 22

Elasc Compute Cloud (EC2) 22

Simple Storage Service (S3) 22

www.it-ebooks.info

Table of Contents

[ ii ]

Elasc MapReduce (EMR) 22

What this book covers 23

A dual approach 23

Summary 24

Chapter 2: Geng Hadoop Up and Running 25

Hadoop on a local Ubuntu host 25

Other operang systems 26

Time for acon – checking the prerequisites 26

Seng up Hadoop 27

A note on versions 27

Time for acon – downloading Hadoop 28

Time for acon – seng up SSH 29

Conguring and running Hadoop 30

Time for acon – using Hadoop to calculate Pi 30

Three modes 32

Time for acon – conguring the pseudo-distributed mode 32

Conguring the base directory and formang the lesystem 34

Time for acon – changing the base HDFS directory 34

Time for acon – formang the NameNode 35

Starng and using Hadoop 36

Time for acon – starng Hadoop 36

Time for acon – using HDFS 38

Time for acon – WordCount, the Hello World of MapReduce 39

Monitoring Hadoop from the browser 42

The HDFS web UI 42

Using Elasc MapReduce 45

Seng up an account on Amazon Web Services 45

Creang an AWS account 45

Signing up for the necessary services 45

Time for acon – WordCount in EMR using the management console 46

Other ways of using EMR 54

AWS credenals 54

The EMR command-line tools 54

The AWS ecosystem 55

Comparison of local versus EMR Hadoop 55

Summary 56

Chapter 3: Understanding MapReduce 57

Key/value pairs 57

What it mean 57

Why key/value data? 58

Some real-world examples 59

MapReduce as a series of key/value transformaons 59

www.it-ebooks.info

Table of Contents

[ iii ]

The Hadoop Java API for MapReduce 60

The 0.20 MapReduce Java API 61

The Mapper class 61

The Reducer class 62

The Driver class 63

Wring MapReduce programs 64

Time for acon – seng up the classpath 65

Time for acon – implemenng WordCount 65

Time for acon – building a JAR le 68

Time for acon – running WordCount on a local Hadoop cluster 68

Time for acon – running WordCount on EMR 69

The pre-0.20 Java MapReduce API 72

Hadoop-provided mapper and reducer implementaons 73

Time for acon – WordCount the easy way 73

Walking through a run of WordCount 75

Startup 75

Spling the input 75

Task assignment 75

Task startup 76

Ongoing JobTracker monitoring 76

Mapper input 76

Mapper execuon 77

Mapper output and reduce input 77

Paroning 77

The oponal paron funcon 78

Reducer input 78

Reducer execuon 79

Reducer output 79

Shutdown 79

That's all there is to it! 80

Apart from the combiner…maybe 80

Why have a combiner? 80

Time for acon – WordCount with a combiner 80

When you can use the reducer as the combiner 81

Time for acon – xing WordCount to work with a combiner 81

Reuse is your friend 82

Hadoop-specic data types 83

The Writable and WritableComparable interfaces 83

Introducing the wrapper classes 84

Primive wrapper classes 85

Array wrapper classes 85

Map wrapper classes 85

www.it-ebooks.info

Table of Contents

[ iv ]

Time for acon – using the Writable wrapper classes 86

Other wrapper classes 88

Making your own 88

Input/output 88

Files, splits, and records 89

InputFormat and RecordReader 89

Hadoop-provided InputFormat 90

Hadoop-provided RecordReader 90

Output formats and RecordWriter 91

Hadoop-provided OutputFormat 91

Don't forget Sequence les 91

Summary 92

Chapter 4: Developing MapReduce Programs 93

Using languages other than Java with Hadoop 94

How Hadoop Streaming works 94

Why to use Hadoop Streaming 94

Time for acon – WordCount using Streaming 95

Dierences in jobs when using Streaming 97

Analyzing a large dataset 98

Geng the UFO sighng dataset 98

Geng a feel for the dataset 99

Time for acon – summarizing the UFO data 99

Examining UFO shapes 101

Time for acon – summarizing the shape data 102

Time for acon – correlang sighng duraon to UFO shape 103

Using Streaming scripts outside Hadoop 106

Time for acon – performing the shape/me analysis from the command line 107

Java shape and locaon analysis 107

Time for acon – using ChainMapper for eld validaon/analysis 108

Too many abbreviaons 112

Using the Distributed Cache 113

Time for acon – using the Distributed Cache to improve locaon output 114

Counters, status, and other output 117

Time for acon – creang counters, task states, and wring log output 118

Too much informaon! 125

Summary 126

Chapter 5: Advanced MapReduce Techniques 127

Simple, advanced, and in-between 127

Joins 128

www.it-ebooks.info

Table of Contents

[ v ]

When this is a bad idea 128

Map-side versus reduce-side joins 128

Matching account and sales informaon 129

Time for acon – reduce-side joins using MulpleInputs 129

DataJoinMapper and TaggedMapperOutput 134

Implemenng map-side joins 135

Using the Distributed Cache 135

Pruning data to t in the cache 135

Using a data representaon instead of raw data 136

Using mulple mappers 136

To join or not to join... 137

Graph algorithms 137

Graph 101 138

Graphs and MapReduce – a match made somewhere 138

Represenng a graph 139

Time for acon – represenng the graph 140

Overview of the algorithm 140

The mapper 141

The reducer 141

Iterave applicaon 141

Time for acon – creang the source code 142

Time for acon – the rst run 146

Time for acon – the second run 147

Time for acon – the third run 148

Time for acon – the fourth and last run 149

Running mulple jobs 151

Final thoughts on graphs 151

Using language-independent data structures 151

Candidate technologies 152

Introducing Avro 152

Time for acon – geng and installing Avro 152

Avro and schemas 154

Time for acon – dening the schema 154

Time for acon – creang the source Avro data with Ruby 155

Time for acon – consuming the Avro data with Java 156

Using Avro within MapReduce 158

Time for acon – generang shape summaries in MapReduce 158

Time for acon – examining the output data with Ruby 163

Time for acon – examining the output data with Java 163

Going further with Avro 165

Summary 166

www.it-ebooks.info

Table of Contents

[ vi ]

Chapter 6: When Things Break 167

Failure 167

Embrace failure 168

Or at least don't fear it 168

Don't try this at home 168

Types of failure 168

Hadoop node failure 168

The dfsadmin command 169

Cluster setup, test les, and block sizes 169

Fault tolerance and Elasc MapReduce 170

Time for acon – killing a DataNode process 170

NameNode and DataNode communicaon 173

Time for acon – the replicaon factor in acon 174

Time for acon – intenonally causing missing blocks 176

When data may be lost 178

Block corrupon 179

Time for acon – killing a TaskTracker process 180

Comparing the DataNode and TaskTracker failures 183

Permanent failure 184

Killing the cluster masters 184

Time for acon – killing the JobTracker 184

Starng a replacement JobTracker 185

Time for acon – killing the NameNode process 186

Starng a replacement NameNode 188

The role of the NameNode in more detail 188

File systems, les, blocks, and nodes 188

The single most important piece of data in the cluster – fsimage 189

DataNode startup 189

Safe mode 190

SecondaryNameNode 190

So what to do when the NameNode process has a crical failure? 190

BackupNode/CheckpointNode and NameNode HA 191

Hardware failure 191

Host failure 191

Host corrupon 192

The risk of correlated failures 192

Task failure due to soware 192

Failure of slow running tasks 192

Time for acon – causing task failure 193

Hadoop's handling of slow-running tasks 195

Speculave execuon 195

Hadoop's handling of failing tasks 195

Task failure due to data 196

Handling dirty data through code 196

Using Hadoop's skip mode 197

www.it-ebooks.info

Table of Contents

[ vii ]

Time for acon – handling dirty data by using skip mode 197

To skip or not to skip... 202

Summary 202

Chapter 7: Keeping Things Running 205

A note on EMR 206

Hadoop conguraon properes 206

Default values 206

Time for acon – browsing default properes 206

Addional property elements 208

Default storage locaon 208

Where to set properes 209

Seng up a cluster 209

How many hosts? 210

Calculang usable space on a node 210

Locaon of the master nodes 211

Sizing hardware 211

Processor / memory / storage rao 211

EMR as a prototyping plaorm 212

Special node requirements 213

Storage types 213

Commodity versus enterprise class storage 214

Single disk versus RAID 214

Finding the balance 214

Network storage 214

Hadoop networking conguraon 215

How blocks are placed 215

Rack awareness 216

Time for acon – examining the default rack conguraon 216

Time for acon – adding a rack awareness script 217

What is commodity hardware anyway? 219

Cluster access control 220

The Hadoop security model 220

Time for acon – demonstrang the default security 220

User identy 223

More granular access control 224

Working around the security model via physical access control 224

Managing the NameNode 224

Conguring mulple locaons for the fsimage class 225

Time for acon – adding an addional fsimage locaon 225

Where to write the fsimage copies 226

Swapping to another NameNode host 227

Having things ready before disaster strikes 227

www.it-ebooks.info

Table of Contents

[ viii ]

Time for acon – swapping to a new NameNode host 227

Don't celebrate quite yet! 229

What about MapReduce? 229

Managing HDFS 230

Where to write data 230

Using balancer 230

When to rebalance 230

MapReduce management 231

Command line job management 231

Job priories and scheduling 231

Time for acon – changing job priories and killing a job 232

Alternave schedulers 233

Capacity Scheduler 233

Fair Scheduler 234

Enabling alternave schedulers 234

When to use alternave schedulers 234

Scaling 235

Adding capacity to a local Hadoop cluster 235

Adding capacity to an EMR job ow 235

Expanding a running job ow 235

Summary 236

Chapter 8: A Relaonal View on Data with Hive 237

Overview of Hive 237

Why use Hive? 238

Thanks, Facebook! 238

Seng up Hive 238

Prerequisites 238

Geng Hive 239

Time for acon – installing Hive 239

Using Hive 241

Time for acon – creang a table for the UFO data 241

Time for acon – inserng the UFO data 244

Validang the data 246

Time for acon – validang the table 246

Time for acon – redening the table with the correct column separator 248

Hive tables – real or not? 250

Time for acon – creang a table from an exisng le 250

Time for acon – performing a join 252

Hive and SQL views 254

Time for acon – using views 254

Handling dirty data in Hive 257

www.it-ebooks.info

Table of Contents

[ ix ]

Time for acon – exporng query output 258

Paroning the table 260

Time for acon – making a paroned UFO sighng table 260

Buckeng, clustering, and sorng... oh my! 264

User Dened Funcon 264

Time for acon – adding a new User Dened Funcon (UDF) 265

To preprocess or not to preprocess... 268

Hive versus Pig 269

What we didn't cover 269

Hive on Amazon Web Services 270

Time for acon – running UFO analysis on EMR 270

Using interacve job ows for development 277

Integraon with other AWS products 278

Summary 278

Chapter 9: Working with Relaonal Databases 279

Common data paths 279

Hadoop as an archive store 280

Hadoop as a preprocessing step 280

Hadoop as a data input tool 281

The serpent eats its own tail 281

Seng up MySQL 281

Time for acon – installing and seng up MySQL 281

Did it have to be so hard? 284

Time for acon – conguring MySQL to allow remote connecons 285

Don't do this in producon! 286

Time for acon – seng up the employee database 286

Be careful with data le access rights 287

Geng data into Hadoop 287

Using MySQL tools and manual import 288

Accessing the database from the mapper 288

A beer way – introducing Sqoop 289

Time for acon – downloading and conguring Sqoop 289

Sqoop and Hadoop versions 290

Sqoop and HDFS 291

Time for acon – exporng data from MySQL to HDFS 291

Sqoop's architecture 294

Imporng data into Hive using Sqoop 294

Time for acon – exporng data from MySQL into Hive 295

Time for acon – a more selecve import 297

Datatype issues 298

www.it-ebooks.info

Table of Contents

[ x ]

Time for acon – using a type mapping 299

Time for acon – imporng data from a raw query 300

Sqoop and Hive parons 302

Field and line terminators 302

Geng data out of Hadoop 303

Wring data from within the reducer 303

Wring SQL import les from the reducer 304

A beer way – Sqoop again 304

Time for acon – imporng data from Hadoop into MySQL 304

Dierences between Sqoop imports and exports 306

Inserts versus updates 307

Sqoop and Hive exports 307

Time for acon – imporng Hive data into MySQL 308

Time for acon – xing the mapping and re-running the export 310

Other Sqoop features 312

AWS consideraons 313

Considering RDS 313

Summary 314

Chapter 10: Data Collecon with Flume 315

A note about AWS 315

Data data everywhere 316

Types of data 316

Geng network trac into Hadoop 316

Time for acon – geng web server data into Hadoop 316

Geng les into Hadoop 318

Hidden issues 318

Keeping network data on the network 318

Hadoop dependencies 318

Reliability 318

Re-creang the wheel 318

A common framework approach 319

Introducing Apache Flume 319

A note on versioning 319

Time for acon – installing and conguring Flume 320

Using Flume to capture network data 321

Time for acon – capturing network trac to a log le 321

Time for acon – logging to the console 324

Wring network data to log les 326

Time for acon – capturing the output of a command in a at le 326

Logs versus les 327

Time for acon – capturing a remote le in a local at le 328

Sources, sinks, and channels 330

www.it-ebooks.info

Table of Contents

[ xi ]

Sources 330

Sinks 330

Channels 330

Or roll your own 331

Understanding the Flume conguraon les 331

It's all about events 332

Time for acon – wring network trac onto HDFS 333

Time for acon – adding mestamps 335

To Sqoop or to Flume... 337

Time for acon – mul level Flume networks 338

Time for acon – wring to mulple sinks 340

Selectors replicang and mulplexing 342

Handling sink failure 342

Next, the world 343

The bigger picture 343

Data lifecycle 343

Staging data 344

Scheduling 344

Summary 345

Chapter 11: Where to Go Next 347

What we did and didn't cover in this book 347

Upcoming Hadoop changes 348

Alternave distribuons 349

Why alternave distribuons? 349

Bundling 349

Free and commercial extensions 349

Choosing a distribuon 351

Other Apache projects 352

HBase 352

Oozie 352

Whir 353

Mahout 353

MRUnit 354

Other programming abstracons 354

Pig 354

Cascading 354

AWS resources 355

HBase on EMR 355

SimpleDB 355

DynamoDB 355

www.it-ebooks.info

Table of Contents

[ xii ]

Sources of informaon 356

Source code 356

Mailing lists and forums 356

LinkedIn groups 356

HUGs 356

Conferences 357

Summary 357

Appendix: Pop Quiz Answers 359

Chapter 3, Understanding MapReduce 359

Chapter 7, Keeping Things Running 360

Index 361

www.it-ebooks.info

Preface

This book is here to help you make sense of Hadoop and use it to solve your big data

problems. It's a really excing me to work with data processing technologies such as

Hadoop. The ability to apply complex analycs to large data sets—once the monopoly of

large corporaons and government agencies—is now possible through free open source

soware (OSS).

But because of the seeming complexity and pace of change in this area, geng a grip on

the basics can be somewhat inmidang. That's where this book comes in, giving you an

understanding of just what Hadoop is, how it works, and how you can use it to extract

value from your data now.

In addion to an explanaon of core Hadoop, we also spend several chapters exploring

other technologies that either use Hadoop or integrate with it. Our goal is to give you an

understanding not just of what Hadoop is but also how to use it as a part of your broader

technical infrastructure.

A complementary technology is the use of cloud compung, and in parcular, the oerings

from Amazon Web Services. Throughout the book, we will show you how to use these

services to host your Hadoop workloads, demonstrang that not only can you process

large data volumes, but also you don't actually need to buy any physical hardware to do so.

What this book covers

This book comprises of three main parts: chapters 1 through 5, which cover the core of

Hadoop and how it works, chapters 6 and 7, which cover the more operaonal aspects

of Hadoop, and chapters 8 through 11, which look at the use of Hadoop alongside other

products and technologies.

www.it-ebooks.info

Preface

[ 2 ]

Chapter 1, What It's All About, gives an overview of the trends that have made Hadoop and

cloud compung such important technologies today.

Chapter 2, Geng Hadoop Up and Running, walks you through the inial setup of a local

Hadoop cluster and the running of some demo jobs. For comparison, the same work is also

executed on the hosted Hadoop Amazon service.

Chapter 3, Understanding MapReduce, goes inside the workings of Hadoop to show how

MapReduce jobs are executed and shows how to write applicaons using the Java API.

Chapter 4, Developing MapReduce Programs, takes a case study of a moderately sized data

set to demonstrate techniques to help when deciding how to approach the processing and

analysis of a new data source.

Chapter 5, Advanced MapReduce Techniques, looks at a few more sophiscated ways of

applying MapReduce to problems that don't necessarily seem immediately applicable to the

Hadoop processing model.

Chapter 6, When Things Break, examines Hadoop's much-vaunted high availability and fault

tolerance in some detail and sees just how good it is by intenonally causing havoc through

killing processes and intenonally using corrupt data.

Chapter 7, Keeping Things Running, takes a more operaonal view of Hadoop and will be

of most use for those who need to administer a Hadoop cluster. Along with demonstrang

some best pracce, it describes how to prepare for the worst operaonal disasters so you

can sleep at night.

Chapter 8, A Relaonal View On Data With Hive, introduces Apache Hive, which allows

Hadoop data to be queried with a SQL-like syntax.

Chapter 9, Working With Relaonal Databases, explores how Hadoop can be integrated with

exisng databases, and in parcular, how to move data from one to the other.

Chapter 10, Data Collecon with Flume, shows how Apache Flume can be used to gather

data from mulple sources and deliver it to desnaons such as Hadoop.

Chapter 11, Where To Go Next, wraps up the book with an overview of the broader Hadoop

ecosystem, highlighng other products and technologies of potenal interest. In addion, it

gives some ideas on how to get involved with the Hadoop community and to get help.

What you need for this book

As we discuss the various Hadoop-related soware packages used in this book, we will

describe the parcular requirements for each chapter. However, you will generally need

somewhere to run your Hadoop cluster.

www.it-ebooks.info

Preface

[ 3 ]

In the simplest case, a single Linux-based machine will give you a plaorm to explore almost

all the exercises in this book. We assume you have a recent distribuon of Ubuntu, but as

long as you have command-line Linux familiarity any modern distribuon will suce.

Some of the examples in later chapters really need mulple machines to see things working,

so you will require access to at least four such hosts. Virtual machines are completely

acceptable; they're not ideal for producon but are ne for learning and exploraon.

Since we also explore Amazon Web Services in this book, you can run all the examples on

EC2 instances, and we will look at some other more Hadoop-specic uses of AWS throughout

the book. AWS services are usable by anyone, but you will need a credit card to sign up!

Who this book is for

We assume you are reading this book because you want to know more about Hadoop at

a hands-on level; the key audience is those with soware development experience but no

prior exposure to Hadoop or similar big data technologies.

For developers who want to know how to write MapReduce applicaons, we assume you are

comfortable wring Java programs and are familiar with the Unix command-line interface.

We will also show you a few programs in Ruby, but these are usually only to demonstrate

language independence, and you don't need to be a Ruby expert.

For architects and system administrators, the book also provides signicant value in

explaining how Hadoop works, its place in the broader architecture, and how it can be

managed operaonally. Some of the more involved techniques in Chapter 4, Developing

MapReduce Programs, and Chapter 5, Advanced MapReduce Techniques, are probably

of less direct interest to this audience.

Conventions

In this book, you will nd several headings appearing frequently.

To give clear instrucons of how to complete a procedure or task, we use:

Time for action – heading

1. Acon 1

2. Acon 2

3. Acon 3

Instrucons oen need some extra explanaon so that they make sense, so they are

followed with:

www.it-ebooks.info

Preface

[ 4 ]

What just happened?

This heading explains the working of tasks or instrucons that you have just completed.

You will also nd some other learning aids in the book, including:

Pop quiz – heading

These are short mulple-choice quesons intended to help you test your own

understanding.

Have a go hero – heading

These set praccal challenges and give you ideas for experimenng with what you

have learned.

You will also nd a number of styles of text that disnguish between dierent kinds of

informaon. Here are some examples of these styles, and an explanaon of their meaning.

Code words in text are shown as follows: "You may noce that we used the Unix command

rm to remove the Drush directory rather than the DOS del command."

A block of code is set as follows:

# * Fine Tuning

key_buffer = 16M

key_buffer_size = 32M

max_allowed_packet = 16M

thread_stack = 512K

thread_cache_size= 8

max_connections= 300

When we wish to draw your aenon to a parcular part of a code block, the relevant lines

or items are set in bold:

# * Fine Tuning

key_buffer = 16M

key_buffer_size = 32M

max_allowed_packet = 16M

thread_stack = 512K

thread_cache_size= 8

max_connections= 300

www.it-ebooks.info

Preface

[ 5 ]

Any command-line input or output is wrien as follows:

cd /ProgramData/Propeople

rm -r Drush

git clone --branch master http://git.drupal.org/project/drush.git

Newterms and important words are shown in bold. Words that you see on the screen, in

menus or dialog boxes for example, appear in the text like this: "On the Select Desnaon

Locaon screen, click on Next to accept the default desnaon."

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this

book—what you liked or may have disliked. Reader feedback is important for us to

develop tles that you really get the most out of.

To send us general feedback, simply send an e-mail to feedback@packtpub.com,

and menon the book tle through the subject of your message.

If there is a topic that you have experse in and you are interested in either wring or

contribung to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you

to get the most from your purchase.

Downloading the example code

You can download the example code les for all Packt books you have purchased from

your account at http://www.packtpub.com. If you purchased this book elsewhere,

you can visit http://www.packtpub.com/support and register to have the les

e-mailed directly to you.

www.it-ebooks.info

Preface

[ 6 ]

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes

do happen. If you nd a mistake in one of our books—maybe a mistake in the text or the

code—we would be grateful if you would report this to us. By doing so, you can save other

readers from frustraon and help us improve subsequent versions of this book. If you nd

any errata, please report them by vising http://www.packtpub.com/submit-errata,

selecng your book, clicking on the errata submission form link, and entering the details of

your errata. Once your errata are veried, your submission will be accepted and the errata

will be uploaded to our website, or added to any list of exisng errata, under the Errata

secon of that tle.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media.

At Packt, we take the protecon of our copyright and licenses very seriously. If you come

across any illegal copies of our works, in any form, on the Internet, please provide us with

the locaon address or website name immediately so that we can pursue a remedy.

Please contact us at copyright@packtpub.com with a link to the suspected

pirated material.

We appreciate your help in protecng our authors, and our ability to bring you

valuable content.

Questions

You can contact us at questions@packtpub.com if you are having a problem with any

aspect of the book, and we will do our best to address it.

www.it-ebooks.info

What It's All About

This book is about Hadoop, an open source framework for large-scale data

processing. Before we get into the details of the technology and its use in later

chapters, it is important to spend a little time exploring the trends that led to

Hadoop's creation and its enormous success.

Hadoop was not created in a vacuum; instead, it exists due to the explosion

in the amount of data being created and consumed and a shift that sees this

data deluge arrive at small startups and not just huge multinationals. At the

same time, other trends have changed how software and systems are deployed,

using cloud resources alongside or even in preference to more traditional

infrastructures.

This chapter will explore some of these trends and explain in detail the specic

problems Hadoop seeks to solve and the drivers that shaped its design.

In the rest of this chapter we shall:

Learn about the big data revoluon

Understand what Hadoop is and how it can extract value from data

Look into cloud compung and understand what Amazon Web Services provides

See how powerful the combinaon of big data processing and cloud compung

can be

Get an overview of the topics covered in the rest of this book

So let's get on with it!

www.it-ebooks.info

What It’s All About

[ 8 ]

Big data processing

Look around at the technology we have today, and it's easy to come to the conclusion that

it's all about data. As consumers, we have an increasing appete for rich media, both in

terms of the movies we watch and the pictures and videos we create and upload. We also,

oen without thinking, leave a trail of data across the Web as we perform the acons of

our daily lives.

Not only is the amount of data being generated increasing, but the rate of increase is also

accelerang. From emails to Facebook posts, from purchase histories to web links, there are

large data sets growing everywhere. The challenge is in extracng from this data the most

valuable aspects; somemes this means parcular data elements, and at other mes, the

focus is instead on idenfying trends and relaonships between pieces of data.

There's a subtle change occurring behind the scenes that is all about using data in more

and more meaningful ways. Large companies have realized the value in data for some

me and have been using it to improve the services they provide to their customers, that

is, us. Consider how Google displays adversements relevant to our web surng, or how

Amazon or Nelix recommend new products or tles that oen match well to our tastes

and interests.

The value of data

These corporaons wouldn't invest in large-scale data processing if it didn't provide a

meaningful return on the investment or a compeve advantage. There are several main

aspects to big data that should be appreciated:

Some quesons only give value when asked of suciently large data sets.

Recommending a movie based on the preferences of another person is, in the

absence of other factors, unlikely to be very accurate. Increase the number of

people to a hundred and the chances increase slightly. Use the viewing history of

ten million other people and the chances of detecng paerns that can be used to

give relevant recommendaons improve dramacally.

Big data tools oen enable the processing of data on a larger scale and at a lower

cost than previous soluons. As a consequence, it is oen possible to perform data

processing tasks that were previously prohibively expensive.

The cost of large-scale data processing isn't just about nancial expense; latency is

also a crical factor. A system may be able to process as much data as is thrown at

it, but if the average processing me is measured in weeks, it is likely not useful. Big

data tools allow data volumes to be increased while keeping processing me under

control, usually by matching the increased data volume with addional hardware.

www.it-ebooks.info

Chapter 1

[ 9 ]

Previous assumpons of what a database should look like or how its data should be

structured may need to be revisited to meet the needs of the biggest data problems.

In combinaon with the preceding points, suciently large data sets and exible

tools allow previously unimagined quesons to be answered.

Historically for the few and not the many

The examples discussed in the previous secon have generally been seen in the form of

innovaons of large search engines and online companies. This is a connuaon of a much

older trend wherein processing large data sets was an expensive and complex undertaking,

out of the reach of small- or medium-sized organizaons.

Similarly, the broader approach of data mining has been around for a very long me but has

never really been a praccal tool outside the largest corporaons and government agencies.

This situaon may have been regreable but most smaller organizaons were not at a

disadvantage as they rarely had access to the volume of data requiring such an investment.

The increase in data is not limited to the big players anymore, however; many small and

medium companies—not to menon some individuals—nd themselves gathering larger

and larger amounts of data that they suspect may have some value they want to unlock.

Before understanding how this can be achieved, it is important to appreciate some of these

broader historical trends that have laid the foundaons for systems such as Hadoop today.

Classic data processing systems

The fundamental reason that big data mining systems were rare and expensive is that scaling

a system to process large data sets is very dicult; as we will see, it has tradionally been

limited to the processing power that can be built into a single computer.

There are however two broad approaches to scaling a system as the size of the data

increases, generally referred to as scale-up and scale-out.

Scale-up

In most enterprises, data processing has typically been performed on impressively large

computers with impressively larger price tags. As the size of the data grows, the approach is

to move to a bigger server or storage array. Through an eecve architecture—even today,

as we'll describe later in this chapter—the cost of such hardware could easily be measured in

hundreds of thousands or in millions of dollars.

www.it-ebooks.info

What It’s All About

[ 10 ]

The advantage of simple scale-up is that the architecture does not signicantly change

through the growth. Though larger components are used, the basic relaonship (for

example, database server and storage array) stays the same. For applicaons such as

commercial database engines, the soware handles the complexies of ulizing the

available hardware, but in theory, increased scale is achieved by migrang the same

soware onto larger and larger servers. Note though that the diculty of moving soware

onto more and more processors is never trivial; in addion, there are praccal limits on just

how big a single host can be, so at some point, scale-up cannot be extended any further.

The promise of a single architecture at any scale is also unrealisc. Designing a scale-up system

to handle data sets of sizes such as 1 terabyte, 100 terabyte, and 1 petabyte may conceptually

apply larger versions of the same components, but the complexity of their connecvity may

vary from cheap commodity through custom hardware as the scale increases.

Early approaches to scale-out

Instead of growing a system onto larger and larger hardware, the scale-out approach

spreads the processing onto more and more machines. If the data set doubles, simply use

two servers instead of a single double-sized one. If it doubles again, move to four hosts.

The obvious benet of this approach is that purchase costs remain much lower than for

scale-up. Server hardware costs tend to increase sharply when one seeks to purchase larger

machines, and though a single host may cost $5,000, one with ten mes the processing

power may cost a hundred mes as much. The downside is that we need to develop

strategies for spling our data processing across a eet of servers and the tools

historically used for this purpose have proven to be complex.

As a consequence, deploying a scale-out soluon has required signicant engineering eort;

the system developer oen needs to handcra the mechanisms for data paroning and

reassembly, not to menon the logic to schedule the work across the cluster and handle

individual machine failures.

Limiting factors

These tradional approaches to scale-up and scale-out have not been widely adopted

outside large enterprises, government, and academia. The purchase costs are oen high,

as is the eort to develop and manage the systems. These factors alone put them out of the

reach of many smaller businesses. In addion, the approaches themselves have had several

weaknesses that have become apparent over me:

As scale-out systems get large, or as scale-up systems deal with mulple CPUs, the

dicules caused by the complexity of the concurrency in the systems have become

signicant. Eecvely ulizing mulple hosts or CPUs is a very dicult task, and

implemenng the necessary strategy to maintain eciency throughout execuon

of the desired workloads can entail enormous eort.

www.it-ebooks.info

Chapter 1

[ 11 ]

Hardware advances—oen couched in terms of Moore's law—have begun to

highlight discrepancies in system capability. CPU power has grown much faster than

network or disk speeds have; once CPU cycles were the most valuable resource in

the system, but today, that no longer holds. Whereas a modern CPU may be able to

execute millions of mes as many operaons as a CPU 20 years ago would, memory

and hard disk speeds have only increased by factors of thousands or even hundreds.

It is quite easy to build a modern system with so much CPU power that the storage

system simply cannot feed it data fast enough to keep the CPUs busy.

A different approach

From the preceding scenarios there are a number of techniques that have been used

successfully to ease the pain in scaling data processing systems to the large scales

required by big data.

All roads lead to scale-out

As just hinted, taking a scale-up approach to scaling is not an open-ended tacc. There is

a limit to the size of individual servers that can be purchased from mainstream hardware

suppliers, and even more niche players can't oer an arbitrarily large server. At some point,

the workload will increase beyond the capacity of the single, monolithic scale-up server, so

then what? The unfortunate answer is that the best approach is to have two large servers

instead of one. Then, later, three, four, and so on. Or, in other words, the natural tendency

of scale-up architecture is—in extreme cases—to add a scale-out strategy to the mix.

Though this gives some of the benets of both approaches, it also compounds the costs

and weaknesses; instead of very expensive hardware or the need to manually develop

the cross-cluster logic, this hybrid architecture requires both.

As a consequence of this end-game tendency and the general cost prole of scale-up

architectures, they are rarely used in the big data processing eld and scale-out

architectures are the de facto standard.

If your problem space involves data workloads with strong internal

cross-references and a need for transaconal integrity, big iron

scale-up relaonal databases are sll likely to be a great opon.

Share nothing

Anyone with children will have spent considerable me teaching the lile ones that it's good

to share. This principle does not extend into data processing systems, and this idea applies to

both data and hardware.

www.it-ebooks.info

What It’s All About

[ 12 ]

The conceptual view of a scale-out architecture in parcular shows individual hosts, each

processing a subset of the overall data set to produce its poron of the nal result. Reality

is rarely so straighorward. Instead, hosts may need to communicate between each other,

or some pieces of data may be required by mulple hosts. These addional dependencies

create opportunies for the system to be negavely aected in two ways: bolenecks and

increased risk of failure.

If a piece of data or individual server is required by every calculaon in the system, there is

a likelihood of contenon and delays as the compeng clients access the common data or

host. If, for example, in a system with 25 hosts there is a single host that must be accessed

by all the rest, the overall system performance will be bounded by the capabilies of this

key host.

Worse sll, if this "hot" server or storage system holding the key data fails, the enre

workload will collapse in a heap. Earlier cluster soluons oen demonstrated this risk;

even though the workload was processed across a farm of servers, they oen used a

shared storage system to hold all the data.

Instead of sharing resources, the individual components of a system should be as

independent as possible, allowing each to proceed regardless of whether others

are ed up in complex work or are experiencing failures.

Expect failure

Implicit in the preceding tenets is that more hardware will be thrown at the problem

with as much independence as possible. This is only achievable if the system is built

with an expectaon that individual components will fail, oen regularly and with

inconvenient ming.

You'll oen hear terms such as "ve nines" (referring to 99.999 percent upme

or availability). Though this is absolute best-in-class availability, it is important

to realize that the overall reliability of a system comprised of many such devices

can vary greatly depending on whether the system can tolerate individual

component failures.

Assume a server with 99 percent reliability and a system that requires ve such

hosts to funcon. The system availability is 0.99*0.99*0.99*0.99*0.99 which

equates to 95 percent availability. But if the individual servers are only rated

at 95 percent, the system reliability drops to a mere 76 percent.

Instead, if you build a system that only needs one of the ve hosts to be funconal at any

given me, the system availability is well into ve nines territory. Thinking about system

upme in relaon to the cricality of each component can help focus on just what the

system availability is likely to be.

www.it-ebooks.info

Chapter 1

[ 13 ]

If gures such as 99 percent availability seem a lile abstract to you, consider

it in terms of how much downme that would mean in a given me period.

For example, 99 percent availability equates to a downme of just over 3.5

days a year or 7 hours a month. Sll sound as good as 99 percent?

This approach of embracing failure is oen one of the most dicult aspects of big data

systems for newcomers to fully appreciate. This is also where the approach diverges most

strongly from scale-up architectures. One of the main reasons for the high cost of large

scale-up servers is the amount of eort that goes into migang the impact of component

failures. Even low-end servers may have redundant power supplies, but in a big iron box,

you will see CPUs mounted on cards that connect across mulple backplanes to banks of

memory and storage systems. Big iron vendors have oen gone to extremes to show how

resilient their systems are by doing everything from pulling out parts of the server while it's

running to actually shoong a gun at it. But if the system is built in such a way that instead of

treang every failure as a crisis to be migated it is reduced to irrelevance, a very dierent

architecture emerges.

Smart software, dumb hardware

If we wish to see a cluster of hardware used in as exible a way as possible, providing hosng

to mulple parallel workows, the answer is to push the smarts into the soware and away

from the hardware.

In this model, the hardware is treated as a set of resources, and the responsibility for

allocang hardware to a parcular workload is given to the soware layer. This allows

hardware to be generic and hence both easier and less expensive to acquire, and the

funconality to eciently use the hardware moves to the soware, where the knowledge

about eecvely performing this task resides.

Move processing, not data

Imagine you have a very large data set, say, 1000 terabytes (that is, 1 petabyte), and you

need to perform a set of four operaons on every piece of data in the data set. Let's look

at dierent ways of implemenng a system to solve this problem.

A tradional big iron scale-up soluon would see a massive server aached to an equally

impressive storage system, almost certainly using technologies such as bre channel to

maximize storage bandwidth. The system will perform the task but will become I/O-bound;

even high-end storage switches have a limit on how fast data can be delivered to the host.

www.it-ebooks.info

What It’s All About

[ 14 ]

Alternavely, the processing approach of previous cluster technologies would perhaps see

a cluster of 1,000 machines, each with 1 terabyte of data divided into four quadrants, with

each responsible for performing one of the operaons. The cluster management soware

would then coordinate the movement of the data around the cluster to ensure each piece

receives all four processing steps. As each piece of data can have one step performed on the

host on which it resides, it will need to stream the data to the other three quadrants, so we

are in eect consuming 3 petabytes of network bandwidth to perform the processing.

Remembering that processing power has increased faster than networking or disk

technologies, so are these really the best ways to address the problem? Recent experience

suggests the answer is no and that an alternave approach is to avoid moving the data and

instead move the processing. Use a cluster as just menoned, but don't segment it into

quadrants; instead, have each of the thousand nodes perform all four processing stages on

the locally held data. If you're lucky, you'll only have to stream the data from the disk once

and the only things travelling across the network will be program binaries and status reports,

both of which are dwarfed by the actual data set in queson.

If a 1,000-node cluster sounds ridiculously large, think of some modern server form factors

being ulized for big data soluons. These see single hosts with as many as twelve 1- or

2-terabyte disks in each. Because modern processors have mulple cores it is possible to

build a 50-node cluster with a petabyte of storage and sll have a CPU core dedicated to

process the data stream coming o each individual disk.

Build applications, not infrastructure

When thinking of the scenario in the previous secon, many people will focus on the

quesons of data movement and processing. But, anyone who has ever built such a

system will know that less obvious elements such as job scheduling, error handling,

and coordinaon are where much of the magic truly lies.

If we had to implement the mechanisms for determining where to execute processing,

performing the processing, and combining all the subresults into the overall result, we

wouldn't have gained much from the older model. There, we needed to explicitly manage

data paroning; we'd just be exchanging one dicult problem with another.

This touches on the most recent trend, which we'll highlight here: a system that handles

most of the cluster mechanics transparently and allows the developer to think in terms of

the business problem. Frameworks that provide well-dened interfaces that abstract all this

complexity—smart soware—upon which business domain-specic applicaons can be built

give the best combinaon of developer and system eciency.

www.it-ebooks.info

Chapter 1

[ 15 ]

Hadoop

The thoughul (or perhaps suspicious) reader will not be surprised to learn that the

preceding approaches are all key aspects of Hadoop. But we sll haven't actually

answered the queson about exactly what Hadoop is.

Thanks, Google

It all started with Google, which in 2003 and 2004 released two academic papers describing

Google technology: the Google File System (GFS) (http://research.google.com/

archive/gfs.html) and MapReduce (http://research.google.com/archive/

mapreduce.html). The two together provided a plaorm for processing data on a very

large scale in a highly ecient manner.

Thanks, Doug

At the same me, Doug Cung was working on the Nutch open source web search

engine. He had been working on elements within the system that resonated strongly

once the Google GFS and MapReduce papers were published. Doug started work on the

implementaons of these Google systems, and Hadoop was soon born, rstly as a subproject

of Lucene and soon was its own top-level project within the Apache open source foundaon.

At its core, therefore, Hadoop is an open source plaorm that provides implementaons of

both the MapReduce and GFS technologies and allows the processing of very large data sets

across clusters of low-cost commodity hardware.

Thanks, Yahoo

Yahoo hired Doug Cung in 2006 and quickly became one of the most prominent supporters

of the Hadoop project. In addion to oen publicizing some of the largest Hadoop

deployments in the world, Yahoo has allowed Doug and other engineers to contribute to

Hadoop while sll under its employ; it has contributed some of its own internally developed

Hadoop improvements and extensions. Though Doug has now moved on to Cloudera

(another prominent startup supporng the Hadoop community) and much of the Yahoo's

Hadoop team has been spun o into a startup called Hortonworks, Yahoo remains a major

Hadoop contributor.

Parts of Hadoop

The top-level Hadoop project has many component subprojects, several of which we'll

discuss in this book, but the two main ones are Hadoop Distributed File System (HDFS)

and MapReduce. These are direct implementaons of Google's own GFS and MapReduce.

We'll discuss both in much greater detail, but for now, it's best to think of HDFS and

MapReduce as a pair of complementary yet disnct technologies.

www.it-ebooks.info

What It’s All About

[ 16 ]

HDFS is a lesystem that can store very large data sets by scaling out across a cluster of

hosts. It has specic design and performance characteriscs; in parcular, it is opmized

for throughput instead of latency, and it achieves high availability through replicaon

instead of redundancy.

MapReduce is a data processing paradigm that takes a specicaon of how the data will be

input and output from its two stages (called map and reduce) and then applies this across

arbitrarily large data sets. MapReduce integrates ghtly with HDFS, ensuring that wherever

possible, MapReduce tasks run directly on the HDFS nodes that hold the required data.

Common building blocks

Both HDFS and MapReduce exhibit several of the architectural principles described in the

previous secon. In parcular:

Both are designed to run on clusters of commodity (that is, low-to-medium

specicaon) servers

Both scale their capacity by adding more servers (scale-out)

Both have mechanisms for idenfying and working around failures

Both provide many of their services transparently, allowing the user to concentrate

on the problem at hand

Both have an architecture where a soware cluster sits on the physical servers and

controls all aspects of system execuon

HDFS

HDFS is a lesystem unlike most you may have encountered before. It is not a POSIX-

compliant lesystem, which basically means it does not provide the same guarantees as a

regular lesystem. It is also a distributed lesystem, meaning that it spreads storage across

mulple nodes; lack of such an ecient distributed lesystem was a liming factor in some

historical technologies. The key features are:

HDFS stores les in blocks typically at least 64 MB in size, much larger than the 4-32

KB seen in most lesystems.

HDFS is opmized for throughput over latency; it is very ecient at streaming

read requests for large les but poor at seek requests for many small ones.

HDFS is opmized for workloads that are generally of the write-once and

read-many type.

Each storage node runs a process called a DataNode that manages the blocks on

that host, and these are coordinated by a master NameNode process running on a

separate host.

www.it-ebooks.info

Chapter 1

[ 17 ]

Instead of handling disk failures by having physical redundancies in disk arrays or

similar strategies, HDFS uses replicaon. Each of the blocks comprising a le is

stored on mulple nodes within the cluster, and the HDFS NameNode constantly

monitors reports sent by each DataNode to ensure that failures have not dropped

any block below the desired replicaon factor. If this does happen, it schedules the

addion of another copy within the cluster.

MapReduce

Though MapReduce as a technology is relavely new, it builds upon much of the

fundamental work from both mathemacs and computer science, parcularly approaches

that look to express operaons that would then be applied to each element in a set of data.

Indeed the individual concepts of funcons called map and reduce come straight from

funconal programming languages where they were applied to lists of input data.

Another key underlying concept is that of "divide and conquer", where a single problem is

broken into mulple individual subtasks. This approach becomes even more powerful when

the subtasks are executed in parallel; in a perfect case, a task that takes 1000 minutes could

be processed in 1 minute by 1,000 parallel subtasks.

MapReduce is a processing paradigm that builds upon these principles; it provides a series of

transformaons from a source to a result data set. In the simplest case, the input data is fed

to the map funcon and the resultant temporary data to a reduce funcon. The developer

only denes the data transformaons; Hadoop's MapReduce job manages the process of

how to apply these transformaons to the data across the cluster in parallel. Though the

underlying ideas may not be novel, a major strength of Hadoop is in how it has brought

these principles together into an accessible and well-engineered plaorm.

Unlike tradional relaonal databases that require structured data with well-dened

schemas, MapReduce and Hadoop work best on semi-structured or unstructured data.

Instead of data conforming to rigid schemas, the requirement is instead that the data be

provided to the map funcon as a series of key value pairs. The output of the map funcon is

a set of other key value pairs, and the reduce funcon performs aggregaon to collect the

nal set of results.

Hadoop provides a standard specicaon (that is, interface) for the map and reduce

funcons, and implementaons of these are oen referred to as mappers and reducers.

A typical MapReduce job will comprise of a number of mappers and reducers, and it is not

unusual for several of these to be extremely simple. The developer focuses on expressing the

transformaon between source and result data sets, and the Hadoop framework manages all

aspects of job execuon, parallelizaon, and coordinaon.

www.it-ebooks.info

What It’s All About

[ 18 ]

This last point is possibly the most important aspect of Hadoop. The plaorm takes

responsibility for every aspect of execung the processing across the data. Aer the user

denes the key criteria for the job, everything else becomes the responsibility of the system.

Crically, from the perspecve of the size of data, the same MapReduce job can be applied

to data sets of any size hosted on clusters of any size. If the data is 1 gigabyte in size and on

a single host, Hadoop will schedule the processing accordingly. Even if the data is 1 petabyte

in size and hosted across one thousand machines, it sll does likewise, determining how best

to ulize all the hosts to perform the work most eciently. From the user's perspecve, the

actual size of the data and cluster are transparent, and apart from aecng the me taken to

process the job, they do not change how the user interacts with Hadoop.

Better together

It is possible to appreciate the individual merits of HDFS and MapReduce, but they are even

more powerful when combined. HDFS can be used without MapReduce, as it is intrinsically a

large-scale data storage plaorm. Though MapReduce can read data from non-HDFS sources,

the nature of its processing aligns so well with HDFS that using the two together is by far the

most common use case.

When a MapReduce job is executed, Hadoop needs to decide where to execute the code

most eciently to process the data set. If the MapReduce-cluster hosts all pull their data

from a single storage host or an array, it largely doesn't maer as the storage system is

a shared resource that will cause contenon. But if the storage system is HDFS, it allows

MapReduce to execute data processing on the node holding the data of interest, building

on the principle of it being less expensive to move data processing than the data itself.

The most common deployment model for Hadoop sees the HDFS and MapReduce clusters

deployed on the same set of servers. Each host that contains data and the HDFS component

to manage it also hosts a MapReduce component that can schedule and execute data

processing. When a job is submied to Hadoop, it can use an opmizaon process as much

as possible to schedule data on the hosts where the data resides, minimizing network trac

and maximizing performance.

Think back to our earlier example of how to process a four-step task on 1 petabyte of

data spread across one thousand servers. The MapReduce model would (in a somewhat

simplied and idealized way) perform the processing in a map funcon on each piece

of data on a host where the data resides in HDFS and then reuse the cluster in the reduce

funcon to collect the individual results into the nal result set.

A part of the challenge with Hadoop is in breaking down the overall problem into the best

combinaon of map and reduce funcons. The preceding approach would only work if the

four-stage processing chain could be applied independently to each data element in turn. As

we'll see in later chapters, the answer is somemes to use mulple MapReduce jobs where

the output of one is the input to the next.

www.it-ebooks.info

Chapter 1

[ 19 ]

Common architecture

Both HDFS and MapReduce are, as menoned, soware clusters that display common

characteriscs:

Each follows an architecture where a cluster of worker nodes is managed by a

special master/coordinator node

The master in each case (NameNode for HDFS and JobTracker for MapReduce)

monitors the health of the cluster and handle failures, either by moving data

blocks around or by rescheduling failed work

Processes on each server (DataNode for HDFS and TaskTracker for MapReduce) are

responsible for performing work on the physical host, receiving instrucons from

the NameNode or JobTracker, and reporng health/progress status back to it

As a minor terminology point, we will generally use the terms host or server to refer to the

physical hardware hosng Hadoop's various components. The term node will refer to the

soware component comprising a part of the cluster.

What it is and isn't good for

As with any tool, it's important to understand when Hadoop is a good t for the problem

in queson. Much of this book will highlight its strengths, based on the previous broad

overview on processing large data volumes, but it's important to also start appreciang

at an early stage where it isn't the best choice.

The architecture choices made within Hadoop enable it to be the exible and scalable data

processing plaorm it is today. But, as with most architecture or design choices, there are

consequences that must be understood. Primary amongst these is the fact that Hadoop is a

batch processing system. When you execute a job across a large data set, the framework will

churn away unl the nal results are ready. With a large cluster, answers across even huge

data sets can be generated relavely quickly, but the fact remains that the answers are not

generated fast enough to service impaent users. Consequently, Hadoop alone is not well

suited to low-latency queries such as those received on a website, a real-me system, or a

similar problem domain.

When Hadoop is running jobs on large data sets, the overhead of seng up the job,

determining which tasks are run on each node, and all the other housekeeping acvies

that are required is a trivial part of the overall execuon me. But, for jobs on small data

sets, there is an execuon overhead that means even simple MapReduce jobs may take a

minimum of 10 seconds.

www.it-ebooks.info

What It’s All About

[ 20 ]

Another member of the broader Hadoop family is HBase, an

open-source implementaon of another Google technology.

This provides a (non-relaonal) database atop Hadoop that

uses various means to allow it to serve low-latency queries.

But haven't Google and Yahoo both been among the strongest proponents of this method

of computaon, and aren't they all about such websites where response me is crical?

The answer is yes, and it highlights an important aspect of how to incorporate Hadoop into

any organizaon or acvity or use it in conjuncon with other technologies in a way that

exploits the strengths of each. In a paper (http://research.google.com/archive/

googlecluster.html), Google sketches how they ulized MapReduce at the me; aer a

web crawler retrieved updated webpage data, MapReduce processed the huge data set, and

from this, produced the web index that a eet of MySQL servers used to service end-user

search requests.

Cloud computing with Amazon Web Services

The other technology area we'll explore in this book is cloud compung, in the form

of several oerings from Amazon Web Services. But rst, we need to cut through some

hype and buzzwords that surround this thing called cloud compung.

Too many clouds

Cloud compung has become an overused term, arguably to the point that its overuse risks

it being rendered meaningless. In this book, therefore, let's be clear what we mean—and

care about—when using the term. There are two main aspects to this: a new architecture

opon and a dierent approach to cost.

A third way

We've talked about scale-up and scale-out as the opons for scaling data processing systems.

But our discussion thus far has taken for granted that the physical hardware that makes

either opon a reality will be purchased, owned, hosted, and managed by the organizaon

doing the system development. The cloud compung we care about adds a third approach;

put your applicaon into the cloud and let the provider deal with the scaling problem.

www.it-ebooks.info

Chapter 1

[ 21 ]

It's not always that simple, of course. But for many cloud services, the model truly is this

revoluonary. You develop the soware according to some published guidelines or interface

and then deploy it onto the cloud plaorm and allow it to scale the service based on the

demand, for a cost of course. But given the costs usually involved in making scaling systems,

this is oen a compelling proposion.

Different types of costs

This approach to cloud compung also changes how system hardware is paid for. By

ooading infrastructure costs, all users benet from the economies of scale achieved by

the cloud provider by building their plaorms up to a size capable of hosng thousands

or millions of clients. As a user, not only do you get someone else to worry about dicult

engineering problems, such as scaling, but you pay for capacity as it's needed and you

don't have to size the system based on the largest possible workloads. Instead, you gain the

benet of elascity and use more or fewer resources as your workload demands.

An example helps illustrate this. Many companies' nancial groups run end-of-month

workloads to generate tax and payroll data, and oen, much larger data crunching occurs at

year end. If you were tasked with designing such a system, how much hardware would you

buy? If you only buy enough to handle the day-to-day workload, the system may struggle at

month end and may likely be in real trouble when the end-of-year processing rolls around. If

you scale for the end-of-month workloads, the system will have idle capacity for most of the

year and possibly sll be in trouble performing the end-of-year processing. If you size for the

end-of-year workload, the system will have signicant capacity sing idle for the rest of the

year. And considering the purchase cost of hardware in addion to the hosng and running

costs—a server's electricity usage may account for a large majority of its lifeme costs—you

are basically wasng huge amounts of money.

The service-on-demand aspects of cloud compung allow you to start your applicaon

on a small hardware footprint and then scale it up and down as the year progresses.

With a pay-for-use model, your costs follow your ulizaon and you have the capacity

to process your workloads without having to buy enough hardware to handle the peaks.

A more subtle aspect of this model is that this greatly reduces the costs of entry for an

organizaon to launch an online service. We all know that a new hot service that fails to

meet demand and suers performance problems will nd it hard to recover momentum and

user interest. For example, say in the year 2000, an organizaon wanng to have a successful

launch needed to put in place, on launch day, enough capacity to meet the massive surge of

user trac they hoped for but did n't know for sure to expect. When taking costs of physical

locaon into consideraon, it would have been easy to spend millions on a product launch.

www.it-ebooks.info

What It’s All About

[ 22 ]

Today, with cloud compung, the inial infrastructure cost could literally be as low as a

few tens or hundreds of dollars a month and that would only increase when—and if—the

trac demanded.

AWS – infrastructure on demand from Amazon

Amazon Web Services (AWS) is a set of such cloud compung services oered by Amazon.

We will be using several of these services in this book.

Elastic Compute Cloud (EC2)

Amazon's Elasc Compute Cloud (EC2), found at http://aws.amazon.com/ec2/, is

basically a server on demand. Aer registering with AWS and EC2, credit card details are

all that's required to gain access to a dedicated virtual machine, it's easy to run a variety

of operang systems including Windows and many variants of Linux on our server.

Need more servers? Start more. Need more powerful servers? Change to one of the higher

specicaon (and cost) types oered. Along with this, EC2 oers a suite of complimentary

services, including load balancers, stac IP addresses, high-performance addional virtual

disk drives, and many more.

Simple Storage Service (S3)

Amazon's Simple Storage Service (S3), found at http://aws.amazon.com/s3/, is a

storage service that provides a simple key/value storage model. Using web, command-

line, or programmac interfaces to create objects, which can be everything from text les

to images to MP3s, you can store and retrieve your data based on a hierarchical model.

You create buckets in this model that contain objects. Each bucket has a unique idener,

and within each bucket, every object is uniquely named. This simple strategy enables an

extremely powerful service for which Amazon takes complete responsibility (for service

scaling, in addion to reliability and availability of data).

Elastic MapReduce (EMR)

Amazon's Elasc MapReduce (EMR), found at http://aws.amazon.com/

elasticmapreduce/, is basically Hadoop in the cloud and builds atop both EC2 and

S3. Once again, using any of the mulple interfaces (web console, CLI, or API), a Hadoop

workow is dened with aributes such as the number of Hadoop hosts required and the

locaon of the source data. The Hadoop code implemenng the MapReduce jobs is provided

and the virtual go buon is pressed.

www.it-ebooks.info

Chapter 1

[ 23 ]

In its most impressive mode, EMR can pull source data from S3, process it on a Hadoop

cluster it creates on EC2, push the results back into S3, and terminate the Hadoop cluster

and the EC2 virtual machines hosng it. Naturally, each of these services has a cost (usually

on per GB stored and server me usage basis), but the ability to access such powerful data

processing capabilies with no need for dedicated hardware is a powerful one.

What this book covers

In this book we will be learning how to write MapReduce programs to do some serious data

crunching and how to run them on both locally managed and AWS-hosted Hadoop clusters.

Not only will we be looking at Hadoop as an engine for performing MapReduce processing,

but we'll also explore how a Hadoop capability can t into the rest of an organizaon's

infrastructure and systems. We'll look at some of the common points of integraon, such as

geng data between Hadoop and a relaonal database and also how to make Hadoop look

more like such a relaonal database.

A dual approach

In this book we will not be liming our discussion to EMR or Hadoop hosted on Amazon EC2;

we will be discussing both the building and the management of local Hadoop clusters (on

Ubuntu Linux) in addion to showing how to push the processing into the cloud via EMR.

The reason for this is twofold: rstly, though EMR makes Hadoop much more accessible,

there are aspects of the technology that only become apparent when manually

administering the cluster. Though it is also possible to use EMR in a more manual mode,

we'll generally use a local cluster for such exploraons. Secondly, though it isn't necessarily

an either/or decision, many organizaons use a mixture of in-house and cloud-hosted

capacies, somemes due to a concern of over reliance on a single external provider, but

praccally speaking, it's oen convenient to do development and small-scale tests on local

capacity then deploy at producon scale into the cloud.

In some of the laer chapters, where we discuss addional products that integrate with

Hadoop, we'll only give examples of local clusters as there is no dierence in how the

products work regardless of where they are deployed.

www.it-ebooks.info

What It’s All About

[ 24 ]

Summary

We learned a lot in this chapter about big data, Hadoop, and cloud compung.

Specically, we covered the emergence of big data and how changes in the approach to

data processing and system architecture bring within the reach of almost any organizaon

techniques that were previously prohibively expensive.

We also looked at the history of Hadoop and how it builds upon many of these trends

to provide a exible and powerful data processing plaorm that can scale to massive

volumes. We also looked at how cloud compung provides another system architecture

approach, one which exchanges large up-front costs and direct physical responsibility

for a pay-as-you-go model and a reliance on the cloud provider for hardware provision,

management and scaling. We also saw what Amazon Web Services is and how its Elasc

MapReduce service ulizes other AWS services to provide Hadoop in the cloud.

We also discussed the aim of this book and its approach to exploraon on both

locally-managed and AWS-hosted Hadoop clusters.

Now that we've covered the basics and know where this technology is coming from

and what its benets are, we need to get our hands dirty and get things running,

which is what we'll do in Chapter 2, Geng Hadoop Up and Running.

www.it-ebooks.info

Getting Hadoop Up and Running

Now that we have explored the opportunities and challenges presented

by large-scale data processing and why Hadoop is a compelling choice,

it's time to get things set up and running.

In this chapter, we will do the following:

Learn how to install and run Hadoop on a local Ubuntu host

Run some example Hadoop programs and get familiar with the system

Set up the accounts required to use Amazon Web Services products such as EMR

Create an on-demand Hadoop cluster on Elasc MapReduce

Explore the key dierences between a local and hosted Hadoop cluster

Hadoop on a local Ubuntu host

For our exploraon of Hadoop outside the cloud, we shall give examples using one or

more Ubuntu hosts. A single machine (be it a physical computer or a virtual machine)

will be sucient to run all the parts of Hadoop and explore MapReduce. However,

producon clusters will most likely involve many more machines, so having even a

development Hadoop cluster deployed on mulple hosts will be good experience.

However, for geng started, a single host will suce.

Nothing we discuss will be unique to Ubuntu, and Hadoop should run on any Linux

distribuon. Obviously, you may have to alter how the environment is congured if

you use a distribuon other than Ubuntu, but the dierences should be slight.

www.it-ebooks.info

Geng Hadoop Up and Running

[ 26 ]

Other operating systems

Hadoop does run well on other plaorms. Windows and Mac OS X are popular choices

for developers. Windows is supported only as a development plaorm and Mac OS X is

not formally supported at all.

If you choose to use such a plaorm, the general situaon will be similar to other Linux

distribuons; all aspects of how to work with Hadoop will be the same on both plaorms

but you will need use the operang system-specic mechanisms for seng up environment

variables and similar tasks. The Hadoop FAQs contain some informaon on alternave

plaorms and should be your rst port of call if you are considering such an approach.

The Hadoop FAQs can be found at http://wiki.apache.org/hadoop/FAQ.

Time for action – checking the prerequisites

Hadoop is wrien in Java, so you will need a recent Java Development Kit (JDK) installed

on the Ubuntu host. Perform the following steps to check the prerequisites:

1. First, check what's already available by opening up a terminal and typing

the following:

$ javac

$ java -version

2. If either of these commands gives a no such file or directory or similar

error, or if the laer menons "Open JDK", it's likely you need to download the full

JDK. Grab this from the Oracle download page at http://www.oracle.com/

technetwork/java/javase/downloads/index.html; you should get the

latest release.

3. Once Java is installed, add the JDK/bin directory to your path and set the

JAVA_HOME environment variable with commands such as the following,

modied for your specic Java version:

$ export JAVA_HOME=/opt/jdk1.6.0_24

$ export PATH=$JAVA_HOME/bin:${PATH}

What just happened?

These steps ensure the right version of Java is installed and available from the command line

without having to use lengthy pathnames to refer to the install locaon.

www.it-ebooks.info

Chapter 2

[ 27 ]

Remember that the preceding commands only aect the currently running shell and the

sengs will be lost aer you log out, close the shell, or reboot. To ensure the same setup

is always available, you can add these to the startup les for your shell of choice, within

the .bash_profile le for the BASH shell or the .cshrc le for TCSH, for example.

An alternave favored by me is to put all required conguraon sengs into a standalone

le and then explicitly call this from the command line; for example:

$ source Hadoop_config.sh

This technique allows you to keep mulple setup les in the same account without making

the shell startup overly complex; not to menon, the required conguraons for several

applicaons may actually be incompable. Just remember to begin by loading the le at the

start of each session!

Setting up Hadoop

One of the most confusing aspects of Hadoop to a newcomer is its various components,

projects, sub-projects, and their interrelaonships. The fact that these have evolved over

me hasn't made the task of understanding it all any easier. For now, though, go to http://

hadoop.apache.org and you'll see that there are three prominent projects menoned:

Common

HDFS

MapReduce

The last two of these should be familiar from the explanaon in Chapter 1, What It's All

About, and common projects comprise a set of libraries and tools that help the Hadoop

product work in the real world. For now, the important thing is that the standard Hadoop

distribuon bundles the latest versions all of three of these projects and the combinaon is

what you need to get going.

A note on versions

Hadoop underwent a major change in the transion from the 0.19 to the 0.20 versions, most

notably with a migraon to a set of new APIs used to develop MapReduce applicaons. We

will be primarily using the new APIs in this book, though we do include a few examples of the

older API in later chapters as not of all the exisng features have been ported to the new API.

Hadoop versioning also became complicated when the 0.20 branch was renamed to 1.0.

The 0.22 and 0.23 branches remained, and in fact included features not included in the 1.0

branch. At the me of this wring, things were becoming clearer with 1.1 and 2.0 branches

being used for future development releases. As most exisng systems and third-party tools

are built against the 0.20 branch, we will use Hadoop 1.0 for the examples in this book.

www.it-ebooks.info

Geng Hadoop Up and Running

[ 28 ]

Time for action – downloading Hadoop

Carry out the following steps to download Hadoop:

1. Go to the Hadoop download page at http://hadoop.apache.org/common/

releases.html and retrieve the latest stable version of the 1.0.x branch; at the

me of this wring, it was 1.0.4.

2. You'll be asked to select a local mirror; aer that you need to download

the le with a name such as hadoop-1.0.4-bin.tar.gz.

3. Copy this le to the directory where you want Hadoop to be installed

(for example, /usr/local), using the following command:

$ cp Hadoop-1.0.4.bin.tar.gz /usr/local

4. Decompress the le by using the following command:

$ tar –xf hadoop-1.0.4-bin.tar.gz

5. Add a convenient symlink to the Hadoop installaon directory.

$ ln -s /usr/local/hadoop-1.0.4 /opt/hadoop

6. Now you need to add the Hadoop binary directory to your path and set

the HADOOP_HOME environment variable, just as we did earlier with Java.

$ export HADOOP_HOME=/usr/local/Hadoop

$ export PATH=$HADOOP_HOME/bin:$PATH

7. Go into the conf directory within the Hadoop installaon and edit the

Hadoop-env.sh le. Search for JAVA_HOME and uncomment the line,

modifying the locaon to point to your JDK installaon, as menoned earlier.

What just happened?

These steps ensure that Hadoop is installed and available from the command line.

By seng the path and conguraon variables, we can use the Hadoop command-line

tool. The modicaon to the Hadoop conguraon le is the only required change to

the setup needed to integrate with your host sengs.

As menoned earlier, you should put the export commands in your shell startup le

or a standalone-conguraon script that you specify at the start of the session.

Don't worry about some of the details here; we'll cover Hadoop setup and use later.

www.it-ebooks.info

Chapter 2

[ 29 ]

Time for action – setting up SSH

Carry out the following steps to set up SSH:

1. Create a new OpenSSL key pair with the following commands:

$ ssh-keygen

Generating public/private rsa key pair.

Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):

Created directory '/home/hadoop/.ssh'.

Enter passphrase (empty for no passphrase):

Enter same passphrase again:

Your identification has been saved in /home/hadoop/.ssh/id_rsa.

Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.

…

2. Copy the new public key to the list of authorized keys by using the following

command:

$ cp .ssh/id _rsa.pub .ssh/authorized_keys

3. Connect to the local host.

$ ssh localhost

The authenticity of host 'localhost (127.0.0.1)' can't be

established.

RSA key fingerprint is b6:0c:bd:57:32:b6:66:7c:33:7b:62:92:61:fd:c

a:2a.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'localhost' (RSA) to the list of known

hosts.

4. Conrm that the password-less SSH is working.

$ ssh localhost

What just happened?

Because Hadoop requires communicaon between mulple processes on one or more

machines, we need to ensure that the user we are using for Hadoop can connect to each

required host without needing a password. We do this by creang a Secure Shell (SSH) key

pair that has an empty passphrase. We use the ssh-keygen command to start this process

and accept the oered defaults.

www.it-ebooks.info

Geng Hadoop Up and Running

[ 30 ]

Once we create the key pair, we need to add the new public key to the stored list of trusted

keys; this means that when trying to connect to this machine, the public key will be trusted.

Aer doing so, we use the ssh command to connect to the local machine and should expect

to get a warning about trusng the host cercate as just shown. Aer conrming this, we

should then be able to connect without further passwords or prompts.

Note that when we move later to use a fully distributed cluster, we will

need to ensure that the Hadoop user account has the same key set up

on every host in the cluster.

Conguring and running Hadoop

So far this has all been prey straighorward, just downloading and system administraon.

Now we can deal with Hadoop directly. Finally! We'll run a quick example to show Hadoop in

acon. There is addional conguraon and set up to be performed, but this next step will

help give condence that things are installed and congured correctly so far.

Time for action – using Hadoop to calculate Pi

We will now use a sample Hadoop program to calculate the value of Pi. Right now,

this is primarily to validate the installaon and to show how quickly you can get a

MapReduce job to execute. Assuming the HADOOP_HOME/bin directory is in your path,

type the following commands:

$ Hadoop jar hadoop/hadoop-examples-1.0.4.jar pi 4 1000

Number of Maps = 4

Samples per Map = 1000

Wrote input for Map #0

Wrote input for Map #1

Wrote input for Map #2

Wrote input for Map #3

Starting Job

12/10/26 22:56:11 INFO jvm.JvmMetrics: Initializing JVM Metrics

with processName=JobTracker, sessionId=

12/10/26 22:56:11 INFO mapred.FileInputFormat: Total input paths

to process : 4

12/10/26 22:56:12 INFO mapred.JobClient: Running job: job_

local_0001

www.it-ebooks.info

Chapter 2

[ 31 ]

12/10/26 22:56:12 INFO mapred.FileInputFormat: Total input paths

to process : 4

12/10/26 22:56:12 INFO mapred.MapTask: numReduceTasks: 1

…

12/10/26 22:56:14 INFO mapred.JobClient: map 100% reduce 100%

12/10/26 22:56:14 INFO mapred.JobClient: Job complete: job_

local_0001

12/10/26 22:56:14 INFO mapred.JobClient: Counters: 13

12/10/26 22:56:14 INFO mapred.JobClient: FileSystemCounters

…

Job Finished in 2.904 seconds

Estimated value of Pi is 3.14000000000000000000

What just happened?

There's a lot of informaon here; even more so when you get the full output on your screen.

For now, let's unpack the fundamentals and not worry about much of Hadoop's status

output unl later in the book. The rst thing to clarify is some terminology; each Hadoop

program runs as a job that creates mulple tasks to do its work.

Looking at the output, we see it is broadly split into three secons:

The start up of the job

The status as the job executes

The output of the job

In our case, we can see the job creates four tasks to calculate Pi, and the overall job result

will be the combinaon of these subresults. This paern should sound familiar to the one

we came across in Chapter 1, What It's All About; the model is used to split a larger job into

smaller pieces and then bring together the results.

The majority of the output will appear as the job is being executed and provide status

messages showing progress. On successful compleon, the job will print out a number of

counters and other stascs. The preceding example is actually unusual in that it is rare to see

the result of a MapReduce job displayed on the console. This is not a limitaon of Hadoop,

but rather a consequence of the fact that jobs that process large data sets usually produce a

signicant amount of output data that isn't well suited to a simple echoing on the screen.

Congratulaons on your rst successful MapReduce job!

www.it-ebooks.info

Geng Hadoop Up and Running

[ 32 ]

Three modes

In our desire to get something running on Hadoop, we sidestepped an important issue: in

which mode should we run Hadoop? There are three possibilies that alter where the various

Hadoop components execute. Recall that HDFS comprises a single NameNode that acts as

the cluster coordinator and is the master for one or more DataNodes that store the data. For

MapReduce, the JobTracker is the cluster master and it coordinates the work executed by one

or more TaskTracker processes. The Hadoop modes deploy these components as follows:

Local standalone mode: This is the default mode if, as in the preceding Pi example,

you don't congure anything else. In this mode, all the components of Hadoop, such

as NameNode, DataNode, JobTracker, and TaskTracker, run in a single Java process.

Pseudo-distributed mode: In this mode, a separate JVM is spawned for each of the

Hadoop components and they communicate across network sockets, eecvely

giving a fully funconing minicluster on a single host.

Fully distributed mode: In this mode, Hadoop is spread across mulple machines,

some of which will be general-purpose workers and others will be dedicated hosts

for components, such as NameNode and JobTracker.

Each mode has its benets and drawbacks. Fully distributed mode is obviously the only one

that can scale Hadoop across a cluster of machines, but it requires more conguraon work,

not to menon the cluster of machines. Local, or standalone, mode is the easiest to set

up, but you interact with it in a dierent manner than you would with the fully distributed

mode. In this book, we shall generally prefer the pseudo-distributed mode even when using

examples on a single host, as everything done in the pseudo-distributed mode is almost

idencal to how it works on a much larger cluster.

Time for action – conguring the pseudo-distributed mode

Take a look in the conf directory within the Hadoop distribuon. There are many

conguraon les, but the ones we need to modify are core-site.xml, hdfs-site.xml

and mapred-site.xml.

1. Modify core-site.xml to look like the following code:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>fs.default.name</name>

www.it-ebooks.info

Chapter 2

[ 33 ]

<value>hdfs://localhost:9000</value>

</property>

</configuration>

2. Modify hdfs-site.xml to look like the following code:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>dfs.replication</name>

</property>

</configuration>

3. Modify mapred-site.xml to look like the following code:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>mapred.job.tracker</name>

<value>localhost:9001</value>

</property>

</configuration>

What just happened?

The rst thing to note is the general format of these conguraon les. They are obviously

XML and contain mulple property specicaons within a single conguraon element.

The property specicaons always contain name and value elements with the possibility for

oponal comments not shown in the preceding code.

We set three conguraon variables here:

The dfs.default.name variable holds the locaon of the NameNode and is

required by both HDFS and MapReduce components, which explains why it's in

core-site.xml and not hdfs-site.xml.

www.it-ebooks.info

Geng Hadoop Up and Running

[ 34 ]

The dfs.replication variable species how many mes each HDFS block should

be replicated. Recall from Chapter 1, What It's All About, that HDFS handles failures

by ensuring each block of lesystem data is replicated to a number of dierent

hosts, usually 3. As we only have a single host and one DataNode in the pseudo-

distributed mode, we change this value to 1.

The mapred.job.tracker variable holds the locaon of the JobTracker just

like dfs.default.name holds the locaon of the NameNode. Because only

MapReduce components need know this locaon, it is in mapred-site.xml.

You are free, of course, to change the port numbers used, though 9000

and 9001 are common convenons in Hadoop.

The network addresses for the NameNode and the JobTracker specify the ports on which

the actual system requests should be directed. These are not user-facing locaons, so don't

bother poinng your web browser at them. There are web interfaces that we will look at

shortly.

Conguring the base directory and formatting the lesystem

If the pseudo-distributed or fully distributed mode is chosen, there are two steps that need

to be performed before we start our rst Hadoop cluster.

1. Set the base directory where Hadoop les will be stored.

2. Format the HDFS lesystem.

To be precise, we don't need to change the default directory; but, as

seen later, it's a good thing to think about it now.

Time for action – changing the base HDFS directory

Let's rst set the base directory that species the locaon on the local lesystem under

which Hadoop will keep all its data. Carry out the following steps:

1. Create a directory into which Hadoop will store its data:

$ mkdir /var/lib/hadoop

2. Ensure the directory is writeable by any user:

$ chmod 777 /var/lib/hadoop

www.it-ebooks.info

Chapter 2

[ 35 ]

3. Modify core-site.xml once again to add the following property:

<name>hadoop.tmp.dir</name>

<value>/var/lib/hadoop</value>

</property>

What just happened?

As we will be storing data in Hadoop and all the various components are running on our local

host, this data will need to be stored on our local lesystem somewhere. Regardless of the

mode, Hadoop by default uses the hadoop.tmp.dir property as the base directory under

which all les and data are wrien.

MapReduce, for example, uses a /mapred directory under this base directory; HDFS uses

/dfs. The danger is that the default value of hadoop.tmp.dir is /tmp and some Linux

distribuons delete the contents of /tmp on each reboot. So it's safer to explicitly state

where the data is to be held.

Time for action – formatting the NameNode

Before starng Hadoop in either pseudo-distributed or fully distributed mode for the rst

me, we need to format the HDFS lesystem that it will use. Type the following:

$ hadoop namenode -format

The output of this should look like the following:

$ hadoop namenode -format

12/10/26 22:45:25 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = vm193/10.0.0.193

STARTUP_MSG: args = [-format]

…

12/10/26 22:45:25 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop

12/10/26 22:45:25 INFO namenode.FSNamesystem: supergroup=supergroup

12/10/26 22:45:25 INFO namenode.FSNamesystem: isPermissionEnabled=true

12/10/26 22:45:25 INFO common.Storage: Image file of size 96 saved in 0

seconds.

www.it-ebooks.info

Geng Hadoop Up and Running

[ 36 ]

12/10/26 22:45:25 INFO common.Storage: Storage directory /var/lib/hadoop-

hadoop/dfs/name has been successfully formatted.

12/10/26 22:45:26 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at vm193/10.0.0.193

What just happened?

This is not a very excing output because the step is only an enabler for our future use

of HDFS. However, it does help us think of HDFS as a lesystem; just like any new storage

device on any operang system, we need to format the device before we can use it. The

same is true for HDFS; inially there is a default locaon for the lesystem data but no

actual data for the equivalents of lesystem indexes.

Do this every me!

If your experience with Hadoop has been similar to the one I have had, there

will be a series of simple mistakes that are frequently made when seng

up new installaons. It is very easy to forget about the formang of the

NameNode and then get a cascade of failure messages when the rst Hadoop

acvity is tried.

But do it only once!

The command to format the NameNode can be executed mulple mes, but in

doing so all exisng lesystem data will be destroyed. It can only be executed

when the Hadoop cluster is shut down and somemes you will want to do it

but in most other cases it is a quick way to irrevocably delete every piece of

data on HDFS; it does take much longer on large clusters. So be careful!

Starting and using Hadoop

Aer all that conguraon and setup, let's now start our cluster and actually do something

with it.

Time for action – starting Hadoop

Unlike the local mode of Hadoop, where all the components run only for the lifeme of the

submied job, with the pseudo-distributed or fully distributed mode of Hadoop, the cluster

components exist as long-running processes. Before we use HDFS or MapReduce, we need to

start up the needed components. Type the following commands; the output should look as

shown next, where the commands are included on the lines prexed by $:

www.it-ebooks.info

Chapter 2

[ 37 ]

1. Type in the rst command:

$ start-dfs.sh

starting namenode, logging to /home/hadoop/hadoop/bin/../logs/

hadoop-hadoop-namenode-vm193.out

localhost: starting datanode, logging to /home/hadoop/hadoop/

bin/../logs/hadoop-hadoop-datanode-vm193.out

localhost: starting secondarynamenode, logging to /home/hadoop/

hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-vm193.out

2. Type in the second command:

$ jps

9550 DataNode

9687 Jps

9638 SecondaryNameNode

9471 NameNode

3. Type in the third command:

$ hadoop dfs -ls /

Found 2 items

drwxr-xr-x - hadoop supergroup 0 2012-10-26 23:03 /tmp

drwxr-xr-x - hadoop supergroup 0 2012-10-26 23:06 /user

4. Type in the fourth command:

$ start-mapred.sh

starting jobtracker, logging to /home/hadoop/hadoop/bin/../logs/

hadoop-hadoop-jobtracker-vm193.out

localhost: starting tasktracker, logging to /home/hadoop/hadoop/

bin/../logs/hadoop-hadoop-tasktracker-vm193.out

5. Type in the h command:

$ jps

9550 DataNode

9877 TaskTracker

9638 SecondaryNameNode

9471 NameNode

9798 JobTracker

9913 Jps

www.it-ebooks.info

Geng Hadoop Up and Running

[ 38 ]

What just happened?

The start-dfs.sh command, as the name suggests, starts the components necessary for

HDFS. This is the NameNode to manage the lesystem and a single DataNode to hold data.

The SecondaryNameNode is an availability aid that we'll discuss in a later chapter.

Aer starng these components, we use the JDK's jps ulity to see which Java processes are

running, and, as the output looks good, we then use Hadoop's dfs ulity to list the root of

the HDFS lesystem.

Aer this, we use start-mapred.sh to start the MapReduce components—this me the

JobTracker and a single TaskTracker—and then use jps again to verify the result.

There is also a combined start-all.sh le that we'll use at a later stage, but in the early

days it's useful to do a two-stage start up to more easily verify the cluster conguraon.

Time for action – using HDFS

As the preceding example shows, there is a familiar-looking interface to HDFS that allows

us to use commands similar to those in Unix to manipulate les and directories on the

lesystem. Let's try it out by typing the following commands:

Type in the following commands:

$ hadoop -mkdir /user

$ hadoop -mkdir /user/hadoop

$ hadoop fs -ls /user

Found 1 items

drwxr-xr-x - hadoop supergroup 0 2012-10-26 23:09 /user/Hadoop

$ echo "This is a test." >> test.txt

$ cat test.txt

This is a test.

$ hadoop dfs -copyFromLocal test.txt .

$ hadoop dfs -ls

Found 1 items

-rw-r--r-- 1 hadoop supergroup 16 2012-10-26 23:19/user/hadoop/

test.txt

$ hadoop dfs -cat test.txt

This is a test.

$ rm test.txt

$ hadoop dfs -cat test.txt

www.it-ebooks.info

Chapter 2

[ 39 ]

This is a test.

$ hadoop fs -copyToLocal test.txt

$ cat test.txt

This is a test.

What just happened?

This example shows the use of the fs subcommand to the Hadoop ulity. Note that both

dfs and fs commands are equivalent). Like most lesystems, Hadoop has the concept of a

home directory for each user. These home directories are stored under the /user directory

on HDFS and, before we go further, we create our home directory if it does not already exist.

We then create a simple text le on the local lesystem and copy it to HDFS by using the

copyFromLocal command and then check its existence and contents by using the -ls and

-cat ulies. As can be seen, the user home directory is aliased to . because, in Unix, -ls

commands with no path specied are assumed to refer to that locaon and relave paths

(not starng with /) will start there.

We then deleted the le from the local lesystem, copied it back from HDFS by using the

-copyToLocal command, and checked its contents using the local cat ulity.

Mixing HDFS and local lesystem commands, as in the preceding example,

is a powerful combinaon, and it's very easy to execute on HDFS commands

that were intended for the local lesystem and vice versa. So be careful,

especially when deleng.

There are other HDFS manipulaon commands; try Hadoop fs -help for a detailed list.

Time for action – WordCount, the Hello World of MapReduce

Many applicaons, over me, acquire a canonical example that no beginner's guide should

be without. For Hadoop, this is WordCount – an example bundled with Hadoop that counts

the frequency of words in an input text le.

1. First execute the following commands:

$ hadoop dfs -mkdir data

$ hadoop dfs -cp test.txt data

$ hadoop dfs -ls data

Found 1 items

-rw-r--r-- 1 hadoop supergroup 16 2012-10-26 23:20 /

user/hadoop/data/test.txt

www.it-ebooks.info

Geng Hadoop Up and Running

[ 40 ]

2. Now execute these commands:

$ Hadoop Hadoop/hadoop-examples-1.0.4.jar wordcount data out

12/10/26 23:22:49 INFO input.FileInputFormat: Total input paths to

process : 1

12/10/26 23:22:50 INFO mapred.JobClient: Running job:

job_201210262315_0002

12/10/26 23:22:51 INFO mapred.JobClient: map 0% reduce 0%

12/10/26 23:23:03 INFO mapred.JobClient: map 100% reduce 0%

12/10/26 23:23:15 INFO mapred.JobClient: map 100% reduce 100%

12/10/26 23:23:17 INFO mapred.JobClient: Job complete:

job_201210262315_0002

12/10/26 23:23:17 INFO mapred.JobClient: Counters: 17

12/10/26 23:23:17 INFO mapred.JobClient: Job Counters

12/10/26 23:23:17 INFO mapred.JobClient: Launched reduce

tasks=1

12/10/26 23:23:17 INFO mapred.JobClient: Launched map tasks=1

12/10/26 23:23:17 INFO mapred.JobClient: Data-local map

tasks=1

12/10/26 23:23:17 INFO mapred.JobClient: FileSystemCounters

12/10/26 23:23:17 INFO mapred.JobClient: FILE_BYTES_READ=46

12/10/26 23:23:17 INFO mapred.JobClient: HDFS_BYTES_READ=16

12/10/26 23:23:17 INFO mapred.JobClient: FILE_BYTES_

WRITTEN=124

12/10/26 23:23:17 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=24

12/10/26 23:23:17 INFO mapred.JobClient: Map-Reduce Framework

12/10/26 23:23:17 INFO mapred.JobClient: Reduce input groups=4

12/10/26 23:23:17 INFO mapred.JobClient: Combine output

records=4

12/10/26 23:23:17 INFO mapred.JobClient: Map input records=1

12/10/26 23:23:17 INFO mapred.JobClient: Reduce shuffle

bytes=46

12/10/26 23:23:17 INFO mapred.JobClient: Reduce output

records=4

12/10/26 23:23:17 INFO mapred.JobClient: Spilled Records=8

12/10/26 23:23:17 INFO mapred.JobClient: Map output bytes=32

12/10/26 23:23:17 INFO mapred.JobClient: Combine input

records=4

12/10/26 23:23:17 INFO mapred.JobClient: Map output records=4

12/10/26 23:23:17 INFO mapred.JobClient: Reduce input

records=4

www.it-ebooks.info

Chapter 2

[ 41 ]

3. Execute the following command:

$ hadoop fs -ls out

Found 2 items

drwxr-xr-x - hadoop supergroup 0 2012-10-26 23:22 /

user/hadoop/out/_logs

-rw-r--r-- 1 hadoop supergroup 24 2012-10-26 23:23 /

user/hadoop/out/part-r-00000

4. Now execute this command:

$ hadoop fs -cat out/part-0-00000

This 1

a 1

is 1

test. 1

What just happened?

We did three things here, as follows:

Moved the previously created text le into a new directory on HDFS

Ran the example WordCount job specifying this new directory and a non-existent

output directory as arguments

Used the fs ulity to examine the output of the MapReduce job

As we said earlier, the pseudo-distributed mode has more Java processes, so it may seem

curious that the job output is signicantly shorter than for the standalone Pi. The reason is

that the local standalone mode prints informaon about each individual task execuon to

the screen, whereas in the other modes this informaon is wrien only to logles on the

running hosts.

The output directory is created by Hadoop itself and the actual result les follow the

part-nnnnn convenon illustrated here; though given our setup, there is only one result

le. We use the fs -cat command to examine the le, and the results are as expected.

If you specify an exisng directory as the output source for a Hadoop job, it

will fail to run and will throw an excepon complaining of an already exisng

directory. If you want Hadoop to store the output to a directory, it must not exist.

Treat this as a safety mechanism that stops Hadoop from wring over previous

valuable job runs and something you will forget to ascertain frequently. If you are

condent, you can override this behavior, as we will see later.

www.it-ebooks.info

Geng Hadoop Up and Running

[ 42 ]

The Pi and WordCount programs are only some of the examples that ship with Hadoop. Here

is how to get a list of them all. See if you can gure some of them out.

$ hadoop jar hadoop/hadoop-examples-1.0.4.jar

Have a go hero – WordCount on a larger body of text

Running a complex framework like Hadoop ulizing ve discrete Java processes to count the

words in a single-line text le is not terribly impressive. The power comes from the fact that

we can use exactly the same program to run WordCount on a larger le, or even a massive

corpus of text spread across a mulnode Hadoop cluster. If we had such a setup, we would

execute exactly the same commands as we just did by running the program and simply

specifying the locaon of the directories for the source and output data.

Find a large online text le—Project Gutenberg at http://www.gutenberg.org is a good

starng point—and run WordCount on it by copying it onto the HDFS and execung the

WordCount example. The output may not be as you expect because, in a large body of text,

issues of dirty data, punctuaon, and formang will need to be addressed. Think about how

WordCount could be improved; we'll study how to expand it into a more complex processing

chain in the next chapter.

Monitoring Hadoop from the browser

So far, we have been relying on command-line tools and direct command output to see what

our system is doing. Hadoop provides two web interfaces that you should become familiar

with, one for HDFS and the other for MapReduce. Both are useful in pseudo-distributed

mode and are crical tools when you have a fully distributed setup.

The HDFS web UI

Point your web browser to port 50030 on the host running Hadoop. By default, the web

interface should be available from both the local host and any other machine that has

network access. Here is an example screenshot:

www.it-ebooks.info

Chapter 2

[ 43 ]

There is a lot going on here, but the immediately crical data tells us the number of nodes

in the cluster, the lesystem size, used space, and links to drill down for more info and even

browse the lesystem.

Spend a lile me playing with this interface; it needs to become familiar. With a mulnode

cluster, the informaon about live and dead nodes plus the detailed informaon on their

status history will be crical to debugging cluster problems.

www.it-ebooks.info

Geng Hadoop Up and Running

[ 44 ]

The MapReduce web UI

The JobTracker UI is available on port 50070 by default, and the same access rules stated

earlier apply. Here is an example screenshot:

This is more complex than the HDFS interface! Along with a similar count of the number

of live/dead nodes, there is a history of the number of jobs executed since startup and a

breakdown of their individual task counts.

The list of execung and historical jobs is a doorway to much more informaon; for every

job, we can access the history of every task aempt on every node and access logs for

detailed informaon. We now expose one of the most painful parts of working with any

distributed system: debugging. It can be really hard.

Imagine you have a cluster of 100 machines trying to process a massive data set where the

full job requires each host to execute hundreds of map and reduce tasks. If the job starts

running very slowly or explicitly fails, it is not always obvious where the problem lies. Looking

at the MapReduce web UI will likely be the rst port of call because it provides such a rich

starng point to invesgate the health of running and historical jobs.

www.it-ebooks.info

Chapter 2

[ 45 ]

Using Elastic MapReduce

We will now turn to Hadoop in the cloud, the Elasc MapReduce service oered by Amazon

Web Services. There are mulple ways to access EMR, but for now we will focus on the

provided web console to contrast a full point-and-click approach to Hadoop with the

previous command-line-driven examples.

Setting up an account in Amazon Web Services

Before using Elasc MapReduce, we need to set up an Amazon Web Services account and

Creating an AWS account

Amazon has integrated their general accounts with AWS, meaning that if you already have an

account for any of the Amazon retail websites, this is the only account you will need to use

AWS services.

Note that AWS services have a cost; you will need an acve credit card associated with the

account to which charges can be made.

If you require a new Amazon account, go to http://aws.amazon.com, select create a new

AWS account, and follow the prompts. Amazon has added a free er for some services, so

you may nd that in the early days of tesng and exploraon you are keeping many of your

acvies within the non-charged er. The scope of the free er has been expanding, so make

sure you know for what you will and won't be charged.

Signing up for the necessary services

Once you have an Amazon account, you will need to register it for use with the required

AWS services, that is, Simple Storage Service (S3), Elasc Compute Cloud (EC2), and Elasc

MapReduce (EMR). There is no cost for simply signing up to any AWS service; the process

just makes the service available to your account.

Go to the S3, EC2, and EMR pages linked from http://aws.amazon.com and click on the

www.it-ebooks.info

Geng Hadoop Up and Running

[ 46 ]

Cauon! This costs real money!

Before going any further, it is crical to understand that use of AWS services will

incur charges that will appear on the credit card associated with your Amazon

account. Most of the charges are quite small and increase with the amount of

infrastructure consumed; storing 10 GB of data in S3 costs 10 mes more than

for 1 GB, and running 20 EC2 instances costs 20 mes as much as a single one.

There are ered cost models, so the actual costs tend to have smaller marginal

increases at higher levels. But you should read carefully through the pricing

secons for each service before using any of them. Note also that currently

data transfer out of AWS services, such as EC2 and S3, is chargeable but data

transfer between services is not. This means it is oen most cost-eecve to

carefully design your use of AWS to keep data within AWS through as much of

the data processing as possible.

Time for action – WordCount on EMR using the management

console

Let's jump straight into an example on EMR using some provided example code. Carry out

the following steps:

1. Browse to http://aws.amazon.com, go to Developers | AWS Management

Console, and then click on the Sign in to the AWS Console buon. The default

view should look like the following screenshot. If it does not, click on Amazon S3

from within the console.

www.it-ebooks.info

Chapter 2

[ 47 ]

2. As shown in the preceding screenshot, click on the Create bucket buon and enter

a name for the new bucket. Bucket names must be globally unique across all AWS

users, so do not expect obvious bucket names such as mybucket or s3test to

be available.

3. Click on the Region drop-down menu and select the geographic area nearest to you.

4. Click on the Elasc MapReduce link and click on the Create a new Job Flow buon.

You should see a screen like the following screenshot:

www.it-ebooks.info

Geng Hadoop Up and Running

[ 48 ]

5. You should now see a screen like the preceding screenshot. Select the Run a sample

applicaon radio buon and the Word Count (Streaming) menu item from the

sample applicaon drop-down box and click on the Connue buon.

6. The next screen, shown in the preceding screenshot, allows us to specify the

locaon of the output produced by running the job. In the edit box for the output

locaon, enter the name of the bucket created in step 1 (garryt1use is the bucket

we are using here); then click on the Connue buon.

www.it-ebooks.info

Chapter 2

[ 49 ]

7. The next screenshot shows the page where we can modify the number and size of

the virtual hosts ulized by our job. Conrm that the instance type for each combo

box is Small (m1.small), and the number of nodes for the Core group is 2 and for the

Task group it is 0. Then click on the Connue buon.

8. This next screenshot involves opons we will not be using in this example. For the

Amazon EC2 key pair eld, select the Proceed without key pair menu item and click

on the No radio buon for the Enable Debugging eld. Ensure that the Keep Alive

radio buon is set to No and click on the Connue buon.

www.it-ebooks.info

Geng Hadoop Up and Running

[ 50 ]

9. The next screen, shown in the preceding screenshot, is one we will not be doing

much with right now. Conrm that the Proceed with no Bootstrap Acons radio

buon is selected and click on the Connue buon.

10. Conrm the job ow specicaons are as expected and click on the Create Job Flow

buon. Then click on the View my Job Flows and check status buons. This will give

a list of your job ows; you can lter to show only running or completed jobs. The

default is to show all, as in the example shown in the following screenshot:

www.it-ebooks.info

Chapter 2

[ 51 ]

11. Occasionally hit the Refresh buon unl the status of the listed job, Running or

Starng, changes to Complete; then click its checkbox to see details of the job ow,

as shown in the following screenshot:

12. Click the S3 tab and select the bucket you created for the output locaon. You will

see it has a single entry called wordcount, which is a directory. Right-click on that

and select Open. Then do the same unl you see a list of actual les following the

familiar Hadoop part-nnnnn naming scheme, as shown in the following screenshot:

www.it-ebooks.info

Geng Hadoop Up and Running

[ 52 ]

Right click on part-00000 and open it. It should look something like this:

a 14716

aa 52

aakar 3

aargau 3

abad 3

abandoned 46

abandonment 6

abate 9

abauj 3

abbassid 4

abbes 3

abbl 3

…

Does this type of output look familiar?

What just happened?

The rst step deals with S3, and not EMR. S3 is a scalable storage service that allows you to

store les (called objects) within containers called buckets, and to access objects by their

bucket and object key (that is, name). The model is analogous to the usage of a lesystem, and

though there are underlying dierences, they are unlikely to be important within this book.

S3 is where you will place the MapReduce programs and source data you want to process in

EMR, and where the output and logs of EMR Hadoop jobs will be stored. There is a plethora

of third-party tools to access S3, but here we are using the AWS management console, a

browser interface to most AWS services.

Though we suggested you choose the nearest geographic region for S3, this is not required;

non-US locaons will typically give beer latency for customers located nearer to them, but

they also tend to have a slightly higher cost. The decision of where to host your data and

applicaons is one you need to make aer considering all these factors.

Aer creang the S3 bucket, we moved to the EMR console and created a new job ow.

This term is used within EMR to refer to a data processing task. As we will see, this can

be a one-me deal where the underlying Hadoop cluster is created and destroyed on

demand or it can be a long-running cluster on which mulple jobs are executed.

We le the default job ow name and then selected the use of an example applicaon,

in this case, the Python implementaon of WordCount. The term Hadoop Streaming refers

to a mechanism allowing scripng languages to be used to write map and reduce tasks, but

the funconality is the same as the Java WordCount we used earlier.

www.it-ebooks.info

Chapter 2

[ 53 ]

The form to specify the job ow requires a locaon for the source data, program, map and

reduce classes, and a desired locaon for the output data. For the example we just saw, most

of the elds were prepopulated; and, as can be seen, there are clear similaries to what was

required when running local Hadoop from the command line.

By not selecng the Keep Alive opon, we chose a Hadoop cluster that would be created

specically to execute this job, and destroyed aerwards. Such a cluster will have a longer

startup me but will minimize costs. If you choose to keep the job ow alive, you will see

addional jobs executed more quickly as you don't have to wait for the cluster to start up.

But you will be charged for the underlying EC2 resources unl you explicitly terminate the

job ow.

Aer conrming, we do not need to add any addional bootstrap opons; we selected the

number and types of hosts we wanted to deploy into our Hadoop cluster. EMR disnguishes

between three dierent groups of hosts:

Master group: This is a controlling node hosng the NameNode and the JobTracker.

There is only 1 of these.

Core group: These are nodes running both HDFS DataNodes and MapReduce

TaskTrackers. The number of hosts is congurable.

Task group: These hosts don't hold HDFS data but do run TaskTrackers and can

provide more processing horsepower. The number of hosts is congurable.

The type of host refers to dierent classes of hardware capability, the details of which can

be found on the EC2 page. Larger hosts are more powerful but have a higher cost. Currently,

by default, the total number of hosts in a job ow must be 20 or less, though Amazon has a

simple form to request higher limits.

Aer conrming, all is as expected—we launch the job ow and monitor it on the console

unl the status changes to COMPLETED. At this point, we go back to S3, look inside the

bucket we specied as the output desnaon, and examine the output of our WordCount

job, which should look very similar to the output of a local Hadoop WordCount.

An obvious queson is where did the source data come from? This was one of the

prepopulated elds in the job ow specicaon we saw during the creaon process. For

nonpersistent job ows, the most common model is for the source data to be read from a

specied S3 source locaon and the resulng data wrien to the specied result S3 bucket.

That is it! The AWS management console allows ne-grained control of services such as S3

and EMR from the browser. Armed with nothing more than a browser and a credit card,

we can launch Hadoop jobs to crunch data without ever having to worry about any of the

mechanics around installing, running, or managing Hadoop.

www.it-ebooks.info

Geng Hadoop Up and Running

[ 54 ]

Have a go hero – other EMR sample applications

EMR provides several other sample applicaons. Why not try some of them as well?

Other ways of using EMR

Although a powerful and impressive tool, the AWS management console is not always

how we want to access S3 and run EMR jobs. As with all AWS services, there are both

programmac and command-line tools to use the services.

AWS credentials

Before using either programmac or command-line tools, however, we need to look at how

an account holder authencates for AWS to make such requests. As these are chargeable

services, we really do not want anyone else to make requests on our behalf. Note that as

we logged directly into the AWS management console with our AWS account in the

preceding example, we did not have to worry about this.

Each AWS account has several ideners that are used when accessing the various services:

Account ID: Each AWS account has a numeric ID.

Access key: Each account has an associated access key that is used to idenfy the

account making the request.

Secret access key: The partner to the access key is the secret access key. The access

key is not a secret and could be exposed in service requests, but the secret access

key is what you use to validate yourself as the account owner.

Key pairs: These are the key pairs used to log in to EC2 hosts. It is possible to either

generate public/private key pairs within EC2 or to import externally generated keys

into the system.

If this sounds confusing, it's because it is. At least at rst. When using a tool to access an

AWS service, however, there's usually a single up-front step of adding the right credenals

to a congured le, and then everything just works. However, if you do decide to explore

programmac or command-line tools, it will be worth a lile me investment to read the

documentaon for each service to understand how its security works.

The EMR command-line tools

In this book, we will not do anything with S3 and EMR that cannot be done from the AWS

management console. However, when working with operaonal workloads, looking to

integrate into other workows, or automang service access, a browser-based tool is not

appropriate, regardless of how powerful it is. Using the direct programmac interfaces to

a service provides the most granular control but requires the most eort.

www.it-ebooks.info

Chapter 2

[ 55 ]

Amazon provides for many services a group of command-line tools that provide a useful way

of automang access to AWS services that minimizes the amount of required development.

The Elasc MapReduce command-line tools, linked from the main EMR page, are worth a

look if you want a more CLI-based interface to EMR but don't want to write custom code

just yet.

The AWS ecosystem

Each AWS service also has a plethora of third-party tools, services, and libraries that can

provide dierent ways of accessing the service, provide addional funconality, or oer

new ulity programs. Check out the developer tools hub at http://aws.amazon.com/

developertools, as a starng point.

Comparison of local versus EMR Hadoop

Aer our rst experience of both a local Hadoop cluster and its equivalent in EMR, this is a

good point at which we can consider the dierences of the two approaches.

As may be apparent, the key dierences are not really about capability; if all we want is an

environment to run MapReduce jobs, either approach is completely suited. Instead, the

disnguishing characteriscs revolve around a topic we touched on in Chapter 1, What It's

All About, that being whether you prefer a cost model that involves upfront infrastructure

costs and ongoing maintenance eort over one with a pay-as-you-go model with a lower

maintenance burden along with rapid and conceptually innite scalability. Other than the

cost decisions, there are a few things to keep in mind:

EMR supports specic versions of Hadoop and has a policy of upgrading over me.

If you have a need for a specic version, in parcular if you need the latest and

greatest versions immediately aer release, then the lag before these are live on

EMR may be unacceptable.

You can start up a persistent EMR job ow and treat it much as you would a local

Hadoop cluster, logging into the hosng nodes and tweaking their conguraon. If

you nd yourself doing this, its worth asking if that level of control is really needed

and, if so, is it stopping you geng all the cost model benets of a move to EMR?

If it does come down to a cost consideraon, remember to factor in all the hidden

costs of a local cluster that are oen forgoen. Think about the costs of power,

space, cooling, and facilies. Not to menon the administraon overhead, which

can be nontrivial if things start breaking in the early hours of the morning.

www.it-ebooks.info

Geng Hadoop Up and Running

[ 56 ]

Summary

We covered a lot of ground in this chapter, in regards to geng a Hadoop cluster up and

running and execung MapReduce programs on it.

Specically, we covered the prerequisites for running Hadoop on local Ubuntu hosts.

We also saw how to install and congure a local Hadoop cluster in either standalone or

pseudo-distributed modes. Then, we looked at how to access the HDFS lesystem and

submit MapReduce jobs. We then moved on and learned what accounts are needed to

access Elasc MapReduce and other AWS services.

We saw how to browse and create S3 buckets and objects using the AWS management

console, and also how to create a job ow and use it to execute a MapReduce job on an

EMR-hosted Hadoop cluster. We also discussed other ways of accessing AWS services and

studied the dierences between local and EMR-hosted Hadoop.

Now that we have learned about running Hadoop locally or on EMR, we are ready to start

wring our own MapReduce programs, which is the topic of the next chapter.

www.it-ebooks.info

Understanding MapReduce

The previous two chapters have discussed the problems that Hadoop allows us

to solve, and gave some hands-on experience of running example MapReduce

jobs. With this foundation, we will now go a little deeper.

In this chapter we will be:

Understanding how key/value pairs are the basis of Hadoop tasks

Learning the various stages of a MapReduce job

Examining the workings of the map, reduce, and oponal combined stages in detail

Looking at the Java API for Hadoop and use it to develop some simple

MapReduce jobs

Learning about Hadoop input and output

Key/value pairs

Since Chapter 1, What It's All About, we have been talking about operaons that process

and provide the output in terms of key/value pairs without explaining why. It is me to

address that.

What it mean

Firstly, we will clarify just what we mean by key/value pairs by highlighng similar concepts

in the Java standard library. The java.util.Map interface is the parent of commonly used

classes such as HashMap and (through some library backward reengineering) even the

original Hashtable.

www.it-ebooks.info

Understanding MapReduce

[ 58 ]

For any Java Map object, its contents are a set of mappings from a given key of a specied

type to a related value of a potenally dierent type. A HashMap object could, for example,

contain mappings from a person's name (String) to his or her birthday (Date).

In the context of Hadoop, we are referring to data that also comprises keys that relate to

associated values. This data is stored in such a way that the various values in the data set

can be sorted and rearranged across a set of keys. If we are using key/value data, it will

make sense to ask quesons such as the following:

Does a given key have a mapping in the data set?

What are the values associated with a given key?

What is the complete set of keys?

Think back to WordCount from the previous chapter. We will go into it in more detail shortly,

but the output of the program is clearly a set of key/value relaonships; for each word

(the key), there is a count (the value) of its number of occurrences. Think about this simple

example and some important features of key/value data will become apparent, as follows:

Keys must be unique but values need not be

Each value must be associated with a key, but a key could have no values

(though not in this parcular example)

Careful denion of the key is important; deciding on whether or not the

counts are applied with case sensivity will give dierent results

Note that we need to dene carefully what we mean by keys being unique

here. This does not mean the key occurs only once; in our data set we may see

a key occur numerous mes and, as we shall see, the MapReduce model has

a stage where all values associated with each key are collected together. The

uniqueness of keys guarantees that if we collect together every value seen for

any given key, the result will be an associaon from a single instance of the key

to every value mapped in such a way, and none will be omied.

Why key/value data?

Using key/value data as the foundaon of MapReduce operaons allows for a powerful

programming model that is surprisingly widely applicable, as can be seen by the adopon of

Hadoop and MapReduce across a wide variety of industries and problem scenarios. Much

data is either intrinsically key/value in nature or can be represented in such a way. It is a

simple model with broad applicability and semancs straighorward enough that programs

dened in terms of it can be applied by a framework like Hadoop.

www.it-ebooks.info

Chapter 3

[ 59 ]

Of course, the data model itself is not the only thing that makes Hadoop useful; its real

power lies in how it uses the techniques of parallel execuon, and divide and conquer

discussed in Chapter 1, What It's All About. We can have a large number of hosts on which

we can store and execute data and even use a framework that manages the division of

the larger task into smaller chunks, and the combinaon of paral results into the overall

answer. But we need this framework to provide us with a way of expressing our problems

that doesn't require us to be an expert in the execuon mechanics; we want to express the

transformaons required on our data and then let the framework do the rest. MapReduce,

with its key/value interface, provides such a level of abstracon, whereby the programmer

only has to specify these transformaons and Hadoop handles the complex process of

applying this to arbitrarily large data sets.

Some real-world examples

To become less abstract, let's think of some real-world data that is key/value pair:

An address book relates a name (key) to contact informaon (value)

A bank account uses an account number (key) to associate with the account

details (value)

The index of a book relates a word (key) to the pages on which it occurs (value)

On a computer lesystem, lenames (keys) allow access to any sort of data,

such as text, images, and sound (values)

These examples are intenonally broad in scope, to help and encourage you to think that

key/value data is not some very constrained model used only in high-end data mining but

a very common model that is all around us.

We would not be having this discussion if this was not important to Hadoop. The boom line

is that if the data can be expressed as key/value pairs, it can be processed by MapReduce.

MapReduce as a series of key/value transformations

You may have come across MapReduce described in terms of key/value transformaons, in

parcular the inmidang one looking like this:

{K1,V1} -> {K2, List<V2>} -> {K3,V3}

We are now in a posion to understand what this means:

The input to the map method of a MapReduce job is a series of key/value pairs that

we'll call K1 and V1.

www.it-ebooks.info

Understanding MapReduce

[ 60 ]

The output of the map method (and hence input to the reduce method) is a series

of keys and an associated list of values that are called K2 and V2. Note that each

mapper simply outputs a series of individual key/value outputs; these are combined

into a key and list of values in the shuffle method.

The nal output of the MapReduce job is another series of key/value pairs, called K3

and V3.

These sets of key/value pairs don't have to be dierent; it would be quite possible to input,

say, names and contact details and output the same, with perhaps some intermediary format

used in collang the informaon. Keep this three-stage model in mind as we explore the Java

API for MapReduce next. We will rst walk through the main parts of the API you will need

and then do a systemac examinaon of the execuon of a MapReduce job.

Pop quiz – key/value pairs

Q1. The concept of key/value pairs is…

1. Something created by and specic to Hadoop.

2. A way of expressing relaonships we oen see but don't think of as such.

3. An academic concept from computer science.

Q2. Are username/password combinaons an example of key/value data?

1. Yes, it's a clear case of one value being associated to the other.

2. No, the password is more of an aribute of the username, there's no index-type

relaonship.

3. We'd not usually think of them as such, but Hadoop could sll process a series

of username/password combinaons as key/value pairs.

The Hadoop Java API for MapReduce

Hadoop underwent a major API change in its 0.20 release, which is the primary interface

in the 1.0 version we use in this book. Though the prior API was certainly funconal, the

community felt it was unwieldy and unnecessarily complex in some regards.

The new API, somemes generally referred to as context objects, for reasons we'll see later,

is the future of Java's MapReduce development; and as such we will use it wherever possible

in this book. Note that caveat: there are parts of the pre-0.20 MapReduce libraries that have

not been ported to the new API, so we will use the old interfaces when we need to examine

any of these.

www.it-ebooks.info

Chapter 3

[ 61 ]

The 0.20 MapReduce Java API

The 0.20 and above versions of MapReduce API have most of the key classes and interfaces

either in the org.apache.hadoop.mapreduce package or its subpackages.

In most cases, the implementaon of a MapReduce job will provide job-specic subclasses

of the Mapper and Reducer base classes found in this package.

We'll sck to the commonly used K1 / K2 / K3 / and so on terminology,

though more recently the Hadoop API has, in places, used terms such as

KEYIN/VALUEIN and KEYOUT/VALUEOUT instead. For now, we will

sck with K1 / K2 / K3 as it helps understand the end-to-end data ow.

The Mapper class

This is a cut-down view of the base Mapper class provided by Hadoop. For our own

mapper implementaons, we will subclass this base class and override the specied

method as follows:

class Mapper<K1, V1, K2, V2>

{

void map(K1 key, V1 value Mapper.Context context)

throws IOException, InterruptedException

{..}

}

Although the use of Java generics can make this look a lile opaque at rst, there is

actually not that much going on. The class is dened in terms of the key/value input

and output types, and then the map method takes an input key/value pair in its parameters.

The other parameter is an instance of the Context class that provides various mechanisms

to communicate with the Hadoop framework, one of which is to output the results of a map

or reduce method.

Noce that the map method only refers to a single instance of K1 and V1 key/

value pairs. This is a crical aspect of the MapReduce paradigm in which you

write classes that process single records and the framework is responsible

for all the work required to turn an enormous data set into a stream of key/

value pairs. You will never have to write map or reduce classes that try to

deal with the full data set. Hadoop also provides mechanisms through its

InputFormat and OutputFormat classes that provide implementaons

of common le formats and likewise remove the need of having to write le

parsers for any but custom le types.

www.it-ebooks.info

Understanding MapReduce

[ 62 ]

There are three addional methods that somemes may be required to be overridden.

protected void setup( Mapper.Context context)

throws IOException, Interrupted Exception

This method is called once before any key/value pairs are presented to the map method.

The default implementaon does nothing.

protected void cleanup( Mapper.Context context)

throws IOException, Interrupted Exception

This method is called once aer all key/value pairs have been presented to the map method.

The default implementaon does nothing.

protected void run( Mapper.Context context)

throws IOException, Interrupted Exception

This method controls the overall ow of task processing within a JVM. The default

implementaon calls the setup method once before repeatedly calling the map

method for each key/value pair in the split, and then nally calls the cleanup method.

Downloading the example code

You can download the example code les for all Packt books you have purchased

from your account at http://www.packtpub.com. If you purchased this

book elsewhere, you can visit http://www.packtpub.com/support

and register to have the les e-mailed directly to you.

The Reducer class

The Reducer base class works very similarly to the Mapper class, and usually requires only

subclasses to override a single reduce method. Here is the cut-down class denion:

public class Reducer<K2, V2, K3, V3>

{

void reduce(K1 key, Iterable<V2> values,

Reducer.Context context)

throws IOException, InterruptedException

{..}

}

Again, noce the class denion in terms of the broader data ow (the reduce method

accepts K2/V2 as input and provides K3/V3 as output) while the actual reduce method

takes only a single key and its associated list of values. The Context object is again the

mechanism to output the result of the method.

This class also has the setup, run, and cleanup methods with similar default

implementaons as with the Mapper class that can oponally be overridden:

www.it-ebooks.info

Chapter 3

[ 63 ]

protected void setup( Reduce.Context context)

throws IOException, InterruptedException

This method is called once before any key/lists of values are presented to the reduce

method. The default implementaon does nothing.

protected void cleanup( Reducer.Context context)

throws IOException, InterruptedException

This method is called once aer all key/lists of values have been presented to the reduce

method. The default implementaon does nothing.

protected void run( Reducer.Context context)

throws IOException, InterruptedException

This method controls the overall ow of processing the task within JVM. The default

implementaon calls the setup method before repeatedly calling the reduce method for as

many key/values provided to the Reducer class, and then nally calls the cleanup method.

The Driver class

Although our mapper and reducer implementaons are all we need to perform the

MapReduce job, there is one more piece of code required: the driver that communicates

with the Hadoop framework and species the conguraon elements needed to run a

MapReduce job. This involves aspects such as telling Hadoop which Mapper and Reducer

classes to use, where to nd the input data and in what format, and where to place the

output data and how to format it. There is an addional variety of other conguraon

opons that can be set and which we will see throughout this book.

There is no default parent Driver class as a subclass; the driver logic usually exists in the main

method of the class wrien to encapsulate a MapReduce job. Take a look at the following

code snippet as an example driver. Don't worry about how each line works, though you

should be able to work out generally what each is doing:

public class ExampleDriver

{

...

public static void main(String[] args) throws Exception

{

// Create a Configuration object that is used to set other options

Configuration conf = new Configuration() ;

// Create the object representing the job

Job job = new Job(conf, "ExampleJob") ;

// Set the name of the main class in the job jarfile

job.setJarByClass(ExampleDriver.class) ;

// Set the mapper class

job.setMapperClass(ExampleMapper.class) ;

www.it-ebooks.info

Understanding MapReduce

[ 64 ]

// Set the reducer class

job.setReducerClass(ExampleReducer.class) ;

// Set the types for the final output key and value

job.setOutputKeyClass(Text.class) ;

job.setOutputValueClass(IntWritable.class) ;

// Set input and output file paths

FileInputFormat.addInputPath(job, new Path(args[0])) ;

FileOutputFormat.setOutputPath(job, new Path(args[1]))

// Execute the job and wait for it to complete

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

}}

Given our previous talk of jobs, it is not surprising that much of the setup involves operaons

on a Job object. This includes seng the job name and specifying which classes are to be

used for the mapper and reducer implementaons.

Certain input/output conguraons are set and, nally, the arguments passed to the main

method are used to specify the input and output locaons for the job. This is a very common

model that you will see oen.

There are a number of default values for conguraon opons, and we are implicitly using

some of them in the preceding class. Most notably, we don't say anything about the le

format of the input les or how the output les are to be wrien. These are dened through

the InputFormat and OutputFormat classes menoned earlier; we will explore them

in detail later. The default input and output formats are text les that suit our WordCount

example. There are mulple ways of expressing the format within text les in addion to

parcularly opmized binary formats.

A common model for less complex MapReduce jobs is to have the Mapper and Reducer

classes as inner classes within the driver. This allows everything to be kept in a single le,

which simplies the code distribuon.

Writing MapReduce programs

We have been using and talking about WordCount for quite some me now; let's actually

write an implementaon, compile, and run it, and then explore some modicaons.

www.it-ebooks.info

Chapter 3

[ 65 ]

Time for action – setting up the classpath

To compile any Hadoop-related code, we will need to refer to the standard

Hadoop-bundled classes.

Add the Hadoop-1.0.4.core.jar le from the distribuon to the Java classpath

as follows:

$ export CLASSPATH=.:${HADOOP_HOME}/Hadoop-1.0.4.core.jar:${CLASSPATH}

What just happened?

This adds the Hadoop-1.0.4.core.jar le explicitly to the classpath alongside the

current directory and the previous contents of the CLASSPATH environment variable.

Once again, it would be good to put this in your shell startup le or a standalone le

to be sourced.

We will later need to also have many of the supplied third-party libraries

that come with Hadoop on our classpath, and there is a shortcut to do this.

For now, the explicit addion of the core JAR le will suce.

Time for action – implementing WordCount

We have seen the use of the WordCount example program in Chapter 2, Geng Hadoop

Up and Running. Now we will explore our own Java implementaon by performing the

following steps:

1. Enter the following code into the WordCount1.java le:

Import java.io.* ;

import org.apache.hadoop.conf.Configuration ;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

www.it-ebooks.info

Understanding MapReduce

[ 66 ]

public class WordCount1

{

public static class WordCountMapper

extends Mapper<Object, Text, Text, IntWritable>

{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

String[] words = value.toString().split(" ") ;

for (String str: words)

{

word.set(str);

context.write(word, one);

}

public static class WordCountReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,

Context context

) throws IOException, InterruptedException {

int total = 0;

for (IntWritable val : values) {

total++ ;

}

context.write(key, new IntWritable(total));

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "word count");

job.setJarByClass(WordCount1.class);

www.it-ebooks.info

Chapter 3

[ 67 ]

job.setMapperClass(WordCountMapper.class);

job.setReducerClass(WordCountReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

2. Now compile it by execung the following command:

$ javac WordCount1.java

What just happened?

This is our rst complete MapReduce job. Look at the structure and you should recognize the

elements we have previously discussed: the overall Job class with the driver conguraon in

its main method and the Mapper and Reducer implementaons dened as inner classes.

We'll do a more detailed walkthrough of the mechanics of MapReduce in the next secon,

but for now let's look at the preceding code and think of how it realizes the key/value

transformaons we talked about earlier.

The input to the Mapper class is arguably the hardest to understand, as the key is not

actually used. The job species TextInputFormat as the format of the input data and, by

default, this delivers to the mapper data where the key is the line number in the le and

the value is the text of that line. In reality, you may never actually see a mapper that uses

that line number key, but it is provided.

The mapper is executed once for each line of text in the input source and every me

it takes the line and breaks it into words. It then uses the Context object to output

(more commonly known as eming) each new key/value of the form <word, 1>.

These are our K2/V2 values.

We said before that the input to the reducer is a key and a corresponding list of values,

and there is some magic that happens between the map and reduce methods to collect

together the values for each key that facilitates this, which we'll not describe right now.

Hadoop executes the reducer once for each key and the preceding reducer implementaon

simply counts the numbers in the Iterable object and gives output for each word in the

form of <word, count>. This is our K3/V3 values.

www.it-ebooks.info

Understanding MapReduce

[ 68 ]

Take a look at the signatures of our mapper and reducer classes: the WordCountMapper

class gives IntWritable and Text as input and gives Text and IntWritable as output.

The WordCountReducer class gives Text and IntWritable both as input and output. This

is again quite a common paern, where the map method performs an inversion on the key and

values, and instead emits a series of data pairs on which the reducer performs aggregaon.

The driver is more meaningful here, as we have real values for the parameters.

We use arguments passed to the class to specify the input and output locaons.

Time for action – building a JAR le

Before we run our job in Hadoop, we must collect the required class les into a single JAR

le that we will submit to the system.

Create a JAR le from the generated class les.

$ jar cvf wc1.jar WordCount1*class

What just happened?

We must always package our class les into a JAR le before subming to Hadoop, be it

local or on Elasc MapReduce.

Be careful with the JAR command and le paths. If you include in a JAR le

class the les from a subdirectory, the class may not be stored with the path

you expect. This is especially common when using a catch-all classes directory

where all source data gets compiled. It may be useful to write a script to

change into the directory, convert the required les into JAR les, and move

the JAR les to the required locaon.

Time for action – running WordCount on a local Hadoop cluster

Now we have generated the class les and collected them into a JAR le, we can run the

applicaon by performing the following steps:

1. Submit the new JAR le to Hadoop for execuon.

$ hadoop jar wc1.jar WordCount1 test.txt output

2. If successful, you should see the output being very similar to the one we obtained

when we ran the Hadoop-provided sample WordCount in the previous chapter.

Check the output le; it should be as follows:

www.it-ebooks.info

Chapter 3

[ 69 ]

$ Hadoop fs –cat output/part-r-00000

This 1

yes 1

a 1

is 2

test 1

this 1

What just happened?

This is the rst me we have used the Hadoop JAR command with our own code. There are

four arguments:

1. The name of the JAR le.

2. The name of the driver class within the JAR le.

3. The locaon, on HDFS, of the input le (a relave reference to the /user/Hadoop

home folder, in this case).

4. The desired locaon of the output folder (again, a relave path).

The name of the driver class is only required if a main class has not

(as in this case) been specied within the JAR le manifest.

Time for action – running WordCount on EMR

We will now show you how to run this same JAR le on EMR. Remember, as always, that this

costs money!

1. Go to the AWS console at http://aws.amazon.com/console, sign in, and

select S3.

2. You'll need two buckets: one to hold the JAR le and another for the job output.

You can use exisng buckets or create new ones.

3. Open the bucket where you will store the job le, click on Upload, and add the wc1.

jar le created earlier.

4. Return to the main console home page, and then go to the EMR poron of the

console by selecng Elasc MapReduce.

www.it-ebooks.info

Understanding MapReduce

[ 70 ]

5. Click on the Create a New Job Flow buon and you'll see a familiar screen as

shown in the following screenshot:

6. Previously, we used a sample applicaon; to run our code, we need to perform

dierent steps. Firstly, select the Run your own applicaon radio buon.

7. In the Select a Job Type combobox, select Custom JAR.

8. Click on the Connue buon and you'll see a new form, as shown in the

following screenshot:

www.it-ebooks.info

Chapter 3

[ 71 ]

We now specify the arguments to the job. Within our uploaded JAR le, our code—

parcularly the driver class—species aspects such as the Mapper and Reducer classes.

What we need to provide is the path to the JAR le and the input and output paths for the

job. In the JAR Locaon eld, put the locaon where you uploaded the JAR le. If the JAR le

is called wc1.jar and you uploaded it into a bucket called mybucket, the path would be

mybucket/wc1.jar.

In the JAR Arguments eld, you need to enter the name of the main class and the

input and output locaons for the job. For les on S3, we can use URLs of the form

s3://bucketname/objectname. Click on Connue and the familiar screen to specify

the virtual machines for the job ow appears, as shown in the following screenshot:

Now connue through the job ow setup and execuon as we did in Chapter 2, Geng

Hadoop Up and Running.

What just happened?

The important lesson here is that we can reuse the code wrien on and for a local Hadoop

cluster in EMR. Also, besides these rst few steps, the majority of the EMR console is the

same regardless of the source of the job code to be executed.

Through the remainder of this chapter, we will not explicitly show code being executed

on EMR and will instead focus more on the local cluster, because running a JAR le on

EMR is very easy.

www.it-ebooks.info

Understanding MapReduce

[ 72 ]

The pre-0.20 Java MapReduce API

Our preference in this book is for the 0.20 and above versions of MapReduce Java API, but

we'll need to take a quick look at the older APIs for two reasons:

1. Many online examples and other reference materials are wrien for the older APIs.

2. Several areas within the MapReduce framework are not yet ported to the new API,

and we will need to use the older APIs to explore them.

The older API's classes are found primarily in the org.apache.hadoop.mapred package.

The new API classes use concrete Mapper and Reducer classes, while the older API had this

responsibility split across abstract classes and interfaces.

An implementaon of a Mapper class will subclass the abstract MapReduceBase class and

implement the Mapper interface, while a custom Reducer class will subclass the same

MapReduceBase abstract class but implement the Reducer interface.

We'll not explore MapReduceBase in much detail as its funconality deals with job setup

and conguraon, which aren't really core to understanding the MapReduce model. But the

interfaces of pre-0.20 Mapper and Reducer are worth showing:

public interface Mapper<K1, V1, K2, V2>

{

void map( K1 key, V1 value, OutputCollector< K2, V2> output, Reporter

reporter) throws IOException ;

}

public interface Reducer<K2, V2, K3, V3>

{

void reduce( K2 key, Iterator<V2> values,

OutputCollector<K3, V3> output, Reporter reporter)

throws IOException ;

}

There are a few points to understand here:

The generic parameters to the OutputCollector class show more explicitly how

the result of the methods is presented as output.

The old API used the OutputCollector class for this purpose, and the Reporter

class to write status and metrics informaon to the Hadoop framework. The 0.20

API combines these responsibilies in the Context class.

www.it-ebooks.info

Chapter 3

[ 73 ]

The Reducer interface uses an Iterator object instead of an Iterable object;

this was changed as the laer works with the Java for each syntax and makes for

cleaner code.

Neither the map nor the reduce method could throw InterruptedException

in the old API.

As you can see, the changes between the APIs alter how MapReduce programs are wrien

but don't change the purpose or responsibilies of mappers or reducers. Don't feel obliged

to become an expert in both APIs unless you need to; familiarity with either should allow

you to follow the rest of this book.

Hadoop-provided mapper and reducer implementations

We don't always have to write our own Mapper and Reducer classes from scratch. Hadoop

provides several common Mapper and Reducer implementaons that can be used in our

jobs. If we don't override any of the methods in the Mapper and Reducer classes in the

new API, the default implementaons are the identy Mapper and Reducer classes, which

simply output the input unchanged.

Note that more such prewrien Mapper and Reducer implementaons may be added over

me, and currently the new API does not have as many as the older one.

The mappers are found at org.apache.hadoop.mapreduce.lib.mapper, and include

the following:

InverseMapper: This outputs (value, key)

TokenCounterMapper: This counts the number of discrete tokens in each line

of input

The reducers are found at org.apache.hadoop.mapreduce.lib.reduce, and currently

include the following:

IntSumReducer: This outputs the sum of the list of integer values per key

LongSumReducer: This outputs the sum of the list of long values per key

Time for action – WordCount the easy way

Let's revisit WordCount, but this me use some of these predened map and reduce

implementaons:

1. Create a new WordCountPredefined.java le containing the following code:

import org.apache.hadoop.conf.Configuration ;

import org.apache.hadoop.fs.Path;

www.it-ebooks.info

Understanding MapReduce

[ 74 ]

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper ;

import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer ;

public class WordCountPredefined

{

public static void main(String[] args) throws Exception

{

Configuration conf = new Configuration();

Job job = new Job(conf, "word count1");

job.setJarByClass(WordCountPredefined.class);

job.setMapperClass(TokenCounterMapper.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

2. Now compile, create the JAR le, and run it as before.

3. Don't forget to delete the output directory before running the job, if you want to

use the same locaon. Use the hadoop fs -rmr output, for example.

What just happened?

Given the ubiquity of WordCount as an example in the MapReduce world, it's perhaps not

enrely surprising that there are predened Mapper and Reducer implementaons that

together realize the enre WordCount soluon. The TokenCounterMapper class simply

breaks each input line into a series of (token, 1) pairs and the IntSumReducer class

provides a nal count by summing the number of values for each key.

There are two important things to appreciate here:

Though WordCount was doubtless an inspiraon for these implementaons, they

are in no way specic to it and can be widely applicable

This model of having reusable mapper and reducer implementaons is one thing to

remember, especially in combinaon with the fact that oen the best starng point

for a new MapReduce job implementaon is an exisng one

www.it-ebooks.info

Chapter 3

[ 75 ]

Walking through a run of WordCount

To explore the relaonship between mapper and reducer in more detail, and to expose

some of Hadoop's inner working, we'll now go through just how WordCount (or indeed

any MapReduce job) is executed.

Startup

The call to Job.waitForCompletion() in the driver is where all the acon starts. The

driver is the only piece of code that runs on our local machine, and this call starts the

communicaon with the JobTracker. Remember that the JobTracker is responsible for

all aspects of job scheduling and execuon, so it becomes our primary interface when

performing any task related to job management. The JobTracker communicates with the

NameNode on our behalf and manages all interacons relang to the data stored on HDFS.

Splitting the input

The rst of these interacons happens when the JobTracker looks at the input data and

determines how to assign it to map tasks. Recall that HDFS les are usually split into blocks

of at least 64 MB and the JobTracker will assign each block to one map task.

Our WordCount example, of course, used a trivial amount of data that was well within a

single block. Picture a much larger input le measured in terabytes, and the split model

makes more sense. Each segment of the le—or split, in MapReduce terminology—is

processed uniquely by one map task.

Once it has computed the splits, the JobTracker places them and the JAR le containing

the Mapper and Reducer classes into a job-specic directory on HDFS, whose path will be

passed to each task as it starts.

Task assignment

Once the JobTracker has determined how many map tasks will be needed, it looks at the

number of hosts in the cluster, how many TaskTrackers are working, and how many map

tasks each can concurrently execute (a user-denable conguraon variable). The JobTracker

also looks to see where the various input data blocks are located across the cluster and

aempts to dene an execuon plan that maximizes the cases when a TaskTracker processes

a split/block located on the same physical host, or, failing that, it processes at least one in the

same hardware rack.

This data locality opmizaon is a huge reason behind Hadoop's ability to eciently process

such large datasets. Recall also that, by default, each block is replicated across three dierent

hosts, so the likelihood of producing a task/host plan that sees most blocks processed locally

is higher than it may seem at rst.

www.it-ebooks.info

Understanding MapReduce

[ 76 ]

Task startup

Each TaskTracker then starts up a separate Java virtual machine to execute the tasks.

This does add a startup me penalty, but it isolates the TaskTracker from problems

caused by misbehaving map or reduce tasks, and it can be congured to be shared

between subsequently executed tasks.

If the cluster has enough capacity to execute all the map tasks at once, they will all be

started and given a reference to the split they are to process and the job JAR le. Each

TaskTracker then copies the split to the local lesystem.

If there are more tasks than the cluster capacity, the JobTracker will keep a queue of

pending tasks and assign them to nodes as they complete their inially assigned map tasks.

We are now ready to see the executed data of map tasks. If this all sounds like a lot of

work, it is; and it explains why when running any MapReduce job, there is always a

non-trivial amount of me taken as the system gets started and performs all these steps.

Ongoing JobTracker monitoring

The JobTracker doesn't just stop work now and wait for the TaskTrackers to execute all the

mappers and reducers. It is constantly exchanging heartbeat and status messages with the

TaskTrackers, looking for evidence of progress or problems. It also collects metrics from the

tasks throughout the job execuon, some provided by Hadoop and others specied by the

developer of the map and reduce tasks, though we don't use any in this example.

Mapper input

In Chapter 2, Geng Hadoop Up and Running, our WordCount input was a simple one-line

text le. For the rest of this walkthrough, let's assume it was a not-much-less trivial two-line

text le:

This is a test

Yes this is

The driver class species the format and structure of the input le by using TextInputFormat,

and from this Hadoop knows to treat this as text with the line number as the key and

line contents as the value. The two invocaons of the mapper will therefore be given the

following input:

1 This is a test

2 Yes it is.

www.it-ebooks.info

Chapter 3

[ 77 ]

Mapper execution

The key/value pairs received by the mapper are the oset in the le of the line and the line

contents respecvely because of how the job is congured. Our implementaon of the map

method in WordCountMapper discards the key as we do not care where each line occurred in

the le and splits the provided value into words using the split method on the standard Java

String class. Note that beer tokenizaon could be provided by use of regular expressions or

the StringTokenizer class, but for our purposes this simple approach will suce.

For each individual word, the mapper then emits a key comprised of the actual word itself,

and a value of 1.

We add a few opmizaons that we'll menon here, but don't worry

too much about them at this point. You will see that we don't create the

IntWritable object containing the value 1 each me, instead we

create it as a stac variable and re-use it in each invocaon. Similarly, we

use a single Text object and reset its contents for each execuon of the

method. The reason for this is that though it doesn't help much for our

ny input le, the processing of a huge data set would see the mapper

potenally called thousands or millions of mes. If each invocaon

potenally created a new object for both the key and value output, this

would become a resource issue and likely cause much more frequent

pauses due to garbage collecon. We use this single value and know the

Context.write method will not alter it.

Mapper output and reduce input

The output of the mapper is a series of pairs of the form (word, 1); in our example

these will be:

(This,1), (is, 1), (a, 1), (test., 1), (Yes, 1), (it, 1), (is, 1)

These output pairs from the mapper are not passed directly to the reducer. Between

mapping and reducing is the shue stage where much of the magic of MapReduce occurs.

Partitioning

One of the implicit guarantees of the Reduce interface is that a single reducer will be given

all the values associated with a given key. With mulple reduce tasks running across a cluster,

each mapper output must therefore be paroned into the separate outputs desned for

each reducer. These paroned les are stored on the local node lesystem.

www.it-ebooks.info

Understanding MapReduce

[ 78 ]

The number of reduce tasks across the cluster is not as dynamic as that of mappers, and

indeed we can specify the value as part of our job submission. Each TaskTracker therefore

knows how many reducers are in the cluster and from this how many parons the mapper

output should be split into.

We'll address failure tolerance in a later chapter, but at this point an obvious

queson is what happens to this calculaon if a reducer fails. The answer is

that the JobTracker will ensure that any failed reduce tasks are reexecuted,

potenally on a dierent node so a transient failure will not be an issue. A

more serious issue, such as that caused by a data-sensive bug or very corrupt

data in a split will, unless certain steps are taken, cause the whole job to fail.

The optional partition function

Within the org.apache.hadoop.mapreduce package is the Partitioner class, an

abstract class with the following signature:

public abstract class Partitioner<Key, Value>

{

public abstract int getPartition( Key key, Value value,

int numPartitions) ;

}

By default, Hadoop will use a strategy that hashes the output key to perform the

paroning. This funconality is provided by the HashPartitioner class within the org.

apache.hadoop.mapreduce.lib.partition package, but it is necessary in some cases

to provide a custom subclass of Partitioner with applicaon-specic paroning logic.

This would be parcularly true if, for example, the data provided a very uneven distribuon

when the standard hash funcon was applied.

Reducer input

The reducer TaskTracker receives updates from the JobTracker that tell it which nodes

in the cluster hold map output parons which need to be processed by its local reduce

task. It then retrieves these from the various nodes and merges them into a single le

that will be fed to the reduce task.

www.it-ebooks.info

Chapter 3

[ 79 ]

Reducer execution

Our WordCountReducer class is very simple; for each word it simply counts the number

of elements in the array and emits the nal (Word, count) output for each word.

We don't worry about any sort of opmizaon to avoid excess object creaon

here. The number of reduce invocaons is typically smaller than the number

of mappers, and consequently the overhead is less of a concern. However, feel

free to do so if you nd yourself with very ght performance requirements.

For our invocaon of WordCount on our sample input, all but one word have only one value

in the list of values; is has two.

Note that the word this and This had discrete counts because we did

not aempt to ignore case sensivity. Similarly, ending each sentence with

a period would have stopped is having a count of two as is would be

dierent from is.. Always be careful when working with textual data such

as capitalizaon, punctuaon, hyphenaon, paginaon, and other aspects, as

they can skew how the data is perceived. In such cases, it's common to have a

precursor MapReduce job that applies a normalizaon or clean-up strategy to

the data set.

Reducer output

The nal set of reducer output for our example is therefore:

(This, 1), (is, 2), (a, 1), (test, 1), (Yes, 1), (this, 1)

This data will be output to paron les within the output directory specied in the driver

that will be formaed using the specied OutputFormat implementaon. Each reduce task

writes to a single le with the lename part-r-nnnnn, where nnnnn starts at 00000 and is

incremented. This is, of course, what we saw in Chapter 2, Geng Hadoop Up and Running;

hopefully the part prex now makes a lile more sense.

Shutdown

Once all tasks have completed successfully, the JobTracker outputs the nal state of the job

to the client, along with the nal aggregates of some of the more important counters that

it has been aggregang along the way. The full job and task history is available in the log

directory on each node or, more accessibly, via the JobTracker web UI; point your browser

to port 50030 on the JobTracker node.

www.it-ebooks.info

Understanding MapReduce

[ 80 ]

That's all there is to it!

As you've seen, each MapReduce program sits atop a signicant amount of machinery

provided by Hadoop and the sketch provided is in many ways a simplicaon. As before,

much of this isn't hugely valuable for such a small example, but never forget that we can

use the same soware and mapper/reducer implementaons to do a WordCount on a much

larger data set across a huge cluster, be it local or on EMR. The work that Hadoop does for

you at that point is enormous and is what makes it possible to perform data analysis on such

datasets; otherwise, the eort to manually implement the distribuon, synchronizaon, and

parallelizaon of code will be immense.

Apart from the combiner…maybe

There is one addional, and oponal, step that we omied previously. Hadoop allows the

use of a combiner class to perform some early sorng of the output from the map method

before it is retrieved by the reducer.

Why have a combiner?

Much of Hadoop's design is predicated on reducing the expensive parts of a job that usually

equate to disk and network I/O. The output of the mapper is oen large; it's not infrequent

to see it many mes the size of the original input. Hadoop does allow conguraon opons

to help reduce the impact of the reducers transferring such large chunks of data across the

network. The combiner takes a dierent approach, where it is possible to perform early

aggregaon to require less data to be transferred in the rst place.

The combiner does not have its own interface; a combiner must have the same signature as

the reducer and hence also subclasses the Reduce class from the org.apache.hadoop.

mapreduce package. The eect of this is to basically perform a mini-reduce on the mapper

for the output desned for each reducer.

Hadoop does not guarantee whether the combiner will be executed. At mes, it may not be

executed at all, while at mes it may be used once, twice, or more mes depending on the

size and number of output les generated by the mapper for each reducer.

Time for action – WordCount with a combiner

Let's add a combiner to our rst WordCount example. In fact, let's use our reducer as

the combiner. Since the combiner must have the same interface as the reducer, this is

something you'll oen see, though note that the type of processing involved in the

reducer will determine if it is a true candidate for a combiner; we'll discuss this later.

Since we are looking to count word occurrences, we can do a paral count on the map

node and pass these subtotals to the reducer.

www.it-ebooks.info

Chapter 3

[ 81 ]

1. Copy WordCount1.java to WordCount2.java and change the driver class to add

the following line between the denion of the Mapper and Reducer classes:

job.setCombinerClass(WordCountReducer.class);

2. Also change the class name to WordCount2 and then compile it.

$ javac WordCount2.java

3. Create the JAR le.

$ jar cvf wc2.jar WordCount2*class

4. Run the job on Hadoop.

$ hadoop jar wc2.jar WordCount2 test.txt output

5. Examine the output.

$ hadoop fs -cat output/part-r-00000

What just happened?

This output may not be what you expected, as the value for the word is is now incorrectly

specied as 1 instead of 2.

The problem lies in how the combiner and reducer will interact. The value provided to the

reducer, which was previously (is, 1, 1), is now (is, 2) because our combiner did its

own summaon of the number of elements for each word. However, our reducer does not

look at the actual values in the Iterable object, it simply counts how many are there.

When you can use the reducer as the combiner

You need to be careful when wring a combiner. Remember that Hadoop makes no

guarantees on how many mes it may be applied to map output, it may be 0, 1, or more.

It is therefore crical that the operaon performed by the combiner can eecvely be

applied in such a way. Distribuve operaons such as summaon, addion, and similar

are usually safe, but, as shown previously, ensure the reduce logic isn't making implicit

assumpons that might break this property.

Time for action – xing WordCount to work with a combiner

Let's make the necessary modicaons to WordCount to correctly use a combiner.

Copy WordCount2.java to a new le called WordCount3.java and change the reduce

method as follows:

public void reduce(Text key, Iterable<IntWritable> values,

Context context) throws IOException, InterruptedException

www.it-ebooks.info

Understanding MapReduce

[ 82 ]

{

int total = 0 ;

for (IntWritable val : values))

{

total+= val.get() ;

}

context.write(key, new IntWritable(total));

}

Remember to also change the class name to WordCount3 and then compile, create the

JAR le, and run the job as before.

What just happened?

The output is now as expected. Any map-side invocaons of the combiner performs

successfully and the reducer correctly produces the overall output value.

Would this have worked if the original reducer was used as the combiner and

the new reduce implementaon as the reducer? The answer is no, though our

test example would not have demonstrated it. Because the combiner may be

invoked mulple mes on the map output data, the same errors would arise

in the map output if the dataset was large enough, but didn't occur here due

to the small input size. Fundamentally, the original reducer was incorrect, but

this wasn't immediately obvious; watch out for such subtle logic aws. This

sort of issue can be really hard to debug as the code will reliably work on a

development box with a subset of the data set and fail on the much larger

operaonal cluster. Carefully cra your combiner classes and never rely on

tesng that only processes a small sample of the data.

Reuse is your friend

In the previous secon we took the exisng job class le and made changes to it. This is a

small example of a very common Hadoop development workow; use an exisng job le as

the starng point for a new one. Even if the actual mapper and reducer logic is very dierent,

it's oen a mesaver to take an exisng working job as this helps you remember all the

required elements of the mapper, reducer, and driver implementaons.

Pop quiz – MapReduce mechanics

Q1. What do you always have to specify for a MapReduce job?

1. The classes for the mapper and reducer.

2. The classes for the mapper, reducer, and combiner.

www.it-ebooks.info

Chapter 3

[ 83 ]

3. The classes for the mapper, reducer, paroner, and combiner.

4. None; all classes have default implementaons.

Q2. How many mes will a combiner be executed?

1. At least once.

2. Zero or one mes.

3. Zero, one, or many mes.

4. It's congurable.

Q3. You have a mapper that for each key produces an integer value and the following set of

reduce operaons:

Reducer A: outputs the sum of the set of integer values.

Reducer B: outputs the maximum of the set of values.

Reducer C: outputs the mean of the set of values.

Reducer D: outputs the dierence between the largest and smallest values

in the set.

Which of these reduce operaons could safely be used as a combiner?

1. All of them.

2. A and B.

3. A, B, and D.

4. C and D.

5. None of them.

Hadoop-specic data types

Up to this point we've glossed over the actual data types used as the input and output

of the map and reduce classes. Let's take a look at them now.

The Writable and WritableComparable interfaces

If you browse the Hadoop API for the org.apache.hadoop.io package, you'll

see some familiar classes such as Text and IntWritable along with others with

the Writable sux.

www.it-ebooks.info

Understanding MapReduce

[ 84 ]

This package also contains the Writable interface specied as follows:

import java.io.DataInput ;

import java.io.DataOutput ;

import java.io.IOException ;

public interface Writable

{

void write(DataOutput out) throws IOException ;

void readFields(DataInput in) throws IOException ;

}

The main purpose of this interface is to provide mechanisms for the serializaon and

deserializaon of data as it is passed across the network or read and wrien from the

disk. Every data type to be used as a value input or output from a mapper or reducer

(that is, V1, V2, or V3) must implement this interface.

Data to be used as keys (K1, K2, K3) has a stricter requirement: in addion to Writable,

it must also provide an implementaon of the standard Java Comparable interface.

This has the following specicaons:

public interface Comparable

{

public int compareTO( Object obj) ;

}

The compare method returns -1, 0, or 1 depending on whether the compared object is less

than, equal to, or greater than the current object.

As a convenience interface, Hadoop provides the WritableComparable interface in the

org.apache.hadoop.io package.

public interface WritableComparable extends Writable, Comparable

{}

Introducing the wrapper classes

Fortunately, you don't have to start from scratch; as you've already seen, Hadoop provides

classes that wrap the Java primive types and implement WritableComparable. They are

provided in the org.apache.hadoop.io package.

www.it-ebooks.info

Chapter 3

[ 85 ]

Primitive wrapper classes

These classes are conceptually similar to the primive wrapper classes, such as Integer

and Long found in java.lang. They hold a single primive value that can be set either

at construcon or via a seer method.

BooleanWritable

ByteWritable

DoubleWritable

FloatWritable

IntWritable

LongWritable

VIntWritable – a variable length integer type

VLongWritable – a variable length long type

Array wrapper classes

These classes provide writable wrappers for arrays of other Writable objects. For example,

an instance of either could hold an array of IntWritable or DoubleWritable but not

arrays of the raw int or oat types. A specic subclass for the required Writable class will

be required. They are as follows:

ArrayWritable

TwoDArrayWritable

Map wrapper classes

These classes allow implementaons of the java.util.Map interface to be used as keys

or values. Note that they are dened as Map<Writable, Writable> and eecvely

manage a degree of internal-runme-type checking. This does mean that compile type

checking is weakened, so be careful.

AbstractMapWritable: This is a base class for other concrete Writable

map implementaons

MapWritable: This is a general purpose map mapping Writable keys to

Writable values

SortedMapWritable: This is a specializaon of the MapWritable class that

also implements the SortedMap interface

www.it-ebooks.info

Understanding MapReduce

[ 86 ]

Time for action – using the Writable wrapper classes

Let's write a class to show some of these wrapper classes in acon:

1. Create the following as WritablesTest.java:

import org.apache.hadoop.io.* ;

import java.util.* ;

public class WritablesTest

{

public static class IntArrayWritable extends ArrayWritable

{

public IntArrayWritable()

{

super(IntWritable.class) ;

}

public static void main(String[] args)

{

System.out.println("*** Primitive Writables ***") ;

BooleanWritable bool1 = new BooleanWritable(true) ;

ByteWritable byte1 = new ByteWritable( (byte)3) ;

System.out.printf("Boolean:%s Byte:%d\n", bool1, byte1.

get()) ;

IntWritable i1 = new IntWritable(5) ;

IntWritable i2 = new IntWritable( 17) ; System.

out.printf("I1:%d I2:%d\n", i1.get(), i2.get()) ;

i1.set(i2.get()) ;

System.out.printf("I1:%d I2:%d\n", i1.get(), i2.get()) ;

Integer i3 = new Integer( 23) ;

i1.set( i3) ;

System.out.printf("I1:%d I2:%d\n", i1.get(), i2.get()) ;

System.out.println("*** Array Writables ***") ;

ArrayWritable a = new ArrayWritable( IntWritable.class) ;

a.set( new IntWritable[]{ new IntWritable(1), new

IntWritable(3), new IntWritable(5)}) ;

IntWritable[] values = (IntWritable[])a.get() ;

for (IntWritable i: values)

www.it-ebooks.info

Chapter 3

[ 87 ]

System.out.println(i) ;

IntArrayWritable ia = new IntArrayWritable() ;

ia.set( new IntWritable[]{ new IntWritable(1), new

IntWritable(3), new IntWritable(5)}) ;

IntWritable[] ivalues = (IntWritable[])ia.get() ;

ia.set(new LongWritable[]{new LongWritable(1000l)}) ;

System.out.println("*** Map Writables ***") ;

MapWritable m = new MapWritable() ;

IntWritable key1 = new IntWritable(5) ;

NullWritable value1 = NullWritable.get() ;

m.put(key1, value1) ;

System.out.println(m.containsKey(key1)) ;

System.out.println(m.get(key1)) ;

m.put(new LongWritable(1000000000), key1) ;

Set<Writable> keys = m.keySet() ;

for(Writable w: keys)

System.out.println(w.getClass()) ;

}

2. Compile and run the class, and you should get the following output:

*** Primitive Writables ***

Boolean:true Byte:3

I1:5 I2:17

I1:17 I2:17

I1:23 I2:17

*** Array Writables ***

*** Map Writables ***

true

(null)

class org.apache.hadoop.io.LongWritable

class org.apache.hadoop.io.IntWritable

www.it-ebooks.info

Understanding MapReduce

[ 88 ]

What just happened?

This output should be largely self-explanatory. We create various Writable wrapper objects

and show their general usage. There are several key points:

As menoned, there is no type-safety beyond Writable itself. So it is possible to

have an array or map that holds mulple types, as shown previously.

We can use autounboxing, for example, by supplying an Integer object to methods

on IntWritable that expect an int variable.

The inner class demonstrates what is needed if an ArrayWritable class is to be

used as an input to a reduce funcon; a subclass with such a default constructor

must be dened.

Other wrapper classes

CompressedWritable: This is a base class to allow for large objects that

should remain compressed unl their aributes are explicitly accessed

ObjectWritable: This is a general-purpose generic object wrapper

NullWritable: This is a singleton object representaon of a null value

VersionedWritable: This is a base implementaon to allow writable classes

to track versions over me

Have a go hero – playing with Writables

Write a class that exercises the NullWritable and ObjectWritable classes in the same

way as it does in the previous examples.

Making your own

As you have seen from the Writable and Comparable interfaces, the required methods

are prey straighorward; don't be afraid of adding this funconality if you want to use your

own custom classes as keys or values within a MapReduce job.

Input/output

There is one aspect of our driver classes that we have menoned several mes without

geng into a detailed explanaon: the format and structure of the data input into and

output from MapReduce jobs.

www.it-ebooks.info

Chapter 3

[ 89 ]

Files, splits, and records

We have talked about les being broken into splits as part of the job startup and the data

in a split being sent to the mapper implementaon. However, this overlooks two aspects:

how the data is stored in the le and how the individual keys and values are passed to the

mapper structure.

InputFormat and RecordReader

Hadoop has the concept of an InputFormat for the rst of these responsibilies.

The InputFormat abstract class in the org.apache.hadoop.mapreduce

package provides two methods as shown in the following code:

public abstract class InputFormat<K, V>

{

public abstract List<InputSplit> getSplits( JobContext context) ;

RecordReader<K, V> createRecordReader(InputSplit split,

TaskAttemptContext context) ;

}

These methods display the two responsibilies of the InputFormat class:

To provide the details on how to split an input le into the splits required for

map processing

To create a RecordReader class that will generate the series of key/value

pairs from a split

The RecordReader class is also an abstract class within the org.apache.hadoop.

mapreduce package:

public abstract class RecordReader<Key, Value> implements Closeable

{

public abstract void initialize(InputSplit split, TaskAttemptContext

context) ;

public abstract boolean nextKeyValue()

throws IOException, InterruptedException ;

public abstract Key getCurrentKey()

throws IOException, InterruptedException ;

public abstract Value getCurrentValue()

throws IOException, InterruptedException ;

public abstract float getProgress()

throws IOException, InterruptedException ;

public abstract close() throws IOException ;

}

www.it-ebooks.info

Understanding MapReduce

[ 90 ]

A RecordReader instance is created for each split and calls getNextKeyValue to return a

Boolean indicang if another key/value pair is available and if so, the getKey and getValue

methods are used to access the key and value respecvely.

The combinaon of the InputFormat and RecordReader classes therefore are all

that is required to bridge between any kind of input data and the key/value pairs

required by MapReduce.

Hadoop-provided InputFormat

There are some Hadoop-provided InputFormat implementaons within the org.apache.

hadoop.mapreduce.lib.input package:

FileInputFormat: This is an abstract base class that can be the parent of any

le-based input

SequenceFileInputFormat: This is an ecient binary le format that will be

discussed in an upcoming secon

TextInputFormat: This is used for plain text les

The pre-0.20 API has addional InputFormats dened in the org.

apache.hadoop.mapred package.

Note that InputFormats are not restricted to reading from les;

FileInputFormat is itself a subclass of InputFormat. It is possible

to have Hadoop use data that is not based on the les as the input to

MapReduce jobs; common sources are relaonal databases or HBase.

Hadoop-provided RecordReader

Similarly, Hadoop provides a few common RecordReader implementaons, which are also

present within the org.apache.hadoop.mapreduce.lib.input package:

LineRecordReader: This implementaon is the default RecordReader class for

text les that present the line number as the key and the line contents as the value

SequenceFileRecordReader: This implementaon reads the key/value from the

binary SequenceFile container

Again, the pre-0.20 API has addional RecordReader classes in the org.apache.hadoop.

mapred package, such as KeyValueRecordReader, that have not yet been ported to the

new API.

www.it-ebooks.info

Chapter 3

[ 91 ]

OutputFormat and RecordWriter

There is a similar paern for wring the output of a job coordinated by subclasses of

OutputFormat and RecordWriter from the org.apache.hadoop.mapreduce

package. We'll not explore these in any detail here, but the general approach is similar,

though OutputFormat does have a more involved API as it has methods for tasks such

as validaon of the output specicaon.

It is this step that causes a job to fail if a specied output directory already

exists. If you wanted dierent behavior, it would require a subclass of

OutputFormat that overrides this method.

Hadoop-provided OutputFormat

The following OutputFormats are provided in the org.apache.hadoop.mapreduce.

output package:

FileOutputFormat: This is the base class for all le-based OutputFormats

NullOutputFormat: This is a dummy implementaon that discards the output and

writes nothing to the le

SequenceFileOutputFormat: This writes to the binary SequenceFile format

TextOutputFormat: This writes a plain text le

Note that these classes dene their required RecordWriter implementaons as inner

classes so there are no separately provided RecordWriter implementaons.

Don't forget Sequence les

The SequenceFile class within the org.apache.hadoop.io package provides an

ecient binary le format that is oen useful as an output from a MapReduce job. This

is especially true if the output from the job is processed as the input of another job. The

Sequence les have several advantages, as follows:

As binary les, they are intrinsically more compact than text les

They addionally support oponal compression, which can also be applied at

dierent levels, that is, compress each record or an enre split

The le can be split and processed in parallel

www.it-ebooks.info

Understanding MapReduce

[ 92 ]

This last characterisc is important, as most binary formats—parcularly those that are

compressed or encrypted—cannot be split and must be read as a single linear stream of

data. Using such les as input to a MapReduce job means that a single mapper will be used

to process the enre le, causing a potenally large performance hit. In such a situaon, it

is preferable to either use a splitable format such as SequenceFile, or, if you cannot avoid

receiving the le in the other format, do a preprocessing step that converts it into a splitable

format. This will be a trade-o, as the conversion will take me; but in many cases—especially

with complex map tasks—this will be outweighed by the me saved.

Summary

We have covered a lot of ground in this chapter and we now have the foundaon to explore

MapReduce in more detail. Specically, we learned how key/value pairs is a broadly applicable

data model that is well suited to MapReduce processing. We also learned how to write mapper

and reducer implementaons using the 0.20 and above versions of the Java API.

We then moved on and saw how a MapReduce job is processed and how the map

and reduce methods are ed together by signicant coordinaon and task-scheduling

machinery. We also saw how certain MapReduce jobs require specializaon in the form

of a custom paroner or combiner.

We also learned how Hadoop reads data to and from the lesystem. It uses the concept of

InputFormat and OutputFormat to handle the le as a whole and RecordReader and

RecordWriter to translate the format to and from key/value pairs.

With this knowledge, we will now move on to a case study in the next chapter, which

demonstrates the ongoing development and enhancement of a MapReduce applicaon

that processes a large data set.

www.it-ebooks.info

Developing MapReduce Programs

Now that we have explored the technology of MapReduce, we will spend

this chapter looking at how to put it to use. In particular, we will take a more

substantial dataset and look at ways to approach its analysis by using the tools

provided by MapReduce.

In this chapter we will cover the following topics:

Hadoop Streaming and its uses

The UFO sighng dataset

Using Streaming as a development/debugging tool

Using mulple mappers in a single job

Eciently sharing ulity les and data across the cluster

Reporng job and task status and log informaon useful for debugging

Throughout this chapter, the goal is to introduce both concrete tools and ideas about how

to approach the analysis of a new data set. We shall start by looking at how to use scripng

programming languages to aid MapReduce prototyping and inial analysis. Though it

may seem strange to learn the Java API in the previous chapter and immediately move to

dierent languages, our goal here is to provide you with an awareness of dierent ways to

approach the problems you face. Just as many jobs make lile sense being implemented

in anything but the Java API, there are other situaons where using another approach is

best suited. Consider these techniques as new addions to your tool belt and with that

experience you will know more easily which is the best t for a given scenario.

www.it-ebooks.info

Developing MapReduce Programs

[ 94 ]

Using languages other than Java with Hadoop

We have menoned previously that MapReduce programs don't have to be wrien in Java.

Most programs are wrien in Java, but there are several reasons why you may want or need

to write your map and reduce tasks in another language. Perhaps you have exisng code to

leverage or need to use third-party binaries—the reasons are varied and valid.

Hadoop provides a number of mechanisms to aid non-Java development, primary amongst

these are Hadoop Pipes that provides a nave C++ interface to Hadoop and Hadoop

Streaming that allows any program that uses standard input and output to be used

for map and reduce tasks. We will use Hadoop Streaming heavily in this chapter.

How Hadoop Streaming works

With the MapReduce Java API, both map and reduce tasks provide implementaons for

methods that contain the task funconality. These methods receive the input to the task as

method arguments and then output results via the Context object. This is a clear and type-

safe interface but is by denion Java specic.

Hadoop Streaming takes a dierent approach. With Streaming, you write a map task that

reads its input from standard input, one line at a me, and gives the output of its results to

standard output. The reduce task then does the same, again using only standard input and

output for its data ow.

Any program that reads and writes from standard input and output can be used in

Streaming, such as compiled binaries, Unixshell scripts, or programs wrien in a

dynamic language such as Ruby or Python.

Why to use Hadoop Streaming

The biggest advantage to Streaming is that it can allow you to try ideas and iterate on them

more quickly than using Java. Instead of a compile/jar/submit cycle, you just write the scripts

and pass them as arguments to the Streaming jar le. Especially when doing inial analysis

on a new dataset or trying out new ideas, this can signicantly speed up development.

The classic debate regarding dynamic versus stac languages balances the benets of swi

development against runme performance and type checking. These dynamic downsides also

apply when using Streaming. Consequently, we favor use of Streaming for up-front analysis and

Java for the implementaon of jobs that will be executed on the producon cluster.

We will use Ruby for Streaming examples in this chapter, but that is a personal preference.

If you prefer shell scripng or another language, such as Python, then take the opportunity

to convert the scripts used here into the language of your choice.

www.it-ebooks.info

Chapter 4

[ 95 ]

Time for action – implementing WordCount using Streaming

Let's og the dead horse of WordCount one more me and implement it using Streaming

by performing the following steps:

1. Save the following le to wcmapper.rb:

#/bin/env ruby

while line = gets

words = line.split("\t")

words.each{ |word| puts word.strip+"\t1"}}

end

2. Make the le executable by execung the following command:

$ chmod +x wcmapper.rb

3. Save the following le to wcreducer.rb:

#!/usr/bin/env ruby

current = nil

count = 0

while line = gets

word, counter = line.split("\t")

if word == current

count = count+1

else

puts current+"\t"+count.to_s if current

current = word

count = 1

end

puts current+"\t"+count.to_s

4. Make the le executable by execung the following command:

$ chmod +x wcreducer.rb

www.it-ebooks.info

Developing MapReduce Programs

[ 96 ]

5. Execute the scripts as a Streaming job using the datale from the previous chapter:

$ hadoop jar hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar

-file wcmapper.rb -mapper wcmapper.rb -file wcreducer.rb

-reducer wcreducer.rb -input test.txt -output output

packageJobJar: [wcmapper.rb, wcreducer.rb, /tmp/hadoop-

hadoop/hadoop-unjar1531650352198893161/] [] /tmp/

streamjob937274081293220534.jar tmpDir=null

12/02/05 12:43:53 INFO mapred.FileInputFormat: Total input paths

to process : 1

12/02/05 12:43:53 INFO streaming.StreamJob: getLocalDirs(): [/var/

hadoop/mapred/local]

12/02/05 12:43:53 INFO streaming.StreamJob: Running job:

job_201202051234_0005

…

12/02/05 12:44:01 INFO streaming.StreamJob: map 100% reduce 0%

12/02/05 12:44:13 INFO streaming.StreamJob: map 100% reduce 100%

12/02/05 12:44:16 INFO streaming.StreamJob: Job complete:

job_201202051234_0005

12/02/05 12:44:16 INFO streaming.StreamJob: Output: wcoutput

6. Check the result le:

$ hadoop fs -cat output/part-00000

What just happened?

Ignore the specics of Ruby. If you don't know the language, it isn't important here.

Firstly, we created the script that will be our mapper. It uses the gets funcon to read a line

from standard input, splits this into words, and uses the puts funcon to write the word and

the value 1 to the standard output. We then made the le executable.

Our reducer is a lile more complex for reasons we will describe in the next secon.

However, it performs the job we would expect, it counts the number of occurrences for each

word, reads from standard input, and gives the output as the nal value to standard output.

Again we made sure to make the le executable.

Note that in both cases we are implicitly using Hadoop input and output formats discussed

in the earlier chapters. It is the TextInputFormat property that processes the source le

and provides each line one at a me to the map script. Conversely, the TextOutputFormat

property will ensure that the output of reduce tasks is also

correctly wrien as textual data. We can of course modify these if required.

www.it-ebooks.info

Chapter 4

[ 97 ]

Next, we submied the Streaming job to Hadoop via the rather cumbersome command line

shown in the previous secon. The reason for each le to be specied twice is that any le

not available on each node must be packaged up by Hadoop and shipped across the cluster,

which requires it to be specied by the -file opon. Then, we also need to tell Hadoop

which script performs the mapper and reducer roles.

Finally, we looked at the output of the job, which should be idencal to the previous

Java-based WordCount implementaons

Differences in jobs when using Streaming

The Streaming WordCount mapper looks a lot simpler than the Java version, but the reducer

appears to have more logic. Why? The reason is that the implied contract between Hadoop

and our tasks changes when we use Streaming.

In Java we knew that our map() method would be invoked once for each input key/value

pair and our reduce() method would be invoked for each key and its set of values.

With Streaming we don't have the concept of the map or reduce methods anymore, instead

we have wrien scripts that process streams of received data. This changes how we need to

write our reducer. In Java the grouping of values to each key was performed by Hadoop; each

invocaon of the reduce method would receive a single key and all its values. In Streaming,

each instance of the reduce task is given the individual ungathered values one at a me.

Hadoop Streaming does sort the keys, for example, if a mapper emied the following data:

First 1

Word 1

A 1

First 1

The Streaming reducer would receive this data in the following order:

A 1

First 1

Word 1

Hadoop sll collects the values for each key and ensures that each key is passed only to a

single reducer. In other words, a reducer gets all the values for a number of keys and they are

grouped together; however, they are not packaged into individual execuons of the reducer,

that is, one per key, as with the Java API.

www.it-ebooks.info

Developing MapReduce Programs

[ 98 ]

This should explain the mechanism used in the Ruby reducer; it rst sets empty default

values for the current word; then aer reading each line it determines if this is another value

for the current key, and if so, increments the count. If not, then there will be no more values

for the previous key and its nal output is sent to standard output and the counng begins

again for the new word.

Aer reading so much in the earlier chapters about how great it is for Hadoop to do so much

for us, this may seem a lot more complex, but aer you write a few Streaming reducers it's

actually not as bad as it may rst appear. Also remember that Hadoop does sll manage

the assignment of splits to individual map tasks and the necessary coordinaon that sends

the values for a given key to the same reducer. This behavior can be modied through

conguraon sengs to change the number of mappers and reducers just as with the

Java API.

Analyzing a large dataset

Armed with our abilies to write MapReduce jobs in both Java and Streaming, we'll now

explore a more signicant dataset than any we've looked at before. In the following secon,

we will aempt to show how to approach such analysis and the sorts of quesons Hadoop

allows you to ask of a large dataset.

Getting the UFO sighting dataset

We will use a public domain dataset of over 60,000 UFO sighngs. This is hosted by

InfoChimps at http://www.infochimps.com/datasets/60000-documented-ufo-

sightings-with-text-descriptions-and-metada.

You will need to register for a free InfoChimps account to download a copy of the data.

The data comprises a series of UFO sighng records with the following elds:

1. Sighng date: This eld gives the date when the UFO sighng occurred.

2. Recorded date: This eld gives the date when the sighng was reported, oen

dierent to the sighng date.

3. Locaon: This eld gives the locaon where the sighng occurred.

4. Shape: This eld gives a brief summary of the shape of the UFO, for example,

diamond, lights, cylinder.

5. Duraon: This eld gives the duraon of how long the sighng lasted.

6. Descripon: This eld gives free text details of the sighng.

Once downloaded, you will nd the data in a few formats. We will be using the .tsv (tab-

separated value) version.

www.it-ebooks.info

Chapter 4

[ 99 ]

Getting a feel for the dataset

When faced with a new dataset it is oen dicult to get a feel for the nature, breadth, and

quality of the data involved. There are several quesons, the answers to which will aect

how you approach the follow-on analysis, in parcular:

How big is the dataset?

How complete are the records?

How well do the records match the expected format?

The rst is a simple queson of scale; are we talking hundreds, thousands, millions, or more

records? The second queson asks how complete the records are. If you expect each record

to have 10 elds (if this is structured or semi-structured data), how many have key elds

populated with data? The last queson expands on this point, how well do the records

match your expectaons of format and representaon?

Time for action – summarizing the UFO data

Now we have the data, let's get an inial summarizaon of its size and how many records

may be incomplete:

1. With the UFO tab-separated value (TSV) le on HDFS saved as ufo.tsv, save the

following le to summarymapper.rb:

#!/usr/bin/env ruby

while line = gets

puts "total\t1"

parts = line.split("\t")

puts "badline\t1" if parts.size != 6

puts "sighted\t1" if !parts[0].empty?

puts "recorded\t1" if !parts[1].empty?

puts "location\t1" if !parts[2].empty?

puts "shape\t1" if !parts[3].empty?

puts "duration\t1" if !parts[4].empty?

puts "description\t1" if !parts[5].empty?

end

2. Make the le executable by execung the following command:

$ chmod +x summarymapper.rb

www.it-ebooks.info

Developing MapReduce Programs

[ 100 ]

3. Execute the job as follows by using Streaming:

$ hadoop jar hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar

-file summarymapper.rb -mapper summarymapper.rb -file wcreducer.rb

-reducer wcreducer.rb -input ufo.tsv -output ufosummary

4. Retrieve the summary data:

$ hadoop fs -cat ufosummary/part-0000

What just happened?

Remember that our UFO sighngs should have six elds as described previously.

They are listed as follows:

The date of the sighng

The date the sighng was reported

The locaon of the sighng

The shape of the object

The duraon of the sighng

A free text descripon of the event

The mapper examines the le and counts the total number of records in addion to

idenfying potenally incomplete records.

We produce the overall count by simply recording how many disnct records are

encountered while processing the le. We idenfy potenally incomplete records

by agging those that either do not contain exactly six elds or have at least one

eld that has a null value.

Therefore, the implementaon of the mapper reads each line and does three things

as it proceeds through the le:

It gives the output of a token to be incremented in the total number of

records processed

It splits the record on tab boundaries and records any occurrence of lines which

do not result in six elds' values

www.it-ebooks.info

Chapter 4

[ 101 ]

For each of the six expected elds it reports when the values present are other than

an empty string, that is, there is data in the eld, though this doesn't actually say

anything about the quality of that data

We wrote this mapper intenonally to produce the output of the form (token, count).

Doing this allowed us to use our exisng WordCount reducer from our earlier implementaons

as the reducer for this job. There are certainly more ecient implementaons, but as this job is

unlikely to be frequently executed, the convenience is worth it.

At the me of wring, the result of this job was as follows:

badline324

description61372

duration58961

location61377

recorded61377

shape58855

sighted61377

total61377

We see from these gures that we have 61,300records. All of these provide values for the

sighted date, reported date, and locaon elds. Around 58,000-59,000 records have values

for shape and duraon, and almost all have a descripon.

When split on tab characters there were 372 lines found to not have exactly six elds.

However, since only ve records had no value for descripon, this suggests that the bad

records typically have too many tabs as opposed to too few. We could of course alter our

mapper to gather detailed informaon on this fact. This is likely due to tabs being used in

the free text descripon, so for now we will do our analysis expecng most records to have

correctly placed values for all the six elds, but not make any assumpons regarding further

tabs in each record.

Examining UFO shapes

Out of all the elds in these reports, it was shape that immediately interested us most,

as it could oer some interesng ways of grouping the data depending on what sort of

informaon we have in that eld.

www.it-ebooks.info

Developing MapReduce Programs

[ 102 ]

Time for action – summarizing the shape data

Just as we provided a summarizaon for the overall UFO data set earlier, let's now do a more

focused summarizaon on the data provided for UFO shapes:

1. Save the following to shapemapper.rb:

#!/usr/bin/env ruby

while line = gets

parts = line.split("\t")

if parts.size == 6

shape = parts[3].strip

puts shape+"\t1" if !shape.empty?

end

2. Make the le executable:

$ chmod +x shapemapper.rb

3. Execute the job once again using the WordCount reducer:

$ hadoop jar hadoop/contrib/streaming/hadoop-streaming-1.0.3.jarr

--file shapemapper.rb -mapper shapemapper.rb -file wcreducer.rb

-reducer wcreducer.rb -input ufo.tsv -output shapes

4. Retrieve the shape info:

$ hadoop fs -cat shapes/part-00000

What just happened?

Our mapper here is prey simple. It breaks each record into its constuent elds,

discards any without exactly six elds, and gives a counter as the output for any

non-empty shape value.

For our purposes here, we are happy to ignore any records that don't precisely match the

specicaon we expect. Perhaps one record is the single UFO sighng that will prove it once

and for all, but even so it wouldn't likely make much dierence to our analysis. Think about

the potenal value of individual records before deciding to so easily discard some. If you

are working primarily on large aggregaons where you care mostly about trends, individual

records likely don't maer. But in cases where single individual values could materially

aect the analysis or must be accounted for, an approach of trying to parse and recover

more conservavely rather than discard may be best. We'll talk more about this trade-o

in Chapter 6, When Things Break.

www.it-ebooks.info

Chapter 4

[ 103 ]

Aer the usual roune of making the mapper executable and running the job we produced,

data showing 29 dierent UFO shapes were reported. Here's some sample output tabulated

in compact form for space reasons:

changed1 changing1533 chevron758 cigar1774

circle5250 cone265 crescent2 cross177

cylinder981 delta8 diamond909 disk4798

dome1 egg661 fireball3437 flare1

flash988 formation1775 hexagon1 light12140

other4574 oval2859 pyramid1 rectangle957

round2 sphere3614 teardrop592 triangle6036

unknown4459

As we can see, there is a wide variance in sighng frequency. Some such as pyramid occur

only once, while light comprises more than a h of all reported shapes. Considering many

UFO sighngs are at night, it could be argued that a descripon of light is not terribly useful

or specic and when combined with the values for other and unknown we see that around

21000 of our 58000 reported shapes may not actually be of any use. Since we are not about

to run out and do addional research, this doesn't maer very much, but what's important

is to start thinking of your data in these terms. Even these types of summary analysis can

start giving an insight into the nature of the data and indicate what quality of analysis may be

possible. In the case of reported shapes, for example, we have already discovered that out of

our 61000 sighngs only 58000 reported the shape and of these 21000 are of dubious value.

We have already determined that our 61000 sample set only provides 37000 shape reports

that we may be able to work with. If your analysis is predicated on a minimum number of

samples, always be sure to do this sort of summarizaon up-front to determine if the data

set will actually meet your needs.

Time for action – correlating of sighting duration to UFO shape

Let's do a lile more detailed analysis in regards to this shape data. We wondered if there

was any correlaon between the duraon of a sighng to the reported shape. Perhaps

cigar-shaped UFOs hang around longer than the rest or formaons always appear for

the exact amount of me.

1. Save the following to shapetimemapper.rb:

#!/usr/bin/env ruby

pattern = Regexp.new /\d* ?((min)|(sec))/

while line = gets

parts = line.split("\t")

if parts.size == 6

www.it-ebooks.info

Developing MapReduce Programs

[ 104 ]

shape = parts[3].strip

duration = parts[4].strip.downcase

if !shape.empty? && !duration.empty?

match = pattern.match(duration)

time = /\d*/.match(match[0])[0]

unit = match[1]

time = Integer(time)

time = time * 60 if unit == "min"

puts shape+"\t"+time.to_s

end

2. Make the le executable by execung the following command:

$ chmod +x shapetimemapper.rb

3. Save the following to shapetimereducer.rb:

#!/usr/bin/env ruby

current = nil

min = 0

max = 0

mean = 0

total = 0

count = 0

while line = gets

word, time = line.split("\t")

time = Integer(time)

if word == current

count = count+1

total = total+time

min = time if time < min

max = time if time > max

else

puts current+"\t"+min.to_s+" "+max.to_s+" "+(total/count).to_s if

current

current = word

count = 1

total = time

min = time

max = time

end

puts current+"\t"+min.to_s+" "+max.to_s+" "+(total/count).to_s

www.it-ebooks.info

Chapter 4

[ 105 ]

4. Make the le executable by execung the following command:

$ chmod +x shapetimereducer.rb

5. Run the job:

$ hadoop jar hadoop/contrib/streaminghHadoop-streaming-1.0.3.jar

-file shapetimemapper.rb -mapper shapetimemapper.rb -file

shapetimereducer.rb -reducer shapetimereducer.rb -input ufo.tsv

-output shapetime

6. Retrieve the results:

$ hadoop fs -cat shapetime/part-00000

What just happened?

Our mapper here is a lile more involved than previous examples due to the nature of the

duraon eld. Taking a quick look at some sample records, we found values as follows:

15 seconds

2 minutes

2 min

2minutes

5-10 seconds

In other words, there was a mixture of range and absolute values, dierent formang and

inconsistent terms for me units. Again for simplicity we decided on a limited interpretaon

of the data; we will take the absolute value if present, and the upper part of a range if not.

We would assume that the strings min or sec would be present for the me units and

would convert all mings into seconds. With some regular expression magic, we unpack the

duraon eld into these parts and do the conversion. Note again that we simply discard

any record that does not work as we expect, which may not always be appropriate.

The reducer follows the same paern as our earlier example, starng with a default key

and reading values unl a new one is encountered. In this case, we want to capture the

minimum, maximum, and mean for each shape, so use numerous variables to track the

needed data.

Remember that Streaming reducers need to handle a series of values grouped into their

associated keys and must idenfy when a new line has a changed key, and hence indicates

the last value for the previous key that has been processed. In contrast, a Java reducer would

be simpler as it only deals with the values for a single key in each execuon.

www.it-ebooks.info

Developing MapReduce Programs

[ 106 ]

Aer making both les executable we run the job and get the following results, where we

removed any shape with less than 10 sighngs and again made the output more compact

for space reasons. The numbers for each shape are the minimum value, the maximum value,

and mean respecvely:

changing0 5400 670 chevron0 3600 333

cigar0 5400 370 circle0 7200 423

cone0 4500 498 cross2 3600 460

cylinder0 5760 380 diamond0 7800 519

disk0 5400 449 egg0 5400 383

fireball0 5400 236 flash0 7200 303

formation0 5400 434 light0 9000 462

other0 5400 418 oval0 5400 405

rectangle0 4200 352 sphere0 14400 396

teardrop0 2700 335 triangle0 18000 375

unknown0 6000 470

It is surprising to see the relavely narrow variance in the mean sighng duraon across all

shape types; most have the mean value between 350 and 430 seconds. Interesngly, we

also see that the shortest mean duraon is for reballs and the maximum for changeable

objects, both of which make some degree of intuive sense. A reball by denion wouldn't

be a long-lasng phenomena and a changeable object would need a lengthy duraon for its

changes to be noced.

Using Streaming scripts outside Hadoop

This last example with its more involved mapper and reducer is a good illustraon of how

Streaming can help MapReduce development in another way; you can execute the scripts

outside of Hadoop.

It's generally good pracce during MapReduce development to have a sample of the

producon data against which to test your code. But when this is on HDFS and you are

wring Java map and reduce tasks, it can be dicult to debug problems or rene complex

logic. With map and reduce tasks that read input from the command line, you can directly

run them against some data to get quick feedback on the result. If you have a development

environment that provides Hadoop integraon or are using Hadoop in standalone mode, the

problems are minimized; just remember that Streaming does give you this ability to try the

scripts outside of Hadoop; it may be useful some day.

While developing these scripts the author noced that the last set of records in his UFO

datale had data in a beer structured manner than those at the start of the le. Therefore,

to do a quick test on the mapper all that was required was:

$ tail ufo.tsv | shapetimemapper.rb

This principle can be applied to the full workow to exercise both the map and reduce script.

www.it-ebooks.info

Chapter 4

[ 107 ]

Time for action – performing the shape/time analysis from the

command line

It may not be immediately obvious how to do this sort of local command-line analysis,

so let's look at an example.

With the UFO datale on the local lesystem, execute the following command:

$ cat ufo.tsv | shapetimemapper.rb | sort| shapetimereducer.rb

What just happened?

With a single Unixcommand line, we produced output idencal to our previous full

MapReduce job. If you look at what the command line does, this makes sense.

Firstly, the input le is sent—a line at a me—to the mapper. The output of this is passed

through the Unix sort ulity and this sorted output is passed a line at a me to the reducer.

This is of course a very simplied representaon of our general MapReduce job workow.

Then the obvious queson is why should we bother with Hadoop if we can do equivalent

analysis at the command line. The answer of course is our old friend, scale. This simple

approach works ne for a le such as the UFO sighngs, which though non-trivial, is only

71MB in size. To put this into context we could hold thousands of copies of this dataset

on a single modern disk drive.

So what if the dataset was 71GB in size instead, or even 71TB? In the laer case, at least

we would have to spread the data across mulple hosts, and then decide how to split the

data, combine paral answers, and deal with the inevitable failures along the way. In other

words,we would need something like Hadoop.

However, don't discount the use of command-line tools like this, such approaches should

be well used during MapReduce development.

Java shape and location analysis

Let's return to the Java MapReduce API and consider some analysis of the shape and locaon

data within the reports.

However, before we start wring code, let's think about how we've been approaching the

per-eld analysis of this dataset. The previous mappers have had a common paern:

Discard records determined to be corrupt

Process valid records to extract the eld of interest

Output a representaon of the data we care about for the record

www.it-ebooks.info

Developing MapReduce Programs

[ 108 ]

Now if we were to write Java mappers to analyze locaon and then perhaps the sighng

and reported me columns, we would follow a similar paern. So can we avoid any of the

consequent code duplicaon?

The answer is yes, through the use of org.apache.hadoop.mapred.lib.ChainMapper.

This class provides a means by which mulple mappers are executed in sequence and it is

the output of the nal mapper that is passed to the reducer. ChainMapper is applicable not

just for this type of data clean-up; when analyzing parcular jobs, it is not an uncommon

paern that is useful to perform mulple map-type tasks before applying a reducer.

An example of this approach would be to write a validaon mapper that could be used by all

future eld analysis jobs. This mapper would discard lines deemed corrupt, passing only valid

lines to the actual business logic mapper that can now be focused on analyzing data instead

of worrying about coarse-level validaon.

An alternave approach here would be to do the validaon within a custom InputFormat

class that discards non-valid records; which approach makes the most sense will depend on

your parcular situaon.

Each mapper in the chain is executed within a single JVM so there is no need to worry about

the use of mulple mappers increasing our lesystem I/O load.

Time for action – using ChainMapper for eld validation/

analysis

Let's use this principle and employ the ChainMapper class to help us provide some record

validaon within our job:

1. Create the following class as UFORecordValidationMapper.java:

import java.io.IOException;

import org.apache.hadoop.io.* ;

import org.apache.hadoop.mapred.* ;

import org.apache.hadoop.mapred.lib.* ;

public class UFORecordValidationMapper extends MapReduceBase

implements Mapper<LongWritable, Text, LongWritable, Text>

{

public void map(LongWritable key, Text value,

OutputCollector<LongWritable, Text> output,

Reporter reporter) throws IOException

{

String line = value.toString();

www.it-ebooks.info

Chapter 4

[ 109 ]

if (validate(line))

output.collect(key, value);

}

private boolean validate(String str)

{

String[] parts = str.split("\t") ;

if (parts.length != 6)

return false ;

return true ;

}

2. Create the following as UFOLocation.java:

import java.io.IOException;

import java.util.Iterator ;

import java.util.regex.* ;

import org.apache.hadoop.conf.* ;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.* ;

import org.apache.hadoop.mapred.* ;

import org.apache.hadoop.mapred.lib.* ;

public class UFOLocation

{

public static class MapClass extends MapReduceBase

implements Mapper<LongWritable, Text, Text, LongWritable>

{

private final static LongWritable one = new LongWritable(1);

private static Pattern locationPattern = Pattern.compile(

"[a-zA-Z]{2}[^a-zA-Z]*$") ;

public void map(LongWritable key, Text value,

OutputCollector<Text, LongWritable> output,

Reporter reporter) throws IOException

{

String line = value.toString();

String[] fields = line.split("\t") ;

String location = fields[2].trim() ;

www.it-ebooks.info

Developing MapReduce Programs

[ 110 ]

if (location.length() >= 2)

{

Matcher matcher = locationPattern.matcher(location) ;

if (matcher.find() )

{

int start = matcher.start() ;

String state = location.substring(start,start+2);

output.collect(new Text(state.toUpperCase()),

One);

}

public static void main(String[] args) throws Exception

{

Configuration config = new Configuration() ;

JobConf conf = new JobConf(config, UFOLocation.class);

conf.setJobName("UFOLocation");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(LongWritable.class);

JobConf mapconf1 = new JobConf(false) ;

ChainMapper.addMapper( conf, UFORecordValidationMapper.class,

LongWritable.class, Text.class, LongWritable.class,

Text.class, true, mapconf1) ;

JobConf mapconf2 = new JobConf(false) ;

ChainMapper.addMapper( conf, MapClass.class,

LongWritable.class, Text.class,

Text.class, LongWritable.class, true, mapconf2) ;

conf.setMapperClass(ChainMapper.class);

conf.setCombinerClass(LongSumReducer.class);

conf.setReducerClass(LongSumReducer.class);

FileInputFormat.setInputPaths(conf,args[0]) ;

FileOutputFormat.setOutputPath(conf, new Path(args[1])) ;

JobClient.runJob(conf);

}

www.it-ebooks.info

Chapter 4

[ 111 ]

3. Compile both les:

$ javac UFORecordValidationMapper.java UFOLocation.java

4. Jar up the class les and submit the job to Hadoop:

$ Hadoop jar ufo.jar UFOLocation ufo.tsv output

5. Copy the output le to the local lesystem and examine it:

$ Hadoop fs -get output/part-00000 locations.txt

$ more locations.txt

What just happened?

There's quite a bit happening here, so let's look at it one piece at a me.

The rst mapper is our simple validaon mapper. The class follows the same interface as

the standard MapReduce API and the map method simply returns the result of a ulity

validaon method. We split this out into a separate method to highlight the funconality of

the mapper, but the checks could easily have been within the main map method itself. For

simplicity, we keep to our previous validaon strategy of looking for the number of elds

and discarding lines that don't break into exactly six tab-delimited elds.

Note that the ChainMapper class has unfortunately been one of the last components to be

migrated to the context object API and as of Hadoop 1.0, it can only be used with the older

API. It remains a valid concept and useful tool but unl Hadoop 2.0, where it will nally be

migrated into the org.apache.hadoop.mapreduce.lib.chain package, its current

use requires the older approach.

The other le contains another mapper implementaon and an updated driver in the main

method. The mapper looks for a two-leer sequence at the end of the locaon eld in a

UFO sighng report. From some manual examinaon of data, it is obvious that most locaon

elds are of the form city, state, where the standard two-character abbreviaon is used

for the state.

Some records, however, add trailing parenthesis, periods, or other punctuaon. Some others

are simply not in this format. For our purposes, we are happy to discard those records and

focus on those that have the trailing two-character state abbreviaon we are looking for.

The map method extracts this from the locaon eld using another regular expression and

gives the output as the capitalized form of the abbreviaon along with a simple count.

The driver for the job has the most changes as the previous conguraon involving a single

map class is replaced with mulple calls on the ChainMapper class.

www.it-ebooks.info

Developing MapReduce Programs

[ 112 ]

The general model is to create a new conguraon object for each mapper, then add the

mapper to the ChainMapper class along with a specicaon of its input and output,

and a reference to the overall job conguraon object.

Noce that the two mappers have dierent signatures. Both input a key of type

LongWritable and value of type Text which are also the output types of

UFORecordValidationMapper. UFOLocationMapper however outputs the

reverse with a key of type Text and a value of type LongWritable.

The important thing here is to match the input from the nal mapper in the chain

(UFOLocationMapper) with the inputs expected by the reduce class (LongSumReducer).

When using theChainMapper class the mappers in the chain can have dierent input and

output as long as the following are true:

For all but the nal mapper each map output matches the input of the subsequent

mapper in the chain

For the nal mapper, its output matches the input of the reducer

We compile these classes and put them in the same jar le. This is the rst me we have

bundled the output from more than one Java source le together. As may be expected,

there is no magic here; the usual rules on jar les, path, and class names apply. Because in

this case we have both our classes in the same package, we don't have to worry about an

addional import in the driver class le.

We then run the MapReduce job and examine the output, which is not quite as expected.

Have a go hero

Use the Java API and the previousChainMapper example to reimplement the mappers

previously wrien in Ruby that produce the shape frequency and duraon reports.

Too many abbreviations

The following are the rst few entries from our result le of the previous job:

AB 286

AD 6

AE 7

AI 6

AK 234

AL 548

AM 22

AN 161

…

www.it-ebooks.info

Chapter 4

[ 113 ]

The le had 186 dierent two-character entries. Plainly, our approach of extracng the nal

character digraph from the locaon eld was not suciently robust.

We have a number of issues with the data which becomes apparent aer a manual analysis

of the source le:

There is inconsistency in the capitalizaon of the state abbreviaons

A non-trivial number of sighngs are from outside the U.S. and though they

may follow a similar (city, area) paern, the abbreviaon is not one of

the 50 we'd expect

Some elds simply don't follow the paern at all, yet would sll be captured

by our regular expression

We need to lter these results, ideally by normalizing the U.S. records into correct state

output and by gathering everything else into a broader category.

To perform this task we need to add to the mapper some noon of what the valid U.S. state

abbreviaons are. We could of course hardcode this into the mapper but that does not seem

right. Although we are for now going to treat all non-U.S. sighngs as a single category, we

may wish to extend that over me and perhaps do a breakdown by country. If we hardcode

the abbreviaons, we would need to recompile our mapper each me.

Using the Distributed Cache

Hadoop gives us an alternave mechanism to achieve the goal of sharing reference data

across all tasks in the job, the Distributed Cache. This can be used to eciently make

available common read-only les that are used by the map or reduce tasks to all nodes.

The les can be text data as in this case but could also be addional jars, binary data, or

archives; anything is possible.

The les to be distributed are placed on HDFS and added to the DistributedCache within

the job driver. Hadoop copies the les onto the local lesystem of each node prior to job

execuon, meaning every task has local access to the les.

An alternave is to bundle needed les into the job jar submied to Hadoop. This does e

the data to the job jar making it more dicult to share across jobs and requires the jar to

be rebuilt if the data changes.

www.it-ebooks.info

Developing MapReduce Programs

[ 114 ]

Time for action – using the Distributed Cache to improve

location output

Let's now use the Distributed Cache to share a list of U.S. state names and abbreviaons

across the cluster:

1. Create a datale called states.txt on the local lesystem. It should have the state

abbreviaon and full name tab separated, one per line. Or retrieve the le from this

book's homepage. The le should start like the following:

AL Alabama

AK Alaska

AZ Arizona

AR Arkansas

CA California

…

2. Place the le on HDFS:

$ hadoop fs -put states.txt states.txt

3. Copy the previous UFOLocation.java le to UFOLocaon2.java le and make the

changes by adding the following import statements:

import java.io.* ;

import java.net.* ;

import java.util.* ;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.filecache.DistributedCache ;

4. Add the following line to the driver main method aer the job name is set:

DistributedCache.addCacheFile(new URI ("/user/hadoop/states.txt"),

conf) ;

5. Replace the map class as follows:

public static class MapClass extends MapReduceBase

implements Mapper<LongWritable, Text, Text, LongWritable>

{

private final static LongWritable one = new

LongWritable(1);

private static Pattern locationPattern = Pattern.compile(

"[a-zA-Z]{2}[^a-zA-Z]*$") ;

private Map<String, String> stateNames ;

@Override

www.it-ebooks.info

Chapter 4

[ 115 ]

public void configure( JobConf job)

{

try

{

Path[] cacheFiles = DistributedCache.

getLocalCacheFiles(job) ;

setupStateMap( cacheFiles[0].toString()) ;

} catch (IOException e)

{

System.err.println("Error reading state file.") ;

System.exit(1) ;

}

private void setupStateMap(String filename)

throws IOException

{

Map<String, String> states = new HashMap<String,

String>() ;

BufferedReader reader = new BufferedReader( new

FileReader(filename)) ;

String line = reader.readLine() ;

while (line != null)

{

String[] split = line.split("\t") ;

states.put(split[0], split[1]) ;

line = reader.readLine() ;

}

stateNames = states ;

}

public void map(LongWritable key, Text value,

OutputCollector<Text, LongWritable> output,

Reporter reporter) throws IOException

{

String line = value.toString();

String[] fields = line.split("\t") ;

String location = fields[2].trim() ;

if (location.length() >= 2)

{

Matcher matcher = locationPattern.matcher(location) ;

if (matcher.find() )

www.it-ebooks.info

Developing MapReduce Programs

[ 116 ]

{

int start = matcher.start() ;

String state = location.substring(start, start+2)

;

output.collect(newText(lookupState(state.

toUpperCase())), one);

}

private String lookupState( String state)

{

String fullName = stateNames.get(state) ;

return fullName == null? "Other": fullName ;

}

6. Compile these classes and submit the job to Hadoop. Then retrieve the result le.

What just happened?

We rst created the lookup le we will use in our job and placed it on HDFS. Files to be

added to the Distributed Cache must inially be copied onto the HDFS lesystem.

Aer creang our new job le, we added the required class imports. Then we modied the

driver class to add the le we want on each node to be added to the DistributedCache.

The lename can be specied in mulple ways, but the easiest way is with an absolute

path to the le locaon on HDFS.

There were a number of changes to our mapper class. We added an overridden configure

method, which we use to populate a map that will be used to associate state abbreviaons

with their full name.

The configure method is called on task startup and the default implementaon does

nothing. In our overridden version, we retrieve the array of les that have been added to the

Distributed Cache. As we know there is only one le in the cache we feel safe in using the

rst index in this array, and pass that to a utility method that parses the le and uses the

contents to populate the state abbreviaon lookup map. Noce that once the le reference

is retrieved, we can access the le with standard Java I/O classes; it is aer all just a le on

the local lesystem.

www.it-ebooks.info

Chapter 4

[ 117 ]

We add another method to perform the lookup that takes the string extracted from

the locaon eld and returns either the full name of the state if there is a match or

the string Other otherwise. This is called prior to the map result being wrien via the

OutputCollector class.

The result of this job should be similar to the following data:

Alabama 548

Alaska 234

Arizona 2097

Arkansas 534

California 7679

…

Other 4531…

…

This works ne but we have been losing some informaon along the way. In our validaon

mapper, we simply drop any lines which don't meet our six eld criteria. Though we don't

care about individual lost records, we may care if the number of dropped records is very

large. Currently, our only way of determining that is to sum the number of records for each

recognized state and subtract from the total number of records in the le. We could also try

to have this data ow through the rest of the job to be gathered in a special reduced key but

that also seems wrong. Fortunately, there is a beer way.

Counters, status, and other output

At the end of every MapReducejob, we see output related to counters such as the

following output:

12/02/12 06:28:51 INFO mapred.JobClient: Counters: 22

12/02/12 06:28:51 INFO mapred.JobClient: Job Counters

12/02/12 06:28:51 INFO mapred.JobClient: Launched reduce tasks=1

12/02/12 06:28:51 INFO mapred.JobClient: Launched map tasks=18

12/02/12 06:28:51 INFO mapred.JobClient: Data-local map tasks=18

12/02/12 06:28:51 INFO mapred.JobClient: SkippingTaskCounters

12/02/12 06:28:51 INFO mapred.JobClient: MapProcessedRecords=61393

…

It is possible to add user-dened counters that will likewise be aggregated from all tasks and

reported in this nal output as well as in the MapReduce web UI.

www.it-ebooks.info

Developing MapReduce Programs

[ 118 ]

Time for action – creating counters, task states, and writing log

output

We'll modify our UFORecordValidationMapper to report stascs about skipped records

and also highlight some other facilies for recording informaon about a job:

1. Create the following as the UFOCountingRecordValidationMapper.java le:

import java.io.IOException;

import org.apache.hadoop.io.* ;

import org.apache.hadoop.mapred.* ;

import org.apache.hadoop.mapred.lib.* ;

public class UFOCountingRecordValidationMapper extends

MapReduceBase

implements Mapper<LongWritable, Text, LongWritable, Text>

{

public enum LineCounters

{

BAD_LINES,

TOO_MANY_TABS,

TOO_FEW_TABS

} ;

public void map(LongWritable key, Text value,

OutputCollector<LongWritable, Text> output,

Reporter reporter) throws IOException

{

String line = value.toString();

if (validate(line, reporter))

Output.collect(key, value);

}

private boolean validate(String str, Reporter reporter)

{

String[] parts = str.split("\t") ;

if (parts.length != 6)

{

if (parts.length < 6)

{

www.it-ebooks.info

Chapter 4

[ 119 ]

reporter.incrCounter(LineCounters.TOO_FEW_TABS, 1) ;

}

else

{

reporter.incrCounter(LineCounters.TOO_MANY_TABS,

1) ;

}

reporter.incrCounter(LineCounters.BAD_LINES, 1) ;

if((reporter.getCounter(

LineCounters.BAD_LINES).getCounter()%10)

== 0)

{

reporter.setStatus("Got 10 bad lines.") ;

System.err.println("Read another 10 bad lines.") ;

}

return false ;

}

return true ;

}

2. Make a copy of the UFOLocation2.java le as the UFOLocation3.java le to

use this new mapper instead of UFORecordValidationMapper:

…

JobConf mapconf1 = new JobConf(false) ;

ChainMapper.addMapper( conf,

UFOCountingRecordValidationMapper.class,

LongWritable.class, Text.class, LongWritable.class,

Text.class,

true, mapconf1) ;

3. Compile the les, jar them up, and submit the job to Hadoop:

…

12/02/12 06:28:51 INFO mapred.JobClient: Counters: 22

12/02/12 06:28:51 INFO mapred.JobClient: UFOCountingRecordValida

tionMapper$LineCounters

12/02/12 06:28:51 INFO mapred.JobClient: TOO_MANY_TABS=324

12/02/12 06:28:51 INFO mapred.JobClient: BAD_LINES=326

12/02/12 06:28:51 INFO mapred.JobClient: TOO_FEW_TABS=2

12/02/12 06:28:51 INFO mapred.JobClient: Job Counters

www.it-ebooks.info

Developing MapReduce Programs

[ 120 ]

4. Use a web browser to go to the MapReduce web UI (remember by default it is on

port 50030 on the JobTracker host). Select the job at the boom of the Completed

Jobs list and you should see a screen similar to the following screenshot:

www.it-ebooks.info

Chapter 4

[ 121 ]

5. Click on the link to the map tasks and you should see an overview screen like the

following screenshot:

www.it-ebooks.info

Developing MapReduce Programs

[ 122 ]

6. For one of the tasks with our custom status message, click on the link to its counters.

This should give a screen similar to the one shown as follows:

www.it-ebooks.info

Chapter 4

[ 123 ]

7. Go back to the task list and click on the task ID to get the task overview similar to

the following screenshot:

www.it-ebooks.info

Developing MapReduce Programs

[ 124 ]

8. Under the Task Logs column are opons for the amount of data to be displayed.

Click on All and the following screenshot should be displayed:

9. Now log into one of the task nodes and look through the les stored under hadoop/

logs/userlogs. There is a directory for each task aempt and several les within

each; the one to look for is stderr.

What just happened?

The rst thing we need to do in order to add new counters is to create a standard Java

enumeraon that will hold them. In this case we created what Hadoop would consider a

counter group called LineCounters and within that there are three counters for the total

number of bad lines, and ner grained counters for the number of lines with either too

few or too many elds. This is all you need to do to create a new set of counters; dene

the enumeraon and once you start seng the counter values, they will be automacally

understood by the framework.

www.it-ebooks.info

Chapter 4

[ 125 ]

To add to a counter we simply increment it via the Reporter object, in each case here we

add one each me we encounter a bad line, one with fewer than six elds, and one with

more than six elds.

We also retrieve the BAD_LINE counter for a task and if it is a mulple of 10, do

the following:

Set the task status to reect this fact

Write a similar message to stderr with the standard Java System.err.println

mechanism

We then go to the MapReduce UI and validate whether we can see both the counter totals in

the job overview as well as tasks with the custom state message in the task list.

We then explored the web UI, looking at the counters for an individual job, then under the

detail page for a task we saw, we can click on through the log les for the task.

We then looked at one of the nodes to see that Hadoop also captures the logs from each

task in a directory on the lesystem under the {HADOOP_HOME}/logs/userlogs directory.

Under subdirectories for each task aempt, there are les for the standard streams as well

as the general task logs. As you will see, a busy node can end up with a large number of task

log directories and it is not always easy to idenfy the task directories of interest. The web

interface proved itself to be a more ecient view on this data.

If you are using the Hadoop context object API, then counters are accessed

through the Context.getCounter().increment() method.

Too much information!

Aer not worrying much about how to get status and other informaon out of our jobs,

it may suddenly seem like we've got too many confusing opons. The fact of the maer is

that when running a fully distributed cluster in parcular, there really is no way around the

fact that the data may be spread across every node. With Java code we can't as easily mock

its usage on the command line as we did with our Ruby Streaming tasks; so care needs to be

taken to think about what informaon will be needed at runme. This should include details

concerning both the general job operaon (addional stascs) as well as indicators of

problems that may need further invesgaon.

www.it-ebooks.info

Developing MapReduce Programs

[ 126 ]

Counters, task status messages, and good old-fashioned Java logging can work together. If

there is a situaon you care about, set it as a counter that will record each me it occurs and

consider seng the status message of the task that encountered it. If there is some specic

data, write that to stderr. Since counters are so easily visible, you can know prey quickly

post job compleon if the situaon of interest occurred. From this, you can go to the web UI

and see all the tasks in which the situaon was encountered at a glance. From here, you can

click through to examine the more detailed logs for the task.

In fact, you don't need to wait unl the job completes; counters and task status messages

are updated in the web UI as the job proceeds, so you can start the invesgaon as soon

as either counters or task status messages alert you to the situaon. This is parcularly

useful in very long running jobs where the errors may cause you to abort the job.

Summary

This chapter covered development of a MapReduce job, highlighng some of the issues

and approaches you are likely to face frequently. In parcular, we learned how Hadoop

Streaming provides a means to use scripng languages to write map and reduce tasks,

and how using Streaming can be an eecve tool for early stages of job prototyping and

inial data analysis.

We also learned that wring tasks in a scripng language can provide the addional

benet of using command-line tools to directly test and debug the code. Within the Java

API, we looked at the ChainMapper class that provides an ecient way of decomposing

a complex map task into a series of smaller, more focused ones.

We then saw how the Distributed Cache provides a mechanism for ecient sharing of data

across all nodes. It copies les from HDFS onto the local lesystem on each node, providing

local access to the data. We also learned how to add job counters by dening a Java

enumeraon for the counter group and using framework methods to increment their

values, and how to use a combinaon of counters, task status messages,

and debug logs to develop an ecient job analysis workow.

We expect most of these techniques and ideas to be the ones that you will encounter

frequently as you develop MapReduce jobs. In the next chapter, we will explore a series

of more advanced techniques that are less oen encountered but are invaluable when

they are.

www.it-ebooks.info

Advanced MapReduce Techniques

Now that we have looked at a few details of the fundamentals of MapReduce and its usage,

it's me to examine some more techniques and concepts involved in MapReduce. This

chapter will cover the following topics:

Performing joins on data

Implemenng graph algorithms in MapReduce

How to represent complex datatypes in a language-independent fashion

Along the way, we'll use the case studies as examples in order to highlight other aspects

such as ps and tricks and idenfying some areas of best pracce.

Simple, advanced, and in-between

Including the word "advanced" in a chapter tle is a lile dangerous, as complexity is a

subjecve concept. So let's be very clear about the material covered here. We don't, for even

a moment, suggest that this is the pinnacle of dislled wisdom that would otherwise take

years to acquire. Conversely, we also don't claim that some of the techniques and problems

covered in this chapter will have occurred to someone new to the world of Hadoop.

For the purposes of this chapter, therefore, we use the term "advanced" to cover things that

you don't see in the rst days or weeks, or wouldn't necessarily appreciate if you did. These

are some techniques that provide both specic soluons to parcular problems but also

highlight ways in which the standard Hadoop and related APIs can be employed to address

problems that are not obviously suited to the MapReduce processing model. Along the way,

we'll also point out some alternave approaches that we don't implement here but which

may be useful sources for further research.

Our rst case study is a very common example of this laer case; performing join-type

operaons within MapReduce.

www.it-ebooks.info

Advanced MapReduce Techniques

[ 128 ]

Joins

Few problems use a single set of data. In many cases, there are easy ways to obviate the need

to try and process numerous discrete yet related data sets within the MapReduce framework.

The analogy here is, of course, to the concept of join in a relaonal database. It is very

natural to segment data into numerous tables and then use SQL statements that join tables

together to retrieve data from mulple sources. The canonical example is where a main

table has only ID numbers for parcular facts, and joins against other tables are used to

extract data about the informaon referred to by the unique ID.

When this is a bad idea

It is possible to implement joins in MapReduce. Indeed, as we'll see, the problem is

less about the ability to do it and more the choice of which of many potenal strategies

to employ.

However, MapReduce joins are oen dicult to write and easy to make inecient. Work

with Hadoop for any length of me, and you will come across a situaon where you need

to do it. However, if you very frequently need to perform MapReduce joins, you may want

to ask yourself if your data is well structured and more relaonal in nature than you rst

assumed. If so, you may want to consider Apache Hive (the main topic of Chapter 8, A

Relaonal View on Data with Hive) or Apache Pig (briey menoned in the same chapter).

Both provide addional layers atop Hadoop that allow data processing operaons to be

expressed in high-level languages; in the case of Hive, through a variant of SQL.

Map-side versus reduce-side joins

That caveat out of the way, there are two basic approaches to join data in Hadoop and these

are given their names depending on where in the job execuon the join occurs. In either

case, we need to bring mulple data streams together and perform the join through some

logic. The basic dierence between these two approaches is whether the mulple data

streams are combined within the mapper or reducer funcons.

Map-side joins, as the name implies, read the data streams into the mapper and uses

logic within the mapper funcon to perform the join. The great advantage of a map-side

join is that by performing all joining—and more crically data volume reducon—within

the mapper, the amount of data transferred to the reduce stage is greatly minimized. The

drawback of map-side joins is that you either need to nd a way of ensuring one of the

data sources is very small or you need to dene the job input to follow very specic criteria.

Oen, the only way to do that is to preprocess the data with another MapReduce job whose

sole purpose is to make the data ready for a map-side join.

www.it-ebooks.info

Chapter 5

[ 129 ]

In contrast, a reduce-side join has the mulple data streams processed through the map

stage without performing any join logic and does the joining in the reduce stage. The

potenal drawback of this approach is that all the data from each source is pulled through

the shue stage and passed into the reducers, where much of it may then be discarded by

the join operaon. For large data sets, this can become a very signicant overhead.

The main advantage of the reduce-side join is its simplicity; you are largely responsible

for how the jobs are structured and it is oen quite straighorward to dene a reduce-side

join approach for related data sets. Let's look at an example.

Matching account and sales information

A common situaon in many companies is that sales records are kept separate from the

client data. There is, of course, a relaonship between the two; usually a sales record

contains the unique ID of the user account through which the sale was performed.

In the Hadoop world, these would be represented by two types of data les: one containing

records of the user IDs and informaon for sales, and the other would contain the full data

for each user account.

Frequent tasks require reporng that uses data from both these sources; say, for example,

we wanted to see the total number of sales and total value for each user but do not want

to associate it with an anonymous ID number, but rather with a name. This may be valuable

when customer service representaves wish to call the most frequent customers—data from

the sales records—but want to be able to refer to the person by name and not just a number.

Time for action – reduce-side join using MultipleInputs

We can perform the report explained in the previous secon using a reduce-side join by

performing the following steps:

1. Create the following tab-separated le and name it sales.txt:

00135.992012-03-15

00212.492004-07-02

00413.422005-12-20

003499.992010-12-20

00178.952012-04-02

00221.992006-11-30

00293.452008-09-10

0019.992012-05-17

www.it-ebooks.info

Advanced MapReduce Techniques

[ 130 ]

2. Create the following tab-separated le and name it accounts.txt:

001John AllenStandard2012-03-15

002Abigail SmithPremium2004-07-13

003April StevensStandard2010-12-20

004Nasser HafezPremium2001-04-23

3. Copy the datales onto HDFS.

$ hadoop fs -mkdir sales

$ hadoop fs -put sales.txt sales/sales.txt

$ hadoop fs -mkdir accounts

$ hadoop fs -put accounts/accounts.txt

4. Create the following le and name it ReduceJoin.java:

import java.io.* ;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.*;

import org.apache.hadoop.mapreduce.lib.input.*;

public class ReduceJoin

{

public static class SalesRecordMapper

extends Mapper<Object, Text, Text, Text>

{

public void map(Object key, Text value, Context context)

throws IOException, InterruptedException

{

String record = value.toString() ;

String[] parts = record.split("\t") ;

context.write(new Text(parts[0]), new

Text("sales\t"+parts[1])) ;

}

public static class AccountRecordMapper

extends Mapper<Object, Text, Text, Text>

{

www.it-ebooks.info

Chapter 5

[ 131 ]

public void map(Object key, Text value, Context context)

throws IOException, InterruptedException

{

String record = value.toString() ;

String[] parts = record.split("\t") ;

context.write(new Text(parts[0]), new

Text("accounts\t"+parts[1])) ;

}

public static class ReduceJoinReducer

extends Reducer<Text, Text, Text, Text>

{

public void reduce(Text key, Iterable<Text> values,

Context context)

throws IOException, InterruptedException

{

String name = "" ;

double total = 0.0 ;

int count = 0 ;

for(Text t: values)

{

String parts[] = t.toString().split("\t") ;

if (parts[0].equals("sales"))

{

count++ ;

total+= Float.parseFloat(parts[1]) ;

}

else if (parts[0].equals("accounts"))

{

name = parts[1] ;

}

String str = String.format("%d\t%f", count, total) ;

context.write(new Text(name), new Text(str)) ;

}

www.it-ebooks.info

Advanced MapReduce Techniques

[ 132 ]

public static void main(String[] args) throws Exception

{

Configuration conf = new Configuration();

Job job = new Job(conf, "Reduce-side join");

job.setJarByClass(ReduceJoin.class);

job.setReducerClass(ReduceJoinReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

MultipleInputs.addInputPath(job, new Path(args[0]),

TextInputFormat.class, SalesRecordMapper.class) ;

MultipleInputs.addInputPath(job, new Path(args[1]),

TextInputFormat.class, AccountRecordMapper.class) ;

Path outputPath = new Path(args[2]);

FileOutputFormat.setOutputPath(job, outputPath);

outputPath.getFileSystem(conf).delete(outputPath);

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

5. Compile the le and add it to a JAR le.

$ javac ReduceJoin.java

$ jar -cvf join.jar *.class

6. Run the job by execung the following command:

$ hadoop jar join.jarReduceJoin sales accounts outputs

7. Examine the result le.

$ hadoop fs -cat /user/garry/outputs/part-r-00000

John Allen 3 124.929998

Abigail Smith 3 127.929996

April Stevens 1 499.989990

Nasser Hafez 1 13.420000

What just happened?

Firstly, we created the datales to be used in this example. We created two small data sets as

this makes it easier to track the result output. The rst data set we dened was the account

details with four columns, as follows:

The account ID

The client name

www.it-ebooks.info

Chapter 5

[ 133 ]

The type of account

The date the account was opened

We then created a sales record with three columns:

The account ID of the purchaser

The value of the sale

The date of the sale

Naturally, real account and sales records would have many more elds than the ones

menoned here. Aer creang the les, we placed them onto HDFS.

We then created the ReduceJoin.java le, which looks very much like the previous

MapReduce jobs we have used. There are a few aspects to this job that make it special

and allow us to implement a join.

Firstly, the class has two dened mappers. As we have seen before, jobs can have mulple

mappers executed in a chain; but in this case, we wish to apply dierent mappers to each

of the input locaons. Accordingly, we have the sales and account data dened into the

SalesRecordMapper and AccountRecordMapper classes. We used the MultipleInputs

class from the org.apache.hadoop.mapreduce.lib.io package as follows:

MultipleInputs.addInputPath(job, new Path(args[0]),

TextInputFormat.class, SalesRecordMapper.class) ;

MultipleInputs.addInputPath(job, new Path(args[1]),

TextInputFormat.class, AccountRecordMapper.class) ;

As you can see, unlike in previous examples where we add a single input locaon, the

MultipleInputs class allows us to add mulple sources and associate each with a

disnct input format and mapper.

The mappers are prey straighorward; the SalesRecordMapper class emits an output of

the form <account number>, <sales value> while the AccountRecordMapper class

emits an output of the form <account number>, <client name>. We therefore have

the order value and client name for each sale being passed into the reducer where the

actual join will happen.

Noce that both mappers actually emit more than the required values.

The SalesRecordMapper class prexes its value output with sales while

the AccountRecordMapper class uses the tag account.

If we look at the reducer, we can see why this is so. The reducer retrieves each record for a

given key, but without these explicit tags we would not know if a given value came from the

sales or account mapper and hence would not understand how to treat the data value.

www.it-ebooks.info

Advanced MapReduce Techniques

[ 134 ]

The ReduceJoinReducer class therefore treats the values in the Iterator

object dierently, depending on which mapper they came from. Values from the

AccountRecordMapper class—and there should be only one—are used to populate

the client name in the nal output. For each sales record—likely to be mulple, as most

clients buy more than a single item—the total number of orders is counted as is the overall

combined value. The output from the reducer is therefore a key of the account holder name

and a value string containing the number of orders and the total order value.

We compile and execute the class; noce how we provide three arguments represenng

the two input directories as well as the single output source. Because of how the

MultipleInputs class is congured, we must also ensure we specify the directories

in the right order; there is no dynamic mechanism to determine which type of le is in

which locaon.

Aer execuon, we examine the output le and conrm that it does indeed contain the

overall totals for named clients as expected.

DataJoinMapper and TaggedMapperOutput

There is a way of implemenng a reduce-side join in a more sophiscated and object-

oriented fashion. Within the org.apache.hadoop.contrib.join package are classes

such as DataJoinMapperBase and TaggedMapOutput that provide an encapsulated

means of deriving the tags for map output and having them processed at the reducer. This

mechanism means you don't have to dene explicit tag strings as we did previously and then

carefully parse out the data received at the reducer to determine from which mapper the

data came; there are methods in the provided classes that encapsulate this funconality.

This capability is parcularly valuable when using numeric or other non-textual data. For

creang our own explicit tags as in the previous example, we would have to convert types

such as integers into strings to allow us to add the required prex tag. This will be more

inecient than using the numeric types in their normal form and relying on the addional

classes to implement the tag.

The framework allows for quite sophiscated tag generaon as well as concepts such as tag

grouping that we didn't implement previously. There is addional work required to use this

mechanism that includes overriding addional methods and using a dierent map base class.

For straighorward joins such as in the previous example, this framework may be overkill,

but if you nd yourself implemenng very complex tagging logic, it may be worth a look.

www.it-ebooks.info

Chapter 5

[ 135 ]

Implementing map-side joins

For a join to occur at a given point, we must have access to the appropriate records from

each data set at that point. This is where the simplicity of the reduce-side join comes into

its own; though it incurs the expense of addional network trac, processing it by denion

ensures that the reducer has all records associated with the join key.

If we wish to perform our join in the mapper, it isn't as easy to make this condion hold

true. We can't assume that our input data is suciently well structured to allow associated

records to be read simultaneously. We generally have two classes of approach here: obviate

the need to read from mulple external sources or preprocess the data so that it is amenable

for map-side joining.

Using the Distributed Cache

The simplest way of realizing the rst approach is to take all but one data set and make it

available in the Distributed Cache that we used in the previous chapter. The approach can

be used for mulple data sources, but for simplicity let's discuss just two.

If we have one large data set and one smaller one, such as with the sales and account info

earlier, one opon would be to package up the account info and push it into the Distributed

Cache. Each mapper would then read this data into an ecient data structure, such as a

hash table that uses the join key as the hash key. The sales records are then processed,

and during the processing of record each the needed account informaon can be

retrieved from the hash table.

This mechanism is very eecve and when one of the smaller data sets can easily t into

memory, it is a great approach. However, we are not always that lucky, and somemes the

smallest data set is sll too large to be copied to every worker machine and held in memory.

Have a go hero - Implementing map-side joins

Take the previous sales/account record example and implement a map-side join using the

Distributed Cache. If you load the account records into a hash table that maps account ID

numbers to client names, you can use the account ID to retrieve the client name. Do this

within the mapper while processing the sales records.

Pruning data to t in the cache

If the smallest data set is sll too big to be used in the Distributed Cache, all is not

necessarily lost. Our earlier example, for instance, extracted only two elds from each record

and discarded the other elds not required by the job. In reality, an account will be described

by many aributes, and this sort of reducon will limit the data size dramacally. Oen the

data available to Hadoop is this full data set, but what we need is only a subset of the elds.

www.it-ebooks.info

Advanced MapReduce Techniques

[ 136 ]

In such a case, therefore, it may be possible to extract from the full data set only the elds

that are needed during the MapReduce job, and in doing so create a pruned data set that is

small enough to be used in the cache.

This is a very similar concept to the underlying column-oriented databases.

Tradional relaonal databases store data a row at a me, meaning that

the full row needs to be read to extract a single column. A column-based

database instead stores each column separately, allowing a query to read

only the columns in which it is interested.

If you take this approach, you need to consider what mechanism will be used to generate

the data subset and how oen this will be done. The obvious approach is to write another

MapReduce job that does the necessary ltering and this output is then used in the

Distributed Cache for the follow-on job. If the smaller data set changes only rarely, you may

be able to get away with generang the pruned data set on a scheduled basis; for example,

refresh it every night. Otherwise, you will need to make a chain of two MapReduce jobs: one

to produce the pruned data set and the other to perform the join operaon using the large

set and the data in the Distributed Cache.

Using a data representation instead of raw data

Somemes, one of the data sources is not used to retrieve addional data but is instead

used to derive some fact that is then used in a decision process. We may, for example, be

looking to lter sales records to extract only those for which the shipping address was in a

specic locale.

In such a case, we can reduce the required data size down to a list of the applicable sales

records that may more easily t into the cache. We can again store it as a hash table, where

we are just recording the fact that the record is valid, or even use something like a sorted

list or a tree. In cases where we can accept some false posives while sll guaranteeing no

false negaves, a Bloom lter provides an extremely compact way of represenng such

informaon.

As can be seen, applying this approach to enable a map-side join requires creavity and not

a lile luck in regards to the nature of the data set and the problem at hand. But remember

that the best relaonal database administrators spend signicant me opmizing queries

to remove unnecessary data processing; so it's never a bad idea to ask if you truly need to

process all that data.

Using multiple mappers

Fundamentally, the previous techniques are trying to remove the need for a full cross data

set join. But somemes this is what you have to do; you may simply have very large data

sets that cannot be combined in any of these clever ways.

www.it-ebooks.info

Chapter 5

[ 137 ]

There are classes within the org.apache.hadoop.mapreduce.lib.join package that

support this situaon. The main class of interest is CompositeInputFormat, which applies

a user-dened funcon to combine records from mulple data sources.

The main limitaon of this approach is that the data sources must already be indexed based

on the common key, in addion to being both sorted and paroned in the same way. The

reason for this is simple: when reading from each source, the framework needs to know if

a given key is present at each locaon. If we know that each paron is sorted and contains

the same key range, simple iteraon logic can do the required matching.

This situaon is obviously not going to happen by accident, so again you may nd yourself

wring preprocess jobs to transform all the input data sources into the correct sort and

paron structure.

This discussion starts to touch on distributed and parallel join algorithms;

both topics are of extensive academic and commercial research. If you are

interested in the ideas and want to learn more of the underlying theory, go

searching on http://scholar.google.com.

To join or not to join...

Aer our tour of joins in the MapReduce world, let's come back to the original queson:

are you really sure you want to be doing this? The choice is oen between a relavely

easily implemented yet inecient reduce-side join, and more ecient but more complex

map-side alternaves. We have seen that joins can indeed be implemented in MapReduce,

but they aren't always prey. This is why we advise the use of something like Hive or Pig if

these types of problems comprise a large poron of your workload. Obviously, we can use

tools such as those that do their own translaon into MapReduce code under the hood

and directly implement both map-side and reduce-side joins, but it's oen beer to use

a well-engineered and well-opmized library for such workloads instead of building your

own. That is aer all why you are using Hadoop and not wring your own distributed

processing framework!

Graph algorithms

Any good computer scienst will tell you that the graph data structure is one of the

most powerful tools around. Many complex systems are best represented by graphs and

a body of knowledge going back at least decades (centuries if you get more mathemacal

about it) provides very powerful algorithms to solve a vast variety of graph problems. But

by their very nature, graphs and their algorithms are oen very dicult to imagine in a

MapReduce paradigm.

www.it-ebooks.info

Advanced MapReduce Techniques

[ 138 ]

Graph 101

Let's take a step back and dene some terminology. A graph is a structure comprising of

nodes (also called verces) that are connected by links called edges. Depending on the type

of graph, the edges may be bidireconal or unidireconal and may have weights associated

with them. For example, a city road network can be seen as a graph where the roads are

the edges, and intersecons and points of interest are nodes. Some streets are one-way

and some are not, some have tolls, some are closed at certain mes of day, and so forth.

For transportaon companies, there is much money to be made by opmizing the routes

taken from one point to another. Dierent graph algorithms can derive such routes by taking

into account aributes such as one-way streets and other costs expressed as weights that

make a given road more aracve or less so.

For a more current example, think of the social graph popularized by sites such as Facebook

where the nodes are people and the edges are the relaonships between them.

Graphs and MapReduce – a match made somewhere

The main reason graphs don't look like many other MapReduce problems is due to the

stateful nature of graph processing, which can be seen in the path-based relaonship

between elements and oen between the large number of nodes processed together

for a single algorithm. Graph algorithms tend to use noons of the global state to make

determinaons about which elements to process next and modify such global knowledge

at each step.

In parcular, most of the well-known algorithms oen execute in an incremental or reentrant

fashion, building up structures represenng processed and pending nodes, and working

through the laer while reducing the former.

MapReduce problems, on the other hand, are conceptually stateless and typically based

upon a divide-and-conquer approach where each Hadoop worker host processes a small

subset of the data, wring out a poron of the nal result where the total job output is

viewed as the simple collecon of these smaller outputs. Therefore, when implemenng

graph algorithms in Hadoop, we need to express algorithms that are fundamentally stateful

and conceptually single-threaded in a stateless parallel and distributed framework. That's

the challenge!

Most of the well-known graph algorithms are based upon search or traversal of the graph,

oen to nd routes—frequently ranked by some noon of cost—between nodes. The most

fundamental graph traversal algorithms are depth-rst search (DFS) and breadth-rst search

(BFS).The dierence between the algorithms is the ordering in which a node is processed in

relaonship to its neighbors.

www.it-ebooks.info

Chapter 5

[ 139 ]

We will look at represenng an algorithm that implements a specialized form of such a

traversal; for a given starng node in the graph, determine the distance between it and

every other node in the graph.

As can be seen, the eld of graph algorithms and theory is a huge one that

we barely scratch the surface of here. If you want to nd out more, the

Wikipedia entry on graphs is a good starng point; it can be found at http://

en.wikipedia.org/wiki/Graph_(abstract_data_type).

Representing a graph

The rst problem we face is how to represent the graph in a way we can eciently

process using MapReduce. There are several well-known graph representaons known

as pointer-based, adjacency matrix, and adjacency list. In most implementaons, these

representaons oen assume a single process space with a global view of the whole graph;

we need to modify the representaon to allow individual nodes to be processed in discrete

map and reduce tasks.

We'll use the graph shown here in the following examples. The graph does have some extra

informaon that will be explained later.

Our graph is quite simple; it has only seven nodes, and all but one of the edges is

bidireconal. We are also using a common coloring technique that is used in standard

graph algorithms, as follows:

White nodes are yet to be processed

Gray nodes are currently being processed

Black nodes have been processed

www.it-ebooks.info

Advanced MapReduce Techniques

[ 140 ]

As we process our graph in the following steps, we will expect to see the nodes move

through these stages.

Time for action – representing the graph

Let's dene a textual representaon of the graph that we'll use in the following examples.

Create the following as graph.txt:

12,3,40C

21,4

31,5,6

41,2

53,6

63,5

What just happened?

We dened a le structure that will represent our graph, based somewhat on the adjacency

list approach. We assumed that each node has a unique ID and the le structure has four

elds, as follows:

The node ID

A comma-separated list of neighbors

The distance from the start node

The node status

In the inial representaon, only the starng node has values for the third and fourth

columns: its distance from itself is 0 and its status is "C", which we'll explain later.

Our graph is direconal—more formally referred to as a directed graph—that is to say,

if node 1 lists node 2 as a neighbor, there is only a return path if node 2 also lists node 1

as its neighbor. We see this in the graphical representaon where all but one edge has an

arrow on both ends.

Overview of the algorithm

Because this algorithm and corresponding MapReduce job is quite involved, we'll explain

it before showing the code, and then demonstrate it in use later.

Given the previous representaon, we will dene a MapReduce job that will be executed

mulple mes to get the nal output; the input to a given execuon of the job will be the

output from the previous execuon.

www.it-ebooks.info

Chapter 5

[ 141 ]

Based on the color code described in the previous secon, we will dene three states

for a node:

Pending: The node is yet to be processed; it is in the default state (white)

Currently processing: The node is being processed (gray)

Done: The nal distance for the node has been determined (black)

The mapper

The mapper will read in the current representaon of the graph and treat each node

as follows:

If the node is marked as Done, it gives output with no changes.

If the node is marked as Currently processing, its state is changed to Done and gives

output with no other changes. Each of its neighbors gives output as per the current

record with its distance incremented by one, but with no neighbors; node 1 doesn't

know node 2's neighbors, for example.

If the node is marked Pending, its state is changed to Currently processing and it

gives output with no further changes.

The reducer

The reducer will receive one or more records for each node ID, and it will combine their

values into the nal output node record for that stage.

The general algorithm for the reducer is as follows:

A Done record is the nal output and no further processing of the values

is performed

For other nodes, the nal output is built up by taking the list of neighbors,

where it is to be found, and the highest distance and state

Iterative application

If we apply this algorithm once, we will get node 1 marked as Done, several more (its

immediate neighbors) as Current, and a few others as Pending. Successive applicaons of

the algorithm will see all nodes move to their nal state; as each node is encountered, its

neighbors are brought into the processing pipeline. We will show this later.

www.it-ebooks.info

Advanced MapReduce Techniques

[ 142 ]

Time for action – creating the source code

We'll now see the source code to implement our graph traversal. Because the code

is lengthy, we'll break it into mulple steps; obviously they should all be together in

a single source le.

1. Create the following as GraphPath.java with these imports:

import java.io.* ;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.*;

import org.apache.hadoop.mapreduce.lib.input.*;

import org.apache.hadoop.mapreduce.lib.output.*;

public class GraphPath

{

2. Create an inner class to hold an object-oriented representaon of a node:

// Inner class to represent a node

public static class Node

{

// The integer node id

private String id ;

// The ids of all nodes this node has a path to

private String neighbours ;

// The distance of this node to the starting node

private int distance ;

// The current node state

private String state ;

// Parse the text file representation into a Node object

Node( Text t)

{

String[] parts = t.toString().split("\t") ;

this.id = parts[0] ;

this.neighbours = parts[1] ;

if (parts.length<3 || parts[2].equals(""))

this.distance = -1 ;

else

this.distance = Integer.parseInt(parts[2]) ;

www.it-ebooks.info

Chapter 5

[ 143 ]

if (parts.length< 4 || parts[3].equals(""))

this.state = "P" ;

else

this.state = parts[3] ;

}

// Create a node from a key and value object pair

Node(Text key, Text value)

{

this(new Text(key.toString()+"\t"+value.toString())) ;

}

Public String getId()

{return this.id ;

}

public String getNeighbours()

{

return this.neighbours ;

}

public int getDistance()

{

return this.distance ;

}

public String getState()

{

return this.state ;

}

3. Create the mapper for the job. The mapper will create a new Node object for its

input and then examine it, and based on its state do the appropriate processing.

public static class GraphPathMapper

extends Mapper<Object, Text, Text, Text>

{

public void map(Object key, Text value, Context context)

throws IOException, InterruptedException

{

Node n = new Node(value) ;

if (n.getState().equals("C"))

www.it-ebooks.info

Advanced MapReduce Techniques

[ 144 ]

{

// Output the node with its state changed to Done

context.write(new Text(n.getId()), new

Text(n.getNeighbours()+"\t"+n.getDistance()+"\t"+"D")) ;

for (String neighbour:n.getNeighbours().

split(","))

{

// Output each neighbour as a Currently processing node

// Increment the distance by 1; it is one link further away

context.write(new Text(neighbour), new

Text("\t"+(n.getDistance()+1)+"\tC")) ;

}

else

{

// Output a pending node unchanged

context.write(new Text(n.getId()), new

Text(n.getNeighbours()+"\t"+n.getDistance()

+"\t"+n.getState())) ;

}

4. Create the reducer for the job. As with the mapper, this reads in a representaon

of a node and gives as output a dierent value depending on the state of the node.

The basic approach is to collect from the input the largest value for the state and

distance columns, and through this converge to the nal soluon.

public static class GraphPathReducer

extends Reducer<Text, Text, Text, Text>

{

public void reduce(Text key, Iterable<Text> values,

Context context)

throws IOException, InterruptedException

{

// Set some default values for the final output

String neighbours = null ;

int distance = -1 ;

String state = "P" ;

for(Text t: values)

{

Node n = new Node(key, t) ;

www.it-ebooks.info

Chapter 5

[ 145 ]

if (n.getState().equals("D"))

{

// A done node should be the final output; ignore the remaining

// values

neighbours = n.getNeighbours() ;

distance = n.getDistance() ;

state = n.getState() ;

break ;

}

// Select the list of neighbours when found

if (n.getNeighbours() != null)

neighbours = n.getNeighbours() ;

// Select the largest distance

if (n.getDistance() > distance)

distance = n.getDistance() ;

// Select the highest remaining state

if (n.getState().equals("D") ||

(n.getState().equals("C") &&state.equals("P")))

state=n.getState() ;

}

// Output a new node representation from the collected parts

context.write(key, new

Text(neighbours+"\t"+distance+"\t"+state)) ;

}

5. Create the job driver:

public static void main(String[] args) throws Exception

{

Configuration conf = new Configuration();

Job job = new Job(conf, "graph path");

job.setJarByClass(GraphPath.class);

job.setMapperClass(GraphPathMapper.class);

job.setReducerClass(GraphPathReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

www.it-ebooks.info

Advanced MapReduce Techniques

[ 146 ]

What just happened?

The job here implements the previously described algorithm that we'll execute in

the following secons. The job setup is prey standard, and apart from the algorithm

denion the only new thing here is the use of an inner class to represent nodes.

The input to a mapper or reducer is oen a aened representaon of a more complex

structure or object. We could just use that representaon, but in this case this would result

in the mapper and reducer bodies being full of text and string manipulaon code that would

obscure the actual algorithm.

The use of the Node inner class allows the mapping from the at le to object representaon

that is to be encapsulated in an object that makes sense in terms of the business domain.

This also makes the mapper and reducer logic clearer as comparisons between object

aributes are more semancally meaningful than comparisons with slices of a string

idened only by absolute index posions.

Time for action – the rst run

Let's now perform the inial execuon of this algorithm on our starng representaon of

the graph:

1. Put the previously created graph.txt le onto HDFS:

$ hadoop fs -mkdirgraphin

$ hadoop fs -put graph.txtgraphin/graph.txt

2. Compile the job and create the JAR le:

$ javac GraphPath.java

$ jar -cvf graph.jar *.class

3. Execute the MapReduce job:

$ hadoop jar graph.jarGraphPathgraphingraphout1

4. Examine the output le:

$ hadoop fs –cat /home/user/hadoop/graphout1/part-r00000

12,3,40D

21,41C

31,5,61C

41,21C

53,6-1P

63,5-1P

76-1P

www.it-ebooks.info

Chapter 5

[ 147 ]

What just happened?

Aer pung the source le onto HDFS and creang the job JAR le, we executed the job in

Hadoop. The output representaon of the graph shows a few changes, as follows:

Node 1 is now marked as Done; its distance from itself is obviously 0

Nodes 2, 3, and 4 – the neighbors of node 1 — are marked as Currently processing

All other nodes are Pending

Our graph now looks like the following gure:

Given the algorithm, this is to be expected; the rst node is complete and its neighboring

nodes, extracted through the mapper, are in progress. All other nodes are yet to

begin processing.

Time for action – the second run

If we take this representaon as the input to another run of the job, we would expect nodes

2, 3, and 4 to now be complete, and for their neighbors to now be in the Current state. Let's

see; execute the following steps:

1. Execute the MapReduce job by execung the following command:

$ hadoop jar graph.jarGraphPathgraphout1graphout2

2. Examine the output le:

$ hadoop fs -cat /home/user/hadoop/graphout2/part-r000000

12,3,40D

21,41D

www.it-ebooks.info

Advanced MapReduce Techniques

[ 148 ]

31,5,61D

41,21D

53,62C

63,52C

76-1P

What just happened?

As expected, nodes 1 through 4 are complete, nodes 5 and 6 are in progress, and node 7 is

sll pending, as seen in the following gure:

If we run the job again, we should expect nodes 5 and 6 to be Done and any unprocessed

neighbors to become Current.

Time for action – the third run

Let's validate that assumpon by running the algorithm for the third me.

1. Execute the MapReduce job:

$ hadoop jar graph.jarGraphPathgraphout2graphout3

2. Examine the output le:

$ hadoop fs -cat /user/hadoop/graphout3/part-r-00000

12,3,40D

21,41D

31,5,61D

www.it-ebooks.info

Chapter 5

[ 149 ]

41,21D

53,62D

63,52D

76-1P

What just happened?

We now see nodes 1 through 6 are complete. But node 7 is sll pending and no nodes are

currently being processed, as shown in the following gure:

The reason for this state is that though node 7 has a link to node 6, there is no edge in the

reverse direcon. Node 7 is therefore eecvely unreachable from node 1. If we run the

algorithm one nal me, we should expect to see the graph unchanged.

Time for action – the fourth and last run

Let's perform the fourth execuon to validate that the output has now reached its nal

stable state.

1. Execute the MapReduce job:

$ hadoop jar graph.jarGraphPathgraphout3graphout4

www.it-ebooks.info

Advanced MapReduce Techniques

[ 150 ]

2. Examine the output le:

$ hadoop fs -cat /user/hadoop/graphout4/part-r-00000

12,3,40D

21,41D

31,5,61D

41,21D

53,62D

63,52D

76-1P

What just happened?

The output is as expected; since node 7 is not reachable by node 1 or any of its neighbors, it

will remain Pending and never be processed further. Consequently, our graph is unchanged

as shown in the following gure:

The one thing we did not build into our algorithm was an understanding of a terminang

condion; the process is complete if a run does not create any new D or C nodes.

The mechanism we use here is manual, that is, we knew by examinaon that the

graph representaon had reached its nal stable state. There are ways of doing this

programmacally, however. In a later chapter, we will discuss custom job counters; we

can, for example, increment a counter every me a new D or C node is created and only

reexecute the job if that counter is greater than zero aer the run.

www.it-ebooks.info

Chapter 5

[ 151 ]

Running multiple jobs

The previous algorithm is the rst me we have explicitly used the output of one MapReduce

job as the input to another. In most cases, the jobs are dierent; but, as we have seen, there

is value in repeatedly applying an algorithm unl the output reaches a stable state.

Final thoughts on graphs

For anyone familiar with graph algorithms, the previous process will seem very alien. This

is simply a consequence of the fact that we are implemenng a stateful and potenally

recursive global and reentrant algorithm as a series of serial stateless MapReduce jobs.

The important fact is not in the parcular algorithm used; the lesson is in how we can take

at text structures and a series of MapReduce jobs, and from this implement something

like graph traversal. You may have problems that at rst don't appear to have any way of

being implemented in the MapReduce paradigm; consider some of the techniques used

here and remember that many algorithms can be modeled in MapReduce. They may look

very dierent from the tradional approach, but the goal is the correct output and not an

implementaon of a known algorithm.

Using language-independent data structures

A cricism oen leveled at Hadoop, and which the community has been working

hard to address, is that it is very Java-centric. It may appear strange to accuse a project

fully implemented in Java of being Java-centric, but the consideraon is from a client's

perspecve.

We have shown how Hadoop Streaming allows the use of scripng languages to implement

map and reduce tasks and how Pipes provides similar mechanisms for C++. However, one

area that does remain Java-only is the nature of the input formats supported by Hadoop

MapReduce. The most ecient format is SequenceFile, a binary spliable container that

supports compression. However, SequenceFiles have only a Java API; they cannot be wrien

or read in any other language.

We could have an external process creang data to be ingested into Hadoop for MapReduce

processing, and the best way we could do this is either have it simply as an output of text

type or do some preprocessing to translate the output format into SequenceFiles to be

pushed onto HDFS. We also struggle here to easily represent complex data types; we either

have to aen them to a text format or write a converter across two binary formats, neither

of which is an aracve opon.

www.it-ebooks.info

Advanced MapReduce Techniques

[ 152 ]

Candidate technologies

Fortunately, there have been several technologies released in recent years that address

the queson of cross-language data representaons. They are Protocol Buers (created

by Google and hosted at http://code.google.com/p/protobuf), Thri (originally

created by Facebook and now an Apache project at http://thrift.apache.org), and

Avro (created by Doug Cung, the original creator of Hadoop). Given its heritage and ght

Hadoop integraon, we will use Avro to explore this topic. We won't cover Thri or Protocol

Buers in this book, but both are solid technologies; if the topic of data serializaon interests

you, check out their home pages for more informaon.

Introducing Avro

Avro, with its home page at http://avro.apache.org, is a data-persistence framework

with bindings for many programming languages. It creates a binary structured format

that is both compressible and spliable, meaning it can be eciently used as the input

to MapReduce jobs.

Avro allows the denion of hierarchical data structures; so, for example, we can create a

record that contains an array, an enumerated type, and a subrecord. We can create these

les in any programming language, process them in Hadoop, and have the result read by

a third language.

We'll talk about these aspects of language independence over the next secons, but this

ability to express complex structured types is also very valuable. Even if we are using only

Java, we could employ Avro to allow us to pass complex data structures in and out of

mappers and reducers. Even things like graph nodes!

Time for action – getting and installing Avro

Let's download Avro and get it installed on our system.

1. Download the latest stable version of Avro from http://avro.apache.org/

releases.html.

2. Download the latest version of the ParaNamer library from http://paranamer.

codehaus.org.

3. Add the classes to the build classpath used by the Java compiler.

$ export CLASSPATH=avro-1.7.2.jar:${CLASSPATH}

$ export CLASSPATH=avro-mapred-1.7.2.jar:${CLASSPATH}

$ export CLASSPATH=paranamer-2.5.jar:${CLASSPATH

www.it-ebooks.info

Chapter 5

[ 153 ]

4. Add exisng JAR les from the Hadoop distribuon to the build classpath.

Export CLASSPATH=${HADOOP_HOME}/lib/Jackson-core-asl-

1.8.jar:${CLASSPATH}

Export CLASSPATH=${HADOOP_HOME}/lib/Jackson-mapred-asl-

1.8.jar:${CLASSPATH}

Export CLASSPATH=${HADOOP_HOME}/lib/commons-cli-

1.2.jar:${CLASSPATH}

5. Add the new JAR les to the Hadoop lib directory.

$cpavro-1.7.2.jar ${HADOOP_HOME}/lib

$cpavro-mapred-1.7.2.jar ${HADOOP_HOME}/lib

What just happened?

Seng up Avro is a lile involved; it is a much newer project than the other Apache tools

we'll be using, so it requires more than a single download of a tarball.

We download the Avro and Avro-mapred JAR les from the Apache website. There is also

a dependency on ParaNamer that we download from its home page at codehaus.org.

The ParaNamer home page has a broken download link at the me of wring;

as an alternave, try the following link:

http://search.maven.org/remotecontent?filepath=com/

thoughtworks/paranamer/paranamer/2.5/paranamer-2.5.jar

Aer downloading these JAR les, we need to add them to the classpath used by our

environment; primarily for the Java compiler. We add these les, but we also need to

add to the build classpath several packages that ship with Hadoop because they are

required to compile and run Avro code.

Finally, we copy the three new JAR les into the Hadoop lib directory on each host

in the cluster to enable the classes to be available for the map and reduce tasks at

runme. We could distribute these JAR les through other mechanisms, but this is

the most straighorward means.

www.it-ebooks.info

Advanced MapReduce Techniques

[ 154 ]

Avro and schemas

One advantage Avro has over tools such as Thri and Protocol Buers, is the way it approaches

the schema describing an Avro datale. While the other tools always require the schema to be

available as a disnct resource, Avro datales encode the schema in their header, which allows

for the code to parse the les without ever seeing a separate schema le.

Avro supports but does not require code generaon that produces code tailored to a specic

data schema. This is an opmizaon that is valuable when possible but not a necessity.

We can therefore write a series of Avro examples that never actually use the datale schema,

but we'll only do that for parts of the process. In the following examples, we will dene a

schema that represents a cut-down version of the UFO sighng records we used previously.

Time for action – dening the schema

Let's now create this simplied UFO schema in a single Avro schema le.

Create the following as ufo.avsc:

{ "type": "record",

"name": "UFO_Sighting_Record",

"fields" : [

{"name": "sighting_date", "type": "string"},

{"name": "city", "type": "string"},

{"name": "shape", "type": ["null", "string"]},

{"name": "duration", "type": "float"}

]

}

What just happened?

As can be seen, Avro uses JSON in its schemas, which are usually saved with the .avsc

extension. We create here a schema for a format that has four elds, as follows:

The Sighng_date eld of type string to hold a date of the form yyyy-mm-dd

The City eld of type string that will contain the city's name where the

sighng occurred

The Shape eld, an oponal eld of type string, that represents the UFO's shape

The Duraon eld gives a representaon of the sighng duraon in

fraconal minutes

With the schema dened, we will now create some sample data.

www.it-ebooks.info

Chapter 5

[ 155 ]

Time for action – creating the source Avro data with Ruby

Let's create the sample data using Ruby to demonstrate the cross-language capabilies

of Avro.

1. Add the rubygems package:

$ sudo apt-get install rubygems

2. Install the Avro gem:

$ gem install avro

3. Create the following as generate.rb:

require 'rubygems'

require 'avro'

file = File.open('sightings.avro', 'wb')

schema = Avro::Schema.parse(

File.open("ufo.avsc", "rb").read)

writer = Avro::IO::DatumWriter.new(schema)

dw = Avro::DataFile::Writer.new(file, writer, schema)

dw<< {"sighting_date" => "2012-01-12", "city" => "Boston", "shape"

=> "diamond", "duration" => 3.5}

dw<< {"sighting_date" => "2011-06-13", "city" => "London", "shape"

=> "light", "duration" => 13}

dw<< {"sighting_date" => "1999-12-31", "city" => "New York",

"shape" => "light", "duration" => 0.25}

dw<< {"sighting_date" => "2001-08-23", "city" => "Las Vegas",

"shape" => "cylinder", "duration" => 1.2}

dw<< {"sighting_date" => "1975-11-09", "city" => "Miami",

"duration" => 5}

dw<< {"sighting_date" => "2003-02-27", "city" => "Paris", "shape"

=> "light", "duration" => 0.5}

dw<< {"sighting_date" => "2007-04-12", "city" => "Dallas", "shape"

=> "diamond", "duration" => 3.5}

dw<< {"sighting_date" => "2009-10-10", "city" => "Milan", "shape"

=> "formation", "duration" => 0}

dw<< {"sighting_date" => "2012-04-10", "city" => "Amsterdam",

"shape" => "blur", "duration" => 6}

dw<< {"sighting_date" => "2006-06-15", "city" => "Minneapolis",

"shape" => "saucer", "duration" => 0.25}

dw.close

4. Run the program and create the datale:

$ ruby generate.rb

www.it-ebooks.info

Advanced MapReduce Techniques

[ 156 ]

What just happened?

Before we use Ruby, we ensure the rubygems package is installed on our Ubuntu host.

We then install the preexisng Avro gem for Ruby. This provides the libraries we need

to read and write Avro les from, within the Ruby language.

The Ruby script itself simply reads the previously created schema and creates a datale

with 10 test records. We then run the program to create the data.

This is not a Ruby tutorial, so I will leave analysis of the Ruby API as an exercise for the

reader; its documentaon can be found at http://rubygems.org/gems/avro.

Time for action – consuming the Avro data with Java

Now that we have some Avro data, let's write some Java code to consume it:

1. Create the following as InputRead.java:

import java.io.File;

import java.io.IOException;

import org.apache.avro.file.DataFileReader;

import org.apache.avro.generic.GenericData;

import org.apache.avro. generic.GenericDatumReader;

import org.apache.avro.generic.GenericRecord;

import org.apache.avro.io.DatumReader;

public class InputRead

{

public static void main(String[] args) throws IOException

{

String filename = args[0] ;

File file=new File(filename) ;

DatumReader<GenericRecord> reader= new

GenericDatumReader<GenericRecord>();

DataFileReader<GenericRecord>dataFileReader=new

DataFileReader<GenericRecord>(file,reader);

while (dataFileReader.hasNext())

{

GenericRecord result=dataFileReader.next();

String output = String.format("%s %s %s %f",

result.get("sighting_date"), result.get("city"),

result.get("shape"), result.get("duration")) ;

www.it-ebooks.info

Chapter 5

[ 157 ]

System.out.println(output) ;

}

2. Compile and run the program:

$ javacInputRead.java

$ java InputReadsightings.avro

The output will be as shown in the following screenshot:

What just happened?

We created the Java class InputRead, which takes the lename passed as a

command-line argument and parses this as an Avro datale. When Avro reads

from a datale, each individual element is called a datum and each datum will

follow the structure dened in the schema.

In this case, we don't use an explicit schema; instead, we read each datum into the

GenericRecord class, and from this extract each eld by explicitly retrieving it by name.

The GenericRecord class is a very exible class in Avro; it can be used to wrap any record

structure, such as our UFO-sighng type. Avro also supports primive types such as integers,

oats, and booleans as well as other structured types such as arrays and enums. In these

examples, we'll use records as the most common structure, but this is only a convenience.

www.it-ebooks.info

Advanced MapReduce Techniques

[ 158 ]

Using Avro within MapReduce

Avro's support for MapReduce revolves around several Avro-specic variants of other

familiar classes, whereas we'd normally expect a new datale format to be supported

in Hadoop through new InputFormat and OutputFormat classes, we'll use AvroJob,

AvroMapper, and AvroReducer instead of the non-Avro versions. AvroJob expects Avro

datales as its input and output, so instead of specifying input and output format types,

we congure it with details of the input and output Avro schemas.

The main dierence for our mapper and reducer implementaons are the types used. Avro,

by default, has a single input and output, whereas we're used to our Mapper and Reducer

classes having a key/value input and a key/value output. Avro also introduces the Pair class,

which is oen used to emit intermediate key/value data.

Avro does also support AvroKey and AvroValue, which can wrap other types, but we'll not

use those in the following examples.

Time for action – generating shape summaries in MapReduce

In this secon we will write a mapper that takes as input the UFO sighng record we dened

earlier. It will output the shape and a count of 1, and the reducer will take this shape and

count records and produce a new structured Avro datale type containing the nal counts

for each UFO shape. Perform the following steps:

1. Copy the sightings.avro le to HDFS.

$ hadoopfs -mkdiravroin

$ hadoopfs -put sightings.avroavroin/sightings.avro

2. Create the following as AvroMR.java:

import java.io.IOException;

import org.apache.avro.Schema;

import org.apache.avro.generic.*;

import org.apache.avro.Schema.Type;

import org.apache.avro.mapred.*;

import org.apache.avro.reflect.ReflectData;

import org.apache.avro.util.Utf8;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.io.* ;

import org.apache.hadoop.util.*;

// Output record definition

www.it-ebooks.info

Chapter 5

[ 159 ]

class UFORecord

{

UFORecord()

{

}

public String shape ;

public long count ;

}

public class AvroMR extends Configured implements Tool

{

// Create schema for map output

public static final Schema PAIR_SCHEMA =

Pair.getPairSchema(Schema.create(Schema.Type.STRING),

Schema.create(Schema.Type.LONG));

// Create schema for reduce output

public final static Schema OUTPUT_SCHEMA =

ReflectData.get().getSchema(UFORecord.class);

@Override

public int run(String[] args) throws Exception

{

JobConfconf = new JobConf(getConf(), getClass());

conf.setJobName("UFO count");

String[] otherArgs = new GenericOptionsParser(conf, args).

getRemainingArgs();

if (otherArgs.length != 2)

{

System.err.println("Usage: avro UFO counter <in><out>");

System.exit(2);

}

FileInputFormat.addInputPath(conf, new Path(otherArgs[0]));

Path outputPath = new Path(otherArgs[1]);

FileOutputFormat.setOutputPath(conf, outputPath);

outputPath.getFileSystem(conf).delete(outputPath);

Schema input_schema =

Schema.parse(getClass().getResourceAsStream("ufo.avsc"));

AvroJob.setInputSchema(conf, input_schema);

AvroJob.setMapOutputSchema(conf,

Pair.getPairSchema(Schema.create(Schema.Type.STRING),

www.it-ebooks.info

Advanced MapReduce Techniques

[ 160 ]

Schema.create(Schema.Type.LONG)));

AvroJob.setOutputSchema(conf, OUTPUT_SCHEMA);

AvroJob.setMapperClass(conf, AvroRecordMapper.class);

AvroJob.setReducerClass(conf, AvroRecordReducer.class);

conf.setInputFormat(AvroInputFormat.class) ;

JobClient.runJob(conf);

return 0 ;

}

public static class AvroRecordMapper extends

AvroMapper<GenericRecord, Pair<Utf8, Long>>

{

@Override

public void map(GenericRecord in, AvroCollector<Pair<Utf8,

Long>> collector, Reporter reporter) throws IOException

{

Pair<Utf8,Long> p = new Pair<Utf8,Long>(PAIR_SCHEMA) ;

Utf8 shape = (Utf8)in.get("shape") ;

if (shape != null)

{

p.set(shape, 1L) ;

collector.collect(p);

}

public static class AvroRecordReducer extends

AvroReducer<Utf8,

Long, GenericRecord>

{

public void reduce(Utf8 key, Iterable<Long> values,

AvroCollector<GenericRecord> collector,

Reporter reporter) throws IOException

{

long sum = 0;

for (Long val : values)

{

sum += val;

}

GenericRecord value = new

GenericData.Record(OUTPUT_SCHEMA);

www.it-ebooks.info

Chapter 5

[ 161 ]

value.put("shape", key);

value.put("count", sum);

collector.collect(value);

}

public static void main(String[] args) throws Exception

{

int res = ToolRunner.run(new Configuration(), new AvroMR(),

args);

System.exit(res);

}

3. Compile and run the job:

$ javacAvroMR.java

$ jar -cvfavroufo.jar *.class ufo.avsc

$ hadoop jar ~/classes/avroufo.jarAvroMRavroinavroout

4. Examine the output directory:

$ hadoopfs -lsavroout

Found 3 items

-rw-r--r-- 1 … /user/hadoop/avroout/_SUCCESS

drwxr-xr-x - hadoopsupergroup 0 … /user/hadoop/

avroout/_logs

-rw-r--r-- 1 … /user/hadoop/avroout/part-00000.avro

5. Copy the output le to the local lesystem:

$ hadoopfs -get /user/hadoop/avroout/part-00000.avroresult.avro

What just happened?

We created the Job class and examined its various components. The actual logic within the

Mapper and Reducer classes is relavely straighorward: the Mapper class just extracts

the shape column and emits it with a count of 1; the reducer then counts the total number

of entries for each shape. The interesng aspects are around the dened input and output

types to the Mapper and Reducer classes and how the job is congured.

The Mapper class has an input type of GenericRecord and an output type of Pair. The

Reducer class has a corresponding input type of Pair and output type of GenericRecord.

www.it-ebooks.info

Advanced MapReduce Techniques

[ 162 ]

The GenericRecord class passed to the Mapper class wraps a datum that is the UFO

sighng record represented in the input le. This is how the Mapper class is able to retrieve

the Shape eld by name.

Recall that GenericRecords may or may not be explicitly created with a schema, and in

either case the structure can be determined from the datale. For the GenericRecord

output by the Reducer class, we do pass a schema but use a new mechanism for its creaon.

Within the previously menoned code, we created the addional UFORecord class and used

Avro reecon to generate its schema dynamically at runme. We were then able to use this

schema to create a GenericRecord class specialized to wrap that parcular record type.

Between the Mapper and Reducer classes we use the Avro Pair type to hold a key and

value pair. This allows us to express the same logic for the Mapper and Reducer classes

that we used in the original WordCount example back in Chapter 2, Geng Hadoop Up

and Running; the Mapper class emits singleton counts for each value and the reducer

sums these into an overall total for each shape.

In addion to the Mapper and Reducer classes' input and output, there is some

conguraon unique to a job processing Avro data:

Schema input_schema = Schema.parse(getClass().

getResourceAsStream("ufo.avsc")) ;

AvroJob.setInputSchema(conf, input_schema);

AvroJob.setMapOutputSchema(conf, Pair.getPairSchema(Schema.

create(Schema.Type.STRING), Schema.create(Schema.Type.LONG)));

AvroJob.setOutputSchema(conf, OUTPUT_SCHEMA);

AvroJob.setMapperClass(conf, AvroRecordMapper.class);

AvroJob.setReducerClass(conf, AvroRecordReducer.class);

These conguraon elements demonstrate the cricality of schema denion to Avro;

though we can do without it, we must set the expected input and output schema types. Avro

will validate the input and output against the specied schemas, so there is a degree of data

type safety. For the other elements, such as seng up the Mapper and Reducer classes,

we simply set those on AvroJob instead of the more generic classes, and once done, the

MapReduce framework will perform appropriately.

This example is also the rst me we've explicitly implemented the Tool interface. When

running the Hadoop command-line program, there are a series of arguments (such as -D)

that are common across all the mulple subcommands. If a job class implements the Tool

interface as menoned in the previous secon, it automacally gets access to any of these

standard opons passed on the command line. It's a useful mechanism that prevents lots of

code duplicaon.

www.it-ebooks.info

Chapter 5

[ 163 ]

Time for action – examining the output data with Ruby

Now that we have the output data from the job, let's examine it again using Ruby.

1. Create the following as read.rb:

require 'rubygems'

require 'avro'

file = File.open('res.avro', 'rb')

reader = Avro::IO::DatumReader.new()

dr = Avro::DataFile::Reader.new(file, reader)

dr.each {|record|

print record["shape"]," ",record["count"],"\n"

}

dr.close

2. Examine the created result le.

$ ruby read.rb

blur 1

cylinder 1

diamond 2

formation 1

light 3

saucer 1

What just happened?

As before, we'll not analyze the Ruby Avro API. The example created a Ruby script that

opens an Avro datale, iterates through each datum, and displays it based on explicitly

named elds. Note that the script does not have access to the schema for the datale;

the informaon in the header provides enough data to allow each eld to be retrieved.

Time for action – examining the output data with Java

To show that the data is accessible from mulple languages, let's also display the job output

using Java.

1. Create the following as OutputRead.java:

import java.io.File;

import java.io.IOException;

www.it-ebooks.info

Advanced MapReduce Techniques

[ 164 ]

import org.apache.avro.file.DataFileReader;

import org.apache.avro.generic.GenericData;

import org.apache.avro. generic.GenericDatumReader;

import org.apache.avro.generic.GenericRecord;

import org.apache.avro.io.DatumReader;

public class OutputRead

{

public static void main(String[] args) throws IOException

{

String filename = args[0] ;

File file=new File(filename) ;

DatumReader<GenericRecord> reader= new

GenericDatumReader<GenericRecord>();

DataFileReader<GenericRecord>dataFileReader=new

DataFileReader<GenericRecord>(file,reader);

while (dataFileReader.hasNext())

{

GenericRecord result=dataFileReader.next();

String output = String.format("%s %d",

result.get("shape"), result.get("count")) ;

System.out.println(output) ;

}

2. Compile and run the program:

$ javacOutputResult.java

$ java OutputResultresult.avro

blur 1

cylinder 1

diamond 2

formation 1

light 3

saucer 1

www.it-ebooks.info

Chapter 5

[ 165 ]

What just happened?

We added this example to show the Avro data being read by more than one language.

The code is very similar to the earlier InputRead class; the only dierence is that the

named elds are used to display each datum as it is read from the datale.

Have a go hero – graphs in Avro

As previously menoned, we worked hard to reduce representaon-related complexity in

our GraphPath class. But with mappings to and from at lines of text and objects, there

was an overhead in managing these transformaons.

With its support for nested complex types, Avro can navely support a representaon of

a node that is much closer to the runme object. Modify the GraphPath class job to read

and write the graph representaon to an Avro datale comprising of datums for each node.

The following example schema may be a good starng point, but feel free to enhance it:

{ "type": "record",

"name": "Graph_representation",

"fields" : [

{"name": "node_id", "type": "int"},

{"name": "neighbors", "type": "array", "items:"int" },

{"name": "distance", "type": "int"},

{"name": "status", "type": "enum",

"symbols": ["PENDING", "CURRENT", "DONE"

},]

]

}

Going forward with Avro

There are many features of Avro we did not cover in this case study. We focused only on its

value as an at-rest data representaon. It can also be used within a remote procedure call

(RPC) framework and can oponally be used as the default RPC format in Hadoop 2.0. We

didn't use Avro's code generaon facilies that produce a much more domain-focused API.

Nor did we cover issues such as Avro's ability to support schema evoluon that, for example,

allows new elds to be added to recent records without invalidang old datums or breaking

exisng clients. It's a technology you are very likely to see more of in the future.

www.it-ebooks.info

Advanced MapReduce Techniques

[ 166 ]

Summary

This chapter has used three case studies to highlight some more advanced aspects of

Hadoop and its broader ecosystem. In parcular, we covered the nature of join-type

problems and where they are seen, how reduce-side joins can be implemented with

relave ease but with an eciency penalty, and how to use opmizaons to avoid

full joins in the map-side by pushing data into the Distributed Cache.

We then learned how full map-side joins can be implemented, but require signicant input

data processing; how other tools such as Hive and Pig should be invesgated if joins are a

frequently encountered use case; and how to think about complex types like graphs and

how they can be represented in a way that can be used in MapReduce.

We also saw techniques for breaking graph algorithms into mulstage MapReduce jobs,

the importance of language-independent data types, how Avro can be used for both

language independence as well as complex Java-consumed types, and the Avro extensions

to the MapReduce APIs that allow structured types to be used as the input and output to

MapReduce jobs.

This now concludes our coverage of the programmac aspects of the Hadoop MapReduce

framework. We will now move on in the next two chapters to explore how to manage and

scale a Hadoop environment.

www.it-ebooks.info

When Things Break

One of the main promises of Hadoop is resilience to failure and an ability to

survive failures when they do happen. Tolerance to failure will be the focus

of this chapter.

In parcular, we will cover the following topics:

How Hadoop handles failures of DataNodes and TaskTrackers

How Hadoop handles failures of the NameNode and JobTracker

The impact of hardware failure on Hadoop

How to deal with task failures caused by soware bugs

How dirty data can cause tasks to fail and what to do about it

Along the way, we will deepen our understanding of how the various components

of Hadoop t together and idenfy some areas of best pracce.

Failure

With many technologies, the steps to be taken when things go wrong are rarely covered in

much of the documentaon and are oen treated as topics only of interest to the experts.

With Hadoop, it is much more front and center; much of the architecture and design of

Hadoop is predicated on execung in an environment where failures are both frequent

and expected.

www.it-ebooks.info

When Things Break

[ 168 ]

Embrace failure

In recent years, a dierent mindset than the tradional one has been described by the term

embrace failure. Instead of hoping that failure does not happen, accept the fact that it will

and know how your systems and processes will respond when it does.

Or at least don't fear it

That's possibly a stretch, so instead, our goal in this chapter is to make you feel more

comfortable about failures in the system. We'll be killing the processes of a running cluster,

intenonally causing the soware to fail, pushing bad data into our jobs, and generally

causing as much disrupon as we can.

Don't try this at home

Oen when trying to break a system, a test instance is abused, leaving the operaonal

system protected from the disrupon. We will not advocate doing the things given in this

chapter to an operaonal Hadoop cluster, but the fact is that apart from one or two very

specic cases, you could. The goal is to understand the impact of the various types of failures

so that when they do happen on the business-crical system, you will know whether it is a

problem or not. Fortunately, the majority of cases are handled for you by Hadoop.

Types of failure

We will generally categorize failures into the following ve types:

Failure of a node, that is, DataNode or TaskTracker process

Failure of a cluster's masters, that is, NameNode or JobTracker process

Failure of hardware, that is, host crash, hard drive failure, and so on

Failure of individual tasks within a MapReduce job due to soware errors

Failure of individual tasks within a MapReduce job due to data problems

We will explore each of these in turn in the following secons.

Hadoop node failure

The rst class of failure that we will explore is the unexpected terminaon of the individual

DataNode and TaskTracker processes. Given Hadoop's claims of managing system availability

through survival of failures on its commodity hardware, we can expect this area to be very

solid. Indeed, as clusters grow to hundreds or thousands of hosts, failures of individual

nodes are likely to become quite commonplace.

Before we start killing things, let's introduce a new tool and set up the cluster properly.

www.it-ebooks.info

Chapter 6

[ 169 ]

The dfsadmin command

As an alternave tool to constantly viewing the HDFS web UI to determine the cluster status,

we will use the dfsadmin command-line tool:

$ Hadoop dfsadmin

This will give a list of the various opons the command can take; for our purposes we'll

be using the -report opon. This gives an overview of the overall cluster state, including

congured capacity, nodes, and les as well as specic details about each congured node.

Cluster setup, test les, and block sizes

We will need a fully distributed cluster for the following acvies; refer to the setup

instrucons given earlier in the book. The screenshots and examples that follow use a

cluster of one host for the JobTracker and NameNode and four slave nodes for running

the DataNode and TaskTracker processes.

Remember that you don't need physical hardware for each node,

we use virtual machines for our cluster.

In normal usage, 64 MB is the usual congured block size for a Hadoop cluster. For

our tesng purposes, that is terribly inconvenient as we'll need prey large les to get

meaningful block counts across our mulnode cluster.

What we can do is reduce the congured block size; in this case, we will use 4 MB. Make the

following modicaons to the hdfs-site.xml le within the Hadoop conf directory:

<name>dfs.block.size</name>

;</property>

<name>dfs.namenode.logging.level</name>

</property>

The rst property makes the required changes to the block size and the second one increases

the NameNode logging level to make some of the block operaons more visible.

www.it-ebooks.info

When Things Break

[ 170 ]

Both these sengs are appropriate for this test setup but would rarely be

seen on a producon cluster. Though the higher NameNode logging may be

required if a parcularly dicult problem is being invesgated, it is highly

unlikely you would ever want a block size as small as 4 MB. Though the

smaller block size will work ne, it will impact Hadoop's eciency.

We also need a reasonably-sized test le that will comprise of mulple 4 MB blocks. We

won't actually be using the content of the le, so the type of le is irrelevant. But you should

copy the largest le you can onto HDFS for the following secons. We used a CD ISO image:

$ Hadoop fs –put cd.iso file1.data

Fault tolerance and Elastic MapReduce

The examples in this book are for a local Hadoop cluster because this allows some of the

failure mode details to be more explicit. EMR provides exactly the same failure tolerance

as the local cluster, so the failure scenarios described here apply equally to a local Hadoop

cluster and the one hosted by EMR.

Time for action – killing a DataNode process

Firstly, we'll kill a DataNode. Recall that the DataNode process runs on each host in the

HDFS cluster and is responsible for the management of blocks within the HDFS lesystem.

Because Hadoop, by default, uses a replicaon factor of 3 for blocks, we should expect a

single DataNode failure to have no direct impact on availability, rather it will result in some

blocks temporarily falling below the replicaon threshold. Execute the following steps to

kill a DataNode process:

1. Firstly, check on the original status of the cluster and check whether everything is

healthy. We'll use the dfsadmin command for this:

$ Hadoop dfsadmin -report

Configured Capacity: 81376493568 (75.79 GB)

Present Capacity: 61117323920 (56.92 GB)

DFS Remaining: 59576766464 (55.49 GB)

DFS Used: 1540557456 (1.43 GB)

DFS Used%: 2.52%

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

-------------------------------------------------

www.it-ebooks.info

Chapter 6

[ 171 ]

Datanodes available: 4 (4 total, 0 dead)

Name: 10.0.0.102:50010

Decommission Status : Normal

Configured Capacity: 20344123392 (18.95 GB)

DFS Used: 403606906 (384.91 MB)

Non DFS Used: 5063119494 (4.72 GB)

DFS Remaining: 14877396992(13.86 GB)

DFS Used%: 1.98%

DFS Remaining%: 73.13%

Last contact: Sun Dec 04 15:16:27 PST 2011

…

Now log onto one of the nodes and use the jps command to determine the process

ID of the DataNode process:

$ jps

2085 TaskTracker

2109 Jps

1928 DataNode

2. Use the process ID (PID) of the DataNode process and kill it:

$ kill -9 1928

3. Check that the DataNode process is no longer running on the host:

$ jps

2085 TaskTracker

4. Check the status of the cluster again by using the dfsadmin command:

$ Hadoop dfsadmin -report

Configured Capacity: 81376493568 (75.79 GB)

Present Capacity: 61117323920 (56.92 GB)

DFS Remaining: 59576766464 (55.49 GB)

DFS Used: 1540557456 (1.43 GB)

DFS Used%: 2.52%

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

www.it-ebooks.info

When Things Break

[ 172 ]

-------------------------------------------------

Datanodes available: 4 (4 total, 0 dead)

…

5. The key lines to watch are the lines reporng on blocks, live nodes, and the last

contact me for each node. Once the last contact me for the dead node is around

10 minutes, use the command more frequently unl the block and live node values

change:

$ Hadoop dfsadmin -report

Configured Capacity: 61032370176 (56.84 GB)

Present Capacity: 46030327050 (42.87 GB)

DFS Remaining: 44520288256 (41.46 GB)

DFS Used: 1510038794 (1.41 GB)

DFS Used%: 3.28%

Under replicated blocks: 12

Blocks with corrupt replicas: 0

Missing blocks: 0

-------------------------------------------------

Datanodes available: 3 (4 total, 1 dead)

…

6. Repeat the process unl the count of under-replicated blocks is once again 0:

$ Hadoop dfsadmin -report

…

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

-------------------------------------------------

Datanodes available: 3 (4 total, 1 dead)

…

What just happened?

The high-level story is prey straighorward; Hadoop recognized the loss of a node and

worked around the problem. However, quite a lot is going on to make that happen.

www.it-ebooks.info

Chapter 6

[ 173 ]

When we killed the DataNode process, the process on that host was no longer available to

serve or receive data blocks as part of the read/write operaons. However, we were not

actually accessing the lesystem at the me, so how did the NameNode process know this

parcular DataNode was dead?

NameNode and DataNode communication

The answer lies in the constant communicaon between the NameNode and DataNode

processes that we have alluded to once or twice but never really explained. This occurs through

a constant series of heartbeat messages from the DataNode reporng on its current state

and the blocks it holds. In return, the NameNode gives instrucons to the DataNode, such as

nocaon of the creaon of a new le or an instrucon to retrieve a block from another node.

It all begins when the NameNode process starts up and begins receiving status messages

from the DataNode. Recall that each DataNode knows the locaon of its NameNode and

will connuously send status reports. These messages list the blocks held by each DataNode

and from this, the NameNode is able to construct a complete mapping that allows it to relate

les and directories to the blocks from where they are comprised and the nodes on which

they are stored.

The NameNode process monitors the last me it received a heartbeat from each DataNode

and aer a threshold is reached, it assumes the DataNode is no longer funconal and marks

it as dead.

The exact threshold aer which a DataNode is assumed to be dead is

not congurable as a single HDFS property. Instead, it is calculated from

several other properes such as dening the heartbeat interval. As we'll

see later, things are a lile easier in the MapReduce world as the meout

for TaskTrackers is controlled by a single conguraon property.

Once a DataNode is marked as dead, the NameNode process determines the blocks which

were held on that node and have now fallen below their replicaon target. In the default

case, each block held on the killed node would have been one of the three replicas, so each

block for which the node held a replica will now have only two replicas across the cluster.

In the preceding example, we captured the state when 12 blocks were sll under-replicated,

that is they did not have enough replicas across the cluster to meet the replicaon target.

When the NameNode process determines the under-replicated blocks, it assigns other

DataNodes to copy these blocks from the hosts where the exisng replicas reside. In this

case we only had to re-replicate a very small number of blocks; in a live cluster, the failure of

a node can result in a period of high network trac as the aected blocks are brought up to

their replicaon factor.

www.it-ebooks.info

When Things Break

[ 174 ]

Note that if a failed node returns to the cluster, we have the situaon of blocks having

more than the required number of replicas; in such a case the NameNode process will

send instrucons to remove the surplus replicas. The specic replica to be deleted is

chosen randomly, so the result will be that the returned node will end up retaining

some of its blocks and deleng the others.

Have a go hero – NameNode log delving

We congured the NameNode process to log all its acvies. Have a look through these

very verbose logs and aempt to idenfy the replicaon requests being sent.

The nal output shows the status aer the under-replicated blocks have been copied

to the live nodes. The cluster is down to only three live nodes but there are no

under-replicated blocks.

A quick way to restart the dead nodes across all hosts is to use the

start-all.sh script. It will aempt to start everything but is smart

enough to detect the running services, which means you get the dead

nodes restarted without the risk of duplicates.

Time for action – the replication factor in action

Let's repeat the preceding process, but this me, kill two DataNodes out of our cluster

of four. We will give an abbreviated walk-through of the acvity as it is very similar to

the previous Time for acon secon:

1. Restart the dead DataNode and monitor the cluster unl all nodes are marked

as live.

2. Pick two DataNodes, use the process ID, and kill the DataNode processes.

3. As done previously, wait for around 10 minutes then acvely monitor the cluster

state via dfsadmin, paying parcular aenon to the reported number of under-

replicated blocks.

4. Wait unl the cluster has stabilized with an output similar to the following:

Configured Capacity: 61032370176 (56.84 GB)

Present Capacity: 45842373555 (42.69 GB)

DFS Remaining: 44294680576 (41.25 GB)

DFS Used: 1547692979 (1.44 GB)

DFS Used%: 3.38%

Under replicated blocks: 125

Blocks with corrupt replicas: 0

www.it-ebooks.info

Chapter 6

[ 175 ]

Missing blocks: 0

-------------------------------------------------

Datanodes available: 2 (4 total, 2 dead)

…

What just happened?

This is the same process as before; the dierence is that due to two DataNode failures

there were signicantly more blocks that fell below the replicaon factor, many going

down to a single remaining replica. Consequently, you should see more acvity in the

reported number of under-replicated blocks as it rst increase because nodes fail and

then drop as re-replicaon occurs. These events can also be seen in the NameNode logs.

Note that though Hadoop can use re-replicaon to bring those blocks with only a single

remaining replica up to two replicas, this sll leaves the blocks in an under-replicated

state. With only two live nodes in the cluster, it is now impossible for any block to

meet the default replicaon target of three.

We have been truncang the dfsadmin output for space reasons; in parcular, we have

been oming the reported informaon for each node. However, let's take a look at the

rst node in our cluster through the previous stages. Before we started killing any DataNode,

it reported the following:

Name: 10.0.0.101:50010

Decommission Status : Normal

Configured Capacity: 20344123392 (18.95 GB)

DFS Used: 399379827 (380.88 MB)

Non DFS Used: 5064258189 (4.72 GB)

DFS Remaining: 14880485376(13.86 GB)

DFS Used%: 1.96%

DFS Remaining%: 73.14%

Last contact: Sun Dec 04 15:16:27 PST 2011

Aer a single DataNode was killed and all blocks had been re-replicated as necessary, it

reported the following:

Name: 10.0.0.101:50010

Decommission Status : Normal

Configured Capacity: 20344123392 (18.95 GB)

DFS Used: 515236022 (491.37 MB)

Non DFS Used: 5016289098 (4.67 GB)

DFS Remaining: 14812598272(13.8 GB)

www.it-ebooks.info

When Things Break

[ 176 ]

DFS Used%: 2.53%

DFS Remaining%: 72.81%

Last contact: Sun Dec 04 15:31:22 PST 2011

The thing to note is the increase in the local DFS storage on the node. This shouldn't be a

surprise. With a dead node, the others in the cluster need to add some addional block

replicas and that will translate to a higher storage ulizaon on each.

Finally, the following is the node's report aer two other DataNodes were killed:

Name: 10.0.0.101:50010

Decommission Status : Normal

Configured Capacity: 20344123392 (18.95 GB)

DFS Used: 514289664 (490.46 MB)

Non DFS Used: 5063868416 (4.72 GB)

DFS Remaining: 14765965312(13.75 GB)

DFS Used%: 2.53%

DFS Remaining%: 72.58%

Last contact: Sun Dec 04 15:43:47 PST 2011

With two dead nodes it may seem as if the remaining live nodes should consume even more

local storage space, but this isn't the case and it's yet again a natural consequence of the

replicaon factor.

If we have four nodes and a replicaon factor of 3, each block will have a replica on three

of the live nodes in the cluster. If a node dies, the blocks living on the other nodes are

unaected, but any blocks with a replica on the dead node will need a new replica created.

However, with only three live nodes, each node will hold a replica of every block. If a second

node fails, the situaon will result into under-replicated blocks and Hadoop does not have

anywhere to put the addional replicas. Since both remaining nodes already hold a replica

of each block, their storage ulizaon does not increase.

Time for action – intentionally causing missing blocks

The next step should be obvious; let's kill three DataNodes in quick succession.

This is the rst of the acvies we menoned that you really should not do

on a producon cluster. Although there will be no data loss if the steps are

followed properly, there is a period when the exisng data is unavailable.

www.it-ebooks.info

Chapter 6

[ 177 ]

The following are the steps to kill three DataNodes in quick succession:

1. Restart all the nodes by using the following command:

$ start-all.sh

2. Wait unl Hadoop dfsadmin -report shows four live nodes.

3. Put a new copy of the test le onto HDFS:

$ Hadoop fs -put file1.data file1.new

4. Log onto three of the cluster hosts and kill the DataNode process on each.

5. Wait for the usual 10 minutes then start monitoring the cluster via dfsadmin unl

you get output similar to the following that reports the missing blocks:

…

Under replicated blocks: 123

Blocks with corrupt replicas: 0

Missing blocks: 33

-------------------------------------------------

Datanodes available: 1 (4 total, 3 dead)

…

6. Try and retrieve the test le from HDFS:

$ hadoop fs -get file1.new file1.new

11/12/04 16:18:05 INFO hdfs.DFSClient: No node available for

block: blk_1691554429626293399_1003 file=/user/hadoop/file1.new

11/12/04 16:18:05 INFO hdfs.DFSClient: Could not obtain block

blk_1691554429626293399_1003 from any node: java.io.IOException:

No live nodes contain current block

…

get: Could not obtain block: blk_1691554429626293399_1003 file=/

user/hadoop/file1.new

7. Restart the dead nodes using the start-all.sh script:

$ start-all.sh

8. Repeatedly monitor the status of the blocks:

$ Hadoop dfsadmin -report | grep -i blocks

Under replicated blockss: 69

Blocks with corrupt replicas: 0

Missing blocks: 35

www.it-ebooks.info

When Things Break

[ 178 ]

$ Hadoop dfsadmin -report | grep -i blocks

Under replicated blockss: 0

Blocks with corrupt replicas: 0

Missing blocks: 30

9. Wait unl there are no reported missing blocks then copy the test le onto

the local lesystem:

$ Hadoop fs -get file1.new file1.new

10. Perform an MD5 check on this and the original le:

$ md5sum file1.*

f1f30b26b40f8302150bc2a494c1961d file1.data

f1f30b26b40f8302150bc2a494c1961d file1.new

What just happened?

Aer restarng the killed nodes, we copied the test le onto HDFS again. This isn't strictly

necessary as we could have used the exisng le but due to the shuing of the replicas,

a clean copy gives the most representave results.

We then killed three DataNodes as before and waited for HDFS to respond. Unlike the

previous examples, killing these many nodes meant it was certain that some blocks would

have all of their replicas on the killed nodes. As we can see, this is exactly the result; the

remaining single node cluster shows over a hundred blocks that are under-replicated

(obviously only one replica remains) but there are also 33 missing blocks.

Talking of blocks is a lile abstract, so we then try to retrieve our test le which, as we

know, eecvely has 33 holes in it. The aempt to access the le fails as Hadoop could

not nd the missing blocks required to deliver the le.

We then restarted all the nodes and tried to retrieve the le again. This me it was

successful, but we took an added precauon of performing an MD5 cryptographic

check on the le to conrm that it was bitwise idencal to the original one — which it is.

This is an important point: though node failure may result in data becoming unavailable,

there may not be a permanent data loss if the node recovers.

When data may be lost

Do not assume from this example that it's impossible to lose data in a Hadoop cluster. For

general use it is very hard, but disaster oen has a habit of striking in just the wrong way.

www.it-ebooks.info

Chapter 6

[ 179 ]

As seen in the previous example, a parallel failure of a number of nodes equal to or greater

than the replicaon factor has a chance of resulng in missing blocks. In our example of

three dead nodes in a cluster of four, the chances were high; in a cluster of 1000, it would

be much lower but sll non-zero. As the cluster size increases, so does the failure rate and

having three node failures in a narrow window of me becomes less and less unlikely.

Conversely, the impact also decreases but rapid mulple failures will always carry a

risk of data loss.

Another more insidious problem is recurring or paral failures, for example, when

power issues across the cluster cause nodes to crash and restart. It is possible for

Hadoop to end up chasing replicaon targets, constantly asking the recovering hosts

to replicate under-replicated blocks, and also seeing them fail mid-way through the task.

Such a sequence of events can also raise the potenal of data loss.

Finally, never forget the human factor. Having a replicaon factor equal to the size of the

cluster—ensuring every block is on every node—won't help you when a user accidentally

deletes a le or directory.

The summary is that data loss through system failure is prey unlikely but is possible through

almost inevitable human acon. Replicaon is not a full alternave to backups; ensure that

you understand the importance of the data you process and the impact of the types of loss

discussed here.

The most catastrophic losses in a Hadoop cluster are actually caused by

NameNode failure and lesystem corrupon; we'll discuss this topic in

some detail in the next chapter.

Block corruption

The reports from each DataNode also included a count of the corrupt blocks, which we

have not referred to. When a block is rst stored, there is also a hidden le wrien to the

same HDFS directory containing cryptographic checksums for the block. By default, there

is a checksum for each 512-byte chunk within the block.

Whenever any client reads a block, it will also retrieve the list of checksums and compare

these to the checksums it generates on the block data it has read. If there is a checksum

mismatch, the block on that parcular DataNode will be marked as corrupt and the client

will retrieve a dierent replica. On learning of the corrupt block, the NameNode will

schedule a new replica to be made from one of the exisng uncorrupted replicas.

www.it-ebooks.info

When Things Break

[ 180 ]

If the scenario seems unlikely, consider that faulty memory, disk drive, storage controller, or

numerous issues on an individual host could cause some corrupon to a block as it is inially

being wrien while being stored or when being read. These are rare events and the chances

of the same corrupon occurring on all DataNodes holding replicas of the same block

become exceponally remote. However, remember as previously menoned that replicaon

is not a full alternave to backup and if you need 100 percent data availability, you likely

need to think about o-cluster backup.

Time for action – killing a TaskTracker process

We've abused HDFS and its DataNode enough; now let's see what damage we can do to

MapReduce by killing some TaskTracker processes.

Though there is an mradmin command, it does not give the sort of status reports we are

used to with HDFS. So we'll use the MapReduce web UI (located by default on port 50070

on the JobTracker host) to monitor the MapReduce cluster health.

Perform the following steps:

1. Ensure everything is running via the start-all.sh script then point your browser

at the MapReduce web UI. The page should look like the following screenshot:

www.it-ebooks.info

Chapter 6

[ 181 ]

2. Start a long-running MapReduce job; the example pi esmator with large values

is great for this:

$ Hadoop jar Hadoop/Hadoop-examples-1.0.4.jar pi 2500 2500

3. Now log onto a cluster node and use jps to idenfy the TaskTracker process:

$ jps

21822 TaskTracker

3918 Jps

3891 DataNode

4. Kill the TaskTracker process:

$ kill -9 21822

5. Verify that the TaskTracker is no longer running:

$jps

3918 Jps

3891 DataNode

6. Go back to the MapReduce web UI and aer 10 minutes you should see that

the number of nodes and available map/reduce slots change as shown in the

following screenshot:

www.it-ebooks.info

When Things Break

[ 182 ]

7. Monitor the job progress in the original window; it should be proceeding, even if

it is slow.

8. Restart the dead TaskTracker process:

$ start-all.sh

9. Monitor the MapReduce web UI. Aer a lile me the number of nodes should

be back to its original number as shown in the following screenshot:

What just happened?

The MapReduce web interface provides a lot of informaon on both the cluster as well

as the jobs it executes. For our interests here, the important data is the cluster summary

that shows the currently execung number of map and reduce tasks, the total number of

submied jobs, the number of nodes and their map and reduce capacity, and nally, any

blacklisted nodes.

The relaonship of the JobTracker process to the TaskTracker process is quite dierent

than that between NameNode and DataNode but a similar heartbeat/monitoring

mechanism is used.

www.it-ebooks.info

Chapter 6

[ 183 ]

The TaskTracker process frequently sends heartbeats to the JobTracker, but instead of status

reports of block health, they contain progress reports of the assigned task and available

capacity. Each node has a congurable number of map and reduce task slots (the default

for each is two), which is why we see four nodes and eight map and reduce slots in the

rst web UI screenshot.

When we kill the TaskTracker process, its lack of heartbeats is measured by the JobTracker

process and aer a congurable amount of me, the node is assumed to be dead and we

see the reduced cluster capacity reected in the web UI.

The meout for a TaskTracker process to be considered dead is modied by

the mapred.tasktracker.expiry.interval property, congured

in mapred-site.xml.

When a TaskTracker process is marked as dead, the JobTracker process also considers its

in-progress tasks as failed and re-assigns them to the other nodes in the cluster. We see

this implicitly by watching the job proceed successfully despite a node being killed.

Aer the TaskTracker process is restarted it sends a heartbeat to the JobTracker, which marks

it as alive and reintegrates it into the MapReduce cluster. This we see through the cluster node

and task slot capacity returning to their original values as can be seen in the nal screenshot.

Comparing the DataNode and TaskTracker failures

We'll not perform similar two or three node killing acvies with TaskTrackers as the task

execuon architecture renders individual TaskTracker failures relavely unimportant.

Because the TaskTracker processes are under the control and coordinaon of JobTracker,

their individual failures have no direct eect other than to reduce the cluster execuon

capacity. If a TaskTracker instance fails, the JobTracker will simply schedule the failed tasks on

a healthy TaskTracker process in the cluster. The JobTracker is free to reschedule tasks around

the cluster because TaskTracker is conceptually stateless; a single failure does not aect

other parts of the job.

In contrast, loss of a DataNode—which is intrinsically stateful—can aect the persistent data

held on HDFS, potenally making it unavailable.

This highlights the nature of the various nodes and their relaonship to the overall Hadoop

framework. The DataNode manages data, and the TaskTracker reads and writes that data.

Catastrophic failure of every TaskTracker would sll leave us with a completely funconal

HDFS; a similar failure of the NameNode process would leave a live MapReduce cluster that

is eecvely useless (unless it was congured to use a dierent storage system).

www.it-ebooks.info

When Things Break

[ 184 ]

Permanent failure

Our recovery scenarios so far have assumed that the dead node can be restarted on the

same physical host. But what if it can't due to the host having a crical failure? The answer is

simple; you can remove the host from the slave's le and Hadoop will no longer try to start a

DataNode or TaskTracker on that host. Conversely, if you get a replacement machine with a

dierent hostname, add this new host to the same le and run start-all.sh.

Note that the slave's le is only used by tools such as the start/stop and

slaves.sh scripts. You don't need to keep it updated on every node, but only

on the hosts where you generally run such commands. In pracce, this is likely to

be either a dedicated head node or the host where the NameNode or JobTracker

processes run. We'll explore these setups in Chapter 7, Keeping Things Running.

Killing the cluster masters

Though the failure impact of DataNode and TaskTracker processes is dierent, each

individual node is relavely unimportant. Failure of any single TaskTracker or DataNode is

not a cause for concern and issues only occur if mulple others fail, parcularly in quick

succession. But we only have one JobTracker and NameNode; let's explore what happens

when they fail.

Time for action – killing the JobTracker

We'll rst kill the JobTracker process which we should expect to impact our ability to execute

MapReduce jobs but not aect the underlying HDFS lesystem.

1. Log on to the JobTracker host and kill its process.

2. Aempt to start a test MapReduce job such as Pi or WordCount:

$ Hadoop jar wc.jar WordCount3 test.txt output

Starting Job

11/12/11 16:03:29 INFO ipc.Client: Retrying connect to server:

/10.0.0.100:9001. Already tried 0 time(s).

11/12/11 16:03:30 INFO ipc.Client: Retrying connect to server:

/10.0.0.100:9001. Already tried 1 time(s).

…

11/12/11 16:03:38 INFO ipc.Client: Retrying connect to server:

/10.0.0.100:9001. Already tried 9 time(s).

www.it-ebooks.info

Chapter 6

[ 185 ]

java.net.ConnectException: Call to /10.0.0.100:9001 failed on

connection exception: java.net.ConnectException: Connection

refused

at org.apache.hadoop.ipc.Client.wrapException(Client.java:767)

at org.apache.hadoop.ipc.Client.call(Client.java:743)

at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)

…

3. Perform some HDFS operaons:

$ hadoop fs -ls /

Found 2 items

drwxr-xr-x - hadoop supergroup 0 2011-12-11 19:19 /user

drwxr-xr-x - hadoop supergroup 0 2011-12-04 20:38 /var

$ hadoop fs -cat test.txt

This is a test file

What just happened?

Aer killing the JobTracker process we aempted to launch a MapReduce job. From the

walk-through in Chapter 2, Geng Hadoop Up and Running, we know that the client on

the machine where we are starng the job aempts to communicate with the JobTracker

process to iniate the job scheduling acvies. But in this case there was no running

JobTracker, this communicaon did not happen and the job failed.

We then performed a few HDFS operaons to highlight the point in the previous secon;

a non-funconal MapReduce cluster will not directly impact HDFS, which will sll be

available to all clients and operaons.

Starting a replacement JobTracker

The recovery of the MapReduce cluster is also prey straighorward. Once the JobTracker

process is restarted, all the subsequent MapReduce jobs are successfully processed.

Note that when the JobTracker was killed, any jobs that were in ight are lost and need to

be restarted. Watch out for temporary les and directories on HDFS; many MapReduce jobs

write temporary data to HDFS that is usually cleaned up on job compleon. Failed jobs—

especially the ones failed due to a JobTracker failure—are likely to leave such data behind

and this may require a manual clean-up.

www.it-ebooks.info

When Things Break

[ 186 ]

Have a go hero – moving the JobTracker to a new host

But what happens if the host on which the JobTracker process was running has a fatal

hardware failure and cannot be recovered? In such situaons you will need to start a new

JobTracker process on a dierent host. This requires all nodes to have their mapred-site.

xml le updated with the new locaon and the cluster restarted. Try this! We'll talk about it

more in the next chapter.

Time for action – killing the NameNode process

Let's now kill the NameNode process, which we should expect to directly stop us from

accessing HDFS and by extension, prevent the MapReduce jobs from execung:

Don't try this on an operaonally important cluster. Though the impact will

be short-lived, it eecvely kills the enre cluster for a period of me.

1. Log onto the NameNode host and list the running processes:

$ jps

2372 SecondaryNameNode

2118 NameNode

2434 JobTracker

5153 Jps

2. Kill the NameNode process. Don't worry about SecondaryNameNode, it can keep

running.

3. Try to access the HDFS lesystem:

$ hadoop fs -ls /

11/12/13 16:00:05 INFO ipc.Client: Retrying connect to server:

/10.0.0.100:9000. Already tried 0 time(s).

11/12/13 16:00:06 INFO ipc.Client: Retrying connect to server:

/10.0.0.100:9000. Already tried 1 time(s).

11/12/13 16:00:07 INFO ipc.Client: Retrying connect to server:

/10.0.0.100:9000. Already tried 2 time(s).

11/12/13 16:00:08 INFO ipc.Client: Retrying connect to server:

/10.0.0.100:9000. Already tried 3 time(s).

11/12/13 16:00:09 INFO ipc.Client: Retrying connect to server:

/10.0.0.100:9000. Already tried 4

time(s).

…

Bad connection to FS. command aborted.

www.it-ebooks.info

Chapter 6

[ 187 ]

4. Submit the MapReduce job:

$ hadoop jar hadoop/hadoop-examples-1.0.4.jar pi 10 100

Number of Maps = 10

Samples per Map = 100

11/12/13 16:00:35 INFO ipc.Client: Retrying connect to server:

/10.0.0.100:9000. Already tried 0 time(s).

11/12/13 16:00:36 INFO ipc.Client: Retrying connect to server:

/10.0.0.100:9000. Already tried 1 time(s).

11/12/13 16:00:37 INFO ipc.Client: Retrying connect to server:

/10.0.0.100:9000. Already tried 2 time(s).

…

java.lang.RuntimeException: java.net.ConnectException: Call

to /10.0.0.100:9000 failed on connection exception: java.net.

ConnectException: Connection refused

at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.

java:371)

at org.apache.hadoop.mapred.FileInputFormat.

setInputPaths(FileInputFormat.java:309)

…

Caused by: java.net.ConnectException: Call to /10.0.0.100:9000

failed on connection exception: java.net.ConnectException:

Connection refused

…

5. Check the running processes:

$ jps

2372 SecondaryNameNode

5253 Jps

2434 JobTracker

Restart the NameNode

$ start-all.sh

6. Access HDFS:

$ Hadoop fs -ls /

Found 2 items

drwxr-xr-x - hadoop supergroup 0 2011-12-16 16:18 /user

drwxr-xr-x - hadoop supergroup 0 2011-12-16 16:23 /var

www.it-ebooks.info

When Things Break

[ 188 ]

What just happened?

We killed the NameNode process and tried to access the HDFS lesystem. This of course

failed; without the NameNode there is no server to receive our lesystem commands.

We then tried to submit a MapReduce job and this also failed. From the abbreviated

excepon stack trace you can see that while trying to set up the input paths for the

job data, the JobTracker also tried and failed to connect to NameNode.

We then conrmed that the JobTracker process is healthy and it was the NameNode's

unavailability that caused the MapReduce task to fail.

Finally, we restarted the NameNode and conrmed that we could once again access

the HDFS lesystem.

Starting a replacement NameNode

With the dierences idened so far between the MapReduce and HDFS clusters, it

shouldn't be a surprise to learn that restarng a new NameNode on a dierent host is

not as simple as moving the JobTracker. To put it more starkly, having to move NameNode

due to a hardware failure is probably the worst crisis you can have with a Hadoop cluster.

Unless you have prepared carefully, the chance of losing all your data is very high.

That's quite a statement and we need to explore the nature of the NameNode process to

understand why this is the case.

The role of the NameNode in more detail

So far we've spoken of the NameNode process as the coordinator between the DataNode

processes and the service responsible for ensuring the conguraon parameters, such as

block replicaon values, are honored. This is an important set of tasks but it's also very

operaonally focused. The NameNode process also has the responsibility of managing

the HDFS lesystem metadata; a good analogy is to think of it holding the equivalent

of the le allocaon table in a tradional lesystem.

File systems, les, blocks, and nodes

When accessing HDFS you rarely care about blocks. You want to access a given le at a

certain locaon in the lesystem. To facilitate this, the NameNode process is required to

maintain numerous pieces of informaon:

The actual lesystem contents, the names of all the les, and their

containing directories

Addional metadata about each of these elements, such as size,

ownership, and replicaon factor

www.it-ebooks.info

Chapter 6

[ 189 ]

The mapping of which blocks hold the data for each le

The mapping of which nodes in the cluster hold which blocks and from this, the

current replicaon state of each

All but the last of the preceding points is persistent data that must be maintained across

restarts of the NameNode process.

The single most important piece of data in the cluster – fsimage

The NameNode process stores two data structures to disk, the fsimage le and the edits

log of changes to it. The fsimage le holds the key lesystem aributes menoned in the

previous secon; the name and details of each le and directory on the lesystem and the

mapping of the blocks that correspond to each.

If the fsimage le is lost, you have a series of nodes holding blocks of data without any

knowledge of which blocks correspond to which part of which le. In fact, you don't even

know which les are supposed to be constructed in the rst place. Loss of the fsimage le

leaves you with all the lesystem data but renders it eecvely useless.

The fsimage le is read by the NameNode process at startup and is held and manipulated

in memory for performance reasons. To avoid changes to the lesystem being lost, any

modicaons made are wrien to the edits log throughout the NameNode's upme. The

next me it restarts, it looks for this log at startup and uses it to update the fsimage le

which it then reads into memory.

This process can be opmized by the use of the SecondaryNameNode

which we'll menon later.

DataNode startup

When a DataNode process starts up, it commences its heartbeat process by reporng to the

NameNode process on the blocks it holds. As explained earlier in this chapter, this is how the

NameNode process knows which node should be used to service a request for a given block.

If the NameNode process itself restarts, it uses the re-establishment of the heartbeats with

all the DataNode processes to construct its mapping of blocks to nodes.

With the DataNode processes potenally coming in and out of the cluster, there is lile use

in this mapping being stored persistently as the on-disk state would oen be out-of-date

with the current reality. This is why the NameNode process does not persist the locaon

of which blocks are held on which nodes.

www.it-ebooks.info

When Things Break

[ 190 ]

Safe mode

If you look at the HDFS web UI or the output of dfsadmin shortly aer starng an HDFS

cluster, you will see a reference to the cluster being in safe mode and the required threshold

of the reported blocks before it will leave safe mode. This is the DataNode block reporng

mechanism at work.

As an addional safeguard, the NameNode process will hold the HDFS lesystem in a read-

only mode unl it has conrmed that a given percentage of blocks meet their replicaon

threshold. In the usual case this will simply require all the DataNode processes to report in,

but if some have failed, the NameNode process will need to schedule some re-replicaon

before safe mode can be le.

SecondaryNameNode

The most unfortunately named enty in Hadoop is the SecondaryNameNode. When one

learns of the crical fsimage le for the rst me, this thing called SecondaryNameNode

starts to sound like a helpful migaon. Is it perhaps, as the name suggests, a second copy

of the NameNode process running on another host that can take over when the primary

fails? No, it isn't. SecondaryNameNode has a very specic role; it periodically reads in the

state of the fsimage le and the edits log and writes out an updated fsimage le with the

changes in the log applied. This is a major me saver in terms of NameNode startup. If the

NameNode process has been running for a signicant period of me, the edits log will be

huge and it will take a very long me (easily several hours) to apply all the changes to the old

fsimage le's state stored on the disk. The SecondaryNameNode facilitates a faster startup.

So what to do when the NameNode process has a critical failure?

Would it help to say don't panic? There are approaches to NameNode failure and this is such

an important topic that we have an enre secon on it in the next chapter. But for now, the

main point is that you can congure the NameNode process to write its fsimage le and

edits log to mulple locaons. Typically, a network lesystem is added as a second locaon

to ensure a copy of the fsimage le outside the NameNode host.

But the process of moving to a new NameNode process on a new host requires manual

eort and your Hadoop cluster is dead in the water unl you do. This is something you want

to have a process for and that you have tried (successfully!) in a test scenario. You really

don't want to be learning how to do this when your operaonal cluster is down, your CEO is

shoung at you, and the company is losing money.

www.it-ebooks.info

Chapter 6

[ 191 ]

BackupNode/CheckpointNode and NameNode HA

Hadoop 0.22 replaced SecondaryNameNode with two new components, BackupNode and

CheckpointNode. The laer of these is eecvely a renamed SecondaryNameNode; it is

responsible for updang the fsimage le at regular checkpoints to decrease the NameNode

startup me.

The BackupNode, however, is a step closer to the goal of a fully funconal hot-backup for

the NameNode. It receives a constant stream of lesystem updates from the NameNode

and its in-memory state is up-to-date at any point in me, with the current state held in the

master NameNode. If the NameNode dies, the BackupNode is much more capable of being

brought into service as a new NameNode. The process isn't automac and requires manual

intervenon and a cluster restart, but it takes some of the pain out of a NameNode failure.

Remember that Hadoop 1.0 is a connuaon of the Version 0.20 branch, so it does not

contain the features menoned previously.

Hadoop 2.0 will take these extensions to the next logical step: a fully automac NameNode

failover from the current master NameNode to an up-to-date backup NameNode. This

NameNode High Availability (HA) is one of the most long-requested changes to the Hadoop

architecture and will be a welcome addion when complete.

Hardware failure

When we killed the various Hadoop components earlier, we were—in most cases—using

terminaon of the Hadoop processes as a proxy for the failure of the hosng physical

hardware. From experience, it is quite rare to see the Hadoop processes fail without

some underlying host issue causing the problem.

Host failure

Actual failure of the host is the simplest case to consider. A machine could fail due to a

crical hardware issue (failed CPU, blown power supply, stuck fans, and so on), causing

sudden failure of the Hadoop processes running on the host. Crical bugs in system-level

soware (kernel panics, I/O locks, and so on) can also have the same eect.

Generally speaking, if the failure causes a host to crash, reboot, or otherwise become

unreachable for a period of me, we can expect Hadoop to act just as demonstrated

throughout this chapter.

www.it-ebooks.info

When Things Break

[ 192 ]

Host corruption

A more insidious problem is when a host appears to be funconing but is in reality producing

corrupt results. Examples of this could be faulty memory resulng in corrupon of data or

disk sector errors, resulng in data on the disk being damaged.

For HDFS, this is where the status reports of corrupted blocks that we discussed earlier come

into play.

For MapReduce there is no equivalent mechanism. Just as with most other soware, the

TaskTracker relies on data being wrien and read correctly by the host and has no means

to detect corrupon in either task execuon or during the shue stage.

The risk of correlated failures

There is a phenomenon that most people don't consider unl it bites them; somemes the

cause of a failure will also result in subsequent failures and greatly increase the chance of

encountering a data loss scenario.

As an example, I once worked on a system that used four networking devices. One of these

failed and no one cared about it; there were three remaining devices, aer all. Unl they all

failed in an 18-hour period. Turned out they all contained hard drives from a faulty batch.

It doesn't have to be quite this exoc; more frequent causes will be due to faults in the

shared services or facilies. Network switches can fail, power distribuon can spike, air

condioning can fail, and equipment racks can short-circuit. As we'll see in the next chapter

Hadoop doesn't assign blocks to random locaons, it acvely seeks to adopt a placement

strategy that provides some protecon from such failures in shared services.

We are again talking about unlikely scenarios, most oen a failed host is just that and not the

p of a failure-crisis iceberg. However, remember to never discount the unlikely scenarios,

especially when taking clusters to progressively larger scale.

Task failure due to software

As menoned earlier, it is actually relavely rare to see the Hadoop processes themselves

crash or otherwise spontaneously fail. What you are likely to see more of in pracce are

failures caused by the tasks, that is faults in the map or reduce tasks that you are execung

on the cluster.

Failure of slow running tasks

We will rst look at what happens if tasks hang or otherwise appear to Hadoop to have

stopped making progress.

www.it-ebooks.info

Chapter 6

[ 193 ]

Time for action – causing task failure

Let's cause a task to fail; before we do, we will need to modify the default meouts:

1. Add this conguraon property to mapred-site.xml:

<name>mapred.task.timeout</name>

</property>

2. We will now modify our old friend WordCount from Chapter 3, Understanding

MapReduce. Copy WordCount3.java to a new le called WordCountTimeout.

java and add the following imports:

import java.util.concurrent.TimeUnit ;

import org.apache.hadoop.fs.FileSystem ;

import org.apache.hadoop.fs.FSDataOutputStream ;

3. Replace the map method with the following one:

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

String lockfile = "/user/hadoop/hdfs.lock" ;

Configuration config = new Configuration() ;

FileSystem hdfs = FileSystem.get(config) ;

Path path = new Path(lockfile) ;

if (!hdfs.exists(path))

{

byte[] bytes = "A lockfile".getBytes() ;

FSDataOutputStream out = hdfs.create(path) ;

out.write(bytes, 0, bytes.length);

out.close() ;

TimeUnit.SECONDS.sleep(100) ;

}

String[] words = value.toString().split(" ") ;

for (String str: words)

{

word.set(str);

context.write(word, one);

}

www.it-ebooks.info

When Things Break

[ 194 ]

4. Compile the le aer changing the class name, jar it up, and execute it on

the cluster:

$ Hadoop jar wc.jar WordCountTimeout test.txt output

…

11/12/11 19:19:51 INFO mapred.JobClient: map 50% reduce 0%

11/12/11 19:20:25 INFO mapred.JobClient: map 0% reduce 0%

11/12/11 19:20:27 INFO mapred.JobClient: Task Id : attempt_2011121

11821_0004_m_000000_0, Status : FAILED

Task attempt_201112111821_0004_m_000000_0 failed to report status

for 32 seconds. Killing!

11/12/11 19:20:31 INFO mapred.JobClient: map 100% reduce 0%

11/12/11 19:20:43 INFO mapred.JobClient: map 100% reduce 100%

11/12/11 19:20:45 INFO mapred.JobClient: Job complete:

job_201112111821_0004

11/12/11 19:20:45 INFO mapred.JobClient: Counters: 18

11/12/11 19:20:45 INFO mapred.JobClient: Job Counters

…

What just happened?

We rst modied a default Hadoop property that manages how long a task can seemingly

make no progress before the Hadoop framework considers it for terminaon.

Then we modied WordCount3 to add some logic that causes the task to sleep for 100

seconds. We used a lock le on HDFS to ensure that only a single task instance sleeps.

If we just had the sleep statement in the map operaon without any checks, every

mapper would meout and the job would fail.

Have a go hero – HDFS programmatic access

We said we would not really deal with programmac access to HDFS in this book.

However, take a look at what we have done here and browse through the Javadoc

for these classes. You will nd that the interface largely follows the paerns for

access to a standard Java lesystem.

Then we compile, jar up the classes, and execute the job on the cluster. The rst task goes

to sleep and aer exceeding the threshold we set (the value was specied in milliseconds),

Hadoop kills the task and reschedules another mapper to process the split assigned to the

failed task.

www.it-ebooks.info

Chapter 6

[ 195 ]

Hadoop's handling of slow-running tasks

Hadoop has a balancing act to perform here. It wants to terminate tasks that have got

stuck or, for other reasons, are running abnormally slowly; but somemes complex tasks

simply take a long me. This is especially true if the task relies on any external resources

to complete its execuon.

Hadoop looks for evidence of progress from a task when deciding how long it has been

idle/quiet/stuck. Generally this could be:

Eming results

Wring values to counters

Explicitly reporng progress

For the laer, Hadoop provides the Progressable interface which contains one method

of interest:

Public void progress() ;

The Context class implements this interface, so any mapper or reducer can call context.

progress() to show it is alive and connuing to process.

Speculative execution

Typically, a MapReduce job will comprise of many discrete maps and reduce task execuons.

When run across a cluster, there is a real risk that a miscongured or ill host will cause its

tasks to run signicantly slower than the others.

To address this, Hadoop will assign duplicate maps or reduce tasks across the cluster

towards the end of the map or reduce phase. This speculave task execuon is aimed

at prevenng one or two slow running tasks from causing a signicant impact on the

overall job execuon me.

Hadoop's handling of failing tasks

Tasks won't just hang; somemes they'll explicitly throw excepons, abort, or otherwise

stop execung in a less silent way than the ones menoned previously.

Hadoop has three conguraon properes that control how it responds to task failures,

all set in mapred-site.xml:

mapred.map.max.attempts: A given map task will be retried this many mes

before causing the job to fail

www.it-ebooks.info

When Things Break

[ 196 ]

mapred.reduce.max.attempts: A given reduce task will be retried these many

mes before causing the job to fail

mapred.max.tracker.failures: The job will fail if this many individual task

failures are recorded

The default value for all of these is 4.

Note that it does not make sense for mapred.tracker.max.failures

to be set to a value smaller than either of the other two properes.

Which of these you consider seng will depend on the nature of your data

and jobs. If your jobs access external resources that may occasionally cause

transient errors, increasing the number of repeat failures of a task may be

useful. But if the task is very data-specic, these properes may be less

applicable as a task that fails once will do so again. However, note that a

default value higher than 1 does make sense as in a large complex system

various transient failures are always possible.

Have a go hero – causing tasks to fail

Modify the WordCount example; instead of sleeping, have it throw a RunmeExcepon

based on a random number. Modify the cluster conguraon and explore the relaonship

between the conguraon properes that manage how many failed tasks will cause the

whole job to fail.

Task failure due to data

The nal types of failure that we will explore are those related to data. By this, we mean

tasks that crash because a given record had corrupt data, used the wrong data types or

formats, or a wide variety of related problems. We mean those cases where the data

received diverges from expectaons.

Handling dirty data through code

One approach to dirty data is to write mappers and reducers that deal with data defensively.

So, for example, if the value received by the mapper should be a comma-separated list of

values, rst validate the number of items before processing the data. If the rst value should

be a string representaon of an integer, ensure that the conversion into a numerical type has

solid error handling and default behavior.

The problem with this approach is that there will always be some type of weird data input

that was not considered, no maer how careful you were. Did you consider receiving values

in a dierent unicode character set? What about mulple character sets, null values, badly

terminated strings, wrongly encoded escape characters, and so on?

www.it-ebooks.info

Chapter 6

[ 197 ]

If the data input to your jobs is something you generate and/or control, these possibilies

are less of a concern. However, if you are processing data received from external sources,

there will always be grounds for surprise.

Using Hadoop's skip mode

The alternave is to congure Hadoop to approach task failures dierently. Instead of

looking upon a failed task as an atomic event, Hadoop can instead aempt to idenfy which

records may have caused the problem and exclude them from future task execuons. This

mechanism is known as skip mode. This can be useful if you are experiencing a wide variety

of data issues where coding around them is not desirable or praccal. Alternavely, you may

have lile choice if, within your job, you are using third-party libraries for which you may not

have the source code.

Skip mode is currently available only for jobs wrien to the pre 0.20 version of API, which is

another consideraon.

Time for action – handling dirty data by using skip mode

Let's see skip mode in acon by wring a MapReduce job that receives the data that causes

it to fail:

1. Save the following Ruby script as gendata.rb:

File.open("skipdata.txt", "w") do |file|

3.times do

500000.times{file.write("A valid record\n")}

5.times{file.write("skiptext\n")}

end

500000.times{file.write("A valid record\n")}

End

2. Run the script:

$ ruby gendata.rb

3. Check the size of the generated le and its number of lines:

$ ls -lh skipdata.txt

-rw-rw-r-- 1 hadoop hadoop 29M 2011-12-17 01:53 skipdata.txt

~$ cat skipdata.txt | wc -l

2000015

4. Copy the le onto HDFS:

$ hadoop fs -put skipdata.txt skipdata.txt

www.it-ebooks.info

When Things Break

[ 198 ]

5. Add the following property denion to mapred-site.xml:

<name>mapred.skip.map.max.skip.records</name>

<value5</value>

</property>

6. Check the value set for mapred.max.map.task.failures and set it to 20

if it is lower.

7. Save the following Java le as SkipData.java:

import java.io.IOException;

import org.apache.hadoop.conf.* ;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.* ;

import org.apache.hadoop.mapred.* ;

import org.apache.hadoop.mapred.lib.* ;

public class SkipData

{

public static class MapClass extends MapReduceBase

implements Mapper<LongWritable, Text, Text, LongWritable>

{

private final static LongWritable one = new

LongWritable(1);

private Text word = new Text("totalcount");

public void map(LongWritable key, Text value,

OutputCollector<Text, LongWritable> output,

Reporter reporter) throws IOException

{

String line = value.toString();

if (line.equals("skiptext"))

throw new RuntimeException("Found skiptext") ;

output.collect(word, one);

}

public static void main(String[] args) throws Exception

{

Configuration config = new Configuration() ;

www.it-ebooks.info

Chapter 6

[ 199 ]

JobConf conf = new JobConf(config, SkipData.class);

conf.setJobName("SkipData");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(LongWritable.class);

conf.setMapperClass(MapClass.class);

conf.setCombinerClass(LongSumReducer.class);

conf.setReducerClass(LongSumReducer.class);

FileInputFormat.setInputPaths(conf,args[0]) ;

FileOutputFormat.setOutputPath(conf, new

Path(args[1])) ;

JobClient.runJob(conf);

}

8. Compile this le and jar it into skipdata.jar.

9. Run the job:

$ hadoop jar skip.jar SkipData skipdata.txt output

…

11/12/16 17:59:07 INFO mapred.JobClient: map 45% reduce 8%

11/12/16 17:59:08 INFO mapred.JobClient: Task Id : attempt_2011121

61623_0014_m_000003_0, Status : FAILED

java.lang.RuntimeException: Found skiptext

at SkipData$MapClass.map(SkipData.java:26)

at SkipData$MapClass.map(SkipData.java:12)

at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.

java:358)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

at org.apache.hadoop.mapred.Child.main(Child.java:170)

11/12/16 17:59:11 INFO mapred.JobClient: map 42% reduce 8%

...

11/12/16 18:01:26 INFO mapred.JobClient: map 70% reduce 16%

11/12/16 18:01:35 INFO mapred.JobClient: map 71% reduce 16%

11/12/16 18:01:43 INFO mapred.JobClient: Task Id : attempt_2011111

61623_0014_m_000003_2, Status : FAILED

java.lang.RuntimeException: Found skiptext

...

11/12/16 18:12:44 INFO mapred.JobClient: map 99% reduce 29%

11/12/16 18:12:50 INFO mapred.JobClient: map 100% reduce 29%

www.it-ebooks.info

When Things Break

[ 200 ]

11/12/16 18:13:00 INFO mapred.JobClient: map 100% reduce 100%

11/12/16 18:13:02 INFO mapred.JobClient: Job complete:

job_201112161623_0014

...

10. Examine the contents of the job output le:

$ hadoop fs -cat output/part-00000

totalcount 2000000

11. Look in the output directory for skipped records:

$ hadoop fs -ls output/_logs/skip

Found 15 items

-rw-r--r-- 3 hadoop supergroup 203 2011-12-16 18:05 /

user/hadoop/output/_logs/skip/attempt_201112161623_0014_m_000001_3

-rw-r--r-- 3 hadoop supergroup 211 2011-12-16 18:06 /

user/hadoop/output/_logs/skip/attempt_201112161623_0014_m_000001_4

…

12. Check the job details from the MapReduce UI to observe the recorded stascs as

shown in the following screenshot:

www.it-ebooks.info

Chapter 6

[ 201 ]

What just happened?

We had to do a lot of setup here so let's walk through it a step at a me.

Firstly, we needed to congure Hadoop to use skip mode; it is disabled by default. The key

conguraon property was set to 5, meaning that we didn't want the framework to skip any

set of records greater than this number. Note that this includes the invalid records, and by

seng this property to 0 (the default) Hadoop will not enter skip mode.

We also check to ensure that Hadoop is congured with a suciently high threshold for

repeated task aempt failures, which we will explain shortly.

Next we needed a test le that we could use to simulate dirty data. We wrote a simple

Ruby script that generated a le with 2 million lines that we would treat as valid with three

sets of ve bad records interspersed through the le. We ran this script and conrmed that

the generated le did indeed have 2,000,015 lines. This le was then put on HDFS where it

would be the job input.

We then wrote a simple MapReduce job that eecvely counts the number of valid records.

Every me the line reads from the input as the valid text we emit an addional count of 1 to

what will be aggregated as a nal total. When the invalid lines are encountered, the mapper

fails by throwing an excepon.

We then compile this le, jar it up, and run the job. The job takes a while to run and as seen

from the extracts of the job status, it follows a paern that we have not seen before. The

map progress counter will increase but when a task fails, the progress will drop back then

start increasing again. This is skip mode in acon.

Every me a key/value pair is passed to the mapper, Hadoop by default increments a counter

that allows it to keep track of which record caused a failure.

If your map or reduce tasks process their input through mechanisms other

than directly receiving all data via the arguments to the map or reduce method

(for example, from asynchronous processes or caches) you will need to ensure

you explicitly update this counter manually.

When a task fails, Hadoop retries it on the same block but aempts to work around the

invalid records. Through a binary search approach, the framework performs retries across

the data unl the number of skipped records is no greater than the maximum value we

congured earlier, that is 5. This process does require mulple task retries and failures as the

framework seeks the opmal batch to skip, which is why we had to ensure the framework

was congured to be tolerant of a higher-than-usual number of repeated task failures.

www.it-ebooks.info

When Things Break

[ 202 ]

We watched the job connue following this back and forth process and on compleon

checked the contents of the output le. This showed 2,000,000 processed records, that

is the correct number of valid records in our input le. Hadoop successfully managed to

skip only the three sets of ve invalid records.

We then looked within the _logs directory in the job output directory and saw that

there is a skip directory containing the sequence les of the skipped records.

Finally, we looked at the MapReduce web UI to see the overall job status, which

included both the number of records processed while in skip mode as well as the

number of records skipped. Note that the total number of failed tasks was 22, which is

greater than our threshold for failed map aempts, but this number is aggregate failures

across mulple tasks.

To skip or not to skip...

Skip mode can be very eecve but as we have seen previously, there is a performance

penalty caused by Hadoop having to determine which record range to skip. Our test le was

actually quite helpful to Hadoop; the bad records were nicely grouped in three groups and

only accounted for a ny fracon of the overall data set. If there were many more invalid

records in the input data and they were spread much more widely across the le, a more

eecve approach may have been to use a precursor MapReduce job to lter out all the

invalid records.

This is why we have presented the topics of wring code to handle bad data and using

skip mode consecuvely. Both are valid techniques that you should have in your tool

belt. There is no single answer to when one or the other is the best approach, you need

to consider the input data, performance requirements, and opportunies for hardcoding

before making a decision.

Summary

We have caused a lot of destrucon in this chapter and I hope you never have to deal with

this much failure in a single day with an operaonal Hadoop cluster. There are some key

learning points from the experience.

In general, component failures are not something to fear in Hadoop. Parcularly with large

clusters, failure of some component or host will be prey commonplace and Hadoop is

engineered to handle this situaon. HDFS, with its responsibility to store data, acvely

manages the replicaon of each block and schedules new copies to be made when the

DataNode processes die.

www.it-ebooks.info

Chapter 6

[ 203 ]

MapReduce has a stateless approach to TaskTracker failure and in general simply schedules

duplicate jobs if one fails. It may also do this to prevent the misbehaving hosts from slowing

down the whole job.

Failure of the HDFS and MapReduce master nodes is a more signicant failure. In parcular,

the NameNode process holds crical lesystem data and you must acvely ensure you have

it set up to allow a new NameNode process to take over.

In general, hardware failures will look much like the previous process failures, but always

be aware of the possibility of correlated failures. If tasks fail due to soware errors, Hadoop

will retry them within congurable thresholds. Data-related errors can be worked around by

employing skip mode, though it will come with a performance penalty.

Now that we know how to handle failures in our cluster, we will spend the next chapter

working through the broader issues of cluster setup, health, and maintenance.

www.it-ebooks.info

Keeping Things Running

Having a Hadoop cluster is not all about writing interesting programs to do

clever data analysis. You also need to maintain the cluster, and keep it tuned

and ready to do the data crunching you want.

In this chapter we will cover:

More about Hadoop conguraon properes

How to select hardware for your cluster

How Hadoop security works

Managing the NameNode

Managing HDFS

Managing MapReduce

Scaling the cluster

Although these topics are operaonally focused, they do give us an opportunity to explore

some aspects of Hadoop we have not looked at before. Therefore, even if you won't be

personally managing the cluster, there should be useful informaon here for you too.

www.it-ebooks.info

Keeping Things Running

[ 206 ]

A note on EMR

One of the main benets of using cloud services such as those oered by Amazon Web Services

is that much of the maintenance overhead is borne by the service provider. Elasc MapReduce

can create Hadoop clusters ed to the execuon of a single task (non-persistent job ows) or

allow long-running clusters that can be used for mulple jobs (persistent job ows). When

non-persistent job ows are used, the actual mechanics of how the underlying Hadoop cluster

is congured and run are largely invisible to the user. Consequently, users employing non-

persistent job ows will not need to consider many of the topics in this chapter. If you are

using EMR with persistent job ows, many topics (but not all) do become relevant.

We will generally talk about local Hadoop clusters in this chapter. If you need to recongure

a persistent job ow, use the same Hadoop properes but set them as described in Chapter

3, Wring MapReduce Jobs.

Hadoop conguration properties

Before we look at running the cluster, let's talk a lile about Hadoop's conguraon

properes. We have been introducing many of these along the way, and there are a

few addional points worth considering.

Default values

One of the most mysfying things to a new Hadoop user is the large number of

conguraon properes. Where do they come from, what do they mean, and

what are their default values?

If you have the full Hadoop distribuon—that is, not just the binary distribuon—the

following XML les will answer your quesons:

Hadoop/src/core/core-default.xml

Hadoop/src/hdfs/hdfs-default.xml

Hadoop/src/mapred/mapred-default.xml

Time for action – browsing default properties

Fortunately, the XML documents are not the only way of looking at the default values; there

are also more readable HTML versions, which we'll now take a quick look at.

www.it-ebooks.info

Chapter 7

[ 207 ]

These les are not included in the Hadoop binary-only distribuon; if you are using that,

you can also nd these les on the Hadoop website.

1. Point your browser at the docs/core-default.html le within your

Hadoop distribuon directory and browse its contents. It should look like

the next screenshot:

2. Now, similarly, browse these other les:

Hadoop/docs/hdfs-default.html

Hadoop/docs/mapred-default.html

What just happened?

As you can see, each property has a name, default value, and a brief descripon. You will

also see there are indeed a very large number of properes. Do not expect to understand

all of these now, but do spend a lile me browsing to get a avor for the type of

customizaon allowed by Hadoop.

www.it-ebooks.info

Keeping Things Running

[ 208 ]

Additional property elements

When we have previously set properes in the conguraon les, we have used an XML

element of the following form:

<name>the.property.name</name>

<value>The property value</value>

</property>

There are an addional two oponal XML elements we can add, description and final.

A fully described property using these addional elements now looks as follows:

<name>the.property.name</name>

<value>The default property value</value>

<description>A textual description of the property</description>

<final>Boolean</final>

</property>

The descripon element is self-explanatory and provides the locaon for the descripve text

we saw for each property in the preceding HTML les.

The final property has a similar meaning as in Java: any property marked final cannot be

overridden by values in any other les or by other means; we will see this shortly. Use this

for those properes where for performance, integrity, security, or other reasons, you wish to

enforce cluster-wide values.

Default storage location

You will see properes that modify where Hadoop stores its data on both the local disk and

HDFS. There's one property used as the basis for many others hadoop.tmp.dir, which is

the root locaon for all Hadoop les, and its default value is /tmp.

Unfortunately, many Linux distribuons—including Ubuntu—are congured to remove

the contents of this directory on each reboot. This means that if you do not override this

property, you will lose all your HDFS data on the next host reboot. Therefore,

it is worthwhile to set something like the following in core-site.xml:

<name>hadoop.tmp.dir</name>

<value>/var/lib/hadoop</value>

</property>

www.it-ebooks.info

Chapter 7

[ 209 ]

Remember to ensure the locaon is writable by the user who will start Hadoop, and that

the disk the directory is located on has enough space. As you will see later, there are a

number of other properes that allow more granular control of where parcular types

of data are stored.

Where to set properties

We have previously used the conguraon les to specify new values for Hadoop properes.

This is ne, but does have an overhead if we are trying to nd the best value for a property

or are execung a job that requires special handling.

It is possible to use the JobConf class to programmacally set conguraon properes on

the execung job. There are two types of methods supported, the rst being those that

are dedicated to seng a specic property, such as the ones we've seen for seng the job

name, input, and output formats, among others. There are also methods to set properes

such as the preferred number of map and reduce tasks for the job.

In addion, there are a set of generic methods, such as the following:

Void set(String key, String value);

Void setIfUnset(String key, String value);

Void setBoolean( String key, Boolean value);

Void setInt(String key, int value);

These are more exible and do not require specic methods to be created for each

property we wish to modify. However, they also lose compile me checking meaning

you can use an invalid property name or assign the wrong type to a property and will

only nd out at runme.

This ability to set property values both programmacally and in the

conguraon les is an important reason for the ability to mark a property as

final. For properes for which you do not want any submied job to have the

ability to override them, set them as nal within the master conguraon les.

Setting up a cluster

Before we look at how to keep a cluster running, let's explore some aspects of seng it up in

the rst place.

www.it-ebooks.info

Keeping Things Running

[ 210 ]

How many hosts?

When considering a new Hadoop cluster, one of the rst quesons is how much capacity to

start with. We know that we can add addional nodes as our needs grow, but we also want

to start o in a way that eases that growth.

There really is no clear-cut answer here, as it will depend largely on the size of the data sets

you will be processing and the complexity of the jobs to be executed. The only near-absolute

is to say that if you want a replicaon factor of n, you should have at least that many nodes.

Remember though that nodes will fail, and if you have the same number of nodes as the

default replicaon factor then any single failure will push blocks into an under-replicated

state. In most clusters with tens or hundreds of nodes, this is not a concern; but for very

small clusters with a replicaon factor of 3, the safest approach would be a ve-node cluster.

Calculating usable space on a node

An obvious starng point for the required number of nodes is to look at the size of the data

set to be processed on the cluster. If you have hosts with 2 TB of disk space and a 10 TB data

set, the temptaon would be to assume that ve nodes is the minimum number needed.

This is incorrect, as it omits consideraon of the replicaon factor and the need for

temporary space. Recall that the output of mappers is wrien to the local disk to be

retrieved by the reducers. We need to account for this non-trivial disk usage.

A good rule of thumb would be to assume a replicaon factor of 3, and that 25 percent of

what remains should be accounted for as temporary space. Using these assumpons, the

calculaon of the needed cluster for our 10 TB data set on 2 TB nodes would be as follows:

Divide the total storage space on a node by the replicaon factor:

2 TB/3 = 666 GB

Reduce this gure by 25 percent to account for temp space:

666 GB * 0.75 = 500 GB

Each 2 TB node therefore has approximately 500 GB (0.5 TB) of usable space

Divide the data set size by this gure:

10 TB / 500 GB = 20

So our 10 TB data set will likely need a 20 node cluster as a minimum, four mes our

naïve esmate.

This paern of needing more nodes than expected is not unusual and should be

remembered when considering how high-spec you want the hosts to be; see the

Sizing hardware secon later in this chapter.

www.it-ebooks.info

Chapter 7

[ 211 ]

Location of the master nodes

The next queson is where the NameNode, JobTracker, and SecondaryNameNode will

live. We have seen that a DataNode can run on the same host as the NameNode and the

TaskTracker can co-exist with the JobTracker, but this is unlikely to be a great setup for a

producon cluster.

As we will see, the NameNode and SecondaryNameNode have some specic resource

requirements, and anything that aects their performance is likely to slow down the enre

cluster operaon.

The ideal situaon would be to have the NameNode, JobTracker, and SecondaryNameNode

on their own dedicated hosts. However, for very small clusters, this would result in a

signicant increase in the hardware footprint without necessarily reaping the full benet.

If at all possible, the rst step should be to separate the NameNode, JobTracker, and

SecondaryNameNode onto a single dedicated host that does not have any DataNode or

TaskTracker processes running. As the cluster connues to grow, you can add an addional

server host and then move the NameNode onto its own host, keeping the JobTracker and

SecondaryNameNode co-located. Finally, as the cluster grows yet further, it will make sense

to move to full separaon.

As discussed in Chapter 6, Keeping Things Running, Hadoop 2.0 will split the

Secondary NameNode into Backup NameNodes and Checkpoint NameNodes.

Best pracce is sll evolving, but aiming towards having a dedicated host each

for the NameNode and at least one Backup NameNode looks sensible.

Sizing hardware

The amount of data to be stored is not the only consideraon regarding the specicaon

of the hardware to be used for the nodes. Instead, you have to consider the amount of

processing power, memory, storage types, and networking available.

Much has been wrien about selecng hardware for a Hadoop cluster, and once again there

is no single answer that will work for all cases. The big variable is the types of MapReduce

tasks that will be executed on the data and, in parcular, if they are bounded by CPU,

memory, I/O, or something else.

Processor / memory / storage ratio

A good way of thinking of this is to look at potenal hardware in terms of the CPU / memory

/ storage rao. So, for example, a quad-core host with 8 GB memory and 2 TB storage could

be thought of as having two cores and 4 GB memory per 1 TB of storage.

www.it-ebooks.info

Keeping Things Running

[ 212 ]

Then look at the types of MapReduce jobs you will be running, does that rao seem

appropriate? In other words, does your workload require proporonally more of one

of these resources or will a more balanced conguraon be sucient?

This is, of course, best assessed by prototyping and gathering metrics, but that isn't always

possible. If not, consider what part of the job is the most expensive. For example, some

of the jobs we have seen are I/O bound and read data from the disk, perform simple

transformaons, and then write results back to the disk. If this was typical of our workload,

we could likely use hardware with more storage—especially if it was delivered by mulple

disks to increase I/O—and use less CPU and memory.

Conversely, jobs that perform very heavy number crunching would need more CPU, and

those that create or use large data structures would benet from memory.

Think of it in terms of liming factors. If your job was running, would it be CPU-bound

(processors at full capacity; memory and I/O to spare), memory-bound (physical memory full

and swapping to disk; CPU and I/O to spare), or I/O-bound (CPU and memory to spare, but

data being read/wrien to/from disk at maximum possible speed)? Can you get hardware

that eases that bound?

This is of course a limitless process, as once you ease one bound another will manifest itself.

So always remember that the idea is to get a performance prole that makes sense in the

context of your likely usage scenario.

What if you really don't know the performance characteriscs of your jobs? Ideally, try

to nd out, do some prototyping on any hardware you have and use that to inform your

decision. However, if even that is not possible, you will have to go for a conguraon and

try it out. Remember that Hadoop supports heterogeneous hardware—though having

uniform specicaons makes your life easier in the end—so build the cluster to the

minimum possible size and assess the hardware. Use this knowledge to inform future

decisions regarding addional host purchases or upgrades of the exisng eet.

EMR as a prototyping platform

Recall that when we congured a job on Elasc MapReduce we chose the type of hardware

for both the master and data/task nodes. If you plan to run your jobs on EMR, you have

a built-in capability to tweak this conguraon to nd the best combinaon of hardware

specicaons to price and execuon speed.

However, even if you do not plan to use EMR full-me, it can be a valuable prototyping

plaorm. If you are sizing a cluster but do not know the performance characteriscs of

your jobs, consider some prototyping on EMR to gain beer insight. Though you may end

up spending money on the EMR service that you had not planned, this will likely be a lot less

than the cost of nding out you have bought completely unsuitable hardware for your cluster.

www.it-ebooks.info

Chapter 7

[ 213 ]

Special node requirements

Not all hosts have the same hardware requirements. In parcular, the host for the

NameNode may look radically dierent to those hosng the DataNodes and TaskTrackers.

Recall that the NameNode holds an in-memory representaon of the HDFS lesystem and

the relaonship between les, directories, blocks, nodes, and various metadata concerning

all of this. This means that the NameNode will tend to be memory bound and may require

larger memory than any other host, parcularly for very large clusters or those with a huge

number of les. Though 16 GB may be a common memory size for DataNodes/TaskTrackers,

it's not unusual for the NameNode host to have 64 GB or more of memory. If the NameNode

ever ran out of physical memory and started to use swap space, the impact on cluster

performance would likely be severe.

However, though 64 GB is large for physical memory, it's ny for modern storage, and

given that the lesystem image is the only data stored by the NameNode, we don't need

the massive storage common on the DataNode hosts. We care much more about NameNode

reliability so are likely to have several disks in a redundant conguraon. Consequently,

the NameNode host will benet from mulple small drives (for redundancy) rather than

large drives.

Overall, therefore, the NameNode host is likely to look quite dierent from the other

hosts in the cluster; this is why we made the earlier recommendaons regarding moving

the NameNode to its own host as soon as budget/space allows, as its unique hardware

requirements are more easily sased this way.

The SecondaryNameNode (or CheckpointNameNode and BackupNameNode

in Hadoop 2.0) share the same hardware requirements as the NameNode. You

can run it on a more generic host while in its secondary capacity, but if you do

ever need to switch and make it the NameNode due to failure of the primary

hardware, you may be in trouble.

Storage types

Though you will nd strong opinions on some of the previous points regarding the relave

importance of processor, memory, and storage capacity, or I/O, such arguments are usually

based around applicaon requirements and hardware characteriscs and metrics. Once we

start discussing the type of storage to be used, however, it is very easy to get into ame war

situaons, where you will nd extremely entrenched opinions.

www.it-ebooks.info

Keeping Things Running

[ 214 ]

Commodity versus enterprise class storage

The rst argument will be over whether it makes most sense to use hard drives aimed at

the commodity/consumer segments or those aimed at enterprise customers. The former

(primarily SATA disks) are larger, cheaper, and slower, and have lower quoted gures for

mean me between failures (MTBF). Enterprise disks will use technologies such as SAS or

Fiber Channel, and will on the whole be smaller, more expensive, faster, and have higher

quoted MTBF gures.

Single disk versus RAID

The next queson will be on how the disks are congured. The enterprise-class approach

would be to use Redundant Arrays of Inexpensive Disks (RAID) to group mulple disks into

a single logical storage device that can quietly survive one or more disk failures. This comes

with the cost of a loss in overall capacity and an impact on the read/write rates achieved.

The other posion is to treat each disk independently to maximize total storage and

aggregate I/O, at the cost of a single disk failure causing host downme.

Finding the balance

The Hadoop architecture is, in many ways, predicated on the assumpon that hardware will

fail. From this perspecve, it is possible to argue that there is no need to use any tradional

enterprise-focused storage features. Instead, use many large, cheap disks to maximize the

total storage and read and write from them in parallel to do likewise for I/O throughput.

A single disk failure may cause the host to fail, but the cluster will, as we have seen, work

around this failure.

This is a completely valid argument and in many cases makes perfect sense. What the

argument ignores, however, is the cost of bringing a host back into service. If your cluster

is in the next room and you have a shelf of spare disks, host recovery will likely be a quick,

painless, and inexpensive task. However, if you have your cluster hosted by a commercial

collocaon facility, any hands-on maintenance may cost a lot more. This is even more

the case if you are using fully-managed servers where you have to pay the provider for

maintenance tasks. In such a situaon, the extra cost and reduced capacity and I/O from

using RAID may make sense.

Network storage

One thing that will almost never make sense is to use networked storage for your primary

cluster storage. Be it block storage via a Storage Area Network (SAN) or le-based via

Network File System (NFS) or similar protocols, these approaches constrain Hadoop by

introducing unnecessary bolenecks and addional shared devices that would have a

crical impact on failure.

www.it-ebooks.info

Chapter 7

[ 215 ]

Somemes, however, you may be forced for non-technical reasons to use something like

this. It's not that it won't work, just that it changes how Hadoop will perform in regards to

speed and tolerance to failures, so be sure you understand the consequences if this happens.

Hadoop networking conguration

Hadoop's support of networking devices is not as sophiscated as it is for storage, and

consequently you have fewer hardware choices to make compared to CPU, memory, and

storage setup. The boom line is that Hadoop can currently support only one network device

and cannot, for example, use all 4-gigabit Ethernet connecons on a host for an aggregate

of 4-gigabit throughput. If you need network throughput greater than that provided by a

single-gigabit port then, unless your hardware or operang system can present mulple

ports as a single device to Hadoop, the only opon is to use a 10-gigabit Ethernet device.

How blocks are placed

We have talked a lot about HDFS using replicaon for redundancy, but have not explored

how Hadoop chooses where to place the replicas for a block.

In most tradional server farms, the various hosts (as well as networking and other devices)

are housed in standard-sized racks that stack the equipment vercally. Each rack will usually

have a common power distribuon unit that feeds it and will oen have a network switch

that acts as the interface between the broader network and all the hosts in the rack.

Given this setup, we can idenfy three broad types of failure:

Those that aect a single host (for example, CPU/memory/disk/motherboard failure)

Those that aect a single rack (for example, power unit or switch failure)

Those that aect the enre cluster (for example, larger power/network failures,

cooling/environmental outages)

Remember that Hadoop currently does not support a cluster that is spread

across mulple data centers, so instances of the third type of failure will

quite likely bring down your cluster.

By default, Hadoop will treat each node as if it is in the same physical rack. This implies that

the bandwidth and latency between any pair of hosts is approximately equal and that each

node is equally likely to suer a related failure as any other.

www.it-ebooks.info

Keeping Things Running

[ 216 ]

Rack awareness

If, however, you do have a mul-rack setup, or another conguraon that otherwise

invalidates the previous assumpons, you can add the ability for each node to report

its rack ID to Hadoop, which will then take this into account when placing replicas.

In such a setup, Hadoop tries to place the rst replica of a node on a given host, the second

on another within the same rack, and the third on a host in a dierent rack.

This strategy provides a good balance between performance and availability. When racks

contain their own network switches, communicaon between hosts inside the rack oen has

lower latency than that with external hosts. This strategy places two replicas within a rack

to ensure maximum speed of wring for these replicas, but keeps one outside the rack to

provide redundancy in the event of a rack failure.

The rack-awareness script

If the topology.script.file.name property is set and points to an executable script

on the lesystem, it will be used by the NameNode to determine the rack for each host.

Note that the property needs to be set and the script needs to exist only on the

NameNode host.

The NameNode will pass to the script the IP address of each node it discovers, so the script

is responsible for a mapping from node IP address to rack name.

If no script is specied, each node will be reported as a member of a single default rack.

Time for action – examining the default rack conguration

Let's take a look at how the default rack conguraon is set up in our cluster.

1. Execute the following command:

$ Hadoop fsck -rack

2. The result should include output similar to the following:

Default replication factor: 3

Average block replication: 3.3045976

Corrupt blocks: 0

Missing replicas: 18 (0.5217391 %)

Number of data-nodes: 4

Number of racks: 1

The filesystem under path '/' is HEALTHY

www.it-ebooks.info

Chapter 7

[ 217 ]

What just happened?

Both the tool used and its output are of interest here. The tool is hadoop fsck, which

can be used to examine and x lesystem problems. As can be seen, this includes some

informaon not dissimilar to our old friend hadoop dfsadmin, though that tool is focused

more on the state of each node in detail while hadoop fsck reports on the internals of the

lesystem as a whole.

One of the things it reports is the total number of racks in the cluster, which, as seen in the

preceding output, has the value 1, as expected.

This command was executed on a cluster that had recently been used for

some HDFS resilience tesng. This explains the gures for average block

replicaon and under-replicated blocks.

If a block ends up with more than the required number of replicas due to a

host temporarily failing, the host coming back into service will put the block

above the minimum replicaon factor. Along with ensuring that blocks have

replicas added to meet the replicaon factor, Hadoop will also delete excess

replicas to return blocks to the replicaon factor.

Time for action – adding a rack awareness script

We can enhance the default at rack conguraon by creang a script that derives the rack

locaon for each host.

1. Create a script in the Hadoop user's home directory on the NameNode host called

rack-script.sh, containing the following text. Remember to change the IP

address to one of your HDFS nodes.

#!/bin/bash

if [ $1 = "10.0.0.101" ]; then

echo -n "/rack1 "

else

echo -n "/default-rack "

2. Make this script executable.

$ chmod +x rack-script.sh

3. Add the following property to core-site.xml on the NameNode host:

<name>topology.script.file.name</name>

<value>/home/Hadoop/rack-script.sh</value>

</property>

www.it-ebooks.info

Keeping Things Running

[ 218 ]

4. Restart HDFS.

$ start-dfs.sh

5. Check the lesystem via fsck.

$ Hadoop fsck –rack

The output of the preceding command can be shown in the following screenshot:

What just happened?

We rst created a simple script that returns one value for a named node and a default value

for all others. We placed this on the NameNode host and added the needed conguraon

property to the NameNode core-site.xml le.

Aer starng HDFS, we used hadoop fsck to report on the lesystem and saw that

we now have a two-rack cluster. With this knowledge, Hadoop will now employ more

sophiscated block placement strategies, as described previously.

Using an external host le

A common approach is to keep a separate data le akin to the /etc/hosts

le on Unix and use this to specify the IP/rack mapping, one per line. This le

can then be updated independently and read by the rack-awareness script.

www.it-ebooks.info

Chapter 7

[ 219 ]

What is commodity hardware anyway?

Let's revisit the queson of the general characteriscs of the hosts used for your cluster, and

whether they should look more like a commodity white box server or something built for a

high-end enterprise environment.

Part of the problem is that "commodity" is an ambiguous term. What looks cheap

and cheerful for one business may seem luxuriously high-end for another. We suggest

considering the following points to keep in mind when selecng hardware and then

remaining happy with your decision:

With your hardware, are you paying a premium for reliability features that duplicate

some of Hadoop's fault-tolerance capabilies?

Are the higher-end hardware features you are paying for addressing the need or risk

that you have conrmed is realisc in your environment?

Have you validated the cost of the higher-end hardware to be higher than dealing

with cheaper / less reliable hardware?

Pop quiz – setting up a cluster

Q1. Which of the following is most important when selecng hardware for your new

Hadoop cluster?

1. The number of CPU cores and their speed.

2. The amount of physical memory.

3. The amount of storage.

4. The speed of the storage.

5. It depends on the most likely workload.

Q2. Why would you likely not want to use network storage in your cluster?

1. Because it may introduce a new single point of failure.

2. Because it most likely has approaches to redundancy and fault-tolerance that may

be unnecessary given Hadoop's fault tolerance.

3. Because such a single device may have inferior performance to Hadoop's use of

mulple local disks simultaneously.

4. All of the above.

www.it-ebooks.info

Keeping Things Running

[ 220 ]

Q3. You will be processing 10 TB of data on your cluster. Your main MapReduce job

processes nancial transacons, using them to produce stascal models of behavior

and future forecasts. Which of the following hardware choices would be your rst

choice for the cluster?

1. 20 hosts each with fast dual-core processors, 4 GB memory, and one 500 GB

disk drive.

2. 30 hosts each with fast dual-core processors, 8 GB memory, and two 500 GB

disk drives.

3. 30 hosts each with fast quad-core processors, 8 GB memory, and one 1 TB disk drive.

4. 40 hosts each with 16 GB memory, fast quad-core processors, and four 1 TB

disk drives.

Cluster access control

Once you have the shiny new cluster up and running, you need to consider quesons of

access and security. Who can access the data on the cluster—is there sensive data that

you really don't want the whole user base to see?

The Hadoop security model

Unl very recently, Hadoop had a security model that could, at best, be described as

"marking only". It associated an owner and group with each le but, as we'll see, did very

lile validaon of a given client connecon. Strong security would manage not only the

markings given to a le but also the idenes of all connecng users.

Time for action – demonstrating the default security

When we have previously shown lisngs of les, we have seen user and group names for

them. However, we have not really explored what that means. Let's do so.

1. Create a test text le in the Hadoop user's home directory.

$ echo "I can read this!" > security-test.txt

$ hadoop fs -put security-test.txt security-test.txt

2. Change the permissions on the le to be accessible only by the owner.

$ hadoop fs -chmod 700 security-test.txt

$ hadoop fs -ls

www.it-ebooks.info

Chapter 7

[ 221 ]

The output of the preceding command can be shown in the following screenshot:

3. Conrm you can sll read the le.

$ hadoop fs -cat security-test.txt

You'll see the following line on the screen:

I can read this!

4. Connect to another node in the cluster and try to read the le from there.

$ ssh node2

$ hadoop fs -cat security-test.txt

You'll see the following line on the screen:

I can read this!

5. Log out from the other node.

$ exit

6. Create a home directory for another user and give them ownership.

$ hadoop m[Kfs -mkdir /user/garry

$ hadoop fs -chown garry /user/garry

$ hadoop fs -ls /user

The output of the preceding command can be shown in the following screenshot:

www.it-ebooks.info

Keeping Things Running

[ 222 ]

7. Switch to that user.

$ su garry

8. Try to read the test le in the Hadoop user's home directory.

$ hadoop/bin/hadoop fs -cat /user/hadoop/security-test.txt

cat: org.apache.hadoop.security.AccessControlException: Permission

denied: user=garry, access=READ, inode="security-test.txt":hadoop:

supergroup:rw-------

9. Place a copy of the le in this user's home directory and again make it accessible

only by the owner.

$ Hadoop/bin/Hadoop fs -put security-test.txt security-test.txt

$ Hadoop/bin/Hadoop fs -chmod 700 security-test.txt

$ hadoop/bin/hadoop fs -ls

The output of the preceding command can be shown in following screenshot:

10. Conrm this user can access the le.

$ hadoop/bin/hadoop fs -cat security-test.txt

You'll see the following line on the screen:

I can read this!

11. Return to the Hadoop user.

$ exit

12. Try and read the le in the other user's home directory.

$ hadoop fs -cat /user/garry/security-test.txt

You'll see the following line on the screen:

I can read this!

www.it-ebooks.info

Chapter 7

[ 223 ]

What just happened?

We rstly used our Hadoop user to create a test le in its home directory on HDFS. We used

the -chmod opon to hadoop fs, which we have not seen before. This is very similar to the

standard Unix chmod tool that gives various levels of read/write/execute access to the le

owner, group members, and all users.

We then went to another host and tried to access the le, again as the Hadoop user. Not

surprisingly, this worked. But why? What did Hadoop know about the Hadoop user that

allowed it to give access to the le?

To explore this, we then created another home directory on HDFS (you can use any other

account on the host you have access to), and gave it ownership by using the -chown

opon to hadoop fs. This should once again look similar to standard Unix -chown. Then

we switched to this user and aempted to read the le stored in the Hadoop user's home

directory. This failed with the security excepon shown before, which is again what we

expected. Once again, we copied a test le into this user's home directory and made it only

accessible by the owner.

But we then muddied the waters by switching back to the Hadoop user and tried to access

the le in the other account's home directory, which, surprisingly, worked.

User identity

The answer to the rst part of the puzzle is that Hadoop uses the Unix ID of the user

execung the HDFS command as the user identy on HDFS. So any commands executed by a

user called alice will create les with an owner named alice and will only be able to read

or write les to which this user has the correct access.

The security-minded will realize that to access a Hadoop cluster all one needs to do is create

a user with the same name as an already exisng HDFS user on any host that can connect

to the cluster. So, for instance, in the previous example, any user named hadoop created

on any host that can access the NameNode can read all les accessible by the user hadoop,

which is actually even worse than it seems.

The super user

The previous step saw the Hadoop user access another user's les. Hadoop treats the user

ID that started the cluster as the super user, and gives it various privileges, such as the

ability to read, write, and modify any le on HDFS. The security-minded will realize even

more the risk of having users called hadoop randomly created on hosts outside the Hadoop

administrator's control.

www.it-ebooks.info

Keeping Things Running

[ 224 ]

More granular access control

The preceding situaon has caused security to be a major weakness in Hadoop since its

incepon. The community has, however, not been standing sll, and aer much work the

very latest versions of Hadoop support a more granular and stronger security model.

To avoid reliance on simple user IDs, the developers need to learn the user identy from

somewhere, and the Kerberos system was chosen with which to integrate. This does require

the establishment and maintenance of services outside the scope of this book, but if such

security is important to you, consult the Hadoop documentaon. Note that this support does

allow integraon with third-party identy systems such as Microso Acve Directory, so it is

quite powerful.

Working around the security model via physical access control

If the burden of Kerberos is too great, or security is a nice-to-have rather than an absolute,

there are ways of migang the risk. One favored by me is to place the enre cluster behind

a rewall with ght access control. In parcular, only allow access to the NameNode and

JobTracker services from a single host that will be treated as the cluster head node and

to which all users connect.

Accessing Hadoop from non-cluster hosts

Hadoop does not need to be running on a host for it to use the command-line

tools to access HDFS and run MapReduce jobs. As long as Hadoop is installed on

the host and its conguraon les have the correct locaons of the NameNode

and JobTracker, these will be found when invoking commands such as Hadoop

fs and Hadoop jar.

This model works because only one host is used to interact with Hadoop; and since this host

is controlled by the cluster administrator, normal users should be unable to create or access

other user accounts.

Remember that this approach is not providing security. It is pung a hard shell around a

so system that reduces the ways in which the Hadoop security model can be subverted.

Managing the NameNode

Let's do some more risk reducon. In Chapter 6, When Things Break, I probably scared

you when talking about the potenal consequences of a failure of the host running the

NameNode. If that secon did not scare you, go back and re-read it—it should have. The

summary is that the loss of the NameNode could see you losing every single piece of data on

the cluster. This is because the NameNode writes a le called fsimage that contains all the

metadata for the lesystem and records which blocks comprise which les. If the loss of the

NameNode host makes the fsimage unrecoverable, all the HDFS data is likewise lost.

www.it-ebooks.info

Chapter 7

[ 225 ]

Conguring multiple locations for the fsimage class

The NameNode can be congured to simultaneously write fsimage to mulple locaons.

This is purely a redundancy mechanism, the same data is wrien to each locaon and there

is no aempt to use mulple storage devices for increased performance. Instead, the policy

is that mulple copies of fsimage will be harder to lose.

Time for action – adding an additional fsimage location

Let's now congure our NameNode to simultaneously write mulple copies of fsimage to

give us our desired data resilience. To do this, we require an NFS-exported directory.

1. Ensure the cluster is stopped.

$ stopall.sh

2. Add the following property to Hadoop/conf/core-site.xml, modifying the

second path to point to an NFS-mounted locaon to which the addional copy of

NameNode data can be wrien.

<value>${hadoop.tmp.dir}/dfs/name,/share/backup/namenode</value>

</property>

3. Delete any exisng contents of the newly added directory.

$ rm -f /share/backup/namenode

4. Start the cluster.

$ start-all.sh

5. Verify that fsimage is being wrien to both the specied locaons by running the

md5sum command against the two les specied before (change the following code

depending on your congured locaons):

$ md5sum /var/hadoop/dfs/name/image/fsimage

a25432981b0ecd6b70da647e9b94304a /var/hadoop/dfs/name/image/

fsimage

$ md5sum /share/backup/namenode/image/fsimage

a25432981b0ecd6b70da647e9b94304a /share/backup/namenode/image/

fsimage

What just happened?

Firstly, we ensured the cluster was stopped; though changes to the core conguraon les

are not reread by a running cluster, it's a good habit to get into in case that capability is ever

added to Hadoop.

www.it-ebooks.info

Keeping Things Running

[ 226 ]

We then added a new property to our cluster conguraon, specifying a value for the

data.name.dir property. This property takes a list of comma-separated values and writes

fsimage to each of these locaons. Note how the hadoop.tmp.dir property discussed

earlier is de-referenced, as would be seen when using Unix variables. This syntax allows us to

base property values on others and inherit changes when the parent properes are updated.

Do not forget all required locaons

The default value for this property is ${Hadoop.tmp.dir}/dfs/name.

When adding an addional value, remember to explicitly add the default

one also, as shown before. Otherwise, only the single new value will be

used for the property.

Before starng the cluster, we ensure the new directory exists and is empty. If the directory

doesn't exist, the NameNode will fail to start as should be expected. If, however, the

directory was previously used to store NameNode data, Hadoop will also fail to start as it will

idenfy that both directories contain dierent NameNode data and it does not know which

one is correct.

Be careful here! Especially if you are experimenng with various NameNode data locaons

or swapping back and forth between nodes; you really do not want to accidentally delete the

contents from the wrong directory.

Aer starng the HDFS cluster, we wait for a moment and then use MD5 cryptographic

checksums to verify that both locaons contain the idencal fsimage.

Where to write the fsimage copies

The recommendaon is to write fsimage to at least two locaons, one of which should be

the remote (such as a NFS) lesystem, as in the previous example. fsimage is only updated

periodically, so the lesystem does not need high performance.

In our earlier discussion regarding the choice of hardware, we alluded to other

consideraons for the NameNode host. Because of fsimage cricality, it may be useful

to ensure it is wrien to more than one disk and to perhaps invest in disks with higher

reliability, or even to write fsimage to a RAID array. If the host fails, using the copy wrien

to the remote lesystem will be the easiest opon; but just in case that has also experienced

problems, it's good to have the choice of pulling another disk from the dead host and using it

on another to recover the data.

www.it-ebooks.info

Chapter 7

[ 227 ]

Swapping to another NameNode host

We have ensured that fsimage is wrien to mulple locaons and this is the single most

important prerequisite for managing a swap to a dierent NameNode host. Now we need

to actually do it.

This is something you really should not do on a producon cluster. Absolutely not when

trying for the rst me, but even beyond that it's not a risk-free process. But do pracce

on other clusters and get an idea of what you'll do when disaster strikes.

Having things ready before disaster strikes

You don't want to be exploring this topic for the rst me when you need to recover the

producon cluster. There are several things to do in advance that will make disaster recovery

much less painful, not to menon possible:

Ensure the NameNode is wring the fsimage to mulple locaons, as done before.

Decide which host will be the new NameNode locaon. If this is a host currently

being used for a DataNode and TaskTracker, ensure it has the right hardware needed

to host the NameNode and that the reducon in cluster performance due to the loss

of these workers won't be too great.

Make a copy of the core-site.xml and hdfs-site.xml les, place them

(ideally) on an NFS locaon, and update them to point to the new host. Any me

you modify the current conguraon les, remember to make the same changes to

these copies.

Copy the slaves le from the NameNode onto either the new host or the NFS

share. Also, make sure you keep it updated.

Know how you will handle a subsequent failure in the new host. How quickly can

you likely repair or replace the original failed host? Which host will be the locaon

of the NameNode (and SecondaryNameNode) in the interim?

Ready? Let's do it!

Time for action – swapping to a new NameNode host

In the following steps we keep the new conguraon les on an NFS share mounted to /

share/backup and change the paths to match where you have the new les. Also use a

dierent string to grep; we use a poron of the IP address we know isn't shared with any

other host in the cluster.

www.it-ebooks.info

Keeping Things Running

[ 228 ]

1. Log on to the current NameNode host and shut down the cluster.

$ stop-all.sh

2. Halt the host that runs the NameNode.

$ sudo poweroff

3. Log on to the new NameNode host and conrm the new conguraon les have the

correct NameNode locaon.

$ grep 110 /share/backup/*.xml

4. On the new host, rst copy across the slaves le.

$ cp /share/backup/slaves Hadoop/conf

5. Now copy across the updated conguraon les.

$ cp /share/backup/*site.xml Hadoop/conf

6. Remove any old NameNode data from the local lesystem.

$ rm -f /var/Hadoop/dfs/name/*

7. Copy the updated conguraon les to every node in the cluster.

$ slaves.sh cp /share/backup/*site.xml Hadoop/conf

8. Ensure each node now has the conguraon les poinng to the new NameNode.

$ slaves.sh grep 110 hadoop/conf/*site.xml

9. Start the cluster.

$ start-all.sh

10. Check HDFS is healthy, from the command line.

$ Hadoop fs ls /

11. Verify whether HDFS is accessible from the web UI.

What just happened?

First, we shut down the cluster. This is a lile un-representave as most failures see the

NameNode die in a much less friendly way, but we do not want to talk about issues of

lesystem corrupon unl later in the chapter.

We then shut down the old NameNode host. Though not strictly necessary, it is a good way

of ensuring that nothing accesses the old host and gives you an incorrect view on how well

the migraon has occurred.

www.it-ebooks.info

Chapter 7

[ 229 ]

Before copying across les, we take a quick look at core-site.xml and hdfs-site.xml

to ensure the correct values are specied for the fs.default.dir property in

core-site.xml.

We then prepare the new host by rstly copying across the slaves conguraon le and

the cluster conguraon les and then removing any old NameNode data from the local

directory. Refer to the preceding steps about being very careful in this step.

Next, we use the slaves.sh script to get each host in the cluster to copy across the new

conguraon les. We know our new NameNode host is the only one with 110 in its IP

address, so we grep for that in the les to ensure all are up-to-date (obviously, you will

need to use a dierent paern for your system).

At this stage, all should be well; we start the cluster and access via both the command-line

tools and UI to conrm it is running as expected.

Don't celebrate quite yet!

Remember that even with a successful migraon to a new NameNode, you aren't done quite

yet. You decided in advance how to handle the SecondaryNameNode and which host would

be the new designated NameNode host should the newly migrated one fail. To be ready for

that, you will need to run through the "Be prepared" checklist menoned before once more

and act appropriately.

Do not forget to consider the chance of correlated failures. Invesgate the

cause of the NameNode host failure in case it is the start of a bigger problem.

What about MapReduce?

We did not menon moving the JobTracker as that is a much less painful process as

shown in Chapter 6, When Things Break. If your NameNode and JobTracker are running

on the same host, you will need to modify the preceding approach by also keeping a new

copy of mapred-site.xml, which has the locaon of the new host in the mapred.job.

tracker property.

Have a go hero – swapping to a new NameNode host

Perform a migraon of both the NameNode and JobTracker from one host to another.

www.it-ebooks.info

Keeping Things Running

[ 230 ]

Managing HDFS

As we saw when killing and restarng nodes in Chapter 6, When Things Break, Hadoop

automacally manages many of the availability concerns that would consume a lot of eort on

a more tradional lesystem. There are some things, however, that we sll need to be aware of.

Where to write data

Just as the NameNode can have mulple locaons for storage of fsimage specied via

the dfs.name.dir property, we explored earlier that there is a similar-appearing property

called dfs.data.dir that allows HDFS to use mulple data locaons on a host, which we

will look at now.

This is a useful mechanism that works very dierently from the NameNode property. If

mulple directories are specied in dfs.data.dir, Hadoop will view these as a series of

independent locaons that it can use in parallel. This is useful if you have mulple physical

disks or other storage devices mounted at disnct points on the lesystem. Hadoop will

use these mulple devices intelligently, maximizing not only the total storage capacity but

also by balancing reads and writes across the locaons to gain maximum throughput. As

menoned in the Storage types secon, this is the approach that maximizes these factors

at the cost of a single disk failure causing the whole host to fail.

Using balancer

Hadoop works hard to place data blocks on HDFS in a way that maximizes both performance

and redundancy. However, in certain situaons, the cluster can become unbalanced, with a

large discrepancy between the data held on the various nodes. The classic situaon that causes

this is when a new node is added to the cluster. By default, Hadoop will consider the new node

as a candidate for block placement alongside all other nodes, meaning that it will remain lightly

ulized for a signicant period of me. Nodes that have been out of service or have otherwise

suered issues may also have collected a smaller number of blocks than their peers.

Hadoop includes a tool called the balancer, started and stopped by the start-balancer.

sh and stop-balancer.sh scripts respecvely, to handle this situaon.

When to rebalance

Hadoop does not have any automac alarms that will alert you to an unbalanced lesystem.

Instead, you need to keep an eye on the data reported by both hadoop fsck and hadoop

fsadmin and watch for imbalances across the nodes.

www.it-ebooks.info

Chapter 7

[ 231 ]

In reality, this is not something you usually need to worry about, as Hadoop is very good at

managing block placement and you likely only need to consider running the balancer to remove

major imbalances when adding new hardware or when returning faulty nodes to service. To

maintain maximum cluster health, however, it is not uncommon to have the balancer run on a

scheduled basis (for example, nightly) to keep the block balancing within a specied threshold.

MapReduce management

As we saw in the previous chapter, the MapReduce framework is generally more tolerant of

problems and failures than HDFS. The JobTracker and TaskTrackers have no persistent data to

manage and, consequently, the management of MapReduce is more about the handling of

running jobs and tasks than servicing the framework itself.

Command line job management

The hadoop job command-line tool is the primary interface for this job management.

As usual, type the following to get a usage summary:

$ hadoop job --help

The opons to the command are generally self-explanatory; it allows you to start, stop,

list, and modify running jobs in addion to retrieving some elements of job history. Instead

of examining each individually, we will explore the use of several of these subcommands

together in the next secon.

Have a go hero – command line job management

The MapReduce UI also provides access to a subset of these capabilies. Explore the UI and

see what you can and cannot do from the web interface.

Job priorities and scheduling

So far, we have generally run a single job against our cluster and waited for it to complete.

This has hidden the fact that, by default, Hadoop places subsequent job submissions into a

First In, First Out (FIFO) queue. When a job nishes, Hadoop simply starts execung the next

job in the queue. Unless we use one of the alternave schedulers that we will discuss in later

secons, the FIFO scheduler dedicates the full cluster to the sole currently running job.

For small clusters with a paern of job submission that rarely sees jobs waing in the queue,

this is completely ne. However, if jobs are oen waing in the queue, issues can arise. In

parcular, the FIFO model takes no account of job priority or resources needed. A long-running

but low-priority job will execute before faster high-priority jobs that were submied later.

www.it-ebooks.info

Keeping Things Running

[ 232 ]

To address this situaon, Hadoop denes ve levels of job priority: VERY_HIGH, HIGH,

NORMAL, LOW, and VERY_LOW. A job defaults to NORMAL priority, but this can be changed

with the hadoop job -set-priority command.

Time for action – changing job priorities and killing a job

Let's explore job priories by changing them dynamically and watching the result of

killing a job.

1. Start a relavely long-running job on the cluster.

$ hadoop jar hadoop-examples-1.0.4.jar pi 100 1000

2. Open another window and submit a second job.

$ hadoop jar hadoop-examples-1.0.4.jar wordcount test.txt out1

3. Open another window and submit a third.

$ hadoop jar hadoop-examples-1.0.4.jar wordcount test.txt out2

4. List the running jobs.

$ Hadoop job -list

You'll see the following lines on the screen:

3 jobs currently running

JobId State StartTime UserName Priority SchedulingInfo

job_201201111540_0005 1 1326325810671 hadoop NORMAL NA

job_201201111540_0006 1 1326325938781 hadoop NORMAL NA

job_201201111540_0007 1 1326325961700 hadoop NORMAL NA

5. Check the status of the running job.

$ Hadoop job -status job_201201111540_0005

You'll see the following lines on the screen:

Job: job_201201111540_0005

file: hdfs://head:9000/var/hadoop/mapred/system/

job_201201111540_0005/job.xml

tracking URL: http://head:50030/jobdetails.

jsp?jobid=job_201201111540_000

map() completion: 1.0

reduce() completion: 0.32666665

Counters: 18

www.it-ebooks.info

Chapter 7

[ 233 ]

6. Raise the priority of the last submied job to VERY_HIGH.

$ Hadoop job -set-priority job_201201111540_0007 VERY_HIGH

7. Kill the currently running job.

$ Hadoop job -kill job_201201111540_0005

8. Watch the other jobs to see which begins processing.

What just happened?

We started a job on the cluster and then queued up another two jobs, conrming that the

queued jobs were in the expected order by using hadoop job -list. The hadoop job

-list all command would have listed both completed as well as the current jobs and

hadoop job -history would have allowed us to examine the jobs and their tasks in much

more detail. To conrm the submied job was running, we used hadoop job -status to get

the current map and reduce task compleon status for the job, in addion to the job counters.

We then used hadoop job -set-priority to increase the priority of the job currently

last in the queue.

Aer using hadoop job -kill to abort the currently running job, we conrmed the job

with the increased priority that executed next, even though the job remaining in the queue

was submied beforehand.

Alternative schedulers

Manually modifying job priories in the FIFO queue certainly does work, but it requires

acve monitoring and management of the job queue. If we think about the problem, the

reason we are having this diculty is the fact that Hadoop dedicates the enre cluster to

each job being executed.

Hadoop oers two addional job schedulers that take a dierent approach and share the

cluster among mulple concurrently execung jobs. There is also a plugin mechanism by

which addional schedulers can be added. Note that this type of resource sharing is one of

those problems that is conceptually simple but is in reality very complex and is an area of

much academic research. The goal is to maximize resource allocaon not only at a point in

me, but also over an extended period while honoring noons of relave priority.

Capacity Scheduler

The Capacity Scheduler uses mulple job queues (to which access control can be applied) to

which jobs are submied, each of which is allocated a poron of the cluster resources. You

could, for example, have a queue for large long-running jobs that is allocated 90 percent of

the cluster and one for smaller high-priority jobs allocated the remaining 10 percent. If both

queues have jobs submied, the cluster resources will be allocated in this proporon.

www.it-ebooks.info

Keeping Things Running

[ 234 ]

If, however, one queue is empty and the other has jobs to execute, the Capacity Scheduler

will temporarily allocate the capacity of the empty queue to the busy one. Once a job is

submied to the empty queue, it will regain its capacity as the currently running tasks

complete execuon. This approach gives a reasonable balance between the desired

resource allocaon and prevenng long periods of unused capacity.

Though disabled by default, the Capacity Scheduler supports job priories within each

queue. If a high priority job is submied aer a low priority one, its tasks will be scheduled

in preference to the other jobs as capacity becomes available.

Fair Scheduler

The Fair Scheduler segments the cluster into pools into which jobs are submied; there

is oen a correlaon between the user and the pool. Though by default each pool gets an

equal share of the cluster, this can be modied.

Within each pool, the default model is to share the pool across all jobs submied to that

pool. Therefore, if the cluster is split into pools for Alice and Bob, each of whom submit three

jobs, the cluster will execute all six jobs in parallel. It is possible to place total limits on the

number of concurrent jobs running in a pool, as too many running at once will potenally

produce a large amount of temporary data and provide overall inecient processing.

As with the Capacity Scheduler, the Fair Scheduler will over-allocate cluster capacity to

other pools if one is empty, and then reclaim it as the pool receives jobs. It also supports job

priories within a pool to preferenally schedule tasks of high priority jobs over those with a

lower priority.

Enabling alternative schedulers

Each of the alternave schedulers is provided as a JAR le in capacityScheduler and

fairScheduler directories within the contrib directory in the Hadoop installaon. To

enable a scheduler, either add its JAR to the hadoop/lib directory or explicitly place it on

the classpath. Note that each scheduler requires its own set of properes to congure its

usage. Refer to the documentaon for each for more details.

When to use alternative schedulers

The alternave schedulers are very eecve, but are not really needed on small clusters

or those with no need to ensure mulple job concurrency or execuon of late-arriving

but high-priority jobs. Each has mulple conguraon parameters and requires tuning

to get opmal cluster ulizaon. But for any large cluster with mulple users and varying

job priories, they can be essenal.

www.it-ebooks.info

Chapter 7

[ 235 ]

Scaling

You have data and you have a running Hadoop cluster; now you get more of the former and

need more of the laer. We have said repeatedly that Hadoop is an easily scalable system.

So let us add some new capacity.

Adding capacity to a local Hadoop cluster

Hopefully, at this point, you should feel prey underwhelmed at the idea of adding another

node to a running cluster. All through Chapter 6, When Things Break, we constantly killed

and restarted nodes. Adding a new node is really no dierent; all you need to do is perform

the following steps:

1. Install Hadoop on the host.

2. Set the environment variables shown in Chapter 2, Geng Up and Running.

3. Copy the conguraon les into the conf directory on the installaon.

4. Add the host's DNS name or IP address to the slaves le on the node from

which you usually run commands such as slaves.sh or cluster start/stop scripts.

And that's it!

Have a go hero – adding a node and running balancer

Try out the process of adding a new node and aerwards examine the state of HDFS. If it

is unbalanced, run the balancer to x things. To help maximize the eect, ensure there is a

reasonable amount of data on HDFS before adding the new node.

Adding capacity to an EMR job ow

If you are using Elasc MapReduce, for non-persistent clusters, the concept of scaling does

not always apply. Since you specify the number and type of hosts required when seng up

the job ow each me, you need only ensure that the cluster size is appropriate for the job

to be executed.

Expanding a running job ow

However, somemes you may have a long-running job that you want to complete more

quickly. In such a case, you can add more nodes to the running job ow. Recall that EMR has

three dierent types of node: master nodes for NameNode and JobTracker, core nodes for

HDFS, and task nodes for MapReduce workers. In this case, you could add addional task

nodes to help crunch the MapReduce job.

www.it-ebooks.info

Keeping Things Running

[ 236 ]

Another scenario is where you have dened a job ow comprising a series of MapReduce

jobs instead of just one. EMR now allows the job ow to be modied between steps in such

a series. This has the advantage of each job being given a tailored hardware conguraon

that gives beer control of balancing performance against cost.

The canonical model for EMR is for the job ow to pull its source data from S3, process that

data on a temporary EMR Hadoop cluster, and then write results back to S3. If, however,

you have a very large data set that requires frequent processing, the copying back and

forth of data could become too me-consuming. Another model that can be employed in

such a situaon is to use a persistent Hadoop cluster within a job ow that has been sized

with enough core nodes to store the needed data on HDFS. When processing is performed,

increase capacity as shown before by assigning more task nodes to the job ow.

These tasks to resize running job ows are not currently available from the AWS

Console and need to be performed through the API or command line tools.

Summary

This chapter covered how to build, maintain, and expand a Hadoop cluster. In parcular,

we learned where to nd the default values for Hadoop conguraon properes and how

to set them programmacally on a per-job level. We learned how to choose hardware for a

cluster and the value in understanding your likely workload before comming to purchases,

and how Hadoop can use awareness of the physical locaon of hosts to opmize its block

placement strategy through the use of rack awareness.

We then saw how the default Hadoop security model works, its weaknesses and how to

migate them, how to migate the risks of NameNode failure we introduced in Chapter

6, When Things Break, and how to swap to a new NameNode host if disaster strikes. We

learned more about block replica placement, how the cluster can become unbalanced,

and what to do if it does.

We also saw the Hadoop model for MapReduce job scheduling and learned how job

priories can modify the behavior, how the Capacity Scheduler and Fair Scheduler give

a more sophiscated way of managing cluster resources across mulple concurrent job

submissions, and how to expand a cluster with a new capacity.

This completes our exploraon of core Hadoop in this book. In the remaining chapters,

we will look at other systems and tools that build atop Hadoop to provide more sophiscated

views on data and integraon with other systems. We will start with a relaonal view on the

data in HDFS through the use of Hive.

www.it-ebooks.info

A Relational View on Data with Hive

MapReduce is a powerful paradigm which enables complex data processing

that can reveal valuable insights. However, it does require a different mindset

and some training and experience on the model of breaking processing

analytics into a series of map and reduce steps. There are several products that

are built atop Hadoop to provide higher-level or more familiar views on the

data held within HDFS. This chapter will introduce one of the most popular of

these tools, Hive.

In this chapter, we will cover:

What Hive is and why you may want to use it

How to install and congure Hive

Using Hive to perform SQL-like analysis of the UFO data set

How Hive can approximate common features of a relaonal database such

as joins and views

How to eciently use Hive across very large data sets

How Hive allows the incorporaon of user-dened funcons into its queries

How Hive complements another common tool, Pig

Overview of Hive

Hive is a data warehouse that uses MapReduce to analyze data stored on HDFS. In parcular,

it provides a query language called HiveQL that closely resembles the common Structured

Query Language (SQL) standard.

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 238 ]

Why use Hive?

In Chapter 4, Developing MapReduce Programs, we introduced Hadoop Streaming and

explained that one large benet of Streaming is how it allows faster turn-around in the

development of MapReduce jobs. Hive takes this a step further. Instead of providing a

way of more quickly developing map and reduce tasks, it oers a query language based

on the industry standard SQL. Hive takes these HiveQL statements and immediately and

automacally translates the queries into one or more MapReduce jobs. It then executes

the overall MapReduce program and returns the results to the user. Whereas Hadoop

Streaming reduces the required code/compile/submit cycle, Hive removes it enrely

and instead only requires the composion of HiveQL statements.

This interface to Hadoop not only accelerates the me required to produce results from data

analysis, it signicantly broadens who can use Hadoop and MapReduce. Instead of requiring

soware development skills, anyone with a familiarity with SQL can use Hive.

The combinaon of these aributes is that Hive is oen used as a tool for business and data

analysts to perform ad hoc queries on the data stored on HDFS. Direct use of MapReduce

requires map and reduce tasks to be wrien before the job can be executed which means

a necessary delay from the idea of a possible query to its execuon. With Hive, the data

analyst can work on rening HiveQL queries without the ongoing involvement of a soware

developer. There are of course operaonal and praccal limitaons (a badly wrien query

will be inecient regardless of technology) but the broad principle is compelling.

Thanks, Facebook!

Just as we earlier thanked Google, Yahoo!, and Doug Cung for their contribuons to Hadoop

and the technologies that inspired it, it is to Facebook that we must now direct thanks.

Hive was developed by the Facebook Data team and, aer being used internally, it was

contributed to the Apache Soware Foundaon and made freely available as open source

soware. Its homepage is http://hive.apache.org.

Setting up Hive

In this secon, we will walk through the act of downloading, installing, and conguring Hive.

Prerequisites

Unlike Hadoop, there are no Hive masters, slaves, or nodes. Hive runs as a client applicaon

that processes HiveQL queries, converts them into MapReduce jobs, and submits these to a

Hadoop cluster.

www.it-ebooks.info

Chapter 8

[ 239 ]

Although there is a mode suitable for small jobs and development usage, the usual situaon

is that Hive will require an exisng funconing Hadoop cluster.

Just as other Hadoop clients don't need to be executed on the actual cluster nodes, Hive

can be executed on any host where the following are true:

Hadoop is installed on the host (even if no processes are running)

The HADOOP_HOME environment variable is set and points to the locaon of the

Hadoop installaon

The ${HADOOP_HOME}/bin directory is added to the system or user path

Getting Hive

You should download the latest stable Hive version from http://hive.apache.org/

releases.html.

The Hive geng started guide at http://cwiki.apache.org/confluence/display/

Hive/GettingStarted will give recommendaons on version compability, but as a

general principle, you should expect the most recent stable versions of Hive, Hadoop, and

Java to work together.

Time for action – installing Hive

Let's now set up Hive so we can start using it in acon.

1. Download the latest stable version of Hive and move it to the locaon to which you

wish to have it installed:

$ mv hive-0.8.1.tar.gz /usr/local

2. Uncompress the package:

$ tar –xzf hive-0.8.1.tar.gz

3. Set the HIVE_HOME variable to the installaon directory:

$ export HIVE_HOME=/usr/local/hive

4. Add the Hive home directory to the path variable:

$ export PATH=${HIVE_HOME}/bin:${PATH}

5. Create directories required by Hive on HDFS:

$ hadoop fs -mkdir /tmp

$ hadoop fs -mkdir /user/hive/warehouse

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 240 ]

6. Make both of these directories group writeable:

$ hadoop fs -chmod g+w /tmp

$ hadoop fs -chmod g+w /user/hive/warehouse

7. Try to start Hive:

$ hive

You will receive the following response:

Logging initialized using configuration in jar:file:/opt/hive-

0.8.1/lib/hive-common-0.8.1.jar!/hive-log4j.properties

Hive history file=/tmp/hadoop/hive_job_log_

hadoop_201203031500_480385673.txt

hive>

8. Exit the Hive interacve shell:

$ hive> quit;

What just happened?

Aer downloading the latest stable Hive release, we copied it to the desired locaon

and uncompressed the archive le. This created a directory, hive-<version>.

Similarly, as we previously dened HADOOP_HOME and added the bin directory within

the installaon to the path variable, we then did something similar with HIVE_HOME

and its bin directory.

Remember that to avoid having to set these variables every me you log in,

add them to your shell login script or to a separate conguraon script that

you source when you want to use Hive.

We then created two directories on HDFS that Hive requires and changed their aributes

to make them group writeable. The /tmp directory is where Hive will, by default, write

transient data created during query execuon and will also place output data in this

locaon. The /user/hive/warehouse directory is where Hive will store the data

that is wrien into its tables.

Aer all this setup, we run the hive command and a successful installaon will give output

similar to the one menoned above. Running the hive command with no arguments enters

an interacve shell; the hive> prompt is analogous to the sql> or mysql> prompts familiar

from relaonal database interacve tools.

www.it-ebooks.info

Chapter 8

[ 241 ]

We then exit the interacve shell by typing quit;. Note the trailing semicolon ;. HiveQL is,

as menoned, very similar to SQL and follows the convenon that all commands must be

terminated by a semicolon. Pressing Enter without a semicolon will allow commands to

be connued on subsequent lines.

Using Hive

With our Hive installaon, we will now import and analyze the UFO data set introduced in

Chapter 4, Developing MapReduce Programs.

When imporng any new data into Hive, there is generally a three-stage process:

1. Create the specicaon of the table into which the data is to be imported.

2. Import the data into the created table.

3. Execute HiveQL queries against the table.

This process should look very familiar to those with experience with relaonal databases.

Hive gives a structured query view of our data and to enable that, we must rst dene the

specicaon of the table's columns and import the data into the table before we can execute

any queries.

We assume a general level of familiarity with SQL and will be focusing

more on how to get things done with Hive than in explaining parcular

SQL constructs in detail. A SQL reference may be handy for those with lile

familiarity with the language, though we will make sure you know what

each statement does, even if the details require deeper SQL knowledge.

Time for action – creating a table for the UFO data

Perform the following steps to create a table for the UFO data:

1. Start the Hive interacve shell:

$ hive

2. Create a table for the UFO data set, spling the statement across mulple lines for

easy readability:

hive> CREATE TABLE ufodata(sighted STRING, reported STRING,

sighting_location STRING, > shape STRING, duration STRING,

description STRING COMMENT 'Free text description')

COMMENT 'The UFO data set.' ;

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 242 ]

You should see the following lines once you are done:

Time taken: 0.238 seconds

3. List all exisng tables:

hive> show tables;

You will receive the following output:

ufodata

Time taken: 0.156 seconds

4. Show tables matching a regular expression:

hive> show tables '.*data';

You will receive the following output:

ufodata

Time taken: 0.065 seconds

5. Validate the table specicaon:

hive> describe ufodata;

You will receive the following output:

sighted string

reported string

sighting_location string

shape string

duration string

description string Free text description

Time taken: 0.086 seconds

6. Display a more detailed descripon of the table:

hive> describe extended ufodata;

You will receive the following output:

www.it-ebooks.info

Chapter 8

[ 243 ]

sighted string

reported string

…

Detailed Table Information Table(tableName:ufodata,

dbName:default, owner:hadoop, createTime:1330818664,

lastAccessTime:0, retention:0,

…

…location:hdfs://head:9000/user/hive/warehouse/

ufodata, inputFormat:org.apache.hadoop.mapred.

TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.

HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1,

What just happened?

Aer starng the interacve Hive interpreter, we used the CREATE TABLE command to

dene the structure of the UFO data table. As with standard SQL, this requires that each

column in the table has a name and datatype. HiveQL also oers oponal comments on

each column and on the overall table, as shown previously where we add one column

and one table comment.

For the UFO data, we use STRING as the data type; HiveQL, as with SQL, supports a variety

of datatypes:

Boolean types: BOOLEAN

Integer types: TINYINT, INT, BIGINT

Floang point types: FLOAT, DOUBLE

Textual types: STRING

Aer creang the table, we use the SHOW TABLES statement to verify that the table has

been created. This command lists all tables and in this case, our new UFO table is the only

one in the system.

We then use a variant on SHOW TABLES that takes an oponal Java regular expression to

match against the table name. In this case, the output is idencal to the previous command,

but in systems with a large number of tables—especially when you do not know the exact

name—this variant can be very useful.

We have seen the table exists but we have not validated whether

it was created properly. We next do this by using the DESCRIBE

TABLE command to display the specicaon of the named table.

We see that all is as specied (though note the table comment is

not shown by this command) and then use the DESCRIBE TABLE

EXTENDED variant to get much more informaon about the table.

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 244 ]

We have omied much of this nal output though a few points of interest are present.

Note the input format is specied as TextInputFormat; by default, Hive will assume

any HDFS les inserted into a table are stored as text les.

We also see that the table data will be stored in a directory under the /user/hive/

warehouse HDFS directory we created earlier.

A note on case sensivity:

HiveQL, as with SQL, is not case sensive in terms of keywords, columns, or

table names. By convenon, SQL statements use uppercase for SQL language

keywords and we will generally follow this when using HiveQL within les, as

shown later. However, when typing interacve commands, we will frequently

take the line of least resistance and use lowercase.

Time for action – inserting the UFO data

Now that we have created a table, let us load the UFO data into it.

1. Copy the UFO data le onto HDFS:

$ hadoop fs -put ufo.tsv /tmp/ufo.tsv

2. Conrm that the le was copied:

$ hadoop fs -ls /tmp

You will receive the following response:

Found 2 items

drwxrwxr-x - hadoop supergroup 0 … 14:52 /tmp/hive-

hadoop

-rw-r--r-- 3 hadoop supergroup 75342464 2012-03-03 16:01 /tmp/

ufo.tsv

3. Enter the Hive interacve shell:

$ hive

4. Load the data from the previously copied le into the ufodata table:

hive> LOAD DATA INPATH '/tmp/ufo.tsv' OVERWRITE INTO TABLE

ufodata;

You will receive the following response:

Loading data to table default.ufodata

Deleted hdfs://head:9000/user/hive/warehouse/ufodata

www.it-ebooks.info

Chapter 8

[ 245 ]

Time taken: 5.494 seconds

5. Exit the Hive shell:

hive> quit;

6. Check the locaon from which we copied the data le:

$ hadoop fs -ls /tmp

You will receive the following response:

Found 1 items

drwxrwxr-x - hadoop supergroup 0 … 16:10 /tmp/hive-

hadoop

What just happened?

We rst copied onto HDFS the tab-separated le of UFO sighngs used previously in Chapter

4, Developing MapReduce Programs. Aer validang the le's presence on HDFS, we started

the Hive interacve shell and used the LOAD DATA command to load the le into the

ufodata table.

Because we are using a le already on HDFS, the path was specied by INPATH alone.

We could have loaded directly from a le on the local lesystem (obviang the need

for the prior explicit HDFS copy) by using LOCAL INPATH.

We specied the OVERWRITE statement which will delete any exisng data in the table

before loading the new data. This obviously should be used with care, as can be seen

from the output of the command, the directory holding the table data is removed by

use of OVERWRITE.

Note the command took only a lile over ve seconds to execute, signicantly longer

than it would have taken to copy the UFO data le onto HDFS.

Though we specied an explicit le in this example, it is possible to load mulple

les with a single statement by specifying a directory as the INPATH locaon; in

such a case, all les within the directory will be loaded into the table.

Aer exing the Hive shell, we look again at the directory into which we copied the data le

and nd it is no longer there. If a LOAD statement is given a path to data on HDFS, it will not

simply copy this into /user/hive/datawarehouse, but will move it there instead. If you

want to analyze data on HDFS that is used by other applicaons, then either create a copy or

use the EXTERNAL mechanism that will be described later.

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 246 ]

Validating the data

Now that we have loaded the data into our table, it is good pracce to do some quick

validang queries to conrm all is as expected. Somemes our inial table denion

turns out to be incorrect.

Time for action – validating the table

The easiest way to do some inial validaon is to perform some summary queries to validate

the import. This is similar to the types of acvies for which we used Hadoop Streaming in

Chapter 4, Developing MapReduce Programs.

1. Instead of using the Hive shell, pass the following HiveQL to the hive command-line

tool to count the number of entries in the table:

$ hive -e "select count(*) from ufodata;"

You will receive the following response:

Total MapReduce jobs = 1

Launching Job 1 out of 1

…

Hadoop job information for Stage-1: number of mappers: 1; number

of reducers: 1

2012-03-03 16:15:15,510 Stage-1 map = 0%, reduce = 0%

2012-03-03 16:15:21,552 Stage-1 map = 100%, reduce = 0%

2012-03-03 16:15:30,622 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201202281524_0006

MapReduce Jobs Launched:

Job 0: Map: 1 Reduce: 1 HDFS Read: 75416209 HDFS Write: 6

SUCESS

Total MapReduce CPU Time Spent: 0 msec

61393

Time taken: 28.218 seconds

2. Display a sample of ve values for the sighted column:

$ hive -e "select sighted from ufodata limit 5;"

You will receive the following response:

Total MapReduce jobs = 1

Launching Job 1 out of 1

www.it-ebooks.info

Chapter 8

[ 247 ]

…

19951009 19951009 Iowa City, IA Man repts. witnessing

"flash, followed by a classic UFO, w/ a tailfin at

back." Red color on top half of tailfin. Became triangular.

19951010 19951011 Milwaukee, WI 2 min. Man on Hwy 43 SW

of Milwaukee sees large, bright blue light streak by his car,

descend, turn, cross road ahead, strobe. Bizarre!

19950101 19950103 Shelton, WA Telephoned Report:CA

woman visiting daughter witness discs and triangular ships over

Squaxin Island in Puget Sound. Dramatic. Written report, with

illustrations, submitted to NUFORC.

19950510 19950510 Columbia, MO 2 min. Man repts. son's

bizarre sighting of small humanoid creature in back yard. Reptd.

in Acteon Journal, St. Louis UFO newsletter.

19950611 19950614 Seattle, WA Anonymous caller repts.

sighting 4 ufo's in NNE sky, 45 deg. above horizon. (No

other facts reptd. No return tel. #.)

Time taken: 11.693 seconds

What just happened?

In this example, we use the hive -e command to directly pass HiveQL to the Hive tool

instead of using the interacve shell. The interacve shell is useful when performing a series

of Hive operaons. For simple statements, it is oen more convenient to use this approach

and pass the query string directly to the command-line tool. This also shows that Hive can

be called from scripts like any other Unix tool.

When using hive –e, it is not necessary to terminate the HiveQL string

with a semicolon, but if you are like me, the habit is hard to break. If

you want mulple commands in a single string, they must obviously be

separated by semicolons.

The result of the rst query is 61393, the same number of records we saw when analyzing

the UFO data set previously with direct MapReduce. This tells us the enre data set was

indeed loaded into the table.

We then execute a second query to select ve values of the rst column in the table, which

should return a list of ve dates. However, the output instead includes the enre record

which has been loaded into the rst column.

The issue is that though we relied on Hive loading our data le as a text le, we didn't take

into account the separator between columns. Our le is tab separated, but Hive, by default,

expects its input les to have elds separated by the ASCII code 00 (control-A).

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 248 ]

Time for action – redening the table with the correct column

separator

Let's x our table specicaon as follows:

1. Create the following le as commands.hql:

DROP TABLE ufodata ;

CREATE TABLE ufodata(sighted string, reported string, sighting_

location string,

shape string, duration string, description string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t' ;

LOAD DATA INPATH '/tmp/ufo.tsv' OVERWRITE INTO TABLE ufodata ;

2. Copy the data le onto HDFS:

$ hadoop fs -put ufo.tsv /tmp/ufo.tsv

3. Execute the HiveQL script:

$ hive -f commands.hql

You will receive the following response:

Time taken: 5.821 seconds

Time taken: 0.248 seconds

Loading data to table default.ufodata

Deleted hdfs://head:9000/user/hive/warehouse/ufodata

Time taken: 0.285 seconds

4. Validate the number of rows in the table:

$ hive -e "select count(*) from ufodata;"

You will receive the following response:

61393

Time taken: 28.077 seconds

www.it-ebooks.info

Chapter 8

[ 249 ]

5. Validate the contents of the reported column:

$ hive -e "select reported from ufodata limit 5"

You will receive the following response:

19951009

19951011

19950103

19950510

19950614

Time taken: 14.852 seconds

What just happened?

We introduced a third way to invoke HiveQL commands in this example. In addion to

using the interacve shell or passing query strings to the Hive tool, we can have Hive

read and execute the contents of a le containing a series of HiveQL statements.

We rst created such a le that deletes the old table, creates a new one, and loads the

data le into it.

The main dierences with the table specicaon are the ROW FORMAT and FIELDS

TERMINATED BY statements. We need both these commands as the rst tells Hive

that the row contains mulple delimited elds, while the second species the actual

separator. As can be seen here, we can use both explicit ASCII codes as well as common

tokens such as \t for tab.

Be careful with the separator specicaon as it must be precise and is case

sensive. Do not waste a few hours by accidentally wring \T instead of

\t as I did recently.

Before running the script, we copy the data le onto HDFS again—the previous copy was

removed by the DELETE statement—and then use hive -f to execute the HiveQL le.

As before, we then execute two simple SELECT statements to rst count the rows in the

table and then extract the specic values from a named column for a small number of rows.

The overall row count is, as should be expected, the same as before, but the second

statement now produces what looks like correct data, showing that the rows are now

correctly being split into their constuent elds.

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 250 ]

Hive tables – real or not?

If you look closely at the me taken by the various commands in the preceding example,

you'll see a paern which may at rst seem strange. Loading data into a table takes about as

long as creang the table specicaon, but even the simple count of all row statements takes

signicantly longer. The output also shows that table creaon and the loading of data do not

actually cause MapReduce jobs to be executed, which explains the very short execuon mes.

When loading data into a Hive table, the process is dierent from what may be expected with

a tradional relaonal database. Although Hive copies the data le into its working directory, it

does not actually process the input data into rows at that point. What it does instead is create

metadata around the data which is then used by subsequent HiveQL queries.

Both the CREATE TABLE and LOAD DATA statements, therefore, do not truly create

concrete table data as such, instead they produce the metadata that will be used when

Hive is generang MapReduce jobs to access the data conceptually stored in the table.

Time for action – creating a table from an existing le

So far we have loaded data into Hive directly from les over which Hive eecvely takes

control. It is also possible, however, to create tables that model data held in les external

to Hive. This can be useful when we want the ability to perform Hive processing over data

wrien and managed by external applicaons or otherwise required to be held in directories

outside the Hive warehouse directory. Such les are not moved into the Hive warehouse

directory or deleted when the table is dropped.

1. Save the following to a le called states.hql:

CREATE EXTERNAL TABLE states(abbreviation string, full_name

string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

LOCATION '/tmp/states' ;

2. Copy the data le onto HDFS and conrm its presence aerwards:

$ hadoop fs -put states.txt /tmp/states/states.txt

$ hadoop fs -ls /tmp/states

You will receive the following response:

Found 1 items

-rw-r--r-- 3 hadoop supergroup 654 2012-03-03 16:54 /tmp/

states/states.txt

www.it-ebooks.info

Chapter 8

[ 251 ]

3. Execute the HiveQL script:

$ hive -f states.hql

You will receive the following response:

Logging initialized using configuration in jar:file:/opt/hive-

0.8.1/lib/hive-common-0.8.1.jar!/hive-log4j.properties

Hive history file=/tmp/hadoop/hive_job_log_

hadoop_201203031655_1132553792.txt

Time taken: 3.954 seconds

Time taken: 0.594 seconds

4. Check the source data le:

$ hadoop fs -ls /tmp/states

You will receive the following response:

Found 1 items

-rw-r--r-- 3 hadoop supergroup 654 … /tmp/states/states.

txt

5. Execute a sample query against the table:

$ hive -e "select full_name from states where abbreviation like

'CA'"

You will receive the following response:

Logging initialized using configuration in jar:file:/opt/hive-

0.8.1/lib/hive-common-0.8.1.jar!/hive-log4j.properties

Hive history file=/tmp/hadoop/hive_job_log_

hadoop_201203031655_410945775.txt

Total MapReduce jobs = 1

...

California

Time taken: 15.75 seconds

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 252 ]

What just happened?

The HiveQL statement to create an external table only diers slightly from the forms of

CREATE TABLE we used previously. The EXTERNAL keyword species that the table exists

in resources that Hive does not control and the LOCATION clause species where the source

le or directory are to be found.

Aer creang the HiveQL script, we copied the source le onto HDFS. For this table, we used

the data le from Chapter 4, Developing MapReduce Programs, which maps U.S. states to

their common two-leer abbreviaon.

Aer conrming the le was in the expected locaon on HDFS, we executed the query to

create the table and checked the source le again. Unlike previous table creaons that

moved the source le into the /user/hive/warehouse directory, the states.txt

le is sll in the HDFS locaon into which it was copied.

Finally, we executed a query against the table to conrm it was populated with the source

data and the expected result conrms this. This highlights an addional dierence with this

form of CREATE TABLE; for our previous non-external tables, the table creaon statement

does not ingest any data into the table, a subsequent LOAD DATA or (as we'll see later)

INSERT statement performs the actual table populaon. With table denions that include

the LOCATION specicaon, we can create the table and ingest data in a single statement.

We now have two tables in Hive; the larger table with UFO sighng data and a smaller one

mapping U.S. state abbreviaons to their full names. Wouldn't it be a useful combinaon to

use data from the second table to enrich the locaon column in the former?

Time for action – performing a join

Joins are a very frequently used tool in SQL, though somemes appear a lile inmidang

to those new to the language. Essenally a join allows rows in mulple tables to be logically

combined together based on a condional statement. Hive has rich support for joins which

we will now examine.

1. Create the following as join.hql:

SELECT t1.sighted, t2.full_name

FROM ufodata t1 JOIN states t2

ON (LOWER(t2.abbreviation) = LOWER(SUBSTR( t1.sighting_location,

(LENGTH(t1.sighting_location)-1))))

LIMIT 5 ;

2. Execute the query:

$ hive -f join.hql

www.it-ebooks.info

Chapter 8

[ 253 ]

You will receive the following response:

20060930 Alaska

20051018 Alaska

20050707 Alaska

20100112 Alaska

20100625 Alaska

Time taken: 33.255 seconds

What just happened?

The actual join query is relavely straighorward; we want to extract the sighted date and

locaon for a series of records but instead of the raw locaon eld, we wish to map this into

the full state name. The HiveQL le we created performs such a query. The join itself is specied

by the standard JOIN keyword and the matching condion is contained in the ON clause.

Things are complicated by a restricon of Hive in that it only supports equijoins, that is,

those where the ON clause contains an equality check. It is not possible to specify a join

condion using operators such as >, ?, <, or as we would have preferred to use here, the

LIKE keyword.

Instead, therefore, we have an opportunity to introduce several of Hive's built-in funcons,

in parcular, those to convert a string to lowercase (LOWER), to extract a substring from a

string (SUBSTR) and to return the number of characters in a string (LENGTH).

We know that most locaon entries are of the form "city, state_abbreviaon." So we use

SUBSTR to extract the third and second from last characters in the string, using length to

calculate the indices. We convert both the state abbreviaon and extracted string to lower

case via LOWER because we cannot assume that all entries in the sighng table will correctly

use uniform capitalizaon.

Aer execung the script, we get the expected sample lines of output that indeed include

the sighng date and full state name instead of the abbreviaon.

Note the use of the LIMIT clause that simply constrains how many output rows will be

returned from the query. This is also an indicaon that HiveQL is most similar to SQL

dialects such as those found in open source databases such as MySQL.

This example shows an inner join; Hive also supports le and right outer joins as well as le

semi joins. There are a number of subtlees around the use of joins in Hive (such as the

aforemenoned equijoin restricon) and you should really read through the documentaon

on the Hive homepage if you are likely to use joins, especially when using very large tables.

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 254 ]

This is not a cricism of Hive alone; joins are incredibly powerful tools but it

is probably fair to say that badly wrien joins or those created in ignorance

of crical constraints have brought more relaonal databases to a grinding

halt than any other type of SQL query.

Have a go hero – improve the join to use regular expressions

As well as the string funcons we used previously, Hive also has funcons such as RLIKE and

REGEXP_EXTRACT that provide direct support for Java-like regular expression manipulaon.

Rewrite the preceding join specicaon using regular expressions to make a more accurate

and elegant join statement.

Hive and SQL views

Another powerful SQL feature supported by Hive is views. These are useful when instead

of a stac table the contents of a logical table are specied by a SELECT statement and

subsequent queries can then be executed against this dynamic view (hence the name)

of the underlying data.

Time for action – using views

We can use views to hide the underlying query complexity such as the previous join example.

Let us now create a view to do just that.

1. Create the following as view.hql:

CREATE VIEW IF NOT EXISTS usa_sightings (sighted, reported,

shape, state)

AS select t1.sighted, t1.reported, t1.shape, t2.full_name

FROM ufodata t1 JOIN states t2

ON (LOWER(t2.abbreviation) = LOWER(substr( t1.sighting_location,

(LENGTH(t1.sighting_location)-1)))) ;

2. Execute the script:

$ hive -f view.hql

You will receive the following response:

Logging initialized using configuration in jar:file:/opt/hive-

0.8.1/lib/hive-common-0.8.1.jar!/hive-log4j.properties

Hive history file=/tmp/hadoop/hive_job_log_

hadoop_201203040557_1017700649.txt

Time taken: 5.135 seconds

www.it-ebooks.info

Chapter 8

[ 255 ]

3. Execute the script again:

$ hive -f view.hql

You will receive the following response:

Logging initialized using configuration in jar:file:/opt/hive-

0.8.1/lib/hive-common-0.8.1.jar!/hive-log4j.properties

Hive history file=/tmp/hadoop/hive_job_log_

hadoop_201203040557_851275946.txt

Time taken: 4.828 seconds

4. Execute a test query against the view:

$ hive -e "select count(state) from usa_sightings where state =

'California'"

You will receive the following response:

Logging initialized using configuration in jar:file:/opt/hive-

0.8.1/lib/hive-common-0.8.1.jar!/hive-log4j.properties

Hive history file=/tmp/hadoop/hive_job_log_

hadoop_201203040558_1729315866.txt

Total MapReduce jobs = 2

Launching Job 1 out of 2

…

2012-03-04 05:58:12,991 Stage-1 map = 0%, reduce = 0%

2012-03-04 05:58:16,021 Stage-1 map = 50%, reduce = 0%

2012-03-04 05:58:18,046 Stage-1 map = 100%, reduce = 0%

2012-03-04 05:58:24,092 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201203040432_0027

Launching Job 2 out of 2

…

2012-03-04 05:58:33,650 Stage-2 map = 0%, reduce = 0%

2012-03-04 05:58:36,673 Stage-2 map = 100%, reduce = 0%

2012-03-04 05:58:45,730 Stage-2 map = 100%, reduce = 100%

Ended Job = job_201203040432_0028

MapReduce Jobs Launched:

Job 0: Map: 2 Reduce: 1 HDFS Read: 75416863 HDFS Write: 116

SUCESS

Job 1: Map: 1 Reduce: 1 HDFS Read: 304 HDFS Write: 5 SUCESS

Total MapReduce CPU Time Spent: 0 msec.

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 256 ]

7599

Time taken: 47.03 seconds

5. Delete the view:

$ hive -e "drop view usa_sightings"

You will receive the following output on your screen:

Time taken: 5.298 seconds

What just happened?

We rstly created the view using the CREATE VIEW statement. This is similar to CREATE

TABLE but has two main dierences:

The column denions include only the name as the type, which will be determined

from the underlying query

The AS clause species the SELECT statement that will be used to generate the view

We use the previous join statement to generate the view, so in eect we are creang a table

that has the locaon eld normalized to the full state name without directly requiring the

user to deal with how that normalizaon is performed.

The oponal IF NOT EXISTS clause (which can also be used with CREATE TABLE) means

that Hive will ignore duplicate aempts to create the view. Without this clause, repeated

aempts to create the view will generate errors, which isn't always the desired behavior.

We then execute this script twice to both create the view and to demonstrate that the

inclusion of the IF NOT EXISTS clause is prevenng errors as we intended.

With the view created, we then execute a query against it, in this case, to simply count how

many of the sighngs took place in California. All our previous Hive statements that generate

MapReduce jobs have only produced a single one; this query against our view requires a

pair of chained MapReduce jobs. Looking at the query and the view specicaon, this isn't

necessarily surprising; it's not dicult to imagine how the view would be realized by the

rst MapReduce job and its output fed to the subsequent counng query performed as the

second job. As a consequence, you will also see this two-stage job take much longer than any

of our previous queries.

www.it-ebooks.info

Chapter 8

[ 257 ]

Hive is actually smarter than this. If the outer query can be folded into the view creaon,

then Hive will generate and execute only one MapReduce job. Given the me taken to hand-

develop a series of co-operang MapReduce jobs this is a great example of the benets

Hive can oer. Though a hand-wrien MapReduce job (or series of jobs) is likely to be much

more ecient, Hive is a great tool for determining which jobs are useful in the rst place. It

is beer to run a slow Hive query to determine an idea isn't as useful as hoped instead of

spending a day developing a MapReduce job to come to the same conclusion.

We have menoned that views can hide underlying complexity; this does oen mean that

execung views is intrinsically slow. For large-scale producon workloads, you will want

to opmize the SQL and possibly remove the view enrely.

Aer running the query, we delete the view through the DROP VIEW statement, which

demonstrates again the similarity between how HiveQL (and SQL) handle tables and views.

Handling dirty data in Hive

The observant among you may noce that the number of California sighngs reported by

this query is dierent from the number we generated in Chapter 4, Developing MapReduce

Programs. Why?

Recall that before running our Hadoop Streaming or Java MapReduce jobs in Chapter 4,

Developing MapReduce Programs, we had a mechanism to ignore input rows that were

malformed. Then while processing the data, we used more precise regular expressions to

extract the two-leer state abbreviaon from the locaon eld. However, in Hive, we did

no such pre-processing and relied on quite crude mechanisms to extract the abbreviaon.

On the laer, we could use some of Hive's previously menoned funcons that support

regular expressions but for the former, we'd at best be forced to add complex validaon

WHERE clauses to many of our queries.

A frequent paern is to instead preprocess data before it is imported into Hive, so for

example, in this case, we could run a MapReduce job to remove all malformed records

in the input le and another to do the normalizaon of the locaon eld in advance.

Have a go hero – do it!

Write MapReduce jobs (it could be one or two) to do this pre-processing of the input data

and generate a cleaned-up le more suited for direct importaon into Hive. Then write a

script to execute the jobs, create a Hive table, and import the new le into the table.

This will also show how easily and powerfully scriptable Hadoop and Hive can be together.

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 258 ]

Time for action – exporting query output

We have previously either loaded large quanes of data into Hive or extracted very small

quanes as query results. We can also export large result sets; let us look at an example.

1. Recreate the previously used view:

$ hive -f view.hql

2. Create the following le as export.hql:

INSERT OVERWRITE DIRECTORY '/tmp/out'

SELECT reported, shape, state

FROM usa_sightings

WHERE state = 'California' ;

3. Execute the script:

$ hive -f export.hql

You will receive the following response:

2012-03-04 06:20:44,571 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201203040432_0029

Moving data to: /tmp/out

7599 Rows loaded to /tmp/out

MapReduce Jobs Launched:

Job 0: Map: 2 Reduce: 1 HDFS Read: 75416863 HDFS Write: 210901

SUCESS

Total MapReduce CPU Time Spent: 0 msec

Time taken: 46.669 seconds

4. Look in the specied output directory:

$ hadoop fs -ls /tmp/out

You will receive the following response:

Found 1 items

-rw-r--r-- 3 hadoop supergroup 210901 … /tmp/out/000000_1

5. Examine the output le:

$ hadoop fs -cat /tmp/out/000000_1 | head

www.it-ebooks.info

Chapter 8

[ 259 ]

You will receive the following output on your screen:

20021014_ light_California

20050224_ other_California

20021001_ egg_California

20030527_ sphere_California

20050813_ light_California

20040701_ other_California

20031007_ light_California

What just happened?

Aer reusing the previous view, we created our HiveQL script using the INSERT OVERWRITE

DIRECTORY command. This, as the name suggests, places the results of the subsequent

statement into the specied locaon. The OVERWRITE modier is again oponal and simply

determines if any exisng content in the locaon is to be rstly removed or not. The INSERT

command is followed by a SELECT statement which produces the data to be wrien to the

output locaon. In this example, we use a query on our previously created view which you

will recall is built atop a join, demonstrang how the query here can be arbitrarily complex.

There is an addional oponal LOCAL modier for occasions when the output data is to be

wrien to the local lesystem of the host running the Hive command instead of HDFS.

When we run the script, the MapReduce output is mostly as we have come to expect but

with the addion of a line stang how many rows have been exported to the specied

output locaon.

Aer running the script, we check the output directory and see if the result le is there

and when we look at it, the contents are as we would expect.

Just as Hive's default separator for text les in inputs is ASCII code

0001 ('\a'), it also uses this as the default separator for output les,

as shown in the preceding example.

The INSERT command can also be used to populate one table with the results of a query

on others and we will look at that next. First, we need to explain a concept we will use at

the same me.

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 260 ]

Partitioning the table

We menoned earlier that badly wrien joins have a long and disreputable history of

causing relaonal databases to spend huge amounts of me grinding through unnecessary

work. A similar sad tale can be told of queries that perform full table scans (vising every

row in the table) instead of using indices that allow direct access to rows of interest.

For data stored on HDFS and mapped into a Hive table, the default situaon almost demands

full table scans. With no way of segmenng data into a more organized structure that allows

processing to apply only to the data subset of interest, Hive is forced to process the enre

data set. For our UFO le of approximately 70 MB, this really is not a problem as we see the

le processed in tens of seconds. However, what if it was a thousand mes larger?

As with tradional relaonal databases, Hive allows tables to be paroned based on the

values of virtual columns and for these values to then be used in query predicates later.

In parcular, when a table is created, it can have one or more paron columns and when

loading data into the table, the specied values for these columns will determine the

paron into which the data is wrien.

The most common paroning strategy for tables that see lots of data ingested on a daily basis

is for the paron column to be the date. Future queries can then be constrained to process

only that data contained within a parcular paron. Under the covers, Hive stores each

paron in its own directory and les, which is how it can then apply MapReduce jobs only on

the data of interest. Through the use of mulple paron columns, it is possible to create a rich

hierarchical structure and for large tables with queries that require only small subsets of data it

is worthwhile spending some me deciding on the opmal paroning strategy.

For our UFO data set, we will use the year of the sighng as the paron value but we have

to use a few less common features to make it happen. Hence, aer this introducon, let us

now make some parons!

Time for action – making a partitioned UFO sighting table

We will create a new table for the UFO data to demonstrate the usefulness of paroning.

1. Save the following query as createpartition.hql:

CREATE TABLE partufo(sighted string, reported string, sighting_

location string,shape string, duration string, description string)

PARTITIONED BY (year string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t' ;

www.it-ebooks.info

Chapter 8

[ 261 ]

2. Save the following query as insertpartition.hql:

SET hive.exec.dynamic.partition=true ;

SET hive.exec.dynamic.partition.mode=nonstrict ;

INSERT OVERWRITE TABLE partufo partition (year)

SELECT sighted, reported, sighting_location, shape, duration,

description,

SUBSTR(TRIM(sighted), 1,4) FROM ufodata ;

3. Create the paroned table:

$ hive -f createpartition.hql

You will receive the following response:

Logging initialized using configuration in jar:file:/opt/hive-

0.8.1/lib/hive-common-0.8.1.jar!/hive-log4j.properties

Hive history file=/tmp/hadoop/hive_job_log_

hadoop_201203101838_17331656.txt

Time taken: 4.754 seconds

4. Examine the created table:

sighted string

reported string

sighting_location string

shape string

duration string

description string

year string

Time taken: 4.704 seconds

5. Populate the table:

$ hive -f insertpartition.hql

You will see the following lines on the screen:

Total MapReduce jobs = 2

…

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 262 ]

Ended Job = job_201203040432_0041

Ended Job = 994255701, job is filtered out (removed at runtime).

Moving data to: hdfs://head:9000/tmp/hive-hadoop/hive_2012-03-

10_18-38-36_380_1188564613139061024/-ext-10000

Loading data to table default.partufo partition (year=null)

Loading partition {year=1977}

Loading partition {year=1880}

Loading partition {year=1975}

Loading partition {year=2007}

Loading partition {year=1957}

…

Table default.partufo stats: [num_partitions: 100, num_files: 100,

num_rows: 0, total_size: 74751215, raw_data_size: 0]

61393 Rows loaded to partufo

…

Time taken: 46.285 seconds

6. Execute a count command against a paron:

$ hive –e "select count(*)from partufo where year = '1989'"

You will receive the following response:

249

Time taken: 26.56 seconds

7. Execute a similar query on the non-paroned table:

$ hive –e "select count(*) from ufodata where sighted like

'1989%'"

You will receive the following response:

249

Time taken: 28.61 seconds

8. List the contents of the Hive directory housing the paroned table:

$ Hadoop fs –ls /user/hive/warehouse/partufo

www.it-ebooks.info

Chapter 8

[ 263 ]

You will receive the following response:

Found 100 items

drwxr-xr-x - hadoop supergroup 0 2012-03-10 18:38 /

user/hive/warehouse/partufo/year=0000

drwxr-xr-x - hadoop supergroup 0 2012-03-10 18:38 /

user/hive/warehouse/partufo/year=1400

drwxr-xr-x - hadoop supergroup 0 2012-03-10 18:38 /

user/hive/warehouse/partufo/year=1762

drwxr-xr-x - hadoop supergroup 0 2012-03-10 18:38 /

user/hive/warehouse/partufo/year=1790

drwxr-xr-x - hadoop supergroup 0 2012-03-10 18:38 /

user/hive/warehouse/partufo/year=1860

drwxr-xr-x - hadoop supergroup 0 2012-03-10 18:38 /

user/hive/warehouse/partufo/year=1864

drwxr-xr-x - hadoop supergroup 0 2012-03-10 18:38 /

user/hive/warehouse/partufo/year=1865

What just happened?

We created two HiveQL scripts for this example. The rst of these creates the new

paroned table. As we can see, it looks very much like the previous CREATE TABLE

statements; the dierence is in the addional PARTITIONED BY clause.

Aer we execute this script, we describe the table and see that from a HiveQL perspecve

the table appears just like the previous ufodata table but with the addion of an extra

column for the year. This allows the column to be treated as any other when it comes to

specifying condions in WHERE clauses, even though the column data does not actually

exist in the on-disk data les.

We next execute the second script which performs the actual loading of data into the

paroned table. There are several things of note here.

Firstly, we see that the INSERT command can be used with tables just as we previously did

for directories. The INSERT statement has a specicaon of where the data is to go and a

subsequent SELECT statement gathers the required data from exisng tables or views.

The paroning mechanism used here is taking advantage of a relavely new feature in Hive,

dynamic parons. In most cases, the paron clause in this statement would include an

explicit value for the year column. But though that would work if we were uploading a day's

data into a daily paron, it isn't suitable for our type of data le where the various rows

should be inserted into a variety of parons. By simply specifying the column name with no

value, the paron name will be automacally generated by the value of the year column

returned from the SELECT statement.

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 264 ]

This hopefully explains the strange nal clause in the SELECT statement; aer specifying all

the standard columns from ufodata, we add a specicaon that extracts a string containing

the rst four characters of the sighng column. Remember that because the paroned

table sees the year paron column as the seventh column, this means we are assigning the

year component of the sighted string to the year column in each row. Consequently, each

row is inserted into the paron associated with its sighng year.

To prove this is working as expected, we then perform two queries; one counts all records

in the paron for 1989 in the paroned table, the other counts the records in ufodata

that begin with the string "1989", that is, the component used to dynamically create the

parons previously.

As can be seen, both queries return the same result, verifying that our paroning strategy is

working as expected. We also note that the paroned query is a lile faster than the other,

though not by very much. This is likely due to the MapReduce start up mes dominang the

processing of our relavely modest data set.

Finally, we take a look inside the directory where Hive stores the data for the paroned

table and see that there is indeed a directory for each of the 100 dynamically-generated

parons. Any me we now express HiveQL statements that refer to specic parons,

Hive can perform a signicant opmizaon by processing only the data found in the

appropriate parons' directories.

Bucketing, clustering, and sorting... oh my!

We will not explore it in detail here, but hierarchical paron columns are not the full extent

of how Hive can opmize data access paerns within subsets of data. Within a paron,

Hive provides a mechanism to further gather rows into buckets using a hash funcon on

specied CLUSTER BY columns. Within a bucket, the rows can be kept in sorted order

using specied SORT BY columns. We could, for example, have bucketed our data based

on the UFO shape and within each bucket sorted on the sighng date.

These aren't necessarily features you'll need to use on day 1 with Hive, but if you nd

yourself using larger and larger data sets, then considering this type of opmizaon

may help query processing me signicantly.

User-Dened Function

Hive provides mechanisms for you to hook custom code directly into the HiveQL execuon.

This can be in the form of adding new library funcons or by specifying Hive transforms,

which work quite similarly to Hadoop Streaming. We will look at user-dened funcons in

this secon as they are where you are most likely to have an early need to add custom code.

Hive transforms are a somewhat more involved mechanism by which you can add custom

map and reduce classes that are invoked by the Hive runme. If transforms are of interest,

they are well documented on the Hive wiki.

www.it-ebooks.info

Chapter 8

[ 265 ]

Time for action – adding a new User Dened Function (UDF)

Let us show how to create and invoke some custom Java code via a new UDF.

1. Save the following code as City.java:

package com.kycorsystems ;

import java.util.regex.Matcher ;

import java.util.regex.Pattern ;

import org.apache.hadoop.hive.ql.exec.UDF ;

import org.apache.hadoop.io.Text ;

public class City extends UDF

{

private static Pattern pattern = Pattern.compile(

"[a-zA-z]+?[\\. ]*[a-zA-z]+?[\\, ][^a-zA-Z]") ;

public Text evaluate( final Text str)

{

Text result ;

String location = str.toString().trim() ;

Matcher matcher = pattern.matcher(location) ;

if (matcher.find())

{

result = new Text( location.

substring(matcher.start(), matcher.end()-2)) ;

}

else

{

result = new Text("Unknown") ;

}

return result ;

}

2. Compile this le:

$ javac -cp hive/lib/hive-exec-0.8.1.jar:hadoop/hadoop-1.0.4-core.

jar -d . City.java

3. Package the generated class le into a JAR le:

$ jar cvf city.jar com

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 266 ]

You will receive the following response:

added manifest

adding: com/(in = 0) (out= 0)(stored 0%)

adding: com/kycorsystems/(in = 0) (out= 0)(stored 0%)

adding: com/kycorsystems/City.class(in = 1101) (out= 647)(deflated

41%)

4. Start the interacve Hive shell:

$ hive

5. Add the new JAR le to the Hive classpath:

hive> add jar city.jar;

You will receive the following response:

Added city.jar to class path

Added resource: city.jar

6. Conrm that the JAR le was added:

hive> list jars;

You will receive the following response:

file:/opt/hive-0.8.1/lib/hive-builtins-0.8.1.jar

city.jar

7. Register the new code with a funcon name:

hive> create temporary function city as 'com.kycorsystems.City' ;

You will receive the following response:

Time taken: 0.277 seconds

8. Execute a query using the new funcon:

hive> select city(sighting_location), count(*) as total

> from partufo

> where year = '1999'

> group by city(sighting_location)

> having total > 15 ;

www.it-ebooks.info

Chapter 8

[ 267 ]

You will receive the following response:

Total MapReduce jobs = 1

Launching Job 1 out of 1

…

Chicago 19

Las Vegas 19

Phoenix 19

Portland 17

San Diego 18

Seattle 26

Unknown 34

Time taken: 29.055 seconds

What just happened?

The Java class we wrote extends the base org.apache.hadoop.hive.exec.ql.UDF

(User Dened Funcon) class. Into this class, we dene a method for returning a city name

given a locaon string that follows the general paern we have seen previously.

UDF does not actually dene a series of evaluate methods based on type; instead, you are

free to add your own with arbitrary arguments and return types. Hive uses Java Reecon

to select the correct evaluaon method, and if you require a ner-grained selecon, you can

develop your own ulity class that implements the UDFMethodResolver interface.

The regular expression used here is a lile unwieldy; we wish to extract the name of the

city, assuming it will be followed by a state abbreviaon. However, inconsistency in how

the names are delineated and handling of mul-word names gives us the regular expression

seen before. Apart from this, the class is prey straighorward.

We compile the City.java le, adding the necessary JARs from both Hive and Hadoop

as we do so.

Remember, of course, that the specic JAR lenames may be dierent if

you are not using the same versions of both Hadoop and Hive.

We then bundle the generated class le into a JAR and start the Hive interacve shell.

Aer creang the JAR, we need to congure Hive to use it. This is a two-step process. Firstly,

we use the add jar command to add the new JAR le to the classpath used by Hive. Aer

doing so, we use the list jars command to conrm that our new JAR has been registered

in the system.

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 268 ]

Adding the JAR only tells Hive that some code exists, it does not say how we wish to refer

to the funcon within our HiveQL statements. The CREATE FUNCTION command does

this—associang a funcon name (in this case, city) with the fully qualied Java class

that provides the implementaon (in this case, com.kycorsystems.City).

With both the JAR le added to the classpath and the funcon created, we can now refer

to our city() funcon within our HiveQL statements.

We next ran an example query that demonstrates the new funcon in acon. Going back to the

paroned UFO sighngs table, we thought it would be interesng to see where the most UFO

sighngs were occurring as everyone prepared for the end-of-millennium apocalypse.

As can be seen from the HiveQL statement, we can use our new funcon just like any other

and indeed the only way to know which funcons are built-in and which are UDFs is through

familiarity with the standard Hive funcon library.

The result shows a signicant concentraon of sighngs in the north-west and south-west of

the USA, Chicago being the only excepon. We did get quite a few Unknown results however,

and it would require further analysis to determine if that was due to locaons outside of the

U.S. or if we need to further rene our regular expression.

To preprocess or not to preprocess...

Let us re-visit an earlier topic; the potenal need to pre-process data into a cleaner

form before it is imported into Hive. As can be seen from the preceding example, we

could perform similar processing on the y through a series of UDFs. We could, for

example, add funcons called state and country that extract or infer the further

region and naon components from the locaon sighng string. There are rarely

concrete rules for which approach is best, but a few guidelines may help.

If, as is the case here, we are unlikely to actually process the full locaon string for

reasons other than to extract the disnct components, then preprocessing likely makes

more sense. Instead of performing expensive text processing every me the column is

accessed, we could either normalize it into a more predictable format or even break it

out into separate city/region/country columns.

If, however, a column is usually used in HiveQL in its original form and addional processing

is the exceponal case, then there is likely lile benet to an expensive processing step

across the enre data set.

Use the strategy that makes the most sense for your data and workloads. Remember that

UDFs are for much more than this sort of text processing, they can be used to encapsulate

any type of logic that you wish to apply to data in your tables.

www.it-ebooks.info

Chapter 8

[ 269 ]

Hive versus Pig

Search the Internet for arcles about Hive and it won't be long before you nd many

comparing Hive to another Apache project called Pig. Some of the most common quesons

around this comparison are why both exist, when to use one over the other, which is beer,

and which makes you look cooler when wearing the project t-shirt in a bar.

The overlap between the projects is that whereas Hive looks to present a familiar SQL-like

interface to data, Pig uses a language called Pig Lan that species dataow pipelines. Just

as Hive translates HiveQL into MapReduce which it then executes, Pig performs similar

MapReduce code generaon from the Pig Lan scripts.

The biggest dierence between HiveQL and Pig Lan is the amount of control expressed

over how the job will be executed. HiveQL, just like SQL, species what is to be done but

says almost nothing about how to actually structure the implementaon. The HiveQL query

planner is responsible for determining in which order to perform parcular parts of the

HiveQL command, in which order to evaluate funcons, and so on. These decisions are

made by Hive at runme, analogous to a tradional relaonal database query planner,

and this is also the level at which Pig Lan operates.

Both approaches obviate the need to write raw MapReduce code; they dier in the

abstracons they provide.

The choice of Hive versus Pig will depend on your needs. If having a familiar SQL interface

to the data is important as a means of making the data in Hadoop available to a wider

audience, then Hive is the obvious choice. If instead you have personnel who think in terms

of data pipelines and need ner-grained control over how the jobs are executed, then Pig

may be a beer t. The Hive and Pig projects are looking for closer integraon so hopefully

the false sense of compeon will decrease and instead both will be seen as complementary

ways of decreasing the Hadoop knowledge required to execute MapReduce jobs.

What we didn't cover

In this overview of Hive, we have covered its installaon and setup, the creaon and

manipulaon of tables, views, and joins. We have looked at how to move data into and out

of Hive, how to opmize data processing, and explored several of Hive's built-in funcons.

In reality, we have barely scratched the surface. In addion to more depth on the previous

topics and a variety of related concepts, we didn't even touch on topics such as the

MetaStore where Hive stores its conguraon and metadata or SerDe (serialize/deserialize)

objects, which can be used to read data from more complex le formats such as JSON.

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 270 ]

Hive is an incredibly rich tool with many powerful and complex features. If Hive is something

that you feel may be of value to you, then it is recommended that aer running through the

examples in this chapter that you spend some quality me with the documentaon on the

Hive website. There you will also nd links to the user mailing list, which is a great source of

informaon and help.

Hive on Amazon Web Services

Elasc MapReduce has signicant support for Hive with some specic mechanisms to help its

integraon with other AWS services.

Time for action – running UFO analysis on EMR

Let us explore the use of EMR with Hive by doing some UFO analysis on the plaorm.

1. Log in to the AWS management console at http://aws.amazon.com/console.

2. Every Hive job ow on EMR runs from an S3 bucket and we need to select the

bucket we wish to use for this purpose. Select S3 to see the list of the buckets

associated with your account and then choose the bucket from which to run the

example, in the example below, we select the bucket called garryt1use.

3. Use the web interface to create three directories called ufodata, ufoout, and

ufologs within that bucket. The resulng list of the bucket's contents should

look like the following screenshot:

www.it-ebooks.info

Chapter 8

[ 271 ]

4. Double-click on the ufodata directory to open it and within it create two

subdirectories called ufo and states.

5. Create the following as s3test.hql, click on the Upload link within the ufodata

directory, and follow the prompts to upload the le:

CREATE EXTERNAL TABLE IF NOT EXISTS ufodata(sighted string,

reported string, sighting_location string,

shape string, duration string, description string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

LOCATION '${INPUT}/ufo' ;

CREATE EXTERNAL TABLE IF NOT EXISTS states(abbreviation string,

full_name string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

LOCATION '${INPUT}/states' ;

CREATE VIEW IF NOT EXISTS usa_sightings (sighted, reported, shape,

state)

AS SELECT t1.sighted, t1.reported, t1.shape, t2.full_name

FROM ufodata t1 JOIN states t2

ON (LOWER(t2.abbreviation) = LOWER(SUBSTR( t1.sighting_location,

(LENGTH(t1.sighting_location)-1)))) ;

CREATE EXTERNAL TABLE IF NOT EXISTS state_results ( reported

string, shape string, state string)

ROW FORMAT DELIMITED

FFIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'

STORED AS TEXTFILE

LOCATION '${OUTPUT}/states' ;

INSERT OVERWRITE TABLE state_results

SELECT reported, shape, state

FROM usa_sightings

WHERE state = 'California' ;

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 272 ]

The contents of ufodata should now look like the following screenshot:

6. Double-click the states directory to open it and into this, upload the states.txt

le used earlier. The directory should now look like the following screenshot:

7. Click on the ufodata component at the top of the le list to return to this directory.

www.it-ebooks.info

Chapter 8

[ 273 ]

8. Double-click on the ufo directory to open it and into this, upload the ufo.tsv le

used earlier. The directory should now look like the following screenshot:

9. Now select Elasc MapReduce and click on Create a New Job Flow. Then select

the opon Run your own applicaon and select a Hive applicaon, as shown in

the following screenshot:

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 274 ]

10. Click on Connue and then ll in the required details for the Hive job ow. Use the

following screenshot as a guide, but remember to change the bucket name (the rst

component in the s3:// URLs) to the bucket you set up before:

www.it-ebooks.info

Chapter 8

[ 275 ]

11. Click on Connue, review the number and the type of hosts to be used, and then

click on Connue once again. Then ll in the name of the directory for the logs, as

shown in the following screenshot:

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 276 ]

12. Click on Connue. Then do the same through the rest of the job creaon process as

there are no other default opons that need to be changed for this example. Finally

start the job ow and monitor its progress from the management console.

13. Once the job has completed successfully, go back to S3 and double-click on the

ufoout directory. Within that should be a directory called states and within that,

a le named something like 0000000. Double-click to download the le and verify

that its contents look something like the following:

20021014 light California

20050224 other California

20021001 egg California

20030527 sphere California

What just happened?

Before we actually execute our EMR job ow, we needed to do a bit of setup in the

preceding example. Firstly, we used the S3 web interface to prepare the directory structure

for our job. We created three main directories to hold the input data, into which to write

results and one for EMR to place logs of the job ow execuon.

The HiveQL script is a modicaon of several of the Hive commands used earlier in this

chapter. It creates the tables for the UFO sighng data and state names as well as the

view joining them. Then it creates a new table with no source data and uses an INSERT

OVERWRITE TABLE to populate the table with the results of a query.

The unique feature in this script is the way we specify the LOCATION clauses for each

of the tables. For the input tables, we use a path relave to a variable called INPUT

and do likewise with the OUTPUT variable for the result table.

Note that Hive in EMR expects the locaon of table data to be a directory and not a le.

This is the reason for us previously creang subdirectories for each table into which we

uploaded the specic source le instead of specifying the table with the direct path to

the data les themselves.

Aer seng up the required le and directory structure within our S3 bucket, we went

to the EMR web console and started the job ow creaon process.

Aer specifying that we wish to use our own program and that it would be a Hive

applicaon, we lled in a screen with the key data required for our job ow:

The locaon of the HiveQL script itself

The directory containing input data

The directory to be used for output data

www.it-ebooks.info

Chapter 8

[ 277 ]

The path to the HiveQL script is an explicit path and does not require any explanaon.

However, it is important to realize how the other values are mapped into the variables

used within our Hive script.

The value for the input path is available to the Hive script as the INPUT variable and this

is how we then specify the directory containing the UFO sighng data as ${INPUT}/ufo.

Similarly, the output value specied in this form will be used as the OUTPUT variable within

our Hive script.

We did not make any changes to the default host setup, which will be one small master

and two small core nodes. On the next screen, we added the locaon into which we

wanted EMR to write the logs produced by the job ow execuon.

Though oponal, it is useful to capture these logs, parcularly in the early stages of running

a new script, though obviously S3 storage does have a cost. EMR can also write indexed log

data into SimpleDB (another AWS service), but we did not show that in acon here.

Aer compleng the job ow denion, we started it and on successful execuon, went

to the S3 interface to browse to the output locaon, which happily contained the data

we were expecng.

Using interactive job ows for development

When developing a new Hive script to be executed on EMR, the previous batch job execuon

is not a good t. There is usually a several minute latency between job ow creaon and

execuon and if the job fails, then the cost of several hours of EC2 instance me will have

been incurred (paral hours are rounded up).

Instead of selecng the opon to create an EMR job ow to run a Hive script, as in the

previous example, we can start a Hive job ow in interacve mode. This eecvely spins up a

Hadoop cluster without requiring a named script. You can then SSH into the master node as

the Hadoop user where you will nd Hive installed and congured. It is much more ecient

to do the script development in this environment and then, if required, set up the batch

script job ows to automacally execute the script in producon.

Have a go hero – using an interactive EMR cluster

Start up an interacve Hive job ow in EMR. You will need to have SSH credenals already

registered with EC2 so that you can connect to the master node. Run the previous script

directly from the master node, remembering to pass the appropriate variables to the script.

www.it-ebooks.info

A Relaonal View on Data with Hive

[ 278 ]

Integration with other AWS products

With a local Hadoop/Hive installaon, the queson of where data lives usually comes

down to HDFS or local lesystems. As we have seen previously, Hive within EMR gives

another opon with its support for external tables whose data resides in S3.

Another AWS service with similar support is DynamoDB (at http://aws.amazon.com/

dynamodb), a hosted NoSQL database soluon in the cloud. Hive job ows within EMR

can declare external tables that either read data from DynamoDB or use it as the

desnaon for query output.

This is a very powerful model as it allows Hive to be used to process and combine data

from mulple sources while the mechanics of mapping data from one system into Hive

tables happens transparently. It also allows Hive to be used as a mechanism for moving

data from one system to another. The act of geng data frequently into such hosted

services from exisng stores is a major adopon hurdle.

Summary

We have looked at Hive in this chapter and learned how it provides many tools and

features that will be familiar to anyone who uses relaonal databases. Instead of requiring

development of MapReduce applicaons, Hive makes the power of Hadoop available to a

much broader community.

In parcular, we downloaded and installed Hive, learning that it is a client applicaon that

translates its HiveQL language into MapReduce code, which it submits to a Hadoop cluster.

We explored Hive's mechanism for creang tables and running queries against these tables.

We saw how Hive can support various underlying data le formats and structures and how

to modify those opons.

We also appreciated that Hive tables are largely a logical construct and that behind the

scenes, all the SQL-like operaons on tables are in fact executed by MapReduce jobs on

HDFS les. We then saw how Hive supports powerful features such as joins and views

and how to paron our tables to aid in ecient query execuon.

We used Hive to output the results of a query to les on HDFS and saw how Hive is

supported by Elasc MapReduce, where interacve job ows can be used to develop

new Hive applicaons, and then ran automacally in batch mode.

As we have menoned several mes in this book, Hive looks like a relaonal database but is

not really one. However, in many cases you will nd exisng relaonal databases are part of

the broader infrastructure into which you need integrate. Performing that integraon and how

to move data across these dierent types of data sources will be the topic of the next chapter.

www.it-ebooks.info

Working with Relational Databases

As we saw in the previous chapter, Hive is a great tool that provides a relational

database-like view of the data stored in Hadoop. However, at the end of the

day, it is not truly a relational database. It does not fully implement the SQL

standard, and its performance and scale characteristics are vastly different

(not better or worse, just different) from a traditional relational database.

In many cases, you will find a Hadoop cluster sitting alongside and used with

(not instead of) relational databases. Often the business flows will require data

to be moved from one store to the other; we will now explore such integration.

In this chapter, we will:

Idenfy some common Hadoop/RDBMS use cases

Explore how we can move data from RDBMS into HDFS and Hive

Use Sqoop as a beer soluon for such problems

Move data with exports from Hadoop into an RDBMS

Wrap up with a discussion of how this can be applied to AWS

Common data paths

Back in Chapter 1, What It's All About, we touched on what we believe to be an arcial

choice that causes a lot of controversy; to use Hadoop or a tradional relaonal database.

As explained there, it is our contenon that the thing to focus on is idenfying the right

tool for the task at hand and that this is likely to lead to a situaon where more than one

technology is employed. It is worth looking at a few concrete examples to illustrate this idea.

www.it-ebooks.info

Working with Relaonal Databases

[ 280 ]

Hadoop as an archive store

When an RDBMS is used as the main data repository, there oen arises issues of scale

and data retenon. As volumes of new data increase, what is to be done with the older

and less valuable data?

Tradionally, there are two main approaches to this situaon:

Paron the RDBMS to allow higher performance of more recent data;

somemes the technology allows older data to be stored on slower and

less expensive storage systems

Archive the data onto tape or another oine store

Both approaches are valid, and the decision between the two oen rests on just whether or

not the older data is required for mely access. These are two extreme cases as the former

maximizes for access at the cost of complexity and infrastructure expense, while the laer

reduces costs but makes data less accessible.

The model being seen recently is for the most current data to be kept in the relaonal

database and the older data to be pushed into Hadoop. This can either be onto HDFS as

structured les or into Hive to retain the RDBMS interface. This gives the best of both worlds,

allowing the lower-volume, more recent data to be accessible by high-speed, low-latency

SQL queries, while the much larger volume of archived data will be accessed from Hadoop.

The data therefore remains available for use cases requiring either types of access; this

would be needed on a plaorm that does require addional integraon for any queries

that need to span both the recent and archive data.

Because of Hadoop's scalability, this model gives great future growth potenal; we know we

can connue to increase the amount of archive data being stored while retaining the ability

to run analycs against it.

Hadoop as a preprocessing step

Several mes in our Hive discussion, we highlighted opportunies where some preprocessing

jobs to massage or otherwise clean up the data would be hugely useful. The unfortunate

fact is that, in many (most?) big data situaons, the large volumes of data coming from

mulple sources mean that dirty data is simply a given. Although most MapReduce jobs

only require a subset of the overall data to be processed, we should sll expect to nd

incomplete or corrupt data across the data set. Just as Hive can benet from preprocessing

data, a tradional relaonal database can as well.

Hadoop can be a great tool here; it can pull data from mulple sources, combine them

for necessary transformaons, and clean up prior to the data being inserted into the

relaonal database.

www.it-ebooks.info

Chapter 9

[ 281 ]

Hadoop as a data input tool

Hadoop is not just valuable in that it makes data beer and is well suited to being ingested

into a relaonal database. In addion to such tasks, Hadoop can also be used to generate

addional data sets or data views that are then served from the relaonal database.

Common paerns here are situaons such as when we wish to display not only the primary

data for an account but to also display alongside it secondary data generated from account

history. Such views could be summaries of transacons against types of expenditure for the

previous months. This data is held within Hadoop, from which can be generated the actual

summaries that may be pushed back into the database for quicker display.

The serpent eats its own tail

Reality is oen more complex than these well-dened situaons, and it's not uncommon

for the data ow between Hadoop and the relaonal database to be described by circles

and arcs instead of a single straight line. The Hadoop cluster may, for example, do the

preprocessing step on data that is then ingested into the RDBMS and then receive frequent

transacon dumps that are used to build aggregates, which are sent back to the database.

Then, once the data gets older than a certain threshold, it is deleted from the database but

kept in Hadoop for archival purposes.

Regardless of the situaon, the ability to get data from Hadoop to a relaonal database and

back again is a crical aspect of integrang Hadoop into your IT infrastructure. So, let's see

how to do it.

Setting up MySQL

Before reading and wring data from a relaonal database, we need a running relaonal

database. We will use MySQL in this chapter because it is freely and widely available and

many developers have used it at some point in their career. You can of course use any

RDBMS for which a JDBC driver is available, but if you do so, you'll need to modify the

aspects of this chapter that require direct interacon with the database server.

Time for action – installing and setting up MySQL

Let's get MySQL installed and congured with the basic databases and access rights.

1. On an Ubuntu host, install MySQL using apt-get:

$ apt-get update

$ apt-get install mysql-server

2. Follow the prompts, and when asked, choose a suitable root password.

www.it-ebooks.info

Working with Relaonal Databases

[ 282 ]

3. Once installed, connect to the MySQL server:

$ mysql -h localhost -u root -p

4. Enter the root password when prompted:

Welcome to the MySQL monitor. Commands end with ; or \g.

Your MySQL connection id is 40

…

Mysql>

5. Create a new database to use for the examples in this chapter:

Mysql> create database hadooptest;

You will receive the following response:

Query OK, 1 row affected (0.00 sec)

6. Create a user account with full access to the database:

Mysql> grant all on hadooptest.* to 'hadoopuser'@'%' identified

by 'password';

You will receive the following response:

Query OK, 0 rows affected (0.01 sec)

7. Reload the user privileges to have the user changes take eect:

Mysql> flush privileges;

You will receive the following response:

Query OK, 0 rows affected (0.01 sec)

8. Log out as root:

mysql> quit;

You will receive the following response:

Bye

9. Log in as the newly created user, entering the password when prompted:

$ mysql -u hadoopuser -p

10. Change to the newly created database:

mysql> use hadooptest;

www.it-ebooks.info

Chapter 9

[ 283 ]

11. Create a test table, drop it to conrm the user has the privileges in this database,

and then log out:

mysql> create table tabletest(id int);

mysql> drop table tabletest;

mysql> quit;

What just happened?

Due to the wonders of package managers such as apt, installing complex soware such

as MySQL is really very easy. We just use the standard process to install a package; under

Ubuntu (and most other distribuons in fact), requesng the main server package for MySQL

will bring along all needed dependencies as well as the client packages.

During the install, you will be prompted for the root password on the database. Even if this is

a test database instance that no one will use and that will have no valuable data, please give

the root user a strong password. Having weak root passwords is a bad habit, and we do not

want to encourage it.

www.it-ebooks.info

Working with Relaonal Databases

[ 284 ]

Aer MySQL is installed, we connect to the database using the mysql command-line ulity.

This takes a range of opons, but the ones we will use are as follows:

-h: This opon is used to specify the hostname of the database (the local machine is

assumed if none is given)

-u: This opon is used for the username with which to connect (the default is the

current Linux user)

-p: This opon is used to be prompted for the user password

MySQL has the concept of mulple databases, each of which is a collecve grouping

of tables. Every table needs be associated with a database. MySQL has several built-in

databases, but we use the CREATE DATABASE statement to create a new one called

hadooptest for our later work.

MySQL refuses connecons/requests to perform acons unless the requesng user has

explicitly been given the needed privileges to perform the acon. We do not want to do

everything as the root user (a bad pracce and quite dangerous since the root can modify/

delete everything), so we create a new user called hadoopuser by using the GRANT statement.

The GRANT statement we used actually does three disnct things:

Creates the hadoopuser account

Sets the hadoopuser password; we set it to password, which obviously you should

never do; pick something easy to memorize

Gives hadoopuser all privileges on the hadooptest database and all its tables

We issue the FLUSH PRIVILEGES command to have these changes take eect and then we

log out as root and connect as the new user to check whether all is working.

The USE statement here is a lile superuous. In future, we can instead add the database

name to the mysql command-line tool to automacally change to that database.

Connecng as the new user is a good sign, but to gain full condence, we create a new table

in the hadooptest database and then drop it. Success here shows that hadoopuser does

indeed have the requested privileges to modify the database.

Did it have to be so hard?

We are perhaps being a lile cauous here by checking every step of the process along

the way. However, I have found in the past that subtle typos, in the GRANT statement in

parcular, can result in really hard-to-diagnose problems later on. And to connue our

paranoia, let's make one change to the default MySQL conguraon that we won't need

quite yet, but which if we don't do, we'll be sorry later.

www.it-ebooks.info

Chapter 9

[ 285 ]

For any producon database, you would of course not have security-sensive statements,

such as GRANT, present that were typed in from a book. Refer to the documentaon of your

database to understand user accounts and privileges.

Time for action – conguring MySQL to allow remote

connections

We need to change the common default MySQL behavior, which will prevent us from

accessing the database from other hosts.

1. Edit /etc/mysql/my.cnf in your favorite text editor and look for this line:

bind-address = 127.0.0.1

2. Change it to this:

# bind-address = 127.0.0.1

3. Restart MySQL:

$ restart mysql

What just happened?

Most out-of-the-box MySQL conguraons allow access only from the same host on which

the server is running. This is absolutely the correct default from a security standpoint.

However, it can also cause real confusion if, for example, you launch MapReduce jobs that try

to access the database on that host. You may see the job fail with connecon errors. If that

happens, you re up the mysql command-line client on the host; this will succeed. Then,

perhaps, you will write a quick JDBC client to test connecvity. This will also work. Only when

you try these steps from one of the Hadoop worker nodes will the problem be apparent. Yes,

this has bit ten me several mes in the past!

The previous change tells MySQL to bind to all available interfaces and thus be accessible

from remote clients.

Aer making the change, we need to restart the server. In Ubuntu 11.10, many of the service

scripts have been ported to the Upstart framework, and we can use the handy restart

command directly.

If you are using a distribuon other than Ubuntu—or potenally even a dierent version of

Ubuntu—the global MySQL conguraon le may be in a dierent locaon; /etc/my.cnf,

for example, on CentOS and Red Hat Enterprise Linux.

www.it-ebooks.info

Working with Relaonal Databases

[ 286 ]

Don't do this in production!

Or at least not without thinking about the consequences. In the earlier example, we gave a

really bad password to the new user; do not do that. However, especially don't do something

like that if you then make the database available across the network. Yes, it is a test database

with no valuable data, but it is amazing how many test databases live for a very long me

and start geng more and more crical. And will you remember to remove that user with

the weak password aer you are done?

Enough lecturing. Databases need data. Let's add a table to the hadooptest database that

we'll use throughout this chapter.

Time for action – setting up the employee database

No discussion of databases is complete without the example of an employee table, so we will

follow tradion and start there.

1. Create a tab-separated le named employees.tsv with the following entries:

Alice Engineering 50000 2009-03-12

Bob Sales 35000 2011-10-01

Camille Marketing 40000 2003-04-20

David Executive 75000 2001-03-20

Erica Support 34000 2011-07-07

2. Connect to the MySQL server:

$ mysql -u hadoopuser -p hadooptest

3. Create the table:

Mysql> create table employees(

first_name varchar(10) primary key,

dept varchar(15),

salary int,

start_date date

) ;

4. Load the data from the le into the database:

mysql> load data local infile '/home/garry/employees.tsv'

-> into table employees

-> fields terminated by '\t' lines terminated by '\n' ;

www.it-ebooks.info

Chapter 9

[ 287 ]

What just happened?

This is prey standard database stu. We created a tab-separated data le, created the table

in the database, and then used the LOAD DATA LOCAL INFILE statement to import the

data into the table.

We are using a very small set of data here as it is really for illustraon purposes only.

Be careful with data le access rights

Don't omit the LOCAL part from the LOAD DATA statement; doing so sees MySQL try and

load the le as the MySQL user, and this usually results in access problems.

Getting data into Hadoop

Now that we have put in all that up-front eort, let us look at ways of bringing the data out

of MySQL and into Hadoop.

www.it-ebooks.info

Working with Relaonal Databases

[ 288 ]

Using MySQL tools and manual import

The simplest way to export data into Hadoop is to use exisng command-line tools and

statements. To export an enre table (or indeed an enre database), MySQL oers the

mysqldump ulity. To do a more precise export, we can use a SELECT statement of the

following form:

SELECT col1, col2 from table

INTO OUTFILE '/tmp/out.csv'

FIELDS TERMINATED by ',', LINES TERMINATED BY '\n';

Once we have an export le, we can move it into HDFS using hadoop fs -put or into

Hive through the methods discussed in the previous chapter.

Have a go hero – exporting the employee table into HDFS

We don't want this chapter to turn into a MySQL tutorial, so look up the syntax of the

mysqldump ulity, and use it or the SELECT … INTO OUTFILE statement to export

the employee table into a tab-separated le you then copy onto HDFS.

Accessing the database from the mapper

For our trivial example, the preceding approaches are ne, but what if you need to export

a much larger set of data, especially if it then is to be processed by a MapReduce job?

The obvious approach is that of direct JDBC access within a MapReduce input job that pulls

the data from the database and writes it onto HDFS, ready for addional processing.

This is a valid technique, but there are a few not-so-obvious gotchas.

You need to be careful how much load you place on the database. Throwing this sort of job

onto a very large cluster could very quickly melt the database as hundreds or thousands

of mappers try to simultaneously open connecons and read the same table. The simplest

access paern is also likely to see one query per row, which obviates the ability to use more

ecient bulk access statements. Even if the database can take the load, it is quite possible

for the database network connecon to quickly become the boleneck.

To eecvely parallelize the query across all the mappers, you need a strategy to paron

the table into segments each mapper will retrieve. You then need to determine how each

mapper is to have its segment parameters passed in.

If the retrieved segments are large, there is a chance that you will end up with long-running

tasks that get terminated by the Hadoop framework unless you explicitly report progress.

That is actually quite a lot of work for a conceptually simple task. Wouldn't it be much

beer to use an exisng tool for the purpose? There is indeed such a tool that we will

use throughout the rest of this chapter, Sqoop.

www.it-ebooks.info

Chapter 9

[ 289 ]

A better way – introducing Sqoop

Sqoop was created by Cloudera (http://www.cloudera.com), a company that provides

numerous services related to Hadoop in addion to producing its own packaging of the

Hadoop distribuon, something we will discuss in Chapter 11, Where to Go Next.

As well as providing this packaged Hadoop product, the company has also created a number

of tools that have been made available to the community, and one of these is Sqoop. Its

job is to do exactly what we need, to copy data between Hadoop and relaonal databases.

Though originally developed by Cloudera, it has been contributed to the Apache Soware

Foundaon, and its homepage is http://sqoop.apache.org.

Time for action – downloading and conguring Sqoop

Let's download and get Sqoop installed and congured.

1. Go to the Sqoop homepage, select the link for the most stable version that is

no earlier than 1.4.1, and match it with the version of Hadoop you are using.

Download the le.

2. Copy the retrieved le where you want it installed on your system; then uncompress

it:

$mv sqoop-1.4.1-incubating__hadoop-1.0.0.tar.gz_ /usr/local

$ cd /usr/local

$ tar –xzf sqoop-1.4.1-incubating__hadoop-1.0.0.tar.gz_

3. Make a symlink:

$ ln -s sqoop-1.4.1-incubating__hadoop-1.0.0 sqoop

4. Update your environment:

$ export SQOOP_HOME=/usr/local/sqoop

$ export PATH=${SQOOP_HOME}/bin:${PATH}

5. Download the JDBC driver for your database; for MySQL, we nd it at http://dev.

mysql.com/downloads/connector/j/5.0.html.

6. Copy the downloaded JAR le into the Sqoop lib directory:

$ cp mysql-connector-java-5.0.8-bin.jar /opt/sqoop/lib

7. Test Sqoop:

$ sqoop help

www.it-ebooks.info

Working with Relaonal Databases

[ 290 ]

You will see the following output:

usage: sqoop COMMAND [ARGS]

Available commands:

codegen Generate code to interact with database

records

…

version Display version information

See 'sqoop help COMMAND' for information on a specific command.

What just happened?

Sqoop is a prey straighorward tool to install. Aer downloading the required version from

the Sqoop homepage—being careful to pick the one that matches our Hadoop version—we

copied and unpacked the le.

Once again, we needed to set an environment variable and added the Sqoop bin directory

to our path so we can either set these directly in our shell, or as before, add these steps to a

conguraon le we can source prior to a development session.

Sqoop needs access to the JDBC driver for your database; for us, we downloaded the MySQL

Connector and copied it into the Sqoop lib directory. For the most popular databases, this

is as much conguraon as Sqoop requires; if you want to use something exoc, consult the

Sqoop documentaon.

Aer this minimal install, we executed the sqoop command-line ulity to validate that it is

working properly.

You may see warning messages from Sqoop telling you that addional

variables such as HBASE_HOME have not been dened. As we are not

talking about HBase in this book, we do not need this seng and will

be oming such warnings from our screenshots.

Sqoop and Hadoop versions

We were very specic in the version of Sqoop to be retrieved before; much more so than for

previous soware downloads. In Sqoop versions prior to 1.4.1, there is a dependency on an

addional method on one of the core Hadoop classes that was only available in the Cloudera

Hadoop distribuon or versions of Hadoop aer 0.21.

Unfortunately, the fact that Hadoop 1.0 is eecvely a connuaon of the 0.20 branch

meant that Sqoop 1.3, for example, would work with Hadoop 0.21 but not 0.20 or 1.0.

To avoid this version confusion, we recommend using version 1.4.1 or later, which removes

the dependency.

www.it-ebooks.info

Chapter 9

[ 291 ]

There is no addional MySQL conguraon required; we would discover if the server had not

been congured to allow remote clients, as described earlier, through use of Sqoop.

Sqoop and HDFS

The simplest import we can perform is to dump data from a database table onto structured

les on HDFS. Let's do that.

Time for action – exporting data from MySQL to HDFS

We'll use a straighorward example here, where we just pull all the data from a single

MySQL table and write it to a single le on HDFS.

1. Run Sqoop to export data from MySQL onto HDFS:

$ sqoop import --connect jdbc:mysql://10.0.0.100/hadooptest

--username hadoopuser \ > --password password --table employees

www.it-ebooks.info

Working with Relaonal Databases

[ 292 ]

2. Examine the output directory:

$ hadoop fs -ls employees

You will receive the following response:

Found 6 items

-rw-r--r-- 3 hadoop supergroup 0 2012-05-21 04:10 /

user/hadoop/employees/_SUCCESS

drwxr-xr-x - hadoop supergroup 0 2012-05-21 04:10 /

user/hadoop/employees/_logs

-rw-r--r-- 3 … /user/hadoop/employees/part-m-00000

-rw-r--r-- 3 … /user/hadoop/employees/part-m-00001

-rw-r--r-- 3 … /user/hadoop/employees/part-m-00002

-rw-r--r-- 3 … /user/hadoop/employees/part-m-00003

3. Display one of the result les:

$ hadoop fs -cat /user/hadoop/employees/part-m-00001

You will see the following output:

Bob,Sales,35000,2011-10-01

Camille,Marketing,40000,2003-04-20

What just happened?

We did not need any preamble; a single Sqoop statement is all we require here. As can be

seen, the Sqoop command line takes many opons; let's unpack them one at a me.

The rst opon in Sqoop is the type of task to be performed; in this case, we wish to import

data from a relaonal source into Hadoop. The --connect opon species the JDBC URI for

the database, of the standard form jdbc:<driver>://<host>/<database>. Obviously,

you need to change the IP or hostname to the server where your database is running.

We use the --username and --password opons to specify those aributes and nally

use --table to indicate from which table we wish to retrieve the data. That is it! Sqoop

does the rest.

The Sqoop output is relavely verbose, but do read it as it gives a good idea of exactly what

is happening.

Repeated execuons of Sqoop may however include a nested error about

a generated le already exisng. Ignore that for now.

www.it-ebooks.info

Chapter 9

[ 293 ]

Firstly, in the preceding steps, we see Sqoop telling us not to use the --password opon

as it is inherently insecure. Sqoop has an alternave -P command, which prompts for the

password; we will use that in future examples.

We also get a warning about using a textual primary key column and that it's a very bad

idea; more on that in a lile while.

Aer all the setup and warnings, however, we see Sqoop execute a MapReduce job and

complete it successfully.

By default, Sqoop places the output les into a directory in the home directory of the

user who ran the job. The les will be in a directory of the same name as the source table.

To verify this, we used hadoop fs -ls to check this directory and conrmed that it

contained several les, likely more than we would have expected, given such a small

table. Note that we slightly abbreviated the output here to allow it to t on one line.

We then examined one of the output les and discovered the reason for the mulple

les; even though the table is ny, it was sll split across mulple mappers, and hence,

output les. Sqoop uses four map tasks by default. It may look a lile strange in this case,

but the usual situaon will be a much larger data import. Given the desire to copy data onto

HDFS, this data is likely to be the source of a future MapReduce job, so mulple les makes

perfect sense.

Mappers and primary key columns

We intenonally set up this situaon by somewhat arcially using a textual primary key

column in our employee data set. In reality, the primary key would much more likely be

an auto-incremenng, numeric employee ID. However, this choice highlighted the nature

of how Sqoop processes tables and its use of primary keys.

Sqoop uses the primary key column to determine how to divide the source data across

its mappers. But, as the warnings before state, this means we are reliant on string-based

comparisons, and in an environment with imperfect case signicance, the results may be

incorrect. The ideal situaon is to use a numeric column as suggested.

Alternavely, it is possible to control the number of mappers using the -m opon. If we use

-m 1, there will be a single mapper and no aempt will be made to paron the primary key

column. For small data sets such as ours, we can also do this to ensure a single output le.

This is not just an opon; if you try to import from a table with no primary key, Sqoop will

fail with an error stang that the only way to import from such a table is to explicitly set a

single mapper.

www.it-ebooks.info

Working with Relaonal Databases

[ 294 ]

Other options

Don't assume that Sqoop is all or nothing when it comes to imporng data. Sqoop has

several other opons to specify, restrict, and alter the data extracted from the database.

We will illustrate these in the following secons, where we discuss Hive, but bear in mind

that most can also be used when exporng into HDFS.

Sqoop's architecture

Now that we have seen Sqoop in acon, it is worthwhile taking a few moments to clarify its

architecture and see how it works. In several ways, Sqoop interacts with Hadoop in much

the same way that Hive does; both are single client programs that create one or more

MapReduce jobs to perform their tasks.

Sqoop does not have any server processes; the command-line client we run is all there is

to it. However, because it can tailor its generated MapReduce code to the specic tasks

at hand, it tends to ulize Hadoop quite eciently.

The preceding example of spling a source RDBMS table on a primary key is a good

example of this. Sqoop knows the number of mappers that will be congured in the

MapReduce job—the default is four, as previously menoned—and from this, it can

do smart paroning of the source table.

If we assume a table with 1 million records and four mappers, then each will process

2,50,000 records. With its knowledge of the primary key column, Sqoop can create four

SQL statements to retrieve the data that each use the desired primary key column range

as caveats. In the simplest case, this could be as straighorward as adding something like

WHERE id BETWEEN 1 and 250000 to the rst statement and using dierent id

ranges for the others.

We will see the reverse behavior when exporng data from Hadoop as Sqoop again

parallelizes data retrieval across mulple mappers and works to opmize the inseron of this

data into the relaonal database. However, all these smarts are pushed into the MapReduce

jobs executed on Hadoop; the Sqoop command-line client's job is to generate this code as

eciently as possible and then get out of the way as the processing occurs.

Importing data into Hive using Sqoop

Sqoop has signicant integraon with Hive, allowing it to import data from a relaonal

source into either new or exisng Hive tables. There are mulple ways in which this

process can be tailored, but again, let's start with the simple case.

www.it-ebooks.info

Chapter 9

[ 295 ]

Time for action – exporting data from MySQL into Hive

For this example, we'll export all the data from a single MySQL table into a correspondingly

named table in Hive. You will need Hive installed and congured as detailed in the

previous chapter.

1. Delete the output directory created in the previous secon:

$ hadoop fs -rmr employees

You will receive the following response:

Deleted hdfs://head:9000/user/hadoop/employees

2. Conrm Hive doesn't already contain an employees table:

$ hive -e "show tables like 'employees'"

You will receive the following response:

Time taken: 2.318 seconds

3. Perform the Sqoop import:

$ sqoop import --connect jdbc:mysql://10.0.0.100/hadooptest

--username hadoopuser -P

--table employees --hive-import --hive-table employees

www.it-ebooks.info

Working with Relaonal Databases

[ 296 ]

4. Check the contents in Hive:

$ hive -e "select * from employees"

You will receive the following response:

Alice Engineering 50000 2009-03-12

Camille Marketing 40000 2003-04-20

David Executive 75000 2001-03-20

Erica Support 34000 2011-07-07

Time taken: 2.739 seconds

5. Examine the created table in Hive:

$ hive -e "describe employees"

You will receive the following response:

first_name string

dept string

salary int

start_date string

Time taken: 2.553 seconds

What just happened?

Again, we use the Sqoop command with two new opons, --hive-import to tell Sqoop

the nal desnaon is Hive and not HDFS, and --hive-table to specify the name of the

table in Hive where we want the data imported.

In actuality, we don't need to specify the name of the Hive table if it is the same as the

source table specied by the --table opon. However, it does make things more explicit,

so we will typically include it.

As before, do read the full Sqoop output as it provides great insight into what's going on,

but the last few lines highlight the successful import into the new Hive table.

We see Sqoop retrieving ve rows from MySQL and then going through the stages of

copying them to HDFS and imporng into Hive. We will talk about the warning re type

conversions next.

Aer Sqoop completes the process, we use Hive to retrieve the data from the new Hive table

and conrm that it is what we expected. Then, we examine the denion of the created table.

www.it-ebooks.info

Chapter 9

[ 297 ]

At this point, we do see one strange thing; the start_date column has been given a type

string even though it was originally a SQL DATE type in MySQL.

The warning we saw during the Sqoop execuon explains this situaon:

12/05/23 13:06:33 WARN hive.TableDefWriter: Column start_date had to be

cast to a less precise type in Hive

The cause of this is that Hive does not support any temporal datatype other than TIMESTAMP.

In those cases where imported data is of another type, relang to dates or mes, Sqoop

converts it to a string. We will look at a way of dealing with this situaon a lile later.

This example is a prey common situaon, but we do not always want to import an enre

table into Hive. Somemes, we want to only include parcular columns or to apply a

predicate to reduce the number of selected items. Sqoop allows us to do both.

Time for action – a more selective import

Let's see how this works by performing an import that is limited by a condional expression.

1. Delete any exisng employee import directory:

$ hadoop fs -rmr employees

You will receive the following response:

Deleted hdfs://head:9000/user/hadoop/employees

2. Import selected columns with a predicate:

sqoop import --connect jdbc:mysql://10.0.0.100/hadooptest

--username hadoopuser -P

--table employees --columns first_name,salary

--where "salary > 45000"

--hive-import --hive-table salary

You will receive the following response:

12/05/23 15:02:03 INFO hive.HiveImport: Hive import complete.

3. Examine the created table:

$ hive -e "describe salary"

You will receive the following response:

first_name string

www.it-ebooks.info

Working with Relaonal Databases

[ 298 ]

salary int

Time taken: 2.57 seconds

4. Examine the imported data:

$ hive -e "select * from salary"

You will see the following output:

Alice 50000

David 75000

Time taken: 2.754 seconds

What just happened?

This me, our Sqoop command rst added the --columns opon that species which

columns to include in the import. This is a comma-separated list.

We also used the --where opon that allows the free text specicaon of a WHERE clause

that is applied to the SQL used to extract data from the database.

The combinaon of these opons is that our Sqoop command should import only the names

and salaries of those with a salary greater than the threshold specied in the WHERE clause.

We execute the command, see it complete successfully, and then examine the table created

in Hive. We see that it indeed only contains the specied columns, and we then display the

table contents to verify that the where predicate was also applied correctly.

Datatype issues

In Chapter 8, A Relaonal View on Data with Hive, we menoned that Hive does not support

all the common SQL datatypes. The DATE and DATETIME types in parcular are not currently

implemented though they do exist as idened Hive issues; so hopefully, they will be added

in the future. We saw this impact our rst Hive import earlier in this chapter. Though the

start_date column was of type DATE in MySQL, the Sqoop import agged a conversion

warning, and the resultant column in Hive was of type STRING.

Sqoop has an opon that is of use here, that is, we can use --map-column-hive to

explicitly tell Sqoop how to create the column in the generated Hive table.

www.it-ebooks.info

Chapter 9

[ 299 ]

Time for action – using a type mapping

Let's use a type mapping to improve our data import.

1. Delete any exisng output directory:

$ hadoop fs -rmr employees

2. Execute Sqoop with an explicit type mapping:

sqoop import --connect jdbc:mysql://10.0.0.100/hadooptest

--username hadoopuser

-P --table employees

--hive-import --hive-table employees

--map-column-hive start_date=timestamp

You will receive the following response:

12/05/23 14:53:38 INFO hive.HiveImport: Hive import complete.

3. Examine the created table denion:

$ hive -e "describe employees"

You will receive the following response:

first_name string

dept string

salary int

start_date timestamp

Time taken: 2.547 seconds

4. Examine the imported data:

$ hive -e "select * from employees";

You will receive the following response:

Failed with exception java.io.IOException:java.lang.

IllegalArgumentException: Timestamp format must be yyyy-mm-dd

hh:mm:ss[.fffffffff]

Time taken: 2.73 seconds

www.it-ebooks.info

Working with Relaonal Databases

[ 300 ]

What just happened?

Our Sqoop command line here is similar to our original Hive import, except for the addion

of the column mapping specicaon. We specied that the start_date column should be

of type TIMESTAMP, and we could have added other specicaons. The opon takes a

comma-separated list of such mappings.

Aer conrming Sqoop executed successfully, we examined the created Hive table

and veried that the mapping was indeed applied and that the start_date column

has type TIMESTAMP.

We then tried to retrieve the data from the table and could not do so, receiving an error

about type format mismatch.

On reecon, this should not be a surprise. Though we specied the desired column type

was to be TIMESTAMP, the actual data being imported from MySQL was of type DATE, which

does not contain the me component required in a mestamp. This is an important lesson.

Ensuring that the type mappings are correct is only one part of the puzzle; we must also

ensure the data is valid for the specied column type.

Time for action – importing data from a raw query

Let's see an example of an import where a raw SQL statement is used to select the data

to be imported.

1. Delete any exisng output directory:

$ hadoop fs –rmr employees

2. Drop any exisng Hive employee table:

$ hive -e 'drop table employees'

3. Import data using an explicit query:

sqoop import --connect jdbc:mysql://10.0.0.100/hadooptest

--username hadoopuser -P

--target-dir employees

--query 'select first_name, dept, salary,

timestamp(start_date) as start_date from employees where

$CONDITIONS'

--hive-import --hive-table employees

--map-column-hive start_date=timestamp -m 1

4. Examine the created table:

$ hive -e "describe employees"

www.it-ebooks.info

Chapter 9

[ 301 ]

You will receive the following response:

first_name string

dept string

salary int

start_date timestamp

Time taken: 2.591 seconds

5. Examine the data:

$ hive -e "select * from employees"

You will receive the following response:

Alice Engineering 50000 2009-03-12 00:00:00

Bob Sales 35000 2011-10-01 00:00:00

Camille Marketing 40000 2003-04-20 00:00:00

David Executive 75000 2001-03-20 00:00:00

Erica Support 34000 2011-07-07 00:00:00

Time taken: 2.709 seconds

What just happened?

To achieve our goal, we used a very dierent form of the Sqoop import. Instead of specifying

the desired table and then either leng Sqoop import all columns or a specied subset, here

we use the --query opon to dene an explicit SQL statement.

In the statement, we select all the columns from the source table but apply the

timestamp() funcon to convert the start_date column to the correct type.

(Note that this funcon simply adds a 00:00 me element to the date). We alias

the result of this funcon, which allows us to name it in the type mapping opon.

Because we have no --table opon, we have to add --target-dir to tell Sqoop the

name of the directory it should create on HDFS.

The WHERE clause in the SQL is required by Sqoop even though we are not actually using it.

Having no --table opon does not just remove Sqoop's ability to auto-generate the name

of the export directory, it also means that Sqoop does not know from where data is being

retrieved, and hence, how to paron the data across mulple mappers. The $CONDITIONS

variable is used in conjuncon with a --where opon; specifying the laer provides Sqoop

with the informaon it needs to paron the table appropriately.

www.it-ebooks.info

Working with Relaonal Databases

[ 302 ]

We take a dierent route here and instead explicitly set the number of mappers to 1, which

obviates the need for an explicit paroning clause.

Aer execung Sqoop, we examine the table denion in Hive, which as before, has the

correct datatypes for all columns. We then look at the data, and this is now successful, with

the start_date column data being appropriately converted into the TIMESTAMP values.

When we menoned in the Sqoop and HDFS secon that Sqoop provided

mechanisms to restrict the data extracted from the database, we were

referring to the query, where, and columns opons. Note that these

can be used by any Sqoop import regardless of the desnaon.

Have a go hero

Though it truly is not needed for such a small data set, the $CONDITIONS variable is an

important tool. Modify the preceding Sqoop statement to use mulple mappers with an

explicit paroning statement.

Sqoop and Hive partitions

In Chapter 8, A Relaonal View on Data with Hive, we talked a lot about Hive parons

and highlighted how important they are in allowing query opmizaon for very large tables.

The good news is that Sqoop can support Hive parons; the bad news is that the support

is not complete.

To import data from a relaonal database into a paroned Hive table, we use the --hive-

partition-key opon to specify the paron column and the --hive-partition-

value opon to specify the value for the paron into which this Sqoop command will

import data.

This is excellent but does require each Sqoop statement to be imported into a single Hive

paron; there is currently no support for Hive auto-paroning. Instead, if a data set is

to be imported into mulple parons in a table, we need use a separate Sqoop statement

for inseron into each paron.

Field and line terminators

Unl now, we have been implicitly relying on some defaults but should discuss them at this

point. Our original text le was tab separated, but you may have noced that the data we

exported onto HDFS was comma-separated. If you go look in the les under /user/hive/

warehouse/employees (remember this is the default locaon on HDFS where Hive keeps

its source les), the records use ASCII code 001 as the separator. What is going on?

www.it-ebooks.info

Chapter 9

[ 303 ]

In the rst instance, we let Sqoop use its defaults, which in this case, means using a comma

to separate elds and using \n for records. However, when Sqoop is imporng into Hive, it

instead employs the Hive defaults, which include using the 001 code (^A) to separate elds.

We can explicitly set separators using the following Sqoop opons:

fields-terminated-by: This is the separator between elds

lines-terminated-by: The line terminator

escaped-by: Used to escape characters (for example, \)

enclosed-by: The character enclosing elds (for example, ")

optionally-enclosed-by: Similar to the preceding opon but not mandatory

mysql-delimiters: A shortcut to use the MySQL defaults

This may look a lile inmidang, but it's not as obscure as the terminology may suggest,

and the concepts and syntax should be familiar to those with SQL experience. The rst few

opons are prey self-explanatory; where it gets less clear is when talking of enclosing and

oponally enclosing characters.

This is really about (usually free-form) data where a given eld may include characters that

have special meanings. For example, a string column in a comma-separated le that includes

commas. In such a case, we could enclose the string columns within quotes to allow the

commas within the eld. If all elds need such enclosing characters, we would use the rst

form; if it was only required for a subset of the elds, it could be specied as oponal.

Getting data out of Hadoop

We said that the data ow between Hadoop and a relaonal database is rarely a linear

single direcon process. Indeed the situaon where data is processed within Hadoop

and then inserted into a relaonal database is arguably the more common case. We will

explore this now.

Writing data from within the reducer

Thinking about how to copy the output of a MapReduce job into a relaonal database,

we nd similar consideraons as when looking at the queson of data import into Hadoop.

www.it-ebooks.info

Working with Relaonal Databases

[ 304 ]

The obvious approach is to modify a reducer to generate the output for each key and its

associated values and then to directly insert them into a database via JDBC. We do not have

to worry about source column paroning, as with the import case, but do sll need to think

about how much load we are placing on the database and whether we need to consider

meouts for long-running tasks. In addion, just as with the mapper situaon, this approach

tends to perform many single queries against the database, which is typically much less

ecient than bulk operaons.

Writing SQL import les from the reducer

Oen, a superior approach is not to work around the usual MapReduce case of generang

output les, as with the preceding example, but instead to exploit it.

All relaonal databases have the ability to ingest data from source les, either through

custom tools or through the use of the LOAD DATA statement. Within the reducer,

therefore, we can modify the data output to make it more easily ingested into our relaonal

desnaon. This obviates the need to consider issues such as reducers placing load on the

database or how to handle long-running tasks, but it does require a second step external to

our MapReduce job.

A better way – Sqoop again

It probably won't come as a surprise—certainly not if you've looked at the output of Sqoop's

inbuilt help or its online documentaon—to learn that Sqoop can also be our tool of choice

for data export from Hadoop.

Time for action – importing data from Hadoop into MySQL

Let's demonstrate this by imporng data into a MySQL table from an HDFS le.

1. Create a tab-separated le named newemployees.tsv with the following entries:

Frances Operations 34000 2012-03-01

Greg Engineering 60000 2003-11-18

Harry Intern 22000 2012-05-15

Iris Executive 80000 2001-04-08

Jan Support 28500 2009-03-30

2. Create a new directory on HDFS and copy the le into it:

$hadoop fs -mkdir edata

$ hadoop fs -put newemployees.tsv edata/newemployees.tsv

3. Conrm the current number of records in the employee table:

$ echo "select count(*) from employees" |

www.it-ebooks.info

Chapter 9

[ 305 ]

mysql –u hadoopuser –p hadooptest

You will receive the following response:

Enter password:

count(*)

4. Run a Sqoop export:

$ sqoop export --connect jdbc:mysql://10.0.0.100/hadooptest

--username hadoopuser -P --table employees

--export-dir edata --input-fields-terminated-by '\t'

You will receive the following response:

12/05/27 07:52:22 INFO mapreduce.ExportJobBase: Exported 5

records.

5. Check the number of records in the table aer the export:

Echo "select count(*) from employees"

| mysql -u hadoopuser -p hadooptest

You will receive the following response:

Enter password:

count(*)

6. Check the data:

$ echo "select * from employees"

| mysql -u hadoopuser -p hadooptest

You will receive the following response:

Enter password:

first_name dept salary start_date

Alice Engineering 50000 2009-03-12

…

Frances Operations 34000 2012-03-01

Greg Engineering 60000 2003-11-18

Harry Intern 22000 2012-05-15

Iris Executive 80000 2001-04-08

Jan Support 28500 2009-03-30

www.it-ebooks.info

Working with Relaonal Databases

[ 306 ]

What just happened?

We rst created a data le containing informaon on ve more employees. We created a

directory for our data on HDFS into which we copied the new le.

Before running the export, we conrmed that the table in MySQL contained the original ve

employees only.

The Sqoop command has a similar structure as before with the biggest change being the use

of the export command. As the name suggests, Sqoop exported export data from Hadoop

into a relaonal database.

We used several similar opons as before, mainly to specify the database connecon, the

username and password needed to connect, and the table into which to insert the data.

Because we are exporng data from HDFS, we needed to specify the locaon containing any

les to be exported which we do via the --export-dir opon. All les contained within

the directory will be exported; they do not need be in a single le; Sqoop will include all les

within its MapReduce job. By default, Sqoop uses four mappers; if you have a large number

of les it may be more eecve to increase this number; do test, though, to ensure that load

on the database remains under control.

The nal opon passed to Sqoop specied the eld terminator used in the source les, in this

case, the tab character. It is your responsibility to ensure the data les are properly formaed;

Sqoop will assume there is the same number of elements in each record as columns in the

table (though null is acceptable), separated by the specied eld separator character.

Aer watching the Sqoop command complete successfully, we saw it reports that it exported

ve records. We check, using the mysql tool, the number of rows now in the database and

then view the data to conrm that our old friends are now joined by the new employees.

Differences between Sqoop imports and exports

Though similar conceptually and in the command-line invocaons, there are a number of

important dierences between Sqoop imports and exports that are worth exploring.

Firstly, Sqoop imports can assume much more about the data being processed; through

either explicitly named tables or added predicates, there is much informaon about both

the structure and type of the data. Sqoop exports, however, are given only a locaon of

source les and the characters used to separate and enclose elds and records. While Sqoop

imports into Hive can automacally create a new table based on the provided table name

and structure, a Sqoop export must be into an exisng table in the relaonal database.

www.it-ebooks.info

Chapter 9

[ 307 ]

Even though our earlier demonstraon with dates and mestamps showed there are some

sharp edges, Sqoop imports are also able to determine whether the source data complies

with the dened column types; the data would not have been possible to insert into the

database otherwise. Sqoop exports again only have access eecvely to elds of characters

with no understanding of the real datatype. If you have the luxury of very clean and

well-formaed data, this may never maer, but for the rest of us, there will be a need to

consider data exports and type conversions, parcularly in terms of null and default values.

The Sqoop documentaon goes into these opons in some detail and is worth a read.

Inserts versus updates

Our preceding example was very straighorward; we added an enre new set of data that

can happily coexist with the exisng contents of the table. Sqoop exports by default do a

series of appends, adding each record as a new row in the table.

However, what if we later want to update data when, for example, our employees get

increased salaries at the end of the year? With the database table dening first_name

as a primary key, any aempt to insert a new row with the same name as an exisng

employee will fail with a failed primary key constraint.

In such cases, we can set the Sqoop --update-key opon to specify the primary key, and

Sqoop will generate UPDATE statements based on this key (it can be a comma-separated list

of keys), as opposed to INSERT statements adding new rows.

In this mode, any record that does not match an exisng key value will

silently be ignored, and Sqoop will not ag errors if a statement updates

more than one row.

If we also want the opon of an update that adds new rows for non-exisng data, we can set

the --update-mode opon to allowinsert.

Have a go hero

Create another data le that contains three new employees as well as updated salaries for

two of the exisng employees. Use Sqoop in import mode to both add the new employees

as well as apply the needed updates.

Sqoop and Hive exports

Given the preceding example, it may not be surprising to learn that Sqoop does not currently

have any direct support to export a Hive table into a relaonal database. More precisely,

there are no explicit equivalents to the --hive-import opon we used earlier.

www.it-ebooks.info

Working with Relaonal Databases

[ 308 ]

However, in some cases, we can work around this. If a Hive table is storing its data in text

format, we could point Sqoop at the locaon of the table data les on HDFS. In case of tables

referring to external data, this may be straighorward, but once we start seeing Hive tables

with complex paroning, the directory structure becomes more involved.

Hive can also store tables as binary SequenceFiles, and a current limitaon is that Sqoop

cannot transparently export from tables stored in this format.

Time for action – importing Hive data into MySQL

Regardless of these limitaons, let's demonstrate that, in the right situaons, we can use

Sqoop to directly export data stored in Hive.

1. Remove any exisng data in the employee table:

$ echo "truncate employees" | mysql –u hadoopuser –p hadooptest

You will receive the following response:

Query OK, 0 rows affected (0.01 sec)

2. Check the contents of the Hive warehouse for the employee table:

$ hadoop fs –ls /user/hive/warehouse/employees

You will receive the following response:

Found 1 items

… /user/hive/warehouse/employees/part-m-00000

3. Perform the Sqoop export:

sqoop export --connect jdbc:mysql://10.0.0.100/hadooptest

--username hadoopuser –P --table employees \

--export-dir /user/hive/warehouse/employees

--input-fields-terminated-by '\001'

--input-lines-terminated-by '\n'

www.it-ebooks.info

Chapter 9

[ 309 ]

What just happened?

Firstly, we truncated the employees table in MySQL to remove any exisng data and then

conrmed the employee table data was where we expected it to be.

Note that Sqoop may also create an empty le in this directory with the

sux _SUCCESS; if this is present it should be deleted before running

the Sqoop export.

www.it-ebooks.info

Working with Relaonal Databases

[ 310 ]

The Sqoop export command is like before; the only changes are the dierent source

locaon for the data and the addion of explicit eld and line terminators. Recall that

Hive, by default, uses ASCII code 001 and \n for its eld and line terminators, respecvely

(also recall, though, that we have previously imported les into Hive with other separators,

so this is something that always needs to be checked).

We execute the Sqoop command and watch it fail due to Java

IllegalArgumentExceptions when trying to create instances of java.sql.Date.

We are now hing the reverse of the problem we encountered earlier; the original type

in the source MySQL table had a datatype not supported by Hive, and we converted the

data to match the available type of TIMESTAMP. When exporng data back again, however,

we are now trying to create a DATE using a TIMESTAMP value, which is not possible without

some conversion.

The lesson here is that our earlier approach of doing a one-way conversion only worked

for as long as we only had data owing in one direcon. As soon as we need bi-direconal

data transfer, mismatched types between Hive and the relaonal store add complexity and

require the inseron of conversion rounes.

Time for action – xing the mapping and re-running the export

In this case, however, let us do what probably makes more sense—modifying the denion

of the employee table to make it consistent in both data sources.

1. Start the mysql ulity:

$ mysql -u hadoopuser -p hadooptest

Enter password:

2. Change the type of the start_date column:

mysql> alter table employees modify column start_date timestamp;

You will receive the following response:

Query OK, 0 rows affected (0.02 sec)

Records: 0 Duplicates: 0 Warnings: 0

www.it-ebooks.info

Chapter 9

[ 311 ]

3. Display the table denion:

mysql> describe employees;

4. Quit the mysql tool:

mysql> quit;

5. Perform the Sqoop export:

sqoop export --connect jdbc:mysql://10.0.0.100/hadooptest

--username hadoopuser –P –table employees

--export-dir /user/hive/warehouse/employees

--input-fields-terminated-by '\001'

--input-lines-terminated-by '\n'

You will receive the following response:

12/05/27 09:17:39 INFO mapreduce.ExportJobBase: Exported 10

records.

www.it-ebooks.info

Working with Relaonal Databases

[ 312 ]

6. Check the number of records in the MySQL database:

$ echo "select count(*) from employees"

| mysql -u hadoopuser -p hadooptest

You will receive the following output:

Enter password:

count(*)

What just happened?

Before trying the same Sqoop export as last me, we used the mysql tool to connect to

the database and modify the type of the start_date column. Note, of course, that such

changes should never be made casually on a producon system, but given that we have a

currently empty test table, there are no issues here.

Aer making the change, we re-ran the Sqoop export and this me it succeeded.

Other Sqoop features

Sqoop has a number of other features that we won't discuss in detail, but we'll highlight

them so the interested reader can look them up in the Sqoop documentaon.

Incremental merge

The examples we've used have been all-or-nothing processing that, in most cases, make the

most sense when imporng data into empty tables. There are mechanisms to handle addions,

but if we foresee Sqoop performing ongoing imports, some addional support is available.

Sqoop supports the concept of incremental imports where an import task is addionally

qualied by a date and only records more recent than that date are processed by the task.

This allows the construcon of long-running workows that include Sqoop.

Avoiding partial exports

We've already seen how errors can occur when exporng data from Hadoop into a relaonal

database. For us, it wasn't a signicant problem as the issue caused all exported records to

fail. But it isn't uncommon for only part of an export to fail and result in parally commied

data in the database.

www.it-ebooks.info

Chapter 9

[ 313 ]

To migate this risk, Sqoop allows the use of a staging table; it loads all the data into this

secondary table, and only aer all data is successfully inserted, performs the move into the

main table in a single transacon. This can be very useful for failure-prone workloads but

does come with some important restricons, such as the inability to support update mode.

For very large imports, there are also performance and load impacts on the RDBMS of a

single very long-running transacon.

Sqoop as a code generator

We've been ignoring an error during Sqoop processing that we casually brushed o a while

ago—the excepon thrown because the generated code required by Sqoop already exists.

When performing an import, Sqoop generates Java class les that provide a programmac

means of accessing the elds and records in the created les. Sqoop uses these classes

internally, but they can also be used outside of a Sqoop invocaon, and indeed, the

Sqoop codegen command can regenerate the classes outside of an export task.

AWS considerations

We've not menoned AWS so far in this chapter as there's been nothing in Sqoop that either

supports or prevents its use on AWS. We can run Sqoop on an EC2 host as easily as on a

local one, and it can access either a manually or EMR-created Hadoop cluster oponally

running Hive. The only possible quirk when considering use in AWS is security group access

as many default EC2 conguraons will not allow trac on the ports used by most relaonal

databases (3306 by default for MySQL). But, that's no more of an issue than if our Hadoop

cluster and MySQL database were to be located on dierent sides of a rewall or any other

network security boundary.

Considering RDS

There is another AWS service that we've not menoned before that does deserve an

introducon now. Amazon Relaonal Database Service (RDS) oers hosted relaonal

databases in the cloud and provides MySQL, Oracle, and Microso SQL Server opons.

Instead of having to worry about the installaon, conguraon, and management of a

database engine, RDS allows an instance to be started from either the console or command-

line tools. You then just point your database client tool at the database and start creang

tables and manipulang data.

RDS and EMR are a powerful combinaon, providing hosted services that take much of the

pain out of manually managing such services. If you need a relaonal database but don't

want to worry about its management, RDS may be for you.

www.it-ebooks.info

Working with Relaonal Databases

[ 314 ]

The RDS and EMR combinaon can be parcularly powerful if you use EC2 hosts to generate

data or store data in S3. Amazon has a general policy that there is no cost for data transfer

from one service to another within a single region. Consequently, it's possible to have a eet

of EC2 hosts generang large data volumes that get pushed into a relaonal database in RDS

for query access and are stored in EMR for archival and long-term analycs. Geng data into

the storage and processing systems is oen a technically challenging acvity that can easily

consume signicant expense if the data needs be moved across commercial network links.

Architectures built atop collaborang AWS services such as EC2, RDS, and EMR can minimize

both these concerns.

Summary

In this chapter, we have looked at the integraon of Hadoop and relaonal databases. In

parcular, we explored the most common use cases and saw that Hadoop and relaonal

databases can be highly complimentary technologies. We considered ways of exporng

data from a relaonal database onto HDFS les and realized that issues such as primary

key column paroning and long-running tasks make it harder than it rst seems.

We then introduced Sqoop, a Cloudera tool now donated to the Apache Soware Foundaon

and that provides a framework for such data migraon. We used Sqoop to import data from

MySQL into HDFS and then Hive, highlighng how we must consider aspects of datatype

compability in such tasks. We also used Sqoop to do the reverse—copying data from HDFS

into a MySQL database—and found out that this path has more subtle consideraons than

the other direcon, briey discussed issues of le formats and update versus insert tasks, and

introduced addional Sqoop capabilies, such as code generaon and incremental merging.

Relaonal databases are an important—oen crical—part of most IT infrastructures.

But, they aren't the only such component. One that has been growing in importance—oen

with lile fanfare—is the vast quanes of log les generated by web servers and other

applicaons. The next chapter will show how Hadoop is ideally suited to process and

store such data.

www.it-ebooks.info

Data Collection with Flume

In the previous two chapters, we've seen how Hive and Sqoop give a

relational database interface to Hadoop and allow it to exchange data with

"real" databases. Although this is a very common use case, there are, of course,

many different types of data sources that we may want to get into Hadoop.

In this chapter, we will cover:

An overview of data commonly processed in Hadoop

Simple approaches to pull this data into Hadoop

How Apache Flume can make this task a lot easier

Common paerns for simple through sophiscated, Flume setups

Common issues, such as the data lifecycle, that need to be considered

regardless of technology

A note about AWS

This chapter will discuss AWS less than any other in the book. In fact, we won't even menon

it aer this secon. There are no Amazon services akin to Flume so there is no AWS-specic

product that we could explore. On the other hand, when using Flume, it works exactly

the same, be it on a local host or EC2 virtual instance. The rest of this chapter, therefore,

assumes nothing about the environment on which the examples are executed; they will

perform idencally on each.

www.it-ebooks.info

Data Collecon with Flume

[ 316 ]

Data data everywhere...

In discussions concerning integraon of Hadoop with other systems, it is easy to think of it as

a one-to-one paern. Data comes out of one system, gets processed in Hadoop, and then is

passed onto a third.

Things may be like that on day one, but the reality is more oen a series of collaborang

components with data ows passing back and forth between them. How we build this

complex network in a maintainable fashion is the focus of this chapter.

Types of data

For the sake of the discussion, we will categorize data into two broad categories:

Network trac, where data is generated by a system and sent across a network

connecon

File data, where data is generated by a system and wrien to les on a

lesystem somewhere

We don't assume these data categories are dierent in any way other than how the data

is retrieved.

Getting network trafc into Hadoop

When we say network data, we mean things like informaon retrieved from a web server

via an HTTP connecon, database contents pulled by a client applicaon, or messages sent

across a data bus. In each case, the data is retrieved by a client applicaon that either pulls

the data across the network or listens for its arrival.

In several of the following examples, we will use the curl ulity to

either retrieve or send network data. Ensure that it is installed on your

system and install it if not.

Time for action – getting web server data into Hadoop

Let's take a look at how we can simpliscally copy data from a web server onto HDFS.

1. Retrieve the text of the NameNode web interface to a local le:

$ curl localhost:50070 > web.txt

2. Check the le size:

$ ls -ldh web.txt

www.it-ebooks.info

Chapter 10

[ 317 ]

You will receive the following response:

-rw-r--r-- 1 hadoop hadoop 246 Aug 19 08:53 web.txt

3. Copy the le to HDFS:

$ hadoop fs -put web.txt web.txt

4. Check the le on HDFS:

$ hadoop fs -ls

You will receive the following response:

Found 1 items

-rw-r--r-- 1 hadoop supergroup 246 2012-08-19 08:53 /

user/hadoop/web.txt

What just happened?

There shouldn't be anything that is surprising here. We use the curl ulity to retrieve a

web page from the embedded web server hosng the NameNode web interface and save

it to a local le. We check the le size, copy it to HDFS, and verify the le has been

transferred successfully.

The point of note here is not the series of acons—it is aer all just another use of the

hadoop fs command we have used since Chapter 2, Geng Up and Running—rather

the paern used is what we should discuss.

Though the data we wanted was in a web server and accessible via the HTTP protocol,

the out of the box Hadoop tools are very le-based and do not have any intrinsic support

for such remote informaon sources. This is why we need to copy our network data into a

le before transferring it to HDFS.

We can, of course, write data directly to HDFS through the programmac interface

menoned back in Chapter 3, Wring MapReduce Jobs, and this would work well.

This would, however, require us to start wring custom clients for every dierent

network source from which we need to retrieve data.

Have a go hero

Programmacally retrieving data and wring it to HDFS is a very powerful capability

and worth some exploraon. A very popular Java library for HTTP is the Apache

HTTPClient, within the HTTP Components project found at http://hc.apache.org/

httpcomponents-client-ga/index.html.

www.it-ebooks.info

Data Collecon with Flume

[ 318 ]

Use the HTTPClient and the Java HDFS interface to retrieve a web page as before and write it

to HDFS.

Getting les into Hadoop

Our previous example showed the simplest method for geng le-based data into Hadoop

and the use of the standard command-line tools or programmac APIs. There is lile else to

discuss here, as it is a topic we have dealt with throughout the book.

Hidden issues

Though the preceding approaches are good as far as they go, there are several reasons why

they may be unsuitable for producon use.

Keeping network data on the network

Our model of copying network-accessed data to a le before placing it on HDFS will

have an impact on performance. There is added latency due to the round-trip to disk,

the slowest part of a system. This may not be an issue for large amounts of data retrieved

in one call—though disk space potenally becomes a concern—but for small amounts of

data retrieved at high speed, it may become a real problem.

Hadoop dependencies

For the le-based approach, it is implicit in the model menoned before that the point at

which we can access the le must have access to the Hadoop installaon and be congured

to know the locaon of the cluster. This potenally adds addional dependencies in the

system; this could force us to add Hadoop to hosts that really need to know nothing about it.

We can migate this by using tools like SFTP to retrieve the les to a Hadoop-aware machine

and from there, copy onto HDFS.

Reliability

Noce the complete lack of error handling in the previous approaches. The tools we are

using do not have built-in retry mechanisms which means we would need to wrap a degree

of error detecon and retry logic around each data retrieval.

Re-creating the wheel

This last point touches on perhaps the biggest issue with these ad hoc approaches; it is

very easy to end up with a dozen dierent strings of command-line tools and scripts, each

of which is doing very similar tasks. The potenal costs in terms of duplicate eort and more

dicult error tracking can be signicant over me.

www.it-ebooks.info

Chapter 10

[ 319 ]

A common framework approach

Anyone with experience in enterprise compung will, at this point, be thinking that this

sounds like a problem best solved with some type of common integraon framework.

This is exactly correct and is indeed a general type of product well known in elds such

as Enterprise Applicaon Integraon (EAI).

What we need though is a framework that is Hadoop-aware and can easily integrate with

Hadoop (and related projects) without requiring massive eort in wring custom adaptors.

We could create our own, but instead let's look at Apache Flume which provides much of

what we need.

Introducing Apache Flume

Flume, found at http://flume.apache.org, is another Apache project with ght Hadoop

integraon and we will explore it for the remainder of this chapter.

Before we explain what Flume can do, let's make it clear what it is not. Flume is described

as a system for the retrieval and distribuon of logs, meaning line-oriented textual data. It is

not a generic data-distribuon plaorm; in parcular, don't look to use it for the retrieval or

movement of binary data.

However, since the vast majority of the data processed in Hadoop matches this descripon,

it is likely that Flume will meet many of your data retrieval needs.

Flume is also not a generic data serializaon framework like Avro that we used

in Chapter 5, Advanced MapReduce Techniques, or similar technologies such as

Thri and Protocol Buers. As we'll see, Flume makes assumpons about the

data format and provides no ways of serializing data outside of these.

Flume provides mechanisms for retrieving data from mulple sources, passing it to remote

locaons (potenally mulple locaons in either a fan-out or pipeline model), and then

delivering it to a variety of desnaons. Though it does have a programmac API that allows

the development of custom sources and desnaons, the base product has built-in support

for many of the most common scenarios. Let's install it and take a look.

A note on versioning

Flume has gone through some major changes in recent mes. The original Flume

(now renamed Flume OG for Original Generaon) is being superseded by Flume NG

(Next Generaon). Though the general principles and capabilies are very similar, the

implementaon is quite dierent.

www.it-ebooks.info

Data Collecon with Flume

[ 320 ]

Because Flume NG is the future, we will cover it in this book. For some me though, it

will lack several of the features of the more mature Flume OG, so if you nd a specic

requirement that Flume NG doesn't meet then it may be worth looking at Flume OG.

Time for action – installing and conguring Flume

Let's get Flume downloaded and installed.

1. Retrieve the most recent Flume NG binary from http://flume.apache.org/ and

download and save it to the local lesystem.

2. Move the le to the desired locaon and uncompress it:

$ mv apache-flume-1.2.0-bin.tar.gz /opt

$ tar -xzf /opt/apache-flume-1.2.0-bin.tar.gz

3. Create a symlink to the installaon:

$ ln -s /opt/apache-flume-1.2.0 /opt/flume

4. Dene the FLUME_HOME environment variable:

Export FLUME_HOME=/opt/flume

5. Add the Flume bin directory to your path:

Export PATH=${FLUME_HOME}/bin:${PATH}

6. Verify that JAVA_HOME is set:

Echo ${JAVA_HOME}

7. Verify that the Hadoop libraries are in the classpath:

$ echo ${CLASSPATH}

8. Create the directory that will act as the Flume conf directory:

$ mkdir /home/hadoop/flume/conf

9. Copy the needed les into the conf directory:

$ cp /opt/flume/conf/log4j.properties /home/hadoop/flume/conf

$ cp /opt/flume/conf/flume-env.sh.sample /home/hadoop/flume/conf/

flume-env.sh

10. Edit flume-env.sh and set JAVA_HOME.

www.it-ebooks.info

Chapter 10

[ 321 ]

What just happened?

The Flume installaon is straighorward and has similar prerequisites to previous tools we

have installed.

Firstly, we retrieved the latest version of Flume NG (any version of 1.2.x or later will do) and

saved it to the local lesystem. We moved it to the desired locaon, uncompressed it, and

created a convenience symlink to the locaon.

We needed to dene the FLUME_HOME environment variable and add the bin directory

within the installaon directory to our classpath. As before, this can be done directly on

the command line or within convenience scripts.

Flume requires JAVA_HOME to be dened and we conrmed this is the case. It also requires

Hadoop libraries, so we checked that the Hadoop classes are in the classpath.

The last steps are not strictly necessary for demonstraon though will be used in producon.

Flume looks for a conguraon directory within which are les dening the default logging

properes and environment setup variables (such as JAVA_HOME). We nd Flume performs

most predictably when this directory is properly set up, so we did this now and don't need

to change it much later.

We assumed /home/hadoop/flume is the working directory within which the

Flume conguraon and other les will be stored; change this based on what's

appropriate for your system.

Using Flume to capture network data

Now that we have Flume installed, let's use it to capture some network data.

Time for action – capturing network trafc in a log le

In the rst instance, let's use a simple Flume conguraon that will capture the network data

to the main Flume log le.

1. Create the following le as agent1.conf within your Flume working directory:

agent1.sources = netsource

agent1.sinks = logsink

agent1.channels = memorychannel

agent1.sources.netsource.type = netcat

agent1.sources.netsource.bind = localhost

agent1.sources.netsource.port = 3000

www.it-ebooks.info

Data Collecon with Flume

[ 322 ]

agent1.sinks.logsink.type = logger

agent1.channels.memorychannel.type = memory

agent1.channels.memorychannel.capacity = 1000

agent1.channels.memorychannel.transactionCapacity = 100

agent1.sources.netsource.channels = memorychannel

agent1.sinks.logsink.channel = memorychannel

2. Start a Flume agent:

$ flume-ng agent --conf conf --conf-file 10a.conf --name agent1

The output of the preceding command can be shown in the following screenshot:

www.it-ebooks.info

Chapter 10

[ 323 ]

3. In another window, open a telnet connecon to port 3000 on the local host and

then type some text:

$ curl telnet://localhost:3000

Hello

Flume!

4. Close the curl connecon with Ctrl + C.

5. Look at the Flume log le:

$ tail flume.log

You will receive the following response:

2012-08-19 00:37:32,702 INFO sink.LoggerSink: Event: { headers:{}

body: 68 65 6C 6C 6F Hello }

2012-08-19 00:37:32,702 INFO sink.LoggerSink: Event: { headers:{}

body: 6D 65 Flume }

What just happened?

Firstly, we created a Flume conguraon le within our Flume working directory. We'll go

into this in more detail later, but for now, think of Flume receiving data through a component

called a source and wring it to a desnaon called a sink.

In this case, we create a Netcat source which listens on a port for network connecons.

You can see we congure it to bind to port 3000 on the local machine.

The congured sink is of the type logger which, not surprisingly, writes its output to a

log le. The rest of the conguraon le denes an agent called agent1, which uses this

source and sink.

We then start a Flume agent by using the flume-ng binary. This is the tool we'll use to

launch all Flume processes. Note that we give a few opons to this command:

The agent argument tells Flume to start an agent, which is the generic name

for a running Flume process involved in data movement

The conf directory, as menoned earlier

The parcular conguraon le for the process we are going to launch

The name of the agent within the conguraon le

www.it-ebooks.info

Data Collecon with Flume

[ 324 ]

The agent will start and no further output will appear on that screen. (Obviously, we would

run the process in the background in a producon seng.)

In another window, we open a telnet connecon to port 3000 on the local machine using

the curl ulity. The tradional way of opening such sessions is of course the telnet program

itself, but many Linux distribuons have curl installed by default; almost none use the older

telnet ulity.

We type a word on each line and hit Enter then kill the session with a Ctrl + C command.

Finally, we look at the flume.log le that is being wrien into the Flume working directory

and see an entry for each of the words we typed in.

Time for action – logging to the console

It's not always convenient to have to look at log les, parcularly when we already have the

agent screen open. Let's modify the agent to also log events to the screen.

1. Restart the Flume agent with an addional argument:

$ flume-ng agent --conf conf --conf-file 10a.conf --name agent1

-Dflume.root.logger=INFO,console

You will receive the following response:

Info: Sourcing environment configuration script /home/hadoop/

flume/conf/flume-env.sh

…

org.apache.flume.node.Application --conf-file 10a.conf --name

agent1

2012-08-19 00:41:45,462 (main) [INFO - org.apache.flume.lifecycle.

LifecycleSupervisor.start(LifecycleSupervisor.java:67)] Starting

lifecycle supervisor 1

2. In another window, connect to the server via curl:

$ curl telnet://localhost:3000

3. Type in Hello and Flume on separate lines, hit Ctrl + C, and then check the

agent window:

www.it-ebooks.info

Chapter 10

[ 325 ]

What just happened?

We added this example as it becomes very useful when debugging or creang new ows.

As seen in the previous example, Flume will, by default, write its logs to a le on the

lesystem. More precisely, this is the default behavior as specied within the log4j property

le within our conf directory. Somemes we want more immediate feedback without

constantly looking at log les or having to change the property le.

By explicitly seng the flume.root.logger variable on the command line, we can override

the default logger conguraon and have the output sent directly to the agent window. The

logger is standard log4j, so the usual log levels such as DEBUG and INFO are supported.

www.it-ebooks.info

Data Collecon with Flume

[ 326 ]

Writing network data to log les

The default log sink behavior of Flume wring its received data into log les has some

limitaons, parcularly if we want to use the captured data in other applicaons.

By conguring a dierent type of sink, we can instead write the data into more

consumable data les.

Time for action – capturing the output of a command to a at le

Let's show this in acon, along the way demonstrang a new kind of source as well.

1. Create the following le as agent2.conf within the Flume working directory:

agent2.sources = execsource

agent2.sinks = filesink

agent2.channels = filechannel

agent2.sources.execsource.type = exec

agent2.sources.execsource.command = cat /home/hadoop/message

agent2.sinks.filesink.type = FILE_ROLL

agent2.sinks.filesink.sink.directory = /home/hadoop/flume/files

agent2.sinks.filesink.sink.rollInterval = 0

agent2.channels.filechannel.type = file

agent2.channels.filechannel.checkpointDir = /home/hadoop/flume/fc/

checkpoint

agent2.channels.filechannel.dataDirs = /home/hadoop/flume/fc/data

agent2.sources.execsource.channels = filechannel

agent2.sinks.filesink.channel = filechannel

2. Create a simple test le in the home directory:

$ echo "Hello again Flume!" > /home/hadoop/message

3. Start the agent:

$ flume-ng agent --conf conf --conf-file agent2.conf --name agent2

4. In another window, check le sink output directory:

$ ls files

$ cat files/*

www.it-ebooks.info

Chapter 10

[ 327 ]

The output of the preceding command can be shown in the following screenshot:

What just happened?

The previous example follows a similar paern as before. We created the conguraon le

for a Flume agent, ran the agent, and then conrmed it had captured the data we expected.

This me we used an exec source and a file_roll sink. The former, as the name suggests,

executes a command on the host and captures its output as the input to the Flume agent.

Although, in the previous case, the command is executed only once, this was for illustraon

purposes only. More common uses will use commands that produce an ongoing stream of

data. Note that the exec sink can be congured to restart the command if it does terminate.

The output of the agent is wrien to a le as specied in the conguraon le. By default,

Flume rotates (rolls) to a new le every 30 seconds; we disable this capability to make it

easier to track what's going on in a single le.

We see the le does indeed contain the output of the specied exec command.

Logs versus les

It may not be immediately obvious why Flume has both log and le sinks. Conceptually

both do the same thing, so what's the dierence?

The logger sink in reality is more of a debug tool than anything else. It doesn't just

record the informaon captured by the source, but adds a lot of addional metadata

www.it-ebooks.info

Data Collecon with Flume

[ 328 ]

and events. The le sink however records the input data exactly as it was received with

no alteraon—though such is possible if required as we will see later.

In most cases, you'll want the le sink to capture the input data but the log may also be of

use in non-producon situaons depending on your needs.

Time for action – capturing a remote le in a local at le

Let's show another example of capturing data to a le sink. This me we will use another

Flume capability that allows it to receive data from a remote client.

1. Create the following le as agent3.conf in the Flume working directory:

agent3.sources = avrosource

agent3.sinks = filesink

agent3.channels = jdbcchannel

agent3.sources.avrosource.type = avro

agent3.sources.avrosource.bind = localhost

agent3.sources.avrosource.port = 4000

agent3.sources.avrosource.threads = 5

agent3.sinks.filesink.type = FILE_ROLL

agent3.sinks.filesink.sink.directory = /home/hadoop/flume/files

agent3.sinks.filesink.sink.rollInterval = 0

agent3.channels.jdbcchannel.type = jdbc

agent3.sources.avrosource.channels = jdbcchannel

agent3.sinks.filesink.channel = jdbcchannel

2. Create a new test le as /home/hadoop/message2:

Hello from Avro!

3. Start the Flume agent:

$ flume-ng agent –conf conf –conf-file agent3.conf –name agent3

4. In another window, use the Flume Avro client to send a le to the agent:

$ flume-ng avro-client -H localhost -p 4000 -F /home/hadoop/

message

5. As before, check the le in the congured output directory:

$ cat files/*

www.it-ebooks.info

Chapter 10

[ 329 ]

The output of the preceding command can be shown in following screenshot:

What just happened?

As before, we created a new conguraon le and this me used an Avro source for the agent.

Recall from Chapter 5, Advanced MapReduce Techniques, that Avro is a data serializaon

framework; that is, it manages the packaging and transport of data from one point to another

across the network. Similarly to the Netcat source, the Avro source requires conguraon

parameters that specify its network sengs. In this case, it will listen on port 4000 on the local

machine. The agent is congured to use the le sink as before and we start it up as usual.

Flume comes with both an Avro source and a standalone Avro client. The laer can be used

to read a le and send it to an Avro source anywhere on the network. In our example, we

just use the local machine, but note that the Avro client requires the explicit hostname and

port of the Avro source to which it should send the le. So this is not a constraint; an Avro

client can send les to a listening Flume Avro source anywhere on the network.

The Avro client reads the le, sends it to the agent, and this gets wrien to the le sink. We

check this behavior by conrming that the le contents are in the le sink locaon as expected.

www.it-ebooks.info

Data Collecon with Flume

[ 330 ]

Sources, sinks, and channels

We intenonally used a variety of sources, sinks, and channels in the previous examples

just to show how they can be mixed and matched. However, we have not explored them—

especially channels—in much detail. Let's dig a lile deeper now.

Sources

We've looked at three sources: Netcat, exec, and Avro. Flume NG also supports a sequence

generator source (mostly for tesng) as well as both TCP and UDP variants of a source that

reads syslogd data. Each source is congured within an agent and aer receiving enough

data to produce a Flume event, it sends this newly created event to the channel to which the

source is connected. Though a source may have logic relang to how it reads data, translates

events, and handles failure situaons, the source has no knowledge of how the event is to be

stored. The source has the responsibility of delivering the event to the congured channel,

and all other aspects of the event processing are invisible to the source.

Sinks

In addion to the logger and le roll sinks we used previously, Flume also supports sinks for

HDFS, HBase (two types), Avro (for agent chaining), null (for tesng), and IRC (for an Internet

Relay Chat service). The sink is conceptually similar to the source but in reverse.

The sink waits for events to be received from the congured channel about whose inner

workings it knows nothing. On receipt, the sink handles the output of the event to its

parcular desnaon, managing all issues around me outs, retries, and rotaon.

Channels

So what are these mysterious channels that connect the source and sink? They are, as

the name and conguraon entries before suggest, the communicaon and retenon

mechanism that manages event delivery.

When we dene a source and a sink, there may be signicant dierences in how they read

and write data. An exec source may, for example, receive data much faster than a le roll sink

can write it or the source may have mes (such as when rolling to a new le or dealing with

system I/O congeson) that wring needs be paused. The channel, therefore, needs buer

data between the source and sink to allow data to stream through the agent as eciently

as possible. This is why the channel conguraon porons of our conguraon les include

elements such as capacity.

The memory channel is the easiest to understand as the events are read from the source

into memory and passed to the sink as it is able to receive them. But if the agent process

dies mid-way through the process (be it due to soware or hardware failure), then all the

events currently in the memory channel are lost forever.

www.it-ebooks.info

Chapter 10

[ 331 ]

The le and JDBC channels that we also used provide persistent storage of events to prevent

such loss. Aer reading an event from a source, the le channel writes the contents to a le

on the lesystem that is deleted only aer successful delivery to the sink. Similarly, the JDBC

channel uses an embedded Derby database to store events in a recoverable fashion.

This is a classic performance versus reliability trade-o. The memory channel is the fastest

but has the risk of data loss. The le and JDBC channels are typically much slower but

eecvely provide guaranteed delivery to the sink. Which channel you choose depends

on the nature of the applicaon and the values of each event.

Don't worry too much about this trade-o; in the real world, the answer

is usually obvious. Also be sure to look carefully at the reliability of the

source and sink being used. If those are unreliable and you drop events

anyway, do you gain much from a persistent channel?

Or roll your own

Don't feel limited by the exisng collecon of sources, sinks, and channels. Flume oers

an interface to dene your own implementaon of each. In addion, there are a few

components present in Flume OG that have not yet been incorporated into Flume NG

but may appear in the future.

Understanding the Flume conguration les

Now that we've talked through sources, sinks, and channels, let's take a look at one of

the conguraon les from earlier in a lile more detail:

agent1.sources = netsource

agent1.sinks = logsink

agent1.channels = memorychannel

These rst lines name the agent and dene the sources, sinks, and channels associated

with it. We can have mulple values on each line; the values are space separated:

agent1.sources.netsource.type = netcat

agent1.sources.netsource.bind = localhost

agent1.sources.netsource.port = 3000

These lines specify the conguraon for the source. Since we are using the Netcat source,

the conguraon values specify how it should bind to the network. Each type of source

has its own conguraon variables.

agent1.sinks.logsink.type = logger

www.it-ebooks.info

Data Collecon with Flume

[ 332 ]

This species the sink to be used is the logger sink which is further congured via the

command line or the log4j property le.

agent1.channels.memorychannel.type = memory

agent1.channels.memorychannel.capacity = 1000

agent1.channels.memorychannel.transactionCapacity = 100

These lines specify the channel to be used and then add the type

specific configuration values. In this case we are using the memory

channel and we specify its capacity but – since it is non-persistent –

no external storage mechanism.

agent1.sources.netsource.channels = memorychannel

agent1.sinks.logsink.channel = memorychannel

These last lines congure the channel to be used for the source and sink. Though we used

dierent conguraon les for our dierent agents, we could just as easily place all the

elements in a single conguraon le as the respecve agent names provide the necessary

separaon. This can, however, produce a prey verbose le which can be a lile inmidang

when you are just learning Flume. We can also have mulple ows within a given agent, we

could, for example, combine the rst two preceding examples into a single conguraon le

and agent.

Have a go hero

Do just that! Create a conguraon le that species the capabilies of both our previous

agent1 and agent2 from the preceding example in a single composite agent that contains:

A Netcat source and its associated logger sink

An exec source and its associated le sink

Two memory channels, one for each of the source/sink pairs menoned before

To get you started, here's how the component denions could look:

agentx.sources = netsource execsource

agentx.sinks = logsink filesink

agentx.channels = memorychannel1 memorychannel2

It's all about events

Let's discuss one more denion before we try another example. Just what is an event?

Remember that Flume is explicitly based around log les, so in most cases, an event equates

to a line of text followed by a new line character. That is the behavior we've seen with the

sources and sinks we've used.

www.it-ebooks.info

Chapter 10

[ 333 ]

This isn't always the case, however, the UDP syslogd source, for example, treats each packet

of data received as a single event, which gets passed through the system. When using these

sinks and sources, however, these denions of events are unchangeable and when reading

les, for example, we have no choice but to use line-based events.

Time for action – writing network trafc onto HDFS

This discussion of Flume in a book about Hadoop hasn't actually used Hadoop at all so far.

Let's remedy that by wring data onto HDFS via Flume.

1. Create the following le as agent4.conf within the Flume working directory:

agent4.sources = netsource

agent4.sinks = hdfssink

agent4.channels = memorychannel

agent4.sources.netsource.type = netcat

agent4.sources.netsource.bind = localhost

agent4.sources.netsource.port = 3000

agent4.sinks.hdfssink.type = hdfs

agent4.sinks.hdfssink.hdfs.path = /flume

agent4.sinks.hdfssink.hdfs.filePrefix = log

agent4.sinks.hdfssink.hdfs.rollInterval = 0

agent4.sinks.hdfssink.hdfs.rollCount = 3

agent4.sinks.hdfssink.hdfs.fileType = DataStream

agent4.channels.memorychannel.type = memory

agent4.channels.memorychannel.capacity = 1000

agent4.channels.memorychannel.transactionCapacity = 100

agent4.sources.netsource.channels = memorychannel

agent4.sinks.hdfssink.channel = memorychannel

2. Start the agent:

$ flume-ng agent –conf conf –conf-file agent4.conf –name agent4

3. In another window, open a telnet connecon and send seven events to Flume:

$ curl telnet://localhost:3000

4. Check the contents of the directory specied in the Flume conguraon le and

then examine the le contents:

$ hadoop fs -ls /flume

$ hadoop fs –cat "/flume/*"

www.it-ebooks.info

Data Collecon with Flume

[ 334 ]

The output of the preceding command can be shown in the following screenshot:

What just happened?

This me we paired a Netcat source with the HDFS sink. As can be seen from the

conguraon le, we need to specify aspects such as the locaon for the les, any le prex,

and the strategy for rolling from one le to another. In this case, we specied les within the

/flume directory, each starng with log- and with a maximum of three entries in each le

(obviously, such a low value is for tesng only).

Aer starng the agent, we use curl once more to send seven single word events to Flume.

We then used the Hadoop command-line ulity to look at the directory contents and verify

that our input data was being wrien to HDFS.

Note that the third HDFS le has a .tmp extension. Remember that we specied three

entries per le but only input seven values. We, therefore, lled up two les and started

on another. Flume gives the le currently being wrien a .tmp extension, which makes it

easy to dierenate the completed les from in-progress les when specifying which les to

process via MapReduce jobs.

www.it-ebooks.info

Chapter 10

[ 335 ]

Time for action – adding timestamps

We menoned earlier that there were mechanisms to have le data wrien in slightly more

sophiscated ways. Let's do something very common and write our data into a directory with

a dynamically-created mestamp.

1. Create the following conguraon le as agent5.conf:

agent5.sources = netsource

agent5.sinks = hdfssink

agent5.channels = memorychannel

agent5.sources.netsource.type = netcat

agent5.sources.netsource.bind = localhost

agent5.sources.netsource.port = 3000

agent5.sources.netsource.interceptors = ts

agent5.sources.netsource.interceptors.ts.type = org.apache.flume.

interceptor.TimestampInterceptor$Builder

agent5.sinks.hdfssink.type = hdfs

agent5.sinks.hdfssink.hdfs.path = /flume-%Y-%m-%d

agent5.sinks.hdfssink.hdfs.filePrefix = log-

agent5.sinks.hdfssink.hdfs.rollInterval = 0

agent5.sinks.hdfssink.hdfs.rollCount = 3

agent5.sinks.hdfssink.hdfs.fileType = DataStream

agent5.channels.memorychannel.type = memory

agent5.channels.memorychannel.capacity = 1000

agent5.channels.memorychannel.transactionCapacity = 100

agent5.sources.netsource.channels = memorychannel

agent5.sinks.hdfssink.channel = memorychannel

2. Start the agent:

$ flume-ng agent –conf conf –conf-file agent5.conf –name agent5

3. In another window, open up a telnet session and send seven events to Flume:

$ curl telnet://localhost:3000

4. Check the directory name on HDFS and the les within it:

$ hadoop fs -ls /

www.it-ebooks.info

Data Collecon with Flume

[ 336 ]

The output of the preceding code can be shown in the following screenshot:

What just happened?

We made a few changes to the previous conguraon le. We added an

interceptor specicaon to the Netcat source and gave its implementaon

class as TimestampInterceptor.

Flume interceptors are plugins that can manipulate and modify events before they

pass from the source to the channel. Most interceptors either add metadata to the

event (as in this case) or drop events based on certain criteria. In addion to several

inbuilt interceptors, there is naturally a mechanism for user-dened interceptors.

We used the mestamp interceptor here which adds to the event metadata the Unix

mestamp at the me the event is read. This allows us to extend the denion of the

HDFS path into which events are to be wrien.

www.it-ebooks.info

Chapter 10

[ 337 ]

While previously we simply wrote all events to the /flume directory, we now specied the

path as /flume-%Y-%m-%d. Aer running the agent and sending some data to Flume, we

looked at HDFS and saw that these variables have been expanded to give the directory a

year/month/date sux.

The HDFS sink supports many other variables such as the hostname of the source and

addional temporal variables that can allow precise paroning to the level of seconds.

The ulity here is plain; instead of having all events wrien into a single directory that becomes

enormous over me, this simple mechanism can give automac paroning, making data

management easier but also providing a simpler interface to the data for MapReduce jobs. If,

for example, most of your MapReduce jobs process hourly data, then having Flume paron

incoming events into hourly directories will make your life much easier.

To be precise, the event passing through Flume has had a complete Unix mestamp added,

that is, accurate to the nearest second. In our example, we used only date-related variables

in the directory specicaon, if hourly or ner-grained directory paroning is required, then

the me-related variables would be used.

This assumes that the mestamp at the point of processing is sucient for

your needs. If les are being batched and then fed to Flume, then a le's

contents may have mestamps from the previous hour than when they are

being processed. In such a case, you could write a custom interceptor to set

the mestamp header based on the contents of the le.

To Sqoop or to Flume...

An obvious queson is whether Sqoop or Flume is most appropriate if we have data in a

relaonal database that we want to export onto HDFS. We've seen how Sqoop can perform

such an export and we could do something similar using Flume, either with a custom source

or even just by wrapping a call to the mysql command within an exec source.

A good rule of thumb is to look at the type of data and ask if it is log data or something

more involved.

Flume was created in large part to handle log data and it excels in this area. But in most

cases, Flume networks are responsible for delivering events from sources to sinks without

any real transformaon on the log data itself. If you have log data in mulple relaonal

databases, then Flume is likely a great choice, though I would queson the long-term

scalability of using a database for storing log records.

www.it-ebooks.info

Data Collecon with Flume

[ 338 ]

Non-log data may require data manipulaon that only Sqoop is capable of performing.

Many of the transformaons we performed in the previous chapter using Sqoop, such

as specifying subsets of columns to be retrieved, are really not possible using Flume. It's

also quite possible that if you are dealing with structured data that requires individual eld

processing, then Flume alone is not the ideal tool. If you want direct Hive integraon then

Sqoop is your only choice.

Remember, of course, that the tools can also work together in more complex workows.

Events could be gathered together onto HDFS by Flume, processed through MapReduce,

and then exported into a relaonal database by Sqoop.

Time for action – multi level Flume networks

Let's put together a few pieces we touched on earlier and see how one Flume agent can use

another as its sink.

1. Create the following le as agent6.conf:

agent6.sources = avrosource

agent6.sinks = avrosink

agent6.channels = memorychannel

agent6.sources.avrosource.type = avro

agent6.sources.avrosource.bind = localhost

agent6.sources.avrosource.port = 2000

agent6.sources.avrosource.threads = 5

agent6.sinks.avrosink.type = avro

agent6.sinks.avrosink.hostname = localhost

agent6.sinks.avrosink.port = 4000

agent6.channels.memorychannel.type = memory

agent6.channels.memorychannel.capacity = 1000

agent6.channels.memorychannel.transactionCapacity = 100

agent6.sources.avrosource.channels = memorychannel

agent6.sinks.avrosink.channel = memorychannel

2. Start an agent congured as per the agent3.conf le created earlier, that is, with

an Avro source and a le sink:

$ flume-ng client –conf conf –conf-file agent3.conf agent3

3. In a second window, start another agent; this one congured with the preceding le:

$ flume-ng client –conf conf –conf-file agent6.conf agent6

www.it-ebooks.info

Chapter 10

[ 339 ]

4. In a third window, use the Avro client to send a le to each of the Flume agents:

$ flume-ng avro-client –H localhost –p 4000 –F /home/hadoop/

message

$ flume-ng avro-client -H localhost -p 2000 -F /home/hadoop/

message2

5. Check the output directory for les and examine the le present:

What just happened?

Firstly, we dened a new agent with an Avro source and also an Avro sink. We've not used

this sink before; instead of wring the events to a local locaon or HDFS, this sink sends the

events to a remote source using Avro.

We start an instance of this new agent and then also start an instance of agent3 we used

earlier. Recall this agent has an Avro source and a le roll sink. We congure the Avro sink in

the rst agent to point to the host and port of the Avro sink in the second and by doing so,

build a data-roung chain.

www.it-ebooks.info

Data Collecon with Flume

[ 340 ]

With both agents running, we then use the Avro client to send a le to each and conrm that

both appear in the le locaon congured as the desnaon for the agent3 sink.

This isn't just technical capability for its own sake. This capability is the building block that

allows Flume to build arbitrarily complex distributed event collecon networks. Instead of

one copy of each agent, think of mulple agents of each type feeding events into the next

link in the chain, which acts as an event aggregaon point.

Time for action – writing to multiple sinks

We need one nal piece of capability to build such networks, namely, an agent that can write

to mulple sinks. Let's create one.

1. Create the following conguraon le as agent7.conf:

agent7.sources = netsource

agent7.sinks = hdfssink filesink

agent7.channels = memorychannel1 memorychannel2

agent7.sources.netsource.type = netcat

agent7.sources.netsource.bind = localhost

agent7.sources.netsource.port = 3000

agent7.sources.netsource.interceptors = ts

agent7.sources.netsource.interceptors.ts.type = org.apache.flume.

interceptor.TimestampInterceptor$Builder

agent7.sinks.hdfssink.type = hdfs

agent7.sinks.hdfssink.hdfs.path = /flume-%Y-%m-%d

agent7.sinks.hdfssink.hdfs.filePrefix = log

agent7.sinks.hdfssink.hdfs.rollInterval = 0

agent7.sinks.hdfssink.hdfs.rollCount = 3

agent7.sinks.hdfssink.hdfs.fileType = DataStream

agent7.sinks.filesink.type = FILE_ROLL

agent7.sinks.filesink.sink.directory = /home/hadoop/flume/files

agent7.sinks.filesink.sink.rollInterval = 0

agent7.channels.memorychannel1.type = memory

agent7.channels.memorychannel1.capacity = 1000

agent7.channels.memorychannel1.transactionCapacity = 100

agent7.channels.memorychannel2.type = memory

www.it-ebooks.info

Chapter 10

[ 341 ]

agent7.channels.memorychannel2.capacity = 1000

agent7.channels.memorychannel2.transactionCapacity = 100

agent7.sources.netsource.channels = memorychannel1 memorychannel2

agent7.sinks.hdfssink.channel = memorychannel1

agent7.sinks.filesink.channel = memorychannel2

agent7.sources.netsource.selector.type = replicating

2. Start the agent:

$ flume-ng agent –conf conf –conf-file agent7.conf –name agent7

3. Open a telnet session and send an event to Flume:

$ curl telnet://localhost:3000

You will receive the following response:

Replicating!

Check the contents of the HDFS and file sinks:

$ cat files/*

$ hdfs fs –cat "/flume-*/*"

The output of the preceding command can be shown in the following screenshot:

www.it-ebooks.info

Data Collecon with Flume

[ 342 ]

What just happened?

We created a conguraon le containing a single Netcat source and both the le and HDFS

sink. We congured separate memory channels connecng the source to both sinks.

We then set the source selector type to replicating, which means events will be sent to

all congured channels.

Aer starng the agent as normal and sending an event to the source, we conrmed that

the event was indeed wrien to both the lesystem and HDFS sinks.

Selectors replicating and multiplexing

The source selector has two modes, replicang as we have seen here and mulplexing.

A mulplexing source selector will use logic to determine to which channel an event

should be sent, depending on the value of a specied header eld.

Handling sink failure

By their nature of being output desnaons, it is to be expected that sinks may fail or

become unresponsive over me. As with any input/output device, a sink may be saturated,

run out of space, or go oine.

Just as Flume associates selectors with sources to allow the replicaon and mulplexing

behavior we have just seen it also supports the concept of sink processors.

There are two dened sink processors, namely, the failover sink processor and the load

balancing sink processor.

The sink processors view the sinks as being within a group and, dependent on their type, react

dierently when an event arrives. The load balancing sink processor sends events to sinks one

at a me, using either a round-robin or random algorithm to select which sink to use next. If a

sink fails, the event is retried on another sink, but the failed sink remains in the pool.

The failover sink, in contrast, views the sinks as a priorized list and only tries a lower priority

sink if the ones above it have failed. Failed sinks are removed from the list and are only

retried aer a cooling-o period that increases with subsequent failures.

Have a go hero - Handling sink failure

Set up a Flume conguraon that has three congured HDFS sinks, each wring to dierent

locaons on HDFS. Use the load balancer sink processor to conrm events are wrien to

each sink, and then use the failover sink processor to show the priorizaon.

Can you force the agent to select a processor other than the highest priority one?

www.it-ebooks.info

Chapter 10

[ 343 ]

Next, the world

We have now covered most of the key features of Flume. As menoned earlier, Flume is a

framework and this should be considered carefully; Flume has much more exibility in its

deployment model than any other product we have looked at.

It achieves its exibility through a relavely small set of capabilies; the linking of sources

to sinks via channels and mulple variaons that allow mul-agent or mul-channel

conguraons. This may not seem like much, but consider that these building blocks can be

composed to create a system such as the following where mulple web server farms feed

their logs into a central Hadoop cluster:

Each node in each farm runs an agent pulling each local log le in turn.

These log les are sent to a highly-available aggregaon point, one within each farm

which also performs some processing and adds addional metadata to the events,

categorizing them as three types of records.

These rst level aggregators then send the events to one of the series of agents that

access the Hadoop cluster. The aggregators oer mulple access points and event

types 1 and 2 are sent to the rst, event type 3 to the second.

Within the nal aggregator, they write event type 1 and 2 to dierent locaons on

HDFS, with type 2 also being wrien to a local lesystem. Event type 3 is wrien

directly into HBase.

It is amazing how simple primives can be composed to build complex systems like this!

Have a go hero - Next, the world

As a thought experiment, try to work through the preceding scenario and determine what

sort of Flume setup would be required at each step in the ow.

The bigger picture

It's important to realize that "simply" geng data from one point to another is rarely the

extent of your data consideraons. Terms such as data lifecycle management have become

widely used recently for good reason. Let's briey look at some things to consider, ideally

before you have the data ooding across the system.

Data lifecycle

The main queson to be asked in terms of data lifecycle is for how long will the value you

gain from data storage be greater than the storage costs. Keeping data forever may seem

aracve but the costs of holding more and more data will increase over me. These costs

are not just nancial; many systems see their performance degrade as volumes increase.

www.it-ebooks.info

Data Collecon with Flume

[ 344 ]

This queson isn't—or at least rarely should be—decided by technical factors. Instead,

you need the value and costs to the business to be the driving factors. Somemes data

becomes worthless very quickly, other mes the business cannot delete it for either

compeve or legal reasons. Determine the posion and act accordingly.

Remember of course that it is not a binary decision between keeping or deleng data;

you can also migrate data across ers of storage that decrease in cost and performance

as they age.

Staging data

On the other side of the process, it is oen worthwhile to think about how data is fed into

processing plaorms such as MapReduce. With mulple data sources, the last thing you

oen want is to have all the data arrive on a single massive volume.

As we saw earlier, Flume's ability to parameterize the locaon to which it writes on HDFS is a

great tool to aid this problem. However, oen it is useful to view this inial drop-o point as

a temporary staging area to which data is wrien prior to processing. Aer it is processed, it

may be moved into the long-term directory structure.

Scheduling

At many points in the ows, we have discussed that there is an implicit need for an external

task to do something. As menoned before, we want MapReduce to process les once they

are wrien to HDFS by Flume, but how is that task scheduled? Alternavely, how do we

manage the post-processing, the archival or deleon of old data, even the removal of log

les on the source hosts?

Some of these tasks, such as the laer, are likely managed by exisng systems such as

logrotate on Linux but the others may be things you need to build. Obvious tools such as

cron may be good enough, but as system complexity increases, you may need to invesgate

more sophiscated scheduling systems. We will briey menon one such system with ght

Hadoop integraon in the next chapter.

www.it-ebooks.info

Chapter 10

[ 345 ]

Summary

This chapter discussed the problem of how to retrieve data from across the network

and make it available for processing in Hadoop. As we saw, this is actually a more general

challenge and though we may use Hadoop-specic tools, such as Flume, the principles are

not unique. In parcular, we covered an overview of the types of data we may want to write

to Hadoop, generally categorizing it as network or le data. We explored some approaches

for such retrieval using exisng command-line tools. Though funconal, the approaches

lacked sophiscaon and did not suit extension into more complex scenarios.

We looked at Flume as a exible framework for dening and managing data (parcularly

from log les) roung and delivery, and learned the Flume architecture which sees data

arrive at sources, be processed through channels, and then wrien to sinks.

We then explored many of Flume's capabilies such as how to use the dierent

types of sources, sinks, and channels. We saw how the simple building blocks could

be composed into very complex systems and we closed with some more general

thoughts on data management.

This concludes the main content of this book. In the next chapter, we will sketch out a

number of other projects that may be of interest and highlight some ways of engaging

the community and geng support.

www.it-ebooks.info

Where to Go Next

This book has, as the title suggests, sought to give a beginner to Hadoop

in-depth knowledge of the technology and its application. As has been seen

on several occasions, there is a lot more to the Hadoop ecosystem than the

core product itself. We will give a quick highlight of some potential areas of

interest in this chapter.

In this chapter we will discuss:

What we covered in this book

What we didn't cover in this book

Upcoming Hadoop changes

Alternave Hadoop distribuons

Other signicant Apache projects

Alternave programming abstracons

Sources of informaon and help

What we did and didn't cover in this book

With our focus on beginners, the aim of this book was to give you a strong grounding in the

core Hadoop concepts and tools. In addion, we provided experiences with some other tools

that help you integrate the technology into your infrastructure.

www.it-ebooks.info

Where to Go Next

[ 348 ]

Though Hadoop started as the single core product, it's fair to say that the ecosystem

surrounding Hadoop has exploded in recent years. There are alternave distribuons

of the technology, some providing commercial custom extensions. There are a plethora

of related projects and tools that build upon Hadoop and provide specic funconality

or alternave approaches to exisng ideas. It's a really excing me to get involved with

Hadoop; let's take a quick tour of what is out there.

Note, of course, that any overview of the ecosystem is both skewed by the

author's interests and preferences and outdated the moment it is wrien.

In other words, don't for a moment think this is all that's available; consider

it a wheng of the appete.

Upcoming Hadoop changes

Before discussing alternave Hadoop distribuons, let's look at some changes to Hadoop

itself in the near future. We've already discussed the HDFS changes coming in Hadoop 2.0,

parcularly the high availability of NameNode enabled by the new BackupNameNode and

CheckpointNameNode services. This is a signicant capability for Hadoop as it will make

HDFS much more robust, greatly enhancing its enterprise credenals and streamlining

cluster operaons. The impact of NameNode HA is hard to exaggerate; it will almost

certainly become one of those capabilies that no one will be able to remember how

we lived without in a few years' me.

MapReduce is not standing sll while all this is going on, and in fact, the changes

being introduced may not have as much immediate impact but are actually much

more fundamental.

These changes were inially developed under the name MapReduce 2.0 or MRV2.

However, the name now being used is YARN (Yet Another Resource Negoator), which is

more accurate as the changes are much more about Hadoop as a plaorm than MapReduce

itself. The goal of YARN is to build a framework on Hadoop that allows cluster resources to be

allocated to given applicaons and for MapReduce to be only one of these applicaons.

If you consider the JobTracker today, it is responsible for two quite dierent tasks:

managing the progress of a given MapReduce job (but also idenfying which cluster

resources are available at any point in me) and allocang the resources to the various

stages of the job. YARN separates these out into disnct roles; a global ResourceManager

that uses NodeManagers on each host to manage the cluster's resources and a disnct

ApplicaonManager (the rst example of which is MapReduce) that communicates with the

ResourceManager to get the resources it needs for its job.

www.it-ebooks.info

Chapter 11

[ 349 ]

The MapReduce interface in YARN will be unchanged, so from a client perspecve, all exisng

code will sll run on the new plaorm. But as new ApplicaonManagers are developed, we

will start to see Hadoop being used more as a generic task processing plaorm with mulple

types of processing models supported. Early examples of other models being ported to YARN

are stream-based processing and a port of the Message Passing Interface (MPI), which is

broadly used in scienc compung.

Alternative distributions

Way back in Chapter 2, Geng Up and Running, we went to the Hadoop homepage from

which we downloaded the installaon package. Odd as it may seem, this is far from the only

way to get Hadoop. Odder sll may be the fact that most producon deployments don't use

the Apache Hadoop distribuon.

Why alternative distributions?

Hadoop is open source soware. Anyone can, providing they comply with the Apache

Soware License that governs Hadoop, make their own release of the soware. There

are two main reasons alternave distribuons have been created.

Bundling

Some providers seek to build a pre-bundled distribuon containing not only Hadoop but

also other projects, such as Hive, HBase, Pig, and many more. Though installaon of most

projects is rarely dicult—with the excepon of HBase, which has historically been more

dicult to set up by hand—there can be subtle version incompabilies that don't arise

unl a parcular producon workload hits the system. A bundled release can provide a

pre-integrated set of compable versions that are known to work together.

The bundled release can also provide the distribuon not only in a tarball le but also in

packages that are easily installed through package managers such as RPM, Yum, or APT.

Free and commercial extensions

Being an open source project with a relavely liberal distribuon license, creators are also

free to enhance Hadoop with proprietary extensions that are made available either as free

open source or commercial products.

This can be a controversial issue as some open source advocates dislike any

commercializaon of successful open source projects; to them it appears that the

commercial enty is freeloading by taking the fruits of the open source community without

having to build it for themselves. Others see this as a healthy aspect of the exible Apache

license; the base product will always be free and individuals and companies can choose to

go with commercial extensions or not. We do not pass judgment either way, but be aware

that this is a controversy you will almost certainly encounter.

www.it-ebooks.info

Where to Go Next

[ 350 ]

Given the reasons for the existence of alternave distribuons, let's look at a few popular

examples.

Cloudera Distribution for Hadoop

The most widely used Hadoop distribuon is the Cloudera Distribuon for Hadoop,

referred to as CDH. Recall that Cloudera is the company that rst created Sqoop and

contributed it back to the open source community and is where Doug Cung now works.

The Cloudera distribuon is available at http://www.cloudera.com/hadoop and

contains a large number of Apache products, from Hadoop itself, Hive, Pig, and HBase

through tools such as Sqoop and Flume, to other lesser-known products such as Mahout

and Whir. We'll talk about some of these later.

CDH is available in several package formats and deploys the soware in a ready-to-go

fashion. The base Hadoop product, for example, is separated into dierent packages

for the components such as NameNode, TaskTracker, and so on, and for each, there is

integraon with the standard Linux service infrastructure.

CDH was the rst widely available alternave distribuon, and its breadth of available

soware, proven level of quality, and free cost has made it a very popular choice.

Cloudera does also oer addional commercial-only products, such as a Hadoop

management tool, in addion to training, support, and consultancy services. Details

are available on the company webpage.

Hortonworks Data Platform

In 2011, the Yahoo division responsible for so much of the development of Hadoop was

spun o into a new company called Hortonworks. They have also produced their own

pre-integrated Hadoop distribuon, called the Hortonworks Data Plaorm (HDP) and

available at http://hortonworks.com/products/hortonworksdataplatform/.

HDP is conceptually similar to CDH, but both products have dierences in their focus.

Hortonworks makes much of the fact that HDP is fully open source, including the

management tool. They also have posioned HDP as a key integraon plaorm through

support for tools such as Talend Open Studio. Hortonworks does not oer commercial

soware; its business model focuses instead on oering professional services and support

for the plaorm.

Both Cloudera and Hortonworks are venture-backed companies with signicant engineering

experse; both companies employ many of the most prolic contributors to Hadoop. The

underlying technology is however the same Apache projects; the dierences are how they

are packaged, the versions employed, and the addional value-added oerings provided by

the companies.

www.it-ebooks.info

Chapter 11

[ 351 ]

MapR

A dierent type of distribuon is oered by MapR Technologies, though the company and

distribuon are usually referred to simply as MapR. Available at http://www.mapr.com,

the distribuon is based on Hadoop but has added a number of changes and enhancements.

One main MapR focus is on performance and availability, for example, it was the rst

distribuon to oer a high-availability soluon for the Hadoop NameNode and JobTracker,

which you will remember (from Chapter 7, Keeping Things Running) is a signicant weakness

in core Hadoop. It also oers nave integraon with NFS le systems, which makes

processing of exisng data much easier; MapR replaced HDFS with a full POSIX-compliant

lesystem that can easily be mounted remotely.

MapR provides both a community and enterprise edion of its distribuon; not all the

extensions are available in the free product. The company also oers support services

as part of the enterprise product subscripon, in addion to training and consultancy.

IBM InfoSphere Big Insights

The last distribuon we'll menon here comes from IBM. The IBM InfoSphere Big Insights

distribuon is available at http://www-01.ibm.com/software/data/infosphere/

biginsights/ and (like MapR) oers commercial improvements and extensions to the

open source Hadoop core.

Big Insights comes in two versions, the free IBM InfoSphere Big Insights Basic Edion and the

commercial IBM InfoSphere Big Insights Enterprise Edion. Big Insights, big names! The basic

edion is an enhanced set of Apache Hadoop products, adding some free management and

deployment tools as well as integraon with other IBM products.

The Enterprise Edion is actually quite dierent from the Basic Edion; it is more of a layer

atop Hadoop, and in fact, can be used with other distribuons such as CDH or HDP. The

Enterprise Edion provides an array of data visualizaon, business analysis, and processing

tools. It also has deep integraon with other IBM products such as InfoSphere Streams, DB2,

and GPFS.

Choosing a distribution

As can be seen, the available distribuons (and we didn't cover them all) range from

convenience packaging and integraon of fully open source products through to enre

bespoke integraon and analysis layers atop them. There is no overall best distribuon;

think carefully about your needs and consider the alternaves. Since all the previous

distribuons oer a free download of at least a basic version, it's also good to simply

have a try and experience the opons for yourself.

www.it-ebooks.info

Where to Go Next

[ 352 ]

Other Apache projects

Whether you use a bundled distribuon or sck with the base Apache Hadoop download,

you will encounter many references to other, related Apache projects. We have covered

Hive, Sqoop, and Flume in this book; we'll now highlight some of the others.

Note that this coverage seeks to point out the highlights (from my perspecve) as well

as give a taste of the wide range of the types of projects available. As before, keep looking

out; there will be new ones launching all the me.

HBase

Perhaps the most popular Apache Hadoop-related project that we didn't cover in this

book is HBase; its homepage is at http://hbase.apache.org. Based on the BigTable

model of data storage publicized by Google in an academic paper (sound familiar?),

HBase is a non-relaonal data store sing atop HDFS.

Whereas both MapReduce and Hive tasks focus on batch-like data access paerns, HBase

instead seeks to provide very low latency access to data. Consequently, HBase can, unlike

the already menoned technologies, directly support user-facing services.

The HBase data model is not the relaonal approach we saw used in Hive and all other

RDBMSs. Instead, it is a key-value, schemaless soluon that takes a column-oriented view

of data; columns can be added at run-me and depend on the values inserted into HBase.

Each lookup operaon is then very fast as it is eecvely a key-value mapping from the row

key to the desired column. HBase also treats mestamps as another dimension on the data,

so one can directly retrieve data from a point in me.

The data model is very powerful but does not suit all use cases, just as the relaonal

model isn't universally applicable. But if you have a need for structured low-latency views

on large-scale data stored in Hadoop, HBase is absolutely something you should look at.

Oozie

We have said many mes that Hadoop clusters do not live in a vacuum and need to integrate

with other systems and into broader workows. Oozie, available at http://oozie.

apache.org, is a Hadoop-focused workow scheduler that addresses this laer scenario.

In its simplest form, Oozie provides mechanisms to schedule the execuon of MapReduce

jobs based either on a me-based criteria (for example, do this every hour) or on data

availability (for example, do this when new data arrives in this locaon). It allows the

specicaon of mul-stage workows that can describe a complete end-to-end process.

www.it-ebooks.info

Chapter 11

[ 353 ]

In addion to straight-forward MapReduce jobs, Oozie can also schedule jobs that run

Hive or Pig commands as well as tasks enrely outside of Hadoop (such as sending emails,

running shell scripts, or running commands on remote hosts).

There are many ways of building workows; a common approach is with Extract Transform

and Load (ETL) tools such as Pentaho Kele (http://kettle.pentaho.com) and Spring

Batch (http://static.springsource.org/spring-batch). These, for example, do

include some Hadoop integraon but the tradional dedicated workow engines may not.

Consider Oozie if you are building workows with signicant Hadoop interacon and you

do not have an exisng workow tool with which you have to integrate.

Whir

When looking to use cloud services such as Amazon AWS for Hadoop deployments, it is

usually a lot easier to use a higher-level service such as ElascMapReduce as opposed to

seng up your own cluster on EC2. Though there are scripts to help, the fact is that the

overhead of Hadoop-based deployments on cloud infrastructures can be involved. That is

where Apache Whir from http://whir.apache.org comes in.

Whir is not focused on Hadoop; it is about supplier-independent instanaon of cloud

services of which Hadoop is a single example. Whir provides a programmac way of

specifying and creang Hadoop-based deployments on cloud infrastructures in a way that

handles all the underlying service aspects for you. And it does this in a provider-independent

fashion so that once you've launched on, say, EC2, you can use the same code to create the

idencal setup on another provider such as Rackspace or Eucalyptus. This makes vendor

lock-in—oen a concern with cloud deployments—less of an issue.

Whir is not quite there yet. Today it is limited in what services it can create and only supports

a single provider, AWS. However, if you are interested in cloud deployment with less pain, it

is worth watching its progress.

Mahout

The previous projects are all general-purpose in that they provide a capability that is

independent of any area of applicaon. Apache Mahout, located at http://mahout.

apache.org, is instead a library of machine learning algorithms built atop Hadoop and

MapReduce.

The Hadoop processing model is oen well suited for machine learning applicaons

where the goal is to extract value and meaning from a large dataset. Mahout provides

implementaons of such common ML techniques as clustering and recommenders.

If you have a lot of data and need help nding the key paerns, relaonships, or just the

needles in the haystack, Mahout may be able to help.

www.it-ebooks.info

Where to Go Next

[ 354 ]

MRUnit

The nal Apache Hadoop project we will menon also highlights the wide range of what

is available. To a large extent, it does not maer how many cool technologies you use and

which distribuon if your MapReduce jobs frequently fail due to latent bugs. The recently

promoted MRUnit from http://mrunit.apache.org can help here.

Developing MapReduce jobs can be dicult, especially in the early days, but tesng and

debugging them is almost always hard. MRUnit takes the unit test model of its namesakes

such as JUnit and DBUnit and provides a framework to help write and execute tests that

can help improve the quality of your code. Build up a test suite, integrate with automated

test, and build tools, and suddenly, all those soware engineering best pracces that you

wouldn't dream of not following when wring non-MapReduce code are available here also.

MRUnit may be of interest, well, if you ever write any MapReduce jobs. In my humble

opinion, it's a really important project; please check it out.

Other programming abstractions

Hadoop is not just extended by addional funconality; there are tools to provide enrely

dierent paradigms for wring the code used to process your data within Hadoop.

Pig

We menoned Pig (http://pig.apache.org) in Chapter 8, A Relaonal View on Data

with Hive, and won't say much else about it here. Just remember that it is available and

may be useful if you have processes or people for whom a data ow denion of the

Hadoop processes is a more intuive or beer t than wring raw MapReduce code or

HiveQL scripts. Remember that the major dierence is that Pig is an imperave language

(it denes how the process will be executed), while Hive is more declarave (denes the

desired results but not how they will be produced).

Cascading

Cascading is not an Apache project but is open source and is available at

http://www.cascading.org. While Hive and Pig eecvely dene dierent languages

with which to express data processing, Cascading provides a set of higher-level abstracons.

Instead of thinking of how mulple MapReduce jobs may process and share data with

Cascading, the model is a data ow using pipes and mulple joiners, taps, and similar

constructs. These are built programmacally (the core API was originally Java, but there are

numerous other language bindings), and Cascading manages the translaon, deployment,

and execuon of the workow on the cluster.

www.it-ebooks.info

Chapter 11

[ 355 ]

If you want a higher-level interface to MapReduce and the declarave style of Pig and Hive

doesn't suit, the programmac model of Cascading may be what you are looking for.

AWS resources

Many Hadoop technologies can be deployed on AWS as part of a self-managed cluster.

But just as Amazon oers support for Elasc MapReduce, which handles Hadoop as a

managed service, there are a few other services that are worth menoning.

HBase on EMR

This isn't really a disnct service per se, but just as EMR has nave support for Hive and Pig,

it also now oers direct support for HBase clusters. This is a relavely new capability, and

it will be interesng to see how well it works in pracce; HBase has historically been quite

sensive to the quality of the network and system load.

SimpleDB

Amazon SimpleDB (http://aws.amazon.com/simpledb) is a service oering an

HBase-like data model. This isn't actually implemented atop Hadoop, but we'll menon

this and the following service as they do provide hosted alternaves worth considering

if a HBase-like data model is of interest. The service has been around for several years

and is very mature with well understood use cases.

SimpleDB does have some limitaons, parcularly on table size and the need to manually

paron large datasets, but if you have a need for an HBase-type store at smaller volumes,

it may be a good t. It's also easy to set up and can be a nice way of having a go at the

column-based data model.

DynamoDB

A more recent service from AWS is DynamoDB, available at http://aws.amazon.com/

dynamodb. Though its data model is again very similar to that of SimpleDB and HBase, it is

aimed at a very dierent type of applicaon. Where SimpleDB has quite a rich search API

but is very limited in terms of size, DynamoDB provides a more constrained API but with a

service guarantee of near-unlimited scalability.

The DynamoDB pricing model is parcularly interesng; instead of paying for a certain number

of servers hosng the service, you allocate a certain read/write capacity and DynamoDB

manages the resources required to meet this provisioned capacity. This is an interesng

development as it is a purer service model, where the mechanism of delivering the desired

performance is kept completely opaque to the service user. Look at DynamoDB if you need

a much larger scale of data store than SimpleDB can oer, but do consider the pricing model

carefully as provisioning too much capacity can become very expensive very quickly.

www.it-ebooks.info

Where to Go Next

[ 356 ]

Sources of information

You don't just need new technologies and tools, no maer how cool they are. Somemes,

a lile help from a more experienced source can pull you out of a hole. In this regard you

are well covered; the Hadoop community is extremely strong in many areas.

Source code

It's somemes easy to overlook, but Hadoop and all the other Apache projects are aer

all fully open source. The actual source code is the ulmate source (pardon the pun) of

informaon about how the system works. Becoming familiar with the source and tracing

through some of the funconality can be hugely informave, not to menon helpful, when

you hit unexpected behavior.

Mailing lists and forums

Almost all the projects and services listed earlier have their own mailing lists and/or forums;

check out the homepages for the specic links. If using AWS, make sure to check out the

AWS developer forums at https://forums.aws.amazon.com.

Remember to always read posng guidelines carefully and understand the expected

equee. These are tremendous sources of informaon, and the lists and forums are

oen frequently visited by the developers of the parcular project. Expect to see the

core Hadoop developers on the Hadoop lists, Hive developers on the Hive list, EMR

developers on the EMR forums, and so on.

LinkedIn groups

There are a number of Hadoop and related groups on the professional social network, LinkedIn.

Do a search for your parcular areas of interest, but a good starng point may be the general

Hadoop Users group at http://www.linkedin.com/groups/Hadoop-Users-988957.

HUGs

If you want more face-to-face interacon, look for a Hadoop User Group (HUG) in your

area; most should be listed at http://wiki.apache.org/hadoop/HadoopUserGroups.

These tend to arrange semi-regular get-togethers that combine things such as quality

presentaons, the ability to discuss technology with like-minded individuals, and oen

pizza and drinks.

No HUG near where you live? Consider starng one!

www.it-ebooks.info

Chapter 11

[ 357 ]

Conferences

Though it's a relavely new technology, Hadoop already has some signicant

conference acon involving the open source, academic, and commercial worlds.

Events such as the Hadoop Summit are prey big; it and and other events are

linked to via http://wiki.apache.org/hadoop/Conferences.

Summary

In this chapter, we took a quick gallop around the broader Hadoop ecosystem.

We looked at the upcoming changes in Hadoop, parcularly HDFS high availability

and YARN, why alternave Hadoop distribuons exist and some of the more popular

ones, and other Apache projects that provide capabilies, extensions, or Hadoop

supporng tools.

We also looked at the alternave ways of wring or creang Hadoop jobs and sources

of informaon and how to connect with other enthusiasts.

Now go have fun and build something amazing!

www.it-ebooks.info

Pop Quiz Answers

Chapter 3, Understanding MapReduce

Pop quiz – key/value pairs

Q1 2

Q2 3

Pop quiz – walking through a run of WordCount

Q1 1

Q2 3

Q3 2. Reducer C cannot be used because if such reduction were to

occur, the final reducer could receive from the combiner a series

of means with no knowledge of how many items were used to

generate them, meaning the overall mean is impossible to calculate.

Reducer D is subtle as the individual tasks of selecting a maximum

or minimum are safe for use as combiner operations. But if the goal

is to determine the overall variance between the maximum and

minimum value for each key, this would not work. If the combiner

that received the maximum key had values clustered around it, this

would generate small results; similarly for the one receiving the

minimum value. These subranges have little value in isolation and

again the final reducer cannot construct the desired result.

www.it-ebooks.info

Pop Quiz Answers

[ 360 ]

Chapter 7, Keeping Things Running

Pop quiz – setting up a cluster

Q1 5. Though some general guidelines are possible and you may need to

generalize whether your cluster will be running a variety of jobs, the best

fit depends on the anticipated workload.

Q2 4. Network storage comes in many flavors but in many cases you may

find a large Hadoop cluster of hundreds of hosts reliant on a single

(or usually a pair) of storage devices. This adds a new failure scenario

to the cluster and one with a less uncommon likelihood than many

others. Where storage technology does look to address failure mitigation

it is usually through disk-level redundancy. These disk arrays can be

highly performant but will usually have a penalty on either reads or

writes. Giving Hadoop control of its own failure handling and allowing

it full parallel access to the same number of disks is likely to give higher

overall performance.

Q3 3. Probably! We would suggest avoiding the first configuration as,

though it has just enough raw storage and is far from underpowered,

there is a good chance the setup will provide little room for growth.

An increase in data volumes would immediately require new hosts and

additional complexity in the MapReduce job could require additional

processor power or memory.

Configurations B and C both look good as they have surplus storage for

growth and provide similar head-room for both processor and memory.

B will have the higher disk I/O and C the better CPU performance.

Since the primary job is involved in financial modelling and forecasting,

we expect each task to be reasonably heavyweight in terms of CPU

and memory needs. Configuration B may have higher I/O but if the

processors are running at 100 percent utilization it is likely the extra disk

throughput will not be used. So the hosts with greater processor power

are likely the better fit.

Configuration D is more than adequate for the task and we don’t choose

it for that very reason; why buy more capacity than we know we need?

www.it-ebooks.info

Index

Symbols

0.20 MapReduce Java API

about 61

driver class 63, 64

Mapper class 61, 62

Reducer class 62, 63

AccountRecordMapper class 133

add jar command 267

advanced techniques, MapReduce

about 127

graph algorithms 137

joins 128

language-independent data structures, using

151

agent

about 323

wring, to mulple sinks 340-342

alternave distribuons

about 349

bundling 349

Cloudera Distribuon 350

free and commercial extensions 349

Hortonworks Data Plaorm 350

IBM InfoSphere Big Insights 351

MapR 351

reasons 349

selecng 351

Amazon Web Services. See AWS

Amazon Web Services account. See AWS ac-

count

Apache projects

HBase 352

Mahout 353

MRUnit 354

Oozie 352

Whir 353

Apache Soware Foundaon 289

ApplicaonManager 348

array wrapper classes

about 85

ArrayWritable 85

TwoDArrayWritable 85

aternave schedulers, MapReduce management

Capacity Scheduler 233

enabling 234

Fair Scheduler 234

using 234

Avro

about 152, 330

advantages 154

Avro data, consuming with Java 156, 157

downloading 152, 153

features 165

graphs 165

installing 153

schema, dening 154

schemas 154

seng up 153

source Avro data, creang with Ruby 155, 156

URL 152

using, within MapReduce 158

www.it-ebooks.info

[ 362 ]

Avro client 329

Avro code 153

Avro data

consuming, with Java 156, 157

creang, with Ruby 155, 156

AvroJob 158

AvroKey 158

AvroMapper 158

Avro-mapred JAR les 153

AvroReducer 158

AvroValue 158

Avro, within MapReduce

output data, examining with Java 163, 165

output data, examining with Ruby 163

shape summaries, generang 158-162

AWS

about 22, 315

consideraons 313

Elasc Compute Cloud (EC2) 22

Elasc MapReduce (EMR) 22

Simple Storage Service (S3) 22

AWS account

creang 45

management console 46

needed services, signing up 45

AWS credenals

about 54

access key 54

account ID 54

key pairs 54

secret access key 54

AWS developer forums

URL 356

AWS ecosystem

about 55

URL 55

AWS management console

URL 270, 273-276

used, for WordCount on EMR 46-51

AWS resources

about 355

DynamoDB 355

HBase on EMR 355

SimpleDB 355

BackupNameNode 348

base HDFS directory

changing 34

big data processing

about 8

aspects 8

dierent approach 11-14

historical trends 9

Bloom lter 136

breadth-rst search (BFS) 138

candidate technologies

about 152

Avro 152

Protocol Buers 152

Thri 152

capacity

adding, to EMR job ow 235

adding, to local Hadoop cluster 235

Capacity Scheduler 233

capacityScheduler directory 234

Cascading

about 354

URL 354

CDH 350

ChainMapper class

using 108, 109

channels 330, 331

CheckpointNameNode 348

C++ interface

using 94

city() funcon 268

classic data processing systems

about 9

scale-out approach 10

scale-up 9, 10

Cloud compung, with AWS

about 20

third approach 20

types of cost 21

www.it-ebooks.info

[ 363 ]

Cloudera

about 289

URL 289

Cloudera Distribuon

about 350

URL 350

Cloudera Distribuon for Hadoop. See CDH

cluster access control

about 220

Hadoop security model 220

cluster masters, killing

BackupNode 191

blocks 188

CheckpointNode 191

DataNode start-up 189

les 188

lesystem 188

fsimage 189

JobTracker, killing 184, 185

JobTracker, moving 186

NameNode failure 190

NameNode HA 191

NameNode process 188

NameNode process, killing 186, 187

nodes 189

replacement JobTracker, starng 185

replacement NameNode, starng 188

safe mode 190

SecondaryNameNode 190

column-oriented databases 136

combiner class

about 80

adding, to WordCount 80, 81

features 80

command line job management 231

command output

capturing, to at le 326, 327

commodity hardware 219

commodity versus enterprise class storage 214

common architecture, Hadoop

about 19

advantages 19

disadvantages 20

CompressedWritable wrapper class 88

conferences

about 357

URL 357

conguraon les, Flume 331, 332

conguraon, Flume 320, 321

conguraon, MySQL

for remote connecons 285

conguraon, Sqoop 289, 290

consideraons, AWS 313

correlated failures 192

counters

adding 117

CPU / memory / storage rao, Hadoop cluster

211, 212

CREATE DATABASE statement 284

CREATE FUNCTION command 268

CREATE TABLE command 243

curl ulity 316, 317, 344

data

about 316

copying, from web server into HDFS 316, 317

exporng, from MySQL into Hive 295-297

exporng, from MySQL to HDFS 291-293

geng, into Hadoop 287

geng, out of Hadoop 303

hidden issues 318, 319

imporng, from Hadoop into MySQL 304-306

imporng, from raw query 300, 301

imporng, into Hive 294

lifecycle 343

scheduling 344

staging 344

types 316

wring, from within reducer 303

database

accessing, from mapper 288

data import

improving, type mapping used 299, 300

data input/output formats

about 88

les 89

Hadoop-provided input formats 90

www.it-ebooks.info

[ 364 ]

Hadoop-provided OutputFormats 91

Hadoop-provided record readers 90

InputFormat 89

OutputFormats 91

RecordReaders 89

records 89

RecordWriters 91

Sequence les 91

splits 89

DataJoinMapperBase class 134

data lifecycle management 343

DataNode 211

data paths 279

dataset analysis

Java shape and locaon analysis 107

UFO sighng dataset 98

datatype issues 298

data, types

le data 316

network trac 316

datatypes, HiveQL

Boolean types 243

Floang point types 243

Integer types 243

Textual types 243

datum 157

default properes

about 206

browsing 206, 207

default security, Hadoop security model

demonstrang 220-222

default storage locaon, Hadoop conguraon

properes 208

depth-rst search (DFS) 138

DESCRIBE TABLE command 243

descripon property element 208

dfs.data.dir property 230

dfs.default.name variable 33

dfs.name.dir property 230

dfs.replicaon variable 34

dierent approach, big data processing 11

dirty data, Hive tables

handling 257

query output, exporng 258, 259

Distributed Cache

used, for improving Java locaon data

output 114-116

driver class, 0.20 MapReduce Java API 63, 64

dual approach 23

DynamoDB

about 278, 355

URL 278, 355

EC2 314

edges 138

Elasc Compute Cloud (EC2)

about 22, 45

URL 22

Elasc MapReduce (EMR)

about 22, 45, 206, 313, 314

as, prototyping plaorm 212

benets 206

URL 22

using 45

employee database

seng up 286, 287

employee table

exporng, into HDFS 288

EMR command-line tools 54, 55

EMR Hadoop

versus, local Hadoop 55

EMR job ow

capacity, adding 235

expanding 235

Enterprise Applicaon Integraon (EAI) 319

ETL tools

about 353

Pentaho Kele 353

Spring Batch 353

evaluate methods 267

events 332

exec 330

export command 310

Extract Transform and Load. See ETL tools

failover sink processor 342

failure types, Hadoop

about 168

cluster masters, killing 184

Hadoop node failures 168

Fair Scheduler 234

www.it-ebooks.info

[ 365 ]

fairScheduler directory 234

features, Sqoop

code generator 313

incremental merge 312

paral exports, avoiding 312

le channel 331

le data 316

FileInputFormat 90

FileOutputFormat 91

le_roll sink 327

les

geng, into Hadoop 318

versus logs 327

nal property element 208

First In, First Out (FIFO) queue 231

at le

command output, capturing to 326, 327

Flume

about 319, 337, 350

channels 330, 331

conguraon les 331, 332

conguring 320, 321

features 343

installing 320, 321

logging, into console 324, 325

network data, wring to log les 326, 327

sink failure, handling 342

sinks 330

source 330

mestamps, adding 335-337

URL 319

used, for capturing network data 321-323

versioning 319

Flume NG 319

Flume OG 319

ume.root.logger variable 325

FLUSH PRIVILEGES command 284

fsimage class 225

fsimage locaon

adding, to NameNode 225

fully distributed mode 32

GenericRecord class 157

Google File System (GFS)

URL 15

GRANT statement 284

granular access control, Hadoop security

model 224

graph algorithms

about 137

adjacency list representaons 139

adjacency matrix representaons 139

black nodes 139

common coloring technique 139

nal thoughts 151

rst run 146, 147

fourth run 149, 150

Graph 101 138

graph nodes 139

graph, represenng 139, 140

Graphs and MapReduce 138

iterave applicaon 141

mapper 141

mulple jobs, running 151

nodes 138

overview 140

pointer-based representaons 139

reducer 141

second run 147, 148

source code, creang 142-145

states, for node 141

third run 148, 149

white nodes 139

graphs, Avro 165

Hadoop

about 15

alternave distribuons 349

architectural principles 16

as archive store 280

as data input tool 281

as preprocessing step 280

base folder, conguring 34

base HDFS directory, changing 34

common architecture 19

common building blocks 16

components 15

conguring 30

data, geng into 287

data paths 279

www.it-ebooks.info

[ 366 ]

downloading 28

embrace failure 168

failure 167

failure, types 168

les, geng into 318

lesystem, formang 34

HDFS 16

HDFS and MapReduce 18

HDFS, using 38

HDFS web UI 42

MapReduce 17

MapReduce web UI 44

modes 32

monitoring 42

NameNode, formang 35

network trac, geng into 316, 317

on local Ubuntu host 25

on Mac OS X 26

on Windows 26

prerequisites 26

programming abstracons 354

running 30

scaling 235

seng up 27

SSH, seng up 29

starng 36, 37

used, for calculang Pi 30

versions 27, 290

web server data, geng into 316, 317

WordCount, execung on larger body

of text 42

WordCount, running 39

Hadoop changes

about 348

MapReduce 2.0 or MRV2 348

YARN (Yet Another Resource Negoator) 348

Hadoop cluster

commodity hardware 219

EMR, as prototyping plaorm 212

hardware, sizing 211

hosts 210

master nodes, locaon 211

networking conguraon 215

node and running balancer, adding 235

processor / memory / storage rao 211, 212

seng up 209

special node requirements 213

storage types 213

usable space on node, calculang 210

Hadoop community

about 356

conferences 357

HUGs 356

LinkedIn groups 356

mailing lists and forums 356

source code 356

Hadoop conguraon properes

about 206

default properes 206

default storage locaon 208

property elements 208

seng 209

Hadoop dependencies 318

Hadoop Distributed File System. See HDFS

Hadoop failure

correlated failures 192

hardware failures 191

host corrupon 192

host failures 191

Hadoop FAQ

URL 26

hadoop fs command 317

Hadoop, into MySQL

data, imporng from 304, 306

Hadoop Java API, for MapReduce

0.20 MapReduce Java API 61

about 60

hadoop job -history command 233

hadoop job -kill command 233

hadoop job -list all command 233

hadoop job -set-priority command 232, 233

hadoop job -status command 233

hadoop/lib directory 234

Hadoop networking conguraon

about 215

blocks, placing 215

default rack conguraon, examining 216

rack-awareness script 216

rack awareness script, adding 217, 218

Hadoop node failures

block corrupon 179

block sizes 169, 170

cluster setup 169

data loss 178, 179

www.it-ebooks.info

[ 367 ]

DataNode and TaskTracker failures,

comparing 183

DataNode process, killing 170-173

dfsadmin command 169

Elasc MapReduce 170

fault tolerance 170

missing blocks, causing intenonally 176-178

NameNode and DataNode communicaon 173,

174

NameNode log delving 174

permanent failure 184

replicaon factor 174, 175

TaskTracker process, killing 180-183

test les 169

Hadoop Pipes 94

Hadoop-provided input formats

about 90

FileInputFormat 90

SequenceFileInputFormat 90

TextInputFormat 90

Hadoop-provided OutputFormats

about 91

FileOutputFormat 91

NullOutputFormat 91

SequenceFileOutputFormat 91

TextOutputFormat 91

Hadoop-provided record readers

about 90

LineRecordReader 90

SequenceFileRecordReader 90

Hadoop security model

about 220

default security, demonstrang 220-222

granular access control 224

user identy 223

working around, via physical access control 224

Hadoop-specic data types

about 83

wrapper classes 84

Writable interface 83, 84

Hadoop Streaming

about 94

advantages 94, 97, 98

using, in WordCount 95, 96

working 94

Hadoop Summit 357

Hadoop User Group. See HUGs

Hadoop versioning 27

hardware failure 191

HBase

about 20, 330, 352

URL 352

HBase on EMR 355

HDFS

about 16

and Sqoop 291

balancer, using 230

data, wring 230

employee table, exporng into 288

features 16

managing 230

network trac, wring onto 333, 334

rebalancing 230

using 38, 39

HDFS web UI 42

HDP. See Hortonworks Data Plaorm

hidden issues, data

about 318

common framework approach 319

Hadoop dependencies 318

network data, keeping on network 318

reliability 318

historical trends, big data processing

about 9

classic data processing systems 9

liming factors 10, 11

Hive

about 237

benets 238

buckeng 264

clustering 264

data, imporng into 294

data, validang 246

downloading 239

features 270

installing 239, 240

overview 237

prerequisites 238

seng up 238

sorng 264

table for UFO data, creang 241-243

table, validang 246, 247

www.it-ebooks.info

[ 368 ]

UFO data, adding to table 244, 245

user-dened funcons 264

using 241

versus, Pig 269

Hive and SQL views

about 254

using 254, 256

Hive data

imporng, into MySQL 308-310

Hive exports

and Sqoob 307, 308

Hive, on AWS

interacve EMR cluster, using 277

interacve job ows, using for development

277

UFO analysis, running on EMR 270-276

Hive parons

about 302

and Sqoop 302

HiveQL

about 243

datatypes 243

HiveQL command 269

HiveQL query planner 269

Hive tables

about 250

creang, from exisng le 250-252

dirty data, handling 257

join, improving 254

join, performing 252, 253

paroned UFO sighng table, creang 260-

264

paroning 260

Hive transforms 264

Hortonworks 350

Hortonworks Data Plaorm

about 350

URL 350

host failure 191

HTTPClient 317, 318

HTTP Components 317

HTTP protocol 317

HUGs 356

IBM InfoSphere Big Insights

about 351

URL 351

InputFormat class 89, 158

INSERT command 263

insert statement

versus update statement 307

installaon, Flume 320, 321

installaon, MySQL 282-284

installaon, Sqoop 289, 290

interacve EMR cluster

using 277

interacve job ows

using, for development, 277

Iterator object 134

Java Development Kit (JDK) 26

Java HDFS interface 318

Java IllegalArgumentExcepons 310

Java shape and locaon analysis

about 108

ChainMapper, using for record validaon 108,

111, 112

Distributed Cache, using 113, 114

issues, with output data 112, 113

java.sql.Date 310

JDBC 304

JDBC channel 331

JobConf class 209

job priories, MapReduce management

changing 231, 233

scheduling 232

JobTracker 211

JobTracker UI 44

joins

about 128

account and sales informaon, mtaching 129

disadvantages 128

limitaons 137

map-side joins, implemenng 135

map-side, versus reduce-side joins 128

reduce-side join, implemenng 129

www.it-ebooks.info

[ 369 ]

key/value data

about 58, 59

MapReduce, using 59

real-world examples 59

key/value pairs

about 57, 58

key/value data 58

language-independent data structures

about 151

Avro 152

candidate technologies 152

large-scale data processing. See big data pro-

cessing

LineCounters 124

LineRecordReader 90

LinkedIn groups

about 356

URL 356

list jars command 267

load balancing sink processor 342

LOAD DATA statement 287

local at le

remote le, capturing to 328, 329

local Hadoop

versus, EMR Hadoop 55

local standalone mode 32

log le

network trac, capturing to 321-323

logrotate 344

logs

versus les 327

Mahout

about 353

URL 353

mapper

database, accessing from 288

mapper and reducer implementaons 73

Mapper class, 0.20 MapReduce Java API

about 61, 62

cleanup method 62

map method 62

setup method 62

mappers 17, 293

MapR

about 351

URL 351

mapred.job.tracker property 229

mapred.job.tracker variable 34

mapred.map.max.aempts 195

mapred.max.tracker.failures 196

mapred.reduce.max.aempts 196

MapReduce

about 16, 17, 237, 344

advanced techniques 127

features 17

Hadoop Java API 60

used, as key/value transformaons 59, 60

MapReduce 2.0 or MRV2 348

MapReducejob analysis

developing 117-124

MapReduce management

about 231

alternave schedulers 233

alternave schedulers, enabling 234

alternave schedulers, using 234

command line job management 231

job priories 231

scheduling 231

MapReduce programs

classpath, seng up 65

developing 93

Hadoop-provided mapper and reducer

implementaons 73

JAR le, building 68

pre-0.20 Java MapReduce API 72

WordCount, implemenng 65-67

WordCount, on local Hadoop cluster 68

WordCount, running on EMR 69-71

wring 64

MapReduce programs development

counters 117

counters, creang 118

job analysis workow, developing 117

languages, using 94

large dataset, analyzing 98

status 117

task states 122, 123

www.it-ebooks.info

[ 370 ]

MapReduce web UI 44

map-side joins

about 128

data pruning, for ng cache 135

data representaon, using 136

implemenng, Distributed Cache used 135

mulple mappers, using 136

map wrapper classes

AbstractMapWritable 85

MapWritable 85

SortedMapWritable 85

master nodes

locaon 211

mean me between failures (MTBF) 214

memory channel 330

Message Passing Interface (MPI) 349

MetaStore 269

modes

fully distributed mode 32

local standalone mode 32

pseudo-distributed mode 32

MRUnit

about 354

URL 354

mul-level Flume networks 338-340

MulpleInputs class 133

mulple sinks

agent, wring to 340-342

mulplexing 342

mulplexing source selector 342

MySQL

conguring, for remote connecons 285

Hive data, imporng into 308-310

installing 282-284

seng up 281-284

mysql command-line ulity

about 284, 337

opons 284

mysqldump ulity 288

MySQL, into Hive

data, exporng from 295-297

MySQL, to HDFS

data, exporng from 291-293

MySQL tools

used, for exporng data into Hadoop 288

NameNode

about 211

formang 35

fsimage copies, wring 226

fsimage locaon, adding 225

host, swaping 227

managing 224

mulple locaons, conguring 225

NameNode host, swapping

disaster recovery 227

swapping, to new NameNode host 227, 228

Netcat 323, 330

network

network data, keeping on 318

network data

capturing, Flume used 321-323

keeping, on network 318

wring, to log les 326, 327

Network File System (NFS) 214

network storage 214

network trac

about 316

capturing, to log le 321-323

geng, into Hadoop 316, 317

wring, onto HDFS 333, 334

Node inner class 146

NullOutputFormat 91

NullWritable wrapper class 88

ObjectWritable wrapper class 88

Oozie

about 352

URL 352

Open JDK 26

OutputFormat class 91

paroned UFO sighng table

creang 260-263

Pentaho Kele

URL 353

www.it-ebooks.info

[ 371 ]

calculang, Hadoop used 30

Pig

about 269, 354

URL 354

Pig Lan 269

pre-0.20 Java MapReduce API 72

primary key column 293

primive wrapper classes

about 85

BooleanWritable 85

ByteWritable 85

DoubleWritable 85

FloatWritable 85

IntWritable 85

LongWritable 85

VIntWritable 85

VLongWritable 85

process ID (PID) 171

programming abstracons

about 354

Cascading 354

Pig 354

Project Gutenberg

URL 42

property elements

about 208

descripon 208

nal 208

Protocol Buers

about 152, 319

URL 152

pseudo-distributed mode

about 32

conguraon variables 33

conguring 32, 33

query output, Hive

exporng 258, 259

raw query

data, imporng from 300, 301

RDBMS 280

RDS

considering 313

real-world examples, key/value data 59

RecordReader class 89

RecordWriters class 91

ReduceJoinReducer class 134

reducer

about 17

data, wring from 303

SQL import les, wring from 304

Reducer class, 0.20 MapReduce Java API

about 62, 63

cleanup method 63

reduce method 62

run method 62

setup method 62

reduce-side join

about 129

DataJoinMapper class 134

implemenng 129

implemenng, MulpleInputs used 129-132

TaggedMapperOutput class 134

Redundant Arrays of Inexpensive Disks (RAID)

214

Relaonal Database Service. See RDS

remote connecons

MySQL, conguring for 285

remote le

capturing, to local at le 328, 329

remote procedure call (RPC) framework 165

replicang 342

ResourceManager 348

Ruby API

URL 156

SalesRecordMapper class 133

scale-out approach

about 10

benets 10

scale-up approach

about 9

advantages 10

scaling

capacity, adding to EMR job ow 235

capacity, adding to local Hadoop cluster 235

www.it-ebooks.info

[ 372 ]

schemas, Avro

City eld 154

dening 154

Duraon eld 154

Shape eld 154

Sighng_date eld 154

SecondaryNameNode 211

selecve import

performing 297, 298

SELECT statement 288

SequenceFile class 91

SequenceFileInputFormat 90

SequenceFileOutputFormat 91

SequenceFileRecordReader 90

SerDe 269

SimpleDB 277

about 355

URL 355

Simple Storage Service (S3)

about 22, 45

URL 22

single disk versus RAID 214

sink 323, 330

sink failure

handling 342

skip mode 197

source 323, 330

source code 356

special node requirements, Hadoop cluster 213

Spring Batch

URL 353

SQL import les

wring, from reducer 304

Sqoop

about 289, 337, 338, 350

and HDFS 291

and Hive exports 307, 308

and Hive parons 302

architecture 294

as code generator 313

conguring 289, 290

downloading 289, 290

export, re-running 310-312

features 312, 313

eld and line terminators 303

installing 289, 290

mappers 293

mapping, xing 310-312

primary key columns 293

URL, for homepage 289

used, for imporng data into Hive 294

versions 290

sqoop command-line ulity 290

Sqoop exports

versus Sqoop imports 306, 307

Sqoop imports

versus Sqoop exports 306, 307

start-balancer.sh script 230

stop-balancer.sh script 230

Storage Area Network (SAN) 214

storage types, Hadoop cluster

about 213

balancing 214

commodity, versus enterprise class storage 214

network storage 214

single disk, versus RAID 214

Streaming WordCount mapper 97

syslogd 330

TaggedMapperOutput class 134

task failures, due to data

about 196

dirty data, handling by skip mode 197-201

dirty data, handling through code 196

skip mode, using 197

task failures, due to soware

about 192

failing tasks, handling 195

HDFS programmac access 194

slow running tasks 192, 194

slow-running tasks, handling 195

speculave execuon 195

TextInputFormat 90

TextOutputFormat 91

Thri

about 152, 319

URL 152

www.it-ebooks.info

[ 373 ]

mestamp() funcon 301

TimestampInterceptor class 336

mestamps

adding 335-337

used, for wring data into directory 335-337

tradional relaonal databases 136

type mapping

used, for improving data import 299, 300

Ubuntu 283

UDFMethodResolver interface 267

UDP syslogd source 333

UFO analysis

running, on EMR 270-273

ufodata 264

UFO dataset

shape data, summarizing 102, 103

shape/me analysis, performing from com-

mand line 107

sighng duraon, correlang to UFO shape

103-105

Streaming scripts, using outside Hadoop 106

UFO data, summarizing 99-101

UFO shapes, examining 101

UFO data table, Hive

creang 241-243

data, loading 244, 245

data, validang 246, 247

redening, with correct column

separator 248, 249

UFO sighng dataset

geng 98

UFO sighng records

descripon 98

duraon 98

locaon date 98

recorded date 98

shape 98

sighng date 98

Unix chmod 223

update statement

versus insert statement 307

user-dened funcons (UDF)

about 264

adding 265-267

user identy, Hadoop security model

about 223

super user 223

USE statement 284

VersionedWritable wrapper class 88

versioning 319

web server data

geng, into Hadoop 316, 317

WHERE clause 301

Whir

about 353

URL 353

WordCount example

combiner class, using 80, 81

execung 39-42

xing, to work with combiner 81, 82

implemenng, Streaming used 95, 96

input, spling 75

JobTracker monitoring 76

mapper and reducer implementaons, using

73, 74

mapper execuon 77

mapper input 76

mapper output 77

oponal paron funcon 78

paroning 77, 78

reduce input 77

reducer execuon 79

reducer input 78

reducer output 79

reducer, using as combiner 81

shutdown 79

start-up 75

task assignment 75

task start-up 76

www.it-ebooks.info

[ 374 ]

WordCount example, on EMR

AWS management console used 46-50, 51

wrapper classes

about 84

array wrapper classes 85

CompressedWritable 88

map wrapper classes 85

NullWritable 88

ObjectWritable 88

primive wrapper classes 85

VersionedWritable 88

writable wrapper classes 86, 87

writable wrapper classes

about 86, 87

exercises 88

Yet Another Resource Negoator (YARN) 348

www.it-ebooks.info

Thank you for buying

Hadoop Beginner's Guide

About Packt Publishing

Packt, pronounced 'packed', published its rst book "Mastering phpMyAdmin for Eecve

MySQL Management" in April 2004 and subsequently connued to specialize in publishing

highly focused books on specic technologies and soluons.

Our books and publicaons share the experiences of your fellow IT professionals in adapng

and customizing today's systems, applicaons, and frameworks. Our soluon based books

give you the knowledge and power to customize the soware and technologies you're

using to get the job done. Packt books are more specic and less general than the IT books

you have seen in the past. Our unique business model allows us to bring you more focused

informaon, giving you more of what you need to know, and less of what you don't.

Packt is a modern, yet unique publishing company, which focuses on producing quality,

cung-edge books for communies of developers, administrators, and newbies alike. For

more informaon, please visit our website: www.packtpub.com.

About Packt Open Source

In 2010, Packt launched two new brands, Packt Open Source and Packt Enterprise, in order

to connue its focus on specializaon. This book is part of the Packt Open Source brand,

home to books published on soware built around Open Source licences, and oering

informaon to anybody from advanced developers to budding web designers. The Open

Source brand also runs Packt's Open Source Royalty Scheme, by which Packt gives a royalty

to each Open Source project about whose soware a book is sold.

Writing for Packt

We welcome all inquiries from people who are interested in authoring. Book proposals

should be sent to author@packtpub.com. If your book idea is sll at an early stage and you

would like to discuss it rst before wring a formal book proposal, contact us; one of our

commissioning editors will get in touch with you.

We're not just looking for published authors; if you have strong technical skills but no wring

experience, our experienced editors can help you develop a wring career, or simply get

some addional reward for your experse.

www.it-ebooks.info

Hadoop MapReduce Cookbook

ISBN: 978-1-84951-728-7 Paperback: 308 pages

Recipes for analyzing large and complex data sets

with Hadoop MapReduce

1. Learn to process large and complex data sets,

starng simply, then diving in deep

2. Solve complex big data problems such as

classicaons, nding relaonships, online

markeng and recommendaons

3. More than 50 Hadoop MapReduce recipes,

presented in a simple and straighorward manner,

with step-by-step instrucons and real world

examples

Hadoop Real World Solutions Cookbook

ISBN: 978-1-84951-912-0 Paperback: 325 pages

Realisc, simple code examples to solve problems at

scale with Hadoop and related technologies

1. Soluons to common problems when working in the

Hadoop environment

2. Recipes for (un)loading data, analycs, and

troubleshoong

3. In depth code examples demonstrang various

analyc models, analyc soluons, and common

best pracces

Please check www.PacktPub.com for information on our titles

www.it-ebooks.info

HBase Administration Cookbook

ISBN: 978-1-84951-714-0 Paperback: 332 pages

Master HBase conguraon and administraon for

opmum database performance

1. Move large amounts of data into HBase and learn

how to manage it eciently

2. Set up HBase on the cloud, get it ready for

producon, and run it smoothly with high

performance

3. Maximize the ability of HBase with the Hadoop eco-

system including HDFS, MapReduce, Zookeeper, and

Hive

Cassandra High Performance Cookbook

ISBN: 978-1-84951-512-2 Paperback: 310 pages

Over 150 recipes to design and opmize large-scale

Apache Cassandra deployments

1. Get the best out of Cassandra using this ecient

recipe bank

2. Congure and tune Cassandra components to

enhance performance

3. Deploy Cassandra in various environments and

monitor its performance

4. Well illustrated, step-by-step recipes to make all

tasks look easy!

Please check www.PacktPub.com for information on our titles

www.it-ebooks.info

Hadoop Beginner Guide

Hadoop%20%20Beginner%20Guide

Hadoop%20%20Beginner%20Guide

hadoop_-beginners-guide

Hadoop_%20Beginner's%20Guide

Navigation menu

Versions of this User Manual:

Views

Navigation