Hadoop The Definitive Guide 3rd Edition Orielly May 2012
User Manual:
Open the PDF directly: View PDF .
Page Count: 686
THIRD EDITION
Hadoop: The Definitive Guide
Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo
Hadoop: The Definitive Guide, Third Edition
Editors:
Production Editor:
Copyeditor:
Proofreader:
Revision History for the Third Edition:
Indexer:
Cover Designer:
Interior Designer:
Illustrator:
Table of Contents
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
1. Meet Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
v
3. The Hadoop Distributed Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4. Hadoop I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
vi | Table of Contents
5. Developing a MapReduce Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Table of Contents | vii
6. How MapReduce Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7. MapReduce Types and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
8. MapReduce Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
viii | Table of Contents
9. Setting Up a Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Table of Contents | ix
10. Administering Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
11. Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
x | Table of Contents
12. Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
13. HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
Table of Contents | xi
14. ZooKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
xii | Table of Contents
15. Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
16. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
Table of Contents | xiii
A. Installing Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
B. Cloudera’s Distribution Including Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . 623
C. Preparing the NCDC Weather Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
xiv | Table of Contents
Foreword
xv
xvi | Foreword
Preface
xvii
Administrative Notes
import org.apache.hadoop.io.*
What’s in This Book?
xviii | Preface
What’s New in the Second Edition?
What’s New in the Third Edition?
Preface | xix
Conventions Used in This Book
Constant width
Constant width bold
Constant width italic
Using Code Examples
xx | Preface
Safari® Books Online
How to Contact Us
Preface | xxi
Acknowledgments
xxii | Preface
Preface | xxiii
CHAPTER 1
Meet Hadoop
Data!
1
2 | Chapter 1: Meet Hadoop
Data Storage and Analysis
Data Storage and Analysis | 3
Comparison with Other Systems
Rational Database Management System
4 | Chapter 1: Meet Hadoop
Traditional RDBMS
MapReduce
Data size
Gigabytes
Petabytes
Access
Interactive and batch
Batch
Updates
Read and write many times
Write once, read many times
Structure
Static schema
Dynamic schema
Integrity
High
Low
Scaling
Nonlinear
Linear
Comparison with Other Systems | 5
Grid Computing
6 | Chapter 1: Meet Hadoop
Comparison with Other Systems | 7
Volunteer Computing
8 | Chapter 1: Meet Hadoop
A Brief History of Hadoop
The Origin of the Name “Hadoop”
JobTracker
A Brief History of Hadoop | 9
10 | Chapter 1: Meet Hadoop
Hadoop at Yahoo!
A Brief History of Hadoop | 11
Apache Hadoop and the Hadoop Ecosystem
12 | Chapter 1: Meet Hadoop
Hadoop Releases
Hadoop Releases | 13
Feature
1.x
0.22
2.x
Secure authentication
Yes
No
Yes
Old configuration names
Yes
Deprecated
Deprecated
New configuration names
No
Yes
Yes
Old MapReduce API
Yes
Yes
Yes
New MapReduce API
Yes (with some
missing libraries)
Yes
Yes
MapReduce 1 runtime (Classic)
Yes
Yes
No
MapReduce 2 runtime (YARN)
No
No
Yes
HDFS federation
No
No
Yes
HDFS high-availability
No
No
Yes
14 | Chapter 1: Meet Hadoop
What’s Covered in This Book
Configuration names
dfs.namenode
dfs.name.dir
dfs.namenode.name.dir
mapreduce
mapred
mapreduce.job.name
mapred.job.name
MapReduce APIs
oldapi
Compatibility
Hadoop Releases | 15
InterfaceStability.Stable
InterfaceStabil
ity.Evolving
16 | Chapter 1: Meet Hadoop
InterfaceStability.Unstable
org.apache.hadoop.classification
CHAPTER 2
MapReduce
A Weather Dataset
Data Format
17
0057
332130
99999
19500101
0300
4
+51317
+028783
FM-12
+0171
99999
V020
320
1
N
0072
1
00450
1
C
N
010000
1
N
9
-0128
1
-0139
1
10268
1
#
#
#
#
USAF weather station identifier
WBAN weather station identifier
observation date
observation time
# latitude (degrees x 1000)
# longitude (degrees x 1000)
# elevation (meters)
# wind direction (degrees)
# quality code
# sky ceiling height (meters)
# quality code
# visibility distance (meters)
# quality code
#
#
#
#
#
#
air temperature (degrees Celsius x 10)
quality code
dew point temperature (degrees Celsius x 10)
quality code
atmospheric pressure (hectopascals x 10)
quality code
% ls raw/1990 | head
010010-99999-1990.gz
010014-99999-1990.gz
010015-99999-1990.gz
010016-99999-1990.gz
010017-99999-1990.gz
010030-99999-1990.gz
010040-99999-1990.gz
010080-99999-1990.gz
010100-99999-1990.gz
010150-99999-1990.gz
18 | Chapter 2: MapReduce
Analyzing the Data with Unix Tools
#!/usr/bin/env bash
for year in all/*
do
echo -ne `basename $year .gz`"\t"
gunzip -c $year | \
awk '{ temp = substr($0, 88, 5) + 0;
q = substr($0, 93, 1);
if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }
END { print max }'
done
END
% ./max_temperature.sh
1901
317
1902
244
1903
289
1904
256
1905
283
...
Analyzing the Data with Unix Tools | 19
Analyzing the Data with Hadoop
Map and Reduce
20 | Chapter 2: MapReduce
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
(1950,
(1950,
(1950,
(1949,
(1949,
0)
22)
−11)
111)
78)
(1949, [111, 78])
(1950, [0, 22, −11])
(1949, 111)
(1950, 22)
Analyzing the Data with Hadoop | 21
Java MapReduce
Mapper
map()
import java.io.IOException;
import
import
import
import
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.LongWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
}
}
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}
Mapper
22 | Chapter 2: MapReduce
org.apache.hadoop.io
Long Text
String
LongWritable
IntWritable
Integer
map()
Text
String
substring()
map()
Context
Text
IntWritable
Reducer
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer {
@Override
public void reduce(Text key, Iterable values,
Context context)
throws IOException, InterruptedException {
}
}
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
Text
Text
IntWritable
IntWritable
Analyzing the Data with Hadoop | 23
import
import
import
import
import
import
org.apache.hadoop.fs.Path;
org.apache.hadoop.io.IntWritable;
org.apache.hadoop.io.Text;
org.apache.hadoop.mapreduce.Job;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature